Use memcpy in the special case amp==0 (no amplification) and optimize the code in
the performance-critical loop. Intrestingly, using the likely()/unlikely() macros made the
code slower.
Results (three runs on identical input data on a 32bit x86 machine under Linux, gcc-4.4.0):
old with --amp 3:
0m0.776s 0m0.790s 0m0.812s, avg: 792
new with --amp 3:
0m0.456s 0m0.492s 0m0.477s, avg: 475
speedup: 1.67
old with --amp 0:
0m0.791s 0m0.808s 0m0.810s, avg: 803
new with --amp 0:
0m0.100s 0m0.103s 0m0.094s, avg: 99
speedup: 8.1