Asking for help to benchmark FASTEST MEMMEM function

Sanmayce

Honorable
Oct 30, 2013
4
0
10,520
Hi to all who appreciate extreme speeds.
Recently, I was lucky to finish my 2+ years long quest for writing in C the most optimized function for finding a memory block into another block of memory, the so called MEMMEM.

My function named Railgun is 100% free and you can see it at the external links in Wikipedia article about BMH algorithm:
https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm

The problem is that the beautiful Boyer–Moore–Horspool algorithm has reigned for 33 years (since 1980) and no one bothered to raise the order of checked chars in the rightmost part of the window. I simply raised the order from 1 to 2 and also to 12 for larger patterns. This did lead to thunderous boost in speed performance even on my old Core 2 T7500 laptop, but since I am greedy my wish is to see how fast it can go (especially on AMD Vishera with reduced L1 cache) on different modern CPUs, I would be glad to see my benchmark results on Haswell - that's my request.

Machinely yours,
Georgi 'Sanmayce'
 

Sanmayce

Honorable
Oct 30, 2013
4
0
10,520
I am disappointed, ten days later not a single soul interested in a piece of beauty being the FASTEST search etude!

Since this is a hardware forum and my function 'Railgun_Sekireigan_Shockeroo' (at http://www.codeproject.com/Articles/250566/Fastest-strstr-like-function-in-C) unfolds the ideas behind Boyer–Moore–Horspool algorithm it is more than very interesting.

Sadly I can't get my hands on some fast machine, only Core 2 and 2nd generation i5 mobile CPUs.

On laptop with i5 2.4GHz searching for pattern "In [[ve" into Wikipedia 50MB dump the results is:
Doing Search for Pattern(7bytes) into String(52428800bytes) as-one-line ...
The first instance occurred at position 51265299 within next line:
In [[vertebrate]]s, vigorously contracting [[skeletal muscle]]s (during weightlifting or sprinting, for example) do not receive enough oxygen to meet the energy demand, and so they shift to [[Fermentation (biochemistry)|anaerobic metabolism]], converting glucose to lactate. The [[liver]] regenerates the glucose, using a process called [[gluconeogenesis]]. This process is not quite the opposite of glycolysis, and actually requires three times the amount of energy gained from glycolysis (six molecules of ATP are used, compared to the two gained in glycolysis). Analogous to the above reactions, the glucose produced can then undergo glycolysis in tissues that need energy, be stored as glycogen (or starch in plants), or be converted to other monosaccharides or joined into di- or oligosaccharides. The combined pathways of glycolysis during exercise, lactate's crossing via the bloodstream to the liver, subsequent gluconeogenesis and release of glucose into the bloodstream is called the [[Cori cycle]].<ref>Fromm and Hargrove (2012), pp. 183–194.</ref>
Railgun_Sekireigan_Shockeroo performance: 3938KB/clock

To find such a short needle at such speed is simply AWESOME, the function is written in plain C and can be easily rewritten in other languages. And the best part is that it is 100% FREE as it should.

I still hope someone with Haswell to help me to benchmark 'Railgun_Sekireigan_Shockeroo' - it tends to become faster with newer CPUs.
I have one more outrageous idea to boost current approach by going to order 3 which will use 2MB lookup table, perhaps even the nowadays CPUs are still weak for such greedy utilization.
 

Sanmayce

Honorable
Oct 30, 2013
4
0
10,520
Thanks Alabalcho for your readiness to help, apparently this nasty economic crisis hurts the benchmarking too.

Stupid money, cannot buy a latest edition CPU, the cache system of Haswell is what 'Railgun_Sekireigan_Shockeroo' needs:
superfast + superfast = DOUBLEsuperfast

Current Haswells feature 8 threads, while my FASTEST search full-text tool uses 16 threads executing 16 'Railgun_Sekireigan_WolfRAM' functions (with better Skip Performance than 'Shockeroo').
Caramba, I really need 16 threads capable CPU with quad-channeled 64GB - simply I want to showcase the power of Kazahana on the 42GB Wikipedia.