iosifovich/README.md

# iosifovich

Iosifovich is a blazingly faster implementation of the Levenshtein distance
function.

The total performance of the current version is much greater than the original,
lost version, which was able to process 10k words from the words file on my
computer in a bit more than 15 seconds. The current implementation is able to do
the same in about 11.3 seconds, meaning a performance gain of about 24%.

To my knowledge this is the fastest implementation of the levenshtein distance
function you can find.

## Licensing

I've kept the code under GPLv3 but would be willing to discuss alternative
licenses, if anybody is interested.

## News

The implementation has been moved to a template function. I didn't measure the
performance meticously, but casual benchmarking revealed no problems with
performance.

When templating the code, I also introduced a stack buffer, that is used, if the
buffer size required is small enough to fit in the stack buffer. This
optimization increased performance by about 11% for a benchmark using the first
5000 words from the words file. The total advantage over templating and the
stack based buffer, is a performance boost that is quite significant.


## Plans for further development

After having reduced the size of the buffer, templating the code and introducing
the stack buffer, I don't think I can get performance much better, without
introducing things like vectorization and parallelism. These advancements will
require a lot of effort to do, without sacrificing performance for short
strings.


1. ~~Make benchmarks.~~
2. Introduce more tests.
    1. Tests with pre, post and infix strings shared between the strings
    2. Tests where the length of the strings are combinations of odd and even.
3. ~~Reduce the size of the buffer. When this was done with the old version,
   performance was increased 100%.~~
4. Look into SIMD instructions
5. Look into parallelism.
6. ~~Templating the code~

## SIMD

I have some ideas for how SIMD instructions might be possible to use to improve
performance, but I doubt it will have much effect on small strings and it might
even be detremental if the strings are too short.

The most straightforward approach would be to just do more than one calculation
at a time, shifting the results down the SIMD registers.

SIMD instructions are very CPU specific and doing anything on this would either
mean investigating what cross-CPU approach is available, if any, that could
allow me to do the implementation without having to maintain a backend for each
CPU and a generic version that works everywhere. I don't expect massive gains
from this as I think the performance bottleneck is currently in memory
bandwidth.

## Parallelism

It should be possible to do the calculations recursively, by splitting the
longer string in the middle and then calculating the two parts sperately.

If that can be done, it should be easy to turn on the threads and make run this
on all the cores.

It might be better to not do parallelism and let the caller rely on being able
to process a lot of word-pairs quickly, instead of doing anything in parallel
inside the function.