I benchmarked Cartesian product implementations in Haskell, then compared them with C

by u/CuriousExplorer991 mo ago4 comments

I wrote a small article around implementing Cartesian products, starting from Haskell’s sequence. The article goes through a naive Haskell implementation, a more idiomatic list-comprehension version, native sequence, then versions using Data.Vector.Unboxed, mutable vectors, runST, unsafeFreeze to try a different memory representation. The second half compares those designs with C implementations, mostly to look at what changes when the memory layout and allocation model are made explicit. The most interesting result for me was that changing representation in Haskell reduced allocations a lot without automatically improving runtime. In some cases, fusion helped a bit (no temporary indices). I’d be happy to get feedback on the Haskell side, especially the vector/ST implementations and whether there are more idiomatic or faster ways to express this. Here is the article link: https://julienlargetpiet.tech/articles/the-cartesian-product-disaster-tour-haskell-c-and-25gb-of-allocations.html If you want to share any optimizations, you can do a PR at this repo: https://github.com/julienlargetpiet/PerfLabs It will be mentioned in the next article update. PS: We now managed to find (currently) best version which is this one (in C) running at 160ms for 100k iterations to 55 lists here: https://github.com/julienlargetpiet/PerfLabs/blob/main/CartesianProduct/contrib2/contrib2.c I'm updating after trying some new Haskell impl, thanks everyone for the help and the intuition on where to dig !

// 4 comments

▲1▼

u/NightOwlDev 1 mo ago

Like many languages there is a tradeoff between idiomatic code and tightly optimized numeric hot loops. Applying some optimizations I'm able to get the Haskell vector flat version (fastest version on my hardware) from 4.4x slower than C (your article number) to faster 😄. move salt out of inner hot loop, and instead invalidate once per iter (matching C version behavior, ~7% reduction) switch to PrimArray (faster than both Vector.Unboxed and Storable) (1.6x slower than C) flattening inner loop via a pre-computed nval * j offset table (once at start of bench) instead of nested inner loop with nval * j (1.3x slower than C). I think this is a fair optimization you could really do if this was a critical function, in a real program you would memoize this by stride length but in your bench stride is constant. Why does this inner loop flattening trick work? C compiler sees the benchmark test data dims are 5x5, and is able to automatically unroll the hot loop to 5 straight line copies (since it knows stride = 5 at compile time). We can do the same optimization manually at runtime in Haskell via a lookup table, once we know stride = 5. After modifying the C code so that it's unable to loop unroll (data as runtime input instead of const's) it becomes slower than my optimized Haskell code. Haskell flattened inner loop: 0.77x runtime of runtime data C 😎. Of course, C can benefit from this inner loop flattening also, applying the same optimization to C puts C in the lead again by 14% without vectorization, and by 60% with vectorization. Now the Haskell and C are both running the same algorithm, the gap is roughly broken down as 4/5ths auto-vectorization (Haskell doesn't auto-vectorize, and it's a big win on this problem), 1/5th residual. That 14% residual performance gap is a real performance penalty for Haskell, but it's much much smaller. For any real use case where the performance of this function is critical, you will want to reach for hand written SIMD intrinsics anyway rather than relying on the compiler to always get auto-vectorization correct. Intrinsics are available in both C and Haskell. I think what this shows more than anything else is how challenging it really is to write a good benchmark!

▲1▼

u/CuriousExplorer99 OP 1 mo ago

Thanks for this response, i created a git repo where some ppl reached me out to push their optimized version. I'm interested about your versions. Do you want to contribute ? If yes, i 'll share the link here.

▲1▼

u/NightOwlDev 1 mo ago

Sorry my code is pretty atrocious for this since I threw it together quickly to run the benchmark so it's not in great sharable form. But none of the individual optimizations are particularly complex, I'm sure an LLM can easily re-create these and verify the results if you feed it your code + my comment.

▲0▼

u/CuriousExplorer99 OP 1 mo ago

Ok, i'll do that and update the article mentioning contributions. And again thanks !

Community	/c/programming
Section	Haskell
Created	1 mo ago
Replies	4
Score	3