Add a new pass to split up (recursively):
foo = {l, r};
into the following, with the right indices, iff the concatenation
straddles a wide word boundary.
foo[_:_] = r;
foo[_:_] = l;
This eliminates more wide temporaries.
Another 23% speedup on VeeR EH2 high_perf. Also brings the predicted
stack size from 8M to 40k.
The DFG peephole pass converts all associative trees into right leaning,
which is good for simplifying pattern recognition, but can lead to an
excessive amount of wide intermediate results being constructed for
right leaning concatenations.
Add a new pass to balance concatenation trees by trying to:
- Create VL_EDATASIZE (32-bit) sub-terms, so words can then be packed
easily afterwards
- Try to ensure the operands of a concat are roughly the same width
within a concatenation tree. This does not yield the shortest tree,
but it ensures it has many sub-nodes that are small enough to fit into
machine registers.
This can eliminate a lot of wide intermediate results, which would need
temporaries, and also increases ILP within sub-expressions (assuming the
C compiler can't figure that out itself).
This is over 2x run-time speedup on the high_perf configuration of
VeeR EH2 (which you could arguably also get with -fno-dfg, but oh well).