Recognize the common single-bit scan loop idioms in V3Unroll (before it
unrolls) and lower them to bit-reduction primitives, replacing a literal
W-iteration loop with one intrinsic-backed expression:
target=0; for (i=0;i<W;i++) if (vec[i]) target = i + 1; -> $mostsetbitp1(vec)
target=0; for (i=0;i<W;i++) if (vec[i]) target = target + 1; -> $countones(vec)
The leading-one form lowers to a new AstMostSetBitP1 node, emitted as
VL_MOSTSETBITP1_{I,Q,W}; those runtime helpers now use __builtin_clz where
available (same pattern as VL_REDXOR's __builtin_parity), with the existing
bit scan as fallback. The count-ones form reuses AstCountOnes ($countones,
popcount); as the DFG requires a 32-bit countones result it is built at 32
bits and narrowed to the accumulator width with a select.
Matching is structural to stay sound: the index must start at 0, increment
by exactly 1, and scan all W==width(vec) bits via a single 1-bit select of a
distinct vector, with the target pre-zeroed and no else branch. The loop
bound is accepted as a strict ascending 'idx < W' written either way and
signed or unsigned (Gt/GtS/Lt/LtS). Gated by -fbit-scan-loops (on at -O).
Adds t_bit_scan_loops (I/Q/W, count-ones and unsigned-index positives;
step-2, start-1, idx*2+1, vec[idx+1], target=idx and W!=width negatives, all
self-checked and asserted via --stats not to lower) plus t_bit_scan_loops_off
for the disable flag.
Motivated by a transformer inference design whose 80-bit leading-one detector
ran every cycle (~37% of runtime); the lowering is worth ~39% there.
Extend the decoder-pattern case optimization to selectors that are too
wide for a full 2^width lookup table. A decoder-pattern case (where
every case item assigns constants to a fixed set of LHSs) is lowered to
a new AstMachMasked expression. AstMachMasked is emitted as a run-time
VL_MATCHMASKEd_* function call. It contains a packed constant pool table,
'matchp', which is a list of '(mask, bits)' pairs. At runtime, the index of the
first matching entry is returned, and is used to index a value table. This single
(albeit complicated) expression can replace large if-else trees whole, resulting
in much more compact code with fewer static hard to predict branches. It
is worth about 10% speed and 30% code size in some designs.
Example:
```systemverilog
logic [39:0] sel;
always_comb
casez (sel)
40'b???????????????????????????????????????1: out = 8'h01;
40'b??????????????????????????????????????1?: out = 8'h02;
40'b?????????????????????????????????????1??: out = 8'h03;
default: out = 8'hff;
endcase
```
is compiled to:
```c++
out = TABLE_value[VL_MATCHMASKED_Q(sel, CONST_match)];
```
Where 'CONST_match' contains 4 entries, of a 40-bit mask and 40-bit bit
pattern each, and 'TABLE_value' contains 4 entries of the corresponding
8-bit results. (Entries are aligned to word boundaries to avoid runtime
bit swizzling)
Add an overload that takes `const char*` format string. This avoids
having to construct/destruct a temporary `std::string` at the call site
when calling these functions with a string literal, which is fairly
common. This is worth 25% code size on some assertion heavy design.
Also use `std::string::clear()` instead of assigning an empty string
which is more efficient.
Change WDataInP/WDataOutP to be opaque handles types instead of aliases
to raw pointers. This subsequently eliminates needing an implicit cast
operator in VlWide, which is replaced with implicit constructors of
WDataInP/WDataOutP that can create a handle from a VlWide. This
eliminates some unsafe conversions that the previous implicit cast
operator unintentionally enabled (e.g. #7618). It also eliminates
having to insert ".data()" in various places int he generated code, which
simplifies internals (the only place ".data()" should be needed is in
calls to variadic functions where the expected type of the argument is
not WDataInP/WDataOutP).
The handles otherwise behave like pointers, implementing the minimal
amount of operators required to code the runtime. The handle is still
only a single pointer, and will be passed in registers as before, so
this patch should be performance neutral.
As part of this removed WData, which used to be an alias for EData.
All uses are now either EData*, WDataInP, WDataOutP, or VlWide directly.
Removed the VlTriggerVec type, and refactored to use an unpacked array
of 64-bit words instead. This means the trigger vector and its
operations are now the same as for any other unpacked array. The few
special functions required for operating on a trigger vector are now
generated in V3SchedTrigger as regular AstCFunc if needed.
No functional change intended, performance should be the same.