Rewrite module inlining decision to be based on a bipartite Module/Cell
graph, similar to V3InlineCFuncs. Preserved all old heuristics, but
added 2 new ones:
- If a module, and all the sub-hierarchy below it, is less than 10% the
total flattened size of the design, then flatten the contents of that
module (but the module itself is not necessarily inlined).
- If the flattened size of all instances of a module is less than 20% of
the total flattened size of the design, then inline all instances of
that module.
These are both relative to the total size of the design, so they
auto-scale with complexity. The net effect is that large shared
instances are preserved, but their contents are flattened out. E.g. in a
multi-core CPU this would keep the cores non-inlined but flatten out
most everything else. This still enables V3Combining and sharing those
later, but avoids potentially big overheads e.g. with small widely used
library modules.
Empirically this yields less generated C++ than the previous version
(due to removing lots of small functions), and can improve performance
10-20% while still having meaningful combining relative to the size of
the design.
Fix scheduling of writes in virtual interfaces, there were missing triggers (see added test).
Make V3SchedVirtIface handle writes done inside methods called through a virtual interface. The pass first records direct vif.member writes, VIF method calls, and candidate interface member VarScopes. It then walks the methods reachable from those VIF calls, writes to persistent interface variables in those method bodies are treated as VIF writes, and nested calls are followed with the same interface context. Function locals, temps, and events are ignored because they are not persistent interface storage observable through a later VIF read. Triggers are still created only from the intersection of (interface type, member name) writes and matching VarScopes, so unrelated interface variables and interfaces with no virtual access do not get extra triggers.
Recognize the common single-bit scan loop idioms in V3Unroll (before it
unrolls) and lower them to bit-reduction primitives, replacing a literal
W-iteration loop with one intrinsic-backed expression:
target=0; for (i=0;i<W;i++) if (vec[i]) target = i + 1; -> $mostsetbitp1(vec)
target=0; for (i=0;i<W;i++) if (vec[i]) target = target + 1; -> $countones(vec)
The leading-one form lowers to a new AstMostSetBitP1 node, emitted as
VL_MOSTSETBITP1_{I,Q,W}; those runtime helpers now use __builtin_clz where
available (same pattern as VL_REDXOR's __builtin_parity), with the existing
bit scan as fallback. The count-ones form reuses AstCountOnes ($countones,
popcount); as the DFG requires a 32-bit countones result it is built at 32
bits and narrowed to the accumulator width with a select.
Matching is structural to stay sound: the index must start at 0, increment
by exactly 1, and scan all W==width(vec) bits via a single 1-bit select of a
distinct vector, with the target pre-zeroed and no else branch. The loop
bound is accepted as a strict ascending 'idx < W' written either way and
signed or unsigned (Gt/GtS/Lt/LtS). Gated by -fbit-scan-loops (on at -O).
Adds t_bit_scan_loops (I/Q/W, count-ones and unsigned-index positives;
step-2, start-1, idx*2+1, vec[idx+1], target=idx and W!=width negatives, all
self-checked and asserted via --stats not to lower) plus t_bit_scan_loops_off
for the disable flag.
Motivated by a transformer inference design whose 80-bit leading-one detector
ran every cycle (~37% of runtime); the lowering is worth ~39% there.
The non *Ovr flavours of AstShift* have better downstream constant
folding, so keep using those if proven safe. Fold overshifts explicitly
instead of introducing *Ovr shifts.
Remove the expression combination counts from the default stats file,
and add a new `--dump-ast-patterns` option, which will dump new
`*_ast_patterns_*.txt` files. These contain the expression combinations
in a similar S-expression format as Dfg already produces with
`--dump-dfg-stats`. These dumps are not produced by just `--stats` as
they are fairly expensive to compute. Currently the new option will dump
at two points: just before we change to C types via widthMin usage, and
just before emit.
Previously V3InlineCFuncs inlined call sites but never deleted the now
dead callees. Also missed a lot of opportunities due to evaluation order.
Rewrite using a graph based algorithm, using only a single traversal of
the netlist. This is clearer, more accurate, and faster at compile time.
Also add a clean -fno-inline-cfuncs disable. Setting the limits to 0
still disables inlining, except of empty functions, which can be inlined
with 0 limits (they are no ops). It will also prune unused functions
without -fno-inline-cfuncs.
Pass now also respects `--output-split`
When a vertex is made acyclic, conservatively update the SCC map to
propagate and mark connected vertices as acyclic as much as possible.
This way we can stop early if the graph becomes acyclic after some
fixups. This can significantly reduce the number of fixups needing to be
applied, avoiding introducing redundancy.
Extend the decoder-pattern case optimization to selectors that are too
wide for a full 2^width lookup table. A decoder-pattern case (where
every case item assigns constants to a fixed set of LHSs) is lowered to
a new AstMachMasked expression. AstMachMasked is emitted as a run-time
VL_MATCHMASKEd_* function call. It contains a packed constant pool table,
'matchp', which is a list of '(mask, bits)' pairs. At runtime, the index of the
first matching entry is returned, and is used to index a value table. This single
(albeit complicated) expression can replace large if-else trees whole, resulting
in much more compact code with fewer static hard to predict branches. It
is worth about 10% speed and 30% code size in some designs.
Example:
```systemverilog
logic [39:0] sel;
always_comb
casez (sel)
40'b???????????????????????????????????????1: out = 8'h01;
40'b??????????????????????????????????????1?: out = 8'h02;
40'b?????????????????????????????????????1??: out = 8'h03;
default: out = 8'hff;
endcase
```
is compiled to:
```c++
out = TABLE_value[VL_MATCHMASKED_Q(sel, CONST_match)];
```
Where 'CONST_match' contains 4 entries, of a 40-bit mask and 40-bit bit
pattern each, and 'TABLE_value' contains 4 entries of the corresponding
8-bit results. (Entries are aligned to word boundaries to avoid runtime
bit swizzling)