Previously V3InlineCFuncs inlined call sites but never deleted the now
dead callees. Also missed a lot of opportunities due to evaluation order.
Rewrite using a graph based algorithm, using only a single traversal of
the netlist. This is clearer, more accurate, and faster at compile time.
Also add a clean -fno-inline-cfuncs disable. Setting the limits to 0
still disables inlining, except of empty functions, which can be inlined
with 0 limits (they are no ops). It will also prune unused functions
without -fno-inline-cfuncs.
Pass now also respects `--output-split`
Extend the decoder-pattern case optimization to selectors that are too
wide for a full 2^width lookup table. A decoder-pattern case (where
every case item assigns constants to a fixed set of LHSs) is lowered to
a new AstMachMasked expression. AstMachMasked is emitted as a run-time
VL_MATCHMASKEd_* function call. It contains a packed constant pool table,
'matchp', which is a list of '(mask, bits)' pairs. At runtime, the index of the
first matching entry is returned, and is used to index a value table. This single
(albeit complicated) expression can replace large if-else trees whole, resulting
in much more compact code with fewer static hard to predict branches. It
is worth about 10% speed and 30% code size in some designs.
Example:
```systemverilog
logic [39:0] sel;
always_comb
casez (sel)
40'b???????????????????????????????????????1: out = 8'h01;
40'b??????????????????????????????????????1?: out = 8'h02;
40'b?????????????????????????????????????1??: out = 8'h03;
default: out = 8'hff;
endcase
```
is compiled to:
```c++
out = TABLE_value[VL_MATCHMASKED_Q(sel, CONST_match)];
```
Where 'CONST_match' contains 4 entries, of a 40-bit mask and 40-bit bit
pattern each, and 'TABLE_value' contains 4 entries of the corresponding
8-bit results. (Entries are aligned to word boundaries to avoid runtime
bit swizzling)
Recognize "decoder" case statements (where every case item only assigns
constants to a fixed set of left-hand sides) and replace them with a
single packed constant lookup table indexed by the case expression.
Small tables are materialized inline in the generated code, and are
always optimized. Larger ones are placed in the constant pool and only
optimized if deemed beneficial over branches.
While this slightly conflicts with V3Table, and is not worth that much
on it's own, there will be a follow up patch that converts more cases of
this form which will be much more valuable. This patch does the
necessary analysis and the simple table conversion when possible.
Split -fcase into -fcase-table (this new conversion) and -fcase-tree (the
existing bitwise branch-tree conversion); -fno-case is now an alias for
both.
Default branches, assignments preceding the case (used as default values),
casez wildcards, multiple and partial left-hand sides, and both blocking and
non-blocking assignments are handled. Cases that cannot be safely tabled (e.g.
non-exhaustive with no default, overlapping writes to one variable, or mixed
blocking/non-blocking assignments) fall back to the existing if/else lowering.
Consequently disabled re-inlining of constant pool variables in V3Const,
and rebuild the constant pool hash in V3Dead (previously we didn't
create constant pool entries early enough for this to matter)
When a lot of combinational logic is driven from top level inputs,
work can be wasted evaluating that logic if the top level inputs don't
change.
This change adds an optimization by performing a change detect on the
top level inputs, and evaluate 'ico' logic only if the top level input
actually changed. This especially helps with --hierarchical/--lib-create
which runs the 'ico' of each sub-model in the eval settle loop.
This was observed to yield 40%+ run-time speedup on some partitioned
designs.
The added change detection is cheap, so it is emitted even if the 'ico'
region is small, and is on by default.
The optimization is only sound if the model itself does not write to the
top level inputs (otherwise the 'previous value' variables would be out
of sync, which are not updated by internal writes.). If we can detect a
top level input is written within the design, then for that input, we
fall back on always running the relevant logic. With --vpi we cannot
prove safety statically, so --vpi will disable this optimisation unless
explicitly enabled. (In which case it's the user's responsibility to not
write to top level inputs via the VPI.)
As per discussion. Remove the unsound V3SplitAs pass. The
isolate_assignments attribute/directive is now parsed and ignored in the
frontend for compatibility but otherwise have no effect.
Fixes#7144
Dumping Dfg patterns can take a non-trivial amount of time, so do it
only with --dump-dfg-patterns, instead of with --stats.
Also further improve dumping format.
Remove parallel (using the FST library writer thread) and offloaded
(separate Verilator internal thread) tracing (only used by FST). These
are not compatible with #6992, and #5806 should yield better performance
in all cases.
Consequently mark '--trace-threads' and '--trace-fst-thread' options as
deprecated
Add jemalloc as an alternative malloc implementation for the Verilator
binary. When both tcmalloc and jemalloc are available, jemalloc is
preferred due to its better performance on RTLMeter.
The new --enable-jemalloc flag (default=check) mirrors the existing
--enable-tcmalloc behavior: auto-detected at configure time, supports
both static and dynamic linking, and is disabled when --enable-dev-asan
is active.