Commit Graph

532 Commits

Author SHA1 Message Date
Thomas Santerre bd6b9161dc
Optimize bit-scan loops into $mostsetbitp1 / $countones (#7822)
Recognize the common single-bit scan loop idioms in V3Unroll (before it
unrolls) and lower them to bit-reduction primitives, replacing a literal
W-iteration loop with one intrinsic-backed expression:

  target=0; for (i=0;i<W;i++) if (vec[i]) target = i + 1;      -> $mostsetbitp1(vec)
  target=0; for (i=0;i<W;i++) if (vec[i]) target = target + 1; -> $countones(vec)

The leading-one form lowers to a new AstMostSetBitP1 node, emitted as
VL_MOSTSETBITP1_{I,Q,W}; those runtime helpers now use __builtin_clz where
available (same pattern as VL_REDXOR's __builtin_parity), with the existing
bit scan as fallback.  The count-ones form reuses AstCountOnes ($countones,
popcount); as the DFG requires a 32-bit countones result it is built at 32
bits and narrowed to the accumulator width with a select.

Matching is structural to stay sound: the index must start at 0, increment
by exactly 1, and scan all W==width(vec) bits via a single 1-bit select of a
distinct vector, with the target pre-zeroed and no else branch.  The loop
bound is accepted as a strict ascending 'idx < W' written either way and
signed or unsigned (Gt/GtS/Lt/LtS).  Gated by -fbit-scan-loops (on at -O).

Adds t_bit_scan_loops (I/Q/W, count-ones and unsigned-index positives;
step-2, start-1, idx*2+1, vec[idx+1], target=idx and W!=width negatives, all
self-checked and asserted via --stats not to lower) plus t_bit_scan_loops_off
for the disable flag.

Motivated by a transformer inference design whose 80-bit leading-one detector
ran every cycle (~37% of runtime); the lowering is worth ~39% there.
2026-06-24 10:43:05 +01:00
Geza Lore eafe9636cf
Internals: Dump Ast expression pattern statistics like Dfg (#7818)
Remove the expression combination counts from the default stats file,
and add a new `--dump-ast-patterns` option, which will dump new
`*_ast_patterns_*.txt` files. These contain the expression combinations
in a similar S-expression format as Dfg already produces with
`--dump-dfg-stats`. These dumps are not produced by just `--stats` as
they are fairly expensive to compute. Currently the new option will dump
at two points: just before we change to C types via widthMin usage, and
just before emit.
2026-06-21 22:17:36 +01:00
Geza Lore bcaa110f60
Optimize generated function inlining (#7811)
Previously V3InlineCFuncs inlined call sites but never deleted the now
dead callees. Also missed a lot of opportunities due to evaluation order.

Rewrite using a graph based algorithm, using only a single traversal of
the netlist. This is clearer, more accurate, and faster at compile time.

Also add a clean -fno-inline-cfuncs disable. Setting the limits to 0
still disables inlining, except of empty functions, which can be inlined
with 0 limits (they are no ops). It will also prune unused functions
without -fno-inline-cfuncs.

Pass now also respects `--output-split`
2026-06-21 18:31:56 +01:00
Wilson Snyder 5fc03ae913 Commentary: Make RST documents round-trip clean. No output change intended. 2026-06-21 10:15:47 -04:00
Igor Zaworski e269b914b2
Support NBAs in initial blocks (#7754) 2026-06-20 17:23:05 -04:00
Wilson Snyder 78d96d23ee Commentary (#7809) 2026-06-20 06:45:51 -04:00
Geza Lore a37e2ee94b
Optimize wide decoder case statements into decoder expressions (#7804)
Extend the decoder-pattern case optimization to selectors that are too
wide for a full 2^width lookup table. A decoder-pattern case (where
every case item assigns constants to a fixed set of LHSs) is lowered to
a new AstMachMasked expression. AstMachMasked is emitted as a run-time
VL_MATCHMASKEd_* function call. It contains a packed constant pool table,
'matchp', which is a list of '(mask, bits)' pairs. At runtime, the index of the 
first matching entry is returned, and is used to index a value table. This single
(albeit complicated) expression can replace large if-else trees whole, resulting
in much more compact code with fewer static hard to predict branches. It
is worth about 10% speed and 30% code size in some designs.

Example:

```systemverilog
    logic [39:0] sel;
    always_comb
      casez (sel)
        40'b???????????????????????????????????????1: out = 8'h01;
        40'b??????????????????????????????????????1?: out = 8'h02;
        40'b?????????????????????????????????????1??: out = 8'h03;
        default:                                      out = 8'hff;
      endcase
```

is compiled to:

```c++
    out = TABLE_value[VL_MATCHMASKED_Q(sel, CONST_match)];
```

Where 'CONST_match' contains 4 entries, of a 40-bit mask and 40-bit bit
pattern each, and 'TABLE_value' contains 4 entries of the corresponding
8-bit results. (Entries are aligned to word boundaries to avoid runtime
bit swizzling)
2026-06-19 19:46:13 +01:00
Wilson Snyder 749b93e405 Commentary: Use standard multiline rst comments, other cleanups 2026-06-18 21:58:01 -04:00
Geza Lore 5712f9b614
Optimize decoder case statements into lookup tables (#7795)
Recognize "decoder" case statements (where every case item only assigns
constants to a fixed set of left-hand sides) and replace them with a
single packed constant lookup table indexed by the case expression.
Small tables are materialized inline in the generated code, and are
always optimized. Larger ones are placed in the constant pool and only
optimized if deemed beneficial over branches.

While this slightly conflicts with V3Table, and is not worth that much
on it's own, there will be a follow up patch that converts more cases of
this form which will be much more valuable. This patch does the
necessary analysis and the simple table conversion when possible.

Split -fcase into -fcase-table (this new conversion) and -fcase-tree (the
existing bitwise branch-tree conversion); -fno-case is now an alias for
both.

Default branches, assignments preceding the case (used as default values),
casez wildcards, multiple and partial left-hand sides, and both blocking and
non-blocking assignments are handled. Cases that cannot be safely tabled (e.g.
non-exhaustive with no default, overlapping writes to one variable, or mixed
blocking/non-blocking assignments) fall back to the existing if/else lowering.

Consequently disabled re-inlining of constant pool variables in V3Const,
and rebuild the constant pool hash in V3Dead (previously we didn't
create constant pool entries early enough for this to matter)
2026-06-18 09:30:50 +01:00
Wilson Snyder c86816476c Commentary: Changes update 2026-06-15 17:37:49 -04:00
Geza Lore 5ab2bf1ec4
Optimize input combinational logic by change detection (#7784)
When a lot of combinational logic is driven from top level inputs,
work can be wasted evaluating that logic if the top level inputs don't
change.

This change adds an optimization by performing a change detect on the
top level inputs, and evaluate 'ico' logic only if the top level input
actually changed. This especially helps with --hierarchical/--lib-create
which runs the 'ico' of each sub-model in the eval settle loop.

This was observed to yield 40%+ run-time speedup on some partitioned
designs.

The added change detection is cheap, so it is emitted even if the 'ico'
region is small, and is on by default.

The optimization is only sound if the model itself does not write to the
top level inputs (otherwise the 'previous value' variables would be out
of sync, which are not updated by internal writes.). If we can detect a
top level input is written within the design, then for that input, we
fall back on always running the relevant logic. With --vpi we cannot
prove safety statically, so --vpi will disable this optimisation unless
explicitly enabled. (In which case it's the user's responsibility to not
write to top level inputs via the VPI.)
2026-06-15 05:42:00 +01:00
Geza Lore df1b1577d9
Deprecate isolate_assignments attribute (#7774)
As per discussion. Remove the unsound V3SplitAs pass. The
isolate_assignments attribute/directive is now parsed and ignored in the
frontend for compatibility but otherwise have no effect.

Fixes #7144
2026-06-13 19:40:29 +01:00
Wilson Snyder 816ab67826 Commentary: Changes update 2026-06-05 18:36:55 -04:00
Matthew Ballance 2886291eba
Support covergroups, coverpoints, and bins (#784) (#7117)
Fixes #784.
2026-06-05 09:35:01 -04:00
Yogish Sekhar 947a08965e
Add hierarchy-aware reporting to `verilator_coverage` (#7657) 2026-06-04 09:32:19 -04:00
Wilson Snyder 99a24c7f39 Commentary: Changes update 2026-05-30 15:16:41 -04:00
Tracy Narine a2fae5eb4b
Add `+verilator+log+file` (#4505) (#7645)
Fixes #4505.
2026-05-27 14:33:19 -04:00
Cookie 8ae0e48103
Fix false MULTIDRIVEN warning on always_ff variables (#7351) (#7621) 2026-05-27 08:34:11 -04:00
Cookie 9460501221
Add NOTREDOP error on reduction and negation operators (#7417) (#7623) (#7624) 2026-05-26 12:20:15 -04:00
Cookie 9e2fedee6f
Fix ALWCOMBORDER on variable ordering (#7350) (#7608) 2026-05-26 06:40:55 -04:00
Yogish Sekhar cf8713aebc
Add `--coverage-per-instance` 2026-05-24 18:08:55 -04:00
Wilson Snyder f0c569ab0d Fix CASEINCOMPLETE to not warn on `unique0 case` (#7647).
Fixes #7647.
2026-05-23 20:04:54 -04:00
Yogish Sekhar f282335600
Support FSM detection in primitive wrappers (#7607) 2026-05-21 13:50:31 -04:00
Yilou Wang 00c9e58006
Fix internal error on consecutive repetition with N > 256 (#7552) (#7603) 2026-05-17 21:54:10 -04:00
Muzaffer Kal 9fe058677b
Support NBAs in initial blocks with delay/event controls (#7566) (#7600)
Fixes #7566.
2026-05-17 07:34:29 -04:00
Yogish Sekhar 8312e9d901
Extend FSM Detect to support 'Wide State Encodings' (#7573) 2026-05-13 06:59:22 -04:00
Yu-Sheng Lin 0ebe01a778
Support new FST writer API (#6871) (#6992) 2026-05-12 07:39:43 -04:00
Wilson Snyder d2047e5bad Commentary: Changes update 2026-05-11 19:50:48 -04:00
Cookie cf9334f2c1
Fix error on mixed-initialization (#7352) (#7357) 2026-05-11 18:32:55 -04:00
Yogish Sekhar f67159de30
Extend FSM coverage detection to case-free FSMs - Use - if/else chains (#7561) 2026-05-10 13:12:58 -04:00
Artur Bieniek c69c11b2db
Support procedural continuous assign/deassign (#7493) 2026-05-08 19:01:11 -04:00
Miguel 0e423a4b39
Fix `+verilator+seed` to default to 1, and 0 to randomly select (#7325) (#7516) 2026-05-05 12:10:51 -04:00
Wilson Snyder d2ae094d43 Commentary (#7532) (#7533) 2026-05-04 18:01:55 -04:00
Igor Zaworski 25d4827bd5
Internals: Four state pre-pull (types) (#7520) 2026-04-30 16:56:15 -04:00
Wilson Snyder 00211c290c Commentary 2026-04-25 11:41:30 -04:00
Wilson Snyder 5064a5ee65 Commentary: Changes update 2026-04-23 00:44:50 -04:00
Yogish Sekhar a680919edc
Support native FSM state and arc coverage (#7412) 2026-04-22 15:18:59 -04:00
Geza Lore 2b9d006097
Change Dfg pattern dumps to use --dump-dfg-patterns (#7455)
Dumping Dfg patterns can take a non-trivial amount of time, so do it
only with --dump-dfg-patterns, instead of with --stats.
Also further improve dumping format.
2026-04-21 12:07:19 +01:00
Wilson Snyder 23ea3d7f11 Commentary: Changes update 2026-04-21 00:33:40 -04:00
Geza Lore 97454a1bc5
Remove multi-threaded FST tracing (#7443)
Remove parallel (using the FST library writer thread) and offloaded
(separate Verilator internal thread) tracing (only used by FST). These
are not compatible with #6992, and #5806 should yield better performance
in all cases.

Consequently mark '--trace-threads' and '--trace-fst-thread' options as
deprecated
2026-04-19 16:02:12 +01:00
Wilson Snyder 707dcea914 Commentary/Tests: Describe PARAMNODEFAULT as top.
Fixes #7441.
2026-04-18 11:34:19 -04:00
Wilson Snyder 1011ea86fa Commentary (#7428) (#7432) 2026-04-15 17:45:48 -04:00
Wilson Snyder ecf6d9b674 Commentary: Changes update 2026-04-09 17:50:40 -04:00
Geza Lore 9f9532ff78
Optimize Dfg only once, after V3Scope (#7362) 2026-04-09 08:31:12 -04:00
Artur Bieniek 8c11d0d0bd
Support rise/fall delays (#7368)
Signed-off-by: Artur Bieniek <abieniek@antmicro.com>
2026-04-07 06:44:52 -04:00
Wilson Snyder 33493cf5b4 Add `+verilator+solver+file` (#7242).
Fixes #7242.
2026-04-04 17:26:43 -04:00
Wilson Snyder 947cbaf330 Deprecate `--structs-packed` (#7222). 2026-03-21 10:59:27 -04:00
Igor Zaworski 907e775aa6
Internals: Add `--fourstate` flag and FUTURE warning (#7279) 2026-03-18 13:45:36 -04:00
Wilson Snyder de2c891ca5 Commentary: Changes update 2026-03-16 22:21:51 -04:00
Yangyu Chen bb5a9dc247
Support jemalloc as the default allocator on Linux (#7250)
Add jemalloc as an alternative malloc implementation for the Verilator
binary. When both tcmalloc and jemalloc are available, jemalloc is
preferred due to its better performance on RTLMeter.

The new --enable-jemalloc flag (default=check) mirrors the existing
--enable-tcmalloc behavior: auto-detected at configure time, supports
both static and dynamic linking, and is disabled when --enable-dev-asan
is active.
2026-03-13 17:08:15 -04:00