After conversion of Ast to Dfg, but before synthesizing AstAlways into
primitives, run a pass to remove variables that are not observable, and
all logic that only computes such variables. This can get rid of a lot
of content early so we don't build redundant Dfgs, and also enables
synthesizing always blocks that use temporaries only in some branches,
which will come in a follow up.
This patch adds IEEE-1800 compliant scheduling support for the Inactive
scheduling region used for #0 delays.
Implementing this requires that **all** IEEE-1800 active region events
are placed in the internal 'act' section. This has simulation
performance implications. It prevents some optimizations (e.g.
V3LifePost), which reduces single threaded performance. It also reduces
the available work and parallelism in the internal 'nba' section, which
reduced the effectiveness of multi-threading severely.
Performance impact on RTLMeter when using scheduling adjusted to support
proper #0 delays is ~10-20% slowdown in single-threaded mode, and ~100%
(2x slower) with --threads 4.
To avoid paying this performance penalty unconditionally, the scheduling
is only adjusted if either:
1. The input contains a statically known #0 delay
2. The input contains a variable #x delay unknown at compile time
If no #0 is present, but #x variable delays are, a ZERODLY warning is
issued advising the use of '--no-sched-zero-delay' which is a promise
by the user that none of the variable delays will evaluate to a zero
delay at run-time. This warning is turned off if '--sched-zero-delay'
is explicitly given. This is similar to the '--timing' option.
If '--no-sched-zero-delay' was used at compile time, then executing
a zero delay will fail at runtime.
A ZERODLY warning is also issued if a static #0 if found, but the user
specified '--no-sched-zero-delay'. In this case the scheduling is not
adjusted to support #0, so executing it will fail at runtime. Presumably
the user knows it won't be executed.
The intended behaviour with all this is the following:
No #0, no #var in the design (#constant is OK)
-> Same as current behaviour, scheduling not adjusted,
same code generated as before
Has static #0 and '--no-sched-zero-delay' is NOT given:
-> No warnings, scheduling adjusted so it just works, runs slow
Has static #0 and '--no-sched-zero-delay' is given:
-> ZERODLY on the #0, scheduling not adjusted, fails at runtime if hit
No static #0, but has #var and no option is given:
-> ZERODLY on the #var advising use of '--no-sched-zero-delay' or
'--sched-zero-delay' (similar to '--timing'), scheduling adjusted
assuming it can be a zero delay and it just works
No static #0, but has #var and '--no-sched-zero-delay' is given:
-> No warning, scheduling not adjusted, fails at runtime if zero delay
No static #0, but has #var and '--sched-zero-delay' is given:
-> No warning, scheduling adjusted so it just works
AstCAwait is only ever uses in statement position, so model it as a
statement. We should never ever have a coroutine that returns a value.
There is no need for it in SV, nor should we rely on it for internals.
Also reworks the fix for V3Life incorrectly constant propagating the
beforeTrig functions (#7072). The property that upsets V3Life is that
a function:
1. Is called from multiple static call sites (multiple AstCCall)
2. Reads model state directly (AstVarRef to non-locals/arguments)
Such function can only be created internally after scheduling (V3Task
throws an unsupported error on a non-inlined function that reads model
state), so added a flag to AstCFunc to mark the dangerous ones for
V3Life.
Many rules in the Dfg Peephole pass check if a node has more than one
sinks. Redundant variables that will ultimately be removed can prevent
these from matching. Remove such variables during the Peeophole pass
itself to enable more matches.
This is an attempt to generate an identical trace file scope hierarchy
both with and without -fno-inline. Primarily because it's needed for
testing in upcoming patch, but also improves consitency prior to #7001
This is primarily cleanup, but there are 2 functional changes included:
- It used to accidentally reorder bodies of AstNodeIf that were outside
an AstAlways. Now it will not touch anything outside an AstAlways.
- Removed one redundant edge from the graph which perturbs the result of
V3Graph::acyclic. This should make no difference for the actual
intended result of reordering NBAs to eliminate shadow variables.
Add a new Dfg pass 'pushDownSel'. This will try to move selects through
a tree of concatenations in order to eliminate temporary nodes holding
intermediate concatenation results. This can get rid of a lot of
variables when packed arrays are assigned in parts (e.g. bit-wise).
Use uint32_t max value instead of zero as sentinel value for a trace
code being unassigned. Prep for follow on patch.
Note the actual trace file will still start codes from one, the codes
in the model are just an offset from the base code.
We use special C++ types for ports, e.g. SystemC types in --sc mode, and native C arrays for unpacked arrays in --cc mode. These types are not substitutable for internal types, e.g. VlUnpacked, however all the runtime primitives expect internal types.
I think the intention was to use these special IO types only for top level ports, but the current implementation also uses them for the ports of all non-inlined modules. This means the output C++ will not compile if such a port is passed to a runtime primitive (e.g. array 'sort' as in the new test) or DPI import.
Changed to use the special IO types only on the top level ports.
Note these are likely still broken if attempting to invoke on a top level port (we might be saved by wrapTop, but later optimizations might eliminate the intermediary)
Re-inline ConstPool entries in V3Subst that have been expanded into
word-wise accessed by V3Expand. This enables downstream constant folding
on the word-wise expressions.
As V3Subst now understands ConstPool entries, we can also omit expanding
straight assignments with a ConstPool entry on the RHS. This allows the
C++ compiler to see the memcpy directly.
V3Expand wide SHIFTL and SHIFTR if the shift amount is know and is a
multiple of VL_EDATA_SIZE. This case results in each word requiring a
simple copy from the original, or store of a constant zero, which
subsequent V3Subst can then eliminate.
A temporary introduced by V3Premit could not be eliminated in V3Subst if
it was involved in an expression that did a write back to a
non-temporary. To enables removing these, we need to track all variables
in V3Subst, not just the ones we would consider for elimination. Note
the new implementation is marginally faster than the old one even though
it does more work. It can eliminate ~5% more of wide temporaries on some
designs. Algorithm is largely the same.
Concatenations that are only used by Sel expressions that do not consume
some bits on the edges can be narrowed to not compute the unused bits.
E.g.: `{a[4:0], b[4:0]}[5:4]` -> `{a[0], b[4]}[1:0]`
This is a superset or the PUSH_SEL_THROUGH_CONCAT DFG pattern, which is
removed.
Minor performance improvement, especially for assertions heavy code.
Strings are often used as temporaries in unlikely branches. Do not
localize them to avoid an unnecessary initialization on function entry.