In order to avoid unexpected breakage on multi-driven variables, we
resolve in DFG construction by using only the first driver encountered.
Also issues the MULTIDRIVEN error for these signals.
Replace the 'run to fixed point' algorithm with a work list driven
approach. Instead of marking the graph as changed, we explicitly add
vertices to the work list, to be visited, when a vertex is changed. This
improves both memory locality (as the work list is processed in last in
first out order), and removed unnecessary visitations when only a few
nodes changes.
Folding an AstLogAnd with a non-zero constant operand used to coerce the
type of the other operand, yielding an ill-typed node that DFG was then
unhappy about. Add a RedOr instead if the width of the replacement
operand is greater than zero.
Fixes#3726
Apart from the representational changes below, this patch renames
AstNodeMath to AstNodeExpr, and AstCMath to AstCExpr.
Now every expression (i.e.: those AstNodes that represent a [possibly
void] value, with value being interpreted in a very general sense) has
AstNodeExpr as a super class. This necessitates the introduction of an
AstStmtExpr, which represents an expression in statement position, e.g :
'foo();' would be represented as AstStmtExpr(AstCCall(foo)). In exchange
we can get rid of isStatement() in AstNodeStmt, which now really always
represent a statement
Peak memory consumption and verilation speed are not measurably changed.
Partial step towards #3420
In V3Active, we try hard to turn `always @(a or b or c)` into an
`always_comb` if the only variables read in the block are also in the
sensitivity list. In addition, also allow this optimization when reading
variables that are not in the sensitivity list, but are known to be
constant/never changing after initialization. In particular lookup
tables introduced by V3Table are covered by this. This can have a
significant impact on designs that use the `always @(a or b or c)` style
for combinational logic.
The cost of an AstCMethodHard right now is generally unknown. However,
VlTriggerVec::at is used a lot in conditions, so we make an effort
to estimate this correctly via 2 changes:
- In general when an AstVarRef appears as the target of an
AstCMethodHard, we cost it as a simple address computation (an add)
- Check for VlTriggerVec::at explicitly when costing AstCMethodHard,
which is essentially a load.
This can have a significant effect when there are a lot of unique
triggers in the design.
In non-static contexts like class objects or stack frames, the use of
global trigger evaluation is not feasible. The concept of dynamic
triggers allows for trigger evaluation in such cases. These triggers are
simply local variables, and coroutines are themselves responsible for
evaluating them. They await the global dynamic trigger scheduler object,
which is responsible for resuming them during the trigger evaluation
step in the 'act' eval region. Once the trigger is set, they await the
dynamic trigger scheduler once again, and then get resumed during the
resumption step in the 'act' eval region.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
This changeset brings support for accesses like:
class Cls#(type TYPE1);
TYPE1::some_method();
endclass
It is done by delaying dot resolution on type parameters until they get
resolved by V3Param, and doing a more thorough reference skip.
In DFG DfgVertex::width() is only defined for vertices representing
packed values, which DfgVertex::hash() used to violate. The only
non-packed values at the moment are DfgVarArray, which is a
DfgVertexVar, which are handled specially anyway, so this is easy to
fix.
Fixes#3682
In order to not leak signal names with --protect-ids, we simply make the
trigger dump function empty (this is a debug only construct).
Partial fix for #3689
The emitted SystemC types (e.g. sc_bv) are not interchangeable with
Verilator internal C++ types (e.g.: VlWide), so the variables themselves
are not interchangeable (but can be assigned to/from each other). We can
preserve correctness simply be not inlining any SystemC variables (i.e.:
don't simplify any 'sc = nonSc' or 'nonSc = sc' assignments). SystemC
types only appear at top level ports so this should have no significant
impact.
Fixes#3688
Prevents the possibility of assigning an integer to a class reference,
both at the SystemVerilog and the emitted C++ levels.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
Vertices representing variables (DfgVertexVar) and constants (DfgConst)
are very common (40-50% of all vertices created in some large designs),
and we also need to, or can treat them specially in algorithms. Keep
these as separate lists in DfgGraph for direct access to them. This
improve verilation speed.
Cyclic components are now extracted separately, so there is no
functional reason to have to do a topological sort (previously we used it
to detect cyclic graphs). Removing it to gain some speed.
AstSel is a ternary node, but the 'widthp' is always constant and is
hence redundant, and 'lsbp' is very often constant. As AstSel is fairly
common, we special case as a DfgSel for the constant 'lsbp', and as
'DfgMux` for the non-constant 'lsbp'.
Added a DfgVertex::user() mechanism for storing data in vertices.
Similar in spirit to AstNode user data, but the generation counter is
stored in the DfgGraph the vertex is held under. Use this to cache
DfgVertex::hash results, and also speed up DfgVertex hashing in general.
Use these and additional improvements to speed up CSE.
`V3SchedTiming` currently assumes that if a fork still exists, it must
have statements within it (otherwise it would have been deleted by
`V3Timing`). However, in a case like this:
```
module t;
reg a;
initial fork a = 1; join
endmodule
```
the assignment in the fork is optimized out by `V3Dead` after
`V3Timing`. This leads to `V3SchedTiming` accessing fork's `stmtsp`
pointer, which at this point is null. This patch addresses that issue.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
Allow constant folding through adjacent nodes of all associative
operations, for example '((a & 2) & 3)' or '(3 & (2 & a))' can now be
folded into '(a & 2)' and '(2 & a)' respectively. Also improve speed of
making associative expression trees right leaning by using rotation of
the existing vertices whenever instead of allocation of new nodes.
Only apply when there is guaranteed to be a subsequent constant folding
and elimination of some of the expression, otherwise this sometimes
interferes with the simplification of concatenations and harms overall
performance.
Before this change, a design verilated with `--timing` that does not
actually use timing features would be emitted with `eventsPending` and
`nextTimeSlot` declared in the top class. However, their definitions
would be missing, leading to linker errors during design compilation.
This patch makes Verilator always emit the definitions, which prevents
linker errors. Trying to use `nextTimeSlot` without delays in the design
will result in an error at runtime.
Also added a testing only -fno-const-before-dfg option, as otherwise
V3Const eats up a lot of the simple inputs. A lot of the things V3Const
swallows in the simple cases can make it to DFG in complex cases, or DFG
itself can create them during optimization. In any case to save
complexity of testing DFG constant folding, we use this option to turn
off V3Const prior to the DFG passes in the relevant test.
Some optimizations are only a net win if they help us remove a graph
node (or at least ensure they don't grow the graph), or yields otherwise
special logic, so try to apply them only in these cases.
Use the same style, and reuse the bulk of astgen to generate DfgVertex
related code. In particular allow for easier definition of custom
DfgVertex sub-types that do not directly correspond to an AstNode
sub-type. Also introduces specific names for the fixed arity vertices.
No functional change intended.
A lot of optimizations in DFG assume a DAG, but the more things are
representable, the more likely it is that a small cyclic sub-graph is
present in an otherwise very large graph that is mostly acyclic. In
order to avoid loosing optimization opportunities, we explicitly extract
the cyclic sub-graphs (which are the strongly connected components +
anything feeing them, up to variable boundaries) and treat them
separately. This enables optimization of the remaining input.
This change introduces a custom reference-counting pointer class that
allows creating such pointers from 'this'. This lets us keep the
receiver object around even if all references to it outside of a class
method no longer exist. Useful for coroutine methods, which may outlive
all external references to the object.
The deletion of objects is deferred until the next time slot. This is to
make clearing the triggered flag on named events in classes safe
(otherwise freed memory could be accessed).
Added DfgVertexVariadic to represent DFG vetices with a varying number
of source operands. Converted DfgVar to be a variadic vertex, with each
driver corresponding to a fixed range of bits in the packed variable.
This allows us to handle AstSel on the LHS of assignments. Also added
support for AstConcat on the LHS by selecting into the RHS as
appropriate.
This improves OpenTitan ST speed by ~13%
This is only a debugging aid at this point, so compile out of the
release build. This reduces peak memory consumption by 4-5%. We still
keep the global counters to detect the tree have changed, to avoid
unnecessary dumps.
Multiple tricks to reduce the size of class FileLine from 72 to 40
bytes:
- Reduce file name index from 32 to 16 bits. This still allows 64K
unique input files, which is hopefully enough.
- Intern message/warning enable bitset and use a 16-bit index, again
allowing 64K unique sets which is hopefully enough.
- Put the m_waive flag into the sign bit of one of the line numbers.
- Use explicit reference counting to avoid overhead of shared_ptr.
Added assertions to ensure interned data fits within it's index space.
This saves ~5-10% peak memory consumption at no measurable run-time cost
on various designs.
Added a new data-flow graph (DFG) based combinational logic optimizer.
The capabilities of this covers a combination of V3Const and V3Gate, but
is also more capable of transforming combinational logic into simplified
forms and more.
This entail adding a new internal representation, `DfgGraph`, and
appropriate `astToDfg` and `dfgToAst` conversion functions. The graph
represents some of the combinational equations (~continuous assignments)
in a module, and for the duration of the DFG passes, it takes over the
role of AstModule. A bulk of the Dfg vertices represent expressions.
These vertex classes, and the corresponding conversions to/from AST are
mostly auto-generated by astgen, together with a DfgVVisitor that can be
used for dynamic dispatch based on vertex (operation) types.
The resulting combinational logic graph (a `DfgGraph`) is then optimized
in various ways. Currently we perform common sub-expression elimination,
variable inlining, and some specific peephole optimizations, but there
is scope for more optimizations in the future using the same
representation. The optimizer is run directly before and after inlining.
The pre inline pass can operate on smaller graphs and hence converges
faster, but still has a chance of substantially reducing the size of the
logic on some designs, making inlining both faster and less memory
intensive. The post inline pass can then optimize across the inlined
module boundaries. No optimization is performed across a module
boundary.
For debugging purposes, each peephole optimization can be disabled
individually via the -fno-dfg-peepnole-<OPT> option, where <OPT> is one
of the optimizations listed in V3DfgPeephole.h, for example
-fno-dfg-peephole-remove-not-not.
The peephole patterns currently implemented were mostly picked based on
the design that inspired this work, and on that design the optimizations
yields ~30% single threaded speedup, and ~50% speedup on 4 threads. As
you can imagine not having to haul around redundant combinational
networks in the rest of the compilation pipeline also helps with memory
consumption, and up to 30% peak memory usage of Verilator was observed
on the same design.
Gains on other arbitrary designs are smaller (and can be improved by
analyzing those designs). For example OpenTitan gains between 1-15%
speedup depending on build type.
- Rename `--dump-treei` option to `--dumpi-tree`, which itself is now a
special case of `--dumpi-<tag>` where tag can be a magic word, or a
filename
- Control dumping via static `dump*()` functions, analogous to `debug()`
- Make dumping independent of the value of `debug()` (so dumping always
works even without the debug flag)
- Add separate `--dumpi-graph` for dumping V3Graphs, which is again a
special case of `--dumpi-<tag>`
- Alias `--dump-<tag>` to `--dumpi-<tag> 3` as before
Use astgen to generate a more thorough version of AstNode::checkTree,
which checks that operands are or consistent structure and type, as
described in the @astgen op directives. Also change checkTree to always
run when --debug-check is given.
Fix discovered fallout.
Introduce the @astgen directives parsed by astgen, currently used for
the generation child node (operand) accessors. Please see the updated
internal documentation for details.
Introduce the @astgen directives parsed by astgen, currently used for
the generation child node (operand) accessors. Please see the updated
internal documentation for details.
This approach reduced total time of V3Undriven stage from 34,2s to 2,5s
in design containing almost 400 000 unused variables.
Signed-off-by: Kamil Rakoczy <krakoczy@antmicro.com>
Generate type specific static overloads of Ast<Node>::addNext, which
return the correct sub-type of the 'this' they were invoked on.
Also remove AstNode::addNextNull, which is now only used in the parser,
implement in verilog.y directly as a template function.
- Move DType representations into V3AstNodeDType.h
- Move AstNodeMath and subclasses into V3AstNodeMath.h
- Move any other AstNode subtypes into V3AstNodeOther.h
- Fix up out-of-order definitions via inline methods and implementations
in V3Inlines.h and V3AstNodes.cpp
- Enforce declaration order of AstNode subtypes via astgen,
which will now fail when definitions are mis-ordered.
Rely less on strings and represent AstNode classes as a 'class Node',
with all associated properties kept together, rather than distributed
over multiple dictionaries or constructed at retrieval time.
No functional change intended.
Small fixup patch so the 'ico' and 'act' scheduling sections could be
ordered as multi-threaded. However, we still only order these single
threaded at the moment (but switching them to multi-threaded now works).
Before this change, some forked processes were being inlined in
`V3Timing` because they contained no `CAwait`s. This only works under
the assumption that no `CAwait`s will be added there later, which is not
true, as a function called by a forked process could be turned into a
coroutine later. The call would be wrapped in a new `CAwait`, but the
process itself would have already been inlined at this point.
This commit moves the inlining to `transformForks` in `V3SchedTiming`,
which is called at a point when all `CAwait`s are already in place.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
The recent patch to defer substitutions on V3Gate crashes on circular
logic that has cycle length >= 3 with all inlineable signals (cycle
length 2 is detected correctly and is not inlined). Fix by stopping
recursion at the loop-back edge.
Fixes#3543
This is detritus from when V3TraceDecl used to run after V3Gate, today
V3TraceDecl runs before V3Gate and this value has no function at all.
No functional change intended.
dynamic_cast is not free. Replace obvious instances (where the result is
unconditionally dereferenced) with static_cast in contexts with
performance implications.
Replace std::set<SiblingMC> with V3Lists to keep track of SiblingMCs
associated with MTasks, use a std::set<LogicMTask*> for ensuring
uniqueness. This yields a bit more speed in PartContraction.
- Use modern C++
- Implement OrderLogicVertex->LogicMTask map with
OrderLogicVertex::userp(), insteas of std::unordered_map
- Simplify data structures
- Simplify code and assert properties
No functional change.
Refactor ProcessMoveBuildGraph utilizing the fact that OrderGraph is a
bipartite graph, also remove unnecessary unordered_map and distribute
variable domain map. No functional change.
Adds timing support to Verilator. It makes it possible to use delays,
event controls within processes (not just at the start), wait
statements, and forks.
Building a design with those constructs requires a compiler that
supports C++20 coroutines (GCC 10, Clang 5).
The basic idea is to have processes and tasks with delays/event controls
implemented as C++20 coroutines. This allows us to suspend and resume
them at any time.
There are five main runtime classes responsible for managing suspended
coroutines:
* `VlCoroutineHandle`, a wrapper over C++20's `std::coroutine_handle`
with move semantics and automatic cleanup.
* `VlDelayScheduler`, for coroutines suspended by delays. It resumes
them at a proper simulation time.
* `VlTriggerScheduler`, for coroutines suspended by event controls. It
resumes them if its corresponding trigger was set.
* `VlForkSync`, used for syncing `fork..join` and `fork..join_any`
blocks.
* `VlCoroutine`, the return type of all verilated coroutines. It allows
for suspending a stack of coroutines (normally, C++ coroutines are
stackless).
There is a new visitor in `V3Timing.cpp` which:
* scales delays according to the timescale,
* simplifies intra-assignment timing controls and net delays into
regular timing controls and assignments,
* simplifies wait statements into loops with event controls,
* marks processes and tasks with timing controls in them as
suspendable,
* creates delay, trigger scheduler, and fork sync variables,
* transforms timing controls and fork joins into C++ awaits
There are new functions in `V3SchedTiming.cpp` (used by `V3Sched.cpp`)
that integrate static scheduling with timing. This involves providing
external domains for variables, so that the necessary combinational
logic gets triggered after coroutine resumption, as well as statements
that need to be injected into the design eval function to perform this
resumption at the correct time.
There is also a function that transforms forked processes into separate
functions.
See the comments in `verilated_timing.h`, `verilated_timing.cpp`,
`V3Timing.cpp`, and `V3SchedTiming.cpp`, as well as the internals
documentation for more details.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
Various optimizations to speed up MTasks coarsening (which is the long
pole in the multi-threaded scheduling of very large designs).
The biggest impact ones:
- Use efficient hand written Pairing Heaps for implementing priority
queues and the scoreboard, instead of the old SortByValueMap. This
helps us avoid having to sort a lot of merge candidates that we will
never actually consider and helps a lot in performance.
- Remove unnecessary associative containers and store data structures
(the heap nodes in particular) directly in the object they relate to.
This eliminates a huge amount of lookups and helps a lot in
performance.
- Distribute storage for SiblingMC instances into the LogicMTask
instances, and combine with the sibling maps. This again eliminates
hash table lookups and makes storage structures smaller.
- Remove some now bidirectional edge maps, keep only the forward map.
There are also some other smaller optimizations:
- Replaced more unnecessary dynamic_casts with static_casts
- Templated some functions/classes to reduce the number of static
branches in loops.
- Improves sorting of edges for sibling candidate creation
- Various micro-optimizations here and there
This speeds up MTask coarsening by 3.8x on a large design, which
translates to a 2.5x speedup of the ordering pass in multi-threaded
mode. (Combined with the earlier optimizations, ordering is now 3x
faster.)
Due to the elimination of a lot of the auxiliary data structures, and
ensuring a minimal size for the necessary ones, memory consumption of
the MTask coarsening is also reduced (measured up to 4.4x reduction
though the accuracy of this is low).
The algorithm is identical except for minor alterations of the order
some candidates are added or removed, this can cause perturbation in the
output due to tied scores being broken based on IDs.
Various optimizations to speed up MTasks coarsening (which is the long
pole in the multi-threaded scheduling of very large designs).
The biggest impact ones:
- Use efficient hand written Pairing Heaps for implementing priority
queues and the scoreboard, instead of the old SortByValueMap. This
helps us avoid having to sort a lot of merge candidates that we will
never actually consider and helps a lot in performance.
- Remove unnecessary associative containers and store data structures
(the heap nodes in particular) directly in the object they relate to.
This eliminates a huge amount of lookups and helps a lot in
performance.
- Distribute storage for SiblingMC instances into the LogicMTask
instances, and combine with the sibling maps. This again eliminates
hash table lookups and makes storage structures smaller.
- Remove some now bidirectional edge maps, keep only the forward map.
There are also some other smaller optimizations:
- Replaced more unnecessary dynamic_casts with static_casts
- Templated some functions/classes to reduce the number of static
branches in loops.
- Improves sorting of edges for sibling candidate creation
- Various micro-optimizations here and there
This speeds up MTask coarsening by 3.8x on a large design, which
translates to a 2.5x speedup of the ordering pass in multi-threaded
mode. (Combined with the earlier optimizations, ordering is now 3x
faster.)
Due to the elimination of a lot of the auxiliary data structures, and
ensuring a minimal size for the necessary ones, memory consumption of
the MTask coarsening is also reduced (measured up to 4.4x reduction
though the accuracy of this is low).
The algorithm is identical except for minor alterations of the order
some candidates are added or removed, this can cause perturbation in the
output due to tied scores being broken based on IDs.
While keeping the client code abstract in PartPropagateCp is nice for
testing, there is performance to be had removing the abstraction. As
this code dominates in scheduling large designs, we eliminate the
abstraction and re-work the testing to use the actual LogicMTask and
MTaskEdge graph types. No functional change intended.
Instead of deleting then re-allocating MTaskEdge instances when merging
two MTasks, just redirect the edged of the donor MTask to the recipient
MTask. This is both faster as it avoids an allocation and a deletion,
together with one update of the sibling maps, and also makes the
algorithm more stable due to MergeCandidate IDs being stable and
allocated up front for all MTaskEdges, before any SiblingMCs are
allocated.
Perturbations in output are expected as the IDs used to break ties
between merge candidates with equal costs are not updated when
redirecting an edge (on purpose). The relinking of only one end of the
graph edges also perturbs the order in which they are enumerated, which
does change candidate opportunities when the number of edges is larger
than PART_SIBLING_EDGE_LIMIT. Confirmed output is identical when
IDs are updated and edges are updated to appear in their original order.
The critical path propagation used to rely on a pointer comparison to
break equal scoring critical path updates. Use the corresponding mtask
ids instead, which is deterministic across invocations.
siblingPairFromRelatives gathers neighbours of a vertex, and sorts them.
It then takes the N best nodes, and creates sibling merge candidates
from them. We now use the unadjusted cost instead of the step cost of
the vertices when sorting. This is both faster as we need not do the
log-space rounding to compute stepCost, and will also make similar but
yet cheaper nodes appear closer to the front as we don't lose precision
in rounding, hence they are more likely to be entered as merge
candidates. Note that when creating the merge candidate, we still use
the stepCost, so it's purpose of reducing the propagation of critical
path updates is maintained in full. In summary, this should make both
Verilator and the generated model very slightly faster, at least in
theory, and I have observed minor improvement in places.
GraphStreamUnordered used to be GraphStream<std::less<const
V3GraphVertex*>>, but a lot of performance improvements can be had by a
specialized implementation, so added a highly optimized one. This helps
a lot with --debug-partition.
Fix compile error for queue method usage, if it is the
first statement in a block of code, and the return
value is not used. Example:
> if (foo)
> void'(bar.pop_front());
* Tests: Add a test to reproduce #3399
* Fix#3399. When reading an inout port in a module, it should refer the
original inout port, not the generated MODTEMP.
Keep a single std::set of key/value pairs, and a single unordered_map
from key to iterators into the set. Also improve some of the accessing
mechanisms using modern C++. This speeds up multi-threaded ordering by
about 10%.
Similarly to the earlier patch that defers constant folding on optimized
logic, now we also defer the variable substitutions as well. This again
eliminates a lot of traversals, and yields another ~10x speedup of V3Gate
on a design where V3Gate used to dominate while producing identical
results.
Rather than constant folding each logic block after every substitution,
only constant fold updated blocks when re-analysed, or at the end. This
removes a lot of invocations of V3Const on large blocks that can be
optimized well, and should yield the same result.
This speeds up V3Gate by ~4x on a design where V3Gate dominates.
Speed improvements:
- Use a direct, recursion-free implementation
- Improve pre-fetching
Functionality:
- Support remove/replace of currently iterated node
__gcov_flush was a private function and was removed from later GCC
versions (at least from 11.2.0, possibly earlier). Replace with the
documented public __gcov_dump.
Set default value of --comp-limit-parens to 240, to respect default
maximum nesting of parentheses in clang (which is controlled by
-fbracket-depth and defaults to 256). For code generation consistency,
also use the same default with gcc.
* Tests: Add a test to reproduce #3509
* Tests: Compile without tautological-compare check because bit op tree optimization is disabled in the test.
* Internals: Dedup code. No functional change is intended.
* Fix#3509.
"2'b10 == (2'b11 & {1'b0, val[0]})" and "2'b10 != (2'b11 & {1'b0, val[0]})" were
wrongly optimized to "!val[0]" and "val[0]" respectively.
Now properly optimize them to 1'b0 and 1'b1.
* Commentary
* Commentary: Update Changes
Associative arrays that specify a wildcard index type may be indexed by
integral expressions of any size, with leading zeros removed
automatically. A natural representation for such expressions is a
string, especially that the standard explicitly specifies automatic
casts from string indices to bit vectors of equivalent size.
The automatic cast part is done implicitly by the existing type system.
A simpler way to just make this work would be to convert wildcard index
type to a string type directly in the parser code, but several new AST
classes are needed to make sure illegal method calls are detected.
The verilated data structure implementation is reused, because there is
no need for differentiating the behavior on C++ side.
All remaining use of conditional compilation in the tracing
implementation of the run-time library are replaced with the use of
VerilatedModel::traceConfig, and is now done at run-time.
Step towards a proper run-time library. Reduce the amount of ifdefs in
the implementation of offloaded tracing. There are still a very small
number of ifdefs left, which will need more careful changes in order to
keep user API compatibility.
Make the external domains provider of ordering populate an output
vector, which then allows us to add multiple external sensitivities to
combinational logic.
Using unlinkFrBackWithNext is O(n) in the size of the list if unlinking
from the middle, so addHereThisAsNext also had this complexity. This
patch implements addHereThisAsNext directly, which is always O(1).
Use std::partial_sort for the non-exhaustive case. This is O(n) instead
of O(n*log(n)) in the size of the candidate list being sorted. (It
actually is O(n*log(k)), but k is constant 6 in the non-exhaustive
case).
* Tests: Add a test to reproduce #3470
* Update LSB during return path of traversal. No functional change is intended.
* Introduce LeafInfo::m_msb
* Update LeafInfo::m_msb when visitin AstCCast
* Internals: Add comment, reorder. No functional change is intended.
* Delete explicit from copy constructor to fix build error.
* Update Changes
* Internals: Remove unused parameter. No functional change is intended.
* Tests: Add explanation to t_const_opt.
* Internals: Let LeafInfo class. No functional change is intended.
* Internals: Rename LeafInfo::width -> LeafInfo::varWidth(). No functional change is intende.
* Tests: Check BitOpTree statistics in t_const_opt.
* Tests: Add a test to reproduce #3445
* Fix#3445. Don't forget LSB of frozen node in BitOpTreeOpt.
* Apply suggestions from code review
Co-authored-by: Geza Lore <gezalore@gmail.com>
VCD tracing is now parallelized using the same thread pool as the model.
We achieve this by breaking the top level trace functions into multiple
top level functions (as many as --threads), and after emitting the time
stamp to the VCD file on the main thread, we execute the tracing
functions in parallel on the same thread pool as the model (which we
pass to the trace file during registration), tracing into a secondary
per thread buffer. The main thread will then stitch (memcpy) the buffers
together into the output file.
This makes the `--trace-threads` option redundant with `--trace`, which
now only affects `--trace-fst`. FST tracing uses the previous offloading
scheme.
This obviously helps a lot in VCD tracing performance, and I have seen
better than Amdahl speedup, namely I get 3.9x on XiangShan 4T (2.7x on
OpenTitan 4T).
V3MergeCond merges consecutive conditional `_ = cond ? _ : _` and
`if (cond) ...` statements. This patch adds an analysis and ordering
phase that moves statements with identical conditions closer to each
other, in order to enable more merging opportunities. This in turn
eliminates a lot of repeated conditionals which reduced dynamic branch
count and branch misprediction rate. Observed 6.5% improvement on
multi-threaded large designs, at the cost of less than 2% increase in
Verilation speed.
For ordering, only the scope of logic vertices should be relevant, so
remove the scope pointer from OrderEitherVertex and move it into
OrderLogicVertex. This does not change single-threaded scheduling at
all. Theoretically, multi-threaded scheduling should not be affected
either though due to some implementation quirk depending on vertex order
in a graph the MT schedule is perturbed by this change, but the
performance effect of this is negligible on all benchmarks I have access
to.
No functional change intended.
Fixes#3442
This is a major re-design of the way code is scheduled in Verilator,
with the goal of properly supporting the Active and NBA regions of the
SystemVerilog scheduling model, as defined in IEEE 1800-2017 chapter 4.
With this change, all internally generated clocks should simulate
correctly, and there should be no more need for the `clock_enable` and
`clocker` attributes for correctness in the absence of Verilator
generated library models (`--lib-create`).
Details of the new scheduling model and algorithm are provided in
docs/internals.rst.
Implements #3278
This is a simple debugging aid to allow constraining the graph layout
via GraphViz rank directives. Note this is not related in any way to the
vertex 'rank' attribute used by some of the graph algorithms.
No functional change.
Some cases of warnings about the use of blocking and non-blocking
assignments in combinational vs sequential processes were suppressed in
a way that is inconsistent with the *actual* current execution model of
Verilator. Turning these back on to, well, warn the user that these might
cause unexpected results. V5 will clean these up, but until then err on
the side of caution.
Fixes#864.
Static variable initializers run before initial blocks, so use an
explicitly different procedure type for them. This also enables us to
now raise errors for assignments to const variables in initial blocks.
At the end of V3Param, fix up the module list to be topologically
sorted. We need to do this at the end as a later instantiation of a
recursive module might instantiate an earlier specialization, which we
cannot know until we processed everything. The rest of the compiler
depends on the module list being topologically sorted.
Fixes#3393
Rename AstNodeModule::hierName -> someInstanceName and explain that this
is only used for user messages.
Rename AstNode::locationStr -> instanceStr and simplify implementation.
In particular, do not report an instance if we can't find a reasonable
guess.
The --prof-threads option has been split into two independent options:
1. --prof-exec, for collecting verilator_gantt and other execution
related profiling data, and
2. --prof-pgo, for collecting data needed for PGO
The implementation of execution profiling is extricated from
VlThreadPool and is now a separate class VlExecutionProfiler. This means
--prof-exec can now be used for single-threaded models (though it does
not measure a lot of things just yet). For consistency VerilatedProfiler
is renamed VlPgoProfiler. Both VlExecutionProfiler and VlPgoProfiler are
in verilated_profiler.{h/cpp}, but can be used completely independently.
Also re-worked the execution profile format so it now only emits events
without holding onto any temporaries. This is in preparation for some
future optimizations that would be hindered by the introduction of function
locals via AstText.
Also removed the Barrier event. Clearing the profile buffers is not
notably more expensive as the profiling records are trivially
destructible.
- Always use a fast function to replace a slow one if available
- Iterate to fixed point (i.e.: if combining made more functions
identical, combine those too). This will be more useful in the future.
- Use only single, const traversal
Introduce VNRef that can be used to wrap AstNode keys in STL
collections, resulting in equality comparisons rather than identity
comparisons. This can then replace the SenTreeSet data-structure.
Using the 'forceable' directive in a configuration file, or the /*
verilator forceable */ metacomment on a variable declaration will
generate additional public signals that allow the specified signals to
be forced/released from the C++ code.
- Add more tests, including for tracing.
- Apply some cleaner, more generic abstractions in the implementation.
- Use clearer AstRelease which is not an assignment.
Avoid cloning the module when inlining the last instance that references
that module. This saves a lot of memory because it saves cloning
singleton modules (those with a single instance), which we always
inline. The top few levels of the hierarchy are often simple wrappers,
including the one added by Verilator in V3LinkLevel::wrapTop. Cloning
these and putting off deleting the originals can be very expensive
because they often have a lot of contents inlined into them, so each
layer of wrapper that is inlined would essentially add a whole new clone
of the large top-level. Directly inlining the module for the last cell
without cloning saves us from all this duplicate memory consumption and
also from having to create the clones in the first place.
Also added minor traversal speedups
This reduces the memory consumption of V3Inline by 80% and peak memory
consumption of Verilator by about 66% on a large design, while speeding
up the V3Inline pass by ~3.5x and the whole of Verilator by ~8% while
producing identical output.
- More efficient comparison by pre-computing sorting keys.
- Remove work items in algorithms known to be redundant earlier.
This greatly reduces data structure sizes.
- Use V3GraphVertex->user() for state tracking instead of unordered_map
while both of these are constant time, they do add up.
- In `makeMinSpanningTree`, instead of batch inserting outgoing edges of
each visited vertex into an ordered set, keep an ordered set of sorted
vectors of edges. This reduces the size of the ordered set
significantly (it is now O(V) rather than O(E), and as the subject
graph is a complete graph, V ~ sqrt(E), so this is a significant gain).
- Use a vector + sorting in `perfectMatching` instead of an ordered set.
This is faster on large working sets.
This yields 3.8x speedup on the variable order pass and overall 14%
verilation speed gain on a large design.
Repeatedly traversing whole modules in emit (due to file splitting)
looking for `systemc_* sections can add up to a lot of time on large
designs that have been flattened and need to be split into many files.
Assuming `systemc_* is a rarely used feature, just don't bother if we
don't need to. This gain 9% verilation speed improvement on a large
benchmark.
* Tests: Add t_hier_block_sc_trace(fst|vcd) that tests tracing hierarchical block on SystemC.
* Add a check that elaboration is done before a trace file is opened.
* Add a check that elaboration is done before trace() is called to verilated SystemC model.
* Tests: call sc_core::sc_start(sc_core::SC_ZERO_TIME) before opening a trace file
* Tests: Fix t_trace_two_sc to call sc_start before opening trace
* Use vl_fatal as suggested in PR review.
Trace initialization (tracep->decl* functions) used to explicitly pass
the complete hierarchical names of signals as string constants. This
contains a lot of redundancy (path prefixes), does not scale well with
large designs and resulted in .rodata sections (the string constants) in
ELF executables being extremely large.
This patch changes the API of trace initialization that allows pushing
and popping name prefixes as we walk the hierarchy tree, which are
prepended to declared signal names at run-time during trace
initialization. This in turn allows us to emit repeat path/name
components only once, effectively removing all duplicate path prefixes.
On SweRV EH1 this reduces the .rodata section in a --trace build by 94%.
Additionally, trace declarations are now emitted in lexical order by
hierarchical signal names, and the top level trace initialization
function respects --output-split-ctrace.
* Add a test to reproduce #3197
* Fix#3197. Optimize correctly even if a variable is >32
* Quick exit instead of continue. No functional change is intended.
* Delete comment-out line.
* update per review comment
IEEE 1800-2017 6.11.3 says these types are unsigned. Until now these
types were treated as not having a signedness (NOSIGN), and nodes having
these types were later resolved by V3Width to be unsigned. This is a bit
problematic when creating nodes of these types after V3Width. Treating
these types as unsigned from the get go is fine, and actually improves
generated code slightly.
* Add a test to reproduce bug3182. Run the same HDL with -Oo to confirm the result is same.
* Hopefully fix#3182. The result can be 0 only when polarity is true (no AstNot is found during traversal).
* add tests to reproduce #3177.
Any random test circuits can be added to t_split_var_4.v later because it uses CRC to check the result while
t_split_var_0.v has just barrel shifters.
* Fix#3177. Don't merge assign statements if a variable is marked split_var.
This is a partial cleanup of V3Order with the aim of increasing clarity:
- Split the initial OrderGraph building and the actual ordering process
into separate classes (OrderVisitor -> OrderBuildVisitor + OrderProcess)
- Remove all the historical cruft from the graph building phase (now in
OrderBuildVisitor), and add more assertions for assumptions.
- Change the dot styling of OrderGraph to use shapes and more easily
distinguishable colors.
- Expand vague comments, remove incorrect comments, and add more.
- Replace some old code with cleaner C++11 constructs.
- Move code about a bit so logically connected sections are closer to
each other, scope some definitions where they are used rather than file
scope.
- The actual ordering process (now in OrderProcess) is still largely
unchanged.
The generated code is identical to before (within the limits of the
exiting non-determinism).
The _CONST suffix on these macros is only lexical notation, pointer
constness can be preserved by overloading the underlying
implementations appropriately. Given that the compiler will catch
invalid const usage (trying to assign a non-const pointer to a const
pointer variable, etc.), and that the declarations of symbols should
make their constness obvious, I see no reason to keep the _CONST
flavours.
Fail at compile time if the result of these macros can be statically
determined (i.e.: they aways succeed or always fail). Remove unnecessary
casts discovered. No functional change.
VN_AS should be used over VN_CAST in code where the author knows up
front (i.e.: statically) what the true type of the node is. This has
multiple benefits over VN_CAST:
- In the debug build: Asserts node type is as expected
- In the optimized build: It is faster as no superfluous type test
- And (I would argue most importantly) it documents intent in the code
No current instances of VN_CAST changed in this patch
- Merge AstNodeIf nodes as well (not just assignment from AstCond)
- Merge merged results recursively (optimizes nested conditionals/ifs)
- Only checking mergeability once per node.
- Don't add redundant masking
- Duplicate cheap statements in both branches, if doing so yields a
larger merge
- Include reduced nodes before the starting conditional in the merge
- Remove redundant casting
- Cheaper final XOR parity flip (~/^1 instead of != 0)
- Support XOR reduction of ~XOR nodes
- Don't add redundant masking of terms
- Support unmasked terms
- Add cheaper implementation for single bit only terms
- Ensure result is clean under all circumstances (this fixes current bugs)
Verilator should now correctly re-evaluate any logic that depends on
state set in a DPI exported function, including if the DPI export is
called outside eval, or if the DPI export is called from a DPI import.
Whenever the design contains a DPI exported function that sets a
non-local variable, we create a global __Vdpi_export_trigger flag, that
is set in the body of the DPI export, and make all variables set in any
DPI exported functions dependent on this flag (this ensures correct
ordering and change detection on state set in DPI exports when needed).
The DPI export trigger flag is cleared at the end of eval, which ensured
calls to DPI exports outside of eval are detected. Additionally the
ordering is modifies to assume that any call to a 'context' DPI import
might call DPI exports by adding an edge to the ordering graph from the
logic vertex containing the call to the DPI import to the DPI export
trigger variable vertex (note the standard does not allow calls to DPI
exports from DPI imports that were not imported with 'context', so we
do not enforce ordering on those).
- Use C++11 initialization syntax
- Use C++11 for loops
- Add const
- Factor out repeated _->fileline() sub-expressions
- Factor out issuing warning message
No functional change.
- Use C++ initialization syntax
- Add const
- Make members static where appropriate
- Factor out repeated _->fileline() sub-expressions
No functional change.
* EmitXml: Added <ccall>, <constpool>, <initarray>/<inititem>, wrapped children of <if> and <while> with <begin> elements to prevent ambiguity
* EmitXml: added signed="true" to signed basicdtypes
All parameters that are required in the output are now emitted as
'static constexpr, except for string or array of strings parameters,
which are still emitted as 'static const' (required as std::string is
not a literal type, so cannot be constexpr). This simplifies handling
of parameters and supports 'real' parameters.
This patch partitions AstCFuncs under an AstNodeModule based on which
header files they require for their implementation, and emits them
into separate files based on the distinct dependency sets. This helps
with incremental recompilation of the output C++.
Apart from adding required AstCUse, V3CUse also used to create some
standard methods for classes. This is now done in a separate pass
V3Common. Note that this is not a performance issue, as V3Common just
iterates through each module, which are stored in a simple linked list
under the netlist, and does not need to traverse the whole netlist.
Other sub-classes of AstNodeCCall do not need the self pointer. Moving
it into the specific sub-class that needs it clarifies V3Descope and
Emit. No functional change intended.
Internal AstNodeModule headers (.h) and implementation (.cpp) files are
now emitted separately in V3EmitC::emitcHeaders() and
V3EmitC::emitcImp() respectively. No functional change intended
A separate V3VariableOrder pass is now used to order module variables
before Emit. All variables are now ordered together, without
consideration for whether they are ports, signals form the design, or
additional internal variables added by Verilator (which used to be
ordered and emitted as separate groups in Emit). For single threaded
models, this is performance neutral. For multi-threaded models, the
MTask affinity based sorting was slightly modified, so variables with no
MTask affinity are emitted last, otherwise the MTask affinity sets are
sorted using the TSP sorter as before, but again, ports, signals, and
internal variables are not differentiated. This yields a 2%+ speedup for
the multithreaded model on OpenTitan.
This patch makes OpenTitan verilation with --debug-check 22% faster, and
the same with --debug --no-dump-tree 91% faster. Functionality is the
same (including when VL_LEAK_CHECKS is defined), except V3Broken can now
always find duplicate references via child/next pointers if the target
node is not `maybePointedTo()` (previously this only happened when
compiled with VL_LEAK_CHECKS). The main change relates to storing the
v3Broken traversal state in the AstNode by stealing a byte from what
used to be unused flags. We retain an unordered_set only for marking
pointers as valid to be referenced via a non-child/non-next member
pointer.
Do not sort and hoist function local variables to the top of the
function definition. The stack layout of automatic variables is not
defined by C so the compilers can lay these out optimally. Simplifies
internals for follow on work. Effect on model performance is neutral to
very slight improvement, so we do not seem to be loosing anything.
V3Broken now checks that AstVar nodes referenced in an AstCFunc are
either external, or appear in the tree before the reference, and are in
scope.
Fix V3Begin to move lifted AstVars to beginning of FTask, rather than
end, which trips the above check.
debug() declared by VL_DEGUB_FUNC used to cache the result of the debug
level lookup (which depends on options) in a static. This meant that if
the debug() function was called before option parsing, the default debug
level of 0 would be used for the rest of the program, even if a --debug
option was given. Fixed by not caching the debug level until after
option parsing is complete.
The bool AstNode flags fit in a padding gap after m_type. This reduces
memory consumption by about 2% on OpenTitan. Initializing the unused
bits in the flags then avoids a read-modify-write in the constructor
(replacing it with a store constant). Overall verilation speed is about
1% faster.
The -G option now correctly parses simple integer literals as signed
numbers, which is in line with the standard and is significant when
overriding parameters without a type specifier.
Fixes#3060
Remove magic code fragments form EmitCTrace, so Emit need not be aware
that a function is tracing related or not (apart from the purpose of
file name generation). All necessary code is now generated via text
nodes in V3TraceDecl and V3Trace. No functional change intended.
For consistency with the rest of the generated code, generated methods
related to tracing now use snake_case instead of camelCase. No
functional change intended.
* Move MTaskState to ThreadSchedule
MTaskState does not concern itself with sandbagging, and thus solely contains information related to the finalized schedule, i.e., completion time, thread ID and next MTask on thread.
* Add .dot graph visualization of ThreadSchedule
Follow-up to #2779.
This commit adds the creation of .dot files - used by GraphViz - to visualize how mtasks are statically scheduled across the set of specified threads.
We visualize each thread as a row, with nodes of a row being the mtasks scheduled for the given thread. The width of the mtask nodes are proportional to their cost. MTask dependencies are shown using an edge between the source and sink mtasks.
V3Simulate reuses allocated AstConst nodes for efficiency, however this
used to be implemented in a way that required a deep copy of a
std::unorderd_map<_, std::deque<_>>, which was quite inefficient when it
grew large. The free list is now managed without any copying. This takes
the V3Table pass from taking 12s to 0.2s on SweRV EH1.
We incorrectly treated two different struct types the same when passed
as an actual parameter to a `parameter type` parameter in an instance,
if the actual parameter expression both hash to the same value and the
structs have the same struct name. This is now corrected.
Fixes#3055.
This patch implements #3032. Verilator creates a module representing the
SystemVerilog $root scope (V3LinkLevel::wrapTop). Until now, this was
called the "TOP" module, which also acted as the user instantiated model
class. Syms used to hold a pointer to this root module, but hold
instances of any submodule. This patch renames this root scope module
from "TOP" to "$root", and introduces a separate model class which is
now an interface class. As the root module is no longer the user
interface class, it can now be made an instance of Syms, just like any
other submodule. This allows absolute references into the root module to
avoid an additional pointer indirection resulting in a potential speedup
(about 1.5% on OpenTitan). The model class now also contains all non
design specific generated code (e.g.: eval loops, trace config, etc),
which additionally simplifies Verilator internals.
Please see the updated documentation for the model interface changes.
This commit removes shadowing of the vlSymsp member of the emitted
modules, allowing models to compile when -Werror=shadow is set. This may
be useful when i.e., an external project which defines its own error
flags depends on the verilated model.
Factored out bits from V3EmitC.cpp that is required to emit a whole
(non-trace) AstCFunc. This is mostly what used to be the EmitCStmts
class plus relevant bits from EmitCImp. These now live in EmitCFunc,
which is reusable by anything that needs to emit a regular AstCFunc
(differences in tracing to be addressed later). EmitCImp now extends
EmitCFunc instead of EmitCStmts. No functional change intended.
Moved these 2 function into V3EmitCBase so we can reuse them later.
emitVarDecl required minor alteration to move building of m_ctorVarsVec
back into V3EmitC (which is now done in V3EmitC::emitSortedVarList).
No functional change intended.
* Add a test to reproduce #3023. Also applied verilog-mode formatting.
* use unique_ptr. No functional change is intended.
* Introduce restorer that reverts changes during iterate() if failed.
This used to be done in the constructor of the top module, but there is
no reason to do it there. Internals are cleaner with this in the Sym
constructor. No functional change intended.
AND(CONST,SHIFTR(_,C)) appears often after V3Expand, with C a large
enough dense mask (i.e.: of the form (1 << n) - 1) to make the masking
redundant. E.g.: 0xff & ((uint32_t)a >> 24). V3Const now replaces these
ANDs with the SHIFTR node.
Similarly, we also simplify the same with SHIFTL,
e.g.: 0xff000000 & ((uint32_t)a << 24)
V3Expand generates a lot of OR nodes that are under a clearing mask, and
have redundant terms, e.g.: 0xff & (a << 8 | b >> 24). The 'a << 8' term
in there is redundant as it's bottom bits are all zero where the mask is
non-zero. V3Const now removes these redundant terms.
Teach V3Localize how to localize variables that are used in multiple
functions, if in all functions where they are used, they are always
written in whole before being consumed. This allows a lot more variables
to be localized (+20k variables on OpenTitan - when building without
--trace), and can cause significant performance improvement (OpenTitan
simulates 8.5% - build single threaded and withuot --trace).
These utility classes can be used to hang advanced data structures off
AstNode user*u() pointers, and they take care of memory management for
the client. Use via the call operator().
V3Localize can now localize variable references that reference variables
located in scopes different from the referencing function. This also
means V3Descope has now moved after V3Localize.
When part selecting bits via an AstSel in a VlWide, V3Expand used to do
something akin to:
word_index = lsb / 32;
bit_index = lsb % 32;
result =
wide[word_index + 1] << (32 - bit_index) | wide[word_index] >> bit_index;
The unconditional "+ 1" can cause an out of bounds access into the
VlWide, when the whole of the select is into the most significant word
(i.e.: when word_index is already the most significant word). We now
emit roughly this instead:
lo_word_index = lsb / 32;
bit_index = lsb % 32;
hi_word_index = (lsb + width - 1) / 32;
result =
wide[hi_word_index] << (32 - bit_index) | wide[lo_word_index] >> bit_index;
i.e.: we explicitly calculate which word the MSB of the select falls
into, and address that word, rather than the unconditional + 1. The
shifts ensure we still yield the right result, even if lo_word_index and
hi_word_index are the same.
Note: The actual expression created by V3Expand can be a bit more
complicated as we might need to access 3 words when the result is a
QData, all 3 word indices are calculated explicitly.
emitVarCmtChg used to emit MTask affinity of variables in comments in
the generated header. This causes unnecessary changes in the output when
scheduling changes slightly between compilation, hindering ccache reuse.
If needing this info for debugging Verilator, add a separate dump file
instead of emitting it in the generated code.
The goal of this patch is to move functionality related to constructing
the thread entry points and then invoking them out of V3EmitC (and into
V3Partition). The long term goal being enabling V3EmitC to emit
functions partitioned based on header dependencies. V3EmitC having to
deal with only AstCFunc instances and no other magic will facilitate
this.
In this patch:
- We construct AstCFuncs for each thread entry point in
V3Partition::finalize and move AstMTaskBody nodes under these functions.
- Add the invocation of the threads as text statements within the
AstExecGraph, so they are still invoked where the exec graph is located.
(the entry point functions are still referenced via AstCCall or
AstAddOrCFunc, so lazy declarations of referenced functions are created
automatically).
- Explicitly handle MTask state variables (VlMTaskVertex in
verilated_threads.h) within Verilator, so no need to text bash a lot of
these any more (some text refs still remain but they are all created
next to each other within V3Partition.cpp).
The effect of all this on the emitted code should be nothing but some
identifier/ordering changes. No functional change intended.
What previously used to be per module static constants created in
V3Table and V3Prelim are now merged globally within the whole model and
emitted as part of a separate constant pool. Members of the constant
pool are global variables which are declared lazily when used (similar to
loose methods).
This patch introduces the concept of 'loose' methods, which semantically
are methods, but are declared as global functions, and are passed an
explicit 'self' pointer. This enables these methods to be declared
outside the class, only when they are needed, therefore removing the
header dependency. The bulk of the emitted model implementation now uses
loose methods.
Check the C++ compiler for -Og via configure and use it if available.
Per the GCC manual:
-Og should be the optimization level of choice for the standard
edit-compile-debug cycle, offering a reasonable level of optimization
while maintaining fast compilation and a good debugging experience. It
is a better choice than -O0 for producing debuggable code because some
compiler passes that collect debug information are disabled at -O0.
The debug exe is painfully slow on large designs, hopefully this is an
improvement.
Similarly, check for and use -gz to compress the debug info as it is
huge otherwise. This should help with distribution and caching on CI.
Also checks for -ggdb via configure for compatibility.
- Initialize variable to avoid 'may be uninitialized' warning
- More reliable segfault (the previous version was compiled into an
undefined instruction by clang sometimes, thew new one is always a store
to zero).
- Better combining, without multiplication, which means a 0 hash value
is now allowed.
- Do not OR in bottom bits (this was used to avoid a 0 hash but had the
side effect of hashing 0 and 1 to the same value, which are actually
common inputs.
- Hash whole content of V3Number. This does not seem to be noticeable in
runtime, but quite often the bottom word can be a special value like
zero while the rest of the content varies.
A few names were incorrectly mangled, which made --protect-ids produce
invalid output when certain SV class constructs were uses. Now fixed and
added a few extra tests to catch this.
// was so far unconditionally treated as comment.
c443e229ee introduced a string literal in
the output that contained a http:// url, which broke the indent
tracking. No functional change intended
astgen now generates ASTGEN_SUPER_* macros, instead of expanding the
ASTGEN_SUPER itself. This means that V3AstNodes.h itself can be included
in V3Ast.h, which helps IDEs (namely CLion) find definitions in
V3AstNodes.h a lot better. Albeit this is a little bit more boilerplate,
writing constructors of Ast nodes should be a lot rarer than trying to
find their definitions, so hopefully this is an improvement overall.
astgen will error if the developer calls the wrong superclass
constructor.
Rework Ast hashing to be stable
Eliminated reliance on pointer values in AstNode hashes in order to make
them stable. This requires moving the sameHash functions into a visitor,
as nodes pointed to via members (and not child nodes) need to be hashed
themselves.
The hashes are now stable (as far as I could test them), and the impact
on verilation time is small enough not to be reliably measurable.
V3Hasher is responsible for computing AstNode hashes, while V3DupFinder
can be used to find duplicate trees based on hashes. Interface of
V3DupFinder simplified somewhat. No functional change intended at this
point, but hash computation might differ in minor details, this however
should have no perceivable effect on output/runtime.
Implements (#2964)
- Remove dead code
- Simplify what is left
- Minor improvements using C++11
- Remove special case AstNode constructors used only in one place in
V3Combine (inlined instead).
No functional changes intended. Boring patch in the hope of being
helpful to new contributors. Although AstNode* base classes needing to
be abstract is described in the internals manual, this might provide a
bit of extra safety.
This seems to belong there, eliminates some code duplication in V3Clock,
and also enables splitting AstAlwaysPost statements into different
functions in V3Order, but should otherwise have little effect.
CFuncs only used to be split at procedure (always/initial/final block),
which on occasion can still yield huge output files if they have large
procedures. This patch make CFuncs split at statement boundaries within
procedures. This has the potential to help a lot, but still does not
help if there are huge statements within procedures.
V3Reloop now can roll up indexed assignments between arrays if there is a
constant offset between indices on the left and right hand sides, e.g.:
a[0] = b[2];
a[1] = b[3];
...
a[x] = b[x + 2];
* Tests: Add more case that does not match native C++ width (8, 16, 32 or 64).
* Use AstVarRef::same() instead of AstNode::sameGateTree() because the latter checks dtype in addition to scope.
AstVarRef may have different minWidth in some cases,
but the difference should be ignored in the context of bitOpTree optimization.
This support code merely adds the capability to skip over the encrypted
parts. Many models have unencrypted module interfaces with ports, and
only encrypt the critical parts.
* Tests: Add another testcase that triggers assertion failure in bitOpTree opt.
* Fix assertion failure in bitOpTree opt reported in #2891. Consider the follwoing case.
CCast -> WordSel -> VarRef(leaf)
* Make sure that m_bitPolarity is expanded enough.
The intention was to not merge impure assignments, but the actual
predicate failed if the assignment was indeed pure.
This fix gains 1.5% speed on SweRV EH1.
This will still warn if a case item is completely covered by previous
items, but will no longer complain about overlaps like this:
priority casez (foo_i)
2'b ?1: bar_o = 3'd0;
2'b 1?: bar_o = 3'd1;
Before, there was a warning for the second statement because the first
two patterns match 2'b11.
Add V3OptionsParser that can suggest correct option.
Co-authored-by: Wilson Snyder <wsnyder@wsnyder.org>
Co-authored-by: github action <action@example.com>
** Add simulation context (VerilatedContext) to allow multiple fully independent
models to be in the same process. Please see the updated examples.
** Add context->time() and context->timeInc() API calls, to set simulation time.
These now are recommended in place of the legacy sc_time_stamp().
* Tests:Add some more signals to t_const_opt_red.v
* Optimize bit op trees such as ~a[2] & a[1] & ~a[0] to 3'b010 == (3'111 & a)
* Update src/V3Const.cpp
Co-authored-by: Wilson Snyder <wsnyder@wsnyder.org>
* Update src/V3Const.cpp
Co-authored-by: Wilson Snyder <wsnyder@wsnyder.org>
* Update src/V3Const.cpp
Co-authored-by: Wilson Snyder <wsnyder@wsnyder.org>
* Apply clang-format
* Don't edit and-or tree.
* Call matchBitOpTree() after V3Expand does its job.
* Internals: Rename newNodep -> newp. No functional change is intended.
* Internals: Remove m_sels. No functional change is intended.
* Internals: Remove stringstream. No functional change is intended.
* Internals: Use V3Number::bitIs1(). No functional change is intended.
* Internals: Use V3Number instead of std::map. Resolved laek. Result should be same.
* Internals: Resolve overload of setPolarity. No functional change is intended.
* Internals: Pass failure reason. No functional change is intended.
* Internals: Add VNUser::toPtr(). No functional change is intended.
* Internals: Use user4 instead of std::map. No functional change is intended.
* Catch up with the AST style aftre V3Expand
* Internals: Rename Context to VarInfo. No functional change is intended.
* Add some more test case
* tests:Add stats to tests
Update stats in t_merge_cond.pl as matchBitopTree does some of them.
* insert CCast if necessary
* small optimization to remove redundant bit mask
* No quick exit even when unoptimizable node is found.
* Simplify removing redundant And
* simplify
* Revert "Internals: Add VNUser::toPtr(). No functional change is intended."
This reverts commit f98dce10db.
* Consider AstWordSel and cleanup
* Update test
* Update src/V3Const.cpp
Co-authored-by: Wilson Snyder <wsnyder@wsnyder.org>
* Update src/V3Const.cpp
Co-authored-by: Wilson Snyder <wsnyder@wsnyder.org>
* Update src/V3Const.cpp
Co-authored-by: Wilson Snyder <wsnyder@wsnyder.org>
* Update src/V3Const.cpp
Co-authored-by: Wilson Snyder <wsnyder@wsnyder.org>
* Update src/V3Const.cpp
Co-authored-by: Wilson Snyder <wsnyder@wsnyder.org>
* Update src/V3Const.cpp
Co-authored-by: Wilson Snyder <wsnyder@wsnyder.org>
* Apply clang-format
* Internals: rename variables. No functional change is intended.
Co-authored-by: Wilson Snyder <wsnyder@wsnyder.org>
Co-authored-by: github action <action@example.com>
When using a "if" statement inside an always block, part of the code may
be unreachable. This can be used to avoid errors, but it generated an
error, this commit demotes this to a warning. Partly fixes#2625.
* Test hierarchical block that is explicitly set its default parameter value.
* Fix hierarchical verilation when a hierarchical block is instantiated with explicit setting of the default value.
Parameterized hierarchical block must have mangled name even when all parameters have default value,
otherwise the parameterized module will be hidden by protect-lib wrapper.
* rename variable names. No functional change is intended.
* Use unique_ptr to manage lifetime. Slightly changed variable name. No functional change is intended.
* Let ParameterizedHierBlocks::areSame() public. No functional change is intended.
* Add a test for hierarchical verilation without timescale
* Emit timeunit in hierarchical wrapper only when it is specified in the input design or command line option.
* Update src/V3AstNodes.h
Co-authored-by: Wilson Snyder <wsnyder@wsnyder.org>
Co-authored-by: Wilson Snyder <wsnyder@wsnyder.org>
On test suite adds ~7% runtime but will simplify logic and enable some
future changes.
Most large designs I saw were already using verilated_heavy as were using
strings, queues, associative arrays, classes, $readmem, $writemem, or
$valueplusargs.
* Add a test to use unpacked array port in hierarchical verilation and protect-lib.
* V3EmitV supports unpacked array variables
* Can Emit local unpacked array properly
* Update golden of t_debug_emitv
* Support unpacked array port in protect-lib
* Remove t_prot_lib_unpacked_bad test as unpacked array is supported now.
* Add tests for unpacked array in DPI-C
* Add more generic parameter generator to AstNodes
* Supports multi dimensional array in DPI ( DPI argmuments <=> Verilator internal type conversion)
consider typedef in V3Task
fix export test
fix inout for scalar
support export func of time
* V3Premit does not show an error for wide words nor ArraySel
* Unnecessary pack func for unapcked array does not appear anymore
* Support unpacked array in runtime header
- Add an overload for lvalue VL_CVT_PACK_STR_NN
- Allow conversion from void *
* touch up tests for codacy advices
* resolve free functions. no functional change intended.
* collect multiple string literals of "__Vcvt". No functional change is intended.
* temporary variable for DPI-C should be static if its type is string
* Update src/V3EmitC.cpp
Co-authored-by: Wilson Snyder <wsnyder@wsnyder.org>
Co-authored-by: Wilson Snyder <wsnyder@wsnyder.org>
* Add a test to use string for $fgets
* Use dedicated function for $fgets to std::string
* share the implementation of $fgets
* Pass -1 for bitwidth of std::string to distinguish from POD
* add checks for scanf with string
* apply clang-format
* WIFEXITED missing from MinGW/MSYS2, added defines
* Found source of the WIFEXITED macro in the binutils-gdb repo. Now with less pointer manipulation.
This is to allow C++ verilator toplevel to support
multiple modes of waveform tracing
VM_TRACE_FST can be used inside a #if VM_TRACE
section to switch between classic .vcd tracing and the
more compact .fst format supported by GTKWAVE
* hier_block with cmake test doesn't assume prefix now.
* Add space between files
* don't set -Mdir on cmake build as it will be set by DIRECTORY option
* Use top target name instead of prefix
* Add a test to make sure that lib modules (loaded via -y option) can be a hier_block.
* Add HDL file of the hier_block to the source list if the module is loaded via -y option.
(Each hier_block is treated as a top module when processing the hier_block.)
* Use "\n" for delimiter as the other files
* Add a test to use shared object of protect-lib
* Add a guard to call ctor/dtor just once even when a protec-lib is shared object.
* Pass .a to linker in leaf-last order for older ld.
* Add -flat_namespace for mac
* No need to link the intermediate .so in hierarchical verilation
* apply clang-format
* run all tests in ci instead of cron. DONT_MERGE
* Add -undefined dynamic_lookup for mac environment when linking a shared protect-lib.
* Let's just check on mac for now. DONT_MERGE
* Revert "Let's just check on mac for now. DONT_MERGE"
This reverts commit 533fac6f9f.
* Revert "run all tests in ci instead of cron. DONT_MERGE"
This reverts commit fb4ac1fb42.
* test:Add more tests for checking split_var for unpacked array.
* Fix range calculation of SliceSel and change to UASSERT_OBJ because the range check is done in V3Width beforehand.
cint.mainInt(nodep) walks the tree and populates m_ctorVarsVec.
Reuse EmitCImp cint for the slow mainImp() emition steps to make sure
we emit constructor calls to setup SystemC sc_module names.
* add a test for protect-lib without sequential logic that cause the follwoing error.
%Error: obj_vlt/t_prot_lib_comb/secret/secret.sv:78:14: syntax error, unexpected ')'
78 | always @() begin
| ^
* protect-lib wrapper must contain sequential signal related statements
only if the protected design is a sequential logic.
Diff looks big, but actually just add "if (m_hasClk)" condition.
Signed-off-by: Yutetsu TAKATSUKASA <y.takatsukasa@gmail.com>
Inlining used to set some VarRefs to point to new Var nodes, without
updating the dtype of the VarRef to the dtype of the new Var.
AstNode::sameTree does identity comparison on the dtype, so some trees
that were semantically the same did not compare as such with sameTree
because of the dtype. This in turn results in some missed opportunities
for combining equivalent SenTree sensitivities when the clock came from
outside.
* Pass FileLine so that findDotted() will be able to tell the error location though it is not used yet.
* reorder the parameter order. No functional change is intended.
- Do try to merge after assignment to condition when possible.
- Do not try to merge reduced form if not the expected statement.
This used to cause a crash.