Adds timing support to Verilator. It makes it possible to use delays,
event controls within processes (not just at the start), wait
statements, and forks.
Building a design with those constructs requires a compiler that
supports C++20 coroutines (GCC 10, Clang 5).
The basic idea is to have processes and tasks with delays/event controls
implemented as C++20 coroutines. This allows us to suspend and resume
them at any time.
There are five main runtime classes responsible for managing suspended
coroutines:
* `VlCoroutineHandle`, a wrapper over C++20's `std::coroutine_handle`
with move semantics and automatic cleanup.
* `VlDelayScheduler`, for coroutines suspended by delays. It resumes
them at a proper simulation time.
* `VlTriggerScheduler`, for coroutines suspended by event controls. It
resumes them if its corresponding trigger was set.
* `VlForkSync`, used for syncing `fork..join` and `fork..join_any`
blocks.
* `VlCoroutine`, the return type of all verilated coroutines. It allows
for suspending a stack of coroutines (normally, C++ coroutines are
stackless).
There is a new visitor in `V3Timing.cpp` which:
* scales delays according to the timescale,
* simplifies intra-assignment timing controls and net delays into
regular timing controls and assignments,
* simplifies wait statements into loops with event controls,
* marks processes and tasks with timing controls in them as
suspendable,
* creates delay, trigger scheduler, and fork sync variables,
* transforms timing controls and fork joins into C++ awaits
There are new functions in `V3SchedTiming.cpp` (used by `V3Sched.cpp`)
that integrate static scheduling with timing. This involves providing
external domains for variables, so that the necessary combinational
logic gets triggered after coroutine resumption, as well as statements
that need to be injected into the design eval function to perform this
resumption at the correct time.
There is also a function that transforms forked processes into separate
functions.
See the comments in `verilated_timing.h`, `verilated_timing.cpp`,
`V3Timing.cpp`, and `V3SchedTiming.cpp`, as well as the internals
documentation for more details.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
Various optimizations to speed up MTasks coarsening (which is the long
pole in the multi-threaded scheduling of very large designs).
The biggest impact ones:
- Use efficient hand written Pairing Heaps for implementing priority
queues and the scoreboard, instead of the old SortByValueMap. This
helps us avoid having to sort a lot of merge candidates that we will
never actually consider and helps a lot in performance.
- Remove unnecessary associative containers and store data structures
(the heap nodes in particular) directly in the object they relate to.
This eliminates a huge amount of lookups and helps a lot in
performance.
- Distribute storage for SiblingMC instances into the LogicMTask
instances, and combine with the sibling maps. This again eliminates
hash table lookups and makes storage structures smaller.
- Remove some now bidirectional edge maps, keep only the forward map.
There are also some other smaller optimizations:
- Replaced more unnecessary dynamic_casts with static_casts
- Templated some functions/classes to reduce the number of static
branches in loops.
- Improves sorting of edges for sibling candidate creation
- Various micro-optimizations here and there
This speeds up MTask coarsening by 3.8x on a large design, which
translates to a 2.5x speedup of the ordering pass in multi-threaded
mode. (Combined with the earlier optimizations, ordering is now 3x
faster.)
Due to the elimination of a lot of the auxiliary data structures, and
ensuring a minimal size for the necessary ones, memory consumption of
the MTask coarsening is also reduced (measured up to 4.4x reduction
though the accuracy of this is low).
The algorithm is identical except for minor alterations of the order
some candidates are added or removed, this can cause perturbation in the
output due to tied scores being broken based on IDs.
Various optimizations to speed up MTasks coarsening (which is the long
pole in the multi-threaded scheduling of very large designs).
The biggest impact ones:
- Use efficient hand written Pairing Heaps for implementing priority
queues and the scoreboard, instead of the old SortByValueMap. This
helps us avoid having to sort a lot of merge candidates that we will
never actually consider and helps a lot in performance.
- Remove unnecessary associative containers and store data structures
(the heap nodes in particular) directly in the object they relate to.
This eliminates a huge amount of lookups and helps a lot in
performance.
- Distribute storage for SiblingMC instances into the LogicMTask
instances, and combine with the sibling maps. This again eliminates
hash table lookups and makes storage structures smaller.
- Remove some now bidirectional edge maps, keep only the forward map.
There are also some other smaller optimizations:
- Replaced more unnecessary dynamic_casts with static_casts
- Templated some functions/classes to reduce the number of static
branches in loops.
- Improves sorting of edges for sibling candidate creation
- Various micro-optimizations here and there
This speeds up MTask coarsening by 3.8x on a large design, which
translates to a 2.5x speedup of the ordering pass in multi-threaded
mode. (Combined with the earlier optimizations, ordering is now 3x
faster.)
Due to the elimination of a lot of the auxiliary data structures, and
ensuring a minimal size for the necessary ones, memory consumption of
the MTask coarsening is also reduced (measured up to 4.4x reduction
though the accuracy of this is low).
The algorithm is identical except for minor alterations of the order
some candidates are added or removed, this can cause perturbation in the
output due to tied scores being broken based on IDs.
While keeping the client code abstract in PartPropagateCp is nice for
testing, there is performance to be had removing the abstraction. As
this code dominates in scheduling large designs, we eliminate the
abstraction and re-work the testing to use the actual LogicMTask and
MTaskEdge graph types. No functional change intended.
Instead of deleting then re-allocating MTaskEdge instances when merging
two MTasks, just redirect the edged of the donor MTask to the recipient
MTask. This is both faster as it avoids an allocation and a deletion,
together with one update of the sibling maps, and also makes the
algorithm more stable due to MergeCandidate IDs being stable and
allocated up front for all MTaskEdges, before any SiblingMCs are
allocated.
Perturbations in output are expected as the IDs used to break ties
between merge candidates with equal costs are not updated when
redirecting an edge (on purpose). The relinking of only one end of the
graph edges also perturbs the order in which they are enumerated, which
does change candidate opportunities when the number of edges is larger
than PART_SIBLING_EDGE_LIMIT. Confirmed output is identical when
IDs are updated and edges are updated to appear in their original order.
The critical path propagation used to rely on a pointer comparison to
break equal scoring critical path updates. Use the corresponding mtask
ids instead, which is deterministic across invocations.
siblingPairFromRelatives gathers neighbours of a vertex, and sorts them.
It then takes the N best nodes, and creates sibling merge candidates
from them. We now use the unadjusted cost instead of the step cost of
the vertices when sorting. This is both faster as we need not do the
log-space rounding to compute stepCost, and will also make similar but
yet cheaper nodes appear closer to the front as we don't lose precision
in rounding, hence they are more likely to be entered as merge
candidates. Note that when creating the merge candidate, we still use
the stepCost, so it's purpose of reducing the propagation of critical
path updates is maintained in full. In summary, this should make both
Verilator and the generated model very slightly faster, at least in
theory, and I have observed minor improvement in places.
GraphStreamUnordered used to be GraphStream<std::less<const
V3GraphVertex*>>, but a lot of performance improvements can be had by a
specialized implementation, so added a highly optimized one. This helps
a lot with --debug-partition.
Fix compile error for queue method usage, if it is the
first statement in a block of code, and the return
value is not used. Example:
> if (foo)
> void'(bar.pop_front());
* Tests: Add a test to reproduce #3399
* Fix#3399. When reading an inout port in a module, it should refer the
original inout port, not the generated MODTEMP.
Keep a single std::set of key/value pairs, and a single unordered_map
from key to iterators into the set. Also improve some of the accessing
mechanisms using modern C++. This speeds up multi-threaded ordering by
about 10%.
Similarly to the earlier patch that defers constant folding on optimized
logic, now we also defer the variable substitutions as well. This again
eliminates a lot of traversals, and yields another ~10x speedup of V3Gate
on a design where V3Gate used to dominate while producing identical
results.
Rather than constant folding each logic block after every substitution,
only constant fold updated blocks when re-analysed, or at the end. This
removes a lot of invocations of V3Const on large blocks that can be
optimized well, and should yield the same result.
This speeds up V3Gate by ~4x on a design where V3Gate dominates.
Speed improvements:
- Use a direct, recursion-free implementation
- Improve pre-fetching
Functionality:
- Support remove/replace of currently iterated node
__gcov_flush was a private function and was removed from later GCC
versions (at least from 11.2.0, possibly earlier). Replace with the
documented public __gcov_dump.
Set default value of --comp-limit-parens to 240, to respect default
maximum nesting of parentheses in clang (which is controlled by
-fbracket-depth and defaults to 256). For code generation consistency,
also use the same default with gcc.
* Tests: Add a test to reproduce #3509
* Tests: Compile without tautological-compare check because bit op tree optimization is disabled in the test.
* Internals: Dedup code. No functional change is intended.
* Fix#3509.
"2'b10 == (2'b11 & {1'b0, val[0]})" and "2'b10 != (2'b11 & {1'b0, val[0]})" were
wrongly optimized to "!val[0]" and "val[0]" respectively.
Now properly optimize them to 1'b0 and 1'b1.
* Commentary
* Commentary: Update Changes
Associative arrays that specify a wildcard index type may be indexed by
integral expressions of any size, with leading zeros removed
automatically. A natural representation for such expressions is a
string, especially that the standard explicitly specifies automatic
casts from string indices to bit vectors of equivalent size.
The automatic cast part is done implicitly by the existing type system.
A simpler way to just make this work would be to convert wildcard index
type to a string type directly in the parser code, but several new AST
classes are needed to make sure illegal method calls are detected.
The verilated data structure implementation is reused, because there is
no need for differentiating the behavior on C++ side.
All remaining use of conditional compilation in the tracing
implementation of the run-time library are replaced with the use of
VerilatedModel::traceConfig, and is now done at run-time.
Step towards a proper run-time library. Reduce the amount of ifdefs in
the implementation of offloaded tracing. There are still a very small
number of ifdefs left, which will need more careful changes in order to
keep user API compatibility.
Make the external domains provider of ordering populate an output
vector, which then allows us to add multiple external sensitivities to
combinational logic.
Using unlinkFrBackWithNext is O(n) in the size of the list if unlinking
from the middle, so addHereThisAsNext also had this complexity. This
patch implements addHereThisAsNext directly, which is always O(1).
Use std::partial_sort for the non-exhaustive case. This is O(n) instead
of O(n*log(n)) in the size of the candidate list being sorted. (It
actually is O(n*log(k)), but k is constant 6 in the non-exhaustive
case).
* Tests: Add a test to reproduce #3470
* Update LSB during return path of traversal. No functional change is intended.
* Introduce LeafInfo::m_msb
* Update LeafInfo::m_msb when visitin AstCCast
* Internals: Add comment, reorder. No functional change is intended.
* Delete explicit from copy constructor to fix build error.
* Update Changes
* Internals: Remove unused parameter. No functional change is intended.
* Tests: Add explanation to t_const_opt.
* Internals: Let LeafInfo class. No functional change is intended.
* Internals: Rename LeafInfo::width -> LeafInfo::varWidth(). No functional change is intende.
* Tests: Check BitOpTree statistics in t_const_opt.
* Tests: Add a test to reproduce #3445
* Fix#3445. Don't forget LSB of frozen node in BitOpTreeOpt.
* Apply suggestions from code review
Co-authored-by: Geza Lore <gezalore@gmail.com>
VCD tracing is now parallelized using the same thread pool as the model.
We achieve this by breaking the top level trace functions into multiple
top level functions (as many as --threads), and after emitting the time
stamp to the VCD file on the main thread, we execute the tracing
functions in parallel on the same thread pool as the model (which we
pass to the trace file during registration), tracing into a secondary
per thread buffer. The main thread will then stitch (memcpy) the buffers
together into the output file.
This makes the `--trace-threads` option redundant with `--trace`, which
now only affects `--trace-fst`. FST tracing uses the previous offloading
scheme.
This obviously helps a lot in VCD tracing performance, and I have seen
better than Amdahl speedup, namely I get 3.9x on XiangShan 4T (2.7x on
OpenTitan 4T).
V3MergeCond merges consecutive conditional `_ = cond ? _ : _` and
`if (cond) ...` statements. This patch adds an analysis and ordering
phase that moves statements with identical conditions closer to each
other, in order to enable more merging opportunities. This in turn
eliminates a lot of repeated conditionals which reduced dynamic branch
count and branch misprediction rate. Observed 6.5% improvement on
multi-threaded large designs, at the cost of less than 2% increase in
Verilation speed.
For ordering, only the scope of logic vertices should be relevant, so
remove the scope pointer from OrderEitherVertex and move it into
OrderLogicVertex. This does not change single-threaded scheduling at
all. Theoretically, multi-threaded scheduling should not be affected
either though due to some implementation quirk depending on vertex order
in a graph the MT schedule is perturbed by this change, but the
performance effect of this is negligible on all benchmarks I have access
to.
No functional change intended.
Fixes#3442
This is a major re-design of the way code is scheduled in Verilator,
with the goal of properly supporting the Active and NBA regions of the
SystemVerilog scheduling model, as defined in IEEE 1800-2017 chapter 4.
With this change, all internally generated clocks should simulate
correctly, and there should be no more need for the `clock_enable` and
`clocker` attributes for correctness in the absence of Verilator
generated library models (`--lib-create`).
Details of the new scheduling model and algorithm are provided in
docs/internals.rst.
Implements #3278
This is a simple debugging aid to allow constraining the graph layout
via GraphViz rank directives. Note this is not related in any way to the
vertex 'rank' attribute used by some of the graph algorithms.
No functional change.
Some cases of warnings about the use of blocking and non-blocking
assignments in combinational vs sequential processes were suppressed in
a way that is inconsistent with the *actual* current execution model of
Verilator. Turning these back on to, well, warn the user that these might
cause unexpected results. V5 will clean these up, but until then err on
the side of caution.
Fixes#864.
Static variable initializers run before initial blocks, so use an
explicitly different procedure type for them. This also enables us to
now raise errors for assignments to const variables in initial blocks.
At the end of V3Param, fix up the module list to be topologically
sorted. We need to do this at the end as a later instantiation of a
recursive module might instantiate an earlier specialization, which we
cannot know until we processed everything. The rest of the compiler
depends on the module list being topologically sorted.
Fixes#3393
Rename AstNodeModule::hierName -> someInstanceName and explain that this
is only used for user messages.
Rename AstNode::locationStr -> instanceStr and simplify implementation.
In particular, do not report an instance if we can't find a reasonable
guess.
The --prof-threads option has been split into two independent options:
1. --prof-exec, for collecting verilator_gantt and other execution
related profiling data, and
2. --prof-pgo, for collecting data needed for PGO
The implementation of execution profiling is extricated from
VlThreadPool and is now a separate class VlExecutionProfiler. This means
--prof-exec can now be used for single-threaded models (though it does
not measure a lot of things just yet). For consistency VerilatedProfiler
is renamed VlPgoProfiler. Both VlExecutionProfiler and VlPgoProfiler are
in verilated_profiler.{h/cpp}, but can be used completely independently.
Also re-worked the execution profile format so it now only emits events
without holding onto any temporaries. This is in preparation for some
future optimizations that would be hindered by the introduction of function
locals via AstText.
Also removed the Barrier event. Clearing the profile buffers is not
notably more expensive as the profiling records are trivially
destructible.
- Always use a fast function to replace a slow one if available
- Iterate to fixed point (i.e.: if combining made more functions
identical, combine those too). This will be more useful in the future.
- Use only single, const traversal
Introduce VNRef that can be used to wrap AstNode keys in STL
collections, resulting in equality comparisons rather than identity
comparisons. This can then replace the SenTreeSet data-structure.
Using the 'forceable' directive in a configuration file, or the /*
verilator forceable */ metacomment on a variable declaration will
generate additional public signals that allow the specified signals to
be forced/released from the C++ code.
- Add more tests, including for tracing.
- Apply some cleaner, more generic abstractions in the implementation.
- Use clearer AstRelease which is not an assignment.
Avoid cloning the module when inlining the last instance that references
that module. This saves a lot of memory because it saves cloning
singleton modules (those with a single instance), which we always
inline. The top few levels of the hierarchy are often simple wrappers,
including the one added by Verilator in V3LinkLevel::wrapTop. Cloning
these and putting off deleting the originals can be very expensive
because they often have a lot of contents inlined into them, so each
layer of wrapper that is inlined would essentially add a whole new clone
of the large top-level. Directly inlining the module for the last cell
without cloning saves us from all this duplicate memory consumption and
also from having to create the clones in the first place.
Also added minor traversal speedups
This reduces the memory consumption of V3Inline by 80% and peak memory
consumption of Verilator by about 66% on a large design, while speeding
up the V3Inline pass by ~3.5x and the whole of Verilator by ~8% while
producing identical output.
- More efficient comparison by pre-computing sorting keys.
- Remove work items in algorithms known to be redundant earlier.
This greatly reduces data structure sizes.
- Use V3GraphVertex->user() for state tracking instead of unordered_map
while both of these are constant time, they do add up.
- In `makeMinSpanningTree`, instead of batch inserting outgoing edges of
each visited vertex into an ordered set, keep an ordered set of sorted
vectors of edges. This reduces the size of the ordered set
significantly (it is now O(V) rather than O(E), and as the subject
graph is a complete graph, V ~ sqrt(E), so this is a significant gain).
- Use a vector + sorting in `perfectMatching` instead of an ordered set.
This is faster on large working sets.
This yields 3.8x speedup on the variable order pass and overall 14%
verilation speed gain on a large design.
Repeatedly traversing whole modules in emit (due to file splitting)
looking for `systemc_* sections can add up to a lot of time on large
designs that have been flattened and need to be split into many files.
Assuming `systemc_* is a rarely used feature, just don't bother if we
don't need to. This gain 9% verilation speed improvement on a large
benchmark.