Multiple edge timing controls in class methods would cause compilation errors on
the generated C++ code. This is because the `SenExprBuilder` used for these
would get recreated per timing control, resulting in duplicate variable names.
The fix is to have a single `SenExprBuilder` per scope.
Multiple edge timing controls in class methods would cause compilation errors on the generated C++ code. This is because the `SenExprBuilder` used for these would get recreated per timing control, resulting in duplicate variable names. The fix is to have a single `SenExprBuilder` per scope.
* Add VL_ASSERT_CAPABILITY; add assumeLocked and pretendUnlock to V3Mutex.
* Pass jobs as template-arguments and use std::packaged_task.
* Add and use V3ThreadPool::ScopedExclusiveAccess.
Given nested forks, if the inner fork had a `join` or `join_any` at the end,
`V3Sched::transformForks()` would decide that the fork's `VlForkSync` variable
should be passed in from the outside. This resulted in the `VlForkSync` getting
redeclared as a function argument. Ultimately, it led to C++ compilation errors
due to variable redeclaration.
Fixed by rearranging the `if`s that decide whether a variable should be passed
in or left as-is.
`stopRequested()` reads only atomic variables. It doesn't need a mutex
to do this.
This function is called in `waitIfStopRequested()`, which in turn
is called before execution of every job, and inside some jobs. With this
change the mutex inside `waitIfStopRequested` needs to be locked only in
very rare cases instead of every time.
Event-triggered coroutines live in two stages: 'uncommitted' and 'ready'. First
they land in 'uncommitted', meaning they can't be resumed yet. Only after
coroutines from the 'ready' queue are resumed, the 'uncommitted' ones are moved
to the 'ready' queue, and can be resumed. This is to avoid self-triggering in
situations like waiting for an event immediately after triggering it.
However, there is an issue with `wait` statements. If you have a `wait(b)`, it's
being translated into a loop that awaits a change in `b` as long as `b` is
false. If `b` is false at first, the coroutine is put into the `uncommitted`
queue. If `b` is set to true before it's committed, the coroutine won't get
resumed.
This patch fixes that by immediately committing event controls created from
`wait` statements. That means the coroutine from the example above will get
resumed from now on.
This makes the implementation of the detection and propagation of the
suspendable property simpler and easier to read. More importantly, there are no
more jumps around the AST with the `visit` functions, which in some cases could
result in incorrect visitor context while in the `visit` function. See the added
test, which would cause Verilator to segfault before this patch.
In testing, verilation performance was not shown to be affected by this change.
Though there is a slight performance improvement from this patch, due to adding
one more check before refreshing class member cache.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
* V3Common.cpp::makeVlToString: fix `VL_TOSTRING_W` statement generation to include width argument
* fix contribution name
* add testcase for long struct `VL_TO_STRING_W` bug
This patch fixes two cases where methods in base classes were not being marked
as coroutines, even though they were being overridden by coroutines.
- One case is the class member cache not getting refreshed for searched classes.
- The other is when the overriding methods are not declared as `virtual`. In
that case, the `isVirtual()` getter on such a method returns false, which led
to `V3Timing` skipping the step of searching for overridden methods.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
Static variables of functions are created in the function. When blocks
in a function use identical names for static variables, we need to name
those variables properly.
Pack the elements of VlTriggerVec as dense bits (instead of a 1 byte
bool per bit), and check whether they are set on a word granularity.
This effectively transforms conditions of the form `if (trig.at(0) |
trig.at(2) | trig.at(64))` into `if (trig.word(0) & 0x5 | trig.word(1) &
0x1)`. This improves OpenTitan ST by about 1%, worth more on some other
designs.
`VlNow{}` is completely unnecessary, as coroutines are always on the
heap (unless optimized out). Also fix access of var ref passed to forked processes.
Given an await at the end of a block, e.g. at the end of a loop body, a trace
activity setter was not inserted, as there were no following statements. This
patch makes the activity update unconditional.
Before this patch, it was possible to access non-static class members using
static access, which resulted in C++ compilation errors. This adds
verilation-time checks for such situations.
Before this patch, calling tasks directly under forks would result in each
statement of these tasks being executed concurrently. This was due to Verilator
inlining tasks most of the time. Such inlined tasks' statements would simply
replace the original call, and there would be no indication that these used to
be grouped together. Ultimately resulting in `V3Timing` treating each statement
as a separate process.
The solution is simply to wrap each fork sub-statement in a begin in `V3Begin`
(except for the ones that are begins, as that would be pointless). `V3Begin` is
already aware of forks, and is supposed to avoid issues like this one, so it
seems like a natural fit. This also protects us from similar bugs, i.e. if some
statement gets replaced or expanded into multiple statements.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
The baked in DEFENV paths might end up with extra NULL characters
at the end if the binaries are installed by something that patches them
for relocatable installs (e.g. conda). Avoid this issue by immediately
passing them through std::string::c_str() method to stop at the first NULL
* Support method class without parenthesis
Signed-off-by: Ryszard Rozak <rrozak@antmicro.com>
* Delete replaced nodes
Signed-off-by: Ryszard Rozak <rrozak@antmicro.com>
---------
Signed-off-by: Ryszard Rozak <rrozak@antmicro.com>
Fixes#3872.
Testing this is a bit tricky, as the front-end fixes up the operand
widths in shifts to match, and we need V3Const to introduce a mismatched
one by reducing `4'd2 ** x` (with x being 2 2-bit wide signal) to `4'd1
<< x`, but t_dfg_peephole runs with V3Const disabled exactly because it
makes it hard to write tests. Rather than fixing this one case in
V3Const (which we should do systematically at some point), I fixed DFG
to accept these just in case V3Const generates more of them. The
assertions were there only because of paranoia (as I thought these were
not possible inputs), the code otherwise works.
In order to avoid unexpected breakage on multi-driven variables, we
resolve in DFG construction by using only the first driver encountered.
Also issues the MULTIDRIVEN error for these signals.
Replace the 'run to fixed point' algorithm with a work list driven
approach. Instead of marking the graph as changed, we explicitly add
vertices to the work list, to be visited, when a vertex is changed. This
improves both memory locality (as the work list is processed in last in
first out order), and removed unnecessary visitations when only a few
nodes changes.
Folding an AstLogAnd with a non-zero constant operand used to coerce the
type of the other operand, yielding an ill-typed node that DFG was then
unhappy about. Add a RedOr instead if the width of the replacement
operand is greater than zero.
Fixes#3726
Apart from the representational changes below, this patch renames
AstNodeMath to AstNodeExpr, and AstCMath to AstCExpr.
Now every expression (i.e.: those AstNodes that represent a [possibly
void] value, with value being interpreted in a very general sense) has
AstNodeExpr as a super class. This necessitates the introduction of an
AstStmtExpr, which represents an expression in statement position, e.g :
'foo();' would be represented as AstStmtExpr(AstCCall(foo)). In exchange
we can get rid of isStatement() in AstNodeStmt, which now really always
represent a statement
Peak memory consumption and verilation speed are not measurably changed.
Partial step towards #3420
In V3Active, we try hard to turn `always @(a or b or c)` into an
`always_comb` if the only variables read in the block are also in the
sensitivity list. In addition, also allow this optimization when reading
variables that are not in the sensitivity list, but are known to be
constant/never changing after initialization. In particular lookup
tables introduced by V3Table are covered by this. This can have a
significant impact on designs that use the `always @(a or b or c)` style
for combinational logic.
The cost of an AstCMethodHard right now is generally unknown. However,
VlTriggerVec::at is used a lot in conditions, so we make an effort
to estimate this correctly via 2 changes:
- In general when an AstVarRef appears as the target of an
AstCMethodHard, we cost it as a simple address computation (an add)
- Check for VlTriggerVec::at explicitly when costing AstCMethodHard,
which is essentially a load.
This can have a significant effect when there are a lot of unique
triggers in the design.
In non-static contexts like class objects or stack frames, the use of
global trigger evaluation is not feasible. The concept of dynamic
triggers allows for trigger evaluation in such cases. These triggers are
simply local variables, and coroutines are themselves responsible for
evaluating them. They await the global dynamic trigger scheduler object,
which is responsible for resuming them during the trigger evaluation
step in the 'act' eval region. Once the trigger is set, they await the
dynamic trigger scheduler once again, and then get resumed during the
resumption step in the 'act' eval region.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
This changeset brings support for accesses like:
class Cls#(type TYPE1);
TYPE1::some_method();
endclass
It is done by delaying dot resolution on type parameters until they get
resolved by V3Param, and doing a more thorough reference skip.
In DFG DfgVertex::width() is only defined for vertices representing
packed values, which DfgVertex::hash() used to violate. The only
non-packed values at the moment are DfgVarArray, which is a
DfgVertexVar, which are handled specially anyway, so this is easy to
fix.
Fixes#3682
In order to not leak signal names with --protect-ids, we simply make the
trigger dump function empty (this is a debug only construct).
Partial fix for #3689
The emitted SystemC types (e.g. sc_bv) are not interchangeable with
Verilator internal C++ types (e.g.: VlWide), so the variables themselves
are not interchangeable (but can be assigned to/from each other). We can
preserve correctness simply be not inlining any SystemC variables (i.e.:
don't simplify any 'sc = nonSc' or 'nonSc = sc' assignments). SystemC
types only appear at top level ports so this should have no significant
impact.
Fixes#3688
Prevents the possibility of assigning an integer to a class reference,
both at the SystemVerilog and the emitted C++ levels.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
Vertices representing variables (DfgVertexVar) and constants (DfgConst)
are very common (40-50% of all vertices created in some large designs),
and we also need to, or can treat them specially in algorithms. Keep
these as separate lists in DfgGraph for direct access to them. This
improve verilation speed.
Cyclic components are now extracted separately, so there is no
functional reason to have to do a topological sort (previously we used it
to detect cyclic graphs). Removing it to gain some speed.
AstSel is a ternary node, but the 'widthp' is always constant and is
hence redundant, and 'lsbp' is very often constant. As AstSel is fairly
common, we special case as a DfgSel for the constant 'lsbp', and as
'DfgMux` for the non-constant 'lsbp'.
Added a DfgVertex::user() mechanism for storing data in vertices.
Similar in spirit to AstNode user data, but the generation counter is
stored in the DfgGraph the vertex is held under. Use this to cache
DfgVertex::hash results, and also speed up DfgVertex hashing in general.
Use these and additional improvements to speed up CSE.
`V3SchedTiming` currently assumes that if a fork still exists, it must
have statements within it (otherwise it would have been deleted by
`V3Timing`). However, in a case like this:
```
module t;
reg a;
initial fork a = 1; join
endmodule
```
the assignment in the fork is optimized out by `V3Dead` after
`V3Timing`. This leads to `V3SchedTiming` accessing fork's `stmtsp`
pointer, which at this point is null. This patch addresses that issue.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
Allow constant folding through adjacent nodes of all associative
operations, for example '((a & 2) & 3)' or '(3 & (2 & a))' can now be
folded into '(a & 2)' and '(2 & a)' respectively. Also improve speed of
making associative expression trees right leaning by using rotation of
the existing vertices whenever instead of allocation of new nodes.
Only apply when there is guaranteed to be a subsequent constant folding
and elimination of some of the expression, otherwise this sometimes
interferes with the simplification of concatenations and harms overall
performance.
Before this change, a design verilated with `--timing` that does not
actually use timing features would be emitted with `eventsPending` and
`nextTimeSlot` declared in the top class. However, their definitions
would be missing, leading to linker errors during design compilation.
This patch makes Verilator always emit the definitions, which prevents
linker errors. Trying to use `nextTimeSlot` without delays in the design
will result in an error at runtime.
Also added a testing only -fno-const-before-dfg option, as otherwise
V3Const eats up a lot of the simple inputs. A lot of the things V3Const
swallows in the simple cases can make it to DFG in complex cases, or DFG
itself can create them during optimization. In any case to save
complexity of testing DFG constant folding, we use this option to turn
off V3Const prior to the DFG passes in the relevant test.
Some optimizations are only a net win if they help us remove a graph
node (or at least ensure they don't grow the graph), or yields otherwise
special logic, so try to apply them only in these cases.
Use the same style, and reuse the bulk of astgen to generate DfgVertex
related code. In particular allow for easier definition of custom
DfgVertex sub-types that do not directly correspond to an AstNode
sub-type. Also introduces specific names for the fixed arity vertices.
No functional change intended.
A lot of optimizations in DFG assume a DAG, but the more things are
representable, the more likely it is that a small cyclic sub-graph is
present in an otherwise very large graph that is mostly acyclic. In
order to avoid loosing optimization opportunities, we explicitly extract
the cyclic sub-graphs (which are the strongly connected components +
anything feeing them, up to variable boundaries) and treat them
separately. This enables optimization of the remaining input.
This change introduces a custom reference-counting pointer class that
allows creating such pointers from 'this'. This lets us keep the
receiver object around even if all references to it outside of a class
method no longer exist. Useful for coroutine methods, which may outlive
all external references to the object.
The deletion of objects is deferred until the next time slot. This is to
make clearing the triggered flag on named events in classes safe
(otherwise freed memory could be accessed).
Added DfgVertexVariadic to represent DFG vetices with a varying number
of source operands. Converted DfgVar to be a variadic vertex, with each
driver corresponding to a fixed range of bits in the packed variable.
This allows us to handle AstSel on the LHS of assignments. Also added
support for AstConcat on the LHS by selecting into the RHS as
appropriate.
This improves OpenTitan ST speed by ~13%
This is only a debugging aid at this point, so compile out of the
release build. This reduces peak memory consumption by 4-5%. We still
keep the global counters to detect the tree have changed, to avoid
unnecessary dumps.
Multiple tricks to reduce the size of class FileLine from 72 to 40
bytes:
- Reduce file name index from 32 to 16 bits. This still allows 64K
unique input files, which is hopefully enough.
- Intern message/warning enable bitset and use a 16-bit index, again
allowing 64K unique sets which is hopefully enough.
- Put the m_waive flag into the sign bit of one of the line numbers.
- Use explicit reference counting to avoid overhead of shared_ptr.
Added assertions to ensure interned data fits within it's index space.
This saves ~5-10% peak memory consumption at no measurable run-time cost
on various designs.
Added a new data-flow graph (DFG) based combinational logic optimizer.
The capabilities of this covers a combination of V3Const and V3Gate, but
is also more capable of transforming combinational logic into simplified
forms and more.
This entail adding a new internal representation, `DfgGraph`, and
appropriate `astToDfg` and `dfgToAst` conversion functions. The graph
represents some of the combinational equations (~continuous assignments)
in a module, and for the duration of the DFG passes, it takes over the
role of AstModule. A bulk of the Dfg vertices represent expressions.
These vertex classes, and the corresponding conversions to/from AST are
mostly auto-generated by astgen, together with a DfgVVisitor that can be
used for dynamic dispatch based on vertex (operation) types.
The resulting combinational logic graph (a `DfgGraph`) is then optimized
in various ways. Currently we perform common sub-expression elimination,
variable inlining, and some specific peephole optimizations, but there
is scope for more optimizations in the future using the same
representation. The optimizer is run directly before and after inlining.
The pre inline pass can operate on smaller graphs and hence converges
faster, but still has a chance of substantially reducing the size of the
logic on some designs, making inlining both faster and less memory
intensive. The post inline pass can then optimize across the inlined
module boundaries. No optimization is performed across a module
boundary.
For debugging purposes, each peephole optimization can be disabled
individually via the -fno-dfg-peepnole-<OPT> option, where <OPT> is one
of the optimizations listed in V3DfgPeephole.h, for example
-fno-dfg-peephole-remove-not-not.
The peephole patterns currently implemented were mostly picked based on
the design that inspired this work, and on that design the optimizations
yields ~30% single threaded speedup, and ~50% speedup on 4 threads. As
you can imagine not having to haul around redundant combinational
networks in the rest of the compilation pipeline also helps with memory
consumption, and up to 30% peak memory usage of Verilator was observed
on the same design.
Gains on other arbitrary designs are smaller (and can be improved by
analyzing those designs). For example OpenTitan gains between 1-15%
speedup depending on build type.
- Rename `--dump-treei` option to `--dumpi-tree`, which itself is now a
special case of `--dumpi-<tag>` where tag can be a magic word, or a
filename
- Control dumping via static `dump*()` functions, analogous to `debug()`
- Make dumping independent of the value of `debug()` (so dumping always
works even without the debug flag)
- Add separate `--dumpi-graph` for dumping V3Graphs, which is again a
special case of `--dumpi-<tag>`
- Alias `--dump-<tag>` to `--dumpi-<tag> 3` as before
Use astgen to generate a more thorough version of AstNode::checkTree,
which checks that operands are or consistent structure and type, as
described in the @astgen op directives. Also change checkTree to always
run when --debug-check is given.
Fix discovered fallout.
Introduce the @astgen directives parsed by astgen, currently used for
the generation child node (operand) accessors. Please see the updated
internal documentation for details.
Introduce the @astgen directives parsed by astgen, currently used for
the generation child node (operand) accessors. Please see the updated
internal documentation for details.
This approach reduced total time of V3Undriven stage from 34,2s to 2,5s
in design containing almost 400 000 unused variables.
Signed-off-by: Kamil Rakoczy <krakoczy@antmicro.com>
Generate type specific static overloads of Ast<Node>::addNext, which
return the correct sub-type of the 'this' they were invoked on.
Also remove AstNode::addNextNull, which is now only used in the parser,
implement in verilog.y directly as a template function.
- Move DType representations into V3AstNodeDType.h
- Move AstNodeMath and subclasses into V3AstNodeMath.h
- Move any other AstNode subtypes into V3AstNodeOther.h
- Fix up out-of-order definitions via inline methods and implementations
in V3Inlines.h and V3AstNodes.cpp
- Enforce declaration order of AstNode subtypes via astgen,
which will now fail when definitions are mis-ordered.
Rely less on strings and represent AstNode classes as a 'class Node',
with all associated properties kept together, rather than distributed
over multiple dictionaries or constructed at retrieval time.
No functional change intended.
Small fixup patch so the 'ico' and 'act' scheduling sections could be
ordered as multi-threaded. However, we still only order these single
threaded at the moment (but switching them to multi-threaded now works).
Before this change, some forked processes were being inlined in
`V3Timing` because they contained no `CAwait`s. This only works under
the assumption that no `CAwait`s will be added there later, which is not
true, as a function called by a forked process could be turned into a
coroutine later. The call would be wrapped in a new `CAwait`, but the
process itself would have already been inlined at this point.
This commit moves the inlining to `transformForks` in `V3SchedTiming`,
which is called at a point when all `CAwait`s are already in place.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
The recent patch to defer substitutions on V3Gate crashes on circular
logic that has cycle length >= 3 with all inlineable signals (cycle
length 2 is detected correctly and is not inlined). Fix by stopping
recursion at the loop-back edge.
Fixes#3543
This is detritus from when V3TraceDecl used to run after V3Gate, today
V3TraceDecl runs before V3Gate and this value has no function at all.
No functional change intended.
dynamic_cast is not free. Replace obvious instances (where the result is
unconditionally dereferenced) with static_cast in contexts with
performance implications.
Replace std::set<SiblingMC> with V3Lists to keep track of SiblingMCs
associated with MTasks, use a std::set<LogicMTask*> for ensuring
uniqueness. This yields a bit more speed in PartContraction.
- Use modern C++
- Implement OrderLogicVertex->LogicMTask map with
OrderLogicVertex::userp(), insteas of std::unordered_map
- Simplify data structures
- Simplify code and assert properties
No functional change.
Refactor ProcessMoveBuildGraph utilizing the fact that OrderGraph is a
bipartite graph, also remove unnecessary unordered_map and distribute
variable domain map. No functional change.
Adds timing support to Verilator. It makes it possible to use delays,
event controls within processes (not just at the start), wait
statements, and forks.
Building a design with those constructs requires a compiler that
supports C++20 coroutines (GCC 10, Clang 5).
The basic idea is to have processes and tasks with delays/event controls
implemented as C++20 coroutines. This allows us to suspend and resume
them at any time.
There are five main runtime classes responsible for managing suspended
coroutines:
* `VlCoroutineHandle`, a wrapper over C++20's `std::coroutine_handle`
with move semantics and automatic cleanup.
* `VlDelayScheduler`, for coroutines suspended by delays. It resumes
them at a proper simulation time.
* `VlTriggerScheduler`, for coroutines suspended by event controls. It
resumes them if its corresponding trigger was set.
* `VlForkSync`, used for syncing `fork..join` and `fork..join_any`
blocks.
* `VlCoroutine`, the return type of all verilated coroutines. It allows
for suspending a stack of coroutines (normally, C++ coroutines are
stackless).
There is a new visitor in `V3Timing.cpp` which:
* scales delays according to the timescale,
* simplifies intra-assignment timing controls and net delays into
regular timing controls and assignments,
* simplifies wait statements into loops with event controls,
* marks processes and tasks with timing controls in them as
suspendable,
* creates delay, trigger scheduler, and fork sync variables,
* transforms timing controls and fork joins into C++ awaits
There are new functions in `V3SchedTiming.cpp` (used by `V3Sched.cpp`)
that integrate static scheduling with timing. This involves providing
external domains for variables, so that the necessary combinational
logic gets triggered after coroutine resumption, as well as statements
that need to be injected into the design eval function to perform this
resumption at the correct time.
There is also a function that transforms forked processes into separate
functions.
See the comments in `verilated_timing.h`, `verilated_timing.cpp`,
`V3Timing.cpp`, and `V3SchedTiming.cpp`, as well as the internals
documentation for more details.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
Various optimizations to speed up MTasks coarsening (which is the long
pole in the multi-threaded scheduling of very large designs).
The biggest impact ones:
- Use efficient hand written Pairing Heaps for implementing priority
queues and the scoreboard, instead of the old SortByValueMap. This
helps us avoid having to sort a lot of merge candidates that we will
never actually consider and helps a lot in performance.
- Remove unnecessary associative containers and store data structures
(the heap nodes in particular) directly in the object they relate to.
This eliminates a huge amount of lookups and helps a lot in
performance.
- Distribute storage for SiblingMC instances into the LogicMTask
instances, and combine with the sibling maps. This again eliminates
hash table lookups and makes storage structures smaller.
- Remove some now bidirectional edge maps, keep only the forward map.
There are also some other smaller optimizations:
- Replaced more unnecessary dynamic_casts with static_casts
- Templated some functions/classes to reduce the number of static
branches in loops.
- Improves sorting of edges for sibling candidate creation
- Various micro-optimizations here and there
This speeds up MTask coarsening by 3.8x on a large design, which
translates to a 2.5x speedup of the ordering pass in multi-threaded
mode. (Combined with the earlier optimizations, ordering is now 3x
faster.)
Due to the elimination of a lot of the auxiliary data structures, and
ensuring a minimal size for the necessary ones, memory consumption of
the MTask coarsening is also reduced (measured up to 4.4x reduction
though the accuracy of this is low).
The algorithm is identical except for minor alterations of the order
some candidates are added or removed, this can cause perturbation in the
output due to tied scores being broken based on IDs.
Various optimizations to speed up MTasks coarsening (which is the long
pole in the multi-threaded scheduling of very large designs).
The biggest impact ones:
- Use efficient hand written Pairing Heaps for implementing priority
queues and the scoreboard, instead of the old SortByValueMap. This
helps us avoid having to sort a lot of merge candidates that we will
never actually consider and helps a lot in performance.
- Remove unnecessary associative containers and store data structures
(the heap nodes in particular) directly in the object they relate to.
This eliminates a huge amount of lookups and helps a lot in
performance.
- Distribute storage for SiblingMC instances into the LogicMTask
instances, and combine with the sibling maps. This again eliminates
hash table lookups and makes storage structures smaller.
- Remove some now bidirectional edge maps, keep only the forward map.
There are also some other smaller optimizations:
- Replaced more unnecessary dynamic_casts with static_casts
- Templated some functions/classes to reduce the number of static
branches in loops.
- Improves sorting of edges for sibling candidate creation
- Various micro-optimizations here and there
This speeds up MTask coarsening by 3.8x on a large design, which
translates to a 2.5x speedup of the ordering pass in multi-threaded
mode. (Combined with the earlier optimizations, ordering is now 3x
faster.)
Due to the elimination of a lot of the auxiliary data structures, and
ensuring a minimal size for the necessary ones, memory consumption of
the MTask coarsening is also reduced (measured up to 4.4x reduction
though the accuracy of this is low).
The algorithm is identical except for minor alterations of the order
some candidates are added or removed, this can cause perturbation in the
output due to tied scores being broken based on IDs.
While keeping the client code abstract in PartPropagateCp is nice for
testing, there is performance to be had removing the abstraction. As
this code dominates in scheduling large designs, we eliminate the
abstraction and re-work the testing to use the actual LogicMTask and
MTaskEdge graph types. No functional change intended.
Instead of deleting then re-allocating MTaskEdge instances when merging
two MTasks, just redirect the edged of the donor MTask to the recipient
MTask. This is both faster as it avoids an allocation and a deletion,
together with one update of the sibling maps, and also makes the
algorithm more stable due to MergeCandidate IDs being stable and
allocated up front for all MTaskEdges, before any SiblingMCs are
allocated.
Perturbations in output are expected as the IDs used to break ties
between merge candidates with equal costs are not updated when
redirecting an edge (on purpose). The relinking of only one end of the
graph edges also perturbs the order in which they are enumerated, which
does change candidate opportunities when the number of edges is larger
than PART_SIBLING_EDGE_LIMIT. Confirmed output is identical when
IDs are updated and edges are updated to appear in their original order.
The critical path propagation used to rely on a pointer comparison to
break equal scoring critical path updates. Use the corresponding mtask
ids instead, which is deterministic across invocations.
siblingPairFromRelatives gathers neighbours of a vertex, and sorts them.
It then takes the N best nodes, and creates sibling merge candidates
from them. We now use the unadjusted cost instead of the step cost of
the vertices when sorting. This is both faster as we need not do the
log-space rounding to compute stepCost, and will also make similar but
yet cheaper nodes appear closer to the front as we don't lose precision
in rounding, hence they are more likely to be entered as merge
candidates. Note that when creating the merge candidate, we still use
the stepCost, so it's purpose of reducing the propagation of critical
path updates is maintained in full. In summary, this should make both
Verilator and the generated model very slightly faster, at least in
theory, and I have observed minor improvement in places.
GraphStreamUnordered used to be GraphStream<std::less<const
V3GraphVertex*>>, but a lot of performance improvements can be had by a
specialized implementation, so added a highly optimized one. This helps
a lot with --debug-partition.
Fix compile error for queue method usage, if it is the
first statement in a block of code, and the return
value is not used. Example:
> if (foo)
> void'(bar.pop_front());
* Tests: Add a test to reproduce #3399
* Fix#3399. When reading an inout port in a module, it should refer the
original inout port, not the generated MODTEMP.
Keep a single std::set of key/value pairs, and a single unordered_map
from key to iterators into the set. Also improve some of the accessing
mechanisms using modern C++. This speeds up multi-threaded ordering by
about 10%.
Similarly to the earlier patch that defers constant folding on optimized
logic, now we also defer the variable substitutions as well. This again
eliminates a lot of traversals, and yields another ~10x speedup of V3Gate
on a design where V3Gate used to dominate while producing identical
results.
Rather than constant folding each logic block after every substitution,
only constant fold updated blocks when re-analysed, or at the end. This
removes a lot of invocations of V3Const on large blocks that can be
optimized well, and should yield the same result.
This speeds up V3Gate by ~4x on a design where V3Gate dominates.
Speed improvements:
- Use a direct, recursion-free implementation
- Improve pre-fetching
Functionality:
- Support remove/replace of currently iterated node
__gcov_flush was a private function and was removed from later GCC
versions (at least from 11.2.0, possibly earlier). Replace with the
documented public __gcov_dump.
Set default value of --comp-limit-parens to 240, to respect default
maximum nesting of parentheses in clang (which is controlled by
-fbracket-depth and defaults to 256). For code generation consistency,
also use the same default with gcc.
* Tests: Add a test to reproduce #3509
* Tests: Compile without tautological-compare check because bit op tree optimization is disabled in the test.
* Internals: Dedup code. No functional change is intended.
* Fix#3509.
"2'b10 == (2'b11 & {1'b0, val[0]})" and "2'b10 != (2'b11 & {1'b0, val[0]})" were
wrongly optimized to "!val[0]" and "val[0]" respectively.
Now properly optimize them to 1'b0 and 1'b1.
* Commentary
* Commentary: Update Changes
Associative arrays that specify a wildcard index type may be indexed by
integral expressions of any size, with leading zeros removed
automatically. A natural representation for such expressions is a
string, especially that the standard explicitly specifies automatic
casts from string indices to bit vectors of equivalent size.
The automatic cast part is done implicitly by the existing type system.
A simpler way to just make this work would be to convert wildcard index
type to a string type directly in the parser code, but several new AST
classes are needed to make sure illegal method calls are detected.
The verilated data structure implementation is reused, because there is
no need for differentiating the behavior on C++ side.
All remaining use of conditional compilation in the tracing
implementation of the run-time library are replaced with the use of
VerilatedModel::traceConfig, and is now done at run-time.
Step towards a proper run-time library. Reduce the amount of ifdefs in
the implementation of offloaded tracing. There are still a very small
number of ifdefs left, which will need more careful changes in order to
keep user API compatibility.