No functional change.
This patch is just cleanup with some non-functional changes to enable
the next patch. Most importantly createDlyOnSet, which implements NBAs
for arrays, has a new streamlined implementation that does the same
thing. Some output code is perturbed due to statement/local variable
insertion order.
Also renamed Vdlyvfoo to VdlyFoo for easier readability of the generated
code.
Checking the wrong node meant we never actually pushed constant
bit-select indices into the delayed update, as was the intention, but
always generated a temporary instead.
This adheres more to the wording used in IEEE 1800-2017 22.5.1,
specifying the join operator to be more of a delimiter than join:
> A `` delimits lexical tokens without introducing white space,
> allowing identifiers to be constructed from arguments.
Before, string RHS arguments to the join operator were silently dropped.
Signed-off-by: Arkadiusz Kozdra <akozdra@antmicro.com>
The goal here is to use as single ordering heuristic (which can be
improved later) within MTasks as we do for serial code ordering. The
heuristic itself is factored out into the new OrderMoveGraphSerializer.
This also yields slightly nicer ordering than the previously use
GraphStream, so we end up with fewer trigger (domain) conditionals in
the MTasks, this can be worth a few percent speedup.
This has the somewhat nice side-effect of reusing OrderMoveGraphVertex
for both serial and parallel mode, so MTaskMoveGraphVertex can be
removed.
Serial mode yields identical output.
- Adds support for C-style for-loop initializers
- Current implementation supports: for (x a = 1, y b = 2, ...)
- This patch extends support to: for (x a = 1, b = 2, ...)
- Adds unit test for new feature
Instead of carrying around MTask affinity from scheduling, compute it in
V3VariableOrder (where it is used), by tracing through the code. This
simplifies some code and has the benefit of handling variables
introduced after scheduling. It's worth a few % speed at run-time, and
the new implementation of V3VariableOrder is slightly more efficient,
though the speed/space is still dominated by the TSP sort.
There is no strong need to re-map LogicMTask IDs and it just adds extra
processing. Instead we just allocate a separate set of ExecMTask IDs as
they are created, which can also be used as the unique profiling ID as
well. The only effect on the output of this is the change in mtask IDs
emitted, which was fairly arbitrary to begin with.
Wrapping the functions in #4933 broke --prof-exec report as the
predicted MTask times are computed during thread packing, but are
emitted in the wrapping functions.
V3Partition used to contain 2 conceptually separate set of algorithms
- The MTask partitioning/coarsening algorithm used by V3Order. This has
been moved to V3OrderParallel.cpp
- The lowering of AstExecGraph into per thread functions by packing
tasks into threads and creating additional code
(V3Partition::finalize). This has been moved to the new
V3ExecGraph.cpp
This patch is just code movement/rename with minimal fixes required to
do so.
Continuing the idea of decoupling the implementations of the various algorithms.
The main points:
-Move the former "processDomain" stuff, dealing with assigning combinational logic into the relevant sensitivity domains into V3OrderProcessDomains.cpp
-Move the parallel code construction in V3OrderParallel.cpp (Could combine this with some parts of V3Partition - those not called from V3Partition::finalize - but that's not for this patch).
-Move the serial code construction into V3OrderSerial.cpp
-Factored the very small common code between the parallel and serial code construction (processMoveOneLogic) into V3OrderCFuncEmitter.cpp
Move OrderBuildVisitor into V3OrderGraphBuilder.cpp (and rename to
V3OrderGraphBuilder). Move ProcessMoveBuildGraph to
V3OrderMoveGraphBuilder.cpp (and rename to V3OrderGraphBuilder).
This patch is pure code movement/rename, no refactoring at all.
Add a new data-structure V3DfgCache, which can be used to retrieve
existing vertices with some given inputs vertices. Use this in
V3DfgPeephole to eliminate the creation of redundant vertices.
Overall this is performance neutral, but is in prep for some future
work.
DFG could remove forceable signals by replacing them with their
in-design driver. This is a bit of a pain to prevent, and ideally the
forcing transform should happen before DFG, but implementing it there is
a pain due to having to rewrite ports based on direction. This is an
attempted fix in DFG. More cases might remain.
This functionality used to be distributed in the removeVars pass and the
final dfgToAst conversion. Instead added a new 'regularize' pass to
convert DFGs into forms that can be trivially converted back to Ast, and
a new 'eliminateVars' pass to remove/repalce redundant variables. This
simplifies dfgToAst significantly and makes the code a bit easier to
follow.
The new 'regularize' pass will ensure that every sub-expression with
multiple uses is assigned to a temporary (unless it's a trivial memory
reference or constant), and will also eliminate or replace redundant
variables. Overall it is a performance neutral change but it does
enable some later improvements which required the graph to be in this
form, and this also happens to be the form required for the dfgToAst
conversion.
With --stats, we will print DFG pattern combinations, one per line, as
S-expressions to new stat files, together with their frequency, to aid
discovery of new peephole patterns.
When the condition of an AstWhile loop contains a statement (e.g.
through an AstExprStmt), temporaries inserted in that statement need to
go before that statement, not in the AstWhile precondition.
- Enable creating constant pool entries for RHS of simple
var = const assignments
- Never extract ArraySel (it's just pointer arithmetic)
- Remove unnecessary AstTraceInc precond child tree
- Always fully recurse expressions (fix transplanted from #4617)
- General cleanup
Overall the patch is performance neutral to slightly positive, but saves
~10% peak Verialtor memory usage due to not creating temporaries (which
are later expanded) for any ArraySels.
Adds a new field to `AstSenItem` that stores the `iff` condition which is then handled by `SenExprBuilder`.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
This patch addresses two issues with NBAs in non-inlined functions/tasks:
- If the NBA writes to a local automatic var, the var could cease to exist before the NBA executes. This is normally addressed by fork dynscopes (#4356), but NBA-to-fork transformation happens way after `V3Fork` (in `V3Timing`). To solve this, we put NBAs that write to locals under forks in `V3Fork` already. This way, such locals will be put in dynscopes, and will still exist after the task containing the NBA exits.
- The above change means that any writes in forks other than `fork..join` should be handled by `V3Fork`. Thus, in `V3SchedTiming`, we only have to worry about read references, so we can simply copy all remaining locals. Because we copy, lifetimes are not an issue. This fixes a bug that allowed assignment intravals to be overwritten if they go out of scope in the containing function.
This patch enforces the use of the most specific accessors for operands
which have an '@astgen alias' declaration, by making the superclass
accessors of the same operands private. This ensures client code is
cleaner as you can't use multiple different methods to reference the
same operands (which we used to in some places). Also prep for some
refactoring.
V3Subst is currently the pass responsible for peak memory usage. This
patch saves ~16% of peak memory on OpenTitan.
2 changes:
- It is actually safe to delete the substituted expressions immediately,
but this is the lesser contribution
- More importantly, we only ever substitute STMTTEMP variables, which
are always defined within the same CFunc, so we can limit the scope of
the optimization to CFunc. This allows us to reclaim the SubstVarEntry
structures at the end of every CFunc, rather than at the end of the
whole pass, which gives us most of the memory savings.
Generated output is identical
Used to set the wrong public flag on forceEn/forceVal, which means they
were not included in ICO as necessary, but V3Gate tended to inline them,
so this was hard to hit.
Fixes#4577
This saves about 5% memory. V3AstUserAllocator is appropriate for most use
cases, performance is marginally up as we are mostly D-cache bound on
large designs.
The typical find/if-not-exists-insert pattern can be achieved with 1
lookup instead of 2 using emplace with a sentinel value. Also maps value
initialize their values when inserted with the [] operator, this is
defined and so there is no need to explicitly insert zeroes for integer
values.
This patch adds some abstract enums to pass to the trace decl* APIs, so
the VCD/FST specific code can be kept in verilated_{vcd,fst}_*.cc, and
removed from V3Emit*. It also reworks the generation of the trace init
functions (those that call 'decl*' for the signals) such that the scope
hierarchy is traversed precisely once during initialization, which
simplifies the FST writer. This later change also has the side effect of
fixing tracing of nested interfaces when traced via an interface
reference - see the change in the expected t_interface_ref_trace - which
previously were missed.
Some values emitted to the trace files are constant (e.g.: actual
parameter values), and never change. Previously we used to trace these
in the 'full' dumps, which also included all other truly variable
signals. This patch introduces a new generated trace function 'const',
to complement the 'full' and 'chg' flavour, and 'const' now only
contains the constant signals, while 'full' and 'chg' contain only the
truly variable signals. The generate 'full' and 'chg' trace functions
now have exactly the same shape. Note that 'const' signals are still
traced using the 'full*' dump methods of the trace buffers, so there is
no need for a third flavour of those.
V3Stats for "fast" code have bit-rotted a little and is causing some
problems with tests that rely on stats outputs. The problem is that not
all code is necessarily reachable from eval() any more (due to the
complexity of some the features added over the past few years), so it
might miss some things, as for measuring the "fast" code, it is trying
to trace the execution paths via calls, starting from eval(). It also
appears the fast code can also contain calls to slow code in some
circumstances.
To avoid all that, removed trying to trace dynamic execution, and simply
report the static node counts, which is enough for testing.
Similarly, the variable counts are somewhat dubious, as they don't
include all data types, or all instances of a module in some stages.
Removing these as they are not widely used nor dependable. More specific
stats can be added if required and can be well defined.
Prior to this we failed to create implicit nets for inputs of gate
primitives, which is required by the standard (IEEE 1800-2017 6.10).
Note: outputs were covered due to being modeled as the LHS of
assignments, which do create implicit nets.
Again --prof-exec have bit-rotted a little with all the recent changes
to the structure of the generated code. This patch contains a few
improvements:
- Repalce the eval/evl_loop begin/end events with generic
section_push/section_pop events, that can be arbitrarily sprinkled
into the generate code (so long as they are matched correctly) to
measure various sections. The report then contains a nested profile
of the sections, and the VCD trace shows the section names.
- Better handling of exec graphs
- Clearer overall statistics
Still some remains of the --threads 0 mode. Remove unnecessary complexity
from V3EmitCModel. (Also don't pretend there is an MTask in single
threaded mode, when there really isn't.)
Previously V3InstrCount used to completely ignore an AstConcat,
including its children (see the rational in the comment). The problem is
the operands can be huge and expensive compound expressions (especially
since DFG), and not just a simple variable reference. This fix gains
some MT speed improvement.
Since we removed --threads 0 support, the 'threads()' option always
returns a value >= 1. Remove corresponding dead code.
Some of the coverage counters appear to use atomics even if the model is
single threaded. I'm under the impression this was a bug originally so
those ones I changed to use threads() > 1 instead.
According to 1800-2017 36.3, 1800-2017 A.9.3, 1364-2005 20.2 and 1364-2005 A.9.3, user defined system task and function identifiers can use the same character set for the second character as all the following characters.
`getCommonClassTypep` and its helper code has been moved to AstNode
class. This is a lot better place for this functionality. Moreover, it
allowed to get rid of the dependency on V3Width from generic AST-related
code.
This API is used if the user copies the process using `fork`
and similar OS-level mechanisms. The `at_clone` member function
ensures that all model-allocated resources are re-allocated, such
that the copied child process/model can simulate correctly.
A typical allocated resource is the thread pool, which every model
has its own pool.
* Ignore CLion project files and CMake outputs
* Supporting stripping file path that contains backslash
* Set /bigobj flag and increase stack size for windows platform
* Fix MSVC warnings
Multiple edge timing controls in class methods would cause compilation errors on
the generated C++ code. This is because the `SenExprBuilder` used for these
would get recreated per timing control, resulting in duplicate variable names.
The fix is to have a single `SenExprBuilder` per scope.
Multiple edge timing controls in class methods would cause compilation errors on the generated C++ code. This is because the `SenExprBuilder` used for these would get recreated per timing control, resulting in duplicate variable names. The fix is to have a single `SenExprBuilder` per scope.
* Add VL_ASSERT_CAPABILITY; add assumeLocked and pretendUnlock to V3Mutex.
* Pass jobs as template-arguments and use std::packaged_task.
* Add and use V3ThreadPool::ScopedExclusiveAccess.
Given nested forks, if the inner fork had a `join` or `join_any` at the end,
`V3Sched::transformForks()` would decide that the fork's `VlForkSync` variable
should be passed in from the outside. This resulted in the `VlForkSync` getting
redeclared as a function argument. Ultimately, it led to C++ compilation errors
due to variable redeclaration.
Fixed by rearranging the `if`s that decide whether a variable should be passed
in or left as-is.
`stopRequested()` reads only atomic variables. It doesn't need a mutex
to do this.
This function is called in `waitIfStopRequested()`, which in turn
is called before execution of every job, and inside some jobs. With this
change the mutex inside `waitIfStopRequested` needs to be locked only in
very rare cases instead of every time.
Event-triggered coroutines live in two stages: 'uncommitted' and 'ready'. First
they land in 'uncommitted', meaning they can't be resumed yet. Only after
coroutines from the 'ready' queue are resumed, the 'uncommitted' ones are moved
to the 'ready' queue, and can be resumed. This is to avoid self-triggering in
situations like waiting for an event immediately after triggering it.
However, there is an issue with `wait` statements. If you have a `wait(b)`, it's
being translated into a loop that awaits a change in `b` as long as `b` is
false. If `b` is false at first, the coroutine is put into the `uncommitted`
queue. If `b` is set to true before it's committed, the coroutine won't get
resumed.
This patch fixes that by immediately committing event controls created from
`wait` statements. That means the coroutine from the example above will get
resumed from now on.
This makes the implementation of the detection and propagation of the
suspendable property simpler and easier to read. More importantly, there are no
more jumps around the AST with the `visit` functions, which in some cases could
result in incorrect visitor context while in the `visit` function. See the added
test, which would cause Verilator to segfault before this patch.
In testing, verilation performance was not shown to be affected by this change.
Though there is a slight performance improvement from this patch, due to adding
one more check before refreshing class member cache.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
* V3Common.cpp::makeVlToString: fix `VL_TOSTRING_W` statement generation to include width argument
* fix contribution name
* add testcase for long struct `VL_TO_STRING_W` bug
This patch fixes two cases where methods in base classes were not being marked
as coroutines, even though they were being overridden by coroutines.
- One case is the class member cache not getting refreshed for searched classes.
- The other is when the overriding methods are not declared as `virtual`. In
that case, the `isVirtual()` getter on such a method returns false, which led
to `V3Timing` skipping the step of searching for overridden methods.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
Static variables of functions are created in the function. When blocks
in a function use identical names for static variables, we need to name
those variables properly.
Pack the elements of VlTriggerVec as dense bits (instead of a 1 byte
bool per bit), and check whether they are set on a word granularity.
This effectively transforms conditions of the form `if (trig.at(0) |
trig.at(2) | trig.at(64))` into `if (trig.word(0) & 0x5 | trig.word(1) &
0x1)`. This improves OpenTitan ST by about 1%, worth more on some other
designs.
`VlNow{}` is completely unnecessary, as coroutines are always on the
heap (unless optimized out). Also fix access of var ref passed to forked processes.
Given an await at the end of a block, e.g. at the end of a loop body, a trace
activity setter was not inserted, as there were no following statements. This
patch makes the activity update unconditional.
Before this patch, it was possible to access non-static class members using
static access, which resulted in C++ compilation errors. This adds
verilation-time checks for such situations.
Before this patch, calling tasks directly under forks would result in each
statement of these tasks being executed concurrently. This was due to Verilator
inlining tasks most of the time. Such inlined tasks' statements would simply
replace the original call, and there would be no indication that these used to
be grouped together. Ultimately resulting in `V3Timing` treating each statement
as a separate process.
The solution is simply to wrap each fork sub-statement in a begin in `V3Begin`
(except for the ones that are begins, as that would be pointless). `V3Begin` is
already aware of forks, and is supposed to avoid issues like this one, so it
seems like a natural fit. This also protects us from similar bugs, i.e. if some
statement gets replaced or expanded into multiple statements.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
The baked in DEFENV paths might end up with extra NULL characters
at the end if the binaries are installed by something that patches them
for relocatable installs (e.g. conda). Avoid this issue by immediately
passing them through std::string::c_str() method to stop at the first NULL
* Support method class without parenthesis
Signed-off-by: Ryszard Rozak <rrozak@antmicro.com>
* Delete replaced nodes
Signed-off-by: Ryszard Rozak <rrozak@antmicro.com>
---------
Signed-off-by: Ryszard Rozak <rrozak@antmicro.com>
Fixes#3872.
Testing this is a bit tricky, as the front-end fixes up the operand
widths in shifts to match, and we need V3Const to introduce a mismatched
one by reducing `4'd2 ** x` (with x being 2 2-bit wide signal) to `4'd1
<< x`, but t_dfg_peephole runs with V3Const disabled exactly because it
makes it hard to write tests. Rather than fixing this one case in
V3Const (which we should do systematically at some point), I fixed DFG
to accept these just in case V3Const generates more of them. The
assertions were there only because of paranoia (as I thought these were
not possible inputs), the code otherwise works.
In order to avoid unexpected breakage on multi-driven variables, we
resolve in DFG construction by using only the first driver encountered.
Also issues the MULTIDRIVEN error for these signals.
Replace the 'run to fixed point' algorithm with a work list driven
approach. Instead of marking the graph as changed, we explicitly add
vertices to the work list, to be visited, when a vertex is changed. This
improves both memory locality (as the work list is processed in last in
first out order), and removed unnecessary visitations when only a few
nodes changes.
Folding an AstLogAnd with a non-zero constant operand used to coerce the
type of the other operand, yielding an ill-typed node that DFG was then
unhappy about. Add a RedOr instead if the width of the replacement
operand is greater than zero.
Fixes#3726
Apart from the representational changes below, this patch renames
AstNodeMath to AstNodeExpr, and AstCMath to AstCExpr.
Now every expression (i.e.: those AstNodes that represent a [possibly
void] value, with value being interpreted in a very general sense) has
AstNodeExpr as a super class. This necessitates the introduction of an
AstStmtExpr, which represents an expression in statement position, e.g :
'foo();' would be represented as AstStmtExpr(AstCCall(foo)). In exchange
we can get rid of isStatement() in AstNodeStmt, which now really always
represent a statement
Peak memory consumption and verilation speed are not measurably changed.
Partial step towards #3420
In V3Active, we try hard to turn `always @(a or b or c)` into an
`always_comb` if the only variables read in the block are also in the
sensitivity list. In addition, also allow this optimization when reading
variables that are not in the sensitivity list, but are known to be
constant/never changing after initialization. In particular lookup
tables introduced by V3Table are covered by this. This can have a
significant impact on designs that use the `always @(a or b or c)` style
for combinational logic.
The cost of an AstCMethodHard right now is generally unknown. However,
VlTriggerVec::at is used a lot in conditions, so we make an effort
to estimate this correctly via 2 changes:
- In general when an AstVarRef appears as the target of an
AstCMethodHard, we cost it as a simple address computation (an add)
- Check for VlTriggerVec::at explicitly when costing AstCMethodHard,
which is essentially a load.
This can have a significant effect when there are a lot of unique
triggers in the design.
In non-static contexts like class objects or stack frames, the use of
global trigger evaluation is not feasible. The concept of dynamic
triggers allows for trigger evaluation in such cases. These triggers are
simply local variables, and coroutines are themselves responsible for
evaluating them. They await the global dynamic trigger scheduler object,
which is responsible for resuming them during the trigger evaluation
step in the 'act' eval region. Once the trigger is set, they await the
dynamic trigger scheduler once again, and then get resumed during the
resumption step in the 'act' eval region.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
This changeset brings support for accesses like:
class Cls#(type TYPE1);
TYPE1::some_method();
endclass
It is done by delaying dot resolution on type parameters until they get
resolved by V3Param, and doing a more thorough reference skip.
In DFG DfgVertex::width() is only defined for vertices representing
packed values, which DfgVertex::hash() used to violate. The only
non-packed values at the moment are DfgVarArray, which is a
DfgVertexVar, which are handled specially anyway, so this is easy to
fix.
Fixes#3682
In order to not leak signal names with --protect-ids, we simply make the
trigger dump function empty (this is a debug only construct).
Partial fix for #3689
The emitted SystemC types (e.g. sc_bv) are not interchangeable with
Verilator internal C++ types (e.g.: VlWide), so the variables themselves
are not interchangeable (but can be assigned to/from each other). We can
preserve correctness simply be not inlining any SystemC variables (i.e.:
don't simplify any 'sc = nonSc' or 'nonSc = sc' assignments). SystemC
types only appear at top level ports so this should have no significant
impact.
Fixes#3688
Prevents the possibility of assigning an integer to a class reference,
both at the SystemVerilog and the emitted C++ levels.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
Vertices representing variables (DfgVertexVar) and constants (DfgConst)
are very common (40-50% of all vertices created in some large designs),
and we also need to, or can treat them specially in algorithms. Keep
these as separate lists in DfgGraph for direct access to them. This
improve verilation speed.
Cyclic components are now extracted separately, so there is no
functional reason to have to do a topological sort (previously we used it
to detect cyclic graphs). Removing it to gain some speed.
AstSel is a ternary node, but the 'widthp' is always constant and is
hence redundant, and 'lsbp' is very often constant. As AstSel is fairly
common, we special case as a DfgSel for the constant 'lsbp', and as
'DfgMux` for the non-constant 'lsbp'.
Added a DfgVertex::user() mechanism for storing data in vertices.
Similar in spirit to AstNode user data, but the generation counter is
stored in the DfgGraph the vertex is held under. Use this to cache
DfgVertex::hash results, and also speed up DfgVertex hashing in general.
Use these and additional improvements to speed up CSE.
`V3SchedTiming` currently assumes that if a fork still exists, it must
have statements within it (otherwise it would have been deleted by
`V3Timing`). However, in a case like this:
```
module t;
reg a;
initial fork a = 1; join
endmodule
```
the assignment in the fork is optimized out by `V3Dead` after
`V3Timing`. This leads to `V3SchedTiming` accessing fork's `stmtsp`
pointer, which at this point is null. This patch addresses that issue.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
Allow constant folding through adjacent nodes of all associative
operations, for example '((a & 2) & 3)' or '(3 & (2 & a))' can now be
folded into '(a & 2)' and '(2 & a)' respectively. Also improve speed of
making associative expression trees right leaning by using rotation of
the existing vertices whenever instead of allocation of new nodes.
Only apply when there is guaranteed to be a subsequent constant folding
and elimination of some of the expression, otherwise this sometimes
interferes with the simplification of concatenations and harms overall
performance.
Before this change, a design verilated with `--timing` that does not
actually use timing features would be emitted with `eventsPending` and
`nextTimeSlot` declared in the top class. However, their definitions
would be missing, leading to linker errors during design compilation.
This patch makes Verilator always emit the definitions, which prevents
linker errors. Trying to use `nextTimeSlot` without delays in the design
will result in an error at runtime.
Also added a testing only -fno-const-before-dfg option, as otherwise
V3Const eats up a lot of the simple inputs. A lot of the things V3Const
swallows in the simple cases can make it to DFG in complex cases, or DFG
itself can create them during optimization. In any case to save
complexity of testing DFG constant folding, we use this option to turn
off V3Const prior to the DFG passes in the relevant test.
Some optimizations are only a net win if they help us remove a graph
node (or at least ensure they don't grow the graph), or yields otherwise
special logic, so try to apply them only in these cases.
Use the same style, and reuse the bulk of astgen to generate DfgVertex
related code. In particular allow for easier definition of custom
DfgVertex sub-types that do not directly correspond to an AstNode
sub-type. Also introduces specific names for the fixed arity vertices.
No functional change intended.
A lot of optimizations in DFG assume a DAG, but the more things are
representable, the more likely it is that a small cyclic sub-graph is
present in an otherwise very large graph that is mostly acyclic. In
order to avoid loosing optimization opportunities, we explicitly extract
the cyclic sub-graphs (which are the strongly connected components +
anything feeing them, up to variable boundaries) and treat them
separately. This enables optimization of the remaining input.
This change introduces a custom reference-counting pointer class that
allows creating such pointers from 'this'. This lets us keep the
receiver object around even if all references to it outside of a class
method no longer exist. Useful for coroutine methods, which may outlive
all external references to the object.
The deletion of objects is deferred until the next time slot. This is to
make clearing the triggered flag on named events in classes safe
(otherwise freed memory could be accessed).
Added DfgVertexVariadic to represent DFG vetices with a varying number
of source operands. Converted DfgVar to be a variadic vertex, with each
driver corresponding to a fixed range of bits in the packed variable.
This allows us to handle AstSel on the LHS of assignments. Also added
support for AstConcat on the LHS by selecting into the RHS as
appropriate.
This improves OpenTitan ST speed by ~13%
This is only a debugging aid at this point, so compile out of the
release build. This reduces peak memory consumption by 4-5%. We still
keep the global counters to detect the tree have changed, to avoid
unnecessary dumps.
Multiple tricks to reduce the size of class FileLine from 72 to 40
bytes:
- Reduce file name index from 32 to 16 bits. This still allows 64K
unique input files, which is hopefully enough.
- Intern message/warning enable bitset and use a 16-bit index, again
allowing 64K unique sets which is hopefully enough.
- Put the m_waive flag into the sign bit of one of the line numbers.
- Use explicit reference counting to avoid overhead of shared_ptr.
Added assertions to ensure interned data fits within it's index space.
This saves ~5-10% peak memory consumption at no measurable run-time cost
on various designs.
Added a new data-flow graph (DFG) based combinational logic optimizer.
The capabilities of this covers a combination of V3Const and V3Gate, but
is also more capable of transforming combinational logic into simplified
forms and more.
This entail adding a new internal representation, `DfgGraph`, and
appropriate `astToDfg` and `dfgToAst` conversion functions. The graph
represents some of the combinational equations (~continuous assignments)
in a module, and for the duration of the DFG passes, it takes over the
role of AstModule. A bulk of the Dfg vertices represent expressions.
These vertex classes, and the corresponding conversions to/from AST are
mostly auto-generated by astgen, together with a DfgVVisitor that can be
used for dynamic dispatch based on vertex (operation) types.
The resulting combinational logic graph (a `DfgGraph`) is then optimized
in various ways. Currently we perform common sub-expression elimination,
variable inlining, and some specific peephole optimizations, but there
is scope for more optimizations in the future using the same
representation. The optimizer is run directly before and after inlining.
The pre inline pass can operate on smaller graphs and hence converges
faster, but still has a chance of substantially reducing the size of the
logic on some designs, making inlining both faster and less memory
intensive. The post inline pass can then optimize across the inlined
module boundaries. No optimization is performed across a module
boundary.
For debugging purposes, each peephole optimization can be disabled
individually via the -fno-dfg-peepnole-<OPT> option, where <OPT> is one
of the optimizations listed in V3DfgPeephole.h, for example
-fno-dfg-peephole-remove-not-not.
The peephole patterns currently implemented were mostly picked based on
the design that inspired this work, and on that design the optimizations
yields ~30% single threaded speedup, and ~50% speedup on 4 threads. As
you can imagine not having to haul around redundant combinational
networks in the rest of the compilation pipeline also helps with memory
consumption, and up to 30% peak memory usage of Verilator was observed
on the same design.
Gains on other arbitrary designs are smaller (and can be improved by
analyzing those designs). For example OpenTitan gains between 1-15%
speedup depending on build type.
- Rename `--dump-treei` option to `--dumpi-tree`, which itself is now a
special case of `--dumpi-<tag>` where tag can be a magic word, or a
filename
- Control dumping via static `dump*()` functions, analogous to `debug()`
- Make dumping independent of the value of `debug()` (so dumping always
works even without the debug flag)
- Add separate `--dumpi-graph` for dumping V3Graphs, which is again a
special case of `--dumpi-<tag>`
- Alias `--dump-<tag>` to `--dumpi-<tag> 3` as before
Use astgen to generate a more thorough version of AstNode::checkTree,
which checks that operands are or consistent structure and type, as
described in the @astgen op directives. Also change checkTree to always
run when --debug-check is given.
Fix discovered fallout.
Introduce the @astgen directives parsed by astgen, currently used for
the generation child node (operand) accessors. Please see the updated
internal documentation for details.
Introduce the @astgen directives parsed by astgen, currently used for
the generation child node (operand) accessors. Please see the updated
internal documentation for details.
This approach reduced total time of V3Undriven stage from 34,2s to 2,5s
in design containing almost 400 000 unused variables.
Signed-off-by: Kamil Rakoczy <krakoczy@antmicro.com>
Generate type specific static overloads of Ast<Node>::addNext, which
return the correct sub-type of the 'this' they were invoked on.
Also remove AstNode::addNextNull, which is now only used in the parser,
implement in verilog.y directly as a template function.
- Move DType representations into V3AstNodeDType.h
- Move AstNodeMath and subclasses into V3AstNodeMath.h
- Move any other AstNode subtypes into V3AstNodeOther.h
- Fix up out-of-order definitions via inline methods and implementations
in V3Inlines.h and V3AstNodes.cpp
- Enforce declaration order of AstNode subtypes via astgen,
which will now fail when definitions are mis-ordered.
Rely less on strings and represent AstNode classes as a 'class Node',
with all associated properties kept together, rather than distributed
over multiple dictionaries or constructed at retrieval time.
No functional change intended.
Small fixup patch so the 'ico' and 'act' scheduling sections could be
ordered as multi-threaded. However, we still only order these single
threaded at the moment (but switching them to multi-threaded now works).
Before this change, some forked processes were being inlined in
`V3Timing` because they contained no `CAwait`s. This only works under
the assumption that no `CAwait`s will be added there later, which is not
true, as a function called by a forked process could be turned into a
coroutine later. The call would be wrapped in a new `CAwait`, but the
process itself would have already been inlined at this point.
This commit moves the inlining to `transformForks` in `V3SchedTiming`,
which is called at a point when all `CAwait`s are already in place.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
The recent patch to defer substitutions on V3Gate crashes on circular
logic that has cycle length >= 3 with all inlineable signals (cycle
length 2 is detected correctly and is not inlined). Fix by stopping
recursion at the loop-back edge.
Fixes#3543
This is detritus from when V3TraceDecl used to run after V3Gate, today
V3TraceDecl runs before V3Gate and this value has no function at all.
No functional change intended.
dynamic_cast is not free. Replace obvious instances (where the result is
unconditionally dereferenced) with static_cast in contexts with
performance implications.
Replace std::set<SiblingMC> with V3Lists to keep track of SiblingMCs
associated with MTasks, use a std::set<LogicMTask*> for ensuring
uniqueness. This yields a bit more speed in PartContraction.
- Use modern C++
- Implement OrderLogicVertex->LogicMTask map with
OrderLogicVertex::userp(), insteas of std::unordered_map
- Simplify data structures
- Simplify code and assert properties
No functional change.
Refactor ProcessMoveBuildGraph utilizing the fact that OrderGraph is a
bipartite graph, also remove unnecessary unordered_map and distribute
variable domain map. No functional change.
Adds timing support to Verilator. It makes it possible to use delays,
event controls within processes (not just at the start), wait
statements, and forks.
Building a design with those constructs requires a compiler that
supports C++20 coroutines (GCC 10, Clang 5).
The basic idea is to have processes and tasks with delays/event controls
implemented as C++20 coroutines. This allows us to suspend and resume
them at any time.
There are five main runtime classes responsible for managing suspended
coroutines:
* `VlCoroutineHandle`, a wrapper over C++20's `std::coroutine_handle`
with move semantics and automatic cleanup.
* `VlDelayScheduler`, for coroutines suspended by delays. It resumes
them at a proper simulation time.
* `VlTriggerScheduler`, for coroutines suspended by event controls. It
resumes them if its corresponding trigger was set.
* `VlForkSync`, used for syncing `fork..join` and `fork..join_any`
blocks.
* `VlCoroutine`, the return type of all verilated coroutines. It allows
for suspending a stack of coroutines (normally, C++ coroutines are
stackless).
There is a new visitor in `V3Timing.cpp` which:
* scales delays according to the timescale,
* simplifies intra-assignment timing controls and net delays into
regular timing controls and assignments,
* simplifies wait statements into loops with event controls,
* marks processes and tasks with timing controls in them as
suspendable,
* creates delay, trigger scheduler, and fork sync variables,
* transforms timing controls and fork joins into C++ awaits
There are new functions in `V3SchedTiming.cpp` (used by `V3Sched.cpp`)
that integrate static scheduling with timing. This involves providing
external domains for variables, so that the necessary combinational
logic gets triggered after coroutine resumption, as well as statements
that need to be injected into the design eval function to perform this
resumption at the correct time.
There is also a function that transforms forked processes into separate
functions.
See the comments in `verilated_timing.h`, `verilated_timing.cpp`,
`V3Timing.cpp`, and `V3SchedTiming.cpp`, as well as the internals
documentation for more details.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
Various optimizations to speed up MTasks coarsening (which is the long
pole in the multi-threaded scheduling of very large designs).
The biggest impact ones:
- Use efficient hand written Pairing Heaps for implementing priority
queues and the scoreboard, instead of the old SortByValueMap. This
helps us avoid having to sort a lot of merge candidates that we will
never actually consider and helps a lot in performance.
- Remove unnecessary associative containers and store data structures
(the heap nodes in particular) directly in the object they relate to.
This eliminates a huge amount of lookups and helps a lot in
performance.
- Distribute storage for SiblingMC instances into the LogicMTask
instances, and combine with the sibling maps. This again eliminates
hash table lookups and makes storage structures smaller.
- Remove some now bidirectional edge maps, keep only the forward map.
There are also some other smaller optimizations:
- Replaced more unnecessary dynamic_casts with static_casts
- Templated some functions/classes to reduce the number of static
branches in loops.
- Improves sorting of edges for sibling candidate creation
- Various micro-optimizations here and there
This speeds up MTask coarsening by 3.8x on a large design, which
translates to a 2.5x speedup of the ordering pass in multi-threaded
mode. (Combined with the earlier optimizations, ordering is now 3x
faster.)
Due to the elimination of a lot of the auxiliary data structures, and
ensuring a minimal size for the necessary ones, memory consumption of
the MTask coarsening is also reduced (measured up to 4.4x reduction
though the accuracy of this is low).
The algorithm is identical except for minor alterations of the order
some candidates are added or removed, this can cause perturbation in the
output due to tied scores being broken based on IDs.
Various optimizations to speed up MTasks coarsening (which is the long
pole in the multi-threaded scheduling of very large designs).
The biggest impact ones:
- Use efficient hand written Pairing Heaps for implementing priority
queues and the scoreboard, instead of the old SortByValueMap. This
helps us avoid having to sort a lot of merge candidates that we will
never actually consider and helps a lot in performance.
- Remove unnecessary associative containers and store data structures
(the heap nodes in particular) directly in the object they relate to.
This eliminates a huge amount of lookups and helps a lot in
performance.
- Distribute storage for SiblingMC instances into the LogicMTask
instances, and combine with the sibling maps. This again eliminates
hash table lookups and makes storage structures smaller.
- Remove some now bidirectional edge maps, keep only the forward map.
There are also some other smaller optimizations:
- Replaced more unnecessary dynamic_casts with static_casts
- Templated some functions/classes to reduce the number of static
branches in loops.
- Improves sorting of edges for sibling candidate creation
- Various micro-optimizations here and there
This speeds up MTask coarsening by 3.8x on a large design, which
translates to a 2.5x speedup of the ordering pass in multi-threaded
mode. (Combined with the earlier optimizations, ordering is now 3x
faster.)
Due to the elimination of a lot of the auxiliary data structures, and
ensuring a minimal size for the necessary ones, memory consumption of
the MTask coarsening is also reduced (measured up to 4.4x reduction
though the accuracy of this is low).
The algorithm is identical except for minor alterations of the order
some candidates are added or removed, this can cause perturbation in the
output due to tied scores being broken based on IDs.
While keeping the client code abstract in PartPropagateCp is nice for
testing, there is performance to be had removing the abstraction. As
this code dominates in scheduling large designs, we eliminate the
abstraction and re-work the testing to use the actual LogicMTask and
MTaskEdge graph types. No functional change intended.
Instead of deleting then re-allocating MTaskEdge instances when merging
two MTasks, just redirect the edged of the donor MTask to the recipient
MTask. This is both faster as it avoids an allocation and a deletion,
together with one update of the sibling maps, and also makes the
algorithm more stable due to MergeCandidate IDs being stable and
allocated up front for all MTaskEdges, before any SiblingMCs are
allocated.
Perturbations in output are expected as the IDs used to break ties
between merge candidates with equal costs are not updated when
redirecting an edge (on purpose). The relinking of only one end of the
graph edges also perturbs the order in which they are enumerated, which
does change candidate opportunities when the number of edges is larger
than PART_SIBLING_EDGE_LIMIT. Confirmed output is identical when
IDs are updated and edges are updated to appear in their original order.
The critical path propagation used to rely on a pointer comparison to
break equal scoring critical path updates. Use the corresponding mtask
ids instead, which is deterministic across invocations.
siblingPairFromRelatives gathers neighbours of a vertex, and sorts them.
It then takes the N best nodes, and creates sibling merge candidates
from them. We now use the unadjusted cost instead of the step cost of
the vertices when sorting. This is both faster as we need not do the
log-space rounding to compute stepCost, and will also make similar but
yet cheaper nodes appear closer to the front as we don't lose precision
in rounding, hence they are more likely to be entered as merge
candidates. Note that when creating the merge candidate, we still use
the stepCost, so it's purpose of reducing the propagation of critical
path updates is maintained in full. In summary, this should make both
Verilator and the generated model very slightly faster, at least in
theory, and I have observed minor improvement in places.
GraphStreamUnordered used to be GraphStream<std::less<const
V3GraphVertex*>>, but a lot of performance improvements can be had by a
specialized implementation, so added a highly optimized one. This helps
a lot with --debug-partition.
Fix compile error for queue method usage, if it is the
first statement in a block of code, and the return
value is not used. Example:
> if (foo)
> void'(bar.pop_front());
* Tests: Add a test to reproduce #3399
* Fix#3399. When reading an inout port in a module, it should refer the
original inout port, not the generated MODTEMP.
Keep a single std::set of key/value pairs, and a single unordered_map
from key to iterators into the set. Also improve some of the accessing
mechanisms using modern C++. This speeds up multi-threaded ordering by
about 10%.
Similarly to the earlier patch that defers constant folding on optimized
logic, now we also defer the variable substitutions as well. This again
eliminates a lot of traversals, and yields another ~10x speedup of V3Gate
on a design where V3Gate used to dominate while producing identical
results.
Rather than constant folding each logic block after every substitution,
only constant fold updated blocks when re-analysed, or at the end. This
removes a lot of invocations of V3Const on large blocks that can be
optimized well, and should yield the same result.
This speeds up V3Gate by ~4x on a design where V3Gate dominates.
Speed improvements:
- Use a direct, recursion-free implementation
- Improve pre-fetching
Functionality:
- Support remove/replace of currently iterated node
__gcov_flush was a private function and was removed from later GCC
versions (at least from 11.2.0, possibly earlier). Replace with the
documented public __gcov_dump.
Set default value of --comp-limit-parens to 240, to respect default
maximum nesting of parentheses in clang (which is controlled by
-fbracket-depth and defaults to 256). For code generation consistency,
also use the same default with gcc.
* Tests: Add a test to reproduce #3509
* Tests: Compile without tautological-compare check because bit op tree optimization is disabled in the test.
* Internals: Dedup code. No functional change is intended.
* Fix#3509.
"2'b10 == (2'b11 & {1'b0, val[0]})" and "2'b10 != (2'b11 & {1'b0, val[0]})" were
wrongly optimized to "!val[0]" and "val[0]" respectively.
Now properly optimize them to 1'b0 and 1'b1.
* Commentary
* Commentary: Update Changes
Associative arrays that specify a wildcard index type may be indexed by
integral expressions of any size, with leading zeros removed
automatically. A natural representation for such expressions is a
string, especially that the standard explicitly specifies automatic
casts from string indices to bit vectors of equivalent size.
The automatic cast part is done implicitly by the existing type system.
A simpler way to just make this work would be to convert wildcard index
type to a string type directly in the parser code, but several new AST
classes are needed to make sure illegal method calls are detected.
The verilated data structure implementation is reused, because there is
no need for differentiating the behavior on C++ side.
All remaining use of conditional compilation in the tracing
implementation of the run-time library are replaced with the use of
VerilatedModel::traceConfig, and is now done at run-time.
Step towards a proper run-time library. Reduce the amount of ifdefs in
the implementation of offloaded tracing. There are still a very small
number of ifdefs left, which will need more careful changes in order to
keep user API compatibility.