No functional change. Postpone the conversion of all AstAssignDlys that
use the 'VdlySet' scheme for array LHSs until after the complete
traversal of the netlist. The next patch takes advantage of this by
using some extra information also gathered through the traversal to
change the conversion.
AstAssignDlys inside suspendable or fork are not deferred and are
processed identical to the previous version.
There are some TODOs in this patch that are fixed in the next patch.
Output code perturbed due to variable ordering.
MULTIDRIVEN message ordering perturbed due to processing order change.
No functional change.
This patch is just cleanup with some non-functional changes to enable
the next patch. Most importantly createDlyOnSet, which implements NBAs
for arrays, has a new streamlined implementation that does the same
thing. Some output code is perturbed due to statement/local variable
insertion order.
Also renamed Vdlyvfoo to VdlyFoo for easier readability of the generated
code.
Checking the wrong node meant we never actually pushed constant
bit-select indices into the delayed update, as was the intention, but
always generated a temporary instead.
This adheres more to the wording used in IEEE 1800-2017 22.5.1,
specifying the join operator to be more of a delimiter than join:
> A `` delimits lexical tokens without introducing white space,
> allowing identifiers to be constructed from arguments.
Before, string RHS arguments to the join operator were silently dropped.
Signed-off-by: Arkadiusz Kozdra <akozdra@antmicro.com>
The goal here is to use as single ordering heuristic (which can be
improved later) within MTasks as we do for serial code ordering. The
heuristic itself is factored out into the new OrderMoveGraphSerializer.
This also yields slightly nicer ordering than the previously use
GraphStream, so we end up with fewer trigger (domain) conditionals in
the MTasks, this can be worth a few percent speedup.
This has the somewhat nice side-effect of reusing OrderMoveGraphVertex
for both serial and parallel mode, so MTaskMoveGraphVertex can be
removed.
Serial mode yields identical output.
- Adds support for C-style for-loop initializers
- Current implementation supports: for (x a = 1, y b = 2, ...)
- This patch extends support to: for (x a = 1, b = 2, ...)
- Adds unit test for new feature
Instead of carrying around MTask affinity from scheduling, compute it in
V3VariableOrder (where it is used), by tracing through the code. This
simplifies some code and has the benefit of handling variables
introduced after scheduling. It's worth a few % speed at run-time, and
the new implementation of V3VariableOrder is slightly more efficient,
though the speed/space is still dominated by the TSP sort.
There is no strong need to re-map LogicMTask IDs and it just adds extra
processing. Instead we just allocate a separate set of ExecMTask IDs as
they are created, which can also be used as the unique profiling ID as
well. The only effect on the output of this is the change in mtask IDs
emitted, which was fairly arbitrary to begin with.
Wrapping the functions in #4933 broke --prof-exec report as the
predicted MTask times are computed during thread packing, but are
emitted in the wrapping functions.
V3Partition used to contain 2 conceptually separate set of algorithms
- The MTask partitioning/coarsening algorithm used by V3Order. This has
been moved to V3OrderParallel.cpp
- The lowering of AstExecGraph into per thread functions by packing
tasks into threads and creating additional code
(V3Partition::finalize). This has been moved to the new
V3ExecGraph.cpp
This patch is just code movement/rename with minimal fixes required to
do so.
Continuing the idea of decoupling the implementations of the various algorithms.
The main points:
-Move the former "processDomain" stuff, dealing with assigning combinational logic into the relevant sensitivity domains into V3OrderProcessDomains.cpp
-Move the parallel code construction in V3OrderParallel.cpp (Could combine this with some parts of V3Partition - those not called from V3Partition::finalize - but that's not for this patch).
-Move the serial code construction into V3OrderSerial.cpp
-Factored the very small common code between the parallel and serial code construction (processMoveOneLogic) into V3OrderCFuncEmitter.cpp
Move OrderBuildVisitor into V3OrderGraphBuilder.cpp (and rename to
V3OrderGraphBuilder). Move ProcessMoveBuildGraph to
V3OrderMoveGraphBuilder.cpp (and rename to V3OrderGraphBuilder).
This patch is pure code movement/rename, no refactoring at all.
Add a new data-structure V3DfgCache, which can be used to retrieve
existing vertices with some given inputs vertices. Use this in
V3DfgPeephole to eliminate the creation of redundant vertices.
Overall this is performance neutral, but is in prep for some future
work.
DFG could remove forceable signals by replacing them with their
in-design driver. This is a bit of a pain to prevent, and ideally the
forcing transform should happen before DFG, but implementing it there is
a pain due to having to rewrite ports based on direction. This is an
attempted fix in DFG. More cases might remain.
This functionality used to be distributed in the removeVars pass and the
final dfgToAst conversion. Instead added a new 'regularize' pass to
convert DFGs into forms that can be trivially converted back to Ast, and
a new 'eliminateVars' pass to remove/repalce redundant variables. This
simplifies dfgToAst significantly and makes the code a bit easier to
follow.
The new 'regularize' pass will ensure that every sub-expression with
multiple uses is assigned to a temporary (unless it's a trivial memory
reference or constant), and will also eliminate or replace redundant
variables. Overall it is a performance neutral change but it does
enable some later improvements which required the graph to be in this
form, and this also happens to be the form required for the dfgToAst
conversion.
With --stats, we will print DFG pattern combinations, one per line, as
S-expressions to new stat files, together with their frequency, to aid
discovery of new peephole patterns.
When the condition of an AstWhile loop contains a statement (e.g.
through an AstExprStmt), temporaries inserted in that statement need to
go before that statement, not in the AstWhile precondition.
- Enable creating constant pool entries for RHS of simple
var = const assignments
- Never extract ArraySel (it's just pointer arithmetic)
- Remove unnecessary AstTraceInc precond child tree
- Always fully recurse expressions (fix transplanted from #4617)
- General cleanup
Overall the patch is performance neutral to slightly positive, but saves
~10% peak Verialtor memory usage due to not creating temporaries (which
are later expanded) for any ArraySels.
Adds a new field to `AstSenItem` that stores the `iff` condition which is then handled by `SenExprBuilder`.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
This patch addresses two issues with NBAs in non-inlined functions/tasks:
- If the NBA writes to a local automatic var, the var could cease to exist before the NBA executes. This is normally addressed by fork dynscopes (#4356), but NBA-to-fork transformation happens way after `V3Fork` (in `V3Timing`). To solve this, we put NBAs that write to locals under forks in `V3Fork` already. This way, such locals will be put in dynscopes, and will still exist after the task containing the NBA exits.
- The above change means that any writes in forks other than `fork..join` should be handled by `V3Fork`. Thus, in `V3SchedTiming`, we only have to worry about read references, so we can simply copy all remaining locals. Because we copy, lifetimes are not an issue. This fixes a bug that allowed assignment intravals to be overwritten if they go out of scope in the containing function.
This patch enforces the use of the most specific accessors for operands
which have an '@astgen alias' declaration, by making the superclass
accessors of the same operands private. This ensures client code is
cleaner as you can't use multiple different methods to reference the
same operands (which we used to in some places). Also prep for some
refactoring.
V3Subst is currently the pass responsible for peak memory usage. This
patch saves ~16% of peak memory on OpenTitan.
2 changes:
- It is actually safe to delete the substituted expressions immediately,
but this is the lesser contribution
- More importantly, we only ever substitute STMTTEMP variables, which
are always defined within the same CFunc, so we can limit the scope of
the optimization to CFunc. This allows us to reclaim the SubstVarEntry
structures at the end of every CFunc, rather than at the end of the
whole pass, which gives us most of the memory savings.
Generated output is identical
Used to set the wrong public flag on forceEn/forceVal, which means they
were not included in ICO as necessary, but V3Gate tended to inline them,
so this was hard to hit.
Fixes#4577
This saves about 5% memory. V3AstUserAllocator is appropriate for most use
cases, performance is marginally up as we are mostly D-cache bound on
large designs.
The typical find/if-not-exists-insert pattern can be achieved with 1
lookup instead of 2 using emplace with a sentinel value. Also maps value
initialize their values when inserted with the [] operator, this is
defined and so there is no need to explicitly insert zeroes for integer
values.
This patch adds some abstract enums to pass to the trace decl* APIs, so
the VCD/FST specific code can be kept in verilated_{vcd,fst}_*.cc, and
removed from V3Emit*. It also reworks the generation of the trace init
functions (those that call 'decl*' for the signals) such that the scope
hierarchy is traversed precisely once during initialization, which
simplifies the FST writer. This later change also has the side effect of
fixing tracing of nested interfaces when traced via an interface
reference - see the change in the expected t_interface_ref_trace - which
previously were missed.
Some values emitted to the trace files are constant (e.g.: actual
parameter values), and never change. Previously we used to trace these
in the 'full' dumps, which also included all other truly variable
signals. This patch introduces a new generated trace function 'const',
to complement the 'full' and 'chg' flavour, and 'const' now only
contains the constant signals, while 'full' and 'chg' contain only the
truly variable signals. The generate 'full' and 'chg' trace functions
now have exactly the same shape. Note that 'const' signals are still
traced using the 'full*' dump methods of the trace buffers, so there is
no need for a third flavour of those.
V3Stats for "fast" code have bit-rotted a little and is causing some
problems with tests that rely on stats outputs. The problem is that not
all code is necessarily reachable from eval() any more (due to the
complexity of some the features added over the past few years), so it
might miss some things, as for measuring the "fast" code, it is trying
to trace the execution paths via calls, starting from eval(). It also
appears the fast code can also contain calls to slow code in some
circumstances.
To avoid all that, removed trying to trace dynamic execution, and simply
report the static node counts, which is enough for testing.
Similarly, the variable counts are somewhat dubious, as they don't
include all data types, or all instances of a module in some stages.
Removing these as they are not widely used nor dependable. More specific
stats can be added if required and can be well defined.
Prior to this we failed to create implicit nets for inputs of gate
primitives, which is required by the standard (IEEE 1800-2017 6.10).
Note: outputs were covered due to being modeled as the LHS of
assignments, which do create implicit nets.
Again --prof-exec have bit-rotted a little with all the recent changes
to the structure of the generated code. This patch contains a few
improvements:
- Repalce the eval/evl_loop begin/end events with generic
section_push/section_pop events, that can be arbitrarily sprinkled
into the generate code (so long as they are matched correctly) to
measure various sections. The report then contains a nested profile
of the sections, and the VCD trace shows the section names.
- Better handling of exec graphs
- Clearer overall statistics
Still some remains of the --threads 0 mode. Remove unnecessary complexity
from V3EmitCModel. (Also don't pretend there is an MTask in single
threaded mode, when there really isn't.)
Previously V3InstrCount used to completely ignore an AstConcat,
including its children (see the rational in the comment). The problem is
the operands can be huge and expensive compound expressions (especially
since DFG), and not just a simple variable reference. This fix gains
some MT speed improvement.
Since we removed --threads 0 support, the 'threads()' option always
returns a value >= 1. Remove corresponding dead code.
Some of the coverage counters appear to use atomics even if the model is
single threaded. I'm under the impression this was a bug originally so
those ones I changed to use threads() > 1 instead.
According to 1800-2017 36.3, 1800-2017 A.9.3, 1364-2005 20.2 and 1364-2005 A.9.3, user defined system task and function identifiers can use the same character set for the second character as all the following characters.