Avoid cloning the module when inlining the last instance that references
that module. This saves a lot of memory because it saves cloning
singleton modules (those with a single instance), which we always
inline. The top few levels of the hierarchy are often simple wrappers,
including the one added by Verilator in V3LinkLevel::wrapTop. Cloning
these and putting off deleting the originals can be very expensive
because they often have a lot of contents inlined into them, so each
layer of wrapper that is inlined would essentially add a whole new clone
of the large top-level. Directly inlining the module for the last cell
without cloning saves us from all this duplicate memory consumption and
also from having to create the clones in the first place.
Also added minor traversal speedups
This reduces the memory consumption of V3Inline by 80% and peak memory
consumption of Verilator by about 66% on a large design, while speeding
up the V3Inline pass by ~3.5x and the whole of Verilator by ~8% while
producing identical output.
- More efficient comparison by pre-computing sorting keys.
- Remove work items in algorithms known to be redundant earlier.
This greatly reduces data structure sizes.
- Use V3GraphVertex->user() for state tracking instead of unordered_map
while both of these are constant time, they do add up.
- In `makeMinSpanningTree`, instead of batch inserting outgoing edges of
each visited vertex into an ordered set, keep an ordered set of sorted
vectors of edges. This reduces the size of the ordered set
significantly (it is now O(V) rather than O(E), and as the subject
graph is a complete graph, V ~ sqrt(E), so this is a significant gain).
- Use a vector + sorting in `perfectMatching` instead of an ordered set.
This is faster on large working sets.
This yields 3.8x speedup on the variable order pass and overall 14%
verilation speed gain on a large design.
Repeatedly traversing whole modules in emit (due to file splitting)
looking for `systemc_* sections can add up to a lot of time on large
designs that have been flattened and need to be split into many files.
Assuming `systemc_* is a rarely used feature, just don't bother if we
don't need to. This gain 9% verilation speed improvement on a large
benchmark.
* Tests: Add t_hier_block_sc_trace(fst|vcd) that tests tracing hierarchical block on SystemC.
* Add a check that elaboration is done before a trace file is opened.
* Add a check that elaboration is done before trace() is called to verilated SystemC model.
* Tests: call sc_core::sc_start(sc_core::SC_ZERO_TIME) before opening a trace file
* Tests: Fix t_trace_two_sc to call sc_start before opening trace
* Use vl_fatal as suggested in PR review.
Trace initialization (tracep->decl* functions) used to explicitly pass
the complete hierarchical names of signals as string constants. This
contains a lot of redundancy (path prefixes), does not scale well with
large designs and resulted in .rodata sections (the string constants) in
ELF executables being extremely large.
This patch changes the API of trace initialization that allows pushing
and popping name prefixes as we walk the hierarchy tree, which are
prepended to declared signal names at run-time during trace
initialization. This in turn allows us to emit repeat path/name
components only once, effectively removing all duplicate path prefixes.
On SweRV EH1 this reduces the .rodata section in a --trace build by 94%.
Additionally, trace declarations are now emitted in lexical order by
hierarchical signal names, and the top level trace initialization
function respects --output-split-ctrace.
* Add a test to reproduce #3197
* Fix#3197. Optimize correctly even if a variable is >32
* Quick exit instead of continue. No functional change is intended.
* Delete comment-out line.
* update per review comment
IEEE 1800-2017 6.11.3 says these types are unsigned. Until now these
types were treated as not having a signedness (NOSIGN), and nodes having
these types were later resolved by V3Width to be unsigned. This is a bit
problematic when creating nodes of these types after V3Width. Treating
these types as unsigned from the get go is fine, and actually improves
generated code slightly.
* Add a test to reproduce bug3182. Run the same HDL with -Oo to confirm the result is same.
* Hopefully fix#3182. The result can be 0 only when polarity is true (no AstNot is found during traversal).
* add tests to reproduce #3177.
Any random test circuits can be added to t_split_var_4.v later because it uses CRC to check the result while
t_split_var_0.v has just barrel shifters.
* Fix#3177. Don't merge assign statements if a variable is marked split_var.
This is a partial cleanup of V3Order with the aim of increasing clarity:
- Split the initial OrderGraph building and the actual ordering process
into separate classes (OrderVisitor -> OrderBuildVisitor + OrderProcess)
- Remove all the historical cruft from the graph building phase (now in
OrderBuildVisitor), and add more assertions for assumptions.
- Change the dot styling of OrderGraph to use shapes and more easily
distinguishable colors.
- Expand vague comments, remove incorrect comments, and add more.
- Replace some old code with cleaner C++11 constructs.
- Move code about a bit so logically connected sections are closer to
each other, scope some definitions where they are used rather than file
scope.
- The actual ordering process (now in OrderProcess) is still largely
unchanged.
The generated code is identical to before (within the limits of the
exiting non-determinism).
The _CONST suffix on these macros is only lexical notation, pointer
constness can be preserved by overloading the underlying
implementations appropriately. Given that the compiler will catch
invalid const usage (trying to assign a non-const pointer to a const
pointer variable, etc.), and that the declarations of symbols should
make their constness obvious, I see no reason to keep the _CONST
flavours.
Fail at compile time if the result of these macros can be statically
determined (i.e.: they aways succeed or always fail). Remove unnecessary
casts discovered. No functional change.
VN_AS should be used over VN_CAST in code where the author knows up
front (i.e.: statically) what the true type of the node is. This has
multiple benefits over VN_CAST:
- In the debug build: Asserts node type is as expected
- In the optimized build: It is faster as no superfluous type test
- And (I would argue most importantly) it documents intent in the code
No current instances of VN_CAST changed in this patch
- Merge AstNodeIf nodes as well (not just assignment from AstCond)
- Merge merged results recursively (optimizes nested conditionals/ifs)
- Only checking mergeability once per node.
- Don't add redundant masking
- Duplicate cheap statements in both branches, if doing so yields a
larger merge
- Include reduced nodes before the starting conditional in the merge
- Remove redundant casting
- Cheaper final XOR parity flip (~/^1 instead of != 0)
- Support XOR reduction of ~XOR nodes
- Don't add redundant masking of terms
- Support unmasked terms
- Add cheaper implementation for single bit only terms
- Ensure result is clean under all circumstances (this fixes current bugs)
Verilator should now correctly re-evaluate any logic that depends on
state set in a DPI exported function, including if the DPI export is
called outside eval, or if the DPI export is called from a DPI import.
Whenever the design contains a DPI exported function that sets a
non-local variable, we create a global __Vdpi_export_trigger flag, that
is set in the body of the DPI export, and make all variables set in any
DPI exported functions dependent on this flag (this ensures correct
ordering and change detection on state set in DPI exports when needed).
The DPI export trigger flag is cleared at the end of eval, which ensured
calls to DPI exports outside of eval are detected. Additionally the
ordering is modifies to assume that any call to a 'context' DPI import
might call DPI exports by adding an edge to the ordering graph from the
logic vertex containing the call to the DPI import to the DPI export
trigger variable vertex (note the standard does not allow calls to DPI
exports from DPI imports that were not imported with 'context', so we
do not enforce ordering on those).
- Use C++11 initialization syntax
- Use C++11 for loops
- Add const
- Factor out repeated _->fileline() sub-expressions
- Factor out issuing warning message
No functional change.
- Use C++ initialization syntax
- Add const
- Make members static where appropriate
- Factor out repeated _->fileline() sub-expressions
No functional change.
* EmitXml: Added <ccall>, <constpool>, <initarray>/<inititem>, wrapped children of <if> and <while> with <begin> elements to prevent ambiguity
* EmitXml: added signed="true" to signed basicdtypes
All parameters that are required in the output are now emitted as
'static constexpr, except for string or array of strings parameters,
which are still emitted as 'static const' (required as std::string is
not a literal type, so cannot be constexpr). This simplifies handling
of parameters and supports 'real' parameters.
This patch partitions AstCFuncs under an AstNodeModule based on which
header files they require for their implementation, and emits them
into separate files based on the distinct dependency sets. This helps
with incremental recompilation of the output C++.
Apart from adding required AstCUse, V3CUse also used to create some
standard methods for classes. This is now done in a separate pass
V3Common. Note that this is not a performance issue, as V3Common just
iterates through each module, which are stored in a simple linked list
under the netlist, and does not need to traverse the whole netlist.
Other sub-classes of AstNodeCCall do not need the self pointer. Moving
it into the specific sub-class that needs it clarifies V3Descope and
Emit. No functional change intended.
Internal AstNodeModule headers (.h) and implementation (.cpp) files are
now emitted separately in V3EmitC::emitcHeaders() and
V3EmitC::emitcImp() respectively. No functional change intended
A separate V3VariableOrder pass is now used to order module variables
before Emit. All variables are now ordered together, without
consideration for whether they are ports, signals form the design, or
additional internal variables added by Verilator (which used to be
ordered and emitted as separate groups in Emit). For single threaded
models, this is performance neutral. For multi-threaded models, the
MTask affinity based sorting was slightly modified, so variables with no
MTask affinity are emitted last, otherwise the MTask affinity sets are
sorted using the TSP sorter as before, but again, ports, signals, and
internal variables are not differentiated. This yields a 2%+ speedup for
the multithreaded model on OpenTitan.
This patch makes OpenTitan verilation with --debug-check 22% faster, and
the same with --debug --no-dump-tree 91% faster. Functionality is the
same (including when VL_LEAK_CHECKS is defined), except V3Broken can now
always find duplicate references via child/next pointers if the target
node is not `maybePointedTo()` (previously this only happened when
compiled with VL_LEAK_CHECKS). The main change relates to storing the
v3Broken traversal state in the AstNode by stealing a byte from what
used to be unused flags. We retain an unordered_set only for marking
pointers as valid to be referenced via a non-child/non-next member
pointer.
Do not sort and hoist function local variables to the top of the
function definition. The stack layout of automatic variables is not
defined by C so the compilers can lay these out optimally. Simplifies
internals for follow on work. Effect on model performance is neutral to
very slight improvement, so we do not seem to be loosing anything.
V3Broken now checks that AstVar nodes referenced in an AstCFunc are
either external, or appear in the tree before the reference, and are in
scope.
Fix V3Begin to move lifted AstVars to beginning of FTask, rather than
end, which trips the above check.
debug() declared by VL_DEGUB_FUNC used to cache the result of the debug
level lookup (which depends on options) in a static. This meant that if
the debug() function was called before option parsing, the default debug
level of 0 would be used for the rest of the program, even if a --debug
option was given. Fixed by not caching the debug level until after
option parsing is complete.
The bool AstNode flags fit in a padding gap after m_type. This reduces
memory consumption by about 2% on OpenTitan. Initializing the unused
bits in the flags then avoids a read-modify-write in the constructor
(replacing it with a store constant). Overall verilation speed is about
1% faster.