Apart from the representational changes below, this patch renames
AstNodeMath to AstNodeExpr, and AstCMath to AstCExpr.
Now every expression (i.e.: those AstNodes that represent a [possibly
void] value, with value being interpreted in a very general sense) has
AstNodeExpr as a super class. This necessitates the introduction of an
AstStmtExpr, which represents an expression in statement position, e.g :
'foo();' would be represented as AstStmtExpr(AstCCall(foo)). In exchange
we can get rid of isStatement() in AstNodeStmt, which now really always
represent a statement
Peak memory consumption and verilation speed are not measurably changed.
Partial step towards #3420
In non-static contexts like class objects or stack frames, the use of
global trigger evaluation is not feasible. The concept of dynamic
triggers allows for trigger evaluation in such cases. These triggers are
simply local variables, and coroutines are themselves responsible for
evaluating them. They await the global dynamic trigger scheduler object,
which is responsible for resuming them during the trigger evaluation
step in the 'act' eval region. Once the trigger is set, they await the
dynamic trigger scheduler once again, and then get resumed during the
resumption step in the 'act' eval region.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
Prevents the possibility of assigning an integer to a class reference,
both at the SystemVerilog and the emitted C++ levels.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
* Put suspended coroutine source location in a separate struct,
* Have `dump()` always print, wrap calls in `VL_DEBUG_IF`,
* Improve const correctness.
These are also used as a marker (when non-nullptr) when creating a
buffer. Reset them when they are not valid to avoid invalid write if a
buffer is created after a close (due to a subsequent re-open).
Fixes#3651.
This change introduces a custom reference-counting pointer class that
allows creating such pointers from 'this'. This lets us keep the
receiver object around even if all references to it outside of a class
method no longer exist. Useful for coroutine methods, which may outlive
all external references to the object.
The deletion of objects is deferred until the next time slot. This is to
make clearing the triggered flag on named events in classes safe
(otherwise freed memory could be accessed).
Adds timing support to Verilator. It makes it possible to use delays,
event controls within processes (not just at the start), wait
statements, and forks.
Building a design with those constructs requires a compiler that
supports C++20 coroutines (GCC 10, Clang 5).
The basic idea is to have processes and tasks with delays/event controls
implemented as C++20 coroutines. This allows us to suspend and resume
them at any time.
There are five main runtime classes responsible for managing suspended
coroutines:
* `VlCoroutineHandle`, a wrapper over C++20's `std::coroutine_handle`
with move semantics and automatic cleanup.
* `VlDelayScheduler`, for coroutines suspended by delays. It resumes
them at a proper simulation time.
* `VlTriggerScheduler`, for coroutines suspended by event controls. It
resumes them if its corresponding trigger was set.
* `VlForkSync`, used for syncing `fork..join` and `fork..join_any`
blocks.
* `VlCoroutine`, the return type of all verilated coroutines. It allows
for suspending a stack of coroutines (normally, C++ coroutines are
stackless).
There is a new visitor in `V3Timing.cpp` which:
* scales delays according to the timescale,
* simplifies intra-assignment timing controls and net delays into
regular timing controls and assignments,
* simplifies wait statements into loops with event controls,
* marks processes and tasks with timing controls in them as
suspendable,
* creates delay, trigger scheduler, and fork sync variables,
* transforms timing controls and fork joins into C++ awaits
There are new functions in `V3SchedTiming.cpp` (used by `V3Sched.cpp`)
that integrate static scheduling with timing. This involves providing
external domains for variables, so that the necessary combinational
logic gets triggered after coroutine resumption, as well as statements
that need to be injected into the design eval function to perform this
resumption at the correct time.
There is also a function that transforms forked processes into separate
functions.
See the comments in `verilated_timing.h`, `verilated_timing.cpp`,
`V3Timing.cpp`, and `V3SchedTiming.cpp`, as well as the internals
documentation for more details.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
Various optimizations to speed up MTasks coarsening (which is the long
pole in the multi-threaded scheduling of very large designs).
The biggest impact ones:
- Use efficient hand written Pairing Heaps for implementing priority
queues and the scoreboard, instead of the old SortByValueMap. This
helps us avoid having to sort a lot of merge candidates that we will
never actually consider and helps a lot in performance.
- Remove unnecessary associative containers and store data structures
(the heap nodes in particular) directly in the object they relate to.
This eliminates a huge amount of lookups and helps a lot in
performance.
- Distribute storage for SiblingMC instances into the LogicMTask
instances, and combine with the sibling maps. This again eliminates
hash table lookups and makes storage structures smaller.
- Remove some now bidirectional edge maps, keep only the forward map.
There are also some other smaller optimizations:
- Replaced more unnecessary dynamic_casts with static_casts
- Templated some functions/classes to reduce the number of static
branches in loops.
- Improves sorting of edges for sibling candidate creation
- Various micro-optimizations here and there
This speeds up MTask coarsening by 3.8x on a large design, which
translates to a 2.5x speedup of the ordering pass in multi-threaded
mode. (Combined with the earlier optimizations, ordering is now 3x
faster.)
Due to the elimination of a lot of the auxiliary data structures, and
ensuring a minimal size for the necessary ones, memory consumption of
the MTask coarsening is also reduced (measured up to 4.4x reduction
though the accuracy of this is low).
The algorithm is identical except for minor alterations of the order
some candidates are added or removed, this can cause perturbation in the
output due to tied scores being broken based on IDs.
__gcov_flush was a private function and was removed from later GCC
versions (at least from 11.2.0, possibly earlier). Replace with the
documented public __gcov_dump.
These have been 'deprecated' for 2 years and are otherwise unused except
for using a temporary placeholder value, which I have inlined with the
default value.
Also remove the now VL_TIME_STR_CONVERT utility function (and
corresponding unit tests), which have no references in any project on
GitHub.
All remaining use of conditional compilation in the tracing
implementation of the run-time library are replaced with the use of
VerilatedModel::traceConfig, and is now done at run-time.
Always fail if adding a model to a trace file that has already executed
a dump. We used to do this before as well, though in a less robust way.
We will be relying on this property more in the future, so improve the
check.
Always build the FST libray with -DFST_WRITER_PARALLEL, iff VL_THREADED.
This supports run-time enablement of the FST writer thread, and has no
measurable performance impact on single threaded tracing but simplifies
the library build.
Note: the actual choice of using the fst writer thread is still compile
time, but can now be made run-time easily.
Step towards a proper run-time library. Reduce the amount of ifdefs in
the implementation of offloaded tracing. There are still a very small
number of ifdefs left, which will need more careful changes in order to
keep user API compatibility.
VCD tracing is now parallelized using the same thread pool as the model.
We achieve this by breaking the top level trace functions into multiple
top level functions (as many as --threads), and after emitting the time
stamp to the VCD file on the main thread, we execute the tracing
functions in parallel on the same thread pool as the model (which we
pass to the trace file during registration), tracing into a secondary
per thread buffer. The main thread will then stitch (memcpy) the buffers
together into the output file.
This makes the `--trace-threads` option redundant with `--trace`, which
now only affects `--trace-fst`. FST tracing uses the previous offloading
scheme.
This obviously helps a lot in VCD tracing performance, and I have seen
better than Amdahl speedup, namely I get 3.9x on XiangShan 4T (2.7x on
OpenTitan 4T).
This change is not a functional one; it is only meant to appease the
compiler with respect to warnings such as GCC's `-Wtype-limits`.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
This is a major re-design of the way code is scheduled in Verilator,
with the goal of properly supporting the Active and NBA regions of the
SystemVerilog scheduling model, as defined in IEEE 1800-2017 chapter 4.
With this change, all internally generated clocks should simulate
correctly, and there should be no more need for the `clock_enable` and
`clocker` attributes for correctness in the absence of Verilator
generated library models (`--lib-create`).
Details of the new scheduling model and algorithm are provided in
docs/internals.rst.
Implements #3278
The --prof-threads option has been split into two independent options:
1. --prof-exec, for collecting verilator_gantt and other execution
related profiling data, and
2. --prof-pgo, for collecting data needed for PGO
The implementation of execution profiling is extricated from
VlThreadPool and is now a separate class VlExecutionProfiler. This means
--prof-exec can now be used for single-threaded models (though it does
not measure a lot of things just yet). For consistency VerilatedProfiler
is renamed VlPgoProfiler. Both VlExecutionProfiler and VlPgoProfiler are
in verilated_profiler.{h/cpp}, but can be used completely independently.
Also re-worked the execution profile format so it now only emits events
without holding onto any temporaries. This is in preparation for some
future optimizations that would be hindered by the introduction of function
locals via AstText.
Also removed the Barrier event. Clearing the profile buffers is not
notably more expensive as the profiling records are trivially
destructible.
While GNU 'ar' supports '@' to specify a file, BSD 'ar' does not.
The max line length can be handled by 'xargs' instead, which will know
to break up the command. In case there are multiple calls, only build
the index (specified with '-s') once in a later call.
* Tests: Add t_hier_block_sc_trace(fst|vcd) that tests tracing hierarchical block on SystemC.
* Add a check that elaboration is done before a trace file is opened.
* Add a check that elaboration is done before trace() is called to verilated SystemC model.
* Tests: call sc_core::sc_start(sc_core::SC_ZERO_TIME) before opening a trace file
* Tests: Fix t_trace_two_sc to call sc_start before opening trace
* Use vl_fatal as suggested in PR review.
Trace initialization (tracep->decl* functions) used to explicitly pass
the complete hierarchical names of signals as string constants. This
contains a lot of redundancy (path prefixes), does not scale well with
large designs and resulted in .rodata sections (the string constants) in
ELF executables being extremely large.
This patch changes the API of trace initialization that allows pushing
and popping name prefixes as we walk the hierarchy tree, which are
prepended to declared signal names at run-time during trace
initialization. This in turn allows us to emit repeat path/name
components only once, effectively removing all duplicate path prefixes.
On SweRV EH1 this reduces the .rodata section in a --trace build by 94%.
Additionally, trace declarations are now emitted in lexical order by
hierarchical signal names, and the top level trace initialization
function respects --output-split-ctrace.
C++20 requires that aggregate types do not have a user declared
constructor, not even an explicitly defaulted one. We need these types
to be aggregates for static initialization.
Fixes#3076.
What previously used to be per module static constants created in
V3Table and V3Prelim are now merged globally within the whole model and
emitted as part of a separate constant pool. Members of the constant
pool are global variables which are declared lazily when used (similar to
loose methods).
This patch introduces the concept of 'loose' methods, which semantically
are methods, but are declared as global functions, and are passed an
explicit 'self' pointer. This enables these methods to be declared
outside the class, only when they are needed, therefore removing the
header dependency. The bulk of the emitted model implementation now uses
loose methods.
Using the standard model Makefile, when in addition to an explicit
target, the target 'ccache-report' is also given, a summary of ccache
hits/misses during this invocation of 'make' will be prited at the end
of the build.
Reasons are:
- it's error prone to keep updating whennever m_suffixes is resized
- invalid pointer may be set when there is not signal to trace as in t_trace_dumporder_bad
* Add VL_ATTR_NO_SANITIZE_ALIGN macro to disable alignment check of ubsan
* Mark a function VL_ATTR_NO_SANITIZE_ALIGN because the function is intentionally using unaligned access for the sake of performance.
Co-authored-by: Wilson Snyder <wsnyder@wsnyder.org>
The "file" make function is only available in gmake 4.x, which isn't
available on RHEL/CentOS 7. Use alternative syntax to support older
gmake versions.
Fixes#2920
** Add simulation context (VerilatedContext) to allow multiple fully independent
models to be in the same process. Please see the updated examples.
** Add context->time() and context->timeInc() API calls, to set simulation time.
These now are recommended in place of the legacy sc_time_stamp().
* Fix memory leak of t_trace_cat and t_trace_cat_renew
* Fix memory leak of t_trace_c_api
* Fix memory leak in t_trace_public_func and t_trace_public_func_vlt
* Fix memory leaks in t_flat_build (and probably more).
* Use unique_ptr in testcases
* Guarantee mechanism to initialize just once is now in VerilatedInitializer. No functional change is intended.
* Make sure to initialize Verilated::NonInitialized just once. Fixes
memory leak in t_prot_lib_shared and t_hier_block_prot_lib_shared.
* Call setup() and teardown() of Verilated::NonSerialized.
* Test that indexing into memory word fails
* VPI: don't index into memory word
Memory and memory words share a VerilatedVar, so check for memory word before attempting to index.
* Test that range iterator exhausts after 1 dimension
* Separate range iterator from range object
Range iterators should return NULL after all ranges have been returned
The end iterator always points to an element past the end of the list.
When new elements are added to the back of the list, they are inserted
before the end iterator.
Instead, track the last element in the list at the start of processing
and stop after it's been processed.
Previously, in any given VPI callback, if the callback body registered
the same callback, that registering would be processed in the currently
executing call to the call*Cbs function. In the worse case, this could
lead to an infinite loop.
* Add a test to use string for $fgets
* Use dedicated function for $fgets to std::string
* share the implementation of $fgets
* Pass -1 for bitwidth of std::string to distinguish from POD
* add checks for scanf with string
* apply clang-format
This is to allow C++ verilator toplevel to support
multiple modes of waveform tracing
VM_TRACE_FST can be used inside a #if VM_TRACE
section to switch between classic .vcd tracing and the
more compact .fst format supported by GTKWAVE
* Add a test to use shared object of protect-lib
* Add a guard to call ctor/dtor just once even when a protec-lib is shared object.
* Pass .a to linker in leaf-last order for older ld.
* Add -flat_namespace for mac
* Don't use thin-archive to merge multiple archives because it is gnu-ar specific.
* remove -time option that may caues -Wunused-command-line-argument warning
This allows compiling the run-time library with optimization even when OPT_FAST is not used in order to imporove model build speed, possibly during debug cycles.
Instead of __ALLfast.cpp and __ALLslow.cpp, we now create only a single
__ALL.cpp and compile it with OPT_FAST, this speeds up small builds
where the C compiler does not dominate. A separate patch will follow
turning VM_PARALLEL_BUILDS on by default at a certain size.
Given this change to the build there is now no point in emitting both
fast and slow routines into the same .cpp file when --output-split is
not set as they will be just included in the same __ALL.cpp file. To
keep things simpler and the output easier to comprehend, V3EmitC has
also been changed to always emit the fast and slow files separately.
Also change verilated.mk to apply OPT_SLOW to all slow files, not just
ones called *__Slow.cpp. This change in particular ensures __Syms.cpp
is build as slow.
Part of #2360.
The main goal of this patch is to enable splitting the full and
incremental tracing functions into multiple functions, which can then be
run in parallel at a later stage. It also simplifies further
experimentation as all of the interesting trace code construction now
happens in V3Trace. No functional change is intended by this patch, but
there are some implementation changes in the generated code.
Highlights:
- Pass symbol table directly to trace callbacks for simplicity.
- A new traceRegister function is generated which adds each trace
function as an individual callback, which means we can have multiple
callbacks for each trace function type.
- A new traceCleanup function is generated which clears the activity
flags, as the trace callbacks might be implemented as multiple functions.
- Re-worked sub-function handling so there is no separate sub-function
for each trace activity class. Sub-functions are generate when required
by splitting.
- traceFull/traceChg are now created in V3Trace rather than V3TraceDecl,
this requires carrying the trace value tree in TraceDecl until it
reaches V3Trace where the TraceInc nodes are created (previously a
TraceInc was also created in V3TraceDecl which carries the value).
- Allow arbitrary number of open array dimensions, not just 3. Note
right now this only works with the array querying functions specified
in IEEE 1800-2017 H.12.2
- Issue error when passing dynamic array or queue as DPI open array
(currently unsupported)
- Also tweaked AstVar::vlArgTypeRecurse, which should now error or fail
for unsupported types.
Use SIMD intrinsics to render VCD traces.
I have measured 10-40% single threaded performance increase with VCD
tracing on SweRV EH1 and lowRISC Ibex using SSE2 intrinsics to render
the trace. Also helps a tiny bit with FST, but now almost all of the FST
overhead is in the FST library.
I have reworked the tracing routines to use more precisely sized
arguments. The nice thing about this is that the performance without the
intrinsics is pretty much the same as it was before, as we do at most 2x
as much work as necessary, but in exchange there are no data dependent
branches at all.
Convert trace buffer to 32-bit entries, rather than a union containing a
pointer type. Also tweaked trace entry layouts for a bit more
performance. This gains another 10% on SweRV EH1 CoreMark.
- Change templated trace routines to branch table.
Removed templating from trace chgBus and fullBus and replaced them with
a branch table like the other there is a very small (< 1%) penalty for
this on SwerRV EH1 CoreMark, but this is less than the variability of
disk IO so it's worth it to keep the code simpler and smaller.
- Prefetch VCD suffix buffer at the top of emit*
- Increase ILP in VCD emit* routines
- Use a 64-bit unaligned store to emit the VCD suffix (on x86 only)
The performance difference with these is very small, but the changes
hopefully make this code more performance-portable across various
micro-architectures.
VerilatedVcdC::openNext() failed to flush the tracing thread before
opening the next output file, which caused t_trace_cat.pl to fail
with --vltmt on occasion.
The --trace-threads option can now be used to perform tracing on a
thread separate from the main thread when using VCD tracing (with
--trace-threads 1). For FST tracing --trace-threads can be 1 or 2, and
--trace-fst --trace-threads 1 is the same a what --trace-fst-threads
used to be (which is now deprecated).
Performance numbers on SweRV EH1 CoreMark, clang 6.0.0, Intel i7-3770 @
3.40GHz, IO to ramdisk, with numactl set to schedule threads on different
physical cores. Relative speedup:
--trace -> --trace --trace-threads 1 +22%
--trace-fst -> --trace-fst --trace-threads 1 +38% (as --trace-fst-thread)
--trace-fst -> --trace-fst --trace-threads 2 +93%
Speed relative to --trace with no threaded tracing:
--trace 1.00 x
--trace --trace-threads 1 0.82 x
--trace-fst 1.79 x
--trace-fst --trace-threads 1 1.23 x
--trace-fst --trace-threads 2 0.87 x
This means FST tracing with 2 extra threads is now faster than single
threaded VCD tracing, and is on par with threaded VCD tracing. You do
pay for it in total compute though as --trace-fst --trace-threads 2 uses
about 240% CPU vs 150% for --trace-fst --trace-threads 1, and 155% for
--trace --trace threads 1. Still for interactive use it should be
helpful with large designs.
This patch de-duplicates common functionality between the VCD and FST
trace implementation. It also enables adding new trace formats more
easily and consistently.
No functional nor performance change intended.
The FST trace timescale used to be set in the constructor via
set_time_unit, but at that point we haven't normally opened the
file yet so it was just dropped. On top of that, we actually want
to use set_time_resolution... FST trace timescales now match the VCD.
If the first dump was not at time zero, then the FST trace used
to contain the initial values as if they were set at time zero. Now
they only appear at the time the first dump call is actually made,
and hence match the VCD trace exactly.
Includes `timescale, $printtimescale, $timeformat.
VL_TIME_MULTIPLIER, VL_TIME_PRECISION, VL_TIME_UNIT have been removed
and the time precision must now match the SystemC time precision.
To get closer behavior to older versions, use e.g. --timescale-override
"1ps/1ps".
* Improve tracing performance.
Various tactics used to improve performance of both VCD and FST tracing:
- Both: Change tracing functions to templates to take variable widths as
template parameters. For VCD, subsequently specialize these to the
values used by Verilator. This avoids redundant instructions and hard
to predict branches.
- Both: Check for value changes via direct pointer access into the
previous signal value buffer. This eliminates a lot of simple pointer
arithmetic instructions form the tracing code.
- Both: Verilator provides clean input, no need to mask out used bits.
- VCD: pre-compute identifier codes and use memory copy instead of
re-computing them every time a code is emitted. This saves a lot of
instructions and hard to predict branches. The added D-cache misses
are cheaper than the removed branches/instructions.
- VCD: re-write the routines emitting the changes to be more efficient.
- FST: Use previous signal value buffer the same way as the VCD tracing
code, and only call the FST API when a change is detected.
Performance as measured on SweRV EH1, with the pre-canned CoreMark
benchmark running from DCCM/ICCM, clang 6.0.0, Intel i7-3770 @ 3.40GHz,
and IO to ramdisk:
+--------------+---------------+----------------------+
| VCD | FST | FST separate thread |
| (--trace) | (--trace-fst) | (--trace-fst-thread) |
------------+-----------------------------------------------------+
Before | 30.2 s | 121.1 s | 69.8 s |
============+==============+===============+======================+
After | 24.7 s | 45.7 s | 32.4 s |
------------+--------------+---------------+----------------------+
Speedup | 22 % | 256 % | 215 % |
------------+--------------+---------------+----------------------+
Rel. to VCD | 1 x | 1.85 x | 1.31 x |
------------+--------------+---------------+----------------------+
In addition, FST trace size for the above reduced by 48%.