The behavior of these format specifiers is highly specific to Verilog
(`$time` and `$realtime` are only defined relative to `$timescale`)
and may not fit other languages well, if at all. If they choose to use
it, it is now clear what they are opting into.
This commit also simplifies the CXXRTL code generation for these format
specifiers.
At the moment the only thing it allows is redirecting `$print` cell
output in a context-dependent manner. In the future, it will allow
customizing handling of `$check` cells (where the default is to abort),
of out-of-range memory accesses, and other runtime conditions with
effects.
This context object also allows a suitably written testbench to add
Verilog-compliant `$time`/`$realtime` handling, albeit it involves
the ceremony of defining a `performer` subclass. Most people will
want something like this to customize `$time`:
int64_t time = 0;
struct : public performer {
int64_t *time_ptr;
int64_t time() const override { return *time_ptr; }
} performer = { &time };
p_top.step(&performer);
This approach to tracking simulation time was a mistake that I did not
catch in review. It has several issues:
1. There is absolutely no requirement to call `step()`, as it is
a convenience function. In particular, `steps` will not be
incremented in submodules if `-noflatten` is used.
2. The semantics of `steps` does not match that of the Verilog `$time`
construct.
3. There is no way to make the semantics of `%t` match that of Verilog.
4. The `module` interface is intentionally very barebones. It is little
more than a container for three method pointers, `reset`, `eval`,
and `commit`. Adding ancillary data to it goes against this.
If similar functionality is introduced again it should probably be
a variable that is global per toplevel design using some object that is
unique for an entire hierarchy of modules, and ideally exposed via
the C API. For now, it is being removed (in this commit) and (in next
commit) the capability is being reintroduced through a context object
that can be specified for `eval()`.
This commit achieves three roughly equally important goals:
1. To bring the rendering code in kernel/fmt.cc and in cxxrtl.h as close
together as possible, with an ideal of only having the bigint library
as the difference between the render functions.
2. To make the treatment of `$time` and `$realtime` in CXXRTL closer to
the Verilog semantics, at least in the formatting code.
3. To change the code generator so that all of the `$print`-to-`string`
conversion code is contained inside of a closure.
There are two reasons to aim for goal (3):
a. Because output redirection through definition of a global ostream
object is neither convenient nor useful for environments where
the output is consumed by other code rather than being printed on
a terminal.
b. Because it may be desirable to, in some cases, ignore the `$print`
cells that are present in the netlist based on a runtime decision.
This is doubly true for an upcoming `$check` cell implementing
assertions, since failing a `$check` would by default cause a crash.
This avoids having to devirtualize them later to get performance back,
and simplifies the code a bit.
The change is prompted by the desire to add a similar observer object
to `eval()`, and a repeated consideration of whether the dispatch on
these should be virtual in first place.
The name `on_commit` was terrible since it would not be called, as one
may conclude from the name, on each `commit()`, but only whenever that
method actually updates a value.
This commit adds a reader/writer implementation for a file format
optimized for fast, single-pass storage and retrieval of design state
changes, as well as a recorder/replayer that integrate with the eval
and commit simulation steps to create replay logs and reproduce them
later.
This feature makes it possible to run a simulation once, recording
the stimulus as well as changes to the registers, and navigate to
a past time point in the simulation later without rerunning it.
Both the changes in inputs (stimulus) and changes in state are saved
so that navigation does not require calling `eval()` or `commit()`;
only a series of memory copy operations.
On a representative example of a SoC netlist, saving the replay log
while simulating it takes 150% of the time it would take to simulate
the same design without logging, which is a much lower overhead than
writing an equivalent full view (including memories) VCD waveform dump.
The replay log is also several times smaller than the VCD dump, and
more space savings are available as low hanging fruit.
Replaying the log has not been optimized and currently takes about
the same time as running the simulation in first place. However, it
is still useful since it provides fast navigation to an arbitrary time
point, something that rerunning the simulation does not allow for.
The current file format should be considered preliminary. It is not
very space-efficient, and my testing shows that a lot of time is spent
in the write() syscall in the kernel. Most likely, compression and/or
writing in another thread could improve performance by 10-20%. This
may be done at a later time.
The commit observer is a structure containing a callback that is invoked
whenever the `commit()` method changes a wire or a memory. This allows
code external to the compiled netlist to react to changes in the design
state in a very efficient way. One example of how this feature can be
used is an efficient implementation of record/replay.
Note that the VCD writer does not benefit from this feature because it
must be able to react to changes in any debug items and not just those
that contain design state.
While the VCD format separates the timescale and the timestep (likely
to allow representing the timestep with a small integer type), time in
CXXRTL is represented using a uniform 96-bit number, which allows for
a ±100 year range at femtosecond resolution.
The implementation uses `value<96>`, which provides fast arithmetic and
comparison operations, as well as conversion to/from a more common
representation of integer seconds plus femtoseconds.
Prior to this fix, the `CxxrtlBackend` used the entire path for the include
directive when a separated interface file is generated (via the `-header`
option). This commit updates the code to use the base name of the interface
file.
Since the C++11 standard is used by default, we cannot take advantage of
the `std::filesystem` to get the basename.
In preparation for substantial expansion of CXXRTL's runtime, this commit
reorganizes the files used by the implementation. Only minimal changes are
required in a consumer.
First, change:
-I$(yosys-config --datdir)/include
to:
-I$(yosys-config --datdir)/include/backends/cxxrtl/runtime
Second, change:
#include <backends/cxxrtl/cxxrtl.h>
to:
#include <cxxrtl/cxxrtl.h>
(and do the same for cxxrtl_vcd.h, etc.)
We add a new flow graph node type, PRINT_SYNC, as they don't get handled
with regular CELL_EVALs. We could probably move this grouping out of
the dump method.
Removing some signed checks and logic where we've already guaranteed the
values to be positive. Indeed, in these cases, if a negative value got
through (per my realisation in the signed fuzz harness), it would cause
an infinite loop due to flooring division.
Before this commit, values, wires, and memories with an initializer
were value-initialized in emitted C++ code. After this commit, all
values, wires, and memories are default-initialized, and the default
constructor of generated modules calls the reset() method, which
assigns the members that have an initializer.
The following VCD file crashes GTKWave's VCD loader:
$var wire 1 ! x:1 $end
$enddefinitions $end
In practice, a colon can be a part of a variable name that is
translated from a Verilog function, something like:
update$func$.../hdl/hazard3_csr.v:350$2534.$result
Public wires may alias buffered internal wires, so keep BUFFERED
wires in debug information even if they are private. Debug items are
only created for public wires, so this does not otherwise affect how
debug information is emitted.
Fixes#2540.
Fixes#2841.
While this helper is already useful to squash sequential initializations
into one in cxxrtl, its main purpose is to squash overlapping masked memory
initializations (when they land) and avoid having to deal with them in
cxxrtl runtime.
Similar to the treatment of black boxes, splitting processes into two
scheduling nodes adds sufficient freedom so that netlists with
well-behaved processes (e.g. those emitted by nMigen) can immediately
converge.
Because processes are not emitted into edge-triggered regions, this
approach has comparable performance to -O5 (without -noproc), which
is substantially slower than -O6.
The exact shape of C++ code emitted by CXXRTL has a critical effect
on performance, both compile-time and runtime. CXXRTL's performance
greatly improved when it started localizing and inlining wires, not
only because this assists the optimizer and register allocator, but
also because inlining code into edge-triggered regions cuts the time
spent in eval() by at least a factor of two.
However, the logic of netlist layout has always been ad-hoc, fragile,
and very hard to understand and modify. After commit ece25a45, which
introduced outlining, the same logic started being applied to two
distinct netlists at once instead of one, which barely worked.
This commit does four major changes:
* There is now a single unambiguous source of truth (per subgraph)
for the layout of any emitted wire.
* Netlist layout is now done entirely during analysis using well
known graph algorithms; no graph operations happen when emitting.
* Netlist layout now happens completely separately for eval() and
debug_eval() subgraphs.
* Unreachable (within subgraph scope) netlist nodes are now neither
emitted nor considered for wire inlining decisions.
The netlist layout code should also now closely match the described
semantics.
As a part of this large cleanup, it includes many miscellaneous
improvements:
* The "bare minimum" debug level introduced in commit dd6a761d was
split into two levels; -g1 now emits debug information *only* for
inputs and state wires, and -g2 now emits debug information for
all public members. The old behavior matches -g2. This is done
to avoid bloat on low optimization levels.
* Debug aliases and inlined connections are now handled separately,
and complex RHS never interferes with inlined connections.
* Aliases to outlined wires now carry a pointer to the outline.
* Cell sync outputs can now be emitted in debug_eval().
* Black box debug information now includes comb/sync driver flags.
* The comment emitted for inlined cells is now accurate.
* Debug information statistics now has less noise.
* Netlist layout code is now much better documented.
Due to more precise inlining decisions, unmodified (i.e. with no
Yosys script being used) netlists now have much more logic inlined
into edge-triggered regions. On Minerva SoC SRAM, this improves
runtime by 20-25% across compilers and optimization levels.
Due to more precise reachability analysis, much less C++ code is now
emitted, especially at the maximum debug level. On Minerva SoC SRAM,
this improves clang compile time by 30-50% depending on options.
gcc is not affected.
On Minerva SoC SRAM compiled with clang-11, this change cuts commit
time in half (!) and overall time by 20%. When compiled with gcc-10,
there is no difference.
In C, non-static inline functions require an implementation elsewhere
(even though the body is right there in the header). It is basically
never desirable to use those as opposed to static inline ones.
Implementing outlining has greatly increased the amount of debug
information in a typical build, and consequently exposed performance
issues in C++ compilers, which are similar for both GCC and Clang;
the compile time of Minerva SoC SRAM increased almost twofold.
Although one would expect the slowdown to be caused by the increased
use of templates in `debug_eval()`, it is actually almost entirely
attributable to optimizations and codegen for `debug_items()`.
Fortunately, it is neither possible nor desirable to optimize
`debug_items()`: in most cases it is called exactly once, and its
body is a linear sequence of calls with unique arguments.
This commit turns off optimizations for `debug_items()` on GCC and
Clang, improving -Os compile time of Minerva SoC SRAM by ~40% (!)
Before this commit, if a sequence of wires assigned in a chain would
terminate on a cell, none of the wires would get marked as aliases,
and typically all of the public wires would get outlined. The reason
for this behavior is that alias analysis predates outlining and in
fact runs before it.
After this commit, alias analysis runs after outlining and considers
outlined wires valid aliasees. More importantly, if the chained wires
contain any valid aliasees, then all of the wires are aliased to
the one that is topologically deepest.
Aliased wires incur virtually no overhead for the VCD writer, unlike
outlined wires that would otherwise take their place. On Minerva SoC
SRAM, size of the full VCD dump is reduced by ~65%, and throughput
is increased by ~55%.
Aggressive wire localization and inlining is necessary for CXXRTL to
achieve high performance. However, that comes with a cost: reduced
debug information coverage. Previously, as a workaround, the `-Og`
option could have been used to guarantee complete coverage, at a cost
of a significant performance penalty.
This commit introduces debug information outlining. The main eval()
function is compiled with the user-specified optimization settings.
In tandem, an auxiliary debug_eval() function, compiled from the same
netlist, can be used to reconstruct the values of localized/inlined
signals on demand. To the extent that it is possible, debug_eval()
reuses the results of computations performed by eval(), only filling
in the missing values.
Benchmarking a representative design (Minerva SoC SRAM) shows that:
* Switching from `-O4`/`-Og` to `-O6` reduces runtime by ~40%.
* Switching from `-g1` to `-g2`, both used with `-O6`, increases
compile time by ~25%.
* Although `-g2` increases the resident size of generated modules,
this has no effect on runtime.
Because the impact of `-g2` is minimal and the benefits of having
unconditional 100% debug information coverage (and the performance
improvement as well) are major, this commit removes `-Og` and changes
the defaults to `-O6 -g2`.
We'll have our cake and eat it too!
"Elision" in this context is an unusual and not very descriptive term
whereas "inlining" is common and straightforward. Also, introducing
"inlining" makes it easier to introduce its dual under the obvious
name "outlining".