This patch adds IEEE-1800 compliant scheduling support for the Inactive
scheduling region used for #0 delays.
Implementing this requires that **all** IEEE-1800 active region events
are placed in the internal 'act' section. This has simulation
performance implications. It prevents some optimizations (e.g.
V3LifePost), which reduces single threaded performance. It also reduces
the available work and parallelism in the internal 'nba' section, which
reduced the effectiveness of multi-threading severely.
Performance impact on RTLMeter when using scheduling adjusted to support
proper #0 delays is ~10-20% slowdown in single-threaded mode, and ~100%
(2x slower) with --threads 4.
To avoid paying this performance penalty unconditionally, the scheduling
is only adjusted if either:
1. The input contains a statically known #0 delay
2. The input contains a variable #x delay unknown at compile time
If no #0 is present, but #x variable delays are, a ZERODLY warning is
issued advising the use of '--no-runtime-zero-delay' which is a promise
by the user that none of the variable delays will evaluate to a zero
delay at run-time. This warning is turned off if '--runtime-zero-delay'
is explicitly given. This is similar to the '--timing' option.
If '--no-runtime-zero-delay' was used at compile time, then executing
a zero delay will fail at runtime.
A ZERODLY warning is also issued if a static #0 if found, but the user
specified '--no-runtime-zero-delay'. In this case the scheduling is not
adjusted to support #0, so executing it will fail at runtime. Presumably
the user knows it won't be executed.
The intended behaviour with all this is the following:
No #0, no #var in the design (#constant is OK)
-> Same as current behaviour, scheduling not adjusted,
same code generated as before
Has static #0 and '--no-runtime-zero-delay' is NOT given:
-> No warnings, scheduling adjusted so it just works, runs slow
Has static #0 and '--no-runtime-zero-delay' is given:
-> ZERODLY on the #0, scheduling not adjusted, fails at runtime if hit
No static #0, but has #var and no option is given:
-> ZERODLY on the #var advising use of '--no-runtime-zero-delay' or
'--runtime-zero-delay' (similar to '--timing'), scheduling adjusted
assuming it can be a zero delay and it just works
No static #0, but has #var and '--no-runtime-zero-delay' is given:
-> No warning, scheduling not adjusted, fails at runtime if zero delay
No static #0, but has #var and '--runtime-zero-delay' is given:
-> No warning, scheduling adjusted so it just works
Event-triggered coroutines live in two stages: 'uncommitted' and 'ready'. First
they land in 'uncommitted', meaning they can't be resumed yet. Only after
coroutines from the 'ready' queue are resumed, the 'uncommitted' ones are moved
to the 'ready' queue, and can be resumed. This is to avoid self-triggering in
situations like waiting for an event immediately after triggering it.
However, there is an issue with `wait` statements. If you have a `wait(b)`, it's
being translated into a loop that awaits a change in `b` as long as `b` is
false. If `b` is false at first, the coroutine is put into the `uncommitted`
queue. If `b` is set to true before it's committed, the coroutine won't get
resumed.
This patch fixes that by immediately committing event controls created from
`wait` statements. That means the coroutine from the example above will get
resumed from now on.
In non-static contexts like class objects or stack frames, the use of
global trigger evaluation is not feasible. The concept of dynamic
triggers allows for trigger evaluation in such cases. These triggers are
simply local variables, and coroutines are themselves responsible for
evaluating them. They await the global dynamic trigger scheduler object,
which is responsible for resuming them during the trigger evaluation
step in the 'act' eval region. Once the trigger is set, they await the
dynamic trigger scheduler once again, and then get resumed during the
resumption step in the 'act' eval region.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
* Put suspended coroutine source location in a separate struct,
* Have `dump()` always print, wrap calls in `VL_DEBUG_IF`,
* Improve const correctness.
Adds timing support to Verilator. It makes it possible to use delays,
event controls within processes (not just at the start), wait
statements, and forks.
Building a design with those constructs requires a compiler that
supports C++20 coroutines (GCC 10, Clang 5).
The basic idea is to have processes and tasks with delays/event controls
implemented as C++20 coroutines. This allows us to suspend and resume
them at any time.
There are five main runtime classes responsible for managing suspended
coroutines:
* `VlCoroutineHandle`, a wrapper over C++20's `std::coroutine_handle`
with move semantics and automatic cleanup.
* `VlDelayScheduler`, for coroutines suspended by delays. It resumes
them at a proper simulation time.
* `VlTriggerScheduler`, for coroutines suspended by event controls. It
resumes them if its corresponding trigger was set.
* `VlForkSync`, used for syncing `fork..join` and `fork..join_any`
blocks.
* `VlCoroutine`, the return type of all verilated coroutines. It allows
for suspending a stack of coroutines (normally, C++ coroutines are
stackless).
There is a new visitor in `V3Timing.cpp` which:
* scales delays according to the timescale,
* simplifies intra-assignment timing controls and net delays into
regular timing controls and assignments,
* simplifies wait statements into loops with event controls,
* marks processes and tasks with timing controls in them as
suspendable,
* creates delay, trigger scheduler, and fork sync variables,
* transforms timing controls and fork joins into C++ awaits
There are new functions in `V3SchedTiming.cpp` (used by `V3Sched.cpp`)
that integrate static scheduling with timing. This involves providing
external domains for variables, so that the necessary combinational
logic gets triggered after coroutine resumption, as well as statements
that need to be injected into the design eval function to perform this
resumption at the correct time.
There is also a function that transforms forked processes into separate
functions.
See the comments in `verilated_timing.h`, `verilated_timing.cpp`,
`V3Timing.cpp`, and `V3SchedTiming.cpp`, as well as the internals
documentation for more details.
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>