Set OPT_FAST=-Os as default (#2374)
This commit is contained in:
parent
c5da38206e
commit
622f59ad65
3
Changes
3
Changes
|
|
@ -35,6 +35,9 @@ The contributors that suggested a given feature are shown in []. Thanks!
|
||||||
|
|
||||||
**** The run-time library is now compiled with -Os by default. (#2369, #2373)
|
**** The run-time library is now compiled with -Os by default. (#2369, #2373)
|
||||||
|
|
||||||
|
**** OPT_FAST is now -Os by default. See the BENCHMARKING & OPTIMIZATION part
|
||||||
|
of the manual if you experience issues with compilation speed.
|
||||||
|
|
||||||
|
|
||||||
* Verilator 4.034 2020-05-03
|
* Verilator 4.034 2020-05-03
|
||||||
|
|
||||||
|
|
|
||||||
115
bin/verilator
115
bin/verilator
|
|
@ -2071,11 +2071,11 @@ distribution.
|
||||||
|
|
||||||
=head1 BENCHMARKING & OPTIMIZATION
|
=head1 BENCHMARKING & OPTIMIZATION
|
||||||
|
|
||||||
For best performance, run Verilator with the "-O3 --x-assign fast
|
For best performance, run Verilator with the "-O3 --x-assign fast --x-initial
|
||||||
--x-initial fast --noassert" flags. The -O3 flag will require longer
|
fast --noassert" flags. The -O3 flag will require longer time to run
|
||||||
compile times, and "--x-assign fast --x-initial fast" may increase the risk
|
Verilator, and "--x-assign fast --x-initial fast" may increase the risk of
|
||||||
of reset bugs in trade for performance; see the above documentation for
|
reset bugs in trade for performance; see the above documentation for these
|
||||||
these flags.
|
flags.
|
||||||
|
|
||||||
If using Verilated multithreaded, use C<numactl> to ensure you are using
|
If using Verilated multithreaded, use C<numactl> to ensure you are using
|
||||||
non-conflicting hardware resources. See L</"MULTITHREADING">.
|
non-conflicting hardware resources. See L</"MULTITHREADING">.
|
||||||
|
|
@ -2087,58 +2087,69 @@ simple change to a clock latch used to gate clocks and gained a 60%
|
||||||
performance improvement.
|
performance improvement.
|
||||||
|
|
||||||
Beyond that, the performance of a Verilated model depends mostly on your
|
Beyond that, the performance of a Verilated model depends mostly on your
|
||||||
C++ compiler and size of your CPU's caches.
|
C++ compiler and size of your CPU's caches. Experience shows that large models
|
||||||
|
are often limited by the size of the instruction cache, and as such reducing
|
||||||
|
code size if possible can be beneficial.
|
||||||
|
|
||||||
By default, the lib/verilated.mk file has optimization turned off. This is
|
The supplied $VERILATOR_ROOT/include/verilated.mk file uses the OPT, OPT_FAST,
|
||||||
for the benefit of new users, as it improves compile times at the cost of
|
OPT_SLOW and OPT_GLOBAL variables to control optimization. You can set these
|
||||||
simulation runtimes. To add optimization as the default, set one of three variables,
|
when compiling the output of Verilator with Make, for example:
|
||||||
OPT, OPT_FAST, or OPT_SLOW lib/verilated.mk. Or, use the -CFLAGS and/or
|
|
||||||
-LDFLAGS option on the verilator command line to pass the flags directly to
|
|
||||||
the compiler or linker. Or, just for one run, pass them on the command
|
|
||||||
line to make:
|
|
||||||
|
|
||||||
make OPT_FAST="-Os -march=native -fno-stack-protector" -f Vour.mk Vour__ALL.a
|
make OPT_FAST="-Os -march=native" -f Vour.mk Vour__ALL.a
|
||||||
|
|
||||||
OPT_FAST specifies optimizations for those parts of the program that are on the
|
OPT_FAST specifies optimization flags for those parts of the model that are on
|
||||||
fast path. This is mostly code that is executed every cycle. OPT_SLOW
|
the fast path. This is mostly code that is executed every cycle. OPT_SLOW
|
||||||
specifies optimizations for slow-path files, which execute only rarely, yet
|
applies to slow-path code, which executes rarely, often only once at the
|
||||||
take a long time to compile with optimization on. OPT_SLOW is ignored if
|
beginning or end of simulation. Note that OPT_SLOW is ignored if
|
||||||
VM_PARALLEL_BUILDS is not 1, in which case all code is compiled with OPT_FAST.
|
VM_PARALLEL_BUILDS is not 1, in which case all generated code will be compiled
|
||||||
See also the C<--output-split> option. OPT specifies overall optimization and
|
in a single compilation unit using OPT_FAST. See also the C<--output-split>
|
||||||
affects all compiles, including those OPT_FAST and OPT_SLOW control. For best
|
option. The OPT_GLOBAL variable applies to common code in the run-time library
|
||||||
results, use OPT="-Os -march=native", and link with "-static". Nearly the same
|
used by verilated models (shipped in $VERILATOR_ROOT/include). Additional C++
|
||||||
results can be had with much better compile times with OPT_FAST="-O1
|
files passed on the verilator command line use OPT_FAST. The OPT variable
|
||||||
-fstrict-aliasing". Higher optimization such as "-O2" or "-O3" may help, but
|
applies to all compilation units in addition to the specific OPT_* variables
|
||||||
gcc compile times may be excessive under O3 on even medium sized designs.
|
described above.
|
||||||
There is a third variable, OPT_GLOBAL, which applies to common code in the
|
|
||||||
run-time library used by verilated models. This is set to "-Os" by default
|
|
||||||
and there should rarely be a need to change it. As the run-time library is
|
|
||||||
small in comparison to a lot of verilated models, disabling optimization on
|
|
||||||
the run-time library should not have a serious effect on overall compilation
|
|
||||||
time, but can have highly detrimental effect on run-time performance,
|
|
||||||
especially with tracing. The OPT variable also applies to files that are
|
|
||||||
controlled by OPT_GLOBAL.
|
|
||||||
|
|
||||||
Unfortunately, using the optimizer with SystemC files can result in
|
You can also use the -CFLAGS and/or -LDFLAGS options on the verilator command
|
||||||
compiles taking several minutes. (The SystemC libraries have many little
|
line to pass flags directly to the compiler or linker.
|
||||||
inlined functions that drive the compiler nuts.)
|
|
||||||
|
|
||||||
For best results, use the latest clang compiler (about 10% faster than
|
The default values of the OPT_* variables are chosen to yield good simulation
|
||||||
GCC). Note the now fairly old GCC 3.2 and earlier have optimization bugs
|
speed with reasonable C++ compilation times. To this end, OPT_FAST is set to
|
||||||
around pointer aliasing detection, which can result in 2x performance
|
"-Os" by default. Higher optimization such as "-O2" or "-O3" may help (though
|
||||||
losses.
|
often they provide only a very small performance benefit), but compile times
|
||||||
|
may be excessively large even with medium sized designs. Compilation times can
|
||||||
|
be improved at the expense of simulation speed by reducing optimization, for
|
||||||
|
example with OPT_FAST="-O0". Often good simulation speed can be achieved with
|
||||||
|
OPT_FAST="-O1 -fstrict-aliasing" but with improved compilation times. Files
|
||||||
|
controlled by OPT_SLOW have little effect on performance and therefore OPT_SLOW
|
||||||
|
is empty by default (equivalent to "-O0") for improved compilation speed. In
|
||||||
|
common use-cases there should be little benefit in changing OPT_SLOW.
|
||||||
|
OPT_GLOBAL is set to "-Os" by default and there should rarely be a need to
|
||||||
|
change it. As the run-time library is small in comparison to a lot of verilated
|
||||||
|
models, disabling optimization on the run-time library should not have a
|
||||||
|
serious effect on overall compilation time, but may have detrimental effect on
|
||||||
|
simulation speed, especially with tracing. In addition to the above, for best
|
||||||
|
results use OPT="-march=native", the latest Clang compiler (about 10% faster
|
||||||
|
than GCC), and link statically.
|
||||||
|
|
||||||
If you will be running many simulations on a single compile, investigate
|
Generally the answer to which optimization level gives the best user experience
|
||||||
feedback driven compilation. With GCC, using -fprofile-arcs, then
|
depends on the use case and some experimentation can pay dividends. For a
|
||||||
|
speedy debug cycle during development, especially on large designs where C++
|
||||||
|
compilation speed can dominate, consider using lower optimization to get to an
|
||||||
|
executable faster. For throughput oriented use cases, for example regressions,
|
||||||
|
it is usually worth spending extra compilation time to reduce total CPU time.
|
||||||
|
|
||||||
|
If you will be running many simulations on a single model, you can investigate
|
||||||
|
profile guided optimization. With GCC, using -fprofile-arcs, then
|
||||||
-fbranch-probabilities will yield another 15% or so.
|
-fbranch-probabilities will yield another 15% or so.
|
||||||
|
|
||||||
Modern compilers also support link-time optimization (LTO), which can help
|
Modern compilers also support link-time optimization (LTO), which can help
|
||||||
especially if you link in DPI code. To enable LTO on GCC, pass "-flto" in
|
especially if you link in DPI code. To enable LTO on GCC, pass "-flto" in both
|
||||||
both compilation and link. Note LTO may cause excessive compile times on
|
compilation and link. Note LTO may cause excessive compile times on large
|
||||||
large designs.
|
designs.
|
||||||
|
|
||||||
Using profile driven compiler optimization, with feedback from a real
|
Unfortunately, using the optimizer with SystemC files can result in compilation
|
||||||
design, can yield up to30% improvements.
|
taking several minutes. (The SystemC libraries have many little inlined
|
||||||
|
functions that drive the compiler nuts.)
|
||||||
|
|
||||||
If you are using your own makefiles, you may want to compile the Verilated
|
If you are using your own makefiles, you may want to compile the Verilated
|
||||||
code with -DVL_INLINE_OPT=inline. This will inline functions, however this
|
code with -DVL_INLINE_OPT=inline. This will inline functions, however this
|
||||||
|
|
@ -5243,15 +5254,15 @@ test_regress/t/t_extend_class files show an example of how to do this.
|
||||||
=item How do I get faster build times?
|
=item How do I get faster build times?
|
||||||
|
|
||||||
When running make pass the make variable VM_PARALLEL_BUILDS=1 so that
|
When running make pass the make variable VM_PARALLEL_BUILDS=1 so that
|
||||||
builds occur in parallel. Note this is now set by default if the output
|
builds occur in parallel. Note this is now set by default if an output
|
||||||
code size exceeds the value of --output-split.
|
file was large enough to be split due to the --output-split option.
|
||||||
|
|
||||||
Verilator emits any infrequently executed "cold" routines into separate
|
Verilator emits any infrequently executed "cold" routines into separate
|
||||||
__Slow.cpp files. This can accelerate compilation as optimization can be
|
__Slow.cpp files. This can accelerate compilation as optimization can be
|
||||||
disabled on these routines. See the OPT_FAST and OPT_SLOW make variables.
|
disabled on these routines. See the OPT_FAST and OPT_SLOW make variables
|
||||||
|
and the BENCHMARKING & OPTIMIZATION section of the manual.
|
||||||
|
|
||||||
Use a recent compiler. Newer compilers tend do be faster, with the
|
Use a recent compiler. Newer compilers tend to be faster.
|
||||||
now relatively old GCC 3.0 to 3.3 being horrible.
|
|
||||||
|
|
||||||
Compile in parallel on many machines and use caching; see the web for the
|
Compile in parallel on many machines and use caching; see the web for the
|
||||||
ccache, distcc and icecream packages. ccache will skip GCC runs between
|
ccache, distcc and icecream packages. ccache will skip GCC runs between
|
||||||
|
|
|
||||||
|
|
@ -84,18 +84,21 @@ CPPFLAGS += $(VM_USER_CFLAGS)
|
||||||
LDFLAGS += $(VM_USER_LDFLAGS)
|
LDFLAGS += $(VM_USER_LDFLAGS)
|
||||||
LDLIBS += $(VM_USER_LDLIBS)
|
LDLIBS += $(VM_USER_LDLIBS)
|
||||||
|
|
||||||
# See the benchmarking section of bin/verilator.
|
######################################################################
|
||||||
# Support class optimizations. This includes the tracing and symbol table.
|
# Optimization control.
|
||||||
# SystemC takes minutes to optimize, thus it is off by default.
|
|
||||||
#OPT_SLOW =
|
# See also the BENCHMARKING & OPTIMIZATION section of the manual.
|
||||||
# Fast path optimizations. Most time is spent in these classes.
|
|
||||||
#OPT_FAST = -Os -fstrict-aliasing
|
# Optimization flags for non performance-critical/rarely executed code.
|
||||||
#OPT_FAST = -O
|
# No optimization by default, which improves compilation speed.
|
||||||
#OPT_FAST =
|
OPT_SLOW =
|
||||||
|
# Optimization for performance critical/hot code. Most time is spent in these
|
||||||
|
# routines. Optimizing by default for improved execution speed.
|
||||||
|
OPT_FAST = -Os
|
||||||
# Optimization applied to the common run-time library used by verilated models.
|
# Optimization applied to the common run-time library used by verilated models.
|
||||||
# For compatibility this is called OPT_GLOBAL even though it only applies to
|
# For compatibility this is called OPT_GLOBAL even though it only applies to
|
||||||
# files in the run-time library. Normally there should be no need for the user
|
# files in the run-time library. Normally there should be no need for the user
|
||||||
# to change this.
|
# to change this as the library is small, but can have significant speed impact.
|
||||||
OPT_GLOBAL = -Os
|
OPT_GLOBAL = -Os
|
||||||
|
|
||||||
#######################################################################
|
#######################################################################
|
||||||
|
|
|
||||||
|
|
@ -1111,7 +1111,7 @@ sub compile {
|
||||||
"-DTEST_VERBOSE=\"".($self->{verbose} ? 1 : 0)."\"",
|
"-DTEST_VERBOSE=\"".($self->{verbose} ? 1 : 0)."\"",
|
||||||
"-DTEST_SYSTEMC=\"" .($self->sc ? 1 : 0). "\"",
|
"-DTEST_SYSTEMC=\"" .($self->sc ? 1 : 0). "\"",
|
||||||
"-DCMAKE_PREFIX_PATH=\"".(($ENV{SYSTEMC_INCLUDE}||$ENV{SYSTEMC}||'')."/..\""),
|
"-DCMAKE_PREFIX_PATH=\"".(($ENV{SYSTEMC_INCLUDE}||$ENV{SYSTEMC}||'')."/..\""),
|
||||||
"-DTEST_OPT_FAST=\"" . ($param{benchmark} ? "-Os" : "") . "\"",
|
"-DTEST_OPT_FAST=\"" . ($param{benchmark} ? "-Os" : "-O0") . "\"",
|
||||||
"-DTEST_OPT_GLOBAL=\"" . ($param{benchmark} ? "-Os" : "-O0") . "\"",
|
"-DTEST_OPT_GLOBAL=\"" . ($param{benchmark} ? "-Os" : "-O0") . "\"",
|
||||||
"-DTEST_VERILATION=\"" . $::Opt_Verilation . "\"",
|
"-DTEST_VERILATION=\"" . $::Opt_Verilation . "\"",
|
||||||
]);
|
]);
|
||||||
|
|
@ -1130,7 +1130,7 @@ sub compile {
|
||||||
"TEST_OBJ_DIR=$self->{obj_dir}",
|
"TEST_OBJ_DIR=$self->{obj_dir}",
|
||||||
"CPPFLAGS_DRIVER=-D".uc($self->{name}),
|
"CPPFLAGS_DRIVER=-D".uc($self->{name}),
|
||||||
($self->{verbose} ? "CPPFLAGS_DRIVER2=-DTEST_VERBOSE=1":""),
|
($self->{verbose} ? "CPPFLAGS_DRIVER2=-DTEST_VERBOSE=1":""),
|
||||||
($param{benchmark} ? "OPT_FAST=-Os" : ""),
|
($param{benchmark} ? "" : "OPT_FAST=-O0"),
|
||||||
($param{benchmark} ? "" : "OPT_GLOBAL=-O0"),
|
($param{benchmark} ? "" : "OPT_GLOBAL=-O0"),
|
||||||
"$self->{VM_PREFIX}", # bypass default rule, as we don't need archive
|
"$self->{VM_PREFIX}", # bypass default rule, as we don't need archive
|
||||||
($param{make_flags}||""),
|
($param{make_flags}||""),
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue