Commit Graph

6219 Commits

Author SHA1 Message Date
Geza Lore 9ac64d0b92 Improve performance of MTask coarsening
Various optimizations to speed up MTasks coarsening (which is the long
pole in the multi-threaded scheduling of very large designs).

The biggest impact ones:
- Use efficient hand written Pairing Heaps for implementing priority
  queues and the scoreboard, instead of the old SortByValueMap. This
  helps us avoid having to sort a lot of merge candidates that we will
  never actually consider and helps a lot in performance.
- Remove unnecessary associative containers and store data structures
  (the heap nodes in particular) directly in the object they relate to.
  This eliminates a huge amount of lookups and helps a lot in
  performance.
- Distribute storage for SiblingMC instances into the LogicMTask
  instances, and combine with the sibling maps. This again eliminates
  hash table lookups and makes storage structures smaller.
- Remove some now bidirectional edge maps, keep only the forward map.

There are also some other smaller optimizations:
- Replaced more unnecessary dynamic_casts with static_casts
- Templated some functions/classes to reduce the number of static
  branches in loops.
- Improves sorting of edges for sibling candidate creation
- Various micro-optimizations here and there

This speeds up MTask coarsening by 3.8x on a large design, which
translates to a 2.5x speedup of the ordering pass in multi-threaded
mode. (Combined with the earlier optimizations, ordering is now 3x
faster.)

Due to the elimination of a lot of the auxiliary data structures, and
ensuring a minimal size for the necessary ones, memory consumption of
the MTask coarsening is also reduced (measured up to 4.4x reduction
though the accuracy of this is low).

The algorithm is identical except for minor alterations of the order
some candidates are added or removed, this can cause perturbation in the
output due to tied scores being broken based on IDs.
2022-08-20 21:18:50 +01:00
Wilson Snyder 7cc89b8b42 Merge branch 'master' into develop-v5 2022-08-20 14:19:45 -04:00
Wilson Snyder c6607724cb Fix clang warning. 2022-08-20 14:19:00 -04:00
Wilson Snyder ebb37b0156 Merge branch 'master' into develop-v5 2022-08-20 14:02:09 -04:00
Wilson Snyder 90dc04cf93 Add --future0 and --future1 options. 2022-08-20 14:01:13 -04:00
Krzysztof Bieganski 10cf492946
Add support for expressions in event controls (#3550)
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
2022-08-19 20:18:38 +02:00
Geza Lore 4d81eb021d Revert "Improve performance of MTask coarsening"
This reverts commit 83475008d9.
2022-08-19 18:03:45 +01:00
Geza Lore 83475008d9 Improve performance of MTask coarsening
Various optimizations to speed up MTasks coarsening (which is the long
pole in the multi-threaded scheduling of very large designs).

The biggest impact ones:
- Use efficient hand written Pairing Heaps for implementing priority
  queues and the scoreboard, instead of the old SortByValueMap. This
  helps us avoid having to sort a lot of merge candidates that we will
  never actually consider and helps a lot in performance.
- Remove unnecessary associative containers and store data structures
  (the heap nodes in particular) directly in the object they relate to.
  This eliminates a huge amount of lookups and helps a lot in
  performance.
- Distribute storage for SiblingMC instances into the LogicMTask
  instances, and combine with the sibling maps. This again eliminates
  hash table lookups and makes storage structures smaller.
- Remove some now bidirectional edge maps, keep only the forward map.

There are also some other smaller optimizations:
- Replaced more unnecessary dynamic_casts with static_casts
- Templated some functions/classes to reduce the number of static
  branches in loops.
- Improves sorting of edges for sibling candidate creation
- Various micro-optimizations here and there

This speeds up MTask coarsening by 3.8x on a large design, which
translates to a 2.5x speedup of the ordering pass in multi-threaded
mode. (Combined with the earlier optimizations, ordering is now 3x
faster.)

Due to the elimination of a lot of the auxiliary data structures, and
ensuring a minimal size for the necessary ones, memory consumption of
the MTask coarsening is also reduced (measured up to 4.4x reduction
though the accuracy of this is low).

The algorithm is identical except for minor alterations of the order
some candidates are added or removed, this can cause perturbation in the
output due to tied scores being broken based on IDs.
2022-08-19 16:59:20 +01:00
Geza Lore 03ac7ad730 Make PartPropagateCp specific to the MTask graph
While keeping the client code abstract in PartPropagateCp is nice for
testing, there is performance to be had removing the abstraction. As
this code dominates in scheduling large designs, we eliminate the
abstraction and re-work the testing to use the actual LogicMTask and
MTaskEdge graph types. No functional change intended.
2022-08-19 14:06:11 +01:00
Geza Lore cd50949a7e Reuse MTaskEdge instances in MT scheduling
Instead of deleting then re-allocating MTaskEdge instances when merging
two MTasks, just redirect the edged of the donor MTask to the recipient
MTask. This is both faster as it avoids an allocation and a deletion,
together with one update of the sibling maps, and also makes the
algorithm more stable due to MergeCandidate IDs being stable and
allocated up front for all MTaskEdges, before any SiblingMCs are
allocated.

Perturbations in output are expected as the IDs used to break ties
between merge candidates with equal costs are not updated when
redirecting an edge (on purpose). The relinking of only one end of the
graph edges also perturbs the order in which they are enumerated, which
does change candidate opportunities when the number of edges is larger
than PART_SIBLING_EDGE_LIMIT. Confirmed output is identical when
IDs are updated and edges are updated to appear in their original order.
2022-08-19 14:06:11 +01:00
Geza Lore f0040c7b9a Remove reliance on pointer comparison in MT scheduling
The critical path propagation used to rely on a pointer comparison to
break equal scoring critical path updates. Use the corresponding mtask
ids instead, which is deterministic across invocations.
2022-08-19 14:06:11 +01:00
Geza Lore f8a0389e73 Do not use stepCost when gathering sibling merge candidates
siblingPairFromRelatives gathers neighbours of a vertex, and sorts them.
It then takes the N best nodes, and creates sibling merge candidates
from them. We now use the unadjusted cost instead of the step cost of
the vertices when sorting. This is both faster as we need not do the
log-space rounding to compute stepCost, and will also make similar but
yet cheaper nodes appear closer to the front as we don't lose precision
in rounding, hence they are more likely to be entered as merge
candidates. Note that when creating the merge candidate, we still use
the stepCost, so it's purpose of reducing the propagation of critical
path updates is maintained in full. In summary, this should make both
Verilator and the generated model very slightly faster, at least in
theory, and I have observed minor improvement in places.
2022-08-19 14:06:11 +01:00
Geza Lore b436794773 Add specialized GraphStreamUnordered
GraphStreamUnordered used to be GraphStream<std::less<const
V3GraphVertex*>>, but a lot of performance improvements can be had by a
specialized implementation, so added a highly optimized one. This helps
a lot with --debug-partition.
2022-08-19 14:06:11 +01:00
Geza Lore 1404319b28 Merge branch 'master' into develop-v5 2022-08-19 13:39:44 +01:00
Geza Lore 90d22cbec6 Fix `AstNode::exists` return type 2022-08-19 13:22:06 +01:00
Krzysztof Bieganski 33e2acfe61
Fix `AstNode::forall` return type (#3559)
Signed-off-by: Krzysztof Bieganski <kbieganski@antmicro.com>
2022-08-19 12:33:17 +01:00
Ryszard Rozak db5fdfb0ee
Fix === with some tristate constants (#3551). 2022-08-18 07:03:05 -04:00
Krzysztof Bieganski 951cd73fe0
Handle MemberSel in V3EmitV.cpp (#3555) 2022-08-18 06:33:45 -04:00
Arkadiusz Kozdra 0eeb40b975
Fix converting subclasses to string (#3552) 2022-08-17 18:08:43 -04:00
Wilson Snyder 93272c13fd Tests: Confirm fixed (#181) 2022-08-15 22:17:36 -04:00
Wilson Snyder 43abaeb055 Tests: Confirm fixed (#485) 2022-08-15 22:17:17 -04:00
Wilson Snyder 18b9e661c9 Tests: Confirm fixed (#446) 2022-08-15 22:17:09 -04:00
Wilson Snyder f435d96241 Fix case statement comparing string literal (#3544). 2022-08-15 21:56:09 -04:00
github action d32e3f042f Apply 'make format' 2022-08-12 10:56:12 +00:00
Mostafa Gamal df5f95a5bd
Fix nested default assignment for struct pattern (#3511) (#3524) 2022-08-12 06:55:07 -04:00
Drew Ranck b0c475205b
Fix void-cast queue pop_front or pop_back (#3542) (#3364)
Fix compile error for queue method usage, if it is the
first statement in a block of code, and the return
value is not used. Example:

>  if (foo)
>    void'(bar.pop_front());
2022-08-12 06:51:25 -04:00
Wilson Snyder 1e2219347e Internals: Cleanup ifdef, move up not under compilver version ifdef 2022-08-11 17:41:43 -04:00
Wilson Snyder cbe1b8e266 Fix segfault exporting non-existant package (#3535). 2022-08-08 17:53:50 -04:00
Mariusz Glebocki 2b12fe5773
Internals: Construct V3Number with correct type instead of changing it manually. (#3529) 2022-08-08 08:17:02 -04:00
Geza Lore a4fd6d38fb Add operator != to VlWide
This is required by VlUnpacked::neq
2022-08-07 13:13:28 +01:00
Yutetsu TAKATSUKASA d20f22beb1
Fix tristate logic when reading inout port in a module #3399 (#3523)
* Tests: Add a test to reproduce #3399

* Fix #3399. When reading an inout port in a module, it should refer the
original inout port, not the generated MODTEMP.
2022-08-07 21:12:57 +09:00
Wilson Snyder f4fe10844b Tests: Fix t_flag_help.pl (#3532). 2022-08-07 04:57:59 -04:00
Mariusz Glebocki 122e89ffde
Fix V3Number::isMsbXZ(). (#3530) 2022-08-05 19:12:52 +01:00
Geza Lore c266739e9f Merge branch 'master' into develop-v5 2022-08-05 12:17:57 +01:00
Geza Lore 96a4b3e5a5 Update clang-format config and apply
- Regroup and sort #include directives (like we used to, but automatic)
- Set AlwaysBreakTemplateDeclarations to true
2022-08-05 12:00:24 +01:00
Geza Lore 7403226a97 Merge branch 'master' into develop-v5 2022-08-04 10:03:38 +01:00
Geza Lore fac8e76923 Rework SortByValueMap for better performance
Keep a single std::set of key/value pairs, and a single unordered_map
from key to iterators into the set. Also improve some of the accessing
mechanisms using modern C++. This speeds up multi-threaded ordering by
about 10%.
2022-08-03 21:17:02 +01:00
Geza Lore b864f5f5ba V3Partition: use static_cast with LogicMTaskVertex
dynamic_cast is not free, and the mtask graph contains only
LogicMTaskVertex vertices, use static_cast instead for some speedup.
2022-08-03 17:05:01 +01:00
Geza Lore f9f66d787e Fix integer overflow in V3Unroll (#3451) 2022-08-03 09:41:30 +01:00
Geza Lore bd211c87aa astgen: split 'visit' method declarations from definitions
Add definitions to V3Ast.cpp, and use static_cast.
This fixes a lot of clang-tidy noise.
2022-08-02 17:53:19 +01:00
Geza Lore 6c33e6e889 Tell clang-tidy .h files are C++ (not C) headers 2022-08-02 17:53:19 +01:00
Geza Lore 6fc25dae9e Fix clang-tidy warnings (#3522) 2022-08-02 15:58:48 +01:00
Kamil Rakoczy cfb6fd8b34
Reduce max RSS usage (#3483)
By constant folding nodes earlier in V3Expand, we can save some max RSS on large designs.
2022-08-02 13:36:14 +01:00
Geza Lore 39d1a62f9e Fix change detection on unpacked arrays
Expand array assignment when creating the trigger, as V3Expand might
mangle it otherwise.
2022-08-02 13:01:41 +01:00
Geza Lore ba66fa7200 Merge branch 'master' into develop-v5 2022-08-02 11:16:35 +01:00
Geza Lore cb60663d49 V3Gate: Defer substitutions until required as well
Similarly to the earlier patch that defers constant folding on optimized
logic, now we also defer the variable substitutions as well. This again
eliminates a lot of traversals, and yields another ~10x speedup of V3Gate
on a design where V3Gate used to dominate while producing identical
results.
2022-08-01 12:54:41 +01:00
Geza Lore 0d2bf23d82 V3Gate: Defer constant folding until required
Rather than constant folding each logic block after every substitution,
only constant fold updated blocks when re-analysed, or at the end. This
removes a lot of invocations of V3Const on large blocks that can be
optimized well, and should yield the same result.

This speeds up V3Gate by ~4x on a design where V3Gate dominates.
2022-07-31 20:42:04 +01:00
Geza Lore 682a60e325 Cleanup V3Gate, no functional change 2022-07-31 20:07:54 +01:00
Geza Lore 2ab6272cc7 Use AstNode::foreach in V3Gate
This yields a little speedup.
2022-07-31 20:05:25 +01:00
Geza Lore 152a6cd886 Improve AstNode::foreach (also exists and forall)
Speed improvements:
- Use a direct, recursion-free implementation
- Improve pre-fetching

Functionality:
- Support remove/replace of currently iterated node
2022-07-31 19:07:32 +01:00