20 KiB
Kahn's Algorithm BFS for OpenSTA: Functional Specification
1. Motivation
OpenSTA's BfsIterator::visitParallel() processes the timing graph level by level, inserting a thread barrier (dispatch_queue_->finishTasks()) between every level. When a level contains few vertices, most threads sit idle at the barrier while waiting for the rest of the level to complete. This is the dominant source of parallel inefficiency in timing analysis for real designs with uneven level populations.
Kahn's algorithm removes the per-level barrier by processing each vertex as soon as all its predecessors are complete, allowing vertices at different levels to execute concurrently.
2. Functional Specification
2.1 Toggle and Predicate Setup
Kahn's is controlled by two settings on each BfsIterator:
// 1. Provide the edge filter Kahn's uses for discovery.
// This is separate from search_pred_ used by the original BFS.
iterator->setKahnPred(search_adj);
// 2. Enable Kahn's for visitParallel.
iterator->setUseKahns(true);
Both are required. If kahn_pred_ is null when use_kahns_ is true, the iterator falls back to the original level-based BFS silently.
Default for both is off/null (original behavior). The toggle affects only visitParallel(); the sequential visit(), hasNext()/next(), and enqueue() APIs are unchanged.
2.2 Why Kahn's Needs Its Own Predicate
In the original BFS, edge filtering happens inside the visitor at call time:
visitor->visit(vertex)
└─ enqueueAdjacentVertices(vertex, adj_pred_) ← visitor provides the filter
The BFS iterator itself never decides which edges to follow -- the visitor does, one vertex at a time.
Kahn's algorithm cannot work this way. It must discover the entire active subgraph upfront (before any visitor runs) to compute in-degrees. This discovery needs an edge filter to know which edges to follow. The iterator's own search_pred_ is often null (the arrival iterator is constructed with nullptr because the visitor was always expected to provide the filter).
kahn_pred_ solves this by giving Kahn's its own dedicated filter, set once during construction, without changing how search_pred_ or the visitor works.
In practice, Search.cc wires it up at construction:
// Search constructor:
arrival_iter_->setKahnPred(search_adj_);
required_iter_->setKahnPred(search_adj_);
2.3 Behavioral Contract
When enabled, visitParallel(to_level, visitor) must:
- Visit exactly the same set of vertices as the original level-based BFS.
- Visit each vertex exactly once.
- Visit every vertex only after all its predecessors (in the BFS-direction DAG) have been visited and their results are visible.
- Call
visitor->visit(vertex)in a thread-safe manner (one thread per vertex, thread-local visitor copies). - Call
visitor->levelFinished()once after all vertices are processed. - Leave the BFS queue in a consistent state (processed levels cleared, remaining levels tracked).
- Respect the
to_levelbound -- vertices beyond it remain queued for future calls.
2.4 Scope
The implementation is integrated into the existing BfsIterator class hierarchy:
BfsFwdIterator-- forward arrival propagation (out-edges)BfsBkwdIterator-- backward required-time propagation (in-edges)
Both directions are supported through the polymorphic kahnForEachSuccessor() virtual method.
3. Why the Graph Is a DAG
Kahn's algorithm requires a directed acyclic graph. Within a single visitParallel() call, the active graph is guaranteed acyclic because:
| Cycle Source | How It's Broken | Where |
|---|---|---|
| Flip-flop feedback (Q -> ... -> D) | D inputs are timing endpoints; clk-to-Q starts new propagation. SearchAdj skips latchDtoQ and timing-check edges. |
Search.cc:127-131, Search.cc:178-186 |
| Latch D-to-Q | Explicitly excluded by SearchThru::searchThru() and SearchAdj::searchThru(). Convergence handled by multi-pass outer loop in Search::findAllArrivals(). |
Search.cc:130, Search.cc:1004-1012 |
| Combinational loops | Levelizer DFS detects back edges and marks them isDisabledLoop(). All BFS predicates skip disabled-loop edges. |
Levelize.cc:232-330, Levelize.cc:428-446 |
4. Algorithm
4.1 Overview
visitParallel(to_level, visitor):
if thread_count == 1 → sequential visit() (unchanged)
if !use_kahns_ || !kahn_pred_ → original level-based parallel BFS (unchanged)
else → Kahn's three-phase algorithm:
Phase 1+2: Discovery + In-Degree Counting (single-threaded)
Phase 3: Batch-Dispatch Parallel Traversal (multi-threaded)
4.2 Phase 1+2: Discovery + In-Degree Counting
Single-threaded BFS from seed vertices (those already in the level queue). For each vertex discovered:
- Assign in-degree = 0 for seeds, in-degree = count of active predecessors for discovered successors.
- Record vertex in the active set.
- Set
bfsInQueueflag to preventenqueue()from re-adding during Phase 3.
Data structures:
in_degree_init: flatstd::vector<int>indexed bygraph_->id(vertex). Value -1 = not active, >= 0 = in-degree count. Grows dynamically if vertex IDs exceed initial capacity (see Section 6.1).active_vertices: list of all discovered vertices for iteration.- Both are persistent across calls via
KahnStateto avoid re-allocation (see Section 5.4).
The discovery uses kahn_pred_ -- the same SearchAdj filter used by the arrival and required paths -- ensuring identical edge filtering to the original BFS.
4.3 Phase 3: Batch-Dispatch Parallel Traversal
ready_batch = {vertices with in_degree == 0}
while ready_batch is not empty:
next_ready = {}
if batch is small (< thread_count):
process single-threaded
else:
for each vertex in ready_batch:
dispatch_queue_->dispatch(lambda(tid):
visitor_copy[tid]->visit(vertex)
for each successor of vertex:
atomic decrement in_degree[successor]
if in_degree reached 0:
lock; next_ready.push_back(successor)
)
dispatch_queue_->finishTasks()
ready_batch.swap(next_ready)
Key properties:
- One task dispatched per vertex --
DispatchQueuehandles load balancing across its thread pool. finishTasks()uses condition_variable internally (no spin-wait).- Successor in-degree decrements use
std::memory_order_acq_relto ensure predecessor writes are visible. - Newly-ready vertices are collected into
next_readyunder a mutex, then swapped into the next batch.
4.4 Cleanup
After traversal:
- Processed levels are cleared from the
LevelQueue. first_level_/last_level_are recalculated viaresetLevelBounds()to track any remaining queued vertices (e.g., those beyondto_level).- Active vertex IDs are saved in
KahnState::prev_idsfor efficient reset on the next call.
5. Implementation Details
5.1 Files Modified
| File | Changes |
|---|---|
include/sta/Bfs.hh |
Added <functional>, <memory>. Added kahnForEachSuccessor pure virtual, resetLevelBounds, KahnState forward decl + unique_ptr member, use_kahns_ flag, kahn_pred_ pointer, and public accessors. |
search/Bfs.cc |
Added <atomic>. Defined KahnState struct. Rewrote visitParallel with three branches (sequential / level-based / Kahn's). Added kahnForEachSuccessor overrides for Fwd (out-edges) and Bkwd (in-edges). Added resetLevelBounds. |
search/Search.cc |
Added two lines in Search constructor to wire up kahn_pred_ on both arrival and required iterators: arrival_iter_->setKahnPred(search_adj_) and required_iter_->setKahnPred(search_adj_). |
5.2 Class Hierarchy
BfsIterator (base)
├─ kahnForEachSuccessor() = 0 [pure virtual]
├─ resetLevelBounds()
├─ KahnState (persistent arrays, pimpl)
├─ use_kahns_ flag
├─ kahn_pred_ (SearchPred* for Kahn's discovery, separate from search_pred_)
│
├─ BfsFwdIterator
│ └─ kahnForEachSuccessor: iterates out-edges with
│ searchFrom/searchThru/searchTo via kahn_pred_
│
└─ BfsBkwdIterator
└─ kahnForEachSuccessor: iterates in-edges with
searchTo/searchThru/searchFrom via kahn_pred_
5.3 KahnState (Persistent Storage)
struct BfsIterator::KahnState {
std::vector<int> in_degree_init; // flat array, indexed by VertexId
std::unique_ptr<std::atomic<int>[]> in_degree; // atomic copy for parallel phase
size_t in_degree_size = 0;
std::vector<VertexId> prev_ids; // IDs to reset on next call
};
- Allocated lazily on first Kahn's call.
in_degree_initgrows dynamically (never shrinks). Only touched entries are reset between calls viaprev_ids.in_degree(atomic array) is reallocated only when the max vertex ID grows.
5.4 Memory Ordering
| Operation | Ordering | Rationale |
|---|---|---|
in_degree_init writes (discovery) |
Non-atomic | Single-threaded phase; dispatch() provides happens-before. |
in_degree[].store() (setup) |
relaxed |
Single-threaded; dispatch() provides happens-before. |
in_degree[].fetch_sub() (worker) |
acq_rel |
Last predecessor's decrement synchronizes all prior arrival writes to the successor's reader thread. |
total_visited.fetch_add() |
relaxed |
Counter read only after finishTasks() barrier. |
5.5 Interaction with enqueue() During visit()
ArrivalVisitor::visit() calls enqueueAdjacentVertices() which calls enqueue(). During Kahn's:
- Active vertices have
bfsInQueueset during discovery ->enqueue()is a no-op (flag already set). - Vertices beyond
to_levelare not in the active set ->enqueue()adds them to theLevelQueuenormally for future passes.
5.6 kahnForEachSuccessor vs enqueueAdjacentVertices
These two methods have nearly identical edge-iteration logic (same predicates, same edge direction). They are intentionally kept as separate methods because:
enqueueAdjacentVerticesis called millions of times in the non-Kahn's path. Wrapping it instd::functionwould add overhead to all BFS operations.kahnForEachSuccessoraccepts astd::function<void(Vertex*)>callback, used in both discovery and the worker. Thestd::functionoverhead is negligible relative to per-vertex computation.
6. Pitfalls and Bugs Found
6.1 ObjectTable Block-Based Vertex IDs
Problem: graph_->vertexCount() returns the live object count, but graph_->id(vertex) returns (block_index << 7) + slot_index. After vertex deletions, live count drops but blocks persist. A vertex in block 2 can have ID 260 even when only 79 vertices are alive.
How we found it: The rmp.gcd_restructure.tcl OpenROAD test crashed with a segfault. The rmp module deletes cells during restructuring, creating gaps between live vertex count and max vertex ID. Our in-degree array was sized to vertexCount() + 1 and accessed out of bounds.
Solution: in_degree_init grows dynamically during discovery (resize(vid + 128, -1) when any ID exceeds current capacity). Worker lambdas include a bounds check (sid < in_deg_size). The atomic array is sized to max_id + 1 after discovery completes.
Note: The other developer's implementation has the same latent bug (graph_->vertexCount() + 1 sizing) but it hasn't manifested because their code only runs on the delay-calc path, which doesn't encounter the deletion pattern that rmp triggers.
6.2 Null Search Predicate on Arrival Iterator
Problem: The arrival BFS iterator is constructed with search_pred_ = nullptr:
arrival_iter_(new BfsFwdIterator(BfsIndex::arrival, nullptr, this))
This is intentional in the original BFS -- the visitor provides its own predicate (adj_pred_) at call time via enqueueAdjacentVertices(vertex, adj_pred_). The null search_pred_ is never dereferenced.
Kahn's discovery phase needs a predicate upfront (before any visitor runs) to know which edges to follow. Using search_pred_ directly caused a null pointer dereference.
How we found it: The rmp.gcd_restructure.tcl test crashed in kahnForEachSuccessor with pred->searchFrom(vertex) dereferencing null. Stack trace showed the call came from arrival propagation via Search::findArrivals1.
Solution: Introduced kahn_pred_ -- a separate predicate pointer dedicated to Kahn's discovery and successor decrement. Set via setKahnPred() and wired up in the Search constructor:
arrival_iter_->setKahnPred(search_adj_);
required_iter_->setKahnPred(search_adj_);
If kahn_pred_ is null when Kahn's is enabled, visitParallel falls back to the original level-based BFS. This ensures no crash even if a caller enables Kahn's without setting the predicate.
6.3 Memory Visibility Across Threads
Problem: When predecessor P finishes computing arrivals and successor S starts reading them, S must see P's writes.
Original BFS: finishTasks() between levels provides a full memory fence.
Kahn's: The fetch_sub(1, memory_order_acq_rel) on the in-degree counter creates the happens-before chain. When the last predecessor's decrement triggers S's readiness, all prior writes by all predecessors are visible to S's processing thread. The batch-dispatch model adds a finishTasks() barrier between batches as an additional fence.
6.4 "Arrivals Unchanged" Optimization
Original BFS: If ArrivalVisitor::visit() finds arrivals haven't changed, it skips enqueueAdjacentVertices -- fanout is not re-evaluated.
Kahn's: The discovery phase conservatively discovers ALL reachable vertices. Fanout in-degrees are decremented unconditionally after visit, regardless of whether arrivals changed. This means some vertices may be visited unnecessarily. They will find no change and produce no further effect.
Impact: Slightly more work for incremental updates where only a small subset changes. Correct for all cases.
6.5 Latch Multi-Pass Convergence
Latch D-to-Q edges are excluded from the BFS by search predicates. Latch convergence is handled by the outer multi-pass loop in Search::findAllArrivals() / Search::findFilteredArrivals(), which re-seeds latch Q outputs between visitParallel calls. Kahn's operates within a single visitParallel call and is orthogonal to this mechanism.
6.6 levelFinished() Callback
VertexVisitor::levelFinished() is a virtual hook called at level boundaries. No override exists in the codebase (base implementation is empty). With Kahn's, it is called once after all vertices are processed. If a future subclass relies on per-level callbacks, it would need adaptation.
7. Comparison with Other Developer's Approach
The other implementation (BfsFwdInDegreeIterator) in the alternate repository takes a different design approach. Key differences:
7.1 Architecture
| Aspect | Other Developer | Our Approach |
|---|---|---|
| Class design | New standalone BfsFwdInDegreeIterator : StaState |
Integrated into existing BfsIterator with toggle |
| Scope | Forward-only, delay calc (GraphDelayCalc) only |
Forward + backward, any visitParallel caller |
| Integration | Requires caller to use new class and call computeInDegrees() explicitly |
Drop-in: setKahnPred(pred) + setUseKahns(true) on existing iterator |
7.2 Discovery
| Aspect | Other Developer | Our Approach |
|---|---|---|
| Strategy | Iterate ALL vertices + ALL edges (full graph) | BFS from seeds (active subgraph only) |
| Cost | O(V_total + E_total) every call | O(V_active + E_active) per call |
| Incremental variant | Yes (computeInDegrees(invalid_delays) with reachability pass) |
Natural (seed-based discovery covers only dirty subgraph) |
| Loop breaking | to_vertex->level() >= vertex->level() (ad-hoc) |
SearchPred::searchThru() (matches Levelizer exactly) |
7.3 Parallelism
| Aspect | Other Developer | Our Approach |
|---|---|---|
| Dispatch model | Batch: dispatch all ready -> finishTasks() -> next batch |
Same (adopted from their approach -- see Section 7.6) |
| Ready queue | std::vector<Vertex*> with swap() |
Same |
| Newly-ready collection | ready_lock_ mutex on ready_ vector |
next_ready_lock mutex on next_ready vector |
7.4 Thread Safety
| Aspect | Other Developer | Our Approach |
|---|---|---|
| Active-set check | vertex->visited() (non-atomic bool) -- data race risk |
in_degree_init[id] >= 0 (read-only during parallel phase) -- safe |
| Edge dedup | std::set<Edge*> processed_edges_ with per-edge mutex lock -- serialization bottleneck |
Not needed (in-degrees computed correctly upfront) |
| Vertex marking | vertex->setVisited(true) from worker threads |
vertex->setBfsInQueue() (atomic field) |
7.5 Array Sizing
| Aspect | Other Developer | Our Approach |
|---|---|---|
| Size | graph_->vertexCount() + 1 (fixed) |
Dynamic growth during discovery + bounds checks |
| Risk | Same ObjectTable ID bug -- IDs can exceed vertexCount after deletions. Latent bug that hasn't manifested because their code path (delay calc) doesn't trigger the deletion pattern. | Fixed (see Section 6.1) |
7.6 What We Adopted from Their Approach
Our initial implementation used a custom KahnReadyQueue with spin-wait workers (std::this_thread::yield()). Comparing with their approach revealed that dispatching one task per vertex into DispatchQueue and using finishTasks() as a batch barrier is significantly more efficient:
DispatchQueueusescondition_variablefor blocking -- no wasted CPU on spin-wait.- Natural load balancing -- the thread pool picks up work items automatically.
- Simpler code -- no custom queue class needed.
This change cut our test suite overhead from 87s to 28s (vs 27s for original BFS).
7.7 Performance Comparison
| Approach | STA Regression (6109 tests) |
|---|---|
| Original level-based BFS | 27-30s |
| Our Kahn's v1 (hash map + spin-wait) | 87s |
| Our Kahn's v2 (dense array + spin-wait) | 42s |
| Our Kahn's v3 (dense array + batch dispatch) | 28s |
| Other developer (delay-calc only, separate class) | Reports ~45% speedup on large designs |
8. Test Plan and Results
8.1 OpenSTA Standalone Regression
cd tools/OpenROAD/src/sta/build
cmake .. && make -j$(nproc)
ctest -j$(nproc)
Pass criteria: All tests pass with use_kahns_ = true. Results must be bit-identical to use_kahns_ = false.
Result: PASS -- 6109/6109 tests pass with both settings.
8.2 OpenROAD Full Regression
cd tools/OpenROAD/build
cmake --build . -j$(nproc)
ctest -j$(nproc)
Pass criteria: All OpenROAD tests pass, including flows that modify the netlist between timing updates.
Key test cases exercised:
rmp.gcd_restructure.tcl-- restructure deletes cells, causing vertex ID gaps. This test originally crashed (Section 6.1, 6.2) and drove two bug fixes (dynamic array sizing andkahn_pred_separation).rsz.*-- resizer modifies netlist incrementally between timing updates.cts.*-- clock tree synthesis adds buffers and triggers re-timing.
Result: PASS -- all OpenROAD regressions pass after the two fixes.
8.3 Thread Count Sweep
set_thread_count 1 ;# Falls back to sequential visit()
set_thread_count 2
set_thread_count 4
set_thread_count 8
Pass criteria: Identical timing reports across all thread counts.
8.4 Toggle Consistency
# Run 1: use_kahns_ = false
report_checks -digits 6 > results_original.rpt
# Run 2: use_kahns_ = true
report_checks -digits 6 > results_kahns.rpt
# diff results_original.rpt results_kahns.rpt
Pass criteria: Reports are identical.
8.5 Performance Expectations
| Scenario | Expectation |
|---|---|
| Small designs (< 10K vertices) | Kahn's within 10% of original (discovery overhead amortized) |
| Large designs (> 100K vertices) with uneven levels | Kahn's faster due to barrier elimination |
| Incremental updates (small active set) | Kahn's overhead proportional to active set, not total graph |
| High thread counts (8-16 threads) | Kahn's scales better (no idle threads at level barriers) |
8.6 Stress Tests
- Design with a single long chain (worst case -- no parallelism, discovery overhead for zero benefit).
- Design with many parallel chains (best case -- maximum parallel utilization).
- Design with latches (multi-pass convergence must work correctly).
- Rapid incremental updates (persistent KahnState reuse exercised).
- Netlist modification flows: rmp, rsz, cts (exercises ObjectTable ID gaps and graph rebuilds).