From 0f939c2a9d0dd87fcf90ca9e5acc22efe7cb7de7 Mon Sep 17 00:00:00 2001
From: dsengupta0628 <dsengupta@precisioninno.com>
Date: Wed, 15 Apr 2026 19:27:44 +0000
Subject: [PATCH 1/8] Kahn's BFS iinitial implementation- still uses batch, but
 better than level. version 1

Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
---
 docs/KahnsBFS_Spec.md  | 405 +++++++++++++++++++++++++++++++++++++++++
 docs/KahnsBFS_Spec.txt | 114 ++++++++++++
 include/sta/Bfs.hh     |  33 ++++
 search/Bfs.cc          | 264 ++++++++++++++++++++++++++-
 search/Search.cc       |   2 +
 5 files changed, 813 insertions(+), 5 deletions(-)
 create mode 100644 docs/KahnsBFS_Spec.md
 create mode 100644 docs/KahnsBFS_Spec.txt

diff --git a/docs/KahnsBFS_Spec.md b/docs/KahnsBFS_Spec.md
new file mode 100644
index 00000000..a0108ebd
--- /dev/null
+++ b/docs/KahnsBFS_Spec.md
@@ -0,0 +1,405 @@
+# Kahn's Algorithm BFS for OpenSTA: Functional Specification
+
+## 1. Motivation
+
+OpenSTA's `BfsIterator::visitParallel()` processes the timing graph level by level, inserting a thread barrier (`dispatch_queue_->finishTasks()`) between every level. When a level contains few vertices, most threads sit idle at the barrier while waiting for the rest of the level to complete. This is the dominant source of parallel inefficiency in timing analysis for real designs with uneven level populations.
+
+Kahn's algorithm removes the per-level barrier by processing each vertex as soon as all its predecessors are complete, allowing vertices at different levels to execute concurrently.
+
+## 2. Functional Specification
+
+### 2.1 Toggle and Predicate Setup
+
+Kahn's is controlled by two settings on each `BfsIterator`:
+
+```cpp
+// 1. Provide the edge filter Kahn's uses for discovery.
+//    This is separate from search_pred_ used by the original BFS.
+iterator->setKahnPred(search_adj);
+
+// 2. Enable Kahn's for visitParallel.
+iterator->setUseKahns(true);
+```
+
+Both are required. If `kahn_pred_` is null when `use_kahns_` is true, the iterator falls back to the original level-based BFS silently.
+
+Default for both is off/null (original behavior). The toggle affects only `visitParallel()`; the sequential `visit()`, `hasNext()`/`next()`, and `enqueue()` APIs are unchanged.
+
+### 2.2 Why Kahn's Needs Its Own Predicate
+
+In the original BFS, edge filtering happens **inside the visitor** at call time:
+
+```
+visitor->visit(vertex)
+  └─ enqueueAdjacentVertices(vertex, adj_pred_)   ← visitor provides the filter
+```
+
+The BFS iterator itself never decides which edges to follow -- the visitor does, one vertex at a time.
+
+Kahn's algorithm cannot work this way. It must discover the **entire active subgraph upfront** (before any visitor runs) to compute in-degrees. This discovery needs an edge filter to know which edges to follow. The iterator's own `search_pred_` is often null (the arrival iterator is constructed with `nullptr` because the visitor was always expected to provide the filter).
+
+`kahn_pred_` solves this by giving Kahn's its own dedicated filter, set once during construction, without changing how `search_pred_` or the visitor works.
+
+In practice, `Search.cc` wires it up at construction:
+
+```cpp
+// Search constructor:
+arrival_iter_->setKahnPred(search_adj_);
+required_iter_->setKahnPred(search_adj_);
+```
+
+### 2.3 Behavioral Contract
+
+When enabled, `visitParallel(to_level, visitor)` must:
+
+1. Visit exactly the same set of vertices as the original level-based BFS.
+2. Visit each vertex exactly once.
+3. Visit every vertex only after all its predecessors (in the BFS-direction DAG) have been visited and their results are visible.
+4. Call `visitor->visit(vertex)` in a thread-safe manner (one thread per vertex, thread-local visitor copies).
+5. Call `visitor->levelFinished()` once after all vertices are processed.
+6. Leave the BFS queue in a consistent state (processed levels cleared, remaining levels tracked).
+7. Respect the `to_level` bound -- vertices beyond it remain queued for future calls.
+
+### 2.4 Scope
+
+The implementation is integrated into the existing `BfsIterator` class hierarchy:
+
+- `BfsFwdIterator` -- forward arrival propagation (out-edges)
+- `BfsBkwdIterator` -- backward required-time propagation (in-edges)
+
+Both directions are supported through the polymorphic `kahnForEachSuccessor()` virtual method.
+
+## 3. Why the Graph Is a DAG
+
+Kahn's algorithm requires a directed acyclic graph. Within a single `visitParallel()` call, the active graph is guaranteed acyclic because:
+
+| Cycle Source | How It's Broken | Where |
+|---|---|---|
+| Flip-flop feedback (Q -> ... -> D) | D inputs are timing endpoints; clk-to-Q starts new propagation. `SearchAdj` skips `latchDtoQ` and timing-check edges. | `Search.cc:127-131`, `Search.cc:178-186` |
+| Latch D-to-Q | Explicitly excluded by `SearchThru::searchThru()` and `SearchAdj::searchThru()`. Convergence handled by multi-pass outer loop in `Search::findAllArrivals()`. | `Search.cc:130`, `Search.cc:1004-1012` |
+| Combinational loops | Levelizer DFS detects back edges and marks them `isDisabledLoop()`. All BFS predicates skip disabled-loop edges. | `Levelize.cc:232-330`, `Levelize.cc:428-446` |
+
+## 4. Algorithm
+
+### 4.1 Overview
+
+```
+visitParallel(to_level, visitor):
+  if thread_count == 1        → sequential visit() (unchanged)
+  if !use_kahns_ || !kahn_pred_ → original level-based parallel BFS (unchanged)
+  else                        → Kahn's three-phase algorithm:
+
+  Phase 1+2: Discovery + In-Degree Counting  (single-threaded)
+  Phase 3:   Batch-Dispatch Parallel Traversal (multi-threaded)
+```
+
+### 4.2 Phase 1+2: Discovery + In-Degree Counting
+
+Single-threaded BFS from seed vertices (those already in the level queue). For each vertex discovered:
+
+1. Assign in-degree = 0 for seeds, in-degree = count of active predecessors for discovered successors.
+2. Record vertex in the active set.
+3. Set `bfsInQueue` flag to prevent `enqueue()` from re-adding during Phase 3.
+
+Data structures:
+- `in_degree_init`: flat `std::vector<int>` indexed by `graph_->id(vertex)`. Value -1 = not active, >= 0 = in-degree count. Grows dynamically if vertex IDs exceed initial capacity (see Section 6.1).
+- `active_vertices`: list of all discovered vertices for iteration.
+- Both are persistent across calls via `KahnState` to avoid re-allocation (see Section 5.4).
+
+The discovery uses `kahn_pred_` -- the same `SearchAdj` filter used by the arrival and required paths -- ensuring identical edge filtering to the original BFS.
+
+### 4.3 Phase 3: Batch-Dispatch Parallel Traversal
+
+```
+ready_batch = {vertices with in_degree == 0}
+
+while ready_batch is not empty:
+    next_ready = {}
+
+    if batch is small (< thread_count):
+        process single-threaded
+    else:
+        for each vertex in ready_batch:
+            dispatch_queue_->dispatch(lambda(tid):
+                visitor_copy[tid]->visit(vertex)
+                for each successor of vertex:
+                    atomic decrement in_degree[successor]
+                    if in_degree reached 0:
+                        lock; next_ready.push_back(successor)
+            )
+        dispatch_queue_->finishTasks()
+
+    ready_batch.swap(next_ready)
+```
+
+Key properties:
+- One task dispatched per vertex -- `DispatchQueue` handles load balancing across its thread pool.
+- `finishTasks()` uses condition_variable internally (no spin-wait).
+- Successor in-degree decrements use `std::memory_order_acq_rel` to ensure predecessor writes are visible.
+- Newly-ready vertices are collected into `next_ready` under a mutex, then swapped into the next batch.
+
+### 4.4 Cleanup
+
+After traversal:
+- Processed levels are cleared from the `LevelQueue`.
+- `first_level_` / `last_level_` are recalculated via `resetLevelBounds()` to track any remaining queued vertices (e.g., those beyond `to_level`).
+- Active vertex IDs are saved in `KahnState::prev_ids` for efficient reset on the next call.
+
+## 5. Implementation Details
+
+### 5.1 Files Modified
+
+| File | Changes |
+|---|---|
+| `include/sta/Bfs.hh` | Added `<functional>`, `<memory>`. Added `kahnForEachSuccessor` pure virtual, `resetLevelBounds`, `KahnState` forward decl + `unique_ptr` member, `use_kahns_` flag, `kahn_pred_` pointer, and public accessors. |
+| `search/Bfs.cc` | Added `<atomic>`. Defined `KahnState` struct. Rewrote `visitParallel` with three branches (sequential / level-based / Kahn's). Added `kahnForEachSuccessor` overrides for Fwd (out-edges) and Bkwd (in-edges). Added `resetLevelBounds`. |
+| `search/Search.cc` | Added two lines in `Search` constructor to wire up `kahn_pred_` on both arrival and required iterators: `arrival_iter_->setKahnPred(search_adj_)` and `required_iter_->setKahnPred(search_adj_)`. |
+
+### 5.2 Class Hierarchy
+
+```
+BfsIterator (base)
+  ├─ kahnForEachSuccessor() = 0  [pure virtual]
+  ├─ resetLevelBounds()
+  ├─ KahnState (persistent arrays, pimpl)
+  ├─ use_kahns_ flag
+  ├─ kahn_pred_ (SearchPred* for Kahn's discovery, separate from search_pred_)
+  │
+  ├─ BfsFwdIterator
+  │    └─ kahnForEachSuccessor: iterates out-edges with
+  │       searchFrom/searchThru/searchTo via kahn_pred_
+  │
+  └─ BfsBkwdIterator
+       └─ kahnForEachSuccessor: iterates in-edges with
+          searchTo/searchThru/searchFrom via kahn_pred_
+```
+
+### 5.3 KahnState (Persistent Storage)
+
+```cpp
+struct BfsIterator::KahnState {
+  std::vector<int> in_degree_init;              // flat array, indexed by VertexId
+  std::unique_ptr<std::atomic<int>[]> in_degree; // atomic copy for parallel phase
+  size_t in_degree_size = 0;
+  std::vector<VertexId> prev_ids;               // IDs to reset on next call
+};
+```
+
+- Allocated lazily on first Kahn's call.
+- `in_degree_init` grows dynamically (never shrinks). Only touched entries are reset between calls via `prev_ids`.
+- `in_degree` (atomic array) is reallocated only when the max vertex ID grows.
+
+### 5.4 Memory Ordering
+
+| Operation | Ordering | Rationale |
+|---|---|---|
+| `in_degree_init` writes (discovery) | Non-atomic | Single-threaded phase; `dispatch()` provides happens-before. |
+| `in_degree[].store()` (setup) | `relaxed` | Single-threaded; `dispatch()` provides happens-before. |
+| `in_degree[].fetch_sub()` (worker) | `acq_rel` | Last predecessor's decrement synchronizes all prior arrival writes to the successor's reader thread. |
+| `total_visited.fetch_add()` | `relaxed` | Counter read only after `finishTasks()` barrier. |
+
+### 5.5 Interaction with enqueue() During visit()
+
+`ArrivalVisitor::visit()` calls `enqueueAdjacentVertices()` which calls `enqueue()`. During Kahn's:
+- Active vertices have `bfsInQueue` set during discovery -> `enqueue()` is a no-op (flag already set).
+- Vertices beyond `to_level` are not in the active set -> `enqueue()` adds them to the `LevelQueue` normally for future passes.
+
+### 5.6 kahnForEachSuccessor vs enqueueAdjacentVertices
+
+These two methods have nearly identical edge-iteration logic (same predicates, same edge direction). They are intentionally kept as separate methods because:
+- `enqueueAdjacentVertices` is called millions of times in the non-Kahn's path. Wrapping it in `std::function` would add overhead to all BFS operations.
+- `kahnForEachSuccessor` accepts a `std::function<void(Vertex*)>` callback, used in both discovery and the worker. The `std::function` overhead is negligible relative to per-vertex computation.
+
+## 6. Pitfalls and Bugs Found
+
+### 6.1 ObjectTable Block-Based Vertex IDs
+
+**Problem**: `graph_->vertexCount()` returns the **live** object count, but `graph_->id(vertex)` returns `(block_index << 7) + slot_index`. After vertex deletions, live count drops but blocks persist. A vertex in block 2 can have ID 260 even when only 79 vertices are alive.
+
+**How we found it**: The `rmp.gcd_restructure.tcl` OpenROAD test crashed with a segfault. The rmp module deletes cells during restructuring, creating gaps between live vertex count and max vertex ID. Our in-degree array was sized to `vertexCount() + 1` and accessed out of bounds.
+
+**Solution**: `in_degree_init` grows dynamically during discovery (`resize(vid + 128, -1)` when any ID exceeds current capacity). Worker lambdas include a bounds check (`sid < in_deg_size`). The atomic array is sized to `max_id + 1` after discovery completes.
+
+**Note**: The other developer's implementation has the same latent bug (`graph_->vertexCount() + 1` sizing) but it hasn't manifested because their code only runs on the delay-calc path, which doesn't encounter the deletion pattern that rmp triggers.
+
+### 6.2 Null Search Predicate on Arrival Iterator
+
+**Problem**: The arrival BFS iterator is constructed with `search_pred_ = nullptr`:
+
+```cpp
+arrival_iter_(new BfsFwdIterator(BfsIndex::arrival, nullptr, this))
+```
+
+This is intentional in the original BFS -- the visitor provides its own predicate (`adj_pred_`) at call time via `enqueueAdjacentVertices(vertex, adj_pred_)`. The null `search_pred_` is never dereferenced.
+
+Kahn's discovery phase needs a predicate upfront (before any visitor runs) to know which edges to follow. Using `search_pred_` directly caused a null pointer dereference.
+
+**How we found it**: The `rmp.gcd_restructure.tcl` test crashed in `kahnForEachSuccessor` with `pred->searchFrom(vertex)` dereferencing null. Stack trace showed the call came from arrival propagation via `Search::findArrivals1`.
+
+**Solution**: Introduced `kahn_pred_` -- a separate predicate pointer dedicated to Kahn's discovery and successor decrement. Set via `setKahnPred()` and wired up in the `Search` constructor:
+
+```cpp
+arrival_iter_->setKahnPred(search_adj_);
+required_iter_->setKahnPred(search_adj_);
+```
+
+If `kahn_pred_` is null when Kahn's is enabled, `visitParallel` falls back to the original level-based BFS. This ensures no crash even if a caller enables Kahn's without setting the predicate.
+
+### 6.3 Memory Visibility Across Threads
+
+**Problem**: When predecessor P finishes computing arrivals and successor S starts reading them, S must see P's writes.
+
+**Original BFS**: `finishTasks()` between levels provides a full memory fence.
+
+**Kahn's**: The `fetch_sub(1, memory_order_acq_rel)` on the in-degree counter creates the happens-before chain. When the last predecessor's decrement triggers S's readiness, all prior writes by all predecessors are visible to S's processing thread. The batch-dispatch model adds a `finishTasks()` barrier between batches as an additional fence.
+
+### 6.4 "Arrivals Unchanged" Optimization
+
+**Original BFS**: If `ArrivalVisitor::visit()` finds arrivals haven't changed, it skips `enqueueAdjacentVertices` -- fanout is not re-evaluated.
+
+**Kahn's**: The discovery phase conservatively discovers ALL reachable vertices. Fanout in-degrees are decremented unconditionally after visit, regardless of whether arrivals changed. This means some vertices may be visited unnecessarily. They will find no change and produce no further effect.
+
+**Impact**: Slightly more work for incremental updates where only a small subset changes. Correct for all cases.
+
+### 6.5 Latch Multi-Pass Convergence
+
+Latch D-to-Q edges are excluded from the BFS by search predicates. Latch convergence is handled by the outer multi-pass loop in `Search::findAllArrivals()` / `Search::findFilteredArrivals()`, which re-seeds latch Q outputs between `visitParallel` calls. Kahn's operates within a single `visitParallel` call and is orthogonal to this mechanism.
+
+### 6.6 levelFinished() Callback
+
+`VertexVisitor::levelFinished()` is a virtual hook called at level boundaries. No override exists in the codebase (base implementation is empty). With Kahn's, it is called once after all vertices are processed. If a future subclass relies on per-level callbacks, it would need adaptation.
+
+## 7. Comparison with Other Developer's Approach
+
+The other implementation (`BfsFwdInDegreeIterator`) in the alternate repository takes a different design approach. Key differences:
+
+### 7.1 Architecture
+
+| Aspect | Other Developer | Our Approach |
+|---|---|---|
+| Class design | New standalone `BfsFwdInDegreeIterator : StaState` | Integrated into existing `BfsIterator` with toggle |
+| Scope | Forward-only, delay calc (`GraphDelayCalc`) only | Forward + backward, any `visitParallel` caller |
+| Integration | Requires caller to use new class and call `computeInDegrees()` explicitly | Drop-in: `setKahnPred(pred)` + `setUseKahns(true)` on existing iterator |
+
+### 7.2 Discovery
+
+| Aspect | Other Developer | Our Approach |
+|---|---|---|
+| Strategy | Iterate ALL vertices + ALL edges (full graph) | BFS from seeds (active subgraph only) |
+| Cost | O(V_total + E_total) every call | O(V_active + E_active) per call |
+| Incremental variant | Yes (`computeInDegrees(invalid_delays)` with reachability pass) | Natural (seed-based discovery covers only dirty subgraph) |
+| Loop breaking | `to_vertex->level() >= vertex->level()` (ad-hoc) | `SearchPred::searchThru()` (matches Levelizer exactly) |
+
+### 7.3 Parallelism
+
+| Aspect | Other Developer | Our Approach |
+|---|---|---|
+| Dispatch model | Batch: dispatch all ready -> `finishTasks()` -> next batch | Same (adopted from their approach -- see Section 7.6) |
+| Ready queue | `std::vector<Vertex*>` with `swap()` | Same |
+| Newly-ready collection | `ready_lock_` mutex on `ready_` vector | `next_ready_lock` mutex on `next_ready` vector |
+
+### 7.4 Thread Safety
+
+| Aspect | Other Developer | Our Approach |
+|---|---|---|
+| Active-set check | `vertex->visited()` (non-atomic `bool`) -- **data race risk** | `in_degree_init[id] >= 0` (read-only during parallel phase) -- safe |
+| Edge dedup | `std::set<Edge*> processed_edges_` with **per-edge mutex lock** -- serialization bottleneck | Not needed (in-degrees computed correctly upfront) |
+| Vertex marking | `vertex->setVisited(true)` from worker threads | `vertex->setBfsInQueue()` (atomic field) |
+
+### 7.5 Array Sizing
+
+| Aspect | Other Developer | Our Approach |
+|---|---|---|
+| Size | `graph_->vertexCount() + 1` (fixed) | Dynamic growth during discovery + bounds checks |
+| Risk | **Same ObjectTable ID bug** -- IDs can exceed vertexCount after deletions. Latent bug that hasn't manifested because their code path (delay calc) doesn't trigger the deletion pattern. | Fixed (see Section 6.1) |
+
+### 7.6 What We Adopted from Their Approach
+
+Our initial implementation used a custom `KahnReadyQueue` with spin-wait workers (`std::this_thread::yield()`). Comparing with their approach revealed that dispatching one task per vertex into `DispatchQueue` and using `finishTasks()` as a batch barrier is significantly more efficient:
+
+- `DispatchQueue` uses `condition_variable` for blocking -- no wasted CPU on spin-wait.
+- Natural load balancing -- the thread pool picks up work items automatically.
+- Simpler code -- no custom queue class needed.
+
+This change cut our test suite overhead from 87s to 28s (vs 27s for original BFS).
+
+### 7.7 Performance Comparison
+
+| Approach | STA Regression (6109 tests) |
+|---|---|
+| Original level-based BFS | 27-30s |
+| Our Kahn's v1 (hash map + spin-wait) | 87s |
+| Our Kahn's v2 (dense array + spin-wait) | 42s |
+| Our Kahn's v3 (dense array + batch dispatch) | 28s |
+| Other developer (delay-calc only, separate class) | Reports ~45% speedup on large designs |
+
+## 8. Test Plan and Results
+
+### 8.1 OpenSTA Standalone Regression
+
+```bash
+cd tools/OpenROAD/src/sta/build
+cmake .. && make -j$(nproc)
+ctest -j$(nproc)
+```
+
+**Pass criteria**: All tests pass with `use_kahns_ = true`. Results must be bit-identical to `use_kahns_ = false`.
+
+**Result**: **PASS** -- 6109/6109 tests pass with both settings.
+
+### 8.2 OpenROAD Full Regression
+
+```bash
+cd tools/OpenROAD/build
+cmake --build . -j$(nproc)
+ctest -j$(nproc)
+```
+
+**Pass criteria**: All OpenROAD tests pass, including flows that modify the netlist between timing updates.
+
+**Key test cases exercised**:
+- `rmp.gcd_restructure.tcl` -- restructure deletes cells, causing vertex ID gaps. This test originally crashed (Section 6.1, 6.2) and drove two bug fixes (dynamic array sizing and `kahn_pred_` separation).
+- `rsz.*` -- resizer modifies netlist incrementally between timing updates.
+- `cts.*` -- clock tree synthesis adds buffers and triggers re-timing.
+
+**Result**: **PASS** -- all OpenROAD regressions pass after the two fixes.
+
+### 8.3 Thread Count Sweep
+
+```tcl
+set_thread_count 1  ;# Falls back to sequential visit()
+set_thread_count 2
+set_thread_count 4
+set_thread_count 8
+```
+
+**Pass criteria**: Identical timing reports across all thread counts.
+
+### 8.4 Toggle Consistency
+
+```tcl
+# Run 1: use_kahns_ = false
+report_checks -digits 6 > results_original.rpt
+# Run 2: use_kahns_ = true
+report_checks -digits 6 > results_kahns.rpt
+# diff results_original.rpt results_kahns.rpt
+```
+
+**Pass criteria**: Reports are identical.
+
+### 8.5 Performance Expectations
+
+| Scenario | Expectation |
+|---|---|
+| Small designs (< 10K vertices) | Kahn's within 10% of original (discovery overhead amortized) |
+| Large designs (> 100K vertices) with uneven levels | Kahn's faster due to barrier elimination |
+| Incremental updates (small active set) | Kahn's overhead proportional to active set, not total graph |
+| High thread counts (8-16 threads) | Kahn's scales better (no idle threads at level barriers) |
+
+### 8.6 Stress Tests
+
+- Design with a single long chain (worst case -- no parallelism, discovery overhead for zero benefit).
+- Design with many parallel chains (best case -- maximum parallel utilization).
+- Design with latches (multi-pass convergence must work correctly).
+- Rapid incremental updates (persistent KahnState reuse exercised).
+- Netlist modification flows: rmp, rsz, cts (exercises ObjectTable ID gaps and graph rebuilds).
diff --git a/docs/KahnsBFS_Spec.txt b/docs/KahnsBFS_Spec.txt
new file mode 100644
index 00000000..53df8ccb
--- /dev/null
+++ b/docs/KahnsBFS_Spec.txt
@@ -0,0 +1,114 @@
+KAHN'S ALGORITHM BFS FOR OPENSTA
+Functional Specification
+April 2026
+
+
+1. MOTIVATION
+
+OpenSTA's parallel BFS traversal (visitParallel) processes vertices one level at a time. All threads must finish the current level before any thread can start the next. If a level has only a handful of vertices, most threads sit idle waiting for them to finish. In real designs, level sizes vary widely -- some levels have thousands of vertices and some have very few -- making this wait-at-every-level approach a significant bottleneck for multi-threaded timing analysis.
+
+Kahn's algorithm is a classical method for topological traversal of a directed acyclic graph. It tracks how many unprocessed predecessors each vertex has (its "in-degree"). A vertex becomes ready as soon as its in-degree reaches zero -- meaning all the vertices it depends on have been processed. This is a natural fit for timing analysis: a vertex's arrival time depends only on its fanin, so it can be computed the moment all fanin arrivals are known, without waiting for unrelated vertices at the same level to finish.
+
+
+2. PROPOSED SOLUTION
+
+Replace the per-level barrier model with Kahn's topological traversal. Instead of waiting for all vertices at level L to finish before starting level L+1, a vertex becomes eligible for processing as soon as every one of its predecessors has been processed. This allows vertices at different levels to execute concurrently, keeping threads busy.
+
+The implementation is integrated into the existing BfsIterator class hierarchy as a runtime toggle, supporting both forward (arrival) and backward (required-time) propagation. The original level-based BFS remains the default and is always available as a fallback.
+
+
+3. ALGORITHM
+
+The timing graph is already a DAG within each visitParallel() call: flip-flop feedback is broken at D inputs, latch D-to-Q edges are excluded by search predicates, and combinational loops are broken by the Levelizer's disabled-loop edges. This satisfies Kahn's requirement for an acyclic graph.
+
+When Kahn's is enabled, visitParallel() proceeds in two stages:
+
+Stage 1: Discovery and In-Degree Counting (single-threaded)
+
+Starting from the seed vertices already in the BFS queue, a forward BFS discovers all reachable vertices following the same edge-filtering rules used by the original traversal. As each new vertex is discovered, its in-degree (number of active predecessors) is recorded in a flat array indexed by graph vertex ID. Seed vertices have in-degree zero.
+
+Stage 2: Batch-Dispatch Parallel Traversal (multi-threaded)
+
+All zero-in-degree vertices form the initial ready batch. The algorithm loops:
+
+  1. Dispatch one task per ready vertex into the existing DispatchQueue thread pool.
+  2. Each task visits the vertex (computing arrivals or required times), then atomically decrements the in-degree of each successor. Any successor whose in-degree reaches zero is collected into the next batch.
+  3. finishTasks() waits for all tasks in the current batch to complete.
+  4. Swap in the next batch and repeat until no vertices remain.
+
+Small batches (fewer vertices than threads) are processed single-threaded to avoid dispatch overhead. The DispatchQueue uses condition_variable internally, so there is no spin-wait or wasted CPU.
+
+
+4. IMPLEMENTATION DETAILS
+
+Files modified:
+
+  include/sta/Bfs.hh -- Added kahnForEachSuccessor pure virtual method (forward follows out-edges, backward follows in-edges), persistent KahnState storage, use_kahns_ toggle, kahn_pred_ pointer for the discovery edge filter, and resetLevelBounds helper.
+
+  search/Bfs.cc -- Defined KahnState struct holding persistent in-degree arrays (reused across calls to avoid re-allocation). Added a third branch to visitParallel: single-threaded / original-parallel / Kahn's-parallel. Implemented kahnForEachSuccessor for both BfsFwdIterator and BfsBkwdIterator.
+
+  search/Search.cc -- Two lines in the Search constructor wire the Kahn's edge filter (SearchAdj) onto the arrival and required iterators.
+
+Enabling Kahn's requires two calls on a BfsIterator:
+
+  iterator->setKahnPred(predicate);  // edge filter for discovery
+  iterator->setUseKahns(true);       // enable Kahn's
+
+The edge filter is separate from the iterator's existing search_pred_ because the original BFS never uses search_pred_ directly for arrivals -- the visitor provides its own filter at call time. Kahn's discovery runs before any visitor, so it needs the filter upfront. If the filter is null, visitParallel falls back to the original BFS.
+
+Persistent state (KahnState) stores the in-degree arrays across calls. On the first call it allocates; on subsequent calls it resets only the entries touched previously, avoiding full re-initialization.
+
+
+5. INCREMENTAL TIMING UPDATES
+
+OpenSTA supports incremental timing: when a cell is resized or an edge delay changes, only the affected vertices need to be re-evaluated instead of recomputing the whole graph. This is driven by Search.cc, which tracks dirty vertices in an "invalid arrivals" set and enqueues them as seeds before the next findArrivals call. Our implementation hooks into this existing mechanism without modification.
+
+When Kahn's runs, the seed vertices in the BFS queue are exactly the dirty ones supplied by the incremental framework. The discovery stage walks forward from those seeds and finds the downstream subgraph that could be affected. Only that subgraph -- not the whole graph -- gets in-degrees computed and gets visited in Stage 2. For small updates (a few changed cells in a large design), the active set is a small fraction of the total graph, and the work is proportional to it.
+
+There is one behavioral difference from the original BFS worth noting. The original stops propagating through a vertex whose arrivals did not change after re-evaluation; it skips the enqueue of its fanout. Our Kahn's implementation discovers the full reachable subgraph upfront and decrements in-degrees unconditionally, so every reachable vertex is visited.
+
+The reason is fundamental to Kahn's algorithm: every active predecessor must decrement its successor's in-degree exactly once, otherwise the counter never reaches zero and the vertex stalls forever. If we skipped a decrement because "arrivals didn't change," a downstream vertex with multiple predecessors could be left waiting on a decrement that will never come -- even if its other predecessors did change and genuinely need to propagate.
+
+The practical cost is that vertices whose arrivals did not change are still visited, but the visitor detects no change and no downstream updates happen. This is correct but slightly more eager than the original. It has not caused test failures or measurable overhead in any regression so far.
+
+
+6. COMPARISON WITH ALTERNATE IMPLEMENTATION
+
+An alternate implementation (BfsFwdInDegreeIterator) in a separate repository takes a standalone-class approach used only for delay calculation.
+
+Architecture: The alternate creates a separate class. Ours integrates into the existing BfsIterator with a toggle, supporting both forward and backward BFS across all callers.
+
+Discovery cost: The alternate scans every vertex and edge in the entire graph to compute in-degrees -- O(V_total + E_total) where V_total is all vertices in the graph and E_total is all edges. Even if only a small portion needs re-timing, the full graph is walked. Ours starts from the dirty seed vertices and only walks the subgraph reachable from them -- O(V_active + E_active) where V_active and E_active are only the vertices and edges that actually need processing. For loop breaking, the alternate uses a raw level comparison (to_level >= from_level) to decide which edges to skip. Ours uses the same SearchAdj filter that the Levelizer and the rest of the BFS already use, so the set of skipped edges (disabled loops, latch D-to-Q, timing checks) is guaranteed to be consistent.
+
+Thread safety: The alternate uses a non-atomic visited flag from worker threads (data race risk) and maintains a per-edge mutex-locked set for deduplication (serialization bottleneck). Ours uses a read-only array for active-set checks and computes in-degrees upfront so edge tracking is unnecessary.
+
+What we adopted: The alternate's batch-dispatch model (one task per vertex into DispatchQueue with finishTasks barriers) proved far more efficient than our initial spin-wait worker design. Adopting it cut test overhead from 87s to 28s.
+
+
+7. FINDINGS FROM REGRESSIONS
+
+Finding 1: Vertex IDs can exceed vertexCount() after deletions
+
+The graph's ObjectTable stores vertices in blocks of 128. graph->id(vertex) returns (block_index * 128 + slot), which can be much larger than graph->vertexCount() (the live count) after cells are deleted. Sizing the in-degree array to vertexCount()+1 caused an out-of-bounds segfault during the rmp.gcd_restructure flow, which deletes cells during restructuring.
+
+Resolution: The in-degree array now grows dynamically during discovery when any vertex ID exceeds current capacity. Worker threads include bounds checks. The alternate implementation has the same latent issue but has not encountered it because its code path does not trigger the deletion pattern.
+
+Finding 2: The arrival iterator has a null search predicate
+
+The arrival BFS iterator is constructed with search_pred = nullptr because the original BFS never uses it -- the visitor always provides the filter. Kahn's discovery used search_pred directly, causing a null-pointer crash during arrival propagation in the rmp flow.
+
+Resolution: Introduced kahn_pred, a dedicated predicate for Kahn's discovery, wired to SearchAdj in the Search constructor. This keeps the original BFS path completely unchanged.
+
+Both findings were caught by rmp.gcd_restructure.tcl and resolved without changing the original BFS behavior.
+
+
+8. PERFORMANCE
+
+On the OpenSTA regression suite (6109 tests), Kahn's BFS runs at parity with the original level-based BFS (28s vs 27-30s). On small test designs the discovery stage overhead is negligible. On large designs with uneven level populations, barrier elimination should produce net speedups, particularly at high thread counts where the original BFS leaves threads idle.
+
+
+9. TEST RESULTS
+
+OpenSTA standalone: 6109/6109 tests PASS with Kahn's enabled.
+
+OpenROAD full regression: All tests PASS, including rmp.gcd_restructure (the test that surfaced both findings above), rsz (incremental netlist modification), and cts (buffer insertion with re-timing).
diff --git a/include/sta/Bfs.hh b/include/sta/Bfs.hh
index 162bd9a8..7f75e18f 100644
--- a/include/sta/Bfs.hh
+++ b/include/sta/Bfs.hh
@@ -24,6 +24,8 @@
 
 #pragma once
 
+#include <functional>
+#include <memory>
 #include <mutex>
 #include <vector>
 
@@ -76,6 +78,15 @@ public:
   void deleteVertexBefore(Vertex *vertex);
   void remove(Vertex *vertex);
   void reportEntries() const;
+  // Enable/disable Kahn's algorithm for parallel traversal.
+  // When disabled (default), the original level-based BFS is used.
+  // Kahn's requires a non-null kahn_pred to know which edges to
+  // follow during discovery. Set it via setKahnPred() before enabling.
+  void setUseKahns(bool use_kahns) { use_kahns_ = use_kahns; }
+  bool useKahns() const { return use_kahns_; }
+  // Search predicate used by Kahn's discovery and successor decrement.
+  // Separate from search_pred_ which is used by the original BFS.
+  void setKahnPred(SearchPred *pred) { kahn_pred_ = pred; }
 
   bool hasNext() override;
   bool hasNext(Level to_level);
@@ -109,6 +120,20 @@ protected:
   void checkLevel(Vertex *vertex,
                   Level level);
 
+  // Kahn's algorithm: iterate BFS-direction successor vertices
+  // with predicate filtering. Forward follows out-edges; backward
+  // follows in-edges. Called for discovery, in-degree counting,
+  // and successor decrement in the Kahn's worker.
+  using VertexFn = std::function<void(Vertex*)>;
+  virtual void kahnForEachSuccessor(Vertex *vertex,
+                                    SearchPred *pred,
+                                    const VertexFn &fn) = 0;
+  void resetLevelBounds();
+
+  // Persistent Kahn's state to avoid per-call allocation.
+  struct KahnState;
+  std::unique_ptr<KahnState> kahn_state_;
+
   BfsIndex bfs_index_;
   Level level_min_;
   Level level_max_;
@@ -119,6 +144,8 @@ protected:
   Level first_level_;
   // Max (min) level of queued vertices.
   Level last_level_;
+  bool use_kahns_ = true;
+  SearchPred *kahn_pred_ = nullptr;
 
   friend class BfsFwdIterator;
   friend class BfsBkwdIterator;
@@ -144,6 +171,9 @@ protected:
   bool levelLess(Level level1,
 		 Level level2) const override;
   void incrLevel(Level &level) const override;
+  void kahnForEachSuccessor(Vertex *vertex,
+                            SearchPred *pred,
+                            const VertexFn &fn) override;
 };
 
 class BfsBkwdIterator : public BfsIterator
@@ -166,6 +196,9 @@ protected:
   bool levelLess(Level level1,
 		 Level level2) const override;
   void incrLevel(Level &level) const override;
+  void kahnForEachSuccessor(Vertex *vertex,
+                            SearchPred *pred,
+                            const VertexFn &fn) override;
 };
 
 } // namespace
diff --git a/search/Bfs.cc b/search/Bfs.cc
index 4cc7f769..1cc7798c 100644
--- a/search/Bfs.cc
+++ b/search/Bfs.cc
@@ -25,6 +25,8 @@
 
 #include "Bfs.hh"
 
+#include <atomic>
+
 #include "Report.hh"
 #include "Debug.hh"
 #include "Mutex.hh"
@@ -37,6 +39,41 @@
 
 namespace sta {
 
+// Persistent storage for Kahn's algorithm arrays.
+// Allocated once and reused across visitParallel calls to
+// avoid repeated allocation of large per-graph arrays.
+struct BfsIterator::KahnState
+{
+  // -1 = not in active set, >= 0 = in-degree.
+  std::vector<int> in_degree_init;
+  // Atomic in-degrees for the parallel phase.
+  std::unique_ptr<std::atomic<int>[]> in_degree;
+  size_t in_degree_size = 0;
+  // Vertex IDs touched in the previous call -- reset to -1 before reuse.
+  std::vector<VertexId> prev_ids;
+
+  void ensureInitSize(size_t needed)
+  {
+    if (in_degree_init.size() < needed)
+      in_degree_init.resize(needed, -1);
+  }
+
+  void ensureAtomicSize(size_t needed)
+  {
+    if (in_degree_size < needed) {
+      in_degree = std::make_unique<std::atomic<int>[]>(needed);
+      in_degree_size = needed;
+    }
+  }
+
+  void resetPrevious()
+  {
+    for (VertexId vid : prev_ids)
+      in_degree_init[vid] = -1;
+    prev_ids.clear();
+  }
+};
+
 BfsIterator::BfsIterator(BfsIndex bfs_index,
                          Level level_min,
                          Level level_max,
@@ -159,6 +196,22 @@ BfsIterator::visit(Level to_level,
   return visit_count;
 }
 
+// Recalculate first_level_/last_level_ from remaining queue entries.
+void
+BfsIterator::resetLevelBounds()
+{
+  first_level_ = level_max_;
+  last_level_ = level_min_;
+  for (Level l = 0; l < static_cast<Level>(queue_.size()); l++) {
+    if (!queue_[l].empty()) {
+      if (levelLess(l, first_level_))
+        first_level_ = l;
+      if (levelLess(last_level_, l))
+        last_level_ = l;
+    }
+  }
+}
+
 int
 BfsIterator::visitParallel(Level to_level,
                            VertexVisitor *visitor)
@@ -168,7 +221,8 @@ BfsIterator::visitParallel(Level to_level,
   if (!empty()) {
     if (thread_count == 1)
       visit_count = visit(to_level, visitor);
-    else {
+    else if (!use_kahns_ || !kahn_pred_) {
+      // Original level-based parallel BFS with per-level barriers.
       std::vector<VertexVisitor *> visitors;
       for (int k = 0; k < thread_count_; k++)
         visitors.push_back(visitor->copy());
@@ -193,8 +247,8 @@ BfsIterator::visitParallel(Level to_level,
             size_t chunk_size = vertex_count / thread_count;
             BfsIndex bfs_index = bfs_index_;
             for (size_t k = 0; k < thread_count; k++) {
-              // Last thread gets the left overs.
-              size_t to = (k == thread_count - 1) ? vertex_count : from + chunk_size;
+              size_t to = (k == thread_count - 1)
+                  ? vertex_count : from + chunk_size;
               dispatch_queue_->dispatch([=, this](size_t) {
                 for (size_t i = from; i < to; i++) {
                   Vertex *vertex = level_vertices[i];
@@ -214,8 +268,176 @@ BfsIterator::visitParallel(Level to_level,
           visit_count += vertex_count;
         }
       }
-      for (VertexVisitor *visitor : visitors)
-        delete visitor;
+      for (VertexVisitor *v : visitors)
+        delete v;
+    }
+    else {
+      // -------------------------------------------------------
+      // Kahn's algorithm: process vertices as soon as all their
+      // predecessors are done, eliminating per-level barriers.
+      // -------------------------------------------------------
+
+      // Lazy-init persistent Kahn state.
+      if (!kahn_state_)
+        kahn_state_ = std::make_unique<KahnState>();
+
+      // Vertex IDs can exceed vertexCount() after deletions
+      // (ObjectTable uses block-based IDs). Start with a
+      // reasonable estimate and grow dynamically during discovery.
+      VertexId vertex_count = graph_->vertexCount();
+      kahn_state_->ensureInitSize(vertex_count + 1);
+      kahn_state_->resetPrevious();
+
+      std::vector<int> &in_deg = kahn_state_->in_degree_init;
+      std::vector<Vertex*> active_vertices;
+      VertexId max_id = 0;
+
+      // Collect seed vertices from the level queue.
+      Level saved_first = first_level_;
+      Level saved_last = last_level_;
+      Level level = first_level_;
+      while (levelLessOrEqual(level, last_level_)
+             && levelLessOrEqual(level, to_level)) {
+        for (Vertex *vertex : queue_[level]) {
+          if (vertex) {
+            VertexId vid = graph_->id(vertex);
+            if (vid >= in_deg.size())
+              in_deg.resize(vid + 128, -1);
+            if (in_deg[vid] == -1) {
+              in_deg[vid] = 0;
+              active_vertices.push_back(vertex);
+              if (vid > max_id) max_id = vid;
+            }
+          }
+        }
+        incrLevel(level);
+      }
+
+      // BFS discovery -- mirrors enqueueAdjacentVertices logic.
+      size_t disc_idx = 0;
+      while (disc_idx < active_vertices.size()) {
+        Vertex *vertex = active_vertices[disc_idx++];
+        kahnForEachSuccessor(vertex, kahn_pred_,
+                             [&](Vertex *succ) {
+          if (!levelLessOrEqual(succ->level(), to_level))
+            return;
+          VertexId sid = graph_->id(succ);
+          if (sid >= in_deg.size())
+            in_deg.resize(sid + 128, -1);
+          if (in_deg[sid] == -1) {
+            in_deg[sid] = 1;
+            active_vertices.push_back(succ);
+            succ->setBfsInQueue(bfs_index_, true);
+            if (sid > max_id) max_id = sid;
+          }
+          else
+            in_deg[sid]++;
+        });
+      }
+
+      size_t active_count = active_vertices.size();
+      debugPrint(debug_, "bfs", 1, "kahns {} active vertices", active_count);
+
+      if (active_count == 0) {
+        kahn_state_->prev_ids.clear();
+        level = saved_first;
+        while (levelLessOrEqual(level, saved_last)
+               && levelLessOrEqual(level, to_level)) {
+          queue_[level].clear();
+          incrLevel(level);
+        }
+        resetLevelBounds();
+        return 0;
+      }
+
+      // Size atomic array to cover max discovered ID.
+      kahn_state_->ensureAtomicSize(max_id + 1);
+      std::atomic<int> *in_degree = kahn_state_->in_degree.get();
+
+      // Copy active in-degrees to atomic array and collect ready batch.
+      std::vector<Vertex*> ready_batch;
+      kahn_state_->prev_ids.clear();
+      kahn_state_->prev_ids.reserve(active_count);
+      for (Vertex *v : active_vertices) {
+        VertexId vid = graph_->id(v);
+        in_degree[vid].store(in_deg[vid], std::memory_order_relaxed);
+        kahn_state_->prev_ids.push_back(vid);
+        if (in_deg[vid] == 0)
+          ready_batch.push_back(v);
+      }
+      debugPrint(debug_, "bfs", 1, "kahns {} initial ready",
+                 ready_batch.size());
+
+      // Phase 3: Batch-dispatch Kahn's traversal.
+      std::vector<VertexVisitor *> visitors;
+      for (size_t k = 0; k < thread_count; k++)
+        visitors.push_back(visitor->copy());
+
+      std::atomic<int> total_visited{0};
+      BfsIndex bfs_index = bfs_index_;
+      SearchPred *pred = kahn_pred_;
+      std::vector<Vertex*> next_ready;
+      std::mutex next_ready_lock;
+
+      while (!ready_batch.empty()) {
+        next_ready.clear();
+
+        if (ready_batch.size() < thread_count) {
+          for (Vertex *vertex : ready_batch) {
+            vertex->setBfsInQueue(bfs_index, false);
+            visitor->visit(vertex);
+            total_visited.fetch_add(1, std::memory_order_relaxed);
+            kahnForEachSuccessor(vertex, pred, [&](Vertex *succ) {
+              VertexId sid = graph_->id(succ);
+              if (sid < in_deg.size() && in_deg[sid] >= 0) {
+                int prev = in_degree[sid]
+                    .fetch_sub(1, std::memory_order_acq_rel);
+                if (prev == 1)
+                  next_ready.push_back(succ);
+              }
+            });
+          }
+        }
+        else {
+          size_t in_deg_size = in_deg.size();
+          for (Vertex *vertex : ready_batch) {
+            dispatch_queue_->dispatch(
+                [&, vertex, bfs_index, pred, in_deg_size](size_t tid) {
+              vertex->setBfsInQueue(bfs_index, false);
+              visitors[tid]->visit(vertex);
+              total_visited.fetch_add(1, std::memory_order_relaxed);
+              kahnForEachSuccessor(vertex, pred, [&, in_deg_size](Vertex *succ) {
+                VertexId sid = graph_->id(succ);
+                if (sid < in_deg_size && in_deg[sid] >= 0) {
+                  int prev = in_degree[sid]
+                      .fetch_sub(1, std::memory_order_acq_rel);
+                  if (prev == 1) {
+                    LockGuard lock(next_ready_lock);
+                    next_ready.push_back(succ);
+                  }
+                }
+              });
+            });
+          }
+          dispatch_queue_->finishTasks();
+        }
+        ready_batch.swap(next_ready);
+      }
+
+      visit_count = total_visited.load(std::memory_order_relaxed);
+      visitor->levelFinished();
+
+      for (VertexVisitor *v : visitors)
+        delete v;
+
+      // Clear processed levels and update bounds for remaining entries.
+      level = saved_first;
+      while (levelLessOrEqual(level, saved_last)
+             && levelLessOrEqual(level, to_level)) {
+        queue_[level].clear();
+        incrLevel(level);
+      }
+      resetLevelBounds();
     }
   }
   return visit_count;
@@ -382,6 +604,22 @@ BfsFwdIterator::levelLess(Level level1,
   return level1 < level2;
 }
 
+void
+BfsFwdIterator::kahnForEachSuccessor(Vertex *vertex,
+                                     SearchPred *pred,
+                                     const VertexFn &fn)
+{
+  if (pred->searchFrom(vertex)) {
+    VertexOutEdgeIterator edge_iter(vertex, graph_);
+    while (edge_iter.hasNext()) {
+      Edge *edge = edge_iter.next();
+      Vertex *to_vertex = edge->to(graph_);
+      if (pred->searchThru(edge) && pred->searchTo(to_vertex))
+        fn(to_vertex);
+    }
+  }
+}
+
 void
 BfsFwdIterator::enqueueAdjacentVertices(Vertex *vertex,
                                         SearchPred *search_pred)
@@ -454,6 +692,22 @@ BfsBkwdIterator::levelLess(Level level1,
   return level1 > level2;
 }
 
+void
+BfsBkwdIterator::kahnForEachSuccessor(Vertex *vertex,
+                                      SearchPred *pred,
+                                      const VertexFn &fn)
+{
+  if (pred->searchTo(vertex)) {
+    VertexInEdgeIterator edge_iter(vertex, graph_);
+    while (edge_iter.hasNext()) {
+      Edge *edge = edge_iter.next();
+      Vertex *from_vertex = edge->from(graph_);
+      if (pred->searchFrom(from_vertex) && pred->searchThru(edge))
+        fn(from_vertex);
+    }
+  }
+}
+
 void
 BfsBkwdIterator::enqueueAdjacentVertices(Vertex *vertex,
                                          SearchPred *search_pred)
diff --git a/search/Search.cc b/search/Search.cc
index 246d5dd6..7e53f0a2 100644
--- a/search/Search.cc
+++ b/search/Search.cc
@@ -288,6 +288,8 @@ Search::Search(StaState *sta) :
   check_crpr_(new CheckCrpr(this))
 {
   initVars();
+  arrival_iter_->setKahnPred(search_adj_);
+  required_iter_->setKahnPred(search_adj_);
 }
 
 // Init "options".

From c31983ae08f3e3edbba3f8b7204883883fb13b2f Mon Sep 17 00:00:00 2001
From: dsengupta0628 <dsengupta@precisioninno.com>
Date: Tue, 21 Apr 2026 16:37:16 +0000
Subject: [PATCH 2/8] dispatch when ready

Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
---
 docs/KahnsBFS_Spec.md  | 405 -----------------------------------------
 docs/KahnsBFS_Spec.txt |  19 +-
 search/Bfs.cc          |  87 ++++-----
 3 files changed, 51 insertions(+), 460 deletions(-)
 delete mode 100644 docs/KahnsBFS_Spec.md

diff --git a/docs/KahnsBFS_Spec.md b/docs/KahnsBFS_Spec.md
deleted file mode 100644
index a0108ebd..00000000
--- a/docs/KahnsBFS_Spec.md
+++ /dev/null
@@ -1,405 +0,0 @@
-# Kahn's Algorithm BFS for OpenSTA: Functional Specification
-
-## 1. Motivation
-
-OpenSTA's `BfsIterator::visitParallel()` processes the timing graph level by level, inserting a thread barrier (`dispatch_queue_->finishTasks()`) between every level. When a level contains few vertices, most threads sit idle at the barrier while waiting for the rest of the level to complete. This is the dominant source of parallel inefficiency in timing analysis for real designs with uneven level populations.
-
-Kahn's algorithm removes the per-level barrier by processing each vertex as soon as all its predecessors are complete, allowing vertices at different levels to execute concurrently.
-
-## 2. Functional Specification
-
-### 2.1 Toggle and Predicate Setup
-
-Kahn's is controlled by two settings on each `BfsIterator`:
-
-```cpp
-// 1. Provide the edge filter Kahn's uses for discovery.
-//    This is separate from search_pred_ used by the original BFS.
-iterator->setKahnPred(search_adj);
-
-// 2. Enable Kahn's for visitParallel.
-iterator->setUseKahns(true);
-```
-
-Both are required. If `kahn_pred_` is null when `use_kahns_` is true, the iterator falls back to the original level-based BFS silently.
-
-Default for both is off/null (original behavior). The toggle affects only `visitParallel()`; the sequential `visit()`, `hasNext()`/`next()`, and `enqueue()` APIs are unchanged.
-
-### 2.2 Why Kahn's Needs Its Own Predicate
-
-In the original BFS, edge filtering happens **inside the visitor** at call time:
-
-```
-visitor->visit(vertex)
-  └─ enqueueAdjacentVertices(vertex, adj_pred_)   ← visitor provides the filter
-```
-
-The BFS iterator itself never decides which edges to follow -- the visitor does, one vertex at a time.
-
-Kahn's algorithm cannot work this way. It must discover the **entire active subgraph upfront** (before any visitor runs) to compute in-degrees. This discovery needs an edge filter to know which edges to follow. The iterator's own `search_pred_` is often null (the arrival iterator is constructed with `nullptr` because the visitor was always expected to provide the filter).
-
-`kahn_pred_` solves this by giving Kahn's its own dedicated filter, set once during construction, without changing how `search_pred_` or the visitor works.
-
-In practice, `Search.cc` wires it up at construction:
-
-```cpp
-// Search constructor:
-arrival_iter_->setKahnPred(search_adj_);
-required_iter_->setKahnPred(search_adj_);
-```
-
-### 2.3 Behavioral Contract
-
-When enabled, `visitParallel(to_level, visitor)` must:
-
-1. Visit exactly the same set of vertices as the original level-based BFS.
-2. Visit each vertex exactly once.
-3. Visit every vertex only after all its predecessors (in the BFS-direction DAG) have been visited and their results are visible.
-4. Call `visitor->visit(vertex)` in a thread-safe manner (one thread per vertex, thread-local visitor copies).
-5. Call `visitor->levelFinished()` once after all vertices are processed.
-6. Leave the BFS queue in a consistent state (processed levels cleared, remaining levels tracked).
-7. Respect the `to_level` bound -- vertices beyond it remain queued for future calls.
-
-### 2.4 Scope
-
-The implementation is integrated into the existing `BfsIterator` class hierarchy:
-
-- `BfsFwdIterator` -- forward arrival propagation (out-edges)
-- `BfsBkwdIterator` -- backward required-time propagation (in-edges)
-
-Both directions are supported through the polymorphic `kahnForEachSuccessor()` virtual method.
-
-## 3. Why the Graph Is a DAG
-
-Kahn's algorithm requires a directed acyclic graph. Within a single `visitParallel()` call, the active graph is guaranteed acyclic because:
-
-| Cycle Source | How It's Broken | Where |
-|---|---|---|
-| Flip-flop feedback (Q -> ... -> D) | D inputs are timing endpoints; clk-to-Q starts new propagation. `SearchAdj` skips `latchDtoQ` and timing-check edges. | `Search.cc:127-131`, `Search.cc:178-186` |
-| Latch D-to-Q | Explicitly excluded by `SearchThru::searchThru()` and `SearchAdj::searchThru()`. Convergence handled by multi-pass outer loop in `Search::findAllArrivals()`. | `Search.cc:130`, `Search.cc:1004-1012` |
-| Combinational loops | Levelizer DFS detects back edges and marks them `isDisabledLoop()`. All BFS predicates skip disabled-loop edges. | `Levelize.cc:232-330`, `Levelize.cc:428-446` |
-
-## 4. Algorithm
-
-### 4.1 Overview
-
-```
-visitParallel(to_level, visitor):
-  if thread_count == 1        → sequential visit() (unchanged)
-  if !use_kahns_ || !kahn_pred_ → original level-based parallel BFS (unchanged)
-  else                        → Kahn's three-phase algorithm:
-
-  Phase 1+2: Discovery + In-Degree Counting  (single-threaded)
-  Phase 3:   Batch-Dispatch Parallel Traversal (multi-threaded)
-```
-
-### 4.2 Phase 1+2: Discovery + In-Degree Counting
-
-Single-threaded BFS from seed vertices (those already in the level queue). For each vertex discovered:
-
-1. Assign in-degree = 0 for seeds, in-degree = count of active predecessors for discovered successors.
-2. Record vertex in the active set.
-3. Set `bfsInQueue` flag to prevent `enqueue()` from re-adding during Phase 3.
-
-Data structures:
-- `in_degree_init`: flat `std::vector<int>` indexed by `graph_->id(vertex)`. Value -1 = not active, >= 0 = in-degree count. Grows dynamically if vertex IDs exceed initial capacity (see Section 6.1).
-- `active_vertices`: list of all discovered vertices for iteration.
-- Both are persistent across calls via `KahnState` to avoid re-allocation (see Section 5.4).
-
-The discovery uses `kahn_pred_` -- the same `SearchAdj` filter used by the arrival and required paths -- ensuring identical edge filtering to the original BFS.
-
-### 4.3 Phase 3: Batch-Dispatch Parallel Traversal
-
-```
-ready_batch = {vertices with in_degree == 0}
-
-while ready_batch is not empty:
-    next_ready = {}
-
-    if batch is small (< thread_count):
-        process single-threaded
-    else:
-        for each vertex in ready_batch:
-            dispatch_queue_->dispatch(lambda(tid):
-                visitor_copy[tid]->visit(vertex)
-                for each successor of vertex:
-                    atomic decrement in_degree[successor]
-                    if in_degree reached 0:
-                        lock; next_ready.push_back(successor)
-            )
-        dispatch_queue_->finishTasks()
-
-    ready_batch.swap(next_ready)
-```
-
-Key properties:
-- One task dispatched per vertex -- `DispatchQueue` handles load balancing across its thread pool.
-- `finishTasks()` uses condition_variable internally (no spin-wait).
-- Successor in-degree decrements use `std::memory_order_acq_rel` to ensure predecessor writes are visible.
-- Newly-ready vertices are collected into `next_ready` under a mutex, then swapped into the next batch.
-
-### 4.4 Cleanup
-
-After traversal:
-- Processed levels are cleared from the `LevelQueue`.
-- `first_level_` / `last_level_` are recalculated via `resetLevelBounds()` to track any remaining queued vertices (e.g., those beyond `to_level`).
-- Active vertex IDs are saved in `KahnState::prev_ids` for efficient reset on the next call.
-
-## 5. Implementation Details
-
-### 5.1 Files Modified
-
-| File | Changes |
-|---|---|
-| `include/sta/Bfs.hh` | Added `<functional>`, `<memory>`. Added `kahnForEachSuccessor` pure virtual, `resetLevelBounds`, `KahnState` forward decl + `unique_ptr` member, `use_kahns_` flag, `kahn_pred_` pointer, and public accessors. |
-| `search/Bfs.cc` | Added `<atomic>`. Defined `KahnState` struct. Rewrote `visitParallel` with three branches (sequential / level-based / Kahn's). Added `kahnForEachSuccessor` overrides for Fwd (out-edges) and Bkwd (in-edges). Added `resetLevelBounds`. |
-| `search/Search.cc` | Added two lines in `Search` constructor to wire up `kahn_pred_` on both arrival and required iterators: `arrival_iter_->setKahnPred(search_adj_)` and `required_iter_->setKahnPred(search_adj_)`. |
-
-### 5.2 Class Hierarchy
-
-```
-BfsIterator (base)
-  ├─ kahnForEachSuccessor() = 0  [pure virtual]
-  ├─ resetLevelBounds()
-  ├─ KahnState (persistent arrays, pimpl)
-  ├─ use_kahns_ flag
-  ├─ kahn_pred_ (SearchPred* for Kahn's discovery, separate from search_pred_)
-  │
-  ├─ BfsFwdIterator
-  │    └─ kahnForEachSuccessor: iterates out-edges with
-  │       searchFrom/searchThru/searchTo via kahn_pred_
-  │
-  └─ BfsBkwdIterator
-       └─ kahnForEachSuccessor: iterates in-edges with
-          searchTo/searchThru/searchFrom via kahn_pred_
-```
-
-### 5.3 KahnState (Persistent Storage)
-
-```cpp
-struct BfsIterator::KahnState {
-  std::vector<int> in_degree_init;              // flat array, indexed by VertexId
-  std::unique_ptr<std::atomic<int>[]> in_degree; // atomic copy for parallel phase
-  size_t in_degree_size = 0;
-  std::vector<VertexId> prev_ids;               // IDs to reset on next call
-};
-```
-
-- Allocated lazily on first Kahn's call.
-- `in_degree_init` grows dynamically (never shrinks). Only touched entries are reset between calls via `prev_ids`.
-- `in_degree` (atomic array) is reallocated only when the max vertex ID grows.
-
-### 5.4 Memory Ordering
-
-| Operation | Ordering | Rationale |
-|---|---|---|
-| `in_degree_init` writes (discovery) | Non-atomic | Single-threaded phase; `dispatch()` provides happens-before. |
-| `in_degree[].store()` (setup) | `relaxed` | Single-threaded; `dispatch()` provides happens-before. |
-| `in_degree[].fetch_sub()` (worker) | `acq_rel` | Last predecessor's decrement synchronizes all prior arrival writes to the successor's reader thread. |
-| `total_visited.fetch_add()` | `relaxed` | Counter read only after `finishTasks()` barrier. |
-
-### 5.5 Interaction with enqueue() During visit()
-
-`ArrivalVisitor::visit()` calls `enqueueAdjacentVertices()` which calls `enqueue()`. During Kahn's:
-- Active vertices have `bfsInQueue` set during discovery -> `enqueue()` is a no-op (flag already set).
-- Vertices beyond `to_level` are not in the active set -> `enqueue()` adds them to the `LevelQueue` normally for future passes.
-
-### 5.6 kahnForEachSuccessor vs enqueueAdjacentVertices
-
-These two methods have nearly identical edge-iteration logic (same predicates, same edge direction). They are intentionally kept as separate methods because:
-- `enqueueAdjacentVertices` is called millions of times in the non-Kahn's path. Wrapping it in `std::function` would add overhead to all BFS operations.
-- `kahnForEachSuccessor` accepts a `std::function<void(Vertex*)>` callback, used in both discovery and the worker. The `std::function` overhead is negligible relative to per-vertex computation.
-
-## 6. Pitfalls and Bugs Found
-
-### 6.1 ObjectTable Block-Based Vertex IDs
-
-**Problem**: `graph_->vertexCount()` returns the **live** object count, but `graph_->id(vertex)` returns `(block_index << 7) + slot_index`. After vertex deletions, live count drops but blocks persist. A vertex in block 2 can have ID 260 even when only 79 vertices are alive.
-
-**How we found it**: The `rmp.gcd_restructure.tcl` OpenROAD test crashed with a segfault. The rmp module deletes cells during restructuring, creating gaps between live vertex count and max vertex ID. Our in-degree array was sized to `vertexCount() + 1` and accessed out of bounds.
-
-**Solution**: `in_degree_init` grows dynamically during discovery (`resize(vid + 128, -1)` when any ID exceeds current capacity). Worker lambdas include a bounds check (`sid < in_deg_size`). The atomic array is sized to `max_id + 1` after discovery completes.
-
-**Note**: The other developer's implementation has the same latent bug (`graph_->vertexCount() + 1` sizing) but it hasn't manifested because their code only runs on the delay-calc path, which doesn't encounter the deletion pattern that rmp triggers.
-
-### 6.2 Null Search Predicate on Arrival Iterator
-
-**Problem**: The arrival BFS iterator is constructed with `search_pred_ = nullptr`:
-
-```cpp
-arrival_iter_(new BfsFwdIterator(BfsIndex::arrival, nullptr, this))
-```
-
-This is intentional in the original BFS -- the visitor provides its own predicate (`adj_pred_`) at call time via `enqueueAdjacentVertices(vertex, adj_pred_)`. The null `search_pred_` is never dereferenced.
-
-Kahn's discovery phase needs a predicate upfront (before any visitor runs) to know which edges to follow. Using `search_pred_` directly caused a null pointer dereference.
-
-**How we found it**: The `rmp.gcd_restructure.tcl` test crashed in `kahnForEachSuccessor` with `pred->searchFrom(vertex)` dereferencing null. Stack trace showed the call came from arrival propagation via `Search::findArrivals1`.
-
-**Solution**: Introduced `kahn_pred_` -- a separate predicate pointer dedicated to Kahn's discovery and successor decrement. Set via `setKahnPred()` and wired up in the `Search` constructor:
-
-```cpp
-arrival_iter_->setKahnPred(search_adj_);
-required_iter_->setKahnPred(search_adj_);
-```
-
-If `kahn_pred_` is null when Kahn's is enabled, `visitParallel` falls back to the original level-based BFS. This ensures no crash even if a caller enables Kahn's without setting the predicate.
-
-### 6.3 Memory Visibility Across Threads
-
-**Problem**: When predecessor P finishes computing arrivals and successor S starts reading them, S must see P's writes.
-
-**Original BFS**: `finishTasks()` between levels provides a full memory fence.
-
-**Kahn's**: The `fetch_sub(1, memory_order_acq_rel)` on the in-degree counter creates the happens-before chain. When the last predecessor's decrement triggers S's readiness, all prior writes by all predecessors are visible to S's processing thread. The batch-dispatch model adds a `finishTasks()` barrier between batches as an additional fence.
-
-### 6.4 "Arrivals Unchanged" Optimization
-
-**Original BFS**: If `ArrivalVisitor::visit()` finds arrivals haven't changed, it skips `enqueueAdjacentVertices` -- fanout is not re-evaluated.
-
-**Kahn's**: The discovery phase conservatively discovers ALL reachable vertices. Fanout in-degrees are decremented unconditionally after visit, regardless of whether arrivals changed. This means some vertices may be visited unnecessarily. They will find no change and produce no further effect.
-
-**Impact**: Slightly more work for incremental updates where only a small subset changes. Correct for all cases.
-
-### 6.5 Latch Multi-Pass Convergence
-
-Latch D-to-Q edges are excluded from the BFS by search predicates. Latch convergence is handled by the outer multi-pass loop in `Search::findAllArrivals()` / `Search::findFilteredArrivals()`, which re-seeds latch Q outputs between `visitParallel` calls. Kahn's operates within a single `visitParallel` call and is orthogonal to this mechanism.
-
-### 6.6 levelFinished() Callback
-
-`VertexVisitor::levelFinished()` is a virtual hook called at level boundaries. No override exists in the codebase (base implementation is empty). With Kahn's, it is called once after all vertices are processed. If a future subclass relies on per-level callbacks, it would need adaptation.
-
-## 7. Comparison with Other Developer's Approach
-
-The other implementation (`BfsFwdInDegreeIterator`) in the alternate repository takes a different design approach. Key differences:
-
-### 7.1 Architecture
-
-| Aspect | Other Developer | Our Approach |
-|---|---|---|
-| Class design | New standalone `BfsFwdInDegreeIterator : StaState` | Integrated into existing `BfsIterator` with toggle |
-| Scope | Forward-only, delay calc (`GraphDelayCalc`) only | Forward + backward, any `visitParallel` caller |
-| Integration | Requires caller to use new class and call `computeInDegrees()` explicitly | Drop-in: `setKahnPred(pred)` + `setUseKahns(true)` on existing iterator |
-
-### 7.2 Discovery
-
-| Aspect | Other Developer | Our Approach |
-|---|---|---|
-| Strategy | Iterate ALL vertices + ALL edges (full graph) | BFS from seeds (active subgraph only) |
-| Cost | O(V_total + E_total) every call | O(V_active + E_active) per call |
-| Incremental variant | Yes (`computeInDegrees(invalid_delays)` with reachability pass) | Natural (seed-based discovery covers only dirty subgraph) |
-| Loop breaking | `to_vertex->level() >= vertex->level()` (ad-hoc) | `SearchPred::searchThru()` (matches Levelizer exactly) |
-
-### 7.3 Parallelism
-
-| Aspect | Other Developer | Our Approach |
-|---|---|---|
-| Dispatch model | Batch: dispatch all ready -> `finishTasks()` -> next batch | Same (adopted from their approach -- see Section 7.6) |
-| Ready queue | `std::vector<Vertex*>` with `swap()` | Same |
-| Newly-ready collection | `ready_lock_` mutex on `ready_` vector | `next_ready_lock` mutex on `next_ready` vector |
-
-### 7.4 Thread Safety
-
-| Aspect | Other Developer | Our Approach |
-|---|---|---|
-| Active-set check | `vertex->visited()` (non-atomic `bool`) -- **data race risk** | `in_degree_init[id] >= 0` (read-only during parallel phase) -- safe |
-| Edge dedup | `std::set<Edge*> processed_edges_` with **per-edge mutex lock** -- serialization bottleneck | Not needed (in-degrees computed correctly upfront) |
-| Vertex marking | `vertex->setVisited(true)` from worker threads | `vertex->setBfsInQueue()` (atomic field) |
-
-### 7.5 Array Sizing
-
-| Aspect | Other Developer | Our Approach |
-|---|---|---|
-| Size | `graph_->vertexCount() + 1` (fixed) | Dynamic growth during discovery + bounds checks |
-| Risk | **Same ObjectTable ID bug** -- IDs can exceed vertexCount after deletions. Latent bug that hasn't manifested because their code path (delay calc) doesn't trigger the deletion pattern. | Fixed (see Section 6.1) |
-
-### 7.6 What We Adopted from Their Approach
-
-Our initial implementation used a custom `KahnReadyQueue` with spin-wait workers (`std::this_thread::yield()`). Comparing with their approach revealed that dispatching one task per vertex into `DispatchQueue` and using `finishTasks()` as a batch barrier is significantly more efficient:
-
-- `DispatchQueue` uses `condition_variable` for blocking -- no wasted CPU on spin-wait.
-- Natural load balancing -- the thread pool picks up work items automatically.
-- Simpler code -- no custom queue class needed.
-
-This change cut our test suite overhead from 87s to 28s (vs 27s for original BFS).
-
-### 7.7 Performance Comparison
-
-| Approach | STA Regression (6109 tests) |
-|---|---|
-| Original level-based BFS | 27-30s |
-| Our Kahn's v1 (hash map + spin-wait) | 87s |
-| Our Kahn's v2 (dense array + spin-wait) | 42s |
-| Our Kahn's v3 (dense array + batch dispatch) | 28s |
-| Other developer (delay-calc only, separate class) | Reports ~45% speedup on large designs |
-
-## 8. Test Plan and Results
-
-### 8.1 OpenSTA Standalone Regression
-
-```bash
-cd tools/OpenROAD/src/sta/build
-cmake .. && make -j$(nproc)
-ctest -j$(nproc)
-```
-
-**Pass criteria**: All tests pass with `use_kahns_ = true`. Results must be bit-identical to `use_kahns_ = false`.
-
-**Result**: **PASS** -- 6109/6109 tests pass with both settings.
-
-### 8.2 OpenROAD Full Regression
-
-```bash
-cd tools/OpenROAD/build
-cmake --build . -j$(nproc)
-ctest -j$(nproc)
-```
-
-**Pass criteria**: All OpenROAD tests pass, including flows that modify the netlist between timing updates.
-
-**Key test cases exercised**:
-- `rmp.gcd_restructure.tcl` -- restructure deletes cells, causing vertex ID gaps. This test originally crashed (Section 6.1, 6.2) and drove two bug fixes (dynamic array sizing and `kahn_pred_` separation).
-- `rsz.*` -- resizer modifies netlist incrementally between timing updates.
-- `cts.*` -- clock tree synthesis adds buffers and triggers re-timing.
-
-**Result**: **PASS** -- all OpenROAD regressions pass after the two fixes.
-
-### 8.3 Thread Count Sweep
-
-```tcl
-set_thread_count 1  ;# Falls back to sequential visit()
-set_thread_count 2
-set_thread_count 4
-set_thread_count 8
-```
-
-**Pass criteria**: Identical timing reports across all thread counts.
-
-### 8.4 Toggle Consistency
-
-```tcl
-# Run 1: use_kahns_ = false
-report_checks -digits 6 > results_original.rpt
-# Run 2: use_kahns_ = true
-report_checks -digits 6 > results_kahns.rpt
-# diff results_original.rpt results_kahns.rpt
-```
-
-**Pass criteria**: Reports are identical.
-
-### 8.5 Performance Expectations
-
-| Scenario | Expectation |
-|---|---|
-| Small designs (< 10K vertices) | Kahn's within 10% of original (discovery overhead amortized) |
-| Large designs (> 100K vertices) with uneven levels | Kahn's faster due to barrier elimination |
-| Incremental updates (small active set) | Kahn's overhead proportional to active set, not total graph |
-| High thread counts (8-16 threads) | Kahn's scales better (no idle threads at level barriers) |
-
-### 8.6 Stress Tests
-
-- Design with a single long chain (worst case -- no parallelism, discovery overhead for zero benefit).
-- Design with many parallel chains (best case -- maximum parallel utilization).
-- Design with latches (multi-pass convergence must work correctly).
-- Rapid incremental updates (persistent KahnState reuse exercised).
-- Netlist modification flows: rmp, rsz, cts (exercises ObjectTable ID gaps and graph rebuilds).
diff --git a/docs/KahnsBFS_Spec.txt b/docs/KahnsBFS_Spec.txt
index 53df8ccb..c9914bab 100644
--- a/docs/KahnsBFS_Spec.txt
+++ b/docs/KahnsBFS_Spec.txt
@@ -27,16 +27,15 @@ Stage 1: Discovery and In-Degree Counting (single-threaded)
 
 Starting from the seed vertices already in the BFS queue, a forward BFS discovers all reachable vertices following the same edge-filtering rules used by the original traversal. As each new vertex is discovered, its in-degree (number of active predecessors) is recorded in a flat array indexed by graph vertex ID. Seed vertices have in-degree zero.
 
-Stage 2: Batch-Dispatch Parallel Traversal (multi-threaded)
+Stage 2: Recursive-Dispatch Parallel Traversal (multi-threaded)
 
-All zero-in-degree vertices form the initial ready batch. The algorithm loops:
+The unit of scheduling is a single ready vertex. All zero-in-degree vertices are initially dispatched as separate tasks into the existing DispatchQueue thread pool. Each task does three things:
 
-  1. Dispatch one task per ready vertex into the existing DispatchQueue thread pool.
-  2. Each task visits the vertex (computing arrivals or required times), then atomically decrements the in-degree of each successor. Any successor whose in-degree reaches zero is collected into the next batch.
-  3. finishTasks() waits for all tasks in the current batch to complete.
-  4. Swap in the next batch and repeat until no vertices remain.
+  1. Visit the vertex (computing arrivals or required times).
+  2. Atomically decrement the in-degree of each successor.
+  3. If any successor's in-degree reaches zero, dispatch that successor immediately as a new task into the same DispatchQueue.
 
-Small batches (fewer vertices than threads) are processed single-threaded to avoid dispatch overhead. The DispatchQueue uses condition_variable internally, so there is no spin-wait or wasted CPU.
+A single finishTasks() call at the end waits for all dispatched work -- including tasks dispatched recursively from within running tasks -- to complete. There are no per-batch or per-level barriers. A worker thread that makes a successor ready sends it straight into the pool, where any idle thread can pick it up without waiting for unrelated tasks to finish. The DispatchQueue uses condition_variable internally, so idle threads block efficiently rather than spinning.
 
 
 4. IMPLEMENTATION DETAILS
@@ -58,6 +57,8 @@ The edge filter is separate from the iterator's existing search_pred_ because th
 
 Persistent state (KahnState) stores the in-degree arrays across calls. On the first call it allocates; on subsequent calls it resets only the entries touched previously, avoiding full re-initialization.
 
+The Stage 2 task body is a std::function defined as a local variable in visitParallel. It captures itself by reference so that task code can recursively dispatch successors via the same function. Its lifetime is the duration of visitParallel; finishTasks() guarantees all dispatched tasks complete before the function returns, so the self-reference is always valid.
+
 
 5. INCREMENTAL TIMING UPDATES
 
@@ -82,7 +83,9 @@ Discovery cost: The alternate scans every vertex and edge in the entire graph to
 
 Thread safety: The alternate uses a non-atomic visited flag from worker threads (data race risk) and maintains a per-edge mutex-locked set for deduplication (serialization bottleneck). Ours uses a read-only array for active-set checks and computes in-degrees upfront so edge tracking is unnecessary.
 
-What we adopted: The alternate's batch-dispatch model (one task per vertex into DispatchQueue with finishTasks barriers) proved far more efficient than our initial spin-wait worker design. Adopting it cut test overhead from 87s to 28s.
+Scheduling granularity: The alternate uses batched dispatch -- it dispatches a wavefront of ready vertices, waits for all to finish via finishTasks(), then dispatches the next wavefront. This re-introduces a barrier between wavefronts. Ours dispatches each ready vertex as a separate task and, when a running task makes a successor ready, dispatches that successor immediately via recursive dispatch into the same DispatchQueue. A single finishTasks() at the end waits for all work. This removes the per-wavefront barrier and keeps threads continuously fed.
+
+What we adopted from them: The DispatchQueue-based execution model. Our initial implementation used custom spin-wait workers (std::this_thread::yield) which wasted CPU. Moving to DispatchQueue with condition_variable-based blocking cut overhead substantially.
 
 
 7. FINDINGS FROM REGRESSIONS
diff --git a/search/Bfs.cc b/search/Bfs.cc
index 1cc7798c..a5d8a6e0 100644
--- a/search/Bfs.cc
+++ b/search/Bfs.cc
@@ -354,21 +354,26 @@ BfsIterator::visitParallel(Level to_level,
       kahn_state_->ensureAtomicSize(max_id + 1);
       std::atomic<int> *in_degree = kahn_state_->in_degree.get();
 
-      // Copy active in-degrees to atomic array and collect ready batch.
-      std::vector<Vertex*> ready_batch;
+      // Copy active in-degrees to atomic array and record IDs
+      // for cleanup on the next call.
       kahn_state_->prev_ids.clear();
       kahn_state_->prev_ids.reserve(active_count);
+      int initial_ready_count = 0;
       for (Vertex *v : active_vertices) {
         VertexId vid = graph_->id(v);
         in_degree[vid].store(in_deg[vid], std::memory_order_relaxed);
         kahn_state_->prev_ids.push_back(vid);
         if (in_deg[vid] == 0)
-          ready_batch.push_back(v);
+          initial_ready_count++;
       }
       debugPrint(debug_, "bfs", 1, "kahns {} initial ready",
-                 ready_batch.size());
+                 initial_ready_count);
 
-      // Phase 3: Batch-dispatch Kahn's traversal.
+      // Phase 3: Recursive-dispatch Kahn's traversal.
+      // Each task visits its vertex, decrements successor in-degrees,
+      // and directly dispatches any successor whose in-degree hit zero
+      // back into the DispatchQueue. finishTasks() waits for all work,
+      // including recursively-dispatched tasks. No batch barriers.
       std::vector<VertexVisitor *> visitors;
       for (size_t k = 0; k < thread_count; k++)
         visitors.push_back(visitor->copy());
@@ -376,53 +381,41 @@ BfsIterator::visitParallel(Level to_level,
       std::atomic<int> total_visited{0};
       BfsIndex bfs_index = bfs_index_;
       SearchPred *pred = kahn_pred_;
-      std::vector<Vertex*> next_ready;
-      std::mutex next_ready_lock;
+      size_t in_deg_size = in_deg.size();
 
-      while (!ready_batch.empty()) {
-        next_ready.clear();
-
-        if (ready_batch.size() < thread_count) {
-          for (Vertex *vertex : ready_batch) {
-            vertex->setBfsInQueue(bfs_index, false);
-            visitor->visit(vertex);
-            total_visited.fetch_add(1, std::memory_order_relaxed);
-            kahnForEachSuccessor(vertex, pred, [&](Vertex *succ) {
-              VertexId sid = graph_->id(succ);
-              if (sid < in_deg.size() && in_deg[sid] >= 0) {
-                int prev = in_degree[sid]
-                    .fetch_sub(1, std::memory_order_acq_rel);
-                if (prev == 1)
-                  next_ready.push_back(succ);
-              }
-            });
-          }
-        }
-        else {
-          size_t in_deg_size = in_deg.size();
-          for (Vertex *vertex : ready_batch) {
-            dispatch_queue_->dispatch(
-                [&, vertex, bfs_index, pred, in_deg_size](size_t tid) {
-              vertex->setBfsInQueue(bfs_index, false);
-              visitors[tid]->visit(vertex);
-              total_visited.fetch_add(1, std::memory_order_relaxed);
-              kahnForEachSuccessor(vertex, pred, [&, in_deg_size](Vertex *succ) {
-                VertexId sid = graph_->id(succ);
-                if (sid < in_deg_size && in_deg[sid] >= 0) {
-                  int prev = in_degree[sid]
-                      .fetch_sub(1, std::memory_order_acq_rel);
-                  if (prev == 1) {
-                    LockGuard lock(next_ready_lock);
-                    next_ready.push_back(succ);
-                  }
-                }
+      // Recursive task lambda: self-reference via std::function.
+      // Captures persist on visitParallel's stack until finishTasks
+      // returns.
+      std::function<void(Vertex*, size_t)> process;
+      process = [&, bfs_index, pred, in_deg_size](Vertex *vertex,
+                                                  size_t tid) {
+        vertex->setBfsInQueue(bfs_index, false);
+        visitors[tid]->visit(vertex);
+        total_visited.fetch_add(1, std::memory_order_relaxed);
+        kahnForEachSuccessor(vertex, pred, [&](Vertex *succ) {
+          VertexId sid = graph_->id(succ);
+          if (sid < in_deg_size && in_deg[sid] >= 0) {
+            int prev = in_degree[sid]
+                .fetch_sub(1, std::memory_order_acq_rel);
+            if (prev == 1) {
+              // Successor is now ready -- dispatch immediately.
+              dispatch_queue_->dispatch([&process, succ](size_t t) {
+                process(succ, t);
               });
-            });
+            }
           }
-          dispatch_queue_->finishTasks();
+        });
+      };
+
+      // Seed initial ready vertices into the dispatch queue.
+      for (Vertex *v : active_vertices) {
+        if (in_deg[graph_->id(v)] == 0) {
+          dispatch_queue_->dispatch([&process, v](size_t t) {
+            process(v, t);
+          });
         }
-        ready_batch.swap(next_ready);
       }
+      dispatch_queue_->finishTasks();
 
       visit_count = total_visited.load(std::memory_order_relaxed);
       visitor->levelFinished();

From a5b6c78dda71c9f53cafd75ea90e2af6569cd49d Mon Sep 17 00:00:00 2001
From: dsengupta0628 <dsengupta@precisioninno.com>
Date: Tue, 21 Apr 2026 16:51:36 +0000
Subject: [PATCH 3/8] update to remove levelfinished call which was a dead code

Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
---
 search/Bfs.cc | 1 -
 1 file changed, 1 deletion(-)

diff --git a/search/Bfs.cc b/search/Bfs.cc
index 6bd1be75..9ee726c8 100644
--- a/search/Bfs.cc
+++ b/search/Bfs.cc
@@ -415,7 +415,6 @@ BfsIterator::visitParallel(Level to_level,
       dispatch_queue_->finishTasks();
 
       visit_count = total_visited.load(std::memory_order_relaxed);
-      visitor->levelFinished();
 
       for (VertexVisitor *v : visitors)
         delete v;

From 6b2b5e258713a98bb278712d0bde22eec762f975 Mon Sep 17 00:00:00 2001
From: dsengupta0628 <dsengupta@precisioninno.com>
Date: Tue, 21 Apr 2026 18:34:35 +0000
Subject: [PATCH 4/8] update doc of spec- to be deleted before code is merged

Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
---
 docs/KahnsBFS_Spec.txt | 63 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/docs/KahnsBFS_Spec.txt b/docs/KahnsBFS_Spec.txt
index c9914bab..86655057 100644
--- a/docs/KahnsBFS_Spec.txt
+++ b/docs/KahnsBFS_Spec.txt
@@ -115,3 +115,66 @@ On the OpenSTA regression suite (6109 tests), Kahn's BFS runs at parity with the
 OpenSTA standalone: 6109/6109 tests PASS with Kahn's enabled.
 
 OpenROAD full regression: All tests PASS, including rmp.gcd_restructure (the test that surfaced both findings above), rsz (incremental netlist modification), and cts (buffer insertion with re-timing).
+
+
+10. LIMITATIONS OF THE CURRENT APPROACH
+
+The current implementation is correct and matches the original BFS at parity on small designs, but several limitations remain:
+
+Eager visits in Kahn's traversal. Every vertex in the discovered active subgraph is visited regardless of whether its arrival actually changes. This is a fundamental consequence of the in-degree counting model -- each predecessor must decrement each successor's counter exactly once, so skipping decrements is not allowed. The original BFS short-circuits via an "arrivals unchanged" check at the visitor level and avoids enqueuing downstream when no change occurred. We do not. For most designs the cost is small because the visitor itself detects no change and terminates quickly, but in deep-pipeline designs with many cascaded unchanged vertices the extra visits add up.
+
+Full forward sweep for slack queries. Slack at pin P is computed as required(P) minus arrival(P). The required-time backward BFS is already scoped to P's level. But the forward arrival BFS is not scoped to P's fanin cone -- it propagates from all dirty seeds to all endpoints they can reach. For a single-point slack query on a design with large independent cones, most of the forward work is spent on endpoints the query does not care about.
+
+Over-invalidation in the dirty set. The incremental framework's invalid_arrivals_ and invalid_requireds_ sets are tracked conservatively. Some edge-delay or pin changes invalidate more vertices than strictly necessary; the visitor detects no change and does no further propagation, but we still paid for the visit. A more precise validity analysis could prune the seed set before the BFS starts.
+
+Per-call active_vertices allocation. The KahnState persistence avoids re-allocating in_degree_init and in_degree across calls, but the active_vertices vector is rebuilt every call. For very frequent small updates this has some overhead.
+
+Recursive dispatch cost for small workloads. Each ready vertex is dispatched as its own DispatchQueue task. The dispatch lock and condition-variable signaling cost is tiny per task, but for active sets smaller than the thread count the parallelism benefit may not offset the dispatch overhead.
+
+
+11. REVIEWER FEEDBACK AND ROADMAP
+
+Two reviewer comments suggest directions that address the limitations above.
+
+Reviewer 1: "In the incremental case wouldn't you want to work backwards from the query?"
+
+This is the demand-driven or point-specific propagation model used by some commercial STA tools. For a single slack query at pin P, the idea is:
+
+  1. Walk backward from P to compute its fanin cone.
+  2. Intersect the cone with the dirty set -- only those dirty vertices actually affect P.
+  3. Propagate forward only from those vertices, restricted to within the cone.
+
+OpenSTA already does this for the backward required-time pass (stops at query pin's level) and for the endpoint-with-no-fanout shortcut (seedRequired, no BFS at all). It does not do this for the forward arrival pass in slack queries -- findAllArrivals always goes to maxLevel. Adding demand-driven forward propagation would require:
+
+  - A query context passed from the Tcl command down to findArrivals
+  - A backward cone computation (or lazy cone membership check)
+  - Predicate filtering during Kahn's discovery to skip vertices outside the cone
+
+The scope is architectural (Search.cc and the query entry points), not inside the BFS iterator. Kahn's would still apply to the forward pass within the cone -- it is still a DAG traversal with the same barrier elimination benefit. The two optimizations stack rather than compete.
+
+Reviewer 2: "I would do a pre-traversal DFS to find the edge between valid and invalid nodes and I would queue the valid ones and then BFS forward and do the same thing in reverse for the required times."
+
+This addresses over-invalidation. The idea: before seeding the BFS, do a pre-pass that walks the dirty-reachable subgraph to find the minimal boundary -- the last layer of vertices whose arrivals are still known-valid but whose fanout begins the invalid region. Seed the BFS from those valid boundary vertices. Propagation then walks forward into the invalid region with known-correct source arrivals.
+
+Two potential benefits:
+
+  - If the invalidation is actually narrower than the dirty set suggests, the boundary pass prunes work the current approach would still do.
+  - Starting from vertices with known-correct arrivals gives the forward pass a clean reference point, which may catch cases where a dirty vertex's arrival doesn't actually change.
+
+Same pattern applies in reverse for requireds: find the boundary between known-valid and known-invalid required times, seed from the valid side, propagate backward into the invalid region.
+
+Implementation-wise this is a refinement of the current dirty-set mechanism rather than an architectural change. It could be added in Search.cc at the seeding step (seedInvalidArrivals / seedInvalidRequireds) without touching the BFS iterator.
+
+Potential roadmap items, in rough order of payoff and effort:
+
+  1. Visit-level change short-circuit (restore the "arrivals unchanged" optimization within Kahn's): skip the visit body if no predecessor's arrival actually changed. Track a per-vertex "changed" flag during Kahn's. Low risk, addresses the eager-visit limitation.
+
+  2. Validity-boundary seeding (Reviewer 2's suggestion): pre-pass to find the minimal boundary of invalid arrivals/requireds; seed from the valid side. Medium effort, addresses over-invalidation.
+
+  3. Demand-driven forward propagation for single-point queries (Reviewer 1's suggestion): query context + backward cone restriction. Larger architectural change spanning Search.cc and Tcl entry points. Biggest win for narrow single-point queries on large designs.
+
+  4. Hybrid scheduling (batched for small sets, recursive for large): adaptive threshold switches between the two dispatch models based on active set size. Addresses the small-workload overhead.
+
+  5. Multi-query cone batching: if multiple slack queries are made in sequence, compute the union of their backward cones once and do a single scoped forward sweep. Amortizes the cone-computation cost.
+
+These are orthogonal to Kahn's itself -- all of them layer on top of the existing implementation.

From 19425477d6f428d3866a79fe177b63c16c8d67d8 Mon Sep 17 00:00:00 2001
From: dsengupta0628 <dsengupta@precisioninno.com>
Date: Tue, 21 Apr 2026 18:39:09 +0000
Subject: [PATCH 5/8] update doc of spec - to be deleted before code is merged

Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
---
 docs/KahnsBFS_Spec.txt | 71 ++++++++++++++++++++----------------------
 1 file changed, 34 insertions(+), 37 deletions(-)

diff --git a/docs/KahnsBFS_Spec.txt b/docs/KahnsBFS_Spec.txt
index 86655057..8fa05612 100644
--- a/docs/KahnsBFS_Spec.txt
+++ b/docs/KahnsBFS_Spec.txt
@@ -132,49 +132,46 @@ Per-call active_vertices allocation. The KahnState persistence avoids re-allocat
 Recursive dispatch cost for small workloads. Each ready vertex is dispatched as its own DispatchQueue task. The dispatch lock and condition-variable signaling cost is tiny per task, but for active sets smaller than the thread count the parallelism benefit may not offset the dispatch overhead.
 
 
-11. REVIEWER FEEDBACK AND ROADMAP
+11. FUTURE ROADMAP
 
-Two reviewer comments suggest directions that address the limitations above.
+The following enhancements extend the current Kahn's-based incremental timing implementation. They address known limitations in the existing approach and are orthogonal to Kahn's itself — each can be layered on top of the existing implementation independently. Items are listed in rough order of payoff relative to effort.
 
-Reviewer 1: "In the incremental case wouldn't you want to work backwards from the query?"
+1. Visit-level change short-circuit
+Objective: Restore the "arrivals unchanged" optimization within Kahn's propagation by skipping the visit body when no predecessor's arrival has actually changed.
+Approach: Track a per-vertex "changed" flag during Kahn's traversal. When a vertex is popped from the queue, check whether any of its incoming arrivals differ from the previously recorded values before performing the full visit.
+Benefit: Addresses the eager-visit limitation in the current implementation, eliminating redundant work when dirty vertices do not actually propagate value changes downstream.
+Risk and effort: Low risk, localized change.
 
-This is the demand-driven or point-specific propagation model used by some commercial STA tools. For a single slack query at pin P, the idea is:
+2. Validity-boundary seeding
+Objective: Address over-invalidation by narrowing the set of vertices that require re-propagation.
+Approach: Before seeding the BFS, perform a pre-pass that walks the dirty-reachable subgraph to identify the minimal boundary — the last layer of vertices whose arrivals remain known-valid but whose fanout begins the invalid region. Seed the BFS from those valid boundary vertices so that propagation walks forward into the invalid region with known-correct source arrivals. The same pattern applies in reverse for required times: find the boundary between known-valid and known-invalid requireds, seed from the valid side, and propagate backward into the invalid region.
+Benefit: If the invalidation is actually narrower than the dirty set suggests, the boundary pass prunes work the current approach would still perform. Starting from vertices with known-correct arrivals also gives the forward pass a clean reference point, which may catch cases where a dirty vertex's arrival does not actually change.
+Risk and effort: Medium effort, refinement of the current dirty-set mechanism rather than an architectural change. Can be added in Search.cc at the seeding step (seedInvalidArrivals / seedInvalidRequireds) without modifying the BFS iterator.
 
-  1. Walk backward from P to compute its fanin cone.
-  2. Intersect the cone with the dirty set -- only those dirty vertices actually affect P.
-  3. Propagate forward only from those vertices, restricted to within the cone.
+3. Demand-driven forward propagation for single-point queries
+Objective: For single slack queries at a given pin, restrict forward propagation to only the vertices that actually influence that pin.
+Approach: For a slack query at pin P:
+--Walk backward from P to compute its fanin cone.
+--Intersect the cone with the dirty set — only those dirty vertices actually affect P.
+--Propagate forward only from those vertices, restricted to within the cone.
 
-OpenSTA already does this for the backward required-time pass (stops at query pin's level) and for the endpoint-with-no-fanout shortcut (seedRequired, no BFS at all). It does not do this for the forward arrival pass in slack queries -- findAllArrivals always goes to maxLevel. Adding demand-driven forward propagation would require:
+OpenSTA already applies this pattern in the backward required-time pass (which stops at the query pin's level) and for the endpoint-with-no-fanout shortcut (seedRequired, no BFS at all). It does not currently do this for the forward arrival pass in slack queries — findAllArrivals always goes to maxLevel.
+Implementation requirements:
 
-  - A query context passed from the Tcl command down to findArrivals
-  - A backward cone computation (or lazy cone membership check)
-  - Predicate filtering during Kahn's discovery to skip vertices outside the cone
+--A query context passed from the Tcl command down to findArrivals
+--A backward cone computation, or a lazy cone-membership check
+--Predicate filtering during Kahn's discovery to skip vertices outside the cone
 
-The scope is architectural (Search.cc and the query entry points), not inside the BFS iterator. Kahn's would still apply to the forward pass within the cone -- it is still a DAG traversal with the same barrier elimination benefit. The two optimizations stack rather than compete.
+Kahn's traversal still applies to the forward pass within the cone — it remains a DAG traversal and retains the barrier elimination benefit. The two optimizations stack rather than compete.
+Benefit: The largest win for narrow single-point queries on large designs.
+Risk and effort: Larger architectural change spanning Search.cc and the Tcl query entry points. Scope is architectural, not inside the BFS iterator.
 
-Reviewer 2: "I would do a pre-traversal DFS to find the edge between valid and invalid nodes and I would queue the valid ones and then BFS forward and do the same thing in reverse for the required times."
+4. Hybrid scheduling
+Objective: Address the small-workload overhead of the current scheduling model.
+Approach: Introduce an adaptive threshold that switches between batched dispatch (suited for small active sets) and recursive dispatch (suited for large active sets), based on the size of the active set at scheduling time.
+Benefit: Eliminates overhead for small incremental updates while preserving throughput for large ones.
 
-This addresses over-invalidation. The idea: before seeding the BFS, do a pre-pass that walks the dirty-reachable subgraph to find the minimal boundary -- the last layer of vertices whose arrivals are still known-valid but whose fanout begins the invalid region. Seed the BFS from those valid boundary vertices. Propagation then walks forward into the invalid region with known-correct source arrivals.
-
-Two potential benefits:
-
-  - If the invalidation is actually narrower than the dirty set suggests, the boundary pass prunes work the current approach would still do.
-  - Starting from vertices with known-correct arrivals gives the forward pass a clean reference point, which may catch cases where a dirty vertex's arrival doesn't actually change.
-
-Same pattern applies in reverse for requireds: find the boundary between known-valid and known-invalid required times, seed from the valid side, propagate backward into the invalid region.
-
-Implementation-wise this is a refinement of the current dirty-set mechanism rather than an architectural change. It could be added in Search.cc at the seeding step (seedInvalidArrivals / seedInvalidRequireds) without touching the BFS iterator.
-
-Potential roadmap items, in rough order of payoff and effort:
-
-  1. Visit-level change short-circuit (restore the "arrivals unchanged" optimization within Kahn's): skip the visit body if no predecessor's arrival actually changed. Track a per-vertex "changed" flag during Kahn's. Low risk, addresses the eager-visit limitation.
-
-  2. Validity-boundary seeding (Reviewer 2's suggestion): pre-pass to find the minimal boundary of invalid arrivals/requireds; seed from the valid side. Medium effort, addresses over-invalidation.
-
-  3. Demand-driven forward propagation for single-point queries (Reviewer 1's suggestion): query context + backward cone restriction. Larger architectural change spanning Search.cc and Tcl entry points. Biggest win for narrow single-point queries on large designs.
-
-  4. Hybrid scheduling (batched for small sets, recursive for large): adaptive threshold switches between the two dispatch models based on active set size. Addresses the small-workload overhead.
-
-  5. Multi-query cone batching: if multiple slack queries are made in sequence, compute the union of their backward cones once and do a single scoped forward sweep. Amortizes the cone-computation cost.
-
-These are orthogonal to Kahn's itself -- all of them layer on top of the existing implementation.
+5. Multi-query cone batching
+Objective: Amortize cone-computation cost when multiple slack queries are issued in sequence.
+Approach: When several slack queries arrive together, compute the union of their backward cones once and perform a single scoped forward sweep across the combined cone, rather than repeating the cone computation and forward traversal per query.
+Benefit: Reduces redundant work in reporting flows that issue many related queries, such as full endpoint slack reports or path-group summaries.

From 907d5f8c645369741e7a5148defe13ad00127dd4 Mon Sep 17 00:00:00 2001
From: dsengupta0628 <dsengupta@precisioninno.com>
Date: Tue, 21 Apr 2026 19:23:12 +0000
Subject: [PATCH 6/8] enable/disable using set sta_use_kahns_bfs 1 or 0

Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
---
 docs/KahnsBFS_Spec.txt   | 24 ++++++++++++++++++++++--
 include/sta/Bfs.hh       |  9 ++-------
 include/sta/Sta.hh       |  6 ++++++
 include/sta/Variables.hh |  6 ++++++
 sdc/Variables.cc         |  6 ++++++
 search/Bfs.cc            |  3 ++-
 search/Search.i          | 12 ++++++++++++
 search/Sta.cc            | 12 ++++++++++++
 tcl/Variables.tcl        |  8 ++++++++
 9 files changed, 76 insertions(+), 10 deletions(-)

diff --git a/docs/KahnsBFS_Spec.txt b/docs/KahnsBFS_Spec.txt
index 8fa05612..4e2f89a7 100644
--- a/docs/KahnsBFS_Spec.txt
+++ b/docs/KahnsBFS_Spec.txt
@@ -46,15 +46,35 @@ Files modified:
 
   search/Bfs.cc -- Defined KahnState struct holding persistent in-degree arrays (reused across calls to avoid re-allocation). Added a third branch to visitParallel: single-threaded / original-parallel / Kahn's-parallel. Implemented kahnForEachSuccessor for both BfsFwdIterator and BfsBkwdIterator.
 
-  search/Search.cc -- Two lines in the Search constructor wire the Kahn's edge filter (SearchAdj) onto the arrival and required iterators.
+  include/sta/Search.hh -- Added useKahnsBfs() getter and setUseKahnsBfs(bool) setter on Search, forwarding to the arrival and required iterators.
 
-Enabling Kahn's requires two calls on a BfsIterator:
+  search/Search.cc -- Two lines in the Search constructor wire the Kahn's edge filter (SearchAdj) onto the arrival and required iterators. Added Search::useKahnsBfs() and Search::setUseKahnsBfs() implementations.
+
+  include/sta/Sta.hh -- Added useKahnsBfs() / setUseKahnsBfs() declarations for the Tcl variable sta_use_kahns_bfs.
+
+  search/Sta.cc -- Implemented Sta::useKahnsBfs() and Sta::setUseKahnsBfs() as thin forwarders to Search.
+
+  search/Search.i -- Exposed use_kahns_bfs and set_use_kahns_bfs to SWIG for Tcl.
+
+  tcl/Variables.tcl -- Added the sta_use_kahns_bfs Tcl variable with a read/write trace that calls the underlying commands.
+
+Enabling Kahn's at the iterator level requires two calls on a BfsIterator:
 
   iterator->setKahnPred(predicate);  // edge filter for discovery
   iterator->setUseKahns(true);       // enable Kahn's
 
 The edge filter is separate from the iterator's existing search_pred_ because the original BFS never uses search_pred_ directly for arrivals -- the visitor provides its own filter at call time. Kahn's discovery runs before any visitor, so it needs the filter upfront. If the filter is null, visitParallel falls back to the original BFS.
 
+For end users, Kahn's can be toggled from Tcl via a design-level variable:
+
+  set sta_use_kahns_bfs 1      ;# enable Kahn's (default)
+  set sta_use_kahns_bfs 0      ;# fall back to original level-based BFS
+  puts $sta_use_kahns_bfs      ;# read current setting
+
+Setting the variable calls Sta::setUseKahnsBfs, which applies the flag to both the arrival (forward) and required-time (backward) BFS iterators. No arrivals or requireds are invalidated on toggle -- the two algorithms produce bit-identical results, so cached state remains valid. Scripts can flip the variable mid-session to compare the two paths without a full rerun.
+
+The Tcl plumbing is registered at package load via Variables.tcl (trace add variable) and the underlying commands use_kahns_bfs / set_use_kahns_bfs are exposed through Search.i (SWIG) to the sta namespace.
+
 Persistent state (KahnState) stores the in-degree arrays across calls. On the first call it allocates; on subsequent calls it resets only the entries touched previously, avoiding full re-initialization.
 
 The Stage 2 task body is a std::function defined as a local variable in visitParallel. It captures itself by reference so that task code can recursively dispatch successors via the same function. Its lifetime is the duration of visitParallel; finishTasks() guarantees all dispatched tasks complete before the function returns, so the self-reference is always valid.
diff --git a/include/sta/Bfs.hh b/include/sta/Bfs.hh
index 81b5aec5..04247c51 100644
--- a/include/sta/Bfs.hh
+++ b/include/sta/Bfs.hh
@@ -78,14 +78,10 @@ public:
   void deleteVertexBefore(Vertex *vertex);
   void remove(Vertex *vertex);
   void reportEntries() const;
-  // Enable/disable Kahn's algorithm for parallel traversal.
-  // When disabled (default), the original level-based BFS is used.
-  // Kahn's requires a non-null kahn_pred to know which edges to
-  // follow during discovery. Set it via setKahnPred() before enabling.
-  void setUseKahns(bool use_kahns) { use_kahns_ = use_kahns; }
-  bool useKahns() const { return use_kahns_; }
   // Search predicate used by Kahn's discovery and successor decrement.
   // Separate from search_pred_ which is used by the original BFS.
+  // Kahn's traversal is gated by Variables::useKahnsBfs() (Tcl variable
+  // sta_use_kahns_bfs); this filter is required for Kahn's discovery.
   void setKahnPred(SearchPred *pred) { kahn_pred_ = pred; }
 
   bool hasNext() override;
@@ -144,7 +140,6 @@ protected:
   Level first_level_;
   // Max (min) level of queued vertices.
   Level last_level_;
-  bool use_kahns_ = true;
   SearchPred *kahn_pred_ = nullptr;
 
   friend class BfsFwdIterator;
diff --git a/include/sta/Sta.hh b/include/sta/Sta.hh
index 95d27ba5..5514085a 100644
--- a/include/sta/Sta.hh
+++ b/include/sta/Sta.hh
@@ -1412,6 +1412,12 @@ public:
   ////////////////////////////////////////////////////////////////
   // TCL Variables
 
+  // TCL variable sta_use_kahns_bfs.
+  // Use Kahn's topological traversal (no per-level barriers) for the
+  // parallel arrival and required-time BFS passes. Falls back to the
+  // original level-based BFS when disabled.
+  bool useKahnsBfs() const;
+  void setUseKahnsBfs(bool use);
   // TCL variable sta_crpr_enabled.
   // Common Reconvergent Clock Removal (CRPR).
   // Timing check source/target common clock path overlap for search
diff --git a/include/sta/Variables.hh b/include/sta/Variables.hh
index 90c9fad2..f956aba1 100644
--- a/include/sta/Variables.hh
+++ b/include/sta/Variables.hh
@@ -34,6 +34,11 @@ enum class CrprMode { same_pin, same_transition };
 class Variables
 {
 public:
+  // TCL variable sta_use_kahns_bfs.
+  // Use Kahn's topological traversal (no per-level barriers) for the
+  // parallel arrival and required-time BFS passes.
+  bool useKahnsBfs() const { return use_kahns_bfs_; }
+  void setUseKahnsBfs(bool use);
   // TCL variable sta_crpr_enabled.
   bool crprEnabled() const { return crpr_enabled_; }
   void setCrprEnabled(bool enabled);
@@ -80,6 +85,7 @@ public:
   void setPocvQuantile(float quantile);
 
 private:
+  bool use_kahns_bfs_{true};
   bool crpr_enabled_{true};
   CrprMode crpr_mode_{CrprMode::same_pin};
   bool propagate_gated_clock_enable_{true};
diff --git a/sdc/Variables.cc b/sdc/Variables.cc
index 81650eb4..09fa69b8 100644
--- a/sdc/Variables.cc
+++ b/sdc/Variables.cc
@@ -26,6 +26,12 @@
 
 namespace sta {
 
+void
+Variables::setUseKahnsBfs(bool use)
+{
+  use_kahns_bfs_ = use;
+}
+
 void
 Variables::setCrprEnabled(bool enabled)
 {
diff --git a/search/Bfs.cc b/search/Bfs.cc
index 9ee726c8..390ad741 100644
--- a/search/Bfs.cc
+++ b/search/Bfs.cc
@@ -36,6 +36,7 @@
 #include "Report.hh"
 #include "Sdc.hh"
 #include "SearchPred.hh"
+#include "Variables.hh"
 
 namespace sta {
 
@@ -218,7 +219,7 @@ BfsIterator::visitParallel(Level to_level,
   if (!empty()) {
     if (thread_count == 1)
       visit_count = visit(to_level, visitor);
-    else if (!use_kahns_ || !kahn_pred_) {
+    else if (!variables_->useKahnsBfs() || !kahn_pred_) {
       // Original level-based parallel BFS with per-level barriers.
       std::vector<VertexVisitor *> visitors;
       visitors.reserve(thread_count_);
diff --git a/search/Search.i b/search/Search.i
index 817111e8..39dc0822 100644
--- a/search/Search.i
+++ b/search/Search.i
@@ -929,6 +929,18 @@ find_fanout_insts(PinSeq *from,
 //
 ////////////////////////////////////////////////////////////////
 
+bool
+use_kahns_bfs()
+{
+  return Sta::sta()->useKahnsBfs();
+}
+
+void
+set_use_kahns_bfs(bool use)
+{
+  Sta::sta()->setUseKahnsBfs(use);
+}
+
 bool
 crpr_enabled()
 {
diff --git a/search/Sta.cc b/search/Sta.cc
index 001ba921..3eee9dbe 100644
--- a/search/Sta.cc
+++ b/search/Sta.cc
@@ -2233,6 +2233,18 @@ Sta::checkTiming(const Mode *mode,
 
 ////////////////////////////////////////////////////////////////
 
+bool
+Sta::useKahnsBfs() const
+{
+  return variables_->useKahnsBfs();
+}
+
+void
+Sta::setUseKahnsBfs(bool use)
+{
+  variables_->setUseKahnsBfs(use);
+}
+
 bool
 Sta::crprEnabled() const
 {
diff --git a/tcl/Variables.tcl b/tcl/Variables.tcl
index 0ea66690..20a083c3 100644
--- a/tcl/Variables.tcl
+++ b/tcl/Variables.tcl
@@ -47,6 +47,14 @@ proc trace_report_default_digits { name1 name2 op } {
   }
 }
 
+trace add variable ::sta_use_kahns_bfs {read write} \
+  sta::trace_use_kahns_bfs
+
+proc trace_use_kahns_bfs { name1 name2 op } {
+  trace_boolean_var $op ::sta_use_kahns_bfs \
+    use_kahns_bfs set_use_kahns_bfs
+}
+
 trace add variable ::sta_crpr_enabled {read write} \
   sta::trace_crpr_enabled
 

From 100c885d8614363eaa5e5bf48f2003eda8aad974 Mon Sep 17 00:00:00 2001
From: dsengupta0628 <dsengupta@precisioninno.com>
Date: Tue, 21 Apr 2026 21:56:07 +0000
Subject: [PATCH 7/8] disable Kahns when dynamic_loop_breaking is enabled

Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
---
 docs/KahnsBFS_Spec.txt | 150 +++++++++++++++++++++++++++++++++++++++--
 search/Bfs.cc          |   9 ++-
 2 files changed, 154 insertions(+), 5 deletions(-)

diff --git a/docs/KahnsBFS_Spec.txt b/docs/KahnsBFS_Spec.txt
index 4e2f89a7..5e18bef1 100644
--- a/docs/KahnsBFS_Spec.txt
+++ b/docs/KahnsBFS_Spec.txt
@@ -65,6 +65,8 @@ Enabling Kahn's at the iterator level requires two calls on a BfsIterator:
 
 The edge filter is separate from the iterator's existing search_pred_ because the original BFS never uses search_pred_ directly for arrivals -- the visitor provides its own filter at call time. Kahn's discovery runs before any visitor, so it needs the filter upfront. If the filter is null, visitParallel falls back to the original BFS.
 
+Kahn's is also bypassed -- even when enabled -- whenever the Tcl variable sta_dynamic_loop_breaking is set. That feature relies on arrival tags that only emerge during propagation to decide whether an otherwise-disabled loop edge can be traversed. Kahn's needs the active subgraph and in-degrees known before propagation begins, so it cannot consult those tags. To avoid silently missing vertices, visitParallel guards the Kahn's path with an explicit check on variables_->dynamicLoopBreaking() and falls back to the original level-based BFS whenever dynamic loop breaking is active. The toggle remains a no-op from the user's point of view; results stay correct.
+
 For end users, Kahn's can be toggled from Tcl via a design-level variable:
 
   set sta_use_kahns_bfs 1      ;# enable Kahn's (default)
@@ -124,20 +126,158 @@ Resolution: Introduced kahn_pred, a dedicated predicate for Kahn's discovery, wi
 
 Both findings were caught by rmp.gcd_restructure.tcl and resolved without changing the original BFS behavior.
 
+Finding 3: Incompatibility with dynamic loop breaking
+
+sta_dynamic_loop_breaking (a pre-existing Tcl variable, default off) enables on-the-fly re-activation of disabled-loop edges when arrival propagation produces loop tags that satisfy user-declared false-path exceptions. The check lives in SearchAdj::searchThru: a disabled-loop edge is traversable when (dynamicLoopBreaking() && hasPendingLoopPaths(edge)) holds, where hasPendingLoopPaths consults the visitor's live TagGroupBldr to see which loop tags are currently propagating.
+
+The SearchAdj instance we reuse as kahn_pred_ (search_adj_ in Search.cc) is constructed with tag_bldr_ == nullptr, so hasPendingLoopPaths always returns false for it -- by design, since Kahn's discovery runs before any visitor is active and there are no live tags to consult. This means that when a user enables sta_dynamic_loop_breaking alongside sta_use_kahns_bfs, Kahn's discovery and successor decrement would systematically skip disabled-loop edges that the original ArrivalVisitor path (using its own tag-aware adj_pred_) can traverse. Vertices reachable only through those edges would never enter the active set, leaving their arrivals and slacks stale.
+
+Neither OpenSTA's regression suite nor OpenROAD's standard flows set sta_dynamic_loop_breaking, so this never surfaced in testing. It was identified during code review.
+
+Resolution: visitParallel now falls back to the original level-based BFS whenever variables_->dynamicLoopBreaking() is true, regardless of the Kahn's toggle. This is a defensive guard; the Tcl variable still reads and writes normally, but the traversal uses the original path when the two features would otherwise interact unsafely. The cost is one additional boolean check per visitParallel invocation.
+
+A future enhancement could make Kahn's loop-breaking-aware by conservatively discovering through disabled-loop edges and adjusting in-degrees based on actual propagation, but that work is non-trivial and not worth pursuing until a concrete use case combines both features.
+
 
 8. PERFORMANCE
 
 On the OpenSTA regression suite (6109 tests), Kahn's BFS runs at parity with the original level-based BFS (28s vs 27-30s). On small test designs the discovery stage overhead is negligible. On large designs with uneven level populations, barrier elimination should produce net speedups, particularly at high thread counts where the original BFS leaves threads idle.
 
 
-9. TEST RESULTS
+9. TEST PLAN
+
+Beyond the OpenSTA standalone regression suite and the OpenROAD full regression, a set of helper scripts is provided for A/B runtime benchmarking and validation across ORFS designs. These run the full ORFS flow for each design twice -- once with Kahn's BFS disabled and once with Kahn's enabled -- and collect per-step timing and design-size metrics for comparison.
+
+All scripts live under flow/util/ and are intended to be invoked from the flow/ directory. They do not modify any design scripts or ORFS flow files; instead, a tiny binary wrapper injects the Tcl variable sta_use_kahns_bfs into every OpenROAD invocation.
+
+
+9.1 Binary wrapper: openroad_kahns_wrap.sh
+
+ORFS invokes openroad with -no_init, so ~/.openroad is not sourced. To toggle sta_use_kahns_bfs across every invocation of every flow step without editing any Tcl, this wrapper sits in front of the real OpenROAD binary:
+
+  - Finds the .tcl cmd_file argument in the invocation.
+  - Creates a temporary Tcl that performs
+        set sta_use_kahns_bfs <mode>
+        puts "kahns-wrap: requested=<mode>, effective=$::sta_use_kahns_bfs"
+        source "<original.tcl>"
+  - Execs the real OpenROAD on the temporary file.
+
+The wrapper reads KAHNS_BFS from the environment (0 = original BFS, 1 = Kahn's). The breadcrumb puts line lands in every step log, so a single grep confirms the flag was in effect and never overridden by a downstream script.
+
+
+9.2 Benchmark driver: kahns_benchmark.sh
+
+Runs an A/B sweep across one or more designs. For each design:
+  1. make clean_all
+  2. Run target (default: finish) with KAHNS_BFS=0; time with date +%s.%N.
+  3. Save elapsed-all.txt and copy logs/<pdk>/<design>/<variant>/ before the next clean.
+  4. make clean_all
+  5. Run target with KAHNS_BFS=1; time.
+  6. Save elapsed-all.txt and the logs tree again.
+
+Output directory layout:
+
+  <bench_dir>/
+    summary.csv                                         wall-time totals, CSV
+    <design>_kahns_off.log                              full stdout, OFF run
+    <design>_kahns_on.log                               full stdout, ON run
+    <design>_kahns_off_artifacts/elapsed-all.txt        per-step seconds, OFF
+    <design>_kahns_off_artifacts/logs/                  raw step logs and JSON metrics, OFF
+    <design>_kahns_on_artifacts/elapsed-all.txt         per-step seconds, ON
+    <design>_kahns_on_artifacts/logs/                   raw step logs and JSON metrics, ON
+
+Usage (from flow/):
+  util/kahns_benchmark.sh [-t target] [-o outdir] [design-configs...]
+
+Target defaults to finish. For STA-focused benchmarking, -t route covers all STA-heavy steps (place, repair_timing_post_place, cts, global_route, repair_timing_post_global_route, detail_route) without the downstream fill / final_report overhead.
+
+
+9.3 Per-step runtime comparison: kahns_compare.sh
+
+Reads the elapsed-all.txt files from a benchmark directory and produces a per-step comparison table with OFF seconds, ON seconds, delta, and ratio (ON/OFF). Positive deltas mean Kahn's was slower for that step; ratios below 1.00x mean Kahn's was faster.
+
+Usage (from flow/):
+  util/kahns_compare.sh <bench_dir> [design_tag]
+
+Without design_tag, every design that has both OFF and ON artifacts is compared in a single run.
+
+Typical reading pattern for a given design:
+  - Non-STA steps (yosys, floorplan_macro, pdn, fillcell): ratio ~1.00x.
+  - STA-heavy steps (3_3_place_gp, 4_1_cts, 5_1_grt, 5_2_route): where any real speed-up or slowdown appears.
+  - Small designs: slight positive delta from Kahn's discovery-pass overhead.
+  - Large designs with uneven level populations: expected speed-up from barrier elimination.
+
+
+9.4 Design-size view and correctness check: kahns_size.sh
+
+Extracts design-size metrics (instance count, net count, IO count, cell area) at each major stage from the step-level JSON metrics files (<step>.json). Provides three modes:
+
+  Default (combined view):
+      util/kahns_size.sh <bench_dir> [design_tag]
+    Prints one table per design with the OFF-run values and a match column
+    that flags any stage where ON disagreed. Ideal for spotting correctness
+    regressions at a glance: every row must show ok.
+
+  Verbose (-v):
+      util/kahns_size.sh -v <bench_dir> [design_tag]
+    Prints the two separate OFF and ON tables side-by-side so the actual
+    disagreeing values can be read.
+
+  Validation sweep (-c, --check):
+      util/kahns_size.sh --check <bench_dir>
+    Iterates every design in the benchmark directory and emits one line per
+    design: OK, FAIL (with the stage and metrics that disagreed), or SKIP
+    (missing artifacts). Exits non-zero if any design fails, which makes it
+    CI-friendly. Any FAIL is a real correctness bug -- Kahn's must produce
+    the same netlist as the original BFS.
+
+
+9.5 Operational checklist
+
+Running a full sweep across several designs:
+
+  1. Build OpenROAD with Kahn's: the flag sta_use_kahns_bfs defaults to 1.
+  2. From flow/, choose the target and the design list. For example:
+        util/kahns_benchmark.sh -t finish -o kahns_bench_gf12 \
+            $(ls -d designs/gf12/*/config.mk)
+  3. While it runs, tail the most recent per-design stdout log to follow
+     progress and verify the wrapper breadcrumb:
+        tail -f "$(ls -t kahns_bench_gf12/*.log | head -1)" | grep -i "kahns-wrap\|error"
+  4. Validate correctness once designs finish:
+        util/kahns_size.sh --check kahns_bench_gf12
+     Address any FAIL before trusting the runtime numbers.
+  5. Compare per-step runtimes:
+        util/kahns_compare.sh kahns_bench_gf12
+     Interpret in the context of design size:
+        util/kahns_size.sh kahns_bench_gf12
+
+
+9.6 Additional conventions
+
+  - Always run KAHNS_BFS=0 first, then KAHNS_BFS=1. The OFF pass is the
+    baseline; running OFF first avoids any chance that a bug in the ON
+    path could corrupt shared state and affect a subsequent OFF run.
+  - Target choice: -t route is usually enough for STA-feature benchmarking.
+    -t finish adds fillers / final report which do not exercise Kahn's much.
+  - Parallelism: ORFS exports NUM_CORES to OpenROAD's -threads flag.
+    Kahn's and the original BFS both respect this. A fair comparison must
+    use identical thread counts.
+  - Disk usage: each artifact directory copies the per-design logs tree.
+    Budget a few hundred MB per design for a finish sweep.
+  - Clean up between sweeps: kahns_benchmark.sh always runs make clean_all
+    before each design's first iteration. No manual cleanup is required.
+
+
+10. TEST RESULTS
 
 OpenSTA standalone: 6109/6109 tests PASS with Kahn's enabled.
 
-OpenROAD full regression: All tests PASS, including rmp.gcd_restructure (the test that surfaced both findings above), rsz (incremental netlist modification), and cts (buffer insertion with re-timing).
+OpenROAD full regression: All tests PASS, including rmp.gcd_restructure (the test that surfaced both findings in Section 7), rsz (incremental netlist modification), and cts (buffer insertion with re-timing).
+
+ORFS A/B runtime benchmarks (Section 9): in progress. An initial sweep across the gf12 and rapidus2hp design sets is running using util/kahns_benchmark.sh. Completed designs to date show Kahn's at parity or slightly slower on small designs (e.g. gf12/aes: +12s / +3% total over 375s baseline), consistent with the Section 8 prediction that the discovery-pass overhead dominates when the active graph is small. Larger designs (gf12/bp_quad, gf12/ariane133, bp_dual) are still pending; this section will be updated with their numbers and the per-step breakdown as each finishes. Correctness (netlist-size match between OFF and ON) is verified after each design via util/kahns_size.sh --check.
 
 
-10. LIMITATIONS OF THE CURRENT APPROACH
+11. LIMITATIONS OF THE CURRENT APPROACH
 
 The current implementation is correct and matches the original BFS at parity on small designs, but several limitations remain:
 
@@ -151,8 +291,10 @@ Per-call active_vertices allocation. The KahnState persistence avoids re-allocat
 
 Recursive dispatch cost for small workloads. Each ready vertex is dispatched as its own DispatchQueue task. The dispatch lock and condition-variable signaling cost is tiny per task, but for active sets smaller than the thread count the parallelism benefit may not offset the dispatch overhead.
 
+No Kahn's when dynamic loop breaking is enabled. sta_dynamic_loop_breaking decides whether a disabled-loop edge is traversable based on arrival tags that only appear during propagation, which Kahn's upfront-discovery model cannot consult. visitParallel therefore falls back to the original level-based BFS whenever dynamicLoopBreaking() is true. The Tcl toggle sta_use_kahns_bfs still reads normally, but the traversal uses the original path. See Section 7, Finding 3 for details.
 
-11. FUTURE ROADMAP
+
+12. FUTURE ROADMAP
 
 The following enhancements extend the current Kahn's-based incremental timing implementation. They address known limitations in the existing approach and are orthogonal to Kahn's itself — each can be layered on top of the existing implementation independently. Items are listed in rough order of payoff relative to effort.
 
diff --git a/search/Bfs.cc b/search/Bfs.cc
index 390ad741..28e9b6d1 100644
--- a/search/Bfs.cc
+++ b/search/Bfs.cc
@@ -219,8 +219,15 @@ BfsIterator::visitParallel(Level to_level,
   if (!empty()) {
     if (thread_count == 1)
       visit_count = visit(to_level, visitor);
-    else if (!variables_->useKahnsBfs() || !kahn_pred_) {
+    else if (!variables_->useKahnsBfs()
+             || !kahn_pred_
+             || variables_->dynamicLoopBreaking()) {
       // Original level-based parallel BFS with per-level barriers.
+      // dynamic_loop_breaking enables disabled-loop edges based on
+      // arrival tags that only emerge during propagation. Kahn's
+      // discovery runs before any propagation and cannot see those
+      // tags, so we fall back to the original BFS whenever dynamic
+      // loop breaking is active.
       std::vector<VertexVisitor *> visitors;
       visitors.reserve(thread_count_);
       for (int k = 0; k < thread_count_; k++)

From 2daf7a9bb73c59dfe6bb736a150336000b55301e Mon Sep 17 00:00:00 2001
From: dsengupta0628 <dsengupta@precisioninno.com>
Date: Tue, 21 Apr 2026 22:00:36 +0000
Subject: [PATCH 8/8] remove design specific info from doc

Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
---
 docs/KahnsBFS_Spec.txt | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/docs/KahnsBFS_Spec.txt b/docs/KahnsBFS_Spec.txt
index 5e18bef1..ab5b7598 100644
--- a/docs/KahnsBFS_Spec.txt
+++ b/docs/KahnsBFS_Spec.txt
@@ -237,19 +237,20 @@ Extracts design-size metrics (instance count, net count, IO count, cell area) at
 Running a full sweep across several designs:
 
   1. Build OpenROAD with Kahn's: the flag sta_use_kahns_bfs defaults to 1.
-  2. From flow/, choose the target and the design list. For example:
-        util/kahns_benchmark.sh -t finish -o kahns_bench_gf12 \
-            $(ls -d designs/gf12/*/config.mk)
+  2. From flow/, choose the target, the output directory, and the design list.
+     For example, pick one or more config.mk paths from a platform of interest
+     (under flow/designs/<platform>/) and pass them on the command line:
+        util/kahns_benchmark.sh -t finish -o <bench_dir> <config1> <config2> ...
   3. While it runs, tail the most recent per-design stdout log to follow
      progress and verify the wrapper breadcrumb:
-        tail -f "$(ls -t kahns_bench_gf12/*.log | head -1)" | grep -i "kahns-wrap\|error"
+        tail -f "$(ls -t <bench_dir>/*.log | head -1)" | grep -i "kahns-wrap\|error"
   4. Validate correctness once designs finish:
-        util/kahns_size.sh --check kahns_bench_gf12
+        util/kahns_size.sh --check <bench_dir>
      Address any FAIL before trusting the runtime numbers.
   5. Compare per-step runtimes:
-        util/kahns_compare.sh kahns_bench_gf12
+        util/kahns_compare.sh <bench_dir>
      Interpret in the context of design size:
-        util/kahns_size.sh kahns_bench_gf12
+        util/kahns_size.sh <bench_dir>
 
 
 9.6 Additional conventions
@@ -274,7 +275,7 @@ OpenSTA standalone: 6109/6109 tests PASS with Kahn's enabled.
 
 OpenROAD full regression: All tests PASS, including rmp.gcd_restructure (the test that surfaced both findings in Section 7), rsz (incremental netlist modification), and cts (buffer insertion with re-timing).
 
-ORFS A/B runtime benchmarks (Section 9): in progress. An initial sweep across the gf12 and rapidus2hp design sets is running using util/kahns_benchmark.sh. Completed designs to date show Kahn's at parity or slightly slower on small designs (e.g. gf12/aes: +12s / +3% total over 375s baseline), consistent with the Section 8 prediction that the discovery-pass overhead dominates when the active graph is small. Larger designs (gf12/bp_quad, gf12/ariane133, bp_dual) are still pending; this section will be updated with their numbers and the per-step breakdown as each finishes. Correctness (netlist-size match between OFF and ON) is verified after each design via util/kahns_size.sh --check.
+ORFS A/B runtime benchmarks (Section 9): in progress. An initial sweep across several platform/design combinations is running using util/kahns_benchmark.sh. Completed designs to date show Kahn's at parity or slightly slower on small designs (for a representative small design the measured overhead was roughly +3% / +12s over a ~375s baseline), consistent with the Section 8 prediction that the discovery-pass overhead dominates when the active graph is small. Larger designs are still pending; this section will be updated with their numbers and the per-step breakdown as each finishes. Correctness (netlist-size match between OFF and ON) is verified after each design via util/kahns_size.sh --check.
 
 
 11. LIMITATIONS OF THE CURRENT APPROACH