updated doc to remove blocked content from doc

Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>
updated doc to include testing strategies with vtune
2026-04-22 21:02:57 +00:00 · 2026-04-22 20:59:05 +00:00 · 2026-04-22 16:55:15 +00:00 · 2026-04-22 15:33:45 +00:00 · 2026-04-22 15:09:36 +00:00 · 2026-04-22 14:24:41 +00:00
17 changed files with 650 additions and 47 deletions
--- a/docs/KahnsBFS_Spec.txt
+++ b/docs/KahnsBFS_Spec.txt
@ -27,15 +27,36 @@ Stage 1: Discovery and In-Degree Counting (single-threaded)

 Starting from the seed vertices already in the BFS queue, a forward BFS discovers all reachable vertices following the same edge-filtering rules used by the original traversal. As each new vertex is discovered, its in-degree (number of active predecessors) is recorded in a flat array indexed by graph vertex ID. Seed vertices have in-degree zero.

-Stage 2: Recursive-Dispatch Parallel Traversal (multi-threaded)
+Stage 2: Parallel Traversal (multi-threaded)

-The unit of scheduling is a single ready vertex. All zero-in-degree vertices are initially dispatched as separate tasks into the existing DispatchQueue thread pool. Each task does three things:
+The unit of scheduling is a single ready vertex. All zero-in-degree vertices are initially handed to workers in the DispatchQueue thread pool as "seed shards"; workers then expand the frontier themselves. Each worker task does three things for every vertex it processes:

  1. Visit the vertex (computing arrivals or required times).
  2. Atomically decrement the in-degree of each successor.
-  3. If any successor's in-degree reaches zero, dispatch that successor immediately as a new task into the same DispatchQueue.
+  3. If any successor's in-degree reaches zero, that successor becomes ready to visit and is enqueued for processing.

-A single finishTasks() call at the end waits for all dispatched work -- including tasks dispatched recursively from within running tasks -- to complete. There are no per-batch or per-level barriers. A worker thread that makes a successor ready sends it straight into the pool, where any idle thread can pick it up without waiting for unrelated tasks to finish. The DispatchQueue uses condition_variable internally, so idle threads block efficiently rather than spinning.
+A single finishTasks() call at the end waits for all work -- including newly-ready successors produced during running tasks -- to complete. There are no per-batch or per-level barriers. Any idle worker can pick up a ready vertex without waiting for unrelated tasks to finish. The DispatchQueue uses condition_variable internally, so idle threads block efficiently rather than spinning.
+
+The initial implementation dispatched each newly-ready successor individually back through the DispatchQueue, which serialized on its internal mutex at high thread counts; Section 13 documents the per-worker local-batching fix that moved Stage 2 to a mostly lock-free scheduling model. The concurrency contract on the in-degree counter itself, described next, is independent of the scheduling choice.
+
+
+Stage 2 concurrency: atomic fetch_sub
+
+Two workers can visit predecessors of the same successor simultaneously. Suppose vertex C has two active predecessors A and B, so in_degree[C] == 2 after Stage 1. Worker W1 visits A, W2 visits B, and both reach the successor decrement concurrently:
+
+    int prev = in_degree[C].fetch_sub(1, std::memory_order_acq_rel);
+    if (prev == 1)
+      <enqueue C for visiting>
+
+Three properties make this correct without locks.
+
+Atomic RMW prevents lost updates. fetch_sub compiles to LOCK XADD on x86 (equivalent atomic RMW on other architectures); the hardware guarantees the read-decrement-write triple is indivisible. Two workers cannot both read the same value and each write the same decremented value. Cache coherence serializes the two attempts into a total order and returns each caller a distinct before-value.
+
+Unique-winner detection. fetch_sub returns the old value. For in_degree[C] transitioning 2 -> 1 -> 0, exactly one worker sees prev == 1 -- the one whose decrement produced zero -- and is uniquely responsible for enqueueing C. The other sees prev == 2 and does nothing. No single worker "owns" C; ownership emerges from the atomic result. This makes the "who enqueues" question self-answering -- no external coordination, no duplicate visits.
+
+Memory ordering: acq_rel. W1's visit(A) writes arrival state that C's eventual visit(C) will later read. Without acq_rel, those writes could be reordered past the fetch_sub, so the worker that wins C could observe a stale predecessor state. Release on the store side makes everything W1 wrote before the decrement visible to anyone observing the decrement's effect; acquire on the load side guarantees the winning worker sees all earlier predecessor writes. This establishes the happens-before edge that makes "visit predecessors before successors" hold at the memory level, not just at the scheduling level.
+
+No lock is required for the decrement or the ownership handoff itself. The shared DispatchQueue mutex is touched only at initial seeding and at spill-overflow; see Section 13.1.3.


 4. IMPLEMENTATION DETAILS
@ -288,9 +309,19 @@ Full forward sweep for slack queries. Slack at pin P is computed as required(P)

 Over-invalidation in the dirty set. The incremental framework's invalid_arrivals_ and invalid_requireds_ sets are tracked conservatively. Some edge-delay or pin changes invalidate more vertices than strictly necessary; the visitor detects no change and does no further propagation, but we still paid for the visit. A more precise validity analysis could prune the seed set before the BFS starts.

-Per-call active_vertices allocation. The KahnState persistence avoids re-allocating in_degree_init and in_degree across calls, but the active_vertices vector is rebuilt every call. For very frequent small updates this has some overhead.
+Per-call active_vertices allocation. (Addressed in Section 13.) The active_vertices, seeds, and per-worker local_ready vectors used to be rebuilt every call. They now live in KahnState alongside in_degree_init and in_degree -- cleared per call, capacity retained -- so incremental flows with many small visitParallel invocations no longer pay a re-allocation on every call.

-Recursive dispatch cost for small workloads. Each ready vertex is dispatched as its own DispatchQueue task. The dispatch lock and condition-variable signaling cost is tiny per task, but for active sets smaller than the thread count the parallelism benefit may not offset the dispatch overhead.
+Recursive dispatch cost for small workloads. (Addressed in Section 13.) The original recursive-dispatch scheme called DispatchQueue::dispatch per ready successor, which contended heavily on the shared mutex at high thread counts. This has been replaced with per-worker local batches (Section 13) that drain in-line and spill only when deep. The steady-state dispatch count dropped from O(visited_vertices) to O(thread_count + spills).
+
+Ref-pin-delay vertex orphaning in cleanup. (Addressed in Section 13.4.) ArrivalVisitor::enqueueRefPinInputDelays pushes vertices into queue_ during Stage 2 that are outside the precomputed Kahn active set. The post-Stage-2 cleanup used to call queue_[level].clear(), which dropped the pointer while leaving bfsInQueue=true stuck on the vertex and silently suppressing all future enqueue() calls for it. Cleanup now filters, keeping bfsInQueue=true entries for the next call.
+
+Eager Kahn's walk past reg CK under clks_only. (Addressed in Section 13.5.) Kahn's Stage 1 discovery used to follow graph successors unconditionally. Under findClkArrivals (ArrivalVisitor with clks_only=true) the original BFS stops at register CK via postponeClkFanouts; Kahn's now mirrors that stop via VertexVisitor::stopDiscoveryAtRegClk() so the discovered active set is narrowed identically. Kahn's itself still runs; only the reg-CK fanout is excluded.
+
+Eager visits in Kahn's traversal. Beyond the clks_only case fixed above, the general "every vertex in the active set is visited regardless of whether its arrival changes" limitation remains. This is a fundamental consequence of the in-degree counting model -- each predecessor must decrement each successor's counter exactly once, so skipping decrements is not allowed. The original BFS short-circuits via an "arrivals unchanged" check at the visitor level and avoids enqueuing downstream when no change occurred. We do not. For most designs the cost is small because the visitor itself detects no change and terminates quickly, but in deep-pipeline designs with many cascaded unchanged vertices the extra visits add up. Section 12 item 1 (visit-level change short-circuit) is the roadmap fix.
+
+Full forward sweep for slack queries. Slack at pin P is computed as required(P) minus arrival(P). The required-time backward BFS is already scoped to P's level. But the forward arrival BFS is not scoped to P's fanin cone -- it propagates from all dirty seeds to all endpoints they can reach. For a single-point slack query on a design with large independent cones, most of the forward work is spent on endpoints the query does not care about.
+
+Over-invalidation in the dirty set. The incremental framework's invalid_arrivals_ and invalid_requireds_ sets are tracked conservatively. Some edge-delay or pin changes invalidate more vertices than strictly necessary; the visitor detects no change and does no further propagation, but we still paid for the visit. A more precise validity analysis could prune the seed set before the BFS starts.

 No Kahn's when dynamic loop breaking is enabled. sta_dynamic_loop_breaking decides whether a disabled-loop edge is traversable based on arrival tags that only appear during propagation, which Kahn's upfront-discovery model cannot consult. visitParallel therefore falls back to the original level-based BFS whenever dynamicLoopBreaking() is true. The Tcl toggle sta_use_kahns_bfs still reads normally, but the traversal uses the original path. See Section 7, Finding 3 for details.

@ -338,3 +369,241 @@ Benefit: Eliminates overhead for small incremental updates while preserving thro
 Objective: Amortize cone-computation cost when multiple slack queries are issued in sequence.
 Approach: When several slack queries arrive together, compute the union of their backward cones once and perform a single scoped forward sweep across the combined cone, rather than repeating the cone computation and forward traversal per query.
 Benefit: Reduces redundant work in reporting flows that issue many related queries, such as full endpoint slack reports or path-group summaries.
+
+
+13. POST-INTEGRATION FINDINGS AND FIXES
+
+After the initial Kahn's implementation shipped, three follow-up issues were identified: a performance regression on large designs (per-vertex dispatch overhead), a latent correctness bug with reference-pin input delays, and a performance regression under the clks_only (findClkArrivals) path. This section documents each, the fix, and the shared diagnostic harness used to catch future regressions.
+
+
+13.1 Parallel-dispatch overhead
+
+13.1.1 Observed symptom
+
+On a large SoC-scale A/B sweep with 32 threads, the clock tree synthesis step ran ~24% slower with Kahn's enabled than with Kahn's disabled. Other STA-heavy steps showed similar regressions in the ~20-25% range. Parity on the OpenSTA standalone regression suite (Section 8) and on small ORFS designs was undisturbed; the slowdown was specific to designs whose active subgraph during arrival propagation is large.
+
+
+13.1.2 Root cause
+
+Two pieces of evidence pinned the issue to DispatchQueue contention rather than algorithmic cost:
+
+  - Isolated STA measurement (harness in Section 13.5) showed the same
+    ~25% gap on a full arrival sweep (report_checks -path_delay max)
+    when loading a post-CTS database directly, outside of any ORFS
+    step. The effect was not CTS-specific; it was the STA engine.
+
+  - A dispatch() counter added to DispatchQueue revealed that Kahn's
+    issued roughly 5-7x more dispatch() calls than the original
+    level-based BFS on identical workloads. Each dispatch() call
+    acquires DispatchQueue::lock_ (a std::mutex) and signals a
+    condition_variable; at 32 threads, the hundreds of thousands of
+    mutex acquisitions produced severe contention.
+
+The algorithmic atomics in the hot path (in_degree[sid].fetch_sub(1, memory_order_acq_rel)) are lock-free and do not loop — they map to a single LOCK XADD on x86 — so the contention was not in the algorithm. It was in the work-queue: the level-based BFS dispatches O(thread_count * heavy_levels) tasks, while the original Kahn's dispatched O(visited_vertices) tasks. On a ~2k-cell reference design this was a 5.2x ratio; on designs where the active subgraph is hundreds of thousands of vertices, the ratio and the wall-clock impact grow accordingly.
+
+
+13.1.3 Fix: per-worker local batches with spill
+
+The recursive-dispatch worker lambda in visitParallel was replaced with a per-worker local ready-batch model:
+
+  - Each DispatchQueue worker runs with a stable thread-id tid. The
+    Kahn's branch allocates a stack-local std::vector<Vertex *>
+    local_ready[tid], one slot per worker, and the worker drains its
+    own slot in-line with a while-loop rather than calling dispatch().
+  - When a successor's in-degree transitions to zero, it is pushed to
+    the current worker's batch instead of dispatched. The same worker
+    pops it on the next drain iteration.
+  - When the local batch exceeds a spill threshold
+    (kKahnBatchSpillThreshold = 64), the older half is handed back to
+    DispatchQueue via a batched dispatch() so idle workers can steal.
+    This keeps the worker's most-recent frontier hot locally while
+    preventing starvation in the presence of fan-out spikes.
+  - Initial seeding no longer dispatches one task per seed. Seeds are
+    sharded round-robin across min(thread_count, seeds.size())
+    dispatches; each dispatched task pre-loads its shard into the
+    worker's local batch and then enters the drain loop.
+
+Per-tid exclusivity is guaranteed because DispatchQueue workers are fixed-id (each worker thread is constructed with its index i in dispatch_thread_handler). Two concurrent tasks never share a tid, so each local_ready[tid] slot is touched only by its owning worker — no locking on the vector is needed. The in-degree fetch_sub retains its memory_order_acq_rel ordering; it is the happens-before edge that establishes visitor-writes-before-successor-reads, and switching between local-push and shared-dispatch does not change that ordering because the same thread that wins the decrement is the thread that subsequently owns the successor's visit. The setBfsInQueue(bfs_index_, false) call still fires exactly once per vertex, at the top of each drain-loop iteration.
+
+
+13.1.4 Measured outcome
+
+  - Dispatch-count regression: the ON/OFF ratio on the reference
+    regression (Section 13.4.2) dropped from 5.22x to 0.27x. Kahn's
+    now dispatches fewer tasks than the level-based BFS, because
+    level-BFS still pays thread_count dispatches per heavy level
+    while Kahn's pays only shard-seeds plus a small number of spills.
+
+  - Wall-clock regression on the large-design STA benchmark
+    (Section 13.4.5): the ~25% slowdown on full arrival propagation
+    at 32 threads disappeared. On an isolated run, ON matched OFF
+    within measurement noise (~2%).
+
+Kahn's is not yet faster than level-BFS on the isolated post-CTS sweep, which was expected — the algorithmic advantage of Kahn's (no per-level barrier) only materializes when barriers dominate, which is not the case at this thread count on designs whose visit() body is the cost center. What the fix delivers is elimination of the overhead regression: Kahn's is safe to leave enabled by default.
+
+
+13.2 Ref-pin-delay vertex orphaning in cleanup
+
+13.2.1 Observed symptom
+
+No runtime symptom in any ORFS flow since no shipped regression exercises set_input_delay -reference_pin. Flagged by external review. Manifests as missed required-time propagation on any subsequent invalidate + report_checks after a Kahn's run that triggered enqueueRefPinInputDelays.
+
+13.2.2 Root cause
+
+ArrivalVisitor::visit() on a clock-like vertex calls enqueueRefPinInputDelays(ref_pin, sdc), which iterates sdc->refPinInputDelays(ref_pin) and calls arrival_iter_->enqueue(target_vertex) for each. That enqueue sets bfsInQueue=true on the target vertex and appends a pointer into queue_[level]. If the target was outside Kahn's precomputed active set (in_deg[target_vid] stays -1), the Stage 2 worker never visits it and never clears bfsInQueue.
+
+The original post-Stage-2 cleanup was queue_[level].clear(). This dropped the pointer but left bfsInQueue stuck at true on the vertex. Every subsequent enqueue() on the same vertex short-circuited via the "already queued" check in BfsIterator::enqueue, silently suppressing its future propagation. A test that did arrivals_invalid and then report_checks produced reports missing paths through the ref-pin target.
+
+13.2.3 Fix: preserve-queued cleanup
+
+search/Bfs.cc now has a helper BfsIterator::dropProcessedEntries(first, last, to_level) that walks each level's queue_ entries, keeps those where v->bfsInQueue(bfs_index_) is still true, and drops the rest via erase-remove. It is called from both the active_count==0 early-return and the post-Stage-2 cleanup. The invariant enforced is "after visitParallel, queue_ contains exactly the vertices the caller can legitimately treat as queued" -- matching the enqueue/dequeue contract the rest of BfsIterator relies on.
+
+Ref-pin targets enqueued during Stage 2 with bfsInQueue=true now survive past cleanup and become seeds on the next call, so missed propagation is impossible.
+
+13.2.4 Verification: search/test/search_kahns_bfs_refpin.tcl
+
+See Section 13.4.3.
+
+
+13.3 Narrowed Kahn's discovery under clks_only
+
+13.3.1 Observed symptom
+
+Runtime regression on the findClkArrivals path (exercised by report_clock_skew, report_clock_latency, and any command routed through Sta::ensureClkArrivals). Flagged by external review. Level-BFS stops at reg CK via postponeClkFanouts (see ArrivalVisitor::visit at Search.cc:1179: if clks_only_ && vertex->isRegClk() then postpone, else enqueue fanout). Kahn's Stage 1 discovery walked graph successors unconditionally, so it pre-discovered the entire data fanout past every register CK even on a clocks-only pass. Stage 2 then visited all of it, turning a narrow clock-arrival pass into broad propagation work.
+
+13.3.2 Fix: visitor-directed stop at reg CK
+
+Added VertexVisitor::stopDiscoveryAtRegClk() (include/sta/VertexVisitor.hh), returning false by default. ArrivalVisitor overrides it to return clks_only_. search/Bfs.cc queries this once at the top of the Kahn's branch and captures it as stop_at_reg_clk. Two gates respect the flag:
+
+  - Stage 1 discovery (the disc_idx loop in visitParallel): if
+    stop_at_reg_clk && vertex->isRegClk(), the loop skips the
+    kahnForEachSuccessor recursion. The reg CK itself is still in
+    active_vertices (it was added when its predecessor discovered
+    it); only the successor walk is pruned.
+  - Stage 2 worker lambda (the process drain loop): same check
+    before the successor-decrement kahnForEachSuccessor call.
+    Redundant with the in_deg[sid] >= 0 guard for correctness --
+    successors past CK were never added, so every decrement would
+    short-circuit -- but it avoids iterating the edge list at all.
+
+Kahn's still runs under clks_only; only the reg-CK fanout is excluded from discovery, matching the level-BFS active set exactly. postpone semantics are unchanged: the reg CK vertex is visited, ArrivalVisitor::visit still calls postponeClkFanouts, and the next non-clks_only pass consumes pending_clk_endpoints_ via enqueuePendingClkFanouts.
+
+13.3.3 Verification: search/test/search_kahns_bfs_clks_only.tcl
+
+See Section 13.4.4.
+
+
+13.4 Diagnostic harness
+
+A small harness was added so both correctness and performance of this path can be checked quickly and deterministically. The pieces are independent and can be used in isolation.
+
+
+13.4.1 Tcl accessor: sta::dispatch_call_count
+
+DispatchQueue carries a std::atomic<uint64_t> counter that increments on every dispatch() call. The counter is exposed via a Sta forwarder and a SWIG inline in util/Util.i:
+
+    set before [sta::dispatch_call_count]
+    # ... some STA work ...
+    set dispatches [expr {[sta::dispatch_call_count] - $before}]
+
+The counter is monotonic since Sta construction. There is no reset command — callers compute deltas between two reads. The counter is wall-clock-independent and therefore suitable for golden-file regressions in CI, unlike raw elapsed times.
+
+Accessor forwarding:
+    Sta::dispatchCallCount() (search/Sta.cc) -> DispatchQueue::dispatchCallCount()
+    Tcl command sta::dispatch_call_count (util/Util.i)
+
+Header visibility is confined to DispatchQueue.hh; neither StaState.hh nor Util.i reaches directly into DispatchQueue internals.
+
+
+13.4.2 Regression: search/test/search_kahns_bfs_dispatch.tcl
+
+Captures dispatch count during a report_checks (full propagation) sweep on the bundled gcd_sky130hd example, once with sta_use_kahns_bfs=0 and once with =1. thread_count=2 is fixed so the golden is portable. Prints off_dispatches, on_dispatches, and on_to_off_ratio. A future change to either BFS path shifts these numbers; the failing diff forces conscious review. Typical healthy values under the current fix are ON << OFF.
+
+Invocation from sta/search/test/:
+    ./regression search_kahns_bfs_dispatch
+
+
+13.4.3 Regression: search/test/search_kahns_bfs_refpin.tcl
+
+Covers the ref-pin-delay orphaning fix (Section 13.2). Adds set_input_delay -reference_pin [get_ports clk] on a test design so that clock propagation fires enqueueRefPinInputDelays during Stage 2. Runs report_checks OFF (baseline), then Kahn's ON (triggers the cleanup-preservation path), then arrivals_invalid + report_checks OFF (the observation point). Asserts the two OFF reports are byte-identical and prints refpin_preservation=ok. A regression that reintroduces the blind queue_[level].clear() makes the second report miss the ref-pin path and the diff fires.
+
+Invocation:
+    ./regression search_kahns_bfs_refpin
+
+
+13.4.4 Regression: search/test/search_kahns_bfs_clks_only.tcl
+
+Covers the narrowed-discovery fix (Section 13.3). Runs report_clock_skew (which routes through clkSkewPreamble -> ensureClkArrivals -> findClkArrivals with clks_only=true) on gcd_sky130hd, captures the arrival-BFS cumulative visit count with Kahn's OFF and ON, and prints clks_only_off_visits, clks_only_on_visits, clks_only_on_to_off_ratio.
+
+Visit count -- not dispatch count -- is the correct observable for this regression: the wider Kahn's walk fills per-worker batches but rarely crosses the 64-vertex spill threshold on small designs, so the dispatch counter is blind to it. Every Kahn's-discovered vertex is visited exactly once in Stage 2, so the cumulative visit count tracks active-set size one-to-one.
+
+Under the fix the ratio is close to parity (Kahn's discovers the same narrowed set as level-BFS). A regression that removes the stop makes Kahn's visit every data-path vertex downstream of reg CK and the ON count jumps sharply.
+
+Accessor: Sta::arrivalVisitCount() forwards to BfsFwdIterator::visitCountCumulative(), a std::atomic<uint64_t> on BfsIterator that accumulates visit_count at the end of each visitParallel call. Exposed to Tcl as sta::arrival_visit_count via util/Util.i.
+
+Invocation:
+    ./regression search_kahns_bfs_clks_only
+
+
+13.4.5 Isolated STA wall-clock harness: flow/util/kahns_sta_isolated.tcl
+
+For pinpointing STA-level wall-clock regressions without running ORFS steps around them, a generic Tcl harness loads a post-step ORFS database, applies the liberty set (globbed from platform/objects dirs), and times two report_checks passes -- full cold arrival+required, then incremental required only -- under a caller-supplied Kahn's mode. Required Tcl vars are kahn_mode and kahn_design_path; kahn_step defaults to 4_cts. A bash wrapper flow/util/kahns_sta_repeat.sh runs N iterations per mode, parses the STA_MS full/incr lines, and reports median/min/max/stddev so small deltas can be distinguished from measurement noise. Modes 0, 1, or both are selectable via -m.
+
+Invocation from flow/:
+    util/kahns_sta_repeat.sh -d PLATFORM/DESIGN/VARIANT [-s step] [-m mode] [-n N] [-t threads]
+
+Two separate openroad invocations are used per iteration, deliberately, to avoid any stale-arrival or process-cached state from contaminating the comparison.
+
+Per-run output. Each invocation prints the following lines so that timing, dispatch count, and visit count can be compared across OFF and ON without opening another tool:
+
+    === design=... step=... kahn_mode=N (sta_use_kahns_bfs=N) threads=N libs=N ===
+    STA_MS full:   <ms>
+    DISPATCH full: <DispatchQueue::dispatch() delta during -path_delay max>
+    VISITS full:   <arrival BFS visit delta during -path_delay max>
+    STA_MS incr:   <ms>
+    DISPATCH incr: <dispatch delta during -path_delay min>
+    VISITS incr:   <visit delta during -path_delay min>
+
+DISPATCH and VISITS come from the Section 13.4.1 counters.
+
+
+13.4.6 VTune hotspot harness: flow/util/kahns_vtune.sh
+
+For attributing time to specific C++ functions, a wrapper runs VTune hotspot collection twice -- once with Kahn's OFF, once with ON -- against the same isolated Tcl harness (13.4.5). It uses VTune's -start-paused option together with the collector's -command resume/pause to gate sampling to only the report_checks windows, so liberty parsing, read_db, and link_design do not pollute the profile. Both result directories (off/ and on/) are written under a timestamped parent so vtune -report hotspots can directly compare the two algorithms on the same propagation workload.
+
+Invocation from flow/:
+    util/kahns_vtune.sh -d PLATFORM/DESIGN/VARIANT [-s step] [-m mode] [-t threads] [-o outdir]
+
+Output layout:
+    <outdir>/off/     vtune result dir, Kahn's OFF
+    <outdir>/off.log  full openroad stdout/stderr (with STA_MS / DISPATCH / VISITS lines)
+    <outdir>/off.tcl  generated wrapper Tcl (reproducible)
+    <outdir>/on/      vtune result dir, Kahn's ON
+    <outdir>/on.log
+    <outdir>/on.tcl
+
+Resume/pause mechanism. kahns_sta_isolated.tcl reads the env variables VTUNE_RESULT_DIR and VTUNE that kahns_vtune.sh sets. When they are present, Tcl procs vtune_resume / vtune_pause shell out "vtune -command resume -r $VTUNE_RESULT_DIR" and "vtune -command pause -r ...". These bracket each report_checks call so the captured profile contains only STA propagation work. When VTUNE_RESULT_DIR is empty (direct Tcl invocation, or the kahns_sta_repeat.sh path), the procs no-op and the harness still runs standalone. No change to openroad's build is required; the gating uses VTune's own CLI to signal the collector from inside the profiled process.
+
+Comparing results:
+    vtune -report hotspots -result-dir <outdir>/off -format=csv > off.csv
+    vtune -report hotspots -result-dir <outdir>/on  -format=csv > on.csv
+
+For a side-by-side of the top functions:
+    paste <(head -21 off.csv) <(head -21 on.csv) | column -t -s $'\t' | less -S
+
+Filtering out VTune's own symbol-resolution overhead (libdwarf, libpindwarf, libc-dynamic) makes the STA-relevant hotspots stand out:
+    awk -F, '$6 !~ /dwarf|pindwarf|libc-dynamic/' off.csv | head -20 > off.stripped
+    awk -F, '$6 !~ /dwarf|pindwarf|libc-dynamic/' on.csv  | head -20 > on.stripped
+    paste off.stripped on.stripped | column -t -s $'\t' | less -S
+
+Expected signatures under the current fix. pthread_mutex_lock/unlock CPU time drops measurably in ON -- this is the dispatch-contention fix visible at the profiler level. VertexOutEdgeIterator::next and VertexInEdgeIterator::next increase slightly in ON -- Kahn's walks each edge twice (Stage 1 discovery plus Stage 2 decrement) while level-BFS walks each edge once. These two shifts roughly cancel on designs where neither barrier nor dispatch dominates; see Section 13.1.4.
+
+Permission note. VTune's SQLite layer wants write access to <outdir>/off/sqlite-db and on/sqlite-db even in -report mode. If the collection ran under a different uid (e.g. inside a container) and the current user sees "Insufficient permissions" on report, either chown -R the result dir to the current user, or cp -a it into /tmp/ before running -report.
+
+
+The combination covers CI and investigation: Section 13.4.2 (full-propagation dispatch count) catches the most general overhead regression; Section 13.4.3 (ref-pin preservation) catches the cleanup invariant; Section 13.4.4 (clks_only dispatch count) catches the narrowed-discovery invariant; Section 13.4.1 exposes the raw counter for ad-hoc scripts; Section 13.4.5 validates wall-clock on real designs at full thread counts; Section 13.4.6 attributes time to C++ functions for identifying the residual Kahn-vs-level cost structure.
+
+
+13.5 What remains
+
+Kahn's at 32 threads on a large design is now at parity with level-BFS on the STA-isolated benchmark. Real algorithmic wins from barrier elimination are still expected on designs where level populations are highly uneven — the profile that originally motivated Kahn's — but measuring them requires designs larger than the reference corpus here. Future work in that direction belongs under Section 12, particularly items 1 (visit-level change short-circuit) and 3 (demand-driven forward propagation), which reduce the visit() cost that currently dominates on the designs tested.
--- a/include/sta/Bfs.hh
+++ b/include/sta/Bfs.hh
@ -24,6 +24,8 @@

 #pragma once

+#include <atomic>
+#include <cstdint>
 #include <functional>
 #include <memory>
 #include <mutex>
@ -97,6 +99,15 @@ public:
  // Returns the number of vertices that are visited.
  int visitParallel(Level to_level,
 		    VertexVisitor *visitor);
+  // Monotonic total of visits across all visitParallel calls since
+  // construction. Wall-clock-independent and therefore suitable for
+  // regression golden files. Used by the clks_only test to observe
+  // Stage 1 discovery narrowing directly (visits track active set
+  // size one-to-one).
+  uint64_t visitCountCumulative() const
+  {
+    return visit_count_cumulative_.load(std::memory_order_relaxed);
+  }

 protected:
  BfsIterator(BfsIndex bfs_index,
@ -125,6 +136,14 @@ protected:
                                    SearchPred *pred,
                                    const VertexFn &fn) = 0;
  void resetLevelBounds();
+  // Post-Kahn cleanup: drop already-visited entries (bfsInQueue=false)
+  // from each level in [first, min(last, to_level)] while keeping any
+  // that are still queued. Necessary because ArrivalVisitor::
+  // enqueueRefPinInputDelays() can push vertices into queue_ during
+  // Stage 2 that are outside the precomputed Kahn active set; their
+  // bfsInQueue flag is still true and they must survive for the next
+  // visitParallel to see them as seeds. Also calls resetLevelBounds.
+  void dropProcessedEntries(Level first, Level last, Level to_level);

  // Persistent Kahn's state to avoid per-call allocation.
  struct KahnState;
@ -141,6 +160,7 @@ protected:
  // Max (min) level of queued vertices.
  Level last_level_;
  SearchPred *kahn_pred_ = nullptr;
+  std::atomic<uint64_t> visit_count_cumulative_{0};

  friend class BfsFwdIterator;
  friend class BfsBkwdIterator;
--- a/include/sta/DispatchQueue.hh
+++ b/include/sta/DispatchQueue.hh
@ -30,6 +30,11 @@ public:
  // Dispatch and move.
  void dispatch(fp_t&& op);
  void finishTasks();
+  // Monotonic total of dispatch() calls since construction. Exposed so
+  // the Kahn's BFS regression can measure concurrency overhead via
+  // dispatch-count deltas (Kahn's dispatches per ready-transition vs
+  // the original BFS's per level-chunk).
+  uint64_t dispatchCallCount() const;

  // Deleted operations
  DispatchQueue(const DispatchQueue& rhs) = delete;
@ -47,6 +52,7 @@ private:
  std::condition_variable cv_;
  std::atomic<size_t> pending_task_count_;
  bool quit_ = false;
+  std::atomic<uint64_t> dispatch_call_count_{0};
 };

 } // namespace sta
--- a/include/sta/Search.hh
+++ b/include/sta/Search.hh
@ -773,6 +773,10 @@ public:
  void copyState(const StaState *sta) override;
  void visit(Vertex *vertex) override;
  VertexVisitor *copy() const override;
+  // Under clks_only, visit() uses postponeClkFanouts at reg CK
+  // boundaries. Kahn's Stage 1 needs the same stop so the discovered
+  // active set matches -- otherwise Kahn's eagerly walks past CK.
+  bool stopDiscoveryAtRegClk() const override { return clks_only_; }
  // Return false to stop visiting.
  bool visitFromToPath(const Pin *from_pin,
                       Vertex *from_vertex,
--- a/include/sta/Sta.hh
+++ b/include/sta/Sta.hh
@ -122,6 +122,13 @@ public:
  // Default number of threads to use.
  virtual int defaultThreadCount() const;
  void setThreadCount(int thread_count);
+  // Cumulative dispatch() calls on the worker queue. Used by the
+  // Kahn's BFS regression to measure parallel-dispatch overhead.
+  uint64_t dispatchCallCount() const;
+  // Cumulative vertices visited by the arrival BFS iterator. Used
+  // by the clks_only regression to observe Stage 1 discovery
+  // narrowing (one visit per active-set vertex).
+  uint64_t arrivalVisitCount() const;

  // define_corners compatibility.
  void makeScenes(const StringSeq &scene_names);
--- a/include/sta/VertexVisitor.hh
+++ b/include/sta/VertexVisitor.hh
@ -37,6 +37,13 @@ public:
  virtual VertexVisitor *copy() const = 0;
  virtual void visit(Vertex *vertex) = 0;
  void operator()(Vertex *vertex) { visit(vertex); }
+  // If true, Kahn's Stage 1 discovery and successor decrement must
+  // stop at register-CK boundaries. Mirrors the clks_only branch in
+  // ArrivalVisitor::visit (postponeClkFanouts) at the discovery
+  // layer so Kahn's active set matches the narrowed one level-BFS
+  // walks under findClkArrivals. Kahn's itself still runs; only the
+  // reg-CK fanout is excluded.
+  virtual bool stopDiscoveryAtRegClk() const { return false; }
 };

 // Collect visited pins into a PinSet.
--- a/search/Bfs.cc
+++ b/search/Bfs.cc
@ -43,6 +43,12 @@ namespace sta {
 // Persistent storage for Kahn's algorithm arrays.
 // Allocated once and reused across visitParallel calls to
 // avoid repeated allocation of large per-graph arrays.
+//
+// Thread-safety: active_vertices, seeds, and local_ready are written
+// only during the single-threaded discovery/seeding phase at the top
+// of the Kahn's branch. Worker tasks never touch them. Keep it that
+// way -- making these persistent would otherwise be an invariant
+// hazard.
 struct BfsIterator::KahnState
 {
  // -1 = not in active set, >= 0 = in-degree.
@ -52,6 +58,15 @@ struct BfsIterator::KahnState
  size_t in_degree_size = 0;
  // Vertex IDs touched in the previous call -- reset to -1 before reuse.
  std::vector<VertexId> prev_ids;
+  // Discovered active set and zero-in-degree seed roots. Cleared at
+  // the start of each call; capacity is retained so incremental
+  // flows with many small calls avoid repeated allocation.
+  std::vector<Vertex *> active_vertices;
+  std::vector<Vertex *> seeds;
+  // Per-worker local ready batches. Outer size is kept in sync with
+  // thread_count at the call site; each inner vector is cleared per
+  // call and its backing store is retained.
+  std::vector<std::vector<Vertex *>> local_ready;

  void ensureInitSize(size_t needed)
  {
@ -210,6 +225,28 @@ BfsIterator::resetLevelBounds()
  }
 }

+void
+BfsIterator::dropProcessedEntries(Level first, Level last, Level to_level)
+{
+  Level level = first;
+  while (levelLessOrEqual(level, last)
+         && levelLessOrEqual(level, to_level)) {
+    VertexSeq &level_vertices = queue_[level];
+    if (!level_vertices.empty()) {
+      auto write = level_vertices.begin();
+      // remove() nulls out entries (see BfsIterator::remove), so the
+      // null check is required -- not defensive padding.
+      for (Vertex *v : level_vertices) {
+        if (v != nullptr && v->bfsInQueue(bfs_index_))
+          *write++ = v;
+      }
+      level_vertices.erase(write, level_vertices.end());
+    }
+    incrLevel(level);
+  }
+  resetLevelBounds();
+}
+
 int
 BfsIterator::visitParallel(Level to_level,
                           VertexVisitor *visitor)
@ -294,8 +331,14 @@ BfsIterator::visitParallel(Level to_level,
      kahn_state_->resetPrevious();

      std::vector<int> &in_deg = kahn_state_->in_degree_init;
-      std::vector<Vertex*> active_vertices;
+      std::vector<Vertex *> &active_vertices = kahn_state_->active_vertices;
+      active_vertices.clear();
      VertexId max_id = 0;
+      // Under clks_only (findClkArrivals), the visitor uses
+      // postponeClkFanouts to stop at reg CK; mirror that at the
+      // discovery layer so the active set is narrowed instead of
+      // eagerly walking the full data fanout.
+      const bool stop_at_reg_clk = visitor->stopDiscoveryAtRegClk();

      // Collect seed vertices from the level queue.
      Level saved_first = first_level_;
@ -319,9 +362,13 @@ BfsIterator::visitParallel(Level to_level,
      }

      // BFS discovery -- mirrors enqueueAdjacentVertices logic.
+      // Under stop_at_reg_clk, don't recurse past reg CK vertices
+      // (matches Search.cc:1179 postponeClkFanouts semantics).
      size_t disc_idx = 0;
      while (disc_idx < active_vertices.size()) {
        Vertex *vertex = active_vertices[disc_idx++];
+        if (stop_at_reg_clk && vertex->isRegClk())
+          continue;
        kahnForEachSuccessor(vertex, kahn_pred_,
                             [&](Vertex *succ) {
          if (!levelLessOrEqual(succ->level(), to_level))
@ -345,13 +392,7 @@ BfsIterator::visitParallel(Level to_level,

      if (active_count == 0) {
        kahn_state_->prev_ids.clear();
-        level = saved_first;
-        while (levelLessOrEqual(level, saved_last)
-               && levelLessOrEqual(level, to_level)) {
-          queue_[level].clear();
-          incrLevel(level);
-        }
-        resetLevelBounds();
+        dropProcessedEntries(saved_first, saved_last, to_level);
        return 0;
      }

@ -374,11 +415,11 @@ BfsIterator::visitParallel(Level to_level,
      debugPrint(debug_, "bfs", 1, "kahns {} initial ready",
                 initial_ready_count);

-      // Phase 3: Recursive-dispatch Kahn's traversal.
-      // Each task visits its vertex, decrements successor in-degrees,
-      // and directly dispatches any successor whose in-degree hit zero
-      // back into the DispatchQueue. finishTasks() waits for all work,
-      // including recursively-dispatched tasks. No batch barriers.
+      // Phase 3: Kahn's traversal with per-worker local batches.
+      // Each worker drains a thread-local ready vector in-line,
+      // only spilling half back to DispatchQueue when the batch
+      // exceeds kKahnBatchSpillThreshold so idle workers can steal.
+      // Collapses O(vertex) dispatches to O(thread_count + spills).
      std::vector<VertexVisitor *> visitors;
      for (size_t k = 0; k < thread_count; k++)
        visitors.push_back(visitor->copy());
@ -388,37 +429,92 @@ BfsIterator::visitParallel(Level to_level,
      SearchPred *pred = kahn_pred_;
      size_t in_deg_size = in_deg.size();

+      // Steady-state fan-out rarely exceeds this; wide bursts (clock
+      // boundaries) spill so work reaches idle workers.
+      constexpr size_t kKahnBatchSpillThreshold = 64;
+      std::vector<std::vector<Vertex *>> &local_ready
+          = kahn_state_->local_ready;
+      // Resize outer to thread_count (thread_count can change between
+      // calls via sta::set_thread_count). Inner backing stores are
+      // retained; only newly-added slots allocate.
+      if (local_ready.size() != thread_count)
+        local_ready.resize(thread_count);
+      for (auto &b : local_ready) {
+        b.clear();
+        if (b.capacity() < kKahnBatchSpillThreshold * 2)
+          b.reserve(kKahnBatchSpillThreshold * 2);
+      }
+
      // Recursive task lambda: self-reference via std::function.
      // Captures persist on visitParallel's stack until finishTasks
      // returns.
      std::function<void(Vertex*, size_t)> process;
-      process = [&, bfs_index, pred, in_deg_size](Vertex *vertex,
-                                                  size_t tid) {
-        vertex->setBfsInQueue(bfs_index, false);
-        visitors[tid]->visit(vertex);
-        total_visited.fetch_add(1, std::memory_order_relaxed);
-        kahnForEachSuccessor(vertex, pred, [&](Vertex *succ) {
-          VertexId sid = graph_->id(succ);
-          if (sid < in_deg_size && in_deg[sid] >= 0) {
-            int prev = in_degree[sid]
-                .fetch_sub(1, std::memory_order_acq_rel);
-            if (prev == 1) {
-              // Successor is now ready -- dispatch immediately.
-              dispatch_queue_->dispatch([&process, succ](size_t t) {
-                process(succ, t);
+      process = [&, bfs_index, pred, in_deg_size,
+                 stop_at_reg_clk](Vertex *seed,
+                                  size_t tid) {
+        auto &batch = local_ready[tid];
+        batch.push_back(seed);
+        while (!batch.empty()) {
+          Vertex *vertex = batch.back();
+          batch.pop_back();
+          vertex->setBfsInQueue(bfs_index, false);
+          visitors[tid]->visit(vertex);
+          total_visited.fetch_add(1, std::memory_order_relaxed);
+          // Skip the edge-list walk past reg CK under stop_at_reg_clk;
+          // per-edge in_deg guard below would reject each hit anyway.
+          if (stop_at_reg_clk && vertex->isRegClk())
+            continue;
+          kahnForEachSuccessor(vertex, pred, [&](Vertex *succ) {
+            VertexId sid = graph_->id(succ);
+            if (sid < in_deg_size && in_deg[sid] >= 0) {
+              int prev = in_degree[sid]
+                  .fetch_sub(1, std::memory_order_acq_rel);
+              if (prev == 1)
+                batch.push_back(succ);
+            }
+          });
+          if (batch.size() > kKahnBatchSpillThreshold) {
+            // Hand older half back so idle workers can steal;
+            // keep the most-recent frontier hot locally.
+            size_t spill = batch.size() / 2;
+            for (size_t i = 0; i < spill; i++) {
+              Vertex *v = batch[i];
+              dispatch_queue_->dispatch([&process, v](size_t t) {
+                process(v, t);
              });
            }
+            batch.erase(batch.begin(), batch.begin() + spill);
          }
-        });
+        }
      };

-      // Seed initial ready vertices into the dispatch queue.
+      // Shard seeds across up to thread_count dispatches instead of
+      // dispatching per-seed. Each worker pre-loads its local batch
+      // with its shard and then runs the drain loop.
+      std::vector<Vertex *> &seeds = kahn_state_->seeds;
+      seeds.clear();
+      if (seeds.capacity() < static_cast<size_t>(initial_ready_count))
+        seeds.reserve(initial_ready_count);
      for (Vertex *v : active_vertices) {
-        if (in_deg[graph_->id(v)] == 0) {
-          dispatch_queue_->dispatch([&process, v](size_t t) {
-            process(v, t);
+        if (in_deg[graph_->id(v)] == 0)
+          seeds.push_back(v);
+      }
+      size_t shards = std::min<size_t>(thread_count,
+                                       std::max<size_t>(seeds.size(), 1));
+      for (size_t s = 0; s < shards; s++) {
+        std::vector<Vertex *> chunk;
+        chunk.reserve(seeds.size() / shards + 1);
+        for (size_t i = s; i < seeds.size(); i += shards)
+          chunk.push_back(seeds[i]);
+        if (chunk.empty())
+          continue;
+        dispatch_queue_->dispatch(
+          [&process, &local_ready, chunk = std::move(chunk)](size_t t) {
+            auto &batch = local_ready[t];
+            for (size_t i = 1; i < chunk.size(); i++)
+              batch.push_back(chunk[i]);
+            process(chunk[0], t);
          });
-        }
      }
      dispatch_queue_->finishTasks();

@ -427,16 +523,10 @@ BfsIterator::visitParallel(Level to_level,
      for (VertexVisitor *v : visitors)
        delete v;

-      // Clear processed levels and update bounds for remaining entries.
-      level = saved_first;
-      while (levelLessOrEqual(level, saved_last)
-             && levelLessOrEqual(level, to_level)) {
-        queue_[level].clear();
-        incrLevel(level);
-      }
-      resetLevelBounds();
+      dropProcessedEntries(saved_first, saved_last, to_level);
    }
  }
+  visit_count_cumulative_.fetch_add(visit_count, std::memory_order_relaxed);
  return visit_count;
 }

--- a/search/Sta.cc
+++ b/search/Sta.cc
@ -29,6 +29,7 @@
 #include <string>

 #include "ArcDelayCalc.hh"
+#include "Bfs.hh"
 #include "CheckCapacitances.hh"
 #include "CheckFanouts.hh"
 #include "CheckMaxSkews.hh"
@ -338,6 +339,19 @@ Sta::setThreadCount1(int thread_count)
    dispatch_queue_ = new DispatchQueue(thread_count);
 }

+uint64_t
+Sta::dispatchCallCount() const
+{
+  return dispatch_queue_ ? dispatch_queue_->dispatchCallCount() : 0;
+}
+
+uint64_t
+Sta::arrivalVisitCount() const
+{
+  BfsFwdIterator *iter = search_ ? search_->arrivalIterator() : nullptr;
+  return iter ? iter->visitCountCumulative() : 0;
+}
+
 void
 Sta::updateComponentsState()
 {
--- a/search/test/CMakeLists.txt
+++ b/search/test/CMakeLists.txt
@ -19,6 +19,9 @@ sta_module_tests("search"
    genclk_latch_deep
    genclk_property_report
    json_unconstrained
+    kahns_bfs_clks_only
+    kahns_bfs_dispatch
+    kahns_bfs_refpin
    latch
    latch_timing
    levelize_loop_disabled
--- a/search/test/search_kahns_bfs_clks_only.ok
+++ b/search/test/search_kahns_bfs_clks_only.ok
@ -0,0 +1,4 @@
+Warning 198: ../../examples/gcd_sky130hd.v line 527, module sky130_fd_sc_hd__tapvpwrvgnd_1 not found. Creating black box for TAP_11.
+clks_only_off_visits=1007
+clks_only_on_visits=333
+clks_only_on_to_off_ratio=0.33x
--- a/search/test/search_kahns_bfs_clks_only.tcl
+++ b/search/test/search_kahns_bfs_clks_only.tcl
@ -0,0 +1,46 @@
+# Regression for Kahn's BFS narrowed discovery under clks_only.
+#
+# ArrivalVisitor with clks_only=true (the path exercised by
+# Search::findClkArrivals) uses postponeClkFanouts to stop propagation
+# at register CK boundaries. Kahn's Stage 1 discovery must mirror that
+# stop via VertexVisitor::stopDiscoveryAtRegClk(); otherwise it eagerly
+# walks the full data fanout past every reg CK.
+#
+# Dispatch count alone doesn't catch the regression on small designs:
+# the wider Kahn's walk still drains through per-worker batches without
+# exceeding the spill threshold, so the dispatch counter doesn't move.
+# The direct observable is the arrival BFS visit count -- every vertex
+# Kahn's Stage 1 discovers is visited exactly once in Stage 2, so the
+# cumulative visit count tracks the active-set size one-to-one.
+
+read_liberty ../../test/sky130hd/sky130hd_tt.lib
+read_verilog ../../examples/gcd_sky130hd.v
+link_design gcd
+
+sta::set_thread_count 2
+source ../../examples/gcd_sky130hd.sdc
+
+# OFF: level-based BFS with clks_only. Level-BFS stops at reg CK via
+# ArrivalVisitor's postponeClkFanouts, so the visit count is bounded
+# to the clock network only.
+sta::arrivals_invalid
+set sta_use_kahns_bfs 0
+set before [sta::arrival_visit_count]
+report_clock_skew -setup > /dev/null
+set off_visits [expr {[sta::arrival_visit_count] - $before}]
+
+# ON: Kahn's with stop-at-reg-CK under clks_only. Stage 1 discovery
+# mirrors the level-BFS narrowing, so visit count should be in the
+# same ballpark as OFF. A regression that removes the stop will make
+# Kahn's visit every data-path vertex downstream of reg CK and this
+# number will jump sharply.
+sta::arrivals_invalid
+set sta_use_kahns_bfs 1
+set before [sta::arrival_visit_count]
+report_clock_skew -setup > /dev/null
+set on_visits [expr {[sta::arrival_visit_count] - $before}]
+
+puts "clks_only_off_visits=$off_visits"
+puts "clks_only_on_visits=$on_visits"
+puts [format "clks_only_on_to_off_ratio=%.2fx" \
+          [expr {$on_visits / double([expr {$off_visits > 0 ? $off_visits : 1}])}]]
--- a/search/test/search_kahns_bfs_dispatch.ok
+++ b/search/test/search_kahns_bfs_dispatch.ok
@ -0,0 +1,4 @@
+Warning 198: ../../examples/gcd_sky130hd.v line 527, module sky130_fd_sc_hd__tapvpwrvgnd_1 not found. Creating black box for TAP_11.
+off_dispatches=203
+on_dispatches=55
+on_to_off_ratio=0.27x
--- a/search/test/search_kahns_bfs_dispatch.tcl
+++ b/search/test/search_kahns_bfs_dispatch.tcl
@ -0,0 +1,31 @@
+# Performance regression: DispatchQueue dispatch() counts under Kahn's
+# BFS OFF vs ON. The count is a wall-clock-independent proxy for
+# parallel-dispatch overhead, so a shift in the golden signals a real
+# change in the BFS/dispatch strategy.
+
+read_liberty ../../test/sky130hd/sky130hd_tt.lib
+read_verilog ../../examples/gcd_sky130hd.v
+link_design gcd
+
+sta::set_thread_count 2
+source ../../examples/gcd_sky130hd.sdc
+
+# OFF phase: level-based BFS.
+set sta_use_kahns_bfs 0
+set before [sta::dispatch_call_count]
+report_checks -path_delay min_max -group_count 10 > /dev/null
+set off_dispatches [expr {[sta::dispatch_call_count] - $before}]
+
+# ON phase: Kahn's BFS. Invalidate arrivals so the iterators
+# re-propagate through visitParallel instead of returning cached
+# results.
+sta::arrivals_invalid
+set sta_use_kahns_bfs 1
+set before [sta::dispatch_call_count]
+report_checks -path_delay min_max -group_count 10 > /dev/null
+set on_dispatches [expr {[sta::dispatch_call_count] - $before}]
+
+puts "off_dispatches=$off_dispatches"
+puts "on_dispatches=$on_dispatches"
+puts [format "on_to_off_ratio=%.2fx" \
+          [expr {$on_dispatches / double($off_dispatches)}]]
--- a/search/test/search_kahns_bfs_refpin.ok
+++ b/search/test/search_kahns_bfs_refpin.ok
@ -0,0 +1 @@
+refpin_preservation=ok
--- a/search/test/search_kahns_bfs_refpin.tcl
+++ b/search/test/search_kahns_bfs_refpin.tcl
@ -0,0 +1,75 @@
+# Regression for Kahn's BFS vs reference-pin input delay.
+#
+# set_input_delay -reference_pin causes ArrivalVisitor::visit() to fire
+# enqueueRefPinInputDelays() while propagating a clock arrival. That
+# call pushes new vertices into the arrival iterator's queue_ during
+# Stage 2 with bfsInQueue=true. The Kahn Phase 3 post-finishTasks
+# cleanup has to keep those entries -- a blind queue_[level].clear()
+# drops the pointer but leaves the flag stuck at true, making every
+# future enqueue() of the same vertex a silent no-op and causing
+# missed propagation on any subsequent invalidate + report_checks.
+#
+# Test flow:
+#   1. Baseline report_checks with Kahn's OFF.
+#   2. Invalidate arrivals; run report_checks with Kahn's ON.
+#      During Stage 2, enqueueRefPinInputDelays(clk) pushes the in1
+#      vertex into queue_; the cleanup fix must preserve it instead
+#      of orphaning its bfsInQueue flag.
+#   3. Invalidate arrivals again; run report_checks with Kahn's OFF.
+#      If step 2 orphaned in1, arrivals_invalid's enqueue is a no-op,
+#      level-BFS runs without in1 as a seed, and the report differs
+#      from the baseline.
+#
+# The test asserts that steps 1 and 3 produce byte-identical reports,
+# which is only possible if Kahn's cleanup preserved the ref-pin
+# vertex correctly.
+
+read_liberty ../../test/nangate45/Nangate45_typ.lib
+read_verilog search_test1.v
+link_design search_test1
+
+sta::set_thread_count 2
+
+create_clock -name clk -period 10 [get_ports clk]
+# in1 carries an input delay whose ARRIVAL reference is the clk pin.
+# When clk propagates and ArrivalVisitor::visit() sees is_clk=true on
+# clk's vertex, it calls enqueueRefPinInputDelays(clk, sdc) which
+# enqueues the in1 vertex into the arrival iterator.
+set_input_delay -clock clk -reference_pin [get_ports clk] 1.5 [get_ports in1]
+set_input_delay -clock clk 1.0 [get_ports in2]
+set_output_delay -clock clk 2.0 [get_ports out1]
+
+proc capture_checks {} {
+  sta::redirect_string_begin
+  report_checks -path_delay max -group_count 10
+  return [sta::redirect_string_end]
+}
+
+# Step 1: baseline, Kahn's OFF.
+set sta_use_kahns_bfs 0
+set baseline [capture_checks]
+
+# Step 2: full re-propagation under Kahn's ON. This is the call that
+# exercises enqueueRefPinInputDelays during Stage 2 and depends on
+# the post-finishTasks cleanup to preserve the ref-pin vertex's
+# bfsInQueue flag.
+sta::arrivals_invalid
+set sta_use_kahns_bfs 1
+set _ [capture_checks]   ;# output discarded, only side effects matter
+
+# Step 3: fall back to OFF and re-propagate. If Kahn's orphaned in1,
+# the arrivals_invalid below is a silent no-op for that vertex, and
+# level-BFS runs without in1 as a seed -- changing the report.
+sta::arrivals_invalid
+set sta_use_kahns_bfs 0
+set after_kahn [capture_checks]
+
+if { $baseline eq $after_kahn } {
+  puts "refpin_preservation=ok"
+} else {
+  puts "refpin_preservation=FAIL"
+  puts "--- BASELINE (Kahn's OFF, first run) ---"
+  puts $baseline
+  puts "--- AFTER KAHN (Kahn's OFF, after ON-then-invalidate) ---"
+  puts $after_kahn
+}
--- a/util/DispatchQueue.cc
+++ b/util/DispatchQueue.cc
@ -62,9 +62,16 @@ DispatchQueue::finishTasks()
    std::this_thread::yield();
 }

+uint64_t
+DispatchQueue::dispatchCallCount() const
+{
+  return dispatch_call_count_.load(std::memory_order_relaxed);
+}
+
 void
 DispatchQueue::dispatch(const fp_t& op)
 {
+  dispatch_call_count_.fetch_add(1, std::memory_order_relaxed);
  std::unique_lock<std::mutex> lock(lock_);
  q_.push(op);
  pending_task_count_++;
@ -78,6 +85,7 @@ DispatchQueue::dispatch(const fp_t& op)
 void
 DispatchQueue::dispatch(fp_t&& op)
 {
+  dispatch_call_count_.fetch_add(1, std::memory_order_relaxed);
  std::unique_lock<std::mutex> lock(lock_);
  q_.push(std::move(op));
  pending_task_count_++;
--- a/util/Util.i
+++ b/util/Util.i
@ -101,6 +101,20 @@ set_thread_count(int count)
  Sta::sta()->setThreadCount(count);
 }

+// See Sta::dispatchCallCount.
+unsigned long long
+dispatch_call_count()
+{
+  return Sta::sta()->dispatchCallCount();
+}
+
+// See Sta::arrivalVisitCount.
+unsigned long long
+arrival_visit_count()
+{
+  return Sta::sta()->arrivalVisitCount();
+}
+
 ////////////////////////////////////////////////////////////////

 void
Author	SHA1	Message	Date
dsengupta0628	bc3791beb4	updated doc to remove blocked content from doc Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>	2026-04-22 21:02:57 +00:00
dsengupta0628	9356d4b91b	updated doc to include testing strategies with vtune Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>	2026-04-22 20:59:05 +00:00
dsengupta0628	2368afc19c	fix for walk past reg/ck for clk-only pass Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>	2026-04-22 16:55:15 +00:00
dsengupta0628	a062f2b1f1	fix to address codex review regarding vertices remaining permanently marked as in-queue and be skipped by future enqueue() Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>	2026-04-22 15:33:45 +00:00
dsengupta0628	9305ce864b	testcase showing the problem: post-Stage-2 cleanup called queue_[level].clear() which dropped vertices that ArrivalVisitor::enqueueRefPinInputDelays had enqueued during Stage 2 without clearing their bfsInQueue flag leaving it stuck at true and permanently suppressing all future enqueue() calls for those vertices. Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>	2026-04-22 15:09:36 +00:00
dsengupta0628	a530706187	address variable persistence-gemini fdbk, add local batch dispatch instead of immediate dispatch-when-ready Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>	2026-04-22 14:24:41 +00:00
dsengupta0628	d53029aba1	add some debug info for dispatch queue and regression Signed-off-by: dsengupta0628 <dsengupta@precisioninno.com>	2026-04-22 04:19:34 +00:00