From d96bfd7a8b64d6888d1ff205ba364b6c70e353bd Mon Sep 17 00:00:00 2001
From: John McMaster <johndmcmaster@gmail.com>
Date: Mon, 10 Sep 2018 15:54:39 -0700
Subject: [PATCH] timfuz: cleanup README

Signed-off-by: John McMaster <johndmcmaster@gmail.com>
---
 timfuz/README.md   | 288 +++++++++++++++++++++++++++++++++++++++++++++
 timfuz/README.txt  |  71 -----------
 timfuz/checksub.py |   1 +
 3 files changed, 289 insertions(+), 71 deletions(-)
 create mode 100644 timfuz/README.md
 delete mode 100644 timfuz/README.txt

diff --git a/timfuz/README.md b/timfuz/README.md
new file mode 100644
index 00000000..ae90a689
--- /dev/null
+++ b/timfuz/README.md
@@ -0,0 +1,288 @@
+# Timing analysis fuzzer (timfuz)
+
+WIP: 2018-09-10: this process is just starting together and is going to get significant cleanup. But heres the general idea
+
+This runs various designs through Vivado and processes the
+resulting timing informationin order to create very simple timing models.
+While Vivado might have more involved models (say RC delays, fanout, etc),
+timfuz creates simple models that bound realistic min and max element delays.
+
+Currently this document focuses exclusively on fabric timing delays.
+
+
+## Quick start
+
+TODO: make this into a more formal makefile flow
+
+```
+# Pre-processing
+# Create speed.json
+./speed.sh
+
+# Create csvs
+make N=1
+csv=specimen_001/timing3.csv
+
+# Main workflow
+# Discover which variables can be separated
+python3 timfuz_rref.py --simplify --out sub.json $csv
+# Verify sub.json makes a solvable solution
+python3 checksub.py --sub-json sub.json group.csv
+# Separate variables
+python3 csv_flat2group.py --sub-json sub.json --strict $csv group.csv
+# Create a rough timing model that approximately fits the given paths
+python3 solve_leastsq.py --sub-json sub.json group.csv --out leastsq.csv
+# Tweak rough timing model, making sure all constraints are satisfied
+python3 solve_linprog.py --sub-json sub.json --sub-csv leastsq.csv --massage group.csv --out linprog.csv
+# Take separated variables and back-annotate them to the original timing variables
+python3 csv_group2flat.py --sub-json sub.json --sort linprog.csv flat.csv
+
+# Final processing
+# Create tile.json (where timing models are in tile fabric)
+python3 tile_txt2json.py timgrid/specimen_001/tiles.txt tiles.json
+# Insert timing delays into actual tile layouts
+python3 tile_annotate.py flat.csv tilea.json
+```
+
+
+## Vivado background
+
+Examples are for a XC750T on Vivado 2017.2.
+
+TODO maybe move to: https://github.com/SymbiFlow/prjxray/wiki/Timing
+
+
+### Speed index
+
+Vivado seems to associate each delay model with a "speed index".
+The fabric has these in two elements: wires (ie one delay element per tile) and pips.
+For example, LUT output node A (ex: CLBLL_L_X12Y100/CLBLL_LL_A) has a single wire, also called CLBLL_L_X12Y100/CLBLL_LL_A.
+This has speed index 733. Speed models can be queried and we find this corresponds to model C_CLBLL_LL_A.
+
+There are various speed model types:
+* bel_delay
+* buffer
+* buffer_switch
+* content_version
+* functional
+* inpin
+* outpin
+* parameters
+* pass_transistor
+* switch
+* table_lookup
+* tl_buffer
+* vt_limits
+* wire
+
+IIRC the interconnect is only composed of switch and wire types.
+
+Indices with value 65535 (0xFFFF) never appear. Presumably these are unused models.
+They are used for some special models such as those of type "content_version".
+For example, the "xilinx" model is of type "content_version".
+
+There are also "cost codes", but these seem to be very course (only around 30 of these)
+and are suspected to be related more to PnR than timing model.
+
+
+### Timing paths
+
+The Vivado timing analyzer can easily output the following:
+* Full: delay from BEL pin to BEL pin
+* Interconnect only (ICO): delay from BEL pin to BEL pin, but only report interconnect delays (ie exclude site delays)
+
+There is also theoretically an option to report delays up to a specific pip,
+but this option is poorly documented and I was unable to get it to work.
+
+Each timing path reports a fast process and a slow process min and max value. So four process values are reported in total:
+* fast_max
+* fast_min
+* slow_max
+* slow_min
+
+For example, if the device is end of life, was poorly made, and at an extreme temperature, the delay may be up to the slow_max value.
+Since ICO can be reported for each of these, fully analyzing a timing path results in 8 values.
+
+Finally, part of this was analyzing tile regularity to discover what a reasonably compact timing model was.
+We verified that all tiles of the same type have exactly the same delay elements.
+
+
+
+## Methodology
+
+Make sure you've read the Vivado background section first
+
+
+### Background
+
+This section briefly describes some of the mathmatics used by this technique that readers may not be familiar with.
+These definitions are intended to be good enough to provide a high level understanding and may not be precise.
+
+Numerical analysis: the study of algorithms that use numerical approximation (as opposed to general symbolic manipulations)
+
+numpy: a popular numerical analysis python library. Often written np (import numpy as np).
+
+scipy: provides higher level functionality on top of numpy
+
+sympy ("symbolic python"): like numpy, but is designed to work with rational numbers.
+For example, python actually stores 0.1 as 0.1000000000000000055511151231257827021181583404541015625.
+However, sympy can represent this as the fraction 1/10, eliminating numerical approximation issues.
+
+Least squares (ex: scipy.optimize.least_squares): approximation method to do a best fit of several variables to a set of equations.
+For example, given the equations "x = 1" and "x = 2" there isn't an exact solution.
+However, "x = 1.5" is a good compromise since its reasonably solves both equations.
+
+Linear programming (ex: scipy.optimize.linprog aka linprog): approximation method that finds a set of variables that satisfy a set of inequalities.
+For example,
+
+Reduced row echelon form (RREF, ex: sympy.Matrix.rref): the simplest form that a system of linear equations can be solved to.
+For example, given "x = 1" and "x + y = 9", one can solve for "x = 1" and "y = 8".
+However, given "x + y = 1" and "x + y + z = 9", there aren't enough variables to solve this fully.
+In this case RREF provides a best effort by giving the ratios between correlated variables.
+One variable is normalized to 1 in each of these ratios and is called the "pivot".
+Note that if numpy.linalg.solve encounters an unsolvable matrix it may either complain
+or generate a false solution due to numerical approximation issues.
+
+
+### What didn't work
+
+First some quick background on things that didn't work to illustrate why the current approach was chosen.
+I first tried to directly through things into linprog, but it unfairly weighted towards arbitrary shared variables. For example, feeding in:
+* t0 >= 10
+* t0 + t1 >= 100
+
+It would declare "t0 = 100", "t1 = 0" instead of the more intuitive "t0 = 10", "t1 = 90".
+I tried to work around this in several ways, notably subtracting equations from each other to produce additional constraints.
+This worked okay, but was relatively slow and wasn't approaching nearly solved solutions, even when throwing a lot of data at it.
+
+Next we tried randomly combining a bunch of the equations together and solving them like a regular linear algebra matrix (numpy.linalg.solve).
+However, this illustrated that the system was under-constrained.
+Further analysis revealed that there are some delay element combinations that simply can't be linearly separated.
+This was checked primarily using numpy.linalg.matrix_rank, with some use of numpy.linalg.slogdet.
+matrix_rank was preferred over slogdet since its more flexible against non-square matrices.
+
+
+### Process
+
+Above ultimately led to the idea that we should come up with a set of substitutions that would make the system solvable. This has several advantages:
+* Easy to evaluate which variables aren't covered well enough by source data
+* Easy to evaluate which variables weren't solved properly (if its fully constrained it should have had a non-zero delay)
+
+At a high level, the above learnings gave this process:
+* Find correlated variables by using RREF (sympy.Matrix.rref) to create variable groups
+  - Note pivots
+  - You must input a fractional type (ex: fractions.Fraction, but surprisingly not int) to get exact results, otherwise it seems to fall back to numerical approximation
+  - This is by far the most computationally expensive step
+  - Mixing RREF substitutions from one data set to another may not be recommended
+* Use RREF result to substitute groups on input data, creating new meta variables, but ultimately reducing the number of columns
+* Pick a corner
+  - Examples assume fast_max, but other corners are applicable with appropriate column and sign changes
+* De-duplicate by removing equations that are less constrained
+  - Ex: if solving for a max corner and given:
+  - t0 + t1 >= 10
+  - t0 + t1 >= 12
+  - The first equation is redundant since the second provides a stricter constraint
+  - This significantly reduces computational time
+* Use least squares (scipy.optimize.least_squares) to fit variables near input constraints
+  - Helps fairly weight delays vs the original input constraints
+  - Does not guarantee all constraints are met. For example, if this was put in (ignoring these would have been de-duplicated):
+  - t0 = 10
+  - t0 = 12
+  - It may decide something like t0 = 11, which means that the second constraint was not satisfied given we actually want t0 >= 12
+* Use linear programming (scipy.optimize.linprog aka linprog) to formally meet all remaining constraints
+  - Start by filtering out all constraints that are already met. This should eliminate nearly all equations
+* Map resulting constraints onto different tile types
+  - Group delays map onto the group pivot variable, typically setting other elements to 0 (if the processed set is not the one used to create the pivots they may be non-zero)
+
+
+## TODO
+
+Milestone 1 (MVP)
+* DONE
+* Provide any process corner with at least some of the fabric
+
+Milestone 2
+* Provide all four fabric corners
+* Simple makefile based flow
+* Cleanup/separate fabric input targets
+
+Milestone 3
+* Create site delay model
+
+Final
+* Investigate ZERO
+* Investigate virtual switchboxes
+* Compare our vs Xilinx output on random designs
+
+
+### Improve test cases
+
+Test cases are somewhat random right now. We could make much more targetted cases using custom routing to improve various fanout estimates and such.
+Also there are a lot more elements that are not covered.
+At a minimum these should be moved to their own directory.
+
+
+### ZERO models
+
+Background: there are a number of speed models with the name ZERO in them.
+These generally seem to be zero delay, although needs more investigation.
+
+Example: see virtual switchbox item below
+
+The timing models will probably significantly improve if these are removed.
+In the past I was removing them, but decided to keep them in for now in the spirit of being more conservative.
+
+They include:
+ * _BSW_CLK_ZERO
+ * BSW_CLK_ZERO
+ * _BSW_ZERO
+ * BSW_ZERO
+ * _B_ZERO
+ * B_ZERO
+ * C_CLK_ZERO
+ * C_DSP_ZERO
+ * C_ZERO
+ * I_ZERO
+ * _O_ZERO
+ * O_ZERO
+ * RC_ZERO
+ * _R_ZERO
+ * R_ZERO
+
+
+### Virtual switchboxes
+
+Background: several low level configuration details are abstracted with virtual configurable elements.
+For example, LUT inputs can be rearranged to reduce routing congestion.
+However, the LUT configuratioon must be changed to match the switched inputs.
+This is handled by the CLBLL_L_INTER switchbox, which doesn't encode any physical configuration bits.
+However, this contains PIPs with delay models.
+
+For example, LUT A, input A1 has node CLBLM_M_A1 coming from pip junction CLBLM_M_A1 has PIP CLBLM_IMUX7->CLBLM_M_A1
+with speed index 659 (R_ZERO).
+
+This might be further evidence on related issue that ZERO models should probably be removed.
+
+
+### Incporporate fanout
+
+We could probably significantly improve model granularity by studying delay impact on fanout
+
+
+### Investigate RC delays
+
+Suspect accuracy could be significantly improved by moving to SPICE based models. But this will take significantly more characterization
+
+
+### Characterize real hardware
+
+A few people have expressed interest on running tests on real hardware. Will take some thought given we don't have direct access
+
+
+### Review approximation errors
+
+Ex: one known issue is that the objective function linearly weights small and large delays.
+This is only recommended when variables are approximately the same order of magnitude.
+For example, carry chain delays are on the order of 7 ps while other delays are 100 ps.
+Its very easy to put a large delay on the carry chain while it could have been more appropriately put somewhere else.
+
diff --git a/timfuz/README.txt b/timfuz/README.txt
deleted file mode 100644
index 5d37ec78..00000000
--- a/timfuz/README.txt
+++ /dev/null
@@ -1,71 +0,0 @@
-Timing analysis fuzzer
-This runs some random designs through Vivado and extracts timing information in order to derive timing models
-While Vivado has more involved RC (spice?) models incorporating fanout and other things,
-for now we are shooting for simple, conservative models with a min and max timing delay
-
-
-*******************************************************************************
-Background
-*******************************************************************************
-
-Vivado seems to associate each delay model with a "speed index"
-In particular, we are currently looking at pips and wires, each of which have a speed index associated with them
-For every timeing path, we record the total delay from one site to another, excluding site delays
-(timing analyzer provides an option to make this easy)
-We then walk along the path and record all wires and pips in between
-These are converted to their associated speed indexes
-This gives an equation that a series of speed indexes was given a certain delay value
-These equations are then fed into scipy.optimize.linprog to give estimates for the delay models
-
-However, there are some complications. For example:
-Given a system of equations like:
-t0           = 5
-t0 + t1      = 10
-t0 + t1 + t2 = 12
-The solver puts all the delays in t0
-To get around this, we subtract equations from each other
-
-Some additional info here: https://github.com/SymbiFlow/prjxray/wiki/Timing
-
-
-*******************************************************************************
-Quick start
-*******************************************************************************
-
-./speed.sh
-python timfuz_delay.py --cols-max 9 timfuz_dat/s1_timing2.txt
-Which will report something like 
-Delay on 36 / 162
-
-Now add some more data in:
-python timfuz_delay.py --cols-max 9 timfuz_dat/speed_json.json timfuz_dat/s*_timing2.txt
-Which should get a few more delay elements, say:
-Delay on 57 / 185
-
-
-*******************************************************************************
-From scratch
-*******************************************************************************
-
-Roughly something like this
-Edit generate.tcl
-Uncomment speed_models2
-Run "make N=1"
-python speed_json.py specimen_001/speed_model.txt speed_json.json
-Edit generate.tcl
-Comment speed_models2
-Run "make N=4" to generate some more timing data
-Now run as in the quick start
-python timfuz_delay.py --cols-max 9 speed_json.json specimen_*/timing2.txt
-
-
-*******************************************************************************
-TODO:
-*******************************************************************************
-
-Verify elements are being imported correctly throughout the whole chain
-Can any wires or similar be aggregated?
-    Ex: if a node consisents of two wire delay models and that pair is never seen elsewhere
-Look at virtual switchboxes. Can these be removed?
-Look at suspicous elements like WIRE_RC_ZERO
-
diff --git a/timfuz/checksub.py b/timfuz/checksub.py
index f91fb38e..213fcb33 100644
--- a/timfuz/checksub.py
+++ b/timfuz/checksub.py
@@ -90,6 +90,7 @@ def run(fns_in, sub_json=None, verbose=False):
     print
     # https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.linalg.matrix_rank.html
     print('rank: %s / %d col' % (np.linalg.matrix_rank(Amat), len(names)))
+    # doesn't work on non-square matrices
     if 0:
         # https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.linalg.slogdet.html
         sign, logdet = np.linalg.slogdet(Amat)