From d96bfd7a8b64d6888d1ff205ba364b6c70e353bd Mon Sep 17 00:00:00 2001 From: John McMaster Date: Mon, 10 Sep 2018 15:54:39 -0700 Subject: [PATCH] timfuz: cleanup README Signed-off-by: John McMaster --- timfuz/README.md | 288 +++++++++++++++++++++++++++++++++++++++++++++ timfuz/README.txt | 71 ----------- timfuz/checksub.py | 1 + 3 files changed, 289 insertions(+), 71 deletions(-) create mode 100644 timfuz/README.md delete mode 100644 timfuz/README.txt diff --git a/timfuz/README.md b/timfuz/README.md new file mode 100644 index 00000000..ae90a689 --- /dev/null +++ b/timfuz/README.md @@ -0,0 +1,288 @@ +# Timing analysis fuzzer (timfuz) + +WIP: 2018-09-10: this process is just starting together and is going to get significant cleanup. But heres the general idea + +This runs various designs through Vivado and processes the +resulting timing informationin order to create very simple timing models. +While Vivado might have more involved models (say RC delays, fanout, etc), +timfuz creates simple models that bound realistic min and max element delays. + +Currently this document focuses exclusively on fabric timing delays. + + +## Quick start + +TODO: make this into a more formal makefile flow + +``` +# Pre-processing +# Create speed.json +./speed.sh + +# Create csvs +make N=1 +csv=specimen_001/timing3.csv + +# Main workflow +# Discover which variables can be separated +python3 timfuz_rref.py --simplify --out sub.json $csv +# Verify sub.json makes a solvable solution +python3 checksub.py --sub-json sub.json group.csv +# Separate variables +python3 csv_flat2group.py --sub-json sub.json --strict $csv group.csv +# Create a rough timing model that approximately fits the given paths +python3 solve_leastsq.py --sub-json sub.json group.csv --out leastsq.csv +# Tweak rough timing model, making sure all constraints are satisfied +python3 solve_linprog.py --sub-json sub.json --sub-csv leastsq.csv --massage group.csv --out linprog.csv +# Take separated variables and back-annotate them to the original timing variables +python3 csv_group2flat.py --sub-json sub.json --sort linprog.csv flat.csv + +# Final processing +# Create tile.json (where timing models are in tile fabric) +python3 tile_txt2json.py timgrid/specimen_001/tiles.txt tiles.json +# Insert timing delays into actual tile layouts +python3 tile_annotate.py flat.csv tilea.json +``` + + +## Vivado background + +Examples are for a XC750T on Vivado 2017.2. + +TODO maybe move to: https://github.com/SymbiFlow/prjxray/wiki/Timing + + +### Speed index + +Vivado seems to associate each delay model with a "speed index". +The fabric has these in two elements: wires (ie one delay element per tile) and pips. +For example, LUT output node A (ex: CLBLL_L_X12Y100/CLBLL_LL_A) has a single wire, also called CLBLL_L_X12Y100/CLBLL_LL_A. +This has speed index 733. Speed models can be queried and we find this corresponds to model C_CLBLL_LL_A. + +There are various speed model types: +* bel_delay +* buffer +* buffer_switch +* content_version +* functional +* inpin +* outpin +* parameters +* pass_transistor +* switch +* table_lookup +* tl_buffer +* vt_limits +* wire + +IIRC the interconnect is only composed of switch and wire types. + +Indices with value 65535 (0xFFFF) never appear. Presumably these are unused models. +They are used for some special models such as those of type "content_version". +For example, the "xilinx" model is of type "content_version". + +There are also "cost codes", but these seem to be very course (only around 30 of these) +and are suspected to be related more to PnR than timing model. + + +### Timing paths + +The Vivado timing analyzer can easily output the following: +* Full: delay from BEL pin to BEL pin +* Interconnect only (ICO): delay from BEL pin to BEL pin, but only report interconnect delays (ie exclude site delays) + +There is also theoretically an option to report delays up to a specific pip, +but this option is poorly documented and I was unable to get it to work. + +Each timing path reports a fast process and a slow process min and max value. So four process values are reported in total: +* fast_max +* fast_min +* slow_max +* slow_min + +For example, if the device is end of life, was poorly made, and at an extreme temperature, the delay may be up to the slow_max value. +Since ICO can be reported for each of these, fully analyzing a timing path results in 8 values. + +Finally, part of this was analyzing tile regularity to discover what a reasonably compact timing model was. +We verified that all tiles of the same type have exactly the same delay elements. + + + +## Methodology + +Make sure you've read the Vivado background section first + + +### Background + +This section briefly describes some of the mathmatics used by this technique that readers may not be familiar with. +These definitions are intended to be good enough to provide a high level understanding and may not be precise. + +Numerical analysis: the study of algorithms that use numerical approximation (as opposed to general symbolic manipulations) + +numpy: a popular numerical analysis python library. Often written np (import numpy as np). + +scipy: provides higher level functionality on top of numpy + +sympy ("symbolic python"): like numpy, but is designed to work with rational numbers. +For example, python actually stores 0.1 as 0.1000000000000000055511151231257827021181583404541015625. +However, sympy can represent this as the fraction 1/10, eliminating numerical approximation issues. + +Least squares (ex: scipy.optimize.least_squares): approximation method to do a best fit of several variables to a set of equations. +For example, given the equations "x = 1" and "x = 2" there isn't an exact solution. +However, "x = 1.5" is a good compromise since its reasonably solves both equations. + +Linear programming (ex: scipy.optimize.linprog aka linprog): approximation method that finds a set of variables that satisfy a set of inequalities. +For example, + +Reduced row echelon form (RREF, ex: sympy.Matrix.rref): the simplest form that a system of linear equations can be solved to. +For example, given "x = 1" and "x + y = 9", one can solve for "x = 1" and "y = 8". +However, given "x + y = 1" and "x + y + z = 9", there aren't enough variables to solve this fully. +In this case RREF provides a best effort by giving the ratios between correlated variables. +One variable is normalized to 1 in each of these ratios and is called the "pivot". +Note that if numpy.linalg.solve encounters an unsolvable matrix it may either complain +or generate a false solution due to numerical approximation issues. + + +### What didn't work + +First some quick background on things that didn't work to illustrate why the current approach was chosen. +I first tried to directly through things into linprog, but it unfairly weighted towards arbitrary shared variables. For example, feeding in: +* t0 >= 10 +* t0 + t1 >= 100 + +It would declare "t0 = 100", "t1 = 0" instead of the more intuitive "t0 = 10", "t1 = 90". +I tried to work around this in several ways, notably subtracting equations from each other to produce additional constraints. +This worked okay, but was relatively slow and wasn't approaching nearly solved solutions, even when throwing a lot of data at it. + +Next we tried randomly combining a bunch of the equations together and solving them like a regular linear algebra matrix (numpy.linalg.solve). +However, this illustrated that the system was under-constrained. +Further analysis revealed that there are some delay element combinations that simply can't be linearly separated. +This was checked primarily using numpy.linalg.matrix_rank, with some use of numpy.linalg.slogdet. +matrix_rank was preferred over slogdet since its more flexible against non-square matrices. + + +### Process + +Above ultimately led to the idea that we should come up with a set of substitutions that would make the system solvable. This has several advantages: +* Easy to evaluate which variables aren't covered well enough by source data +* Easy to evaluate which variables weren't solved properly (if its fully constrained it should have had a non-zero delay) + +At a high level, the above learnings gave this process: +* Find correlated variables by using RREF (sympy.Matrix.rref) to create variable groups + - Note pivots + - You must input a fractional type (ex: fractions.Fraction, but surprisingly not int) to get exact results, otherwise it seems to fall back to numerical approximation + - This is by far the most computationally expensive step + - Mixing RREF substitutions from one data set to another may not be recommended +* Use RREF result to substitute groups on input data, creating new meta variables, but ultimately reducing the number of columns +* Pick a corner + - Examples assume fast_max, but other corners are applicable with appropriate column and sign changes +* De-duplicate by removing equations that are less constrained + - Ex: if solving for a max corner and given: + - t0 + t1 >= 10 + - t0 + t1 >= 12 + - The first equation is redundant since the second provides a stricter constraint + - This significantly reduces computational time +* Use least squares (scipy.optimize.least_squares) to fit variables near input constraints + - Helps fairly weight delays vs the original input constraints + - Does not guarantee all constraints are met. For example, if this was put in (ignoring these would have been de-duplicated): + - t0 = 10 + - t0 = 12 + - It may decide something like t0 = 11, which means that the second constraint was not satisfied given we actually want t0 >= 12 +* Use linear programming (scipy.optimize.linprog aka linprog) to formally meet all remaining constraints + - Start by filtering out all constraints that are already met. This should eliminate nearly all equations +* Map resulting constraints onto different tile types + - Group delays map onto the group pivot variable, typically setting other elements to 0 (if the processed set is not the one used to create the pivots they may be non-zero) + + +## TODO + +Milestone 1 (MVP) +* DONE +* Provide any process corner with at least some of the fabric + +Milestone 2 +* Provide all four fabric corners +* Simple makefile based flow +* Cleanup/separate fabric input targets + +Milestone 3 +* Create site delay model + +Final +* Investigate ZERO +* Investigate virtual switchboxes +* Compare our vs Xilinx output on random designs + + +### Improve test cases + +Test cases are somewhat random right now. We could make much more targetted cases using custom routing to improve various fanout estimates and such. +Also there are a lot more elements that are not covered. +At a minimum these should be moved to their own directory. + + +### ZERO models + +Background: there are a number of speed models with the name ZERO in them. +These generally seem to be zero delay, although needs more investigation. + +Example: see virtual switchbox item below + +The timing models will probably significantly improve if these are removed. +In the past I was removing them, but decided to keep them in for now in the spirit of being more conservative. + +They include: + * _BSW_CLK_ZERO + * BSW_CLK_ZERO + * _BSW_ZERO + * BSW_ZERO + * _B_ZERO + * B_ZERO + * C_CLK_ZERO + * C_DSP_ZERO + * C_ZERO + * I_ZERO + * _O_ZERO + * O_ZERO + * RC_ZERO + * _R_ZERO + * R_ZERO + + +### Virtual switchboxes + +Background: several low level configuration details are abstracted with virtual configurable elements. +For example, LUT inputs can be rearranged to reduce routing congestion. +However, the LUT configuratioon must be changed to match the switched inputs. +This is handled by the CLBLL_L_INTER switchbox, which doesn't encode any physical configuration bits. +However, this contains PIPs with delay models. + +For example, LUT A, input A1 has node CLBLM_M_A1 coming from pip junction CLBLM_M_A1 has PIP CLBLM_IMUX7->CLBLM_M_A1 +with speed index 659 (R_ZERO). + +This might be further evidence on related issue that ZERO models should probably be removed. + + +### Incporporate fanout + +We could probably significantly improve model granularity by studying delay impact on fanout + + +### Investigate RC delays + +Suspect accuracy could be significantly improved by moving to SPICE based models. But this will take significantly more characterization + + +### Characterize real hardware + +A few people have expressed interest on running tests on real hardware. Will take some thought given we don't have direct access + + +### Review approximation errors + +Ex: one known issue is that the objective function linearly weights small and large delays. +This is only recommended when variables are approximately the same order of magnitude. +For example, carry chain delays are on the order of 7 ps while other delays are 100 ps. +Its very easy to put a large delay on the carry chain while it could have been more appropriately put somewhere else. + diff --git a/timfuz/README.txt b/timfuz/README.txt deleted file mode 100644 index 5d37ec78..00000000 --- a/timfuz/README.txt +++ /dev/null @@ -1,71 +0,0 @@ -Timing analysis fuzzer -This runs some random designs through Vivado and extracts timing information in order to derive timing models -While Vivado has more involved RC (spice?) models incorporating fanout and other things, -for now we are shooting for simple, conservative models with a min and max timing delay - - -******************************************************************************* -Background -******************************************************************************* - -Vivado seems to associate each delay model with a "speed index" -In particular, we are currently looking at pips and wires, each of which have a speed index associated with them -For every timeing path, we record the total delay from one site to another, excluding site delays -(timing analyzer provides an option to make this easy) -We then walk along the path and record all wires and pips in between -These are converted to their associated speed indexes -This gives an equation that a series of speed indexes was given a certain delay value -These equations are then fed into scipy.optimize.linprog to give estimates for the delay models - -However, there are some complications. For example: -Given a system of equations like: -t0 = 5 -t0 + t1 = 10 -t0 + t1 + t2 = 12 -The solver puts all the delays in t0 -To get around this, we subtract equations from each other - -Some additional info here: https://github.com/SymbiFlow/prjxray/wiki/Timing - - -******************************************************************************* -Quick start -******************************************************************************* - -./speed.sh -python timfuz_delay.py --cols-max 9 timfuz_dat/s1_timing2.txt -Which will report something like -Delay on 36 / 162 - -Now add some more data in: -python timfuz_delay.py --cols-max 9 timfuz_dat/speed_json.json timfuz_dat/s*_timing2.txt -Which should get a few more delay elements, say: -Delay on 57 / 185 - - -******************************************************************************* -From scratch -******************************************************************************* - -Roughly something like this -Edit generate.tcl -Uncomment speed_models2 -Run "make N=1" -python speed_json.py specimen_001/speed_model.txt speed_json.json -Edit generate.tcl -Comment speed_models2 -Run "make N=4" to generate some more timing data -Now run as in the quick start -python timfuz_delay.py --cols-max 9 speed_json.json specimen_*/timing2.txt - - -******************************************************************************* -TODO: -******************************************************************************* - -Verify elements are being imported correctly throughout the whole chain -Can any wires or similar be aggregated? - Ex: if a node consisents of two wire delay models and that pair is never seen elsewhere -Look at virtual switchboxes. Can these be removed? -Look at suspicous elements like WIRE_RC_ZERO - diff --git a/timfuz/checksub.py b/timfuz/checksub.py index f91fb38e..213fcb33 100644 --- a/timfuz/checksub.py +++ b/timfuz/checksub.py @@ -90,6 +90,7 @@ def run(fns_in, sub_json=None, verbose=False): print # https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.linalg.matrix_rank.html print('rank: %s / %d col' % (np.linalg.matrix_rank(Amat), len(names))) + # doesn't work on non-square matrices if 0: # https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.linalg.slogdet.html sign, logdet = np.linalg.slogdet(Amat)