\section{Modules} \label{sec:modules} This section provides an overview of the main modules that are used in an SRAM. For each module, we will provide both an architectural description and an explanation of how that design is generated and used in OpenRAM. The modules described below are provided in the first release of OpenRAM, but by no means is this an exhaustive list of the possible circuits that can be adapted into a SRAM architecture; refer to Section~\ref{sec:implementation} for more information on adding different module designs to the compiler. Data structures for schematic and layout are provided in the \verb|base| directory. These implement a generic design object and have many auxiliary functions for routing, pin access, placement, DRC/LVS, etc. These are discussed further in Section~\ref{sec:implementation}. Each module has a corresponding Python class in the \verb|compiler/modules| directory. These classes are used to generate both the GDSII layout and spice netlists. A module can consist of hard library cells (Section~\ref{sec:techdir}), paramterized cells (Section~\ref{sec:parameterized}) or other modules. When combining modules at any level of hierarchy, DRC rules for minimum spacing of metals, wells, etc. must be followed and DRC and LVS are run by default after each hierarchical module's creation. A module is responsible for creating its own pins to enable routing at the next level up in the hierarchy. A module must also define its height and width assuming a (0,0) offset for the lower-left coordinate to aid with placement. \subsection{The Bitcell and Bitcell Array} \label{sec:bitcellarray} OpenRAM can work with any cell as the bitcell. This could be a foundry created one or a user design rule cell for experiments. In addition, it could be a common 6T cell or it could be replaced with an 8T, 10T or other cell, depending on needs. By default, OpenRAM uses a standard 6T cell as shown in Figure~\ref{fig:6t_cell}. The cross coupled inverters hold a single data bit that can either be driven into, or read from the cell by the bitlines. The access transistors are used to isolate the cell from the bitlines so that data is not corrupted while a cell is not being accessed. \begin{figure}[h!] \centering \includegraphics[scale=.9]{figs/cell_6t_schem.pdf} \caption{Standard 6T cell.} \label{fig:6t_cell} \end{figure} % tiling memory cells The 6T cells are tiled together in both the horizontal and vertical directions to make up the memory array. % keeping it square It is common practice to keep the aspect ratio of a memory array roughly ``square'' to ensure that the bitlines and wordlines do not become too long. If the bitlines are too long, this can increase the bitline capacitance, slow down the operation and lead to bitline leakage problems. To make an array ``more square'', multiple words can share rows by interleaving the bits of each word. The column mux in Section~\ref{sec:column_mux} is responsbile for selecting a subset of bitcells in a row to extract a word during read and write operations. % memory cell is a library cell In OpenRAM, we provide a library cell for the 6T cell that can be swapped with a fab memory cell, if available. The transitors in the cell are sized appropriately considering read and write noise margins. % bitcell and bitcell_array classes The bitcell class in \verb|modules/bitcell.py| is a single memory cell and is usually a pre-made library cell. % bitcell_array The bitcell\_array class in \verb|modules/bitcell_array.py| dynamically implements the memory cell array by instantiating a the bitcell class in rows and columns. % abutment connections During the tiling process, bitcells are abutted so that all bitlines and word lines are connected in the vertical and horizontal directions respectively. This is done by using the boundary layer to define the height and width of the cell. If this is not specified, OpenRAM will use the bounding box of all shapes as the boundary. The boundary layer should be offset at (0,0) in the lower left coordinate. % flipping In order to share supply rails, bitcells are flipped in alternating rows. \subsection{Precharge Circuitry} \label{sec:precharge} The precharge circuit is depicted in Figure~\ref{fig:precharge} and is implemented by three PMOS transistors. The input signal to the cell, clk, enables all three transistors during the first half of a read or write cycle (i.e. while the clock signal is low). M1 and M2 charge BL and BLB to Vdd and M3 helps to equalize the voltages seen on BL and BLB. \begin{figure}[h!] \centering \includegraphics[width=5cm]{./figs/precharge_schem.pdf} \caption{Schematic of a single precharge cell. \fixme{Change PCLK to CLK.}} \label{fig:precharge} \end{figure} In OpenRAM, the precharge citcuitry is dynamically generated using the parameterized transistor class (\verb|ptx|). The \verb|precharge| class in \verb|precharge.py| dynamically generates a single precharge cell. The offsets of the bitlines and the width of the precharge cell are equal to the 6T cell so that the bitlines are correctly connected down to the 6T cell. The \verb|precharge_array| class is then used to generate a precharge array, which is a single row of \textbf{n} precharge cells, where \textbf{n} equals the number of columns in the bitcell array. \subsection{Address Decoders} \label{sec:addressdecoder} The address decoder takes the row address bits from the address bus as inputs, and asserts the appropriate wordline in the row that data is to be read or written. A n-bit address input controls $2^n$ word lines. OpenRAM provides a hierarchical address decoder as the default, but will soon have other options. \subsubsection{Hierarchical Decoder} \label{sec:hierdecoder} Hierarchical decoder is a type of decoder which the constrcution takes place hierarchically. The simple 2:4 decoder is shown in the Figure~\ref{fig:2 to 4 decoder}. The operation of this decoder can be explained as follows: soon after the address signals A0 and A1 are put on the address lines, depending on the signal combination, one of the wordlines will rise after a brief amount of time. For example if the address input is A0A1=00 then the output is W0W1W2W3=1000. The 2:4 address decoder uses inverters and two input nand gates for its constrcution while the gates are sized to have equal rise and fall time. As the decoder size increases the size of the nand gates required for decoding also increases. Table~\ref{table:2-4 hierarchical_decoder} gives the detailed input and output siganls for the 2:4 hierarchical decoder. \begin{figure}[h!] \centering \includegraphics[scale=.6]{./figs/2t4decoder.pdf} \caption{Schematic of 2-4 simple decoder.} \label{fig:2 to 4 decoder} \end{figure} \begin{table}[h!] \begin{center} \begin{tabular}{| c | c |} \hline A[1:0] & Selected WL\\ \hline 00 & 0\\ \hline 01 & 1\\ \hline 10 & 2\\ \hline 11 & 3\\ \hline \end{tabular} \end{center} \caption{Truth table for 2:4 hierarchical decoder.} \label{table:2-4 hierarchical_decoder} \end{table} An $n$-bit decoder requires {$2^n$} logic gates, each with $n$ inputs. For example, with $n$ = 6, 64 $NAND6$ gates are needed to drive 64 inverters to implement the decoder. It is clear that gates with more than 3 inputs create large series resistances and long delays. Rather than using $n$-input gates, it is preferable to use a cascade of gates. Typically two stages are used: a predecode stage and a final decode stage. The predecode stage generates intermediate signals that are used by multiple gates in the final decode stage. \begin{figure}[h!] \centering \includegraphics[scale=.6]{./figs/4t16decoder.pdf} \caption{Schematic of 4 to 16 hierarchical decoder.} \label{fig:4 to 16 decoder} \end{figure} Figure~\ref{fig:4 to 16 decoder} shows the 4 to 16 heirarchical decoder. The structure of the decoder consists of two 2:4 decoders for predecoding and 2-input nand gates and inverters for final decoding to form the 4:16 decoder. In the predecoder, a total of 8 intermediate signals are generated from the address bits and their complements. The concept of using predecoing and final decoding stage for construction of address decoder is very procutive since small decoders like 2:4 decoder is used for predecoding. The operation of 4:16 heirarchical decoder can explained with an example. If the address is A0A1A2A3=0000 the output of the predecoder1 and predeocder2 will be WL0WL1WL2WL3=1000 and WL0WL1WL2WL3=1000, respectively. According to the connections in figure~\ref{fig:4 to 16 decoder} the wordline 0 of predecoder1 and predecoder2 are conneted to the first 2-input nand gate in the decode stage representing the wordline 0 of the final decoding stage. Hence depengin on the combination of the input signal one of the wordline will rise. In this case since the address input is A0A1A2A3=0000 the wordline 0 should go high. Table~\ref{table:4-16 hierarchical_decoder} gives the detailed input and output siganls for the 4:16 hierarchical decoder. \begin{table}[h!] \begin{center} \begin{tabular}{| c | c | c | c |} \hline A[3:0] & predecoder1 & predecoder2 & Selected WL\\ \hline 0000 & 1000 & 1000 & 0\\ \hline 0001 & 1000 & 0100 & 1\\ \hline 0010 & 1000 & 0010 & 2\\ \hline 0011 & 1000 & 0001 & 3\\ \hline 0100 & 0100 & 1000 & 4\\ \hline 0101 & 0100 & 0100 & 5\\ \hline 0110 & 0100 & 0010 & 6\\ \hline 0111 & 0100 & 0001 & 7\\ \hline 1000 & 0010 & 1000 & 8\\ \hline 1001 & 0010 & 0100 & 9\\ \hline 1010 & 0010 & 0010 & 10\\ \hline 1011 & 0010 & 0001 & 11\\ \hline 1100 & 0001 & 1000 & 12\\ \hline 1101 & 0001 & 0100 & 13\\ \hline 1110 & 0001 & 0010 & 14\\ \hline 1111 & 0001 & 0001 & 15\\ \hline \end{tabular} \end{center} \caption{Truth table for 4:16 hierarchical decoder.} \label{table:4-16 hierarchical_decoder} \end{table} As the size of the address line increases higher level decoder can be created using the lower level decoders. For example for a 8:256 decoder, two instances of 4:16 followed by 256 2-input nand gates and inverters can form the decoder. In order to construct the 8:256 decoder, first 4:16 decoder should be constructed through using 2:4 deccoders. Hence the name is hierarchical decoder. \subsection{Wordline Driver} \label{sec:wldriver} Word line drivers are inserted, in between the word line output of the address decoder and the word line input of the bitcell-array. The word line drivers ensure that as the size of the memory array increases, and the word line length and capacitance increases, the word line signal is able to turn on the access transistors in the 6T cell. Also, as the bank select signal in multi-bank structures is $ANDED$ with the word line output of decoder, bitcells turn on only when bank is selected. Figure~\ref{fig:wordline_driver} shows the diagram of word line driver and its input/output pins. In OpenRAM, word line drivers are created by using the \verb|pinv| and \verb|nand2| classes which takes the transistor size and cell height as inputs (so that it can abutt the 6T cell). Word line driver is added as seperate module in \verb|compiler|. \begin{figure}[h!] \centering \includegraphics[scale=.8]{./figs/wordline_driver.pdf} \caption{Diagram of word line driver.} \label{fig:wordline_driver} \end{figure} \subsection{Column Mux} \label{sec:column_mux} The column mux takes the column address bits from the address bus selects the appropriate bitlines for the word that is to be read from or written to. It takes n-bits from the address bus and can select $2^n$ bitlines. The column mux is used for both the read and write operations; it connects the bitline of the memory array to both the sense ampflifier and the write driver. OpenRAM provides several options for column mux, but the default is a single-level column mux which is sized for optimal speed. %% \subsubsection{Tree\_Decoding Column Mux} %% \label{sec:tree_decoding_column_mux} %% The schematic for a 4-1 tree %% multiplexer is shown in Figure~\ref{fig:colmux}. %% \begin{figure}[h!] %% \centering %% \includegraphics[scale=.9]{./figs/tree_column_mux_schem.pdf} %% \caption{Schematic of 4-1 tree column mux that passes both of the bitlines.} %% \label{fig:colmux} %% \end{figure} %% \fixme{Shading/opacity is different on different platforms. Make this a box in the image. It doesn't work on OSX.} %% This tree mux selects pairs of bitlines (both BL and BL\_B) as inputs %% and outputs. This 4-1 tree mux illustrates the process of choosing %% the correct bitlines if there are 4 words per row in the memory array. %% Each bitline pair represents a single bit from each word. A binary %% reduction pattern, shown in Table~\ref{table:colmux}, is used to %% select the appropriate bitlines. As the number of words per row in %% the memory array increases, the depth of the column mux grows. The %% depth of the column mux is equal to the number of bits in the column %% address bus. The 4-1 tree mux has a depth of 2. In level 1, the %% least significant bit from the column address bus selects either the %% first and second words or the third and fourth words. In level 2, the %% most signifant column address bit selects one of the words passed down %% from the previous level. Relative to other column mux designs, the %% tree mus uses significantly less devices. But, this type of design %% can provide poor performance if a large decoder with many levels are %% needed. The delay of of a tree mux quadratically increases with each %% level. Due to this fact, other types of column %% decoders should be considered for larger arrays. %% \begin{table}[h!] %% \begin{center} %% \begin{tabular}{| c | c | c | c |} %% \hline %% Selected BL & Inp1 & Inp2 & Binary\\ \hline %% BL0 & SEL0\_bar & SEL1\_bar & 00\\ \hline %% BL1 & SEL0 & SEL1\_bar & 01\\ \hline %% BL2 & SEL0\_bar & SEL1 & 10\\ \hline %% BL3 & SEL0 & SEL1 & 11\\ %% \hline %% \end{tabular} %% \end{center} %% \caption{Binary reduction pattern for 4-1 tree column mux.} %% \label{table:colmux} %% \end{table} %% In OpenRAM, the tree column mux is a dynamically generated design. The %% \verb|tree_mux_array| is made up of two dynamically generated cells: \verb|muxa| %% and \verb|mux_abar|. The only diffference between these cells is that input %% select signal is either hooked up to the \textbf{SEL} or %% \textbf{SEL\_bar} signals (see highlighted boxes in %% Figure~\ref{fig:colmux}). These cells are initialized the the %% \verb|column_muxa| and \verb|column_muxabar| classes in \verb|columm_mux.py|. Instances %% of \verb|ptx| PMOS transistors are added to the design and the necessary %% routing is performed using the \verb|add_rect()| function. A horizontal rail %% is added in metal2 for both the SEL and Sel\_bar signals. Underneath %% those input rails, horizontal straps are added. These straps are used %% to connect the BL and BL\_B outputs from \verb|muxa| to the BL and BL\_B %% outputs of \verb|mux_abar|. Vertical conenctors in metal3 are added at the %% bottom of the cell so that connections can be made down to the sense %% amp. Vertical connectors are also added in metal1 so that the cells %% can connect down to other mux cells when the depth of the tree mux is %% more than one level. %% The \verb|tree_mux_array| class is used to generate the tree mux. %% Instances of both the \verb|muxa| and \verb|mux_abar| cells are instantiated and %% are tiled row by row. The offset of the cell in a row is determined %% by the depth of that row in the tree mux. The pattern used to %% determine the offset of the mux cells is %% $muxa.width*(i)*(2*row\_depth)$ where is the column number. As the %% depth increases, the mux cells become further apart. A separate %% ``for'' loop is invoked if the $depth>1$, which extends the %% power/ground and select rails across the entire width of the array. %% Similarly, if the $depth>1$, spice net names are created for the %% intermediate connection made at the various levels. This is necessary %% to ensure that a correct spice netlist is generated and that the %% input/output pins of the column mux match the pins in the modules that %% it is connected to. \subsubsection{Single\_Level Column Mux} \label{sec:single_level_column_mux} The optimal design for column mux uses a single NMOS device, driven by the input address or decoded input addresses. Figure~\ref{fig:2t1_single_level_column_mux} shows the schematic of a 2:1 single-level column mux. In this column mux one bit of address and its complementry drive the pass transistors. Selected transistors will connect their corresponding bitlines ( 1 set of column out of 2 set of columns) to sense-amp and write-driver circuitry for read or write operation. Figure~\ref{fig:4t1_single_level_column_mux} shows the schematic of a 4:1 single-level column mux. In this column mux, 2 input address are decoded using a 2:4 decoder ( 2:4 decoder is explain in section~\ref{sec:hierdecoder}). 2:4 decoder provides a one-hot set of outputs, so only one set of columns will be selected and connected to sense-amp and write-driver ( in figure~\ref{fig:4t1_single_level_column_mux} one set of column out of four sets of column is selected). In OpenRAM, the \verb|single-level_mux_array| is a dynamically generated design and it is made up of dynamically generated cell (\verb|single-level_mux|). \verb|single-level_mux| uses the parameterized transistor class \verb|ptx| to generate two NMOS transistors which will connect the BL and BLB of selected columns to sense-amp and write-driver. Horizontal rails are added for $sel$ signals. Vertical straps connect the BL and BLB of bitcell\_array to BL and BLB of single-level column mux and also BL-out and BLB-out of single-level column mux to BL and BLB of sense-amp and write-driver. \begin{figure}[h!] \centering \includegraphics[scale=.7]{./figs/2t1_single_level_column_mux.pdf} \caption{Schematic of a 2:1 single level column mux.} \label{fig:2t1_single_level_column_mux} \end{figure} \begin{figure}[h!] \centering \includegraphics[scale=.6]{./figs/4t1_single_level_column_mux.pdf} \caption{Schematic of a 4:1 single level column mux.} \label{fig:4t1_single_level_column_mux} \end{figure} \subsection{Sense Amplifier} \label{sec:senseamp} The sense amplifier is used to sense the difference between the bitline and bitline bar while a read operation is performed. The sense amp is necessary to recover the signals from the bitlines because they do not experience full voltage swing. As the size of the memory array grows, the load of the bitlines increases and the voltage swing is limited by the small memory cell driving this large load. A differential sense amplifier is used to``sense'' the small voltage difference between the bitlines. \begin{figure}[h!] \centering \includegraphics[scale=.8]{./figs/sense_amp_schem.pdf} \caption{Schematic of a single sense amplifier cell.} \label{fig:sense_amp} \end{figure} The schematic for the sense amp is shown in Figure~\ref{fig:sense_amp}. The sense amplifier is enable by the SCLK signal, which initiates the read operation. Before the sense amplifier is enable, the bitlines are precharged to Vdd by the precharge unit. When the sense amp is enabled, one of the bitlines experiences a voltage drop based on the value stored in the memory cell. If a zero is stored, the bitline voltage drops. If a one is stored, the bitline bar voltage drops. The output signal is then taken to a true logic level and latched for output to the data bus. In OpenRAM, the sense amplifier is a libray cell. The associated layout and spice netlist can be found in the \verb|gds_lib| and \verb|sp_lib| in the FreePDK45 directory. The \verb|sense_amp| class in \verb|sense_amp.py| instantiates a single instance of the sense amp library cell. The \verb|sense_amp_array| class handles the tiling of the sense amps cells. One sense amp cell is needed per data bit and the sense amp cells need to be appropriately spaced so that they can hook up to the column mux bitline pairs. The spacing is determined based on the number of words per row in the memory array. Instances are added and then Vdd, Gnd and SCLK rails that span the entire width of the array are drawn using the add\_rect() function. We chose to leave the sense amp as a libray cell so that custom amplifier designs could be swapped into the memory as needed. The two major things that need to be considered while designing the sense amplifier cell are the size of the cell and the bitline/input pitches. Optimally, the cell should be no larger than the 6T cell so that it abuts to the column mux and no extra routing or space is needed. Also, the bitline inputs of the sense amp need to line up with the outputs of the write driver. In the current version of OpenRAM, the write driver is situated under the sense amp, which had bitlines spaning the entire height of the cell. In this case, the sense amplifier is disabled during a write operation but the bitlines still connect the write driver to the column mux without any extra routing. \subsection{Write Driver} \label{sec:writedriver} The write driver is used to drive the input signal into the memory cell during a write operation. It can be seen in Figure~\ref{fig:write_driver} that the write driver consists of two tristate buffers, one inverting and one non-inverting. It takes in a data bit, from the data bus, and outputs that value on the bitline, and its complement on bitline bar. The bitlines need to be complements so that the data value can be correctly stored in the 6T cell. Both tristates are enabled by the EN signal. \begin{figure}[h!] \centering \includegraphics[scale=.8]{./figs/write_driver_schem.pdf} \caption{Schematic of a write driver cell, which consists of 2 tristates (non-inverting and inverting) to drive the bitlines.} \label{fig:write_driver} \end{figure} Currently, in OpenRAM, the write driver is a library cell. The associated layout and spice netlist can be found in the \verb|gds_lib| and \verb|sp_lib| in the FreePDK45 directory. Similar to the \verb|sense_amp_array|, the \verb|write_driver_array| class tiles the write driver cells. One driver cell is needed per data bit and Vdd, Gnd, and EN signals must be extended to span the entire width of the cell. It is not optimal to have the write driver as a library cell because the driver needs to be sized based on the capacitance of the bitlines. A large memory array needs a stronger driver to drive the data values into the memory cells. We are working on creating a parameterized tristate class, which will dynamically generate write driver cells of different sizes/strengths. \subsection{Flip-Flop Array} In a synchronous SRAM it is necessary to synchronize the inputs and outputs with a clock signal by using flip-flops. In FreePDK45 we provide a library cell for a simple master-slave flip-flop, see schematic in Figure~\ref{fig:ms_flop}. In our library cell we provide both Q and Q\_bar as outputs of the flop because inverted signals are used in various modules. The \verb|ms_flop| class in \verb|ms_flop.py| instatitates a single master-slave flop, and the \verb|ms_flop_array| class generates an array of flip-flops. Arrays of flops are necessary for the data bus (an array for both the inputs and outputs) as well as the address bus (an array for row and column inputs). The \verb|ms_flop_array| takes the number of flops and the type of array as inputs. Currently, the type of the array must be either ``data\_in'', ``data\_out'', ``addr\_row'', or ``addr\_col'' verbatim. The array type input is used to look up that associated pin names for each of the flop arrays. This was implemented very quickly and should be improved in the near future... \begin{figure}[h!] \centering \includegraphics[scale=.7]{./figs/ms_flop_schem.pdf} \caption{Schematic of a master-slave flip-flop provided in FreePDK45 library} \label{fig:ms_flop} \end{figure} \subsection{Control Logic} The details of the control logic architecture are outlined in Section~\ref{sec:control}. The control logic module, \verb|control_logic.py|, instantiates a \verb|control_logic| class that arranges all of the flip-flops and logic associated with the control signals into a single module. Flip-flops are instantiated for each control signal input and library NAND and NOR gates are used for the logic. A delay chain, of variable length, is also generted using parameterized inverters. The associated layouts and spice netlists can be found in the \verb|gds_lib| and \verb|sp_lib| in the FreePDK45 directory. \section{Bank and SRAM} \label{sec:bank} The overall memory architecture is shown in figure~\ref{fig:bank}. As shown in this figure one Bank contains different modules including precharge-array which is positioned above the bitcell-array, column-mux-array which is located below the bitcell-array, sense-amp-array, write-driver-array, data-in-ms-flop-array to synchronize the input data with negative edge of the clock, tri-gata-array to share the bidirectional data-bus between input and output data, hierarchical decoder which is placed on the right side of the bitcell-array (predecoder + decoder), wordline-driver which drives the wordlines horizontally across the bitcell-array and address-ms-flops to synchronize the input address with positive edge of the clock. In bitcell-array each memory cell is mirrored vertically and horizontally inorder to share VDD and GND rails with adjacent cells and form the array. Data-bus is connected to tri-gate, address-bus is connected to address-ms-flops and bank-select signal will enable the bank when it goes high. To complete the SRAM design, bank is connected to control-logic as shown in figure~\ref{fig:bank}. Control-logic controls the timing of modules inside the bank. CSb, OEb, Web and clk are inputs to the control logic and output of control logic will ANDed with bank-select signal and send to the corresponding modules. \begin{figure}[h!] \centering \includegraphics[scale=1]{./figs/bank.pdf} \caption{Overal bank and SRAM architecture.} \label{fig:bank} \end{figure} In order to reduce the delay and power, divided wordline strategy have been used in this compiler. Part of the address bits are used to define the global wordline (bank-select) and rest of address bits are connected to hierarchical decoder inside each bank to generate local wordlines that actually drive the bitcell access transistors. As shown in figure~\ref{fig:bank2} SRAM is divided to two banks which share data-bus, address-bus, control-bus and control-logic. In this case one bit of address (most significant bit) goes to an ms-flop and outputs of ms-flop (address-out and address-out-bar) are connected to banks as bank-select signals. Control logic is shared between two banks and based on which bank is selected, control signals will activate modules inside the selected bank. In this architecture, the total cell capacitance is reduced by up to a factor of two. Therefore the power will be reduced greatly and the delay among the wordlines is also reduced. \begin{figure}[h!] \centering \includegraphics[scale=.9]{./figs/bank2.pdf} \caption{SRAM is divided to two banks which share the control-logic.} \label{fig:bank2} \end{figure} In figure~\ref{fig:bank4}, four banks are connected together. In this case a 2:4 decoder is added to select one of the banks using two most significant bits of input address. Control signals are connected to all banks but will turn on only the selected bank. \begin{figure}[h!] \centering \includegraphics[scale=.9]{./figs/bank4.pdf} \caption{SRAM is divided to 4 banks wich are controlled by the control-logic and a 2:4 decoder.} \label{fig:bank4} \end{figure}