### Hybrid Josephson CMOS FIFO

by

Amit Mehrotra

### **Research Project**

Submitted to the Department of Electrical Engineering and Computer Sciences, University of California at BERKELEY in partial satisfaction of the requirements for the degree of Master of Sciences, Plan II

Approval for the Report and Comprehensive Examination:

Committee:

Professor Theodore VanDuzer Research Advisor

Date

\* \* \* \* \* \*

Professor Alberto L Sangiovanni-Vincentelli Second Reader

Date

## Hybrid Josephson CMOS FIFO

#### Amit Mehrotra

### Abstract

We describe the design and measurement results of a hybrid Josephson-CMOS that adopts a ring pointer architecture. The FIFO is designed for throughput at frequencies exceeding 10 GHz in an optimized 4K 1.2  $\mu$ m CMOS process. It consists of a CMOS memory core surrounded by high-speed Josephson multiplexer-demultiplexer pair. We have used split-output latch TSPC shift register as ring pointer for our FIFO. We have modified the standard three-transistor CMOS DRAM cell to achieve a high-speed write and currentmode read. We have developed a new Josephson-CMOS interface amplifier circuit which achieves high throughput at reasonable power dissipations. We also have investigated the design of the demultiplexer and the multiplexer. To my parents

# Contents

| List of Figures |                                                       |                                                                                                                                                                                                                                                                                                                                                             |                                                                                                            |  |  |
|-----------------|-------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|--|--|
| $\mathbf{Li}$   | List of Tables v                                      |                                                                                                                                                                                                                                                                                                                                                             |                                                                                                            |  |  |
| 1               | <b>Intr</b><br>1.1<br>1.2<br>1.3                      | roduction         Applications         Project Overview         Report Organization                                                                                                                                                                                                                                                                         | <b>1</b><br>2<br>3<br>4                                                                                    |  |  |
| 2               | Syst<br>2.1<br>2.2<br>2.3<br>2.4                      | tem ArchitectureFunctional Description of a FIFOFIFO Architecture2.2.1 Event-Controlled Architecture2.2.2 Ring-Pointer ArchitectureJosephson-CMOS FIFO ArchitectureSystem Timing                                                                                                                                                                            | <b>5</b><br>6<br>6<br>7<br>9<br>11                                                                         |  |  |
| 3               | Cole<br>3.1<br>3.2<br>3.3<br>3.4<br>3.5<br>3.6<br>3.7 | d CMOS Device Properties Process Description                                                                                                                                                                                                                                                                                                                | <b>15</b><br>15<br>16<br>17<br>17<br>18<br>18                                                              |  |  |
| 4               | <b>Rin</b><br>4.1<br>4.2<br>4.3                       | g Pointer Shift Register         True Single Phase Clock (TSPC) Edge Triggered Flip-Flops         4.1.1 Double C <sup>2</sup> MOS Flip-Flop         4.1.2 Split-Output Latch Flip-Flop         Shift Register Modifications         Design Specifications and Implementation         4.3.1 Flip-Flop Design         4.3.2 Clock Distribution Network Design | <ul> <li>23</li> <li>23</li> <li>24</li> <li>24</li> <li>25</li> <li>27</li> <li>27</li> <li>28</li> </ul> |  |  |

|              | 4.4        | Layout                                                                                                                                                       |  |
|--------------|------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
|              | 4.5        | Simulation Results                                                                                                                                           |  |
|              | 4.6        | Experimental Results                                                                                                                                         |  |
| <b>5</b>     | Me         | mory Cell 35                                                                                                                                                 |  |
|              | 5.1        | CMOS Memory Design at 4K                                                                                                                                     |  |
|              |            | 5.1.1 Static Random Access Memory (SRAM)                                                                                                                     |  |
|              |            | 5.1.2 Dynamic Random Access Memory (DRAM)                                                                                                                    |  |
|              | 5.2        | DRAM Modifications for Current Mode Read                                                                                                                     |  |
|              | 5.3        | Design Specifications and Implementation                                                                                                                     |  |
|              | 5.4        | Layout                                                                                                                                                       |  |
|              | 5.5        | Simulation Results                                                                                                                                           |  |
| 6            | Jos        | enhson-CMOS Interface Circuits 43                                                                                                                            |  |
| U            | 6 1        | Interface Constraints 43                                                                                                                                     |  |
|              | 6.2        | Suzuki Stack Preamplification 43                                                                                                                             |  |
|              | 0.2        | 6.2.1 Suzuki Stack Structure                                                                                                                                 |  |
|              |            | 6.2.2 Suzuki Stack Binsing and Output Modification                                                                                                           |  |
|              | 63         | Unterface Amplifier 47                                                                                                                                       |  |
|              | 0.3<br>6.4 | Design Specification and Implementation 48                                                                                                                   |  |
|              | 0.4<br>6 5 | Levent 52                                                                                                                                                    |  |
|              | 0.0        | Layout                                                                                                                                                       |  |
|              | 0.0        | Calibration Results                                                                                                                                          |  |
|              | 0.1        | Calibration                                                                                                                                                  |  |
|              | 0.8        | Experimental Results                                                                                                                                         |  |
| 7            | Jos        | ephson Current Sense 61                                                                                                                                      |  |
|              | 7.1        | Fundamentals of Zero Impedance Current Sensing                                                                                                               |  |
|              | 7.2        | Single Ended Current Sensing Techniques                                                                                                                      |  |
|              | 7.3        | 4JL Current Sense                                                                                                                                            |  |
| _            | _          |                                                                                                                                                              |  |
| 8            | Jos        | ephson Demultiplexer 67                                                                                                                                      |  |
|              | 8.1        | Edge Triggered Logic                                                                                                                                         |  |
|              |            | 8.1.1 Clock Regulator $\ldots \ldots \ldots$ |  |
|              |            | 8.1.2 Buffer $\ldots$                                                       |  |
|              |            | 8.1.3 Inverter $\ldots$ $\ldots$ $72$                                                                                                                        |  |
|              |            | 8.1.4 Buffered AND                                                                                                                                           |  |
|              | 8.2        | Demultiplexer Architecture                                                                                                                                   |  |
|              | 8.3        | Circuit Design                                                                                                                                               |  |
| 9            | Jos        | ephson Multiplexer 81                                                                                                                                        |  |
|              | 9.1        | Multiplexer Architecture                                                                                                                                     |  |
|              | 9.2        | Circuit Design                                                                                                                                               |  |
| 10           | O Cor      | clusions and Future Work 87                                                                                                                                  |  |
| Bibliography |            |                                                                                                                                                              |  |
|              | C C        |                                                                                                                                                              |  |

# List of Figures

| 2.1  | Functional description of FIFO operation.                                    | 6  |
|------|------------------------------------------------------------------------------|----|
| 2.2  | Event controlled FIFO architecture.                                          | 7  |
| 2.3  | Ring pointer FIFO architecture                                               | 8  |
| 2.4  | Proposed Josephson-CMOS FIFO architecture.                                   | 10 |
| 2.5  | Parallel FIFO block.                                                         | 11 |
| 2.6  | Write path timing.                                                           | 12 |
| 3.1  | Interconnect capacitance model                                               | 19 |
| 3.2  | Diffusion capacitance model.                                                 | 20 |
| 4.1  | Double $C^2MOS$ flip-flop                                                    | 24 |
| 4.2  | Split-output latch flip-flop                                                 | 25 |
| 4.3  | Circular shift register layout.                                              | 26 |
| 4.4  | TSPC shift register cell with reset to "1" feature                           | 26 |
| 4.5  | TSPC shift register cell with reset to "0" feature                           | 27 |
| 4.6  | Two shift registers with skewed clocks                                       | 28 |
| 4.7  | Output of the first and the second shift registers for different clock skews | 29 |
| 4.8  | Clock distribution network.                                                  | 31 |
| 4.9  | Flip-Flop layout.                                                            | 32 |
| 4.10 | Simulation of the complete read path                                         | 34 |
| 5.1  | CMOS SRAM cell.                                                              | 36 |
| 5.2  | Three-transistor DRAM cell.                                                  | 36 |
| 5.3  | One transistor DRAM cell.                                                    | 37 |
| 5.4  | Three transistor DRAM cell with current mode read-out.                       | 39 |
| 5.5  | Three transistor DRAM cell layout.                                           | 40 |
| 5.6  | Simulation results for the three transistor DRAM cell                        | 41 |
| 6.1  | Interface architecture.                                                      | 44 |
| 6.2  | The Suzuki stack                                                             | 45 |
| 6.3  | Equivalent circuit of a resetting stack.                                     | 46 |
| 6.4  | CMOS bias for a Suzuki stack.                                                | 47 |
| 6.5  | CMOS bias for a Suzuki stack with improved reset                             | 48 |
| 6.6  | Stack output for standard and shunted CMOS loads.                            | 50 |

| $\begin{array}{c} 6.7 \\ 6.8 \\ 6.9 \\ 6.10 \\ 6.11 \\ 6.12 \\ 6.13 \end{array}$                                   | Interface amplifier.51Interface amplifier layout.52Simulation results of interface amplifier.53Interface amplifier with calibration circuit.53Timing diagram for interface calibration.56Simulation of interface calibration.56Simulation of interface calibration.57Experimental results of the interface circuit operation at 10 MHz.58                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|--------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 6.14                                                                                                               | Experimental results of the interface circuit operation at 70 MHz 59                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| $7.1 \\ 7.2 \\ 7.3 \\ 7.4 \\ 7.5 \\ 7.6$                                                                           | Josephson junction I-V characteristics.61Block diagram of single-ended current sensing.63MVTL based current sense.63RCJL based current sense.644JL based current sense.64Dynamic operation of a 4JL current sense element.64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| $\begin{array}{c} 8.1 \\ 8.2 \\ 8.3 \\ 8.4 \\ 8.5 \\ 8.6 \\ 8.7 \\ 8.8 \\ 8.9 \\ 8.10 \\ 8.11 \\ 8.12 \end{array}$ | Clock shaping circuit.68Regulation of a 10 GHz sinusoidal clock.69Edge-triggered buffer.70Simulation results for edge-triggered buffer.71Edge-triggered inverter.72Simulation results for edge-triggered inverter.74Edge-triggered AND gate.74Simulation results for edge-triggered AND gate.76Josephson demultiplexer architecture.76Shift register with read-out latches.76Josephson demultiplexer simulation results.76Shift register simulation results.76Source and the result of the |
| 9.1<br>9.2<br>9.3<br>9.4                                                                                           | Multiplexer architecture.       82         Circular shift register for the multiplexer.       82         Circular shift register simulation results.       84         Simulation results for the complete demultiplexer.       84                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |

# List of Tables

| 4.1 | TSPC Shift Register Cell Device Parameters                                   | 32 |
|-----|------------------------------------------------------------------------------|----|
| 4.2 | Performance Measure of TSPC Shift Register                                   | 32 |
| 4.3 | Clock Buffer Sizes.                                                          | 33 |
| 6.1 | Interface Amplifier Device Parameters                                        | 49 |
| 6.2 | Clock Buffer Sizes.                                                          | 49 |
| 7.1 | 4JL current sense circuit parameters                                         | 65 |
| 8.1 | Edge-triggered buffer parameters                                             | 70 |
| 8.2 | Clock shaping circuit parameters for the edge-triggered buffer.              | 70 |
| 8.3 | Edge-triggered inverter parameters                                           | 73 |
| 8.4 | Clock shaping circuit parameters for the edge-triggered inverter. $\ldots$ . | 73 |

#### Acknowledgements

I would like to thank my advisor Prof. Van Duzer for his constant support, guidance and encouragement during the course of this research. I would also like to express my sincere thanks to Prof. Sangiovanni-Vincentelli for his valuable feedback on this work.

I would like to thank Uttam Ghoshal and Kishore Seendripu for introducing me to the topic of cold CMOS and for the interesting discussions we had during the course of this project. I would like to thank Alex Flores for helping me in testing. I would also like to thank other members of the hybrid group, Arnold Feldman and Hui Zhang for helping me in various stages of this project.

My Berkeley experience has been a thoroughly rewarding one. I would like to thank the people who were largely responsible for it - Alok Agrawal, Adnan Aziz, Anu Bhat, Dubravka Bilić, Mike Branch, Premal Buch, Luca Carloni, Edoardo Charbon, Yatin Chawathe, John Deng, Jonathan Du, Alberto Ferrari, Keng Fong, Wilsin Gosti, Mark Jeffrey, Sunil Khatri, Sriram Krishnan, Gurmeet Manku, Amit Marathe, Paolo Miliozzi, Rajeev Murgai, Amit Narayan, Ashwin Nayak, Ali Niknejad, Willem Perold, Pramote Piriyapoksombut, Mukul Prasad, Shaz Qadeer, Sriram Rajamani, Rajeev Ranjan, Jagesh Sanghavi, Kumud Sanwal, Vigyan Singhal, Dennis Sinitsky, Osamu Takahashi, Zuoqin Wang, Steve Whiteley and Phillip Xie.

## Chapter 1

# Introduction

High-speed communication networks, digital signal processing systems and microprocessors are among the largest application areas of digital circuits. Almost all such applications place severe demands on the bandwidth of either random access memories (RAM) or first-in-first-out (FIFO) memories. In modern high speed microprocessors and application specific integrated circuits (ASICs), memory performance often limits the overall performance of the system. The primary reason has been that while the speed of arithmetic and floating point processors has increased at an exponential rate with the scaling of technology, memory speed has not kept up [1].

For very high performance digital circuits based on bipolar emitter coupled logic (ECL), gallium strenide metal-semiconductor field effect transistors (MESFETs) or Josephson junctions, memory speed becomes even more critical. Josephson memory devices have been proposed in the past [2] but they have invariably suffered from from both hard failures due to limited circuit margins and performance degradation due to soft errors resulting from flux trapping in the memory element. These problems have become even more severe as the memory size is increased [3].

Recently, CMOS based solutions have been proposed to solve the memory bottleneck of Josephson-based processors [4]. These exploit the high density of CMOS-based memory and its enhanced performance at cryogenic temperatures. Availability of high speed Josephson logic devices can be exploited to improve the performance of these memories beyond the basic performance enhancement due to cryogenic temperature operation.

This report concentrates on the design of a high bandwidth first-in-first-out (FIFO) memory in an optimized cryogenic CMOS technology. Even though the main ideas of

cryogenic CMOS based memory designs carry over to the design of a FIFO, differences in the implementation of a FIFO and a RAM can be exploited to achieve better performance.

## 1.1 Applications

Before describing the details of the hybrid Josephson-CMOS FIFO, we consider several applications in communications and digital signal processing to appreciate the need for high-bandwidth FIFOs. We first examine an asynchronous transfer mode (ATM) switch that is increasingly becoming the standard digital communication technology [5]. We also examine a fast Fourier transform (FFT) processor for real-time image processing applications using a FIFO-based architecture [6].

As mentioned before, ATM switches are fast becoming the de-facto standard in digital communication technology due to their capability of handling the conflicting requirements of voice, data and video transmission. The basic ATM data unit (henceforth referred to as "packet") consists of fixed 53-byte blocks which are suitable for transmission on both local (LANs) and wide area networks (WANs). The expected data rates defined by the Sonet standard include transmission rates exceeding gigabites per second (Gbps). Future standards are likely to be adapted in the tens of gigabites per second range.

Now consider a conflict in such a high speed system when two packets arrive at a switch at the same time from two different sources. During the resolution of conflict, a buffer implemented as a FIFO is required to store the data. After the conflict is resolved, a FIFO is again used to store the packet that was not transmitted immediately for subsequent retransmission. As the data rates increase, FIFO performance typically limits the overall data rate that can be handled by the ATM switch while meeting the bit error rate specification of the system.

The Josephson junction digital signal processor currently being developed is designed for high performance signal processing applications. For real time use, these applications require very high speed and extensive FFT computation. The proposed system implements a data flow architecture based on a standard CMOS chip, the United Technologies IQMAC. Such an architecture makes extensive use of FIFOs to feed and recirculate data to the arithmetic logic unit (ALU). As mentioned before, typically in these systems, memory limits the amount of data available to the ALU and therefore the overall system throughput. Once again, high bandwidth FIFO is required to achieve the specified data rate.

#### **1.2 Project Overview**

In each of the above applications, FIFO bandwidth determines the overall system performance. As a result, we explore the use of a hybrid Josephson-CMOS approach to improve the bandwidth of a FIFO. Such an approach requires the development of a new architecture and use new circuit techniques to take advantage of the strengths of both CMOS and Josephson technologies. At the architecture level we partition the FIFO into high speed Josephson demultiplexer and multiplexer and a relatively low speed CMOS ring pointerbased FIFO core. This functional decomposition exploits the high speed logic capability of Josephson circuits and the high density of CMOS memory.

At the circuit level, the availability of Josephson devices and device characteristics of CMOS at cryogenic temperatures allows us to improve upon standard techniques. We use a three-transistor dynamic random access memory (DRAM) for single ended write and single ended pure current mode read. The availability of Josephson circuits allows us to use zero impedance, current mode read instead of the standard, power hungry CMOS sense amplifiers read capabilities. Both these techniques allow for extremely high speed operation for a given power consumption or low power operation for a given system speed. A new Josephson-CMOS interface circuit achieves high throughput with acceptable power dissipation. In addition, the Josephson multiplexer and demultiplexer implemented using the edge triggered logic allow high throughput and simplified system timing.

We develop specifications for the FIFO in the available optimized CMOS technology. The process employs 1.2  $\mu$ m CMOS optimized for cryogenic temperature operation. Josephson junctions are implemented in a niobium process with current density of 1000 A/cm<sup>2</sup>. In such a process the overall FIFO is expected to operate at 10 GHz and the CMOS core at 1 GHz. The ten-to-one ratio between the Josephson and the CMOS clocks implies that a ten-to-one multiplexer is required at the output and a one-to-ten demultiplexer is required at the input.

### 1.3 Report Organization

This report is organized as follows. In Chapter 2 we discuss the system level framework for the FIFO design and describe standard CMOS FIFO architectures along with the modifications for the present application. In Chapter 3 we discuss the characteristics of CMOS devices at cryogenic temperatures and develop circuit techniques which exploit these characteristics. We also discuss some limitations of the cryogenic operation of CMOS. In Chapter 4 we describe the CMOS ring pointer architecture including the clock driver circuit. In Chapter 5 we describe the DRAM based memory core while in Chapter 6 we discuss the design of the interface circuit. Chapter 7 describes the various techniques of Josephson current sensing. Chapters 8 and 9 describe the Josephson demultiplexer and the multiplexer design.

## Chapter 2

## System Architecture

Before examining the specific circuit details, we will address system level architecture issues. After a brief functional description of the FIFO, we describe the two most widely used CMOS FIFO architectures and investigate their relative merits for application to a hybrid Josephson-CMOS system. This helps us select a proper architecture for the implementation of the FIFO before we start adapting and optimizing it for hybrid CMOS process. We also briefly discuss the system timing issues.

## 2.1 Functional Description of a FIFO

A first-in-first-out (FIFO) memory stores a finite length queue of data. As the name suggests, the first datum written in is the first datum read out. All subsequent data is also read out in the same order as it is written in. Figure 2.1 shows schematically a typical sequence of FIFO operation from a behavioral perspective. Shaded boxes indicate that the memory cell contains valid data while empty boxes indicate that the memory cell is empty or contains invalid data. Suppose that the FIFO initially contains some valid data indicated by state 1. A write operation adds a new bit of data in the next empty cell causing the FIFO to change to state 2. A read operation causes a datum to be read out and the FIFO ends up in state 3. In addition, empty and full conditions must be checked to maintain data integrity. The logic implementing this is not shown in the figure.



Figure 2.1: Functional description of FIFO operation.

### 2.2 FIFO Architecture

The functional description of the FIFO outlined in the previous section suggests several possible implementations. The two most widely used architectures are the eventcontrolled architecture which uses a shift register based approach [7] and the ring pointer architecture [8] which uses a RAM based approach. We discuss both these approaches enumerating their advantages and disadvantages. This discussion leads to the selection of an appropriate architecture for a hybrid Josephson-CMOS system. Finally we describe architectural modifications for the complete system.

#### 2.2.1 Event-Controlled Architecture

To obtain high speed operation in a shift register type structure, an asynchronous event-controlled architecture typically is used. A block diagram of the event-controlled architecture is shown in Figure 2.2 [7].

The event-controlled FIFO employs the micropipeline asynchronous handshaking protocol. The micropipeline protocol uses two-cycle handshaking which requires single *Request* and *Acknowledge* lines to communicate between stages. Each stage in the FIFO includes of a stage controller to handle the handshaking and data storage in the form of D flip-flops as shown in Figure 2.2. Cascades of data storage elements forms a number of



Figure 2.2: Event controlled FIFO architecture.

parallel asynchronous shift registers, employing a common stage controller.

Consider an example of FIFO operation. Stage controller k gives the WE (write enable) signal to storage element k. The AND gate in the stage controller synchronizes the WE signal with the logical product of the Request signal from the  $k - 1^{th}$  stage and the Acknowledge signal from the  $k + 1^{th}$  stage. After writing data to the storage element, the stage controller issues a Request signal to the  $k+1^{th}$  stage and an Acknowledge signal to the  $k - 1^{th}$  stage. The process repeats itself for subsequent stages of the FIFO. When neither Request nor Acknowledge signals are received by the stage controller, the storage element holds its data.

The event-controlled FIFO achieves high speed operation because the loading at each stage is relatively light. In addition, the asynchronous design avoids heavily loaded clock nodes which can adversely affect speed. Unfortunately, power dissipation is large because each piece of data must ripple through the entire structure. This implies a large number of transitions which, in turn, suggests large dynamic power dissipation.

#### 2.2.2 Ring-Pointer Architecture

The ring pointer architecture employs a dual-port RAM array with both the write and read word lines controlled by a ring pointer (circular shift register). A typical imple-



Figure 2.3: Ring pointer FIFO architecture.

mentation is shown in Figure 2.3 [8].

In contrast to the event-controlled architecture employing shift register data storage, the ring pointer FIFO stores data in a dual-port CMOS RAM array. The dual-port array provides separate read and write paths through the memory, allowing simultaneous reads and writes. The write pointer implemented as a circular CMOS shift register points to the first empty row of memory in the FIFO queue. A write operation is initiated by issuing the Write signal which latches data into the input latches. Data is then transferred to the row in memory pointed to by the write pointer. Subsequently, the write pointer shifts one bit and points to the next empty row in memory. Similar to the write pointer, the read pointer is implemented as a circular CMOS shift register. The read pointer points to the row of memory corresponding to the front of the FIFO queue. When the Read signal is issued, the row of memory pointed to by the read pointer places its data on the read bit lines. Sense amplifiers boost the data signal level, and data is transferred to the output latches. Subsequently, the read pointer shifts one bit to point to the new front of the FIFO queue. A relationship between the read and write pointers, or equivalently, active read and write word lines, determines full and empty conditions. Full and empty flag match circuitry tests for these conditions. Then, flag generator circuitry latches a flag signifying full or empty and provides an external output.

For low power dissipation and low latency, the ring pointer architecture is preferred. Dynamic power dissipation due to transitions from shifting data along the length of a shift register in the event-controlled architecture is eliminated. Latency is reduced to the sum of the write time and the access time of the memory rather than the time required for data to move the length of a shift register. However, throughput is reduced in the ring pointer architecture, limited by the charging of the write bit lines in the write path and the memory cycle time in the read path which, in turn, is limited by the sense amplifier.

#### 2.3 Josephson-CMOS FIFO Architecture

The Josephson-CMOS FIFO must achieve high throughput, in order to keep up with a Josephson processor, as well as low power dissipation due to the limited heat removal capability of liquid helium. While the event-controlled FIFO offers speed advantages, the power dissipation is simply too large for the cryogenic environment. As a result, we selected the ring pointer approach as the basic architecture [9].

As mentioned earlier, the ring pointer approach is limited in speed by the charging of the read and the write bit lines. Read time is reduced significantly by replacing the conventional voltage-based CMOS sense amplifiers with zero impedance current-sense Josephson circuits. However, the write time is still large due to charging and discharging of bit line capacitances. This limits the operation of the overall CMOS system to 1 GHz clock speed for reasonable power dissipation. However, the system clock for Josephson circuits is at 10 GHz. Hence, the incoming data is demultiplexed into ten parallel data streams of 1 GHz. These parallel data streams are then stored in the CMOS FIFO. At the output, the ten data streams are multiplexed again to recover data at 10 GHz.

The hybrid FIFO consists of a basic CMOS ring pointer FIFO and mainly Josephson peripheral blocks as shown in Figure 2.4. The partition into Josephson and CMOS blocks seeks to exploit the advantages of each technology: high-speed Josephson logic at the periphery and a dense CMOS memory core. Full and empty flag generation to check the overflow and underflow condition are omitted in the initial architecture so that the basic read and write data paths can be tested. However, flag generation could be easily added to



Figure 2.4: Proposed Josephson-CMOS FIFO architecture.

future designs. For the initial design, synchronous operation is assumed as described in the next section, which discusses timing issues. Consider a typical write operation. A ten bit serial data stream enters the Josephson demultiplexer synchronized with the 10 GHz clock. These data are demultiplexed into ten parallel streams synchronized with a 1 GHz clock (WCLK). The Josephson-CMOS interface circuit amplifies millivolt level Josephson signals to volt level CMOS signals and drives these data onto the write bit lines of the memory array. Subsequently, the write pointer shifts one bit latching the data into memory. The write pointer now points to the next empty column of memory.

Now consider a typical read operation. On the rising edge of the 1 GHz read clock (RCLK), the pointer shifts one bit to activate the appropriate row of memory cells, which place their data onto the read bit lines. Josephson junctions, which replace CMOS sense amplifiers due to their higher speed and superior current sensitivity, sense and latch the output data. Subsequently, a Josephson multiplexer serializes the ten bit wide parallel data stream and synchronizes it to a 10 GHz clock.

For bit-parallel Josephson processors, a number of the Josephson-CMOS FIFO blocks shown in Figure 2.4 may be placed in parallel. Ring pointers may be shared between adjacent blocks to reduce complexity and power dissipation. The basic idea is shown in Figure 2.5.



Figure 2.5: Parallel FIFO block.

## 2.4 System Timing

The key timing issue in a hybrid Josephson-CMOS system is the synchronization of high speed superconducting logic with slower CMOS circuits. Asynchronous operation would be complicated because of the difficulty in interfacing millivolt-level Josephson circuits to volt level CMOS circuits at high speed, or equivalently, low latency which is required for local clock generation or handshaking. However, synchronous approaches are problematic because a high-speed Josephson clock must be phase locked to a low speed CMOS clock. For the hybrid FIFO, we selected a synchronous approach with the clocks phase synchronized externally. This moved the difficult timing problems off chip and out of the cryogenic environment where they should be easier to handle.

Within the synchronous timing framework, the approach was to minimize the number of clock phases required. Minimizing the number of clock phases tends to simplify circuit layout and reduce CMOS dynamic power dissipation. Recall that in CMOS circuits, dynamic power dissipation is directly proportional to total capacitance.

$$P_{dynamic} = CV^2 f \tag{2.1}$$

In the above equation  $P_{dynamic}$  is the total dynamic power dissipated, C is the total capacitance being charged or discharged, V is the supply voltage and f is rate of high to low or low to high transitions. Single phase timing minimizes the capacitance at the clock



Figure 2.6: Write path timing.

node because less wiring and fewer clock buffers are required as compared to a multi-phase system. Thus, power dissipation is reduced. We therefore decided to implement the CMOS circuit with single phase clock.

At the Josephson-CMOS interface, we employ edge-triggered Josephson logic using two phase clocks [9]. The edge triggered logic ensures that outputs of the demultiplexer become valid at well defined clock edges, which allows interface circuits to amplify for almost half the clock period, thus maximizing bandwidth. Furthermore, the edge triggered logic requires one fewer clock phase than the standard MVTL approach.

The write path timing diagram is shown in Figure 2.6. Two phase dc offset 10 GHz sinusoidal clocks, which are 180 degrees out of phase, drive the high speed portion of the demultiplexer. A single phase 1 GHz square wave clock externally phase locked to the

high speed clock  $\Phi_1$  drives the interface and write ring pointer. The read path employs a similar timing scheme. Note that the voltage levels shown in Figure 2.6 for the high speed clocks are arbitrary.

## Chapter 3

# **Cold CMOS Device Properties**

We first give a brief description of the cryogenic CMOS process. We then give a qualitative description of device characteristics and transient behavior at cryogenic temperatures which are critical for digital circuit design. We also discuss the modeling of line capacitances at low temperatures.

### 3.1 Process Description

We describe the CMOS 1.2  $\mu$ m process of Orbit Semiconductors which is optimized for cryogenic temperature operation. The key modifications to the standard CMOS process include the use of dual poly, i.e., n+ over NMOS and p+ over PMOS devices. Substrate doping profiles have also been optimized for low temperature operation. This results in threshold voltage for NMOS devices as 0.3 V at cryogenic temperatures and -0.3 V for PMOS devices. This allows low voltage operation of these circuits.

## 3.2 Carrier Statistics at Cryogenic Temperatures

In silicon MOSFETs, acceptor and donor levels are typically very close to the corresponding band edges. At room temperature, this small energy difference between the dopant levels and the corresponding bands permits ionization of these impurities and availability of free carriers in the semiconductor. At cryogenic temperatures, a very large fraction of these impurities are not ionized resulting in a dramatic reduction in free carrier density and hence the conductivity in silicon. Henceforth we will refer to this effect as

"freeze out".

Degenerately doped drain and source regions have different freeze-out behavior. For typical source or drain impurity concentration ( $\gtrsim 10^{19} \text{cm}^3$ ) which exceeds Mott's semiconductor-to-metal transition in silicon, there is no freeze out at all.

The carrier freeze out condition in the channel is different from that in the bulk due to band bending. Under steady state conditions, acceptors in the surface depletion region are ionized due to applied potential at the gate. By integrating Poisson's equation expressed in terms of band-bending potential  $\Psi_s$ , it can be shown [10] that the depletion charge for an n-channel device can be expressed as

$$Q_d = \sqrt{2q\epsilon_{Si}N_A \left[\Psi_s - kT\log_e\left(1 + g_A \exp\left(\frac{E_A - E_F}{kT}\right)\right)\right]}$$
(3.1)

where  $Q_d$  is the depletion charge, q is the electron charge,  $\epsilon_{Si}$  is the dielectric permittivity of silicon,  $N_A$  is the acceptor impurity concentration,  $\Psi_s$  is the band bending potential, kis Boltzmann's constant, T is the temperature,  $g_A$  is the ground state degeneracy factor for the acceptor level,  $E_A$  is the acceptor level relative to the edge of the valence band and  $E_F$ is the Fermi level.

At cryogenic temperatures, the second term inside the square bracket is approximately equal to  $E_A - E_F$ . For an n-channel device,  $E_A - E_F$  is very small and can be neglected in comparison to the band-bending potential  $\Psi_s$ . The depletion charge is thus unaffected by freeze out effects and can be calculated by the standard formula

$$Q_d = \sqrt{2q\epsilon_{Si}N_A\Psi_s} \tag{3.2}$$

#### 3.3 Threshold Voltage

Low threshold voltages are important for high-speed operation at reduced supply voltage because the current drive capability of MOSFETs is determined by the overdrive gate voltage  $V_{gs} - V_t$ . The threshold voltage of an NMOS transistor is given by [11]

$$V_t = \Phi_{gs} - \frac{Q_o}{C_{ox}} + 2\Psi_B + \frac{\sqrt{2\epsilon_{Si}qN_A(2\Psi_B)}}{C_{ox}}$$
(3.3)

where  $\Phi_{gs}$  is the work function difference between the gate material and the p-type silicon,  $Q_o$  is the effective oxide charge per unit area at the silicon and oxide interface,  $\Psi_B$  is the band-bending potential at the onset of strong inversion and  $C_{ox}$  is the gate capacitance.

#### 3.4. DEVICE CHARACTERISTICS

For a dual-poly process, i.e., n+ over NMOS and p+ over PMOS,  $\Phi_{gs} = -E_g/2q - \Psi_B$ . Neglecting the oxide charge, for cryogenic temperatures where the band bending required for inversion is approximately the band gap voltage, the threshold voltage is given by

$$V_t = \frac{\sqrt{2\epsilon_{Si}qN_A E_g}}{C_{ox}} \tag{3.4}$$

For the doping concentrations of the MOS device channels of this process, the threshold voltage for an NMOS device is 0.3V and for a PMOS device is -0.3V.

#### **3.4** Device Characteristics

At cryogenic temperatures, the low field mobility of carriers in silicon is extremely high due to the absence of thermal scattering effects. However, for a MOS device in saturation, the drain current is limited by the saturation velocity of the carriers. The saturation velocity of carriers increases by 40% for cryogenic temperatures. Hence the critical field beyond which the carriers move with saturation limited velocity is dramatically reduced at low temperatures. The saturation drain current is given by  $I_d = WC_{ox}v_s(V_{GS} - V_t)$ , where W is the transistor width and  $v_s$  is the saturation velocity. This was reflected in the measured device characteristics where the drain current increased by a factor of approximately 1.6 at cryogenic temperatures [10].

#### 3.5 Subthreshold Conduction

Reduction of the subthreshold currents is critical for reducing the power dissipation in both logic and memory circuits. Besides static power dissipation, subthreshold current causes leakage of stored charge at dynamic nodes.

According to the classical theory of subthreshold conduction [11], the subthreshold swing S defined as the logarithmic change in the drain current with the gate drive, is given by

$$S = 2.3 \frac{kT}{q} \left( \frac{C_{ox} + C_D}{C_{ox}} \right) \tag{3.5}$$

where  $C_D$  is the depletion capacitance per unit area corresponding to the surface potential  $\Psi_s$  given by

$$C_D = \sqrt{\frac{q\epsilon_{Si}N_A}{2\Psi_s}} \tag{3.6}$$

According to the above expression for S, at 4 K the subthreshold swing is smaller than its room temperature value by a factor of approximately 75. However, it can be shown that this expression does not hold for very low temperatures. Experimentally, it has been determined that the subthreshold swing at cryogenic temperatures is smaller than the room temperature subthreshold swing by a factor of 12. This implies that the subthreshold current at  $V_{gs} = 0$ is much smaller than normal room temperature MOS devices even with smaller threshold voltages.

#### **3.6** Hot Carrier Effects

Hot carrier effects are aggravated at low temperatures due to increase in the mean free path of free carriers. This implies that free carriers gain much more energy from the applied field before being scattered by coulombic charges (recall that phonon scattering is absent at cryogenic temperatures). Hence these high energy or "hot" electrons can cause impact ionization much more easily that at room temperature. The most noticeable effect of hot carrier effects at 4 K is the kink observed in the IV characteristics of devices at high  $V_{ds}$ . Hot electrons cause impact ionization which results in the creation of electron-hole pairs. The electrons are swept away to the drain and cause a slight increase in the drain current. The holes on the other hand, stay in the substrate due to extremely high resistivity of the substrate resulting from freeze out. This body effect causes the substrate bias to change and increases the drain current. Therefore for drain voltages slightly above the impact ionization potential, a sharp kink is observed in the drain current characteristics.

The hot carrier effect is insignificant below the impact ionization threshold (1.7 eV for electrons). Hence, by choosing a supply voltage of 1.5 V or less, this region of operation can be avoided. We chose a supply voltage of 1 V to further reduce the power dissipation in the circuit.

### 3.7 Dynamic Characteristics

As mentioned earlier, cryogenic operation causes an increase in the drain current by a factor of approximately 1.6. However, delay measurements of 51-stage ring oscillators show that the stage delay decreases by a factor of 2.4. Hence the increase in drive capability alone cannot account for the reduction in ring oscillator stage delay. In the following discussion,



Figure 3.1: Interconnect capacitance model.

we will investigate the effect of substrate freeze out on the behavior of various charge storage elements present in a typical digital circuit.

• Line Capacitance Since the substrate is frozen out at cryogenic temperatures, the conductance from the oxide-silicon interface to the backplane ground contact is extremely small and can be neglected. Figure 3.1 shows the model of the line capacitance. The capacitance of the silicon substrate is approximately four orders of magnitude smaller than the capacitance of oxide dielectric. This is because typical wafer thickness is four orders of magnitude more than typical dielectric thickness. This capacitance  $C_{sub}$  appears in series with the line capacitance  $C_{line}$  and this reduces the overall capacitance by approximately four orders of magnitude.

The reduction in the actual capacitance is not that dramatic. The segments of metal line which cross over diffusion areas or other layers of metal experience no reduction of capacitance due to freeze out. Other segments which do not overlap any metal line or diffusion area also experience capacitive coupling to these areas due to fringing. These contributions to the line capacitance are difficult to characterize accurately. For room temperature operation, the dominant contribution in the line capacitance is the line to substrate capacitance. However, for cryogenic operation, this factor is not present and hence the line capacitance is almost exclusively due to fringe and overlap capacitance.



Figure 3.2: Diffusion capacitance model.

- Gate Capacitance Freeze out has no effect on the channel capacitance. As mentioned before, the depletion and the inversion charge don't change at cryogenic temperature from their corresponding room temperature values due to band bending potential. This implies that gate capacitance does not change from its room temperature value.
- Drain and Source Diffusion Capacitance Drain and source diffusion capacitances are negligible due to substrate freeze out. The source and drain junctions can be modeled as capacitor  $C_j$  in series with a parallel combination of a resistor  $R_{sub}$  and a capacitor  $C_{sub}$  which represents the capacitance and the conductance of the substrate as shown in Figure 3.2. As in the previous case, conductance of the substrate is vanishingly small and the extremely small substrate capacitance causes the effective diffusion capacitance to be negligible.

The above discussion of the behavior of charge storage elements is far from complete. More work is needed to accurately characterize various capacitive elements in the circuit to come up with accurate models. However we can draw some guidelines while designing circuits.

- 1. Diffusions capacitances of MOS devices need not be considered.
- 2. Capacitances of long interconnections need not be taken into account if they are fairly

#### 3.7. DYNAMIC CHARACTERISTICS

isolated from the rest of the circuit. A good example is interconnections in the clock buffer tree. However, room temperature capacitance values should be used for those lines which overlap a lot of diffusion area or other metal layers. A good example of such interconnects would be bit line and word lines in the memory core.

## Chapter 4

## **Ring Pointer Shift Register**

We describe the CMOS ring pointer circuits. We begin by first describing the standard true single phase clock (TSPC) shift registers. We then discuss the modifications required to use the TSPC shift register as a ring pointer. We describe the design specifications and the actual design of the circuit and present the simulation results.

## 4.1 True Single Phase Clock (TSPC) Edge Triggered Flip-Flops

TSPC shift registers [12] are composed of either positive-edge-triggered or negativeedge-triggered D flip-flops. The biggest advantage of this logic style is that it requires only a single phase clock. This results in significant reduction in the complexity and power dissipation of the clock distribution network over conventional designs based on two phase clocks. The race condition is eliminated by using the dual-stage approach.

The limitation of this latch is that it malfunctions when the slope of the clock is not sufficiently steep. Slow clocks cause both NMOS and PMOS clocked transistors to be ON simultaneously, resulting in undefined values of the state as well as race conditions. We therefore introduce four stages of buffering to ensure the quality of the clock signals. The dynamic nature of this circuit does not pose a problem at cryogenic temperatures because of infinite charge storage at dynamic nodes.

We now describe two possible implementations of an edge-triggered D flip-flop using TSPC latches.



Figure 4.1: Double  $C^2MOS$  flip-flop.

#### 4.1.1 Double C<sup>2</sup>MOS Flip-Flop

The positive edge triggered double C<sup>2</sup>MOS flip-flop is shown in Figure 4.1. When the clock  $\Phi$  is low, transistors M2 and M6 are ON and M4 and M6 are OFF. As a result M1 and M3 form an inverter and node X gets the value  $\overline{D}$ . Node Y is high; therefore, M9 is OFF and node  $\overline{Q}$  holds its previous value. As  $\Phi$  goes high, thereby turning M2 and M6 OFF and M4 and M8 ON, node Y goes low (high) if X is high (low). M7 and M9 form an inverter and  $\overline{D}$  appears at the output.

The advantage of this kind of a configuration is that the gates of all the transistors are driven to either rail, allowing maximum switching. The drawback is that the clock line drives the gates of four transistors.

#### 4.1.2 Split-Output Latch Flip-Flop

The positive edge triggered split-output latch flip-flop is shown in Figure 4.2. When  $\Phi$  is low, M2 is ON, M1 and M3 form an inverter, and voltage appears at the gates of M4 and M6. When  $\Phi$  goes high, M2 switches OFF and M5 switches ON so data is transferred to the output through M4, M6, M7 and M8.

This configuration requires only two transistors to be driven by the clock line, thus reducing the load on the clock lines. However, not all nodes experience the full voltage



Figure 4.2: Split-output latch flip-flop.

swing. For instance, for high input, the gate of M4 is driven down only to  $V_{tp}$ , the threshold voltage of a PMOS device. Similarly the gate of M8 is driven only to  $V_{DD} - V_{tn}$ , where  $V_{DD}$  is the supply voltage and  $V_{tn}$  is the threshold voltage of an NMOS device. This causes a reduction in the drive voltage on the gates of these transistors. However, for the cold CMOS process, the threshold voltages are very low (as against a normal room temperature process) hence this penalty is not very severe. A lower clock load makes this type of latch the natural choice for cryogenic CMOS circuits.

Note that the output is inverted for both configurations and hence an inverter is needed to restore the correct polarity. Also note that at any time, one of the stages is always OFF; hence, a race condition can never occur if the clock edges are maintained sufficiently sharp.

## 4.2 Shift Register Modifications

While the latch shown in Figure 4.2 can be used in a linear shift register, modifications are required for a circular shift register at the layout stage to ensure that capacitance of a single line does not become excessive and degrade the performance. This is accomplished by connecting the input of a shift register to every other cell as shown in Figure 4.3.

Another modification is necessary to ensure that data are initialized properly on



Figure 4.3: Circular shift register layout.



Figure 4.4: TSPC shift register cell with reset to "1" feature.


Figure 4.5: TSPC shift register cell with reset to "0" feature.

all the dynamic nodes. This is accomplished by providing a reset capability in the flipflop. The circuit in Figure 4.4 shows the TSPC shift register with reset to "1" capability. Consider the initialization phase when  $\Phi$  is held low. The reset signal *RES* is high and *RES* is held low. This causes MR1 to pull the gate of M7 high turning it OFF and causes MR2 to force the input of the output stage inverter M9 and M10 to go low. This initializes the output to "1". The circuit in Figure 4.5 shows the TSPC shift register with reset to "0" capability. Its operation can also be explained a in similar manner.

#### 4.3 Design Specifications and Implementation

#### 4.3.1 Flip-Flop Design

A 10x128 memory array was laid out and the capacitances of the read and write word lines were extracted to determine the loading on the shift register output. Since line capacitances at cryogenic temperatures are not well characterized, room temperature values were used. The actual capacitance at cryogenic temperatures is always less than the room



Figure 4.6: Two shift registers with skewed clocks.

temperature values because of substrate freeze-out. However, as mentioned previously, the extent of lowering of the capacitance is determined by the amount of overlap with other metal lines and diffusion areas which are not frozen out. This overlap is not easily determined and characterized. We believe the use of room temperature values for the line capacitance is not overly conservative. The extracted read word and write word line capacitances were 5 fF per memory cell. This resulted in a total load of 50 fF of line capacitance plus the gate capacitance for the transistors in the read and the write word paths. The maximum line capacitance appears on the write word line where each transistor is a 7.2  $\mu$ mx1.2  $\mu$ m device having a gate capacitance of 15 fF. Thus each shift register was designed to drive the input of another shift register and a 200 fF load. The choice of DRAM as the memory cell resulted in smaller word lines, hence obviating word line drivers. This resulted in a significant reduction in power consumption.

As mentioned earlier, a single edge trigger flip-flop was found to be sufficient to avoid race conditions. This was verified by simulation where two shift registers were cascaded and the clocks for the two were skewed as shown in Figure 4.6. A race condition occurs if the output at the second shift register appears at the same time as the output of the first one. Figure 4.7 shows the output of the second shift register for different values of clock skews. Clearly race condition was not developed for any amount of skew.

#### 4.3.2 Clock Distribution Network Design

It was noted that the ratio of the PMOS to NMOS channel widths was to be kept at 1.4 for the given process if symmetric rise times and fall times were to be maintained



Figure 4.7: Output of the first and the second shift registers for different clock skews.

for the waveforms. Henceforth for inverter designs, this ratio was used. A minimum sized inverter therefore had a PMOS of width 2.4  $\mu$ m and NMOS of width 1.8  $\mu$ m.

The clock distribution network for the shift registers was designed to minimize the skew. Figure 4.8 shows the complete clock distribution network. Four TSPC cells were driven by one clock buffer at the first stage. At the third level of the tree, four final-stage buffers were driven by one second-stage buffer. At the second level of the tree, four third-stage buffers were driven by a second-stage buffer. At the root of the clock distribution tree, two second-stage buffers were driven by a first-stage buffer. The buffers were laid out so that the lengths of the signal lines were about the same, thereby minimizing the clock skew.

#### 4.4 Layout

The layout was done keeping in mind the desired wiring of the circular shift register cells which obviates a very long wire to connect the first and the last cell. Hence an extra signal route was provided in each cell and two kinds of cells were laid out: one for forward propagation of signal and one for the backward propagation. The cells were designed to be  $45 \lambda$  in height and width of  $156 \lambda$ . The cells were laid out in such a manner that adjacent cells were abutted together to make the power rails common. This resulted in an overlap of  $6 \lambda$  so the final cell height was  $39 \lambda$ . The layout of a single TSPC cell is shown in Figure 4.9.

#### 4.5 Simulation Results

The device dimensions for the shift register were optimized to meet the goals of sharp rise and fall times and minimal delay on the output while keeping the power dissipation reasonable. Various device sizes are listed in Table 4.1 and the rise and the fall time and the delays are listed in Table 4.2. The sizes of the various clock buffers are listed in Table 4.3. These sizes have been reported as multiples of the minimum sized inverter which has a PMOS device of width 2.4  $\mu$ m and an NMOS device with width 1.8  $\mu$ m. Note that all the device lengths are minimum size, i.e., 1.2  $\mu$ m.

The full 128-bit shift register with the complete clock buffer network and loaded with ten memory cells was simulated for one complete cycle. Simulations show one complete



Figure 4.8: Clock distribution network.



Figure 4.9: Flip-Flop layout.

| Device | Width $(\mu m)$ |
|--------|-----------------|
| M1     | 3.6             |
| M2     | 5.4             |
| M3     | 3.6             |
| M4     | 11.4            |
| M5     | 12.0            |
| M6     | 6.0             |
| M7     | 12.0            |
| M8     | 13.2            |
| M9     | 15.0            |
| M10    | 7.8             |
| MR1    | 1.8             |
| MR2    | 1.8             |

 Table 4.1: TSPC Shift Register Cell Device Parameters.

| Performance Measure | Value (ns) |
|---------------------|------------|
| Rise Time           | 0.26       |
| Fall Time           | 0.30       |
| Delay (rise)        | 0.37       |
| Delay (fall)        | 0.43       |

 Table 4.2: Performance Measure of TSPC Shift Register.

| Buffer | Size |
|--------|------|
| 1      | 40x  |
| 2      | 30x  |
| 3      | 30x  |
| 4      | 20x  |

Table 4.3: Clock Buffer Sizes.

circle of bits shifting with correct functionality with good rise and fall times and delay for the current in the current sense circuit. Figure 4.10 shows the currents in the current sense circuit corresponding to the first column and the fourth and the fifth row of the memory array. Both the memory cells had a "1" stored.

#### 4.6 Experimental Results

A layout error in the sixteen bit ring pointer shift register prevented us from directly testing the performance of the TSPC ring pointers. However, the room temperature experimental results of the 16 bit and 4 bit wide FIFO confirmed the operation of the TSPC based ring pointers in the FIFO. The FIFO was configured such that the read and the write shift registers were driven by the same clock. Therefore any data that was read into the FIFO was immediately read out. The reset signal was applied every 32 clock cycles. This meant that the shift registers were correctly initialized and data circulated to the entire length of the circular shift registers twice. The frequency of operation of the complete FIFO was 10 MHz.

Since the threshold voltages of both the PMOS and the NMOS devices were significantly more than the specified amount, the power supply needed to be increased to 1.5 V even at room temperature. This meant that the power supply voltage required for proper operation of the circuit at low temperatures would be more than the impact ionization energy of the electrons so it would lead to rapid failure of the circuit.



Figure 4.10: Simulation of the complete read path.

### Chapter 5

# Memory Cell

We describe the design of a three transistor DRAM cell. After reviewing the standard CMOS SRAM and DRAM, we describe the necessary modifications for zero impedance current sensing. We then describe the design of the memory cell. Finally we present the simulation results.

#### 5.1 CMOS Memory Design at 4K

We first review the design of the standard CMOS static and dynamic random access memories.

#### 5.1.1 Static Random Access Memory (SRAM)

The standard CMOS SRAM is shown in Figure 5.1 [13]. The basic cell consists of two cross-coupled inverters. Access to the cell is enabled by the word line which controls the two pass transistors M5 and M6. During the write operation, the bit lines are impressed with the proper voltage and the pass transistors are activated. This causes the cross-coupled inverters to toggle if the previously stored value is different from the one currently being stored, and the correct value is stored in the cell. For instance, suppose the cell initially stored a value "1". This means that Q was high and  $\overline{Q}$  was low. While writing a "0" in the cell, BL is held low and  $\overline{BL}$  is held high. Initially M4 and M2 are OFF. The applied voltages at the bit lines pull node Q towards ground and node  $\overline{Q}$  towards the supply rail. This causes the cross-coupled inverters to toggle and M2 and M4 turn ON. This causes node



Figure 5.1: CMOS SRAM cell.

Q to be pulled all the way to ground and node  $\overline{Q}$  to be pulled all the way to the supply rail due to positive feedback.

During a read operation, the bit lines are typically precharged to half the power supply voltage and the word line is turned ON. This causes the bit lines to charge towards the supply voltage or discharge to ground through the pass transistors M5 and M6 depending upon the data stored in the memory cell. When the voltage difference between the two bit lines is sufficiently large, it is sensed by a sense amplifier and the proper value is read out.

#### 5.1.2 Dynamic Random Access Memory (DRAM)

The standard three transistor DRAM cell is shown in Figure 5.2 [13]. The cell is written by placing the appropriate data value on the bit line BL1 and asserting the write-



Figure 5.2: Three-transistor DRAM cell.



Figure 5.3: One transistor DRAM cell.

word line WWL. This causes M1 to turn ON and data is transferred to the charge storage node. When WWL is turned OFF, the datum is retained as charge on the capacitor  $C_s$  once the write-word line is lowered. During a read operation, the read word line RWL is raised. This causes M3 to turn ON and the bit line BL2 is precharged to the supply voltage. The series connection of M2 and M3 pulls the bit line down if "1" is stored in the cell. This is sensed and inverted for correct polarity of the read-out data. Data is periodically refreshed to compensate for the charge leakage from the storage capacitor. A refresh cycle simply consists of reading the cell and writing the same value on the cell. It should be noted that read-out operation does not destroy the stored data.

Another simplified memory cell, called the one transistor DRAM cell, is shown in Figure 5.3 [13]. During the write operation, datum is placed on the bit line BL and the word line WL is raised. Depending upon the data value, the storage capacitor  $C_s$  is either charged or discharged. Before the read operation, the bit line is precharged to a voltage  $V_{precharge}$ . When the word line is asserted, charge redistributes between the bit line capacitor and the storage capacitor. This results in a voltage change on the bit line and the direction of this voltage change determines the value of data stored. Since the bit line capacitance is much larger than the storage capacitor, every read operation destroys the charge stored in the cell capacitor and has to be refreshed after every read operation. This is achieved by imposing the output of the sense amplifier to the bit line during read-out. Normal refresh operation is also performed if the memory is not read out for a long time.

In order to obtain the maximum possible throughput, an approach which does not

require periodic refresh would be desirable. Infinite retention of charge on dynamic nodes at cryogenic temperatures enables correct operation of DRAMs without requiring refresh due to charge leakage. Compared to SRAM, the DRAM cell is more compact so the capacitive load on the word line drivers is lowered considerably; as mentioned in Section 4.3.1, this choice obviates word line drivers for this design. The one transistor memory cell is even more compact but it requires refresh for every read cycle. This is expensive because, as described in the next section, the read out is current mode as against voltage mode. A refresh operation would therefore require milli-volt level Josephson sense signals to be amplified to volt level signals to ensure proper refresh.

Another advantage of the DRAM approach is that it requires only one bit line per cell for read and write operation. This results in a significant reduction in power dissipation for the bit line drivers of the interface circuit as compared to SRAM approach, which requires both the data and its complement to be present, hence necessitating two read bit and write bit lines.

#### 5.2 DRAM Modifications for Current Mode Read

As mentioned previously, zero impedance current sensing allows current mode for read operation of the memory cell. Figure 5.4 shows the three transistor DRAM cell modified for current mode read [10]. The write operation proceeds exactly as for the normal DRAM cell. During a read operation, the read word line (RWL) is asserted and this causes a current to flow through the read bit line (RBL) if a "1" is stored in the cell.

The advantage of this read-out scheme over the conventional voltage based readout schemes is that the bit line capacitance does not need to be charged and discharged for every read operation. This results in significant power reduction over voltage read-out schemes. Also since the current sense is zero impedance, the effect of error due to cell offset is also minimized.

#### 5.3 Design Specifications and Implementation

The read word transistor M3 was designed to be minimum size because the read path is current mode and hence does not require charging or discharging of the large bit line capacitance. The storage transistor M2 was designed such that it was large enough to be



Figure 5.4: Three transistor DRAM cell with current mode read-out.

able to generate sufficient current when a "1" is read out for sensing by the superconductive current sense circuit. This current value was determined to be 100  $\mu$ A and the required width of M2 was 7.2  $\mu$ m. The write word transistor was designed to write the data on a 20 fF storage capacitor within 1 ns. The required width for this transistor was 7.2  $\mu$ m.

#### 5.4 Layout

The dimensions of the memory cell were determined by the height of the TSPC shift register cell and width of the interface amplifier. The resulting cell size was  $39\lambda$  by  $38\lambda$ . The word lines were pitch-matched with the read and the write shift registers while word line was pitch-matched to the interface amplifier. The layout of one DRAM cell is shown in Figure 5.5.

#### 5.5 Simulation Results

Figure 5.6 shows the simulation results of the DRAM demonstrating the read and write operation of a "1" followed by a "0". During the first cycle, a "1" is stored in the memory cell and the charge storing node is charged to 550 mV. This results in a 127  $\mu$ A current during the read-out cycle. During the second write cycle, a "0" is written on the cell and no current results during the second read-out phase.

Since the charge storing node is charged to 550 mV when a "1" is stored, during the write phase, the voltage at the read bit line need not rise to 1 V to write a "1". It can



Figure 5.5: Three transistor DRAM cell layout.

be shown that the read line needs to swing to a minimum of 600 mV to be able to write a "1".



Figure 5.6: Simulation results for the three transistor DRAM cell.

CHAPTER 5. MEMORY CELL

## Chapter 6

# Josephson-CMOS Interface Circuits

We begin by describing the constraints for the interface circuit for this application. We then review Suzuki stack preamplification. We also describe the modifications to the standard Suzuki stack structure tailored for this application. We then describe the basic interface amplifier stage. We describe the design of the interface amplifier and give the simulation results. Finally we discuss the calibration of the interface amplifier to remove the offsets.

#### 6.1 Interface Constraints

The interface circuit converts millivolt level Josephson signals to volt level CMOS signals. For this application, the interface must operate on a single phase clock and drive bit line capacitance of approximately 400 fF. Simulations show that the bit lines needs to swing to a minimum of 600 mV to write a correct "1" on the cell. The interface amplifier must also be pitch-matched to the bit lines of the memory cell and power dissipation kept at a minimum.

#### 6.2 Suzuki Stack Preamplification

A single Josephson junction provides a 2.8 mV signal which is insufficient to drive a CMOS amplifier at high speeds. Hence the Josephson signal is preamplified by a Joseph-



Figure 6.1: Interface architecture.

son preamplifier and then fed to a CMOS amplifier. The preamplification stage typically amplifies the signal by a factor of ten to thirty. The overall gain of the system is the product of the gain of the preamplification stage  $A_{pre}$  and the gain of the CMOS stage  $A_{CMOS}$ .

Lowering the gain requirement for the CMOS amplifier results in higher bandwidth of the CMOS system because the gain bandwidth product is constant in CMOS technology. The overall interface architecture is shown in Figure 6.1.

#### 6.2.1 Suzuki Stack Structure

The Suzuki stack [14] uses a series-parallel array of Josephson junctions to generate a voltage ideally equal to  $nV_g$  where n is the number of junctions in the stack and  $V_g$  is the gap voltage. In practice, not all junctions will switch so the interface circuit must be designed to handle such cases. Figure 6.2 shows a typical Suzuki stack along with the bias circuit.

The Suzuki stack is typically biased with a current  $I_{bias}$  which is less than twice the critical current of the junctions. When there is no input current,  $I_{bias}$  divides equally between the two branches of the stack and the junctions are in the superconducting state since the resistors are of equal values. When a current  $I_{in}$  is applied, it divides equally in the two branches. The first half passes to ground through the resistor of the first branch while the other half goes to the second branch through the stack of junctions in the first branch. Thus the total current through the stack of junctions in the first branch reduces by  $I_{in}/2$  while the current through the stack of junctions in the second branch increases by the same amount. This current causes the junction with the lowest critical current in the second branch to switch. Ideally, all the junctions should switch at the same time but process variation will cause spread in critical currents. After the first junction has switched, the bias current is diverted to the first branch in which the junctions are still in the superconducting state. This will cause all the junctions in the stack to switch to the



Figure 6.2: The Suzuki stack.

resistive state and the bias current is diverted to the second branch which causes the rest of the junctions to switch to gap voltage. This results in an output voltage which is n times larger than the gap voltage. The stack is then reset by switching the bias current off.

Dynamic performance of the stack is limited by the reset time of the junctions. Figure 6.3 shows the equivalent circuit of a resetting stack [9]. The time constant for resetting of the stacks can be modeled as

$$\tau = \frac{n}{2} R_{sg} \left( C_l + \frac{2}{n} C_j \right) \tag{6.1}$$

where  $\tau$  is the switching time constant, n is the stack height,  $C_l$  is the load capacitance,  $C_j$ is the junction capacitance and  $R_{sg}$  is the sub-gap resistance of the junctions. This equation suggests that load capacitance should be minimized to achieve high speed operation. Also stack height should be minimized to reduce the time constant of switching. However, stack height should be large enough to overcome the input offset voltage of the CMOS amplifier and provide sufficient drive at the CMOS amplifier inputs. It was found that a stack height of ten to twenty was sufficient to meet the throughput requirements depending on the CMOS amplifier design.



Figure 6.3: Equivalent circuit of a resetting stack.

#### 6.2.2 Suzuki Stack Biasing and Output Modification

As described earlier, the Suzuki stack requires a reset phase where the bias current is switched off. The clocked bias for the Suzuki stacks can either be provided in the CMOS system or externally from the superconducting system. The advantage of using the superconducting scheme is lower power dissipation. The CMOS scheme results in a higher power dissipation because current is drawn from the supply. However, a big advantage of this scheme is the ease of clock synchronization. The same clock which drives the interface amplifier can be used to clock the bias circuit. This turns out to be significant advantage over the superconductor based biasing scheme.

Figure 6.4 shows the bias circuit for the Suzuki stack. This consists of a cascoded NMOS current source. NMOS transistors are chosen because higher mobility of an n-channel device results in lower bias and clock transistor size thereby reducing the clock load and area. Transistor M2 does not suffer from body effect because the switching of the Suzuki stack causes the source voltage to change only by a couple of tens of millivolts. This does not produce significant body bias to alter the current of the cascode current source.

The circuit in Figure 6.4 can be further modified to improve stack resetting. This is achieved by adding a resistor in parallel with the load capacitor. Figure 6.5 shows the resultant circuit. In effect, this resistor comes in parallel with the subgap resistance of the junctions and reduces the resetting time constant. This load resistor is implemented as



Figure 6.4: CMOS bias for a Suzuki stack.

a transistor biased in the linear region. Figure 6.6 shows the simulation results for a ten junction Suzuki stack with and without the transistor at the output of the stack. Adding the transistor at the load causes a significant reduction in the reset time. For a 1 GHz operation, we observe that without the transistor shunting the load, the stack does not reset fully and this may cause false switching of the interface amplifier.

For a fully differential CMOS amplifier, Suzuki stack preamplifier requires four lines for each column of memory, two for each bias and two for the output. This takes up a lot of pads to interface the CMOS part of the FIFO with the Josephson part. A different approach is to take the output from the same node as the bias current node. Notice that the voltage at this node will be higher than the output voltage by an amount equal to  $I_{bias}R$  which is typically negligible for low values of R. For the Suzuki stack, the typical values of R are of the order of 1 $\Omega$ . Also this voltage appears as systematic offset for a single ended circuit while for a differential implementation, this voltage appear as common-mode. Therefore, for our implementation, this will not affect the performance of the amplifier.

#### 6.3 Interface Amplifier

We now describe the design of the differential CMOS interface amplifier. The amplifier is single stage and consists of cross-coupled flip-flops which drive an output driver



Figure 6.5: CMOS bias for a Suzuki stack with improved reset.

buffer chain which drives the bit line capacitance. The interface amplifier is shown in Figure 6.7.

The front-end flip-flop is a dynamic cross-coupled structure. Transistors M2, M6, M3 and M7 form the cross-coupled structure. When the clock  $\Phi$  is low, M1 and M4 precharge the dynamic nodes to the supply voltage. When the clock turns high, M9 activates the flip-flop. Transistors M5 and M8 convert the differential voltage at the input to a current which discharges the cross-coupled nodes. The voltage differential which develops at the cross-coupled nodes due to the difference in currents causes the flip-flop to switch depending on the input voltage. This voltage difference is latched by the two latches which in-turn drive the buffer train.

#### 6.4 Design Specification and Implementation

As mentioned earlier, a 10x128 memory array was laid out first and drive requirements for shift registers and interface amplifiers were obtained by extracting the line capacitance and MOS capacitances on those lines if any. The write bit line capacitance was extracted to be 412 fF. Since the substrate is frozen out at cryogenic temperatures, substrate and sidewall capacitances for the n and p diffusion were neglected. Thus the amplifier had to be designed to drive a 412 fF capacitive load at 1 GHz while maintaining a swing of at least 600 mV.

| Device | Width $(\mu m)$ |
|--------|-----------------|
| M1     | 9               |
| M2     | 10.8            |
| M3     | 10.8            |
| M4     | 9               |
| M5     | 3.6             |
| M6     | 7.2             |
| M7     | 7.2             |
| M8     | 3.6             |
| M9     | 18.0            |
| M10    | 3.6             |
| M11    | 3.6             |
| M12    | 3.6             |
| M13    | 3.6             |
| M14    | 3.6             |
| M15    | 3.6             |

Table 6.1: Interface Amplifier Device Parameters.

| Buffer | Size |
|--------|------|
| 1      | 1x   |
| 2      | 3x   |
| 3      | 9x   |
| 4      | 27x  |

Table 6.2: Clock Buffer Sizes.



Figure 6.6: Stack output for standard and shunted CMOS loads.

The device sizes were optimized for lowest power dissipation. Using a clocked power supply resulted in a significant reduction in power consumptions. This was achieved by transistor M9. During the phase when the clock is low, M1 and M4 are switched ON. Since the capacitors are always precharged to 1 V, M5 and M8 would be ON and dissipate excessive amount of power in the "dead" clock phase. This is avoided by including M9, which is switched OFF during this phase.

Various device parameters are listed in Table 6.1 and the buffer sizes are listed in Table 6.2. Note that the buffer sizes have been reported relative to a minimum sized inverter with NMOS channel width as  $1.8 \ \mu m$  and PMOS channel width as  $2.4 \ \mu m$ . All three transistors in the bias circuit are  $3.6 \ \mu m$  wide. Note that all devices have  $1.2 \ \mu m$ length.



Figure 6.7: Interface amplifier.

#### 6.5 Layout

Since the interface amplifiers were pitch-matched to the memory cells, the width of the interface amplifier was constrained to be as small as possible. A large width would result in long read and write word lines which, in turn, would require word line drivers at the output of the TSPC shift registers. This would mean excessive power dissipation in these line drivers. The layout of the main amplifier without the calibration circuit and the bit line driver buffers is shown in Figure 6.8. The main amplifier was  $44\lambda$  in width and  $161\lambda$  in height.



Figure 6.8: Interface amplifier layout.

#### 6.6 Simulation Results

Figure 6.9 shows the simulation results of the complete interface amplifier along with the buffers. The amplifier was simulated for two data inputs, a "1" followed by a "0". The output has a delay of one clock cycle. Note that the current pulse is high only for 500 ps. This means that this amplifier can also amplify current pulses of very short durations. This results in additional robustness of the amplifier. The overall power dissipation was 830  $\mu$ W. The main amplifier stage dissipated 300  $\mu$ W at 1 GHz while the buffer train dissipated 530  $\mu$ W. The simulation results show the bit line being charged to supply voltage and discharged to ground according to the input data with acceptable rise and fall times.



Figure 6.9: Simulation results of interface amplifier.

#### 6.7 Calibration

Differential interface circuits employ calibrated capacitors at the input of the CMOS amplifiers. The capacitors are calibrated at power-up to reduce the amplifier offset voltage and bias the input transistors. Due to infinite charge storage of capacitive nodes at cryogenic temperatures, calibration is required to be performed only at power-up and circuit need not be recalibrated. As a result, calibration process can be low speed and as accurate as possible.

The general approach of calibration is to charge the capacitor on one end of the amplifier to a known voltage and charge the capacitor at the other end by a known amount in every clock cycle. At the end of every clock cycle, the flip-flop output is examined. Initially the flip-flop will switch in the direction determined by the capacitor charged to the known voltage. When the flip-flop switches in the other direction, calibration is complete. The resolution of the calibration process depends on the amount of voltage increment at each clock cycle.

Figure 6.10 shows the interface amplifier with the calibration circuit. For the sake of clarity, clock buffers have not been included. Since calibration is low speed, all transistors are minimum sized. Large poly-to-poly capacitors are employed to minimize the effect of charge injection and provide good resolution for the calibration process. A small minimum size capacitor sets the step size for charging up the bias capacitor on the other side. Absolute capacitance values are not critical for circuit operation. The only requirement is that a large ratio must be maintained between the bias capacitors and the charging capacitor. From the layout, the ratio of the bias and the charging capacitor was determined to be approximately 20. This meant that the offset of CMOS amplifiers could be nulled out to within 50 mV.

The timing diagram for the calibration circuit, shown in Figure 6.11 illustrates the timing of various clocks of the calibration circuit. The capacitor on the left side is charged to the supply voltage by applying a one shot pulse  $\Phi_x$  to the transmission gate TG3. The charging capacitor is charged to  $V_{nom}$  every time the clock  $\Phi_2$  is high. Upon application of  $\Phi$ , the flip-flop switches in the direction determined by the precharged bias capacitor. This causes the gate of M16 to be high. Therefore, every time  $\Phi_1$  is switched high, transmission gate TG2 is turned ON and causes charge to redistribute between the charging and the bias capacitors. Thus in every clock cycle, the bias capacitor is charged. The application  $\Phi$ ,  $\Phi_1$  and  $\Phi_2$  continues till the flip-flop switches in the other direction thereby completing



Figure 6.10: Interface amplifier with calibration circuit.



Figure 6.11: Timing diagram for interface calibration.



Figure 6.12: Simulation of interface calibration.

the calibration process. Figure 6.12 shows the voltages on the two capacitors for a full calibration cycle. This simulation demonstrates the correct operation of the calibration scheme.

It was later realized that the proposed scheme would fail to calibrate the amplifier properly if the amplifier powers up with the voltage at the inverting input initialized to the power supply and the offset correction requires this input be less than the supply voltage. However, this can corrected by adding another transmission gate TG4 as shown in Figure 6.10 in the shaded box which initializes this node to ground when the one-shot  $\Phi_x$ is applied.



Figure 6.13: Experimental results of the interface circuit operation at 10 MHz.

#### 6.8 Experimental Results

As pointed out earlier, the threshold voltages of the fabricated devices were significantly higher than the specified values. This meant that the devices were no longer conducting at room temperature and the circuit functionality could be verified even at room temperature. However, this required us to increase the supply voltage to approximately 1.5 V even at room temperature operation. We also found that the calibration circuit was nonfunctional in some chips. However, we were able to bypass the calibration process altogether by applying the offsets directly instead of calibrating the circuit and storing them on the input capacitors. Also, at room temperature, the offset charge would leak off quickly requiring periodic recalibration.

Figures 6.13 and 6.14 show the output of the interface amplifier to a "10000000" input at 10 and 70 MHz respectively at room temperature. Note the output voltage is developed by driving the gate of an NMOS device whose drain is connected to the supply rail and the output is taken off the source. In our case the output was developed across the 50  $\Omega$  input resistance of the sampling scope. We also find that the rising and falling edges



Figure 6.14: Experimental results of the interface circuit operation at 70 MHz.

of the main clock cause a glitch at the output. This problem become more severe as the clock speed is increased. This is due to capacitive feed-through of clock to the output nodes through the substrate and low threshold voltage at room temperature operation. At room temperature the threshold voltages of the devices are approximately 0.3 V. Therefore the subthreshold current at room temperature is approximately four orders of magnitude more than transistors with a threshold voltage of 0.7 V at room temperature. Hence any glitch at the nodes of these devices will cause a relatively large glitch at the output than for a device with larger threshold voltages. However, at 4K, the substrate will be frozen-out and therefore substrate feed-through will be minimal. Also the subthreshold conduction will be substantially reduced (about 12 orders of magnitude lower at 4K than at room temperature) which will also prevent glitching.

## Chapter 7

## Josephson Current Sense

We describe several techniques for zero-impedance current sensing of the dual port memory cells used in the hybrid Josephson-CMOS FIFO. We will restrict our discussion to voltage state techniques because the sense circuitry must drive the multiplexer which employs edge triggered voltage state logic. After the discussion of the fundamentals of zero impedance current sensing, we describe the single-ended techniques using Josephson junctions.

#### 7.1 Fundamentals of Zero Impedance Current Sensing

Josephson junctions provide a high speed, accurate, pure current-mode alternative for CMOS sense amplifiers in the read path of the FIFO array. Consider the Josephson



Figure 7.1: Josephson junction I-V characteristics.

junction I-V characteristics shown in Figure 7.1. If the junction is biased below its critical current  $I_c$ , the voltage across the junction is zero, i.e., it is in the superconducting state. When the bias current is raised above the critical current (for instance, when additional current is applied from the memory cell) the junction switches to the gap voltage in several picoseconds. Since the RC time constant of current on the shorted bit lines is negligibly small and there is no voltage swing on the bit lines, sensing is limited by the time of flight delay. Inductive time constants are also insignificant because the source impedance from the memory cell is large. In a CMOS based approach, sensing is dominated by the capacitive delay of the bit lines as well as the limited bandwidth of the CMOS sense amplifiers. Josephson junction based circuits can accurately sense 100  $\mu$ A of current with good margins in several picoseconds.

#### 7.2 Single Ended Current Sensing Techniques

Single ended current sensing techniques employ either superconducting quantum interference devices (SQUIDs) or any one of a number of standard Josephson digital gates. Figure 7.2 shows the a block diagram of the single ended current sensing operation of a column of memory cells. The single-ended current output of the DRAM array is fed to the current sensing circuit. For a cell storing "0", no current is present in the bit line and the current sensing circuit stays in the superconducting state. For a cell storing "1", the current in the read bit line switches the current sensing circuit to voltage state. Figures 7.3 [15], 7.4 [16] and 7.5 [17] show the various logic families used for the current sensing.

#### 7.3 4JL Current Sense

We use a 4JL based current sense circuit as shown in Figure 7.5 [10]. The circuit is biased with a current less than the sum of the the critical currents of the junctions in the two branches. This current divides in the two branches in the ratio of the critical currents of the junctions. When an input current pulse is applied, it divides into J2 and the series combination of J1, J3 and J4. This current adds to the bias current of J1 and this junction switches to voltage state. Now the full bias current is diverted to the J3 J4 branch and this causes J3 and J4 to switch. The full bias current is now diverted to the input branch causing J1 to switch. Now the bias current is diverted to the load resistance to produce the


Figure 7.2: Block diagram of single-ended current sensing.



Figure 7.3: MVTL based current sense.



Figure 7.4: RCJL based current sense.



Figure 7.5: 4JL based current sense.

| Parameter  | Value         | Margin     |
|------------|---------------|------------|
| $I_{c1}$   | 67 µA         | $\pm 66\%$ |
| $I_{c2}$   | 67 µA         | $\pm 66\%$ |
| $I_{c3}$   | 200 µA        | $\pm 50\%$ |
| $I_{c4}$   | 200 µA        | $\pm 50\%$ |
| $R_{in}$   | $1 \Omega$    | $\pm 50\%$ |
| $R_{out}$  | $14 \ \Omega$ | $\pm 50\%$ |
| $I_{bias}$ | 200 µA        | $\pm 33\%$ |
| $I_{in}$   | 100 µA        | $\pm 50\%$ |

Table 7.1: 4JL current sense circuit parameters.

desired output voltage.

The sense amplifier is to be designed to sense  $100 \ \mu\text{A}$  of current which appears on the read bit line when a "1" is being read. The output of the circuit is to feed the multiplexer hence it should deliver Josephson level signals at the output. The output voltage is required to have small rise time and reset time so that the multiplexer can latch the correct output in the required time.

Figure 7.6 shows the simulation result of the current sense circuit with an input pattern of "1010". It shows good rise time and rest time at the output. Table 7.1 shows the circuit parameters of the current sense circuit and their margins. The worst case margins for the circuit are  $\pm 33\%$ . The rise time and the fall time at the output are 37 ps and 24 ps, respectively. Note that the margins reported in Table 7.1 are based on the transient response of the circuit and hence are much tighter than the margins one would obtain considering dc operation only. If the circuit with certain parameter values operated correctly but the rise time of the output was less than 50 ps or the delay of the output from the rising edge of the clock was more than 50 ps, it was rejected.



Figure 7.6: Dynamic operation of a 4JL current sense element.

## Chapter 8

# Josephson Demultiplexer

We now describe the design of the demultiplexer. We design the demultiplexer and the multiplexer using the edge triggered logic requiring two phase clock [18]. We begin by reviewing the design of basic logic blocks of the edge triggered logic. We reoptimize the circuit parameters used in a previous design [9] operating at 5 GHz for 10 GHz operation required for our design. We present simulation results of individual blocks at 10 GHz clock speed. We then describe the architecture of the demultiplexer and circuit modifications to incorporate different drive requirements. Finally we present the simulation results for the complete demultiplexer.

### 8.1 Edge Triggered Logic

We review the design of the basic gates used in the design of Josephson multiplexer and demultiplexer using edge triggered logic.

#### 8.1.1 Clock Regulator

It has been shown that edge triggered logic gates operate with good margins only if the clock waveform is of a particular shape. Simulation results demonstrate that gates using edge triggered logic family fail miserably if clocks are too steep. Fortunately, the desired clock shape can be realized using a clock shaping junction. The complete clock regulator consists of a sinusoidal voltage source which drives a resistor in series with a Josephson junction which is shunted by another resistor. The critical current of the junction and the values of the shunt and series resistances depend on the drive capability and the speed of the circuit. The basic clock regulation circuit is shown in Figure 8.1. Figure 8.2 shows the simulation results of clock shaping circuit with a 10 GHz sinusoidal clock. The circuit



Figure 8.1: Clock shaping circuit.

parameters are shown in Table 8.2.

As the size of the circuit increases, the drive requirement of the clock shaping junction becomes more and more severe at a given clock speed if one junction is used for driving the whole bus for a given clock phase. This requires bigger junctions with more critical current and smaller values of shunt resistance. Another drawback of this approach is that the clock shaping junctions are located far from actual circuit and may be susceptible to process variation. This will reduce the margins of the circuit or might cause it to fail altogether. An obvious rectification of this is to design clock shaping junction for each sub-block of the circuit. This results in reasonable size junctions which can be placed very close to the actual circuit and hence will track any process variation extremely well.

### 8.1.2 Buffer

The buffer [18] consists of a series stack with a SQUID on the bottom and a single junction on top as shown in Figure 8.3. The critical current of the SQUID is typically larger than the critical current of the junction at the top. If an input current is applied, the critical current of the SQUID is suppressed below the critical current of the series junction. Upon the application of the clock signal  $\Phi$ , the SQUID switches to voltage state and the current is diverted to the load resistor. If no input current is applied, the SQUID remains superconducting upon the application of  $\Phi$  and no current is delivered to the load. The circuit parameters are listed in Table 8.1 while the clock regulation circuit parameters are listed in Table 8.2. Figure 8.4 shows the simulation results of the clock buffer operation



Figure 8.2: Regulation of a 10 GHz sinusoidal clock.



Figure 8.3: Edge-triggered buffer.

| Parameter | Value             |
|-----------|-------------------|
| $I_{c1}$  | 275 µA            |
| $I_{c2}$  | 160 µA            |
| $I_{c3}$  | 160 µA            |
| $R_s$     | $4 \Omega$        |
| $R_{out}$ | $10 \ \Omega$     |
| $R_d$     | $1 \ \Omega$      |
| L1        | $3.2 \mathrm{pH}$ |
| L2        | $3.2 \mathrm{pH}$ |
| L3        | $6.4 \mathrm{pH}$ |
| L4        | $6.4~\mathrm{pH}$ |

Table 8.1: Edge-triggered buffer parameters.

| Parameter   | Value         |
|-------------|---------------|
| $I_c$       | 275 μΑ        |
| $R_{shunt}$ | $4 \ \Omega$  |
| $R_s$       | $20 \ \Omega$ |

Table 8.2: Clock shaping circuit parameters for the edge-triggered buffer.



Figure 8.4: Simulation results for edge-triggered buffer.

at 10 GHz clock. Simulations show that the buffer operates with a margin of  $\pm 42\%$  at 10 GHz. The buffer achieves a propagation delay of 7 ps.

#### 8.1.3 Inverter

The inverter [18] consists of a series stack with a SQUID on the top and a single junction on bottom as shown in Figure 8.5. The critical current of the SQUID is typically larger than the critical current of the junction at the bottom. If an input current is applied, the critical current of the SQUID is suppressed below the critical current of the series junction. Upon the application of the clock signal  $\Phi$ , the SQUID switches to voltage state and no current is diverted to the load resistor. If no current is applied, the SQUID remains superconducting upon the application of  $\Phi$  and the lower junction switches, thus diverting current to the load resistor.



Figure 8.5: Edge-triggered inverter.

The circuit parameters are listed in Table 8.3 while the clock regulation circuit parameters are listed in Table 8.4. Figure 8.6 shows the simulation results of the clock buffer operation at 10 GHz clock. Simulations show that the inverter operates with a margin of  $\pm 42\%$  at 10 GHz. The inverter achieves a propagation delay of 6 ps.

| Parameter | Value             |
|-----------|-------------------|
| $I_{c1}$  | 250 μΑ            |
| $I_{c2}$  | 160 µA            |
| $I_{c3}$  | 160 µA            |
| $R_s$     | $4 \Omega$        |
| $R_{out}$ | $10 \ \Omega$     |
| $R_d$     | $1 \Omega$        |
| L1        | $3.2~\mathrm{pH}$ |
| L2        | $3.2~\mathrm{pH}$ |
| L3        | $6.4~\mathrm{pH}$ |
| L4        | $6.4~\mathrm{pH}$ |

Table 8.3: Edge-triggered inverter parameters.

| Parameter   | Value         |
|-------------|---------------|
| $I_c$       | 275 µA        |
| $R_{shunt}$ | $4 \Omega$    |
| $R_s$       | $20 \ \Omega$ |

Table 8.4: Clock shaping circuit parameters for the edge-triggered inverter.



Figure 8.6: Simulation results for edge-triggered inverter.

#### 8.1.4 Buffered AND

We use an approach similar to the MVTL realization of Josephson AND gates [18]. The two input signals are buffered and fed to a junction. The critical current of this junction is such that it does not switch when only one of the buffers delivers current. However, when both the buffers switch, the current is sufficient to switch J4 and this diverts the current to the load resistor. Figure 8.7 shows the edge-triggered buffered AND. The circuit parameters



Figure 8.7: Edge-triggered AND gate.

of the buffers used are the same as the stand-alone buffer except the load resistors have been changed to 7.5  $\Omega$ . The output junction has a critical current of 300  $\mu$ A and output resistor is 10  $\Omega$ . The simulation results for 10 GHz operation are shown in Figure 8.8.

#### 8.2 Demultiplexer Architecture

The demultiplexer converts a 10 GHz bit serial data stream into ten parallel data streams of both inverting and noninverting outputs which drive the Suzuki stack. For a one-to-ten demultiplexer, the simplest architecture is a ten bit shift register with parallel read-out as shown in Figure 8.9 [9]. The demultiplexer timing is shown in Figure 2.6.

We now describe the operation of the demultiplexer. On every rising edge of  $\Phi_1$ , a new bit is shifted into the shift register. After ten cycles, the read-out circuit latches the data. On the rising edge of WCLK which is phase-locked to  $\Phi_1$ , data are latched to the parallel read-out circuit and sent to the output. Given this nature of the architecture, the read-out circuit must be driven from shift register cells which operate on  $\Phi_2$  to maintain



Figure 8.8: Simulation results for edge-triggered AND gate.



Figure 8.9: Josephson demultiplexer architecture.

proper timing. Also the read-out circuit must latch the data for the entire time that WCLK is high to ensure robust operation in the presence of clock skews between the read-out latches and the Suzuki stacks in the interface amplifier.

### 8.3 Circuit Design

The basic shift register consists of a cascade of two inverters driven by different clock phases. The output of the inverter driven by  $\Phi_2$  drives the read-out circuit which is driven by WCLK which is synchronized with  $\Phi_1$ . Note that several modifications must be made to take into account the new fan-out requirement.

- This reoptimization implies that the margins of the two inverters in the basic shift register cell get misaligned. These have to be realigned to widen the bias margins of the resulting circuit.
- The inverter in the read-out circuit must be modified to latch on to the data when WCLK goes high and hold that data until WCLK goes low. This can be achieved by adding a bleed resistor  $R_{bleed}$  of 12  $\Omega$  in parallel to the SQUID, which ensures that the single junction remains biased inspite of the changing input.

Figure 8.10 shows the simulation results of the complete shift register for a "1000000000" input. The plot shows a single bit shifting by one bit on every clock pulse. Figure 8.11 shows



Figure 8.10: Demultiplexer shift register simulation results.



Figure 8.11: Shift register with read-out latches.



Figure 8.12: Josephson demultiplexer simulation results.

one bit of the shift register and readout latch combination. Figure 8.12 shows the simulation results of the complete demultiplexer for a "1000000000" input. Note that output appears in the next WCLK clock cycle.

# Chapter 9

# Josephson Multiplexer

We briefly describe the design of the ten-to-one multiplexer for 10 GHz operation. We first describe the multiplexer architecture using ring pointers and buffered AND. Subsequently we describe the multiplexer circuit design for 10 GHz operation using edge triggered logic.

### 9.1 Multiplexer Architecture

The basic architecture of the ten-to-one multiplexer is shown in Figure 9.1 [9]. It consists of a pointer which is high every tenth cycle and a series of AND gates whose outputs are wire-ORed together. The multiplexer timing is similar to that of the demultiplexer except that a second low speed clock which is out of phase from the read clock is required to maintain data synchronization. The shift register requires two 10 GHz clock phases  $\Phi_1$ and  $\Phi_2$  while all the AND gates are clocked with  $\Phi_1$ . The first five low speed inputs from the current sense circuit must be synchronized with RCLK while the next five low speed inputs must be synchronized to  $\overline{RCLK}$  to maintain throughput at the multiplexer output. Note that this can be trivially achieved by shifting the clock signal of the corresponding current sense circuits by 180°.

The pointer consists of a shift register which shifts a single "1". Hence the input of the AND gate connected to the pointer is high after every ten clock cycle and data at the other input is transferred to the output. Outputs from all the ten AND gates are wire-ORed together to produce the desired multiplexed output.

The desired shift register with a single circulating "1" can be implemented in the



Figure 9.1: Multiplexer architecture.



Figure 9.2: Circular shift register for the multiplexer.

edge triggered logic due to the availability of inverter. The basic shift register cell can be implemented either by a cascade of two buffers or two inverters. Consider the buffer-inverter chain shown in Figure 9.2. It consists of nine buffer based shift registers and an inverter based shift register. The circuit operates as a normal ten bit shift register. However at power-up,  $\Phi_1$  turns high before  $\Phi_2$  and hence a "1" appears at the inverter output, thereby injecting a single "1" in the shift register loop.

## 9.2 Circuit Design

The ring pointer circuit was designed using inverters and buffers having the same parameters as the basic gates for fan-out of one. Simulations results demonstrated that no further optimization was required when the fanout requirement changed from one to two. Figure 9.3 shows the simulation result of the circular shift register for two complete clock cycles. Simulations yield a bias margin of  $\pm 20\%$  for the complete pointer circuit at 10 GHz.

Figure 9.4 shows the simulation results of the complete multiplexer for a "1000110011" input. Note that all the buffered AND gates operate on  $\Phi_1$  thereby causing an unbalance in clock loading for the two phases. During the layout phase, this is corrected by adding a dummy 1.2  $\Omega$  resistor from the clock source to ground. Bias margins for the clock are  $\pm 15\%$  for the complete multiplexer.



Figure 9.3: Circular shift register simulation results.



Figure 9.4: Simulation results for the complete demultiplexer.

## Chapter 10

# **Conclusions and Future Work**

We have designed a hybrid Josephson-CMOS FIFO for high performance digital signal processing and communications application. The FIFO employs 1.2  $\mu$ m CMOS technology optimized for cryogenic temperature operation. The superconducting part of the circuit is based on niobium Josephson junction technology with critical current density of 1000 A/cm<sup>2</sup>. In this conclusion, we summarize the architectural features and key circuit design techniques employed in the FIFO and discuss the expected performance of the overall design.

At the architectural level, we modified the standard low power CMOS ring pointer architecture to take advantage of the hybrid technology. As a result, we first demultiplexed the high-speed Josephson data to parallel streams of slower data rate. We took advantage of the high density of CMOS to implement the memory core. We then multiplexed the data streams back to achieve high speed Josephson output. We thus took advantage of superior memory cells available in CMOS and high speed in superconducting logic.

We employed new circuit techniques to maximize the throughput and minimize the power dissipation for the overall architecture. We used a three-transistor DRAM cell with single ended read and write and pure current-mode read with Josephson junctions instead of the conventional CMOS sense amplifier. The interface circuit provided the required throughput without excessive power dissipation and illustrated the need for customizing the interface design to a particular application. In the Josephson circuits, we employed edgetriggered logic to simplify system timing and easily obtain inverting gates. The resulting multiplexer and demultiplexer required only buffers, inverters and buffered AND gates.

As mentioned earlier, we have not implemented the control logic for the FIFO to

check the overflow and empty conditions. Any design in future should also include this to enable complete verification of the FIFO. The Josephson multiplexer and demultiplexer should also be verified separately before intergration in the overall design.

Low power operation and design is possible only by taking advantage of the freeze out effect of the capacitive elements in the circuit. However, full advantage of this cannot be taken without complete characterization of the process. Proper characterization of various line capacitances for this process at cryogenic temperature should also result in lower power dissipation if the circuit is optimized for those capacitance values. This includes line to substrate capacitance, line to line capacitance at the same metal layer, line to line capacitance of different metal layers and line to diffusion area capacitance. Proper dc characterization for the low temperature operation should also result in more optimized designs.

The circuit should also be characterized at liquid nitrogen temperatures for possible future integration with the high temperature coefficient YBCO based superconducting logic which can operate at liquid nitrogen temperatures.

The Josephson multiplexer and demultiplexer have been designed using the two phase edge triggered logic. Simulation results suggest that the demultiplexer operates at 10 GHz clock speed with sufficiently wide margins if the critical currents within a block track each other well. However; the multiplexer margins were found to be  $\pm 15\%$  at 10 GHz clock speeds. Alternative architectures can be explored for multiplexer design which do not require the use of AND gates. Other logic families, such as MVTL and the new COSL, must also be investigated for better margins and high speed performance.

In summary, we have designed a hybrid Josephson-CMOS FIFO. The FIFO operates at 10 GHz system clock with the CMOS core operating at 1 GHz speed. A layout of the complete CMOS part of the FIFO occupies 4 mm x 1.1 mm including the pad area. While future work remains, the basic architecture and design guidelines presented in this report should provide a feasible starting point for any future designs.

# Bibliography

- J. L. Hennessy and D. A. Patterson, Computer Architecture A Quantitative Approach. Morgan Kaufmann, 1996.
- [2] P. F. Yuh, "A 2-kb Superconducting Memeory Chip," IEEE Trans. Appl. Superconductivity, vol. 3, pp. 3013–3021, June 1993.
- [3] Y. Wada, "Josephson Memory Technology," Proc. IEEE, vol. 77, pp. 1194–1207, Aug. 1989.
- [4] U. Ghoshal, D. Hebert, and T. VanDuzer, "Josephson-CMOS Memories," in *ISSCC Dig. Tech. Papers*, pp. 54–55, 1993.
- [5] A. Miller, "From Here to ATM," IEEE Spectrum, vol. 31, pp. 20–24, June 1994.
- [6] J. Kaschmitter, S. Azevedo, G. Roberson, and D. Schneberk, "JJDSP Functional Description," tech. rep., Department of Commerce Advance Technology Program, 1994.
- [7] Y. Watanabe, Y. Nakasha, Y. Kato, K. Odani, and M. Abe, "A 9.6 Gb/s HEMT ATM Switch LSI with Event-Controlled FIFO," *IEEE J. Solid-State Circuits*, vol. 28, pp. 935–940, Sept. 1993.
- [8] L. Fenstermaker and K. O'Connor, "A Low-Power Generator based FIFO using Ring Pointers and Current-Mode Sensing," in *ISSCC Dig. Tech. Papers*, pp. 242–243, 1993.
- [9] A. Feldman, "Hybrid Josephson CMOS FIFO," Master's thesis, University of California at Berkeley, 1994.
- [10] U. Ghoshal, Josephson-CMOS Memory. PhD thesis, University of California at Berkeley, 1994.

- [11] S. M. Sze, *Physics of Semiconductor Devices*. Wiley, 1981.
- [12] J. Yuan and C. Svensson, "High Speed CMOS Circuit Techniques," IEEE J. Solid-State Circuits, vol. 24, pp. 62–70, Feb. 1989.
- [13] D. A. Hodges and H. Jackson, Analysis and Desing of Digital Integrated Circuits. McGraw-Hill, 1988.
- [14] H. Suzuki, A. Inoue, T. Imamura, and S. Hasuo, "A Josephson Driver to Interface Josephson Junctions to Semiconductor Transistors," in *IEDM Dig. Tech. Papers*, pp. 290–293, 1988.
- [15] N. Fujimaki, S. Kotani, T. Imamura, and S. Hasuo, "Josephson Modified Variable Threshold Logic Gates for Use in Ultra High Speed LSI," *IEEE Trans. Electron Devices*, vol. 34, pp. 433–466, Feb. 1989.
- [16] J. Sone, T. Yoshida, and H. Abe, "Resistor Coupled Josephson Logic," Applied Physics Letters, vol. 40, pp. 741–744, Apr. 1982.
- [17] H. Nakagawa, E. Sogawa, S. Kosaka, S. Takada, and H. Hayakawa, "Operating Characteristics of Josephson Four-Junction Logic (4JL) Gates," *Japanese Journal of Applied Physics*, vol. 21, pp. L198–L200, Apr. 1982.
- [18] P. Yuh and C. Yao, "Josephson Edge-Triggered Gates for Sequential Circuits," Applied Physics Letters, vol. 40, pp. 741–744, Apr. 1982.