February 13, 2008
Xilinx Virtex-5 User-Guide Lite
As opposed to wading through more than 1,000 pages of Virtex-5 User-Guide documentation, this "User Guide Lite " boils all the key details down into a few easily-digestible pages.
By Peter Alfke, Xilinx
Editor's Note: Generally speaking we (Programmable Logic DesignLine) are not in the business of publishing user guides for specific device families. But one of my favorite sayings (in addition to someone else exclaiming "My round, I think!") is the classic: "Rules are intended for the guidance of wise men and the blind obedience of fools."
The point is, where do you go to learn more about a specific family of FPGAs, for example? The vendor's data sheets are great if you are already an expert looking for a specific nugget of information, but more-often-than-not they are a pain in the rear end, telling you everything
except the fact you're trying to tie down.At the other end of the spectrum are the vendor's main User Guides, but these can number hundreds or thousands of pages and are presented in such excruciating detail as to bring even the strongest amongst us to our knees.
If only there were something in between . . . Which brings us to this article, which is a
User-Guide Lite for the Xilinx Virtex-5 family of FPGAs.In fact I think that this is an incredibly good idea. I would love to see the same treatment for all of the major FPGA and CPLD families from all of the vendors. My message is:
"If you write them, they will come. . ." So, over the course of time, I hope to build a little "library" of these guides . . . watch this space!What is the purpose of this paper?
This paper gives potential users an easy-to-grasp idea of the capabilities of the device functions of Xilinx Virtex-5 FPGAs. It describes the functionality of these devices in far more detail than in the data sheet, but avoids the minute implementation details covered in the various Virtex 5 FPGA User Guides.
Any designer contemplating designing with Virtex-5 FPGAs faces a dilemma: The first four pages of the data sheet give very concentrated information about the whole family, without describing the capabilities in enough detail. By comparison, the User Guides give all the details that the designer needs, but – at more than a thousand pages – it may require weeks of work to read and understand all of the details.
This paper describes the capabilities (what you can do) in detail, but leaves out the implementation details (how to utilize the capabilities). The idea is to give the designer enough information to evaluate the capabilities, without requiring weeks of study. This paper should create significant enthusiasm in many designers who before did not have the patience or the motivation to study the full-up User Guides.
Peter Alfke
Configuration
Like all other Xilinx FPGAs, Virtex-5 FPGAs store their customized configuration in SRAM-type internal latches. The array size is between 8 Mb and 79 Mb (1 to 10 MB), depending on device size but independent of the specific user-design implementation, unless compression mode is used. The configuration storage is volatile and must be reloaded whenever the FPGA is powered up. This storage can also be reloaded at any time by pulling the PROG pin Low. Several methods and data formats for loading configuration are available, determined by the levels on the three Mode pins.
Bit-serial configurations can be either Master Serial where the FPGA generates the configuration clock (CCLK) signal, or Slave Serial where the external configuration data source also clocks the FPGA. For byte- and word-wide configurations, Master SelectMap mode generates the CCLK signal while Slave SelectMap mode receives the CCLK signal for the 8-, 16-, or 32-bit-wide transfer. Alternatively, Serial Peripheral Interface (SPI) and Byte Peripheral Interface (BPI) modes interface with industry-standard flash memories and are clocked by the FPGA's CCLK output. JTAG mode uses Boundary-Scan protocols to load bit-serial configuration data.
The bitstream configuration information is generated by the Xilinx ISE development software using a program called BitGen. The configuration process always executes the following sequence:
- Detects power-up (Power-On Reset) or PROG being Low.
- Clears the whole configuration memory.
- Samples the mode pins to determine the configuration mode. (Master or slave, bit-serial or parallel, etc)
- Loads the configuration data starting with a synchronization word and a check for the proper device code and ending with a cyclic redundancy check (CRC) of the complete bitstream.
- Start-up executes a user-defined sequence of events: releasing the internal reset (or preset) of flip-flops, optionally waiting for the DCMs to lock, activating the output drivers and making DOBE go High.
Dynamic Reconfiguration Port (DRP)
The DRP gives the user easy access to configuration bits and status registers for the following three block types:
- 32 locations for each Clock Tile (both DCM and PLL)
- 128 locations for the System Monitor
- 128 locations for each MGT GTP_DUAL tile
DRP behaves like memory-mapped IO, and can access and modify block-specific configuration bits, as well as status and control registers.
Encryption, Readback, Compression, and Partial Re-configuration
As a special option, the bitstream can be AES-encrypted to prevent unauthorized copying of the design. The Virtex-5 FPGA performs the decryption using the internally stored 256-bit key that can use battery backup to remain non-volatile.
Most configuration data can be read back without affecting the user operation. Configuration data compression takes advantage of repetition in the configuration data structure. In most cases, configuration is an "all-or-nothing" operation, but the Virtex-5 FPGA also supports partial reconfiguration, which in certain designs can greatly improve the versatility of the FPGA, when applicable. It is possible to reconfigure only a portion of the FPGA while the rest of the logic remains active. This operation is called partial reconfiguration.
Subsets of the different logic types such as CLBs, BRAMs, I/Os, etc. can be designated as reconfigurable by using Xilinx PlanAhead and ISE software. A floorplan is created that includes the amount and type of logic required for the hierarchical block of the design that will be partially reconfigured. After the design is implemented, a partial bit file is generated for each component of the design that will be reconfigured.
Downloading the partial bit file is exactly like downloading a full bit file. Simply download the partial bit file to the JTAG, Serial, or SelectMap ports and the FPGA will be partially reconfigured. The Internal Configuration Access Port (ICAP) also supports partial reconfiguration, so that an external interface such as JTAG, Serial, or SelectMap may not be required.
Logic Fabric
Four-input look-up tables (LUTs) have been the mainstay of the logic fabric in FPGAs for almost 20 years. As advances in technology have made regular structures more space-efficient but interconnects more dominant, LUT capacity has been increased from 16 bits to 64 bits (6 inputs).
The LUTs in Virtex-5 FPGAs can be configured as either 6-input LUT (64-bit ROMs) with one output, or as two 5-input LUTs (32-bit ROMs) with separate outputs but common addresses or logic inputs. Four such LUTs and four flip flops as well as multiplexers and arithmetic carry logic form a slice, and two slices form a Configurable Logic Block (CLB). Virtex-5 FPGA slices implement multiplexers very efficiently: four 4:1, two 8:1, or one 16:1 multiplexer in any slice. In addition to this, between 25 and 50% of all slices can also use their LUTs as distributed 64-bit RAM or as 32-bit shift registers (SRL32) or as two SRL16s. Modern synthesis tools know how to take advantage of these highly efficient features, but expert users can also instantiate them.
Clock Management
Each Virtex-5 FPGA has two to six clock management tiles, each consisting of two digital clock managers (DCMs) and one phase-locked loop/phase-matched clock divider (PLL/PMCD). These three subblocks can be used individually or concatenated as desired.
Digital Clock Manager
The DCM can act as a zero-delay clock buffer when a clock signal drives CLKIN, while the CLK0 output is fed back to the CLKFB input. The DCM also provides three additional phases of the input frequency, shifted 90°, 180°, and 270° (CLK90, CLK180, and CLK270, respectively), as well as a doubled frequency CLK2X and its complement CLK2X180. The CLKDV output provides a fractional clock frequency that is phase-aligned to CLK0. The fraction is programmable as every integer from 2 to 16, as well as 1.5, 2.5, 3.5 . . . 7.5.
Frequency Synthesis
Independent of the DCM functionality already described, the frequency synthesis outputs CLKFX and CLKFX180 can be programmed to generate any output frequency that is FIN (the DCM input frequency) multiplied by M and simultaneously divided by D, where M can be any integer from 2 to 33 and D can be any integer from 1 to 32.
Multiplication and division are performed as a combined mathematical operation. Assume FIN = 50 MHz, M = 25, and D = 8. In this case, CLKFX is then 156.25 MHz, even though FIN × 25 = 1.25 GHz, which is well above the maximum frequency of 550 MHz.
If CLKFX is fed back to CLKFB, the CLKFX outputs are phase aligned to CLKIN whenever that is mathematically possible. In the example above, phase alignment occurs on every 8th CLKIN period, which is every 25th period of CLKFX.
If CLKFX is not fed back to CLKFB (i.e. the DFS is used by itself), then the input frequency may be as low as 1 MHz, provided the output meets the minimum frequency requirement of 19 MHz.
Phase Shifting
With CLK0 connected to CLKFB, all the nine CLK outputs (CLK0, CLK90, CLK180, CLK270, CLK2X, CLK2X180, CLKDV, CLKFX, and CLKFX180) can be shifted by a common amount, defined as any integer multiple of the CLKIN period divided by 256. This shift value can be established by configuration and can also be incremented or decremented dynamically by either 1/256 of the FIN period, or by one internal tap (less than 40 ps).
Phase-Locked Loop (PLL)
The PLL can serve as a frequency synthesizer for a wider range of frequencies and as a jitter filter for incoming clocks in conjunction with the DCMs. The heart of the PLL is a voltage-controlled oscillator (VCO) with a frequency range of 400 MHz to 1100 MHz, thus spanning more than one octave. Three sets of programmable frequency dividers (D, M, and O) adapt the VCO to the required application.
The pre-divider D (programmable by configuration) reduces the input frequency and feeds one input of the traditional PLL phase comparator. The feedback divider (programmable by configuration) acts as a multiplier because it divides the VCO output frequency before feeding the other input of the phase comparator. D and M must be chosen appropriately to keep the VCO within its controllable frequency range.
The VCO has eight equally-spaced outputs (0°, 45°, 90°, 135°, 180°, 225°, 270°, and 315°), each of which can be selected to drive one of the six output dividers, O1 to O5 (each programmable by configuration to divide by any integer from 1 to 127).
PLL Programmable Features
The PLL has two input-jitter filter options: low bandwidth or high bandwidth mode. Low bandwidth has the best jitter attenuation, but should not be used with clocks that change frequency rapidly. By comparison, high bandwidth mode has less jitter attenuation and should be used with input clocks that might change their frequency quickly.
Clock Distribution
Ideally, clock lines should be plentiful, reach every flip-flop on the device, and have very short propagation delay and extremely low skew. These requirements are difficult to combine; each Virtex-5 FPGA, therefore, has three different types of clock lines.
Global Clock Lines
In each Virtex-5 FPGA, 32 global clock lines have the highest fan-out and can reach every flip flop and clock enable as well as many logic inputs. There is a limit of 10 global clock lines within any region. Global clock lines must be driven by global clock buffers, which can also perform glitchless clock multiplexing and the clock enable function. Global clocks are often driven from the clock management tile, which can completely eliminate the basic clock distribution delay.
Regional Clocks
Regional clocks can drive all clock destinations in their region as well as the region above and below. A region is defined as any area that is 40 I/O high and half the chip wide. Virtex-5 FPGAs have between 8 and 24 regions. Each regional clock buffer can be driven from either of four clock-capable input pins, and its frequency can optionally be divided by any integer from 1 to 8.
I/O Clocks
I/O clocks are especially fast and serve only the localized IDELAY/ODELAY circuits and the I/O serializer/deserializer (SERDES) circuits, as described further down in the I/O logic section.
Block RAM with FIFO
Every Virtex-5 FPGA has between 32 and 324 true dual-port block RAMs, each having 36K bits.
- Synchronous operation: Each memory access, read and write, is controlled by the clock. All inputs, data, address, clock enable, and write enable are registered. "Nothing happens without a clock." The data output is always latched, retaining data until the next operation. An optional output data pipeline register allows higher clock rates at the cost of an extra cycle of latency. During a write operation, the data output can be made to reflect the previously stored data, the newly written data, or remain unchanged.
- Aspect ratio control: Each port can be configured as 32K × 1, 16K × 2, 8K × 4, 4K × 9, 2K × 18, or 1K × 36. The two ports can have different aspect ratios.
- True dual-port operation: The block RAM has two completely independent ports that share nothing but the stored data.
- The optional Simple Dual-Port primitive dedicates one port as a write port and the other as a read port. The data width can thus be extended to 72 bits for the 36 Kb full block RAM or 36 bits for the "split" 18K block RAM.
- Each block RAM can be divided into two completely independent 18 Kb RAMs.
Two adjacent block RAMs can be configured as one 64K × 1 true dual-port RAM with no additional logic. - Error detection and correction: Each 64-bit wide BlockRAM can generate, store and utilize 8 additional "Hamming" bits, and perform single-bit error correction and double-bit error detection (ECC) during the read process. The ECC logic can also be used when writing to, or reading from, external 64/72-wide memories.
FIFO Controller
The built-in FIFO controller for single-clock (synchronous) or dual-clock (asynchronous a.k.a. multi-rate) operation increments the internal addresses and provides four handshaking flags: full, empty, almost full, and almost empty. The almost full and almost empty flags are freely programmable. FIFO width and depth are programmable like the block RAM, but the write and read ports always have identical width. "First-word-fall-through" is an option that presents the first-written word on the data output even before the first read operation. After the first word has been read, there is no difference between this mode and the normal mode.
Digital Signal Processing Element DSP48E Slice
DSP applications use many binary multipliers and accumulators which are slower, dissipate much higher power, and consume more area, when implemented in the programmable fabric.
This is why all Virtex-5 FPGAs have dedicated, full-custom, low- power DSP slices (32 to 640). They combine high speed with small size, while retaining programmability and thus user flexibility.
Each DSP48E slice fundamentally consists of a dedicated 25 × 18 bit two's complement multiplier and a 48-bit accumulator, both capable of operating at 550 MHz. The multiplier can be dynamically bypassed, and two 48-bit inputs can feed a single-instruction-multiple-data (SIMD) arithmetic unit (dual 24-bit add/sub/acc or quad 12-bit add/sub/acc), or a logic unit that can generate any one of 10 different logic functions of the two operands.
The DSP48E slice provides extensive pipelining and extension capabilities that enhance speed and efficiency of many applications, even beyond digital signal processing, such as wide dynamic bus shifters, memory address generators, wide bus multiplexers, and memory-mapped I/O register files. Obviously, the accumulator can also be used as a synchronous up/down counter, and the multiplier can be used as a barrel shifter.
Input/Output (I/O)
The number of I/O pins varies with device and package size from 220 to 1200. Each I/O pin is configurable and can be made to comply with a large number of standards. The User Guide uses three full pages to describe the I/O compatibilities of the various I/O options. With the exception of supply pins and a few dedicated configuration and clocking pins, all other package pins have the same I/O capabilities, constrained only by certain banking rules.
All I/O pins are organized in banks, with 40 pins per bank (20 pins in some banks in the central column). Each bank has one common VCCO output supply-voltage pin, which also powers certain input buffers. Some single-ended input buffers require an externally applied reference voltage VREF. One of every 20 pins can serve that purpose, if required.
I/O Electrical Characteristics
Single-ended outputs use a conventional CMOS push/pull output structure driving High towards VCCO or Low towards ground, and can be put into high-Z state. The user can specify the slew rate and the output strength, which is determined internally by the number of parallel output transistors. The input is always active, but is usually ignored while the output is active. Each pin can optionally have a weak pull-up or a weak pull-down resistor.
Any signal pin pair can be configured as LVDS input pair or output pair. LVDS input pin pairs can optionally be terminated with a 100 Ohm internal resistor.
Digitally Controlled Impedance (DCI) DCI can control the output drive impedance (series termination) or can provide parallel termination of input signals to VCCO, or even split (Thevenin) termination to VCCO/2. DCI uses two pins per bank as reference pins, but one such pair can also control multiple banks. VRN must be resistively pulled to VCCO, while VRP must be resistively connected to ground. The resistor must be either 1× or 2× the characteristic trace impedance, typically close to 50 Ohms.
I/O Logic
IDELAY and ODELAY
This section describes the available logic resources behind the I/O interfaces. All inputs and outputs can be configured as either combinatorial or registered. Double data rate is supported by all inputs and outputs. Any input or output can be individually delayed by up to 64 increments of ~75 ps each. This is known as IODELAY. The number of delay steps can be set by configuration and can also be incremented or decremented while in use. Since IDELAY and ODELAY share a common delay mechanism, only one of the two can be active per I/O.
For using either IDELAY or ODELAY, the user must instantiate the IDELAY control block and clock it with a frequency close to 200 MHz. Each 64-tap total IDELAY or ODELAY is servo-controlled to be equal to the ~5 ns period of that frequency, thus unaffected by temperature, supply voltage, and processing variations.
SerDes
Many applications combine high-speed bit-serial I/O with slower parallel operation inside the chip. This requires a serializer and deserializer (SerDes) inside the I/O structure. Each input has access to its own deserializer (serial-to-parallel converter) with programmable parallel width of 2, 3, 4, 5, 6, 7, 8, or 10 bits. Each output has access to its own serializer (parallel to serial converter) with programmable parallel width of up to 8 bits wide for single data rate, or up to 10 bits wide for double data rate.
System Monitor Each Virtex-5 FPGA contains exactly one System Monitor circuit. Its heart is a 10-bit 200 ksps analog-to-digital converter that can measure internal supply voltages and device temperature, as well as external voltages that are applied to a dedicated pin pair, or to 16 general-purpose programmable input pin pairs.
- Temperature resolution is a few degrees C, VCC resolution is ~3 mV, external input resolution is ~1 mV.
- Extensive signal storage and analysis tools are available.
- The digital information can be averaged, threshold-detected, and max/min-logged. It can also be used to power down the device when too hot, and can keep on monitoring while the chip is powered down.
The system monitor starts operating right after power-up, even before the beginning of configuration, so that it can monitor supply voltages before, during, and after configuration to be read out via JTAG TAP.
The following features are available in all 'LXT and 'SXT devices, but are not available in the 'LX devices.
Low-Power Gigabit Transceiver
Ultra-fast data transmission between chips, over the backplane, or over longer distances is becoming increasingly popular and important. It requires specialized dedicated on-chip circuitry and differential I/O capable of coping with the signal integrity issues at these high data rates.
Each Virtex-5 LXT or SXT device has between 8 to 24 Gigabit Transceiver-with-low-Power (GTP) circuits. Each of these is a combined transmitter and receiver capable of operating at a data rate between 100 Mb/s and 3.75 Gb/s, The transmitter and receiver are independent circuits, sharing only a common reference clock that uses a PLL to multiply the reference frequency input by certain programmable numbers between 2 and 25, to become the bit-serial data clock. Two GTP transceivers (i.e. two transmitters and two receivers) are combined as a slice using common Fref and PLL but are otherwise independent of each other. Each GTP has a large number of user-definable features and parameters. All of these can be defined during device configuration, and many can also be modified during operation.
Transmitter
The transmitter is effectively a parallel-to-serial converter with a conversion ratio of 8, 10, 16, or 20. The transmitter output drives the PC board with a single-channel differential current mode logic (CML) output signal.
TXOUTCLK is the appropriately divided serial data clock and can be used directly to register the parallel data coming from the internal logic. That incoming parallel data is fed through a small FIFO, and can optionally be modified with the 8B/10B algorithm to guarantee a sufficient number of transitions. The bit-serial output signal drives two package pins with complementary CML signals. This output signal pair has programmable signal swing as well as programmable pre-emphasis to compensate for PC board losses and other interconnect characteristics.
Receiver
The receiver is effectively a serial-to-parallel converter, changing the incoming bit serial differential signal into a parallel stream of words, each 8, 10, 16, or 20 bits wide. The receiver takes the incoming differential data stream, feeds it through a programmable equalizer (to compensate for PC-board and other interconnect characteristics), and uses the Fref input to initiate clock recognition. There is no separate clock line. The data pattern uses non-return-to-zero (NRZ) encoding and optionally guarantees sufficient data transitions by using 8B/10B encoding. Parallel data is then transferred into the FPGA fabric using the RXUSRCLK clock. The serial-to-parallel conversion ratio can be 8, 10, 16, or 20.
Out-of-band signaling
The GTP transceivers can provide Out-of-Band (OOB) signaling, often used to send low-speed signals from the transmitter to the receiver, while high-speed serial data transmission is not active, typically when the link is in a power-down state or has not been initialized.
PCI Express Endpoint Block
PCI Express is a packet-based high-speed point-to-point bit-serial I/O standard. The differential signal transmission uses an embedded clock, which eliminates the clock-to-data skew problems of traditional wide parallel buses. PCIe Base Specification defines a bit rate of 2.5 Gbps per lane. Using 8B/10B encoding this supports a data rate of 2.0 Gbps per lane, 16 Gbps for 8 lanes, or 64 Gbps for 32 lanes.
Virtex-5 LXT and SXT devices each include one built-in endpoint block compliant with PCI Express base specification 1.1. This block is highly configurable to user requirements, and can operate 1, 2, 4 or 8 lanes. The built-in PCI Express block interfaces to GTP or GTX transceivers for serialization/de-serialization, and to Block RAMs for data buffering. The combined PCI Express block implements the physical layer, data link layer and the transaction layer of the protocol.
Xilinx also provides a configurable ease-of-use soft wrapper that ties the various building blocks – the transceivers, Block RAM and user logic – into a compliant Endpoint solution. The user has control over the following parameters: Lane width, maximum payload size, fabric interface speeds, reference clock frequency, and Base Address register decoding and filtering.
10-100-1000 Mb/s Ethernet Controller
A hard-coded Tri-Mode Ethernet MAC core has been available in the Virtex-4 FX device, where it is coupled to the PowerPC processor. Virtex-5 LXT and SXT devices offer a version that is easily connected to the fabric and to the GTP modules, as well as to the SelectIO interface. This hard-coded version saves fabric resources and design effort. Each LXT or SXT device has 4 EMAC cores (2 blocks with 2 cores each), implementing the LINK layer of the OSI protocol stack. The CORE Generator software GUI helps to configure flexible interfaces to GTP or SelectIO technology, to the fabric and to a microprocessor (when required).
About Xilinx Virtex-5 FPGAs
The Virtex-5 family represents the fifth generation in the Virtex series. Built upon 65 nm triple-oxide technology, ExpressFabric technology, and the ASMBL architecture, the Virtex-5 family includes four domain-optimized platforms for high-speed logic (LX), digital signal processing (SXT), embedded processing and serial connectivity applications (LXT).
Production devices are shipping now and may be purchased online or through Xilinx distributors. For even further cost reductions, the Virtex-5 EasyPath program offers up to 75 percent cost reduction. Visit www.xilinx.com/virtex5 for more information.
For detailed technical information see:
http://www.xilinx.com/support/documentation/virtex-5.htm#19297
Peter Alfke joined Xilinx in 1988 as director of applications engineering. He currently serves as Distinguished Engineer in the Advanced Products Group.
Peter graduated in electronic engineering from the Technical University in Hannover, Germany in 1957. He went on to work in telecom and computer design with LM Ericsson and Litton Industries before moving to California in 1968. He has spent forty years in Applications Engineering with Fairchild, Zilog, AMD, and now Xilinx. Peter holds more than thirty patents, has authored many application notes, and given worldwide seminars on digital integrated circuits. He is active in the newsgroup
comp.arch.fpga.----------------------------------------
댓글 없음:
댓글 쓰기