January 27, 2009
Specialized, heterogeneous logic architectures can offer designers the cost and power efficiency they seek while managing development costs and keeping their time-to-market edge.
By Jack Ogawa, Cswitch Corp
The programmable logic industry has remained stubbornly resistant to change over the past 20 years of its existence. If you follow this industry, you know that the FPGA logic fabric that is implemented in the latest offerings from Altera and Xilinx has not fundamentally changed since its commercial introduction back in the 1980's, exemplifying what Harvard Business School Professor Clayton Christensen described in his book "The Innovator's Dilemma".
In his book, Christensen presents a behavioral model where larger, successful companies have problems in discovering and nurturing "disruptive" product technologies that initially do not look attractive, but eventually prove superior.
Looking at Altera and Xilinx, you can see that their products have evolved largely due to Moore's Law, which is a "sustaining" technology in Christensen's model. In other words, the incumbent vendors have relied on process technology advancement to sustain the marketability of their products over time rather than true architectural innovation.
Now, with programmable logic markets such as carrier Ethernet, data centers, and wireless infrastructure being awakened from their post-bubble slumber by the YouTube generation, the time has come for innovation. Packet-based equipment is moving to the next level of throughput (generically referred to as bandwidth), with increasing touches per packet due to security and quality of service requirements.
Unfortunately, equipment companies who relied on programmable logic to provide the flexibility in their hardware during the heady bubble growth of the late 90's are now finding that the 20-year-old FPGA cannot meet the new challenges, even with the help of Moore's Law. Programmable logic applications are evolving, and today's FPGAs cannot service them.
How wide, how fast?
How dramatic is this problem? This sentiment received some attention from Clive Maxfield(1) when Altera announced their Stratix IV family. As noted in Maxfield's article, Altera's 40nm Stratix IV family supports a typical system clock frequency of 350 MHz "across the fabric". While this number is somewhat optimistic (100 to 200 MHz is the number most often cited by designers), it nonetheless is effective in highlighting the problem that many high-bandwidth designers face today:
"So... we have 8.5 Gbps coming in (for a single serial I/O lane). After we strip out the 8b/10b coding we're left with 8.5 / 10 * 8 = 6.8 gigabits per second. If the receiver converts this into byte-wide chunks, we now have 6.8 / 8 = 0.85 gigabytes per second."
Maxfield goes on to note:
"If all we wanted to do was load these values in to an 8-bit register we'd still need to be clocking our register at 850 MHz."
Clearly 350 MHz is much less than 850 MHz. But, you can always make your data processing logic more parallel to meet the throughput requirements, right? In fact, in Altera's case, the logic fabric interface from the SERDES is allowed to be as wide as 40-bits, since it is limited to 250 MHz(2).
So, for a 40G application, you would need 6 channels (6 * 6.8 gigabits per second) presenting a total of 240 bits of data that need to be aligned at 250 MHz as it is routed through the device. So, yes, you can spread things out, but this is a daunting timing closure challenge, to say the least.
Gate efficiency is key
Logic density is another challenge as bandwidth requirements increase. Today's programmable logic is notoriously area-inefficient, making the additional logic required to process more gigabits per second extremely expensive from a power and cost perspective. For example:
10G Ethernet MAC = 10,370 logic elements (4,148 ALUTs(3) x 2.5(4))
40G Ethernet MAC = 41,600 LEs (26,000 LUTs(5) x 1.6(6))
100G Ethernet MAC = 107,200 LEs (67,000 LUTs(5) x 1.6(6))
This implies that every 10 Gbps of Ethernet data terminated requires roughly 10,000 logic elements. A protocol conversion (e.g. 100G Ethernet to Interlaken), which is a common FPGA application, doubles that requirement to 20,000 logic elements per 10Gbps. Therefore, you must consume about 80,000 logic elements simply to support the termination of protocols for a 40G application. This is an expensive proposition, especially when you consider the commonly held belief that programmable gates are 20x less area efficient that ASIC gates.
For an FPGA architect, the ostensible objective of FPGA fabric elegance is gate efficiency. Putting aside exotic process technologies in development, this basically implies embedding more "hard" gates in their architectures. Embedding increases gate density, performance, and power efficiency, which are all desirable effects.
However, the trick to embedding in any programmable device is to make the gates configurable so that they are not locked into a single function. With some creativity and proper scope, this is entirely possible. But, therein lies the "Dilemma": with every R&D dollar at the incumbent vendors being held against a return-on-investment (ROI) metric, it will take a brave soul indeed to argue for innovation that has less breadth than their current products. So, where does that leave today's logic designers?
Specialized, heterogeneous logic architectures can offer designers the cost and power efficiency that they seek while managing development costs and keeping their time-to-market edge. Utilizing embedded application-specific elements that are configurable, these architectures can provide reduced development costs over ASICs and FPGAs by eliminating the timing closure problem of a generic logic fabric and providing ASIC-like performance.
Furthermore, bandwidth bottlenecks can be eliminated by utilizing an interconnect structure that is designed to support the datapath topologies common to a given application. Imagine a densely populated city such as Tokyo with only one choice of travel – surface streets. Yes, they are the most flexible, serving all destinations, but they are inefficient for traveling any significant distance, or for moving volumes of people. Fortunately for Tokyo, they have tailored resources, such as freeways and a train system, each with its own merits. Like Tokyo, new programmable logic architectures will offer density with flexibility at a local level, and high performance and efficiency for traveling from function to function.
Configurable Switch Array
One such architecture is the Configurable Switch Array (CSA) offered by Cswitch. The CSA has been designed to support the performance-demanding datapath functions of packet-based applications.
1. Configurable Switch Array (CSA) high-level architecture.
(Click this image to view a larger, more detailed version)
The cornerstone of this architecture is the interconnect structure. It is a two-level structure, with the fabric-level interconnect designed to offer maximum flexibility, and the upper level, known as the dataCrossconnect (DCC) network, offering up to 2.56 Tbps of cross-sectional bandwidth through a synchronous mesh.
The logic fabric itself is heterogeneous, with an array of configurable embedded elements called Configurable Packet Engines that support header parsing of any protocol, fast look-ups, and fast polynomial calculations common to networking, such as CRC. These engines operate at up to 1 GHz.
Together with the DCC network, which operates at up to 2 GHz, the Configurable Switch Array can process 20 to 100 Gbps streams without undue area and power consumption. Because the speed-critical blocks are embedded, achieving performance is as simple as managing latency in the datapath. Cswitch's Andara development tool suite provides an HDL-based design flow utilizing instantiation and inference technology provided by Magma Design Automation.
Conclusion
As much as the incumbent programmable logic vendors would like to think so, innovation is not defined as riding Moore's Law. While advancement in transistor density provides an increasingly capable piece of silicon, it is up to fabless programmable logic vendors to take advantage of this manufacturing capability and to create products that truly help designers achieve their goals.
In many cases, today's FPGAs are dramatically underserving their markets. Think about this: the FPGA device that is being designed into an LTE cellular baseband processing card is the same device that is being designed into a 10G Layer 2 Ethernet switch, and it is the exact same device that is being designed into a plasma display.
Yes, this breadth helps programmable logic companies stay financially efficient. However, this "lowest common denominator" approach is making FPGAs harder to use and less cost efficient for any given application, and the vendors are ultimately passing this burden on to their customers.
The growth of ASIC cost of ownership has been well documented by industry journalists, but today's FPGAs are facing a similarly escalating cost of ownership. FPGAs don't have the mask and manufacturing costs associated with ASICs, but they do have increasing design capture and verification costs.
Yes, you can verify at system speed in hardware, exposing your design to billions of real vectors instead of thousands of software vectors. But consider this: how long will it take you to close timing on a 20Gbps traffic manager spread across multiple FPGAs? How many times will you have to re-architect your design to achieve the throughput needed? Will you ignore the hundreds of paths that miss by 1.5 ns simply because there are too many of them? Each of these common FPGA challenges indeed costs you money, much like ASIC verification. The lack of FPGA architectural innovation is driving up the development costs associated with design capture and verification.
What does this mean? With engineering costs rising, and time to market pressures as strong as they have ever been, the time is right for an innovative programmable logic approach to emerge that will better meet the logic needs of specific applications, just as PALs, EPLDs, and CPLDs did in the past. The programmable logic market will again become fragmented with specialized architectures, but this time it will be manifested by more embedded IP. The time is right for smaller companies such as Cswitch to fill the void and bring customer focus and innovation back to the programmable logic industry.
Notes/References
- "How do we use the data from I/Os running at 8.5 Gbps?" by Clive Maxfield, Programmable Logic DesignLine, May 23 2008. (www.plddesignline.com).
- Stratix IV Device Handbook Vol 2 by Altera, Nov 2008, page 131.
- AN516: 10Gbps Ethernet Reference Design by Altera, Nov 2008, page 5.
- Conversion factor from Altera Stratix IV GX ALMs to logic elements, which is defined as a 4LUT and a register. (www.altera.com).
- 40G/100G Ethernet IP Core data sheet by Sarance, version 1.1, Aug 5 2008, page 2.
- Conversion factor from Xilinx Virtex5 LUTs to logic cells or elements (www.xilinx.com)
Jack Ogawa is the Vice President of Marketing at Cswitch (www.cswitch.com), a fabless semiconductor startup.
Prior to Cswitch, Jack was with Altera Corporation for over 15 years serving key leadership roles in Applications Engineering, Product Marketing, Product Planning, Strategic Marketing, Business Development, and Sales.
Jack's experience includes four years as Senior Director of Marketing and Acting Country Manager for Altera in Tokyo, Japan. Jack, who can be contacted at
jogawa@cswitch.com, holds a BSEE degree from the University of California at Davis.====================
댓글 없음:
댓글 쓰기