Vector Unit
Programmable acceleration for AI
The Vector Unit executes multiple data elements simultaneously, delivering high-throughput parallel compute for AI and HPC workloads
What is a Vector Unit?
A Vector Unit is composed of several "Vector Cores", roughly equivalent to a GPU Core, that perform multiple calculations in parallel. Each Vector Core has arithmetic units capable of performing addition, subtraction, fused multiply-add, division, square root, and logic operations
Parallel Efficiency
Processes numerous computations simultaneously to maximize speed
Functional Versatility
Handles a wide array of tasks, from basic math to complex logic
Massive Throughput
Leverages GPU-like architecture to manage heavy data loads effectively
Key Benefits
Optimized performance through massive bandwidth, flexible data widths, and full RISC-V 1.0 integration
Up to 2048-bit data path
The VPU supports large memory bandwidth with its up to 2048-bit native data path
FP & Int, 8b to 64b
Supports all integer and floating points formats from 8 up to 64 bits, including bfloat16
DLEN up to 2048b
Data path length can be scaled to your needs, from 128 up to 2048 bits
VLEN up to 4096b
Vector length can be scaled to your needs, from 128 up to 4096 bits
AI - Ready
Tensor instructions seamlessly integrated
RVV1.0
Implements the complete RISC-V Vector 1.0 Specification
Programmable acceleration for AI
Our vector core can be tailored to support different data types: FP64, FP32, FP16, BF16, INT64, INT32, INT16 or INT8, depending on the customer’s target application domain. The largest data type size in bits defines the vector core width or ELEN
You can build a vector unit by hooking together up to 32 vector cores (4, 8, 16 or 32 cores) catering for a very wide range of power-performance-area trade-off options
Once these choices are made, the total Vector Unit data path width or DLEN is ELEN x number of vector cores. We support DLEN configurations from 128b to 2048b
Our Vector Unit is equipped with a high-performance, cross-vector-core network that provides all-to-all connectivity between the vector cores at high bandwidth, even for the very large, 32-vector core option.
The cross-vector-core unit is used for specific instructions in the RISC-V standard that shuffle data between the different vector cores, such as vrgather, vslide, etc.
We also offer a second key choice in the Vector Unit: the number of bits of each vector register (known as VLEN) can also be tailored to customer’s needs.
While most other vendors assume that VLEN is equal to DLEN (i.e., 1X ratio), we offer 2X, 4X and 8X ratios. When the VLEN is larger than the DLEN, a vector operation uses multiple cycles to execute. This is a great feature for tolerating large memory latencies and for reducing power.
For example, when VLEN=2048 and DLEN=512, each vector arithmetic operation will take 4 clocks to execute.