- Verilog 51.3%
- Python 23.1%
- VHDL 14.2%
- SystemVerilog 11.4%
| hex_file | ||
| python | ||
| rtl | ||
| tb | ||
| uvm | ||
| README.md | ||
AYAKA_Transformer
check: hdpe_unit_new.v
Input Matrix A: 5 rows x 4 columns Weight Matrix B: 4 rows x 3 columns Output Matrix C: 5 rows x 3 columns
Output Stationary
In Output Stationary, the output values remain in place during computation. Inputs and weights move through the compute array. Matrix shift-in pattern:
0 b14 b23 b32 b41 N(i/p)
b04 b13 b22 b31 b40 |
b03 b12 b21 b30 0 (i/p)W--0--E (o/p)
b02 b11 b20 0 0 |
b01 b10 0 0 0 S(o/p)
b00 0 0 0 0
0 a03 a02 a01 a00 pe00 pe01 pe02 pe03 pe04 -- c00 c01 c02 c03
a13 a12 a11 a10 0 pe10 pe11 pe12 pe13 pe14 -- 0 c10 c11 c12
a22 a21 a20 0 0 pe20 pe21 pe22 pe23 pe24 -- 0 0 c20 c21
a31 a30 0 0 0 pe30 pe31 pe32 pe33 pe34 -- 0 0 0 c30
| | | | |
c00 0 0 0 0
c10 c01 0 0 0
c20 c11 c02 0 0
c30 c21 c12 c03 0
0 c31 c22 c13 c04
Input Stationary
In Input Stationary, input matrix values remain fixed in local memory. Weights slide over, and outputs accumulate dynamically in PEs. Matrix shift-in pattern:
0 b14 b23 b32
b04 b13 b22 b31
b03 b12 b21 b30
b02 b11 b20 0
b01 b10 0 0
b00 0 0 0
a00 a01 a02 a03: a00->pp a01->pp a02->pp a03-->c00 c01 c02 c03 c04
a10 a11 a12 a13: a10 a11 a12 a13--> 0 c10 c11 c12 c13
a20 a21 a22 a23: a20 a21 a22 a23--> 0 0 c20 c21 c22
a30 a31 a32 a33: a30 a31 a32 a33--> 0 0 0 c30 c31
[pre load]
Weight Stationary
In Weight Stationary, weights are fixed in the compute elements. Inputs stream in, and partial sums propagate through the array to form the output. Matrix shift-in pattern:
[pre load]
b00 b01 b02 b03 b04
b10 b11 b12 b13 b14
b20 b21 b22 b23 b24
b30 b31 b32 b33 b34
... ... ... ... ...
0 0 0 a30 a20 a10 a00 b00 b01 b02 b03 b04
pp
0 0 a31 a21 a11 a01 0 b10 b11 b12 b13 b14
pp
0 a32 a22 a12 a02 0 0 b20 b21 b22 b23 b24
pp
a33 a23 a13 a03 0 0 0 b30 b31 b32 b33 b34
| | | | |
V V V V V
c00 0 0 0 0
c10 C01 0 0 0
c20 c11 c02 0 0
c30 c21 c12 c03 0
0 c31 c22 c13 c04
Memory 3 Layout
Memory 3 [20 X 100] (location: file_dump/mem3.hex)
===================
| 20X4 | 20X4 | 20X4 | 20X10 | 20X10 | 20X10 | 20X10 | 20X20 | 20X20 |20X2 |20X1|20X1|20X2 |20x1|20X1|
0--+-------+-------+-------+---------------+---------------+---------------+----------------+------------------------------+------------------------------+-----+----+----+-----+----+----+
| | | | h1 | h2 |X| h1 | h2 |X| h1 | h2 |X| h1 | h2 |X| | | | | | | | |
| | | | 20X4 | 20X4 |X| 20X4 | 20X4 |X| 20X4 | 20X4 |X| 20X4 | 20X4 |X| | | | | | | | |
| | | | | X| | X| | X| | |X| | | | | | | | |
| | | | | |X| | |X| | |X| | |X| | |MASK |MASK|MASK|MASK |MASK|MASK|
20 | T^ | Wq^ | Wkv^ | Q^ |X| W^ |X| Q^_rpas |X| W^_rpas |X| A^=Q^_rpas X W^_rpas(h1) | A^=Q^_rpas X W^_rpas(h2) | A | Q | KV | A | Q | KV |
| rpas | rpas | rpas | | X| | X| | X| | |X| | |(h1) |(h1)|(h1)|(h2) |(h2)|(h2)|
| | | | | |X| | |X| | |X| | |X| | | | | | | | |
| | | | | |X| | |X| | |X| | |X| | | | | | | | |
| | | |12 15|16 19|X|22 25|26 29|X|32 35|36 39|X|42 45|46 49|X| | | | | | | | |
19--+-------+-------+-------+---------------+---------------+---------------+----------------+------------------------------+------------------------------+-----+----+----+-----+----+----+
|0 3|4 7|8 11|12 21|22 31|32 41|42 51|52 71|72 91|92 93| 94 | 95 |96 97| 98 | 99 |
Reference
Y. Qin et al., "Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow," IEEE Journal of Solid-State Circuits, vol. 59, no. 10, pp. 3342–3356, Oct. 2024. doi: 10.1109/JSSC.2024.3397189.