### A systolic architecture for supporting Wiedemann's algorithm

Rainer Steinwandt<sup>1</sup>

(joint work with Willi Geiselmann<sup>1</sup>, Adi Shamir<sup>2</sup> and Eran Tromer<sup>2</sup>)

<sup>1</sup> O Universität Karlsruhe, Germany
 <sup>2</sup> Weizmann Institute of Science, Israel

# NFS & (block) Wiedemann

**NFS:** relation collection + linear algebra (LA) step dominating for total running time

### (Block) Wiedemann for GF(2):

E

reduces LA step to iterated matrix-by-vector multiplications  $A\nu$ ,  $A^2\nu$ ,  $A^3\nu$ , ...,  $A^k\nu$ 

with sparse (... but potentially large) matrix A

1024 bit:  $A \in GF(2)^{10^{10} \times 10^{10}}$ 

... doing this fast could be nice for GF(p), too (→[Frey04])

# LA hardware: basic approach

Currently most promising hardware devices for LA step:

- offer methods for efficiently computing the vector chains Av,  $A^2v$ ,  $A^3v$ , ...,  $A^kv$  using a **2-D mesh architecture**:
  - 2-D sorting ( $\rightarrow$ [Bernstein '01])
  - 2-D routing ( $\rightarrow$ [Lenstra et al. '02])
- ✓ impose another 2-D splitting for doing with small chips
   (→[Geiselmann, S. '03])
   ..., cheap

... not utopian, but not as simple & efficient as desirable

# Multiplying with $\nu \in GF(p)^n$





Ë,

## **Additional parallelization**

Needed arithmetics is not space-consuming process k>1 vector components in parallel

station #1: entries of rows #1 ... # $s_1$ i station #u: entries of rows # $n-s_u+1$  ... #n



### ... using intra-station buses

### Handling *k* vector components in parallel in each station:



#### Circular buses for intra-station transport of *v*-entries.

# Multiplying with A again

Actually needed:  $A \cdot v$ ,  $A \cdot Av$ ,  $A \cdot A^2 v$ , ...

result of multiplication must go back into vector pipeline

rearrange stations:

E

#### ... have each station scan v in a different cyclic order

### **Doing another multiplication**

GF(*p*)-addition commutative

1 complete cycle yields A·V

stations switch to 2<sup>nd</sup> mem. bank holding *A*·*v* 

E



14

Device is immediately prepared for next multiplication.

### **Critical parameters**

#### I/O Bandwith, number of pins:

limits the speed at which  $\nu$  can be fed into the stations & therewith overall LA time

### Memory:

representing the non-zero entries of A &storing the vector(s) v requires large amount of (D)RAM

### Clock rate:

simple logic allowing high clocking rate vs. (slow) space-optimized memory

# Techn(olog)ical limitations

#pins limited through chip size (>2<sup>12</sup> pins means large chips)
 logic for systolic design simpler than for mesh-based designs
 increasing clocking rate to 1 GHz seems doable

What about the memory?

vector v: dense, 2×(D)RAM for m (=10<sup>10</sup>) GF(p)-entries matrix A: GF(p)-entry, row coord. within CPU, auxiliary flags no need for random access, DRAM sufficient



### **Matrix handling**

| "External table" for reading v-entries:  |                |                                                               |                                           |
|------------------------------------------|----------------|---------------------------------------------------------------|-------------------------------------------|
| cycles                                   | "read it" flag | bus no. to write on                                           |                                           |
|                                          |                |                                                               |                                           |
|                                          |                |                                                               |                                           |
| "Internal table" for storing the matrix: |                |                                                               |                                           |
| cycles                                   | "read it" flag | bus no. to read from                                          |                                           |
| -entry                                   | row coord.     | "delete it" flag                                              |                                           |
|                                          | cycles         | cycles "read it" flag rnal table" for s cycles "read it" flag | cycles "read it" flag bus no. to write on |

O

### **Distributing the matrix**

As with mesh based designs, we can **split** *A* into submatrices  $(\rightarrow [Geiselmann, S. '03])$ :



### **Block matrix multiplication**

- assign a multiplication circuit to each submatrix A<sub>i,i</sub>
- distribute/load appropriate v-parts into each circuit
- compute all A<sub>i,j</sub> · v<sub>i,j</sub> -values
- output all subproducts & add them in a pipeline

result must be split & loaded into the device

Limiting factor for run time: I/O bandwidth/#pins

### Systolic parallelization

#### Increased blocking factor without repeatedly storing A:



### ... combining it all

splitting of A into submatrices can be combined with systolic parallelization

short vectors + small matrices + simple logic

### small interconnected chips

... may be fast, but not that trivial to implement

**2D-systolic looks preferable** 

### Systolic vs. mesh based design

- Features of (pure) systolic approach:
- simple logic
- use of small chip sizes seems doable
- integration of error handling looks doable
  no need for heuristic complexity bounds
  simulation in software is possible

### ... how fast can we go?

- ✓ using similar chip area as currently fastest mesh-based proposals (→[Geiselmann et al. '05]), a significant speed-up, say factor 2, seems realistic
- various possibilites for optimization: area × time, cost, ...

LA step for 1024 bit looks (even more  $\bigcirc$ ) doable.

### Conclusion

- systolic design looks preferable to mesh-based approach: seems to be simpler, faster and require smaller chips
- topic of "optimal" parameter choice (purely systolic, matrix splitting, ...) not fully explored yet

... coping with LA step for 1024 bit is not utopian 🙂

