2008

A reconfigurable and scalable efficient architecture for AES

Li, Ke

Lethbridge, Alta. : University of Lethbridge, Department of Mathematics and Computer Science, 2008

http://hdl.handle.net/10133/778

Downloaded from University of Lethbridge Research Repository, OPUS
A RECONFIGURABLE AND SCALABLE EFFICIENT ARCHITECTURE FOR AES

KE LI
Bachelor of Science, University of Electronic Science and Technology of China, 2003

A Thesis
Submitted to the School of Graduate Studies of the University of Lethbridge in Partial Fulfillment of the Requirements for the Degree

MASTER OF SCIENCE

Department of Mathematics and Computer Science
University of Lethbridge
LETHBRIDGE, ALBERTA, CANADA

© Ke Li, 2008
For my family, who offered me unconditional love and support throughout the course of this thesis.
Abstract

A new 32-bit reconfigurable FPGA implementation of AES algorithm is presented in this thesis. It employs a single round architecture to minimize the hardware cost. The combinatorial logic implementation of S-Box ensures the suitability for non-Block RAMs (BRAMs) FPGA devices. Fully composite field $GF((2^4)^2)$ based encryption and keyschedule lead to the lower hardware complexity and convenience for the efficient subpipelining. For the first time, a subpipelined on-the-fly keyschedule over composite field $GF((2^4)^2)$ is applied for the all standard key sizes (128-, 192-, 256-bit). The proposed architecture achieves a throughput of 805.82Mbits/s using 523 slices with a ratio throughput/slice of 1.54Mbps/Slice on Xilinx Virtex2 XC2V2000 ff896 device.
Acknowledgments

I would like to express many thanks to my supervisor Dr. Hua Li, for his invaluable advice and ideas on the research and also for his devotion of time to me during this program. His support and expertise resolved many hurdles that I encountered throughout the research.

I am also grateful to other committee members Dr. Howard Cheng and Dr. Gongbing Shan for their advice.

Finally, I would like thank my parents for their support of me.
## Contents

**Approval/Signature Page** ............................................................ ii

**Dedication** ........................................................................ iv

**Abstract** ............................................................................. iv

**Acknowledgments** ............................................................... v

**Table of Contents** ............................................................... vi

**List of Tables** ....................................................................... viii

**List of Figures** .................................................................... ix

1 **Introduction** ........................................................................ 1
   1.1 Motivation ........................................................................ 2
   1.2 32-bit Subpipelined Architecture ...................................... 2
   1.3 Thesis Outline ................................................................... 4

2 **Mathematical Background** .................................................. 5
   2.1 Finite Fields ...................................................................... 5
      2.1.1 AES Arithmetic over Field $GF(2^8)$ ............................. 6
   2.2 Composite Fields ............................................................ 9
      2.2.1 AES Arithmetic over Composite Field $GF((2^4)^2)$ ..... 10

3 **AES Algorithm** ................................................................. 13
   3.1 Subbytes and Invsubbytes ............................................... 14
   3.2 Shiftrows and Invshiftrows .............................................. 17
   3.3 Mixcolumns and Invmixcolumns ...................................... 18
   3.4 Addroundkey .................................................................. 19
   3.5 Keyschedule .................................................................... 20

4 **Reconfigurable and Compact Architecture of the AES** .......... 23
   4.1 32-bit Single Round Unit ................................................. 23
   4.2 Full Composite Field Encryptor and Keyschedule .............. 24
   4.3 Subpipelined Encryptor and Keyschedule ........................... 27
   4.4 Double-Block Subpipelined Architecture ............................ 28
      4.4.1 Column Fashion Shiftrows ......................................... 32
      4.4.2 Subpipelined Subbytes .............................................. 36
      4.4.3 Mixcolumns on $GF((2^4)^2)$ ..................................... 48
      4.4.4 Subpipelined Keyschedule ......................................... 51

vi
List of Tables

3.1 Key-Block-Round Combinations [20] ................................. 13
4.1 AES Encryption Sequence ............................................. 29
4.2 Four Control Signals .................................................... 32
4.3 Path Delays and Number of Slices for Spartan2E and Virtex2 .... 47
4.4 Key128 Roundkey Sequence ........................................... 57
4.5 Key192 Roundkey Sequence ........................................... 59
4.6 Key256 Roundkey Sequence ........................................... 66
5.1 Comparisons of BRAMs Based AES Architecture .................. 69
5.2 Comparisons of Non-BRAMs Architectures ......................... 70
5.3 Comparisons of AES Architectures Functions ....................... 72
## List of Figures

<table>
<thead>
<tr>
<th>Figure</th>
<th>Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>3.1</td>
<td>State array input and output</td>
<td>14</td>
</tr>
<tr>
<td>3.2</td>
<td>AES architecture</td>
<td>15</td>
</tr>
<tr>
<td>3.3</td>
<td>AES S-box</td>
<td>16</td>
</tr>
<tr>
<td>3.4</td>
<td>AES IS-box</td>
<td>17</td>
</tr>
<tr>
<td>3.5</td>
<td>AES Shiftrows</td>
<td>18</td>
</tr>
<tr>
<td>3.6</td>
<td>AES Invshiftrows</td>
<td>18</td>
</tr>
<tr>
<td>3.7</td>
<td>Pseudo Code for Key Expansion [20]</td>
<td>21</td>
</tr>
<tr>
<td>3.8</td>
<td>AES Keyschedule</td>
<td>22</td>
</tr>
<tr>
<td>4.1</td>
<td>Unfolded Architecture(a) - Single Round Unit(b) - 32-bit Single Round Unit(c)</td>
<td>23</td>
</tr>
<tr>
<td>4.2</td>
<td>Partial Composite Field (a) - Full Composite Field (b)</td>
<td>24</td>
</tr>
<tr>
<td>4.3</td>
<td>Pipelining (a) and Subpipelining (b)</td>
<td>27</td>
</tr>
<tr>
<td>4.4</td>
<td>AES Encryption Architecture</td>
<td>30</td>
</tr>
<tr>
<td>4.5</td>
<td>Column Fashion Shiftrows</td>
<td>33</td>
</tr>
<tr>
<td>4.6</td>
<td>Two States’ Arrangement in Shiftrows Registers</td>
<td>34</td>
</tr>
<tr>
<td>4.7</td>
<td>Input of Shiftrows in Figure (4.6)</td>
<td>36</td>
</tr>
<tr>
<td>4.8</td>
<td>Subbytes in composite field $GF(2^4)[34]$</td>
<td>36</td>
</tr>
<tr>
<td>4.9</td>
<td>Pipelined Subbytes in composite field $GF((2^4)^2)$</td>
<td>47</td>
</tr>
<tr>
<td>4.10</td>
<td>$GF((2^4)^2)$ Based Mixcolumns</td>
<td>49</td>
</tr>
<tr>
<td>4.11</td>
<td>Architecture of Keyschedule 128</td>
<td>55</td>
</tr>
<tr>
<td>4.12</td>
<td>Architecture of Keyschedule 192</td>
<td>58</td>
</tr>
<tr>
<td>4.13</td>
<td>Architecture of Keyschedule 256</td>
<td>64</td>
</tr>
</tbody>
</table>
Chapter 1

Introduction

Cryptography is of importance in digital communications systems. The security aspects of many applications such as Automated Teller Machines (ATMs), e-commerce, internet bank services depend on various cryptographic schemes.

A symmetric-key cryptography algorithm, Data Encryption Standard (DES), has been the encryption standard since 1977. It has been widely used and no attack better than the brute force search has been discovered. But its 56-bit key size has been criticized since its inception. 3DES with triple key size of DES offers higher security but it is inefficient in software, because DES was primarily designed for hardware implementations [30].

In 2001, the National Institute of Standards and Technology (NIST) announced the approval of the Federal Information Processing Standard (FIPS) for the Advanced Encryption Standard (AES), FIPS-197 [20]. This standard specifies the Rijndael algorithm [7] as an FIPS-approved symmetric encryption algorithm that may be used by U.S. government organizations (and others) to protect sensitive information.

As a replacement of DES, AES is presently widely used in both software and hardware implementations. Hardware approaches are attractive because it provides better throughput as well as higher physical security. Besides, the byte-wise arithmetic in AES gives hardware approaches more convenience. There are mainly two categories of hardware implementations: Application-Specific Integrated Circuit (ASIC) and Field Programmable Gate Array (FPGA). Compared with ASIC, FPGA becomes more and more popular because of its scalability, re-programmability and obvious advantage on time-to-market [19].
1.1 Motivation

The standard announced by NIST [20] indicates that AES is a block cipher with 128-bit block size and 128-, 192-, 256-bit key sizes. These three key sizes are specified for various security levels. The capability to deal with all key sizes makes reconfigurability an important feature of AES implementations.

Numerous FPGA [5, 9, 22, 23, 34] and ASIC [2, 25, 28] implementations of the AES have been presented and evaluated. To date, most implementations feature high speed and high cost suitable for high-end applications only. Fully unrolled scheme makes a convenient platform for pipelining technology to get efficient area cost and high throughput by unfolding all the ten (128-bit key) rounds on the device, which is applied in literature [8, 9, 11, 17, 34].

The issue of secure communication in computing restricted environments, such as Personal Digital Assistants (PDAs), wireless devices, and many other embedded devices, has become more important recently. In order to apply AES in these devices, the AES implementations must be cost efficient. An opposite approach to fully unrolled scheme is to implement a single round unit on hardware [1, 2, 26, 27, 31]. When no further optimization effort is made, a block of data needs ten (128-bit key) cycles to go through encryption. The economic area cost is obtained by sacrificing the speed.

In this thesis, a compact design of AES with low hardware cost and adequate throughput is proposed and implemented in a non-BRAM FPGA.

1.2 32-bit Subpipelined Architecture

The following list summarizes the major contributions in this thesis.

- **32-bit Single Round Unit:** By extending one cycle’s job to ten cycles (128-bit
key), single round unit requires approximately 1/10 hardware area as fully unrolled scheme; by chopping a block data (128-bit) to four words, theoretically, a 32-bit single round unit costs 1/40 hardware area as the common 128-bit fully unrolled scheme as in [8, 9, 11, 17, 34]. Nevertheless, when 32-bit datapath is used, the shiftrows transformations can not be simply implemented by rewiring. We use the column fashion shiftrows which naturally cuts one round unit into four substages.

- **Complete Composite Field Based AES**: In a non-BRAM design, combinational logic is the approach used for subbytes, also known as S-Box. It is the most costly transformation in AES, in both time and area aspects. Rijmen [24] suggested an alternative approach to calculate multiplicative inverses in S-Box. Since then, the relevant research has proved that the composite field $GF((2^4)^2)$ based arithmetic provides the least gate count and the shortest critical path for calculating multiplicative inverse of a byte, which is the key step in S-Box. This conversion involves an isomorphic map function before and after inversion in each round. As in [9, 11, 28, 31, 34], when key size is 128-bit, it needs ten map functions for each block (128-bit) from finite field to composite field and ten inverse map functions for encryption. If key generator, which also has S-Boxes, is included, another ten mappings and ten inverse mappings are needed. To save the overhead caused by mapping, our design converts the whole AES algorithm from $GF(2^8)$ to $GF((2^4)^2)$, which needs only one forward mapping before the initial round and one backward mapping after the final round. Only one forward mapping is needed for the keyschedule.

- **Subpipelined On-the-fly Keyschedule and Encryptor**: On-the-fly keyschedule supports instant key changing. The previous works of [1, 2, 4, 9, 12, 14, 17, 33, 26, 28, 34] applied the on-the-fly key generator, but only [1, 2, 17] integrate on-the-fly keyschedule for all three key sizes (128, 192, 256-bit). These three designs employ
subpipelining to optimize throughput/area ratio. However, none of them uses it in
keyschedule. When pipelining and on-the-fly keyschedule are both employed in an
AES implementation, the keyschedule must be synchronized with the cipher because
they share the same clock. The designs in [34, 26] made a subpipelined keyschedule,
but they only support 128-bit key size.

1.3 Thesis Outline

Chapter 2 introduces the mathematical background of finite fields which are relevant to
AES. We also present the definition of composite fields in this chapter.

Chapter 3 gives an overview of AES standard. We focus on encryption and keyschedule
in this thesis. However, the complete AES standard, including decryption, is presented in
this chapter.

Chapter 4 describes the proposed architecture in detail. The formulas for the non-trivial
transformations in field $GF((2^4)^2)$ are presented. The keyschedules for three key sizes are
demonstrated in figures.

Chapter 5 presents the implementation and compares the proposed architecture with the
previous designs.

Chapter 6 provides conclusion of the design.
Chapter 2
Mathematical Background

This chapter introduces the mathematical background of AES. Finite Fields, also referred to as Galois Fields, is the arithmetic basis of AES. The Rijndael algorithm [7] is derived from the finite field $GF(2^8)$. C. Paar [21] demonstrated that by decomposing field $GF(2^8)$ into composite field $GF((2^4)^2)$, we can make hardware implementations consuming less area. The following sections introduce the relevant properties and definitions in finite field $GF(2^8)$ and composite field $GF((2^4)^2)$. All statements are given without proof, but they are referred to the appropriate literature.

2.1 Finite Fields

This section introduces the definition of finite fields, followed by the basic AES mathematical representations and operations over finite field $GF(2^8)$. We start with the definition of group.

**Definition 2.1** [21] A set $G$ together with a binary operation $G \times G \longrightarrow G$ is called a group if the following condition are satisfied:

- The binary operation is associative: $(a \circ b) \circ c = a \circ (b \circ c)$, for all $a, b, c \in G$;
- There is an identity element $e \in G$ such that $a \circ e = e \circ a$, for all $a \in G$;
- For any element $a \in G$, there exists an inverse element $a' \in G$ such that $a \circ a' = a' \circ a = e$.

If a group satisfies the additional condition that $a \circ b = b \circ a$, for all $a, b \in G$, the group is commutative or abelian.
**Definition 2.2** [29] Let $F$ be a set of elements on which two binary operations, called addition “+” and multiplication “·”, are defined. The set $F$ together with the two binary operation $+$ and $\cdot$ is a **field** if the following conditions are satisfied:

- $F$ is a **commutative group** under addition $+$. The identity element with respect to addition is called the **zero element** or the **additive identity** of $F$ and is denoted by $0$;
- The set of nonzero elements in $F$ is a **commutative group** under multiplication $\cdot$. The **identity element** with respect to multiplication is called the **unit element** or the **multiplicative identity** of $F$ and is denoted by $1$;
- Multiplication is distributive over addition; that is, for any three elements $a, b$ and $c$ in $F$: $a \cdot (b + c) = a \cdot b + a \cdot c$.

Fields with a finite number of elements are called **Finite or Galois Fields**, denoted as $GF(q)$. Here, $q$ is the number of field elements, which is also the **order** of $GF(q)$. The **extension field** is of order $q^m$ and is denoted by $GF(q^m)$ [21], which can be constructed by an **irreducible polynomial** $P(x)$ [29] of degree $m$ over $GF(q)$. The field $GF(q)$ is a **subfield** of $GF(q^m)$ [16]. Every element of field $GF(q^m)$ can be represented as polynomial with a maximum degree of $m - 1$ over $GF(q)$, which is the residue modulo $P(x)$. Hence $P(x)$ determines the arithmetic operations in field $GF(q^m)$.

**2.1.1 AES Arithmetic over Field $GF(2^8)$**

AES is built on the specific finite field $GF(q^m)$, when $q = 2, m = 8$. $GF(2^8)$ is an extension field of $GF(2)$. We use the same notations and conventions as the AES specification in [20], except the multiplication denotation of two elements in $GF(2^8)$. Instead of using $\cdot$, we use $\otimes$, for a consistency with the figures in the subsequent chapters. The basic unit for
processing in the AES algorithm is a byte. Each 8-bit sequence of input, output, states, cipherkey or roundkeys is treated as a single entity.

A. 3 Notations of An Element

1. Binary notation: A concatenation of 8 individual bits. The bit value is 0 or 1.

\[ \{a_7a_6a_5a_4a_3a_2a_1a_0\} \]

2. Polynomial notation: Because \( GF(2^8) \) is the extension field of \( GF(2) \), its element can be represented as a polynomial over \( GF(2) \) (Equation (2.1)). Bit \( a_i \) is the coefficients of the polynomial with the value of 0 or 1.

\[
a(x) = \sum_{i=0}^{7} a_i x^i = a_7 x^7 + a_6 x^6 + a_5 x^5 + a_4 x^4 + a_3 x^3 + a_2 x^2 + a_1 x + a_0 \quad (2.1)
\]

3. Hexadecimal notation: \{AB\}, \( A \) denotes \( a_7a_6a_5a_4 \) in hexadecimal representation, \( B \) denotes \( a_3a_2a_1a_0 \) in hexadecimal representation.

For example, \{01100011\} (binary notation) can be represented as \( x^6 + x^5 + x + 1 \) (polynomial notation) and \{63\} (hexadecimal notation).

B. Addition

The addition of two elements in \( GF(2^8) \) is adding their corresponding polynomial coefficients modulo 2, which is the XOR-operation denoted by \( \oplus \). For \( a(x), b(x) \in GF(2^8) \) (\( a(x) \) is in Equations (2.1); \( b(x) = b_7 x^7 + b_6 x^6 + b_5 x^5 + b_4 x^4 + b_3 x^3 + b_2 x^2 + b_1 x + b_0 \)), it can be implemented by Equation (2.2)

\[
a(x) \oplus b(x) = \sum_{i=0}^{7} a_i x^i \oplus \sum_{i=0}^{7} b_i x^i = \sum_{i=0}^{7} (a_i \oplus b_i) x^i \quad (2.2)
\]
C. Multiplication

For \( a(x), b(x) \in GF(2^8) \), \( \otimes \) is the multiplication operation in \( GF(2^8) \), \( \times \) is the normal polynomial multiplication.

Polynomial (2.3) is the irreducible polynomial used in AES. The multiplication of \( a(x) \) and \( b(x) \) is done by multiplying these two polynomials followed by a modular reduction over \( m(x) \) (Equation (2.4)). The modular reduction is made to ensure that the result is an element in \( GF(2^8) \).

\[
m(x) = x^8 + x^4 + x^3 + x + 1
\]  

(2.3)

Given \( q(x) \in GF(2^8) \), \( q(x) = q_7x^7 + q_6x^6 + q_5x^5 + q_4x^4 + q_3x^3 + q_2x^2 + q_1x + q_0 \), we have:

\[
q(x) = a(x) \otimes b(x) = (a(x) \times b(x)) \mod m(x)
\]  

(2.4)

FIPS gives an efficient method to do multiplication in \( GF(2^8) \) in [20]. It uses the multiplication by \( x \), which is denoted as \( xtimes(a(x)) \). Given: \( t(x) \in GF(2^8) \), \( t(x) = t_7x^7 + t_6x^6 + t_5x^5 + t_4x^4 + t_3x^3 + t_2x^2 + t_1x + t_0 \), we have:

\[
t(x) = xtimes(a(x)) = (a(x) \times x) \mod m(x)
\]  

(2.5)

\[
t_0 = a_7, \ t_1 = a_0 \oplus a_7, \ t_2 = a_1, \ t_3 = a_2 \oplus a_7 \\
t_4 = a_3 \oplus a_7, \ t_5 = a_4, \ t_6 = a_5, \ t_7 = a_6
\]

In Equation (2.5), \( t(x) \) is the multiplication result of \( a(x) \) and \( x \) in \( GF(2^8) \). It is calculated by multiplying \( a(x) \) with \( x \), followed by the modular reduction over \( m(x) \). Based on Equations (2.5), we can use Equation (2.6) to conduct the multiplication in \( GF(2^8) \)
(Equation (2.4)).

\[ q(x) = \sum_{i=0}^{7} P_i(x) \times b_i \]

\[ P_i(x) = \times times(P_{i-1}(x)) \quad (P_0(x) = a(x)) \]

In Equation (2.6), the partial multiplications \((P_i(x))\) is performed first, followed by adding the corresponding coefficients. Bit \(b_i\) is the coefficient in \(b(x)\), which are 0 or 1.

D. Multiplicative Inverses

\[ \forall a \in GF(2^8) \setminus \{0\} : a \otimes a^{-1} = \{1\} \]  \hspace{1cm} (2.7)

\(a^{-1}\) is the multiplicative inverse of \(a\) in \(GF(2^8)\). A popular algorithm for inversion is the Extended Euclidean Algorithm [18], but it is not suitable for hardware implementation because of its high hardware complexity.

2.2 Composite Fields

Two Galois Fields of the same order are isomorphic, but they may have different hardware complexity which depends on the representations of their field elements. Green and Taylor [10] introduced a certain type of extension fields called composite field, which can simplify field operations in AES arithmetic.

**Definition 2.4**

We call two pairs

\[ \{GF(2^n), Q(y) = y^n + \sum_{i=0}^{n-1} q_i y^i, q_i \in GF(2)\} \]

\[ \{GF((2^n)^m), P(x) = x^m + \sum_{i=0}^{m-1} p_i x^i, p_i \in GF(2)\} \]
a composite field if

- \( GF(2^n) \) is constructed from \( GF(2) \) by \( Q(y) \);
- \( GF((2^n)^m) \) is constructed from \( GF(2^n) \) by \( P(x) \).

Composite field is denoted by \( GF((2^n)^m) \). A composite field \( GF((2^n)^m) \) is isomorphic to the field \( GF(2^k) \), \( k = nm \) [15].

### 2.2.1 AES Arithmetic over Composite Field \( GF((2^4)^2) \)

The specific composite field used in this thesis is \( GF((2^4)^2) \), which is isomorphic to field \( GF(2^8) \) \( (k = 8, n = 4, m = 2) \). Taking field \( GF(2^8) \) as a quadratic extension of the field \( GF(2^4) \), an element \( a \in GF(2^8) \) is represented as a linear polynomial with coefficients in \( GF(2^4) \).

#### A. Notation

Wolkerstorfer et al. introduced a **two-term polynomial** in [32], which is the representation of \( GF((2^4)^2) \) used in the thesis.

\[
a \cong a_h x + a_l, \ a \in GF(2^8), \ a_h, a_l \in GF(2^4) \tag{2.8}
\]

The two-term polynomial \( a_h x + a_l \) is an isomorphic representation of \( a \). Hence, all mathematical operations applied to elements of \( GF(2^8) \) can also be computed in this representation.

#### B. Addition

Adding the corresponding coefficients.

\[
(a_h x + a_l) \oplus (b_h x + b_l) = (a_h \oplus b_h)x + (a_l \oplus b_l) \tag{2.9}
\]
C. Multiplication

There are two irreducible polynomials needed for the two-term polynomial multiplication: $n(x)$ (Equations (2.10)) and $m(x)$ (Equations (2.11)).

$$n(x) = x^2 + \{1\}x + \{E\} \quad \{E\} \text{ denotes } "1110" \quad (2.10)$$

$$m(x) = x^4 + x + 1 \quad (2.11)$$

Equation (2.10) is used to reduce the result to a two-term polynomial. The coefficients of $n(x)$ are written in hexadecimal notation which are elements in $GF(2^4)$ (Section (2.1.1)). Multiplication of two-term polynomials is denoted by $\otimes$. Normal polynomials multiplication is denoted by $\times$. Multiplying two two-term polynomials, followed by a modular reduction over $n(x)$, is described by Equations (2.12).

$$(a_hx + a_l) \otimes (b_hx + b_l) = ((a_hx + a_l) \times (b_hx + b_l)) \mod n(x) \quad (2.12)$$

Equation (2.11) is used to ensure that, the result of multiplication in subfield $GF(2^4)$ (Equation (2.13)), where $(a'(x), b'(x) \in GF(2^4))$, is an element of $GF(2^4)$.

$$a'(x) \otimes b'(x) = (a'(x) \times b'(x)) \mod m(x) \quad (2.13)$$

These two irreducible polynomials $n(x)$ and $m(x)$ are chosen by Wolkerstorfer et al. [32] to optimize the arithmetic.

D. Multiplicative Inverses

A multiplication of a two-term polynomial with its inverse yields the 1-element of the field $GF((2^4)^2)$

$$(a_hx + a_l) \otimes (a'_lx + a'_l) = \{0\}x + \{1\} \quad (2.14)$$
where $a_h, a_l, a_h', a_l' \in GF(2^4)$.

\[
(a_h x + a_l)^{-1} = (a_h' x + a_l') = (a_h \otimes d) x + (a_h \oplus a_l) \otimes d
\]  \hspace{1cm} (2.15)

where \(d = ((a_h^2 \otimes \{E\}) \oplus (a_h \otimes a_l) \otimes a_l^2)^{-1} = ((a_h^2 \otimes \{e\}) \oplus ((a_h \oplus a_l) \otimes a_l))^{-1}\) (\(\oplus\) is addition in \(GF(2^4)\); \(\otimes\) is multiplication in \(GF(2^4)\)).

This multiplicative inversion equation is proposed by Wolkerstorfer et al. [32]. Reconfiguration of this equation can provide good quality for subpipeling. We will explain this in Section (4.4.2).
Chapter 3

AES Algorithm

This chapter introduces the AES algorithm presented by NIST in 2001 [20].

The AES algorithm, also known as the Rijndael algorithm is the encryption standard designed by two Belgian cryptographers John Daemen and Vincent Rijmen [7]. AES is a symmetric-key cipher where both the encryptor and decryptor use the same key. It is an iterative algorithm. Each iteration is called a round. According to NIST, AES is a symmetric block cipher with block size of 128-bit and three key sizes of (128-, 192-, or 256-bit). The AES parameters depend on the key size (Table 3.1):

- $N_k$ is the number of 32-bit words comprising the cipher key;
- $N_b$ is the number of 32-bit words comprising a data block, which is four in AES standard;
- $N_r$ is the number of rounds which is 10, 12 or 14 for AES-128, AES-192 and AES-256, respectively.

The internal operations of AES are performed on a $4 \times 4$ matrix of bytes, termed the state (Figure 3.1). An individual byte of the state is referred as $S_{r,c}$ ($r$ represents the row

<table>
<thead>
<tr>
<th>Key Length (Nk words)</th>
<th>Block Size (Nb words)</th>
<th>Number of Rounds (Nr)</th>
</tr>
</thead>
<tbody>
<tr>
<td>AES-128</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>AES-192</td>
<td>6</td>
<td>4</td>
</tr>
<tr>
<td>AES-256</td>
<td>8</td>
<td>4</td>
</tr>
</tbody>
</table>
number and \( c \) represents the column number: \( 0 \leq r < 4, 0 \leq c < 4 \). A word \( W_i \) (\( 0 \leq i < 4 \)) consists of the four bytes of column \( i \).

AES runs iteratively on four transformations (inv-/subbytes, inv-/shiftrows, inv-/mixcolumns and addroundkey) with different sequence in encryption and decryption. Figure (3.2) illustrates the basic architecture of AES. In the initial round (\( r = 0 \)), only addroundkey is performed; in the final round (\( r = Nr \)), it skips inv-/mixcolumns. The key schedule module expands cipherkey to \((Nr + 1) \times 4\) words of roundkeys. Each round applies a unique 128-bit roundkey in the addroundkey operation.

### 3.1 Subbytes and Invsubbytes

Inv-/subbytes is the only non-linear transformation in AES which is also called S-Box.

**A. Subbytes** – Uses an S-Box to perform a non-linear byte-by-byte substitution of the state.

S-Box is a \( 16 \times 16 \) matrix containing all possible 256 8-bit values.

Consider a byte \( \{x_7x_6x_5x_4x_3x_2x_1x_0\} \). Subbytes transformation has two steps:

1. \( \{x'_7x'_6x'_5x'_4x'_3x'_2x'_1x'_0\} \) is its multiplicative inverse in \( GF(2^8) \) field, modulo the irre-
ducible polynomial $m(x) = x^8 + x^4 + x^3 + x + 1$; \{00000000\}’s multiplicative inverse in $GF(2^8)$ field is itself;

2. An affine transformation over $GF(2)$ (Equation (3.1)) is conducted on the inverse, which is the result of the first step.
Figure (3.3) shows the S-Box diagram:

![S-Box Diagram](image)

Figure 3.3: AES S-box

**B. Invsubbytes** – Uses an inverse S-Box (IS-Box) to perform a non-linear byte-by-byte substitution of the state.

Considering a byte \{y_7, y_6, y_5, y_4, y_3, y_2, y_1, y_0\}. Inverse subbytes transformation has two steps:

1. The inverse affine transformation over $GF(2)$ (Equation (3.2)) is performed first
\[
\begin{bmatrix}
{x'_0} \\
{x'_1} \\
{x'_2} \\
{x'_3} \\
{x'_4} \\
{x'_5} \\
{x'_6} \\
{x'_7}
\end{bmatrix} =
\begin{bmatrix}
0 & 0 & 1 & 0 & 0 & 1 & 0 & 1 \\
1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 \\
0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 \\
1 & 0 & 1 & 0 & 0 & 1 & 0 & 0 \\
0 & 1 & 0 & 1 & 0 & 0 & 1 & 0 \\
0 & 0 & 1 & 0 & 1 & 0 & 0 & 1 \\
1 & 0 & 0 & 1 & 0 & 1 & 0 & 0 \\
0 & 1 & 0 & 0 & 1 & 0 & 1 & 0
\end{bmatrix}
\begin{bmatrix}
y_0 \\
y_1 \\
y_2 \\
y_3 \\
y_4 \\
y_5 \\
y_6 \\
y_7
\end{bmatrix} +
\begin{bmatrix}
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0
\end{bmatrix}
\tag{3.2}
\]

2. \(\{x_7x_6x_5x_4x_3x_2x_1x_0\}\) is the multiplicative inverse of \(\{x'_7x'_6x'_5x'_4x'_3x'_2x'_1x'_0\}\) in \(GF(2^8)\) field, modulo the irreducible polynomial \(m(x) = x^8 + x^4 + x^3 + x + 1; \{00000000\}\) is mapped onto itself.

Figure (3.4) shows the IS-Box diagram:

\[\{y'_7y'_6y'_5y'_4y'_3y'_2y'_1y'_0\} \rightarrow \text{IS-Box} \rightarrow \{x_7x_6x_5x_4x_3x_2x_1x_0\}\]

Figure 3.4: AES IS-box

### 3.2 Shiftrows and Invshiftrows

This transformation circularly shifts each row of the state to the left on encryption or to the right on decryption. The top row of the state is denoted as \(row(0)\) and the bottom row is denoted as \(row(3)\). The shift offset of each row corresponds to the row number.

**A. Shiftrows** – Each row of the state is left shifted cyclically a certain number of bytes.

Performs \(i\)-byte circular left shift to \(row(i)(i = 0, 1, 2, 3)\). Figure (3.5) illustrates the shiftrows operation.
### B. Invshiftrows

Each row of the state is right shifted cyclically a certain number of bytes. Performs $i$-byte circular right shift to $\text{row}(i) (i = 0, 1, 2, 3)$. Figure (3.6) illustrates the invshiftrows operation.

```
<table>
<thead>
<tr>
<th></th>
<th>row(0)</th>
<th>row(1)</th>
<th>row(2)</th>
<th>row(3)</th>
</tr>
</thead>
<tbody>
<tr>
<td>S0,0</td>
<td>S0,1</td>
<td>S0,2</td>
<td>S0,3</td>
<td></td>
</tr>
<tr>
<td>S1,0</td>
<td>S1,1</td>
<td>S1,2</td>
<td>S1,3</td>
<td></td>
</tr>
<tr>
<td>S2,0</td>
<td>S2,1</td>
<td>S2,2</td>
<td>S2,3</td>
<td></td>
</tr>
<tr>
<td>S3,0</td>
<td>S3,1</td>
<td>S3,2</td>
<td>S3,3</td>
<td></td>
</tr>
</tbody>
</table>
```

```
<table>
<thead>
<tr>
<th></th>
<th>row(0)</th>
<th>row(1)</th>
<th>row(2)</th>
<th>row(3)</th>
</tr>
</thead>
<tbody>
<tr>
<td>S0,0</td>
<td>S0,1</td>
<td>S0,2</td>
<td>S0,3</td>
<td></td>
</tr>
<tr>
<td>S1,0</td>
<td>S1,1</td>
<td>S1,2</td>
<td>S1,3</td>
<td></td>
</tr>
<tr>
<td>S2,0</td>
<td>S2,1</td>
<td>S2,2</td>
<td>S2,3</td>
<td></td>
</tr>
<tr>
<td>S3,0</td>
<td>S3,1</td>
<td>S3,2</td>
<td>S3,3</td>
<td></td>
</tr>
</tbody>
</table>
```

Figure 3.6: AES Invshiftrows

### 3.3 Mixcolumns and Invmixcolumns

This transformation treats each column of the state as a four-term polynomial over $GF(2^8)$ and transforms each column to a new one by multiplying it with a constant polynomial $a(x) = \{03\}x^3 + \{01\}x^2 + \{01\}x + \{02\}$ modulo $x^4 + 1$. The inverse mixcolumns operation is a multiplication of each column with $b(x) = a^{-1}(x) = \{0B\}x^3 + \{0D\}x^2 + \{09\}x + \{0E\}$
modulo \( x^4 + 1 \).

A. MixColumns – Left multiplies the state with a mixcolumns matrix.

Mixcolumns transformation gives each byte of a column a new value based on all four bytes in that column. In matrix form, the mixcolumns can be expressed as:

\[
\begin{bmatrix}
02 & 03 & 01 & 01 \\
01 & 02 & 03 & 01 \\
01 & 01 & 02 & 03 \\
03 & 01 & 01 & 02 \\
\end{bmatrix}
\begin{bmatrix}
s_{0,0} & s_{0,1} & s_{0,2} & s_{0,3} \\
s_{1,0} & s_{1,1} & s_{1,2} & s_{1,3} \\
s_{2,0} & s_{2,1} & s_{2,2} & s_{2,3} \\
s_{3,0} & s_{3,1} & s_{3,2} & s_{3,3} \\
\end{bmatrix}
=
\begin{bmatrix}
s'_{0,0} & s'_{0,1} & s'_{0,2} & s'_{0,3} \\
s'_{1,0} & s'_{1,1} & s'_{1,2} & s'_{1,3} \\
s'_{2,0} & s'_{2,1} & s'_{2,2} & s'_{2,3} \\
s'_{3,0} & s'_{3,1} & s'_{3,2} & s'_{3,3} \\
\end{bmatrix}
\tag{3.3}
\]

B. Invmixcolumns – Left multiplies the state with a invmixcolumns matrix.

In matrix form, the invmixcolumns can be expressed as:

\[
\begin{bmatrix}
0E & 0B & 0D & 09 \\
09 & 0E & 0B & 0D \\
0D & 09 & 0E & 0B \\
0B & 0D & 09 & 0E \\
\end{bmatrix}
\begin{bmatrix}
s'_{0,0} & s'_{0,1} & s'_{0,2} & s'_{0,3} \\
s'_{1,0} & s'_{1,1} & s'_{1,2} & s'_{1,3} \\
s'_{2,0} & s'_{2,1} & s'_{2,2} & s'_{2,3} \\
s'_{3,0} & s'_{3,1} & s'_{3,2} & s'_{3,3} \\
\end{bmatrix}
=
\begin{bmatrix}
s_{0,0} & s_{0,1} & s_{0,2} & s_{0,3} \\
s_{1,0} & s_{1,1} & s_{1,2} & s_{1,3} \\
s_{2,0} & s_{2,1} & s_{2,2} & s_{2,3} \\
s_{3,0} & s_{3,1} & s_{3,2} & s_{3,3} \\
\end{bmatrix}
\tag{3.4}
\]

3.4 Addroundkey

The addroundkey is a simple logical XOR of the current state with a roundkey which is generated by the keyschedule.

Addroundkey – The state is XORed with the 128-bit roundkey (Equation (3.5)).
\[ \begin{bmatrix} s_{0,0} & s_{0,1} & s_{0,2} & s_{0,3} \\ s_{1,0} & s_{1,1} & s_{1,2} & s_{1,3} \\ s_{2,0} & s_{2,1} & s_{2,2} & s_{2,3} \\ s_{3,0} & s_{3,1} & s_{3,2} & s_{3,3} \end{bmatrix} \oplus \text{roundkey} = \begin{bmatrix} s'_{0,0} & s'_{0,1} & s'_{0,2} & s'_{0,3} \\ s'_{1,0} & s'_{1,1} & s'_{1,2} & s'_{1,3} \\ s'_{2,0} & s'_{2,1} & s'_{2,2} & s'_{2,3} \\ s'_{3,0} & s'_{3,1} & s'_{3,2} & s'_{3,3} \end{bmatrix} \quad (3.5) \]

3.5 Keyschedule

**Keyschedule** – Derives roundkeys from the cipherkey. It consists of two steps:

1. **Key Expansion** - Uses the AES Key Expansion Algorithm (Figure (3.7)) to generate \( 4 \times (N_r + 1) \) words of roundkeys \( (W_0, W_1, \ldots, W_{4(N_r+1)-1}) \). The cipherkey is divided into \( N_k \) words used as the first \( N_k \) roundkeys. Keyschedule repeats to generate the rest roundkeys.

2. **Roundkey Selection** - The first 4 roundkeys are the first 4 words, the second 4 roundkeys are the second 4 words, etc. Each roundkey has 128 bits: \( \text{roundkey}(i) = (W_{4i}, W_{4i+1}, W_{4i+2}, W_{4i+3}) \).

Figure (3.8) shows keyschedule’s architecture which generates roundkeys for AES-128, AES-192 and AES-256.

- **Rotword:** One-byte circular left shift on a word. For example, word \((a, b, c, d)\) becomes \((b, c, d, a)\).

- **Subword:** Using S-Box to perform a byte substitution on each byte.

- **Xorrrcon:** XORing with a round constant \( rcon[j], j = 1, 2, \ldots, N_r \).

\[ rcon[j] = (RC[j], 0, 0, 0), \text{ with } RC[1] = 1, RC[j] = 2 \cdot RC[j-1] \text{ and with multiplication defined over the field } GF(2^8). \]
KeyExpansion(byte key[4*Nk], word w[4*(Nr+1)], Nk)
begin
    word temp
    i=0
    while(i<Nk)
        w[i]=word(key[4*i], key[4*i+1], key[4*i+2], key[4*i+3])
        i=i+1
    end while
    i=Nk
    while(i<Nb*(Nr+1))
        temp=w[i-1]
        if(i mod Nk=0)
            temp=subword(rotword(temp)) xor rcon[i/Nk]
        else if (Nk>6 and i mod Nk=4)
            temp=subword(temp)
        end if
        w[i]=w[i-Nk] xor temp
        i=i+1
    end while
end

Figure 3.7: Pseudo Code for Key Expansion [20]
Figure 3.8: AES Keyschedule
Chapter 4

Reconfigurable and Compact Architecture of the AES

In this chapter, the reconfigurable and compact AES architecture is proposed. We introduce the contributions in detail, followed by the four transformations (shiftrows, subbytes, mixcolumns and addroundkey). The three key schedules with different key sizes (128-, 192-, 256-bit) are explained individually.

4.1 32-bit Single Round Unit

Roll unfolded architecture (Figure (4.1(a))) is widely used to achieve high speed. It processes several blocks of data during one clock cycle by implementing more than one round units on the hardware. The more round units the architecture implements, the higher the hardware cost. The opposite scheme, which is called the single round unit architecture (Figure (4.1(b))), can be applied to simplify the hardware complexity. Instead of unfolding
all the round units in devices, it implements a single round unit which costs approximately $1/Nr$ area as the unfolded scheme by sacrificing the speed (Figure (4.1(a))).

Both Figure ((4.1(a)) and ((4.1(b)) use 128-bit data path. Sticking to the goal of making a compact design, we propose a 32-bit single round unit (Figure (4.1 (c))). It needs four iterations to perform a round on a block (128-bit). This 32-bit data path scheme saves about 75% hardware, compared with the 128-bit single round unit (Figure (4.1 (b))).

4.2 Full Composite Field Encryptor and Keyschedule

![Diagram of Full Composite Field Encryptor and Keyschedule](image)

Figure 4.2: Partial Composite Field (a)- Full Composite Field (b)

Many high-end FPGA devices possess Block-RAMs (BRAMs) which are efficient for the implementation of S-Box. S-Box, also referred as subbytes, is the key part in both
encryptor and keyschedule modules. However, these BRAM-based designs cannot be implemented in the low-cost devices which do not have BRAMs. An alternative approach for S-Box implementation is using combinational logic. But this method may lead to high hardware complexity because of the mathematic operations of AES over finite field $GF(2^8)$.

The key step of S-Box is calculating multiplicative inverse of each byte (Section (3.1)). Since the introduction of composite field $GF((2^4)^2)$ based S-Box, numerous research [9, 11, 28, 31, 34] has investigated the calculation of the multiplicative inverses over $GF((2^4)^2)$, instead of $GF(2^8)$, to decrease hardware complexity (Figure (4.2(a))). In Figure (4.2), the arithmetic in the shadow area is performed over field $GF((2^4)^2)$. Figure (4.2(a)) shows that the architecture implements only multiplicative inverse in $GF((2^4)^2)$.

The architectures in [33, 22] extend the field $GF((2^4)^2)$ to affine transformation which makes all S-Box block operations performed in $GF((2^4)^2)$. By decomposing these operations from $GF(2^8)$ to its subfield $GF(2^4)$, the hardware complexity is decreased.

As in Figure (4.2(a)), in each round before S-Box, it needs an isomorphic mapping function ($MAP$) to convert a representation from $GF(2^8)$ to $GF((2^4)^2)$; to convert inversely after, it needs the inverse mapping ($MAP^{-1}$). If key size is 128 bits, it applies the S-Box to the plaintext and the cipherkey ten times, which means that it needs 20 $MAP$s and 20 $MAP^{-1}$s for the encryption of 128-bit data. In [32], for every byte, $MAP$ costs 11 XOR gates with 2 gates in critical path; $MAP^{-1}$ costs 15 XOR gates with 3 gates in critical path. $MAP$ and $MAP^{-1}$ together cost 33.3% in critical path and 21% gates in total for the subbytes transformation.

In order to reduce the cost of $MAP$ and $MAP^{-1}$ as much as possible, we propose the complete composite field approach (Figure (4.2(b))). The $GF((2^4)^2)$ field covers both encryptor and keyschedule. As illustrated in Figure (4.2(b)), one $MAP$ and one $MAP^{-1}$ are applied in encryption, one $MAP^{-1}$ is applied in keyschedule. This is a constant overhead
which is not affected by the round count. No matter what the key size is, the cost of mapping is the same.

The isomorphic mapping functions between field $GF(2^8)$ and field $GF((2^4)^2)$ are determined by the irreducible polynomials of field $GF(2^8)$ (Equation (2.3)) and field $GF((2^4)^2)$ (Equations (2.10) and (2.11)). We use the mapping formulas in [32] to conduct the transition of representations between $GF(2^8)$ and $GF((2^4)^2)$:

$$a_h x + a_l = MAP(a), \quad a_h, a_l \in GF(2^4), \quad a \in GF(2^8)$$

$$a_A = a_1 \oplus a_7, \quad a_B = a_5 \oplus a_7, \quad a_C = a_4 \oplus a_6$$

$$a_{l_0} = a_C \oplus a_0 \oplus a_5, \quad a_{l_1} = a_1 \oplus a_2, \quad a_{l_2} = a_A, \quad a_{l_3} = a_2 \oplus a_4$$

$$a_{h_0} = a_C \oplus a_5, \quad a_{h_1} = a_A \oplus a_C, \quad a_{h_2} = a_B \oplus a_2 \oplus a_3, \quad a_{h_3} = a_B$$

In Equation (4.1), $a$ is an element in field $GF(2^8)$. $MAP(a)$ convert $a$ to its isomorphic element in $GF((2^4)^2)$, which is represented as $a_h x + a_l$.

$$a = MAP^{-1}(a_h x + a_l), \quad a \in GF(2^8), \quad a_h, a_l \in GF(2^4)$$

$$a_A = a_{l_1} \oplus a_{h_3}, \quad a_B = a_{h_0} \oplus a_{h_1}$$

$$a_0 = a_{l_0} \oplus a_{h_0}, \quad a_1 = a_B \oplus a_{h_3}, \quad a_2 = a_A \oplus a_B$$

$$a_3 = a_B \oplus a_{l_1} \oplus a_{h_2}, \quad a_4 = a_A \oplus a_B \oplus a_{l_3}, \quad a_5 = a_B \oplus a_{l_2}$$

$$a_6 = a_A \oplus a_{l_2} \oplus a_{l_3} \oplus a_{h_0}, \quad a_7 = a_B \oplus a_{l_2} \oplus a_{h_3}$$

In Equation (4.2), $a_h x + a_l$ in an element in field $GF((2^4)^2)$. $MAP^{-1}(a_h x + a_l)$ convert $a_h x + a_l$ to its isomorphic element in $GF(2^8)$, which is represented as $a$. 

26
4.3 Subpipelined Encryptor and Keyschedule

The technique of pipelining is applied in the AES designs to optimize speed/area ratio in [1, 2, 8, 9, 11, 17, 33, 26, 27, 31, 34]. By inserting registers among combinational logic, multiple blocks are processed simultaneously. The frequency is determined by the maximum delay between two registers. When the maximum delay between two registers is decreased, the frequency is increased.

Figure (4.3(a)) is the fully unrolled pipelining architecture, which includes two steps. First, unfold the Nr round units on the device; second, insert registers between each round unit. In this case, the maximum delay is the period of one round which contains four transformations.

By cutting one round unit into more substages, we can further improve the frequency. This technology is called subpipelining [34]. Figure (4.3(b)) gives an example where registers are placed both between and inside each round unit. The frequency is determined by
the maximum delay of a substage. In this thesis, we propose a single round subpipelined architecture, where one round unit is implemented and subpipelined into eight substages.

To generate the roundkeys, we design an on-the-fly keyschedule, which generates a 32-bit roundkey at each clock cycle. The encryption unit and the key expansion unit share the same clock which leads to the fact that the general frequency is determined by the maximum delay in both units. Hence, the substage balance of keyschedule is as important as in encryptor. We propose a new subpipelined keyschedule on composite field for all standard key sizes. The most costly part of keyschedule is still the S-Box. We divide it into the same substages as in encryptor.

4.4 Double-Block Subpipelined Architecture

An equivalent decryptor along with the AES was introduced in FIPS [20], where the same architecture can be used in both encryption and decryption. Figure 5.7 in [30] illustrates the equivalent inverse cipher. It makes use of the fact that the order of subbytes and shiftrows can be exchanged because subbytes changes the value of each byte individually while shiftrows only rearranges their positions. So it changes the order of invshiftrows and invsubbytes, and add an extra step to conduct invmixcolumns on each roundkey. We can also change the sequence of shiftrows and subbytes in encryptor to obtain the same result. In this design, we put shiftrows before subbytes.

Figure (4.4) illustrates the proposed encryption architecture. The eight 32-bit registers (four in shiftrows, three in subbytes and one between subbytes and mixcolumns) are used to cut one round unit into eight substages, which leads to an eight clock cycles initial delay to generate the first 32-bit ciphertext. \textit{clk\_counter} in Figure (4.4) is a clock register counter generated in keyschedule. It is repeating from 0 to $8N_r + 7$ (Table (4.1)) and is used to synchronize encryptor and keyschedule.
Table 4.1: AES Encryption Sequence

<table>
<thead>
<tr>
<th>clk_counter</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
</tr>
</thead>
<tbody>
<tr>
<td>plaintext</td>
<td>PA(0)</td>
<td>PA(1)</td>
<td>PA(2)</td>
<td>PA(3)</td>
<td>PB(0)</td>
<td>PB(1)</td>
<td>PB(2)</td>
<td>PB(3)</td>
</tr>
<tr>
<td>cipherkey</td>
<td>KA(0)</td>
<td>KA(1)</td>
<td>KA(2)</td>
<td>KA(3)</td>
<td>KB(0)</td>
<td>KB(1)</td>
<td>KB(2)</td>
<td>KB(3)</td>
</tr>
<tr>
<td>outcome(0)</td>
<td>OA(0)</td>
<td>OA(1)</td>
<td>OA(2)</td>
<td>OA(3)</td>
<td>OB(0)</td>
<td>OB(1)</td>
<td>OB(2)</td>
<td>OB(3)</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>clk_counter</th>
<th>8</th>
<th>9</th>
<th>10</th>
<th>11</th>
<th>12</th>
<th>13</th>
<th>14</th>
<th>15</th>
</tr>
</thead>
<tbody>
<tr>
<td>input(1)</td>
<td>OA(0)</td>
<td>OA(1)</td>
<td>OA(2)</td>
<td>OA(3)</td>
<td>OB(0)</td>
<td>OB(1)</td>
<td>OB(2)</td>
<td>OB(3)</td>
</tr>
<tr>
<td>roundkey(1)</td>
<td>KA(4)</td>
<td>KA(5)</td>
<td>KA(6)</td>
<td>KA(7)</td>
<td>KB(4)</td>
<td>KB(5)</td>
<td>KB(6)</td>
<td>KB(7)</td>
</tr>
<tr>
<td>outcome(1)</td>
<td>OA(4)</td>
<td>OA(5)</td>
<td>OA(6)</td>
<td>OA(7)</td>
<td>OB(4)</td>
<td>OB(5)</td>
<td>OB(6)</td>
<td>OB(7)</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>clk_counter</th>
<th>8Nr</th>
<th>8Nr+1</th>
<th>8Nr+2</th>
<th>8Nr+3</th>
<th>8Nr+4</th>
<th>8Nr+5</th>
<th>8Nr+6</th>
<th>8Nr+7</th>
</tr>
</thead>
<tbody>
<tr>
<td>input(Nr)</td>
<td>OA(4Nr-4)</td>
<td>OA(4Nr-3)</td>
<td>OA(4Nr-2)</td>
<td>OA(4Nr-1)</td>
<td>OB(4Nr-4)</td>
<td>OB(4Nr-3)</td>
<td>OB(4Nr-2)</td>
<td>OB(4Nr-1)</td>
</tr>
<tr>
<td>roundkey(Nr)</td>
<td>KA(4Nr)</td>
<td>KA(4Nr+1)</td>
<td>KA(4Nr+2)</td>
<td>KA(4Nr+3)</td>
<td>KB(4Nr)</td>
<td>KB(4Nr+1)</td>
<td>KB(4Nr+2)</td>
<td>KB(4Nr+3)</td>
</tr>
<tr>
<td>ciphertext</td>
<td>CA(0)</td>
<td>CA(1)</td>
<td>CA(2)</td>
<td>CA(3)</td>
<td>CB(0)</td>
<td>CB(1)</td>
<td>CB(2)</td>
<td>CB(3)</td>
</tr>
</tbody>
</table>
We use a double-block (block A and B) data flow to avoid the eight clock cycles initial delay. Table (4.1(a)) illustrates the data sequence of the initial round ($Nr = 0$).

\[
\{ PA(0), PA(1), PA(2), PA(3) \}: 128\text{-bit plaintext of block A} \\
\{ PB(0), PB(1), PB(2), PB(3) \}: 128\text{-bit plaintext of block B}
\]

They are put into AES during the first eight clock cycles and then processed alternately.

\[
\{ KA(0), KA(1), KA(2), KA(3) \}: \text{cipherkey for block A} \\
\{ KB(0), KB(1), KB(2), KB(3) \}: \text{cipherkey for block B}
\]
Because in the initial round, the encryption involves only addroundkey, which is the simple XOR operation, and the according roundkey is the MAPed cipherkey, the operation in this round is not delayed by registers. Hence, the outcome of the initial round (outcome(0)) is produced from the very beginning.

\[
\{OA(0), OA(1), OA(2), OA(3)\} : \text{outcome of round 0 for block A} \\
\{OB(0), OB(1), OB(2), OB(3)\} : \text{outcome of round 0 for block B}
\]

Table (4.1(b)) is for round 1, which goes through the eight substages. At the eighth clock cycle, \(OA(0)\) finishes the eight substages and XORes the the according roundkey \((KA(4))\) to generate the outcome \((OA(4))\) for block \(A\), so as block \(B\).

Table (4.1(c)) is for the last round \(Nr\).

\[
\{CA(0), CA(1), CA(2), CA(3)\} : \text{ciphertext for block A} \\
\{CB(0), CB(1), CB(2), CB(3)\} : \text{ciphertext for block B}
\]

Now we explain the 3-to-1 multiplexer (mul) controlled by the \(clk\_counter\):

- **Case a:** In initial round, where \(0 \leq clk\_counter < 8\), 128-bit plaintext is MAPed into \(GF((2^4)^2)\) and XORed with the according roundkey in four clock cycles, 32 bits at each clock cycle. The result is the outcome of the initial round \((Nr = 0)\) which is the input of the second round;

- **Case b:** In normal rounds, where \(8 \leq clk\_counter < Nr \times 8\), the outcome of mix-columns XORs with the according roundkey to produce the outcome of this round.

- **Case c:** The last round, where \(Nr \times 8 \leq clk\_counter < (Nr + 1) \times 8\), the transformation mixcolumns is skipped. The result of subbytes is added with its roundkey.

Finally, the outcome of the last round goes through \(MAP^{-1}\) to generate the ciphertext.
4.4.1 Column Fashion Shiftrows

This subsection proposes the column fashion shiftrows (Figure (4.5)). It includes 16 8-bit registers \((\text{Row0}, \text{Col0}, \text{Row1}, \text{Col1}, \ldots, \text{Row3}, \text{Col3})\) and three 2 to 1 multiplexers \((M1, M2, \text{and } M3)\). Both input and output of shiftrows is a state. Each column is a word \((W0, W1, W2, \text{and } W3)\), which includes four bytes. Every clock cycle it processes a 32-bit word (one column of a state), so four clock cycles are needed to produce a 128-bit state. The first 3 clock cycles are initial clock cycles, so the first word is shifted out at the 4th clock cycle. Figure (4.6) shows how it works in the first eight clock cycles. \(R_{00}, R_{01}, \ldots \text{ and } R_{33}\) stand for registers \(\text{Row0}, \text{Col0}, \text{Row1}, \text{Col1}, \ldots \) \text{ and } \text{Row3}, \text{Col3}.\) Each row shows their values at each clock cycle \((clk0, clk1, \ldots \text{ and } clk7)\). We will explain the shadow area and black border in the following text.

Four counters \((\text{Counter}_W0, \text{Counter}_W1, \text{Counter}_W2 \text{ and } \text{Counter}_W3)\) control the registers and the multiplexers. Table (4.2) shows how these signals are generated.

<table>
<thead>
<tr>
<th>(W)</th>
<th>(\text{Counter}_W0)</th>
<th>(\text{Counter}_W1)</th>
<th>(\text{Counter}_W2)</th>
<th>(\text{Counter}_W3)</th>
</tr>
</thead>
<tbody>
<tr>
<td>(W0)</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>(W1)</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>(W2)</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>(W3)</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>

When the first word \((W0)\) of a state is shifted in, \(\text{Counter}_W0 = 1\);  
When the second word \((W1)\) of a state is shifted in, \(\text{Counter}_W1 = 1\);  
When the third word \((W2)\) of a state is shifted in, \(\text{Counter}_W2 = 1\);  
When the forth word \((W3)\) of a state is shifted in, \(\text{Counter}_W3 = 1\).

Certain registers are controlled by special enable signals \((\text{Enable}_1, \text{col3}, \text{Enable}_2, \text{col23}, \text{and } \text{Enable}_3, \text{col123})\), others use the general enable signal, which
Figure 4.5: Column Fashion Shiftrows
is not shown.

<table>
<thead>
<tr>
<th></th>
<th>R00</th>
<th>R01</th>
<th>R02</th>
<th>R03</th>
<th>R10</th>
<th>R11</th>
<th>R12</th>
<th>R13</th>
<th>R20</th>
<th>R21</th>
<th>R22</th>
<th>R23</th>
<th>R30</th>
<th>R31</th>
<th>R32</th>
<th>R33</th>
</tr>
</thead>
<tbody>
<tr>
<td>clk0</td>
<td>A0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>A3,0</td>
<td>A0,1</td>
<td>A1,1</td>
<td>A2,1</td>
<td>A3,1</td>
<td>A0,2</td>
<td>A1,2</td>
<td>A2,2</td>
<td>A3,2</td>
<td>A0,3</td>
<td>A1,3</td>
<td>A2,3</td>
<td>A3,3</td>
</tr>
<tr>
<td>clk1</td>
<td>A0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>A3,0</td>
<td>A0,1</td>
<td>A1,1</td>
<td>A2,1</td>
<td>A3,1</td>
<td>A0,2</td>
<td>A1,2</td>
<td>A2,2</td>
<td>A3,2</td>
<td>A0,3</td>
<td>A1,3</td>
<td>A2,3</td>
<td>A3,3</td>
</tr>
<tr>
<td>clk2</td>
<td>A0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>A3,0</td>
<td>A0,1</td>
<td>A1,1</td>
<td>A2,1</td>
<td>A3,1</td>
<td>A0,2</td>
<td>A1,2</td>
<td>A2,2</td>
<td>A3,2</td>
<td>A0,3</td>
<td>A1,3</td>
<td>A2,3</td>
<td>A3,3</td>
</tr>
<tr>
<td>clk3</td>
<td>A0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>A3,0</td>
<td>A0,1</td>
<td>A1,1</td>
<td>A2,1</td>
<td>A3,1</td>
<td>A0,2</td>
<td>A1,2</td>
<td>A2,2</td>
<td>A3,2</td>
<td>A0,3</td>
<td>A1,3</td>
<td>A2,3</td>
<td>A3,3</td>
</tr>
<tr>
<td>clk4</td>
<td>B0,0</td>
<td>A0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>B0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>A3,0</td>
<td>B0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>A3,0</td>
<td>B0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>A3,0</td>
</tr>
<tr>
<td>clk5</td>
<td>B0,0</td>
<td>A0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>B0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>A3,0</td>
<td>B0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>A3,0</td>
<td>B0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>A3,0</td>
</tr>
<tr>
<td>clk6</td>
<td>B0,0</td>
<td>A0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>B0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>A3,0</td>
<td>B0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>A3,0</td>
<td>B0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>A3,0</td>
</tr>
<tr>
<td>clk7</td>
<td>B0,0</td>
<td>A0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>B0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>A3,0</td>
<td>B0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>A3,0</td>
<td>B0,0</td>
<td>A1,0</td>
<td>A2,0</td>
<td>A3,0</td>
</tr>
</tbody>
</table>

Output for the 1st state:

Figure 4.6: Two States’ Arrangement in Shiftrows Registers

Enable $\text{Enable}_{\text{row1}}_{\text{col3}}$ ($\text{Enable}_{\text{row1}}_{\text{col3}} = \text{Counter}_{\text{W3}}$) controls register $\text{Row1} \_\text{Col3}$. This enable signal is negative when clock is $\text{clk0}$, $\text{clk1}$, $\text{clk2}$, $\text{clk4}$, $\text{clk5}$, $\text{clk6}$, etc. $\text{Row1} \_\text{Col3}$ does not work during these clock cycles, which corresponds to the shadow areas of the column $\text{R13}$ in Figure (4.6);

Enable $\text{Enable}_{\text{row2}}_{\text{col23}}$ ($\text{Enable}_{\text{row2}}_{\text{col23}} = \text{Counter}_{\text{W2}} \lor \text{Counter}_{\text{W3}}$) controls registers $\text{Row2} \_\text{Col2}$ and $\text{Row2} \_\text{Col3}$. This enable signal is negative when clock is $\text{clk0}$, $\text{clk1}$, $\text{clk4}$, $\text{clk5}$, etc. $\text{Row2} \_\text{Col2}$ and $\text{Row2} \_\text{Col3}$ do not work during these clock cycles, which corresponds to the shadow area of the columns $\text{R22}$ and $\text{R23}$ in Figure (4.6);

Enable $\text{Enable}_{\text{row3}}_{\text{col123}}$ ($\text{Enable}_{\text{row3}}_{\text{col123}} = \text{Counter}_{\text{W1}} \lor \text{Counter}_{\text{W2}} \lor \text{Counter}_{\text{W3}}$) controls registers $\text{Row3} \_\text{Col1}$, $\text{Row3} \_\text{Col2}$ and $\text{Row3} \_\text{Col3}$. This enable signal is negative when clock is $\text{clk0}$, $\text{clk4}$, etc. $\text{Row3} \_\text{Col1}$, $\text{Row3} \_\text{Col2}$ and $\text{Row3} \_\text{Col3}$ do not work during these clock cycles, which corresponds to the shadow area of the columns $\text{R31}$, $\text{R32}$ and $\text{R33}$ in Figure (4.6).

For each input word:
The input (a state) of shiftrows is the MAPed ciphertext if it is the initial round; otherwise it is the outcome of the previous round. Each word of the state ($W_0$, $W_1$, $W_2$ and $W_3$) is shifted into the first column of the registers ($Row0\_Col0$, $Row1\_Col0$, $Row2\_Col0$ and $Row3\_Col0$) at each clock cycle.

**For each output word:**

- 1st byte is shifted out from $Row0\_Col3$, which corresponds to the black border area of column $R_{03}$ in Figure (4.6);

- 2nd byte is shifted out from $Row1\_Col3$ if ($Counter_{W2}$) is active, otherwise from $Row1\_Col2$, which corresponds to the black border area of columns $R_{12}$ and $R_{13}$ in Figure (4.6);

- 3rd byte is shifted out from $Row2\_Col3$ if ($Counter_{W1}$ or $Counter_{W2}$) is active, otherwise from $Row2\_Col1$, which corresponds to the black border area of columns $R_{21}$ and $R_{23}$ in Figure (4.6);

- 4th byte is shifted out from $Row3\_Col3$ if ($Counter_{W0}$ or $Counter_{W1}$ or $Counter_{W2}$) is active, otherwise from $Row3\_Col0$, which corresponds to the black border area of columns $R_{30}$ and $R_{33}$ in Figure (4.6).

Figure (4.6) takes two states $A$ and $B$ (Figure (4.7)) as the input of shiftrows. During the first eight clock cycles, each word of state $A$ and $B$ is shifted into the first column of registers ($R_{00}$, $R_{10}$, $R_{20}$ and $R_{30}$) one after another. The first three clock cycles are the initial cycles with no output.

At $clk3$, the first column of state $A$ is generated from registers ($R_{03}$, $R_{12}$, $R_{21}$ and $R_{30}$);

At $clk4$, the second column of state $A$ is generated from registers ($R_{03}$, $R_{12}$, $R_{21}$ and $R_{33}$);

At $clk5$, the third column of state $A$ is generated from registers ($R_{03}$, $R_{12}$, $R_{23}$ and $R_{33}$);
At clk6, the forth column of state A is generated from registers ($R_{03}, R_{13}, R_{23}$ and $R_{33}$).

The first output state is shown in the right down corner of the Figure (4.6).

4.4.2 Subpipelined Subbytes

The key step of subbytes is the calculation of the multiplicative inverse. Figure (4.8)
illustrates the architecture of subbytes used in [34], which applies Equation (2.15). As shown in this figure, it uses multiplication in $GF(2^4)$ three times. In order to distinguish the multipliers, we indicate them as $\times_1$, $\times_2$, $\times_3$. It also needs one inversion ($x^{-1}$), one constant multiplier with $\{E\}$ ($\times e$), $\{E\}$ is in hexadecimal notation, which is '1110' in binary notation), one squarer ($x^2$) and two 4-bit XORs ($\oplus$). These arithmetic operations are over field $GF(2^4)$.

Considering $x, y, z \in GF(2^4)$, $x, y$ and $z$ are represented in binary notation where $x = \{x_3x_2x_1x_0\}$, $y = \{y_3y_2y_1y_0\}$, $z = \{z_3z_2z_1z_0\}$. Let $a, b, c, d, e$ and $f$ are 1-bit value, which equals to 0 or 1. $\oplus$ stands for XOR-operation. $x_0y_1$ means $x_0 \land y_1$.

The following Equations (4.3), (4.4), (4.5) and (4.6) are used to calculate squaring, constant multiplication with $\{E\}$, multiplication and multiplicative inverse [32].

\[
y = x^2
\]

\[
y_0 = x_0 \oplus x_2, \quad y_1 = x_2
\]

\[
y_2 = x_1 \oplus x_3, \quad y_3 = x_3
\]

\[
y = x \times \{E\}
\]

\[
a = x_0 \oplus x_1, \quad b = x_2 \oplus x_3
\]

\[
y_0 = x_1 \oplus b, \quad y_1 = a
\]

\[
y_2 = a \oplus x_2, \quad y_3 = a \oplus b
\]
\[ z = x \times y \]

\[ a = x_0 \oplus x_3, \quad b = x_2 \oplus x_3, \quad c = x_1 \oplus x_2 \]

\[ z_0 = x_0 y_0 \oplus x_3 y_1 \oplus x_2 y_2 \oplus x_1 y_3 \]  \hspace{1cm} (4.5)

\[ z_1 = x_1 y_0 \oplus a y_1 \oplus b y_2 \oplus c y_3 \]

\[ z_2 = x_2 y_0 \oplus x_1 y_1 \oplus a y_2 \oplus b y_3 \]

\[ z_3 = x_3 y_0 \oplus x_2 y_1 \oplus x_1 y_2 \oplus a y_3 \]

\[ y = x^{-1} \]

\[ a = x_1 \oplus x_2 \oplus x_3 \oplus x_1 x_2 x_3 \]  \hspace{1cm} (4.6)

\[ y_0 = a \oplus x_0 \oplus x_0 x_2 \oplus x_1 x_2 \oplus x_0 x_1 x_2 \]

\[ y_1 = x_0 x_1 \oplus x_0 x_2 \oplus x_1 x_2 \oplus x_3 \oplus x_1 x_3 \oplus x_0 x_1 x_3 \]

\[ y_2 = x_0 x_1 \oplus x_2 \oplus x_0 x_2 \oplus x_3 \oplus x_0 x_3 \oplus x_0 x_2 x_3 \]

\[ y_3 = a \oplus x_0 x_3 \oplus x_1 x_3 \oplus x_2 x_3 \]

As illustrated in Figure (4.4), subbytes should be cut into four substages. The key to an efficient subpipelining technology is to balance the delays of these substages. Previous research [34] calculate the delay of an individual substage by counting the gates in critical path.

Xilinx ISE provides synthesis tool to yield the maximum combinational delay of an entity. A more straightforward method to achieve the optimal balance is to cut subbytes in different manners and use this synthesis tool to measure the delay of each substage (an entity). The most even delays of these substages stand for the optimal balanced substages arrangement.

Based on our experiments, Equation (4.6) is not suitable for this 4-substage subbytes.
With this equation, the substage including \( x^{-1} \) yields the longest delay, hence decreasing this substage’s delay can increase the general frequency. We derive a new Equation (4.7) from Equation (4.6) to reduce the delay caused by \( x^{-1} \). Equation (4.7) is derived in three steps:

1. In Equation (4.6), replace \( a \) by its expression, we have:

\[
\begin{align*}
\gamma_0 &= x_0 \oplus x_1 \oplus x_2 \oplus x_3 \oplus x_0 x_2 \oplus x_1 x_2 \oplus x_0 x_1 x_2 \oplus x_1 x_2 x_3 \\
\gamma_1 &= x_0 x_1 \oplus x_0 x_2 \oplus x_1 x_2 \oplus x_3 \oplus x_1 x_3 \oplus x_0 x_1 x_3 \\
\gamma_2 &= x_0 x_1 \oplus x_2 \oplus x_0 x_2 \oplus x_0 x_3 \oplus x_0 x_2 x_3 \\
\gamma_3 &= x_1 \oplus x_2 \oplus x_3 \oplus x_1 x_2 x_3 \oplus x_0 x_3 \oplus x_1 x_3 \oplus x_2 x_3
\end{align*}
\]

2. The expressions in step 1 can be equally changed to:

\[
\begin{align*}
\gamma_0 &= x_1 \oplus x_2 \oplus x_1 x_2 \oplus x_0 x_2 \oplus (x_0 \oplus x_3) (1 \oplus x_1 x_2) \\
\gamma_1 &= x_1 x_2 \oplus x_0 x_2 \oplus x_0 x_1 \oplus x_3 (1 \oplus x_1 \oplus x_0 x_1) \\
\gamma_2 &= x_2 \oplus x_0 x_2 \oplus x_0 x_1 \oplus x_3 (1 \oplus x_0 \oplus x_0 x_2) \\
\gamma_3 &= x_1 \oplus x_2 \oplus x_3 (1 \oplus x_0 \oplus x_1 \oplus x_2 \oplus x_1 x_2)
\end{align*}
\]

3. Let \( a = x_1 x_2, \ b = x_0 x_2, \ c = x_0 x_1, \ d = x_1 \oplus x_2, \ e = 1 \oplus a \) and \( f = b \oplus c \), we have:

\[
\begin{align*}
y &= x^{-1} \\
\gamma_0 &= a \oplus b \oplus d \oplus (x_0 \oplus x_3) e \\
\gamma_1 &= a \oplus f \oplus x_3 (x_1 \oplus 1 \oplus c) \\
\gamma_2 &= f \oplus x_2 \oplus x_3 (b \oplus 1 \oplus x_0) \\
\gamma_3 &= d \oplus x_3 (e \oplus x_0 \oplus d)
\end{align*}
\]
According to Equation (4.7), we design the circuit Figure (4.9) to perform $x^{-1}$ over $GF(2^4)$.

Besides multiplicative inversion, other expensive operations in Figure (4.8) are the three multiplications ($\times_1$, $\times_2$ and $\times_3$). In order to decrease the maximum delay caused by multiplication, we separate each multiplication into two steps and put each step in different substages. The registers between each stage store the result of the first step of multiplication and pass it to the second step. We decompose these three multipliers into two different manners (AB-type and MN-type) to achieve the best balance.

**AB-type:** Equation (4.8) is derived from Equation (4.5). $p_0, p_1, ..., p_{15}$ are 1-bit values, which represents one AND term in Equation (4.5). *Step A* calculates the value of all the terms; *Step B* conducts XOR of every four values to generate $z_0$, $z_1$, $z_2$ and $z_3$. A register is inserted between *Step A* and *Step B* to store $p_0, p_1, ..., p_{15}$. $\times_1$ in Figure (4.8) is separated in this way, as $\times_{1A}$ and $\times_{1B}$ in Figure (4.9):

$$z = x \times y \ (AB-type)$$

---

**Step A**

\[ a = x_0 \oplus x_3, \quad b = x_2 \oplus x_3, \quad c = x_1 \oplus x_2 \]

\[ p_0 = x_0 y_0, \quad p_1 = x_3 y_1, \quad p_2 = x_2 y_2, \quad p_3 = x_1 y_3 \]

\[ p_4 = x_1 y_0, \quad p_5 = a y_1, \quad p_6 = b y_2, \quad p_7 = c y_3 \]

\[ p_8 = x_2 y_0, \quad p_9 = x_1 y_1, \quad p_{10} = a y_2, \quad p_{11} = b y_3 \]

\[ p_{12} = x_3 y_0, \quad p_{12} = x_2 y_1, \quad p_{14} = x_1 y_2, \quad p_{15} = a y_3 \]

---

**Step B**

\[ z_0 = p_0 \oplus p_1 \oplus p_2 \oplus p_3 \]

\[ z_1 = p_4 \oplus p_5 \oplus p_6 \oplus p_7 \]

\[ z_2 = p_8 \oplus p_9 \oplus p_{10} \oplus p_{11} \]

\[ z_3 = p_{12} \oplus p_{13} \oplus p_{14} \oplus p_{15} \]
MN-type: Equation (4.9) is also derived from Equation (4.5). *Step M* creates the value of $a$, $b$ and $c$; *Step N* finishes the rest of Equation (4.5). A register is inserted between *Step M* and *Step N* to store $a, b, c$. $\times_2$ and $\times_3$ in Figure (4.8) are separated in this way, as $\times_2M$ and $\times_2N$, $\times_3M$ and $\times_3N$ in Figure (4.9).

\[ z = x \times y \ (MN\text{-type}) \]

---

\[ a = x_0 \oplus x_3, \quad b = x_2 \oplus x_3, \quad c = x_1 \oplus x_2 \]

---

\[ z_0 = x_0y_0 \oplus x_3y_1 \oplus x_2y_2 \oplus x_1y_3 \]

\[ z_1 = x_1y_0 \oplus ay_1 \oplus by_2 \oplus cy_3 \]

\[ z_2 = x_2y_0 \oplus x_1y_1 \oplus ay_2 \oplus by_3 \]

\[ z_3 = x_3y_0 \oplus x_2y_1 \oplus x_1y_2 \oplus ay_3 \]

(4.9)

The last operation in subbytes is the affine transformation. We derive Equation (4.16) to do the affine transformation, based on Equation (3.1), Equation (4.1) and Equation (4.2). First, we change the format of Equation (4.1) and Equation (4.2).

Consider $p \in GF((2^4)^2)$, $q \in GF(2^8)$:

\[ p = \{p_7p_6p_5p_4p_3p_2p_1p_0\} \]

\[ q = \{q_7q_6q_5q_4q_3q_2q_1q_0\} \]

For Equation (4.1):

1. In expression of $a_{l0}, \ldots, a_{l3}$, replace $a_A$, $a_B$ and $a_C$ by their expression

\[ a_{l0} = a_4 \oplus a_6 \oplus a_0 \oplus a_5 \]

\[ a_{l1} = a_1 \oplus a_2 \]

\[ a_{l2} = a_1 \oplus a_7 \]
\[ a_{l3} = a_2 \oplus a_4 \]
\[ a_{h0} = a_4 \oplus a_6 \oplus a_5 \]
\[ a_{h1} = a_1 \oplus a_7 \oplus a_4 \oplus a_6 \]
\[ a_{h2} = a_5 \oplus a_7 \oplus a_2 \oplus a_3 \]
\[ a_{h3} = a_5 \oplus a_7 \]

2. Let \( p \) replace \( a_{hx} + a_1 \), \( q \) replace \( a \), we have Equation (4.10)

\[
p = MAP(q), \ p \in GF((2^4)^2), \ q \in GF(2^8)
\]

\[
p_0 = q_0 \oplus q_4 \oplus q_5 \oplus q_6
\]
\[
p_1 = q_1 \oplus q_2
\]
\[
p_2 = q_1 \oplus q_7
\]
\[
p_3 = q_2 \oplus q_4
\]
\[
p_4 = q_4 \oplus q_5 \oplus q_6
\]
\[
p_5 = q_1 \oplus q_4 \oplus q_6 \oplus q_7
\]
\[
p_6 = q_2 \oplus q_3 \oplus q_5 \oplus q_7
\]
\[
p_7 = q_5 \oplus q_7
\]

The same steps for Equation (4.2):

1. In expression of \( a_0, \ldots, a_7 \), replace \( a_A \) and \( a_B \) by their expression

\[
a_0 = a_{l0} \oplus a_{h0}
\]
\[
a_1 = a_{h0} \oplus a_{h1} \oplus a_{h3}
\]
\[
a_2 = a_{l1} \oplus a_{h3} \oplus a_{h0} \oplus a_{h1}
\]
\[
a_3 = a_{h0} \oplus a_{h1} \oplus a_{l1} \oplus a_{h2}
\]
\[
a_4 = a_{l1} \oplus a_{h3} \oplus a_{h0} \oplus a_{h1} \oplus a_{l3}
\]
\[
a_5 = a_{h0} \oplus a_{h1} \oplus a_{l2}
\]
\[ a_6 = a_{l1} \oplus a_{h3} \oplus a_{l2} \oplus a_{l3} \oplus a_{h0} \]
\[ a_7 = a_{h0} \oplus a_{h1} \oplus a_{l2} \oplus a_{h3} \]

2. Let \( q \) replace \( a \), \( p \) replace \( a_{h}x + a_{l} \), we have Equation (4.11)

\[ q = MAP^{-1}(p), \; q \in GF(2^8), \; p \in GF((2^4)^2) \]

\[ q_0 = p_0 \oplus p_4 \]
\[ q_1 = p_4 \oplus p_5 \oplus p_7 \]
\[ q_2 = p_1 \oplus p_4 \oplus p_5 \oplus p_7 \]
\[ q_3 = p_1 \oplus p_4 \oplus p_5 \oplus p_6 \]
\[ q_4 = p_1 \oplus p_3 \oplus p_4 \oplus p_5 \oplus p_7 \]
\[ q_5 = p_2 \oplus p_4 \oplus p_5 \]
\[ q_6 = p_1 \oplus p_2 \oplus p_3 \oplus p_4 \oplus p_7 \]
\[ q_7 = p_2 \oplus p_4 \oplus p_5 \oplus p_7 \]

Now we use Equation (3.1), Equation (4.10) and Equation (4.11) to derive Equation (4.16).

Let \( x', y \) be the elements in \( GF(2^8) \):

\[ x' = \{x'_7, x'_6, x'_5, x'_4, x'_3, x'_2, x'_1, x'_0\} \]
\[ y = \{y_7y_6y_5y_4y_3y_2y_1y_0\} \]

According to Equation (3.1), we have:
\[ y_0 = x'_0 \oplus x'_4 \oplus x'_5 \oplus x'_6 \oplus x'_7 \oplus 1 \]
\[ y_1 = x'_0 \oplus x'_1 \oplus x'_5 \oplus x'_6 \oplus x'_7 \oplus 1 \]
\[ y_2 = x'_0 \oplus x'_1 \oplus x'_2 \oplus x'_6 \oplus x'_7 \]
\[ y_3 = x'_0 \oplus x'_1 \oplus x'_2 \oplus x'_3 \oplus x'_7 \]
\[ y_4 = x'_0 \oplus x'_1 \oplus x'_2 \oplus x'_3 \oplus x'_4 \]
\[ y_5 = x'_1 \oplus x'_2 \oplus x'_3 \oplus x'_4 \oplus x'_5 \oplus 1 \]
\[ y_6 = x'_2 \oplus x'_3 \oplus x'_4 \oplus x'_5 \oplus x'_6 \oplus 1 \]
\[ y_7 = x'_3 \oplus x'_4 \oplus x'_5 \oplus x'_6 \oplus x'_7 \]  

(4.12)

In the following, we convert the result of \( y \) to the field \( GF((2^4)^2) \), and use the \( GF((2^4)^2) \) format to represent \( x' \). Thus, we can derive the affine transformation in \( GF((2^4)^2) \).

1. We let \( w \) to represent \( y \) in \( GF((2^4)^2) \) (\( w \) is one element in \( GF((2^4)^2) \)). According to Equation (4.10) (Map from \( GF(2^8) \) to \( GF((2^4)^2) \)):

\[ w_0 = y_0 \oplus y_4 \oplus y_5 \oplus y_6 \]
\[ w_1 = y_1 \oplus y_2 \]
\[ w_2 = y_1 \oplus y_7 \]
\[ w_3 = y_2 \oplus y_4 \]
\[ w_4 = y_4 \oplus y_5 \oplus y_6 \]
\[ w_5 = y_1 \oplus y_4 \oplus y_6 \oplus y_7 \]
\[ w_6 = y_2 \oplus y_3 \oplus y_5 \oplus y_7 \]
\[ w_7 = y_5 \oplus y_7 \]  

(4.13)

2. Next, we use \( GF((2^4)^2) \) format to represent \( x' \) in Equation 4.12. Let \( z \) be the \( GF((2^4)^2) \) format of \( x' \). From Equation 4.11, we have:
\[\begin{align*}
x'_0 &= z_0 \oplus z_4 \\
x'_1 &= z_4 \oplus z_5 \oplus z_7 \\
x'_2 &= z_1 \oplus z_4 \oplus z_5 \oplus z_7 \\
x'_3 &= z_1 \oplus z_4 \oplus z_5 \oplus z_6 \\
x'_4 &= z_1 \oplus z_3 \oplus z_4 \oplus z_5 \oplus z_7 \\
x'_5 &= z_2 \oplus z_4 \oplus z_5 \\
x'_6 &= z_1 \oplus z_2 \oplus z_3 \oplus z_4 \oplus z_7 \\
x'_7 &= z_2 \oplus z_4 \oplus z_5 \oplus z_7 \\
\end{align*}\] (4.14)

3. Now, we replace \( y \) with its \( GF(2^4) \) format \((w)\), and replace \( x' \) with its \( GF(2^4) \) format \((z)\):

\[
w_0 = y_0 \oplus y_4 \oplus y_5 \oplus y_6 \\
= (x'_0 \oplus x'_4 \oplus x'_6 \oplus x'_7 \oplus 1) \oplus (x'_1 \oplus x'_2 \oplus x'_3 \oplus x'_7) \oplus (x'_1 \oplus x'_2 \oplus x'_3 \oplus x'_4 \oplus x'_5 \oplus 1) \oplus (x'_2 \oplus x'_3 \oplus x'_4 \oplus x'_5 \oplus x'_6 \oplus 1) \quad \text{(By Equation (4.12))}
\]

\[
= x'_2 \oplus x'_3 \oplus x'_5 \oplus x'_7 \oplus 1 \\
= (z_1 \oplus z_4 \oplus z_5 \oplus z_7) \oplus (z_1 \oplus z_4 \oplus z_5 \oplus z_6) \oplus (z_2 \oplus z_4 \oplus z_5) \oplus (z_2 \oplus z_4 \oplus z_5 \oplus z_7) \oplus 1 \quad \text{(By Equation (4.14))}
\]

\[
= z_6 \oplus 1 = \overline{z_6}
\]

In the same way, we can get:
\[ \begin{align*}
w_0 &= z_6 \\
w_1 &= z_1 \oplus z_2 \oplus z_7 \\
w_2 &= z_0 \oplus z_5 \oplus z_6 \oplus z_3 \\
w_3 &= z_1 \oplus z_5 \oplus z_6 \oplus z_7 \\
w_4 &= z_0 \oplus z_2 \oplus z_4 \oplus z_5 \oplus z_6 \oplus z_7 \\
w_5 &= z_1 \oplus z_5 \oplus z_6 \\
w_6 &= z_2 \oplus z_6 \oplus z_7 \\
w_7 &= z_3 \oplus z_5
\end{align*} \] (4.15)

4. Finally, for the consistency of the other equations in this thesis, we replace \( w \) by \( y \), \( z \) by \( x \) \((x, y \in GF((2^4)^2))\). Let \( a = x_5 \oplus x_6 \oplus x_7 \), we have:

\[ y = AFF_{\text{TRAN}}(x) \]

\[ a = x_5 \oplus x_6 \oplus x_7 \]

\[ \begin{align*}
y_0 &= x_6, & y_1 &= x_1 \oplus x_2 \oplus x_7 \\
y_2 &= x_0 \oplus x_3 \oplus x_5 \oplus x_6, & y_3 &= x_1 \oplus a \\
y_4 &= x_0 \oplus x_2 \oplus x_4 \oplus a, & y_5 &= x_1 \oplus x_5 \oplus x_6 \\
y_6 &= x_2 \oplus x_6 \oplus x_7, & y_7 &= x_3 \oplus x_5
\end{align*} \] (4.16)

Figure (4.9) describes the proposed subpipelined architecture of subbytes in \( GF((2^4)^2) \). The following symbols represent the equation for each arithmetic block in Figure (4.9), except the \( \oplus \), which is a simple 4-bit XOR operation. The dashed lines in Figure (4.9) stand for the registers.
\[ x^2 \quad \text{--- Equation (4.3)[32]} \]
\[ \times e \quad \text{--- Equation (4.4)[32]} \]
\[ \times_{1A} \quad \text{--- Equation (4.8) Step A} \]
\[ \times_{1B} \quad \text{--- Equation (4.8) Step B} \]
\[ \times_{2M} \text{ and } \times_{3M} \quad \text{--- Equation (4.9) Step M} \]
\[ \times_{2N} \text{ and } \times_{3N} \quad \text{--- Equation (4.9) Step N} \]
\[ x^{-1} \quad \text{--- Equation (4.7)} \]
\[ \text{AFF_TRAN} \quad \text{--- Equation (4.16)} \]

Figure 4.9: Pipelined Subbytes in composite field \( GF((2^4)^2) \)

Table (4.3) shows the time (ns) and area (slices) cost of each substage (I, II, III, IV in Figure (4.9)) when it runs on different FPGA devices. We cut an AES round unit into 8 substages with the maximum delay determined by part II in subbytes.

Table 4.3: Path Delays and Number of Slices for Spartan2E and Virtex2

<table>
<thead>
<tr>
<th>Delay(ns):Slices</th>
<th>I</th>
<th>II</th>
<th>III</th>
<th>IV</th>
</tr>
</thead>
</table>
4.4.3 Mixcolumns on $GF((2^4)^2)$

Mixcolumns is another transformation which involves mathematical operations on $GF((2^4)^2)$. We derive the equations to perform mixcolumns in the composite field in this subsection.

Subsection (3.3) describes mixcolumn in finite field $GF(2^8)$. Since $GF((2^4)^2)$ is an isomorphic field to $GF(2^8)$, and in $GF((2^4)^2)$, $\{02\}$ is mapped to $\{26\}$, $\{03\}$ is mapped to $\{27\}$, $\{01\}$ is still $\{01\}$, Equation (3.3) can be mapped directly to Equation (4.17).

\[
\begin{bmatrix}
26 & 27 & 01 & 01 \\
01 & 26 & 27 & 01 \\
01 & 01 & 26 & 27 \\
27 & 01 & 01 & 26
\end{bmatrix}
\begin{bmatrix}
s_{0,0} & s_{0,1} & s_{0,2} & s_{0,3} \\
s_{1,0} & s_{1,1} & s_{1,2} & s_{1,3} \\
s_{2,0} & s_{2,1} & s_{2,2} & s_{2,3} \\
s_{3,0} & s_{3,1} & s_{3,2} & s_{3,3}
\end{bmatrix}
= 
\begin{bmatrix}
s'_{0,0} & s'_{0,1} & s'_{0,2} & s'_{0,3} \\
s'_{1,0} & s'_{1,1} & s'_{1,2} & s'_{1,3} \\
s'_{2,0} & s'_{2,1} & s'_{2,2} & s'_{2,3} \\
s'_{3,0} & s'_{3,1} & s'_{3,2} & s'_{3,3}
\end{bmatrix} \tag{4.17}
\]

Observing that in $GF((2^4)^2)$, $\{27\} = \{26\} \oplus \{01\}$, Equation (4.17) is equal to Equation (4.18), where $j = 0, 1, 2, 3$:

\[
\begin{align*}
S'_{0,j} &= \{26\} \times (S_{0,j} \oplus S_{1,j}) \oplus S_{1,j} \oplus S_{2,j} \oplus S_{3,j} \\
S'_{1,j} &= \{26\} \times (S_{1,j} \oplus S_{2,j}) \oplus S_{0,j} \oplus S_{2,j} \oplus S_{3,j} \\
S'_{2,j} &= \{26\} \times (S_{2,j} \oplus S_{3,j}) \oplus S_{0,j} \oplus S_{1,j} \oplus S_{3,j} \\
S'_{3,j} &= \{26\} \times (S_{0,j} \oplus S_{3,j}) \oplus S_{0,j} \oplus S_{1,j} \oplus S_{2,j} \tag{4.18}
\end{align*}
\]

Equation (4.18) presents the mixcolumn transformation of one column of a state. We implement the mixcolumn transformation as the structure in Figure (4.10).

In the following, we derive Equation (4.22) to calculate $x \times 26$ in $GF((2^4)^2)$. That is, we represent the results of $x \times \{02\}$ in $GF((2^4)^2)$:

1. Let $x, y \in GF(2^8)$, using Equation (2.5) to calculate $y = x \times \{02\}$. 

48
Figure 4.10: $GF((2^4)^2)$ Based Mixcolumns

\[ y_0 = x_7, \ y_1 = x_0 \oplus x_7, \ y_2 = x_1 \]
\[ y_3 = x_2 \oplus x_7, \ y_4 = x_3 \oplus x_7, \ y_5 = x_4 \]
\[ y_6 = x_5, \ y_7 = x_6 \]  

(4.19)

2. Convert $y$ to the field element in $GF((2^4)^2)$. Let $w$ to represent $y$ in $GF((2^4)^2)$ ($w$ is one element in $GF((2^4)^2)$). We have the same equation as Equation (4.13).

3. Next, we use $GF((2^4)^2)$ format to represent $x$. Let $z$ be the $GF((2^4)^2)$ format of $x$. $z$ is one element in $GF((2^4)^2)$. By Equation (4.11), we have:
\[ x_0 = z_0 \oplus z_4 \]
\[ x_1 = z_4 \oplus z_5 \oplus z_7 \]
\[ x_2 = z_1 \oplus z_4 \oplus z_5 \oplus z_7 \]
\[ x_3 = z_1 \oplus z_4 \oplus z_5 \oplus z_6 \]
\[ x_4 = z_1 \oplus z_3 \oplus z_4 \oplus z_5 \oplus z_7 \]
\[ x_5 = z_2 \oplus z_4 \oplus z_5 \]
\[ x_6 = z_1 \oplus z_2 \oplus z_3 \oplus z_4 \oplus z_7 \]
\[ x_7 = z_2 \oplus z_4 \oplus z_5 \oplus z_7 \] (4.20)

4. We replace \( x \) and \( y \) with their corresponding \( GF((2^4)^2) \) format, \( z \) and \( w \), we have:

\[ w_0 = y_0 \oplus y_4 \oplus y_5 \oplus y_6 \] (By Equation (4.13))
\[ = (x_7) \oplus (x_3 \oplus x_7) \oplus (x_4) \oplus (x_5) \] (By Equation (4.19))
\[ = x_3 \oplus x_4 \oplus x_5 \]
\[ = (z_1 \oplus z_4 \oplus z_5 \oplus z_6) \oplus (z_1 \oplus z_3 \oplus z_4 \oplus z_5 \oplus z_7) \oplus (z_2 \oplus z_4 \oplus z_5) \] (By Equation (4.20))
\[ = z_2 \oplus z_3 \oplus z_4 \oplus z_5 \oplus z_6 \oplus z_7 \]

By the same method, we derive:

\[ w_0 = z_2 \oplus z_3 \oplus z_4 \oplus z_5 \oplus z_6 \oplus z_7 \]
\[ w_1 = z_0 \oplus z_2 \oplus z_4 \]
\[ w_2 = z_0 \oplus z_1 \oplus z_3 \oplus z_4 \oplus z_5 \]
\[ w_3 = z_1 \oplus z_2 \oplus z_4 \oplus z_5 \oplus z_6 \] (4.21)
\[ w_4 = z_3 \oplus z_6 \]
\[ w_5 = z_0 \oplus z_3 \oplus z_6 \oplus z_7 \]
\[ w_6 = z_1 \oplus z_4 \oplus z_7 \]
\[ w_7 = z_2 \oplus z_5 \]
5. To be consistent, we replace \( z \) with \( x \), and replace \( w \) with \( y \) \((x, y \in GF((2^4)^2))\). In addition, in order to calculate the mixcolumns operations efficiently, we store the intermediate results. Let \( a = x_2 \oplus x_4 \), \( b = x_3 \oplus x_6 \oplus x_7 \), \( c = x_1 \oplus x_5 \), we have:

\[
y = x \otimes 26, \ x, y \in GF((2^4)^2)
\]

\[
a = x_2 \oplus x_4, \quad b = x_3 \oplus x_6 \oplus x_7, \quad c = x_1 \oplus x_5
\]

\[
y_0 = a \oplus b \oplus x_5, \quad y_1 = a \oplus x_0
\]

\[
y_2 = c \oplus x_0 \oplus x_3 \oplus x_4, \quad y_3 = c \oplus a \oplus x_6
\]

\[
y_4 = x_3 \oplus x_6, \quad y_5 = b \oplus x_0
\]

\[
y_6 = x_1 \oplus x_4 \oplus x_7, \quad y_7 = x_2 \oplus x_5
\]

This mixcolumns architecture (Figure (4.10)) is a 32-bit parallel combinational logic. When synthesized on Virtex2 XC2V2000, it costs 28 1-bit 2-input XOR gates, 44 1-bit 3-input XOR gates, four 1-bit 4-input XOR gates and four 8-bit 2-input XOR gates. The maximum combinational path delay is 7.922ns.

In the above section, we have designed all the modules in AES over field \( GF((2^4)^2) \). In this way, each byte of the data needs only one MAP before the initial round and one inverse MAP after the last round.

### 4.4.4 Subpipelined Keyschedule

There are two approaches to implement keyschedule: (1) pre-calculated keyschedule and (2) on-the-fly keyschedule. In the pre-calculated keyschedule, the \((Nr + 1)\) 128-bit round-keys are generated before the encryption or decryption begins and stored in the memory. The addroundkey operation accesses the roundkeys by referring the corresponding address in the memory. The advantage of this approach is that the keyschedule only needs to be
performed once; however, the drawbacks include:

1. The \((Nr + 1)\) roundkeys cost \((Nr + 1) \times 128\) bits memory space;

2. The cipherkey cannot change frequently. Every time it changes, the roundkeys must be recalculated.

In this thesis, we propose a new 32-bit on-the-fly keyschedule in composite field \((GF((2^4)^2))\) with 128-, 192-, 256-bit key sizes, where each 128-bit roundkey is generated at every four clock cycles (32-bit at each clock). This is suitable for our 32-bit encryption architecture.

Table (4.1) shows the 32-bit roundkeys at each clock cycle. The following list explains this table for the three key sizes.

- When key size=128 bits, \(Nr=10\), it generates 11 128-bit roundkeys for both block A and B from cycles 0 to 87.

The roundkeys for block A:

\[
\begin{align*}
\text{roundkey}[0] &= \{KA(0), KA(1), KA(2), KA(3)\} \\
\text{roundkey}[1] &= \{KA(4), KA(5), KA(6), KA(7)\} \\
\text{......} \\
\text{roundkey}[10] &= \{KA(40), KA(41), KA(42), KA(43)\}
\end{align*}
\]

The roundkeys for block B:

\[
\begin{align*}
\text{roundkey}[0] &= \{KB(0), KB(1), KB(2), KB(3)\} \\
\text{roundkey}[1] &= \{KB(4), KB(5), KB(6), KB(7)\} \\
\text{......} \\
\text{roundkey}[10] &= \{KB(40), KB(41), KB(42), KB(43)\}
\end{align*}
\]

- When key size=192, \(Nr=12\), it generates 13 roundkeys for both block A and B from
cycles 0 to 103.
The roundkeys for block A:
roundkey[0]=\{KA(0), KA(1), KA(2), KA(3)\}
roundkey[1]=\{KA(4), KA(5), KA(6), KA(7)\}
......
roundkey[12]=\{KA(48), KA(49), KA(50), KA(51)\}
The roundkeys for block B:
roundkey[0]=\{KB(0), KB(1), KB(2), KB(3)\}
roundkey[1]=\{KB(4), KB(5), KB(6), KB(7)\}
......
roundkey[12]=\{KB(48), KB(49), KB(50), KB(51)\}

• When key size=256, Nr=14, it generates 15 roundkeys for both block A and B from cycles 0 to 119.
The roundkeys for block A:
roundkey[0]=\{KA(0), KA(1), KA(2), KA(3)\}
roundkey[1]=\{KA(4), KA(5), KA(6), KA(7)\}
......
roundkey[14]=\{KA(56), KA(57), KA(58), KA(59)\}
The roundkeys for block B:
roundkey[0]=\{KB(0), KB(1), KB(2), KB(3)\}
roundkey[1]=\{KB(4), KB(5), KB(6), KB(7)\}
......
roundkey[14]=\{KB(56), KB(57), KB(58), KB(59)\}
Because we are using the on-the-fly keyschedule, keyschedule and encryptor are sharing the same clock, which means the general frequency is determined by the maximum delay in both keyschedule and encryptor modules. To achieve an efficient pipelining, proper division in keyschedule is as important as in encryptor. We know that subword is the most costly part in keyschedule. In order to make the same maximum delay in both modules, we implement subword in the same way as subbytes in encryptor.

In keyschedule module, rotword rearranges the position of each byte without changing its value, hence the sequence of rotword and subword can be changed. We do the subword operation before rotword to save one multiplexer in keyschedule 256.

All mathematic operations in keyschedule are transformed into field $GF((2^4)^2)$. Subword shares the same structure as in subbytes. Xorrcon is a simple XOR operation with a round constant, which is initially $\{01\}$ and multiplied by $\{02\}$ each keyschedule round. Keyschedule round is defined in this way. It begins when $clk\_counter = 0$. If key size is 128, keyschedule round cycle is four; if key size is 192, keyschedule round cycle is six; if key size is 256, keyschedule round cycle is eight. As explained in Subsection (4.4.3), in $GF((2^4)^2)$, $\{01\}$ is still $\{01\}$, $\{02\}$ is mapped to $\{26\}$. We can use Equation (4.22) to generate round constant for each keyschedule round in $GF((2^4)^2)$.

This keyschedule has three key size options: Key128, Key192 and Key256. In the following section, we discuss the generation of roundkeys in details for these three key size options. In the rest of the chapter, roundkey$_{32}$ stands for 32-bit roundkey for each clock cycle, roundkey stands for 128-bit roundkey for a round of AES.

**Key128**

When key size is 128 bits, the encryptor round count is ten. Two blocks $A$ and $B$ need 22 roundkeys. Figure (4.11) illustrates the architecture of keyschedule when key size is 128
In our design, the first step is to map (MAP) cipherkey from $GF(2^8)$ to $GF((2^4)^2)$. After that, it performs its isomorphic functions in $GF((2^4)^2)$. The output of keystable are $\text{roundkey}_{32s}$ represented in $GF((2^4)^2)$. They are the exact format required in encryption where the message blocks are also represented in $GF((2^4)^2)$, hence no inverse MAP follows roundkey.

In Figure (4.11), W7, ..., W0 are 32-bit words separated by eight registers, which are used to store the previous eight $\text{roundkey}_{32s}$. SA, SB, SC and SD are the results of the 4 parts of subword. We place three registers among the four substages in subword, same as in Figure (4.9). RW is the outcome of rotword. RC generates the round constant for xorxcon in $GF((2^4)^2)$. mul is a 3-to-1 multiplexer controlled by $\text{clk}_\text{counter}$, which is the same signal as in Figure (4.4) and Table (4.1). There are three different cases (a, b, c) to generate the current $\text{roundkey}_{32}$. Table (4.4) explains the value of each register in Figure (4.11) during the first 15 clock cycles. In this table, the row titled $\text{mul}$ stands for the multiplexer in Figure (4.4). $a$, $b$ and $c$ are the three cases. In the following expressions, $\text{roundkey}_{32}[i]$ stands for the cell of this table with row ID of $\text{roundkey}_{32}$ and column ID of $\text{clk}_\text{counter} = i$. 

Figure 4.11: Architecture of Keyschedule 128
**Case a:** \( \text{clk} \_\text{counter} < 8 \) (initial round):

\[
\text{roundkey}_{32}[\text{clk} \_\text{counter}] = \text{MAP}(\text{cipherkey}[\text{clk} \_\text{counter}]);
\]

**Case b:** \( \text{clk} \_\text{counter} \geq 8 \) and \( \text{clk} \_\text{counter} \mod 4 \neq 0 : \)

\[
\text{roundkey}_{32}[\text{clk} \_\text{counter}] = \text{roundkey}_{32}[\text{clk} \_\text{counter} - 1] \oplus \text{roundkey}_{32}[\text{clk} \_\text{counter} - 8]
\]

\( \text{roundkey}_{32}[\text{clk} \_\text{counter} - 1] \) is stored in \( W7 \)

\( \text{roundkey}_{32}[\text{clk} \_\text{counter} - 8] \) is stored in \( W0 \);

**Case c:** \( \text{clk} \_\text{counter} \geq 8 \) and \( \text{clk} \_\text{counter} \mod 4 = 0 : \)

\[
\text{roundkey}_{32}[\text{clk} \_\text{counter}] = \text{rotword}(\text{subword}(\text{roundkey}_{32}[\text{clk} \_\text{counter} - 5])) \oplus \text{roundkey}_{32}[\text{clk} \_\text{counter} - 8] \oplus \text{Rcon}
\]

\( \text{rotword}(\text{subword}(\text{roundkey}_{32}[\text{clk} \_\text{counter} - 5])) \) is stored in \( RW \)

\( \text{roundkey}_{32}[\text{clk} \_\text{counter} - 8] \) is stored in \( W0 \).

We give examples for each case.

- When \( \text{clk} \_\text{counter} = 0 \), the first \( \text{roundkey}_{32} \) is MAPed from the first word of cipherkey. \( \text{roundkey}_{32}[0] = \text{KA}(0) \). (Case a)

- When \( \text{clk} \_\text{counter} = 1 \), the second \( \text{roundkey}_{32} \) is MAPed from the second word of cipherkey. \( \text{roundkey}_{32}[1] = \text{KA}(1) \). \( \text{KA}(0) \) is moved to \( W7 \). In the mean time, \( \text{KA}(0) \) finished the first part of subword and is stored in \( SA \). (Case a)

  ......  

- When \( \text{clk} \_\text{counter} = 8 \), \( \text{KA}(3) \) is moved in to \( RW \), which means

\[
\text{KA}(3) = \text{rotword}(\text{subword}(\text{KA}(3))). \text{Now} \ \text{roundkey}_{32}[8] = \text{KA}(4) = \text{KA}(3) \oplus \text{KA}(0) \oplus \text{Rcon}. \text{(Case c)}
\]

56
### Table 4.4: Key128 Roundkey Sequence

<table>
<thead>
<tr>
<th>clk_counter</th>
<th>a</th>
<th>b</th>
<th>c</th>
<th>d</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>KA(0)</td>
<td>KA(1)</td>
<td>KA(2)</td>
<td>KA(3)</td>
</tr>
<tr>
<td>1</td>
<td>KA(4)</td>
<td>KA(5)</td>
<td>KA(6)</td>
<td>KA(7)</td>
</tr>
<tr>
<td>2</td>
<td>KB(0)</td>
<td>KB(1)</td>
<td>KB(2)</td>
<td>KB(3)</td>
</tr>
<tr>
<td>3</td>
<td>KB(4)</td>
<td>KB(5)</td>
<td>KB(6)</td>
<td>KB(7)</td>
</tr>
<tr>
<td>4</td>
<td>w7</td>
<td>w6</td>
<td>w5</td>
<td>w4</td>
</tr>
<tr>
<td>5</td>
<td>w3</td>
<td>w2</td>
<td>w1</td>
<td>w0</td>
</tr>
<tr>
<td>6</td>
<td>SA</td>
<td>SB</td>
<td>SC</td>
<td>SD</td>
</tr>
<tr>
<td>7</td>
<td>RW</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
• when \( \text{clk	extunderscore counter} = 9 \), \( KA(4) \) is put into \( W7 \) and \( SA \) (after finishes the first part of subword). \( \text{roundkey}_{32}[9] = KA(5) = KA(4) \oplus KA(1) \). **(Case b)**

......

**Key192**

When key size is 192 bits, the encryptor round count is 12. Block \( A \) and block \( B \) need 104 \( \text{roundkey}_{32} \). Cipherkey size does not affect the function entities. So it shares the same subword, rotword and xorcon as in Figure (4.11). However, due to key size, the structure becomes more complex. When key size is 192, the keyschedule round cycle is six while the encryptor cycle is still four. This cycle difference requires extra treatment for the input of subword. We can see in Figure (4.11) that, when key size is 128 bits, the input of subword is the roundkey. But when key size is 192 bits, the input of subword is classified into three cases. We use multiplexer \( \text{mul1} \) in Figure (4.12) to choose the input from case \( x \), \( y \) and \( z \).

![Figure 4.12: Architecture of Keyschedule 192](image-url)
Table 4.5: Key192 Roundkey Sequence

<table>
<thead>
<tr>
<th>cipherkey</th>
<th>KA(0)</th>
<th>KA(1)</th>
<th>KA(2)</th>
<th>KA(3)</th>
<th>KB(0)</th>
<th>KB(1)</th>
<th>KB(2)</th>
<th>KB(3)</th>
<th>KA(4)</th>
<th>KA(5)</th>
<th>KB(4)</th>
<th>KB(5)</th>
<th>KB(6)</th>
<th>KB(7)</th>
<th>KB(8)</th>
<th>KB(9)</th>
<th>KA(10)</th>
<th>KA(11)</th>
<th>KA(12)</th>
<th>KA(13)</th>
<th>KA(14)</th>
<th>KA(15)</th>
</tr>
</thead>
<tbody>
<tr>
<td>cipherkey6</td>
<td>KA(0)</td>
<td>KA(1)</td>
<td>KA(2)</td>
<td>KA(3)</td>
<td>KB(0)</td>
<td>KB(1)</td>
<td>KB(2)</td>
<td>KB(3)</td>
<td>KA(4)</td>
<td>KA(5)</td>
<td>KB(4)</td>
<td>KB(5)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>mul2</td>
<td>a</td>
<td>f</td>
<td>b</td>
<td>a</td>
<td>d</td>
<td>c</td>
<td>e</td>
<td>b</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>clk_counter</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
<td>11</td>
<td>12</td>
<td>13</td>
<td>14</td>
<td>15</td>
<td>16</td>
<td>17</td>
<td>24</td>
<td>30</td>
<td>31</td>
<td></td>
</tr>
<tr>
<td>roundkey32</td>
<td>KA(0)</td>
<td>KA(1)</td>
<td>KA(2)</td>
<td>KA(3)</td>
<td>KB(0)</td>
<td>KB(1)</td>
<td>KB(2)</td>
<td>KB(3)</td>
<td>KA(4)</td>
<td>KA(5)</td>
<td>KA(6)</td>
<td>KA(7)</td>
<td>KB(8)</td>
<td>KB(9)</td>
<td>KA(10)</td>
<td>KA(11)</td>
<td>KA(12)</td>
<td>KA(13)</td>
<td>KA(14)</td>
<td>KA(15)</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>w13</td>
<td>KA(0)</td>
<td>KA(1)</td>
<td>KA(2)</td>
<td>KA(3)</td>
<td>KB(0)</td>
<td>KB(1)</td>
<td>KB(2)</td>
<td>KB(3)</td>
<td>KA(4)</td>
<td>KA(5)</td>
<td>KA(6)</td>
<td>KA(7)</td>
<td>KB(8)</td>
<td>KB(9)</td>
<td>KB(10)</td>
<td>KB(11)</td>
<td>KB(12)</td>
<td>KB(13)</td>
<td>KB(14)</td>
<td>KB(15)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>w12</td>
<td>KA(0)</td>
<td>KA(1)</td>
<td>KA(2)</td>
<td>KA(3)</td>
<td>KB(0)</td>
<td>KB(1)</td>
<td>KB(2)</td>
<td>KB(3)</td>
<td>KA(4)</td>
<td>KA(5)</td>
<td>KA(6)</td>
<td>KA(7)</td>
<td>KB(8)</td>
<td>KB(9)</td>
<td>KB(10)</td>
<td>KB(11)</td>
<td>KB(12)</td>
<td>KB(13)</td>
<td>KB(14)</td>
<td>KB(15)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>w11</td>
<td>KA(0)</td>
<td>KA(1)</td>
<td>KA(2)</td>
<td>KA(3)</td>
<td>KB(0)</td>
<td>KB(1)</td>
<td>KB(2)</td>
<td>KB(3)</td>
<td>KA(4)</td>
<td>KA(5)</td>
<td>KA(6)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>w10</td>
<td>KA(0)</td>
<td>KA(1)</td>
<td>KA(2)</td>
<td>KA(3)</td>
<td>KB(0)</td>
<td>KB(1)</td>
<td>KB(2)</td>
<td>KB(3)</td>
<td>KA(4)</td>
<td>KA(5)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>w9</td>
<td>KA(0)</td>
<td>KA(1)</td>
<td>KA(2)</td>
<td>KA(3)</td>
<td>KB(0)</td>
<td>KB(1)</td>
<td>KB(2)</td>
<td>KB(3)</td>
<td>KA(4)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>w8</td>
<td>KA(0)</td>
<td>KA(1)</td>
<td>KA(2)</td>
<td>KA(3)</td>
<td>KB(0)</td>
<td>KB(1)</td>
<td>KB(2)</td>
<td>KB(3)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>w7</td>
<td>KA(0)</td>
<td>KA(1)</td>
<td>KA(2)</td>
<td>KA(3)</td>
<td>KB(0)</td>
<td>KB(1)</td>
<td>KB(2)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>w6</td>
<td>KA(0)</td>
<td>KA(1)</td>
<td>KA(2)</td>
<td>KA(3)</td>
<td>KB(0)</td>
<td>KB(1)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>w5</td>
<td>KA(0)</td>
<td>KA(1)</td>
<td>KA(2)</td>
<td>KA(3)</td>
<td>KB(0)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>w4</td>
<td>KA(0)</td>
<td>KA(1)</td>
<td>KA(2)</td>
<td>KA(3)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>w3</td>
<td>KA(0)</td>
<td>KA(1)</td>
<td>KA(2)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>w2</td>
<td>KA(0)</td>
<td>KA(1)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>w1</td>
<td>KA(0)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>w0</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>SA</td>
<td></td>
<td>KA(5)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>SB</td>
<td></td>
<td></td>
<td>KA(5)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>SC</td>
<td></td>
<td></td>
<td></td>
<td>KA(5)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>SD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>KA(5)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>RW</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>KA(5)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Table (4.5) shows the value of each register in Figure (4.12) during the first 32 clock cycles. Row \textit{mul1} and \textit{mul2} correspond to these two multiplexers. In the following section, we explain the two multiplexers (\textit{mul1} and \textit{mul2}) in Figure (4.12).

1. Multiplexer 1 (\textit{mul1})

\textbf{Case x:} (\texttt{clk\_counter} = 6 or 10) : $SA = MAP(\text{cipherkey}6)$; (Cipherkey6 is the sixth 32-bit of the 192-bit cipherkey. We can see from Table (4.5) that, when \texttt{clk\_counter} = 10, $KA(5)$ must finish subword and rotword, so that we can produce $KA(6)$, where $KA(6) = KA(5) \oplus \text{rotword}(\text{subword}(KA(5))) \oplus Rcon$. Because it needs five clock cycles to complete $\text{rotword}(\text{subword}(KA(5)))$, $KA(5)$ must be shifted into $SA$ when \texttt{clk\_counter} = 6. That’s why we need cipherkey6 to provide $KA(5)$.)

\textbf{Case y:} (\texttt{clk\_counter mod 24} = 6 or 10) and (\texttt{clk\_counter} $> 23$) : $SA = W_{11} \oplus W_2 \oplus W_3$; (One example is when \texttt{clk\_counter} = 30, $KA(17)$ needs to be shifted into $SA$. Because $KA(17)$ is not stored in any register, we need to calculate it from the existing data. $KA(17) = KA(16) \oplus KA(11) = KA(15) \oplus KA(10) \oplus KA(11)$. $KA(15), KA(10)$ and $KA(11)$ are stored in register $W_{11}, W_2$ and $W_3$, respectively.)

\textbf{Case z:} (\texttt{clk\_counter mod 24} = 0 or 20) and (\texttt{clk\_counter} $\neq 0$) : $SA = W_{13}$;

This is why we need multiplexer \textit{mul1} to differentiate three cases for the input of subword when key size is 192 bits.

2. Multiplexer 2 (\textit{mul2})

- Generate $\text{roundkey}_{32s}$ from cipherkey directly, $KA(i), KB(i), i = 0,...,5$
**Case a:** $(\text{clk}\_\text{counter} < 10 \text{ or } 12, 13) : \text{roundkey}_{32}[\text{clk}\_\text{counter}] = \text{MAP}(\text{cipherkey}[\text{clk}\_\text{counter}])$ (Because $\text{roundkey}_{32}$ is generated when it is needed in encryption, the arrangement of row $\text{cipherkey}$ in Table (4.5) is determined by encryptor)

• Generate $\text{roundkey}_{32}$s $\text{KA}(i)$, $\text{KB}(i)$, $i \geq 8$ and $i \text{ mod } 6 \neq 0$ (Because of the round cycle difference between $\text{keyschedule}$ and encryptor, we need to classify it into three sub-cases, based on the value $(\text{clk}\_\text{counter} \text{ mod } 4)$. Table (4.5) shows these three sub-cases (b, c and d), where $\text{roundkey}_{32}$s are generated by the formula $\text{KA}/B(i) = \text{KA}/B(i-1) \oplus \text{KA}/B(i-6)$.)

**Case b:** $(\text{clk}\_\text{counter} \text{ mod } 4 = 3 \text{ or } 0)$ and $(\text{clk}\_\text{counter} > 7)$ :
\[
\text{roundkey}_{32}[\text{clk}\_\text{counter}] = \text{roundkey}_{32}[\text{clk}\_\text{counter} - 1] \oplus \text{roundkey}_{32}[\text{clk}\_\text{counter} - 10];
\]
$\text{roundkey}_{32}[\text{clk}\_\text{counter} - 1]$ is stored in register W13;
$\text{roundkey}_{32}[\text{clk}\_\text{counter} - 10]$ is stored in register W4;

**Case c:** $(\text{clk}\_\text{counter} \text{ mod } 4 = 2)$ and $(\text{clk}\_\text{counter} > 7)$ :
\[
\text{roundkey}_{32}[\text{clk}\_\text{counter}] = \text{roundkey}_{32}[\text{clk}\_\text{counter} - 1] \oplus \text{roundkey}_{32}[\text{clk}\_\text{counter} - 14];
\]
$\text{roundkey}_{32}[\text{clk}\_\text{counter} - 1]$ is stored in register W13;
$\text{roundkey}_{32}[\text{clk}\_\text{counter} - 14]$ is stored in register W0;

**Case d:** $(\text{clk}\_\text{counter} \text{ mod } 4 = 1)$ and $(\text{clk}\_\text{counter} \text{ mod } 24 > 7)$ :
\[
\text{roundkey}_{32}[\text{clk}\_\text{counter}] = \text{roundkey}_{32}[\text{clk}\_\text{counter} - 5] \oplus \text{roundkey}_{32}[\text{clk}\_\text{counter} - 14];
\]
$\text{roundkey}_{32}[\text{clk}\_\text{counter} - 5]$ is stored in register W9;
$\text{roundkey}_{32}[\text{clk}\_\text{counter} - 14]$ is stored in register W0;

• Generate $\text{roundkey}_{32}$s $\text{KA}(i)$, $\text{KB}(i)$, $i \geq 8$ and $i \text{ mod } 6 = 0$ (The sub-cases are
caused by the same reason as the above case. The following two sub-cases are based on the formula $KA/B(i) = \text{rotword}(\text{subword}(KA/B(i-1))) \oplus KA/B(i-6) \oplus Rcon$)

**Case e:** ($\text{clk\_counter mod 24} = 0$ or $4$) and ($\text{clk\_counter} > 7$):
\[
\text{roundkey}_{32}[\text{clk\_counter}] = \text{rotword}(\text{subword}(\text{roundkey}_{32}[\text{clk\_counter} - 1])) \\
\oplus \text{roundkey}_{32}[\text{clk\_counter} - 14] \oplus RC;
\]
\[
\text{rotword}(\text{subword}(\text{roundkey}_{32}[\text{clk\_counter} - 1]))\text{ is stored in register RW;}
\]
\[
\text{roundkey}_{32}[\text{clk\_counter} - 14]\text{ is stored in register W0;}
\]

**Case f:** ($\text{clk\_counter mod 24} = 10$ or $14$):
\[
\text{roundkey}_{32}[\text{clk\_counter}] = \text{rotword}(\text{subword}(\text{roundkey}_{32}[\text{clk\_counter} - 1])) \\
\oplus \text{roundkey}_{32}[\text{clk\_counter} - 10] \oplus RC;
\]
\[
\text{rotword}(\text{subword}(\text{roundkey}_{32}[\text{clk\_counter} - 1]))\text{ is stored in register RW;}
\]
\[
\text{roundkey}_{32}[\text{clk\_counter} - 10]\text{ is stored in register W4;}
\]

Table (4.5) lists instances for each case for both $mul1$ and $mul2$ during the first 32 clock cycles.

- When $\text{clk\_counter} = 0$, $\text{roundkey}_{32}[0] = KA(0)$, which is MAPed from cipherkey(Case a);
  ......  

- When $\text{clk\_counter} = 6$, $\text{roundkey}_{32}[6] = KB(2)$, which is MAPed from cipherkey(Case a). $SA = KA(5)$, where $KA(5)$ is MAPed from cipherkey6 and shifted into $SA$ after finished subword’s first part(Case x);
  ......  

- When $\text{clk\_counter} = 10$, $\text{roundkey}_{32}[10] = KA(6) = \text{rotword}(\text{subword}(KA(5))) \oplus KA(0) \oplus Rcon$ (Case f);
When $clk\_counter = 11$, $\text{roundkey}_{32}[11] = KA(7) = KA(6) \oplus KA(1)$ (Case b);

......

When $clk\_counter = 16$, $\text{roundkey}_{32}[16] = KA(8) = KA(7) \oplus KA(2)$ (Case d);

......

When $clk\_counter = 17$, $\text{roundkey}_{32}[17] = KA(9) = KA(8) \oplus KA(3)$ (Case c);

......

When $clk\_counter = 24$, $\text{roundkey}_{32}[24] = KA(12) = \text{rotword}(\text{subword}(KA(11))) \oplus KA(6) \oplus Rcon$ (Case e). $SA = KB(11)$, where $KB(11)$ is shifted into $SA$ after finished subword’s first part (Case z);

......

When $clk\_counter = 30$, $SA = KA(17) = KA(16) \oplus KA(11) = (KA(15) \oplus KA(10)) \oplus KA(11)$ (Case y);

......

Key256

Keyschedule 256 is slightly different from keyschedule 128. The keyschedule round cycle is eight clock cycles. As shown in Figure (4.13):

There are four different cases to generate $\text{roundkey}_{32}$s:

**Case a:** ($clk\_counter < 16$) :

$$\text{roundkey}_{32}[clk\_counter] = MAP(cipherkey[clk\_counter]);$$

**Case b:** ($clk\_counter \geq 16$) and ($clk\_counter \ mod \ 4 \neq 0$) :

$$\text{roundkey}_{32}[clk\_counter] = \text{roundkey}_{32}[clk\_counter - 1] \oplus \text{roundkey}_{32}[clk\_counter - 16];$$
$GF((2^4)^2)$

Figure 4.13: Architecture of Keyschedule 256

$\text{roundkey}_{32}[\text{clk\_counter} - 1]$ is stored in register W15;

$\text{roundkey}_{32}[\text{clk\_counter} - 16]$ is stored in register W0;

**Case c:** $(\text{clk\_counter} \geq 16)$ and $(\text{clk\_counter} \mod 8 = 0)$:

$\text{roundkey}_{32}[\text{clk\_counter}] = \text{rotword}(\text{subword}(\text{roundkey}_{32}[\text{clk\_counter} - 5]))\text{RW}$

$\oplus \text{roundkey}_{32}[\text{clk\_counter} - 16] \oplus \text{RC};$

$\text{rotword}(\text{subword}(\text{roundkey}_{32}[\text{clk\_counter} - 5]))$ is stored in register RW;

$\text{roundkey}_{32}[\text{clk\_counter} - 16]$ is stored in register W0;

**Case d:** $(\text{clk\_counter} \geq 16)$ and $(\text{clk\_counter} \mod 4 = 0)$ and $(\text{clk\_counter} \mod 8 \neq 0)$:

$\text{roundkey}_{32}[\text{clk\_counter}] = \text{subword}(\text{roundkey}_{32}[\text{clk\_counter} - 5]) \oplus \text{roundkey}_{32}[\text{clk\_counter} - 16];$

$\text{subword}(\text{roundkey}_{32}[\text{clk\_counter} - 5])$ is stored in register RW;

$\text{roundkey}_{32}[\text{clk\_counter} - 16]$ is stored in register W0.

This is why we change the sequence of subword and rotword. Putting rotword before subword saves one multiplexer when key size is 256 bits.

Table (4.4.4) gives instances for each case.
• When $clk\_counter = 0$, $roundkey_{32}[0] = KA(0)$, where $KA(0)$ is MAPed from cipherkey (Case a);

......

• When $clk\_counter = 16$, $roundkey_{32}[16] = KA(8) = rotword(subword(KA(7))) \oplus KA(0) \oplus Rcon$ (Case c);

• When $clk\_counter = 17$, $roundkey_{32}[17] = KA(9) = KA(8) \oplus KA(1)$ (Case b);

......

• When $clk\_counter = 24$, $roundkey_{32}[12] = KA(12) = subword(KA(11)) \oplus KA(4)$ (Case d);

......
<table>
<thead>
<tr>
<th>cipherkey</th>
<th>KA(0)</th>
<th>KB(7)</th>
<th>KA(8)</th>
<th>KA(9)</th>
<th>KA(10)</th>
<th>KA(11)</th>
<th>KB(8)</th>
<th>KA(12)</th>
<th>KB(12)</th>
</tr>
</thead>
<tbody>
<tr>
<td>mul</td>
<td>a</td>
<td>c</td>
<td>b</td>
<td>c</td>
<td>d</td>
<td>d</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>clk_reg</td>
<td>0</td>
<td>15</td>
<td>16</td>
<td>17</td>
<td>18</td>
<td>19</td>
<td>20</td>
<td>24</td>
<td>28</td>
</tr>
<tr>
<td>roundkey32</td>
<td>KA(0)</td>
<td>KB(7)</td>
<td>KA(8)</td>
<td>KA(9)</td>
<td>KA(10)</td>
<td>KA(11)</td>
<td>KB(8)</td>
<td>KA(12)</td>
<td>KB(12)</td>
</tr>
<tr>
<td>w15</td>
<td>KB(6)</td>
<td>KB(7)</td>
<td>KA(8)</td>
<td>KA(9)</td>
<td>KA(10)</td>
<td>KA(11)</td>
<td>KB(11)</td>
<td>KA(15)</td>
<td></td>
</tr>
<tr>
<td>w14</td>
<td>KB(5)</td>
<td>KB(6)</td>
<td>KB(7)</td>
<td>KA(8)</td>
<td>KA(9)</td>
<td>KA(10)</td>
<td>KB(10)</td>
<td>KA(14)</td>
<td></td>
</tr>
<tr>
<td>w13</td>
<td>KB(4)</td>
<td>KB(5)</td>
<td>KB(6)</td>
<td>KB(7)</td>
<td>KA(8)</td>
<td>KA(9)</td>
<td>KB(9)</td>
<td>KA(13)</td>
<td></td>
</tr>
<tr>
<td>w12</td>
<td>KA(7)</td>
<td>KB(4)</td>
<td>KB(5)</td>
<td>KB(6)</td>
<td>KB(7)</td>
<td>KA(8)</td>
<td>KB(8)</td>
<td>KA(12)</td>
<td></td>
</tr>
<tr>
<td>w11</td>
<td>KA(6)</td>
<td>KA(7)</td>
<td>KB(4)</td>
<td>KB(5)</td>
<td>KB(6)</td>
<td>KB(7)</td>
<td>KA(11)</td>
<td>KB(11)</td>
<td></td>
</tr>
<tr>
<td>w0</td>
<td></td>
<td>KA(0)</td>
<td>KA(1)</td>
<td>KA(2)</td>
<td>KA(3)</td>
<td>KB(0)</td>
<td>KA(4)</td>
<td>KB(4)</td>
<td></td>
</tr>
<tr>
<td>SA</td>
<td>KB(6)</td>
<td>KB(7)</td>
<td>KA(8)</td>
<td>KA(9)</td>
<td>KA(10)</td>
<td>KA(11)</td>
<td>KB(11)</td>
<td>KA(15)</td>
<td></td>
</tr>
<tr>
<td>SB</td>
<td>KB(5)</td>
<td>KB(6)</td>
<td>KB(7)</td>
<td>KA(8)</td>
<td>KA(9)</td>
<td>KA(10)</td>
<td>KB(10)</td>
<td>KA(14)</td>
<td></td>
</tr>
<tr>
<td>SC</td>
<td>KB(4)</td>
<td>KB(5)</td>
<td>KB(6)</td>
<td>KB(7)</td>
<td>KA(8)</td>
<td>KA(9)</td>
<td>KB(9)</td>
<td>KA(13)</td>
<td></td>
</tr>
<tr>
<td>SD</td>
<td>KA(7)</td>
<td>KB(4)</td>
<td>KB(5)</td>
<td>KB(6)</td>
<td>KB(7)</td>
<td>KA(8)</td>
<td>KB(8)</td>
<td>KA(12)</td>
<td></td>
</tr>
<tr>
<td>RW</td>
<td>KA(6)</td>
<td>KA(7)</td>
<td>KB(4)</td>
<td>KB(5)</td>
<td>KB(6)</td>
<td>KB(7)</td>
<td>KA(11)</td>
<td>KB(11)</td>
<td></td>
</tr>
</tbody>
</table>
Chapter 5
Implementation Performance And Comparison

Literature regarding hardware implementation of AES have been published. The comparison tables listed in the literatures are synthesized by various design tools on different FPGA chips. Although the difficulty of comparison about FPGA implementations was reported, there is still no proved measure to get a real fair comparison among different architectures. Even for the devices from the same company (Xilinx), different families use different technology which leads to different frequency. For example, a slice in Virtex 5 has four LUTs (Look Up Tables) instead of two in previous families [6], which leads to different area cost (number of slice).

Since AES standard includes encryption, decryption and keyschedule with three key sizes, it is up to the designers to choose which function they would like to realize. Obviously, more functions need more resource. Hence it is reasonable to compare architectures providing similar functions.

In this chapter, we first classify previous AES architectures into different categories and then use tables to compare their performance.

1. **Encryption and Decryption**: AES architectures include encryption and decryption units. In [1, 3, 5, 8, 9, 17, 33, 22, 27, 28], they provide functions for both encryption and decryption. As a symmetric algorithm, encryption and decryption share same units. With the parameter indicated by the user, it executes encryption or decryption exclusively. Some other AES architectures only focus on encryption [2, 4, 11, 12, 25, 26, 31].

2. **Key Sizes**: AES uses data size of 128 bits but offers three key sizes (128, 192 and 256 bits). 128-bit is the most common choice in the reported designs [3, 4, 5, 9, 12,
However, as reconfigurability is one of most important factors for FPGA implementations, options for all three key sizes are included in a number of designs [1, 2, 17, 22].

3. **Key Expansion:** The keyschedule in AES generates roundkeys for each round. The roundkeys can be previously calculated and stored in memory [1, 2, 3, 5, 22, 27]. This method results in an acceptable initial delay when the data size is relatively large compared with the key size. A more flexible approach is the on-the-fly keyschedule [4, 9, 12, 17, 33, 26, 28, 34] which conducts an on-line calculation of roundkeys for each 128-bit data block. On-the-fly keyschedule affects the general frequency as both the data unit and key unit share the same clock, especially when it is employed for all the three key sizes. There are also some architectures that do not include keyschedule [8, 11, 13, 31].

4. **BRAM based S-Box and combinational logic based S-Box:** Different approaches for S-Box implementation have obvious impact on AES performance. BRAM based approaches [5, 8, 13, 17, 26, 27] are preferred when low area cost is required. It saves the slices required in combinational logic based approach. Hence it is not reasonable to compare the ratio of throughput/slice between BRAM-based S-Box and combinational logic based S-Box [1, 2, 9, 11, 12, 13, 33, 22, 25, 28, 31, 34]. Good et al. [9] used a term (32bits/slice) to convert number of BRAMs to number of slices required to implement the equivalent distributed memory. But, the estimates vary between 8 and 32 bits/slice depending on the functionality required. In this thesis, we only compare our design throughput/slice with non-BRAM implementations.

The above four categories summarize the majors factors affecting the performance in hardware implementation of AES. Table 5.1 compares the performance of the architectures
Table 5.1: Comparisons of BRAMs Based AES Architecture

<table>
<thead>
<tr>
<th>Design</th>
<th>Device</th>
<th>Frequency (MHz)</th>
<th>Slices</th>
<th>BRAMs</th>
<th>Throughput (Mbps)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Samanta et al. [27]</td>
<td>VIRTEX2 2V6000</td>
<td>76.699</td>
<td>1051</td>
<td>11</td>
<td>111.56</td>
</tr>
<tr>
<td>Chodowiec [8]</td>
<td>VIRTEX XCV1000</td>
<td>95</td>
<td>12600</td>
<td>80</td>
<td>12100</td>
</tr>
<tr>
<td>Chodowiec et al. [5]</td>
<td>SPARTAN2 XC2S30</td>
<td>60</td>
<td>222</td>
<td>3</td>
<td>166</td>
</tr>
<tr>
<td>Chang et al. [4]</td>
<td>SPARTAN2 XC2S30</td>
<td>38.50</td>
<td>200</td>
<td>2</td>
<td>38</td>
</tr>
<tr>
<td>Saggese et al. [26]</td>
<td>VIRTEXE XCV2000E</td>
<td>142</td>
<td>648</td>
<td>10</td>
<td>1820</td>
</tr>
<tr>
<td>McLoone et al. [17]</td>
<td>VIRTEXE XCV3200E</td>
<td>54.35</td>
<td>2222</td>
<td>100</td>
<td>6956</td>
</tr>
</tbody>
</table>

using BRAMs. Table 5.2 compares the architectures without BRAMs. Table 5.3 summarizes the functions provided by these architectures.

Among the architectures using BRAMs, Chodowiec [8] employed fully unrolled sub-pipelining achieving the highest throughput with the largest resource cost. Recently, Chodowiec et al. made a compact design costing 222 slices in a Spartan2 device offering a throughput of 166Mbps [5].

In our proposed architecture, we do not use BRAM. In Table 5.2, it can be seen that Good et al. achieves the highest throughput of 25.107 Gbps on Spartan3 XC3S2000. It employs fully parallel loop unrolled architecture which calculates multiplicative inverse of each byte over composite field $GF((2^4)^2)$. It also gets the frequency of 196.1MHz. But it only deals with 128-bit key size and costs 17425 slices. Another fully unrolled architecture is proposed by Zhang et al. [34]. This design used number-of-gates-in-critical-path to place the pipeline cuts. It subpipelines a round into seven substages and achieves 21.556 Gbps with the throughput/slice ratio of 1.956.

Compared with the previous architectures, our design focuses on the low cost, non-
Table 5.2: Comparisons of Non-BRAMs Architectures

<table>
<thead>
<tr>
<th>Design</th>
<th>Device</th>
<th>Frequency (MHz)</th>
<th>Area (Slices)</th>
<th>Throughput (Mbps)</th>
<th>Mbps/Slice</th>
</tr>
</thead>
<tbody>
<tr>
<td>Good et al. [9]</td>
<td>SPARTAN3 XC3S2000</td>
<td>196.1</td>
<td>17425</td>
<td>25107</td>
<td>1.441</td>
</tr>
<tr>
<td>Zhang et al. [34]</td>
<td>VIRTEXE XCV1000E</td>
<td>168.4</td>
<td>11022</td>
<td>21560</td>
<td>1.956</td>
</tr>
<tr>
<td>Jarvinen et al. [12]</td>
<td>VIRTEX XC2V2000</td>
<td>139.1</td>
<td>10750</td>
<td>17800</td>
<td>1.656</td>
</tr>
<tr>
<td>Lemsitzer et al. [13]</td>
<td>VIRTEX4 FX100</td>
<td>110</td>
<td>7300</td>
<td>3500</td>
<td>0.479</td>
</tr>
<tr>
<td>Bulens et al. [3]</td>
<td>SPARTAN3</td>
<td>150</td>
<td>1800</td>
<td>1700</td>
<td>0.944</td>
</tr>
<tr>
<td>Standaert et al. [31]</td>
<td>VIRTEXE XCV1000E</td>
<td>167</td>
<td>1767</td>
<td>2085</td>
<td>1.180</td>
</tr>
<tr>
<td>Pramstaller et al. [22]</td>
<td>VIRTEXE XCV1000E</td>
<td>161</td>
<td>1125</td>
<td>215</td>
<td>0.191</td>
</tr>
<tr>
<td><strong>Our Design</strong></td>
<td>VIRTEX2 E XCV2V2000E</td>
<td>277.4</td>
<td>523</td>
<td>807</td>
<td>1.543</td>
</tr>
<tr>
<td>Alam et al. [1]</td>
<td>VIRTEXE XCV1000E</td>
<td>135</td>
<td>510</td>
<td>432</td>
<td>0.847</td>
</tr>
</tbody>
</table>
BRAM implementations. There were not many literatures in the low-cost AES designs. Pramstaller et al. proposed a compact design costing 1125 slices in [22]. Its pre-calculate key generator can deal with three key sizes. Standaert et al. [31] made a single encryption architecture with 1767 slices which provides Gbps-level throughput. Alam et al. [1] reported a design including encryption, decryption and on-the-fly keyschedule for 3 key sizes, which achieves 432 Mbps with the frequency of 135MHz.

Compared with similar previous works, our proposed low-cost and efficient AES architecture only uses 523 slices, and achieves the throughput of 806Mbps when implemented in Virtex 2 XCV2V2000. The throughput/area ratio is 1.543, which is relatively high in low-cost designs (< 2000 slices). The proposed design can be efficiently applied in computing-resources restricted environments, such as wireless devices and embedded devices.
<table>
<thead>
<tr>
<th>Design</th>
<th>Encryption</th>
<th>Decryption</th>
<th>KeySchedule</th>
<th>KeySize</th>
<th>BRAMs</th>
</tr>
</thead>
<tbody>
<tr>
<td>Samanta [27]</td>
<td>•</td>
<td>•</td>
<td>Pre-Calculate</td>
<td>128</td>
<td>•</td>
</tr>
<tr>
<td>Chodowiec [8]</td>
<td>•</td>
<td>•</td>
<td></td>
<td></td>
<td>•</td>
</tr>
<tr>
<td>Chodowiec et al. [5]</td>
<td>•</td>
<td>•</td>
<td>Pre-Calculate</td>
<td>128</td>
<td>•</td>
</tr>
<tr>
<td>Satoh et al. [28]</td>
<td>•</td>
<td>•</td>
<td>On-The-Fly</td>
<td>128</td>
<td></td>
</tr>
<tr>
<td>Hodjat et al. [11]</td>
<td>•</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Jarvinen et al. [12]</td>
<td>•</td>
<td></td>
<td>On-The-Fly</td>
<td>128</td>
<td></td>
</tr>
<tr>
<td>Good et al. [9]</td>
<td>•</td>
<td>•</td>
<td>On-The-Fly</td>
<td>128</td>
<td></td>
</tr>
<tr>
<td>Zhang et al. [34]</td>
<td>•</td>
<td></td>
<td>On-The-Fly</td>
<td>128</td>
<td></td>
</tr>
<tr>
<td>Chang et al. [4]</td>
<td>•</td>
<td></td>
<td>On-The-Fly</td>
<td>128</td>
<td>•</td>
</tr>
<tr>
<td>Pramstaller et al. [22]</td>
<td>•</td>
<td>•</td>
<td>Pre-Calculate</td>
<td>128/192/256</td>
<td></td>
</tr>
<tr>
<td>Saggese et al. [26]</td>
<td>•</td>
<td></td>
<td>On-The-Fly</td>
<td>128</td>
<td>•</td>
</tr>
<tr>
<td>Standaert et al. [31]</td>
<td>•</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>McLoone et al. [17]</td>
<td>•</td>
<td>•</td>
<td>On-The-Fly</td>
<td>128/192/256</td>
<td>•</td>
</tr>
<tr>
<td>Lemsitzer et al. [13]</td>
<td>•</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Bulens et al. [3]</td>
<td>•</td>
<td>•</td>
<td>Pre-Calculate</td>
<td>128</td>
<td>•</td>
</tr>
<tr>
<td>Alam et al. [1]</td>
<td>•</td>
<td></td>
<td>On-The-Fly</td>
<td>128/192/256</td>
<td></td>
</tr>
<tr>
<td><strong>Our Design</strong></td>
<td>•</td>
<td></td>
<td>On-The-Fly</td>
<td>128/192/256</td>
<td></td>
</tr>
</tbody>
</table>
Chapter 6

Conclusion

AES is an important and popular cryptographic algorithm to secure the information and data transmission. In this thesis, we propose a compact reconfigurable FPGA architecture for the AES implementation.

The 32-bit single round unit design results in low area cost, which makes it suitable for low-end devices. The combinational logic approach of AES implementation eliminates the need for BRAMs. Full composite field \( GF\left((2^4)^2\right) \) based design decreases hardware complexity of arithmetic operations in AES. We apply subpipelining technology in both encryptor and keyschedule modules to optimize the speed/area ratio, which achieves 1.543Mbps/Slice in Virtex 2 XCV2V2000. Besides, the capability to deal with three key sizes makes our design an efficient reconfigurable architecture of AES.

The throughput of our proposed design achieves 805.8Mbps. It requires less than a quarter of the resources of a Xilinx Spartan2 FPGA, which is one of the smallest FPGA devices. The performance comparison indicates that the proposed AES architecture achieves higher throughput than previous compact designs.

FIPS standard [20] provides an equivalent inverse cipher which switches the sequence of the four transformations in decryption round so that the encryption and decryption can share the same functions, such as the multiplicative inversion in subbytes. In our design, the encryption conducts shiftrows before subbytes. When implementing the equivalent inverse cipher, it only needs to switch the relative sequence of inv-mixcolumns and addroundkey. The positions of inv-shiftrows and inv-subbytes are not changed. The proposed design can be easily modified into an equivalent cipher.

In conclusion, the proposed compact and reconfigurable AES architecture has high throughput and low area cost, which is very useful in the computing restricted environment.
and wireless devices.
Bibliography


