Sakineh Sharifi ${ }^{1}$, Mohammad Esmaeildoust ${ }^{2}$, Mohammad Reza Taheri ${ }^{3}$, Keivan Navi ${ }^{4}$<br>${ }^{I}$ Department of Information Technology Engineering, Qom University, Qom, Iran<br>${ }^{2}$ Faculty of Marine Engineering, Khorramshahr University of Marine Science and Technology, Iran<br>${ }^{3}$ Nanotechnology and Quantum Computing Laboratory, Shahid Beheshti University, GC, Tehran, Iran<br>${ }^{4}$ Faculty of Electrical and Computer Engineering, Shahid Beheshti University GC, Tehran, Iran<br>m_doust@kmsu.ac.ir

Article history:
Received June 2014
Accepted July 2014
Available online August 2014


#### Abstract

Point multiplication is the most important part of elliptic curve cryptography which consumes remarkable time of implementation. Therefore efficiency enhancement of entire system is depending on efficiency of this part. Increasing the efficiency of the modular multiplication improve overall performance of the cryptographic system as it frequency used in some application such as Elliptic Curve Cryptography. By applying Residue Number System (RNS) to Montgomery multiplication as a method for modular multiplication, delay of modular multiplication will be reduced. Appropriate RNS moduli sets replace time consuming operation of multiplication by smaller operations. In this paper two balanced moduli set with proper dynamic range is presented and the efficiency of conversion from RNS to RNS which is the most time consuming part of the Montgomery modular multiplication will be increased.


Keywords: residue number system (RNS), RNS Montgomery, reverse converter, elliptic curve cryptography

## 1. Introduction

One of the best public key cryptography systems are RSA [1] and Elliptic curves cryptography [2] [3]. ECC can provide security equivalent RSA with smaller key and a fewer calculations. ECC systems are
face with several problems. As well as we know the important operation in ECC is point multiplication that include two basic operations are on field P: point doubling (P.D) and point addition (P.A). In order to improving performance of operation on P, Montgomery multiplication [4] can be employed. Montgomery multiplication performs modulo reduction without division. So as to increase efficiency of Montgomery, RNS Montgomery presented [5] [6]. RNS is a non-weighted number system and operations on large numbers are done over small moduli. As described in [5], two RNS bases are required to perform RNS Montgomery multiplication and conversion from one basis to another is needed in this process. In [7], RNS bases in the form of $2^{n}-c_{i}$ where $0 \leq c_{i}<2^{k / 2}$ are considered. In [8],[9], in one bases efficient RNS moduli set such as $\left\{2^{n}, 2^{n}-1,2^{n}+1,2^{n-2(n+1) / 2}+1,2^{n+2(n+1) / 2}+1\right\}$ [10], $\left\{2^{n}-1,2^{n}, 2^{n}+1,2^{2 n}+1-1\right\}[11]$ and $\left\{2^{n}-1,2^{n}, 2 n+1,2^{2 n}+1\right\}[12]$ are considered and for first bases the moduli set in the form of $2^{n}-c_{i}$ [7] are considered. Although in [7] it is proved that in the special case of $2^{n}-c_{i}$ where $0 \leq c_{i}<2^{k / 2}$, modulo reduction can be efficiently implemented, but using well formed moduli such as $2^{n}-1,2^{n}+1,2^{n}, 2^{n}-3,2^{n}+3$, where efficient modulo adder [13-15], multiplier and residue to binary converter and binary to residue converter are presented for these moduli by researcher [16-21], can leads to more efficient RNS Montgomery implementation.
In this paper, two efficient RNS bases $\left\{2^{n}, 2^{n}+1,2^{n-1}-1\right\}$ and $\left\{2^{n}+3,2^{n}-1,2^{n}-3\right\}$ are selected for RNS Montgomery multiplication and required conversions are efficiently implemented. The selected RNS bases are balanced and this lead to modulo channel with approximately same delay. Fast and efficient implementation of RNS Montgomery multiplication leads to meliorate the performance of P.A and P.D.

This paper organized as follow: Section 2 provides the related background of RNS and RNS Montgomery. In section 3, the RNS bases and efficient design of required conversions in RNS Montgomery regarding to proposed bases are presented. Performance of proposed work are evaluated and compared in section 4 and finally section 5 concludes the paper.

## 2. Related Background

This section in three subsections mathematical background of RNS Montgomery multiplication and RNS will be discussed, respectively.

### 2.1 Montgomery in RNS

RNS Montgomery multiplication is presented by [5]. In RNS Montgomery multiplication two RNS bases (moduli set) are required. Considering $X$ and $Y$ as two large integer number with RNS representation $\left(x_{1}, \ldots, x_{m}\right)$ and $\left(y_{1}, \ldots, y_{m}\right)$ in first basis $\left(p_{1}, \ldots, p_{m}\right)$ and in the second basis we consider $X$ and $Y$ as $\left(x_{1}^{\prime}, \ldots, x_{m}^{\prime}\right)$ and $\left(y_{1}^{\prime}, \ldots, y_{m}^{\prime}\right)$ in second basis $\left(p_{1}^{\prime}, \ldots, p_{m}^{\prime}\right)$.

Algorithm 1 shows the RNS Montgomery multiplication [7]. $M=p_{1} \times p_{2} \ldots \times p_{i}$ and $M^{\prime}=p_{1}^{\prime} \times p^{\prime}{ }_{2} \ldots \times p_{i}^{\prime}$ are the dynamic range for first and second bases, respectively. Consider $T$ which is $T<M<M^{\prime}$, so that $\operatorname{gcd}(T, M)=\operatorname{gcd}\left(T, M^{\prime}\right)=\operatorname{gcd}\left(M, M^{\prime}\right)=1$. Montgomery multiplication performs modulo reduction without division. The most important part of Montgomery algorithm is moduli selection that leads to design pretty faster converter and efficient arithmetic unit. Choosing moduli set is necessary to provide these features, so in this approach the RNS basis in order to achieve the high performance of multiplication is proposed.

According to algorithm 1, in the process of Montgomery multiplication conversion from one basis to another is required. Figure 1 shows the required conversions.


Figure 1. Overview of base extension in Montgomery algorithm
Conversion from one basis to another needs the mathematical background of the RNS. Therefore in the following the related background of the RNS is detailed.

### 2.2 Residue Number System

The RNS is an unconventional number system which is defined in terms of relatively-prime moduli set $\left\{m_{1}, m_{2}, \ldots, m_{n}\right\}$ that is $\operatorname{gcd}\left(m_{i}, m_{j}\right)=1$ for $i \neq j$. A weighted number $X$ can be represented as $X=\left(x_{1}\right.$, $x_{2}, \ldots, x_{n}$ ), where

$$
\begin{equation*}
x_{i}=X \bmod m_{i}=|X|_{m_{i}}, 0 \leq x_{i}<m_{i} \tag{1}
\end{equation*}
$$

Such a representation is unique for any integer $X$ in the range $[0, M-1]$, where $M$ is the dynamic range of the moduli set $\left\{m_{1}, m_{2}, \ldots, m_{n}\right\}$ which is equal to the product of mi terms $\left(M=\prod_{i=1}^{n} m_{i}\right)[22]$.

RNS includes three main parts: forward converter, reverse converter and arithmetic operation [23]. Several algorithms for reverse conversion can be employed such as Chinese reminder Theorem (CRT), Mixed Radix Conversion (MRC) or the modified version of these algorithms [23]. Since MRC is used in the required conversion in this paper, the mathematical details of MRC is discussed in the following.

In MRC [24], [25], the number $X$ can be calculated from residues by:

$$
\begin{equation*}
X=v_{n} \prod_{i=1}^{n-1} P_{i}+\ldots+v_{3} P_{2} P_{1}+v_{2} P_{1}+v_{1} \tag{2}
\end{equation*}
$$

The coefficients $\left\{v_{1}, v_{2}, \ldots, v_{n}\right\}$ can be obtained from residues as follows:

$$
\begin{align*}
& v_{1}=x_{1}  \tag{3}\\
& v_{2}=\left.\left.\left|\left(x_{2}-v_{1}\right)\right| P_{1}^{-1}\right|_{P_{2}}\right|_{P_{2}}  \tag{4}\\
& v_{3}=\left.\left.\left|\left(\left(x_{3}-v_{1}\right)\left|P_{1}^{-1}\right|_{P_{3}}-v_{2}\right)\right| P_{2}^{-1}\right|_{P_{3}}\right|_{P_{3}} \tag{5}
\end{align*}
$$

In the general case

$$
\begin{equation*}
v_{n}=\left.\left.\left|\left(\left(\left(x_{n}-v_{1}\right)\left|P_{1}^{-1}\right|_{P_{n}}-v_{2}\right)\left|P_{2}^{-1}\right|_{P_{n}}-\cdots-v_{n-1}\right)\right| P_{n-1}^{-1}\right|_{P_{n}}\right|_{P_{n}} \tag{6}
\end{equation*}
$$

$\left|P_{i}^{-1}\right|_{P_{j}}$ denotes the multiplicative inverse of $P_{i}$ modulus $P_{j}$.

## 3. Selected RNS Bases

In order to increase the efficiency of Montgomery in RNS, efficient RNS bases are required. To achieve this, for first and second bases $\left\{2^{n}, 2^{n}+1,2^{n-1}-1\right\}$ and $\left\{2^{n}+3,2^{n}-1,2^{n}-3\right\}$ when $n$ is even, are considered as moduli sets, respectively. As mentioned before, conversion from one base to another is needed in the process of RNS Montgomery multiplication. In the following, the required conversion will be detailed.

### 3.1 Residue-to-binary conversion in first bases

By using MRC and considering $P_{1}=2^{n}, P_{2}=2^{n}+1$ and $P_{3}=2^{n-1}-1$ and residues $Z=\left(x_{1}, x_{2}, x_{3}\right)$, we have

$$
\begin{equation*}
X=v_{1}+v_{2} 2^{n}+v_{3}\left(2^{n}+1\right)\left(2^{n}\right) \tag{7}
\end{equation*}
$$

Where

$$
\begin{align*}
& v_{1}=x_{1}  \tag{8}\\
& v_{2}=\left.\left.\left|\left(x_{2}-x_{1}\right)\right| P_{1}^{-1}\right|_{P_{2}}\right|_{P_{2}}  \tag{9}\\
& v_{3}=\left.\left|\left(x_{3}-x_{1}\right)\right| P_{1}^{-1}\right|_{P_{3}}-\left.v_{2}| | P_{2}^{-1}\right|_{P_{3}} \|_{P_{3}} \tag{10}
\end{align*}
$$

Theorem 1: Required multiplicative inverses in Eq. (10), are $\left|P_{1}^{-1}\right|_{P_{3}}=2^{n-2},\left|P_{1}^{-1}\right|_{P_{2}}=-1$ and $\left|\left|P_{2}^{-1}\right|_{P_{3}}\right|=\sum_{i=0}^{(n / 2)-1} 2^{2 i}$.

Proof.

$$
\begin{aligned}
& \left|p_{1}^{-1}\right|_{p_{2}}=-1 \longrightarrow\left|2^{n} \times(-1)\right|_{2^{n}+1}=\left|-2^{n}\right|_{2^{n+1}}=1 \\
& \left|P_{1}^{-1}\right|_{P_{3}}=2^{n-2} \longrightarrow\left|2^{n} \times 2^{n-2}\right|_{2^{n-1}-1}=\left|2^{2 n-2}\right|_{2^{n-1}-1}=1 \\
& \left|\left|P_{2}^{-1}\right|_{P_{3}}\right|=\sum_{i=0}^{(n / 2)-1} 2^{2 i} \longrightarrow\left|\left(2^{n}+1\right) \times\left|P_{2}^{-1}\right|_{P_{3}}\right|_{2^{n-1}-1}=1 \longrightarrow\left|3 \times\left|P_{2}^{-1}\right|_{P_{3}}\right|_{2^{n-1}-1}=1 \\
& \longrightarrow\left|P_{2}^{-1}\right|_{P_{3}}=\left|\frac{1}{3}\right|_{2^{n-1}-1}=\left|\frac{2^{n}-1}{3}\right|_{2^{n-1}-1}=\sum_{i=0}^{(n / 2)-1} 2^{2 i}
\end{aligned}
$$

Using theorem $1, v_{2}$ can be calculated as:

$$
\begin{equation*}
v_{2}=\left|x_{1}-x_{2}\right|_{2^{n+1}} \tag{1}
\end{equation*}
$$

Based on equation of $v_{3}$ in MRC we have:

$$
\begin{equation*}
v_{3}=\left.\left|\left(x_{3}-x_{1}\right)\right| P_{1}^{-1}\right|_{P_{3}}-\left.v_{2}| | P_{2}^{-1}\right|_{P_{3}} \|_{P_{3}} \tag{12}
\end{equation*}
$$

Using theorem $1, v_{3}$ can be calculated as:

$$
\begin{equation*}
v_{3}=|\underbrace{\left(x_{3}-x_{1}\right) 2^{n-1}}_{Y}-\underbrace{v_{2}\left(2^{0}+2^{2}+\ldots+2^{n-2}\right)}_{K}|_{2^{n-1}-1} \tag{10}
\end{equation*}
$$

Lemma 1: In modulo $2^{n}-1$, multiplication of $n$-bit residue number $x$ by $2^{k}$ is equal to $k$-bit circular left shift residue number $x$ [26].

Lemma 2: In modulo $2^{n}-1$, the negative of residue number $x$ is obtained by one's complement of $x$, where $0 \leq x<2^{n}-1$ [26].

By using Lemma 1 and 2 we have:

$$
\begin{equation*}
v_{3}=|Y+\bar{K}|_{2^{n-1}-1} \tag{14}
\end{equation*}
$$

Where

$$
\begin{equation*}
\bar{K}=\left|\left(C L S\left(\bar{v}_{2}, 0\right)+\ldots \ldots+C L S\left(\bar{v}_{2}, n-2\right)\right)\right|_{2^{n-1}-1} \tag{15}
\end{equation*}
$$

$C L S(k, p)$ denotes $p$-bit left shift of $k$ [27]. For $Y$ calculation we have:

$$
\begin{equation*}
Y=|(x_{3, n-2} \ldots x_{3,0}-\underbrace{00 \ldots 0}_{n-2 \text { bit }} x_{1, n-1} \ldots x_{1,0}) 2^{n-2}|_{2^{n-1}-1} \tag{16}
\end{equation*}
$$

Based on Lemma 1 and 2, $Y$ can be rewritten as follows:


Figure 2. Hardware implementation of $X$ conversion in first basis

$$
\begin{align*}
& Y=|(x_{3,0} x_{3, n-2} \ldots x_{3,1}+\underbrace{\bar{x}_{1, n-1} \ldots \bar{x}_{1,0} \overbrace{11 \ldots 1}^{(n-2)}}_{2(n-1) \text { bit }})|_{2^{n-1}-1}  \tag{17}\\
& Y_{1}=\left|\left(x_{3,0} x_{3, n-2} \ldots x_{3,1}\right)\right|_{2^{n-1}-1} \tag{18}
\end{align*}
$$

$$
\begin{align*}
& Y_{2}=\left|\left(\bar{x}_{1, n-1} \ldots \bar{x}_{1,1}\right)\right|_{2^{n-1}-1}  \tag{19}\\
& Y_{3}=|(\bar{x}_{1,0} \overbrace{11 \ldots 1}^{(n-2)})|_{2^{n-1}-1} \tag{20}
\end{align*}
$$

Finally after calculation of coefficients $\left\{v_{1}, v_{2}, v_{3}\right\}$ we came back to general equation of $X$. Thus:

$$
\begin{align*}
& X=v_{1}+v_{2} 2^{n}+v_{3} 2^{2 n}+\underbrace{00 \ldots 0}_{n \text { bit }} v_{3} 2^{n}  \tag{21}\\
& X=\underbrace{v_{3} v_{1}}_{(2 n-1) \text { bit }}+v_{2} 2^{n}+\underbrace{0 \ldots 0}_{2 n \text { bit }} v_{3} 2^{2 n}  \tag{22}\\
& X=\underbrace{v_{3} \overbrace{v_{3} v_{1}}^{(2 n-1)-\text { bit }}}_{k_{1}}+\underbrace{v_{2} 2^{n}}_{k_{2}} \tag{23}
\end{align*}
$$

Hardware implementation of $X$ in Eq. (23) is shown in figure 2. Operand preparation unit (OPU) 1, 2 and 3 provides the required negation and shift according to Eq. (18-23). Details report of the required hardware and considered Carry save adder (CSA) with end around carry (EAC) [27], Modulo adder [29][30], subtractor [31] and carry propagate adder (CPA) with EAC [32] which used in this paper are included in section 4 and table 2.

## 3.2 binary-to-Residue conversion in second bases

Calculation of number $X$ from moduli set of $\left\{2^{n}+3,2^{n}-1,2^{n}-3\right\}$ can be done with two MRC levels as shown in figure 3:


Figure 3. Two MRC levels of conversion in second basis

As mentioned before the coefficients $\left\{v_{1}, v_{2}, v_{3}\right\}$ have to be obtained. Assume $P_{1}=2^{n}+3, P_{2}=2^{n}-1$, $P_{3}=2^{n}-3$ and $Z=\left(x_{1}, x_{2}, x_{3}\right)$. Thus:

$$
\begin{align*}
& v_{1}=x_{1}  \tag{24}\\
& Y=v_{1}+v_{2}\left(2^{n}+3\right)  \tag{25}\\
& v_{2}=\left.\left.\left|\left(x_{2}-x_{1}\right)\right| P_{1}^{-1}\right|_{P_{2}}\right|_{P_{2}} \tag{26}
\end{align*}
$$

Theorem 2: Required multiplicative inverse in Eq. (26), is $\left|P_{1}^{-1}\right|_{2^{n}-1}=2^{n-2}$
Proof.

$$
\left|P_{1}^{-1}\left(2^{n}+3\right)\right|_{2^{n}-1}=1 \longrightarrow\left|P_{1}^{-1}\right|_{2^{n}-1}=2^{n-2}
$$

Then Eq. (26) can be rewritten as follow:

$$
\begin{equation*}
v_{2}=\left|\left(x_{2}-x_{1}\right) 2^{n-2}\right|_{2^{n}-1} \tag{27}
\end{equation*}
$$

By using Lemma 1 and 2 we have:

$$
\begin{align*}
& v_{2}=\mid \underbrace{\left(x_{2, n-1} \ldots \ldots x_{2,0}\right.}_{n \text { bit }}-\underbrace{00 . .0}_{n-1} \text { bit } \underbrace{x_{n, 1} \ldots \ldots x_{0,1}}_{n+1}) \times\left. 2^{n-2}\right|_{2^{n}-1}  \tag{28}\\
& v_{2}=|\underbrace{x_{2,1} x_{2,0} x_{2, n-1} \ldots \ldots x_{2,2}}_{n \text { bit }}-0 \underbrace{0 x_{n, 1} \ldots \ldots x_{0,1}}_{n+1} \frac{00 . .0}{n-1}|_{\text {bit }^{n}-1}  \tag{29}\\
& v_{2}=|\underbrace{x_{3,1} x_{3,0} x_{3, n-1} \ldots x_{3,2}}_{n \text { bit }}+\underbrace{1 \bar{x}_{n, 1} \ldots \bar{x}_{2,1}}_{n \text { bit }} \underbrace{\bar{x}_{1,1} \bar{x}_{0,1} 111 \ldots .1}_{n \text { bit }}|_{2^{n}-1}  \tag{30}\\
& v_{2}=|\underbrace{x_{3,1} x_{3,0} x_{3, n-1} \ldots x_{3,2}}_{n \text { bit }}+\underbrace{1 \bar{x}_{n, 1} \ldots \bar{x}_{2,1}}_{n \text { bit }}+\underbrace{\bar{x}_{1,1} \bar{x}_{0,1} 111 \ldots .1}_{n \text { bit }}|_{2^{n}-1}  \tag{31}\\
& v_{2}=\left|K_{1}+K_{2}+K_{3}\right|_{2^{n}-1} \tag{32}
\end{align*}
$$

Where

$$
\begin{align*}
& K_{1}=x_{3,1} x_{3,0} x_{3, n-1} \ldots . . x_{3,2}  \tag{33}\\
& K_{2}=1 \bar{x}_{n, 1} \ldots \bar{x}_{2,1}  \tag{34}\\
& K_{3}=\bar{x}_{1,1} \bar{x}_{0,1} 111 \ldots .1 \tag{35}
\end{align*}
$$

At the end of MRC1, $Y$ can be computed by:

$$
\begin{align*}
& Y=v_{1}+\left(2^{n}+3\right) v_{2}  \tag{36}\\
& Y=v_{1}+2^{n} v_{2}+3 v_{2}=v_{1} v_{2}+3 v_{2}=\underbrace{v_{1} v_{2}}_{v_{21}}+\underbrace{v_{2} 0}_{v_{22}}+v_{2} \tag{37}
\end{align*}
$$

Hardware implementation of $Y$ in Eq. (37) in first level of MRC conversion is shown in figure 4.
For implementation of second level of MRC conversion as shown in figure 3 we used Eq. (40) as follows:

$$
\begin{align*}
& v_{1}=Y  \tag{38}\\
& X=v_{1}+v_{2} P_{3}  \tag{39}\\
& v_{2}=\left.\left.\left|\left(x_{3}-Y\right)\right| P_{1,2}{ }^{-1}\right|_{P_{3}}\right|_{P_{3}} \tag{40}
\end{align*}
$$

Theorem 3: Required multiplicative in Eq. (40) is

$$
\left|P_{1,2} 2^{-1}\right|_{P_{3}}=\left|\left(\left(2^{n}+3\right)\left(2^{n}-1\right)\right)^{-1}\right|_{\left(2^{n}-3\right)}
$$

Proof.

$$
\begin{aligned}
& \left.\left.\left|\left(2^{n}+3\right)\left(2^{n}-1\right)\right|_{1,2^{-1}}^{-1}\right|_{P_{3}}\right|_{\left(2^{n}-3\right)}=\left.\left.1 \longrightarrow|12| P_{1,2}^{-1}\right|_{P_{3}}\right|_{\left(2^{n}-3\right)}=1 \\
& \left.\left.\longrightarrow|12| P_{1,2}^{-1}\right|_{P_{3}}\right|_{\left(2^{n}-3\right)}=\left|\frac{1}{12}\right|_{\left(2^{n}-3\right)}=\left|\frac{-\left(2^{n}-4\right)}{12}\right|_{\left(2^{n}-3\right)}=-\left(1+2^{2}+\ldots+2^{n-4}\right)
\end{aligned}
$$



Figure 4. Hardware implementation of $X$ conversion in second basis

Based on Theorem 3 and Lemma 1 and 2 Eq. (40) can be written as the following equation:

$$
\begin{align*}
& v_{2}=\left|\left(Y-x_{3, n-1} \ldots x_{3,0}\right)\left(1+2^{2}+\ldots+2^{n-4}\right)\right|_{\left(2^{n}-3\right)}  \tag{41}\\
& v_{2}=\left|\begin{array}{l}
C L S(Y, 0)+\ldots \ldots+C L S(Y, n-4)+ \\
C L S\left(\bar{x}_{3}, 0\right)+\ldots \ldots+C L S\left(\bar{x}_{3}, n-4\right)
\end{array}\right|_{\left(2^{n}-3\right)} \tag{42}
\end{align*}
$$

Finally $X$ is calculated as the following:

$$
\begin{align*}
& X=Y+v_{2}\left(2^{n}-3\right)\left(2^{n}-1\right)=Y+v_{2}\left(2^{2 n}+2^{n+1}-3\right)  \tag{43}\\
& =Y+2^{2 n} v_{2}+2^{n+1} v_{2}-v_{2} 0-v_{2} \\
& X=Y+2^{2 n}(\underbrace{0 \ldots .0 v_{2}}_{2 n \text { bit }})+2^{n+1}(\underbrace{0 \ldots \ldots v_{2}}_{(n+1) \text { bit }})+\bar{v}_{2} 1+\bar{v}_{2}+2  \tag{44}\\
& X=Y+\underbrace{v_{2} \underbrace{00 \ldots 0}_{n \text { bit }} \bar{v}_{2}}_{x_{1}}+\underbrace{v_{2} \bar{v}_{2} 1}_{x_{2}}+2 \tag{45}
\end{align*}
$$

By calculating X in Eq. (45) for second level of MRC conversion, the residue to binary converter for the moduli set $\left\{2^{n}+3,2^{n}-1,2^{n}-3\right\}$ is designed completely and shown in figure 4 . Details report of the required hardware is included in section 4 and table 3 .

### 3.3 Forward conversion

In order to achieve RNS to RNS conversion according to figure 1, Forward conversion in the considered moduli set is required. Forward converter for the moduli $2^{n}, 2^{n}+1,2^{n}+3,2^{n}-1$ and $2^{n}-3$ are designed by researchers [19] [20].

Critical moduli in first and second bases are $2^{n}+1$ and $2^{n}+3$, respectively. Forward conversion for ( $3 n$ )bit word in moduli $2^{n}+1$ and ( $3 n+1$ )-bit word in critical moduli $2^{n}+3$ in second bases have been calculated and show in table 1 as follow:

Table 1. Delay of critical channel of forward conversion in $2^{n}+1$ and $2^{n}+3$

|  | Critical moduli | Hardware cost | Delay for conversion |
| :--- | :---: | :---: | :---: |
| First bases | $\left\{2^{n}+1\right\}$ | $(4 n+10) t_{\mathrm{FA}}$ | $(2 n+5) t_{\mathrm{FA}}[19]$ |
| Second bases | $\left\{2^{n}+3\right\}$ | $(5 n+19) t_{\mathrm{FA}}$ | $(3 n+10) t_{\mathrm{FA}}[20]$ |

## 4. Performance evaluation

In this section, in order to evaluate the performance of the proposed conversion for comparison with [7], we are going to calculate the hardware cost and conversion delay of two proposed sets $\left\{2^{n}, 2^{n}+1\right.$, $\left.2^{n-1}-1\right\}$ and $\left\{2^{n}+3,2^{n}-1,2^{n}-3\right\}$ when $t_{\mathrm{NOT}}, t_{\mathrm{FA}}, t_{\mathrm{XNOR} / O R}$ and $t_{\mathrm{XOR} / O R}$ are delay represented of NOT gate, a full adder (FA), a pairs of XNOR/OR and a pairs of XOR/OR, respectively. As shown in figure 2 for first set, for calculating Eq. (18-20), (2n-2) NOT gates in OPU1 is used, so delay OPU1 equals to $t_{\text {NOT }}$. In Eq. (11) for calculating $v_{2}$ is used a modulo ( $2 n+1$ ) subtractor which has ( $2 n+2$ )-bit delay [30]. Hence, for obtaining Eq. (14) from Eq. (15) and Eq. (17-20) is used from ( $n+1$ ) NOT gates in OPU2 and a 4 - operand modulo ( $2^{n-1}-1$ ) adder is required which include three ( $n-1$ )-bit CSA with EAC following by ( $n-1$ )-bit CPA with EAC. Each CSA with EAC has the delay of FA, and the delay of a

CPA with EAC is twice the delay of a regular CPA. Since Eq. (19) and Eq. (20) have constant value of 1 then FAs in CSA1 reduced to pairs of XNOR/OR gates. Area for CSA Tree is (n-3)(n-1)-bit CSA with EAC tree which arranged in $l$ levels. Delay of circular shifting or bits rearrangement in OPU3 is ignored, since it is a rearrangement of wires. Realization of Eq. (23) required (3n)-bit CPA. Finally total delay and hardware cost for first set are calculated in table 2.

Table 2. Hardware and delay specification of reverse converter for the moduli set $\left\{2^{n}, 2^{n}+1,2^{n-1}-1\right\}$

| Parts | FA | NOT | XNOR/OR <br> pairs | Delay |
| :---: | :---: | :---: | :---: | :---: |
| OPU1 |  | $2 n-2$ |  | $t_{\mathrm{NOT}}$ |
| CSA1 | 1 |  | $n-2$ | $t_{\mathrm{FA}}$ |
| Modulo subtractor <br> $[28]$ | $2 n+2$ |  |  | $(n+1) t_{\mathrm{FA}}$ |
| OPU2 |  | $n+1$ |  | $t_{\mathrm{NOT}}$ |
| CSA Tree | $n^{2}-4 n-3$ |  |  | $l_{\text {l. } ._{\mathrm{FA}}}$ |
| CSA2 | $n-1$ |  |  | $t_{\mathrm{FA}}$ |
| CSA3 | $n-1$ |  |  | $t_{\mathrm{FA}}$ |
| CPA1 | $n-1$ |  |  | $(2 n-2) t_{\mathrm{FA}}$ |
| CPA2 | $3 n$ |  |  | $(3 \mathrm{n}) t_{\mathrm{FA}}$ |
| Total delay | $n^{2}+4 n-3$ | $3 n-1$ | $n-2$ | $(6 n+2+l) t_{\mathrm{FA}}+2 t_{\mathrm{NOT}}$ |

*Here $l$ is the number of levels of CSA tree with ( $n-1$ ) input

In second set, required hardware for convert from Residue to binary in Eq. (29) is (2n) NOT gates in OPU1 following by $n$-bit CSA with EAC and $n$-bit CPA with EAC are used. Since Eq. (34) and Eq. (35) have constant value of 1 then FAs in CSA1 reduced to pairs of XNOR/OR. For calculating the Eq. (37) need to $n$-bit CSA that following by ( $2 n+1$ )-bit CPA with EAC has been used. In Eq. (41) OPU3 has $n$ NOT gates and two CSA Trees which one of them consists of ( $n-5$ )( $4 n$ )-bit CSA with EAC tree area which arranged in $k$ levels for calculation of $Y$ into two (4n)-bit parts which has constant value of 0 then FAs in CSA tree reduced to pairs of XOR/AND gates and following by OPU4, eleven $n$-bit CSA with EAC and $n$-bit CPA with EAC. As mention before delay of circular shifting or bits rearrangement in OPU4 was ignored, since it is a rearrangement of wires. Another CSA Tree for $x_{3}$ calculation has $(n-5)(2 n)$-bit CSA with EAC tree area which arranged in $I$ levels which has constant value of 1 then FAs in CSA tree reduced to pairs of XNOR/OR gates. Finally for operations of Eq. (45) OPU5 with ( $n+1$ ) NOT gates, one ( $3 n+1$ )-bit CSA and ( $3 n+1$ )-bit CPA has been applied. Since Eq. (45) has constant values of 0 then full adders in CSA9 reduced to pairs of XOR/AND gates. Total delay for second set is calculated in table 3 .

Table 3. Hardware and Delay Specification of Reverse Converter for the moduli set $\left\{2^{n}+3,2^{n}-1,2^{n}-3\right\}$

| Parts | FA | NOT | XOR/AND <br> pairs | XNOR/OR <br> Pairs | Delay |
| :---: | :---: | :---: | :---: | :---: | :---: |
| OPU1 |  | $2 n$ |  |  | $t_{\mathrm{NOT}}$ |
| CSA1 | 1 |  |  | $n-1$ | $t_{\mathrm{FA}}$ |
| CPA1 | $n$ |  |  |  | $(2 n) t_{\mathrm{FA}}$ |
| CSA 2 | $2 n+1$ |  |  |  | $t_{\mathrm{FA}}$ |
| CPA2 | $2 n+1$ |  |  |  | $(2 n+1) t_{\mathrm{FA}}$ |
| OPU3 |  | $n$ |  |  | $t_{\mathrm{NOT}}$ |
| CSA Tree1 | $4 n^{2}-20 n$ |  | $2 n^{2}-7 n+3$ |  | ${ }^{2} k \cdot t_{\mathrm{FA}}$ |
| CSA Tree2 | $2 n^{2}-10 n$ |  |  | $n^{2}-3 n$ | ${ }^{*} t_{\mathrm{FA}}$ |
| CSA 3-13 | $n \times 11$ |  |  |  | $(1 \times 11) t_{\mathrm{FA}}$ |
| CPA3 | $n$ |  |  |  | $(2 n) t_{\mathrm{FA}}$ |
| OPU 5 |  | $n+1$ |  |  | $t_{\mathrm{NOT}}$ |
| CSA9 | $2 n+1$ |  | $n$ |  | $t_{\mathrm{FA}}$ |
| CPA3 | $3 n+1$ |  |  |  | $(3 n+1) t_{\mathrm{FA}}$ |
| Total delay | $6 n^{2}-8 n+5$ | $4 n+1$ | $2 n^{2}-6 n+3$ | $n^{2}-2 n-1$ | $(9 n+16+k+I) t_{\mathrm{FA}}+3 t_{\mathrm{NOT}}$ |

*Here $I$ and $k$ are the number of levels of two CSA trees with ( $n-3$ ) input

Total delay and hardware cost for RNS to RNS conversion from first to second basis and inverse are computed in table 4.

Table 4. Total delay and hardware cost for two proposed sets

|  | Hardware cost | Delay |
| :--- | :--- | :--- |
| RNS to RNS conversion from <br> first to second basis | $\left(n^{2}+8 n+7\right) t_{\mathrm{FA}}+(3 n-1) t_{\mathrm{NOT}}+(n-2) t_{\mathrm{XNOR} / \mathrm{OR}}$ | $(8 n+7+l) t_{\mathrm{FA}}+2 t_{\mathrm{NOT}}$ |
| RNS to RNS conversion from | $\left(6 n^{2}-3 n+24\right) t_{\mathrm{FA}}+(4 n+1) t_{\mathrm{NOT}}+\left(2 n^{2}-\right.$ | $(12 n+26+k+l) t_{\mathrm{FA}}$ |
| second to first basis | $6 n+3) t_{\mathrm{NOR} / \mathrm{AND}}+\left(n^{2}-2 n-1\right) t_{\mathrm{XNOR} / \mathrm{OR}}$ | $+3 t_{\mathrm{NOT}}$ |

For $p=192-\mathrm{b}$ implementation according to NIST report [33], $p=2^{192}-2^{64}-1$ is considered. By the proposed RNS bases, moduli sets $\left\{2^{66}, 2^{66}+1,2^{65}-1\right\}$ and $\left\{2^{66}+3,2^{66}-1,2^{66}-3\right\}$ when $n=66$ are achieved. In table 5 delay and hardware cost of proposed sets with [7] has been compared. Before beginning, in order to have a better comparison to obtain total area and delay estimation, the unit gate model is considered [14]. According to this model [14], time and area requirements for each of the following components are explained as follows: each two-input monotonic gate (e.g., AND, NAND) counts as one gate in area and delay, an XOR/XNOR gate has two gates in area and delay and a FA counts as seven gates in area and has four gates in delay.

Table 5. Total delay of required RNS to RNS conversion for 192 bit key length

|  | Delay | Unit gate delay |
| :--- | :---: | :---: |
| proposed | $(1298) t_{\mathrm{FA}}$ | 5192 |
| $[7]$ | $(5376) t_{\mathrm{FA}}$ | 21504 |

Table 5 shows the delay comparison with three moduli RNS bases proposed in [7]. As the result shows, noticeable improvement in RNS to RNS conversion in RNS Montgomery multiplication is achieved. Besides, using well formed RNS moduli results arithmetic operation in RNS Montgomery multiplication with higher efficiency [23].

## 5. CONCLUSION

Montgomery modular multiplication in RNS is an efficient way to achieve higher speed of modular multiplication. Conversion from RNS to RNS is the critical part of this approach. In this paper, efficient RNS bases are selected to achieve high speed operation and the required RNS to RNS conversions are designed in efficient way. The proposed design enjoys efficient bases with suitable arithmetic unit as well as high speed RNS to RNS conversion that is proper for ECC. The results shows noticeable improvements in terms of delay of conversions are achieved compared to state-of-the-art-work in literature.

## References

[1] R. L. Rivest, A. Shamir, and L. M. Adleman, "A Method for Obtaining Digital Signatures and Public-Key Cryptosystems,"Commun. ACM, vol. 21, no. 2, pp. 120-126, 1978.
[2] V. S. Miller, "Use of elliptic curves in cryptography," in Lecture notes in computer sciences; 218 on Advances in cryptology-CRYPTO 85. New York, NY, USA: Springer-Verlag New York, Inc., 1986, pp. 417-426.
[3] N. Koblitz, "Elliptic Curve Cryptosystem," J. Cryptology Math. Comp., vol. 48, pp. 203-209, 1987.
[4] P. Montgomery, "Modular Multiplication without Trial Division," Mathematics of Computation, vol. 44, no. 170, pp. 519521, Apr. 1985.
[5] J. Bajard, L. Didier, and P. Kornerup, "An RNS Montgomery's Modular Multiplication Algorithm," IEEE Trans. Computers, vol. 47, no. 2, pp. 167-178, Feb. 1998.
[6] J. Bajard, L. Didier, and P. Kornerup, "Modular Multiplication and Base Extensions in Residue Number Systems," Proc. 15th IEEE Symp. Computer Arithmetic (ARITH’01), pp. 59-65, 2001.
[7] J. C. Bajard, M. Kaihara, T. Plantard, "Selected RNS Bases for Modular Multiplication," In 19th IEEE International Symposium on Computer Arithmetic, pp. 25-32, 2009.
[8] M. Gerami, M. Esmaeildoust, Sh. Rezaie, K. Navi and O. Hashemipour, "Four Moduli RNS Bases for Efficient Design of Modular Multiplication," Journal of Computations \& Modelling, vol. 1, no. 2, pp 73-96, 2011.
[9] Sh. Rezaie, M. Esmaeildoust, M. Gerami, K. Navi and O. Hashemipour, "High Dynamic Range RNS Bases for Modular Multiplication, " IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 4, No 1, July 2011.
[10] A. A. Hiasat, "VLSI implementation of New Arithmetic Residue to Binary decoders," IEEE Transactions on VLSI systems, vol. 13, no. 1, pp. 153-158, 2005.
[11] A. S. Molahosseini, K. Navi, C. Dadkhah, O. Kavehei and S. Timarchi, "Efficient Reverse Converter Designs for the new 4Moduli Set $\left\{2^{n}-1,2^{n}, 2^{n}+1,2^{2 n+1}-1\right\}$ and $\left\{2^{n}-1,2^{n}+1,2^{2 n}, 2^{2 n}+1\right\}$ Based on New CRTs, " IEEE Transactions on Circuits and Systems-I, 57(4), (2010), 823-835.
[12] B. Cao, C. Chang and T. Srikanthan, "An Efficient Reverse Converter for the 4-Moduli Set $\left\{2^{n}-1,2^{n}, 2^{n}+1,2^{2 n}+1\right\}$ Based on the New Chinese Remainder Theorem," IEEE Transactions on Circuits and Systems-I: Fundamental Theory and Applications, 50(10), (October, 2003), 1296-1303.
[13] P. M. Matutino, R. Chaves, and L. Sousa, "Arithmetic units for RNS moduli $\left\{2^{n}-3\right\}$ and $\left\{2^{n}+3\right\}$ operations," in 13th EUROMICRO Conference on Digital System Design: Architectures, Methods and Tools, 2010.
[14] R. Zimmermann, "Efficient vlsi implementation of modulo ( $2 \mathrm{n} \pm 1$ ) addition and multiplication," in 14th IEEE Symposium on Computer Arithmetic, 1999.
[15] S. B. R.A. Patel, M. Benaissa and N. Powell, "Power-delay-area efficient modulo $2^{n}+1$ adder architecture for rns," Electronics Letters, 2005.
[16] R. Zimmermann, "Binary adder architectures for cell-based VLSI and their synthesis," Ph.D. dissertation, Swiss Federal Institute of Technology (ETH) Zurich, Hartung-Gorre Verlag, 1998.
[17] R. Chaves and L. Sousa, "Improved 2n +1 multipliers," Technical Report RT/14/2003, INESC-ID, July 2003.
[18] R. Chaves and L. Sousa, " $2^{n}+1,2^{n}+\mathrm{k}, 2^{n}-1$ : A new RNS moduli set extension," in EUROMICRO Systems on Digital System Design, 2004.
[19] P. M. Matutino, R. Chaves, and L. Sousa, "Binary-to-RNS conversion units for moduli $\{2 \mathrm{n} \pm 3\}$," in 14th EUROMICRO Conference on Digital System Design: Architectures, Methods and Tools, 2011.
[20] F. J. Taylor, "Residue arithmetic: A tutorial with examples," Computer,vol. 17, pp. 50-62, May 1984.
[21] M. Esmaeildoust, A. Kaabi, "High Speed Reverse Converter for the Five Moduli Set $\left\{2^{n}, 2^{n}-1,2^{n}+1,2^{n}-3,{ }^{2 n-1}-1\right\}$ ", TJMCS, pp. 438-450, 2009.
[22] K. Navi, A. S. Molahosseini, and M. Esmaeildoust, "How to teach residue number system to computer scientists and engineers?" IEEETrans. Edu., vol. 54, no. 1, pp. 156-163, Feb. 2011.
[23] B. Parhami, Computer Arithmetic: Algorithms and Hardware Design.Oxford, U.K.: Oxford Univ. Press, 2000.
[24] B. Cao, C. H. Chang, and T. Srikanthan, "Adder based residue to binary converters for a new balanced 4-moduli set," in Proc. 3 rd IEEE Symp.Image, Signal Process. Anal., 2003, vol. 2, pp. 820-825.
[25] N Keivan, M Esmaeildoust, AS Molahosseini, "A General Reverse Converter Architecture with Low Complexity and High Performance", IEICE TRANSACTIONS on Information and Systems 94 (2), (2011) 264-273.
[26] B. Cao, T. Srikanthan, C.H. Chang, "Efficient reverse converters for the four-moduli sets $\{2 n-1,2 n, 2 n+1,2 n+1-1\}$ and $\{2 \mathrm{n}-1,2 \mathrm{n}, 2 \mathrm{n}+1,2 \mathrm{n}-1-1\}$," IEE Proc. Comput. Digit.Tech., vol. 152, 687-96, 2005.
[27] K. Hwang, "Computer Arithmetic: Principle, Architecture and Design", New York: Wiley, 1979.
[28] S.J. Piestrak, "Design of residue generators and multioperand modular adders using carry-save adders", IEEE Trans. Comput., Vol. 423, pp. 68-77, 1994.
[29] S.J. Piestrak, "A high speed realization of a residue to binary converter", IEEE Trans. Circuits Syst. II, Analog.Digit. Signal Process., Vol. 42, pp. 661-663, Oct. 1995.
[30] D.Younes, P.Steffan, "Novel Modulo $2^{n}+1$ Subtractor and Multiplier ", ICONS: The Sixth International Conference on Systems, pp.36-38, 2011.
[31] Reto Zimmermann, "Lecture notes on Computer Arithmetic: Principles, Architectures and VLSI Design," Integrated System Laboratory, Swiss Federal Institute of Technology (ETH) Zurich, Mar, 16, 1999.
[32] D. M. Schinianakis, A. P. Fournaris, H. E. Michail, A. P. Kakarountas, and T. Stouraitis, "An RNS Implementation of an Fp Elliptic Curve Point Multiplier", IEEE Transactions on Circuits and Systems-I, VOL. 56, NO. 6, JUNE 2009.

