METHOD OF FAST MATRIX MULTIPLICATION UNDER ARM ARCHITECTURE USING SIMD INSTRUCTIONS

Background. Matrix multiplication is a rather complicated algorithm with a large number of operations. An additional problem is the nonlinear memory traversal of matrices. Matrix multiplication is widely used in various fields, such as neural networks, solutions of linear equation systems, matrix transformations, and so on. Therefore, it is important to develop a method of matrix multiplication, which will take into account the problems of the location of the matrices in memory, and will effectively manage the data when reused.

Objective. The purpose of the paper is to develop a method of fast matrix multiplication of two matrices, as well as multiplying the matrix by the transposed matrix and by a list of vectors (including special case for only one vector), as well as to implement it as a function with optimization for ARM architecture processors. The function must be able to handle different types of data and submatrices. The integer result can be scaled.

Methods. The main ideas of the developed method are simultaneous work with several rows/columns of input matrices and their splitting into blocks, which will allow the algorithm to run on the same memory for a while. The C programming language was chosen for implementation. SIMD instructions were used to increase productivity. We also need to properly organize the memory preloading for effective implementation under the ARM architecture.

Results. A function that performs matrix multiplication by the developed method with the necessary parameters was implemented as a result of the study. Tests on various sizes and types have shown that the implemented function is faster than analogues from the OpenCV2 and Eigen 3 libraries. Testing was done using the vipmed utility for running and measuring features developed for enterprise use at VIT.

Conclusions. The proposed matrix multiplication method gives the expected acceleration of matrix multiplication operations, has passed evaluation test for use and meets the target requirements. For further work, it is necessary to study in more detail the influence of the cache at different levels and compare with other existing libraries.

Keywords: matrix multiplication; ARM architecture; vector operations; matrix transposition.

© The Author(s).
The article is distributed under the terms of the license CC BY 4.0.
Overview of the existing solutions

The problem of using matrix multiplication is that it has the burden of performing a great deal of operations. In example, we define two matrices $A$ and $B$:

$$ A = \begin{pmatrix} a_{11} & \cdots & a_{1m} \\ \vdots & \ddots & \vdots \\ a_{n1} & \cdots & a_{nm} \end{pmatrix}, \quad B = \begin{pmatrix} b_{11} & \cdots & b_{1p} \\ \vdots & \ddots & \vdots \\ b_{m1} & \cdots & b_{mp} \end{pmatrix}. $$

Assume that the result is $C$:

$$ C = \begin{pmatrix} c_{11} & \cdots & c_{1p} \\ \vdots & \ddots & \vdots \\ c_{n1} & \cdots & c_{np} \end{pmatrix}. $$

Therefore, matrix multiplication formula is:

$$ c_{ij} = a_{ij}b_{j} + \cdots + a_{im}b_{mj} = \sum_{k=1}^{m} a_{ik}b_{kj}, $$

for $i = 1, \ldots, n$ and $j = 1, \ldots, p$.

The complexity of this elementary algorithm is $O(nmp)$ or $O(n^3)$ if matrices are square ($n = m = p$) [2]. It is not necessary to go into the details that the cubic complexity is a bad property.

The existing solutions having open source code are not effective enough (this will be proven in the results).

There are many algorithms for rapid matrix multiplication that reduce the complexity of the operation. The best-known and used in practice is the Strassen algorithm, which reduces the complexity to $O(n^{\log27})$, which is approximately equal to $O(n^{2.81})$ [3]. All the other algorithms are only theoretical and approximate, so they are practically not used [4].

These algorithms are purely mathematical and do not take into account such an important point as placement the matrices in memory. They only reduce the number of multiplications, so in practical programming this is not enough.

Considering the chosen processor architecture, memory management is very important. To speed up this work, we use memory preload in cache. The auto-preloader of X86 processors, unlike ARM, works quite well, so it makes little sense to do this job manually. In ARM architecture, the well-timed use of the memory preloader can result in speeding up operation many-fold. However, if one makes a mistake, there is a significant drop in performance.

Therefore, we propose a new method of the matrix multiplication considering above information.

There are many libraries, including open source code, that implement matrix multiplication. In the specification of basic linear algebra subroutines (BLAS), this operation has a more enhanced interface and is called gemm. There are many implementations of this specification and most of them implement matrix multiplication using SIMD technology [5].

OpenCV is a library of software functions primarily focused on real-time computer vision algorithms. This library is a very powerful tool: it has many useful features, is cross-platform and implemented in several programming languages. It is distributed as BSD-licensed open source code software. OpenCV is not BLAS compatible, but implements a similar gemm function [6].

Eigen is a template library, it provides a simple and very common C++ 98 template interface for matrix/vector operations and related algorithms. This library, like OpenCV, contains implementations that leverage vector operations for optimization. Important thing is that there are some (more optimal) implementations for some fixed sizes. The main feature of this library is that it is fully implemented in the headers, so one only needs to download these files to be used [7].

These libraries stated that they have optimized algorithms for matrix operations, so we choose them for comparing with ours.

It is pointless to describe in detail how the matrix multiplication algorithm is implemented in the above libraries. The result of their work and comparison with the proposed method will be shown below. Their main disadvantage is the lack of performance, so the purpose of this article is to develop a faster method of implementing matrix multiplication.

Description of the proposed method

The matrix multiplication of the $M \times K$ matrix $A$ and $K \times N$ matrix $B$ results in the creation of $M \times N$ matrix $C$. Each element in matrix $C$ can be considered as a scalar product of the corresponding row of matrix $A$ and column of matrix $B$.

It is possible to implement all matrix multiplications by using a primitive scalar product, but such implementation would be far from effective. In a scalar product, we load two elements for each multiplication-add operation, and on modern processors, this implementation will be limited by memory or cache bandwidth instead of the computing power of multiplication-add units. Neverthe-
less, a minor modification — calculating point products from several rows in $A$ and several columns in $B$ at a time — improves performance significantly.

The modified primitive takes $MR$ elements of elements in $A$ and $NR$ of $B$ elements and performs multiplication operations with $MR \times NR$ accumulation. The number of registers and other details of the processor architecture limit maximum $MR$ and $NR$ values. But in most modern systems, they are large enough to make the operation limited, and all high-performance implementations of matrix-matrix multiplication are built on this primitive microkernel commonly called PDOT (panel dot product). The $M = N = 4$ workaround was selected in the method stated below. Such a small number causes the limitations of the chosen architecture.

This paper considers the matrix multiplication method for the ARM v7 architecture. It is 32-bit and has some limitations on the number of registers. With the exception of Armv6-M, Armv7-M, Armv8-M.baseline and Armv8-M.mainline based processors, there are 33 32-bit general-purpose registers, including redundant SP and LR registers. Fifteen general-purpose registers are visible at any time, depending on the current processor mode. These are R0-R12, SP and LR. PC (R15) is not considered a general-purpose register [8].

SP (or R13) is a stack pointer. C and C++ compilers always use SP as a stack pointer. Arm devalues most SP applications as a general-purpose registry. In T32 state, SP is strictly defined as a stack pointer. ARM official documentation describes when SP and PC can be used.

In a user mode, LR (or R14) is used as a register of links to store a return address when a subroutine is called. It can also be used as a general-purpose register, if the return address is stored in the stack.

In exception handling modes, the LR stores the return address for the exception or the return address of the subroutine, if subroutine calls are made within the exception. The LR can be used as a general-purpose register if the return address is stored in the stack.

From the above it is clear that only 13 regular registers are available for general use without any restrictions.

The algorithm requires much more regular registers to move with the four rows of the first to the second matrix, as noted above. Matrix row pointers (4 first + 4 second) also need pointers to the resulting matrix rows, registers to store sizes and iterators for each of them. Since matrices can actually only be parts of larger matrices, the notion of a step between rows is introduced. This step is defined in bytes and is equal to the actual width of the entire image, multiplied by the size of one element. The user with an understanding of the storage area in which one operates should transmit this data. These steps, for each of the three matrices, accordingly, require registers. So, even not taking into account the extra registers that may be needed to calculate, the required number equals already to twenty. Therefore, to save all the necessary values, one needs to allocate additional temporary memory. To use this data, they should be loaded in free registers and stored back in time, in case of necessity, the register should be freed up (saving constant values, such as matrix sizes and line spacing, each time is not required).

The main practical problem in calculating the product of matrices is the inefficient bypass of the second matrix, because the result of a particular element of the result matrix is the product of the row of the first matrix and the second matrix column. Column matrix bypass is quite inefficient in Row-Major mode (when a row in memory is in sequence). To solve this problem, the accumulation of results in a temporary buffer was chosen, while moving linearly on the second matrix. With this approach, the number of bypasses of the second matrix increases. In fact, with each row of the first matrix, the second one is completely read-out. However, gradual read-out of the data slows down the program operation to such an extent, that multiple linear reading of the matrix still faster than just one inconsistent read. Considering that the bypass will be performed at once by four rows of the first matrix, the number of bypasses of the second one is reduced by four times (it is fully read-out once per every four rows of the first matrix).

Of course, the disadvantage of this approach is the considerable amount of additional dedicated memory (4* the width of the second matrix). Thus, the resulting calculation is divided into two parts: first, there is an accumulation in the temporary buffer, and then, only at the last iteration, the entry in the resulting matrix. The last iteration is the calculation on the last rows of the second matrix. The magnitude of the so-called matrix tail will be equal to the remainder of the division of the second matrix height (or the width of the first, since they are equal) by the number of rows that are bypassed within one iteration (in this case by four). If there is no remainder, then one iteration less is performed in the total cycle and when processing the last four rows, the result is immediately written to the resulting matrix.
As a result, we have the following general algorithm for matrices $A(M\times K)$ and $B(K\times N)$ multiplied into the matrix $C(M\times N)$ by the blocks $m\times R$ and $R\times n$ with the tail having $t$ value:

1. Allocation of necessary additional memory (for variables and accumulation), initial data initialization (including memory reset into which the result will be accumulated) and their storage.

2. Reading-out $R$ elements from $m$ rows of the first matrix.

3. Reading-out $n$ elements from the $R$ rows of the second matrix.

4. Read $n$ elements from $m$ lines of temporary buffer with intermediate results.

5. Scalar multiplication of blocks $m\times R$ and $R\times n$ and their accumulation to data read-out from a temporary buffer.

6. Record of intermediate results back to the temporary buffer.

7. Implementation of items 3–6 $N/n$ times.

8. Upon bypass of the entire width of the second matrix, the transition to next $R$ rows of that matrix and the $R$ elements of the first one is performed. Temporary buffer pointers are moved to the beginning and accumulation will further occur in it.

9. Implementation of items 2–8 $(K/R-t)$ times.

10. At the tail iteration, we read-out the last $t$ elements from the $m$ rows of the first matrix.

11. We read $n$ elements from the last $t$ rows of the second matrix.

12. Same as item 4.

13. Scalar multiplication of blocks $m\times t$ and $t\times n$ and their accumulation to read-out data from the temporary buffer.

14. Resetting the temporary buffer.

15. Writing $m$ rows of $n$ elements in the resulting matrix.

16. Implementation of items 11–15 $N/n$ times.

17. Transition to the next $m$ rows of the first and the resulting matrices.

18. Implementation of items 2–17 $M/m$ times.

**Implementation using SIMD instructions**

In practical application, this algorithm is good enough. It is worth to note that aliquant values are not taken into account here, that is, it is necessary to additionally process the remainders, but it is decided to describe the algorithm without such, quite clear, details.

Of course, fast calculation requires more than just an efficient algorithm. One has to additionally look for optimization methods (both algorithmic and architecture related). Using algorithmic optimization, one can specify items related to tail processing. In a simple algorithm, individual points are not dedicated to this: first, the result is completely obtained in the temporary buffer (i.e., points 2–8 are performed $K/R$ times), then the result is rewritten into matrix $C$ and the last step is the resetting of the temporary buffer. In this algorithm, all these operations occur in one pass.

As to lower-level optimization, one should start with the vector instructions. Such operations allow executing operation with several values written in vector registers.

SIMD is a class of parallel programming, which is based on such operations. Most modern processors are designed to support SIMD instructions to enhance performance. This class is particularly popular in signal processing, where, as a rule, a large number of identical data is processed with similar operations. SIMD also allows processing several similar data types with the same instruction (as indicated in the name — Single instruction, multiple data; which is rendered as: one instruction for lots of data).

In ARM architecture processors, SIMD is represented as NEON (Advanced SIMD) extension. The Registry Bank of this extension is a collection of registers that can be accessed both as 64-bit and 128-bit vector registers. Advanced SIMD and VFP (floating-point values operations) use the same registers and differ from the main ARM register bank.

128-bit registers are called Q-registers, and 64-bit — D-registers. Each Q-register corresponds to 2 D-registers, they are overlapping. The mapping between the registers is as follows:

$D\left(2n\right)$ maps the least significant half of $Q\left(n\right)$;

$D\left(2n+1\right)$ maps the most significant half of $Q\left(n\right)$.

For example, one can access the least significant half of the vector elements in Q6 by referring to D12 and the most significant half of the elements — by referring to D13.

Therefore, in general, the registry bank can be represented by:

- Sixteen 128-bit registers Q0–Q15;
- Thirty-two 64-bit registers D0–D31;
- A combination of D and Q registers.

The SIMD extension treats each register as containing 1, 2, 4, 8, or 16 elements of the same size and type (the number depends on the register size and the element, respectively). Individual elements can also be accessed as scalars.
Let us consider using this technique in the proposed algorithm.

We omit the moments with reading and writing, we assume that the matrix elements have already been read into the vector registers and will be written from them.

Let us have a more detailed look at item 5. We will consider the example of a 32-bit float type. In the developed method, for \( m = 4 \) and \( n = 4 \), \( R \) is also assumed as four, due to the limitation of the registers number. Thus, \( 4 \times 4 \) elements of the first matrix have been read in item 2. This corresponds to 64 bytes, so four Q registers are required. In items 3 and 4 the same number of Q registers of the second matrix and time buffer was read out, respectively.

According to the matrix multiplication algorithm, it is necessary to multiply the first row of the second matrix by the first element of each row of the first one, the second row — by every second element, and so on. Moreover, the products obtained from the \( i \)-th row of the first matrix and the \( j \)-th column of the second matrix correspond to the element \((i, j)\) of the temporary buffer.

Given the specifics of the actions described, a vector-to-vector operation is not appropriate, and as stated above, NEON allows getting access to an individual element. Therefore, some instructions allow performing vector-scalar operations. VMLA is one such instruction [9].

VMLA (Vector Multiple and Accumulate) — multiplies the corresponding elements of two vectors and adds the product to the corresponding elements of the result register. In the scalar-vector case, each element of the vector is multiplied by a scalar. The general syntax of the instruction is as follows:

\[
\text{VMLA \{cond\} .datatype \{Qd\}, Qn, Qm.}
\]

And the vector scalar option looks like:

\[
\text{VMLA \{cond\} .datatype \{Qd\}, Qn, Dm [x].}
\]

Let us consider each element of the structure:

- \( \{\text{cond}\} \) — an optional parameter, NEON allows conditional execution of instructions, cond box is clicked for the condition under which the instruction will be executed;
- \( \{\text{datatype}\} \) — a vector register element type, in our case F32, denoting a 32-bit floating point number;
- \( \{Qd\}, Qn, Qm \) — input registers for the operation, the following operation is conditionally performed: \( \text{Qd} = \text{Qn} \times \text{Qm} \);
- \( \text{Dm}[x] \) — a scalar, for the vector-scalar operation it is important that the Q-registers are used as vectors, but the scalar gets from the D-register; \( x \) — the index of the desired element from the vector Dm.

Therefore, we assume that vectors Q4—Q7 were read from the first matrix, from the second matrix — Q8—Q11 vectors, and from the temporary buffer — Q0—Q3. Then the set of commands required to get the result will look as follows:

Multiplying the 1st row of the second matrix by the first element of each row of the first matrix with accumulation in the temporary buffer.

\[
\begin{align*}
\text{VMLA.F32 Q0, Q8, D8[0]} \\
\text{VMLA.F32 Q1, Q8, D10[0]} \\
\text{VMLA.F32 Q2, Q8, D12[0]} \\
\text{VMLA.F32 Q3, Q8, D14[0]}
\end{align*}
\]

Multiplication of the 1st row of the second matrix by the first element of each row of the first matrix with accumulation in the temporary buffer, etc.

\[
\begin{align*}
\text{VMLA.F32 Q0, Q9, D8[1]} \\
\text{VMLA.F32 Q1, Q9, D10[1]} \\
\text{VMLA.F32 Q2, Q9, D12[1]} \\
\text{VMLA.F32 Q3, Q9, D14[1]} \\
\text{VMLA.F32 Q0, Q10, D9[0]} \\
\text{VMLA.F32 Q1, Q10, D11[0]} \\
\text{VMLA.F32 Q2, Q10, D13[0]} \\
\text{VMLA.F32 Q3, Q10, D15[0]} \\
\text{VMLA.F32 Q0, Q11, D9[1]} \\
\text{VMLA.F32 Q1, Q11, D11[1]} \\
\text{VMLA.F32 Q2, Q11, D13[1]} \\
\text{VMLA.F32 Q3, Q11, D15[1]}
\end{align*}
\]

So, we obtained 16 vector VMLA operations. A similar solution by the conventional method would produce 64 multiplication and the same number of addition operations.

Another way to optimize is linearly reading and writing to the temporary buffer. In a simple way, for convenience, the temporary buffer corresponds to the four rows of the resulting matrix. Given that this buffer is only a temporary one, it is possible to read and write it linearly. At first glance, it is just an elementary change, when it comes to vector registers and if we consider the further processing of remainders, but, in fact, it leads to serious confusion. These
details will not be described in this publication. The main thing is to correctly understand at what point the corresponding elements for accumulation are located and what registers are to be written in sums and where. It should be noted here that NEON allows to effectively read/write vector registers with one instruction. However, there are two limitations: one instruction can write at most two Q-registers at a time; the registers should be serial. That is, reading/writing of Q1, Q3 registers with one instruction is impossible. These limitations are one of the reasons for the problems with the transition from reading the temporary buffer by 4 lines to serial.

Given that the bypass of the second matrix is performed a large number of times, the overall execution time is greatly affected by the padding size. This is a value equal to the width of the matrix subtracted from the step between the rows. Briefly, this is the size of the region of the entire matrix that does not take part in the multiplication (if the multiplication is performed on the part of the larger matrix). By reducing the ratio between the width at which the multiplication is performed and the width at which it is not performed, the rate of execution decreases. Sometimes the deterioration is such that it is quicker to make a copy of the sub-matrix into a new memory where paddings will be removed and to perform the operation without them. It is due to these reasons that one more optimization occurred: from the above data, as well as the height of the first matrix (the number of the second matrix full readouts depends on this), the conversion factor is calculated, which makes it advantageous to first make a copy to the extra memory and only then to multiply the matrices on that memory.

For the ARM architecture the proper placement of the memory reboot is very important. As practice shows, ×86 architecture types have a good auto-preloader, unlike ARM. In this processor family, correct and timely memory rebooting can result in a huge acceleration. On the contrary, if programmer makes a mistake, the performance can drop significantly. The official documentation does not give any flexible advice. It only recommends preloading with 128-byte indentation unit.

The PLD instruction performs a 64-byte preload of the transmitted pointer memory with some preset indentation unit. As mentioned above, the official documentation recommends that 128 bytes indentation should be made in advance. However, practical use shows that this is only a minimum of the real capabilities of this instruction. In different situations accelerations produce single preloads before the start of the general cycle, or in the cycle itself (not always with 128 byte indentation unit, and sometimes their number can be more than one). The PLD instruction preloads 64 bytes at once. The above design only makes 16-bytes pass, so this operation is redundant at every iteration. The simplest solution to this problem is to spin a 64-byte cycle. In fact, one such iteration will simply contain 4 blocks of operations of 16 bytes each, but this approach makes it possible to perform preloading much more efficiently.

Finally, it can be noted that due to the different behavior when bypassing the matrices (first, second and temporary), the schemes of their preload also differ.

Similarly, not only matrix multiplication on floating-point numbers is implemented, but integer 8-bit scaling as well. The option of the matrix multiplication by the transposed one and the list of vectors is also implemented (the actual implementation is one, the transposed goes to the vectors list). This option allows to read both matrices sequentially without additional memory buffer and to use a vector-vector type VMLA instruction.

The results of the comparison of the proposed method with others

The OpenCV and Eigen libraries described above were selected for the purpose of comparison. Matrix multiplication functions in all libraries (including the developed one) return the same result, so we assume that it is accurate. The data was verified by a vipmed application designed specifically to test and verify image and matrix management functions. This software allows running various mathematical functions of different libraries with the transfer of the necessary parameters, compare their results, and measure the execution time within the accuracy of a microsecond using C library function “clock”.

All measurements were made on a NVIDIA Tegra K1 chip with ARM Cortex-A15 processors that support ARMv7 and NEON.

Testing was performed on square matrices with dimensions 10×10, 100×100, 200×200, 500×500, 1,000×1,000, 2,000×2,000 and 4,000×4,000 elements with padding of 0 and 4,000 bytes on 8-bit unsigned integers (u8) and 32-bit floating point numbers (f32). Variants of ordinary matrix multiplication and matrix multiplication by the transposed one were tested.

For illustration purposes, the tables show the percentage of run time of existing OpenCV and Eigen methods from the one developed for Vipm library.
Table 1 represents multiplying the \( N \times N \) matrix by the \( N \times N \) matrix using 8-bit unsigned integers without a padding.

**Table 1.** Matrix multiplication on type u8 with 0 byte padding

<table>
<thead>
<tr>
<th>N</th>
<th>Vipm (ms)</th>
<th>OpenCV (ms)</th>
<th>Eigen (ms)</th>
<th>cv (%)</th>
<th>ei (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
<td>0.001</td>
<td>0.005</td>
<td>0.002</td>
<td>500</td>
<td>200</td>
</tr>
<tr>
<td>100</td>
<td>0.203</td>
<td>1.799</td>
<td>1.167</td>
<td>886</td>
<td>575</td>
</tr>
<tr>
<td>200</td>
<td>1.658</td>
<td>16.574</td>
<td>8.733</td>
<td>1000</td>
<td>527</td>
</tr>
<tr>
<td>500</td>
<td>26.487</td>
<td>281.808</td>
<td>134.164</td>
<td>1064</td>
<td>507</td>
</tr>
<tr>
<td>1000</td>
<td>216</td>
<td>2326</td>
<td>1064</td>
<td>1078</td>
<td>493</td>
</tr>
<tr>
<td>2000</td>
<td>2259</td>
<td>18706</td>
<td>8457</td>
<td>828</td>
<td>374</td>
</tr>
<tr>
<td>4000</td>
<td>20956</td>
<td>151309</td>
<td>67598</td>
<td>722</td>
<td>323</td>
</tr>
</tbody>
</table>

Table 2 represents multiplying the \( N \times N \) matrix by the \( N \times N \) matrix using 8-bit unsigned integers with 4,000 bytes padding.

**Table 2.** Matrix multiplication on type u8 with 4000 bytes padding

<table>
<thead>
<tr>
<th>N</th>
<th>Vipm (ms)</th>
<th>OpenCV (ms)</th>
<th>Eigen (ms)</th>
<th>cv (%)</th>
<th>ei (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
<td>0.002</td>
<td>0.005</td>
<td>0.003</td>
<td>250</td>
<td>150</td>
</tr>
<tr>
<td>100</td>
<td>0.208</td>
<td>1.799</td>
<td>1.175</td>
<td>865</td>
<td>565</td>
</tr>
<tr>
<td>200</td>
<td>1.672</td>
<td>16.721</td>
<td>8.748</td>
<td>1000</td>
<td>523</td>
</tr>
<tr>
<td>500</td>
<td>28.331</td>
<td>281.932</td>
<td>134.221</td>
<td>995</td>
<td>474</td>
</tr>
<tr>
<td>1000</td>
<td>216</td>
<td>2341</td>
<td>1067</td>
<td>1085</td>
<td>494</td>
</tr>
<tr>
<td>2000</td>
<td>2336</td>
<td>18709</td>
<td>8470</td>
<td>801</td>
<td>363</td>
</tr>
<tr>
<td>4000</td>
<td>20957</td>
<td>151314</td>
<td>67651</td>
<td>722</td>
<td>323</td>
</tr>
</tbody>
</table>

Table 3 represents multiplying the \( N \times N \) matrix by the \( N \times N \) matrix using 32-bit floating-point numbers without padding.

**Table 3.** Matrix multiplication on type f32 with 0 byte padding

<table>
<thead>
<tr>
<th>N</th>
<th>Vipm (ms)</th>
<th>OpenCV (ms)</th>
<th>Eigen (ms)</th>
<th>cv (%)</th>
<th>ei (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
<td>0.001</td>
<td>0.003</td>
<td>0.002</td>
<td>300</td>
<td>200</td>
</tr>
<tr>
<td>100</td>
<td>0.223</td>
<td>1.727</td>
<td>0.309</td>
<td>774</td>
<td>139</td>
</tr>
<tr>
<td>200</td>
<td>1.708</td>
<td>15.725</td>
<td>2.301</td>
<td>921</td>
<td>135</td>
</tr>
<tr>
<td>500</td>
<td>26.170</td>
<td>264.676</td>
<td>39.118</td>
<td>1011</td>
<td>149</td>
</tr>
<tr>
<td>1000</td>
<td>248</td>
<td>2302</td>
<td>318</td>
<td>926</td>
<td>128</td>
</tr>
<tr>
<td>2000</td>
<td>2160</td>
<td>18563</td>
<td>2475</td>
<td>860</td>
<td>115</td>
</tr>
<tr>
<td>4000</td>
<td>17938</td>
<td>150348</td>
<td>19131</td>
<td>838</td>
<td>107</td>
</tr>
</tbody>
</table>

Table 4 represents multiplying the \( N \times N \) matrix by the \( N \times N \) matrix using 32-bit floating-point numbers with 4,000 bytes padding.

**Table 4.** Matrix multiplication on type f32 with 4000 bytes padding

<table>
<thead>
<tr>
<th>N</th>
<th>Vipm (ms)</th>
<th>OpenCV (ms)</th>
<th>Eigen (ms)</th>
<th>cv (%)</th>
<th>ei (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
<td>0.002</td>
<td>0.003</td>
<td>0.002</td>
<td>150</td>
<td>100</td>
</tr>
<tr>
<td>100</td>
<td>0.231</td>
<td>1.933</td>
<td>0.327</td>
<td>837</td>
<td>142</td>
</tr>
<tr>
<td>200</td>
<td>1.746</td>
<td>15.725</td>
<td>2.361</td>
<td>901</td>
<td>135</td>
</tr>
<tr>
<td>500</td>
<td>26.728</td>
<td>265.320</td>
<td>39.345</td>
<td>993</td>
<td>147</td>
</tr>
<tr>
<td>1000</td>
<td>250</td>
<td>2303</td>
<td>319</td>
<td>920</td>
<td>127</td>
</tr>
<tr>
<td>2000</td>
<td>2210</td>
<td>18576</td>
<td>2477</td>
<td>840</td>
<td>112</td>
</tr>
<tr>
<td>4000</td>
<td>18630</td>
<td>150349</td>
<td>19140</td>
<td>807</td>
<td>103</td>
</tr>
</tbody>
</table>

Table 5 represents multiplication of the \( N \times N \) matrix by the transposed \( N \times N \) matrix using 8-bit unsigned integers without padding.

**Table 5.** Matrix multiplication (transposed multiplicand) on type u8 with 0 byte padding

<table>
<thead>
<tr>
<th>N</th>
<th>Vipm (ms)</th>
<th>OpenCV (ms)</th>
<th>Eigen (ms)</th>
<th>cv (%)</th>
<th>ei (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
<td>0.001</td>
<td>0.007</td>
<td>0.002</td>
<td>700</td>
<td>200</td>
</tr>
<tr>
<td>100</td>
<td>0.264</td>
<td>2.243</td>
<td>1.171</td>
<td>850</td>
<td>444</td>
</tr>
<tr>
<td>200</td>
<td>2.184</td>
<td>16.969</td>
<td>8.745</td>
<td>777</td>
<td>400</td>
</tr>
<tr>
<td>500</td>
<td>33.740</td>
<td>272.955</td>
<td>135.278</td>
<td>809</td>
<td>401</td>
</tr>
<tr>
<td>1000</td>
<td>246</td>
<td>2154</td>
<td>1069</td>
<td>877</td>
<td>435</td>
</tr>
<tr>
<td>2000</td>
<td>2035</td>
<td>17105</td>
<td>8473</td>
<td>841</td>
<td>416</td>
</tr>
<tr>
<td>4000</td>
<td>16002</td>
<td>136829</td>
<td>67651</td>
<td>855</td>
<td>423</td>
</tr>
</tbody>
</table>

Table 6 represents multiplying the \( N \times N \) matrix by the transposed \( N \times N \) matrix using 8-bit unsigned integers with 4,000 bytes padding.

**Table 6.** Matrix multiplication (transposed multiplicand) on type u8 with 4000 bytes padding

<table>
<thead>
<tr>
<th>N</th>
<th>Vipm (ms)</th>
<th>OpenCV (ms)</th>
<th>Eigen (ms)</th>
<th>cv (%)</th>
<th>ei (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
<td>0.002</td>
<td>0.007</td>
<td>0.003</td>
<td>350</td>
<td>150</td>
</tr>
<tr>
<td>100</td>
<td>0.265</td>
<td>2.246</td>
<td>1.172</td>
<td>848</td>
<td>442</td>
</tr>
<tr>
<td>200</td>
<td>2.187</td>
<td>17.007</td>
<td>8.746</td>
<td>778</td>
<td>400</td>
</tr>
<tr>
<td>500</td>
<td>33.815</td>
<td>272.958</td>
<td>135.370</td>
<td>807</td>
<td>400</td>
</tr>
<tr>
<td>1000</td>
<td>247</td>
<td>2181</td>
<td>1069</td>
<td>885</td>
<td>434</td>
</tr>
<tr>
<td>2000</td>
<td>2044</td>
<td>17106</td>
<td>8475</td>
<td>837</td>
<td>415</td>
</tr>
<tr>
<td>4000</td>
<td>16405</td>
<td>136841</td>
<td>67651</td>
<td>834</td>
<td>412</td>
</tr>
</tbody>
</table>
Table 7 represents multiplying the $N \times N$ matrix by the transposed $N \times N$ matrix using 32-bit floating-point numbers without padding.

**Table 7.** Matrix multiplication (transposed multiplicand) on type f32 with 0 byte padding

<table>
<thead>
<tr>
<th>N</th>
<th>Vipm (ms)</th>
<th>OpenCV (ms)</th>
<th>Eigen (ms)</th>
<th>cv (%)</th>
<th>ei (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
<td>0.001</td>
<td>0.004</td>
<td>0.002</td>
<td>400</td>
<td>200</td>
</tr>
<tr>
<td>100</td>
<td>0.245</td>
<td>2.141</td>
<td>0.313</td>
<td>874</td>
<td>128</td>
</tr>
<tr>
<td>200</td>
<td>1.786</td>
<td>17.022</td>
<td>2.298</td>
<td>953</td>
<td>129</td>
</tr>
<tr>
<td>500</td>
<td>26.884</td>
<td>269.551</td>
<td>38.846</td>
<td>1003</td>
<td>144</td>
</tr>
<tr>
<td>1000</td>
<td>275</td>
<td>2166</td>
<td>317</td>
<td>789</td>
<td>115</td>
</tr>
<tr>
<td>2000</td>
<td>2425</td>
<td>17275</td>
<td>2466</td>
<td>712</td>
<td>102</td>
</tr>
<tr>
<td>4000</td>
<td>18909</td>
<td>138233</td>
<td>19104</td>
<td>731</td>
<td>101</td>
</tr>
</tbody>
</table>

Table 8 represents multiplying the $N \times N$ matrix by the transposed $N \times N$ matrix using 32-bit floating-point numbers with 4,000 bytes padding.

**Table 8.** Matrix multiplication (transposed multiplicand) on type f32 with 4000 bytes padding

<table>
<thead>
<tr>
<th>N</th>
<th>Vipm (ms)</th>
<th>OpenCV (ms)</th>
<th>Eigen (ms)</th>
<th>cv (%)</th>
<th>ei (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
<td>0.002</td>
<td>0.004</td>
<td>0.002</td>
<td>200</td>
<td>100</td>
</tr>
<tr>
<td>100</td>
<td>0.245</td>
<td>2.142</td>
<td>0.315</td>
<td>874</td>
<td>129</td>
</tr>
<tr>
<td>200</td>
<td>1.789</td>
<td>17.042</td>
<td>2.301</td>
<td>953</td>
<td>129</td>
</tr>
<tr>
<td>500</td>
<td>27.042</td>
<td>269.964</td>
<td>38.867</td>
<td>998</td>
<td>144</td>
</tr>
<tr>
<td>1000</td>
<td>290</td>
<td>2167</td>
<td>317</td>
<td>749</td>
<td>110</td>
</tr>
<tr>
<td>2000</td>
<td>2426</td>
<td>17277</td>
<td>2466</td>
<td>712</td>
<td>102</td>
</tr>
<tr>
<td>4000</td>
<td>18996</td>
<td>138337</td>
<td>19114</td>
<td>728</td>
<td>101</td>
</tr>
</tbody>
</table>

The results show that in all cases OpenCV works about in the same way (time only increases with image size, which is quite logical), and options of the floating point are much more efficiently implemented in Eigen. The algorithm developed is also executed at about the same rate in all conditions, but generally, runs much faster than OpenCV — sometimes ten-fold. Eigen is also much more efficient than OpenCV, but the 8-bit unsigned numbers are still significantly inferior to the developed algorithm. It can be assumed that this library does not have a direct implementation for this type of values, so it is executed by additional conversions to floating point numbers, which causes a delay. When using floating-point numbers, the algorithm for the given values is still faster. For higher values, the efficiency of the Eigen library approaches the developed algorithm.

Testing was performed on various sizes and types using vipmed software (designed for in-house use by VIT).

**Conclusions**

The proposed matrix multiplication method gives the expected speed of matrix multiplication operations (not slower than existing analogues) and has passed evaluation test for use.

The efficiency of the method is tested by practical application.

The method is used in the corporate library of one of the leading companies in Ukraine.

The article can help to understand how pre-loaders and caches are work and how to use them in operations such as matrix multiplication.

This algorithm do not consider that block sizes are depend on cache size. Therefore, speed on machines with a different cache size can be unexpected.

For future work, we need to explore more existed libraries with matrix multiplication. It will help understand in what way we can move next.

**References**


І.А. Дичка, Д.А. Винник, Ю.В. Бухтіяров, В.Я. Юрчишин

МЕТОД РЕАЛІЗАЦІЇ ШВІДКОГО МАТРИЧНОГО МНОЖЕННЯ ПІД АРХІТЕКТУРОЮ ARM IЗ ВИКОРИСТАННЯМ SIMD-ІНСТРУКЦІЙ

Проблематика. Матричне множення є досить складним алгоритмом із великою кількістю операцій. Додатковою проблемою також є нелінійний обхід матриць по пам'яті. Операція матричного множення широко використовується в різних сферах, таких як нейронні мережі, розв’язки систем лінійних рівнянь, матричні перетворення тощо. Тож важливо розробити метод матричного множення, що враховуватиме проблеми з розташуванням матриць у пам'яті, а також ефективно розпоряджатись даними при їх попередньому використанні.

Мета дослідження. Розробити метод швидкого матричного множення двох матриць, множення матриць на транспоновану та на список векторів (у т.ч. окремий випадок для одного вектора); розрахувати його у вигляді функції з оптимізацією для процес- сорів архітектури ARM. Функція має вміти працювати з різними типами данних та з підмatriцями. Цілочисловий результат може бути масштабований.

Методика реалізації. Головними ідеями розробленого методу є одночасних проход декількома рядками/стовпчики вхідних матриць та їх розбиття на блоки, що дасть алгоритму змогу деякий час працювати на одній і тій самій пам'яті. Для реалізації було вибрано мову програмування С. Для збільшення продуктивності використано SIMD-інструкції. Для ефективної реалізації під архітектуру ARM також необхідно правильно організувати роботу з попереднім завантаженням пам'яті.

Результати дослідження. Реалізовано функцію, що виконує матричне множення за розробленим методом із необхідними параметрами. Перевири на різних розмірах та типах показали, що реалізована функція є швидкою за аналогії з бібліотек OpenCV2 та Eigen 3. Тестування підбивалося за допомогою утиліти вірнед для запусків і замірів характеристик, розробленої для корпора- тивного користування у компанії VIT.

Висновки. Запропонований метод множення матриць дає очікуване приріст зміни операції множення матриць, пройшов оцінювальний тест на використання та відповідає заданим у меті вимогам. Для подальшої роботи необхідно детальніше дослідити вплив кшів різного рівня та порівняти з іншими існуючими бібліотеками.

Ключові слова: матричне множення; архітектура ARM; векторні операції; транспонування матриці.

І.А. Дичка, Д.А. Винник, Ю.В. Бухтіяров, В.Я. Юрчишин

МЕТОД РЕАЛІЗАЦІЇ БІСТРОГО МАТРИЧНОГО УМНОЖЕННЯ ПОД АРХІТЕКТУРУ ARM ІЗ ВИКОРИСТАННЯМ SIMD-ИНСТРУКЦІЙ

Проблематика. Матричне умножение является достаточно сложным алгоритмом с большим количеством операций. Дополнительной проблемой также является нелинейный обход матриц по памяти. Операція матричного умножения широко используется в различных сферах, таких как нейронные сети, решение систем линейных уравнений, матричные преобразования и т.п. Поэтому важно разработать метод матричного умножения, который будет учитывать проблемы расположения матриц в памяти, а также эффективно будет распределяться данными при их повторном использовании.

Цель исследования. Разработать метод быстрого матричного умножения двух матриц, умножения матрицы на транспонированную и на список векторов (в т.ч. частный случай для одного вектора); реализовать его в виде функции с оптимизацией для процессоров архитектуры ARM. Функция должна уметь работать с различными типами данных и с подматрицами. Целочисленный результат может быть отмасштабирован.

Методика реализации. Главными идеями разработанного метода является одновременный проход несколькими стро-ками/столбцами входных матриц и их разбиение на блоки, что позволяет алгоритму некоторое время работать на одной и той же памяти. Для реализации был выбран язык программирования С. Для увеличения производительности использованы SIMD- инструкции. Для эффективной реализации под архитектуру ARM также необходимо правильно организовать работу с предварительной загрузкой памяти.

Результаты исследования. Реализованная функция, которая выполняет матричное умножение по разработанному методу с необходимыми параметрами. Проверки на разных размерах и типах показали, что реализованная функция быстрее аналогов из библиотек OpenCV2 и Eigen 3. Тестирование проходило с помощью утилит вирнед для запусков и замеров характеристик, разработанной для корпоративного пользования в компании VIT.

Выводы. Предложенный метод умножения матриц дает ожидаемое ускорение операции умножения матриц, прошел оценочный тест на использование и соответствует заданным в целях требованиям. Для дальнейшей работы необходимо подробнее исследовать влияние кшів разного уровня и сравнить с другими существующими библиотеками.

Ключевые слова: матричное умножение; архитектура ARM; векторные операции; транспонирование матриц.

Рекомендована Радою факультету прикладной математики КПІ ім. Ігоря Сікорського

Надійшла до редакції
03 лютого 2020 року

Прийнята до публікації
05 червня 2020 року