找回密码
 注册
关于网站域名变更的通知
查看: 175|回复: 3
打印 上一主题 下一主题

ARM NEON optimization

[复制链接]

该用户从未签到

跳转到指定楼层
1#
发表于 2021-5-20 14:07 | 只看该作者 |只看大图 回帖奖励 |倒序浏览 |阅读模式

EDA365欢迎您登录!

您需要 登录 才可以下载或查看,没有帐号?注册

x

1 z- [2 [4 Y, ?' p9 s+ f0 c2 u1. Introduction
, T& c: ^8 e# f  B  L5 P3 yAfter reading the article ARM NEON programming quick reference, I believe you have a basic understanding of ARM NEON programming. But when applying ARM NEON to a real-world applications, there are many programming skills to observe.This article aims to introduce some common NEON optimization skills which come from development practice. The issue of NEON assembly and intrinsics will also be discussed.
: u, B! m" f& X/ p% r8 |: @- h0 {2 Y! q$ j# T) S
2. NEON optimization skills
- u, U9 C' f! s3 o# I6 YWhen using NEON to optimize applications, there are some commonly used optimization skills as follows.
8 O7 [1 d5 E  q* L/ x0 `) ^
- `/ Z& u% [. B8 K* N2.1. Remove data dependencies" w( ^8 V( X) a1 u( E
On the ARMv7-A platform, NEON instructions usually take more cycles than ARM instructions. To reduce instruction latency, it’s better to avoid using the destination register of current instruction as the source register of next instruction.$ V, ?' u* _, |* l' o2 ?

; g3 q0 g3 p- q* z5 g& rExample:
, W' Q- g& `: |C code:; o( q' Q" C% Z/ k  p
$ \. [' G# J  o) a) K
  • float SumSquareError_C(const float* src_a, const float* src_b, int count)
  • {
  •   float sse = 0u;
  •   int i;
  •   for (i = 0; i < count; ++i) {
  •     float diff = src_a - src_b;
  •     sse += (float)(diff * diff);
  •   }
  •   return sse;
  • }$ m- s  O4 d/ U# ^0 v. v, @9 k

+ n% t% a, P( y) V
+ b; E& y" Q3 ]* L. N
. D& y6 g" R/ C7 s
' C4 G4 R; V9 O; U2 cNEON implementation 1. g, e( I* e2 t9 S7 W& k6 b: n

8 M9 p7 u6 j  O5 G
  • float SumSquareError_NEON1(const float* src_a, const float* src_b, int count)
  • {
  •   float sse;
  •   asm volatile (
  •     // Clear q8, q9, q10, q11
  •     "veor    q8, q8, q8                            \n"
  •     "veor    q9, q9, q9                            \n"
  •     "veor    q10, q10, q10                     \n"
  •     "veor    q11, q11, q11                     \n"
  •   "1:                                                           \n"
  •     "vld1.32     {q0, q1}, [%[src_a]]!       \n"
  •     "vld1.32     {q2, q3}, [%[src_a]]!       \n"
  •     "vld1.32     {q12, q13}, [%[src_b]]!  \n"
  •     "vld1.32     {q14, q15}, [%[src_b]]!  \n"
  • "subs %[count], %[count], #16  \n"
  • // q0, q1, q2, q3 are the destination of vsub.
  • // they are also the source of vmla.
  •     "vsub.f32 q0, q0, q12                      \n"
  •     "vmla.f32   q8, q0, q0                        \n"
  •     "vsub.f32   q1, q1, q13                      \n"
  •     "vmla.f32   q9, q1, q1                       \n"
  •     "vsub.f32   q2, q2, q14                    \n"
  •     "vmla.f32   q10, q2, q2                    \n"
  •     "vsub.f32   q3, q3, q15                    \n"
  •     "vmla.f32   q11, q3, q3                    \n"
  •     "bgt        1b                                        \n"
  •     "vadd.f32   q8, q8, q9                      \n"
  •     "vadd.f32   q10, q10, q11               \n"
  •     "vadd.f32   q11, q8, q10                 \n"
  •     "vpadd.f32  d2, d22, d23                \n"
  •     "vpadd.f32  d0, d2, d2                     \n"
  •     "vmov.32    %3, d0[0]                      \n"
  •     : "+r"(src_a),
  •       "+r"(src_b),
  •       "+r"(count),
  •       "=r"(sse)
  •     :
  •     : "memory", "cc", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11",
  •       "q12", "q13","q14", "q15");
  •   return sse;
  • }" b& ~' Y) {% x) N& x6 k
' X5 @: m( w1 H+ R0 W$ ?" b9 L+ u, J

8 {- P1 D: Y; u2 [% z* T2 o  L7 v4 V3 O4 ]$ Z1 S

/ a6 a# E9 ], {NEON implementation 2
$ e- Q3 X* c5 ^+ n, s9 E
8 }/ I) {9 ?- e' z& _% S
  • float SumSquareError_NEON2(const float* src_a, const float* src_b, int count)
  • {
  •   float sse;
  •   asm volatile (
  •     // Clear q8, q9, q10, q11
  •     "veor    q8, q8, q8                            \n"
  •     "veor    q9, q9, q9                            \n"
  •     "veor    q10, q10, q10                     \n"
  •     "veor    q11, q11, q11                     \n"
  •   "1: \n"
  •     "vld1.32     {q0, q1}, [%[src_a]]!       \n"
  •     "vld1.32     {q2, q3}, [%[src_a]]!       \n"
  •     "vld1.32     {q12, q13}, [%[src_b]]!  \n"
  •     "vld1.32     {q14, q15}, [%[src_b]]!  \n"
  •     "subs       %[count], %[count], #16  \n"
  •     "vsub.f32 q0, q0, q12                      \n"
  •     "vsub.f32   q1, q1, q13                     \n"
  •     "vsub.f32   q2, q2, q14                     \n"
  •     "vsub.f32   q3, q3, q15                     \n"
  •     "vmla.f32   q8, q0, q0                      \n"
  •     "vmla.f32   q9, q1, q1                      \n"
  •     "vmla.f32   q10, q2, q2                    \n"
  •     "vmla.f32   q11, q3, q3                    \n"
  •     "bgt        1b                                         \n"
  •     "vadd.f32   q8, q8, q9                      \n"
  •     "vadd.f32   q10, q10, q11                \n"
  •     "vadd.f32   q11, q8, q10                  \n"
  •     "vpadd.f32  d2, d22, d23                 \n"
  •     "vpadd.f32  d0, d2, d2                      \n"
  •     "vmov.32    %3, d0[0]                       \n"
  •     : "+r"(src_a),
  •       "+r"(src_b),
  •       "+r"(count),
  •       "=r"(sse)
  •     :
  •     : "memory", "cc", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11",
  •       "q12", "q13","q14", "q15");
  •   return sse;
  • }
    9 e& {. P( J' N0 y

; y% G& [, q, x* V9 L
5 A7 o& ?- z3 Q0 e9 z: z. X* x2 ]* g* ~0 {

  U9 m, c# `! EIn NEON implementation 1, the destination register is used as source register immediately; In NEON implementation 2, instructions are rescheduled and given the latency as much as possible. The test result indicates that implementation 2 is ~30% faster than implementation 1. Thus, reducing data dependencies can improve peRFormance significantly. A good news is that compiler can fine-tune NEON intrinsics automatically to avoid data dependencies which is really one of the big advantages.
! v4 R  a* E# C+ \& h9 v+ K, h7 T9 T" @" S" _6 t( G7 o! J
Note: this test runs on Cortex-A9. The result may be different on other platforms.9 T9 b# p4 G: x# V

, x% L, L2 z6 l" f: r1 V4 ?, u# ~1 |8 [. [  B, N
2.2 Reduce branches7 O2 I' U% j0 X! o
There isn’t branch jump instruction in NEON instruction set. When the branch jump is needed, jump instructions of ARM are used. In ARM processors, branch prediction techniques are widely used. But once the branch prediction fails, the punishment is rather high. So it’s better to avoid the using jump instructions. In fact, logical operations can be used to replace branch in some cases.
" n6 A2 F* c- q2 ~; e, j0 l' A/ u" q- g* Y; o" [6 F5 S* M
Example:
1 P5 j' S$ b5 z; n+ U% r5 A: R% G/ X; r. C7 m* a- ?1 a
C implementation
8 G3 V* a' u& J/ i" K" d) r) A! g9 l0 k, v% w
  • if( flag )
  • {
  •         dst[x * 4]       = a;
  •         dst[x * 4 + 1] = a;
  •         dst[x * 4 + 2] = a;
  •         dst[x * 4 + 3] = a;
  • }
  • else
  • {
  •         dst[x * 4]       = b;
  •         dst[x * 4 + 1] = b;
  •         dst[x * 4 + 2] = b;
  •         dst[x * 4 + 3] = b;
  • }) L6 K! W' w5 p' w  a' q
3 S+ [! Z7 T1 r6 Z* r; h
5 q% @* [0 c4 {/ ~
* [$ }  b- ?" ~3 G; w5 l
NEON implementation
% T. O; |3 i0 P- U* J2 K7 a3 n
# V* V3 x- z! E0 D! _$ z- ~
  • //dst[x * 4]       = (a&Eflag) | (b&~Eflag);
  • //dst[x * 4 + 1] = (a&Eflag) | (b&~Eflag);
  • //dst[x * 4 + 2] = (a&Eflag) | (b&~Eflag);
  • //dst[x * 4 + 3] = (a&Eflag) | (b&~Eflag);
  • VBSL qFlag, qA, qB
    7 B" |0 b. [- n& O. d

7 A% _4 e* X( O9 M) I1 E( H+ A1 K' c5 m7 {, v5 j9 S' R
2 ]8 T) ~: s# i  K9 r

- r) S5 ]/ N7 ~+ Q. _* y$ JARM NEON instruction set provides the instructions as follows to help users implement the logical operation above:
5 M0 w7 p; t$ _6 {# A. I8 T, G7 w9 g6 R
  • VCEQ, VCGE, VCGT, VCLE, VCLT
  • VBIT, VBIF, VBSL" r- N/ @2 S9 Z# M0 c# ]7 i5 c

, h# y: c$ ]% iReducing branches is not specific to NEON only. It is a commonly used trick. Even in a C program, this trick is also worth the effort.( V1 g9 u2 \) d7 G$ a% j% I9 v
3 e) e: o" L2 T* V: x) o" B

  \4 }  F8 b* c& u5 N" D! b2.3 Preload data-PLD
) _5 z, t8 Q8 G5 f! {1 `
5 Q, x0 p7 p; i4 \* dARM processors are a load/store system. Except load/store instructions, all operations perform on registers. Therefore increasing the efficiency of load/store instructions is very important for optimizing application." p* _5 l6 M& a; n& f
7 {. `) Z9 ^- A0 b1 q, _
Preload instruction allows the processor to signal the memory system that a data load from an address is likely in the near future. If the data is preloaded into cache correctly, it would be helpful to improve the rate of cache hit which can boost performance significantly. But the preload is not a panacea. It’s very hard to use on recent processors and it can be harmful too. A bad preload will reduce performance.
4 y2 X" v! P. F: i4 a6 F; d
+ |8 C% V- _9 E3 f* N) b; l) HPLD syntax:
% o( S# z7 y6 E. n* u/ Q" a& F8 b) p2 t8 V) c
  • PLD{cond} [Rn {, #offset}]
  •     PLD{cond} [Rn, +/-Rm {, shift}]
  •     PLD{cond} label/ [. {( W  @1 n6 ]

, @4 p9 N$ S8 a7 ?' O9 S' z( l
6 X3 g: F. g3 [- U' i4 L. O' K9 n8 T7 M2 h1 B0 {: A3 o
. d. t7 P& Q. A! g, k
Where:
- {* N" Y, Z5 w, E( N
) ^, \) K; n8 v3 g" t+ rCond - is an optional condition code.' H3 h& c0 ?8 J/ N% L# i9 Q' f
Rn - is the register on which the memory address is based.+ ]2 O, T- D$ k( q8 c& {
Offset - is an immediate offset. If offset is omitted, the address is the value in Rn.) V- p. s# ^7 y) x* n+ U! O
Rm - contains an offset value and must not be PC (or SP, in Thumb state).
% e  a# @0 g# |1 h- oShift - is an optional shift.4 U1 D8 D2 W- u" d( B2 `& I
Label - is a PC-relative expression.! l9 I  j: U/ E7 F

: ?% n0 |& Q8 Q) u' ]+ G: bThe PLD operation features:
5 e/ j7 l$ v! A6 m& ?$ ]2 _5 k; f. P4 P' K" i+ v9 O! T
  • Independent of load and store instruction execution
  • Happens in the background while the processor continues to execute other instructions.
  • The offset is specified to real cases.7 T4 s: i) l3 Q; N" q
$ W# F" g7 r7 H- n, E
4 h1 @& B* z! s& i+ v' `5 D5 J9 g
2.4 Misc
8 X, [) B7 Z$ D0 E

$ Q4 f) ], y  \9 bIn ARM NEON programming, Different instruction sequences can be used to perform the same operation. But fewer instructions do not always produce better performance. It depends on benchmark and profiling result of specific cases. Below listed are some special cases in development practice.

/ T$ |6 z6 Q( [9 b! R2 H- N' b9 O/ g1 [/ s+ Y
2.4.1 Floating-point VMLA/VMLS instruction; F, g  N3 s2 F/ V! V4 {+ F
( t) W$ Y" S* q5 [$ M
This example is specific for Cortex-A9. For other platforms, the result needs to be verified again.

1 X2 g- W# g- W; Y/ z6 u# ~: f! S# w9 O- ~
Usually, VMUL+VADD/VMUL+VSUB can be replaced by VMLA/VMLS because fewer instructions are used. But compared to floating-point VMUL, floating-point VMLA/VMLS has a longer instruction delay. If there aren’t other instructions that can be inserted into delay slot, using floating-point VMUL+VADD/VMUL+VSUB will show a better performance.
4 f5 _+ k' [! k
& O, S" `9 O" M/ e$ M! H$ aA real-world example is floating-point FIR function in Ne10. The code snippets are as follows:# X. ]* W' F: w6 ^7 N2 s
$ J7 I7 h! j1 U
Implementation 1: there is only one instruction “VEXT” between two “VMLA” which needs 9 execution cycles according to the table of NEON floating-point instructions timing.
# q% M+ p$ G1 v* c, J3 N# u% `' p1 {* H. k
  • VEXT qTemp1,qInp,qTemp,#1
  • VMLA qAcc0,qInp,dCoeff_0[0]
  • VEXT qTemp2,qInp,qTemp,#2
  • VMLA qAcc0,qTemp1,dCoeff_0[1]
  • VEXT qTemp3,qInp,qTemp,#3
  • VMLA qAcc0,qTemp2,dCoeff_1[0]
  • VMLA qAcc0,qTemp3,dCoeff_1[1]. {- V! A/ [* v+ p- Z' N
: @- V8 n# ^8 ~7 @) N7 W

0 ^5 O3 X' j9 O2 s! S2 R) r4 j/ K( x9 o8 K8 P& F
* x) s# L( }4 J
Implementation 2: there is still data dependency on qAcc0. But VADD/VMUL needs 5 execution cycles only.
. }. x) _$ U) \, d+ `
  • VEXT qTemp1,qInp,qTemp,#1
  • VMLA qAcc0,qInp,dCoeff_0[0] ]
  • VMUL qAcc1,qTemp1,dCoeff_0[1]
  • VEXT qTemp2,qInp,qTemp,#2
  • VMUL qAcc2,qTemp2,dCoeff_1[0]
  • VADD qAcc0, qAcc0, qAcc1
  • VEXT qTemp3,qInp,qTemp,#3
  • VMUL qAcc3,qTemp3,dCoeff_1[1]
  • VADD qAcc0, qAcc0, qAcc2
  • VADD qAcc0, qAcc0, qAcc33 A& O: ~" }  R9 |+ G! U

5 P' l/ C0 Z; l' v$ R. `
% N0 d/ ?: w$ K- R' H/ u% h9 m
/ _$ [( A+ d* Y8 h: x. Q/ ]
6 _! \3 b2 \" W1 mThe benchmark shows that implementation 2 has a better performance.5 m  Y+ Y% A! t- @9 R

  o; J- m2 |& k/ ~4 HFor more code details, see GitHub.* U' O! I2 a* f! {# d& m
- N  W7 h" B" \1 Y
NEON floating-point instructions timing:% e" f- `3 X) N5 K

+ C& U  w. u: G: O, z 4 _- L- x6 B0 u: n  v

* C5 b  F- c8 I; h" ]/ n6 UThe table is from appendix[ii]。" j0 B; J, a0 S  j9 @

* A7 o0 `6 s7 D' [& s. WWhere:9 z/ M( N1 D( b9 y% Y! b

; A& y3 M& N2 H. r1 d5 f  \' M
  • Cycles: instruction issue
  • Result: instruction execute
    1 r+ A# V3 Q3 d+ W, e" k1 T' Z

, o% d5 ?) R; Y2 |* K' t9 J  }1 n: Y; L9 n6 X
2.5 Summary

# L1 Q, J+ \4 D" ]& m
7 j3 p4 k/ C  O6 kNEON optimization techniques are summarized as follows:

6 W% R3 D: o! _) P# K" `! W) Y4 Y1 m& V3 ~. l% W
  • Utilize the delay slot of instruction as much as possible.
  • Avoid branches.
  • Pay attention to cache hit./ @" c5 u9 e( U2 g7 y8 c

% O1 n5 z, b9 B& L. i" m0 e0 j8 I. X4 G/ t* T2 a
3. NEON assembly and intrinsics
( V4 J( W+ d) }  N. y0 w4 x% F

+ V% Q+ B0 t7 N& \1 ]In “ARM NEON programming quick guide”, there is a simple comparison of the pros and cons of NEON assembly and intrinsics:
  x1 x6 V9 v( n

. I& s4 k4 O3 K
" }9 W3 \8 i( f8 q& J; q: z
5 L! G5 ]$ q/ U+ L9 d6 Q  }. n! mBut the reality is far more complex than many of these, especially when it comes to ARMv7-A/v8-A cross-platform. In the following sections, this issue will be analyzed further with some examples.
& G* S2 i- F1 a# a4 V& ~  e, G, l! y
3.1 Programming
1 F' H( U" X& P+ M& X" q2 J( D
) y# K. c. P' V2 a3 P, E! UFor NEON beginners, intrinsics are easier to use than assembly. But experienced developers may be more familiar with NEON assembly programming. They need time to adapt to the intrinsics coding. Some issues that may occur in real development are described in the following.
' {" S7 x' H2 C. D
2 H# A  \7 z0 P* n
3.1.1 Flexibility of instruction/ y: K, M$ z" D, t8 b$ c0 C) ]

& _5 n4 i( m8 L( V8 ^From the perspective of using instructions, assembly instruction is more flexible than intrinsics. It is mainly reflected in the data load / store.

# p7 H( h& K4 I1 S2 t) p6 @& m- J" G! X
Example:- b  C  p6 q( l. D4 C* @
% }: C; `+ I7 W+ \# j3 {" `4 N; C

( t5 D2 e( A7 z* m
; N7 D! U% w6 U  N$ Q$ n% tThis issue will be fixed with upgrading the compiler in the future. Sometimes, the compiler has been able to translate two intrinsics instructions into one assembly instruction, such as:
1 O( B9 k  B! L4 l7 f. B# }% S/ O4 J) t. m4 n, o' U
: N% Z' y* c, N& e" h
compiler translating two intrinsics instructions into one assembly instruction6 y1 F4 D3 f9 g# v( l
* _; K! b- M1 ^9 f4 b- \) R
Therefore, it’s expected that intrinsics instruction will have the same flexibility with assembly instructions, with the upgrade of the ARMv8 toolchain.: j7 M5 y% x* N
" N9 x. c0 f4 T9 @+ I+ a9 Y$ J9 s6 {
3.1.2 Register allocation- S) N9 K: x# C9 T. D; o0 K

7 w  |0 X( P1 m% YWhen programming in NEON assembly, registers have to be allocated by users. It must be known clearly which registers are occupied. One of benefits of programming in intrinsics is that users only need to define variables. The compiler will allocate registers automatically. This is an advantage, but it might be a weakness in some cases. Practice has proved that using too many NEON registers simultaneously in the intrinsics programming will bring gcc compiler register allocation issue. When this issue happens, many data are pushed into the stack, which will greatly affect performance of the program. Therefore, users should pay attention to this issue when intrinsics programming. When there is performance exception (such as the performance of C is better than it of NEON), you need to check the disassembly to see if there is register allocation issue firstly. For ARMv8-A AArch64, there are more registers (32 128-bit NEON registers). The impact of this issue is significantly reduced.
& i2 r5 Z/ S  e  }- V

. c6 m8 y1 y, t3.2 Performance and compiler8 o. e1 S, P# z3 j3 X
# N6 I* t7 v& z, P1 Y6 `
On one platform, the performance of NEON assembly is only decided by implementation, and has nothing to do with the compiler. The benefit is that you can predict and control the performance of program when you hand-tune the codes, but there isn’t surprise.
/ X$ g6 g4 a' V+ ?
# @1 m% j# ^! d3 l( R
Conversely, the performance of NEON intrinsics is greatly dependent on the compiler used. Different compilers may bring very different performance. In general, the older the compiler, the worse the performance. When compatibility with older compilers is needed, it must be considered carefully whether the intrinsics will fit the need. In addition, when fine-tuning the code, it’s hard for user to predict the change of performance with the intervention of the compiler. But there may be surprise. Sometimes the intrinsics might bring better performance than assembly. This is very rare, but does occur.
% u9 q" w4 n' ^3 w7 I  ?$ F
4 r( o( }, z2 u: t$ S1 [Compiler will have an impact on the process of NEON optimization. The following figure describes the general process of NEON implementation and optimization.& C  o' O! |. \: ]) Q! e9 \

/ m; w, j+ J3 ^$ r2 o- w* |) W   process of NEON implementation and optimization.0 C9 g6 G$ S1 ^" X+ s+ m

) x/ n8 U' ?/ U( U0 k% S6 Y
3 J8 \& C- _" t  _$ @% VNEON assembly and intrinsics have the same process of implementation, coding - debugging – performance test. But they have different process of optimization step.The methods of assembly fine-tuning are:/ X; {: e2 _4 E* S
4 o! T; {4 r4 E5 n2 `
  • Change implementations, such as changing the instructions, adjusting the parallelism.
  • Adjust the instruction sequence to reduce data dependency.
  • Try the skills described in section 2.. \8 m6 r, X' I) ?1 c6 I
' T; h7 _$ G/ d1 e$ j3 ~
W
hen fine-tuning assembly, a sophisticated way is that:* e/ A/ D: S3 V0 A! w" n' V
" b1 M$ L# |: L3 R
  • Know the number of used instructions exactly
  • Get the execution cycles of program by using the PMU (Performance Monitoring Unit).
  • Adjust the sequence of instructions based on the timing of used instructions. And try to minimize the instruction delay as possible as you can
    - @  a$ r- a# S' C  {; _

9 m/ J. ~5 e; pThe disadvantage of this approach is that the changes are specific to one micro-architecture. When the platform is switched, performance improvement achieved might be lost. This is also very time-consuming to do for often comparatively small gains.Fine-tuning of NEON intrinsics is more difficult.

' ^! J* i0 n9 @, y+ T
6 U% B& C4 T, R( j
  • Try the methods used in NEON assembly optimization.
  • Look at the generated assembly and check the data dependencies and register usage.
  • Check whether the performance meets the expectation. If yes, the work of optimization is done. Then the performance with other compilers needs to be verified again.: W$ B7 q( f; _4 G; C% U

# F, w/ B. ~2 M3 VW
hen porting the assembly code of ARMv7-A with intrinsics for ARMv7-A/v8-A compatibility, performance of assembly can be used as a reference of performance. So it is easy to check whether the work is done.  However, when intrinsics are used to optimize ARMv8-A code, there is not a performance reference. It is difficult to determine whether the performance is optimal. Based on the experience on ARMv7-A, there might be a doubt whether the assembly has the better performance. I think the impact of this issue will become smaller and smaller with the maturity of the ARMv8-A environment.
, Z$ a0 i1 M  K% f* z: ~
3 \$ X* K+ m5 t3.3 Cross-platform and portability
% B2 P# G' o( W  N/ x2 C9 T4 k( Q; H3 y' E0 ], @
Now, most of the existing NEON assembly codes can only run on the platforms of ARMv7-A/ARMv8-A AArch32 mode. If you want to run them on platforms of ARMv8-A AArch64 mode, you must rewrite these codes, which take a lot of work. In such situation, if the codes are programmed with NEON intrinsics, they can be run directly on platforms of ARMv8-A AArch64 mode. Cross-platform is one of great advantages. Meanwhile, you just need to maintain on set of code for different platform by using intrinsics, which also significantly reduces the maintenance effort. However, due to the different hardware resources on ARMv7-A/ARMv8-A platform, sometimes there still might be two sets of code even with intrinsics. The FFT implementation in Ne10 project is an example:
* i% g2 I% T' K5 ^

& {& G3 s8 r3 S  k5 g3 @
  • // radix 4 butterfly with twiddlesscratch[0].r = scratch_in[0].r;scratch[0].i = scratch_in[0].i;scratch[1].r = scratch_in[1].r * scratch_tw[0].r - scratch_in[1].i * scratch_tw[0].i;scratch[1].i = scratch_in[1].i * scratch_tw[0].r + scratch_in[1].r * scratch_tw[0].i;scratch[2].r = scratch_in[2].r * scratch_tw[1].r - scratch_in[2].i * scratch_tw[1].i;scratch[2].i = scratch_in[2].i * scratch_tw[1].r + scratch_in[2].r * scratch_tw[1].i;scratch[3].r = scratch_in[3].r * scratch_tw[2].r - scratch_in[3].i * scratch_tw[2].i;scratch[3].i = scratch_in[3].i * scratch_tw[2].r + scratch_in[3].r * scratch_tw[2].i;6 b! v% |5 O' k6 m/ ?

; b2 c$ c4 ~! r- Q! a' o
8 K( U7 [1 l* K4 N* p
' b- K. J9 q0 VThe above code snippet lists the basic element of FFT---- radix4 butterfly. From the code, the following can be concluded:/ D6 u5 L7 a% y2 w/ a
5 L; C4 V* W" \# J! W
  • 20 64-bit NEON registers are needed if 2 radix4 butterflies are executed in one loop.
  • 20 128-bit NEON registers are needed if 4 radix4 butterflies are executed in one loop.
    7 [0 _! J6 @% k' l

" `( j" ]) j% p6 `0 l7 _7 qAnd, for ARMv7-A/v8-A AArch32 and v8-A AArch64,
, X5 ?& t' Q( M) u

! S  J: u8 Z8 z. s
  • There are 32 64-bit or 16 128-bit NEON registers for ARMv7-A/v8-A AArch32.
  • There are 32 128-bit NEON registers for ARMv8-A AArch64.
    , z: a* c! T7 j3 k# i/ h. l
; v+ ~) P6 P8 h9 E1 ~) e/ ?
Considering the above factors, the FFT implementation of Ne10 eventually has an assembly version, in which 2 radix4 butterflies are executed in one loop, for ARMv7-A/v8-A AAch32, and an intrinsic version, in which 4 radix4 butterflies are executed in one loop, for ARMv8-A AArch64.The above example can illustrate that you need to pay attention to some exceptions when maintaining one set of code across ARMv7-A/v8-A platform,
- {# m5 d' e3 ]5 G' M) S# K

8 v1 o( u$ Q1 n6 b+ |3.4 Future2 `* b7 M. J) ]! D% [, s2 _

/ T% a$ f2 l5 P" z3 mMany issues about using NEON assembly and intrinsics have been discussed. But these issues are temporary. In the long term, intrinsics will be better. By using intrinsics, you can reap the benefits of hardware and compiler upgrade without reprogramming. That means some classical algorithms just need to be implemented once.
* B0 [" V' E' `$ ?

5 j, j( b% B' u8 b6 mThe compiler will help to adjust these codes for new hardware, which reduces the workload significantly. Pffft is an example.The following figure describes the performance of pffft and Ne10 real FFT on the ARM Cortex-A9 platform with gcc. X-axis represents the length of FFT. Y-axis represents the time of execution, the smaller the better. Pffft is implemented with NEON intrinsics. Ne10 real FFT is implemented with NEON assembly. They don’t use the same algorithm, but they have similar performance.  . b/ N0 J/ J! f8 C9 _/ p3 J, H

3 @, a2 s, ?; X3 ~7 A  V , o4 J; w6 i& M; ?- ^$ q$ u  q

' q9 ^) \# H" h" o- [" p: `In the ARMv8-A AArch64 mode, the Ne10 real FFT is rewritten with both NEON assembly and intrinsics. Section 3.3 has explained that ARMv8-A can process 4 butterflies in parallel, but ARMv7-A can only process 2 butterflies in parallel. So theoretically, the effect of FFT optimization on ARMv8-A would be better than that on ARMv7-A. However, based on the following figure, it’s clear that pffft has the best performance. The result illustrates that the compiler should have done very good optimizations specified for ARMv8 architecture.! A% p* D& ?; y& T
; G# n; O" l0 _  H' E
. U4 E  [2 U- B! l2 F

4 Z/ x3 Q7 I% M3 LFrom this example, it is concluded that: pffft, the performance of which isn’t the best on the ARMv7-A, shows a very good performance in the ARMv8-A AArch64 mode. That proves the point: compiler can adjust intrinsics automatically for ARMv8-A to achieve a good performance.In the long run, existing NEON assembly code have to be rewritten for ARMv8. If NEON is upgraded in the future, the code has to be rewritten again and again. But for NEON intrinsics code, it is expected that it may show good performance on ARMv8-A with the help of compilers. Even if NEON is upgraded, you can also look forward to the upgrade of compilers." ~  m# |2 `: m9 B$ T$ @( x
0 \8 B* P) \  o# P
3.5 Summary4 o8 J) z" D+ W- C: |% K: }8 h

! _+ }3 y7 @: SIn this section, the pros and cons of NEON assembly and intrinsics have been analyzed with some examples. The benefits of intrinsics far outweigh drawbacks. Compared to assembly, Intrinsics are more easily programmed, and have better compatibility between ARMv7 and ARMv8.Some tips for NEON intrinsics are summarized as follows:

' X: y/ a4 x4 B' j/ f- n% o" e
  • The number of registers used
  • Compiler
  • Do look at the generated assembly
    3 P7 {7 ^3 l0 n/ F
1 A7 `/ E* k: d. a1 c4 b" u5 @
4. End

; p" S& X- V- g. w5 T7 @1 e3 n/ t, {* z' i! @6 K. {
This blog mainly discusses some common NEON optimization skills and analyzes the pros and cons of NEON assembly and intrinsics with some examples. Hope that these can help NEON developers in actual developmen
/ ?6 t& P# j& E; g) T, f

该用户从未签到

2#
发表于 2021-5-20 15:01 | 只看该作者
ARM NEON optimization

该用户从未签到

3#
发表于 2021-5-20 15:16 | 只看该作者
ARM NEON optimization

该用户从未签到

4#
发表于 2021-5-20 15:48 | 只看该作者
ARM NEON optimization
您需要登录后才可以回帖 登录 | 注册

本版积分规则

关闭

推荐内容上一条 /1 下一条

EDA365公众号

关于我们|手机版|EDA365电子论坛网 ( 粤ICP备18020198号-1 )

GMT+8, 2025-6-23 19:51 , Processed in 0.125000 second(s), 26 queries , Gzip On.

深圳市墨知创新科技有限公司

地址:深圳市南山区科技生态园2栋A座805 电话:19926409050

快速回复 返回顶部 返回列表