Benchmarks
From Albatross
Motivation
Because the Intel XScale PXA255 CPU has no Floating Point Unit (FPU), operations on floating point numbers (float and double data types in C) are emulated by software, and as a result floating point performance is quite poor.
Floating Point Emulation
The classical technique for doing floating point on ARM is to pretend there is a hardware FPU present, and use the hardware FPU instructions. Because the processor has no FPU, these instructions are invalid and are trapped by the operating systems' undefined instruction trap. The trap in the kernel handler emulates the floating point operation using a number of integer and logic instructions, and returns the result to the user space program. This is known as a Floating Point Emulator (FPE). ARM Linux has two FPEs to choose from: NetWinder FPE and FastFPE. NetWinder supports both single and double precision values, correctly handles floating point exceptions, and is regarded as accurate and reliable. FastFPE seems to support only single-precision values and does not always handle floating point exceptions correctly. It is not as reliable or accurate as NetWinder.
Soft Float
The other approach is to modify the C compiler to not use hardware FPU instructions, and instead replace FP operations with function calls to a library that performs the FP emulation in the same user-space program. This eliminates two or more context switches between the user space application and the kernel, greatly increasing speed. The disadvantage is that everything on the entire system including the C libraries have to be completely rebuilt to not use hardware FPU instructions (for an FPE). With GCC, the soft-float options are used when building a cross-compiler. The soft-float in the GCC we are using supports both single and double precision correctly, and can handle FP exceptions correctly.
Benchmarks
I have used three benchmarks to evaluate relative performance of the three FP methods (NetWinder FPE, FastFPE, soft-float). Two are commonly used real-world benchmarks, and the last was hacked together to time individual FP operations.
- Linpack: performs Gaussian elimination (row reduction with partial pivoting) on a large (200x200) matrix of floating point values. A mix of FP operations (add, subtract, multiply and divide) are exercised, but it is not clear how many of each is used. The output is a compensated benchmark in Mega-FLOPs, i.e. the total number of floating point operations that can be performed in a second. It is not as dependent on memory subsystem performance as STREAM.
- STREAM: [1] exercises memory system and also individual floating-point operations (add, scale i.e. multiply, triad i.e. multiply-accumulate). Outputs data streaming speeds in MB/s for each test:
name kernel bytes/iter FLOPS/iter ------------------------------------------------------------------ COPY: a(i) = b(i) 16 0 SCALE: a(i) = q*b(i) 16 1 SUM: a(i) = b(i) + c(i) 24 1 TRIAD: a(i) = b(i) + q*c(i) 24 2
- MyFlops: performs each floating point primitive operation 1 million times and measures how long it takes. Outputs the time in microseconds for each primitive for single or double precision.
Results
STREAM
- NWFPE, double precision:
Function Rate (MB/s) Avg time Min time Max time MFLOPS Copy: 84.9866 0.3767 0.3765 0.3768 - Scale: 3.9847 8.0630 8.0306 8.1525 0.25 Add: 4.8545 9.9009 9.8878 9.9239 0.20 Triad: 3.2661 14.7567 14.6964 14.8879 0.27
- FastFPE, double precision (kind of...):
Function Rate (MB/s) Avg time Min time Max time MFLOPS
Copy: 84.9128 0.3769 0.3769 0.3771 - Scale: 7.7499 4.1509 4.1291 4.1619 0.48 Add: 10.1022 4.7602 4.7514 4.7780 0.42 Triad: 6.8614 7.0079 6.9957 7.0141 0.57
- Soft-float, double precision:
Function Rate (MB/s) Avg time Min time Max time MFLOPS Copy: 90.9437 0.3520 0.3519 0.3521 - Scale: 52.6172 0.6332 0.6082 0.6459 3.28 Add: 54.2478 0.9013 0.8848 0.9096 2.26 Triad: 34.9572 1.3735 1.3731 1.3739 2.91
We can see here that the memory might be alimiting factor here for Soft-Float (or at least contributing to the performance in some way) because the FP tests Scale and Add use over half the memory subsystem bandwidth.
Linpack
- Soft-Float, single precision: 5.73 MFLOPS.
- Soft-Float, double precision: 3.31 MFLOPS.
- NetWinder FPE, single precision: 0.41 MFLOPS.
- NetWinder FPE, double precision: 0.32 MFLOPS.
- FastFPE, single precision: 1.17 MFLOPS.
- FastFPE, double precision (kind of): 1.11 MFLOPS.
MyFlops
- NetWinder FPE:
- Single Precision:
- add - 1.25 us
- sub - 1.36 us
- mul - 1.54 us
- div - 1.89 us
- Double Precision:
- add - 2.02 us
- sub - 2.28 us
- mul - 1.91 us
- div - 3.66 us
- Single Precision:
- Fast FPE: (single precision only)
- Single Precision (float):
- add - 0.34 us
- sub - 0.40 us
- mul - 0.34 us
- div - 2.30 us
- Single Precision (float):
- Soft-Float:
- Single Precision (float):
- add - 0.19 us
- sub - 0.19 us
- mul - 0.13 us
- div - 0.36 us
- Double Precision (double):
- add - 0.27 us
- sub - 0.27 us
- mul - 0.21 us
- div - 1.53 us
- Single Precision (float):
FP Conclusions
Soft-float is by far the fastest, at about 6x faster than FastFPE and about 11x faster than NetWinder FPE. It is also more elegant in my opinion; emulating an FPU strikes me as a bit of a dirty hack.
Integer/General Benchmarks
These benchmark the general and integer performance (but no floating point tests).
- Dhrystone v2.1:
- PXA255 @ 400 MHz, Non-soft-float toolchain: 652046 Dhrystones/Sec (1.5 us per Dhrystone) => 371.1 VAX MIPS rating.
- PXA255 @ 400 MHz, Soft-float toolchain: 652004 Dhrystones/Sec (1.5 us per Dhrystone) => 371.0 VAX MIPS rating.
- Comparison: P4 PC @ 1.6 GHz: 2174243.9 Dhrystones/Sec (0.5 us per Dhrystone) => 1237.475 VAX MIPS rating.
- Dhrystone v1.1:
- PXA255 @ 400 MHz, either toolchain: 937836.6 Dhrystones/second => 533.772 VAX MIPS rating.
- Comparison: P4 PC @ 1.6 GHz: 2297088.4 Dhrystones/second => 1307.392 VAX MIPS rating.
These benchmarks are as expected for the PXA255; we can see infact that it performs (significantly) better per clock than the Pentium 4. Also, note that there is no difference in performance between the FPE or soft-float toolchains (as expected, because Dhrystone does not use any floating point operations).
