Hardware crc32 instruction for Power7 (ppc64 architecture) CPU with big-endian byte order - crc

I have seen hardware crc32 instructions for Intel and ARM processors. I am wondering if similar crc32 instruction exist specifically for Power7 processor. I have searched through the Power ISA manual but couldn't find any such crc instruction. Any suggestions or comments would be greatly appreciated.

Related

What is the /d2vzeroupper MSVC compiler optimization flag doing?

What is the /d2vzeroupper MSVC compiler optimization flag doing?
I was reading through this Compiler Options Quick Reference Guide
for Epyc CPUs from AMD: https://developer.amd.com/wordpress/media/2020/04/Compiler%20Options%20Quick%20Ref%20Guide%20for%20AMD%20EPYC%207xx2%20Series%20Processors.pdf
For MSVC, to "Optimize for 64-bit AMD processors", they recommend to enable /favor:AMD64 /d2vzeroupper.
What /favor:AMD64 is doing is clear, there is documentation about that in the MSVC docs. But I can't seem to find /d2vzeroupper being mentioned anywhere in the internet at all, no documentation anywhere. What is it doing?
TL;DR: When using /favor:AMD64 add /d2vzeroupper to avoid very poor performance of SSE code on both current AMD CPUs and Intel CPUs.
Generally /d1... and /d2... are "secret" (undocumented) MSVC options to tune compiler behavior. /d1... apply to complier front-end, /d2... apply to compiler back-end.
/d2vzeroupper enables compiler-generated vzeroupper instruction
See Do I need to use _mm256_zeroupper in 2021? for more information.
Normally it is by default. You can disable it by /d2vzeroupper-. See here: https://godbolt.org/z/P48crzTrb
/favor:AMD64 switch suppresses vzeroupper, so /d2vzeroupper enables it back.
The up-to-date Visual Studio 2022 has fixed that, so /favor:AMD64 still emits vzeroupper and /d2vzeroupper is not needed to enable it.
Reason: current AMD optimization guides (available from AMD site; direct pdf link) suggest:
2.11.6 Mixing AVX and SSE
There is a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the
YMM registers contain non-zero data. Transitioning in either direction will cause a micro-fault to
spill or fill the upper 128 bits of all 16 YMM registers. There will be an approximately 100 cycle
penalty to signal and handle this fault. To avoid this penalty, a VZEROUPPER or VZEROALL
instruction should be used to clear the upper 128 bits of all YMM registers when transitioning from
AVX code to SSE or unknown code
Older AMD processor did not need vzeroupper, so /favor:AMD64 implemented optimization for them, even though penalizing Intel CPUs. From MS docs:
/favor:AMD64
(x64 only) optimizes the generated code for the AMD Opteron, and Athlon processors that support 64-bit extensions. The optimized code can run on all x64 compatible platforms. Code that is generated by using /favor:AMD64 might cause worse performance on Intel processors that support Intel64.

Reading/writing 32-bit data types on a 64-bit architecture

I have an unclear question that I need more clarification about it. when I store data type like integer 4 byte. then how the processor with the architecture of 64 bit reads or writes an integer with a size of 4byte as I know the processor reads/writes a word. is there any padding size here. I will be thankful for some clarifications about that because I can not understand how it works or maybe I miss some things that I must read more about it. is it differs from compiler to compiler or language to language? Thanks a lot.
as I know the processor reads/writes a word
Word-oriented CPUs can load/store in 64-bit chunks, but also narrower chunks. (Storing the low part of a register, or loading with zero-extension or sign-extension). Capability to do narrow stores is fairly essential for writing device drivers for most hardware, as well as for implementing efficiently sized integers that don't waste a huge amount of cache footprint, and for some kinds of string processing.
Some CPUs (like x86-64) are not really word-oriented at all, and have about the same efficiency for every operand-size. Although the default operand-size in x86-64 machine code is 32-bit.
All mainstream 64-bit architectures natively support 32-bit operand-size, including even DEC Alpha which was aggressively 64-bit and chose not to provide 8-bit or 16-bit loads/stores. (See Can modern x86 hardware not store a single byte to memory? for more details)
There might be some highly obscure 64-bit architecture where only 64-bit load/store is possible, but that seems unlikely. Also note that most modern 64-bit ISAs evolved out 32-bit ISAs.

How can I determine how many AVX registers my processor has?

Currently I'm developing function that counts integral using AVX registers. I want to know if there are enough of them on my computer. How can I find out that?
Assuming a CPU with AVX at all (i.e. not Pentium/Celeron, even latest-generation):
32-bit mode always has 8 architectural YMM registers. 32-bit mode is mostly obsolete for high-performance computing.
64-bit mode has 16 YMM regs, or with AVX512VL, 32 if you include using EVEX-encoded 256-bit versions of instructions.
In either case, these are renamed onto a larger physical register file (PRF), avoiding write-after-write and write-after-read hazards. https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ has some info about out-of-order execution window size being limited by PRF size, instead of by the ReOrder Buffer (ROB).
You could detect 64-bit mode with #if defined(__x86_64__) on most compilers, #if defined(_M_X64) on MSVC.
Compile-time detection of AVX is __AVX__, AVX512VL is __AVX512VL__. (Mainstream CPUs with AVX512 have it, Xeon Phi (KNL / KNM) don't; only legacy SSE or AVX512 full-width ZMM.) You may want to only do runtime detection of AVX instead of enabling it as a baseline for all your source files, though.

Are there in x86 any instructions to accelerate SHA (SHA1/2/256/512) encoding?

An example, in x86 are Instruction Set to hardware acceleration AES. But are there in x86 any instructions to accelerate SHA (SHA1/2/256/512) encoding, and what library is the fastet to encoding SHA on x86?
Intel has upcoming instructions for accelerating the calculation of SHA1 /256 hashes.
You can read more about them, how to detect if your CPU support them and how to use them here.
(But not SHA-512, you'll still need to manually vectorize that with regular SIMD instructions. AVX512 should help for SHA-512 (and for SHA-1 / SHA-256 on CPUs with AVX512 but not SHA extensions), providing SIMD rotates as well as shifts, for example https://github.com/minio/sha256-simd)
It was hoped that Intel's Skylake microarchitecture would have them, but it doesn't. Intel CPU's with it are low-power Goldmont in 2016, then Goldmont Plus in 2017. Intel's first mainstream CPU with SHA extensions will be Cannon Lake. Skylake / Kaby Lake / Coffee Lake do not.
AMD Ryzen (2017) has SHA extension.
A C/C++ programmer is probably best off using OpenSSL, which will use whatever CPU features it can to hash quickly. (Including SHA extensions on CPUs that have them, if your version of OpenSSL is new enough.)
Are there in x86 any instructions to accelerate SHA (SHA1/2/256/512) encoding?
It's November 2016 and the answer is finally Yes. But its only SHA-1 and SHA-256 (and by extension, SHA-224).
Intel CPUs with SHA extensions hit the market recently. It looks like processors which support it are Goldmont microarchitecture:
Pentium J4205 (desktop)
Pentium N4200 (mobile)
Celeron J3455 (desktop)
Celeron J3355 (desktop)
Celeron N3450 (mobile)
Celeron N3350 (mobile)
I looked through offerings at Amazon for machines with the architecture or the processor numbers, but I did not find any available (yet). I believe HP Acer had one laptop with Pentium N4200 expected to be available in November 2016 December 2016 that would meet testing needs.
For some of the technical details why it's only SHA-1, SHA-224 and SHA-256, then see crypto: arm64/sha256 - add support for SHA256 using NEON instructions on the kernel crypto mailing list. The short answer is, above SHA-256, things are not easily parallelizable.
You can find source code for both Intel SHA intrinsics and ARMv8 SHA intrinsics at Noloader GitHub | SHA-Intrinsics. They are C source files, and provide the compress function for SHA-1, SHA-224 and SHA-256. The intrinsic-based implementations increase throughput approximately 3× to 4× for SHA-1, and approximately 6× to 12× for SHA-224 and SHA-256.
2019 Update:
OpenSSL does use H/W acceleration when present.
On Intel's side Goldmont µarch has (Atom-series) and from Cannonlake (desktop/mobile, 10nm) onwards have SHA-NI support, Cascade Lake server CPUs and older do not support it. Yes, support is non-linear on timeline due to parallel CPU/µarch lines present.
In 2017 AMD released their Zen µarch, so all current server and desktop CPUs based on Zen fully support it.
My benchmark of OpenSSL speed SHA256 showed a 550% speed increase with a block size of 8KiB.
For real 1GB and 5GB files loaded to RAM the hashing was roughly 3x times faster.
(Benchmarked on Ryzen 1700 # 3.6 GHz, 2933CL16 RAM; OpenSSL: 1.0.1 no support vs 1.1.1 with support)
Absolute values for comparison against other hash functions:
sha1 (1.55GHz): 721,1 MiB/s
sha256 (1.55GHz): 668.8 MiB/s
sha1 (3.8GHz) : 1977,9 MiB/s
sha256 (3.8GHz) : 1857,7 MiB/s
See this for details until there's a way to add tables on SO.
CPUID identification, page 298: 07h in EAX → EBX Bit 29 == 1.
Intel's Instruction Set Reference, page 1264ff.
Agner Fog's Instruction tables where he benchmarks instruction latency/µops etc. (currently Zen, Goldmont, Goldmont Plus available)
Code example, SIMD comparison: minio/sha256-simd
Try something open source such as OpenSSL
I have personally used their MD5 hashing functions and those worked pretty well.
You might also want to take a look at hashlib2++.
As far as I know Intel hasn't made dedicated instruction set for SHA-1 or two. They may in upcoming architectures as CodesInChaos indicated in a comment. The major component in most hashing algorithms is the XOR operation which is already in the instruction set.

Using AVX CPU instructions: Poor performance without "/arch:AVX"

My C++ code uses SSE and now I want to improve it to support AVX when it is available. So I detect when AVX is available and call a function that uses AVX commands. I use Win7 SP1 + VS2010 SP1 and a CPU with AVX.
To use AVX, it is necessary to include this:
#include "immintrin.h"
and then you can use intrinsics AVX functions like _mm256_mul_ps, _mm256_add_ps etc.
The problem is that by default, VS2010 produces code that works very slowly and shows the warning:
warning C4752: found Intel(R) Advanced Vector Extensions; consider
using /arch:AVX
It seems VS2010 actually does not use AVX instructions, but instead, emulates them. I added /arch:AVX to the compiler options and got good results. But this option tells the compiler to use AVX commands everywhere when possible. So my code may crash on CPU that does not support AVX!
So the question is how to make VS2010 compiler to produce AVX code but only when I specify AVX intrinsics directly. For SSE it works, I just use SSE intrinsics functions and it produce SSE code without any compiler options like /arch:SSE. But for AVX it does not work for some reason.
2021 update: Modern versions of MSVC don't need manual use of _mm256_zeroupper() even when compiling AVX intrinsics without /arch:AVX. VS2010 did.
The behavior that you are seeing is the result of expensive state-switching.
See page 102 of Agner Fog's manual:
http://www.agner.org/optimize/microarchitecture.pdf
Every time you improperly switch back and forth between SSE and AVX instructions, you will pay an extremely high (~70) cycle penalty.
When you compile without /arch:AVX, VS2010 will generate SSE instructions, but will still use AVX wherever you have AVX intrinsics. Therefore, you'll get code that has both SSE and AVX instructions - which will have those state-switching penalties. (VS2010 knows this, so it emits that warning you're seeing.)
Therefore, you should use either all SSE, or all AVX. Specifying /arch:AVX tells the compiler to use all AVX.
It sounds like you're trying to make multiple code paths: one for SSE, and one for AVX.
For this, I suggest you separate your SSE and AVX code into two different compilation units. (one compiled with /arch:AVX and one without) Then link them together and make a dispatcher to choose based on the what hardware it's running on.
If you need to mix SSE and AVX, be sure to use _mm256_zeroupper() or _mm256_zeroall() appropriately to avoid the state-switching penalties.
tl;dr for old versions of MSVC only
Use _mm256_zeroupper(); or _mm256_zeroall(); around sections of code using AVX (before or after depending on function arguments). Only use option /arch:AVX for source files with AVX rather than for an entire project to avoid breaking support for legacy-encoded SSE-only code paths.
In modern MSVC (and the other mainstream compilers, GCC/clang/ICC), the compiler knows when to use a vzeroupper asm instruction. Forcing extra vzerouppers with intrinsics can hurt performance when inlining. See Do I need to use _mm256_zeroupper in 2021?
Cause
I think the best explanation is in the Intel article, "Avoiding AVX-SSE Transition Penalties" (PDF). The abstract states:
Transitioning between 256-bit Intel® AVX instructions and legacy Intel® SSE instructions within a program may cause performance penalties because the hardware must save and restore the upper 128 bits of the YMM registers.
Separating your AVX and SSE code into different compilation units may NOT help if you switch between calling code from both SSE-enabled and AVX-enabled object files, because the transition may occur when AVX instructions or assembly are mixed with any of (from the Intel paper):
128-bit intrinsic instructions
SSE inline assembly
C/C++ floating point code that is compiled to Intel® SSE
Calls to functions or libraries that include any of the above
This means there may even be penalties when linking with external code using SSE.
Details
There are 3 processor states defined by the AVX instructions, and one of the states is where all of the YMM registers are split, allowing the lower half to be used by SSE instructions. The Intel document "Intel® AVX State Transitions: Migrating SSE Code to AVX" provides a diagram of these states:
When in state B (AVX-256 mode), all bits of the YMM registers are in use. When an SSE instruction is called, a transition to state C must occur, and this is where there is a penalty. The upper half of all YMM registers must be saved into an internal buffer before SSE can start, even if they happen to be zeros. The cost of the transitions is on the "order of 50-80 clock cycles on Sandy Bridge hardware". There is also a penalty going from C -> A, as diagrammed in Figure 2.
You can also find details about the state switching penalty causing this slowdown on page 130, Section 9.12, "Transitions between VEX and non-VEX modes" in Agner Fog's optimization guide (of version updated 2014-08-07), referenced in Mystical's answer. According to his guide, any transition to/from this state takes "about 70 clock cycles on Sandy Bridge". Just as the Intel document states, this is an avoidable transition penalty.
Skylake has a different dirty-upper mechanism that causes false dependencies for legacy-SSE with dirty uppers, rather than one-time penalties. Why is this SSE code 6 times slower without VZEROUPPER on Skylake?
Resolution
To avoid the transition penalties you can either remove all legacy SSE code, instruct the compiler to convert all SSE instructions to their VEX encoded form of 128-bit instructions (if compiler is capable), or put the YMM registers in a known zero state before transitioning between AVX and SSE code. Essentially, to maintain the separate SSE code path, you must zero out the upper 128-bits of all 16 YMM registers (issuing a VZEROUPPER instruction) after any code that uses AVX instructions. Zeroing these bits manually forces a transition to state A, and avoids the expensive penalty since the YMM values do not need to be stored in an internal buffer by hardware. The intrinsic that performs this instruction is _mm256_zeroupper. The description for this intrinsic is very informative:
This intrinsic is useful to clear the upper bits of the YMM registers when transitioning between Intel® Advanced Vector Extensions (Intel® AVX) instructions and legacy Intel® Supplemental SIMD Extensions (Intel® SSE) instructions. There is no transition penalty if an application clears the upper bits of all YMM registers (sets to ‘0’) via VZEROUPPER, the corresponding instruction for this intrinsic, before transitioning between Intel® Advanced Vector Extensions (Intel® AVX) instructions and legacy Intel® Supplemental SIMD Extensions (Intel® SSE) instructions.
In Visual Studio 2010+ (maybe even older), you get this intrinsic with immintrin.h.
Note that zeroing out the bits with other methods does not eliminate the penalty - the VZEROUPPER or VZEROALL instructions must be used.
One automatic solution implemented by the Intel Compiler is to insert a VZEROUPPER at the beginning of each function containing Intel AVX code if none of the arguments are a YMM register or __m256/__m256d/__m256i datatype, and at the end of functions if the returned value is not a YMM register or __m256/__m256d/__m256i datatype.
In the wild
This VZEROUPPER solution is used by FFTW to generate a library with both SSE and AVX support. See simd-avx.h:
/* Use VZEROUPPER to avoid the penalty of switching from AVX to SSE.
See Intel Optimization Manual (April 2011, version 248966), Section
11.3 */
#define VLEAVE _mm256_zeroupper
Then VLEAVE(); is called at the end of every function using intrinsics for AVX instructions.