ARM V-8 with Scalable Vector Extension (SVE) - c++

I come across this point that ARMv8 is now supporting variable length vector register from 128 bits to 2048 bits (scalable vector extension SVE).
It is always good to have bigger width of register to achieve the data level parallelism. But on what basis we need to select the size of register from 128 bits to 2048 bits for achieving maximum performance?
For example I want to do Sobel filtering with 3x3 mask on 1920 X 1080 Y image. What register width do I need to select?

The Scalable Vector Extension is a module for the aarch64 execution state that extends the A64 Instruction Set and is focused on High-Performance Computing and not on media, for that you have NEON.
The registers width will be decided by the Hardware designer/manufacturer depending on what that implementation is trying to solve/do. The possible vector length are: 128 256 384 512 640 768 896 1024 1152 1280 1408 1536 1664 1792 1920 2048.
From the programmers' point of view, the programming model is Vector Length Agnostic, meaning that the same application will work on implementations with different registers width (Vector lengths).
The specification is out, however, there is no hardware available with SVE implemented. For the time being, you can use the ARM Instruction Emulator (armie) to run your programs.
So answering your question, unless you are manufacturing hardware, you need not select any specific vector length, as that would vary from one implementation to another.
Now if you are testing using armie, then you can select whichever your want.

SVE essentially implicitly increments loop indices for you based on the hardware defined vector width, so you don't have to worry about it.
Check out the worked out Daxpy example at: https://www.rico.cat/files/ICS18-gem5-sve-tutorial.pdf to understand what that means in more detail, and this minimal runnable example with an assertion.

Related

Why is GL_UNPACK_ALIGNMENT default 4?

I'm trying to upload RGB (no alpha channel) pixel data to the GPU using:
GLInternalFormat = GL_RGB8;
GLFormat = GL_RGB;
GLType = GL_UNSIGNED_BYTE;
Here's how my data is structured u_int8* pixels;
RGBRGBRGB -> row 1
RGBRGBRGB -> row 2
RGBRGBRGB -> row 3
The texture width is not a multiple of 4 for this particular texture.
I was able to make it work using glPixelStorei(GL_UNPACK_ALIGNMENT, 1);
I understand now how GL_UNPACK_ALIGNMENT works. My question now is, why is not 1 the default in OpenGL (source: https://docs.gl/gl3/glPixelStore)
pname Initial value Value range
GL_UNPACK_ALIGNMENT integer 4 1, 2, 4, or 8
Is it safe to set it to 1? Otherwise, it'll break on textures that are not divisible by 4
why is the default value 4? Seems arbitrary...
A texture or other image is a sizable chunk of memory that you want copied from CPU memory to GPU or vice versa. (Back in the early 1990s when OpenGL first appeared they were fixed function graphics accelerators, not truly general purpose.) Just like copying a block of memory to/from your disk drive or network card, the copy should run as fast as possible.
In the early 1990s there were many more types of CPU around than today. Most of the ones that might be running OpenGL were 32 bit, and 64 bit RISCs were appearing.
Many of the RISC systems, including those sold by SGI, could only read and write 32 bit/4 byte values that were aligned on a 4 byte boundary. The M680x0 family which were also popular for workstations needed 2 byte alignment. Only the Intel x86 could read 32 bits from any boundary, and they still ran faster if 32 bit values were 4 byte aligned.
(On a system with a DMA controller that could run in parallel with the CPU, it would most likely have the same alignment and performance requirements as the CPU itself.)
So defaulting to 4 byte alignment gave the best performance on the most systems. Specifying alignment 1 would have to drop down to reading/writing a byte at a time. Not necessarily 4x slower but slower on most.
These early 1990s systems also had enough RAM and disk space that going from 24 bits per pixel to 32 bits wasn't too bad. You can still find image file/memory format definitions that have 8 unused bits just to get 4 byte alignment with 24 bit RGB.

Why is bzip2's maximum blocksize 900k?

bzip2 (i.e. this program by Julian Seward)'s lists available block-sizes between 100k and 900k:
$ bzip2 --help
bzip2, a block-sorting file compressor. Version 1.0.6, 6-Sept-2010.
usage: bzip2 [flags and input files in any order]
-1 .. -9 set block size to 100k .. 900k
This number corresponds to the hundred_k_blocksize value written into the header of a compressed file.
From the documentation, memory requirements are as follows:
Compression: 400k + ( 8 x block size )
Decompression: 100k + ( 4 x block size ), or
100k + ( 2.5 x block size )
At the time the original program was written (1996), I imagine 7.6M (400k + 8 * 900k) might have been a hefty amount of memory on a computer, but for today's machines it's nothing.
My question is two part:
1) Would better compression be achieved with larger block sizes? (Naively I'd assume yes). Is there any reason not to use larger blocks? How does the cpu time for compression scale with the size of block?
2) Practically, are there any forks of the bzip2 code (or alternate implementations) that allow for larger block sizes? Would this require significant revision to the source code?
The file format seems flexible enough to handle this. For example ... since hundred_k_blocksize holds an 8-bit character that indicates the block-size, one could extend down the ASCII table to indicate larger block-sizes (e.g. ':' = x3A => 1000k, ';' = x3B => 1100k, '<' = x3C => 1200k, ...).
Your intuition that a larger block size should lead to a higher compression ratio is supported by Matt Mahoney's compilation of programs from his large text compression benchmark. For example, the open-source BWT program, BBB, (http://mattmahoney.net/dc/text.html#1640) has a ~40% compression ratio improvement going from a blocksize of 10^6 to 10^9. Between these two values, the compression time doubles. Now that the "xz" program, which uses is an LZ variant (called LZMA2) originally described by 7zip author, Igor Pavlov, is beginning to overtake bzip2 as the default strategy for compressing source code, it is worth studying the possibility of upping bzip2's block size to see if it might be a viable alternative. Also, bzip2 avoided arithmetic coding due to patent restrictions, which have since expired. Combined with the possibility of using the fast asymmetric numeral systems for entropy coding developed by Jarek Duda, a modernized bzip2 could very well be competitive in both compression ratio and speed to xz.

Using only part of crc32

Will using only 2 upper/lower bytes of a crc32 sum make it weaker than crc16?
Background:
I'm currently implementing a wireless protocol.
I have chunks of 64byte each, and according to
Data Length vs CRC Length
I would need at most crc16.
Using crc16 instead of crc32 would free up bandwidth for use in Forward error correction (64byte is one block in FEC).
However, my hardware is quite low powered but has hardware support for CRC32.
So my idea was to use the hardware crc32 engine and just throw away 2 of the result bytes.
I know that this is not a crc16 sum, but that does not matter because I control both sides of the transmission.
In case it matters: I can use both crc32 (poly 0x04C11DB7) or crc32c (poly 0x1EDC6F41).
Yes, it will be weaker, but only for small numbers of bit errors. You get none of the guarantees of a CRC-16 by instead taking half of a CRC-32. E.g. the number of bits in a burst that are always detectable.
What is the noise source that you are trying to protect against?

How to find out if we are using really 48, 56 or 64 bits pointers

I am using tricks to store extra information in pointers, At the moment some bits are not used in pointers(the highest 16 bits), but this will change in the future. I would like to have a way to detect if we are compiling or running on a platform that will use more than 48 bits for pointers.
related things:
Why can't OS use entire 64-bits for addressing? Why only the 48-bits?
http://developer.amd.com/wordpress/media/2012/10/24593_APM_v2.pdf
The solution is needed for x86-64, Windows, C/C++, preferably something that can be done compile-time.
Solutions for other platforms are also of interest but will not marked as correct answer.
Windows has exactly one switch for 32bit and 64bit programs to determine the top of their virtual-address-space:
IMAGE_FILE_LARGE_ADDRESS_AWARE
For both types, omitting it limits the program to the lower 2 GB of address-space, severely reducing the memory an application can map and thus also reducing effectiveness of Address-Space-Layout-Randomization (ASLR, an attack mitigation mechanism).
There is one upside to it though, and just what you seem to want: Only the lower 31 bits of a pointer can be set, so pointers can be safely round-tripped through int (32 bit integer, sign- or zero-extension).
At run-time the situation is slightly better:
Just use the cpuid-instruction from intel, function EAX=80000008H, and read the maximum number of used address bits for virtual addresses from bits 8-15.
The OS cannot use more than the CPU supports, remember that intel insists on canonical addresses (sign-extended).
See here for how to use cpuid from C++: CPUID implementations in C++

OpenGL - How is GLenum a unsigned 32 bit Integer?

To begin there are 8 types of Buffer Objects in OpenGL:
GL_ARRAY_BUFFER​
GL_ELEMENT_ARRAY_BUFFER​
GL_COPY_READ_BUFFER
...
They are enums, or more specifically GLenum's. Where GLenum is a unsigned 32 bit integer that has values up to ~ 4,743,222,432 so to say.
Most of the uses of buffer objects involve binding them to a certain target like this: e.g.
glBindBuffer (GL_ARRAY_BUFFER, Buffers [size]);
[void glBindBuffer (GLenum target, GLuint buffer)] documentation
My question is - is that if its an enum its only value must be 0,1,2,3,4..7 respectively so why go all the way and make it a 32 bit integer if it has only values up to 7? Pardon my knowledge of CS and OpenGL, it just seems unethical.
Enums aren't used just for the buffers - but everywhere a symbolic constant is needed. Currently, several thousand enum values are assigned (look into your GL.h and the latest glext.h. Note that vendors get allocated their official enum ranges so they can implement vendor-specific extensions wihtout interfering with others - so a 32Bit enum space is not a bad idea. Furthermore, on modern CPU architechtures, using less than 32Bit won't be any more efficient, so this is not a problem performance-wise.
UPDATE:
As Andon M. Coleman pointed out, currently only 16Bit enumerant ranges are beeing allocated. It might be useful to link at the OpenGL Enumerant Allocation Policies, which also has the following remark:
Historically, enumerant values for some single-vendor extensions were allocated in blocks of 1000, beginning with the block [102000,102999] and progressing upward. Values in this range cannot be represented as 16-bit unsigned integers. This imposes a significant and unnecessary performance penalty on some implementations. Such blocks that have already been allocated to vendors will remain allocated unless and until the vendor voluntarily releases the entire block, but no further blocks in this range will be allocated.
Most of these seem to have been removed in favor of 16 Bit values, but 32 Bit values have been in use. In the current glext.h, one still can find some (obsolete) enumerants above 0xffff, like
#ifndef GL_PGI_misc_hints
#define GL_PGI_misc_hints 1
#define GL_PREFER_DOUBLEBUFFER_HINT_PGI 0x1A1F8
#define GL_CONSERVE_MEMORY_HINT_PGI 0x1A1FD
#define GL_RECLAIM_MEMORY_HINT_PGI 0x1A1FE
...
Why would you use a short anyway? What situation would you ever be in that you would even save more than 8k ram (if the reports of near a thousand GLenums is correct) by using a short or uint8_t istead of GLuint for enums and const declarations? Considering the trouble of potential hardware incompatibilities and potential cross platform bugs you would introduce, it's kind of odd to try to save something like 8k ram even in the context of the original 2mb Voodoo3d graphics hardware, much less SGL super-computer-farms OpenGL was created for.
Besides, modern x86 and GPU hardware aligns on 32 or 64 bits at a time, you would actually stall the operation of the CPU/GPU as 24 or 56 bits of the register would have to be zeroed out and THEN read/written to, whereas it could operate on the standard int as soon as it was copied in. From the start of OpenGL compute resources have tended to be more valuable than memory while you might do billions of state changes during a program's life you'd be saving about 10kb (kilobytes) of ram max if you replaced every 32 bit GLuint enum with a uint8_t one. I'm trying so hard not to be extra-cynical right now, heh.
For example, one valid reason for things like uint18_t and the like is for large data buffers/algorithms where data fits in that bit-depth. 1024 ints vs 1024 uint8_t variables on the stack is 8k, are we going to split hairs over 8k? Now consider a 4k raw bitmap image of 4000*2500*32 bits, we're talking a few hundred megs and it would be 8 times the size if we used 64 bit RGBA buffers in the place of standard 8 bit RGBA8 buffers, or quadruple in size if we used 32 bit RGBA encoding. Multiply that by the number of textures open or pictures saved and swapping a bit of cpu operations for all that extra memory makes sense, especially in the context of that type of work.
That is where using a non standard integer type makes sense. Unless you're on a 64k machine or something (like an old-school beeper, good luck running OpenGL on that) system if you're trying to save a few bits of memory on something like a const declaration or reference counter you're just wasting everyone's time.