I have 2 functions:
unsigned long long getLineAsRow(unsigned long long board, int col) {
unsigned long long column = (board >> (7-(col - 1))) & col_mask_right;
column *= magicLineToRow;
return (column >> 56) & row_mask_bottom;
}
unsigned long long getDiagBLTR_asRow(unsigned long long board, int line, int row) {
unsigned long long result = board & diagBottomLeftToTopRightPatterns[line][row];
result = result << diagBLTR_shiftUp[line][row];
result = (result * col_mask_right) >> 56;
return result;
}
The only big difference I see is the access to a 2-dim-array. Defined like
int diagBRTL_shiftUp[9][9] = {};
I call both functions 10.000.000 times:
getLineAsRow ... time used: 1.14237s
getDiagBLTR_asRow ... time used: 2.18997s
I tested it with cl (vc++) and g++. Nearly no difference.
It is a really huge difference, do you have any advice?
The question what creates the difference between the execution times of your two functions really cannot be answered without knowing either the resulting assembler code or which of the globals you are accessing are actually constants that can be compiled right into the code. Anyway, analyzing your functions, we see that
function 1
reads two arguments from stack, returns a single value
reads three globals, which may or may not be constants
performs six arithmetic operations (the two minuses in 7-(col-1) can be collapsed into a single subtraction)
function 2
reads three arguments from stack, returns a single value
reads one global, which may or may not be a constant
dereferences two pointers (not four, see below)
does five arithmetic operations (three which you see, two which produce the array indices)
Note that accesses to 2D arrays actually boil down to a single memory access. When you write diagBottomLeftToTopRightPatterns[line][row], your compiler transforms it to something like diagBottomLeftToTopRightPatterns[line*9 + row]. That's two extra arithmetic instructions, but only a single memory access. What's more, the result of the calculation line*9 + row can be recycled for the second 2D array access.
Arithmetic operations are fast (on the order of a single CPU cycle), reads from memory may take four to twenty CPU cycles. So I guess that the three globals you access in function 1 are all constants which your compiler built right into the assembler code. This leaves function 2 with more memory accesses, making it slower.
However, one thing bothers me: If I assume you have a normal CPU with at least 2 GHz clock frequency, your times suggest that your functions consume more than 200 or 400 cycles, respectively. This is significantly more than expected. Even if your CPU has no values in cache, your functions shouldn't take more than roughly 100 cycles. So I would suggest to take a second look at how you are timing your code, I assume that you have some more code in your measuring loop which spoils your results.
Those functions do completely different things, but I assume that's no relevant to the question.
Sometimes these tests don't show the real cost of a function.
In this case the main cost is the access of the array in the memory. After the first access it will be in the cache and after that your function is going to be fast. So you don't really measure this characteristic. Even though in the test there are 10.000.000 iterations, you pay the price only once.
Now if you execute this function in a batch, calling it many times in a bulk, then it's non-issue. The cache will be warm.
If you access it sporadically, in an application which has high memory demands and frequently flushes the CPU cashes, it could be performance problem. But that of course depends on the context: how often it's called, etc..
Related
I'm developing a software that runs on a DE10 board, in an ARM Cortex-A9 processor.
This software has to access physical memory addresses in order to communicate with the FPGA in the DE10, and this is done mapping /dev/mem, this method is described here.
I have a situation where I have to select which of 4 addresses to send some values, and this could be done in 1 of 2 ways:
Using an if statement and checking a integer variable (which is always 0 or 1 at that part of the loop) and only write if it's 1.
Multiply the values that should be sent by the aforementioned variable and write on all addresses without any conditional, because writing zero doesn't have any effect on my system.
I was curious about which would be faster, so I tried this:
First, I made this loop:
int test=0;
for(int i=0;i<1000000;i++)
{
if(test==9)
{
test=15;
}
test++;
if(test==9)
{
test=0;
}
}
The first if statement should never be satisfied, so its only contribution to the time taken in the loop is from its comparison itself.
The increment and the second if statement are just things I added in an attempt to prevent the compiler from just "optimizing out" the first if statement.
This loop is ran once without being benchmarked (just in case there's any frequency scaling ramp, although I'm pretty sure it has none) and then its ran once again being benchmarked, and it takes around 18350 μs to complete.
Without the first if statement, it takes around 17260 μs
Now, If I change that first if statement by a line that sets the value of a memory-mapped address to the value of the integer test, like this:
for(int i=0;i<1000000;i++)
{
*(uint8_t*)address=test;
test++;
if(test==9)
{
test=0;
}
}
This loops takes around 253600 μs to complete, almost 14 x slower.
Reading that address instead of writing on it barely changes anything.
Is this what it really is, or is there some kind of compiler optimization possibly frustrating my benchmarking?
Should I expect this difference in performance (and thus favoring the comparison method) in the actual software?
I am trying to vectorize this for loop. After using the Rpass flag, I am getting the following remark for it:
int someOuterVariable = 0;
for (unsigned int i = 7; i != -1; i--)
{
array[someOuterVariable + i] -= 0.3 * anotherArray[i];
}
Remark:
The cost-model indicates that vectorization is not beneficial
the cost-model indicates that interleaving is not beneficial
I want to understand what this means. Does "interleaving is not benificial" mean the array indexing is not proper?
It's hard to answer without more details about your types. But in general, starting a loop incurs some costs and vectorising also implies some costs (such as moving data to/from SIMD registers, ensuring proper alignment of data)
I'm guessing here that the compiler tells you that the vectorisation cost here is bigger than simply running the 8 iterations without it, so it's not doing it.
Try to increase the number of iterations, or help the compiler for computing alignement for example.
Typically, unless the type of array's item are exactly of the proper alignment for SIMD vector, accessing an array from a "unknown" offset (what you've called someOuterVariable) prevents the compiler to write an efficient vectorisation code.
EDIT: About the "interleaving" question, it's hard to guess without knowning your tool. But in general, interleaving usually means mixing 2 streams of computations so that the compute units of the CPU are all busy. For example, if you have 2 ALU in your CPU, and the program is doing:
c = a + b;
d = e * f;
The compiler can interleave the computation so that both the addition and multiplication happens at the same time (provided you have 2 ALU available). Typically, this means that the multiplication which is a bit longer to compute (for example 6 cycles) will be started before the addition (for example 3 cycles). You'll then get the result of both operation after only 6 cycles instead of 9 if the compiler serialized the computations. This is only possible if there is no dependencies between the computation (if d required c, it can not work). A compiler is very cautious about this, and, in your example, will not apply this optimization if it can't prove that array and anotherArray don't alias.
Suppose I have
x &(num-1)
where x is an unsigned long long and num a regular int and & is the bitwise and operator.
I'm getting a significant speed reduction as the value of num increases. Is that normal behavior?
These are the other parts of the code that are affected
int* hash = new int[num]
I don't think that the bitwise operation is slowing down, I think you're using it a lot more times. And probably it isn't even the bitwise operation that's taking too long, but whatever else you're also doing more times.
Use a profiler.
If you're executing the code in a tight loop, it's wholly possibly that you'll see the performance lessen the higher num gets, I'm guessing that your C++ compiler isn't able to find a native instruction to perform the & with an unsigned long long - as you've stated your getting a slowdown for each power of two then I'd expect that the code that results from the & is repeatedly "dividing num" by 2 until it's zero and performing the and bit-by-bit.
Another possibility is that the CPU you're running on is lame and doesn't perform AND in a fixed number of cycles.
Problem solved. It had to do with the CPU cache and locality.
What is the fastest way to write a bitstream on x86/x86-64? (codeword <= 32bit)
by writing a bitstream I refer to the process of concatenating variable bit-length symbols into a contiguous memory buffer.
currently I've got a standard container with a 32bit intermediate buffer to write to
void write_bits(SomeContainer<unsigned int>& dst,unsigned int& buffer, unsigned int& bits_left_in_buffer,int codeword, short bits_to_write){
if(bits_to_write < bits_left_in_buffer){
buffer|= codeword << (32-bits_left_in_buffer);
bits_left_in_buffer -= bits_to_write;
}else{
unsigned int full_bits = bits_to_write - bits_left_in_buffer;
unsigned int towrite = buffer|(codeword<<(32-bits_left_in_buffer));
buffer= full_bits ? (codeword >> bits_left_in_buffer) : 0;
dst.push_back(towrite);
bits_left_in_buffer = 32-full_bits;
}
}
Does anyone know of any nice optimizations, fast instructions or other info that may be of use?
Cheers,
I wrote once a quite fast implementation, but it has several limitations: It works on 32 bit x86 when you write and read the bitstream. I don't check for buffer limits here, I was allocating larger buffer and checked it from time to time from the calling code.
unsigned char* membuff;
unsigned bit_pos; // current BIT position in the buffer, so it's max size is 512Mb
// input bit buffer: we'll decode the byte address so that it's even, and the DWORD from that address will surely have at least 17 free bits
inline unsigned int get_bits(unsigned int bit_cnt){ // bit_cnt MUST be in range 0..17
unsigned int byte_offset = bit_pos >> 3;
byte_offset &= ~1; // rounding down by 2.
unsigned int bits = *(unsigned int*)(membuff + byte_offset);
bits >>= bit_pos & 0xF;
bit_pos += bit_cnt;
return bits & BIT_MASKS[bit_cnt];
};
// output buffer, the whole destination should be memset'ed to 0
inline unsigned int put_bits(unsigned int val, unsigned int bit_cnt){
unsigned int byte_offset = bit_pos >> 3;
byte_offset &= ~1;
*(unsigned int*)(membuff + byte_offset) |= val << (bit_pos & 0xf);
bit_pos += bit_cnt;
};
It's hard to answer in general because it depends on many factors such as the distribution of bit-sizes you are reading, the call pattern in the client code and the hardware and compiler. In general, the two possible approaches for reading (writing) from a bitstream are:
Using a 32-bit or 64-bit buffer and conditionally reading (writing) from the underlying array it when you need more bits. That's the approach your write_bits method takes.
Unconditionally reading (writing) from the underlying array on every bitstream read (write) and then shifting and masking the resultant values.
The primary advantages of (1) include:
Only reads from the underlying buffer the minimally required number of times in an aligned fashion.
The fast path (no array read) is somewhat faster since it doesn't have to do the read and associated addressing math.
The method is likely to inline better since it doesn't have reads - if you have several consecutive read_bits calls, for example, the compiler can potentially combine a lot of the logic and produce some really fast code.
The primary advantage of (2) is that it is completely predictable - it contains no unpredictable branches.
Just because there is only one advantage for (2) doesn't mean it's worse: that advantage can easily overwhelm everything else.
In particular, you can analyze the likely branching behavior of your algorithm based on two factors:
How often will the bitsteam need to read from the underlying buffer?
How predictable is the number of calls before a read is needed?
For example if you are reading 1 bit 50% of the time and 2 bits 50% of time, you will do 64 / 1.5 = ~42 reads (if you can use a 64-bit buffer) before requiring an underlying read. This favors method (1) since reads of the underlying are infrequent, even if mis-predicted. On the other hand, if you are usually reading 20+ bits, you will read from the underlying every few calls. This is likely to favor approach (2), unless the pattern of underlying reads is very predictable. For example, if you always read between 22 and 30 bits, you'll perhaps always take exactly three calls to exhaust the buffer and read the underlying1 array. So the branch will be well-predicated and (1) will stay fast.
Similarly, it depends on how you call these methods, and how the compiler can inline and simplify the code. Especially if you ever call the methods repeatedly with a compile-time constant size, a lot of simplification is possible. Little to no simplification is available when the codeword is known at compile-time.
Finally, you may be able to get increased performance by offering a more complex API. This mostly applies to implementation option (1). For example, you can offer an ensure_available(unsigned size) call which ensures that at least size bits (usually limited the buffer size) are available to read. Then you can read up to that number of bits using unchecked calls that don't check the buffer size. This can help you reduce mis-predictions by forcing the buffer fills to a predictable schedule and lets you write simpler unchecked methods.
1 This depends on exactly how your "read from underlying" routine is written, as there are a few options here: Some always fill to 64-bits, some fill to between 57 and 64-bits (i.e., read an integral number of bytes), and some may fill between 32 or 33 and 64-bits (like your example which reads 32-bit chunks).
You'll probably have to wait until 2013 to get hold of real HW, but the "Haswell" new instructions will bring proper vectorised shifts (ie the ability to shift each vector element by different amounts specified in another vector) to x86/AVX. Not sure of details (plenty of time to figure them out), but that will surely enable a massive performance improvement in bitstream construction code.
I don't have the time to write it for you (not too sure your sample is actually complete enough to do so) but if you must, I can think of
using translation tables for the various input/output bit shift offsets; This optimization would make sense for fixed units of n bits (with n sufficiently large (8 bits?) to expect performance gains)
In essence, you'd be able to do
destloc &= (lookuptable[bits_left_in_buffer][input_offset][codeword]);
disclaimer: this is very sloppy pseudo code, I just hope it conveys my idea of a lookup table o prevent bitshift arithmetics
writing it in assembly (I know i386 has XLAT, but then again, a good compiler might already use something like that)
; Also, XLAT seems limited to 8 bits and the AL register, so it's not really versatile
Update
Warning: be sure to use a profiler and test your optimization for correctness and speed. Using a lookup table can result in poorer performance in the light of locality of reference. So, you might need to change the bit-streaming thread on a single core (set thread affinity) to get the benefits, and you might have to adapt the lookup table size to the processor's L2 cache.
Als, have a look at SIMD, SSE4 or GPU (CUDA) instruction sets if you know you'll have certain features at your disposal.
consider the following array of bytes that is intended to be converted into a single unsigned integer:
unsigned char arr[3] = {0x23, 0x45, 0x67};
each byte represents the equivalent byte in integer, now which one of the following methods would you suggest specially performance-wise:
unsigned int val1 = arr[2] << 16 | arr[1] << 8 | arr[0];
//or
unsigned int val2=arr[0];
*((char *)&val2+1)=arr[1];
*((char *)&val2+2)=arr[2];
I prefer the first method because it is portable. The second isn't due to endianness issues.
This depends on your specific processor, a lot.
For example, on the PowerPC, the second form -- writing through the character pointers -- runs into a tricky implementation detail called a load-hit-store. This is a CPU stall that occurs when you store to a location in memory, then read it back again before the store has completed. The load op cannot complete until the store has finished (most PPCs do not have memory store-forwarding), and the store may take many cycles to make it from the CPU out to the memory cache.
Because of the way the store and arithmetic units are arranged in the pipeline, the CPU will have to flush the pipeline completely until the store completes: this can be a stall of twenty cycles or more during which the CPU has stopped dead. In general, writing to memory and then reading it back immediately is very bad on this platform. So on this case, the sequential bitshifts will be much faster, as they all occur on registers, and will not incur a pipeline stall.
On the Pentium series, the situation may be entirely reversed, because that chipset does have store forwarding and a fast stack architecture, and relatively few architectural registers. On the Core Duos and i7s, it may reverse yet again, because their pipelines are very deep.
Remember: it is not the case that every opcode takes one cycle. CPUs are not simple, and things like superscalar pipes and data hazards may cause instructions to take many cycles, or even many instructions to occur per cycle, depending on just how you arrange your code.
All of this just to underscore the point: this sort of optimization is extremely specific to a particular compiler and chipset. So you must compile, test and measure.
the first is faster, translated in x86 asm. It depends on your architecture anyway. Usually the compilers are able to optimize the first expression very well, and it's more portable too
The performance depends on the compiler and the machine. For example, in my experiment with gcc 4.4.5 on x64 the second was marginally faster, while others report the first as being faster. Therefore I recommend to stick with the first one because it is cleaner (no casts) and safer (no endianness issues).
I believe bitshift will the fastest solution. In my mind the CPU can just slide in the values, but by going directly to the address, like your second example, it will have to use many temp storages.
I would suggest a solution with union :
union color {
// first representation (member of union)
struct s_color {
unsigned char a, b, g, r;
} uc_color;
// second representation (member of union)
unsigned int int_color;
};
int main()
{
color a;
a.int_color = 0x23567899;
a.uc_color.a;
a.uc_color.b;
a.uc_color.g;
a.uc_color.r;
}
Take care that it platform dependent (which endianess)