How to atomically add and fetch a 128-bit number in C++? - c++

I use Linux x86_64 and clang 3.3.
Is this even possible in theory?
std::atomic<__int128_t> doesn't work (undefined references to some functions).
__atomic_add_fetch also doesn't work ('error: cannot compile this atomic library call yet').
Both std::atomic and __atomic_add_fetch work with 64-bit numbers.

It's not possible to do this with a single instruction, but you can emulate it and still be lock-free. Except for the very earliest AMD64 CPUs, x64 supports the CMPXCHG16B instruction. With a little multi-precision math, you can do this pretty easily.
I'm afraid I don't know the instrinsic for CMPXCHG16B in GCC, but hopefully you get the idea of having a spin loop of CMPXCHG16B. Here's some untested code for VC++:
// atomically adds 128-bit src to dst, with src getting the old dst.
void fetch_add_128b(uint64_t *dst, uint64_t* src)
{
uint64_t srclo, srchi, olddst[2], exchlo, exchhi;
srchi = src[0];
srclo = src[1];
olddst[0] = dst[0];
olddst[1] = dst[1];
do
{
exchlo = srclo + olddst[1];
exchhi = srchi + olddst[0] + (exchlo < srclo); // add and carry
}
while(!_InterlockedCompareExchange128((long long*)dst,
exchhi, exchlo,
(long long*)olddst));
src[0] = olddst[0];
src[1] = olddst[1];
}
Edit: here's some untested code going off of what I could find for the GCC intrinsics:
// atomically adds 128-bit src to dst, returning the old dst.
__uint128_t fetch_add_128b(__uint128_t *dst, __uint128_t src)
{
__uint128_t dstval, olddst;
dstval = *dst;
do
{
olddst = dstval;
dstval = __sync_val_compare_and_swap(dst, dstval, dstval + src);
}
while(dstval != olddst);
return dstval;
}

That isn't possible. There is no x86-64 instruction that does a 128-bit add in one instruction, and to do something atomically, a basic starting point is that it is a single instruction (there are some instructions which aren't atomic even then, but that's another matter).
You will need to use some other lock around the 128-bit number.
Edit: It is possible that one could come up with something that uses something like this:
__volatile__ __asm__(
" mov %0, %%rax\n"
" mov %0+4, %%rdx\n"
" mov %1,%%rbx\n"
" mov %1+4,%%rcx\n"
"1:\n
" add %%rax, %%rbx\n"
" adc %%rdx, %%rcx\n"
" lock;cmpxcchg16b %0\n"
" jnz 1b\n"
: "=0"
: "0"(&arg1), "1"(&arg2));
That's just something I just hacked up, and I haven't compiled it, never mind validated that it will work. But the principle is that it repeats until it compares equal.
Edit2: Darn typing too slow, Cory Nelson just posted the same thing, but using intrisics.
Edit3: Update loop to not unnecessary read memory that doesn't need reading... CMPXCHG16B does that for us.

Yes; you need to tell your compiler that you're on hardware that supports it.
This answer is going to assume you're on x86-64; there's likely a similar spec for arm.
From the generic x86-64 microarchitecture levels, you'll want at least x86-64-v2 to let the compiler know that you have the cmpxchg16b instruction.
Here's a working godbolt, note the compiler flag -march=x86-64-v2:
https://godbolt.org/z/PvaojqGcx
For more reading on the x86-64-psABI, the spec is published here.

Related

Why is std::fill(0) slower than std::fill(1)?

I have observed on a system that std::fill on a large std::vector<int> was significantly and consistently slower when setting a constant value 0 compared to a constant value 1 or a dynamic value:
5.8 GiB/s vs 7.5 GiB/s
However, the results are different for smaller data sizes, where fill(0) is faster:
With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):
This raises the secondary question, why the peak bandwidth of fill(1) is so much lower.
The test system for this was a dual socket Intel Xeon CPU E5-2680 v3 set at 2.5 GHz (via /sys/cpufreq) with 8x16 GiB DDR4-2133. I tested with GCC 6.1.0 (-O3) and Intel compiler 17.0.1 (-fast), both get identical results. GOMP_CPU_AFFINITY=0,12,1,13,2,14,3,15,4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23 was set. Strem/add/24 threads gets 85 GiB/s on the system.
I was able to reproduce this effect on a different Haswell dual socket server system, but not any other architecture. For example on Sandy Bridge EP, memory performance is identical, while in cache fill(0) is much faster.
Here is the code to reproduce:
#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <omp.h>
#include <vector>
using value = int;
using vector = std::vector<value>;
constexpr size_t write_size = 8ll * 1024 * 1024 * 1024;
constexpr size_t max_data_size = 4ll * 1024 * 1024 * 1024;
void __attribute__((noinline)) fill0(vector& v) {
std::fill(v.begin(), v.end(), 0);
}
void __attribute__((noinline)) fill1(vector& v) {
std::fill(v.begin(), v.end(), 1);
}
void bench(size_t data_size, int nthreads) {
#pragma omp parallel num_threads(nthreads)
{
vector v(data_size / (sizeof(value) * nthreads));
auto repeat = write_size / data_size;
#pragma omp barrier
auto t0 = omp_get_wtime();
for (auto r = 0; r < repeat; r++)
fill0(v);
#pragma omp barrier
auto t1 = omp_get_wtime();
for (auto r = 0; r < repeat; r++)
fill1(v);
#pragma omp barrier
auto t2 = omp_get_wtime();
#pragma omp master
std::cout << data_size << ", " << nthreads << ", " << write_size / (t1 - t0) << ", "
<< write_size / (t2 - t1) << "\n";
}
}
int main(int argc, const char* argv[]) {
std::cout << "size,nthreads,fill0,fill1\n";
for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
bench(bytes, 1);
}
for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
bench(bytes, omp_get_max_threads());
}
for (int nthreads = 1; nthreads <= omp_get_max_threads(); nthreads++) {
bench(max_data_size, nthreads);
}
}
Presented results compiled with g++ fillbench.cpp -O3 -o fillbench_gcc -fopenmp.
From your question + the compiler-generated asm from your answer:
fill(0) is an ERMSB rep stosb which will use 256b stores in an optimized microcoded loop. (Works best if the buffer is aligned, probably to at least 32B or maybe 64B).
fill(1) is a simple 128-bit movaps vector store loop. Only one store can execute per core clock cycle regardless of width, up to 256b AVX. So 128b stores can only fill half of Haswell's L1D cache write bandwidth. This is why fill(0) is about 2x as fast for buffers up to ~32kiB. Compile with -march=haswell or -march=native to fix that.
Haswell can just barely keep up with the loop overhead, but it can still run 1 store per clock even though it's not unrolled at all. But with 4 fused-domain uops per clock, that's a lot of filler taking up space in the out-of-order window. Some unrolling would maybe let TLB misses start resolving farther ahead of where stores are happening, since there is more throughput for store-address uops than for store-data. Unrolling might help make up the rest of the difference between ERMSB and this vector loop for buffers that fit in L1D. (A comment on the question says that -march=native only helped fill(1) for L1.)
Note that rep movsd (which could be used to implement fill(1) for int elements) will probably perform the same as rep stosb on Haswell.
Although only the official documentation only guarantees that ERMSB gives fast rep stosb (but not rep stosd), actual CPUs that support ERMSB use similarly efficient microcode for rep stosd. There is some doubt about IvyBridge, where maybe only b is fast. See the #BeeOnRope's excellent ERMSB answer for updates on this.
gcc has some x86 tuning options for string ops (like -mstringop-strategy=alg and -mmemset-strategy=strategy), but IDK if any of them will get it to actually emit rep movsd for fill(1). Probably not, since I assume the code starts out as a loop, rather than a memset.
With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):
A normal movaps store to a cold cache line triggers a Read For Ownership (RFO). A lot of real DRAM bandwidth is spent on reading cache lines from memory when movaps writes the first 16 bytes. ERMSB stores use a no-RFO protocol for its stores, so the memory controllers are only writing. (Except for miscellaneous reads, like page tables if any page-walks miss even in L3 cache, and maybe some load misses in interrupt handlers or whatever).
#BeeOnRope explains in comments that the difference between regular RFO stores and the RFO-avoiding protocol used by ERMSB has downsides for some ranges of buffer sizes on server CPUs where there's high latency in the uncore/L3 cache. See also the linked ERMSB answer for more about RFO vs non-RFO, and the high latency of the uncore (L3/memory) in many-core Intel CPUs being a problem for single-core bandwidth.
movntps (_mm_stream_ps()) stores are weakly-ordered, so they can bypass the cache and go straight to memory a whole cache-line at a time without ever reading the cache line into L1D. movntps avoids RFOs, like rep stos does. (rep stos stores can reorder with each other, but not outside the boundaries of the instruction.)
Your movntps results in your updated answer are surprising.
For a single thread with large buffers, your results are movnt >> regular RFO > ERMSB. So that's really weird that the two non-RFO methods are on opposite sides of the plain old stores, and that ERMSB is so far from optimal. I don't currently have an explanation for that. (edits welcome with an explanation + good evidence).
As we expected, movnt allows multiple threads to achieve high aggregate store bandwidth, like ERMSB. movnt always goes straight into line-fill buffers and then memory, so it is much slower for buffer sizes that fit in cache. One 128b vector per clock is enough to easily saturate a single core's no-RFO bandwidth to DRAM. Probably vmovntps ymm (256b) is only a measurable advantage over vmovntps xmm (128b) when storing the results of a CPU-bound AVX 256b-vectorized computation (i.e. only when it saves the trouble of unpacking to 128b).
movnti bandwidth is low because storing in 4B chunks bottlenecks on 1 store uop per clock adding data to the line fill buffers, not on sending those line-full buffers to DRAM (until you have enough threads to saturate memory bandwidth).
#osgx posted some interesting links in comments:
Agner Fog's asm optimization guide, instruction tables, and microarch guide: http://agner.org/optimize/
Intel optimization guide: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.
NUMA snooping: http://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/
https://software.intel.com/en-us/articles/intelr-memory-latency-checker
Cache Coherence Protocol and Memory
Performance of the Intel Haswell-EP Architecture
See also other stuff in the x86 tag wiki.
I'll share my preliminary findings, in the hope to encourage more detailed answers. I just felt this would be too much as part of the question itself.
The compiler optimizes fill(0) to a internal memset. It cannot do the same for fill(1), since memset only works on bytes.
Specifically, both glibcs __memset_avx2 and __intel_avx_rep_memset are implemented with a single hot instruction:
rep stos %al,%es:(%rdi)
Wheres the manual loop compiles down to an actual 128-bit instruction:
add $0x1,%rax
add $0x10,%rdx
movaps %xmm0,-0x10(%rdx)
cmp %rax,%r8
ja 400f41
Interestingly while there is a template/header optimization to implement std::fill via memset for byte types, but in this case it is a compiler optimization to transform the actual loop.
Strangely,for a std::vector<char>, gcc begins to optimize also fill(1). The Intel compiler does not, despite the memset template specification.
Since this happens only when the code is actually working in memory rather than cache, makes it appears the Haswell-EP architecture fails to efficiently consolidate the single byte writes.
I would appreciate any further insight into the issue and the related micro-architecture details. In particular it is unclear to me why this behaves so differently for four or more threads and why memset is so much faster in cache.
Update:
Here is a result in comparison with
fill(1) that uses -march=native (avx2 vmovdq %ymm0) - it works better in L1, but similar to the movaps %xmm0 version for other memory levels.
Variants of 32, 128 and 256 bit non-temporal stores. They perform consistently with the same performance regardless of the data size. All outperform the other variants in memory, especially for small numbers of threads. 128 bit and 256 bit perform exactly similar, for low numbers of threads 32 bit performs significantly worse.
For <= 6 thread, vmovnt has a 2x advantage over rep stos when operating in memory.
Single threaded bandwidth:
Aggregate bandwidth in memory:
Here is the code used for the additional tests with their respective hot-loops:
void __attribute__ ((noinline)) fill1(vector& v) {
std::fill(v.begin(), v.end(), 1);
}
┌─→add $0x1,%rax
│ vmovdq %ymm0,(%rdx)
│ add $0x20,%rdx
│ cmp %rdi,%rax
└──jb e0
void __attribute__ ((noinline)) fill1_nt_si32(vector& v) {
for (auto& elem : v) {
_mm_stream_si32(&elem, 1);
}
}
┌─→movnti %ecx,(%rax)
│ add $0x4,%rax
│ cmp %rdx,%rax
└──jne 18
void __attribute__ ((noinline)) fill1_nt_si128(vector& v) {
assert((long)v.data() % 32 == 0); // alignment
const __m128i buf = _mm_set1_epi32(1);
size_t i;
int* data;
int* end4 = &v[v.size() - (v.size() % 4)];
int* end = &v[v.size()];
for (data = v.data(); data < end4; data += 4) {
_mm_stream_si128((__m128i*)data, buf);
}
for (; data < end; data++) {
*data = 1;
}
}
┌─→vmovnt %xmm0,(%rdx)
│ add $0x10,%rdx
│ cmp %rcx,%rdx
└──jb 40
void __attribute__ ((noinline)) fill1_nt_si256(vector& v) {
assert((long)v.data() % 32 == 0); // alignment
const __m256i buf = _mm256_set1_epi32(1);
size_t i;
int* data;
int* end8 = &v[v.size() - (v.size() % 8)];
int* end = &v[v.size()];
for (data = v.data(); data < end8; data += 8) {
_mm256_stream_si256((__m256i*)data, buf);
}
for (; data < end; data++) {
*data = 1;
}
}
┌─→vmovnt %ymm0,(%rdx)
│ add $0x20,%rdx
│ cmp %rcx,%rdx
└──jb 40
Note: I had to do manual pointer calculation in order to get the loops so compact. Otherwise it would do vector indexing within the loop, probably due to the intrinsic confusing the optimizer.

Reading CF, PF, ZF, SF, OF

I am writing a virtual machine for my own assembly language, I want to be able to set the carry, parity, zero, sign and overflowflags as they are set in the x86-64 architecture, when I perform operations such as addition.
Notes:
I am using Microsoft Visual C++ 2015 & Intel C++ Compiler 16.0
I am compiling as a Win64 application.
My virtual machine (currently) only does arithmetic on 8-bit integers
I'm not (currently) interested in any other flags (e.g. AF)
My current solution is using the following function:
void update_flags(uint16_t input)
{
Registers::flags.carry = (input > UINT8_MAX);
Registers::flags.zero = (input == 0);
Registers::flags.sign = (input < 0);
Registers::flags.overflow = (int16_t(input) > INT8_MAX || int16_t(input) < INT8_MIN);
// I am assuming that overflow is handled by trunctation
uint8_t input8 = uint8_t(input);
// The parity flag
int ones = 0;
for (int i = 0; i < 8; ++i)
if (input8 & (1 << i) != 0) ++ones;
Registers::flags.parity = (ones % 2 == 0);
}
Which for addition, I would use as follows:
uint8_t a, b;
update_flags(uint16_t(a) + uint16_t(b));
uint8_t c = a + b;
EDIT:
To clarify, I want to know if there is a more efficient/neat way of doing this (such as by accessing RFLAGS directly)
Also my code may not work for other operations (e.g. multiplication)
EDIT 2 I have updated my code now to this:
void update_flags(uint32_t result)
{
Registers::flags.carry = (result > UINT8_MAX);
Registers::flags.zero = (result == 0);
Registers::flags.sign = (int32_t(result) < 0);
Registers::flags.overflow = (int32_t(result) > INT8_MAX || int32_t(result) < INT8_MIN);
Registers::flags.parity = (_mm_popcnt_u32(uint8_t(result)) % 2 == 0);
}
One more question, will my code for the carry flag work properly?, I also want it to be set correctly for "borrows" that occur during subtraction.
Note: The assembly language I am virtualising is of my own design, meant to be simple and based of Intel's implementation of x86-64 (i.e. Intel64), and so I would like these flags to behave in mostly the same way.
TL:DR: use lazy flag evaluation, see below.
input is a weird name. Most ISAs update flags based on the result of an operation, not the inputs. You're looking at the 16bit result of an 8bit operation, which is an interesting approach. In the C, you should just use unsigned int, which is guaranteed to be at least uint16_t. It will compile to better code on x86, where unsigned is 32bit. 16bit ops take an extra prefix and can lead to partial-register slowdowns.
That might help with the 8bx8b->16b mul problem you noted, depending on how you want to define the flag-updating for the mul instruction in the architecture you're emulating.
I don't think your overflow detection is correct. See this tutorial linked from the x86 tag wiki for how it's done.
This will probably not compile to very fast code, especially the parity flag. Do you need the ISA you're emulating/designing to have a parity flag? You never said you're emulating an x86, so I assume it's some toy architecture you're designing yourself.
An efficient emulator (esp. one that needs to support a parity flag) would probably benefit a lot from some kind of lazy flag evaluation. Save a value that you can compute flags from if needed, but don't actually compute anything until you get to an instruction that reads flags. Most instructions only write flags without reading them, and they just save the uint16_t result into your architectural state. Flag-reading instructions can either compute just the flag they need from that saved uint16_t, or compute all of them and store that somehow.
Assuming you can't get the compiler to actually read PF from the result, you might try _mm_popcnt_u32((uint8_t)x) & 1. Or, horizontally XOR all the bits together:
x = (x&0b00001111) ^ (x>>4)
x = (x&0b00000011) ^ (x>>2)
PF = (x&0b00000001) ^ (x>>1) // tweaking this to produce better asm is probably possible
I doubt any of the major compilers can peephole-optimize a bunch of checks on a result into LAHF + SETO al, or a PUSHF. Compilers can be led into using a flag condition to detect integer overflow to implement saturating addition, for example. But having it figure out that you want all the flags, and actually use LAHF instead of a series of setcc instruction, is probably not possible. The compiler would need a pattern-recognizer for when it can use LAHF, and probably nobody's implemented that because the use-cases are so vanishingly rare.
There's no C/C++ way to directly access flag results of an operation, which makes C a poor choice for implementing something like this. IDK if any other languages do have flag results, other than asm.
I expect you could gain a lot of performance by writing parts of the emulation in asm, but that would be platform-specific. More importantly, it's a lot more work.
I appear to have solved the problem, by splitting the arguments to update flags into an unsigned and signed result as follows:
void update_flags(int16_t unsigned_result, int16_t signed_result)
{
Registers::flags.zero = unsigned_result == 0;
Registers::flags.sign = signed_result < 0;
Registers::flags.carry = unsigned_result < 0 || unsigned_result > UINT8_MAX;
Registers::flags.overflow = signed_result < INT8_MIN || signed_result > INT8_MAX
}
For addition (which should produce the correct result for both signed & unsigned inputs) I would do the following:
int8_t a, b;
int16_t signed_result = int16_t(a) + int16_t(b);
int16_t unsigned_result = int16_t(uint8_t(a)) + int16_t(uint8_t(b));
update_flags(unsigned_result, signed_result);
int8_t c = a + b;
And signed multiplication I would do the following:
int8_t a, b;
int16_t result = int16_t(a) * int16_t(b);
update_flags(result, result);
int8_t c = a * b;
And so on for the other operations that update the flags
Note: I am assuming here that int16_t(a) sign extends, and int16_t(uint8_t(a)) zero extends.
I have also decided against having a parity flag, my _mm_popcnt_u32 solution should work if I change my mind later..
P.S. Thank you to everyone who responded, it was very helpful. Also if anyone can spot any mistakes in my code, that would be appreciated.

SHLD+BSR decoder?

Reading the following blog post. A so called "SHLD+BSR" huffman decoder is mentioned, which is then further expanded to MOV, MOV, SHLD, OR, BSR, MOV, SHR, MOV, OR, ADD, ADC, however I have not found any reference or source code that describes such decoding. Does anyone know what decoding method is referred?
I haven't really succeeded at understanding this method for decoding huffman codes, but the relevant "inner loop" here contains something like this (edited slightly to make the SHDL and BSR obvious):
uint32 posidx = pos >> 5;
uint32 code = src32[posidx];
uint32 extrabits = src32[posidx + 1];
SHLD(code, extrabits, pos);
code |= 1;
uint32 idx = BSR(code);
uint8 *p = (const uint8 *)(table->mBsrLenTable[idx] +
2*(code >> table->mBsrShiftTable[idx]));
result = p[0];
pos += p[1];
That accounts for the MOV, MOV, SHLD, OR, BSR, MOV, SHR, MOV but then I'm not sure anymore. I think the ADD refers to the multiplication by 2, the ADC is actually a trick to add p[1] to pos and the OR joins the mBsrLenTable entry and the rest of the code, but that seems to be in the wrong order and then the OR would correspond with an addition in the source. Perhaps I shouldn't do this sort of thing after midnight..
You'd probably better take a look at the source yourself, because to be honest, my answer is useless. I got it here: sourceforge.net/projects/virtualdub/files/virtualdub-win/1.9.11.32842/VirtualDub-1.9.11-src.7z look for the file src\Meia\source\decode_huffyuv.cpp, it starts with the table initializing and the actual decoding is in a macro called DECODE, about 200 lines down.

What's the reason behind applying two explicit type casts in a row?

What's the reason behind applying two explicit type casts as below?
if (unlikely(val != (long)(char)val)) {
Code taken from lxml.etree.c source file from lxml's source package.
That's a cheap way to check to see if there's any junk in the high bits. The char cast chops of the upper 8, 24 or 56 bits (depending on sizeof(val)) and then promotes it back. If char is signed, it will sign extend as well.
A better test might be:
if (unlikely(val & ~0xff)) {
or
if (unlikely(val & ~0x7f)) {
depending on whether this test cares about bit 7.
Just for grins and completeness, I wrote the following test code:
void RegularTest(long val)
{
if (val != ((int)(char)val)) {
printf("Regular = not equal.");
}
else {
printf("Regular = equal.");
}
}
void MaskTest(long val)
{
if (val & ~0xff) {
printf("Mask = not equal.");
}
else {
printf("Mask = equal.");
}
}
And here's what the cast code turns into in debug in visual studio 2010:
movsx eax, BYTE PTR _val$[ebp]
cmp DWORD PTR _val$[ebp], eax
je SHORT $LN2#RegularTes
this is the mask code:
mov eax, DWORD PTR _val$[ebp]
and eax, -256 ; ffffff00H
je SHORT $LN2#MaskTest
In release, I get this for the cast code:
movsx ecx, al
cmp eax, ecx
je SHORT $LN2#RegularTes
In release, I get this for the mask code:
test DWORD PTR _val$[ebp], -256 ; ffffff00H
je SHORT $LN2#MaskTest
So what's going on? In the cast case it's doing a byte mov with sign extension (ha! bug - the code is not the same because chars are signed) and then a compare and to be totally sneaky, the compiler/linker has also made this function use register passing for the argument. In the mask code in release, it has folded everything up into a single test instruction.
Which is faster? Beats me - and frankly unless you're running this kind of test on a VERY slow CPU or are running it several billion times, it won't matter. Not in the least.
So the answer in this case, is to write code that is clear about its intent. I would expect a C/C++ jockey to look at the mask code and understand its intent, but if you don't like that, you should opt for something like this instead:
#define BitsAbove8AreSet(x) ((x) & ~0xff)
#define BitsAbove7AreSet(x) ((x) & ~0x7f)
or:
inline bool BitsAbove8AreSet(long t) { return (t & ~0xff) != 0; } // make it a bool to be nice
inline bool BitsAbove7AreSet(long t) { return (t & ~0x7f) != 0; }
And use the predicates instead of the actual code.
In general, I think "is it cheap?" is not a particularly good question to ask about this unless you're working in some very specific problem domains. For example, I work in image processing and when I have some kind of operation going from one image to another, I often have code that looks like this:
BYTE *srcPixel = PixelOffset(src, x, y, srcrowstride, srcdepth);
int srcAdvance = PixelAdvance(srcrowstride, right, srcdepth);
BYTE *dstPixel = PixelOffset(dst, x, y, dstrowstride, dstdepth);
int dstAdvance = PixelAdvance(dstrowstride, right, dstdepth);
for (y = top; y < bottom; y++) {
for (x=left; x < right; x++) {
ProcessOnePixel(srcPixel, srcdepth, dstPixel, dstdepth);
srcPixel += srcdepth;
dstPixel += dstdepth;
}
srcPixel += srcAdvance;
dstPixel += dstAdvance;
}
And in this case, assume that ProcessOnePixel() is actually a chunk of inline code that will be executed billions and billions of times. In this case, I care a whole lot about not doing function calls, not doing redundant work, not rechecking values, ensuring that the computational flow will translate into something that will use registers wisely, etc. But my actual primary concern is that the code can be read by the next poor schmuck (probably me) who has to look at it.
And in our current coding world, it is FAR FAR CHEAPER for nearly every problem domain to spend a little time up front ensuring that your code is easy to read and maintain than it is to worry about performance out of the gate.
Speculations:
cast to char: to mask the 8 low bits,
cast to long: to bring the value back to signed (if char is unsigned).
If val is a long then the (char) will strip off all but the bottom 8 bits. The (long) casts it back for the comparison.

How to detect what CPU is being used during runtime?

How can I detect which CPU is being used at runtime ? The c++ code needs to differentiate between AMD / Intel architectures ? Using gcc 4.2.
The cpuid instruction, used with EAX=0 will return a 12-character vendor string in EBX, EDX, ECX, in that order.
For Intel, this string is "GenuineIntel". For AMD, it's "AuthenticAMD". Other companies that have created x86 chips have their own strings.The Wikipedia page for cpuid has many (all?) of the strings listed, as well as an example ASM listing for retrieving the details.
You really only need to check if ECX matches the last four characters. You can't use the first four, because some Transmeta CPUs also start with "Genuine"
For Intel, this is 0x6c65746e
For AMD, this is 0x444d4163
If you convert each byte in those to a character, they'll appear to be backwards. This is just a result of the little endian design of x86. If you copied the register to memory and looked at it as a string, it would work just fine.
Example Code:
bool IsIntel() // returns true on an Intel processor, false on anything else
{
int id_str; // The first four characters of the vendor ID string
__asm__ ("cpuid":\ // run the cpuid instruction with...
"=c" (id_str) : // id_str set to the value of EBX after cpuid runs...
"a" (0) : // and EAX set to 0 to run the proper cpuid function.
"eax", "ebx", "edx"); // cpuid clobbers EAX, ECX, and EDX, in addition to EBX.
if(id_str==0x6c65746e) // letn. little endian clobbering of GenuineI[ntel]
return true;
else
return false;
}
EDIT: One other thing - this can easily be changed into an IsAMD function, IsVIA function, IsTransmeta function, etc. just by changing the magic number in the if.
If you're on Linux (or on Windows running under Cygwin), you can figure that out by reading the special file /proc/cpuinfo and looking for the line beginning with vendor_id. If the string is GenuineIntel, you're running on an Intel chip. If you get AuthenticAMD, you're running on an AMD chip.
void get_vendor_id(char *vendor_id) // must be at least 13 bytes
{
FILE *cpuinfo = fopen("/proc/cpuinfo", "r");
if(cpuinfo == NULL)
; // handle error
char line[256];
while(fgets(line, 256, cpuinfo))
{
if(strncmp(line, "vendor_id", 9) == 0)
{
char *colon = strchr(line, ':');
if(colon == NULL || colon[1] == 0)
; // handle error
strncpy(vendor_id, 12, colon + 2);
fclose(cpuinfo);
return;
}
}
// if we got here, handle error
fclose(cpuinfo);
}
If you know you're running on an x86 architecture, a less portable method would be to use the CPUID instruction:
void get_vendor_id(char *vendor_id) // must be at least 13 bytes
{
// GCC inline assembler
__asm__ __volatile__
("movl $0, %%eax\n\t"
"cpuid\n\t"
"movl %%ebx, %0\n\t"
"movl %%edx, %1\n\t"
"movl %%ecx, %2\n\t"
: "=m"(vendor_id), "=m"(vendor_id + 4), "=m"(vendor_id + 8) // outputs
: // no inputs
: "%eax", "%ebx", "%edx", "%ecx", "memory"); // clobbered registers
vendor_id[12] = 0;
}
int main(void)
{
char vendor_id[13];
get_vendor_id(vendor_id);
if(strcmp(vendor_id, "GenuineIntel") == 0)
; // it's Intel
else if(strcmp(vendor_id, "AuthenticAMD") == 0)
; // it's AMD
else
; // other
return 0;
}
On Windows, you can use the GetNativeSystemInfo function
On Linux, try sysinfo
You probably should not check at all. Instead, check whether the CPU supports the features you need, e.g. SSE3. The differences between two Intel chips might be greater than between AMD and Intel chips.
You have to define it in your Makefile arch=uname -p 2>&1 , then use #ifdef i386 some #endif for diferent architectures.
I have posted a small project:
http://sourceforge.net/projects/cpp-cpu-monitor/
which uses the libgtop library and exposes data through UDP. You can modify it to suit your needs. GPL open-source. Please ask if you have any questions regarding it.