How to optimize strtok + atoll

How to optimize strtok + atoll - c++

Is there any good way to optimize this function in terms of execution time? My final goal is to parse a long string composed of several integers (thousands of integer per line, and thousands of lines). This was my initial solution.
int64_t get_next_int(char *newLine) {
char *token=strtok(newLine, " ");
if( token == NULL ) {
exit(0);
}
return atoll(token);
}
More details: I need the "state" based implementation of strtok, so the padding implemented by strtok should exist in the final string. Atoll does not need of any kind of verification.
Target system: Intel x86_64 (Xeon series)
Related topics:
atoi optimization: C++ most efficient way to convert string to int (faster than atoi)

First off: I find optimizing string conversion routines in signal processing chains most of the time to be totally in vain. The speed at which your system loads data in string form (which will probably happen from some mass storage, where it was put by something that didn't care about performance, since it wouldn't have chosen a string format in the first place, otherwise), and if you compare read speeds of all but clusters of SSDs attached via PCIe with how fast atoll is, you'll notice that you're losing a negligible amount of time on inefficient conversion. If you pipeline loading parts of that string with conversion, the time spent waiting for storage will not even be remotely filled up with converting, so even without any algorithmic optimization, pipelining/multi-threading will eliminate practically all time spent on conversion.
I'm going to go ahead and assume your integer-containing string is sufficiently large. Like, tens of millions of integers. Otherwise, all optimization might be pretty premature, considering there's little to complain about std::iostream performance.
Now, the trick is that no performance optimization can be done once the performance of your conversion routine hits the memory bandwidth barrier. To push that barrier as far as possible, it's crucial to optimize usage of CPU caches – hence, doing linear access and shuffling memory as little as possible is crucial here. Also, if you care for speed, you don't want to call a function every time you need to convert a few-digit number – the call overhead (saving/restoring stack, jumping back and forth) will be significant. So if you're after performance, you'll do the conversion of the whole string at once, and then just access the resulting integer array.
So you'd have roughly something like, on a modern, SSE4.2 capable x86 processor
Outer loop, jumps in steps of 16:
load 128 bit of input string into 128 bit SIMD register
run something like __mm_cmpestri to find indices of delimiters and \0 terminator in all these 16 bytes at once
inner loop over the found indices
Use SSE copy/shift/immediate instructions to isolate substrings; fill the others with 0
prepend saved "last characters" from previous iteration (if any – should only be the case for first inner loop iteration per outer loop iteration)
subtract 0 from each of the digits, again using SSE instructions to do up to 16 subtractions with a single instruction (_mm_sub_epi8)
convert the eight 16bit subwords to eight 128 bit words containing two packed 64bit integers each (one instruction per 16bit, _mm_cvtepi8_epi64, I think)
initialize a __mm128 register with [10^15 10^14], let's call it powers
loop over pairs dual-64bit words: (each step should be one SSE instruction)
multiply first with powers
divide powers by [100 100]
multiply second with powers
add results to dual-64bit accumulator
sum the two values in accumulator
store the result to integer array

I'd rather use something along the lines of a std::istringstream:
int64_t get_next_int(std::istringstream& line) {
int64_t token;
if(!(line >> token))
exit(0);
return token;
}
std::istringstream line(newLine);
int64_t i = get_next_int(line);
strtok() has well known drawbacks, and you don't want to use it at all.

What about
int n= 0;
// Find the token
for ( ; *newline == ' '; newline++)
;
if (*newline == 0)
// Not found
exit(0);
// Scan and convert the token
for ( ; unsigned(*newline - '0') < 10; newline++)
n= 10 * n + *newline - '0';
return n;

AFA I get from your code at first splitting it will return. It seems at first parsing(before space character) it will returun 0 if it is non-number entry or combined alphabetic and number in such a way that alphabetic at beginning . If combined and number at beginning, it will return the number merely. Namely, you just need a string for the conversion. So you don't need tokenizing just check the string is null or not. You can change return type as well. Because, if you need a type with _exactly_ 64 bits, use (u)int64_t, if you need _at least_ 64 bits, (unsigned) long long is perfectly fine, as would be (u)int_least64_t. I think your code is little gobbledygook. Show what you exactly want without simplification.
/*
* ascii-to-longlong conversion
*
* no error checking; assumes decimal digits
*
* efficient conversion:
* start with value = 0
* then, starting at first character, repeat the following
* until the end of the string:
*
* new value = (10 * (old value)) + decimal value of next character
*
*/
long long my_atoll(char *instr)
{
if(str[0] == '\0')
return -1;
long long retval;
int i;
retval = 0;
for (; *instr; instr++) {
retval = 10*retval + (*instr - '0');
}
return retval;
}

Related

Efficient equality test for bitstrings with arbitrary offsets

I have more than 1e7 sequences of tokens, where each token can only take one of four possible values.
In order to make this dataset fit into memory, I decided to encode each token in 2 bits, which allows to store 4 tokens in a byte instead of just one (when using a char for each token / std::string for a sequence). I store each sequence in a char array.
For some algorithm, I need to test arbitrary subsequences of two token sequences for exact equality. Each subsequence can have an arbitrary offset. The length is typically between 10 and 30 tokens (random) and is the same for the two subsequences.
My current method is to operate in chunks:
Copy up to 32 tokens (each having 2 bit) from each subsequences into an uint64_t. This is realized in a loop over the tokens that selects the correct char in the array and writes the bits into the correct position of the uint64_t.
Compare the two uint64_t. If they are not equal, return.
Repeat until all tokens in the subsequences have been processed.
#include <climits>
#include <cstdint>
using Block = char;
constexpr int BitsPerToken = 2;
constexpr int TokenPerBlock = sizeof(Block) * CHAR_BIT / BitsPerToken;
Block getTokenFromBlock(Block b, int nt) noexcept
{
return (b >> (nt * BitsPerToken)) & ((1UL << (BitsPerToken)) - 1);
}
bool seqEqual(Block const* seqA, int startA, int endA, Block const* seqB, int startB, int endB) noexcept
{
using CompareBlock = uint64_t;
constexpr int TokenPerCompareBlock = sizeof(CompareBlock) * CHAR_BIT / BitsPerToken;
const int len = endA - startA;
int posA = startA;
int posB = startB;
CompareBlock curA = 0;
CompareBlock curB = 0;
for (int i = 0; i < len; ++i, ++posA, ++posB)
{
const int cmpIdx = i % TokenPerBlock;
const int blockA = posA / TokenPerBlock;
const int idxA = posA % TokenPerBlock;
const int blockB = posB / TokenPerBlock;
const int idxB = posB % TokenPerBlock;
if ((i % TokenPerCompareBlock) == 0)
{
if (curA != curB)
return false;
curA = 0;
curB = 0;
}
curA += getTokenFromBlock(seqA[blockA], idxA) << (BitsPerToken * cmpIdx);
curB += getTokenFromBlock(seqB[blockB], idxB) << (BitsPerToken * cmpIdx);
}
if (curA != curB)
return false;
return true;
}
I figured that this should be quite fast (comparing 32 tokens simultaneously), but it is more than two times slower than using an std::string (with each token stored in a char) and its operator==.
I have looked into std::memcmp, but cannot use it because the subsequence might start somewhere within a byte (at a multiple of 2 bits, though).
Another candidate would be boost::dynamic_bitset, which basically implements the same storage format. However, it does not include equality tests.
How can I achieve fast equality tests using this compressed format?

First of all, this is the kind of computation where the target processor, RAM, compiler and compiler flags can drastically change the results. Unfortunately these critical information are not provided. Let's assume you use a quite recent mainstream x86-64 processor, a common DDR4-SDRAM, a compiler like Clang/GCC relatively up-to-date, and optimizations are enabled (ie. -O3 and possibly -march=native).
Clang and GCC use a fast comparison functions for comparing strings : respectively memcmp for GCC 12 and bcmp for Clang 15. The two functions are highly optimized on most platforms : they typically compare short strings by blocks of 8 bytes (uint64_t) and large strings by using SIMD instructions.
Your optimization is good to reduce the memory footprint but it introduces more computation and there is a high chance for the operation to be already compute-bound if the input buffer is already in the CPU cache. In addition, the computation is not SIMD-friendly due to the inner loop : the compiler will certainly not generate an efficient code due toe the bit-wise operations. The thing is scalar codes are slow. In fact, scalar byte-per-byte computations are generally so slow that they are usually far from being able to saturate the RAM bandwidth (at least the one achievable using only 1 core) as opposed to to memcmp. For example, a Skylake/Coffeelake processor at 4 GHz can only read 8 GiB/s from the L1 cache using a scalar byte-per-byte code while an AVX-2 SIMD code can read 256 GiB/s. For the write it is twice smaller : 4 GiB/s VS 128 GiB/s. A 1-channel DDR4-SDRAM # 3200MHz can theoretically reach ~24 GiB/s, that is, far more than a byte-per-byte scalar sequential code. The L3 cache have a much bigger bandwidth.
If you want a fast code for large sequences, then you need to either help your compiler so it can use SIMD instruction (not so easy in this case), to use non-portable SIMD intrinsics or possibly to use a relatively-portable SIMD library to generate quite-good SIMD code (though low-level platform-dependent intrinsics are more flexible/featureful).
I expect the main bottleneck to come from the "loop over the tokens that selects the correct char in the array and writes the bits into the correct position of the uint64_t". Indeed, this loop will likely generate a dependency chain of instructions (operating on the same uint64_t variable) that cannot be executed efficiently by the processor nor easily optimized by the compiler.
A typical solution would be to read blocks of 8 bytes (using memcpy to do it correctly, and hope the compiler optimize it properly). The bits can be reordered using a bswap instruction on x86-64 processors and it is not needed on big-endian processors. A shift+mask can be applied so to compare only the useful part. Here is an (untested) example to show the idea:
if(length >= 16)
{
uint64_t block1, block2;
uint64_t prev_block1 = 0, prev_block2 = 0;
unsigned int shift1 = (start1 % 4) * 2;
unsigned int shift2 = (start2 % 4) * 2;
uint64_t mask = 0xFFFFFFFFFFFFFF00ull;
// Read blocks 7 byte per 7 byte for sake of simplicity
for(size_t i=0; i<length-7 ; i+=7)
{
// Safe and cheap and GCC/Clang
memcpy(&block1, charArray1[i], 8);
memcpy(&block2, charArray2[i], 8);
// Architecture-dependent: reorder bytes on little-endian processors.
// There is a fast instruction for that on x86-64 processors: bswap.
// See: https://stackoverflow.com/questions/36497605
block1 = reorder_bytes(block1);
block2 = reorder_bytes(block2);
block1 = (block1 << shift1) & mask;
block2 = (block2 << shift2) & mask;
if(block1 != block2)
return false;
}
}
// TODO: compute the reminder part for the last block
This operation can be done using the SSE/AVX instruction set so to be faster for large sequences. Note you can perform a special optimization when shift1 == shift2 (especially when the both are equal to 0).
One should keep in mind that the bit-packing computation is pretty expensive, even using a SIMD code. It will certainly not be faster than a memcpy unless the operation is memory bound which is unlikely to be the case. For example, a Skylake/Coffeelake processor can load and compare 2 blocks of 32 bytes (ie. 32 tokens per block) in only 1 cycle (reciprocal throughput) using the AVX-2 SIMD instruction set, while there is no chance each iteration of the above bit-packing loop can take less than 2 cycles to compute 7 bytes (ie. 28 tokens). Using AVX-2 to optimize the above code is possible but the AVX lanes and the byte reordering results in several additional instructions being required so it will certainly be still slightly slower than just a basic very-fast comparisons (few cycles to compute ~120 tokens).
The only use-case where packing can help is when multiple core are used to do the computation. Indeed, in that case, the bit-packing code can scale well because it is likely compute-bound while the string-based version will quickly be limited by the speed of the RAM since it is likely memory-bound.

If there are only 10million tokens total, its 20Mbit or 2-3MB. If you keep their shifted versions in different arrays such as from 2 bit shifted to 30 bit shifted (assuming 4byte comparison at once, ignore 32 bit shift as it means just a different starting position), you can do a direct comparison (std::memcmp) with no shifting involved (fast) after selecting the right array with modulo of the arbitrary offset. But this requires the token sequence to be constant through many function calls (if not lifetime of program).
If these tokens are part of a much bigger data, you can put a caching layer (that caches fixed length chunks and joins them to get requested sub-sequence for A and B) just before the shifted initialization. Maybe LRU/LFU works fast enough if its token access pattern is cache-friendly. If its not cache friendly, then perhaps just reaching the arrays could be the bottleneck with or without shifting.
If you do checking per byte instead of per 4 bytes, it requires only 4 arrays instead of 16 and it shouldn't add too big requirement with caching.
You can also add an XOR result of fixed-length (like 50-100) sub-sequences for every offset as a way of quicker exiting. Again, this requires 4x more memory space. If XOR results of first tokens (+fixed length) are not equal, then they are not equal. This would reduce number of comparisons at least.
Another way is directly caching f(x,y)->bool like Python language does with its own caching. But this would be much worse than "fixed-length-chunked-caching & joining them" due to non-reusable parts & a lot of duplication.

Hash algorithm for string of characters using XOR and bit shift

I was given this algorithm to write a hash function:
BEGIN Hash (string)
UNSIGNED INTEGER key = 0;
FOR_EACH character IN string
key = ((key << 5) + key) ^ character;
END FOR_EACH
RETURN key;
END Hash
The <<operator refers to shift bits to the left. The ^ refers to the XOR operation and the character refers to the ASCII value of the character. Seems pretty straightforward.
Below is my code
unsigned int key = 0;
for (int i = 0; i < data.length(); i++) {
key = ((key<<5) + key) ^ (int)data[i];
}
return key;
However, I keep getting ridiculous positive and negative huge numbers when i should actually get a hash value from 0 - n. n is a value set by the user beforehand. I'm not sure where things went wrong but I'm thinking it could be the XOR operation.
Any suggestions or opinions will be greatly appreciated. Thanks!

The output of this code is a 32-bit (or 64-bit or however wide your unsigned int is) unsigned integer. To restrict it to the range from 0 to n−1, simply reduce it modulo n, using the % operator:
unsigned int hash = key % n;
(It should be obvious that your code, as written, cannot return "a hash value from 0 - n", since n does not appear anywhere in your code.)
In fact, there's a good reason not to reduce the hash value modulo n too soon: if you ever need to grow your hash, storing the unreduced hash codes of your strings saves you the effort of recalculating them whenever n changes.
Finally, a few general notes on your hash function:
As Joachim Pileborg comments above, the explicit (int) cast is unnecessary. If you want to keep it for clarity, it really should say (unsigned int) to match the type of key, since that's what the value actually gets converted into.
For unsigned integer types, ((key<<5) + key) is equal to 33 * key (since shifting left by 5 bits is the same as multiplying by 25 = 32). On modern CPUs, using multiplication is almost certainly faster; on old or very low-end processors with slow multiplication, it's likely that any decent compiler will optimize multiplication by a constant into a combination of shifts and adds anyway. Thus, either way, expressing the operation as a multiplication is IMO preferable.
You don't want to call data.length() on every iteration of the loop. Call it once before the loop and store the result in a variable.
Initializing key to zero means that your hash value is not affected by any leading zero bytes in the string. The original version of your hash function, due to Dan Bernstein, uses a (more or less random) initial value of 5381 instead.

Fast implementation of a large integer counter (in C/C++)

My goal is as the following,
Generate successive values, such that each new one was never generated before, until all possible values are generated. At this point, the counter start the same sequence again. The main point here is that, all possible values are generated without repetition (until the period is exhausted). It does not matter if the sequence is simple 0, 1, 2, 3,..., or in other order.
For example, if the range can be represented simply by an unsigned, then
void increment (unsigned &n) {++n;}
is enough. However, the integer range is larger than 64-bits. For example, in one place, I need to generated 256-bits sequence. A simple implementation is like the following, just to illustrate what I am trying to do,
typedef std::array<uint64_t, 4> ctr_type;
static constexpr uint64_t max = ~((uint64_t) 0);
void increment (ctr_type &ctr)
{
if (ctr[0] < max) {++ctr[0]; return;}
if (ctr[1] < max) {++ctr[1]; return;}
if (ctr[2] < max) {++ctr[2]; return;}
if (ctr[3] < max) {++ctr[3]; return;}
ctr[0] = ctr[1] = ctr[2] = ctr[3] = 0;
}
So if ctr start with all zeros, then first ctr[0] is increased one by one until it reach max, and then ctr[1], and so on. If all 256-bits are set, then we reset it to all zero, and start again.
The problem is that, such implementation is surprisingly slow. My current improved version is sort of equivalent to the following,
void increment (ctr_type &ctr)
{
std::size_t k = (!(~ctr[0])) + (!(~ctr[1])) + (!(~ctr[2])) + (!(~ctr[3]))
if (k < 4)
++ctr[k];
else
memset(ctr.data(), 0, 32);
}
If the counter is only manipulated with the above increment function, and always start with zero, then ctr[k] == 0 if ctr[k - 1] == 0. And thus the value k will be the index of the first element that is less than the maximum.
I expected the first to be faster, since branch mis-prediction shall happen only once in every 2^64 iterations. The second, though mis-predication only happen every 2^256 iterations, it shall not make a difference. And apart from the branching, it needs four bitwise negation, four boolean negation, and three addition. Which might cost much more than the first.
However, both clang, gcc, or intel icpc generate binaries that the second was much faster.
My main question is that does anyone know if there any faster way to implement such a counter? It does not matter if the counter start by increasing the first integers or if it is implemented as an array of integers at all, as long as the algorithm generate all 2^256 combinations of 256-bits.
What makes things more complicated, I also need non uniform increment. For example, each time the counter is incremented by K where K > 1, but almost always remain a constant. My current implementation is similar to the above.
To provide some more context, one place I am using the counters is using them as input to AES-NI aesenc instructions. So distinct 128-bits integer (loaded into __m128i), after going through 10 (or 12 or 14, depending on the key size) rounds of the instructions, a distinct 128-bits integer is generated. If I generate one __m128i integer at once, then the cost of increment matters little. However, since aesenc has quite a bit latency, I generate integers by blocks. For example, I might have 4 blocks, ctr_type block[4], initialized equivalent to the following,
block[0]; // initialized to zero
block[1] = block[0]; increment(block[1]);
block[2] = block[1]; increment(block[2]);
block[3] = block[2]; increment(block[3]);
And each time I need new output, I increment each block[i] by 4, and generate 4 __m128i output at once. By interleaving instructions, overall I was able to increase the throughput, and reduce the cycles per bytes of output (cpB) from 6 to 0.9 when using 2 64-bits integers as the counter and 8 blocks. However, if instead, use 4 32-bits integers as counter, the throughput, measured as bytes per sec is reduced to half. I know for a fact that on x86-64, 64-bits integers could be faster than 32-bits in some situations. But I did not expect such simple increment operation makes such a big difference. I have carefully benchmarked the application, and the increment is indeed the one slow down the program. Since the loading into __m128i and store the __m128i output into usable 32-bits or 64-bits integers are done through aligned pointers, the only difference between the 32-bits and 64-bits version is how the counter is incremented. I expected that the AES-NI expected, after loading the integers into __m128i, shall dominate the performance. But when using 4 or 8 blocks, it was clearly not the case.
So to summary, my main question is that, if anyone know a way to improve the above counter implementation.

It's not only slow, but impossible. The total energy of universe is insufficient for 2^256 bit changes. And that would require gray counter.
Next thing before optimization is to fix the original implementation
void increment (ctr_type &ctr)
{
if (++ctr[0] != 0) return;
if (++ctr[1] != 0) return;
if (++ctr[2] != 0) return;
++ctr[3];
}
If each ctr[i] was not allowed to overflow to zero, the period would be just 4*(2^32), as in 0-9, 19,29,39,49,...99, 199,299,... and 1999,2999,3999,..., 9999.
As a reply to the comment -- it takes 2^64 iterations to have the first overflow. Being generous, upto 2^32 iterations could take place in a second, meaning that the program should run 2^32 seconds to have the first carry out. That's about 136 years.
EDIT
If the original implementation with 2^66 states is really what is wanted, then I'd suggest to change the interface and the functionality to something like:
(*counter) += 1;
while (*counter == 0)
{
counter++; // Move to next word
if (counter > tail_of_array) {
counter = head_of_array;
memset(counter,0, 16);
break;
}
}
The point being, that the overflow is still very infrequent. Almost always there's just one word to be incremented.

If you're using GCC or compilers with __int128 like Clang or ICC
unsigned __int128 H = 0, L = 0;
L++;
if (L == 0) H++;
On systems where __int128 isn't available
std::array<uint64_t, 4> c[4]{};
c[0]++;
if (c[0] == 0)
{
c[1]++;
if (c[1] == 0)
{
c[2]++;
if (c[2] == 0)
{
c[3]++;
}
}
}
In inline assembly it's much easier to do this using the carry flag. Unfortunately most high level languages don't have means to access it directly. Some compilers do have intrinsics for adding with carry like __builtin_uaddll_overflow in GCC and __builtin_addcll
Anyway this is rather wasting time since the total number of particles in the universe is only about 1080 and you cannot even count up the 64-bit counter in your life

Neither of your counter versions increment correctly. Instead of counting up to UINT256_MAX, you are actually just counting up to UINT64_MAX 4 times and then starting back at 0 again. This is apparent from the fact that you do not bother to clear any of the indices that has reached the max value until all of them have reached the max value. If you are measuring performance based on how often the counter reaches all bits 0, then this is why. Thus your algorithms do not generate all combinations of 256 bits, which is a stated requirement.

You mention "Generate successive values, such that each new one was never generated before"
To generate a set of such values, look at linear congruential generators
the sequence x = (x*1 + 1) % (power_of_2), you thought about it, this are simply sequential numbers.
the sequence x = (x*13 + 137) % (power of 2) , this generates unique numbers with a predictable period (power_of_2 - 1) and the unique numbers look more "random", kind of pseudo-random. You need to resort to arbitrary precision arithmetic to get it working, and also all the trickeries of multiplications by constants. This will get you a nice way to start.
You also complain that your simple code is "slow"
At 4.2 GHz frequency, running 4 intructions per cycle and using AVX512 vectorizations, on a 64-core computer with a multithreaded version of your program doing nothing else than increments, you get only 64x8x4*232=8796093022208 increments per second, that is 264 increments reached in 25 days. This post is old, you might have reached 841632698362998292480 by now, running such a program on such a machine, and you will gloriously reach 1683265396725996584960 in 2 years time.
You also require "until all possible values are generated".
You can only generate a finite number of values, depending how much you are willing to pay for the energy to power your computers. As mentioned in the other responses, with 128 or 256-bit numbers, even being the richest man in the world, you will never wrap around before the first of these conditions occurs:
getting out of money
end of humankind (nobody will get the outcome of your software)
burning the energy from the last particles of the universe

Multi-word addition can easily be accomplished in portable fashion by using three macros that mimic three types of addition instructions found on many processors:
ADDcc adds two words, and sets the carry if their was unsigned overflow
ADDC adds two words plus carry (from a previous addition)
ADDCcc adds two words plus carry, and sets the carry if their was unsigned overflow
A multi-word addition with two words uses ADDcc of the least significant words followed by ADCC of the most significant words. A multi-word addition with more than two words forms sequence ADDcc, ADDCcc, ..., ADDC. The MIPS architecture is a processor architecture without conditions code and therefore without carry flag. The macro implementations shown below basically follow the techniques used on MIPS processors for multi-word additions.
The ISO-C99 code below shows the operation of a 32-bit counter and a 64-bit counter based on 16-bit "words". I chose arrays as the underlying data structure, but one might also use struct, for example. Use of a struct will be significantly faster if each operand only comprises a few words, as the overhead of array indexing is eliminated. One would want to use the widest available integer type for each "word" for best performance. In the example from the question that would likely be a 256-bit counter comprising four uint64_t components.
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#define ADDCcc(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)
#define ADDcc(a,b,cy,t0,t1) \
(t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)
#define ADDC(a,b,cy,t0,t1) \
(t0=(b)+cy, t1=(a), t0+t1)
typedef uint16_t T;
/* increment a multi-word counter comprising n words */
void inc_array (T *counter, const T *increment, int n)
{
T cy, t0, t1;
counter [0] = ADDcc (counter [0], increment [0], cy, t0, t1);
for (int i = 1; i < (n - 1); i++) {
counter [i] = ADDCcc (counter [i], increment [i], cy, t0, t1);
}
counter [n-1] = ADDC (counter [n-1], increment [n-1], cy, t0, t1);
}
#define INCREMENT (10)
#define UINT32_ARRAY_LEN (2)
#define UINT64_ARRAY_LEN (4)
int main (void)
{
uint32_t count32 = 0, incr32 = INCREMENT;
T count_arr2 [UINT32_ARRAY_LEN] = {0};
T incr_arr2 [UINT32_ARRAY_LEN] = {INCREMENT};
do {
count32 = count32 + incr32;
inc_array (count_arr2, incr_arr2, UINT32_ARRAY_LEN);
} while (count32 < (0U - INCREMENT - 1));
printf ("count32 = %08x arr_count = %08x\n",
count32, (((uint32_t)count_arr2 [1] << 16) +
((uint32_t)count_arr2 [0] << 0)));
uint64_t count64 = 0, incr64 = INCREMENT;
T count_arr4 [UINT64_ARRAY_LEN] = {0};
T incr_arr4 [UINT64_ARRAY_LEN] = {INCREMENT};
do {
count64 = count64 + incr64;
inc_array (count_arr4, incr_arr4, UINT64_ARRAY_LEN);
} while (count64 < 0xa987654321ULL);
printf ("count64 = %016llx arr_count = %016llx\n",
count64, (((uint64_t)count_arr4 [3] << 48) +
((uint64_t)count_arr4 [2] << 32) +
((uint64_t)count_arr4 [1] << 16) +
((uint64_t)count_arr4 [0] << 0)));
return EXIT_SUCCESS;
}
Compiled with full optimization, the 32-bit example executes in about a second, while the 64-bit example runs for about a minute on a modern PC. The output of the program should look like so:
count32 = fffffffa arr_count = fffffffa
count64 = 000000a987654326 arr_count = 000000a987654326
Non-portable code that is based on inline assembly or proprietary extensions for wide integer types may execute about two to three times as fast as the portable solution presented here.

Analysis of the usage of prime numbers in hash functions

I was studying hash-based sort and I found that using prime numbers in a hash function is considered a good idea, because multiplying each character of the key by a prime number and adding the results up would produce a unique value (because primes are unique) and a prime number like 31 would produce better distribution of keys.
key(s)=s[0]*31(len–1)+s[1]*31(len–2)+ ... +s[len–1]
Sample code:
public int hashCode( )
{
int h = hash;
if (h == 0)
{
for (int i = 0; i < chars.length; i++)
{
h = MULT*h + chars[i];
}
hash = h;
}
return h;
}
I would like to understand why the use of even numbers for multiplying each character is a bad idea in the context of this explanation below (found on another forum; it sounds like a good explanation, but I'm failing to grasp it). If the reasoning below is not valid, I would appreciate a simpler explanation.
Suppose MULT were 26, and consider
hashing a hundred-character string.
How much influence does the string's
first character have on the final
value of 'h'? The first character's value
will have been multiplied by MULT 99
times, so if the arithmetic were done
in infinite precision the value would
consist of some jumble of bits
followed by 99 low-order zero bits --
each time you multiply by MULT you
introduce another low-order zero,
right? The computer's finite
arithmetic just chops away all the
excess high-order bits, so the first
character's actual contribution to 'h'
is ... precisely zero! The 'h' value
depends only on the rightmost 32
string characters (assuming a 32-bit
int), and even then things are not
wonderful: the first of those final 32
bytes influences only the leftmost bit
of `h' and has no effect on the
remaining 31. Clearly, an even-valued
MULT is a poor idea.

I think it's easier to see if you use 2 instead of 26. They both have the same effect on the lowest-order bit of h. Consider a 33 character string of some character c followed by 32 zero bytes (for illustrative purposes). Since the string isn't wholly null you'd hope the hash would be nonzero.
For the first character, your computed hash h is equal to c[0]. For the second character, you take h * 2 + c[1]. So now h is 2*c[0]. For the third character h is now h*2 + c[2] which works out to 4*c[0]. Repeat this 30 more times, and you can see that the multiplier uses more bits than are available in your destination, meaning effectively c[0] had no impact on the final hash at all.
The end math works out exactly the same with a different multiplier like 26, except that the intermediate hashes will modulo 2^32 every so often during the process. Since 26 is even it still adds one 0 bit to the low end each iteration.

This hash can be described like this (here ^ is exponentiation, not xor).
hash(string) = sum_over_i(s[i] * MULT^(strlen(s) - i - 1)) % (2^32).
Look at the contribution of the first character. It's
(s[0] * MULT^(strlen(s) - 1)) % (2^32).
If the string is long enough (strlen(s) > 32) then this is zero.

Other people have posted the answer -- if you use an even multiple, then only the last characters in the string matter for computing the hash, as the early character's influence will have shifted out of the register.
Now lets consider what happens when you use a multiplier like 31. Well, 31 is 32-1 or 2^5 - 1. So when you use that, your final hash value will be:
\sum{c_i 2^{5(len-i)} - \sum{c_i}
unfortunately stackoverflow doesn't understad TeX math notation, so the above is hard to understand, but its two summations over the characters in the string, where the first one shifts each character by 5 bits for each subsequent character in the string. So using a 32-bit machine, that will shift off the top for all except the last seven characters of the string.
The upshot of this is that using a multiplier of 31 means that while characters other than the last seven have an effect on the string, its completely independent of their order. If you take two strings that have the same last 7 characters, for which the other characters also the same but in a different order, you'll get the same hash for both. You'll also get the same hash for things like "az" and "by" other than in the last 7 chars.
So using a prime multiplier, while much better than an even multiplier, is still not very good. Better is to use a rotate instruction, which shifts the bits back into the bottom when they shift out the top. Something like:
public unisgned hashCode(string chars)
{
unsigned h = 0;
for (int i = 0; i < chars.length; i++) {
h = (h<<5) + (h>>27); // ROL by 5, assuming 32 bits here
h += chars[i];
}
return h;
}
Of course, this depends on your compiler being smart enough to recognize the idiom for a rotate instruction and turn it into a single instruction for maximum efficiency.
This also still has the problem that swapping 32-character blocks in the string will give the same hash value, so its far from strong, but probably adequate for most non-cryptographic purposes

would produce a unique value
Stop right there. Hashes are not unique. A good hash algorithm will minimize collisions, but the pigeonhole principle assures us that perfectly avoiding collisions is not possible (for any datatype with non-trivial information content).

Where can I find the world's fastest atof implementation?

I'm looking for an extremely fast atof() implementation on IA32 optimized for US-en locale, ASCII, and non-scientific notation. The windows multithreaded CRT falls down miserably here as it checks for locale changes on every call to isdigit(). Our current best is derived from the best of perl + tcl's atof implementation, and outperforms msvcrt.dll's atof by an order of magnitude. I want to do better, but am out of ideas. The BCD related x86 instructions seemed promising, but I couldn't get it to outperform the perl/tcl C code. Can any SO'ers dig up a link to the best out there? Non x86 assembly based solutions are also welcome.
Clarifications based upon initial answers:
Inaccuracies of ~2 ulp are fine for this application.
The numbers to be converted will arrive in ascii messages over the network in small batches and our application needs to convert them in the lowest latency possible.

What is your accuracy requirement? If you truly need it "correct" (always gets the nearest floating-point value to the decimal specified), it will probably be hard to beat the standard library versions (other than removing locale support, which you've already done), since this requires doing arbitrary precision arithmetic. If you're willing to tolerate an ulp or two of error (and more than that for subnormals), the sort of approach proposed by cruzer's can work and may be faster, but it definitely will not produce <0.5ulp output. You will do better accuracy-wise to compute the integer and fractional parts separately, and compute the fraction at the end (e.g. for 12345.6789, compute it as 12345 + 6789 / 10000.0, rather than 6*.1 + 7*.01 + 8*.001 + 9*0.0001) since 0.1 is an irrational binary fraction and error will accumulate rapidly as you compute 0.1^n. This also lets you do most of the math with integers instead of floats.
The BCD instructions haven't been implemented in hardware since (IIRC) the 286, and are simply microcoded nowadays. They are unlikely to be particularly high-performance.

This implementation I just finished coding runs twice as fast as the built in 'atof' on my desktop. It converts 1024*1024*39 number inputs in 2 seconds, compared 4 seconds with my system's standard gnu 'atof'. (Including the setup time and getting memory and all that).
UPDATE:
Sorry I have to revoke my twice as fast claim. It's faster if the thing you're converting is already in a string, but if you're passing it hard coded string literals, it's about the same as atof. However I'm going to leave it here, as possibly with some tweaking of the ragel file and state machine, you may be able to generate faster code for specific purposes.
https://github.com/matiu2/yajp
The interesting files for you are:
https://github.com/matiu2/yajp/blob/master/tests/test_number.cpp
https://github.com/matiu2/yajp/blob/master/number.hpp
Also you may be interested in the state machine that does the conversion:

It seems to me you want to build (by hand) what amounts to a state machine where each state handles the Nth input digit or exponent digits; this state machine would be shaped like a tree (no loops!). The goal is to do integer arithmetic wherever possible, and (obviously) to remember state variables ("leading minus", "decimal point at position 3") in the states implicitly, to avoid assignments, stores and later fetch/tests of such values. Implement the state machine with plain old "if" statements on the input characters only (so your tree gets to be a set of nested ifs). Inline accesses to buffer characters; you don't want a function call to getchar to slow you down.
Leading zeros can simply be suppressed; you might need a loop here to handle ridiculously long leading zero sequences. The first nonzero digit can be collected without zeroing an accumulator or multiplying by ten. The first 4-9 nonzero digits (for 16 bit or 32 bits integers) can be collected with integer multiplies by constant value ten (turned by most compilers into a few shifts and adds). [Over the top: zero digits don't require any work until a nonzero digit is found and then a multiply 10^N for N sequential zeros is required; you can wire all this in into the state machine]. Digits following the first 4-9 may be collected using 32 or 64 bit multiplies depending on the word size of your machine. Since you don't care about accuracy, you can simply ignore digits after you've collected 32 or 64 bits worth; I'd guess that you can actually stop when you have some fixed number of nonzero digits based on what your application actually does with these numbers. A decimal point found in the digit string simply causes a branch in the state machine tree. That branch knows the implicit location of the point and therefore later how to scale by a power of ten appropriately. With effort, you may be able to combine some state machine sub-trees if you don't like the size of this code.
[Over the top: keep the integer and fractional parts as separate (small) integers. This will require an additional floating point operation at the end to combine the integer and fraction parts, probably not worth it].
[Over the top: collect 2 characters for digit pairs into a 16 bit value, lookup the 16 bit value.
This avoids a multiply in the registers in trade for a memory access, probably not a win on modern machines].
On encountering "E", collect the exponent as an integer as above; look up accurately precomputed/scaled powers of ten up in a table of precomputed multiplier (reciprocals if "-" sign present in exponent) and multiply the collected mantissa. (don't ever do a float divide). Since each exponent collection routine is in a different branch (leaf) of the tree, it has to adjust for the apparent or actual location of the decimal point by offsetting the power of ten index.
[Over the top: you can avoid the cost of ptr++ if you know the characters for the number are stored linearly in a buffer and do not cross the buffer boundary. In the kth state along a tree branch, you can access the the kth character as *(start+k). A good compiler can usually hide the "...+k" in an indexed offset in the addressing mode.]
Done right, this scheme does roughly one cheap multiply-add per nonzero digit, one cast-to-float of the mantissa, and one floating multiply to scale the result by exponent and location of decimal point.
I have not implemented the above. I have implemented versions of it with loops, they're pretty fast.

I've implemented something you may find useful.
In comparison with atof it's about x5 faster and if used with __forceinline about x10 faster.
Another nice thing is that it seams to have exactly same arithmetic as crt implementation.
Of course it has some cons too:
it supports only single precision float,
and doesn't scan any special values like #INF, etc...
__forceinline bool float_scan(const wchar_t* wcs, float* val)
{
int hdr=0;
while (wcs[hdr]==L' ')
hdr++;
int cur=hdr;
bool negative=false;
bool has_sign=false;
if (wcs[cur]==L'+' || wcs[cur]==L'-')
{
if (wcs[cur]==L'-')
negative=true;
has_sign=true;
cur++;
}
else
has_sign=false;
int quot_digs=0;
int frac_digs=0;
bool full=false;
wchar_t period=0;
int binexp=0;
int decexp=0;
unsigned long value=0;
while (wcs[cur]>=L'0' && wcs[cur]<=L'9')
{
if (!full)
{
if (value>=0x19999999 && wcs[cur]-L'0'>5 || value>0x19999999)
{
full=true;
decexp++;
}
else
value=value*10+wcs[cur]-L'0';
}
else
decexp++;
quot_digs++;
cur++;
}
if (wcs[cur]==L'.' || wcs[cur]==L',')
{
period=wcs[cur];
cur++;
while (wcs[cur]>=L'0' && wcs[cur]<=L'9')
{
if (!full)
{
if (value>=0x19999999 && wcs[cur]-L'0'>5 || value>0x19999999)
full=true;
else
{
decexp--;
value=value*10+wcs[cur]-L'0';
}
}
frac_digs++;
cur++;
}
}
if (!quot_digs && !frac_digs)
return false;
wchar_t exp_char=0;
int decexp2=0; // explicit exponent
bool exp_negative=false;
bool has_expsign=false;
int exp_digs=0;
// even if value is 0, we still need to eat exponent chars
if (wcs[cur]==L'e' || wcs[cur]==L'E')
{
exp_char=wcs[cur];
cur++;
if (wcs[cur]==L'+' || wcs[cur]==L'-')
{
has_expsign=true;
if (wcs[cur]=='-')
exp_negative=true;
cur++;
}
while (wcs[cur]>=L'0' && wcs[cur]<=L'9')
{
if (decexp2>=0x19999999)
return false;
decexp2=10*decexp2+wcs[cur]-L'0';
exp_digs++;
cur++;
}
if (exp_negative)
decexp-=decexp2;
else
decexp+=decexp2;
}
// end of wcs scan, cur contains value's tail
if (value)
{
while (value<=0x19999999)
{
decexp--;
value=value*10;
}
if (decexp)
{
// ensure 1bit space for mul by something lower than 2.0
if (value&0x80000000)
{
value>>=1;
binexp++;
}
if (decexp>308 || decexp<-307)
return false;
// convert exp from 10 to 2 (using FPU)
int E;
double v=pow(10.0,decexp);
double m=frexp(v,&E);
m=2.0*m;
E--;
value=(unsigned long)floor(value*m);
binexp+=E;
}
binexp+=23; // rebase exponent to 23bits of mantisa
// so the value is: +/- VALUE * pow(2,BINEXP);
// (normalize manthisa to 24bits, update exponent)
while (value&0xFE000000)
{
value>>=1;
binexp++;
}
if (value&0x01000000)
{
if (value&1)
value++;
value>>=1;
binexp++;
if (value&0x01000000)
{
value>>=1;
binexp++;
}
}
while (!(value&0x00800000))
{
value<<=1;
binexp--;
}
if (binexp<-127)
{
// underflow
value=0;
binexp=-127;
}
else
if (binexp>128)
return false;
//exclude "implicit 1"
value&=0x007FFFFF;
// encode exponent
unsigned long exponent=(binexp+127)<<23;
value |= exponent;
}
// encode sign
unsigned long sign=negative<<31;
value |= sign;
if (val)
{
*(unsigned long*)val=value;
}
return true;
}

I remember we had a Winforms application that performed so slowly while parsing some data interchange files, and we all thought it was the db server thrashing, but our smart boss actually found out that the bottleneck was in the call that was converting the parsed strings into decimals!
The simplest is to loop for each digit (character) in the string, keep a running total, multiply the total by 10 then add the value of the next digit. Keep on doing this until you reach the end of the string or you encounter a dot. If you encounter a dot, separate the whole number part from the fractional part, then have a multiplier that divides itself by 10 for each digit. Keep on adding them up as you go.
Example: 123.456
running total = 0, add 1 (now it's 1)
running total = 1 * 10 = 10, add 2 (now it's 12)
running total = 12 * 10 = 120, add 3 (now it's 123)
encountered a dot, prepare for fractional part
multiplier = 0.1, multiply by 4, get 0.4, add to running total, makes 123.4
multiplier = 0.1 / 10 = 0.01, multiply by 5, get 0.05, add to running total, makes 123.45
multipiler = 0.01 / 10 = 0.001, multiply by 6, get 0.006, add to running total, makes 123.456
Of course, testing for a number's correctness as well as negative numbers will make it more complicated. But if you can "assume" that the input is correct, you can make the code much simpler and faster.

Have you considered looking into having the GPU do this work? If you can load the strings into GPU memory and have it process them all you may find a good algorithm that will run significantly faster than your processor.
Alternately, do it in an FPGA - There are FPGA PCI-E boards that you can use to make arbitrary coprocessors. Use DMA to point the FPGA at the part of memory containing the array of strings you want to convert and let it whizz through them leaving the converted values behind.
Have you looked at a quad core processor? The real bottleneck in most of these cases is memory access anyway...
-Adam

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js