I was studying hash-based sort and I found that using prime numbers in a hash function is considered a good idea, because multiplying each character of the key by a prime number and adding the results up would produce a unique value (because primes are unique) and a prime number like 31 would produce better distribution of keys.
key(s)=s[0]*31(len–1)+s[1]*31(len–2)+ ... +s[len–1]
Sample code:
public int hashCode( )
{
int h = hash;
if (h == 0)
{
for (int i = 0; i < chars.length; i++)
{
h = MULT*h + chars[i];
}
hash = h;
}
return h;
}
I would like to understand why the use of even numbers for multiplying each character is a bad idea in the context of this explanation below (found on another forum; it sounds like a good explanation, but I'm failing to grasp it). If the reasoning below is not valid, I would appreciate a simpler explanation.
Suppose MULT were 26, and consider
hashing a hundred-character string.
How much influence does the string's
first character have on the final
value of 'h'? The first character's value
will have been multiplied by MULT 99
times, so if the arithmetic were done
in infinite precision the value would
consist of some jumble of bits
followed by 99 low-order zero bits --
each time you multiply by MULT you
introduce another low-order zero,
right? The computer's finite
arithmetic just chops away all the
excess high-order bits, so the first
character's actual contribution to 'h'
is ... precisely zero! The 'h' value
depends only on the rightmost 32
string characters (assuming a 32-bit
int), and even then things are not
wonderful: the first of those final 32
bytes influences only the leftmost bit
of `h' and has no effect on the
remaining 31. Clearly, an even-valued
MULT is a poor idea.
I think it's easier to see if you use 2 instead of 26. They both have the same effect on the lowest-order bit of h. Consider a 33 character string of some character c followed by 32 zero bytes (for illustrative purposes). Since the string isn't wholly null you'd hope the hash would be nonzero.
For the first character, your computed hash h is equal to c[0]. For the second character, you take h * 2 + c[1]. So now h is 2*c[0]. For the third character h is now h*2 + c[2] which works out to 4*c[0]. Repeat this 30 more times, and you can see that the multiplier uses more bits than are available in your destination, meaning effectively c[0] had no impact on the final hash at all.
The end math works out exactly the same with a different multiplier like 26, except that the intermediate hashes will modulo 2^32 every so often during the process. Since 26 is even it still adds one 0 bit to the low end each iteration.
This hash can be described like this (here ^ is exponentiation, not xor).
hash(string) = sum_over_i(s[i] * MULT^(strlen(s) - i - 1)) % (2^32).
Look at the contribution of the first character. It's
(s[0] * MULT^(strlen(s) - 1)) % (2^32).
If the string is long enough (strlen(s) > 32) then this is zero.
Other people have posted the answer -- if you use an even multiple, then only the last characters in the string matter for computing the hash, as the early character's influence will have shifted out of the register.
Now lets consider what happens when you use a multiplier like 31. Well, 31 is 32-1 or 2^5 - 1. So when you use that, your final hash value will be:
\sum{c_i 2^{5(len-i)} - \sum{c_i}
unfortunately stackoverflow doesn't understad TeX math notation, so the above is hard to understand, but its two summations over the characters in the string, where the first one shifts each character by 5 bits for each subsequent character in the string. So using a 32-bit machine, that will shift off the top for all except the last seven characters of the string.
The upshot of this is that using a multiplier of 31 means that while characters other than the last seven have an effect on the string, its completely independent of their order. If you take two strings that have the same last 7 characters, for which the other characters also the same but in a different order, you'll get the same hash for both. You'll also get the same hash for things like "az" and "by" other than in the last 7 chars.
So using a prime multiplier, while much better than an even multiplier, is still not very good. Better is to use a rotate instruction, which shifts the bits back into the bottom when they shift out the top. Something like:
public unisgned hashCode(string chars)
{
unsigned h = 0;
for (int i = 0; i < chars.length; i++) {
h = (h<<5) + (h>>27); // ROL by 5, assuming 32 bits here
h += chars[i];
}
return h;
}
Of course, this depends on your compiler being smart enough to recognize the idiom for a rotate instruction and turn it into a single instruction for maximum efficiency.
This also still has the problem that swapping 32-character blocks in the string will give the same hash value, so its far from strong, but probably adequate for most non-cryptographic purposes
would produce a unique value
Stop right there. Hashes are not unique. A good hash algorithm will minimize collisions, but the pigeonhole principle assures us that perfectly avoiding collisions is not possible (for any datatype with non-trivial information content).
Related
Is there any good way to optimize this function in terms of execution time? My final goal is to parse a long string composed of several integers (thousands of integer per line, and thousands of lines). This was my initial solution.
int64_t get_next_int(char *newLine) {
char *token=strtok(newLine, " ");
if( token == NULL ) {
exit(0);
}
return atoll(token);
}
More details: I need the "state" based implementation of strtok, so the padding implemented by strtok should exist in the final string. Atoll does not need of any kind of verification.
Target system: Intel x86_64 (Xeon series)
Related topics:
atoi optimization: C++ most efficient way to convert string to int (faster than atoi)
First off: I find optimizing string conversion routines in signal processing chains most of the time to be totally in vain. The speed at which your system loads data in string form (which will probably happen from some mass storage, where it was put by something that didn't care about performance, since it wouldn't have chosen a string format in the first place, otherwise), and if you compare read speeds of all but clusters of SSDs attached via PCIe with how fast atoll is, you'll notice that you're losing a negligible amount of time on inefficient conversion. If you pipeline loading parts of that string with conversion, the time spent waiting for storage will not even be remotely filled up with converting, so even without any algorithmic optimization, pipelining/multi-threading will eliminate practically all time spent on conversion.
I'm going to go ahead and assume your integer-containing string is sufficiently large. Like, tens of millions of integers. Otherwise, all optimization might be pretty premature, considering there's little to complain about std::iostream performance.
Now, the trick is that no performance optimization can be done once the performance of your conversion routine hits the memory bandwidth barrier. To push that barrier as far as possible, it's crucial to optimize usage of CPU caches – hence, doing linear access and shuffling memory as little as possible is crucial here. Also, if you care for speed, you don't want to call a function every time you need to convert a few-digit number – the call overhead (saving/restoring stack, jumping back and forth) will be significant. So if you're after performance, you'll do the conversion of the whole string at once, and then just access the resulting integer array.
So you'd have roughly something like, on a modern, SSE4.2 capable x86 processor
Outer loop, jumps in steps of 16:
load 128 bit of input string into 128 bit SIMD register
run something like __mm_cmpestri to find indices of delimiters and \0 terminator in all these 16 bytes at once
inner loop over the found indices
Use SSE copy/shift/immediate instructions to isolate substrings; fill the others with 0
prepend saved "last characters" from previous iteration (if any – should only be the case for first inner loop iteration per outer loop iteration)
subtract 0 from each of the digits, again using SSE instructions to do up to 16 subtractions with a single instruction (_mm_sub_epi8)
convert the eight 16bit subwords to eight 128 bit words containing two packed 64bit integers each (one instruction per 16bit, _mm_cvtepi8_epi64, I think)
initialize a __mm128 register with [10^15 10^14], let's call it powers
loop over pairs dual-64bit words: (each step should be one SSE instruction)
multiply first with powers
divide powers by [100 100]
multiply second with powers
add results to dual-64bit accumulator
sum the two values in accumulator
store the result to integer array
I'd rather use something along the lines of a std::istringstream:
int64_t get_next_int(std::istringstream& line) {
int64_t token;
if(!(line >> token))
exit(0);
return token;
}
std::istringstream line(newLine);
int64_t i = get_next_int(line);
strtok() has well known drawbacks, and you don't want to use it at all.
What about
int n= 0;
// Find the token
for ( ; *newline == ' '; newline++)
;
if (*newline == 0)
// Not found
exit(0);
// Scan and convert the token
for ( ; unsigned(*newline - '0') < 10; newline++)
n= 10 * n + *newline - '0';
return n;
AFA I get from your code at first splitting it will return. It seems at first parsing(before space character) it will returun 0 if it is non-number entry or combined alphabetic and number in such a way that alphabetic at beginning . If combined and number at beginning, it will return the number merely. Namely, you just need a string for the conversion. So you don't need tokenizing just check the string is null or not. You can change return type as well. Because, if you need a type with _exactly_ 64 bits, use (u)int64_t, if you need _at least_ 64 bits, (unsigned) long long is perfectly fine, as would be (u)int_least64_t. I think your code is little gobbledygook. Show what you exactly want without simplification.
/*
* ascii-to-longlong conversion
*
* no error checking; assumes decimal digits
*
* efficient conversion:
* start with value = 0
* then, starting at first character, repeat the following
* until the end of the string:
*
* new value = (10 * (old value)) + decimal value of next character
*
*/
long long my_atoll(char *instr)
{
if(str[0] == '\0')
return -1;
long long retval;
int i;
retval = 0;
for (; *instr; instr++) {
retval = 10*retval + (*instr - '0');
}
return retval;
}
I have the following hash algorithm:
unsigned long specialNum=0x4E67C6A7;
unsigned int ch;
char inputVal[]=" AAPB2GXG";
for(int i=0;i<strlen(inputVal);i++)
{
ch=inputVal[i];
ch=ch+(specialNum*32);
ch=ch+(specialNum/4);
specialNum=bitXor(specialNum,ch);
}
unsigned int outputVal=specialNum;
The bitXor simply does the Xor operation:
int bitXor(int a,int b)
{
return (a & ~b) | (~a & b);
}
Now I want to find an Algorithm that can generate an "inputVal" when the outputVal is given.(The generated inputVal may not be necessarily be same as the original inputVal.That's why I want to find collision).
This means that I need to find an algorithm that generates a solution that when fed into the above algorithm results same as specified "outputVal".
The length of solution to be generated should be less than or equal to 32.
Method 1: Brute force. Not a big deal, because your "specialNum" is always in the range of an int, so after trying on average a few billion input values, you find the right one. Should be done in a few seconds.
Method 2: Brute force, but clever.
Consider the specialNum value before the last ch is processed. You first calculate (specialNum * 32) + (specialNum / 4) + ch. Since -128 <= ch < 128 or 0 <= ch < 256 depending on the signedness of char, you know the highest 23 bits of the result, independent of ch. After xor'ing ch with specialNum, you also know the highest 23 bits (if ch is signed, there are two possible values for the highest 23 bits). You check whether those 23 bits match the desired output, and if they don't, you have excluded all 256 values of ch in one go. So the brute force method will end on average after 16 million steps.
Now consider the specialNum value before the last two ch are processed. Again, you can determine the highest possible 14 bits of the result (if ch is signed with four alternatives) without examining the last two characters at all. If the highest 14 bits don't match, you are done.
Method 3: This is how you do it. Consider in turn all strings s of length 0, 1, 2, etc. (however, your algorithm will most likely find a solution much quicker). Calculate specialNum after processing the string s. Following your algorithm, and allowing for char to be signed, find the up to 4 different values that the highest 14 bits of specialNum might have after processing two further characters. If any of those matches the desired output, then examine the value of specialNum after processing each of the 256 possible values of the next character, and find the up to 2 different values that the highest 23 bits of specialNum might have after examining another char. If one of those matches the highest 23 bits of the desired output then examine what specialNum would be after processing each of the 256 possible next characters and look for a match.
This should work below a millisecond. If char is unsigned, it is faster.
I was given this algorithm to write a hash function:
BEGIN Hash (string)
UNSIGNED INTEGER key = 0;
FOR_EACH character IN string
key = ((key << 5) + key) ^ character;
END FOR_EACH
RETURN key;
END Hash
The <<operator refers to shift bits to the left. The ^ refers to the XOR operation and the character refers to the ASCII value of the character. Seems pretty straightforward.
Below is my code
unsigned int key = 0;
for (int i = 0; i < data.length(); i++) {
key = ((key<<5) + key) ^ (int)data[i];
}
return key;
However, I keep getting ridiculous positive and negative huge numbers when i should actually get a hash value from 0 - n. n is a value set by the user beforehand. I'm not sure where things went wrong but I'm thinking it could be the XOR operation.
Any suggestions or opinions will be greatly appreciated. Thanks!
The output of this code is a 32-bit (or 64-bit or however wide your unsigned int is) unsigned integer. To restrict it to the range from 0 to n−1, simply reduce it modulo n, using the % operator:
unsigned int hash = key % n;
(It should be obvious that your code, as written, cannot return "a hash value from 0 - n", since n does not appear anywhere in your code.)
In fact, there's a good reason not to reduce the hash value modulo n too soon: if you ever need to grow your hash, storing the unreduced hash codes of your strings saves you the effort of recalculating them whenever n changes.
Finally, a few general notes on your hash function:
As Joachim Pileborg comments above, the explicit (int) cast is unnecessary. If you want to keep it for clarity, it really should say (unsigned int) to match the type of key, since that's what the value actually gets converted into.
For unsigned integer types, ((key<<5) + key) is equal to 33 * key (since shifting left by 5 bits is the same as multiplying by 25 = 32). On modern CPUs, using multiplication is almost certainly faster; on old or very low-end processors with slow multiplication, it's likely that any decent compiler will optimize multiplication by a constant into a combination of shifts and adds anyway. Thus, either way, expressing the operation as a multiplication is IMO preferable.
You don't want to call data.length() on every iteration of the loop. Call it once before the loop and store the result in a variable.
Initializing key to zero means that your hash value is not affected by any leading zero bytes in the string. The original version of your hash function, due to Dan Bernstein, uses a (more or less random) initial value of 5381 instead.
I hope this finds you well.
I am trying to convert an index (number) for a word, using the ASCII code for that.
for ex:
index 0 -> " "
index 94 -> "~"
index 625798 -> "e#A"
index 899380 -> "!$^."
...
As we all can see, the 4th index correspond to a 4 char string. Unfortunately, at some point, these combinations get really big (i.e., for a word of 8 chars, i need to perform operations with 16 digit numbers (ex: 6634204312890625), and it gets really worse if I raise the number of chars of the word).
To support such big numbers, I had to upgrade some variables of my program from unsigned int to unsigned long long, but then I realized that modf() from C++ uses doubles and uint32_t (http://www.raspberryginger.com/jbailey/minix/html/modf_8c-source.html).
The question is: is this possible to adapt modf() to use 64 bit numbers like unsigned long long? I'm afraid that in case this is not possible, i'll be limited to digits of double length.
Can anyone enlight me please? =)
16-digit numbers fit within the range of a 64-bit number, so you should use uint64_t (from <stdint.h>). The % operator should then do what you need.
If you need bigger numbers, then you'll need to use a big-integer library. However, if all you're interested in is modulus, then there's a trick you can pull, based on the following properties of modulus:
mod(a * b) == mod(mod(a) * mod(b))
mod(a + b) == mod(mod(a) + mod(b))
As an example, let's express a 16-digit decimal number, x as:
x = x_hi * 1e8 + x_lo; // this is pseudocode, not real C
where x_hi is the 8 most-significant decimal digits, and x_lo the least-significant. The modulus of x can then be expressed as:
mod(x) = mod((mod(x_hi) * mod(1e8) + mod(x_lo));
where mod(1e8) is a constant which you can precalculate.
All of this can be done in integer arithmetic.
I could actually use a comment that was deleted right after (wonder why), that said:
modulus = a - a/b * b;
I've made a cast in the division to unsigned long long.
Now... I was a bit disappointed, because in my problem I thought I could keep raising the number of characters of the word with no problem. Nevertheless, I've started to get size issues at the n.º of chars = 7. Why? 95^7 starts to give huge numbers.
I was hoping to get the possibility to write a word like "my cat is so fat I 1234r5s" and calculate the index of this, but this word has almost 30 characters:
95^26 = 2635200944657423647039506726457895338535308837890625 combinations.
Anyway, thanks for the answer.
I'm looking for an extremely fast atof() implementation on IA32 optimized for US-en locale, ASCII, and non-scientific notation. The windows multithreaded CRT falls down miserably here as it checks for locale changes on every call to isdigit(). Our current best is derived from the best of perl + tcl's atof implementation, and outperforms msvcrt.dll's atof by an order of magnitude. I want to do better, but am out of ideas. The BCD related x86 instructions seemed promising, but I couldn't get it to outperform the perl/tcl C code. Can any SO'ers dig up a link to the best out there? Non x86 assembly based solutions are also welcome.
Clarifications based upon initial answers:
Inaccuracies of ~2 ulp are fine for this application.
The numbers to be converted will arrive in ascii messages over the network in small batches and our application needs to convert them in the lowest latency possible.
What is your accuracy requirement? If you truly need it "correct" (always gets the nearest floating-point value to the decimal specified), it will probably be hard to beat the standard library versions (other than removing locale support, which you've already done), since this requires doing arbitrary precision arithmetic. If you're willing to tolerate an ulp or two of error (and more than that for subnormals), the sort of approach proposed by cruzer's can work and may be faster, but it definitely will not produce <0.5ulp output. You will do better accuracy-wise to compute the integer and fractional parts separately, and compute the fraction at the end (e.g. for 12345.6789, compute it as 12345 + 6789 / 10000.0, rather than 6*.1 + 7*.01 + 8*.001 + 9*0.0001) since 0.1 is an irrational binary fraction and error will accumulate rapidly as you compute 0.1^n. This also lets you do most of the math with integers instead of floats.
The BCD instructions haven't been implemented in hardware since (IIRC) the 286, and are simply microcoded nowadays. They are unlikely to be particularly high-performance.
This implementation I just finished coding runs twice as fast as the built in 'atof' on my desktop. It converts 1024*1024*39 number inputs in 2 seconds, compared 4 seconds with my system's standard gnu 'atof'. (Including the setup time and getting memory and all that).
UPDATE:
Sorry I have to revoke my twice as fast claim. It's faster if the thing you're converting is already in a string, but if you're passing it hard coded string literals, it's about the same as atof. However I'm going to leave it here, as possibly with some tweaking of the ragel file and state machine, you may be able to generate faster code for specific purposes.
https://github.com/matiu2/yajp
The interesting files for you are:
https://github.com/matiu2/yajp/blob/master/tests/test_number.cpp
https://github.com/matiu2/yajp/blob/master/number.hpp
Also you may be interested in the state machine that does the conversion:
It seems to me you want to build (by hand) what amounts to a state machine where each state handles the Nth input digit or exponent digits; this state machine would be shaped like a tree (no loops!). The goal is to do integer arithmetic wherever possible, and (obviously) to remember state variables ("leading minus", "decimal point at position 3") in the states implicitly, to avoid assignments, stores and later fetch/tests of such values. Implement the state machine with plain old "if" statements on the input characters only (so your tree gets to be a set of nested ifs). Inline accesses to buffer characters; you don't want a function call to getchar to slow you down.
Leading zeros can simply be suppressed; you might need a loop here to handle ridiculously long leading zero sequences. The first nonzero digit can be collected without zeroing an accumulator or multiplying by ten. The first 4-9 nonzero digits (for 16 bit or 32 bits integers) can be collected with integer multiplies by constant value ten (turned by most compilers into a few shifts and adds). [Over the top: zero digits don't require any work until a nonzero digit is found and then a multiply 10^N for N sequential zeros is required; you can wire all this in into the state machine]. Digits following the first 4-9 may be collected using 32 or 64 bit multiplies depending on the word size of your machine. Since you don't care about accuracy, you can simply ignore digits after you've collected 32 or 64 bits worth; I'd guess that you can actually stop when you have some fixed number of nonzero digits based on what your application actually does with these numbers. A decimal point found in the digit string simply causes a branch in the state machine tree. That branch knows the implicit location of the point and therefore later how to scale by a power of ten appropriately. With effort, you may be able to combine some state machine sub-trees if you don't like the size of this code.
[Over the top: keep the integer and fractional parts as separate (small) integers. This will require an additional floating point operation at the end to combine the integer and fraction parts, probably not worth it].
[Over the top: collect 2 characters for digit pairs into a 16 bit value, lookup the 16 bit value.
This avoids a multiply in the registers in trade for a memory access, probably not a win on modern machines].
On encountering "E", collect the exponent as an integer as above; look up accurately precomputed/scaled powers of ten up in a table of precomputed multiplier (reciprocals if "-" sign present in exponent) and multiply the collected mantissa. (don't ever do a float divide). Since each exponent collection routine is in a different branch (leaf) of the tree, it has to adjust for the apparent or actual location of the decimal point by offsetting the power of ten index.
[Over the top: you can avoid the cost of ptr++ if you know the characters for the number are stored linearly in a buffer and do not cross the buffer boundary. In the kth state along a tree branch, you can access the the kth character as *(start+k). A good compiler can usually hide the "...+k" in an indexed offset in the addressing mode.]
Done right, this scheme does roughly one cheap multiply-add per nonzero digit, one cast-to-float of the mantissa, and one floating multiply to scale the result by exponent and location of decimal point.
I have not implemented the above. I have implemented versions of it with loops, they're pretty fast.
I've implemented something you may find useful.
In comparison with atof it's about x5 faster and if used with __forceinline about x10 faster.
Another nice thing is that it seams to have exactly same arithmetic as crt implementation.
Of course it has some cons too:
it supports only single precision float,
and doesn't scan any special values like #INF, etc...
__forceinline bool float_scan(const wchar_t* wcs, float* val)
{
int hdr=0;
while (wcs[hdr]==L' ')
hdr++;
int cur=hdr;
bool negative=false;
bool has_sign=false;
if (wcs[cur]==L'+' || wcs[cur]==L'-')
{
if (wcs[cur]==L'-')
negative=true;
has_sign=true;
cur++;
}
else
has_sign=false;
int quot_digs=0;
int frac_digs=0;
bool full=false;
wchar_t period=0;
int binexp=0;
int decexp=0;
unsigned long value=0;
while (wcs[cur]>=L'0' && wcs[cur]<=L'9')
{
if (!full)
{
if (value>=0x19999999 && wcs[cur]-L'0'>5 || value>0x19999999)
{
full=true;
decexp++;
}
else
value=value*10+wcs[cur]-L'0';
}
else
decexp++;
quot_digs++;
cur++;
}
if (wcs[cur]==L'.' || wcs[cur]==L',')
{
period=wcs[cur];
cur++;
while (wcs[cur]>=L'0' && wcs[cur]<=L'9')
{
if (!full)
{
if (value>=0x19999999 && wcs[cur]-L'0'>5 || value>0x19999999)
full=true;
else
{
decexp--;
value=value*10+wcs[cur]-L'0';
}
}
frac_digs++;
cur++;
}
}
if (!quot_digs && !frac_digs)
return false;
wchar_t exp_char=0;
int decexp2=0; // explicit exponent
bool exp_negative=false;
bool has_expsign=false;
int exp_digs=0;
// even if value is 0, we still need to eat exponent chars
if (wcs[cur]==L'e' || wcs[cur]==L'E')
{
exp_char=wcs[cur];
cur++;
if (wcs[cur]==L'+' || wcs[cur]==L'-')
{
has_expsign=true;
if (wcs[cur]=='-')
exp_negative=true;
cur++;
}
while (wcs[cur]>=L'0' && wcs[cur]<=L'9')
{
if (decexp2>=0x19999999)
return false;
decexp2=10*decexp2+wcs[cur]-L'0';
exp_digs++;
cur++;
}
if (exp_negative)
decexp-=decexp2;
else
decexp+=decexp2;
}
// end of wcs scan, cur contains value's tail
if (value)
{
while (value<=0x19999999)
{
decexp--;
value=value*10;
}
if (decexp)
{
// ensure 1bit space for mul by something lower than 2.0
if (value&0x80000000)
{
value>>=1;
binexp++;
}
if (decexp>308 || decexp<-307)
return false;
// convert exp from 10 to 2 (using FPU)
int E;
double v=pow(10.0,decexp);
double m=frexp(v,&E);
m=2.0*m;
E--;
value=(unsigned long)floor(value*m);
binexp+=E;
}
binexp+=23; // rebase exponent to 23bits of mantisa
// so the value is: +/- VALUE * pow(2,BINEXP);
// (normalize manthisa to 24bits, update exponent)
while (value&0xFE000000)
{
value>>=1;
binexp++;
}
if (value&0x01000000)
{
if (value&1)
value++;
value>>=1;
binexp++;
if (value&0x01000000)
{
value>>=1;
binexp++;
}
}
while (!(value&0x00800000))
{
value<<=1;
binexp--;
}
if (binexp<-127)
{
// underflow
value=0;
binexp=-127;
}
else
if (binexp>128)
return false;
//exclude "implicit 1"
value&=0x007FFFFF;
// encode exponent
unsigned long exponent=(binexp+127)<<23;
value |= exponent;
}
// encode sign
unsigned long sign=negative<<31;
value |= sign;
if (val)
{
*(unsigned long*)val=value;
}
return true;
}
I remember we had a Winforms application that performed so slowly while parsing some data interchange files, and we all thought it was the db server thrashing, but our smart boss actually found out that the bottleneck was in the call that was converting the parsed strings into decimals!
The simplest is to loop for each digit (character) in the string, keep a running total, multiply the total by 10 then add the value of the next digit. Keep on doing this until you reach the end of the string or you encounter a dot. If you encounter a dot, separate the whole number part from the fractional part, then have a multiplier that divides itself by 10 for each digit. Keep on adding them up as you go.
Example: 123.456
running total = 0, add 1 (now it's 1)
running total = 1 * 10 = 10, add 2 (now it's 12)
running total = 12 * 10 = 120, add 3 (now it's 123)
encountered a dot, prepare for fractional part
multiplier = 0.1, multiply by 4, get 0.4, add to running total, makes 123.4
multiplier = 0.1 / 10 = 0.01, multiply by 5, get 0.05, add to running total, makes 123.45
multipiler = 0.01 / 10 = 0.001, multiply by 6, get 0.006, add to running total, makes 123.456
Of course, testing for a number's correctness as well as negative numbers will make it more complicated. But if you can "assume" that the input is correct, you can make the code much simpler and faster.
Have you considered looking into having the GPU do this work? If you can load the strings into GPU memory and have it process them all you may find a good algorithm that will run significantly faster than your processor.
Alternately, do it in an FPGA - There are FPGA PCI-E boards that you can use to make arbitrary coprocessors. Use DMA to point the FPGA at the part of memory containing the array of strings you want to convert and let it whizz through them leaving the converted values behind.
Have you looked at a quad core processor? The real bottleneck in most of these cases is memory access anyway...
-Adam