Haxe: Efficiency of mapping a variable to one of two values

Haxe: Efficiency of mapping a variable to one of two values - if-statement

I feel this question is too general and poorly asked. Improvement suggestions would be appreciated.
Say I have an integer variable which I need to convert to one of two other values. If the variable is 0, it will remain 0, otherwise, it will become 1. I can think of two ways to do this.
Method 1: An inline if/else assignment.
function runFunc(input:Int):Void
{
<script>
}
for (index in 0...5)
{
runFunc(if (index == 0) {0;} else {1;});
}
Method 2: Division and rounding.
function runFunc(input:Int):Void
{
<script>
}
for (index in 0...5)
{
runFunc(Math.round(index / index));
}
Method 1 is more standardised and would work for other data-types and other values, but Method 2 seems like it would take less processing power (especially if this needs to be done almost constantly).
Assuming that Method 2 wouldn't have a problem with rounding, is there enough of a difference in times between the two methods to consider using one over the other? How do different things like if/else statements and Math.round() impact processing time?

Well, first things first, your method 2 has a division by zero. So it's not even a valid solution.
However, I assume you want an answer "in general". And, of course, this kind of question comes with a TON of "it depends" qualifications. It depends on the type of CPU, the programming language, the optimizations the compiler might make, runtime optimizations, etc, etc.
However, in general, the order of cost of the relevant operations is roughly:
Logic and arithmetic operations are cheapest
Of these, integer operations are cheaper than floating point operations.
Branches (ifs and loops) are moderately expensive.
Division is moderately expensive.
Function calls are very expensive.
Basically, this is because (respectively):
This is what CPUs do, and they are optimized to do it quickly
Branches represent two possible paths, which can slow down CPU parallelization and optimizations.
Division is a multi-step process.
Function calls have to push and pop the stack.
Thus, I'd probably bet on your method 1 above. And, just for clarity, I'd write the if statement in your first example with an equivalent ternary operator:
for (index in 0...5)
{
runFunc( index==0 ? 0 : 1 );
}
This is well and good for mapping a simple boolean (either index is 0 or it isn't). But what about when the mapping becomes more complex? There are a couple constructs that generally give good performance when mapping one value to another: a lookup table or a hash map.
The lookup table is useful if your keys are integers, bounded between some minimum and maximum values. The hash map is good if your keys are hashable (often Ints, Strings, or Objects can be used as hash keys. Look at haxe.ds.IntMap, haxe.ds.StringMap and haxe.ds.ObjectMap.)
Example 1: A common look-up table might map numeric integer "day of the week" to the word used for that day. Assuming day:Int is always 0-6, a day int to string lookup table would be:
var day_name_lut = [ "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday" ];
So now trace('Today is a '+day_name_lut[Date.now().getDay()]); will tell us what day it is today. That's mapping an integer to an String for VERY low cost. Cheaper than a function call or a IntMap<String>.
Example 2: A sample StringMap might be used to store some object by a string value (for example, maybe a Person by their name.) This will allow us to lookup people by name later on:
var people_by_name = new StringMap<Person>();
var joe = new Person("Joe");
people_by_name.set(joe.name, joe);
var bob = new Person("Bob");
people_by_name.set(bob.name, bob);
This is very common for caching expensive oeprations. Imagine caching an HTTP response by its URL. It'd be much cheaper to lookup the second time from the StringMap than going back to fetch the response again.
--- update ---
I also note you say "inline if/else assignment", referring to this:
runFunc(if (index == 0) {0;} else {1;});
Note that actually writing the if inline is the kind of thing that makes absolutely no performance difference. It'll perform exactly the same as:
var tmp = if (index == 0) {0;} else {1;}
runFunc(tmp);
And these are exactly the same as well:
runFunc( if (index == 0) 0 else 1);
runFunc( (index == 0) ? 0 : 1);
These types of semantic differences don't make any difference to the final execution of code. The processor still has to compute and temporarily store the value to call the function, whether you wrote it inline, stored it in a tmp variable, or wrote the branch with an if/else or ternary operator. Compilers and VMs are optimized to take your (potentially variable) code and boil it down to the best instructions for the target CPU.

How about binary op index & 1 ?

Related

Storing multiple int values into one variable - C++

I am doing an algorithmic contest, and I'm trying to optimize my code. Maybe what I want to do is stupid and impossible but I was wondering.
I have these requirements:
An inventory which can contains 4 distinct types of item. This inventory can't contain more than 10 items (all type included). Example of valid inventory: 1 / 1 / 1 / 0. Example of invalid inventories: 11 / 0 / 0 / 0 or 5 / 5 / 5 / 0
I have some recipe which consumes or adds items into my inventory. The recipe can't add or consume more than 10 items since the inventory can't have more than 10 items. Example of valid recipe: -1 / -2 / 3 /
0. Example of invalid recipe: -6 / -6 / +12 / 0
For now, I store the inventory and the recipe into 4 integers. Then I am able to perform some operations like:
ApplyRecepe: Inventory(1/1/1/0).Apply(Recepe(-1/1/0/0)) = Inventory(0/2/1/0)
CanAfford: Iventory(1/1/0/0).CanAfford(Recepe(-2/1/0/0)) = False
I would like to know if it is possible (and if yes, how) to store the 4 values of an inventory/recipe into one single integer and to performs previous operations on it that would be faster than comparing / adding the 4 integers as I'm doing now.
I thought of something like having the inventory like that:
int32: XXXX (number of items of the first type) - YYYY (number of items of the second type) - ZZZ (number of items of the third type) - WWW (number of item of the fourth type)
But I have two problems with that:
I don't know how to handle the possible negative values
It seems to me much slower than just adding the 4 integers since I have to bit shift the inventory and the recipe to get the value I want and then proceed with the addition.

Storing multiple int values into one variable
Here are two alternatives:
An array. The advantage of this is that you may iterate over the elements:
int variable[] {
1,
1,
1,
0,
};
Or a class. The advantage of this is the ability to name the members:
struct {
int X;
int Y;
int Z;
int W;
} variable {
1,
1,
1,
0,
};
Then I am able to perform some operations like:
Those look like SIMD vector operations (Single Instruction Multiple Data). The array is the way to go in this case. Since the number of operands appears to be constant and small in your description, an efficient way to perform them are vector operations on the CPU 1.
There is no standard way to use SIMD operations directly in C++. To give the compiler optimal opportunity to use them, these steps need to be followed:
Make sure that the CPU you use supports the operations that you need. AVX-2 instruction set and its expansions have wide support for integer vector operations.
Make sure that you tell the compiler that the program should be optimised for that architecture.
Make sure to tell the compiler to perform vectorisation optimisations.
Make sure that the integers are sufficiently aligned as required by the operations. This can be achieved with alignas.
Make sure that the number of integers is known at compile time.
If the prospect of relying on the optimiser worries you, then you may instead prefer to use vector extensions that may be provided by your compiler. The use of language extensions would come at the cost of portability to other compilers naturally. Here is an example with GCC:
constexpr int count = 4;
using v4si = int __attribute__ ((vector_size (sizeof(int) * count)));
#include <iostream>
int main()
{
v4si inventory { 1, 1, 1, 0};
v4si recepe {-1, 1, 0, 0};
v4si applied = inventory + recepe;
for (int i = 0; i < count; i++) {
std::cout << applied[i] << ", ";
}
}
1 If the number of operands were large, then specialised vector processor such as a GPU could be faster.

Especially if you're learning, it's not a bad opportunity to try implementing your own helper class for vectorization, and consequently deepen your understanding about data in C++, even if your use case might not warrant the technique.
The insight you want to exploit is that arithmetic operations seem invariant to bitshifts, if one considers the pesky carry-bit and effects of signage (e.g. two's complement). But precisely because of these latter factors, it's much better to use some standardized underlying type like an int8_t[], as #Botje suggests.
To begin, implement the following functions. (My C++ is rusty, consider this pseudocode.)
int8_t* add(int8_t[], int8_t[], size_t);
int8_t* multiply(int8_t[], int8_t[], size_t);
int8_t* zeroes(size_t); // additive identity
int8_t* ones(size_t); // multiplicative identity
Also considering:
How would you like to handle overflows and underflows? Let them be and ask the developer to be cautious? Or throw exceptions?
Maybe you'd like to pin down the size of the array and avoid having to deal with a dynamic size_t?
Maybe you'd like to go as far as overloading operators?
The end result of an exercise like this, but generalized and polished, is something like Armadillo. But you'll understand it on a whole different level by doing the exercise yourself first. Also, if all this makes sense so far, you can now take a look at How to vectorize my loop with g++?—even the compiler can vectorize for you in certain cases.
Bitpacking as #Botje mentions is another step beyond this. You won't even have the safety and convenience of an integer type like int8_t or int4_t. Which additionally means the code you write might stop being platform-independent. I recommend at least finishing the vectorization exercise before delving into this.

This will be something of a non-answer, just intended to show what you're up against if you do bitpacking.
Suppose, for simplicity's sake, that recipes can only remove from inventory, and only contain positive values (you could represent negative numbers using two's complement, but it would take more bits, and add much complexity to working with the bit-packed numbers).
You then have 11 possible values for an item, so you need 4 bits for each item. Four items can then be represented in one uint16.
So, say you have an inventory with 10,4,6,9 items; this would be uint16_t inv = 0b1010'0100'0110'1001.
Then, a recipe with 2,2,2,2 items or uint16_t rec = 0b0010'0010'0010'0010.
inv - rec would give 0b1000'0010'0100'0111 for 8,2,4,7 items.
So far, so good. No need here to shift and mask to get at the individual values before doing the calculation. Yay.
Now, a recipe with 6,6,6,6 items which would be 0b0110'0110'0110'0110, giving inv - rec = 0b0011'1110'0000'0011 for 3,14,0,3 items.
Oops.
The arithmetic will work, but only if you check beforehand that the individual 4-bit results don't go out of bounds; in this example this would mean that you know beforehand that there are enough items in the inventory to fill a recipe.
You could get at, say, the third item in the inventory by doing: (inv >> 4) & 0b1111 or (inv << 8) >> 12 for doing your checks.
For testing, you would then get expressions like:
if ((inv >> 4) & 0b1111 >= (rec >> 4) & 0b1111)
or, comparing the 4 bits "in place":
if (inv & 0b0000000011110000 >= rec & 0b0000000011110000)
for each 4-bit part.
All these things are doable, but do you want to? It probably won't be faster than what is suggested in the other answers after the compiler has done its job, and it certainly won't be more readable.
It becomes even more horrible when you allow negative numbers (two's complement or otherwise) in recipes, especially if you want to bit-shift them.
So, bitpacking is nice for storage, and in some rare cases you can even do math without unpacking the bits, but I wouldn't try to go there (unless you are very performance and memory constrained).
Having said that, it could be fun to try to get it to work; there's always that.

How to use if condition in intrinsics

I want to compare two floating point variables using intrinsics. If the comparison is true, do something else do something. I want to do this as a normal if..else condition. Is there any way using intrinsics?
//normal code
vector<float> v1, v2;
for(int i = 0; i < v1.size(); ++i)
if(v1[i]<v2[i])
{
//do something
}
else
{
//do something
)
How to do this using SSE2 or AVX?

If you expect that v1[i] < v2[i] is almost never true, almost always true, or usually stays the same for a long run (even if overall there might be no particular bias), then an other technique is also applicable which offers "true conditionality" (ie not "do both, discard one result"), a price of course, but you also get to actually skip work instead of just ignoring some results.
That technique is fairly simple, do the comparison (vectorized), gather the comparison mask with _mm_movemask_ps, and then you have 3 cases:
All comparisons went the same way and they were all false, execute the appropriate "do something" code that is now maybe easier to vectorize since the condition is gone.
All comparisons went the same way and they were all true, same.
Mixed, use more complicated logic. Depending on what you need, you could check all bits separately (falling back to scalar code, but now just 1 FP compare for the whole lot), or use one of the "iterate only over (un)set bits" tricks (combines well with bitscan to recover the actual index), or sometimes you can fall back to doing masking and merging as usual.
Not all 3 cases are always relevant, usually you're applying this because the predicate almost always goes the same way, making one of the "all the same" cases so rare that you can just lump it in with "mixed".
This technique is definitely not always useful. The "mixed" case is complicated and slow. The fast-path has to be common and fast enough to be worth testing whether you're can take it.
But it can be useful, maybe one of the sides is very slow and annoying, while the other side of the branch is nice simple vectorizable code that doesn't take all that long in comparison. For example, maybe the slow side has to do argument reduction for an otherwise fast approximated transcendental function, or maybe it has to normalize some vectors before taking their dot product, or orthogonalize a matrix, maybe even get data from disk..
Or, maybe neither side is exactly slow, but they evict each others data from cache (maybe both sides are a loop over an array that fits in cache, but the arrays don't fit in it together) so doing them unconditionally slows both of them down. This is probably a real thing, but I haven't seen it in the wild (yet).
Or, maybe one side cannot be executed unconditionally, doing some generally destructive things, maybe even some IO. For example if you're checking for error conditions and logging them.

SIMD conditional operations are done with branchless techniques. You use a packed-compare instruction to get a vector of elements that are all-zero or all-one.
e.g. you can conditionally add 4 to elements in an accumulator when a corresponding element matches a condition with code like:
__m128i match_counts = _mm_setzero_si128();
for (...) {
__m128 fvec = something;
__m128i condition = _mm_castps_si128( _mm_cmplt_ps(fvec, _mm_setzero_ps()) ); // for elements less than zero
__m128i masked_constant = _mm_and_si128(condition, _mm_set1_epi32(4));
match_counts = _mm_add_epi32(match_counts, masked_constant);
}
Obviously this only works well if you can come up with a branchless way to do both sides of the branch. A blend instruction can often help.
It's likely that you won't get any speedup at all if there's too much work in each side of the branch, especially if your element size is 4 bytes or larger. (SIMD is really powerful when you're doing 16 operations in parallel on 16 separate bytes, less powerful when doing 4 operations on four 32-bit elements).

I found a document which is very useful for conditional SIMD instructions.
It is a perfect solution to my question.
If...else condition
Document: http://saluc.engr.uconn.edu/refs/processors/intel/sse_sse2.pdf

Which operator is faster: != or >

Which operator is faster: > or ==?
Example: I want to test a value (which can have a positive value or -1) against -1 :
if(time > -1)
// or
if (time != -1)
time has type "int"

The standard doesn't say. So it's up to what opcodes the given compiler generates in its given version, and how fast a given CPU executes them.
I.e., implementation / platform defined.
You can find out for a specific compiler / platform combination by looking at / benchmarking the executable code.
But I seriously doubt it will make much of a difference; this is the kind of micro-optimization that is almost always dwarfed by higher-level architectural decisions.

It is platform-dependent. Generally though, those two operations will translate directly to the assembler instructions "branch if greater than" and "branch if not equal". It is unlikely that there is any performance difference between those two, and if there would be, it would be non-significant.
The only branch instruction which is ever so slightly faster than the others is usually "branch if zero"/"branch if not zero".
(In the dark ages when compilers sucked, C programmers therefore liked to write loops as down-counting to zero, instead of up-counting, so that comparisons would be done against zero instead of a value, in order to gain a few nanoseconds. Modern compilers can do that optimization themselves, but you still see such loops now and then.)
In general, you shouldn't concern yourself with micro-management of performance. If you spend time pondering if > is faster than !=, instead of pondering about program design, readability and functionality, you need to set your priorities straight asap.

Semantically these conditions are different. The first one checks whether object time is positive or zero.
if(time > -1)
In this case it would be better to write
if( time >= 0 )
However some functions return either a non-negative value or -1. For example a search function can return -1 if it did not find an element in an array. Or -1 can signal an error state or an absence of a value.
In this case it is better to use condition
if ( time != -1 )
As for the speed when the compiler can generate only one mashine instruction to make the comparison in the both cases.
It is not the case when you should think about the speed. You should think about what condition is more expressive and shows the intention of the programmer.

How do jump-tables work?

In the following document, pages 4-5:
http://www.open-std.org/jtc1/sc22/wg21/docs/ESC_Boston_01_304_paper.pdf
typedef int (* jumpfnct)(void * param);
static int CaseError(void * param)
{
return -1;
}
static jumpfnct const jumptable[] =
{
CaseError, CaseError, ...
.
.
.
Case44, CaseError, ...
.
.
.
CaseError, Case255
};
result = index <= 0xFF ? jumptable[index](param) : -1;
it is comparing IF-ELSE vs SWITCH and then introduces this "Jump table". Apparently it is the fastest implementation of the three. What exactly is it? I cannot see how it could work??

The jumptable is a method of mapping some input integer to an action. It stems from the fact that you can use the input integer as the index of an array.
The code sets up an array of pointers to functions. Your input integer is then used to select on of these function-pointers. Generally, it looks like it's going to be a pointer to the function CaseError. However, every now and again, it will be a different function that is being pointed to.
It's designed so that
jumptable[62] = Case62;
jumptable[95] = Case95;
jumptable[35] = Case35;
jumptable[34] = CaseError; /* For example... and so it goes on */
Thus, selecting the right function to call is constant time... with the if-elses and selects, the time taken to select the correct function is dependent on the input integer... assuming the compiler doesn't optimize the select to a jumptable itself... if it's for embedded code, then there's a chance that optimizations of this kind have been disabled... you'd have to check.
Once the correct function-pointer is found, the last line simply calls it:
result = index <= 0xFF ? jumptable[index](param) : -1;
becomes
result = index <= 0xFF /* Check that the index is within
the range of the jump table */
? jumptable[index](param) /* jumptable[index] selects a function
then it gets called with (param) */
: -1; /* If the index is out of range, set result to be -1
Personally, I think a better choice would be to call
CaseError(param) here */

Jumpfnct is a pointer to a function. Jumptable is an array that consists of a number of jumpfncts. The functions can be called just by referencing their position in the array.
For example, jumptable0 will execute the first function, passing along param. jumptable1 will execute the second function, etc.
If you don't know about function pointers, you shouldn't use this trick. They're very handy, in a narrow domain.
It's very fast and space efficient, when what you're doing is switching between a large number of similar function calls. You are adding a function call overhead that a switch statement doesn't necessarily have, so it might not be appropriate in all circumstances. If your code is something like this:
switch(x) {
case 1:
function1();
break;
case 2:
function2();
break;
...
}
A jump table might be a good substitution. If, though, your switch is something like this:
switch(x) {
case 1:
y++;
break;
case 1023:
y--;
break;
...
}
It probably wouldn't be worth doing.
I've used them in a toy FORTH language interpreter, where they were invaluable, but in most cases you're not going to see a speed benefit that makes them worth using. Use them if it makes the logic of your program clearer, not for optimization.

This jumptable returns a pointer-to-function by its index. You define this table in a way that invalid indexes point to the function that returns some invalid code (like -1 in the example) and valid indexes point to the functions you need to call.
Construction
jumptable[index]
returns pointer-to-function and this function gets called
jumptable[index](param)
where param is some custom parameter.

A Jump-Table is an obvious, but rarely used optimization, that for some reason seems to have fallen out of favor.
Briefly, instead of testing a value and exiting out of a switch/case or if-else block to branch to function or code path, you create an array which is filled with the addresses of the functions the program can branch to.
Once completed, this arrangement eliminates the relentless if testing attendant with if-else and switch/case blocks. The code uses the variable that would otherwise be tested with if as a subscript into the function-pointer array, and proceeds directly the the appropriate code - sans ANY if testing. A perfectly efficient branch. The assembly code should literally be a jump.
If you profile code, and find a hot-spot where the program is spending a large % of it's time, look to this kind of optimization to improve performance. A little bit of this can go a long way if it's part of a code's hot-spot.
Thanks for the link. Nice find!

As mentioned in the comment above, whether this solution is more or less effiicent than, for example, a switch statement depends on the amount of work needed to be done for each case.
Writing a regular switch statement for the values you want to process will definitely be a clearer way to see what the code does. So unless either space or speed requirements dictate that a more sophisticated solution, I would suggest that this is not a "better" solution.
Tables of function pointers is however an efficient and good way to solve certain problems. I use function pointers in a table quite regularly to do things like "Benchmark 11 different solutions to a problem", where I have a struct wiht the name of the function and the function, and some parameters perhaps. Then I have one function to time and loop over the code a few million times (or whatever it takes to get a long enough measurement to make sense)

Any better alternatives for getting the digits of a number? (C++)

I know that you can get the digits of a number using modulus and division. The following is how I've done it in the past: (Psuedocode so as to make students reading this do some work for their homework assignment):
int pointer getDigits(int number)
initialize int pointer to array of some size
initialize int i to zero
while number is greater than zero
store result of number mod 10 in array at index i
divide number by 10 and store result in number
increment i
return int pointer
Anyway, I was wondering if there is a better, more efficient way to accomplish this task? If not, is there any alternative methods for this task, avoiding the use of strings? C-style or otherwise?
Thanks. I ask because I'm going to be wanting to do this in a personal project of mine, and I would like to do it as efficiently as possible.
Any help and/or insight is greatly appreciated.

The time it takes to extract the digits will be dwarfed by the time required to dynamically allocate the array. Consider returning the result in a struct:
struct extracted_digits
{
int number_of_digits;
char digits[12];
};
You'll want to pick a suitable value for the maximum number of digits (12 here, which is enough for a 32-bit integer). Alternatively, you could return a std::array<char, 12> and encode the terminal by using an invalid value (so, after the last value, store a 10 or something else that isn't a digit).
Depending on whether you want to handle negative values, you'll also have to decide how to report the unary minus (-).

Unless you want the representation of the number in a base that's a power of 2, that's about the only way to do it.

Smacks of premature optimisation. If profiling proves it matters, then be sure to compare your algo to itoa - internally it may use some CPU instructions that you don't have explicit access to from C++, and which your compiler's optimiser may not be clever enough to employ (e.g. AAM, which divs while saving the mod result). Experiment (and benchmark) coding the assembler yourself. You might dig around for assembly implementations of ITOA (which isn't identical to what you're asking for, but might suggest the optimal CPU instructions).

By "avoiding the use of strings", I'm going to assume you're doing this because a string-only representation is pretty inefficient if you want an integer value.
To that end, I'm going to suggest a slightly unorthodox approach which may be suitable. Don't store them in one form, store them in both. The code below is in C - it will work in C++ but you may want to consider using c++ equivalents - the idea behind it doesn't change however.
By "storing both forms", I mean you can have a structure like:
typedef struct {
int ival;
char sval[sizeof("-2147483648")]; // enough for 32-bits
int dirtyS;
} tIntStr;
and pass around this structure (or its address) rather than the integer itself.
By having macros or inline functions like:
inline void intstrSetI (tIntStr *is, int ival) {
is->ival = i;
is->dirtyS = 1;
}
inline char *intstrGetS (tIntStr *is) {
if (is->dirtyS) {
sprintf (is->sval, "%d", is->ival);
is->dirtyS = 0;
}
return is->sval;
}
Then, to set the value, you would use:
tIntStr is;
intstrSetI (&is, 42);
And whenever you wanted the string representation:
printf ("%s\n" intstrGetS(&is));
fprintf (logFile, "%s\n" intstrGetS(&is));
This has the advantage of calculating the string representation only when needed (the fprintf above would not have to recalculate the string representation and the printf only if it was dirty).
This is a similar trick I use in SQL with using precomputed columns and triggers. The idea there is that you only perform calculations when needed. So an extra column to hold the indexed lowercased last name along with an insert/update trigger to calculate it, is usually a lot more efficient than select lower(non_lowercased_last_name). That's because it amortises the cost of the calculation (done at write time) across all reads.
In that sense, there's little advantage if your code profile is set-int/use-string/set-int/use-string.... But, if it's set-int/use-string/use-string/use-string/use-string..., you'll get a performance boost.
Granted this has a cost, at the bare minimum extra storage required, but most performance issues boil down to a space/time trade-off.
And, if you really want to avoid strings, you can still use the same method (calculate only when needed), it's just that the calculation (and structure) will be different.
As an aside: you may well want to use the library functions to do this rather than handcrafting your own code. Library functions will normally be heavily optimised, possibly more so than your compiler can make from your code (although that's not guaranteed of course).
It's also likely that an itoa, if you have one, will probably outperform sprintf("%d") as well, given its limited use case. You should, however, measure, not guess! Not just in terms of the library functions, but also this entire solution (and the others).

It's fairly trivial to see that a base-100 solution could work as well, using the "digits" 00-99. In each iteration, you'd do a %100 to produce such a digit pair, thus halving the number of steps. The tradeoff is that your digit table is now 200 bytes instead of 10. Still, it easily fits in L1 cache (obviously, this only applies if you're converting a lot of numbers, but otherwise efficientcy is moot anyway). Also, you might end up with a leading zero, as in "0128".

Yes, there is a more efficient way, but not portable, though. Intel's FPU has a special BCD format numbers. So, all you have to do is just to call the correspondent assembler instruction that converts ST(0) to BCD format and stores the result in memory. The instruction name is FBSTP.

Mathematically speaking, the number of decimal digits of an integer is 1+int(log10(abs(a)+1))+(a<0);.
You will not use strings but go through floating points and the log functions. If your platform has whatever type of FP accelerator (every PC or similar has) that will not be a big deal ,and will beat whatever "sting based" algorithm (that is noting more than an iterative divide by ten and count)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js