How are integers converted to strings under the hood?

How are integers converted to strings under the hood? - casting

I suppose the real question is how to convert base2/binary to base10. The most common application of this would probably be in creating strings for output: turning a chunk of binary numerical data into an array of characters. How exactly is this done?
my guess:
Seeing as there probably isn't a string predefined for each numerical value, I'm guessing that the computer goes through each bit of the integer from right to left, each time incrementing the appropriate values in the char array/base10 notation places. If we take the number 160 in binary (10100000), it would know that a 1 in the 8th place means 128, so it places 1 into the third column, 2 in the second, and 8 in the third. The 1 in the 6th column means 32, and it would add those values to the second and first place, carrying over if needed. After this it's an easy conversion to actual char codes.

while number != 0:
nextdigit = number % 10
AddToLeft(result, convert nextdigit to char)
number = number / 10
It's left as an exercise for the reader to handle zero and negative numbers.

How it's done depends on the platform. The Intel type processors for example has built in support for packed BCD (Binary Coded Decimal) arithmetics.
Let's say that the register al contains binary 00101010, decimal 42.
fushf ;store flags on the stack
std ;set decimal flag
sub bl, bl ;clear bl register
add bl, al ;add al to bl using BCD arithmetics
pop ;restore flags from stack
The bl register now contains 01000010.
The upper four bits contain 0100, or decimal 4.
The lower four bits contain 0010, or decimal 2.
To convert this into characters, extract the four bit values from the register and add 48 to get the character code for the digit.

The implementation of printf in the Linux kernel is pretty readable. See lib/vsprintf.c:number().
Well, okay, mostly pretty readable. do_div is a macro with assembler in it.

Firstly, this is a tricky question because obviously it is based on platform and language.
Take Java for example. integers declared as int are actually 32-bit long.
so to represent the decimal value of 0 we should have
1000000000000000000000000000000 <== the leading 1 (or zero?) denotes it is positive or not.
Well, this is because Java store int values as half negative and half positive...
So, my guess is Java would do the following :
step1 : get the content from a chunk of 32-bit memory pointed by the "pointer" of the variable or the literal
step2 : calculate its decimal value so that big number is converted to 0
step3 : (jdk5+) use Int32.toString() to return the string literal as "0"
This might be wrong cause I have never thought question like this.
I really don't think any language will try converting the value into an array of chars, cause that is much overhead to add...
OR, to convert binary values to decimal, based on my math experience, you will calculate based on the value, not its literal repersentation :
1 1 0 1 in binary
1*2^3 + 1* 2^2 + 0*2^1 +1*2^0 = 13 in decimal

Related

Assigning a distinct number to a string

Lets say that I have a VIN like this: SB164ABN10E082986.
Now, I want to assign an integer to each possible VIN (without WMI, which is the first three digits -> 64ABN10E082986) in a manner that I retrieve the VIN from this integer afterwards.
What would be the best way of doing so? It can be used to the advantage of such algorithm that the first 10 digits can be composed from those values:
1234567890 ABCDEFGH JKLMN P RSTUVWXYZ
and the last 4 can be composed of all one-digit numbers (0-9).
Background: I want to be able to save memory. So, in a sense I'm searching for a special way of compression. I calculated that an 8 Byte integer would suffice under those conditions. I am only missing the way of doing "the mapping".
This is how it should work:
VIN -> ALGORITHM -> INDEX
INDEX -> ALGORITHM REVERSED -> VIN

Each character becomes a digit in a variable-base integer. Then convert those digits to the integer.
Those that can be digits or one of 23 letters is base 33. Those that can only be digits are base 10. The total number of possible combinations is 3310 times 104. The logarithm base two of that is 63.73, so it will just fit in a 64-bit integer.
You start with zero. Add the first digit. Multiply by the base of the next digit (33 or 10). Add that digit. Continue until all digits processed. You have the integer. Each digit is 0..32 or 0..9. Take care to properly convert the discontiguous letters to the contiguous numbers 0..32.
Your string 64ABN10E082986 is then encoded as the integer 2836568518287652986. (I gave the digits the values 0..9, and the letters 10..32.)
You can reverse the process by taking the integer and both dividing it by the last base and taking the modulo the last base. The result of the modulo is the last digit. Continue with the quotient from dividing for the next digit.
By the way, in the US anyway the last five characters of the VIN must be numeric digits. I don't know why you are only considering four.

Assign a 6 bit number to each valid character/digit and encode all ten in less than 64 bits. This means it would fit in an 8 bytes ie uint64_t in C/C++ and would be easy to store in a database etc.
Count valid bytes
echo -n "1234567890ABCDEFGHJKLMNPRSTUVWXYZ"| wc -c
33
Minimum number of bits to allow 33 is 6. 10 * 6 = 60
If the idea is to make it as small as possible where the length may vary based on VIN then that would be a different answer and looking at the actual wikipedia page for VIN there are likely quite a few ways to do that.

VBA debugger precision

I had a single which I believe the C++ equivalent is float in VBA in an Excel workbook module. Anyways, the value I originally assigned (876.34497) is rounded off to 876.345 in the Immediate Window, and Watch, and hover tooltip when I set a breakpoint on the VBA. However, if I pass this Single to a C++ DLL C++ reports it as the original value 876.34497.
So, is it actually stored in memory as the original value? Is this some limitation of the debugger? Unsure what is going on here. Makes it difficult to test if what I'm passing is what I'm getting on the C++ side.
I tried:
?CStr(test)
876.345
?CDbl(test)
876.344970703125
?CSng(test)
876.345
VBA isn't very straightforward, so at some level it must be stored as 876.34497 in memory. Otherwise, I don't think CDbl would be correct like it is.

VBA variables of type "single" are stored as "32-bit hardware implementation of IEEE 754[-]1985 [sic]." [see: https://msdn.microsoft.com/en-us/library/ee177324.aspx].
What this means in English is, "single" precision numbers are converted to binary then truncated to fit in a 4 byte (32-bit) sequence. The exact process is very well described in Wikipedia under http://en.wikipedia.org/wiki/Single-precision_floating-point_format . The upshot is that all single precision numbers are expressed as
(1) a 23 bit "fraction" between 0 and 1, *times*
(2) an 8-bit exponent which represents a multiplier between 2^(-127) and 2^128, *times*
(3) one more bit for positive or negative.
The process of converting numbers to binary and back causes two types of rounding errors:
(1) Significant Digits -- as you have noticed, there is a limit on significant digits. A 22 bit integer can only have 8,388,607 unique values. Stated another way, no number can be expressed with greater than +/- 0.000012% precision. Reaching back to high school science, you may recall that that is another way of saying you cannot count on more than six significant digits (well, decimal digits, at least ... of course you have 22 significant binary digits). So any representation of a number with more than six significant digits will get rounded off. However, it won't get rounded off to the nearest decimal digit ... it will get rounded off to the nearest binary digit. This often causes some unexpected results (like yours).
(2) Binary conversion -- The other type of error is even more pernicious. There are some numbers with significantly less than six (decimal) digits that will get rounded off. For example, 1/5 in decimal is 0.2000000. It never gets "rounded off." But the same number in binary is 0.00110011001100110011.... repeating forever. (That sequence is equivalent to 1/8 + 1/16 + 1/16*(1/8+1/16) + 1/256*(1/8+1/16) ... ) If you used an arbitrary number of binary digits to represent 0.20, then converted it back to decimal, you will NEVER get exactly 0.20. For example, if you used eight bits, you would have 0.00110011 in binary which is:
0.12500000
0.06250000
0.00781250
+ 0.00390625
------------
0.19921875
No matter how many binary digits you use, you will never get exactly 0.20, because 0.20 cannot be expressed as the sum of powers of two.
That in a nutshell explains what's going on. When you assign 876.34497 to "test," it gets converted internally to:
1 10001000 0110110001011000010011
136 5,969,427
Which is (+1) * 2^(136-127) * (5,969,427)/(2^23)
Excel is automatically truncating the display of your single-precision number to show only six significant digits, because it knows that the seventh digit might be wrong. I can't tell you what the number is exactly because my excel doesn't display enough significant digits! But you get the point.
When you coerce the value into double precision, it uses the entire binary string and then adds another 4 bytes worth of zeroes to the end. It now allows you to display twice as many significant figures because it is double precision, but as you can see, the conversion from 8 decimal digits to 23 binary digits and then appending another long string of zeros has introduced some errors. Not really errors, if you understand what it's doing; just artifacts. After all, it's doing exactly what you told it to do ... you just didn't know what you were telling it to do!

16,32 etc.-byte variable for a utopic application

The following lines are part from my really "useless" C++ program... which is calculating powers of 2 only up to 2^63 instead of 2^128 "which is being asked" due to the length of the "unsigned long long" variable which is proposed for numbers with 15 digits accuracy...!!!
Just that....I need a 16 bytes or more variable...which is not provided by:
-__int128(Visual Studio 2010 turns the letters to blue but a red line and a error in debug: "keyword not supported on this architecture"32-bit system)
-Boost::Projects...after I googled it due to the fact that I am a newcomer "I was lost in the universe" when I came across with professionals sites (does boost::bigint...exist??? not a rhetorical question)
(-Multi-typing long of' course)
int main()
{
unsigned long long result;
int i;
const int max=128;
for(i=0, result=1ll; i <= max; ++i,result *=2 )
cout<<setw(3)<< i <<setw(32)<< result <<endl;
system("pause");
return 0;
}

You could find a "bigint" implementation in C++ that implements operator<<() to output to ostream's, but if all you want to do is print out powers of 2 to a console or text string, and you don't need to actually do "bigint" math (except to compute those powers-of-2), there's a simpler approach that will give you powers of 2 out to pretty much as large as you want to go & have the patience to look through:
Store each decimal digit (numbers 0 through 9) as a separate entity, perhaps as an array of chars or ints or in a std::list of the digits. Using a std::list has the advantage that you can easily add new digit places at the front as your number gets bigger, but you can do that almost as easily by storing the digits in reverse order in a std::vector (of course to print them, you have to iterate from the back to the front to print the digits in their proper order).
Once you figure out how you want to store the digits, your algorithm for doubling the number is as follows: Iterate over the digits of the large number, doubling each (mod 10 of course) and carrying any overflow (i.e. a bool that says if its result... before the %10... was greater than 9) from that digit to the next. On the next digit, double it first and then add 1 if the previous digit overflowed. And if that result overflows, carry that overflow on to the next digit & continue to the end of all of the digits. At the end of the digits, if doubling the last digit & adding any overflow from the previous digit caused an overflow in that last digit, then add a new digit & set it to 1. Then print the resulting list of digits.
With this algorithm, you can print powers-of-2 as large as you like. Of course they're not "numbers" in the sense that you can't use them directly in C++ math ops.

SSE and AVX intrinsics go up to 256 bytes, given a modern CPU. They're named __m128i and __m256i.

128 bit integer is a really big integer. You should implement your own data type. You can create an array of shorts, store there numbers (digits) and implement multiplying, just like you do in your math notebook, that's probably the simplest approach.
This one is not finished, of course! The '2' is still missing ;)

c++ bitwise addition , calculates the final number of representative bits

I am currently developing an utility that handles all arithmetic operations on bitsets.
The bitset can auto-resize to fit any number, so it can perform addition / subtraction / division / multiplication and modulo on very big bitsets (i've come up to load a 700Mo movie inside to treat it just as a primitive integer)
I'm facing one problem though, i need for my addition to resize my bitset to fit the exact number of bits needed after an addition, but i couldn't come up with an absolute law to know exactly how many bits would be needed to store everything, knowing only the number of bits that both numbers are handling (either its representation is positive or negative, it doesn't matter)
I have the whole code that i can share with you to point out the problem if my question is not clear enough.
Thanks in advance.
jav974

but i couldn't come up with an absolute law to know exactly how many bits would be needed to store everything, knowing only the number of bits that both numbers are handling (either its representation is positive or negative, it doesn't matter)
Nor will you: there's no way given "only the number of bits that both numbers are handling".
In the case of same-signed numbers, you may need one extra bit - you can start at the most significant bit of the smaller number, and scan for 0s that would absorb the impact of a carry. For example:
1010111011101 +
..10111010101
..^ start here
As both numbers have a 1 here you need to scan left until you hit a 0 (in which case the result has the same number of digits as the larger input), or until you reach the most significant bit of the larger number (in which case there's one more digit in the result).
1001111011101 +
..10111010101
..^ start here
In this case where the longer input has a 0 at the starting location, you first need to do a right-moving scan to establish whether there'll be a carry from the right of that starting position before launching into the left-moving scan above.
When signs differ:
if one value has 2 or more digits less than the other, then the number of digits required in the result will be either the same or one less than the digits in the larger input
otherwise, you'll have to do more of the work for an addition just to work out how many digits the result needs.
This is assuming the sign bit is separate from the count of magnitude bits.

Finally the number of representative bits after an addition is at maximum the number of bits of the one that owns the most + 1.
Here is an explanation, using an unsigned char:
For max unsigned char :
11111111 (255)
+ 11111111 (255)
= 111111110 (510)
Naturally if max + max = (bits of max + 1) then for x and y between 0 and max the result bits is at max + 1 (very maximum)
this works the same way with signed integers.

I don't get Golomb / Rice coding: It does make more bits of the input, or does it?

Or, maybe, what I don't get is unary coding:
In Golomb, or Rice, coding, you split a number N into two parts by dividing it by another number M and then code the integer result of that division in unary and the remainder in binary.
In the Wikipedia example, they use 42 as N and 10 as M, so we end up with a quotient q of 4 (in unary: 1110) and a remainder r of 2 (in binary 010), so that the resulting message is 1110,010, or 8 bits (the comma can be skipped). The simple binary representation of 42 is 101010, or 6 bits.
To me, this seems due to the unary representation of q which always has to be more bits than binary.
Clearly, I'm missing some important point here. What is it?

The important point is that Golomb codes are not meant to be shorter than the shortest binary encoding for one particular number. Rather, by providing a specific kind of variable-length encoding, they reduce the average length per encoded value compared to fixed-width encoding, if the encoded values are from a large range, but the most common values are generally small (and hence are using only a small fraction of that range most of the time).
As an example, if you were to transmit integers in the range from 0 to 1000, but a large majority of the actual values were in the range between 0 and 10, in a fixed-width encoding, most of the transmitted codes would have leading 0s that contain no information:
To cover all values between 0 and 1000, you need a 10-bit wide encoding in fixed-width binary. Now, as most of your values would be below 10, at least the first 6 bits of most numbers would be 0 and would carry little information.
To rectify this with Golomb codes, you split the numbers by dividing them by 10 and encoding the quotient and the remainder separately. For most values, all that would have to be transmitted is the remainder which can be encoded using 4 bits at most (if you use truncated binary for the remainder it can be less). The quotient is then transmitted in unary, which encodes as a single 0 bit for all values below 10, as 10 for 10..19, 110 for 20..29 etc.
Now, for most of your values, you have reduced the message size to 5 bits max, but you are still able to transmit all values unambigously without separators.
This comes at a rather high cost for the larger values (for example, values in the range 990..999 need 100 bits for the quotient), which is why the coding is optimal for 2-sided geometric distributions.
The long runs of 1 bits in the quotients of larger values can be addressed with subsequent run-length encoding. However, if the quotients consume too much space in the resulting message, this could indicate that other codes might be more appropriate than Golomb/Rice.

One difference between the Golomb coding and binary code is that binary code is not a prefix code, which is a no-go for coding strings of arbitrarily large numbers (you cannot decide if 1010101010101010 is a concatenation of 10101010 and 10101010 or something else). Hence, they are not that easily comparable.
Second, the Golomb code is optimal for geometric distribution, in this case with parameter 2^(-1/10). The probability of 42 is some 0.3 %, so you get the idea about how important is this for the length of the output string.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js