How can i get utf-8 char number from binary in c++?

How can i get utf-8 char number from binary in c++? - c++

For example, I have: 11100011 10000010 10100010. It is the binary of: ア;
its number in UTF-8 is:12450
How can I get this number from binary?

The byte sequence you're showing is the UTF-8 encoded version of the character.
You need to decode the UTF-8 to get to the Unicode code point.
For this exact sequence of bytes, the following bits make up the code point:
11100011 10000010 10100010
**** ****** ******
So, concatenating the asterisked bits we get the number 0011000010100010, which equals 0x30a2 or 12450 in decimal.
See the Wikipedia description for details on how to interpret the encoding.
In a nutshell: if bit 7 is set in the first byte, the number of adjacent bits (call it m) that are also set (2) gives the number of bytes that follow for this code point. The number of bits to extract from each byte is (8 - 1 - 1 - m) for the first byte, and 6 bits from each subsequent byte. So here we got (8 - 1 - 1 - 2) = 4 + 2 * 6 = 16 bits.
As pointed out in comments, there are plenty of libraries for this, so you might not need to implement it yourself.

working from the wikipedia page, I came up with this:
unsigned utf8_to_codepoint(const char* ptr) {
if( *ptr < 0x80) return *ptr;
if( *ptr < 0xC0) throw unicode_error("invalid utf8 lead byte");
unsigned result=0;
int shift=0;
if( *ptr < 0xE0) {result=*ptr&0x1F; shift=1;}
if( *ptr < 0xF0) {result=*ptr&0x0F; shift=2;}
if( *ptr < 0xF8) {result=*ptr&0x07; shift=3;}
for(; shift>0; --shift) {
++ptr;
if (*ptr<0x7F || *ptr>=0xC0)
throw unicode_error("invalid utf8 continuation byte");
result <<= 6;
result |= *ptr&0x6F;
}
return result;
}
Note that this is a very poor implementation (I highly doubt it even compiles), and parses a lot of invalid values that it probably shouldn't. I put this up merely to show that it's a lot harder than you'd think, and that you should use a good unicode library.

Related

16-bit to 10-bit conversion code explanation

I came across the following code to convert 16-bit numbers to 10-bit numbers and store it inside an integer. Could anyone maybe explain to me what exactly is happening with the AND 0x03?
// Convert the data to 10-bits
int xAccl = (((data[1] & 0x03) * 256) + data[0]);
if(xAccl > 511) {
xAccl -= 1024;
}
Link to where I got the code: https://www.instructables.com/id/Measurement-of-Acceleration-Using-ADXL345-and-Ardu/

The bitwise operator & will make a mask, so in this case, it voids the 6 highest bits of the integer.
Basically, this code does a modulo % 1024 (for unsigned values).

data[1] takes the 2nd byte; & 0x03 masks that byte with binary 11 - so: takes 2 bits; * 256 is the same as << 8 - i.e. pushes those 2 bits into the 9th and 10th positions; adding data[0] to data combines these two bytes (personally I'd have used |, not +).
So; xAccl is now the first 10 bits, using big-endian ordering.
The > 511 seems to be a sign check; essentially, it is saying "if the 10th bit is set, treat the entire thing as a negative integer as though we'd used 10-bit twos complement rules".

Convering Big Endian Formatted Bits to Intended Decimal Value While Ignoring First Bit

I am a reading binary file and trying to convert from IBM 4 Byte floating point to double in C++. How exactly would one use the first byte of IBM data to find the ccccccc in the given picture
IBM to value conversion chart
The code below gives an exponent way larger than what the data should have. I am confused with how the line
exponent = ((IBM4ByteValue[0] & 127) - 64);
executes, I do not understand the use of the & operator in this statement. But essentially what the previous author of this code implied is that (IBM4ByteValue[0]) is the ccccccc , so does this mean that the ampersand sets a maximum value that the left side of the operator can equal? Even if this is correct though I'm sure how this line accounts for the fact that there Big Endian bitwise notation in the first byte (I believe it is Big Endian after viewing the picture). Not to mention 1000001 and 0000001 should have the same exponent (-63) however they will not with my current interpretation of the previously mentioned line.
So in short could someone show me how to find the ccccccc (shown in the picture link above) using the first byte --> IBM4ByteValue[0]. Maybe accessing each individual bit? However I do not know the code to do this using my array.
**this code is using the std namespace
**I believe ret should be mantissa * pow(16, 24+exponent) however if I'm wrong about the exponent I'm probable wrong about this (I got the IBM Conversion from a previously asked stackoverflow question) **I would have just commented on the old post, but this question was a bit too large, pun intended, for a comment. It is also different in that I am asking how exactly one accesses the bits in an array storing whole bytes.
Code I put together using an IBM conversion from previous question answer
for (long pos = 0; pos < fileLength; pos += BUF_LEN) {
file.seekg(bytePosition);
file.read((char *)(&IBM4ByteValue[0]), BUF_LEN);
bytePosition += 4;
printf("\n%8ld: ", pos);
//IBM Conversion
double ret = 0;
uint32_t mantissa = 0;
uint16_t exponent = 0;
mantissa = (IBM4ByteValue[3] << 16) | (IBM4ByteValue[2] << 8)|IBM4ByteValue[1];
exponent = ((IBM4ByteValue[0] & 127) - 64);
ret = mantissa * exp2(-24 + 4 * exponent);
if (IBM4ByteValue[0] & 128) ret *= -1.;
printf(":%24f", ret);
printf("\n");
system("PAUSE");
}

The & operator basically takes the bits in that value of the array and masks it with the binary value of 127. If a bit in the value of the array is 1, and the corresponding bit position of 127 is 1, the bit will be a resulting 1. 1 & 0 would be 0, and so would 0 & 0 , and 0 & 1. You would be changing the bits. Then you would take the resulting bit value, converted to decimal now, and subtract 64 from it to equal your exponent.
In floating point we always have a bias (in this case, 64) for the exponent. This means that if your exponent is 5, 69 will be stored. So what this code is trying to do is find the original value of the exponent.

Printing integers as a set of 4 bytes arranged in little endian?

I have an array of 256 unsigned integers called frequencies[256] (one integer for each ascii value). My goal is to read through an input and for each character i increment the integer in the array that corresponds to it (for example the character 'A' will cause the frequencies[65] integer to increase by one) and when the input is over I must output each integer as 4 characters in little endian form.
So far I have made a loop that goes through the input and increases each corresponding integer in the array. But i am very confused on how to output each integer in little endian form. I understand that each byte of the four bytes of each integer should be output as a character (for instance the unsigned integer 1 in little endian is "00000001 00000000 00000000 00000000" which i would want to output as the 4 ascii characters that correspond to those bytes).
But how do i get at the binary representation of an unsigned integer in my code and how would i go about chopping it up and rearranging it?
Thanks for the help.

For hardware portability, please use the following solution:
int freqs[256];
for (int i = 0; i < 256; ++i)
printf("%02x %02x %02x %02x\n", (freqs[i] >> 0 ) & 0xFF
, (freqs[i] >> 8 ) & 0xFF
, (freqs[i] >> 16) & 0xFF
, (freqs[i] >> 24) & 0xFF);

You can use memcpy which copies a block of memory.
char tab[4] ;
memcpy(tab, frequencies+i, sizeof(int));
now, tab[0], tab[1], etc. will be your characters.

A program to swap from big to little endian: Little Endian - Big Endian Problem.
To understand if your system is little or big endian: https://stackoverflow.com/a/1024954/2436175.
Transform your chars/integers in a set of printable bits: https://stackoverflow.com/a/7349767/2436175

It's not really clear what you mean by "little endian" here.
Integers don't have endianness per se; endianness only comes
into play when you cut them up into smaller pieces. So which
smaller pieces to you mean: bytes or characters. If characters,
just convert in the normal way, and reverse the generated
string. If bytes (or any other smaller piece), each individual
byte can be represented as a function of the int: i & 0xFF
calculates the low order byte, (i >> 8) & 0xFF the next
lowest, and so forth. (If the bytes aren't 8 bits, then change
the shift value and the mask correspondingly.)
And with regards to your second paragraph: a single byte of an
int doesn't necessarily correspond to a character, regardless
of the encodig. For the four bytes you show, for example, none
of them corresponds to a character in any of the usual
encodings.
With regards to the last paragraph: to get the binary
representation of an unsigned integer, use the same algorithm
that you would use for any representation:
std::string
asText( unsigned int value, int base, int minDigits = 1 )
{
static std::string digits( "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" );
assert( base >= 2 && base <= digits.size() );
std::string results;
while ( value != 0 || minDigits > 0 ) {
results += digits[ value % base ];
value /= base;
-- minDigits;
}
// results is now little endian. For the normal big-endian
std::reverse( results.begin(), results.end() );
return results;
}
Called with base equal to 2, this will give you your binary
representation.

how to optimize C++/C code for a large number of integers

I have written the below mentioned code. The code checks the first bit of every byte. If the first bit of every byte of is equal to 0, then it concatenates this value with the previous byte and stores it in a different variable var1. Here pos points to bytes of an integer. An integer in my implementation is uint64_t and can occupy upto 8 bytes.
uint64_t func(char* data)
{
uint64_t var1 = 0; int i=0;
while ((data[i] >> 7) == 0)
{
variable = (variable << 7) | (data[i]);
i++;
}
return variable;
}
Since I am repeatedly calling func() a trillion times for trillions of integers. Therefore it runs slow, is there a way by which I may optimize this code?
EDIT: Thanks to Joe Z..its indeed a form of uleb128 unpacking.

I have only tested this minimally; I am happy to fix glitches with it. With modern processors, you want to bias your code heavily toward easily predicted branches. And, if you can safely read the next 10 bytes of input, there's nothing to be saved by guarding their reads by conditional branches. That leads me to the following code:
// fast uleb128 decode
// assumes you can read all 10 bytes at *data safely.
// assumes standard uleb128 format, with LSB first, and
// ... bit 7 indicating "more data in next byte"
uint64_t unpack( const uint8_t *const data )
{
uint64_t value = ((data[0] & 0x7F ) << 0)
| ((data[1] & 0x7F ) << 7)
| ((data[2] & 0x7F ) << 14)
| ((data[3] & 0x7F ) << 21)
| ((data[4] & 0x7Full) << 28)
| ((data[5] & 0x7Full) << 35)
| ((data[6] & 0x7Full) << 42)
| ((data[7] & 0x7Full) << 49)
| ((data[8] & 0x7Full) << 56)
| ((data[9] & 0x7Full) << 63);
if ((data[0] & 0x80) == 0) value &= 0x000000000000007Full; else
if ((data[1] & 0x80) == 0) value &= 0x0000000000003FFFull; else
if ((data[2] & 0x80) == 0) value &= 0x00000000001FFFFFull; else
if ((data[3] & 0x80) == 0) value &= 0x000000000FFFFFFFull; else
if ((data[4] & 0x80) == 0) value &= 0x00000007FFFFFFFFull; else
if ((data[5] & 0x80) == 0) value &= 0x000003FFFFFFFFFFull; else
if ((data[6] & 0x80) == 0) value &= 0x0001FFFFFFFFFFFFull; else
if ((data[7] & 0x80) == 0) value &= 0x00FFFFFFFFFFFFFFull; else
if ((data[8] & 0x80) == 0) value &= 0x7FFFFFFFFFFFFFFFull;
return value;
}
The basic idea is that small values are common (and so most of the if-statements won't be reached), but assembling the 64-bit value that needs to be masked is something that can be efficiently pipelined. With a good branch predictor, I think the above code should work pretty well. You might also try removing the else keywords (without changing anything else) to see if that makes a difference. Branch predictors are subtle beasts, and the exact character of your data also matters. If nothing else, you should be able to see that the else keywords are optional from a logic standpoint, and are there only to guide the compiler's code generation and provide an avenue for optimizing the hardware's branch predictor behavior.
Ultimately, whether or not this approach is effective depends on the distribution of your dataset. If you try out this function, I would be interested to know how it turns out. This particular function focuses on standard uleb128, where the value gets sent LSB first, and bit 7 == 1 means that the data continues.
There are SIMD approaches, but none of them lend themselves readily to 7-bit data.
Also, if you can mark this inline in a header, then that may also help. It all depends on how many places this gets called from, and whether those places are in a different source file. In general, though, inlining when possible is highly recommended.

Your code is problematic
uint64_t func(const unsigned char* pos)
{
uint64_t var1 = 0; int i=0;
while ((pos[i] >> 7) == 0)
{
var1 = (var1 << 7) | (pos[i]);
i++;
}
return var1;
}
First a minor thing: i should be unsigned.
Second: You don't assert that you don't read beyond the boundary of pos. E.g. if all values of your pos array are 0, then you will reach pos[size] where size is the size of the array, hence you invoke undefined behaviour. You should pass the size of your array to the function and check that i is smaller than this size.
Third: If pos[i] has most significant bit equal to zero for i=0,..,k with k>10, then previous work get's discarded (as you push the old value out of var1).
The third point actually helps us:
uint64_t func(const unsigned char* pos, size_t size)
{
size_t i(0);
while ( i < size && (pos[i] >> 7) == 0 )
{
++i;
}
// At this point, i is either equal to size or
// i is the index of the first pos value you don't want to use.
// Therefore we want to use the values
// pos[i-10], pos[i-9], ..., pos[i-1]
// if i is less than 10, we obviously need to ignore some of the values
const size_t start = (i >= 10) ? (i - 10) : 0;
uint64_t var1 = 0;
for ( size_t j(start); j < i; ++j )
{
var1 <<= 7;
var1 += pos[j];
}
return var1;
}
In conclusion: We separated logic and got rid of all discarded entries. The speed-up depends on the actual data you have. If lot's of entries are discarded then you save a lot of writes to var1 with this approach.
Another thing: Mostly, if one function is called massively, the best optimization you can do is call it less. Perhaps you can have come up with an additional condition that makes the call of this function useless.
Keep in mind that if you actually use 10 values, the first value ends up the be truncated.
64bit means that there are 9 values with their full 7 bits of information are represented, leaving exactly one bit left foe the tenth. You might want to switch to uint128_t.

A small optimization would be:
while ((pos[i] & 0x80) == 0)
Bitwise and is generally faster than a shift. This of course depends on the platform, and it's also possible that the compiler will do this optimization itself.

Can you change the encoding?
Google came across the same problem, and Jeff Dean describes a really cool solution on slide 55 of his presentation:
http://research.google.com/people/jeff/WSDM09-keynote.pdf‎
http://videolectures.net/wsdm09_dean_cblirs/
The basic idea is that reading the first bit of several bytes is poorly supported on modern architectures. Instead, let's take 8 of these bits, and pack them as a single byte preceding the data. We then use the prefix byte to index into a 256-item lookup table, which holds masks describing how to extract numbers from the rest of the data.
I believe it's how protocol buffers are currently encoded.

Can you change your encoding? As you've discovered, using a bit on each byte to indicate if there's another byte following really sucks for processing efficiency.
A better way to do it is to model UTF-8, which encodes the length of the full int into the first byte:
0xxxxxxx // one byte with 7 bits of data
10xxxxxx 10xxxxxx // two bytes with 12 bits of data
110xxxxx 10xxxxxx 10xxxxxx // three bytes with 16 bits of data
1110xxxx 10xxxxxx 10xxxxxx 10xxxxxx // four bytes with 22 bits of data
// etc.
But UTF-8 has special properties to make it easier to distinguish from ASCII. This bloats the data and you don't care about ASCII, so you'd modify it to look like this:
0xxxxxxx // one byte with 7 bits of data
10xxxxxx xxxxxxxx // two bytes with 14 bits of data.
110xxxxx xxxxxxxx xxxxxxxx // three bytes with 21 bits of data
1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx // four bytes with 28 bits of data
// etc.
This has the same compression level as your method (up to 64 bits = 9 bytes), but is significantly easier for a CPU to process.
From this you can build a lookup table for the first byte which gives you a mask and length:
// byte_counts[255] contains the number of additional
// bytes if the first byte has a value of 255.
uint8_t const byte_counts[256]; // a global constant.
// byte_masks[255] contains a mask for the useful bits in
// the first byte, if the first byte has a value of 255.
uint8_t const byte_masks[256]; // a global constant.
And then to decode:
// the resulting value.
uint64_t v = 0;
// mask off the data bits in the first byte.
v = *data & byte_masks[*data];
// read in the rest.
switch(byte_counts[*data])
{
case 3: v = v << 8 | *++data;
case 2: v = v << 8 | *++data;
case 1: v = v << 8 | *++data;
case 0: return v;
default:
// If you're on VC++, this'll make it take one less branch.
// Better make sure you've got all the valid inputs covered, though!
__assume(0);
}
No matter the size of the integer, this hits only one branch point: the switch, which will likely be put into a jump table. You can potentially optimize it even further for ILP by not letting each case fall through.

First, rather than shifting, you can do a bitwise test on the
relevant bit. Second, you can use a pointer, rather than
indexing (but the compiler should do this optimization itself.
Thus:
uint64_t
readUnsignedVarLength( unsigned char const* pos )
{
uint64_t results = 0;
while ( (*pos & 0x80) == 0 ) {
results = (results << 7) | *pos;
++ pos;
}
return results;
}
At least, this corresponds to what your code does. For variable
length encoding of unsigned integers, it is incorrect, since
1) variable length encodings are little endian, and your code is
big endian, and 2) your code doesn't or in the high order byte.
Finally, the Wiki page suggests that you've got the test
inversed. (I know this format mainly from BER encoding and
Google protocol buffers, both of which set bit 7 to indicate
that another byte will follow.
The routine I use is:
uint64_t
readUnsignedVarLen( unsigned char const* source )
{
int shift = 0;
uint64_t results = 0;
uint8_t tmp = *source ++;
while ( ( tmp & 0x80 ) != 0 ) {
*value |= ( tmp & 0x7F ) << shift;
shift += 7;
tmp = *source ++;
}
return results | (tmp << shift);
}
For the rest, this wasn't written with performance in mind, but
I doubt that you could do significantly better. An alternative
solution would be to pick up all of the bytes first, then
process them in reverse order:
uint64_t
readUnsignedVarLen( unsigned char const* source )
{
unsigned char buffer[10];
unsigned char* p = std::begin( buffer );
while ( p != std::end( buffer ) && (*source & 0x80) != 0 ) {
*p = *source & 0x7F;
++ p;
}
assert( p != std::end( buffer ) );
*p = *source;
++ p;
uint64_t results = 0;
while ( p != std::begin( buffer ) ) {
-- p;
results = (results << 7) + *p;
}
return results;
}
The necessity of checking for buffer overrun will likely make
this slightly slower, but on some architectures, shifting by
a constant is significantly faster than shifting by a variable,
so this could be faster on them.
Globally, however, don't expect miracles. The motivation for
using variable length integers is to reduce data size, at
a cost in runtime for decoding and encoding.

How to write individual bytes to filein C++

GIven the fact that I generate a string containing "0" and "1" of a random length, how can I write the data to a file as bits instead of ascii text ?
Given my random string has 12 bits, I know that I should write 2 bytes (or add 4 more 0 bits to make 16 bits) in order to write the 1st byte and the 2nd byte.
Regardless of the size, given I have an array of char[8] or int[8] or a string, how can I write each individual group of bits as one byte in the output file?
I've googled a lot everywhere (it's my 3rd day looking for an answer) and didn't understand how to do it.
Thank you.

You don't do I/O with an array of bits.
Instead, you do two separate steps. First, convert your array of bits to a number. Then, do binary file I/O using that number.
For the first step, the types uint8_t and uint16_t found in <stdint.h> and the bit manipulation operators << (shift left) and | (or) will be useful.

You haven't said what API you're using, so I'm going to assume you're using I/O streams. To write data to the stream just do this:
f.write(buf, len);
You can't write single bits, the best granularity you are going to get is bytes. If you want bits you will have to do some bitwise work to your byte buffer before you write it.
If you want to pack your 8 element array of chars into one byte you can do something like this:
char data[8] = ...;
char byte = 0;
for (unsigned i = 0; i != 8; ++i)
{
byte |= (data[i] & 1) << i;
}
f.put(byte);
If data contains ASCII '0' or '1' characters rather than actual 0 or 1 bits replace the |= line with this:
byte |= (data[i] == '1') << i;

Make an unsigned char out of the bits in an array:
unsigned char make_byte(char input[8]) {
unsigned char result = 0;
for (int i=0; i<8; i++)
if (input[i] != '0')
result |= (1 << i);
return result;
}
This assumes input[0] should become the least significant bit in the byte, and input[7] the most significant.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How can i get utf-8 char number from binary in c++? - c++

For example, I have: 11100011 10000010 10100010. It is the binary of: ア; its number in UTF-8 is:12450 How can I get this number from binary?

Related

16-bit to 10-bit conversion code explanation

Convering Big Endian Formatted Bits to Intended Decimal Value While Ignoring First Bit

Printing integers as a set of 4 bytes arranged in little endian?

how to optimize C++/C code for a large number of integers

How to write individual bytes to filein C++

Categories

Resources