Alternate reading as char* and wchar_t* - c++

I'm trying to write a program that parses ID3 tags, for educational purposes (so please explain in depth, as I'm trying to learn). So far I've had great success, but stuck on an encoding issue.
When reading the mp3 file, the default encoding for all text is ISO-8859-1. All header info (frame IDs etc) can be read in that encoding.
This is how I've done it:
ifstream mp3File("../myfile.mp3");
mp3File.read(mp3Header, 10); // char mp3Header[10];
// .... Parsing the header
// After reading the main header, we get into the individual frames.
// Read the first 10 bytes from buffer, get size and then read data
char encoding[1];
while(1){
char frameHeader[10] = {0};
mp3File.read(frameHeader, 10);
ID3Frame frame(frameHeader); // Parses frameHeader
if (frame.frameId[0] == 'T'){ // Text Information Frame
mp3File.read(encoding, 1); // Get encoding
if (encoding[0] == 1){
// We're dealing with UCS-2 encoded Unicode with BOM
char data[frame.size];
mp3File.read(data, frame.size);
}
}
}
This is bad code, because data is a char*, its' inside should look like this (converted undisplayable chars to int):
char = [0xFF, 0xFE, C, 0, r, 0, a, 0, z, 0, y, 0]
Two questions:
What are the first two bytes? - Answered.
How can I read wchar_t from my already open file? And then get back to reading the rest of it?
Edit Clarification: I'm not sure if this is the correct way to do it, but essentially what I wanted to do was.. Read the first 11 bytes to a char array (header+encoding), then the next 12 bytes to a wchar_t array (the name of the song), and then the next 10 bytes to a char array (the next header). Is that possible?

I figured out a decent solution: create a new wchar_t buffer and add the characters from the char array in pairs.
wchar_t* charToWChar(char* cArray, int len) {
char wideChar[2];
wchar_t wideCharW;
wchar_t *wArray = (wchar_t *) malloc(sizeof(wchar_t) * len / 2);
int counter = 0;
int endian = BIGENDIAN;
// Check endianness
if ((uint8_t) cArray[0] == 255 && (uint8_t) cArray[1] == 254)
endian = LITTLEENDIAN;
else if ((uint8_t) cArray[1] == 255 && (uint8_t) cArray[0] == 254)
endian = BIGENDIAN;
for (int j = 2; j < len; j+=2){
switch (endian){
case LITTLEENDIAN: {wideChar[0] = cArray[j]; wideChar[1] = cArray[j + 1];} break;
default:
case BIGENDIAN: {wideChar[1] = cArray[j]; wideChar[0] = cArray[j + 1];} break;
}
wideCharW = (uint16_t)((uint8_t)wideChar[1] << 8 | (uint8_t)wideChar[0]);
wArray[counter] = wideCharW;
counter++;
}
wArray[counter] = '\0';
return wArray;
}
Usage:
if (encoding[0] == 1){
// We're dealing with UCS-2 encoded Unicode with BOM
char data[frame.size];
mp3File.read(data, frame.size);
wcout << charToWChar(data, frame.size) << endl;
}

Related

Encoding Vietnamese characters from ISO88591, UTF8, UTF16BE, UTF16LE, UTF16 to Hex and vice versa using C++

I have edited my post. Currently what I'm trying to do is to encode an input string from the user and then convert it to Hex formats. I can do it properly if it does not contain any Vietnamese character.
If my inputString is "Hello". But when I try to input a string such as "Tôi", I don't know how to do it.
enum Encodings { USASCII, ISO88591, UTF8, UTF16BE, UTF16LE, UTF16, BIN, OCT, HEX };
switch (Encodings)
{
case USASCII:
ASCIIToHex(inputString, &ascii); //hello output 48656C6C6F
return new ByteField(ascii.c_str());
case ISO88591:
ASCIIToHex(inputString, &ascii);//hello output 48656C6C6F
//tôi output 54F469
return new ByteField(ascii.c_str());
case UTF8:
ASCIIToHex(inputString, &ascii);//hello output 48656C6C6F
//tôi output 54C3B469
return new ByteField(ascii.c_str());
case UTF16BE:
ToUTF16(inputString, &ascii, Encodings);//hello output 00480065006C006C006F
//tôi output 005400F40069
return new ByteField(ascii.c_str());
case UTF16:
ToUTF16(inputString, &ascii, Encodings);//hello output FEFF00480065006C006C006F
//tôi output FEFF005400F40069
return new ByteField(ascii.c_str());
case UTF16LE:
ToUTF16(inputString, &ascii, Encodings);//hello output 480065006C006C006F00
//tôi output 5400F4006900
return new ByteField(ascii.c_str());
}
void StringUtilLib::ASCIIToHex(std::string s, std::string * result)
{
int n = s.length();
for (int i = 0; i < n; i++)
{
unsigned char c = s[i];
long val = long(c);
std::string bin = "";
while (val > 0)
{
(val % 2) ? bin.push_back('1') :
bin.push_back('0');
val /= 2;
}
reverse(bin.begin(), bin.end());
result->append(ConvertBinToHex(bin));
}
}
std::string ToUTF16(std::string s, std::string * result, int encodings) {
int n = s.length();
if (encodings == UTF16) {
result->append("FEFF");
}
for (int i = 0; i < n; i++)
{
int val = int(s[i]);
std::string bin = "";
while (val > 0)
{
(val % 2) ? bin.push_back('1') :
bin.push_back('0');
val /= 2;
}
reverse(bin.begin(), bin.end());
if (encodings == UTF16 || encodings == UTF16BE) {
result->append("00" + ConvertBinToHex(bin));
}
if (encodings == UTF16LE) {
result->append(ConvertBinToHex(bin) + "00");
}
}
}
std::string ConvertBinToHex(std::string str) {
long long temp = atoll(str.c_str());
int dec_value = 0;
int base = 1;
int i = 0;
while (temp) {
int last_digit = temp % 10;
temp = temp / 10;
dec_value += last_digit * base;
base = base * 2;
}
char hexaDeciNum[10];
while (dec_value != 0)
{
int temp = 0;
temp = dec_value % 16;
if (temp < 10)
{
hexaDeciNum[i] = temp + 48;
i++;
}
else
{
hexaDeciNum[i] = temp + 55;
i++;
}
dec_value = dec_value / 16;
}
str.clear();
for (int j = i - 1; j >= 0; j--) {
str = str + hexaDeciNum[j];
}
return str;
}
The question is completely unclear. To encode something you need an input right? So when you say "Encoding Vietnamese Character to UTF8, UTF16" what's your input string and what's the encoding before converting to UTF-8/16? How do you input it? From file or console?
And why on earth are you converting to binary and then to hex? You can print directly to binary and hex from the bytes, no need to convert from binary to hex. Note that converting to binary like that is fine for testing but vastly inefficient in production code. I also don't know what you mean by "But what if my letter is "Á" or "À" which is a Vietnamese letter I cannot get the value of it". Please show a minimal, reproducible example along with the input/output
But I think you just want to output the UTF encoded bytes from a string literal in the source code like "ÁÀ". In that case it isn't called "encoding a string" but just "outputting a string"
Both Á and À in Unicode can be represented by precomposed characters (U+00C1 and U+00C0) or combining characters (A + U+0301 ◌́/U+0300 ◌̀). You can switch between them by selecting "Unicode dựng sẵn" or "Unicode tổ hợp" in Unikey. Suppose you have those characters in string literal form then std::string str = "ÁÀ" contains a series of bytes that corresponds to the above letters in the source file encoding. So depending on which encoding you save the *.cpp file as (CP1252, CP1258, UTF-8...), the output byte values will be different
To force UTF-8/16/32 encoding you just need to use the u8, u and U suffix respectively, along with the correct type (char8_t, char16_t, char32_t or std::u8string/std::u16string/std::u32string)
std::u8string utf8 = u8"ÁÀ";
std::u16string utf16 = u"ÁÀ";
std::u32string utf32 = U"ÁÀ";
Then just use c_str() to get the underlying buffers and print the bytes. In C++14 std::u8string is not available yet so just save the file as UTF-8 and use std::string. Similarly you can read std::u*string directly from std::cin to print the encoding of a user-input string
Edit:
To convert between UTF encodings use the standard std::codecvt, std::wstring_convert, std::codecvt_utf8_utf16...
Working on non-Unicode encodings is trickier and needs some external library like ICU or OS-dependent APIs
WideCharToMultiByte and MultiByteToWideChar on Windows
iconv on Linux
Limiting to ISO-8859-1 makes it easier but you still need many lookup tables, and there's no way to convert other encodings to ASCII without loss of information
-64 is the correct representation of À if you are using signed char and CP1258. If you want a positive number you need to cast to unsigned char first.
If you are indeed using CP1258, you are probably on Windows. To convert your input string to UTF-16, you probably want to use a Windows platform API such as MultiByteToWideChar which accepts a code page parameter (of course you have to use the correct code page). Alternatively you may try a standard function like mbstowcs but you need to set up your locale correctly before using it.
You might find it easier to switch to wide characters throughout your application, and avoid most transcoding.
As a side note, converting an integer to binary only to convert that to hexadecimal is not an easy or efficient way to display a hexadecimal representation of an integer.

Converting binary data in string to char

So I have some data I convert from packet to string, in binary (datagram):
std::string Packet::packetToString()
{
//packing to one bitset
std::bitset<208> pak(std::string(std::bitset<2>(type).to_string() + std::bitset<64>(num1).to_string() + std::bitset<64>(num2).to_string() + std::bitset<64>(num3).to_string() + std::bitset<4>(state).to_string() + std::bitset<4>(id).to_string() + "000000"));
std::string temp;
std::bitset<8> tempBitset(0);
for (int i = pak.size() - 1; i >= 0; i--)
{
tempBitset[i % 8] = pak[i];
if (i % 8 == 0)
{
char t = static_cast<char> (tempBitset.to_ulong());
temp.push_back(t);
}
}
return temp;
}
Then I want to convert this string to char array (in this case char buffer[26];) and send it with SendTo("127.0.0.1", 1111, buffer, 26);
What's the problem:
Packet pak1(... data I input ... );
string packet;
packet = pak1.packetToString();
char buffer[26];
strcpy_s(buffer, packet.c_str());
Data send with this array seems to be erased in case 0x00(NULL) appears. This is caused by c_str() i guess. How can I deal with this? :)
strcpy() and strcpy_s() copy null terminated c-strings. So indeed, if there's any 0x00 char in the c_str() the copy will end.
Use std::copy() to copy the full data, regardless of the any 0x00 that you might encounter:
copy (packet.begin(), packet.end(), buffer); // hoping packet.size()<26
or with copy_n():
copy_n (packet.begin(), 26, buffer); // assuming that packet is at least 26 bytes

Converting a byte into bit and writing the binary data to file

Suppose I have a character array, char a[8] containing 10101010. If I store this data in a .txt file, this file has 8 bytes size. Now I am asking that how can I convert this data to binary format and save it in a file as 8 bits (and not 8 bytes) so that the file size is only 1 byte.
Also, Once I convert these 8 bytes to a single byte, Which File format should I save the output in? .txt or .dat or .bin?
I am working on Huffman Encoding of text files. I have already converted the text format into binary, i.e. 0's and 1's, but when I store this output data on a file, each digit(1 or 0) takes a byte instead of a bit. I want a solution such that each digit takes only a bit.
char buf[100];
void build_code(node n, char *s, int len)
{
static char *out = buf;
if (n->c) {
s[len] = 0;
strcpy(out, s);
code[n->c] = out;
out += len + 1;
return;
}
s[len] = '0'; build_code(n->left, s, len + 1);
s[len] = '1'; build_code(n->right, s, len + 1);
}
This is how I build up my code tree with help of a Huffman tree. And
void encode(const char *s, char *out)
{
while (*s)
{
strcpy(out, code[*s]);
out += strlen(code[*s++]);
}
}
This is how I Encode to get the final output.
Not entirely sure how you end up with a string representing the binary representation of a value,
but you can get an integer value from a string (in any base) using standard functions like std::strtoul.
That function provides an unsigned long value, since you know your value is within 0-255 range you can store it in an unsigned char:
unsigned char v=(unsigned char)(std::strtoul(binary_string_value.c_str(),0,2) & 0xff);
Writing it to disk, you can use ofstream to write
Which File format should I save the output in? .txt or .dat or .bin?
Keep in mind that the extension (the .txt, .dat or .bin) does not really mandate the format (i.e. the structure of the contents). The extension is a convention commonly used to indicate that you're using some well-known format (and in some OS/environments, it drives the configuration of which program can best handle that file). Since this is your file, it is up to you define the actual format... and to name the file with any extension (or even no extension) you like best (or in other words, any extension that best represent your contents) as long as it is meaningful to you and to those that are going to consume your files.
Edit: additional details
Assuming we have a buffer of some length where you're storing your string of '0' and '1'
int codeSize; // size of the code buffer
char *code; // code array/pointer
std::ofstream file; // File stream where we're writing to.
unsigned char *byteArray=new unsigned char[codeSize/8+(codeSize%8+=0)?1:0]
int bytes=0;
for(int i=8;i<codeSize;i+=8) {
std::string binstring(code[i-8],8); // create a temp string from the slice of the code
byteArray[bytes++]=(unsigned char)(std::strtoul(binstring.c_str(),0,2) & 0xff);
}
if(i>codeSize) {
// At this point, if there's a number of bits not multiple of 8,
// there are some bits that have not
// been writter. Not sure how you would like to handle it.
// One option is to assume that bits with 0 up to
// the next multiple of 8... but it all depends on what you're representing.
}
file.write(byteArray,bytes);
Function converting input 8 chars representing bit representation into one byte.
char BitsToByte( const char in[8] )
{
char ret = 0;
for( int i=0, pow=128;
i<8;
++i, pow/=2;
)
if( in[i] == '1' ) ret += pow;
return ret;
}
We iterate over array passed to function (of size 8 for obvious reasons) and based of content of it we increase our return value (first element in the array represents the oldest bit). pow is set to 128 because 2^(n-1)is value of n-th bit.
You can shift them into a byte pretty easily, like this:
byte x = (s[3] - '0') + ((s[2] - '0') << 1) + ((s[1] - '0') << 2) + ((s[0] - '0') << 3);
In my example, I only shifted a nibble, or 4-bits. You can expand the example to shift an entire byte. This solution will be faster than a loop.
One way:
/** Converts 8 bytes to 8 bits **/
unsigned char BinStrToNum(const char a[8])
{
return( ('1' == a[0]) ? 128 : 0
+ ('1' == a[1]) ? 64 : 0
+ ('1' == a[2]) ? 32 : 0
+ ('1' == a[3]) ? 16 : 0
+ ('1' == a[4]) ? 8 : 0
+ ('1' == a[5]) ? 4 : 0
+ ('1' == a[6]) ? 2 : 0
+ ('1' == a[7]) ? 1 : 0);
);
};
Save it in any of the formats you mentioned; or invent your own!
int main()
{
rCode=0;
char *a = "10101010";
unsigned char byte;
FILE *fp=NULL;
fp=fopen("data.xyz", "wb");
if(NULL==fp)
{
rCode=errno;
fprintf(stderr, "fopen() failed. errno:%d\n", errno);
goto CLEANUP;
}
byte=BinStrToNum(a);
fwrite(&byte, 1, 1, fp);
CLEANUP:
if(fp)
fclose(fp);
return(rCode);
}

utf8 aware strncpy

I find it hard to believe I'm the first person to run into this problem but searched for quite some time and didn't find a solution to this.
I'd like to use strncpy but have it be UTF8 aware so it doesn't partially write a utf8 character into the destination string.
Otherwise you can never be sure that the resulting string is valid UTF8, even if you know the source is (when the source string is larger than the max length).
Validating the resulting string can work but if this is to be called a lot it would be better to have a strncpy function that checks for it.
glib has g_utf8_strncpy but this copies a certain number of unicode chars, whereas Im looking for a copy function that limits by the byte length.
To be clear, by "utf8 aware", I mean that it should not exceed the limit of the destination buffer and it must never copy only part of a utf-8 character. (Given valid utf-8 input must never result in having invalid utf-8 output).
Note:
Some replies have pointed out that strncpy nulls all bytes and that it wont ensure zero termination, in retrospect I should have asked for a utf8 aware strlcpy, however at the time I didn't know of the existence of this function.
I've tested this on many sample UTF8 strings with multi-byte characters. If the source is too long, it does a reverse search of it (starts at the null terminator) and works backward to find the last full UTF8 character which can fit into the destination buffer. It always ensures the destination is null terminated.
char* utf8cpy(char* dst, const char* src, size_t sizeDest )
{
if( sizeDest ){
size_t sizeSrc = strlen(src); // number of bytes not including null
while( sizeSrc >= sizeDest ){
const char* lastByte = src + sizeSrc; // Initially, pointing to the null terminator.
while( lastByte-- > src )
if((*lastByte & 0xC0) != 0x80) // Found the initial byte of the (potentially) multi-byte character (or found null).
break;
sizeSrc = lastByte - src;
}
memcpy(dst, src, sizeSrc);
dst[sizeSrc] = '\0';
}
return dst;
}
I'm not sure what you mean by UTF-8 aware; strncpy copies bytes, not
characters, and the size of the buffer is given in bytes as well. If
what you mean is that it will only copy complete UTF-8 characters,
stopping, for example, if there isn't room for the next character, I'm
not aware of such a function, but it shouldn't be too hard to write:
int
utf8Size( char ch )
{
static int const sizeTable[] =
{
// ...
};
return sizeTable( static_cast<unsigned char>( ch ) )
}
char*
stru8ncpy( char* dest, char* source, int n )
{
while ( *source != '\0' && utf8Size( *source ) < n ) {
n -= utf8Size( *source );
switch ( utf8Size( ch ) ) {
case 6:
*dest ++ = *source ++;
case 5:
*dest ++ = *source ++;
case 4:
*dest ++ = *source ++;
case 3:
*dest ++ = *source ++;
case 2:
*dest ++ = *source ++;
case 1:
*dest ++ = *source ++;
break;
default:
throw IllegalUTF8();
}
}
*dest = '\0';
return dest;
}
(The contents of the table in utf8Size are a bit painful to generate,
but this is a function you'll be using a lot if you're dealing with
UTF-8, and you only have to do it once.)
To reply to own question, heres the C function I ended up with (Not using C++ for this project):
Notes:
- Realize this is not a clone of strncpy for utf8, its more like strlcpy from openbsd.
- utf8_skip_data copied from glib's gutf8.c
- It doesn't validate the utf8 - which is what I intended.
Hope this is useful to others and interested in feedback, but please no pedantic zealot's about NULL termination behavior unless its an actual bug, or misleading/incorrect behavior.
Thanks to James Kanze who provided the basis for this, but was incomplete and C++ (I need a C version).
static const size_t utf8_skip_data[256] = {
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,6,6,1,1
};
char *strlcpy_utf8(char *dst, const char *src, size_t maxncpy)
{
char *dst_r = dst;
size_t utf8_size;
if (maxncpy > 0) {
while (*src != '\0' && (utf8_size = utf8_skip_data[*((unsigned char *)src)]) < maxncpy) {
maxncpy -= utf8_size;
switch (utf8_size) {
case 6: *dst ++ = *src ++;
case 5: *dst ++ = *src ++;
case 4: *dst ++ = *src ++;
case 3: *dst ++ = *src ++;
case 2: *dst ++ = *src ++;
case 1: *dst ++ = *src ++;
}
}
*dst= '\0';
}
return dst_r;
}
strncpy() is a terrible function:
If there is insufficient space, the resulting string will not be nul terminated.
If there is enough space, the remainder is filled with NULs. This can be painful if the target string is very big.
Even if the characters stay in the ASCII range (0x7f and below), the resulting string will not be what you want. In the UTF-8 case it might be not nul-terminated and end in an invalid UTF-8 sequence.
Best advice is to avoid strncpy().
EDIT:
ad 1):
#include <stdio.h>
#include <string.h>
int main (void)
{
char buff [4];
strncpy (buff, "hello world!\n", sizeof buff );
printf("%s\n", buff );
return 0;
}
Agreed, the buffer will not be overrun. But the result is still unwanted. strncpy() solves only part of the problem. It is misleading and unwanted.
UPDATE(2012-10-31): Since this is a nasty problem, I decided to hack my own version, mimicking the ugly strncpy() behavior. The return value is the number of characters copied, though..
#include <stdio.h>
#include <string.h>
size_t utf8ncpy(char *dst, char *src, size_t todo);
static int cnt_utf8(unsigned ch, size_t len);
static int cnt_utf8(unsigned ch, size_t len)
{
if (!len) return 0;
if ((ch & 0x80) == 0x00) return 1;
else if ((ch & 0xe0) == 0xc0) return 2;
else if ((ch & 0xf0) == 0xe0) return 3;
else if ((ch & 0xf8) == 0xf0) return 4;
else if ((ch & 0xfc) == 0xf8) return 5;
else if ((ch & 0xfe) == 0xfc) return 6;
else return -1; /* Default (Not in the spec) */
}
size_t utf8ncpy(char *dst, char *src, size_t todo)
{
size_t done, idx, chunk, srclen;
srclen = strlen(src);
for(done=idx=0; idx < srclen; idx+=chunk) {
int ret;
for (chunk=0; done+chunk < todo; chunk++) {
ret = cnt_utf8( src[idx+chunk], srclen - (idx+chunk) );
if (ret ==1) continue; /* Normal character: collect it into chunk */
if (ret < 0) continue; /* Bad stuff: treat as normal char */
if (ret ==0) break; /* EOF */
if (!chunk) chunk = ret;/* an UTF8 multibyte character */
else ret = 1; /* we allready collected a number (chunk) of normal characters */
break;
}
if (ret > 1 && done+chunk > todo) break;
if (done+chunk > todo) chunk = todo - done;
if (!chunk) break;
memcpy( dst+done, src+idx, chunk);
done += chunk;
if (ret < 1) break;
}
/* This is part of the dreaded strncpy() behavior:
** pad the destination string with NULs
** upto its intended size
*/
if (done < todo) memset(dst+done, 0, todo-done);
return done;
}
int main(void)
{
char *string = "Hell\xc3\xb6 \xf1\x82\x82\x82, world\xc2\xa1!";
char buffer[30];
unsigned result, len;
for (len = sizeof buffer-1; len < sizeof buffer; len -=3) {
result = utf8ncpy(buffer, string, len);
/* remove the following line to get the REAL strncpy() behaviour */
buffer[result] = 0;
printf("Chop #%u\n", len );
printf("Org:[%s]\n", string );
printf("Res:%u\n", result );
printf("New:[%s]\n", buffer );
}
return 0;
}
Here is a C++ solution:
u8string.h:
#ifndef U8STRING_H
#define U8STRING_H 1
#include <stddef.h>
#ifdef __cplusplus
extern "C" {
#endif
/**
* Copies the first few characters of the UTF-8-encoded string pointed to by
* \p src into \p dest_buf, as many UTF-8-encoded characters as can be written in
* <code>dest_buf_len - 1</code> bytes or until the NUL terminator of the string
* pointed to by \p str is reached.
*
* The string of bytes that are written into \p dest_buf is NUL terminated
* if \p dest_buf_len is greater than 0.
*
* \returns \p dest_buf
*/
char * u8slbcpy(char *dest_buf, const char *src, size_t dest_buf_len);
#ifdef __cplusplus
}
#endif
#endif
u8slbcpy.cpp:
#include "u8string.h"
#include <cstring>
#include <utf8.h>
char * u8slbcpy(char *dest_buf, const char *src, size_t dest_buf_len)
{
if (dest_buf_len <= 0) {
return dest_buf;
} else if (dest_buf_len == 1) {
dest_buf[0] = '\0';
return dest_buf;
}
size_t num_bytes_remaining = dest_buf_len - 1;
utf8::unchecked::iterator<const char *> it(src);
const char * prev_base = src;
while (*it++ != '\0') {
const char *base = it.base();
ptrdiff_t diff = (base - prev_base);
if (num_bytes_remaining < diff) {
break;
}
num_bytes_remaining -= diff;
prev_base = base;
}
size_t n = dest_buf_len - 1 - num_bytes_remaining;
std::memmove(dest_buf, src, n);
dest_buf[n] = '\0';
return dest_buf;
}
The function u8slbcpy() has a C interface, but it is implemented in C++. My implementation uses the header-only UTF8-CPP library.
I think that this is pretty much what you are looking for, but note that there is still the problem that one or more combining characters might not be copied if the combining characters apply to the nth character (itself not a combining character) and the destination buffer is just large enough to store the UTF-8 encoding of characters 1 through n, but not the combining characters of character n. In this case, the bytes representing characters 1 through n are written, but none of the combining characters of n are. In effect, you could say that the nth character is partially written.
To comment on the above answer "strncpy() is a terrible function:".
I hate to even comment on such blanket statements at the expense of creating yet another internet programming jihad, but will anyhow since statements like this are misleading to those that might come here to look for answers.
Okay maybe C string functions are "old school". Maybe all strings in C/C++ should be in some kind of smart containers, etc., maybe one should use C++ instead of C (when you have a choice), these are more of a preference and an argument for other topics.
I came here looking for a UTF-8 strncpy() my self. Not that I couldn't make one (the encoding is IMHO simple and elegant) but wanted to see how others made theirs and perhaps find a optimized in ASM one.
To the "gods gift" of the programming world people, put your hubris aside for a moment and look at some facts.
There is nothing wrong with "strncpy()", or any other of the similar functions with the same side effects and issues like "_snprintf()", etc.
I say: "strncpy() is not terrible", but rather "terrible programmers use it terribly".
What is "terrible" is not knowing the rules.
Furthermore on the whole subject because of security (like buffer overrun) and program stability implications, there wouldn't be a need for example Microsoft to add to it's CRT lib "Safe String Functions" if the rules were just followed.
The main ones:
"sizeof()" returns the length of a static string w/terminator.
"strlen()" returns the length of string w/o terminator.
Most if no all "n" functions just clamp to 'n' with out adding a terminator.
There is implicit ambiguity on what "buffer size" is in functions that require and input buffer size. I.E. The "(char *pszBuffer, int iBufferSize)" types.
Safer to assume the worst and pass a size one less then the actual buffer size, and adding a terminator at the end to be sure.
For string inputs, buffers, etc., set and use a reasonable size limit based on expected average and maximum. To hopefully avoid input truncation, and to eliminate buffer overruns period.
This is how I personally handle such things, and other rules that are just to be known and practiced.
A handy macro for static string size:
// Size of a string with out terminator
#define SIZESTR(x) (sizeof(x) - 1)
When declaring local/stack string buffers:
A) The size for example limited to 1023+1 for terminator to allow for strings up to 1023 chars in length.
B) I'm initializing the the string to zero in length, plus terminating at the very end to cover a possible 'n' truncation.
char szBuffer[1024]; szBuffer[0] = szBuffer[SIZESTR(szBuffer)] = 0;
Alternately one could do just:
char szBuffer[1024] = {0};
of course but then there is some performance implication for a compiler generated "memset() like call to zero the whole buffer. It makes things cleaner for debugging though, and I prefer this style for static (vs local/stack) strings buffers.
Now a "strncpy()" following the rules:
char szBuffer[1024]; szBuffer[0] = szBuffer[SIZESTR(szBuffer)] = 0;
strncpy(szBuffer, pszSomeInput, SIZESTR(szBuffer));
There are other "rules" and issues of course, but these are the main ones that come to mind.
You just got to know how the lib functions work and to use safe practices like this.
Finally in my project I use ICU anyhow so I decided to go with it and use the macros in "utf8.h" to make my own "strncpy()".

Dealing with hex values in C/C++

I receive values using winsock from another computer on the network. It is a TCP socket, with the 4 first bytes of the message carrying its size. The rest of the message is formatted by the server using protobuf (protocol buffers from google).
The problemn, I think, is that it seems that the values sent by the server are hex values sent as char (ie only 10 received for 0x10). To receive the values, I do this :
bytesreceived = recv(sock, buffer, msg_size, 0);
for (int i=0;i<bytesreceived;i++)
{
data_s << hex << buffer[i];
}
where data_s is a stringstream. Them I can use the ParseFromIstream(&data_s) method from protobuf and recover the information I want.
The problem that I have is that this is VERY VERY long (I got another implementation using QSock taht I can't use for my project but which is much faster, so there is no problem on the server side).
I tried many things that I took from here and everywhere on the internet (using Arrays of bytes, strings), but nothing works.
Do I have any other options ?
Thank you for your time and comments ;)
not sure if this will be of any use, but I've used a similar protocol before (first 4 bytes holds an int with the length, rest is encoded using protobuf) and to decode it I did something like this (probably not the most efficient solution due to appending to strings):
// Once I've got the first 4 bytes, cast it to an int:
int msgLen = ntohl(*reinterpret_cast<const int*>(buffer));
// Check I've got enough bytes for the message, if I have then
// just parse the buffer directly
MyProtobufObj obj;
if( bytesreceived >= msgLen+4 )
{
obj.ParseFromArray(buffer+4,msgLen);
}
else
{
// just keep appending buffer to an STL string until I have
// msgLen+4 bytes and then do
// obj.ParseFromString(myStlString)
}
I wouldn't use the stream operators. They're for formatted data and that's not what you want.
You can keep the values received in a std::vector with the char type (vector of bytes). That would essentially just be a dynamic array. If you want to continue using a string stream, you can use the stringstream::write function which takes a buffer and a length. You should have the buffer and number of bytes received from your call to recv.
If you want to use the vector method, you can use std::copy to make it easier.
#include <algorithm>
#include <iterator>
#include <vector>
char buf[256];
std::vector<char> bytes;
size_t n = recv(sock, buf, 256, 0);
std::copy(buf, buf + n, std::back_inserter(bytes));
Your question is kind of ambiguous. Let's follow your example. You receive 10 as characters and you want to retrieve this as a hex number.
Assuming recv will give you this character string, you can do this.
First of all make it null terminated:
bytesreceived[msg_size] = '\0';
then you can very easily read the value from this buffer using standard *scanf function for strings:
int hexValue;
sscanf(bytesreceived, "%x", &hexValue);
There you go!
Edit: If you receive the number in reverse order (so 01 for 10), probably your best shot is to convert it manually:
int hexValue = 0;
int positionValue = 1;
for (int i = 0; i < msg_size; ++i)
{
int digit = 0;
if (bytesreceived[i] >= '0' && bytesreceived[i] <= '9')
digit = bytesreceived[i]-'0';
else if (bytesreceived[i] >= 'a' && bytesreceived[i] <= 'f')
digit = bytesreceived[i]-'a';
else if (bytesreceived[i] >= 'A' && bytesreceived[i] <= 'F')
digit = bytesreceived[i]-'A';
else // Some kind of error!
return error;
hexValue += digit*positionValue;
positionValue *= 16;
}
This is just a clear example though. In reality you would do it with bit shifting for example rather than multiplying.
What data type is buffer?
The whole thing looks like a great big no-op, since operator<<(stringstream&, char) ignores the base specifier. The hex specifier only affects formatting of non-character integral types. For certain you don't want to be handing textual data to protobuf.
Just hand the buffer pointer to protobuf, you're done.
OK, a shot into the dark: Let's say your ingress stream is "71F4E81DA...", and you want to turn this into a byte stream { 0x71, 0xF4, 0xE8, ...}. Then we can just assemble the bytes from the character literals as follows, schematically:
char * p = getCurrentPointer();
while (chars_left() >= 2)
{
unsigned char b;
b = get_byte_value(*p++) << 8;
b += get_byte_value(*p++);
output_stream.insert(b);
}
Here we use a little helper function:
unsigned char get_byte_value(char c)
{
if ('0' <= c && c <= '9') return c - '0';
if ('A' <= c && c <= 'F') return 10 + c - 'A';
if ('a' <= c && c <= 'f') return 10 + c - 'a';
return 0; // error
}