C++ convert ASII escaped unicode string into utf8 string - c++

I need to read in a standard ascii style string with unicode escaping and convert it into a std::string containing the utf8 encoded equivalent. So for example "\u03a0" (a std::string with 6 characters) should be converted into the std::string with two characters, 0xce, 0xa0 respectively, in raw binary.
Would be most happy if there's a simple answer using icu or boost but I haven't been able to find one.
(This is similar to Convert a Unicode string to an escaped ASCII string, but NB that I ultimately need to arrive at the UTF8 encoding. If we can use the Unicode as an intermediate step that's fine.)

Try something like this:
std::string to_utf8(uint32_t cp)
{
/*
if using C++11 or later, you can do this:
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
return conv.to_bytes( (char32_t)cp );
Otherwise...
*/
std::string result;
int count;
if (cp <= 0x007F)
count = 1
else if (cp <= 0x07FF)
count = 2;
else if (cp <= 0xFFFF)
count = 3;
else if (cp <= 0x10FFFF)
count = 4;
else
return result; // or throw an exception
result.resize(count);
if (count > 1)
{
for (int i = count-1; i > 0; --i)
{
result[i] = (char) (0x80 | (cp & 0x3F));
cp >>= 6;
}
for (int i = 0; i < count; ++i)
cp |= (1 << (7-i));
}
result[0] = (char) cp;
return result;
}
std::string str = ...; // "\\u03a0"
std::string::size_type startIdx = 0;
do
{
startIdx = str.find("\\u", startIdx);
if (startIdx == std::string::npos) break;
std::string::size_type endIdx = str.find_first_not_of("0123456789abcdefABCDEF", startIdx+2);
if (endIdx == std::string::npos) break;
std::string tmpStr = str.substr(startIdx+2, endIdx-(startIdx+2));
std::istringstream iss(tmpStr);
uint32_t cp;
if (iss >> std::hex >> cp)
{
std::string utf8 = to_utf8(cp);
str.replace(startIdx, 2+tmpStr.length(), utf8);
startIdx += utf8.length();
}
else
startIdx += 2;
}
while (true);

(\u03a0 is the Unicode code point for GREEK CAPITAL LETTER PI whose UTF-8 encoding is 0xCE 0xA0)
You need to:
Get the number 0x03a0 from the string "\u03a0": drop the backslash and the u and parse 03a0 as hex, into a wchar_t. Repeat until you get a (wide) string.
Convert 0x3a0 into UTF-8. C++11 has a codecvt_utf8 that may help.

My solution:
convert_unicode_escape_sequences(str)
input: "\u043f\u0440\u0438\u0432\u0435\u0442"
output: "привет"
Boost used for wchar/chars convertion:
#include <boost/locale/encoding_utf.hpp>
using boost::locale::conv::utf_to_utf;
inline uint8_t get_uint8(uint8_t h, uint8_t l)
{
uint8_t ret;
if (h - '0' < 10)
ret = h - '0';
else if (h - 'A' < 6)
ret = h - 'A' + 0x0A;
else if (h - 'a' < 6)
ret = h - 'a' + 0x0A;
ret = ret << 4;
if (l - '0' < 10)
ret |= l - '0';
else if (l - 'A' < 6)
ret |= l - 'A' + 0x0A;
else if (l - 'a' < 6)
ret |= l - 'a' + 0x0A;
return ret;
}
std::string wstring_to_utf8(const std::wstring& str)
{
return utf_to_utf<char>(str.c_str(), str.c_str() + str.size());
}
std::string convert_unicode_escape_sequences(const std::string& source)
{
std::wstring ws; ws.reserve(source.size());
std::wstringstream wis(ws);
auto s = source.begin();
while (s != source.end())
{
if (*s == '\\')
{
if (std::distance(s, source.end()) > 5)
{
if (*(s + 1) == 'u')
{
unsigned int v = get_uint8(*(s + 2), *(s + 3)) << 8;
v |= get_uint8(*(s + 4), *(s + 5));
s += 6;
wis << boost::numeric_cast<wchar_t>(v);
continue;
}
}
}
wis << wchar_t(*s);
s++;
}
return wstring_to_utf8(wis.str());
}

Related

C++ string of greek characters and .at() operator

With english characters it is easy to extract, so to say, a char from a string, e.g., the following code should have y as output:
string my_word;
cout << my_word.at(1);
If I try to do the same with greek characters, I get a funny character:
string my_word = "λογος";
cout << my_word.at(1);
Output:
�
My question is: what can I do to make .at() or whatever similar function work?
many thanks!
std::string is a sequence of narrow characters char. But many national alphabets use more then one char to encode single letter when using utf-8 locale. So when you take s.at(0) you get a half of whole letter or even less. You should use wide chars: std::wstring instead of std::string, std::wcout instead of std::cout and L"λογος" as string literal.
Also, you should set right locale before any printing using std::locale stuff.
Code example for this case:
#include <iostream>
#include <string>
#include <locale>
int main(int, char**) {
std::locale::global(std::locale("en_US.utf8"));
std::wcout.imbue(std::locale());
std::wstring s = L"λογος";
std::wcout << s.at(0) << std::endl;
return 0;
}
Problem is complex. Non Latin characters have to be encoded properly. There are couple standards for that. Question is which encoding your system is using.
In UTF-8 encoding one character is represented by multiple bytes. It can vary form 1 to 4 bytes depending on what kind of character it is.
For example: λ is represented by two bytes (in hex): CE BB.
I don't know what are the other character encoding which gives single byte characters fro Greek letters, but I'm sure there is one such encoding.
Note that your value my_word.length() most probably returns 10 not 5.
As others have said, it depends on your encoding. An at() function is problematic once you move to internationalisation because Hebrew has vowels written around the character, for example. Not all scripts consist of discrete sequences of glyphs.
Generally it's best to treat strings as atomic, unless you are writing the display / word manipulation code itself, when of course you need the individual glyphs. To read UTF, check out the code in Baby X (it's a windowing system that has to draw text to the screen)
Here;s the link https://github.com/MalcolmMcLean/babyx/blob/master/src/common/BBX_Font.c
Here's the UTF8 code - it's quite a hunk of code but fundamentally strightforwards.
static const unsigned int offsetsFromUTF8[6] =
{
0x00000000UL, 0x00003080UL, 0x000E2080UL,
0x03C82080UL, 0xFA082080UL, 0x82082080UL
};
static const unsigned char trailingBytesForUTF8[256] = {
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};
int bbx_isutf8z(const char *str)
{
int len = 0;
int pos = 0;
int nb;
int i;
int ch;
while(str[len])
len++;
while(pos < len && *str)
{
nb = bbx_utf8_skip(str);
if(nb < 1 || nb > 4)
return 0;
if(pos + nb > len)
return 0;
for(i=1;i<nb;i++)
if( (str[i] & 0xC0) != 0x80 )
return 0;
ch = bbx_utf8_getch(str);
if(ch < 0x80)
{
if(nb != 1)
return 0;
}
else if(ch < 0x8000)
{
if(nb != 2)
return 0;
}
else if(ch < 0x10000)
{
if(nb != 3)
return 0;
}
else if(ch < 0x110000)
{
if(nb != 4)
return 0;
}
pos += nb;
str += nb;
}
return 1;
}
int bbx_utf8_skip(const char *utf8)
{
return trailingBytesForUTF8[(unsigned char) *utf8] + 1;
}
int bbx_utf8_getch(const char *utf8)
{
int ch;
int nb;
nb = trailingBytesForUTF8[(unsigned char)*utf8];
ch = 0;
switch (nb)
{
/* these fall through deliberately */
case 3: ch += (unsigned char)*utf8++; ch <<= 6;
case 2: ch += (unsigned char)*utf8++; ch <<= 6;
case 1: ch += (unsigned char)*utf8++; ch <<= 6;
case 0: ch += (unsigned char)*utf8++;
}
ch -= offsetsFromUTF8[nb];
return ch;
}
int bbx_utf8_putch(char *out, int ch)
{
char *dest = out;
if (ch < 0x80)
{
*dest++ = (char)ch;
}
else if (ch < 0x800)
{
*dest++ = (ch>>6) | 0xC0;
*dest++ = (ch & 0x3F) | 0x80;
}
else if (ch < 0x10000)
{
*dest++ = (ch>>12) | 0xE0;
*dest++ = ((ch>>6) & 0x3F) | 0x80;
*dest++ = (ch & 0x3F) | 0x80;
}
else if (ch < 0x110000)
{
*dest++ = (ch>>18) | 0xF0;
*dest++ = ((ch>>12) & 0x3F) | 0x80;
*dest++ = ((ch>>6) & 0x3F) | 0x80;
*dest++ = (ch & 0x3F) | 0x80;
}
else
return 0;
return dest - out;
}
int bbx_utf8_charwidth(int ch)
{
if (ch < 0x80)
{
return 1;
}
else if (ch < 0x800)
{
return 2;
}
else if (ch < 0x10000)
{
return 3;
}
else if (ch < 0x110000)
{
return 4;
}
else
return 0;
}
int bbx_utf8_Nchars(const char *utf8)
{
int answer = 0;
while(*utf8)
{
utf8 += bbx_utf8_skip(utf8);
answer++;
}
return answer;
}

C++ ShiftJIS to UTF8 conversion

I need to convert Doublebyte characters. In my special case Shift-Jis into something better to handle, preferably with standard C++.
the following Question ended up without a workaround:
Doublebyte encodings on MSVC (std::codecvt): Lead bytes not recognized
So is there anyone with a suggestion or a reference on how to handle this conversion with C++ standard?
Normally I would recommend using the ICU library, but for this alone, using it is way too much overhead.
First a conversion function which takes an std::string with Shiftjis data, and returns an std::string with UTF8 (note 2019: no idea anymore if it works :))
It uses a uint8_t array of 25088 elements (25088 byte), which is used as convTable in the code. The function does not fill this variable, you have to load it from eg. a file first. The second code part below is a program that can generate the file.
The conversion function doesn't check if the input is valid ShiftJIS data.
std::string sj2utf8(const std::string &input)
{
std::string output(3 * input.length(), ' '); //ShiftJis won't give 4byte UTF8, so max. 3 byte per input char are needed
size_t indexInput = 0, indexOutput = 0;
while(indexInput < input.length())
{
char arraySection = ((uint8_t)input[indexInput]) >> 4;
size_t arrayOffset;
if(arraySection == 0x8) arrayOffset = 0x100; //these are two-byte shiftjis
else if(arraySection == 0x9) arrayOffset = 0x1100;
else if(arraySection == 0xE) arrayOffset = 0x2100;
else arrayOffset = 0; //this is one byte shiftjis
//determining real array offset
if(arrayOffset)
{
arrayOffset += (((uint8_t)input[indexInput]) & 0xf) << 8;
indexInput++;
if(indexInput >= input.length()) break;
}
arrayOffset += (uint8_t)input[indexInput++];
arrayOffset <<= 1;
//unicode number is...
uint16_t unicodeValue = (convTable[arrayOffset] << 8) | convTable[arrayOffset + 1];
//converting to UTF8
if(unicodeValue < 0x80)
{
output[indexOutput++] = unicodeValue;
}
else if(unicodeValue < 0x800)
{
output[indexOutput++] = 0xC0 | (unicodeValue >> 6);
output[indexOutput++] = 0x80 | (unicodeValue & 0x3f);
}
else
{
output[indexOutput++] = 0xE0 | (unicodeValue >> 12);
output[indexOutput++] = 0x80 | ((unicodeValue & 0xfff) >> 6);
output[indexOutput++] = 0x80 | (unicodeValue & 0x3f);
}
}
output.resize(indexOutput); //remove the unnecessary bytes
return output;
}
About the helper file: I used to have a download here, but nowadays I only know unreliable file hosters. So... either http://s000.tinyupload.com/index.php?file_id=95737652978017682303 works for you, or:
First download the "original" data from ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT . I can't paste this here because of the length, so we have to hope at least unicode.org stays online.
Then use this program while piping/redirecting above text file in, and redirecting the binary output to a new file. (Needs a binary-safe shell, no idea if it works on Windows).
#include<iostream>
#include<string>
#include<cstdio>
using namespace std;
// pipe SHIFTJIS.txt in and pipe to (binary) file out
int main()
{
string s;
uint8_t *mapping; //same bigendian array as in converting function
mapping = new uint8_t[2*(256 + 3*256*16)];
//initializing with space for invalid value, and then ASCII control chars
for(size_t i = 32; i < 256 + 3*256*16; i++)
{
mapping[2 * i] = 0;
mapping[2 * i + 1] = 0x20;
}
for(size_t i = 0; i < 32; i++)
{
mapping[2 * i] = 0;
mapping[2 * i + 1] = i;
}
while(getline(cin, s)) //pipe the file SHIFTJIS to stdin
{
if(s.substr(0, 2) != "0x") continue; //comment lines
uint16_t shiftJisValue, unicodeValue;
if(2 != sscanf(s.c_str(), "%hx %hx", &shiftJisValue, &unicodeValue)) //getting hex values
{
puts("Error hex reading");
continue;
}
size_t offset; //array offset
if((shiftJisValue >> 8) == 0) offset = 0;
else if((shiftJisValue >> 12) == 0x8) offset = 256;
else if((shiftJisValue >> 12) == 0x9) offset = 256 + 16*256;
else if((shiftJisValue >> 12) == 0xE) offset = 256 + 2*16*256;
else
{
puts("Error input values");
continue;
}
offset = 2 * (offset + (shiftJisValue & 0xfff));
if(mapping[offset] != 0 || mapping[offset + 1] != 0x20)
{
puts("Error mapping not 1:1");
continue;
}
mapping[offset] = unicodeValue >> 8;
mapping[offset + 1] = unicodeValue & 0xff;
}
fwrite(mapping, 1, 2*(256 + 3*256*16), stdout);
delete[] mapping;
return 0;
}
Notes:
Two-byte big endian raw unicode values (more than two byte not necessary here)
First 256 chars (512 byte) for the single byte ShiftJIS chars, value 0x20 for invalid ones.
Then 3 * 256*16 chars for the groups 0x8???, 0x9??? and 0xE???
= 25088 byte
For those looking for the Shift-JIS conversion table data, you can get the uint8_t array here:
https://github.com/bucanero/apollo-ps3/blob/master/include/shiftjis.h
Also, here's a very simple function to convert basic Shift-JIS chars to ASCII:
const char SJIS_REPLACEMENT_TABLE[] =
" ,.,..:;?!\"*'`*^"
"-_????????*---/\\"
"~||--''\"\"()()[]{"
"}<><>[][][]+-+X?"
"-==<><>????*'\"CY"
"$c&%#&*#S*******"
"*******T><^_'='";
//Convert Shift-JIS characters to ASCII equivalent
void sjis2ascii(char* bData)
{
uint16_t ch;
int i, j = 0;
int len = strlen(bData);
for (i = 0; i < len; i += 2)
{
ch = (bData[i]<<8) | bData[i+1];
// 'A' .. 'Z'
// '0' .. '9'
if ((ch >= 0x8260 && ch <= 0x8279) || (ch >= 0x824F && ch <= 0x8258))
{
bData[j++] = (ch & 0xFF) - 0x1F;
continue;
}
// 'a' .. 'z'
if (ch >= 0x8281 && ch <= 0x829A)
{
bData[j++] = (ch & 0xFF) - 0x20;
continue;
}
if (ch >= 0x8140 && ch <= 0x81AC)
{
bData[j++] = SJIS_REPLACEMENT_TABLE[(ch & 0xFF) - 0x40];
continue;
}
if (ch == 0x0000)
{
//End of the string
bData[j] = 0;
return;
}
// Character not found
bData[j++] = bData[i];
bData[j++] = bData[i+1];
}
bData[j] = 0;
return;
}

Optimizing Hexadecimal To Ascii Function in C++

This is a function in c++ that takes a HEX string and converts it to its equivalent ASCII character.
string HEX2STR (string str)
{
string tmp;
const char *c = str.c_str();
unsigned int x;
while(*c != 0) {
sscanf(c, "%2X", &x);
tmp += x;
c += 2;
}
return tmp;
If you input the following string:
537461636b6f766572666c6f77206973207468652062657374212121
The output will be:
Stackoverflow is the best!!!
Say I were to input 1,000,000 unique HEX strings into this function, it takes awhile to compute.
Is there a more efficient way to complete this?
Of course. Look up two characters at a time:
unsigned char val(char c)
{
if ('0' <= c && c <= '9') { return c - '0'; }
if ('a' <= c && c <= 'f') { return c + 10 - 'a'; }
if ('A' <= c && c <= 'F') { return c + 10 - 'A'; }
throw "Eeek";
}
std::string decode(std::string const & s)
{
if (s.size() % 2) != 0) { throw "Eeek"; }
std::string result;
result.reserve(s.size() / 2);
for (std::size_t i = 0; i < s.size() / 2; ++i)
{
unsigned char n = val(s[2 * i]) * 16 + val(s[2 * i + 1]);
result += n;
}
return result;
}
Just since I wrote it anyway, this should be fairly efficient :)
const char lookup[32] =
{0,10,11,12,13,14,15,0,0,0,0,0,0,0,0,0,0,1,2,3,4,5,6,7,8,9,0,0,0,0,0,0};
std::string HEX2STR(std::string str)
{
std::string out;
out.reserve(str.size()/2);
const char* tmp = str.c_str();
unsigned char ch, last = 1;
while(*tmp)
{
ch <<= 4;
ch |= lookup[*tmp&0x1f];
if(last ^= 1)
out += ch;
tmp++;
}
return out;
}
Don't use sscanf. It's a very general flexible function, which means its slow to allow all those usecases. Instead, walk the string and convert each character yourself, much faster.
This routine takes a string with (what I call) hexwords, often used in embedded ECUs, for example "31 01 7F 33 38 33 37 30 35 31 30 30 20 20 49" and transforms it in readable ASCII where possible.
Transforms by taking care of the discontuinity in the ASCII table (0-9: 48-57, A-F:65 - 70);
int i,j, len=strlen(stringWithHexWords);
char ascii_buffer[250];
char c1, c2, r;
i=0;
j=0;
while (i<len) {
c1 = stringWithHexWords[i];
c2 = stringWithHexWords[i+1];
if ((int)c1!=32) { // if space found, skip next section and bump index only once
// skip scary ASCII codes
if (32<(int)c1 && 127>(int)c1 && 32<(int)c2 && 127>(int)c2) {
//
// transform by taking first hexdigit * 16 and add second hexdigit
// both with correct offset
r = (char) ((16*(int)c1+((int)c2<64?((int)c2-48):((int)c2-55))));
if (31<(int)r && 127>(int)r)
ascii_buffer[j++] = r; // check result for readability
}
i++; // bump index
}
i++; // bump index once more for next hexdigit
}
ascii_bufferCurrentLength = j;
return true;
}
The hexToString() function will convert hex string to ASCII readable string
string hexToString(string str){
std::stringstream HexString;
for(int i=0;i<str.length();i++){
char a = str.at(i++);
char b = str.at(i);
int x = hexCharToInt(a);
int y = hexCharToInt(b);
HexString << (char)((16*x)+y);
}
return HexString.str();
}
int hexCharToInt(char a){
if(a>='0' && a<='9')
return(a-48);
else if(a>='A' && a<='Z')
return(a-55);
else
return(a-87);
}

CString Hex value conversion to Byte Array

I have been trying to carry out a conversion from CString that contains Hex string to a Byte array and have been
unsuccessful so far. I have looked on forums and none of them seem to help so far. Is there a function with just a few
lines of code to do this conversion?
My code:
BYTE abyData[8]; // BYTE = unsigned char
CString sByte = "0E00000000000400";
Expecting:
abyData[0] = 0x0E;
abyData[6] = 0x04; // etc.
You can simply gobble up two characters at a time:
unsigned int value(char c)
{
if (c >= '0' && c <= '9') { return c - '0'; }
if (c >= 'A' && c <= 'F') { return c - 'A' + 10; }
if (c >= 'a' && c <= 'f') { return c - 'a' + 10; }
return -1; // Error!
}
for (unsigned int i = 0; i != 8; ++i)
{
abyData[i] = value(sByte[2 * i]) * 16 + value(sByte[2 * i + 1]);
}
Of course 8 should be the size of your array, and you should ensure that the string is precisely twice as long. A checking version of this would make sure that each character is a valid hex digit and signal some type of error if that isn't the case.
How about something like this:
for (int i = 0; i < sizeof(abyData) && (i * 2) < sByte.GetLength(); i++)
{
char ch1 = sByte[i * 2];
char ch2 = sByte[i * 2 + 1];
int value = 0;
if (std::isdigit(ch1))
value += ch1 - '0';
else
value += (std::tolower(ch1) - 'a') + 10;
// That was the four high bits, so make them that
value <<= 4;
if (std::isdigit(ch2))
value += ch1 - '0';
else
value += (std::tolower(ch1) - 'a') + 10;
abyData[i] = value;
}
Note: The code above is not tested.
You could:
#include <stdint.h>
#include <sstream>
#include <iostream>
int main() {
unsigned char result[8];
std::stringstream ss;
ss << std::hex << "0E00000000000400";
ss >> *( reinterpret_cast<uint64_t *>( result ) );
std::cout << static_cast<int>( result[1] ) << std::endl;
}
however take care of memory management issues!!!
Plus the result is in the reverse order as you would expect, so:
result[0] = 0x00
result[1] = 0x04
...
result[7] = 0x0E

Convert MAC address std::string into uint64_t

I have a hexadecimal MAC address held in a std::string. What would be the best way to turn that MAC address into an integer-type held in a uint64_t?
I'm aware of stringstream, sprintf, atoi, etc. I've actually written little conversion functions with the first 2 of those, but they seem more sloppy than I would like.
So, can someone show me a good, clean way to convert
std::string mac = "00:00:12:24:36:4f";
into a uint64_t?
PS: I don't have boost/TR1 facilities available and can't install them where the code will actually be used (which is also why I haven't copy pasted one of my attempts, sorry about that!). So please keep solutions to straight-up C/C++ calls. If you have an interesting solution with a UNIX system call I'd be interested too!
uint64_t string_to_mac(std::string const& s) {
unsigned char a[6];
int last = -1;
int rc = sscanf(s.c_str(), "%hhx:%hhx:%hhx:%hhx:%hhx:%hhx%n",
a + 0, a + 1, a + 2, a + 3, a + 4, a + 5,
&last);
if(rc != 6 || s.size() != last)
throw std::runtime_error("invalid mac address format " + s);
return
uint64_t(a[0]) << 40 |
uint64_t(a[1]) << 32 | (
// 32-bit instructions take fewer bytes on x86, so use them as much as possible.
uint32_t(a[2]) << 24 |
uint32_t(a[3]) << 16 |
uint32_t(a[4]) << 8 |
uint32_t(a[5])
);
}
My solution (requires c++11):
#include <string>
#include <cstdint>
#include <algorithm>
#include <stdlib.h>
uint64_t convert_mac(std::string mac) {
// Remove colons
mac.erase(std::remove(mac.begin(), mac.end(), ':'), mac.end());
// Convert to uint64_t
return strtoul(mac.c_str(), NULL, 16);
}
Use sscanf:
std::string mac = "00:00:12:24:36:4f";
unsigned u[6];
int c=sscanf(mac.c_str(),"%x:%x:%x:%x:%x:%x",u,u+1,u+2,u+3,u+4,u+5);
if (c!=6) raise_error("input format error");
uint64_t r=0;
for (int i=0;i<6;i++) r=(r<<8)+u[i];
// or: for (int i=0;i<6;i++) r=(r<<8)+u[5-i];
I can't think of any magic tricks. Here's a random attempt that may or may not be better than what you've done. It's simplish, but I bet there's far faster solutions.
uint64_t mac2int(std::string s) {
uint64_t r=0;
std::string::iterator i;
std::string::iterator end = s.end();
for(i = s.begin; i != end; ++i) {
char let = *i;
if (let >= '0' && let <= '9') {
r = r*0xf + (let-'0');
} else if (let >= 'a' && let <= 'f') {
r = r*0xf + (let-'a'+10);
} else if (let >= 'A' && let <= 'F') {
r = r*0xf + (let-'A'+10);
}
}
return r;
}
This will just shift hex digits through until the string runs out, not caring about delimiters or total length. But it converts the input string to the desired uint64_t format.
#include <string>
#include <stdint.h>
uint64_t cvt(std::string &v)
{
std::string::iterator i;
std::string digits = "0123456789abcdefABCDEF";
uint64_t result = 0;
size_t pos = 0;
i = v.begin();
while (i != v.end())
{
// search for character in hex digits set
pos = digits.find(*i);
// if found in valid hex digits
if (pos != std::string::npos)
{
// handle upper/lower case hex digit
if (pos > 0xf)
{
pos -= 6;
}
// shift a nibble in
result <<= 4;
result |= pos;
}
++i;
}
return result;
}
Another faster version without calling library functions:
inline unsigned read_hex_byte(char const** beg, char const* end) {
if(end - *beg < 2)
throw std::invalid_argument("");
unsigned hi = (*beg)[0], lo = (*beg)[1];
*beg += 2;
hi -= hi >= '0' && hi <= '9' ? '0' :
hi >= 'a' && hi <= 'f' ? 'a' - 10 :
hi >= 'A' && hi <= 'F' ? 'A' - 10 :
throw std::invalid_argument("");
lo -= lo >= '0' && lo <= '9' ? '0' :
lo >= 'a' && lo <= 'f' ? 'a' - 10 :
lo >= 'A' && lo <= 'F' ? 'A' - 10 :
throw std::invalid_argument("");
return hi << 4 | lo;
}
uint64_t string_to_mac2(std::string const& s) {
char const *beg = s.data(), *end = beg + s.size();
uint64_t r;
try {
r = read_hex_byte(&beg, end);
beg += beg != end && ':' == *beg;
r = r << 8 | read_hex_byte(&beg, end);
beg += beg != end && ':' == *beg;
r = r << 8 | read_hex_byte(&beg, end);
beg += beg != end && ':' == *beg;
r = r << 8 | read_hex_byte(&beg, end);
beg += beg != end && ':' == *beg;
r = r << 8 | read_hex_byte(&beg, end);
beg += beg != end && ':' == *beg;
r = r << 8 | read_hex_byte(&beg, end);
} catch(std::invalid_argument&) {
beg = end - 1;
}
if(beg != end)
throw std::runtime_error("invalid mac address format " + s);
return r;
}
My 2 cents:
uint64_t ParseMac(const std::string& str)
{
std::istringstream iss(str);
uint64_t nibble;
uint64_t result(0);
iss >> std::hex;
while(iss >> nibble) {
result = (result << 8) + nibble;
iss.get();
}
return result;
}
More C++11 way without input data validation:
uint64_t stomac( const std::string &mac )
{
static const std::regex r{ "([\\da-fA-F]{2})(:|$)" };
auto it = std::sregex_iterator( mac.begin(), mac.end(), r );
static const auto end = std::sregex_iterator();
return std::accumulate( it, end, 0, []( uint64_t i, const std::sregex_iterator::value_type &v ) {
return ( i << 8 ) + std::stol( v.str(1), nullptr, 16 );
} );
}
live example
You can also use the ASCII to struct ether_addr conversion routine ether_aton, or its thread-safe version ether_aton_r (GNU extension).
#include <netinet/ether.h>
#include <stdint.h>
#include <string>
#define ETHER_ADDR_ERR UINT64_C(~0)
uint64_t ether_atou64( const std::string& addr_str ) {
union {
uint64_t result;
struct ether_addr address;
};
result = 0;
struct ether_addr* ptr = ether_aton_r( addr_str.c_str(), &address );
if( !ptr ) {
return ETHER_ADDR_ERR;
}
return result;
}
Sorry I connot comment yet.
For the answer from #AB71E5, you need to change "strtoul" to "strtoull".
Ex : 01:02:03:04:05:06 = 48bits but "unsigned long" = 32bits.
The final result is :
#include <string>
#include <cstdint>
#include <algorithm>
#include <stdlib.h>
uint64_t convert_mac(std::string mac) {
// Remove colons
mac.erase(std::remove(mac.begin(), mac.end(), ':'), mac.end());
// Convert to uint64_t
return strtoull(mac.c_str(), NULL, 16);
}