With english characters it is easy to extract, so to say, a char from a string, e.g., the following code should have y as output:
string my_word;
cout << my_word.at(1);
If I try to do the same with greek characters, I get a funny character:
string my_word = "λογος";
cout << my_word.at(1);
Output:
�
My question is: what can I do to make .at() or whatever similar function work?
many thanks!
std::string is a sequence of narrow characters char. But many national alphabets use more then one char to encode single letter when using utf-8 locale. So when you take s.at(0) you get a half of whole letter or even less. You should use wide chars: std::wstring instead of std::string, std::wcout instead of std::cout and L"λογος" as string literal.
Also, you should set right locale before any printing using std::locale stuff.
Code example for this case:
#include <iostream>
#include <string>
#include <locale>
int main(int, char**) {
std::locale::global(std::locale("en_US.utf8"));
std::wcout.imbue(std::locale());
std::wstring s = L"λογος";
std::wcout << s.at(0) << std::endl;
return 0;
}
Problem is complex. Non Latin characters have to be encoded properly. There are couple standards for that. Question is which encoding your system is using.
In UTF-8 encoding one character is represented by multiple bytes. It can vary form 1 to 4 bytes depending on what kind of character it is.
For example: λ is represented by two bytes (in hex): CE BB.
I don't know what are the other character encoding which gives single byte characters fro Greek letters, but I'm sure there is one such encoding.
Note that your value my_word.length() most probably returns 10 not 5.
As others have said, it depends on your encoding. An at() function is problematic once you move to internationalisation because Hebrew has vowels written around the character, for example. Not all scripts consist of discrete sequences of glyphs.
Generally it's best to treat strings as atomic, unless you are writing the display / word manipulation code itself, when of course you need the individual glyphs. To read UTF, check out the code in Baby X (it's a windowing system that has to draw text to the screen)
Here;s the link https://github.com/MalcolmMcLean/babyx/blob/master/src/common/BBX_Font.c
Here's the UTF8 code - it's quite a hunk of code but fundamentally strightforwards.
static const unsigned int offsetsFromUTF8[6] =
{
0x00000000UL, 0x00003080UL, 0x000E2080UL,
0x03C82080UL, 0xFA082080UL, 0x82082080UL
};
static const unsigned char trailingBytesForUTF8[256] = {
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};
int bbx_isutf8z(const char *str)
{
int len = 0;
int pos = 0;
int nb;
int i;
int ch;
while(str[len])
len++;
while(pos < len && *str)
{
nb = bbx_utf8_skip(str);
if(nb < 1 || nb > 4)
return 0;
if(pos + nb > len)
return 0;
for(i=1;i<nb;i++)
if( (str[i] & 0xC0) != 0x80 )
return 0;
ch = bbx_utf8_getch(str);
if(ch < 0x80)
{
if(nb != 1)
return 0;
}
else if(ch < 0x8000)
{
if(nb != 2)
return 0;
}
else if(ch < 0x10000)
{
if(nb != 3)
return 0;
}
else if(ch < 0x110000)
{
if(nb != 4)
return 0;
}
pos += nb;
str += nb;
}
return 1;
}
int bbx_utf8_skip(const char *utf8)
{
return trailingBytesForUTF8[(unsigned char) *utf8] + 1;
}
int bbx_utf8_getch(const char *utf8)
{
int ch;
int nb;
nb = trailingBytesForUTF8[(unsigned char)*utf8];
ch = 0;
switch (nb)
{
/* these fall through deliberately */
case 3: ch += (unsigned char)*utf8++; ch <<= 6;
case 2: ch += (unsigned char)*utf8++; ch <<= 6;
case 1: ch += (unsigned char)*utf8++; ch <<= 6;
case 0: ch += (unsigned char)*utf8++;
}
ch -= offsetsFromUTF8[nb];
return ch;
}
int bbx_utf8_putch(char *out, int ch)
{
char *dest = out;
if (ch < 0x80)
{
*dest++ = (char)ch;
}
else if (ch < 0x800)
{
*dest++ = (ch>>6) | 0xC0;
*dest++ = (ch & 0x3F) | 0x80;
}
else if (ch < 0x10000)
{
*dest++ = (ch>>12) | 0xE0;
*dest++ = ((ch>>6) & 0x3F) | 0x80;
*dest++ = (ch & 0x3F) | 0x80;
}
else if (ch < 0x110000)
{
*dest++ = (ch>>18) | 0xF0;
*dest++ = ((ch>>12) & 0x3F) | 0x80;
*dest++ = ((ch>>6) & 0x3F) | 0x80;
*dest++ = (ch & 0x3F) | 0x80;
}
else
return 0;
return dest - out;
}
int bbx_utf8_charwidth(int ch)
{
if (ch < 0x80)
{
return 1;
}
else if (ch < 0x800)
{
return 2;
}
else if (ch < 0x10000)
{
return 3;
}
else if (ch < 0x110000)
{
return 4;
}
else
return 0;
}
int bbx_utf8_Nchars(const char *utf8)
{
int answer = 0;
while(*utf8)
{
utf8 += bbx_utf8_skip(utf8);
answer++;
}
return answer;
}
I need to convert Doublebyte characters. In my special case Shift-Jis into something better to handle, preferably with standard C++.
the following Question ended up without a workaround:
Doublebyte encodings on MSVC (std::codecvt): Lead bytes not recognized
So is there anyone with a suggestion or a reference on how to handle this conversion with C++ standard?
Normally I would recommend using the ICU library, but for this alone, using it is way too much overhead.
First a conversion function which takes an std::string with Shiftjis data, and returns an std::string with UTF8 (note 2019: no idea anymore if it works :))
It uses a uint8_t array of 25088 elements (25088 byte), which is used as convTable in the code. The function does not fill this variable, you have to load it from eg. a file first. The second code part below is a program that can generate the file.
The conversion function doesn't check if the input is valid ShiftJIS data.
std::string sj2utf8(const std::string &input)
{
std::string output(3 * input.length(), ' '); //ShiftJis won't give 4byte UTF8, so max. 3 byte per input char are needed
size_t indexInput = 0, indexOutput = 0;
while(indexInput < input.length())
{
char arraySection = ((uint8_t)input[indexInput]) >> 4;
size_t arrayOffset;
if(arraySection == 0x8) arrayOffset = 0x100; //these are two-byte shiftjis
else if(arraySection == 0x9) arrayOffset = 0x1100;
else if(arraySection == 0xE) arrayOffset = 0x2100;
else arrayOffset = 0; //this is one byte shiftjis
//determining real array offset
if(arrayOffset)
{
arrayOffset += (((uint8_t)input[indexInput]) & 0xf) << 8;
indexInput++;
if(indexInput >= input.length()) break;
}
arrayOffset += (uint8_t)input[indexInput++];
arrayOffset <<= 1;
//unicode number is...
uint16_t unicodeValue = (convTable[arrayOffset] << 8) | convTable[arrayOffset + 1];
//converting to UTF8
if(unicodeValue < 0x80)
{
output[indexOutput++] = unicodeValue;
}
else if(unicodeValue < 0x800)
{
output[indexOutput++] = 0xC0 | (unicodeValue >> 6);
output[indexOutput++] = 0x80 | (unicodeValue & 0x3f);
}
else
{
output[indexOutput++] = 0xE0 | (unicodeValue >> 12);
output[indexOutput++] = 0x80 | ((unicodeValue & 0xfff) >> 6);
output[indexOutput++] = 0x80 | (unicodeValue & 0x3f);
}
}
output.resize(indexOutput); //remove the unnecessary bytes
return output;
}
About the helper file: I used to have a download here, but nowadays I only know unreliable file hosters. So... either http://s000.tinyupload.com/index.php?file_id=95737652978017682303 works for you, or:
First download the "original" data from ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT . I can't paste this here because of the length, so we have to hope at least unicode.org stays online.
Then use this program while piping/redirecting above text file in, and redirecting the binary output to a new file. (Needs a binary-safe shell, no idea if it works on Windows).
#include<iostream>
#include<string>
#include<cstdio>
using namespace std;
// pipe SHIFTJIS.txt in and pipe to (binary) file out
int main()
{
string s;
uint8_t *mapping; //same bigendian array as in converting function
mapping = new uint8_t[2*(256 + 3*256*16)];
//initializing with space for invalid value, and then ASCII control chars
for(size_t i = 32; i < 256 + 3*256*16; i++)
{
mapping[2 * i] = 0;
mapping[2 * i + 1] = 0x20;
}
for(size_t i = 0; i < 32; i++)
{
mapping[2 * i] = 0;
mapping[2 * i + 1] = i;
}
while(getline(cin, s)) //pipe the file SHIFTJIS to stdin
{
if(s.substr(0, 2) != "0x") continue; //comment lines
uint16_t shiftJisValue, unicodeValue;
if(2 != sscanf(s.c_str(), "%hx %hx", &shiftJisValue, &unicodeValue)) //getting hex values
{
puts("Error hex reading");
continue;
}
size_t offset; //array offset
if((shiftJisValue >> 8) == 0) offset = 0;
else if((shiftJisValue >> 12) == 0x8) offset = 256;
else if((shiftJisValue >> 12) == 0x9) offset = 256 + 16*256;
else if((shiftJisValue >> 12) == 0xE) offset = 256 + 2*16*256;
else
{
puts("Error input values");
continue;
}
offset = 2 * (offset + (shiftJisValue & 0xfff));
if(mapping[offset] != 0 || mapping[offset + 1] != 0x20)
{
puts("Error mapping not 1:1");
continue;
}
mapping[offset] = unicodeValue >> 8;
mapping[offset + 1] = unicodeValue & 0xff;
}
fwrite(mapping, 1, 2*(256 + 3*256*16), stdout);
delete[] mapping;
return 0;
}
Notes:
Two-byte big endian raw unicode values (more than two byte not necessary here)
First 256 chars (512 byte) for the single byte ShiftJIS chars, value 0x20 for invalid ones.
Then 3 * 256*16 chars for the groups 0x8???, 0x9??? and 0xE???
= 25088 byte
For those looking for the Shift-JIS conversion table data, you can get the uint8_t array here:
https://github.com/bucanero/apollo-ps3/blob/master/include/shiftjis.h
Also, here's a very simple function to convert basic Shift-JIS chars to ASCII:
const char SJIS_REPLACEMENT_TABLE[] =
" ,.,..:;?!\"*'`*^"
"-_????????*---/\\"
"~||--''\"\"()()[]{"
"}<><>[][][]+-+X?"
"-==<><>????*'\"CY"
"$c&%#&*#S*******"
"*******T><^_'='";
//Convert Shift-JIS characters to ASCII equivalent
void sjis2ascii(char* bData)
{
uint16_t ch;
int i, j = 0;
int len = strlen(bData);
for (i = 0; i < len; i += 2)
{
ch = (bData[i]<<8) | bData[i+1];
// 'A' .. 'Z'
// '0' .. '9'
if ((ch >= 0x8260 && ch <= 0x8279) || (ch >= 0x824F && ch <= 0x8258))
{
bData[j++] = (ch & 0xFF) - 0x1F;
continue;
}
// 'a' .. 'z'
if (ch >= 0x8281 && ch <= 0x829A)
{
bData[j++] = (ch & 0xFF) - 0x20;
continue;
}
if (ch >= 0x8140 && ch <= 0x81AC)
{
bData[j++] = SJIS_REPLACEMENT_TABLE[(ch & 0xFF) - 0x40];
continue;
}
if (ch == 0x0000)
{
//End of the string
bData[j] = 0;
return;
}
// Character not found
bData[j++] = bData[i];
bData[j++] = bData[i+1];
}
bData[j] = 0;
return;
}
This is a function in c++ that takes a HEX string and converts it to its equivalent ASCII character.
string HEX2STR (string str)
{
string tmp;
const char *c = str.c_str();
unsigned int x;
while(*c != 0) {
sscanf(c, "%2X", &x);
tmp += x;
c += 2;
}
return tmp;
If you input the following string:
537461636b6f766572666c6f77206973207468652062657374212121
The output will be:
Stackoverflow is the best!!!
Say I were to input 1,000,000 unique HEX strings into this function, it takes awhile to compute.
Is there a more efficient way to complete this?
Of course. Look up two characters at a time:
unsigned char val(char c)
{
if ('0' <= c && c <= '9') { return c - '0'; }
if ('a' <= c && c <= 'f') { return c + 10 - 'a'; }
if ('A' <= c && c <= 'F') { return c + 10 - 'A'; }
throw "Eeek";
}
std::string decode(std::string const & s)
{
if (s.size() % 2) != 0) { throw "Eeek"; }
std::string result;
result.reserve(s.size() / 2);
for (std::size_t i = 0; i < s.size() / 2; ++i)
{
unsigned char n = val(s[2 * i]) * 16 + val(s[2 * i + 1]);
result += n;
}
return result;
}
Just since I wrote it anyway, this should be fairly efficient :)
const char lookup[32] =
{0,10,11,12,13,14,15,0,0,0,0,0,0,0,0,0,0,1,2,3,4,5,6,7,8,9,0,0,0,0,0,0};
std::string HEX2STR(std::string str)
{
std::string out;
out.reserve(str.size()/2);
const char* tmp = str.c_str();
unsigned char ch, last = 1;
while(*tmp)
{
ch <<= 4;
ch |= lookup[*tmp&0x1f];
if(last ^= 1)
out += ch;
tmp++;
}
return out;
}
Don't use sscanf. It's a very general flexible function, which means its slow to allow all those usecases. Instead, walk the string and convert each character yourself, much faster.
This routine takes a string with (what I call) hexwords, often used in embedded ECUs, for example "31 01 7F 33 38 33 37 30 35 31 30 30 20 20 49" and transforms it in readable ASCII where possible.
Transforms by taking care of the discontuinity in the ASCII table (0-9: 48-57, A-F:65 - 70);
int i,j, len=strlen(stringWithHexWords);
char ascii_buffer[250];
char c1, c2, r;
i=0;
j=0;
while (i<len) {
c1 = stringWithHexWords[i];
c2 = stringWithHexWords[i+1];
if ((int)c1!=32) { // if space found, skip next section and bump index only once
// skip scary ASCII codes
if (32<(int)c1 && 127>(int)c1 && 32<(int)c2 && 127>(int)c2) {
//
// transform by taking first hexdigit * 16 and add second hexdigit
// both with correct offset
r = (char) ((16*(int)c1+((int)c2<64?((int)c2-48):((int)c2-55))));
if (31<(int)r && 127>(int)r)
ascii_buffer[j++] = r; // check result for readability
}
i++; // bump index
}
i++; // bump index once more for next hexdigit
}
ascii_bufferCurrentLength = j;
return true;
}
The hexToString() function will convert hex string to ASCII readable string
string hexToString(string str){
std::stringstream HexString;
for(int i=0;i<str.length();i++){
char a = str.at(i++);
char b = str.at(i);
int x = hexCharToInt(a);
int y = hexCharToInt(b);
HexString << (char)((16*x)+y);
}
return HexString.str();
}
int hexCharToInt(char a){
if(a>='0' && a<='9')
return(a-48);
else if(a>='A' && a<='Z')
return(a-55);
else
return(a-87);
}
I have a hexadecimal MAC address held in a std::string. What would be the best way to turn that MAC address into an integer-type held in a uint64_t?
I'm aware of stringstream, sprintf, atoi, etc. I've actually written little conversion functions with the first 2 of those, but they seem more sloppy than I would like.
So, can someone show me a good, clean way to convert
std::string mac = "00:00:12:24:36:4f";
into a uint64_t?
PS: I don't have boost/TR1 facilities available and can't install them where the code will actually be used (which is also why I haven't copy pasted one of my attempts, sorry about that!). So please keep solutions to straight-up C/C++ calls. If you have an interesting solution with a UNIX system call I'd be interested too!
uint64_t string_to_mac(std::string const& s) {
unsigned char a[6];
int last = -1;
int rc = sscanf(s.c_str(), "%hhx:%hhx:%hhx:%hhx:%hhx:%hhx%n",
a + 0, a + 1, a + 2, a + 3, a + 4, a + 5,
&last);
if(rc != 6 || s.size() != last)
throw std::runtime_error("invalid mac address format " + s);
return
uint64_t(a[0]) << 40 |
uint64_t(a[1]) << 32 | (
// 32-bit instructions take fewer bytes on x86, so use them as much as possible.
uint32_t(a[2]) << 24 |
uint32_t(a[3]) << 16 |
uint32_t(a[4]) << 8 |
uint32_t(a[5])
);
}
My solution (requires c++11):
#include <string>
#include <cstdint>
#include <algorithm>
#include <stdlib.h>
uint64_t convert_mac(std::string mac) {
// Remove colons
mac.erase(std::remove(mac.begin(), mac.end(), ':'), mac.end());
// Convert to uint64_t
return strtoul(mac.c_str(), NULL, 16);
}
Use sscanf:
std::string mac = "00:00:12:24:36:4f";
unsigned u[6];
int c=sscanf(mac.c_str(),"%x:%x:%x:%x:%x:%x",u,u+1,u+2,u+3,u+4,u+5);
if (c!=6) raise_error("input format error");
uint64_t r=0;
for (int i=0;i<6;i++) r=(r<<8)+u[i];
// or: for (int i=0;i<6;i++) r=(r<<8)+u[5-i];
I can't think of any magic tricks. Here's a random attempt that may or may not be better than what you've done. It's simplish, but I bet there's far faster solutions.
uint64_t mac2int(std::string s) {
uint64_t r=0;
std::string::iterator i;
std::string::iterator end = s.end();
for(i = s.begin; i != end; ++i) {
char let = *i;
if (let >= '0' && let <= '9') {
r = r*0xf + (let-'0');
} else if (let >= 'a' && let <= 'f') {
r = r*0xf + (let-'a'+10);
} else if (let >= 'A' && let <= 'F') {
r = r*0xf + (let-'A'+10);
}
}
return r;
}
This will just shift hex digits through until the string runs out, not caring about delimiters or total length. But it converts the input string to the desired uint64_t format.
#include <string>
#include <stdint.h>
uint64_t cvt(std::string &v)
{
std::string::iterator i;
std::string digits = "0123456789abcdefABCDEF";
uint64_t result = 0;
size_t pos = 0;
i = v.begin();
while (i != v.end())
{
// search for character in hex digits set
pos = digits.find(*i);
// if found in valid hex digits
if (pos != std::string::npos)
{
// handle upper/lower case hex digit
if (pos > 0xf)
{
pos -= 6;
}
// shift a nibble in
result <<= 4;
result |= pos;
}
++i;
}
return result;
}
Another faster version without calling library functions:
inline unsigned read_hex_byte(char const** beg, char const* end) {
if(end - *beg < 2)
throw std::invalid_argument("");
unsigned hi = (*beg)[0], lo = (*beg)[1];
*beg += 2;
hi -= hi >= '0' && hi <= '9' ? '0' :
hi >= 'a' && hi <= 'f' ? 'a' - 10 :
hi >= 'A' && hi <= 'F' ? 'A' - 10 :
throw std::invalid_argument("");
lo -= lo >= '0' && lo <= '9' ? '0' :
lo >= 'a' && lo <= 'f' ? 'a' - 10 :
lo >= 'A' && lo <= 'F' ? 'A' - 10 :
throw std::invalid_argument("");
return hi << 4 | lo;
}
uint64_t string_to_mac2(std::string const& s) {
char const *beg = s.data(), *end = beg + s.size();
uint64_t r;
try {
r = read_hex_byte(&beg, end);
beg += beg != end && ':' == *beg;
r = r << 8 | read_hex_byte(&beg, end);
beg += beg != end && ':' == *beg;
r = r << 8 | read_hex_byte(&beg, end);
beg += beg != end && ':' == *beg;
r = r << 8 | read_hex_byte(&beg, end);
beg += beg != end && ':' == *beg;
r = r << 8 | read_hex_byte(&beg, end);
beg += beg != end && ':' == *beg;
r = r << 8 | read_hex_byte(&beg, end);
} catch(std::invalid_argument&) {
beg = end - 1;
}
if(beg != end)
throw std::runtime_error("invalid mac address format " + s);
return r;
}
My 2 cents:
uint64_t ParseMac(const std::string& str)
{
std::istringstream iss(str);
uint64_t nibble;
uint64_t result(0);
iss >> std::hex;
while(iss >> nibble) {
result = (result << 8) + nibble;
iss.get();
}
return result;
}
More C++11 way without input data validation:
uint64_t stomac( const std::string &mac )
{
static const std::regex r{ "([\\da-fA-F]{2})(:|$)" };
auto it = std::sregex_iterator( mac.begin(), mac.end(), r );
static const auto end = std::sregex_iterator();
return std::accumulate( it, end, 0, []( uint64_t i, const std::sregex_iterator::value_type &v ) {
return ( i << 8 ) + std::stol( v.str(1), nullptr, 16 );
} );
}
live example
You can also use the ASCII to struct ether_addr conversion routine ether_aton, or its thread-safe version ether_aton_r (GNU extension).
#include <netinet/ether.h>
#include <stdint.h>
#include <string>
#define ETHER_ADDR_ERR UINT64_C(~0)
uint64_t ether_atou64( const std::string& addr_str ) {
union {
uint64_t result;
struct ether_addr address;
};
result = 0;
struct ether_addr* ptr = ether_aton_r( addr_str.c_str(), &address );
if( !ptr ) {
return ETHER_ADDR_ERR;
}
return result;
}
Sorry I connot comment yet.
For the answer from #AB71E5, you need to change "strtoul" to "strtoull".
Ex : 01:02:03:04:05:06 = 48bits but "unsigned long" = 32bits.
The final result is :
#include <string>
#include <cstdint>
#include <algorithm>
#include <stdlib.h>
uint64_t convert_mac(std::string mac) {
// Remove colons
mac.erase(std::remove(mac.begin(), mac.end(), ':'), mac.end());
// Convert to uint64_t
return strtoull(mac.c_str(), NULL, 16);
}