Unicode to CodePoint C++ - c++

How can I get the codepoint from a Unicode value?
According the character code table, the Code Point for the pictogram '丂' is 8140, and the Unicode is \u4E02
I made this app on C++, to try to get the CP for a Unicode string value:
#include <iostream>
#include <atlstr.h>
#include <iomanip>
#include <codecvt>
void hex_print(const std::string& s);
int main()
{
std::wstring test = L"丂"; //assign pictogram directly
std::wstring test2 = L"\u4E02"; //assign value via Unicode
std::wstring_convert<std::codecvt_utf16<wchar_t>> conv1;
std::string u8str = conv1.to_bytes(test);
hex_print(u8str);
std::wstring_convert<std::codecvt_utf16<wchar_t>> conv2;
std::string u8str2 = conv2.to_bytes(test2);
hex_print(u8str2);
return 1;
}
void hex_print(const std::string& s)
{
std::cout << std::hex << std::setfill('0');
for (unsigned char c : s)
std::cout << std::setw(2) << static_cast<int>(c) << ' ';
std::cout << std::dec << '\n';
}
Output:
00 81 00 40
4e 02
What can I do to get 00 81 00 40, when the value is \u4E02?

In Windows you can use WideCharToMultiByte
int main()
{
std::wstring test = L"丂"; //assign pictogram directly
std::wstring test2 = L"\u4E02"; //assign value via Unicode
std::wstring_convert<std::codecvt_utf16<wchar_t>> conv1;
std::string u8str = conv1.to_bytes(test);
hex_print(u8str);
std::wstring_convert<std::codecvt_utf16<wchar_t>> conv2;
std::string u8str2 = conv2.to_bytes(test2);
hex_print(u8str2);
int len = WideCharToMultiByte(54936, 0, test2.c_str(), -1, NULL, 0, NULL, NULL);
char* strGB18030 = new char[len + 1];
WideCharToMultiByte(54936, 0, test2.c_str(), -1, strGB18030, len, NULL, NULL);
hex_print(std::string(strGB18030));
delete[] strGB18030;
return 1;
}
output
4e 02
4e 02
81 40

Related

How do you convert a `std::string` hex value to an `unsigned char` [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed last year.
Improve this question
sample input
a8 49 7f ac 24 77 c3 6e 70 ca 99 ca fc e2 c5 7b
This fucntion converts the hex values in the sample to a string to be later converted into an unsigned char
std::vector<unsigned char> cipher_as_chars(std::string cipher)
{
std::vector<unsigned char> hex_char;
int j =0 ;
for (int i = 0; i < cipher.length();)
{
std::string x = "";
x = x + cipher[i] + cipher[i+1];
unsigned char hexchar[2] ;
strcpy( (char*) hexchar, x.c_str() );
hex_char[j] = *hexchar;
j++;
cout << "Current Index : " << i << " " << x << " <> " << hexchar << endl;
i = i+3;
}
return hex_char;
}
As a very simple solution, you can use a istringstream, which allows parsing hex strings:
#include <cstdio>
#include <iterator>
#include <sstream>
#include <string>
#include <vector>
std::vector<unsigned char> cipher_as_chars(std::string const& cipher) {
std::istringstream strm{cipher};
strm >> std::hex;
return {std::istream_iterator<int>{strm}, {}};
}
int main() {
auto const cipher = "a8 49 7f ac 24 77 c3 6e 70 ca 99 ca fc e2 c5 7b";
auto const sep = cipher_as_chars(cipher);
for (auto elm : sep) {
std::printf("%hhx ", elm);
}
std::putchar('\n');
}

c++ hex set of strings to array of bytes not return vector [duplicate]

This question already has answers here:
How to turn a hex string into an unsigned char array?
(7 answers)
Closed 2 years ago.
So basically , I want to create a function that
takes this hex pattern:
"03 C6 8F E2 18 CA 8C E2 94 FD BC E5 03 C6 8F E2"
and returns this array of bytes:
BYTE pattern[] = { 0x03, 0xC6, 0x8F, 0xE2, 0x18, 0xCA, 0x8C, 0xE2, 0x94, 0xFD, 0xBC, 0xE5, 0x03, 0xC6, 0x8F, 0xE2 };
My main problem is what i need like each 0x03 in one byte cell, of the output array exactly as i described,
if i use this
#include <windows.h>
std::vector<BYTE> strPatternToByte(const char* pattern, std::vector<BYTE> bytes)
{
std::stringstream converter;
std::istringstream ss( pattern );
std::string word;
while( ss >> word )
{
BYTE temp;
converter << std::hex << "0x" + word;
converter >> temp;
bytes.push_back( temp );
}
return bytes;
}
int main()
{
const char* pattern = "03 C6 8F E2 18 CA 8C E2 D4 FD BC E5 03 C6 8F E2";
std::vector<BYTE> bytes;
bytes = strPatternToByte(pattern,bytes);
BYTE somePtr[16];
for ( int i=0 ; i < 16 ; i++)
{
somePtr[i] = bytes[i];
}
for(unsigned char i : somePtr)
{
std::cout << i << std::endl;
}
/*
* output should be:
0x03
0xC6
0x8F
* etc
* .
* .
*/
return 0;
}
it doesn't actually do what i need because when i debug it , i look at the bytes vector and i see it puts 0 in a cell, x in a cell , 0 , in cell , 3 in cell , which is not what i want, is there anyway to solve this kind of problem ?
the output aint like it should be, I added what the output should be there in the code something like this:
/*
* output should be:
0x03
0xC6
0x8F
* etc
* .
* .
*/
the somePtr array is my last output should be, with as i described up.
thanks.
template <typename T>
std::vector<uint8_t> bytesFromHex(std::basic_istream<T>& stream, size_t reserve = 0x100)
{
std::vector<uint8_t> result;
result.reserve(reserve);
auto flags = stream.flags();
stream.setf(std::ios_base::hex, std::ios_base::basefield);
std::copy(std::istream_iterator<unsigned>{stream}, {}, std::back_inserter(result));
stream.flags(flags);
return result;
}
template <typename T>
std::vector<uint8_t> bytesFromHex(std::basic_string_view<T> s, size_t reserve = 0x100)
{
std::basic_istringstream<T> stream{std::basic_string<T>{s}};
return bytesFromHex(stream, reserve);
}
https://godbolt.org/z/aW915b
You have a number of minor errors.
By reusing your converter object it no longer works after the first conversion because it gets into an error state. Try this version that recreates the stringstream each time round the loop (you could also call converter.clear(); at the end of the loop).
while (ss >> word)
{
int temp;
std::stringstream converter;
converter << std::hex << "0x" + word;
converter >> temp;
bytes.push_back(temp);
}
Also temp should be an int (you are reading integers after all).
The output loop is wrong, try this
for (int i : somePtr)
{
std::cout << "0x" << std::hex << std::setfill('0') << std::setw(2) << i << std::endl;
}
Again note that i is an int, and I've added formatting to get the effect you wanted. You will need to #include <iomanip>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
typedef unsigned char BYTE; // Remove this line if you've included windows.h
std::vector<BYTE> strPatternToByte(std::string hex_chars) {
std::istringstream hex_chars_stream(hex_chars);
std::vector<BYTE> bytes;
unsigned c;
hex_chars_stream >> std::hex;
while (hex_chars_stream >> c) bytes.push_back(c);
return bytes;
}
int main() {
std::string hex_chars = "03 C6 8F E2 18 CA 8C E2 D4 FD BC E5 03 C6 8F E2";
auto bytes = strPatternToByte(hex_chars);
std::cout << std::setfill('0') << std::hex << std::uppercase;
for (unsigned i : bytes) std::cout << "0x" << std::setw(2) << i << '\n';
}
Output :
0x03
0xC6
0x8F
0xE2
0x18
0xCA
0x8C
0xE2
0xD4
0xFD
0xBC
0xE5
0x03
0xC6
0x8F
0xE2

Convert a signed Int to a hex string with spaces

I was wondering if I could get some help converting a integer to a hex string with a space between each byte like so-
int val = -2147483648;
char hexval[32];
sprintf(hexval, "%x", val)
Output = 80000000
how could I add spaced between each byte so I would have a string like -> 80 00 00 00
is there an easier way then malloc'ing memory and moving a pointer around?
Thanks!
A simple function:
/**
* hexstr(char *str, int val);
*
* `str` needs to point to a char array with at least 12 elements.
**/
int hexstr(char *str, int val) {
return snprintf(str, 12, "%02hhx %02hhx %02hhx %02hhx", val >> 24, val >> 16, val >> 8, val);
}
Example:
int main(void) {
int val = -2147483648;
char hexval[12];
hexstr(hexval, val);
printf("Integer value: %d\n", val);
printf("Result string: %s\n", hexval);
return 0;
}
Integer value: -2147483648
Result string: 80 00 00 00
As an alternative, you may consider using std::hex. Example:
#include <iostream>
int main() {
int n=255;
std::cout << std::hex << n << std::endl;
return 0;
}
UPDATE:
A more flexible implementation that do not rely on printing the content can be
void gethex(int n, std::ostream &o) {
o << std::hex << n;
}
then
std::ostringstream ss;
gethex(myNumber, ss);
std::cout << "Hex number: " << ss.str() << std::endl;

How do I read a text file having Unicode codes?

I initialize a string using the following code.
std::string unicode8String = "\u00C1 M\u00F3ti S\u00F3l";
Printing it using cout, the output is Á Móti Sól.
But when I read same same string from a text file using ifstream, store it in a std::string, and print it, the output is \u00C1 M\u00F3ti S\u00F3l.
The content of my file is \u00C1 M\u00F3ti S\u00F3l and I want to print it as Á Móti Sól. Is there any way to do this?
Off the top of my head (completely untested)
std::string convert_string(const std::string& in)
{
std::string out;
for (size_t i = 0; i < in.size(); )
{
if (i + 5 < in.size() && in[i] == '\\' && in[i+1] == 'u' &&
in[i+2] == '0' && in[i+3] == '0' &&
isxdigit(in[i+4]) && isxdigit(in[i+5]))
{
out += (unsigned char)16*in[i+4] + (unsigned char)in[i+5];
i += 6;
}
else
{
out += in[i];
++i;
}
}
return out;
}
But this won't work with any unicode values above 255, (e.g. \u1234) because you have the fundamental problem that your string stores 8 bit characters, and Unicode characters can have up to 20 bits.
As I said completely untested, but I'm sure you get the idea.
Can you try printing using "std::wcout"!
The unicode characters have a different representation in a text file (There is no \u).
For Evaluation
int main()
{
// Write
{
std::string s = "\u00C1 M\u00F3ti S\u00F3l";
std::ofstream out("/tmp/test.txt");
out << s;
}
// Read Text
{
std::string s;
std::ifstream in("/tmp/test.txt");
std::getline(in, s);
std::cout << "Result: " << s << std::endl;
}
// Read Binary
{
std::ifstream in("/tmp/test.txt");
in.unsetf(std::ios_base::skipws);
std::istream_iterator<unsigned char> first(in);
std::istream_iterator<unsigned char> last;
std::vector<unsigned char> v(first, last);
std::cout << "Result: ";
for(unsigned c: v) std::cout << std::hex << c << ' ';
std::cout << std::endl;
}
return 0;
}
On Linux with UTF8:
Result: Á Móti Sól
Result: c3 81 20 4d c3 b3 74 69 20 53 c3 b3 6c

How do I HTML-/ URL-Encode a std::wstring containing Unicode characters?

I have another question yet. If I had a std::wstring looking like this:
ドイツ語で検索していてこちらのサイトにたどり着きました。
How could I possibly get it to be URL-Encoded (%nn, n = 0-9, a-f) to:
%E3%83%89%E3%82%A4%E3%83%84%E8%AA%9E%E3%81%A7%E6%A4%9C%E7%B4%A2%E3%81%97%E3%81%A6%E3%81%84%E3%81%A6%E3%81%93%E3%81%A1%E3%82%89%E3%81%AE%E3%82%B5%E3%82%A4%E3%83%88%E3%81%AB%E3%81%9F%E3%81%A9%E3%82%8A%E7%9D%80%E3%81%8D%E3%81%BE%E3%81%97%E3%81%9F%E3%80%82
... and also HTML-Encoded (&#nnn(nn);, n = 0-9(?)) to:
ドイツ語で検索していてこちらのサイトにたどり着きました。
Please help me as I am totally lost right now and don't even know where to start. By the way, performance isn't much important to me right now.
Thanks in advance!
Here is an example which shows two methods, one based on the Qt library and one based on the ICU library. Both should be fairly platform-independent:
#include <iostream>
#include <sstream>
#include <iomanip>
#include <stdexcept>
#include <boost/scoped_array.hpp>
#include <QtCore/QString>
#include <QtCore/QUrl>
#include <QtCore/QVector>
#include <unicode/utypes.h>
#include <unicode/ustring.h>
#include <unicode/unistr.h>
#include <unicode/schriter.h>
void encodeQt() {
const QString str = QString::fromWCharArray(L"ドイツ語で検索していてこちらのサイトにたどり着きました。");
const QUrl url = str;
std::cout << "URL encoded: " << url.toEncoded().constData() << std::endl;
typedef QVector<uint> CodePointVector;
const CodePointVector codePoints = str.toUcs4();
std::stringstream htmlEncoded;
for (CodePointVector::const_iterator it = codePoints.constBegin(); it != codePoints.constEnd(); ++it) {
htmlEncoded << "&#" << *it << ';';
}
std::cout << "HTML encoded: " << htmlEncoded.str() << std::endl;
}
void encodeICU() {
const std::wstring cppString = L"ドイツ語で検索していてこちらのサイトにたどり着きました。";
int bufSize = cppString.length() * 2;
boost::scoped_array<UChar> strBuffer(new UChar[bufSize]);
int size = 0;
UErrorCode error = U_ZERO_ERROR;
u_strFromWCS(strBuffer.get(), bufSize, &size, cppString.data(), cppString.length(), &error);
if (error) return;
const UnicodeString str(strBuffer.get(), size);
bufSize = str.length() * 4;
boost::scoped_array<char> buffer(new char[bufSize]);
u_strToUTF8(buffer.get(), bufSize, &size, str.getBuffer(), str.length(), &error);
if (error) return;
const std::string urlUtf8(buffer.get(), size);
std::stringstream urlEncoded;
urlEncoded << std::hex << std::setfill('0');
for (std::string::const_iterator it = urlUtf8.begin(); it != urlUtf8.end(); ++it) {
urlEncoded << '%' << std::setw(2) << static_cast<unsigned int>(static_cast<unsigned char>(*it));
}
std::cout << "URL encoded: " << urlEncoded.str() << std::endl;
std::stringstream htmlEncoded;
StringCharacterIterator it = str;
while (it.hasNext()) {
const UChar32 pt = it.next32PostInc();
htmlEncoded << "&#" << pt << ';';
}
std::cout << "HTML encoded: " << htmlEncoded.str() << std::endl;
}
int main() {
encodeQt();
encodeICU();
}
You see, before you can convert a char to a URL escape sequence, you have to convert your wstring* into ISO-Latin charset which is what is used for URLs. ICU could be a good place to start, where you can pass your wstring to it and get a ISO-Lantin sequence. Then, simply iterate through the resulting chars and convert them to the escape senquence:
std::stringstream URL;
URL << std::hex;
for(auto it = myWString.begin(); it != myWString.end(); ++it)
URL << '%' << std::setfill('0') << std::setw(2) << (int)*it;
Take a look here for more info in how to format the string.
* I'm assuming that your wstring is a UTF-16, which usually is the case, although you didn't specify
This might help also.
Here's a version that converts from UTF-16 (wchar) to hex-encoded UTF-8 using the Win32-specific WideCharToMultiByte() function.
#include <string>
#include <iostream>
#include <stdio.h>
#include <windows.h>
std::string wstring_to_utf8_hex(const std::wstring &input)
{
std::string output;
int cbNeeded = WideCharToMultiByte(CP_UTF8, 0, input.c_str(), -1, NULL, 0, NULL, NULL);
if (cbNeeded > 0) {
char *utf8 = new char[cbNeeded];
if (WideCharToMultiByte(CP_UTF8, 0, input.c_str(), -1, utf8, cbNeeded, NULL, NULL) != 0) {
for (char *p = utf8; *p; *p++) {
char onehex[5];
_snprintf(onehex, sizeof(onehex), "%%%02.2X", (unsigned char)*p);
output.append(onehex);
}
}
delete[] utf8;
}
return output;
}
int main(int, char*[])
{
std::wstring ja = L"ドイツ語で検索していてこちらのサイトにたどり着きました。";
std::cout << "result=" << wstring_to_utf8_hex(ja) << std::endl;
return 0;
}
To go the other way, you'll need to use some parsing to decode the hex values into a UTF-8 buffer, and then call the complimentary MultiByteToWideChar() to get it back into a wchar array.
#include <string>
#include <iostream>
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <windows.h>
std::string unhexlify(const std::string &input)
{
std::string output;
for (const char *p = input.c_str(); *p; ) {
if (p[0] == '%' && isxdigit(p[1]) && isxdigit(p[2])) {
int ch = (isdigit(p[1]) ? p[1] - '0' : toupper(p[1]) - 'A' + 10) * 16 +
(isdigit(p[2]) ? p[2] - '0' : toupper(p[2]) - 'A' + 10);
output.push_back((char)ch);
p += 3;
} else if (p[0] == '%' && p[1] == '#' && isdigit(p[2])) {
int ch = atoi(p + 2);
output.push_back((char)ch);
p += 2;
while (*p && isdigit(*p)) p++;
if (*p == ';') p++;
} else {
output.push_back(*p++);
}
}
return output;
}
std::wstring utf8_hex_to_wstring(const std::string &input)
{
std::wstring output;
std::string utf8 = unhexlify(input);
int cchNeeded = MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), -1, NULL, 0);
if (cchNeeded > 0) {
wchar_t *widebuf = new wchar_t[cchNeeded];
if (MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), -1, widebuf, cchNeeded) != 0) {
output = widebuf;
}
delete[] widebuf;
}
return output;
}
int main(int, char*[])
{
std::wstring ja = L"ドイツ語で検索していてこちらのサイトにたどり着きました。";
std::string hex = "%E3%83%89%E3%82%A4%E3%83%84%E8%AA%9E%E3%81%A7%E6%A4%9C%E7%B4%A2%E3%81%97%E3%81%A6%E3%81%84%E3%81%A6%E3%81%93%E3%81%A1%E3%82%89%E3%81%AE%E3%82%B5%E3%82%A4%E3%83%88%E3%81%AB%E3%81%9F%E3%81%A9%E3%82%8A%E7%9D%80%E3%81%8D%E3%81%BE%E3%81%97%E3%81%9F%E3%80%82";
std::wstring newja = utf8_hex_to_wstring(hex);
std::cout << "match?=" << (newja == ja ? "yes" : "no") << std::endl;
return 0;
}
First, convert to UTF-8.
Then, normal URL/HTML encode would do the right thing.
I find in C# it's simple, so I use C++\CLI as wrapper, wrap C# code:
string encodedStr = System.Web.HttpUtility.UrlEncode(inputstr);`
in C++\CLI make a method as __declspec(dllexport) so in C++ can call it, the C++\CLI syntax is:
String^ encodedStr = System::Web::HttpUtility::UrlEncode(inputStr);`.
this is a tutorial about how to call C++\CLI from C++: How to call a C# library from Native C++ (using C++\CLI and IJW)