Utf-8 to URI percent encoding

Utf-8 to URI percent encoding - c++

I'm trying to convert Unicode code points to percent encoded UTF-8 code units.
The Unicode -> UTF-8 conversion seems to be working correctly as shown by some testing with Hindi and Chinese characters which show up correctly in Notepad++ with UTF-8 encoding, and can be translated back properly.
I thought the percent encoding would be as simple as adding '%' in front of each UTF-8 code unit, but that doesn't quite work. Rather than the expected %E5%84%A3, I'm seeing %xE5%x84%xA3 (for the unicode U+5123).
What am I doing wrong?
Added code (note that utf8.h belongs to the UTF8-CPP library).
#include <fstream>
#include <iostream>
#include <vector>
#include "utf8.h"
std::string unicode_to_utf8_units(int32_t unicode)
{
unsigned char u[5] = {0,0,0,0,0};
unsigned char *iter = u, *limit = utf8::append(unicode, u);
std::string s;
for (; iter != limit; ++iter) {
s.push_back(*iter);
}
return s;
}
int main()
{
std::ofstream ofs("test.txt", std::ios_base::out);
if (!ofs.good()) {
std::cout << "ofstream encountered a problem." << std::endl;
return 1;
}
utf8::uint32_t unicode = 0x5123;
auto s = unicode_to_utf8_units(unicode);
for (auto &c : s) {
ofs << "%" << c;
}
ofs.close();
return 0;
}

You actually need to convert byte values to the corresponding ASCII strings, for example:
"é" in UTF-8 is the value { 0xc3, 0xa9 }. Please not that these are bytes, char values in C++.
Each byte needs to be converted to: "%C3" and "%C9" respectively.
The best way to do so is to use sstream:
std::ostringstream out;
std::string utf8str = "\xE5\x84\xA3";
for (int i = 0; i < utf8str.length(); ++i) {
out << '%' << std::hex << std::uppercase << (int)(unsigned char)utf8str[i];
}
Or in C++11:
for (auto c: utf8str) {
out << '%' << std::hex << std::uppercase << (int)(unsigned char)c;
}
Please note that the bytes need to be cast to int, because else the << operator will use the litteral binary value.
First casting to unsigned char is needed because otherwise, the sign bit will propagate to the int value, causing output of negative values like FFFFFFE5.

Related

C++ tolower on special characters such as ü

I have trouble transforming a string to lowercase with the tolower() function in C++. With normal strings, it works as expected, however special characters are not converted successfully.
How I use my function:
string NotLowerCase = "Grüßen";
string LowerCase = "";
for (unsigned int i = 0; i < NotLowerCase.length(); i++) {
LowerCase += tolower(NotLowerCase[i]);
}
For example:
Test -> test
TeST2 -> test2
Grüßen -> gr????en
(§) -> ()
3 and 4 are not working as expected as you can see
How can I fix this issue? I have to keep the special chars, but as lowercase.

The sample code (below) from tolower shows how you fix this; you have to use something other than the default "C" locale.
#include <iostream>
#include <cctype>
#include <clocale>
int main()
{
unsigned char c = '\xb4'; // the character Ž in ISO-8859-15
// but ´ (acute accent) in ISO-8859-1
std::setlocale(LC_ALL, "en_US.iso88591");
std::cout << std::hex << std::showbase;
std::cout << "in iso8859-1, tolower('0xb4') gives "
<< std::tolower(c) << '\n';
std::setlocale(LC_ALL, "en_US.iso885915");
std::cout << "in iso8859-15, tolower('0xb4') gives "
<< std::tolower(c) << '\n';
}
You might also change std::string to std::wstring which is Unicode on many C++ implementations.
wstring NotLowerCase = L"Grüßen";
wstring LowerCase;
for (auto&& ch : NotLowerCase) {
LowerCase += towlower(ch);
}
Guidance from Microsoft is to "Normalize strings to uppercase", so you might use toupper or towupper instead.
Keep in mind that a character-by-character transformation might not work well for some languages. For example, using German as spoken in Germany, making Grüßen all upper-case turns it into GRÜESSEN (although there is now a capital ẞ). There are numerous other "problems" such a combining characters; if you're doing real "production" work with strings, you really want a completely different approach.
Finally, C++ has more sophisticated support for managing locales, see <locale> for details.

I think the most portable way to do this is to use the user selected locale which is achieved by setting the locale to "" (empty string).
std::locale::global(std::locale(""));
That sets the locale to whatever was in use where the program was run and it effects the standard character conversion routines (std::mbsrtowcs & std::wcsrtombs) that convert between multi-byte and wide-string characters.
Then you can use those functions to convert from the system/user selected multi-byte characters (such as UTF-8) to system standard wide character codes that can be used in functions like std::tolower that operate on one character at a time.
This is important because multi-byte character sets like UTF-8 can not be converted using single character operations like with std::tolower().
Once you have converted the wide string version to upper/lower case it can then be converted back to the system/user multibyte character set for printing to the console.
// Convert from multi-byte codes to wide string codes
std::wstring mb_to_ws(std::string const& mb)
{
std::wstring ws;
std::mbstate_t ps{};
char const* src = mb.data();
std::size_t len = 1 + mbsrtowcs(0, &src, 3, &ps);
ws.resize(len);
src = mb.data();
mbsrtowcs(&ws[0], &src, ws.size(), &ps);
if(src)
throw std::runtime_error("invalid multibyte character after: '"
+ std::string(mb.data(), src) + "'");
ws.pop_back();
return ws;
}
// Convert from wide string codes to multi-byte codes
std::string ws_to_mb(std::wstring const& ws)
{
std::string mb;
std::mbstate_t ps{};
wchar_t const* src = ws.data();
std::size_t len = 1 + wcsrtombs(0, &src, 0, &ps);
mb.resize(len);
src = ws.data();
wcsrtombs(&mb[0], &src, mb.size(), &ps);
if(src)
throw std::runtime_error("invalid wide character");
mb.pop_back();
return mb;
}
int main()
{
// set locale to the one chosen by the user
// (or the one set by the system default)
std::locale::global(std::locale(""));
try
{
string NotLowerCase = "Grüßen";
std::cout << NotLowerCase << '\n';
// convert system/user multibyte character codes
// to wide string versions
std::wstring ws1 = mb_to_ws(NotLowerCase);
std::wstring ws2;
for(unsigned int i = 0; i < ws1.length(); i++) {
// use the system/user locale
ws2 += std::tolower(ws1[i], std::locale(""));
}
// convert wide string character codes back
// to system/user multibyte versions
string LowerCase = ws_to_mb(ws2);
std::cout << LowerCase << '\n';
}
catch(std::exception const& e)
{
std::cerr << e.what() << '\n';
return EXIT_FAILURE;
}
catch(...)
{
std::cerr << "Unknown exception." << '\n';
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}
Code not heavily tested

use ASCII
string NotLowerCase = "Grüßen";
string LowerCase = "";
for (unsigned int i = 0; i < NotLowerCase.length(); i++) {
if(NotLowerCase[i]<65||NotLowerCase[i]>122)
{
LowerCase+='?';
}
else
LowerCase += tolower(NotLowerCase[i]);
}

converting a chararcter to \uxxx format in C /C++

I want to convert a string/char to \uxxx format in C / C++ program.
Support I have a Character 'A', I want to print convert as \u0041( a standard unicode ).
Second thing is I was using a unix command utility to print (printf) to print a \uxxx string tto char. I tried with "\u092b" it print a different character than my font file. Can any one please explain reason behind this.

Here's a function using standard C++ to do this (though depending on CharT it may have some requirements that some valid implementation defined behavior doesn't meet).
#include <codecvt>
#include <sstream>
#include <iomanip>
#include <iostream>
template<typename CharT,typename traits,typename allocator>
std::basic_string<CharT,traits,allocator>
to_uescapes(std::basic_string<CharT,traits,allocator> const &input)
{
// string converter from CharT to char. If CharT = char then no conversion is done.
// if CharT is char32_t or char16_t then the conversion is UTF-32/16 -> UTF-8. Not all implementations support this yet.
// if CharT is something else then this uses implementation defined encodings and will only work for us if the implementation uses UTF-8 as the narrow char encoding
std::wstring_convert<std::codecvt<CharT,char,std::mbstate_t>,CharT> convertA;
// string converter from UTF-8 -> UTF-32. Not all implementations support this yet
std::wstring_convert<std::codecvt<char32_t,char,std::mbstate_t>,char32_t> convertB;
// convert from input encoding to UTF-32 (Assuming convertA produces UTF-8 string)
std::u32string u32input = convertB.from_bytes(convertA.to_bytes(input));
std::basic_stringstream<CharT,traits,allocator> ss;
ss.fill('0');
ss << std::hex;
for(char32_t c : u32input) {
if(c < U'\U00010000')
ss << convertA.from_bytes("\\u") << std::setw(4) << (unsigned int)c;
else
ss << convertA.from_bytes("\\U") << std::setw(8) << (unsigned int)c;
}
return ss.str();
}
template<typename CharT>
std::basic_string<CharT>
to_uescapes(CharT const *input)
{
return to_uescapes(std::basic_string<CharT>(input));
}
int main() {
std::string s = to_uescapes(u8"Hello \U00010000");
std::cout << s << '\n';
}
This should print:
\u0048\u0065\u006c\u006c\u006f\u0020\U00010000

restore runtime unicode strings

I'm building an application that receives runtime strings with encoded unicode via tcp, an example string would be "\u7cfb\u8eca\u4e21\uff1a\u6771\u5317 ...". I have the following but unfortunately I can only benefit from it at compile time due to: incomplete universal character name \u since its expecting 4 hexadecimal characters at compile time.
QString restoreUnicode(QString strText)
{
QRegExp rx("\\\\u([0-9a-z]){4}");
return strText.replace(rx, QString::fromUtf8("\u\\1"));
}
I'm seeking a solution at runtime, I could I foreseen break up these strings and do some manipulation to convert those hexadecimals after the "\u" delimiters into base 10 and then pass them into the constructor of a QChar but I'm looking for a better way if one exists as I am very concerned about the time complexity incurred by such a method and am not an expert.
Does anyone have any solutions or tips.

You should decode the string by yourself. Just take the Unicode entry (rx.indexIn(strText)), parse it (int result; std::istringstream iss(s); if (!(iss>>std::hex>>result).fail()) ... and replace the original string \\uXXXX with (wchar_t)result.

For closure and anyone who comes across this thread in future, here is my initial solution before optimising the scope of these variables. Not a fan of it but it works given the unpredictable nature of unicode and/or ascii in the stream of which I have no control over (client only), whilst Unicode presence is low, it is good to handle it instead of ugly \u1234 etc.
QString restoreUnicode(QString strText)
{
QRegExp rxUnicode("\\\\u([0-9a-z]){4}");
bool bSuccessFlag;
int iSafetyOffset = 0;
int iNeedle = strText.indexOf(rxUnicode, iSafetyOffset);
while (iNeedle != -1)
{
QChar cCodePoint(strText.mid(iNeedle + 2, 4).toInt(&bSuccessFlag, 16));
if ( bSuccessFlag )
strText = strText.replace(strText.mid(iNeedle, 6), QString(cCodePoint));
else
iSafetyOffset = iNeedle + 1; // hop over non code point to avoid lock
iNeedle = strText.indexOf(rxUnicode, iSafetyOffset);
}
return strText;
}

#include <assert.h>
#include <iostream>
#include <string>
#include <sstream>
#include <locale>
#include <codecvt> // C++11
using namespace std;
int main()
{
char const data[] = "\\u7cfb\\u8eca\\u4e21\\uff1a\\u6771\\u5317";
istringstream stream( data );
wstring ws;
int code;
char slashCh, uCh;
while( stream >> slashCh >> uCh >> hex >> code )
{
assert( slashCh == '\\' && uCh == 'u' );
ws += wchar_t( code );
}
cout << "Unicode code points:" << endl;
for( auto it = ws.begin(); it != ws.end(); ++it )
{
cout << hex << 0 + *it << endl;
}
cout << endl;
// The following is C++11 specific.
cout << "UTF-8 encoding:" << endl;
wstring_convert< codecvt_utf8< wchar_t > > converter;
string const bytes = converter.to_bytes( ws );
for( auto it = bytes.begin(); it != bytes.end(); ++it )
{
cout << hex << 0 + (unsigned char)*it << ' ';
}
cout << endl;
}

How to convert fom string to hexadecimal?

I have the following string: s="80". I need to put this in an
unsigned char k[]. The unsigned char should look like this: unsigned char k[]={0x38,0x34}, where 0x38=8 and 0x34=0 These are the hexadecimal values for 8 and 0. How to do this? Need some help!
Please give some code. Thx
I am working on ubuntu c++ code. THX!
I use this for an encryption! I need 0x38 in an unsigned char.PLEASE HELP! Need some code:)
EDIT:
HOW TO OBTAIN THE DEC/CHAR VALUE AND PUT IT IN AN unsigned char k[]?
I've realised that it's ok if in the unsigned char [] i have the dec values {56,52} of the 8 and 0 that i have in the string!

Assuming you want this string converted as ASCII (or UTF-8) it is already in the correct format.
std::string s="80";
std::cout << "0x" << std::hex << static_cast<int>(s[0]) << "\n";
std::cout << "0x" << std::hex << static_cast<int>(s[1]) << "\n";
If you want it in an int array, then just copy it:
int data[2];
std::copy(s.begin(), s.end(), data);

I think that no matter you store '8' or 0x39, they will be present as the same binary numbers by the computer.

I think you do not really understand what you are asking.
The following are synonyms:
std::string s = "\x38\x30";
std::string s = "80";
As the following are synonyms:
char c = '8', s = '0' ;
char c = s[0], s = s[1];
char c = 0x38, s = 0x30;
It is exactly the same (except if your base encoding is not ASCII). This is not an encryption.

std::string s = "80";
unsigned char * pArray = new unsigned char[ s.size() ];
const char * p = s.c_str();
unsigned char * p2 = pArray;
while( *p )
*p2++ = *p++;
delete []pArray;

You can try it. I did not write these codes. I found I like you
#include <algorithm>
#include <sstream>
#include <iostream>
#include <iterator>
#include <iomanip>
namespace {
const std::string test="mahmutefe";
}
int main() {
std::ostringstream result;
result << std::setw(2) << std::setfill('0') << std::hex << std::uppercase;
std::copy(test.begin(), test.end(), std::ostream_iterator<unsigned int>(result, " "));
std::cout << test << ":" << result.str() << std::endl;
system("PAUSE");
}

convert that string to a char array, then subrtact '0' from each char.

How to convert ASCII string to hexadecimal?

I have tried to find this topic on the web but I couldn't find the one I need.
I have a string of character:
char * tempBuf = "qj";
The result I want is 0x716A, and that value is going to be converted into decimal value.
Is there any function in vc++ that can be used for that?

You can use a stringstream to convert each character to a hexadecimal representation.
#include <iostream>
#include <sstream>
#include <cstring>
int main()
{
const char* tempBuf = "qj";
std::stringstream ss;
const char* it = tempBuf;
const char* end = tempBuf + std::strlen(tempBuf);
for (; it != end; ++it)
ss << std::hex << unsigned(*it);
unsigned result;
ss >> result;
std::cout << "Hex value: " << std::hex << result << std::endl;
std::cout << "Decimal value: " << std::dec << result << std::endl;
}

So if I understood correctly the idea...
#include <stdint.h>
uint32_t charToUInt32(const char* src) {
uint32_t ret = 0;
char* dst = (char*)&ret;
for(int i = 0; (i < 4) && (*src); ++i, ++src)
dst[i] = *src;
return ret;
}

If I understand what you want correctly: just loop over the characters, start to finish; at each character, multiply the sum so far by 256, and add the value of the next character; that gives the decimal value in one shot.

What you are looking for is called "hex encoding". There are a lot of libraries out there that can do that (unless what you were looking for was how to implement one yourself).
One example is crypto++.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Utf-8 to URI percent encoding - c++

Related

C++ tolower on special characters such as ü

converting a chararcter to \uxxx format in C /C++

restore runtime unicode strings

How to convert fom string to hexadecimal?

How to convert ASCII string to hexadecimal?

Categories

Resources