How to parse std::string containing unicode literals? - c++

I have std::string which stores characters encoded in UTF. Example:
std::string a = "\\u00c1\\u00c4\\u00d3";
Note that the length of a is 18 (3 characters, 6 ASCII symbols for each UTF character).
Question: How can I convert a into C++ string that have only 3 characters? Are there any standard functions (libraries) to do that?

There is nothing in the standard C++ library to handle this kind of conversion automatically for you. You are going to have to parse this string yourself, manually converting each 6-char "\uXXXX" substring into a 1-wchar value 0xXXXX that you can then store into a std::wstring or std::u16string as needed.
For example:
std::string a = "\\u00c1\\u00c4\\u00d3";
std::wstring ws;
ws.reserve(a.size());
for(size_t i = 0; i < a.size();)
{
char ch = a[i++];
if ((ch == '\\') && (i < a.size()) && (a[i] == 'u'))
{
wchar_t wc = static_cast<wchar_t>(std::stoi(a.substr(++i, 4), nullptr, 16));
i += 4;
ws.push_back(wc);
}
else
{
// depending on the charset used for encoding the string,
// this may or may not need to be decoded further...
ws.push_back(static_cast<wchar_t>(ch));
}
}
Live Demo
Alternatively:
std::string a = "\\u00c1\\u00c4\\u00d3";
std::wstring ws;
ws.reserve(a.size());
size_t start = 0;
do
{
size_t found = a.find("\\u", start);
if (found == std::string::npos) break;
if (start < found)
{
// depending on the charset used for encoding the string,
// this may or may not need to be decoded further...
ws.insert(ws.end(), a.begin()+start, a.begin()+found);
}
wchar_t wc = static_cast<wchar_t>(std::stoi(a.substr(found+2, 4), nullptr, 16));
ws.push_back(wc);
start = found + 6;
}
while (true);
if (start < a.size())
{
// depending on the charset used for encoding the string,
// this may or may not need to be decoded further...
ws.insert(ws.end(), a.begin()+start, a.end());
}
Live Demo
Otherwise, use a 3rd party library that already does this kind of translation for you.

Related

Split string path with space

I am writing a program that should receive 3 parameters by User: file_upload "local_path" "remote_path"
code example:
std::vector split(std::string str, char delimiter) {
std::vector<string> v;
std::stringstream src(str);
std::string buf;
while(getline(src, buf, delimiter)) {
v.push_back(buf);
}
return v;
}
void function() {
std::string input
getline(std::cin, input);
// user input like this: file_upload /home/Space Dir/file c:\dir\file
std::vector<std::string> v_input = split(input, ' ');
// the code will do something like this
if(v_input[0].compare("file_upload") == 0) {
FILE *file;
file = fopen(v_input[1].c_str(), "rb");
send_upload_dir(v_input[2].c_str());
// bla bla bla
}
}
My question is: the second and third parameter are directories, then they can contain spaces in name. How can i make the split function does not change the spaces of the second and third parameter?
I thought to put quotes in directories and make a function to recognize, but not work 100% because the program has other functions that take only 2 parameters not three. can anyone help?
EDIT: /home/user/Space Dir/file.out <-- path with space name.
If this happens the vector size is greater than expected, and the path to the directory will be broken.. this can not happen..
the vector will contain something like this:
vector[1] = /home/user/Space
vector[2] = Dir/file.out
and what I want is this:
vector[1] = /home/user/Space Dir/file.out
Since you need to accept three values from a single string input, this is a problem of encoding.
Encoding is sometimes done by imposing fixed-width requirements on some or all fields, but that's clearly not appropriate here, since we need to support variable-width file system paths, and the first value (which appears to be some kind of mode specifier) may be variable-width as well. So that's out.
This leaves 4 possible solutions for variable-width encoding:
1: Unambiguous delimiter.
If you can select a separator character that is guaranteed never to show up in the delimited values, then you can split on that. For example, if NUL is guaranteed never to be part of the mode value or the path values, then we can do this:
std::vector<std::string> v_input = split(input,'\0');
Or maybe the pipe character:
std::vector<std::string> v_input = split(input,'|');
Hence the input would have to be given like this (for the pipe character):
file_upload|/home/user/Space Dir/file.out|/home/user/Other Dir/blah
2: Escaping.
You can write the code to iterate through the input line and properly split it on unescaped instances of the separator character. Escaped instances will not be considered separators. You can parameterize the escape character. For example:
std::vector<std::string> escapedSplit(std::string str, char delimiter, char escaper ) {
std::vector<std::string> res;
std::string cur;
for (size_t i = 0; i < str.size(); ++i) {
if (str[i] == delimiter) {
res.push_back(cur);
cur.clear();
} else if (str[i] == escaper) {
++i;
if (i == str.size()) break;
cur.push_back(str[i]);
} else {
cur.push_back(str[i]);
} // end if
} // end for
if (!cur.empty()) res.push_back(cur);
return res;
} // end escapedSplit()
std::vector<std::string> v_input = escapedSplit(input,' ','\\');
With input as:
file_upload /home/user/Space\ Dir/file.out /home/user/Other\ Dir/blah
3: Quoting.
You can write the code to iterate through the input line and properly split it on unquoted instances of the separator character. Quoted instances will not be considered separators. You can parameterize the quote character.
A complication of this approach is that it is not possible to include the quote character itself inside a quoted extent unless you introduce an escaping mechanism, similar to solution #2. A common strategy is to allow repetition of the quote character to escape it. For example:
std::vector<std::string> quotedSplit(std::string str, char delimiter, char quoter ) {
std::vector<std::string> res;
std::string cur;
for (size_t i = 0; i < str.size(); ++i) {
if (str[i] == delimiter) {
res.push_back(cur);
cur.clear();
} else if (str[i] == quoter) {
++i;
for (; i < str.size(); ++i) {
if (str[i] == quoter) {
if (i+1 == str.size() || str[i+1] != quoter) break;
++i;
cur.push_back(quoter);
} else {
cur.push_back(str[i]);
} // end if
} // end for
} else {
cur.push_back(str[i]);
} // end if
} // end for
if (!cur.empty()) res.push_back(cur);
return res;
} // end quotedSplit()
std::vector<std::string> v_input = quotedSplit(input,' ','"');
With input as:
file_upload "/home/user/Space Dir/file.out" "/home/user/Other Dir/blah"
Or even just:
file_upload /home/user/Space" "Dir/file.out /home/user/Other" "Dir/blah
4: Length-value.
Finally, you can write the code to take a length before each value, and only grab that many characters. We could require a fixed-width length specifier, or skip a delimiting character following the length specifier. For example (note: light on error checking):
std::vector<std::string> lengthedSplit(std::string str) {
std::vector<std::string> res;
size_t i = 0;
while (i < str.size()) {
size_t len = std::atoi(str.c_str());
if (len == 0) break;
i += (size_t)std::log10(len)+2; // +1 to get base-10 digit count, +1 to skip delim
res.push_back(str.substr(i,len));
i += len;
} // end while
return res;
} // end lengthedSplit()
std::vector<std::string> v_input = lengthedSplit(input);
With input as:
11:file_upload29:/home/user/Space Dir/file.out25:/home/user/Other Dir/blah
I had similar problem few days ago and solve it like this:
First I've created a copy, Then replace the quoted strings in the copy with some padding to avoid white spaces, finally I split the original string according to the white space indexes from the copy.
Here is my full solution:
you may want to also remove the double quotes, trim the original string and so on:
#include <sstream>
#include<iostream>
#include<vector>
#include<string>
using namespace std;
string padString(size_t len, char pad)
{
ostringstream ostr;
ostr.fill(pad);
ostr.width(len);
ostr<<"";
return ostr.str();
}
void splitArgs(const string& s, vector<string>& result)
{
size_t pos1=0,pos2=0,len;
string res = s;
pos1 = res.find_first_of("\"");
while(pos1 != string::npos && pos2 != string::npos){
pos2 = res.find_first_of("\"",pos1+1);
if(pos2 != string::npos ){
len = pos2-pos1+1;
res.replace(pos1,len,padString(len,'X'));
pos1 = res.find_first_of("\"");
}
}
pos1=res.find_first_not_of(" \t\r\n",0);
while(pos1 < s.length() && pos2 < s.length()){
pos2 = res.find_first_of(" \t\r\n",pos1+1);
if(pos2 == string::npos ){
pos2 = res.length();
}
len = pos2-pos1;
result.push_back(s.substr(pos1,len));
pos1 = res.find_first_not_of(" \t\r\n",pos2+1);
}
}
int main()
{
string s = "234 \"5678 91\" 8989";
vector<string> args;
splitArgs(s,args);
cout<<"original string:"<<s<<endl;
for(size_t i=0;i<args.size();i++)
cout<<"arg "<<i<<": "<<args[i]<<endl;
return 0;
}
and this is the output:
original string:234 "5678 91" 8989
arg 0: 234
arg 1: "5678 91"
arg 2: 8989

C++ XOR encryption truncates file

I'm currently trying to implement file encryption via XOR. Simple as it is, though, I struggle with encryption of multiline files.
Actually, my first problem was that XOR can produce zero chars, which are interpreted as line-end by std::string, thus my solution was:
std::string Encryption::encrypt_string(const std::string& text)
{ //encrypting string
std::string result = text;
int j = 0;
for(int i = 0; i < result.length(); i++)
{
result[i] = 1 + (result[i] ^ code[j]);
assert(result[i] != 0);
j++;
if(j == code.length())
j = 0;
}
return result;
}
std::string Encryption::decrypt_string(const std::string& text)
{ // decrypting string
std::string result = text;
int j = 0;
for(int i = 0; i < result.length(); i++)
{
result[i] = (result[i] - 1) ^ code[j];
assert(result[i] != 0);
j++;
if(j == code.length())
j = 0;
}
return result;
}
Not neat, but fine for the first attempt. But when trying to crypt text files, I understood, that depending on encryption key, my output file gets truncated in random places. My best thought was, that \n is handled incorrectly, because strings from keyboard (even with \n) don't break the code.
bool Encryption::crypt(const std::string& input_filename, const std::string& output_filename, bool encrypt)
{ //My file function
std::fstream finput, foutput;
finput.open(input_filename, std::fstream::in);
foutput.open(output_filename, std::fstream::out);
if (finput.is_open() && foutput.is_open())
{
std::string str;
while (!finput.eof())
{
std::getline(finput, str);
if (encrypt)
str.append("\n");
std::string encrypted = encrypt ? encrypt_string(str) : decrypt_string(str);
foutput.write(encrypted.c_str(), str.length());
}
finput.close();
foutput.close();
return true;
}
return false;
}
What could be the problem, given that console input is XOR'ed fine?
XOR can produce zero chars, which are interpreted as line-end by std::string
std::string provides overloads to most functionality which allow you to specify the size of input data. It allows you to also check for the size of the stored data. Therefore, a 0-value char inside of std::string is perfectly reasonable and acceptable.
Therefore, the problem isn't std::string treating nulls as end-of-line but perhaps std::getline() which may be doing that.
I see that you're using std::ostream::write() so I see you're already familiar with using sizes as parameters. So why not also use std::istream::read() instead of std::getline()?
Therefore, you can simply read in "chunks" or "blocks" of the file instead of needing to treat line separators as a special case.

c++ string erase does not work for UTF8 string, what library can I use?

Program:
void foo() {
string sourceStr = "Tag:贾鑫#VoltDB";
string insertStr = "XinJia";
int start = 4;
int length = 2;
sourceStr.erase(start, length);
sourceStr.insert(start, insertStr);
cout << sourceStr << endl;
}
For this program, I want to get output as "Tag:XinJia#VoltDB", but it seems that the std string erase and insert does not work for UTF-8 string.
Is there any boost library that I can use? How should I solve this problem?
After talking with others, I realize that there is no standard library that can solve this problem. So I write a function to do my work and would like to share it with others who have this similar problem:
std::string overlay_function(const char* sourceStr, size_t sourceLength,
std::string insertStr, size_t startPos, size_t length) {
int32_t i = 0, j = 0;
while (i < sourceLength) {
if ((sourceStr[i] & 0xc0) != 0x80) {
if (++j == startPos) break;
}
i++;
}
std::string result = std::string(sourceStr, i);
result.append(insertStr);
bool reached = false;
j = 0;
while (i < sourceLength) {
if ((sourceStr[i] & 0xc0) != 0x80) {
if (reached) break;
if (++j == length) reached = true;
}
i++;
}
result.append(std::string(&sourceStr[i], sourceLength - i));
return result;
}
With this funciton, my program can be:
cout << overlay_function(sourceStr, sourceStr.length(), 4+1, 2) << endl;
Hope it helps.
Indices in C++ string are encoding value indices, not character (or in your case ideogram) indices. With UTF-8 each character can be composed of more than one encoding unit, and in your case it is so. Find the correct encoding unit index.
Tip 1: I'd use .substr and + string concatenation for this.
Tip 2: it seems that you can search for the characters : and #. Note that these encoding units cannot occur in multi-unit UTF-8 character. Check out the methods of string.

How to convert UTF-8 std::string to UTF-16 std::wstring?

If I have a UTF-8 std::string how do I convert it to a UTF-16 std::wstring? Actually, I want to compare two Persian words.
This is how you do it with C++11:
std::string str = "your string in utf8";
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>> converter;
std::wstring wstr = converter.from_bytes(str);
And these are the headers you need:
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
A more complete example available here:
http://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes
Here's some code. Only lightly tested and there's probably a few improvements. Call this function to convert a UTF-8 string to a UTF-16 wstring. If it thinks the input string is not UTF-8 then it will throw an exception, otherwise it returns the equivalent UTF-16 wstring.
std::wstring utf8_to_utf16(const std::string& utf8)
{
std::vector<unsigned long> unicode;
size_t i = 0;
while (i < utf8.size())
{
unsigned long uni;
size_t todo;
bool error = false;
unsigned char ch = utf8[i++];
if (ch <= 0x7F)
{
uni = ch;
todo = 0;
}
else if (ch <= 0xBF)
{
throw std::logic_error("not a UTF-8 string");
}
else if (ch <= 0xDF)
{
uni = ch&0x1F;
todo = 1;
}
else if (ch <= 0xEF)
{
uni = ch&0x0F;
todo = 2;
}
else if (ch <= 0xF7)
{
uni = ch&0x07;
todo = 3;
}
else
{
throw std::logic_error("not a UTF-8 string");
}
for (size_t j = 0; j < todo; ++j)
{
if (i == utf8.size())
throw std::logic_error("not a UTF-8 string");
unsigned char ch = utf8[i++];
if (ch < 0x80 || ch > 0xBF)
throw std::logic_error("not a UTF-8 string");
uni <<= 6;
uni += ch & 0x3F;
}
if (uni >= 0xD800 && uni <= 0xDFFF)
throw std::logic_error("not a UTF-8 string");
if (uni > 0x10FFFF)
throw std::logic_error("not a UTF-8 string");
unicode.push_back(uni);
}
std::wstring utf16;
for (size_t i = 0; i < unicode.size(); ++i)
{
unsigned long uni = unicode[i];
if (uni <= 0xFFFF)
{
utf16 += (wchar_t)uni;
}
else
{
uni -= 0x10000;
utf16 += (wchar_t)((uni >> 10) + 0xD800);
utf16 += (wchar_t)((uni & 0x3FF) + 0xDC00);
}
}
return utf16;
}
There are some relevant Q&A here and here which is worth a read.
Basically you need to convert the string to a common format -- my preference is always to convert to UTF-8, but your mileage may wary.
There have been lots of software written for doing the conversion -- the conversion is straigth forwards and can be written in a few hours -- however why not pick up something already done such as the UTF-8 CPP
To convert between the 2 types, you should use: std::codecvt_utf8_utf16< wchar_t>
Note the string prefixes I use to define UTF16 (L) and UTF8 (u8).
#include <string>
#include <codecvt>
int main()
{
std::string original8 = u8"הלו";
std::wstring original16 = L"הלו";
//C++11 format converter
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
//convert to UTF8 and std::string
std::string utf8NativeString = convert.to_bytes(original16);
std::wstring utf16NativeString = convert.from_bytes(original8);
assert(utf8NativeString == original8);
assert(utf16NativeString == original16);
return 0;
}
Microsoft has developed a beautiful library for such conversions as part of their Casablanca project also named as CPPRESTSDK. This is marked under the namespaces utility::conversions.
A simple usage of it would look something like this on using namespace
utility::conversions
utf8_to_utf16("sample_string");
This page also seems useful: http://www.codeproject.com/KB/string/UtfConverter.aspx
In the comment section of that page, there are also some interesting suggestions for this task like:
// Get en ASCII std::string from anywhere
std::string sLogLevelA = "Hello ASCII-world!";
std::wstringstream ws;
ws << sLogLevelA.c_str();
std::wstring sLogLevel = ws.str();
Or
// To std::string:
str.assign(ws.begin(), ws.end());
// To std::wstring
ws.assign(str.begin(), str.end());
Though I'm not sure the validity of these approaches...

how to convert ucs4 to ucs2 using C++ and ucs2 to ucs4?

Is there any C++ method support this conversion?
By now i just fill character '0' to convert ucs2 to ucs4, is it safe?
thanks!
It's correct for UCS2, but that's most likely not what you have. Nowadays, you're more likely to encounter UTF-16. Unlike UCS-2, UTF-16 encodes Unicode characters as either one or two 16-bit units. This is necessary because Unicode has more than 65536 characters in its current version.
The more complex conversions usually can be done by your OS, and there are several (non-standard) libraries that offer the same functionality, e.g. ICU.
I have something like that. Hope it will help:
String^ StringFromUCS4(const char32_t* element, int length)
{
StringBuilder^ result = gcnew StringBuilder(length);
const char32_t* pUCS4 = element;
int characterCount = 0;
while (*pUCS4 != 0)
{
wchar_t cUTF16;
if (*pUCS4 < 0x10000)
{
cUTF16 = (wchar_t)*pUCS4;
}
else
{
unsigned int t = *pUCS4 - 0x10000;
unsigned int h = (((t << 12) >> 22) + 0xD800);
unsigned int l = (((t << 22) >> 22) + 0xDC00);
cUTF16 = (wchar_t)((h << 16) | (l & 0x0000FFFF));
}
result->Append((wchar_t)*pUCS4);
characterCount++;
if (characterCount >= length)
{
break;
}
pUCS4++;
}
return result->ToString();
}