How to convert wstring to string as escapes - c++

I have a wstring, what's the best way to convert it to string in escaped form like \u043d\u043e\u043c\u0430 ?
The one below works but does not seem to be the best:
string output;
for (wchar_t chr : wtst) {
char code[7];
sprintf(code,"\\u%0.4X",chr);
output += code;
}

A less compact but faster version that a) allocates ahead of time and b) avoids the cost of printf re-interpreting the format string every iteration, c) avoiding the function call overhead to printf.
std::wstring wstr(L"\x043d\x043e\x043c\x0430");
std::string sstr;
// Reserve memory in 1 hit to avoid lots of copying for long strings.
static size_t const nchars_per_code = 6;
sstr.reserve(wstr.size() * nchars_per_code);
char code[nchars_per_code];
code[0] = '\\';
code[1] = 'u';
static char const* const hexlut = "0123456789abcdef";
std::wstring::const_iterator i = wstr.begin();
std::wstring::const_iterator e = wstr.end();
for (; i != e; ++i) {
unsigned wc = *i;
code[2] = (hexlut[(wc >> 12) & 0xF]);
code[3] = (hexlut[(wc >> 8) & 0xF]);
code[4] = (hexlut[(wc >> 4) & 0xF]);
code[5] = (hexlut[(wc) & 0xF]);
sstr.append(code, code + nchars_per_code);
}

Related

longest palindromic substring. Error: AddressSanitizer, heap overflow

#include<string>
#include<cstring>
class Solution {
void shift_left(char* c, const short unsigned int bits) {
const unsigned short int size = sizeof(c);
memmove(c, c+bits, size - bits);
memset(c+size-bits, 0, bits);
}
public:
string longestPalindrome(string s) {
char* output = new char[s.length()];
output[0] = s[0];
string res = "";
char* n = output;
auto e = s.begin() + 1;
while(e != s.end()) {
char letter = *e;
char* c = n;
(*++n) = letter;
if((letter != *c) && (c == &output[0] || letter != (*--c)) ) {
++e;
continue;
}
while((++e) != s.end() && c != &output[0]) {
if((letter = *e) != (*--c)) {
const unsigned short int bits = c - output + 1;
shift_left(output, bits);
n -= bits;
break;
}
(*++n) = letter;
}
string temp(output);
res = temp.length() > res.length()? temp : res;
shift_left(output, 1);
--n;
}
return res;
}
};
input string longestPalindrome("babad");
the program works fine and prints out "bab" as the longest palindrome but there's a heap overflow somewhere. Error like this appears:
Read of size 6 at ...memory address... thread T0
"babad" is size 5 and after going over this for an hour. I don't see the point where the iteration ever exceeds 5
There is 3 pointers here that iterate.
e as the element of string s.
n which is the pointer to the next char of output.
and c which is a copy of n and decrements until it reaches the address of &output[0].
maybe it's something with the memmove or memset since I've never used it before.
I'm completely lost
TL;DR : mixture of char* and std::string are not really good idea if you don't understand how exactly it works.
If you want to length of string you cant do this const unsigned short int size = sizeof(c); (sizeof will return size of pointer (which is commonly 4 on 32-bit machine and 8 on 64-bit machine). You must do this instead: const size_t size = strlen(c);
Address sanitizers is right that you (indirectly) are trying to get an memory which not belongs to you.
How does constructor of string from char* works?
Answer: char* is considered as c-style string, which means that it must be null '\0' terminated.
More details: constructor of string from char* calls strlen-like function which looks like about this:
https://en.cppreference.com/w/cpp/string/byte/strlen
int strlen(char *begin){
int k = 0;
while (*begin != '\0'){
++k;
++begin;
}
return k;
}
If c-style char* string does not contain '\0' it cause accessing memory which doesn't belongs to you.
How to fix?
Answer (two options):
not use mixture of char* and std::string
char* output = new char[s.length()]; replace with char* output = new char[s.length() + 1]; memset(output, 0, s.length() + 1);
Also you must delete all memory which you newed. So add delete[] output; before return res;

How to convert accented character to hex in C++? [duplicate]

This question already has answers here:
Convert string from UTF-8 to ISO-8859-1
(3 answers)
Closed 1 year ago.
Referring to the ISO-8859-1 (Latin-1) encoding:
The capital E acute (É) has a hex value of C9.
I am trying to write a function that takes a std::string and then converts it to hex according to the ISO-8859-1 encoding above.
Currently, I am only able to write a function that converts an ASCII string to hex:
std::string Helper::ToHex(std::string input) {
std::stringstream strstream;
std::string output;
for (int i=0; i<input.length(); i++) {
strstream << std::hex << unsigned(input[i]);
}
strstream >> output;
}
However, this function can't do the job when the input has accented characters. It will convert É to a hex value of ffffffc3ffffff89.
std::string has no encoding of its own. It can easily hold characters encoded in ASCII, UTF-8, ISO-8859-x, Windows-125x, etc. They are just raw bytes, as far as std::string is concerned. So, before you can print your output in ISO-8859-1 specifically, you need to first know what the std::string is already holding so it can be converted to ISO-8859-1 if needed.
FYI, ffffffc3ffffff89 is simply the two char values 0xc3 0x89 (the UTF-8 encoded form of É) being sign-extended to 32 bits. Which means your compiler implements char as a signed type rather than an unsigned type. To eliminate the leading fs, you need to cast each char to unsigned char before then casting to unsigned. You also will need to account for unsigned values < 10 so that the output is an even multiple of 2 hex digits per char, eg:
strstream << std::hex << std::setw(2) << std::setfill('0') << static_cast<unsigned>(static_cast<unsigned char>(input[i]));
So, it appears that your std::string is encoded in UTF-8. There are plenty of libraries available that can convert text from one encoding to another, such as ICU or ICONV. Or platform-specific APIs, like WideCharToMultiByte()/MultiByteToWideChar() on Windows, std::mbstowcs()/std::wcstombs(), etc (provided suitable locales are installed in the OS). But there is nothing really built-in to C++ for this exact UTF-8 to ISO-8859-1 conversion. Though, you could use the (deprecated) std::wstring_convert to decode the UTF-8 std::string to a UTF-16/32 encoded std::wstring, or a UTF-16 encoded std::u16string, at least. And then you can convert that to ISO-8859-1 using whatever library you want as needed.
Or, knowing that the input is UTF-8 and the output is ISO-8859-1, it is really not that hard to just convert the data manually, decoding the UTF-8 into codepoints, and then encoding those codepoints to bytes. Both encodings are well-documented and fairly easy to write code for without too much effort, eg:
size_t nextUtf8CodepointLen(const char* data)
{
unsigned char ch = static_cast<unsigned char>(*data);
if ((ch & 0x80) == 0) {
return 1;
}
if ((ch & 0xE0) == 0xC0) {
return 2;
}
if ((ch & 0xF0) == 0xE0) {
return 3;
}
if ((ch & 0xF8) == 0xF0) {
return 4;
}
return 0;
}
unsigned nextUtf8Codepoint(const char* &data, size_t &data_size)
{
if (data_size == 0) return -1;
unsigned char ch = static_cast<unsigned char>(*data);
size_t len = nextUtf8CodepointLen(data);
++data;
--data_size;
if (len < 2) {
return (len == 1) ? static_cast<unsigned>(ch) : 0xFFFD;
}
--len;
unsigned cp;
if (len == 1) {
cp = ch & 0x1F;
}
else if (len == 2) {
cp = ch & 0x0F;
}
else {
cp = ch & 0x07;
}
if (len > data_size) {
data += data_size;
data_size = 0;
return 0xFFFD;
}
for(size_t j = 0; j < len; ++j) {
ch = static_cast<unsigned char>(data[j]);
if ((ch & 0xC0) != 0x80) {
cp = 0xFFFD;
break;
}
cp = (cp << 6) | (ch & 0x3F);
}
data += len;
data_size -= len;
return cp;
}
std::string Helper::ToHex(const std::string &input) {
const char *data = input.c_str();
size_t data_size = input.size();
std::ostringstream oss;
unsigned cp;
while ((cp = nextUtf8Codepoint(data, data_size)) != -1) {
if (cp > 0xFF) {
cp = static_cast<unsigned>('?');
}
oss << std::hex << std::setw(2) << std::setfill('0') << cp;
}
return oss.str();
}
Online Demo

Adding space char to each item appending to string - have to erase last item. How to improve

This function, vec2string takes a vector of char's and converts to a hex string representation but with a blank space between each byte value. Just a formatting requirement in my app. Can anyone think of a way to remove the need for that.
std::string& vec2string(const std::vector<char>& vec, std::string& s) {
static const char hex_lookup[] = "0123456789ABCDEF";
for(std::vector<char>::const_iterator it = vec.begin(); it != vec.end(); ++it) {
s.append(1, hex_lookup[(*it >> 4) & 0xf]);
s.append(1, hex_lookup[*it & 0xf]);
s.append(1, ' ');
}
//remove very last space - I would ideally like to remove this***
if(!s.empty())
s.erase(s.size()-1);
return s;
}
std::string& vec2string(const std::vector<char>& vec, std::string& s) {
static const char hex_lookup[] = "0123456789ABCDEF";
if (vec.empty())
return s;
s.reserve(s.size() + vec.size() * 3 - 1);
std::vector<char>::const_iterator it = vec.begin();
s.append(1, hex_lookup[(*it >> 4) & 0xf]);
s.append(1, hex_lookup[*it & 0xf]);
for(++it; it != vec.end(); ++it) {
s.append(1, ' ');
s.append(1, hex_lookup[(*it >> 4) & 0xf]);
s.append(1, hex_lookup[*it & 0xf]);
}
return s;
}
You could add a check in the loop, before appending the characters, if the string is not empty and then add a space. Like:
for (...)
{
if (!s.empty())
s += ' ';
...
}
If you have Boost, use algorithm/string/join.hpp. Otherwise, you could try the Duff's Device approach:
string hex_str(const vector<char>& bytes)
{
string result;
if (!bytes.empty())
{
const char hex_lookup[] = "0123456789ABCDEF";
vector<char>::const_iterator it = bytes.begin();
goto skip;
do {
result += ' ';
skip:
result += hex_lookup[(*it >> 4) & 0xf];
result += hex_lookup[*it & 0xf];
} while (++it != bytes.end());
}
return result;
}
Use boost::trim documentation
boost::trim(your_string);
I'd start by separating the hex conversion into its own little function:
std::string to_hex(char in) {
static const char hex_lookup[] = "0123456789ABCDEF";
std::string s;
s.push_back(hex_lookup[(in>>4) & 0xf];
s.push_back(hex_lookup[in & 0xf];
return s;
}
Then I'd use std::transform to apply that to the whole vector, with my infix_ostream_iterator and a std::stringstream to put the pieces together.
#include <sstream>
#include <algorithm>
#include "infix_iterator"
std::string vec2string(const std::vector<char>& vec) {
std::stringstream s;
std::transform(vec.begin(), vec.end(),
infix_ostream_iterator<std::string>(s, " "),
to_hex);
return s.str();
}
Also note that rather than modifying an existing string, this creates and returns a new string. At least in my opinion, modifying an existing string is a poor idea -- simply producing a string is much cleaner and more modular. If the caller wants to combine the results into a longer string, that's fine, but it's better for the lower-level function to just do one thing cleanly, and let the higher level function decide what to do with the result.

Converting from char string to an array of uint8_t?

I'm reading a string from a file so it's in the form of a char array. I need to tokenize the string and save each char array token as a uint8_t hex value in an array.
char* starting = "001122AABBCC";
// ...
uint8_t[] ending = {0x00,0x11,0x22,0xAA,0xBB,0xCC}
How can I convert from starting to ending? Thanks.
Here is a complete working program. It is based on Rob I's solution, but fixes several problems has been tested to work.
#include <string>
#include <stdio.h>
#include <stdlib.h>
#include <vector>
#include <iostream>
const char* starting = "001122AABBCC";
int main()
{
std::string starting_str = starting;
std::vector<unsigned char> ending;
ending.reserve( starting_str.size());
for (int i = 0 ; i < starting_str.length() ; i+=2) {
std::string pair = starting_str.substr( i, 2 );
ending.push_back(::strtol( pair.c_str(), 0, 16 ));
}
for(int i=0; i<ending.size(); ++i) {
printf("0x%X\n", ending[i]);
}
}
strtoul will convert text in any base you choose into bytes. You have to do a little work to chop the input string into individual digits, or you can convert 32 or 64bits at a time.
ps uint8_t[] ending = {0x00,0x11,0x22,0xAA,0xBB,0xCC}
Doesn't mean anything, you aren't storing the data in a uint8 as 'hex', you are storing bytes, it's upto how you (or your debugger) interpretes the binary data
With C++11, you may use std::stoi for that :
std::vector<uint8_t> convert(const std::string& s)
{
if (s.size() % 2 != 0) {
throw std::runtime_error("Bad size argument");
}
std::vector<uint8_t> res;
res.reserve(s.size() / 2);
for (std::size_t i = 0, size = s.size(); i != size; i += 2) {
std::size_t pos = 0;
res.push_back(std::stoi(s.substr(i, 2), &pos, 16));
if (pos != 2) {
throw std::runtime_error("bad character in argument");
}
}
return res;
}
Live example.
I think any canonical answer (w.r.t. the bounty notes) would involve some distinct phases in the solution:
Error checking for valid input
Length check and
Data content check
Element conversion
Output creation
Given the usefulness of such conversions, the solution should probably include some flexibility w.r.t. the types being used and the locale required.
From the outset, given the date of the request for a "more canonical answer" (circa August 2014) liberal use of C++11 will be applied.
An annotated version of the code, with types corresponding to the OP:
std::vector<std::uint8_t> convert(std::string const& src)
{
// error check on the length
if ((src.length() % 2) != 0) {
throw std::invalid_argument("conversion error: input is not even length");
}
auto ishex = [] (decltype(*src.begin()) c) {
return std::isxdigit(c, std::locale()); };
// error check on the data contents
if (!std::all_of(std::begin(src), std::end(src), ishex)) {
throw std::invalid_argument("conversion error: input values are not not all xdigits");
}
// allocate the result, initialised to 0 and size it to the correct length
std::vector<std::uint8_t> result(src.length() / 2, 0);
// run the actual conversion
auto str = src.begin(); // track the location in the string
std::for_each(result.begin(), result.end(), [&str](decltype(*result.begin())& element) {
element = static_cast<std::uint8_t>(std::stoul(std::string(str, str + 2), nullptr, 16));
std::advance(str, 2); // next two elements
});
return result;
}
The template version of the code adds flexibility;
template <typename Int /*= std::uint8_t*/,
typename Char = char,
typename Traits = std::char_traits<Char>,
typename Allocate = std::allocator<Char>,
typename Locale = std::locale>
std::vector<Int> basic_convert(std::basic_string<Char, Traits, Allocate> const& src, Locale locale = Locale())
{
using string_type = std::basic_string<Char, Traits, Allocate>;
auto ishex = [&locale] (decltype(*src.begin()) c) {
return std::isxdigit(c, locale); };
if ((src.length() % 2) != 0) {
throw std::invalid_argument("conversion error: input is not even length");
}
if (!std::all_of(std::begin(src), std::end(src), ishex)) {
throw std::invalid_argument("conversion error: input values are not not all xdigits");
}
std::vector<Int> result(src.length() / 2, 0);
auto str = std::begin(src);
std::for_each(std::begin(result), std::end(result), [&str](decltype(*std::begin(result))& element) {
element = static_cast<Int>(std::stoul(string_type(str, str + 2), nullptr, 16));
std::advance(str, 2);
});
return result;
}
The convert() function can then be based on the basic_convert() as follows:
std::vector<std::uint8_t> convert(std::string const& src)
{
return basic_convert<std::uint8_t>(src, std::locale());
}
Live sample.
uint8_t is typically no more than a typedef of an unsigned char. If you're reading characters from a file, you should be able to read them into an unsigned char array just as easily as a signed char array, and an unsigned char array is a uint8_t array.
I'd try something like this:
std::string starting_str = starting;
uint8_t[] ending = new uint8_t[starting_str.length()/2];
for (int i = 0 ; i < starting_str.length() ; i+=2) {
std::string pair = starting_str.substr( i, i+2 );
ending[i/2] = ::strtol( pair.c_str(), 0, 16 );
}
Didn't test it but it looks good to me...
You may add your own conversion from set of char { '0','1',...'E','F' } to uint8_t:
uint8_t ctoa(char c)
{
if( c >= '0' && c <= '9' ) return c - '0';
else if( c >= 'a' && c <= 'f' ) return 0xA + c - 'a';
else if( c >= 'A' && c <= 'F' ) return 0xA + c - 'A';
else return 0;
}
Then it will be easy to convert a string in to array:
uint32_t endingSize = strlen(starting)/2;
uint8_t* ending = new uint8_t[endingSize];
for( uint32_t i=0; i<endingSize; i++ )
{
ending[i] = ( ctoa( starting[i*2] ) << 4 ) + ctoa( starting[i*2+1] );
}
This simple solution should work for your problem
char* starting = "001122AABBCC";
uint8_t ending[12];
// This algo will work for any size of starting
// However, you have to make sure that the ending have enough space.
int i=0;
while (i<strlen(starting))
{
// convert the character to string
char str[2] = "\0";
str[0] = starting[i];
// convert string to int base 16
ending[i]= (uint8_t)atoi(str,16);
i++;
}
uint8_t* ending = static_cast<uint8_t*>(starting);

How to return the md5 hash in a string in this code C++?

I have this code that show me correctly, the md5 of a string.
I prefer to return a string to the function, but I have some problem converting the values of md5 into my string.
This is the code:
string calculatemd5(string msg)
{
string result;
const char* test = msg.c_str();
int i;
MD5_CTX md5;
MD5_Init (&md5);
MD5_Update (&md5, (const unsigned char *) test, msg.length());
unsigned char buffer_md5[16];
MD5_Final ( buffer_md5, &md5);
printf("Input: %s", test);
printf("\nMD5: ");
for (i=0;i<16;i++){
printf ("%02x", buffer_md5[i]);
result[i]=buffer_md5[i];
}
std::cout <<"\nResult:"<< result[i]<<endl;
return result;
}
For example result[i] is a strange ascii char like this: .
How can is possible solve this problem?
A cleaner way (and faster) might be like this:
std::string result;
result.reserve(32); // C++11 only, otherwise ignore
for (std::size_t i = 0; i != 16; ++i)
{
result += "0123456789ABCDEF"[hash[i] / 16];
result += "0123456789ABCDEF"[hash[i] % 16];
}
return result;
replace
for (i=0;i<16;i++){
printf ("%02x", buffer_md5[i]);
result[i]=buffer_md5[i];
}
with
char buf[32];
for (i=0;i<16;i++){
sprintf(buf, "%02x", buffer_md5[i]);
result.append( buf );
}
notice that when you print out result, print result, not result[i] to get whole string.
if you put the buffer_md5[i] value directly in result then you may get problems since a string may not have an embedded 0 (if there is one).
Seems that you are using openssl.
Use constant MD5_DIGEST_LENGTH.
You can also use MD5 function instead of MD5_Init, MD5_Update and MD5_Final.
MD5() may take most of the time, but if you want to reduce time of sprintf, then do hex string manually.
Like this way:
{
static const char hexDigits[16] = "0123456789ABCDEF";
unsigned char digest[MD5_DIGEST_LENGTH];
char digest_str[2*MD5_DIGEST_LENGTH+1];
int i;
// Count digest
MD5( (const unsigned char*)msg.c_str(), msg.length(), digest );
// Convert the hash into a hex string form
for( i = 0; i < MD5_DIGEST_LENGTH; i++ )
{
digest_str[i*2] = hexDigits[(digest[i] >> 4) & 0xF];
digest_str[i*2+1] = hexDigits[digest[i] & 0xF];
}
digest_str[MD5_DIGEST_LENGTH*2] = '\0';
std::cout <<"\nResult:"<< digest_str <<endl;
}
not tested, so there may be bugs.
#include <sstream>
...
std::stringstream ss;
for (i=0;i<16;i++){
printf ("%02x", buffer_md5[i]);
ss << std::hex << buffer_md5[i];
}
result = ss.str();
std::hex might not do exactly what you want. Perhaps, this will be better:
for (i=0;i<16;i++){
printf ("%02x", buffer_md5[i]);
if (buffer_md5[i] < 10)
ss << buffer_md5[i];
else
ss << 97 + buffer_md5[i] - 15;
}