Writing std::string with non-ascii data to file

Writing std::string with non-ascii data to file - c++

Below is a simplified example of my problem. I have some external byte data which appears to be a string with cp1252 encoded degree symbol 0xb0. When it is stored in my program as an std::string it is correctly represented as 0xffffffb0. However, when that string is then written to a file, the resulting file is only one byte long with just 0xb0. How do I write the string to the file? How does the concept of UTF-8 come into this?
#include <iostream>
#include <fstream>
typedef struct
{
char n[40];
} mystruct;
static void dump(const std::string& name)
{
std::cout << "It is '" << name << "'" << std::endl;
const char *p = name.data();
for (size_t i=0; i<name.size(); i++)
{
printf("0x%02x ", p[i]);
}
std::cout << std::endl;
}
int main()
{
const unsigned char raw_bytes[] = { 0xb0, 0x00};
mystruct foo;
foo = *(mystruct *)raw_bytes;
std::string name = std::string(foo.n);
dump(name);
std::ofstream my_out("/tmp/out.bin", std::ios::out | std::ios::binary);
my_out << name;
my_out.close();
return 0;
}
Running the above program produces the following on STDOUT
It is '�'
0xffffffb0

First of all, this is a must read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Now, when you done with that, you have to understand what type represents p[i].
It is char, which in C is a small size integer value with a sign! char can be negative!
Now, since you have cp1252 characters, they are outside the scope of ASCII. This means these characters are seen as negative values!
Now, when they are converted to int, the sign bit is replicated, and when you are trying to print it, you will see 0xffffff<actual byte value>.
To handle that in C, first you should cast to unsigned char:
printf("0x%02x ", (unsigned char)p[i]);
then the default conversion will fill in the missing bits with zeros and printf() will give you a proper value.
Now, in C++ this is a bit more nasty, since char and unsigned char are treated by stream operators as a character representation. So to print them in hex manner, it should be like this:
int charToInt(char ch)
{
return static_cast<int>(static_cast<unsigned char>(ch));
}
std::cout << std::hex << charToInt(s[i]);
Now, direct conversion from char to unsigned int will not fix the problem since silently the compiler will perform a conversation to int first.
See here: https://wandbox.org/permlink/sRmh8hZd78Oar7nF
UTF-8 has nothing to this issue.
Off-topic: please, when you write pure C++ code, do not use C. It is pointless and makes code harder to maintain, and it is not faster. So:
do not use char* or char[] to store strings. Just use std::string.
do not use printf(), use std::cout (or the fmt library, if you like format strings - it will became a future C++ standard).
do not use alloc(), malloc(), free() - in modern C++, use std::make_unique() and std::make_shared().

Related

C++ - Converting a char to wchar_t. Getting a segfault

I'm trying to write small program that reads in a character from an istream and converts it to a wchar_t. I'm getting a segfault. Here's my code
#include <iostream>
using namespace std;
wchar_t read(istream &stream) {
char *c;
stream.read(c, sizeof(*c));
cout << *c << endl;
wchar_t retChar = static_cast<wchar_t>(*c);
return retChar;
}
int main() {
cout << "Write something" << endl;
read(cin);
}
My logic here is:
Create an array of chars because read only takes arrays of chars.
Read in bytes equal to the size of a character. i.e. read a character and store it in the array c.
Create a wchar_t and cast that character *c into a wchar_t.
return wchar_t
Since I'm getting a segfault, there's obviously something wrong here. I can't see it though. Any help would be appreciated.
Thanks SO

Stepping through the code to give OP a look at what's going on and why it won't work. Then we'll take a look at a method to do what they want that is as close as possible to their intent. Then a hint on how to do this a bit better in the C++ world.
wchar_t read(istream &stream) {
char *c;
Declares a pointer c and doesn't point it at anything. c is an uninitialized variable. Think of it like being invited to Steve's house for a party, but no one told you where he lived. Odds are very good that where ever you go, it won't be Steve's house.
stream.read(c, sizeof(*c));
sizeof(*c) will return the size of one character. Probably 8 bits and 1 byte, but c still hasn't been pointed at anything so this is Undefined Behaviour. There is no telling what the program will do, but most likely it reads one byte into some unknown space in memory. Maybe this causes a crash because you can't write there. Maybe it writes over something that it is allowed to write over and screws up something else.
cout << *c << endl;
Tries to print out c. If the program survived the read above, odds are good it will survive this as well, but this is also Undefined Behaviour.
wchar_t retChar = static_cast<wchar_t>(*c);
This will literally stuff one character's worth of data into a wide character. It will not convert it according to locale or any other character encoding. char is a numeric code that has been defined to be interpreted as a character. A cast will stupidly put the character value, say 'A' and ASCII encoding into retChar. retChar now equals 65. 65 could mean anything depending on the encoding used by wchar_t. It might still mean 'A', but sorry Ayn Rand, this is one case where A may well not be A.
return retChar;
}
To do what OP was trying to do (and ignoring that there are better ways to do this for the time being):
#include <iostream>
using namespace std;
wchar_t read(istream &stream) {
char c[2];
Allocates an array of characters. Why? because the easiest way I know of is to do the conversion on a string.
stream.read(c, sizeof(c[0]));
c is now an array which decays to a pointer. We only want to read one char, so sizeof(c[0]) gets the size of the first element in the array.
c[1] = '\0';
cout << c << endl;
Null terminate and print.
wchar_t retChar[2];
Again, an array.
mbstowcs(retChar, c, 1);
convert one character from char to wide char using whatever locale has been set. Read more on locales here: http://en.cppreference.com/w/cpp/locale/setlocale
And documentation on mbstowcs: http://en.cppreference.com/w/cpp/string/multibyte/mbstowcs
return retChar[0];
}
Put all together with a quick tester:
#include <iostream>
#include <cstdlib>
wchar_t read(std::istream &stream)
{
char c[2];
stream.read(c, sizeof(c[0]));
c[1] = '\0';
std::cout << c << std::endl;
wchar_t retChar[2];
mbstowcs(retChar, c, 1);
return retChar[0];
}
int main()
{
std::wcout << read(std::cin) << std::endl;
}
This is simple, but ugly in the C++ world where you should stick to strings where possible. In that case look into std::wstring_convert.

Converting an integer to an hexadecimal unsigned char array in C++

I need to convert an integer int parameter to an hexadecimal unsigned char buffer[n].
If integer is for example 10 then the hexadecimal unsigned char array should be 0x0A
To do so I have the following code:
int parameter;
std::stringstream ss;
unsigned char buffer[];
ss << std::hex << std::showbase << parameter;
typedef unsigned char byte_t;
byte_t b = static_cast<byte_t>(ss); //ERROR: invalid static_cast from type ‘std::stringstream {aka std::basic_stringstream<char>}’ to type ‘byte_t {aka unsigned char}’
buffer[0]=b;
Does anyone know how to avoid this error?
If there is a way of converting the integer parameter into an hexadecimal unsigned char than doing first: ss << std::hex << std::showbase << parameter; that would be even better.

Consulting my psychic powers it reads you actually want to have a int value seen in it's representation of bytes (byte_t). Well, as from your comment
I want the same number represented in hexadecimal and assigned to a unsigned char buffer[n].
not so much psychic powers, but you should note hexadecimal representation is a matter of formatting, not internal integer number representation.
The easiest way is to use a union like
union Int32Bytes {
int ival;
byte_t bytes[sizeof(int)];
};
and use it like
Int32Bytes x;
x.ival = parameter;
for(size_t i = 0; i < sizeof(int); ++i) {
std::cout << std::hex << std::showbase << (int)x.bytes[i] << ' ';
}
Be aware to see unexpected results due to endianess specialities of your current CPU architecture.

Problem 1: buffer is of undetermined size in your snippet. I'll suppose that you have declared it with a sufficient size.
Problem 2: the result of your conversion will be several chars (at least 3 due to the 0x prefix). So you need to copy all of them. This won't work with an = unless you'd have strings.
Problem 3: your intermediary cast won't succeed: you can't hope to convert a complex stringstream object to a single unsigned char. Fortunately, you don't need this.
Here a possible solution using std::copy(), and adding a null terminator to the buffer:
const string& s = ss.str();
*copy(s.cbegin(), s.cend(), buffer)='\0';
Live demo

c++ pass by value segmentation fault

I have a function f() which receives a char* p and gives a const char* to it.
void f(char *p) {
string s = "def";
strcpy(p, s.c_str());
}
In the main() below I expect to get q = "def".
int main(){
char *q = "abc";
f(q);
cout << q << endl;
}
By running this I get segmentation fault and as I am new in C++ I don't understand why.
I also get a segmentation fault when I do not initialize q thus:
int main(){
char *q;
f(q);
cout << q << endl;
}
Knowing that the function's parameter and the way it's called must not change. Is there any work around that I can do inside the function? Any suggestions?
Thanks for your help.

You are trying to change a string literal. Any attemp to change a string literal results in undefined behaviour of the program.
Take into account that string literals have types of constant character arrays. So it would be more correct to write
const char *q = "abc";
From the C++ Standard (2.14.5 String literals)
8 Ordinary string literals and UTF-8 string literals are also referred
to as narrow string literals. A narrow string literal has type
“array of n const char”, where n is the size of the string as
defined below, and has static storage duration
You could write your program the following way
//...
void f(char *p) {
string s = "def";
strcpy(p, s.c_str());
}
//..
main(){
char q[] = "abc";
f(q);
cout << q << endl;
}
If you need to use a pointer then you could write
//...
void f(char *p) {
string s = "def";
strcpy(p, s.c_str());
}
//..
main(){
char *q = new char[4] { 'a', 'b', 'c', '\0' };
f(q);
cout << q << endl;
delete []q;
}

This is an issue that, in reality, should fail at compilation time but for really old legacy reasons they allow it.
"abc" is not not a mutable string and therefore it should be illegal to point a mutable pointer to it.
Really any legacy code that does this should be forced to be fixed, or have some pragma around it that lets it compile or some permissive flag set in the build.
But a long time ago in the old days of C there was no such thing as a const modifier, and literals were stored in char * parameters and programmers had to be careful what they did with them.
The latter construct, where q is not initialised at all is an error because now q could be pointing anywhere, and is unlikely to be pointing to a valid memory place to write the string. It is actually undefined behaviour, for obvious reason - who knows where q is pointing?
The normal construct for such a function like f is to request not only a pointer to a writable buffer but also a maximum available size (capacity). Usually this size includes the null-terminator, sometimes it might not, but either way the function f can then write into it without an issue. It will also often return a status, possibly the number of bytes it wanted to write. This is very common for a "C" interface. (And C interfaces are often used even in C++ for better portability / compatibility with other languages).
To make it work in this instance, you need at least 4 bytes in your buffer.
int main()
{
char q[4];
f(q);
std::cout << q << std::endl;
}
would work.
Inside the function f you can use std::string::copy to copy into the buffer. Let's modify f.
(We assume this is a prototype and in reality you have a meaningful name and it returns something more meaningful that you retrieve off somewhere).
size_t f( char * buf, size_t capacity )
{
std::string s = "def";
size_t copied = s.copy( buf, capacity-1 ); // leave a space for the null
buf[copied] = '\0'; // std::string::copy doesn't write this
return s.size() + 1; // the number of bytes you need
}
int main()
{
char q[3];
size_t needed = f( q, 3 );
std::cout << q << " - needed " << needed << " bytes " << std::endl;
}
Output should be:
de needed 4 bytes
In your question you suggested you can change your function but not the way it is called. Well in that case, you actually have only one real solution:
void f( const char * & p )
{
p = "def";
}
Now you can happily do
int main()
{
const char * q;
f( q );
std::cout << q << std::endl;
}
Note that in this solution I am actually moving your pointer to point to something else. This works because it is a static string. You cannot have a local std::string then point it to its c_str(). You can have a std::string whose lifetime remains beyond the scope of your q pointer e.g. stored in a collection somewhere.

Look at the warnings you get while compiling your code (and if you don’t get any, turn up the warning levels or get a better compiler).
You will notice that despite the type declaration, the value of q is not really mutable. The compiler was humoring you because not doing so would break a lot of legacy code.

You can't do that because you assigned a string literal to your char*. And this is memory you can't modify.

With your f, You should do
int main(){
char q[4 /*or more*/];
f(q);
std::cout << q << std::endl;
}

The problem is that you are trying to write on a read-only place in the process address space. As all the string literals are placed in read-only-data. char *q = "abc"; creates a pointer and points towards the read-only section where the string literal is placed. and when you copy using strcpy or memcpy or even try q[1] = 'x' it attempts to write on a space which is write protected.
This was the problem among many other solutions one can be
main(){
char *q = "abc"; \\ the strings are placed at a read-only portion
\\ in the process address space so you can not write that
q = new char[4]; \\ will make q point at space at heap which is writable
f(q);
cout << q << endl;
delete[] q;
}
the initialization of q is unnecessary here. after the second line q gets a space of 4 characters on the heap (3 for chars and one for null char). This would work but there are many other and better solutions to this problem which varies from situation to situation.

msgpack C++ implementation: How to pack binary data?

I am making use of C++ msgpack implementation. I have hit a roadblock as to how to pack binary data. In terms of binary data I have a buffer of the following type:
unsigned char* data;
The data variable points to an array which is actually an image. What I want to do is pack this using msgpack. There seems to be no example of how to actually pack binary data. From the format specification raw bytes are supported, but I am not sure how to make use of the functionality.
I tried using a vector of character pointers like the following:
msgpack::sbuffer temp_sbuffer;
std::vector<char*> vec;
msgpack::pack(temp_sbuffer, vec);
But this results in a compiler error since there is no function template for T=std::vector.
I have also simply tried the following:
msgpack::pack(temp_sbuffer, "Hello");
But this also results in a compilation error (i.e. no function template for T=const char [6]
Thus, I was hoping someone could give me advice on how to use msgpack C++ to pack binary data represented as a char array.

Josh provided a good answer but it requires the copying of byte buffers to a vector of char. I would rather minimize copying and use the buffer directly (if possible). The following is an alternative solution:
Looking through the source code and trying to determine how different data types are packed according to the specification I happened upon msgpack::packer<>::pack_raw(size_t l) and msgpack::packer<>::pack_raw_body(const char* b, size_t l). While there appears to be no documentation for these methods this is how I would described them.
msgpack::packer<>::pack_raw(size_t l): This method appends the type identification to buffer (i.e. fix raw, raw16 or raw32) as well as the size information (which is an argument for the method).
msgpack::packer<>::pack_raw_body(const char* b, size_t l): This method appends the raw data to the buffer.
The following is a simple example of how to pack a character array:
msgpack::sbuffer temp_sbuffer;
msgpack::packer<msgpack::sbuffer> packer(&temp_sbuffer);
packer.pack_raw(5); // Indicate that you are packing 5 raw bytes
packer.pack_raw_body("Hello", 5); // Pack the 5 bytes
The above example can be extended to pack any binary data. This allows one to pack directly from byte arrays/buffers without having to copy to an intermediate (i.e. a vector of char).

If you can store your image in a vector<unsigned char> instead of a raw array of unsigned char, then you can pack that vector:
#include <iostream>
#include <string>
#include <vector>
#include <msgpack.hpp>
int main()
{
std::vector<unsigned char> data;
for (unsigned i = 0; i < 10; ++i)
data.push_back(i * 2);
msgpack::sbuffer sbuf;
msgpack::pack(sbuf, data);
msgpack::unpacked msg;
msgpack::unpack(&msg, sbuf.data(), sbuf.size());
msgpack::object obj = msg.get();
std::cout << obj << std::endl;
}
Strangely, this only works for unsigned char. If you try to pack a buffer of char instead (or even an individual char), it won't compile.

MessagePack has a raw_ref type which you could use like so:
#include "msgpack.hpp"
class myClass
{
public:
msgpack::type::raw_ref r;
MSGPACK_DEFINE(r);
};
int _tmain(int argc, _TCHAR* argv[])
{
const char* str = "hello";
myClass c;
c.r.ptr = str;
c.r.size = 6;
// From here on down its just the standard MessagePack example...
msgpack::sbuffer sbuf;
msgpack::pack(sbuf, c);
msgpack::unpacked msg;
msgpack::unpack(&msg, sbuf.data(), sbuf.size());
msgpack::object o = msg.get();
myClass d;
o.convert(&d);
OutputDebugStringA(d.r.ptr);
return 0;
}
Disclaimer: I found this by poking around the header files, not through reading the non-existent documentation on serialising raw bytes, so it may not be the 'correct' way (though it was defined along with all the other 'standard' types a serialiser would want to explicitly handle).

msgpack-c has been updated after question and answers were posted.
I'd like to inform the current situation.
Since msgpack-c version 2.0.0 C-style array has been supported. See https://github.com/msgpack/msgpack-c/releases
msgpack-c can pack const char array such as "hello".
Types conversion rule is documented https://github.com/msgpack/msgpack-c/wiki/v2_0_cpp_adaptor#predefined-adaptors.
char array is mapped to STR. If you want to use BIN instead of STR, you need to wrap with msgpack::type::raw_ref.
That is packing overview.
Here are unpacking and converting description:
https://github.com/msgpack/msgpack-c/wiki/v2_0_cpp_object#conversion
Unpack means creating msgpack::object from MessagePack formatted byte stream. Convert means converting to C++ object from msgpack::object.
If MessagePack formatted data is STR, and covert target type is char array, copy data to array, and if array has extra capacity, add '\0'. If MessagePack formatted data is BIN, '\0' is not added.
Here is a code example based on the original question:
#include <msgpack.hpp>
#include <iostream>
inline
std::ostream& hex_dump(std::ostream& o, char const* p, std::size_t size ) {
o << std::hex << std::setw(2) << std::setfill('0');
while(size--) o << (static_cast<int>(*p++) & 0xff) << ' ';
return o;
}
int main() {
{
msgpack::sbuffer temp_sbuffer;
// since 2.0.0 char[] is supported.
// See https://github.com/msgpack/msgpack-c/wiki/v2_0_cpp_adaptor#predefined-adaptors
msgpack::pack(temp_sbuffer, "hello");
hex_dump(std::cout, temp_sbuffer.data(), temp_sbuffer.size()) << std::endl;
// packed as STR See https://github.com/msgpack/msgpack/blob/master/spec.md
// '\0' is not packed
auto oh = msgpack::unpack(temp_sbuffer.data(), temp_sbuffer.size());
static_assert(sizeof("hello") == 6, "");
char converted[6];
converted[5] = 'x'; // to check overwriting, put NOT '\0'.
// '\0' is automatically added if char-array has enought size and MessagePack format is STR
oh.get().convert(converted);
std::cout << converted << std::endl;
}
{
msgpack::sbuffer temp_sbuffer;
// since 2.0.0 char[] is supported.
// See https://github.com/msgpack/msgpack-c/wiki/v2_0_cpp_adaptor#predefined-adaptors
// packed as BIN
msgpack::pack(temp_sbuffer, msgpack::type::raw_ref("hello", 5));
hex_dump(std::cout, temp_sbuffer.data(), temp_sbuffer.size()) << std::endl;
auto oh = msgpack::unpack(temp_sbuffer.data(), temp_sbuffer.size());
static_assert(sizeof("hello") == 6, "");
char converted[7];
converted[5] = 'x';
converted[6] = '\0';
// only first 5 bytes are written if MessagePack format is BIN
oh.get().convert(converted);
std::cout << converted << std::endl;
}
}
Running Demo:
https://wandbox.org/permlink/mYJyYycfsQIwsekY

char* to char[]

I have char* which is of fixed (known) width but is not null terminated.
I want to pass it into LOG4CPLUS_ERROR("Bad string " << char_pointer); but as its not null terminated it will print it all.
Any suggestions of some light weight way of getting "(char[length])*char_pointer" without performing a copy?

No, you'll have to deep-copy and null-terminate it. That code expects a null-terminated string and it means a contiguous block of characters ending with a null terminator.

If your goal is to print such a string, you could:
Store the last byte.
Replace it with \0.
Print the string.
Print the stored byte.
Put the stored byte back into the last position of the string.
Wrap all this in a function.

Real iostreams
When you're writing to a real iostream, then you can just use ostream.write() which takes a char* and a size for how many bytes to write -- no null termination necessary. (In fact, any null characters in the string would be written to the ostream, and would not be used to determine the size.)
Logging libraries
In some logging libraries, the stream that you write to is not a real iostream. This is the case in Log4CPP.
However, in Log4CPlus which is what it appears matt is using, the object that you're writing to is a std::basic_ostringstream<tchar> (see loggingmacros.h and streams.h for the definition, since none of this is obvious from the documentation). There's just one problem: in the macro LOG4CPLUS_ERROR, the first << is already built into the macro, so he won't be able to call LOG4CPLUS_ERROR(.write(char_pointer,length)) or anything like that. Unfortunately, I don't see an easy way around this without deconstructing the LOG4CPLUS_ERROR error macro and getting into the internals of Log4CPlus
Solution
I'm not sure why you're trying to avoid copying the string at this point, since you can see that there's a lot of copying going on inside the logging library. Any attempt to avoid that extra string copy is probably unwarranted optimization.
I'm going to assume that it's an issue of code cleanliness, and maybe an issue of making sure the copy happens inside the LOG4CPLUS_ERROR macro, as opposed to outside it. In that case, just use:
LOG4CPLUS_ERROR("Bad string " << std::string(char_pointer, length));

We're getting hung up on the semantics of conversion between char* and char[]. Take a step back, what are you trying to do? If this is a simple case of on an error condition, streaming out the content of a structure to a stream, why not do it properly?
e.g.
struct foo
{
char a1[10];
char a2[10];
char a3[10];
char a4[10];
};
// free function to stream the above structure properly..
std::ostream operator<<(std::ostream& str, foo const& st)
{
str << "foo::a1[";
std::copy(st.a1, st.a1 + sizeof(st.a1), std::ostream_iterator<char>(str));
str << "]\n";
str << "foo::a2[";
std::copy(st.a2, st.a2 + sizeof(st.a2), std::ostream_iterator<char>(str));
str << "]\n";
:
return str;
}
Now you can simply stream out an instance of foo and don't have to worry about null terminated string etc.!

I keep a string reference class in my toolkit just for these type of situations. Here is a greatly abbreviated version of that class. I trimmed away anything that is not relevant to this particular problem:
#include <iostream>
class stringref {
public:
stringref(const char* ptr, unsigned len) : ptr(ptr), len(len) {}
unsigned length() { return len; }
const char* data() { return ptr; }
private:
const char* ptr;
unsigned len;
};
std::ostream& operator<< (std::ostream& os, stringref sr) {
const char* data = sr.data();
for (unsigned len = sr.length(); len--; )
os << *data++;
return os;
}
using namespace std;
int main (int argc, const char * argv[])
{
cout << "string: " << stringref("test", 4) << endl;
}
or, in your case:
LOG4CPLUS_ERROR("Bad string " << stringref(char_pointer, length));
should work.
The idea of a string reference class is to keep enough information about a string (a size and a pointer) to refer to any block of memory which represents a string. It relies on you to make sure that the underlying string data is valid throughout the lifetime of a stringref object. This way you can pass around and process string information with a minimum of overhead.

When you know it is of fixed length: Why not simply add one more charakter to the size of the array? Then you can easily fill this last char with \0 terminating character and all will be fine

No, you'll have to copy it. There is no proper conversion in the language that you can use to get the array type out of it.
It seems very odd that you want to do this, or that you have a non-terminated C-style string in the first place.
Why are you not using std::string?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js