C++ - Converting a char to wchar_t. Getting a segfault - c++

I'm trying to write small program that reads in a character from an istream and converts it to a wchar_t. I'm getting a segfault. Here's my code
#include <iostream>
using namespace std;
wchar_t read(istream &stream) {
char *c;
stream.read(c, sizeof(*c));
cout << *c << endl;
wchar_t retChar = static_cast<wchar_t>(*c);
return retChar;
}
int main() {
cout << "Write something" << endl;
read(cin);
}
My logic here is:
Create an array of chars because read only takes arrays of chars.
Read in bytes equal to the size of a character. i.e. read a character and store it in the array c.
Create a wchar_t and cast that character *c into a wchar_t.
return wchar_t
Since I'm getting a segfault, there's obviously something wrong here. I can't see it though. Any help would be appreciated.
Thanks SO

Stepping through the code to give OP a look at what's going on and why it won't work. Then we'll take a look at a method to do what they want that is as close as possible to their intent. Then a hint on how to do this a bit better in the C++ world.
wchar_t read(istream &stream) {
char *c;
Declares a pointer c and doesn't point it at anything. c is an uninitialized variable. Think of it like being invited to Steve's house for a party, but no one told you where he lived. Odds are very good that where ever you go, it won't be Steve's house.
stream.read(c, sizeof(*c));
sizeof(*c) will return the size of one character. Probably 8 bits and 1 byte, but c still hasn't been pointed at anything so this is Undefined Behaviour. There is no telling what the program will do, but most likely it reads one byte into some unknown space in memory. Maybe this causes a crash because you can't write there. Maybe it writes over something that it is allowed to write over and screws up something else.
cout << *c << endl;
Tries to print out c. If the program survived the read above, odds are good it will survive this as well, but this is also Undefined Behaviour.
wchar_t retChar = static_cast<wchar_t>(*c);
This will literally stuff one character's worth of data into a wide character. It will not convert it according to locale or any other character encoding. char is a numeric code that has been defined to be interpreted as a character. A cast will stupidly put the character value, say 'A' and ASCII encoding into retChar. retChar now equals 65. 65 could mean anything depending on the encoding used by wchar_t. It might still mean 'A', but sorry Ayn Rand, this is one case where A may well not be A.
return retChar;
}
To do what OP was trying to do (and ignoring that there are better ways to do this for the time being):
#include <iostream>
using namespace std;
wchar_t read(istream &stream) {
char c[2];
Allocates an array of characters. Why? because the easiest way I know of is to do the conversion on a string.
stream.read(c, sizeof(c[0]));
c is now an array which decays to a pointer. We only want to read one char, so sizeof(c[0]) gets the size of the first element in the array.
c[1] = '\0';
cout << c << endl;
Null terminate and print.
wchar_t retChar[2];
Again, an array.
mbstowcs(retChar, c, 1);
convert one character from char to wide char using whatever locale has been set. Read more on locales here: http://en.cppreference.com/w/cpp/locale/setlocale
And documentation on mbstowcs: http://en.cppreference.com/w/cpp/string/multibyte/mbstowcs
return retChar[0];
}
Put all together with a quick tester:
#include <iostream>
#include <cstdlib>
wchar_t read(std::istream &stream)
{
char c[2];
stream.read(c, sizeof(c[0]));
c[1] = '\0';
std::cout << c << std::endl;
wchar_t retChar[2];
mbstowcs(retChar, c, 1);
return retChar[0];
}
int main()
{
std::wcout << read(std::cin) << std::endl;
}
This is simple, but ugly in the C++ world where you should stick to strings where possible. In that case look into std::wstring_convert.

Related

Writing std::string with non-ascii data to file

Below is a simplified example of my problem. I have some external byte data which appears to be a string with cp1252 encoded degree symbol 0xb0. When it is stored in my program as an std::string it is correctly represented as 0xffffffb0. However, when that string is then written to a file, the resulting file is only one byte long with just 0xb0. How do I write the string to the file? How does the concept of UTF-8 come into this?
#include <iostream>
#include <fstream>
typedef struct
{
char n[40];
} mystruct;
static void dump(const std::string& name)
{
std::cout << "It is '" << name << "'" << std::endl;
const char *p = name.data();
for (size_t i=0; i<name.size(); i++)
{
printf("0x%02x ", p[i]);
}
std::cout << std::endl;
}
int main()
{
const unsigned char raw_bytes[] = { 0xb0, 0x00};
mystruct foo;
foo = *(mystruct *)raw_bytes;
std::string name = std::string(foo.n);
dump(name);
std::ofstream my_out("/tmp/out.bin", std::ios::out | std::ios::binary);
my_out << name;
my_out.close();
return 0;
}
Running the above program produces the following on STDOUT
It is '�'
0xffffffb0
First of all, this is a must read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Now, when you done with that, you have to understand what type represents p[i].
It is char, which in C is a small size integer value with a sign! char can be negative!
Now, since you have cp1252 characters, they are outside the scope of ASCII. This means these characters are seen as negative values!
Now, when they are converted to int, the sign bit is replicated, and when you are trying to print it, you will see 0xffffff<actual byte value>.
To handle that in C, first you should cast to unsigned char:
printf("0x%02x ", (unsigned char)p[i]);
then the default conversion will fill in the missing bits with zeros and printf() will give you a proper value.
Now, in C++ this is a bit more nasty, since char and unsigned char are treated by stream operators as a character representation. So to print them in hex manner, it should be like this:
int charToInt(char ch)
{
return static_cast<int>(static_cast<unsigned char>(ch));
}
std::cout << std::hex << charToInt(s[i]);
Now, direct conversion from char to unsigned int will not fix the problem since silently the compiler will perform a conversation to int first.
See here: https://wandbox.org/permlink/sRmh8hZd78Oar7nF
UTF-8 has nothing to this issue.
Off-topic: please, when you write pure C++ code, do not use C. It is pointless and makes code harder to maintain, and it is not faster. So:
do not use char* or char[] to store strings. Just use std::string.
do not use printf(), use std::cout (or the fmt library, if you like format strings - it will became a future C++ standard).
do not use alloc(), malloc(), free() - in modern C++, use std::make_unique() and std::make_shared().

Pointer of a character in C++

Going by the books, the first cout line should print me the address of the location where the char variable b is stored, which seems to be the case for the int variable a too. But the first cout statement prints out an odd 'dh^#' while the second statement correctly prints a hex value '
ox23fd68'. Why is this happening?
#include<iostream>
using namespace std;
int main()
{
char b='d';
int a=10;
char *c=new char[10];
c=&b;
int *e=&a;
cout<<"c: "<<c<<endl;
cout<<"e: "<<e;
}
There is a non-member overload operator<<(std::basic_ostream) for the const char* type, that doesn't write the address, but rather the (presumed) C-style string1). In your case, since you have assigned the address of a single character, there is no NUL terminator, and thus no valid C-style string. The code exhibits undefined behavior.
The behavior for int* is different, as there is no special handling for pointers to int, and the statement writes the address to the stream, as expected.
If you want to get the address of the character instead, use a static_cast:
std::cout << static_cast<void*>( c ) << std::endl;
1) A C-style string is a sequence of characters, terminated by a NUL character ('\0').
Actually this program has problem. There is a memory leak.
char *c=new char[10];
c=&b;
This allocates 10 characters on heap, but then the pointer to heap is overwritten with the address of the variable b.
When a char* is written to cout with operator<< then it is considered as a null terminated C-string. As the address of b was initialized to a single character containing d op<< continues to search on the stack finding the first null character. It seems the it was found after a few characters, so dh^# is written (the d is the value of variable b the rest is just some random characters found on the stack before the 1st \0 char).
If you want to get the address try to use static_cast<void*>(c).
My example:
int main() {
char *c;
char b = 'd';
c = &b;
cout << c << ", " << static_cast<void*>(c) << endl;
}
An the output:
dÌÿÿ, 0xffffcc07
See the strange characters after 'd'.
I hope this could help a bit!

Returning base address in C(Pointers)

I am learning pointers and i tried this following program
#include <iostream>
#include <cstdlib>
#include <cstdio>
using namespace std;
char* getword()
{
char*temp=(char*)malloc(sizeof(char)*10);
cin>>temp;
return temp;
}
int main()
{
char *a;
a=getword();
cout<<a;
return 0;
}
To my level of understanding, a is a pointer to a character, and in the function getword() I returned temp which I think the base &temp[0]. I thought that the output would be the first character of the string I enter, but I got the entire string in stdout. How does this work?
In the tradition of C, a char* represents a string. Indeed, any string literal in your program (e.g. "hello") will have a type of const char *.
Thus, cout::operator<<( const char * ) is implemented as a string-output. It will output characters beginning at the address it is given, until it encounters the string terminator (otherwise known as null-terminator, or '\0').
If you want to output a single character, you need to dereference the pointer into a char type. You can choose one of the following syntaxes:
cout << *a; // Dereference the pointer
cout << a[0]; // Use array index of zero to return the value at that address
It should be noted that the code you provided isn't very C++ish. For starters, we generally don't use malloc in C++. You then leak the memory by not calling free later. The memory is uninitialised and relies on cin succeeding (which might not be the case). Also, you can only handle input strings of up to 9 characters before you will get undefined behaviour.
Perhaps you should learn about the <string> library and start using it.
It's true that char* "points to a character". But, by convention, and because with pointers there is no other way to do so, we also use it to "point to more than one character".
Since use of char* almost always means you're using a pointer to a C-style string, the C++ streams library makes this assumption for you, printing the char that your pointer points to … and the next … and the next … and the next until NULL is found. That's just the way it's been designed to work.
You can print just that character if you like by dereferencing the pointer to obtain an actual char.
std::cout is an overloaded operator and when it receives a char * as an operand then it treats it as a pointer to c style string and it will print the entire string.
If you want to print the first character of the string then use
cout << *a;
or
cout << a[0];
In your code, std::cout is an ostream and providing a char* variable as input to operator<< invokes a particular operator function overload to write characters to the ostream.
std::ostream also has a operator overload for writing a single character to itself.
I'm assuming you now know how to dereference a char* variable, but you should be using std::string instead of an unsafe char* type.
Here is the correct code
#include <stdio.h>
#include <stdlib.h>
char* getword()
{
char*temp=(char*)malloc(sizeof(char)*10);
scanf("%s",temp);
return temp;
}
int main()
{
char *a;
a = getword();
int currChar = 1;
printf("%c",*(a + currChar)); //increment currChar to get next character
return 0;
}

How do you best utilize wcsdup?

I'm writing code and a good portion of it requires returning wchar arrays. Returning wstrings aren't really an option (although I can use them) and I know I can pass a pointer as an argument and populate that, but I'm looking specifically to return a pointer to this array of wide chars. The first few iterations, I found that I would return the arrays alright, but by the time they are processed and printed, the memory would be overwritten and I would be left with gibberish. To fix this, I started using wcsdup, which fixed everything, but I'm struggling to grasp exactly what is happening, and thus, when it should be called so that it works and I leak no memory. As it is, I pretty much use wcsdup every time I return a string and every time a string is returned, which I know leaks memory. Here is what I'm doing. Where and why should I use wcsdup, or is there a better solution than wcsdup altogether?
wchar_t *intToWChar(int toConvert, int base)
{
wchar_t converted[12];
/* Conversion happens... */
return converted;
}
wchar_t *intToHexWChar(int toConvert)
{
/* Largest int is 8 hex digits, plus "0x", plus /0 is 11 characters. */
wchar_t converted[11];
/* Prefix with "0x" for hex string. */
converted[0] = L'0';
converted[1] = L'x';
/* Populate the rest of converted with the number in hex. */
wchar_t *hexString = intToWChar(toConvert, 16);
wcscpy((converted + 2), hexString);
return converted;
}
int main()
{
wchar_t *hexConversion = intToHexWChar(12345);
/* Other code. */
/* Without wcsdup calls, this spits out gibberish. */
wcout << "12345 in Hex is " << hexConversion << endl;
}
wchar_t *intToWChar(int toConvert, int base)
{
wchar_t converted[12];
/* Conversion happens... */
return converted;
}
This returns a pointer to a local variable.
wchar_t *hexString = intToWChar(toConvert, 16);
After this line, hexString will point to invalid memory and using it is undefined (may still have value or may be garbage!).
You do the same thing with the return from intToHexWChar.
Solutions:
use std::wstring
use std::vector<wchar_t>
pass in an array to the function for it to use
use smart pointers
use dynamic memory allocation (please don't!)
Note: you might also need to change to wcout instead of cout
Since you tagged your question with 'C++' the answer is a resounding: no, you should not use wcsdup at all. Instead, for passing arrays of wchar_t values around, use std::vector<wchar_t>.
If needed, you can turn those into a wchar_t* by taking the address of the first element (since vectors are guaranteed to be stored in contiguous memory), e.g.
cout << "12345 in Hex is " << &hexConversion[0] << endl;

Memory allocation of character array in C++

#include<iostream>
using namespace std;
int main()
{
cout<<"Enter\n";
char ch[0];
cin>>ch;
cout<<sizeof ch;
cout<<sizeof &ch;
cout<<"\n You entered \n"<<ch<<"\n";
return 0;
}
I use g++ compiler to compile the C++ program. What is the difference in memory allocation of char ch and char ch[0]. ch can accept one character but ch[0] can accept many character(I put in qqqqqqqq). Why? Also why does sizeof ch return 0 whereas sizeof &ch gives 4, yet ch accepts more than four characters?
Let's take this line by line:
char ch[0];
declares a zero-length array of characters. As mentioned in the comments, C++ standards don't actually allow this, but g++ might.
cin >> ch;
puts whatever's read in into the zero-length array. Since cin doesn't know the length of your array (it decays into a char * when it gets passed to the stream), it writes as much as it feels, which means it can store more than zero characters.
cout << sizeof ch;
outputs the size (ie. total space used in bytes) of the array, which is zero, as mentioned earlier.
cout << sizeof &ch;
outputs the size of the address of the array, which is 4 bytes on your architecture, since it's just a pointer.
cout << "You entered\n" << ch << "\n";
outputs whatever was stored into the array, which isn't well-defined, because, as I mentioned, zero-length arrays are not standard C++. Further, since cout just writes memory until it encounters a null byte (\0), it writes as much as you stored, since again, cout doesn't care whether you delcared ch with size 0 or size 413.
What is the difference in memory allocation of char ch and char ch[0].
The difference is in how much the compiler is going to protected you from yourself. An array is just a sequential block of memory, C++ has not array out of bounds exceptions, and doesn't check array bounds before writing. By using
char ch[0];
you have told the C++ compiler you will be responsible for bounds checking. This is a buffer overflow (remember those). When you assign something like 'qqqq' to ch[0] you have overwritten some other piece of memory that belongs to some other variable, function or program. Try running the program below if you need a better understanding.
// Note that I'm setting arr[10], which in Java (or any other modern language) would
// be an array out of bound exception
// I haven't run this program, but you're most likely to get 'a' printed
// to standard out
char arr[10], achar;
arr[10] = 'a';
cout << achar;