C++ Null characters in string? - c++

I want to read a txt file and convert two cells from each line to floats.
If I first run:
someString = someString.substr(1, tempLine.size());
And then:
std::stof(someString)
it only converts the first number in 'someString' to a number. The rest of the string is lost.
When I handled the string in my IDE I noticed that copying it and pasting it inside quotation marks gives me "\u00005\u00007\u0000.\u00007\u00001\u00007\u00007\u0000" and not 57.7177.
If I instead do:
std::string someOtherString = "57.7177"
std::stof(someOtherString)
I get 57.7177.
Minimal working example is:
int main() {
std::string someString = "\u00005\u00007\u0000.\u00007\u00001\u00007\u00007\u0000";
float someFloat = std::stof(someString);
return 0;
}
Same problem occurs using both UTF-8 and -16 encoding.
What is happening and what should I do differently? Should I remove the null-characters somehow?

"I want to read a txt file"
What is the encoding of the text file? "Text" is not a encoding. What I suspect is happening is that you wrote code that reads in the file as either UTF8 or Windows-1250 encoding, and stored it in a std::string. From the bytes, I can see that the file is actually UTF16BE, and so you need to read into a std::u16string. If your program will only ever run on Windows, then you can get by with a std::wstring.
You probably have followup questions, but your original question is vague enough that I can't predict what those questions would be.

Related

what's exactly the string of "^A" is?

I run my code on an online judgement. I log the string, key. Below is my code:
fprintf(stderr, "key=%s, and key.size()=%d\n", key.c_str(), key.size());
But the result is this:
key=^A, and key.size()=8
I want to what is the ^A represent in ascii. ^A's size is 2 rather than 8, but it shows that it is 8. I view the result by vim, and the log_file is encoded by UTF-8. Why?
Your viewer is electing to show you the bytes interpreted using a character encoding of its choosing and electing to show the resulting characters in caret notation.
Other viewers could make different choices on both counts or allow you to indicate what you want. For example, control picture characters (␁) instead of caret notation.
For a std:string c_str() is terminated by an additional \x00 byte following the actual value. You often use c_str() with functions that expect a string to be \x00 terminated. This applies to fprintf. In such cases, what's read ends just before the first \x00 seen.
You have several \x00 bytes in your string, which, of course, contributes to size() but fprintf will stop right at the first one (and not count it).
I have solve it by myself. If you write a std::string "\x01\x00\x00\x00\x00end" to a file and open it with vim later, you will get '^A'.
This is my test code:
string sss("\x01\x00\x00\x00\x00end");
ofstream of("of.txt");
for (int i=0; i<sss.size(); i++) {
of.put(sss[i]);
}
of.close();
After I open the file "of.txt", I saw "^A";

Extra character when reading a file. C++

I'm writing two programs that communicate by reading files which the other one writes.
My problem is that when the other program is reading a file created by the first program it outputs a weird character at the end of the last data. This only happens seemingly at random, as adding data to the textfile can result in a normal output.
I'm utilizing C++ and Qt4. This is the part of program 1:
std::ofstream idxfile_new;
QString idxtext;
std::string fname2="some_textfile.txt"; //Imported from a file browser in the real code.
idxfile_new.open (fname2.c_str(), std::ios::out);
idxtext = ui->indexBrowser->toPlainText(); //Grabs data from a dialog of the GUI.
//See 'some_textfile.txt' below
idxfile_new<<idxtext.toStdString();
idxfile_new.clear();
idxfile_new.close();
some_textfile.txt:
3714.1 3715.1 3716.1 3717.1 3719.1 3739.1 3734.1 3738.1 3562.1 3563.1 3623.1
part of program 2:
std::string indexfile = "some_textfile.txt"; //Imported from file browser in the real code
std::ifstream file;
std::string sub;
file.open(indexfile.c_str(), std::ios::in);
while(file>>sub)
{
cerr<<sub<<"\n"; //Stores values in an array in the real code
}
This outputs:
3714.1
3715.1
3716.1
3717.1
3719.1
3739.1
3734.1
3738.1
3562.1
3563.1
3623.1�
If I add more data it works at times. Sometimes it can output data such as
3592.�
or
359�
at the end. So it is not consistent in reading the whole data either. At first I figured it wasn't reading the eof properly, and I have read and tried many solutions to similar problems but can't get it to work correctly.
Thank you guys for the help!
I managed to solve the problem by myself this morning.
For anyone with the same problem I will post my solution.
The problem was the UTF-8 encoding when creating the file. Here's my solution:
Part of program 1:
std::ofstream idxfile_new;
QString idxtext;
std::string fname2="some_textfile.txt";
idxfile_new.open (fname2.c_str(), std::ios::out);
idxtext = ui->indexBrowser->toPlainText();
QByteArray qstr = idxtext.toUtf8(); //Enables Utf8 encoding
idxfile_new<<qstr.data();
idxfile_new.clear();
idxfile_new.close();
The other program is left unchanged.
A hex converter displayed the extra character as 'ef bf bd', which is due to the replacement character U+FFFD that replace invalid bytes when encoding to Utf8.

error about UTF_8 format while creating my xml using libxml and c++

I created an xml file using libxml and c++. What I want to do now, is reading from a .txt and put this text between some specific tags.
I have tried the following code, just reading from a file and write it between tags:
char * s ;
double d;
fichier>>i>>s>>d;
// fichier.close();
cout << s << endl ;
xmlNewChild(root_node, NULL, BAD_CAST "metadata",
BAD_CAST s );
While running this code, I get this error:
output error : string is not in UTF-8
So I guess that there is a format incompatibility between the input and output. Can you help me please? I don't know how to fix this.
You need to convert your input string into UTF-8 input using one of the functions defined in the encoding module. (Or using any other encoding library you like like icu ) you can find details about the encoding module here http://www.xmlsoft.org/html/libxml-encoding.html
My guess is that you want to preserve the bytes so that what you need is something like (VERY untested and derived purely from the docs.)
//Get the encoding
xmlCharEncodingHandlerPtr encoder = xmlGetCharEncodingHandler(XML_CHAR_ENCODING_ASCII);
// Each ascii byte should take up at most 2 utf-8 bytes IIRC so allocate enough space.
char* buffer_utf8 = new char[length_of_s*2];
//Do the encoding
int consumed = length_of_s;
int encoded_length=length_of_s*2;
int len = (*encoder.input)(buffer_utf8, &encoded,s,&consumed);
if( len<0 ) { .. error .. }
buffer_utf8[len]=0; // I'm not sure if this is automatically appended or not.
//Now you can use buffer_utf8 rather than s.
If your input is in a different encoding supported by libxml then it should just be a matter of changing XML_CHAR_ENCODING_ASCII to the right constant, though you may need to change thenumber of bytes allocated in the in buffer_utf8 too.

how to get a single character from UTF-8 encoded URDU string written in a file?

i am working on Urdu Hindi translation/transliteration. my objective is to translate an Urdu sentence into Hindi and vice versa, i am using visual c++ 2010 software with c++ language. i have written an Urdu sentence in a text file saved as UTF-8 format. now i want to get a single character one by one from that file so that i can work on it to convert it into its equivalent Hindi character. when i try to get a single character from input file and write this single character on output file, i get some unknown ugly looking character placed in output file. kindly help me with proper code. my code is as follows
#include<iostream>
#include<fstream>
#include<cwchar>
#include<cstdlib>
using namespace std;
void main()
{
wchar_t arry[50];
wifstream inputfile("input.dat",ios::in);
wofstream outputfile("output.dat");
if(!inputfile)
{
cerr<<"File not open"<<endl;
exit(1);
}
while (!inputfile.eof()) // i am using this while just to
// make sure copy-paste operation of
// written urdu text from one file to
// another when i try to pick only one character
// from file, it does not work.
{ inputfile>>arry; }
int i=0;
while(arry[i] != '\0') // i want to get urdu character placed at
// each-index so that i can work on it to convert
// it into its equivalent hindi character
{ outputfile<<arry[i]<<endl;
i++; }
inputfile.close();
outputfile.close();
cout<<"Hello world"<<endl;
}
Assuming you are on Windows, the easiest way to get "useful" characters is to read a larger chunk of the file (for example a line, or the entire file), and convert it to UTF-16 using the MultiByteToWideChar function. Use the "pseudo"-codepage CP_UTF8. In many cases, decoding the UTF-16 isn't required, but I don't know about the languages you are referring to; if you expect non-BOM characters (with codes above 65535) you might want to consider decoding the UTF-16 (or decode the UTF-8 yourself) to avoid having to deal with 2-word characters.
You can also write your own UTF-8 decoder, if you prefer. It's not complicated, and just requires some bit-juggling to extract the proper bits from the input bytes and assemble them into the final unicode value.
HINT: Windows also has a NormalizeString() function, which you can use to make sure the characters from the file are what you expect. This can be used to transform characters that have several representations in Unicode into their "canonical" representation.
EDIT: if you read up on UTF-8 encoding, you can easily see that you can read the first byte, figure out how many more bytes you need, read these as well, and pass the whole thing to MultiByteToWideChar or your own decoder (although your own decoder could just read from the file, of course). That way you could really do a "read one char at a time".
'w' classes do not read and write UTF-8. They read and write UTF-16. If your file is in UTF-8, reading it with this code will produce gibberish.
You will need to read it as bytes and then convert it, or write it in UTF-16 in the first place.

UCS-2LE text file parsing

I have a text file which was created using some Microsoft reporting tool. The text file includes the BOM 0xFFFE in the beginning and then ASCII character output with nulls between characters (i.e "F.i.e.l.d.1."). I can use iconv to convert this to UTF-8 using UCS-2LE as an input format and UTF-8 as an output format... it works great.
My problem is that I want to read in lines from the UCS-2LE file into strings and parse out the field values and then write them out to a ASCII text file (i.e. Field1 Field2). I have tried the string and wstring-based versions of getline – while it reads the string from the file, functions like substr(start, length) do interpret the string as 8-bit values, so the start and length values are off.
How do I read the UCS-2LE data into a C++ String and extract the data values? I have looked at boost and icu as well as numerous google searches but have not found anything that works. What am I missing here? Please help!
My example code looks like this:
wifstream srcFile;
srcFile.open(argv[1], ios_base::in | ios_base::binary);
..
..
wstring srcBuf;
..
..
while( getline(srcFile, srcBuf) )
{
wstring field1;
field1 = srcBuf.substr(12, 12);
...
...
}
So, if, for example, srcBuf contains "W.e. t.h.i.n.k. i.n. g.e.n.e.r.a.l.i.t.i.e.s." then the substr() above returns ".k. i.n. g.e" instead of "g.e.n.e.r.a.l.i.t.i.e.s.".
What I want is to read in the string and process it without having to worry about the multi-byte representation. Does anybody have an example of using boost (or something else) to read these strings from the file and convert them to a fixed width representation for internal use?
BTW, I am on a Mac using Eclipse and gcc.. Is it possible my STL does not understand wide character strings?
Thanks!
Having spent some good hours tackling this question, here are my conclusions:
Reading an UTF-16 (or UCS2-LE) file is apparently manageable in C++11, see How do I write a UTF-8 encoded string to a file in Windows, in C++
Since the boost::locale library is now part of C++11, one can just use codecvt_utf16 (see bullet below for eventual code samples)
However, in older compilers (e.g. MSVC 2008), you can use locale and a custom codecvt facet/"recipe", as very nicely exemplified in this answer to Writing UTF16 to file in binary mode
Alternatively, one can also try this method of reading, though it did not work in my case. The output would be missing lines which were replaced by garbage chars.
I wasn't able to get this done in my pre-C++11 compiler and had to resort to scripting it in Ruby and spawning a process (it's just in test so I think that kind of complications are ok there) to execute my task.
Hope this spares others some time, happy to help.
substr works fine for me on Linux with g++ 4.3.3. The program
#include <string>
#include <iostream>
using namespace std;
int main()
{
wstring s1 = L"Hello, world";
wstring s2 = s1.substr(3,5);
wcout << s2 << endl;
}
prints "lo, w" as it should.
However, the file reading probably does something different from what you expect. It converts the files from the locale encoding to wchar_t, which will cause each byte becoming its own wchar_t. I don't think the standard library supports reading UTF-16 into wchar_t.