TZipFile read UTF8 - c++

I am trying to extract a (UTF-8) text file from a zip file:
TZipFile *zFile = new TZipFile;
zFile->Open(L"C:\\test.zip", zmRead);
TByteDynArray bda;
zFile->Read(L"test.txt", bda);
zFile->Close();
ShowMessage(WideStringOf(bda));
This doesn't work. I get a string, but with weird content.
If I use zFile->Extract() it works fine, but I don't want to use the disk (performance).
Is there a way to use the read function on a UTF-8 file?

The problem is not with TZipFile itself, the real problem is actually with WideStringOf() instead.
TZipFile::Read() returns the raw bytes of the specified archived file (decompressing if needed), so your bda variable is a UTF-8 encoded byte array. However, WideStringOf() expects a byte array that is encoded as UTF-16LE instead. That is why you are seeing incorrect results.
To decode the byte array as UTF-8, use this instead:
ShowMessage(TEncoding::UTF8->GetString(bda));

Related

C++ Null characters in string?

I want to read a txt file and convert two cells from each line to floats.
If I first run:
someString = someString.substr(1, tempLine.size());
And then:
std::stof(someString)
it only converts the first number in 'someString' to a number. The rest of the string is lost.
When I handled the string in my IDE I noticed that copying it and pasting it inside quotation marks gives me "\u00005\u00007\u0000.\u00007\u00001\u00007\u00007\u0000" and not 57.7177.
If I instead do:
std::string someOtherString = "57.7177"
std::stof(someOtherString)
I get 57.7177.
Minimal working example is:
int main() {
std::string someString = "\u00005\u00007\u0000.\u00007\u00001\u00007\u00007\u0000";
float someFloat = std::stof(someString);
return 0;
}
Same problem occurs using both UTF-8 and -16 encoding.
What is happening and what should I do differently? Should I remove the null-characters somehow?
"I want to read a txt file"
What is the encoding of the text file? "Text" is not a encoding. What I suspect is happening is that you wrote code that reads in the file as either UTF8 or Windows-1250 encoding, and stored it in a std::string. From the bytes, I can see that the file is actually UTF16BE, and so you need to read into a std::u16string. If your program will only ever run on Windows, then you can get by with a std::wstring.
You probably have followup questions, but your original question is vague enough that I can't predict what those questions would be.

Do we need to consider encoding (UTF-8) while constructing a string from char* buffer

I am working on a HTTP Client module which receives information from server in a character buffer and is UTF-8 encoded. I wanted to create a std::string object from this character buffer.
Can I create a string object directly by passing the character buffer like this ?
std::string receivedstring(receievedbuffer,bufferlength);
here receievedbuffer is char[] array which contains data received from TCP/IP connection and bufferlength contains the number of bytes received. I am really confused with the term UTF-8 , I understood that its a unicode encoding , do I need to take any steps before the conversion.
std::string receivedstring(receievedbuffer,bufferlength);
It does not do any conversion, it just copies from receievedbuffer to receivedstring.
If your receievedbuffer was UTF-8 encoded then the the exact same bytes will be stored into receivedstring.
std::string is just a storage format and does not reflect the encoding of the data stored in it.

how to get a single character from UTF-8 encoded URDU string written in a file?

i am working on Urdu Hindi translation/transliteration. my objective is to translate an Urdu sentence into Hindi and vice versa, i am using visual c++ 2010 software with c++ language. i have written an Urdu sentence in a text file saved as UTF-8 format. now i want to get a single character one by one from that file so that i can work on it to convert it into its equivalent Hindi character. when i try to get a single character from input file and write this single character on output file, i get some unknown ugly looking character placed in output file. kindly help me with proper code. my code is as follows
#include<iostream>
#include<fstream>
#include<cwchar>
#include<cstdlib>
using namespace std;
void main()
{
wchar_t arry[50];
wifstream inputfile("input.dat",ios::in);
wofstream outputfile("output.dat");
if(!inputfile)
{
cerr<<"File not open"<<endl;
exit(1);
}
while (!inputfile.eof()) // i am using this while just to
// make sure copy-paste operation of
// written urdu text from one file to
// another when i try to pick only one character
// from file, it does not work.
{ inputfile>>arry; }
int i=0;
while(arry[i] != '\0') // i want to get urdu character placed at
// each-index so that i can work on it to convert
// it into its equivalent hindi character
{ outputfile<<arry[i]<<endl;
i++; }
inputfile.close();
outputfile.close();
cout<<"Hello world"<<endl;
}
Assuming you are on Windows, the easiest way to get "useful" characters is to read a larger chunk of the file (for example a line, or the entire file), and convert it to UTF-16 using the MultiByteToWideChar function. Use the "pseudo"-codepage CP_UTF8. In many cases, decoding the UTF-16 isn't required, but I don't know about the languages you are referring to; if you expect non-BOM characters (with codes above 65535) you might want to consider decoding the UTF-16 (or decode the UTF-8 yourself) to avoid having to deal with 2-word characters.
You can also write your own UTF-8 decoder, if you prefer. It's not complicated, and just requires some bit-juggling to extract the proper bits from the input bytes and assemble them into the final unicode value.
HINT: Windows also has a NormalizeString() function, which you can use to make sure the characters from the file are what you expect. This can be used to transform characters that have several representations in Unicode into their "canonical" representation.
EDIT: if you read up on UTF-8 encoding, you can easily see that you can read the first byte, figure out how many more bytes you need, read these as well, and pass the whole thing to MultiByteToWideChar or your own decoder (although your own decoder could just read from the file, of course). That way you could really do a "read one char at a time".
'w' classes do not read and write UTF-8. They read and write UTF-16. If your file is in UTF-8, reading it with this code will produce gibberish.
You will need to read it as bytes and then convert it, or write it in UTF-16 in the first place.

LPBYTE data to CString in MFC

I am encrypting data using CryptProtectData function and I am getting encrypted data in LPBYTE format, I want to save that data into a file and then read back for decryption.
In order to write string in file, I used following one to convert LPBYTE data to CString:
CString strEncrUName = (wchar_t *)encryptedUN;
I even tried this one How to convert from BYTE array to CString in MFC? but still it's not working.
Character set used is unicode.
Thanks in advance
The encrypted data is a buffer of raw bytes, not characters. If you want to convert it to a string, you'll have to encode it somehow, such as by converting it to Hex chars.
eg. byte 0xd5 becomes 2 chars: "D5"
Looping through each byte and converting it to hex chars is an easy exercice left up to the reader.
Of course, you'll have to convert it back to binary after you read the file.
Are you sure you want to save it to a text file. Your other option is to save the binary encrypted data to a binary file: no need to convert to/from string.
If your pointer represents zero terminated string
LPBYTE pByte;
CString str(LPCSTR(pByte));

Reading file with cyrillic

I have to open file with cyrillic symbols. I've encoded file into utf8. Here is example:
en: Couldn't your family afford a
costume for you
ru: Не ваша семья
позволить себе костюм для вас
How do I open file:
ifstream readFile(fileData.c_str());
while (!readFile.eof())
{
std::getline(readFile, buffer);
...
}
The first trouble, there is some symbol before text 'en' (I saw this in debugger):
"en: least"
And another trouble is cyrillic symbols:
" ru: наименьший"
What's wrong?
there is some symbol before text 'en'
That's a faux-BOM, the result of encoding a U+FEFF BYTE ORDER MARK character into UTF-8.
Since UTF-8 is an encoding that does not have a byte order, the faux-BOM shouldn't ever be used, but unfortunately quite a bit of existing software (especially in the MS world) does nonetheless. Load the messages file into a text editor and save it back out again as UTF-8, using a “UTF-8 without BOM” encoding if one is especially listed.
ru: наименьший
That's what you get when you've got a UTF-8 byte string (representing наименьший) and you print it as if it were a Code Page 1252 (Windows Western European) byte string. It's not an input problem; you have read in the string OK and have a UTF-8 byte string. But then, in code you haven't quoted, it gets output as cp1252.
If you're just printing it to the console, this is to be expected, as the console always uses the system default code page (1252 on a Western Windows install), and not UTF-8. If you need to send Unicode to the console you'll have to convert the bytes to native-Unicode wchar​s and write them from there. I don't know what the final destination for your strings is though... if you're just going to write them to another file or something you could just keep them as bytes and not care about what encoding they're in.
i suppose that your os is windows. exists several ways simple:
Use wchar_t, wstring, wifstream, etc.
Use icu library
Use other super puper library (them really many)
Note: for console printing you must use WinApi functions to convert UTF-8 to cp866 (my default cyrilic windows encoding cp1251) because of windows console supports only dos encodings.
Note: for file printing you need to know what encoding use your file
Use libiconv to convert the text to a usable encoding after reading.
Use icu to convert the text.