Read special characters from file - c++

For a few day I am trying to solve an ecoding issue in my C++ code.
I have the following content of a txt file (saved utf-8): "québécois". (and I have only read rights over the file).
I am reading it with ReadFile function in a std::string variable.
The first surpize is that the strContnet is not "québécois", but "québécois". (double encoded ??)
Why?
How cand I read it's content in a std::string variable in order to obtaint "québécois" (utf8 to ansi?)
Until now I have two ways to get somewere close to what I want but none satisfie me enought.
1. Convert from:
std::string to std::wstring (using MultiByteToWideChar)
create a CString from the wstring
use CT2CA to obtaint a new std::string
and finally use agin MultiByteToWideChar for a final std::wstring containint "québécois"
I thing the method is unsafe and not verry smart and I have a wstring not string.
I found this article:
How to convert from UTF-8 to ANSI using standard c++
And if I use utf8_to_string function twice I obtain what I want, but still I am not sure if it's the wright way.
Can anybody help me with a solution for this issue?

Related

What function should I use to open file in mac OS X?

I use C++ Builder to create my cross-platform application.
In the app, I will get file name/path by open file dialog.
In Windows, it's no problem to take care the unicode string. (ex. "C:\測試")
In mac OS X, I can get correct string from UnicodeString. But I can't find a good method to convert it to char array and use "fopen" to open the file correctly.
I tried to assign the UnicodeString to AnsiString directly but it became "C:\??".
Because "fopen" only accepts "char*" and UnicodeString can only export "char16*", I need to convert it to char for "fopen".
Any idea?
Just because fopen() takes a char* does not mean you should give it an ANSI string. POSIX APIs on OSX accept UTF-8 encoded filenames, so use UTF8String instead of AnsiString. A char* can point to a UTF-8 string.
Otherwise, don't use fopen() directly. Use the RTL's own functions instead, like FileCreate()/FileOpen() in System.SysUtils.hpp unit, or the TFileStream class in System.Classes.hpp unit. Let the RTL decide internally how to interact with platform APIs for you.

urlDecode - php function in c++

I have urlDecode function.
But when i'm decoding some string like:
P%C4%99dz%C4%85cyJele%C5%84
I get output: PędzącyJeleń
Of course this is not correct output. I think its broken because there are Polish chars.
I try to set in compilator:
Use Unicode Character Set
or Use Multi-Byte Character Set
I try to do that using wstrings but i have a lot of errors :|
I suppose that i should use wstring not string but could you tell me how? There is not easier way to solve my problem? (i listen a lot about wstring and string and litte dont understand - wstring should not use on linux, but i have Windows)
//link to my functions at bottom
http://bogomip.net/blog/cpp-url-encoding-and-decoding/
//EDIT
When i change all string to wstring, fstream->wfstream
It still problem look:
z%C5%82omiorz - this wstring (from file ) != złomiorz , but this function print me L"z197130omiorz"
what is 197130 ? How to fix that ?:0

c++ fstreams open file with utf-16 name

At first I built my project on Linux and it was built around streams.
When I started moving to Windows I ran into some problems.
I have a name of the file that I want to open in UTF-16 encoding.
I try to do it using fstream:
QString source; // content of source is shown on image
char *op= (char *) source.data();
fstream stream(op, std::ios::in | std::ios::binary);
But file cannot be opened.
When I check it,
if(!stream.is_open())
{} // I always get that it's not opened. But file indeed exists.
I tried to do it with wstream. But result is the same, because wstream accepts only char * too. As I understand it's so , because string , that is sent as char * , is truncated after the first zero and only one symbol of the file's name is sent, so file is never found. I know wfstream in Vissual studio can accept wchar_t * line as name, but compiler of my choice is MinGW and it doesn't have such signature for wstring constructor.
Is there any way to do it with STL streams?
ADDITION
That string can contaion not only Ascii symbols, it can contain Russian, German, Chinese symbols simultaneously. I don't want limit myself only to ASCII or local encoding.
NEXT ADDITION
Also data can be different, not only ASCII, otherwise I wouldn't bother myself with Unicode at all.
E.g.
Thanks in advance!
Boost::Filesystem especially the fstream.hpp header may help.
If you are using MSVC and it's implementation of the c++ standard library, something like this should work:
QString source; // content of source is shown on image
wchar_t *op= source.data();
fstream stream(op, std::ios::in | std::ios::binary);
This works because the Microsoft c++ implementation has an extension to allow fstream to be opened with a wide character string.
Convert the UTF-16 string using WideCharToMultiByte with CP_ACP before passing the filename to fstream.

UCS-2LE text file parsing

I have a text file which was created using some Microsoft reporting tool. The text file includes the BOM 0xFFFE in the beginning and then ASCII character output with nulls between characters (i.e "F.i.e.l.d.1."). I can use iconv to convert this to UTF-8 using UCS-2LE as an input format and UTF-8 as an output format... it works great.
My problem is that I want to read in lines from the UCS-2LE file into strings and parse out the field values and then write them out to a ASCII text file (i.e. Field1 Field2). I have tried the string and wstring-based versions of getline – while it reads the string from the file, functions like substr(start, length) do interpret the string as 8-bit values, so the start and length values are off.
How do I read the UCS-2LE data into a C++ String and extract the data values? I have looked at boost and icu as well as numerous google searches but have not found anything that works. What am I missing here? Please help!
My example code looks like this:
wifstream srcFile;
srcFile.open(argv[1], ios_base::in | ios_base::binary);
..
..
wstring srcBuf;
..
..
while( getline(srcFile, srcBuf) )
{
wstring field1;
field1 = srcBuf.substr(12, 12);
...
...
}
So, if, for example, srcBuf contains "W.e. t.h.i.n.k. i.n. g.e.n.e.r.a.l.i.t.i.e.s." then the substr() above returns ".k. i.n. g.e" instead of "g.e.n.e.r.a.l.i.t.i.e.s.".
What I want is to read in the string and process it without having to worry about the multi-byte representation. Does anybody have an example of using boost (or something else) to read these strings from the file and convert them to a fixed width representation for internal use?
BTW, I am on a Mac using Eclipse and gcc.. Is it possible my STL does not understand wide character strings?
Thanks!
Having spent some good hours tackling this question, here are my conclusions:
Reading an UTF-16 (or UCS2-LE) file is apparently manageable in C++11, see How do I write a UTF-8 encoded string to a file in Windows, in C++
Since the boost::locale library is now part of C++11, one can just use codecvt_utf16 (see bullet below for eventual code samples)
However, in older compilers (e.g. MSVC 2008), you can use locale and a custom codecvt facet/"recipe", as very nicely exemplified in this answer to Writing UTF16 to file in binary mode
Alternatively, one can also try this method of reading, though it did not work in my case. The output would be missing lines which were replaced by garbage chars.
I wasn't able to get this done in my pre-C++11 compiler and had to resort to scripting it in Ruby and spawning a process (it's just in test so I think that kind of complications are ok there) to execute my task.
Hope this spares others some time, happy to help.
substr works fine for me on Linux with g++ 4.3.3. The program
#include <string>
#include <iostream>
using namespace std;
int main()
{
wstring s1 = L"Hello, world";
wstring s2 = s1.substr(3,5);
wcout << s2 << endl;
}
prints "lo, w" as it should.
However, the file reading probably does something different from what you expect. It converts the files from the locale encoding to wchar_t, which will cause each byte becoming its own wchar_t. I don't think the standard library supports reading UTF-16 into wchar_t.

Read Unicode files C++

I have a simple question to ask. I have a UTF 16 text file to read wich starts with FFFE. What are the C++ tools to deal with this kind of file? I just want to read it, filter some lines, and display the result.
It looks simple, but I just have experience in work with plain ascci files and I'm in the hurry. I'm using VS C++, but I'm not want to work with managed C++.
Regards
Here a put a very simple example
wifstream file;
file.open("C:\\appLog.txt", ios::in);
wchar_t buffer[2048];
file.seekg(2);
file.getline(buffer, bSize-1);
wprintf(L"%s\n", buffer);
file.close();
You can use fgetws, which reads 16-bit characters. Your file is in little-endian,byte order. Since x86 machines are also little-endian you should be able to handle the file without much trouble. When you want to do output, use fwprintf.
Also, I agree more information could be useful. For instance, you may be using a library that abstracts away some of this.
Since you are in the hurry, use ifstream in binary mode and do your job. I had the same problems with you and this saved my day. (it is not a recommended solution, of course, its just a hack)
ifstream file;
file.open("k:/test.txt", ifstream::in|ifstream::binary);
wchar_t buffer[2048];
file.seekg(2);
file.read((char*)buffer, line_length);
wprintf(L"%s\n", buffer);
file.close();
For what it's worth, I think I've read you have to use a Microsoft function which allows you to specfiy the encoding.
http://msdn.microsoft.com/en-us/library/z5hh6ee9(VS.80).aspx
The FFFE is just the initial BOM (byte order mark). Just read from the file like you normally do, but into a wide char buffer.