I need to read files with different encodings. Unicode files are correctly read using
wxFileInputStream fileInputStream(dialog->GetPath());
wxTextInputStream textInputStream(fileInputStream);
If I need to read, say, Cyrillic (cp1251) files, I use:
wxFileInputStream fileInputStream(dialog->GetPath());
wxTextInputStream textInputStream(fileInputStream, " \n", wxCSConv(wxFONTENCODING_CP1251));
But neither of these ways works with both kinds of files. In .NET we can just use:
new StreamReader(file, Encoding.Default)
So what's the alternative of Encoding.Default in wxWidgets or in C++ in general?
Thank you
I believe wxFONTENCODING_SYSTEM would be analogous to Encoding.Default.
The problem was solved by using wxConvAuto(wxFONTENCODING_SYSTEM) instead of wxCSConv(wxFONTENCODING_SYSTEM). The wxConvAuto function first tries to read the file as a Unicode document, and then if it fails, it uses system's encoding to read the ANSI file. It works great!
Related
I'm writing a C++ program in Visual Studio for class. I am using certain Unicode characters within my program like:
╚, █, ╗, ╝, & ║
I have figured out how to print these characters onto the console properly but I have yet to find a way to output it to a file properly.
In Visual Studio, choosing [OEM United States - Codepage 437] encoding when saving the .cpp file allows it to display properly onto the console.
Now I just need a way to output these characters to a file without errors.
Hopefully someone knows how. Thank You!
Create the file using a wofstream, which uses wide (wchar_t) characters instead of an ofstream (which uses char).
I need to be able to use utf-8-encoded strings with log4cxx. I can print the strings just fine with std::cout (the characters are displayed correctly). Using log4cxx, i.e. putting the strings into the LOG4CXX_DEBUG() macro with a ConsoleAppender will output "??" instead of the special character. I found one solution:
LOG4CXX_DECODE_CHAR(logstring, str);
LOG4CXX_DEBUG(logstring);
where str is my input string, but this does not work. Anyone have an idea how this might work? I google'd around a bit, but I couldn't find anything useful.
You can use
setlocale(LC_CTYPE, "UTF-8");
to set only the character encoding, without changing any other information about the locale.
I met the same problem and searched and searched. I found this post, It may work, but I don't like the setlocaleish solution. so i made more research, finally the solution came out.
I reconfigure log4cxx and build it, the problem was solved!
add two more configure options in log4cxx:
./configure --prefx=blabla --with-apr=blabla --with-apr-util=blabla --with-charset=utf-8 --with-logchar=utf-8
hope this will help anyone who need it.
One solution is to use
setlocale(LC_ALL, "en_US.UTF-8");
in my main function. This is OK for me, but if you want more localizable applications, this will probably become hard to track/use.
The first answer didn't work for me, the second one is more than i want. So I combined the two answers:
setlocale(LC_CTYPE, "xx_XX.UTF-8"); // or "xx_XX.utf8", it means the same
where xx_XX is some language tag. I tried to log strings in many languages with different alphabets (on LINUX, including Chinese, language left-to-right and rigth-to-left); so I tried:
setlocale(LC_CTYPE, "it_IT.UTF-8");
and it worked with any tested language. I cannot understand why the simple "UTF-8" without indicating a language xx_XX doesn't work, since i use UTF8 to be language-independent and one shouldn't indicate one. (If somebody know the reason also for that, would be an interesting improvement to the answer). Maybe this also depends by Operatin System.
Finally, on Linux you can get a list of the encodings by typing on shell:
# locale -a | grep utf
I try to generate some xml files (TMX) on our servers.
The servers are Solaris SPARC servers, but the destination of the files are some legacy Windows CAT Tools.
The CAT-Tool requires CR+LF line endings as is the default on Windows. Writing the files with libxml2, using xmlWriter is easy and works quite well. But I haven't figured out a way to force the lib to emit CR+LF instead of the Unix standard LF. The lib only seem to support the line ending of the platform it runs on.
Has somebody found a way to generate files with another line ending than the default of the platform it runs on. Actually my workaround is to open the written file and writing a new file with the changed line ending using a simple C loop. That works, but it is annoying to have such a unnecessary step in our chain.
I haven't tried this myself, but from xmlsave, I can see two possibilities
xmlSaveToBuffer: save to a buffer, convert to CR/LF and write it out yourself.
xmlSaveToIO: register an iowrite callback and convert to CF/LF while writing in your callback function
Maybe, there are other options, but I haven't found them.
The CAT-Tool requires CR+LF line endings as is the default on Windows.
FWIW, that means the CAT-Tool has a broken XML parser. It shouldn't care about this, as the the XML spec says:
To simplify the tasks of applications, the XML processor must behave as if it normalized all line breaks ... by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.
I know often these things are out of our control, but if you can lean on the CAT-Tool vendor to fix their software, it could become a more future-proof solution.
According to the source code (as of April 2013), libxml2 just puts "\n" into the output stream. At least, when writing dtd-part of a document. Therefore, re-encoding the stream on the fly is the only option to get "\r\n" as result.
If you were lucky (as me) and your tool run on Windows, you could open the file in the text mode, and the OS would do recoding for you.
I have a binary file read/write module in c++ . Which works fine for English language, but fails to read write french character set. What changes do i need to make ? any special encoding type needs to be specified ? (I have access to c++ std libs and qt 4.7 lib functions) .
You can try QString::fromUtf8(yourString)
For starters, make sure that your data files are UTF8 and that you open them as UTF8. Make sure that your source code files are UTF8, too, especially if you use any explicit strings in them, but it's better to avoid using explicit strings.
If I'm given a .doc file with special tags in it such as [first_name], how do I go about replacing all occurrences of it with something like "Clark"? A simple binary replacement only works if the replacement string is the exact same length.
Haskell, C, and C++ answers would be best, but any compiled language would do. I'd also prefer to do this without an external library since it has to be deployed on Windows and Linux and cross-platform dependency handling is a bitch.
To summarize...
.doc -> magic program -> .doc with strings replaced
You could use the Word COM component ("Word.Application") on Windows to open the file, do the replacements, save the file, and close it. However, this is Windows-only and can be buggy.
Another thing you could do is use the OpenOffice.org command line interface to convert the file to the ODF format, unzip the file (ODF is mostly zipped XML), do the replacements with the files inside, re-zip the file, and re-convert it to .doc format. However, OpenOffice.org doesn't always read Word files correctly (especially if there is a lot of complex formatting) and it can make it harder to distribute (users must either have OpenOffice.org or you must distribute it with your program).
Also, if you have a file in the .docx format, you can unzip it, do the replacements, and re-zip it.
First read the Word Document Specification.
If that hasn't terrified you, then you should find it fairly straightforward to figure out how to read and write it. It must be possible; Word manages to do it most of the time.
You probably have to use .Net programming (VB or C#) to create an object of Word.Application and then use the MS Word object model to manipulate your document.
Why do you want to be using C/C++/Haskell or another compiled language? I'm not too familiar with Haskell, but in general I would say that C is not a great language for performing text processing. A lot of interpreted languages (Perl, Python, etc.) also have powerful regular expression libraries that are suited for finding and replacing phrases.
With that said, as the other posters have noted, you will still have to deal with the eccentricities of the .doc format.