Using utf-8 characters in log4cxx - c++

I need to be able to use utf-8-encoded strings with log4cxx. I can print the strings just fine with std::cout (the characters are displayed correctly). Using log4cxx, i.e. putting the strings into the LOG4CXX_DEBUG() macro with a ConsoleAppender will output "??" instead of the special character. I found one solution:
LOG4CXX_DECODE_CHAR(logstring, str);
LOG4CXX_DEBUG(logstring);
where str is my input string, but this does not work. Anyone have an idea how this might work? I google'd around a bit, but I couldn't find anything useful.

You can use
setlocale(LC_CTYPE, "UTF-8");
to set only the character encoding, without changing any other information about the locale.

I met the same problem and searched and searched. I found this post, It may work, but I don't like the setlocaleish solution. so i made more research, finally the solution came out.
I reconfigure log4cxx and build it, the problem was solved!
add two more configure options in log4cxx:
./configure --prefx=blabla --with-apr=blabla --with-apr-util=blabla --with-charset=utf-8 --with-logchar=utf-8
hope this will help anyone who need it.

One solution is to use
setlocale(LC_ALL, "en_US.UTF-8");
in my main function. This is OK for me, but if you want more localizable applications, this will probably become hard to track/use.

The first answer didn't work for me, the second one is more than i want. So I combined the two answers:
setlocale(LC_CTYPE, "xx_XX.UTF-8"); // or "xx_XX.utf8", it means the same
where xx_XX is some language tag. I tried to log strings in many languages with different alphabets (on LINUX, including Chinese, language left-to-right and rigth-to-left); so I tried:
setlocale(LC_CTYPE, "it_IT.UTF-8");
and it worked with any tested language. I cannot understand why the simple "UTF-8" without indicating a language xx_XX doesn't work, since i use UTF8 to be language-independent and one shouldn't indicate one. (If somebody know the reason also for that, would be an interesting improvement to the answer). Maybe this also depends by Operatin System.
Finally, on Linux you can get a list of the encodings by typing on shell:
# locale -a | grep utf

Related

Linux Printing - How To

I find it hard to explain but I will try my best. Some times in Linux- in the Terminal- things get printed but you can still write over them. eg when using wget you get a progress bar like this:
[===================> ]
Now if you type something while it is doing this it will 'overwrite' it. My question is how to recreate this in c++.
Will you use something like
cout <<
or something else?
I hope you understand what I am getting at...
btw I am using the most recent version of Arch with xfce4
Printing a carriage return character \r is typically interpreted in Linux as returning you to the beginning of the line. Try this, for example:
std::cout << "Hello\rJ";
The output will be:
Jello
This does depend on your terminal, however, so you should look up the meaning of particular control characters for your terminal.
For a more cross-platform solution and the ability to do more complex text-based user interfaces, take a look at ncurses.
You can print the special character \b to go back one space. Then you can print a space to blank it out, or another character to overwrite what was there. You can also use \r to return to the beginning of the current output line and write again from there.
Controlling the terminal involved sending various escape sequences to it, in order to move the cursor around and such.
http://www.ibiblio.org/pub/historic-linux/ftp-archives/tsx-11.mit.edu/Oct-07-1996/info/vt102.codes
You could also use ncurses to do this.

what locale does wstring support?

In my program I used wstring to print out text I needed but it gave me random ciphers (those due to different encoding scheme). For example, I have this block of code.
wstring text;
text.append(L"Some text");
Then I use directX to render it on screen. I used to use wchar_t but I heard it has portability problem so I switched to swtring. wchar_t worked fine but it seemed only took English character from what I can tell (the print out just totally ignore the non-English character entered), which was fine, until I switch to wstring: I only got random ciphers that looked like Chinese and Korean mixed together. And interestingly, my computer locale for non-unicode text is Chinese. Based on what I saw I suspected that it would render Chinese character correctly, so then I tried and it does display the charactor correctly but with a square in front (which is still kind of incorrect display). I then guessed the encoding might depend on the language locale so I switched the locale to English(US) (I use win8), then I restart and saw my Chinese test character in the source file became some random stuff (my file is not saved in unicode format since all texts are English) then I tried with English character, but no luck, the display seemed exactly the same and have nothing to do with the locale. But I don't understand why it doesn't display correctly and looked like asian charactor (even I use English locale).
Is there some conversion should be done or should I save my file in different encoding format? The problem is I wanted to display English charactore correctly which is the default.
In the absence of code that demonstrates your problem, I will give you a correspondingly general answer.
You are trying to display English characters, but see Chinese characters. That is what happens when you pass 8 bit ANSI text to an API that receives UTF-16 text. Look for somewhere in your program where you cast from char* to wchar_t*.
First of all what is type of file you are trying to store text in?Normal txt files stores in ANSI by default (so does excel). So when you are trying to print a Unicode character to a ANSI file it will print junk. Two ways of over coming this problem is:
try to open the file in UTF-8 or 16 mode and then write
convert Unicode to ANSI before writing in file. If you are using windows then MSDN provides particular API to do Unicode to ANSI conversion and vice-verse. If you are using Linux then Google for conversion of Unicode to ANSI. There are lot of solution out there.
Hope this helps!!!
std::wstring does not have any locale/internationalisation support at all. It is just a container for storing sequences of wchar_t.
The problem with wchar_t is that its encoding is unspecified. It might be Unicode UTF-16, or Unicode UTF-32, or Shift-JIS, or something completely different. There is no way to tell from within a program.
You will have the best chances of getting things to work if you ensure that the encoding of your source code is the same as the encoding used by the locale under which the program will run.
But, the use of third-party libraries (like DirectX) can place additional constraints due to possible limitations in what encodings those libraries expect and support.
Bug solved, it turns out to be the CASTING problem (not rendering problem as previously said).
The bugged text is a intermediate product during some internal conversion process using swtringstream (which I forgot to mention), the code is as follows
wstringstream wss;
wstring text;
textToGenerate.append(L"some text");
wss << timer->getTime()
text.append(wss.str());
Right after this process the debugger shows the text as a bunch of random stuff but later somehow it converts back so it's readable. But the problem appears at rendering stage using DirectX. I somehow left the casting for wchar_t*, which results in the incorrect rendering.
old:
LPCWSTR lpcwstrText = (LPCWSTR)textToDraw->getText();
new:
LPCWSTR lpcwstrText = (*textToDraw->getText()).c_str();
By changing that solves the problem.
So, this is resulted by a bad cast. As some kind people provided correction to my statement.

__FILE__ Returns a String with "\/" in the Path

I use the __FILE__ macro for error messages. However, sometimes the path comes back as E:\x\y\/z.ext. It does this for specific files.
For example, E:\programming\v2\wwwindowclass.h comes back as E:\programming\v2\/wwwindowclass.h and E:\programming\v2\test.cpp comes back as E:\programming\v2\test.cpp. In fact, the only file in the directory that works seems to be test.cpp.
To work around this, I used jmucchiello's answer to this question to replace any occurrence of "/" with "\". This worked fine, and the displayed path changed to a normal one.
The problem was when I tried it on Windows 7 (after using XP). The string came up as (null) after calling the function.
Along with this, I sometimes get some seemingly random error 2: File not found errors. I'm unsure of whether this is related at all, but if there's an explanation, it would be nice to hear.
I've tried to find why __FILE__ would be returning the wrong string, but to no avail. I'm using GNU g++ 4.6.1. I'm not actually sure yet if the paths that were wrong in XP were wrong in Windows 7 too. Any insight is appreciated.
The function in the linked question appears to return NULL if there are no changes to make. Probably Windows 7 doesn't suffer from the \/ problem (in some cases).
As per MSalters's comment:
Typically, the compiler does so when you pass #include "v2/wwwindowclass.h" to the compiler.
Since every file has its own include statements, you can (but shouldn't) mix the two styles.
This was the case. My compiler automatically adds a forward slash.

How do I work with a C++ program containing non-Latin characters?

I have a C++ program that was written by a Russian-speaking developer and so it contains Cyrillic characters. When I open the sources they are displayed as garbage. How do I solve this in windows ?
The actual problem is your IDE/editor doesn't display Cyrillic characters correctly. You solve this by changing the IDE/editor settings to use a font that contains Cyrillic characters - for example, Courier New if you're on Windows.
Well, assuming they've actually used ISO C and not some weird Russian variant, the language constructs and standard library calls will be in English (or its strange cousin, American).
The only thing you'll really need to convert are the strings (such as for user output or logging), code comments and variable names.
And even the comments and variable names may not have to change. They may make the code harder to understand to a non-Russian reader however.
If the code contains characters that your current editor doesn't understand, well, you need to get yourself an editor that does. Or get your Russian friends to turn it into English for you.
Don't think that there is another C++ programming language in russia. So you just need to replace the strings to the other language, i.e. English. Care must be taken when processing input since here you can find handling of single characters.
A better approach would be to prepare a localization. You can read all strings from a ressource or file. In that case you can select the resource that matches you target language.
If you mean that the strings of the program are written in Russian and you want to add English texts, you need to first internationalize (i18n) your program, using instead of static strings a library like Gettext; then you need to add support for the English locale.
If you mean that the variables and the comments are in Russian and you want them in English, well.. find a translator ;)
Find a translator and give him the code.

How can I make a case-insensitive regexp match for Russian letters?

I have list of catalog paths and need to filter out some of them. My match pattern is in a non-Unicode encoding.
I tried the following:
require 5.004;
use POSIX qw(locale_h);
my $old_locale = setlocale(LC_ALL);
setlocale(LC_ALL, "ru_RU.cp1251");
#{$data -> {doc_folder_rights}} =
grep {
# catalog path pattern in $_REQUEST{q}
$_->{doc_folder} =~/$_REQUEST{q}/i;
}
#{$data -> {doc_folder_rights}};
setlocale(LC_ALL, $old_locale);
What I need is case-insensitive regexp pattern matching when pattern contains russsian letters.
There are several (potential) issues with your code:
Your code filters out all doc_folders that do not match the regexp in $_REQUEST{q}, however the question suggests that you want to do the opposite.
You might have an encoding issue. Setting the locale (using setlocale) changes the perl's handling of upper- & lower-case-conversions, but it does not change any encoding. You need to assure that $_REQUEST{q} is interpreted correctly.
For simplicity you can assume that any Perl-string contains Unicode-data in some internal representation that you need not know about in detail. Only when Perl does I/O there is an implicit or explicit conversion. When reading from stdin, ARGV or environment, Perl assumes that the bytes are encoded using the current locale and implicitly converts.
If you have an encoding issue, there are several ways to fix it:
Fix the environment in which Perl runs so that it knows about the correct locale from the very start. That will fix the implicit conversion.
In the unlikely case that $_REQUEST is loaded from a filehandle, you could explicitly tell Perl to convert using binmode($fh, ":encoding(cp1251)");. Do that prior the reading $_REQUEST.
There is the $string = Encode::decode(Encoding, $octets) function that tells Perl to forget its assumption about the encoding of $octets and instead treat the contents of $octets as byte-stream that needs to be converted to Unicode using Encoding. You need to do that before touching the contents of $octets, or strange things may happen.
Since $_REQUEST was probably loaded by some cgi-module, and was probably url-encoded in transit, you could just tell the cgi-module how to correctly do the decoding.