GetFileAttributeW fails for non-ASCII characters - c++

So I am trying to check if a given file exists or not. Following this answer I tried GetFileAttributesW. It works just fine for any ascii input, but it fails for ß, ü and á (and any other non-ascii character I suspect). I get ERROR_FILE_NOT_FOUND for filenames with them and ERROR_PATH_NOT_FOUND for pathnames with them, as one would expect if they didn't exists.
I made 100% sure that they did. I spend 15 minutes on copying filenames to not make typos and using literals to avoid any bad input. I couldn't find any mistake.
Since all of these characters are non-ascii characters I stopped trying, because I suspected I might have screwed up with encodings. I just can't spot it. Is there something I am missing? I link against Kernel32.lib
Thanks!
#include <stdio.h>
#include <iostream>
#include <string>
#include "Windows.h"
void main(){
while(true){
std::wstring file_path;
std::getline(std::wcin, file_path);
DWORD dwAttrib = GetFileAttributesW(file_path.data());
if(dwAttrib == INVALID_FILE_ATTRIBUTES){
printf("error: %d\n", GetLastError());
continue;
}
if(!(dwAttrib & FILE_ATTRIBUTE_DIRECTORY))
printf("valid!\n");
else
printf("invalid!\n");
}
}

It's extremely hard to make Unicode work well in a console program on Windows, so let's start by removing that aspect of it (for now).
Modify your program so that it looks like this:
#include <cstdio>
#include <iostream>
#include <string>
#include "Windows.h"
int main() {
std::wstring file_path = L"fooß.txt";
DWORD dwAttrib = GetFileAttributesW(file_path.data());
if (dwAttrib == INVALID_FILE_ATTRIBUTES)
printf("error: %d\n", GetLastError());
if (!(dwAttrib & FILE_ATTRIBUTE_DIRECTORY))
printf("valid!\n");
else
printf("invalid!\n");
return 0;
}
Make sure this file is saved with a byte-order mark (BOM), even if you're using UTF-8. Windows applications, including Visual Studio and the compilers, can be very picky about that. If your editor won't do that, use Visual Studio to edit the file and then use Save As, click the down arrow next to the Save button, choose With Encoding. In the Advanced Save Options dialog, choose "Unicode (UTF-8 with signature) - Codepage 65001".
Make sure you have a file named fooß.txt in the current folder. I strongly recommend using a GUI program to create this file, like Notepad or Explorer.
This program works. If you still get a file-not-found message, check to make sure the temporary file is in the working directory or change the program to use an absolute path. If you use an absolute path, use backslashes and make sure they are all properly escaped. Check for typos, the extension, etc. This code does work.
Now, if you take the file name from standard input:
std::wstring file_path;
std::getline(std::wcin, file_path);
And you enter fooß.txt in the console window, you'll probably find that it doesn't work. And if you look in the debugger, you'll see that the character that should be ß is something else. For me, it's á, but it might be different for you if your console codepage is something else.
ß is U+00DF in Unicode. In Windows 1252 (the most common codepage for Windows users in the U.S.), it's 0xDF, so it might seem like there's no chance of a conversion problem. But the console windows (by default) use OEM code pages. In the U.S., the common OEM codepage is 437. So when I try to type ß in the console, that's actually encoded as 0xE1. Surprise! That's the same as the Unicode value for á. And if you manage to enter a character with the value 0xDF, you'll see that corresponds to the block character you reported in the original question.
You would think (well, I would think) that asking for the input from std::wcin would do whatever conversion is necessary. But it doesn't, and there's probably some legacy backward compatibility reason for that. You could try to imbue the stream with the "proper" codepage, but that gets complicated, and I've never bothered trying to make it work. I've simply stopped trying to use anything other than ASCII on the console.

Related

C++ output Unicode in variable

I'm trying to output a string containing unicode characters, which is received with a curl call. Therefore, I'm looking for something similar to u8 and L options for literal strings, but than applicable for variables. E.g.:
const char *s = u8"\u0444";
However, since I have a string containing unicode characters, such as:
mit freundlichen Grüßen
When I want to print this string with:
cout << UnicodeString << endl;
it outputs:
mit freundlichen Gr??en
When I use wcout, it returns me:
mit freundlichen Gren
What am I doing wrong and how can I achieve the correct output. I return the output with RapidJSON, which returns the string as:
mit freundlichen Gr��en
Important to note, the application is a CGI running on Ubuntu, replying on browser requests
If you are on Windows, what I would suggest is using Unicode UTF-16 at the Windows boundary.
It seems to me that on Windows with Visual C++ (at least up to VS2015) std::cout cannot output UTF-8-encoded-text, but std::wcout correctly outputs UTF-16-encoded text.
This compilable code snippet correctly outputs your string containing German characters:
#include <fcntl.h>
#include <io.h>
#include <iostream>
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
// ü : U+00FC
// ß : U+00DF
const wchar_t * text = L"mit freundlichen Gr\u00FC\u00DFen";
std::wcout << text << L'\n';
}
Note the use of a UTF-16-encoded wchar_t string.
On a more general note, I would suggest you using the UTF-8 encoding (and for example storing text in std::strings) in your cross-platform C++ portions of code, and convert to UTF-16-encoded text at the Windows boundary.
To convert between UTF-8 and UTF-16 you can use Windows APIs like MultiByteToWideChar and WideCharToMultiByte. These are C APIs, that can be safely and conveniently wrapped in C++ code (more details can be found in this MSDN article, and you can find compilable C++ code here on GitHub).
On my system the following produces the correct output. Try it on your system. I am confident that it will produce similar results.
#include <string>
#include <iostream>
using namespace std;
int main()
{
string s="mit freundlichen Grüßen";
cout << s << endl;
return 0;
}
If it is ok, then this points to the web transfer not being 8-bit clean.
Mike.
containing unicode characters
You forgot to specify which unicode encoding does the string contain. There is the "narrow" UTF-8, which can be stored in a std::string and printed using std::cout, as well as wider variants, which can't. It is crucial to know which encoding you're dealing with. For the remainder of my answer, I'm going to assume you want to use UTF-8.
When I want to print this string with:
cout << UnicodeString << endl;
EDIT:
Important to note, the application is a CGI running on Ubuntu, replying on browser requests
The concerns here are slightly different from printing onto a terminal.
You need to set the Content-Type response header appropriately or else the client cannot know how to interpret the response. For example Content-Type: application/json; charset=utf-8.
You still need to make sure that the source string is in fact the correct encoding corresponding to the header. See the old answer below for overview.
The browser has to support the encoding. Most modern browsers have had support for UTF-8 a long time now.
Answer regarding printing to terminal:
Assuming that
UnicodeString indeed contains an UTF-8 encoded string
and that the terminal uses UTF-8 encoding
and the font that the terminal uses has the graphemes that you use
the above should work.
it outputs:
mit freundlichen Gr??en
Then it appears that at least one of the above assumptions don't hold.
Whether 1. is true, you can verify by inspecting the numeric value of each code unit separately and comparing it to what you would expect of UTF-8. If 1. isn't true, then you need to figure out what encoding does the string actually use, and either convert the encoding, or configure the terminal to use that encoding.
The terminal typically, but not necessarily, uses the system native encoding. The first step of figuring out what encoding your terminal / system uses is to figure out what terminal / system you are using in the first place. The details are probably in a manual.
If the terminal doesn't use UTF-8, then you need to convert the UFT-8 string within your program into the character encoding that the terminal does use - unless that encoding doesn't have the graphemes that you want to print. Unfortunately, the standard library doesn't provide arbitrary character encoding conversion support (there is some support for converting between narrow and wide unicode, but even that support is deprecated). You can find the unicode standard here, although I would like to point out that using an existing conversion implementation can save a lot of work.
In the case the character encoding of the terminal doesn't have the needed grapehemes - or if you don't want to implement encoding conversion - is to re-configure the terminal to use UTF-8. If the terminal / system can be configured to use UTF-8, there should be details in the manual.
You should be able to test if the font itself has the required graphemes simply by typing the characters into the terminal and see if they show as they should - although, this test will also fail if the terminal encoding does not have the graphemes, so check that first. Manual of your terminal should explain how to change the font, should it be necessary. That said, I would expect üß to exist in most fonts.

printing Unicode characters C++

I'm trying to write a simple command line app to teach myself Japanese, but can't seem to get Unicode characters to print. What am I missing?
#include <iostream>
using namespace std;
int main()
{
wcout << L"こんにちは世界\n";
wcout << L"Hello World\n"
system("pause");
}
In this example only "Press any key to continue" is displayed. Tested on Visual C++ 2013.
This is not easy on Windows. Even when you manage to get the text to the Windows console you still need to configure cmd.exe to be able to display Japanese characters.
#include <iostream>
int main() {
std::cout << "こんにちは世界\n";
}
This works fine on any system where:
The compiler's source and execution encodings include the characters.
The output device (e.g., the console) expects text in the same encoding as the compiler's execution encoding.
A font with the appropriate characters is available (usually not a problem).
Most platforms these days use UTF-8 by default for all these encodings and so can support the entire Unicode range with code similar to the above. Unfortunately Windows is not one of these platforms.
wcout << L"こんにちは世界\n";
In this line the string literal data is (at compile time) converted from the source encoding to the execution wide encoding and then (at run time) wcout uses the locale it is imbued with to convert the wchar_t data to char data for output. Where things go wrong is that the default locale is only required to support characters from the basic source character set, which doesn't even include all ASCII characters, let alone non-ASCII characters.
So the conversion results in an error, putting wcout into a bad state. The error has to be cleared before wcout will function again, which is why the second print statement does not output anything.
You can work around this for a limited range of characters by imbuing wcout with a locale that will successfully convert the characters. Unfortunately the encoding that is needed to support the entire Unicode range this way is UTF-8; Although Microsoft's implementation of streams supports other multibyte encodings it very specifically does not support UTF-8.
For example:
wcout.imbue(std::locale(std::locale::classic(), new std::codecvt_utf8_utf16<wchar_t>()));
SetConsoleOutputCP(CP_UTF8);
wcout << L"こんにちは世界\n";
Here wcout will correctly convert the string to UTF-8, and if the output were written to a file instead of the console then the file would contain the correct UTF-8 data. However the Windows console, even though configured here to accept UTF-8 data, simply will not accept UTF-8 data written in this way.
There are a few options:
Avoid the standard library entirely:
DWORD n;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE), L"こんにちは世界\n", 8, &n, nullptr);
Use non-standard magical incantation that will break standard code:
#include <fcntl.h>
#include <io.h>
_setmode(_fileno(stdout), _O_U8TEXT);
std::wcout << L"こんにちは世界\n";
After setting this mode std::cout << "Hello, World"; will crash.
Use a low level IO API along with manual conversion:
#include <codecvt>
#include <locale>
SetConsoleOutputCP(CP_UTF8);
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
std::puts(convert.to_bytes(L"こんにちは世界\n"));
Using any of these methods, cmd.exe will display the correct text to the best of its ability, by which I mean it will display unreadable boxes. Seven little boxes, for the given string.
You can copy the text out of cmd.exe and into notepad.exe or whatever to see the correct glyphs.
There's a whole article about dealing with Unicode in Windows console
http://alfps.wordpress.com/2011/11/22/unicode-part-1-windows-console-io-approaches/
http://alfps.wordpress.com/2011/12/08/unicode-part-2-utf-8-stream-mode/
Basically, you may implement you own streambuf for std::cout (or std::wcout) in terms of WriteConsoleW and enjoy writing UTF-8 (or whatever Unicode you want) to Windows console without depending on locales, console code pages and even without using wide characters.
It may not look very straightforward, but it's convenient and reusable solution, which is also able to give you a portable utf8-everywhere style user code. Please, don't beat me for my English :)
Or you can change Windows locale to Japanese.

Windows Unicode C++ Stream Output Failure

I am currently writing an application which requires me to call GetWindowText on arbitrary windows and store that data to a file for later processing. Long story short, I noticed that my tool was failing on Battlefield 3, and I narrowed the problem down to the following character in its window title:
http://www.fileformat.info/info/unicode/char/2122/index.htm
So I created a little test app which just does the following:
std::wcout << L"\u2122";
Low and behold that breaks output to the console window for the remainder of the program.
Why is the MSVC STL choking on this character (and I assume others) when APIs like MessageBoxW etc display it just fine?
How can I get those characters printed to my file?
Tested on both VC10 and VC11 under Windows 7 x64.
Sorry for the poorly constructed post, I'm tearing my hair out here.
Thanks.
EDIT:
Minimal test case
#include <fstream>
#include <iostream>
int main()
{
{
std::wofstream test_file("test.txt");
test_file << L"\u2122";
}
std::wcout << L"\u2122";
}
Expected result: '™' character printed to console and file.
Observed result: File is created but is empty. No output to console.
I have confirmed that the font I"m using for my console is capable of displaying the character in question, and the file is definitely empty (0 bytes in size).
EDIT:
Further debugging shows that the 'failbit' and 'badbit' are set in the stream(s).
EDIT:
I have also tried using Boost.Locale and I am having the same issue even with the new locale imbued globally and explicitly to all standard streams.
To write into a file, you have to set the locale correctly, for example if you want to write them as UTF-8 characters, you have to add
const std::locale utf8_locale
= std::locale(std::locale(), new std::codecvt_utf8<wchar_t>());
test_file.imbue(utf8_locale);
You have to add these 2 include files
#include <codecvt>
#include <locale>
To write to the console you have to set the console in the correct mode (this is windows specific) by adding
_setmode(_fileno(stdout), _O_U8TEXT);
(in case you want to use UTF-8).
For this you have to add these 2 include files:
#include <fcntl.h>
#include <io.h>
Furthermore you have to make sure that your are using a font that supports Unicode (such as for example Lucida Console). You can change the font in the properties of your console window.
The complete program now looks like this:
#include <fstream>
#include <iostream>
#include <codecvt>
#include <locale>
#include <fcntl.h>
#include <io.h>
int main()
{
const std::locale utf8_locale = std::locale(std::locale(),
new std::codecvt_utf8<wchar_t>());
{
std::wofstream test_file("c:\\temp\\test.txt");
test_file.imbue(utf8_locale);
test_file << L"\u2122";
}
_setmode(_fileno(stdout), _O_U8TEXT);
std::wcout << L"\u2122";
}
Are you always using std::wcout or are you sometimes using std::cout? Mixing these won't work. Of course, the error description "choking" doesn't say at all what problem you are observing. I'd suspect that this is a different problem to the one using files, however.
As there is no real description of the problem it takes somewhat of a crystal ball followed by a shot in the dark to hit the problem... Since you want to get Unicode characters from you file make sure that the file stream you are using uses a std::locale whose std::codecvt<...> facet actually converts to a suitable Unicode encoding.
I just tested GCC (versions 4.4 thru 4.7) and MSVC 10, which all exhibit this problem.
Equally broken is wprintf, which does as little as the C++ stream API.
I also tested the raw Win32 API to see if nothing else was causing the failure, and this works:
#include <windows.h>
int main()
{
HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD n;
WriteConsoleW( stdout, L"\u03B2", 1, &n, NULL );
}
Which writes β to the console (if you set cmd's font to something like Lucida Console).
Conclusion: wchar_t output is horribly broken in both large C++ Standard library implementations.
Although the wide character streams take Unicode as input, that's not what they produce as output - the characters go through a conversion. If a character can't be represented in the encoding that it's converting to, the output fails.

C++: output contents of a Unicode file to console in Windows

I've read a bunch of articles and forums posts discussing this problem all of the solutions seem way too complicated for such a simple task.
Here's a sample code straight from cplusplus.com:
// reading a text file
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main () {
string line;
ifstream myfile ("example.txt");
if (myfile.is_open())
{
while ( myfile.good() )
{
getline (myfile,line);
cout << line << endl;
}
myfile.close();
}
else cout << "Unable to open file";
return 0;
}
It works fine as long as example.txt has only ASCII characters. Things get messy if I try to add, say, something in Russian.
In GNU/Linux it's as simple as saving the file as UTF-8.
In Windows, that doesn't work. Converting the file into UCS-2 Little Endian (what Windows seems to use by default) and changing all the functions into their wchar_t counterparts doesn't do the trick either.
Isn't there some kind of a "correct" way to get this done without doing all kinds of magic encoding conversions?
The Windows console supports unicode, sort of. It does not support left-to-right and "complex scripts". To print a UTF-16 file with Visual C++, use the following:
_setmode(_fileno(stdout), _O_U16TEXT);
And use wcout instead of cout.
There is no support for a "UTF8" code page so for UTF-8 you will have to use MultiBytetoWideChar
More on console support for unicode can be found in this blog
The right way to output to a console on Windows using cout is to first call GetConsoleOutputCP, and then convert the input you have into the console code page. Alternatively, use WriteConsoleW, passing a wchar_t*.
For reading UTF-8 or UTF-16 strings from a file, you can use the extended mode string of _wfopen_s and fgetws. I don't think there is a C++ interface for these extensions yet. The easiest way to print to the console is described in Michael Kaplan's blog:
#include <fcntl.h>
#include <io.h>
#include <stdio.h>
int main(void) {
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n");
return 0;
}
Avoid GetConsoleOutputCP, it is only retained for compatibility with the 8-bit API.
While Windows console windows are UCS-2 based, they don't support UTF-8 properly.
You might make things work by setting the console window's active output code page to UTF-8 temporarily, using the appropriate API functions. Note that those functions distinguish between input code page and output code page. However, [cmd.exe] really doesn't like UTF-8 as active code page, so don't set that as a permanent code page.
Otherwise, you can use the Unicode console window functions.
Cheers & hth.,
#include <stdio.h>
int main (int argc, char *argv[])
{
// do chcp 65001 in the console before running this
printf ("γασσο γεο!\n");
}
Works perfectly if you do chcp 65001 in the console before running your program.
Caveats:
I'm using 64 bit Windows 7 with VC++ Express 2010
The code is in a file encoded as UTF-8 without BOM - I wrote it in a text editor, not using the VC++ IDE, then used VC++ to compile it.
The console has a TrueType font - this is important
Don't know if these things make too much difference...
Can't speak for chars off the BMP, give it a whirl and leave a comment.
Just to be clear, some here have mentioned UTF8. UTF8 is a multibyte format, which in some documentation is mistakenly referred to as Unicode. Unicode is always just two bytes.
I've used this previously posted solution with Visual Studio 2008. I don't know if if works with later versions of Visual Studio.
#include <iostream>
#include <fnctl.h>
#include <io.h>
#include <tchar.h>
<code ommitted>
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << _T("This is some text to print\n");
I used macros to switch between std::wcout and std::cout, and also to remove the _setmode call for ASCII builds, thus allowing compiling either for ASCII and UNICODE. This works. I have not yet tested using std::endl, but I that might work wcout and Unicode (not sure), i.e.
std::wcout << _T("This is some text to print") << std::endl;

Can't read unicode (japanese) from a file

Hi I have a file containing japanese text, saved as unicode file.
I need to read from the file and display the information to the stardard output.
I am using Visual studio 2008
int main()
{
wstring line;
wifstream myfile("D:\sample.txt"); //file containing japanese characters, saved as unicode file
//myfile.imbue(locale("Japanese_Japan"));
if(!myfile)
cout<<"While opening a file an error is encountered"<<endl;
else
cout << "File is successfully opened" << endl;
//wcout.imbue (locale("Japanese_Japan"));
while ( myfile.good() )
{
getline(myfile,line);
wcout << line << endl;
}
myfile.close();
system("PAUSE");
return 0;
}
This program generates some random output and I don't see any japanese text on the screen.
Oh boy. Welcome to the Fun, Fun world of character encodings.
The first thing you need to know is that your console is not unicode on windows. The only way you'll ever see Japanese characters in a console application is if you set your non-unicode (ANSI) locale to Japanese. Which will also make backslashes look like yen symbols and break paths containing european accented characters for programs using the ANSI Windows API (which was supposed to have been deprecated when Windows XP came around, but people still use to this day...)
So first thing you'll want to do is build a GUI program instead. But I'll leave that as an exercise to the interested reader.
Second, there are a lot of ways to represent text. You first need to figure out the encoding in use. Is is UTF-8? UTF-16 (and if so, little or big endian?) Shift-JIS? EUC-JP? You can only use a wstream to read directly if the file is in little-endian UTF-16. And even then you need to futz with its internal buffer. Anything other than UTF-16 and you'll get unreadable junk. And this is all only the case on Windows as well! Other OSes may have a different wstream representation. It's best not to use wstreams at all really.
So, let's assume it's not UTF-16 (for full generality). In this case you must read it as a char stream - not using a wstream. You must then convert this character string into UTF-16 (assuming you're using windows! Other OSes tend to use UTF-8 char*s). On windows this can be done with MultiByteToWideChar. Make sure you pass in the right code page value, and CP_ACP or CP_OEMCP are almost always the wrong answer.
Now, you may be wondering how to determine which code page (ie, character encoding) is correct. The short answer is you don't. There is no prima facie way of looking at a text string and saying which encoding it is. Sure, there may be hints - eg, if you see a byte order mark, chances are it's whatever variant of unicode makes that mark. But in general, you have to be told by the user, or make an attempt to guess, relying on the user to correct you if you're wrong, or you have to select a fixed character set and don't attempt to support any others.
Someone here had the same problem with Russian characters (He's using basic_ifstream<wchar_t> wich should be the same as wifstream according to this page). In the comments of that question they also link to this which should help you further.
If understood everything correctly, it seems that wifstream reads the characters correctly but your program tries to convert them to whatever locale your program is running in.
Two errors:
std::wifstream(L"D:\\sample.txt");
And do not mix cout and wcout.
Also check that your file is encoded in UTF-16, Little-Endian. If not so, you will be in trouble reading it.
wfstream uses wfilebuf for the actual reading and writing of the data. wfilebuf defaults to using a char buffer internally which means that the text in the file is assumed narrow, and converted to wide before you see it. Since the text was actually wide, you get a mess.
The solution is to replace the wfilebuf buffer with a wide one.
You probably also need to open the file as binary.
const size_t bufsize = 128;
wchar_t buffer[bufsize];
wifstream myfile("D:\\sample.txt", ios::binary);
myfile.rdbuf()->pubsetbuf(buffer, 128);
Make sure the buffer outlives the stream object!
See details here: http://msdn.microsoft.com/en-us/library/tzf8k3z8(v=VS.80).aspx