FillConsoleOutputCharacter/WriteConsoleOutput and special characters

FillConsoleOutputCharacter/WriteConsoleOutput and special characters - c++

I'm messing with some of the native windows console functions, and am impressed by their speed,if not their ease of use.
Anyway, I have long known that the following code will produce some interesting characters
for(int i = 0; i < 256; i++)
{
cout << char(i) << endl;
}
However, I cannot get FillConsoleOutputCharacter or WriteConsoleOutput to produce all of those characters (many simply appear as question marks).
Here is an example of the code I am using:
COORD spot = {0,0};
HANDLE hOut = GetStdHandle(STD_OUTPUT_HANDLE);
DWORD Written;
for(int i = 0; i < 256; i++)
{
FillConsoleOutputAttribute(hOut, 7, 1, spot, &Written);
FillConsoleOutputCharacterW(hOut, char(i), 1, spot, &Written);
spot.Y++;
}
Does anyone know of a relatively convenient way to write those characters with the native functions?
By the way, I am using Visual Studio 2010 on Windows 7 x64.

Try using FillConsoleOutputCharacterA instead of FillConsoleOutputCharacterW which is using the unicode character which can take a little bit of knowledge to get correctly.
edit I tried using FillConsoleOutputCharacterA and it gives equivalent output to your first case.

FillConsoleOutputCharacterA should write the same set of characters that the cout function does. These characters are determined by the console's current code page.
With FillConsoleOutputCharacterW, you can still generate all the same characters (as well as any additional characters that may be included in the console font) but you need to use the Unicode (16-bit) codes for these characters, rather than the 8-bit codes used with cout.
Note that Windows internally uses an out-of-date version of Unicode, with characters limited to 16 bits (0-65536) rather than Unicode proper which uses 0-1,112,063 (although most of these codes remain unassigned). I believe the console's Unicode character set corresponds to plane 0 of Unicode, the basic multilingual plane.
The question marks appear when you write a control character or a character that isn't included in the current font.

Related

Text on MFC Controls - Unicode Characters such as Japanese get cut off

Background
I'm working on a C++/MFC application and we've been converting it to display unicode characters to support foreign languages. For the most part this has been successful and unicode characters are displayed correctly. But I've encountered an issue where certain text on certain controls gets cut off.
Example
Here you can see a button that should display "ログアウト／終了" but gets cutoff and displays an unknown character in it's place.
But if I pad the string with spaces it displays fine. The number of spaces needed varies by string. This string needed 4 spaces to display correctly, whereas another string with one less character needed 5 spaces; there doesn't seem to be a correlation or pattern with the number of spaces needed. And also, I don't want to pad strings randomly throughout the code, especially when other languages don't need this at all.
What I've tried (doesn't work)
Shrinking the font size
Resizing the control
Changing the font facename
Changing the font character set
Copying the control properties from another control in the application that does not have this issue
Add extra null terminators
Padding with zero-width characters
Using SetWindowTextW
Changing source and execution character sets
Changing system locale
The only thing I've found that works is padding with an arbitrary amount of spaces which is certainly not an ideal solution.
Other info
I've only noticed this issue for Japanese characters, but have only tested English, German, and Japanese.
Japanese characters use 3 bytes of data, which I suspect has something to do with this but I don't know what or why. English characters use 1 byte and certain German characters use 2 bytes.
A control (button/label/etc) in one place may have an issue whereas a control in a different place that contains the same text does not have the issue, even if they're both buttons..etc.
When the text is cutoff, it typically either displays a question mark box (like the first image) or a random character/letter at the end. This character changes each time I run the application, but the question box is the most common.
For my padding "fix", it doesn't matter if the spaces are at the beginning or end of the string, as long as the number of spaces is enough. It also doesn't need to be spaces, any non-zero-width character works.
Compiled using MBCS (Multibyte Character Set) and the Windows 10 UTF-8 Unicode Support setting enabled. (As opposed to compiling with UNICODE defined which isn't an option. Large old codebase)
EDIT: Here is an example on how the text is set
GetDlgItem(IDC_SOME_CTRL_ID)->SetWindowText(GetTranslation("Some String"));
Where GetTranslation() is our own function to look up the translation of "Some String" (basically a lookup table) and return a CString. Using a debugger I can see the returned CString always has the correct string value. I can replace GetTranslation with a hardcoded Japanese string and the issue will still happen.
EDIT 2: I got complaints that this code wasn't enough.
myapp.rc
// Microsoft Visual C++ generated resource script.
//
#include "resource.h"
#define APSTUDIO_READONLY_SYMBOLS
#include "afxres.h"
#undef APSTUDIO_READONLY_SYMBOLS
IDD_VIEW_MENU DIALOGEX 0, 0, 50, 232
STYLE DS_SETFONT | WS_CHILD
FONT 14, "Verdana", 0, 0, 0x1
BEGIN
CONTROL "btn0",IDC_BUTTON_MENU_0,"Button",BS_3STATE | BS_PUSHLIKE,12,38,25,13
END
#endif
resource.h
#define IDC_BUTTON_MENU_0 6040
ViewMenu.cpp
#include "stdafx.h"
#include "ViewMenu.h"
CViewMenu::CViewMenu() : CFormView(CViewMenu::IDD)
{
}
void CViewMenu::DoDataExchange(CDateExchange* pDX)
{
CFormView::DoDataExchange(pDX);
DDX_Control(pDX, IDC_BUTTON_MENU_0, m_ctrlMenuButton0);
}
void CViewMenu::OnInitialUpdate()
{
CFormView::OnInitialUpdate();
}
void CViewMenu::OnDraw(CDC* pDC)
{
CFormView::OnDraw(pDC);
GetDlgItem(IDC_BUTTON_MENU_0)->SetWindowText("ログアウト／終了");
return;
}
ViewMenu.h
#include "resource.h"
class CViewMenu : public CFormView
{
protected:
CViewMenu();
public:
enum { IDD = IDD_VIEW_MENU };
CButton m_ctrlMenuButton0;
}

The following should work in Windows 10 versions 1903 and later, regardless of the default system locale, and fulfills OP's requirements (string literals, MBCS build, no Unicode windows etc). It was verified to work in version 2004 set to En-US locale, without "Beta: Use Unicode UTF-8 for worldwide language support" checked, using VS 2019 16.7.5 to build.
Save source files containing characters outside the active codepage in UTF-8 encoding, with or without BOM.
Compile with _MBCS defined (in the IDE: Properties / Advanced / Character Set = MBCS).
Compile with the /utf-8 switch (C/C++ / Command Line / Additional Options = /utf-8).
Create a manifest file declaring UTF-8 as the target codepage for the process (per the activeCodePage documentation).
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1" xmlns:asmv3="urn:schemas-microsoft-com:asm.v3">
<asmv3:application>
<asmv3:windowsSettings xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">
<activeCodePage>UTF-8</activeCodePage>
</asmv3:windowsSettings>
</asmv3:application>
</assembly>
Add the manifest file to the project (in the IDE: Manifest Tool / General / Input and Output / Additional Manifest Files = manifest file created at the previous step).

This ain't Python. With C++ you need to know, why your code works. Otherwise it doesn't.
GetDlgItem(IDC_BUTTON_MENU_0)->SetWindowText("ログアウト／終了");
That's where you and your compiler start to disagree. You think this should be UTF-8. Your compiler, on the other hand, trusts you, and assumes that you are using the source character set.
While you are unaware of a concept called source character set, you get all confused about something that should be the norm: Garbage in, garbage out.
If you feel like fixing the "Garbage in" part (now, clearly, that is your job), read up on C++ string literals. In case you don't make it to the end, the quickest way to fix your ungodly workaround is to use a u8 prefix.
Seriously, though, the real solutions is to use Windows' native character encoding. Which, oddly, you seem to reject, even though you could use it, given a string literal. I mean, it's not like you have to change anything global. Just call SetWindowTextW and use an L prefix.
Just saying, you know...

Printing smiley face c++

I'm trying to print out the smiley face (from ascii) based on the amount of times the user asks for it, but on the console output screen, it only shows a square with another one inside of it. Where have I gone wrong?
#include <iostream>
using namespace std;
int main()
{
int smile;
cout << "How many smiley faces do you want to see? ";
cin >> smile;
for (int i = 0; i < smile; i++)
{
cout << static_cast<char>(1) << "\t";
}
cout << endl;
return 0;
}

ASCII does not have smileys (so in ASCII you'll have :-) and you expect your reader to understand that as a smiley). But Unicode has several ones, e.g. ☺ (white smiling face, U+263A); see http://unicodeemoticons.com/ or http://www.unicode.org/emoji/charts/emoji-list.html for a nice table of them.
In 2017, it is reasonable to use UTF8 everywhere (in terminals & outputs). UTF-8 is a very common encoding for Unicode, and many Unicode characters are encoded in several bytes in UTF-8.
So in a terminal using UTF8, with a font with many characters available, since ☺ is UTF8 encoded as "\342\230\272", use:
for (int i = 0; i < smile; i++)
{
cout << "\342\230\272" << "\t";
}
In 2017, most "console" are terminal emulators because real terminals -like the mythical VT100- are today in museums, and you can at least configure these terminal emulators to use UTF-8 encoding. On many operating systems (notably most Linux distributions and MacOSX), they are using UTF-8 by default.
If your C++11 compiler accepts UTF8 in strings (and a UTF8 source file), as most do today, you could even have "☺" in your source code. To type that you'll often use some copy and paste technique from an outside source. On my Linux system I often use some Character Map utility (e.g. run charmap in a terminal) to get them.
In ASCII, the character of code 1 is a control character, the Start Of Heading. Perhaps you are confusing ASCII with CP437 which is no more used (but in 1980s encoded a smiley-thing at code 1).
You need to use Unicode and understand it. Today, in 2017, you cannot afford using other encodings (they are historical legacy for museums) externally. Of course if you use weird characters, you should document that the user of your program should use some font having them (but most common fonts used in terminal emulators accept a very wide part of Unicode, so that is not a problem in practice). However, on my Linux computers, many fonts are lacking U+1F642 Slightly Smiling Face (e.g. "\360\267\231\202" in a C++ program) which appeared only in Unicode7.0 in 2014.

Just do this in Visual Studio Code:
for print;
cout<<"\2";

Writing unicode(?) character directly from source code to WriteConsoleOutput

I'm trying to use WriteConsoleOutput from the WinApi to write characters to the command prompt window buffer. The thing is, I'd really like to be able to write characters such as ☺ directly into the source code, as-is, instead of using some kind of encoding/notation like '\uFFFF' or '0xFF', since I don't understand them too well (differences between codepages/character sets/etc.)
The code below showcases the simplest form of my problem. Running this code does not print ☺ into the command prompt window, but a question mark (?) instead.
#include <Windows.h>
int main()
{
HANDLE h = GetStdHandle(STD_OUTPUT_HANDLE);
CHAR_INFO c[1] = {0};
COORD cS = {1, 1};
COORD cH = {0, 0};
SMALL_RECT sr = {0, 0, 0, 0};
c[0].Attributes = FOREGROUND_INTENSITY;
c[0].Char.UnicodeChar = '☺';
WriteConsoleOutput(h, c, cS, cH, &sr);
Sleep(5000);
return 0;
}
It is vital for my code to display output identically between all Windows versions, regardless of the languages installed/used. So to my knowledge (which admittedly is absolutely minimal), I'd need to set a specific codepage (one which would hopefully be supported by the command prompt in any language Windows).
I've tried:
• Changing from using the CHAR_INFO.UnicodeChar to CHAR_INFO.AsciiChar
• Fiddling around with SetConsoleCP and SetConsoleOutputCP functions, but I haven't got a clue on how to utilize them to help me with this problem.
• Changing the Visual Studio -> Project -> Project properties.. -> Character Set setting to every possible value.
• Using specifically either WriteConsoleOutputA or WriteConsoleOutputW in addition to the aforementioned settings
• Changing the source code file encoding to UTF-8 with(/out) signature.
In my project I'm programmatically setting the command prompt font to 8x8 Terminal, which to my knowledge does not support actual unicode characters. The available characters are displayed here. Those characters do include '☺', so I'm not entirely sure my question is about unicode. I have no idea anymore. Please help.

C source has to be ascii only. If you embed non-ascii characters in a C source file, and IDE might show them in what appears to be the correct format, but the compiler quite likely treats them differently, and the executable function you pass them to can treat them differently still. It's just not portable or reliable. But you can use the escape sequence \x to embed arbitrary bytes in C strings.
UTF-8 is good for internal use, but Windows APIs don't yet support it, so you need to convert to Windows 16 bit chars (UTF-16 nearly but not quite), to display extended characters. However you have to ensure that you are calling the wide character version of the Windows API. Most Windows API functions that take string come in a A and W version (ascii and wide) for binary backwards compatibility. If you query the identifier in the IDE (go to definition etc) you should see which version you have.

c++ Lithuanian language, how to get more than ascii

I am trying to use Lithuanian in my c++ application, but every try is unsuccesfull.
Multi-byte character set is used. I have tryed everything i have tought of, i am new in c++. Never ever tryed to do something in Lithuanian.
Tryed every setlocale(LC_ALL, "en_US.utf8"); setlocale(LC_ALL, "Lithuanian");...
Researched for 2 hours and didnt found proper examples, solution.
I do have a average sized project which needs Lithuanian translation from database and it cant understand most of "ĄČĘĖĮŠŲŪąčęėįšųū".
Compiler - "Visual studio 2013"
Database - sqlite3.
I cant get simple strings to work(defined myself), and output as Lithuanian to win32 application, even.

In Windows use wide character strings (1UTF-16 encoding, wchar_t type) for internal text handling, and preferably UTF-8 for external text files and networking.
Note that Visual C++ will translate narrow text literals from the source encoding to Windows ANSI, which is a platform-dependent usually single-byte encoding (you can check which one via the GetACP API function), i.e., Visual C++ has the platform-specific Windows ANSI as its narrow C++ execution character set.
But also do note that for an app restricted to non-Windows platforms, i.e. Unix-land, it makes practical sense to do everything in UTF-8, based on char type.
For the database communication you may need to translate to and from the program's internal text representation.
This depends on what the database interface requires, which is not stated.
Example for console output in Windows:
#include <iostream>
#include <fcntl.h>
#include <io.h>
auto main() -> int
{
_setmode( _fileno( stdout ), _O_WTEXT );
using namespace std;
wcout << L"ĄČĘĖĮŠŲŪąčęėįšųū" << endl;
}
To make this compile by default with g++, the source code encoding needs to be UTF-8. Then, to make it produce correct results with Visual C++ the source code encoding needs to be UTF-8 with BOM, which happily is also accepted by modern versions of g++. For otherwise the Visual C++ compiler will assume the Windows ANSI encoding and produce an incorrect UTF-16 string.
Not coincidentally this is the default meaning of UTF-8 in Windows, e.g. in the Notepad editor, namely UTF-8 with BOM.
But note that while in Windows the problem is that the main system compiler requires a BOM for UTF-8, in Unix-land the problem is the opposite, that many old tools can't handle the BOM (for example, even MinGW g++ 4.9.1 isn't yet entirely up to speed: it sometimes includes the BOM bytes, then incorrectly interpreted, in error messages).
1) On other platforms wide character text can be encoded in other ways, e.g. with UTF-32. In fact the Windows convention is in direct conflict with the C and C++ standards which require that a single wchar_t should be able to encode any character in the extended character set. However, this requirement was, AFAIK, imposed after Windows adopted UTF-16, so the fault probably lies with the politics of the C and C++ standardization process, not yet another Microsoft'ism.

Complexity of internationalisation
There are several related but distinct topics that can cause mismatches between them, making try and error approach very tedious:
type used for storing strings and chars: windows iuses wchar_t by default, but for most APIs you have also char equivalents functions
character set encoding this defines how the chars stored in the type are to be understood. For exemple unicode (UTF8, UTF16, UTF32), 7 bits ascii, 8 bit ansii. In windows, by default it is UTF16 for wchar_t and ansi/windows for char
locale defines, among other things, the character set asumptions, when processing strings. This permit to use language independent functions like isalpha(i, loc), islower(i, loc), ispunct(i, loc) to find out if a given character is alphanumeric, a lower case alphabetic, or a punctuation, for example to bereak down a user text into words. C++ offers here portable functions.
output codepage or font used to show a character to the user. This assumes that the font used shows the characters using the same character set used in the code internals.
source code encoding. For example your editor could assume an ansi encoding, with windows 1252 character set.
Most typical errors
The problem n°1 is Win32 console output, as unicode is not well supported by the console. But this is not your problem here.
Another cause of mismatch is the encoding of your text editor. It might not be unicode, but use a windows code page. In this case, you type "Č", the deditor displays it as such, but editor might use windows 1257 encoding for lithuanian and store 0xC8 in the file. If you then display this literal with a windows unicode function, it will interpret 0xC8 as "latin E grave accent" and print something else, as the right unicode encoding for "Č" is 0x010C !
I can be even worse: the compiler may have its own assumption about character set encoding used and convert your litterals into unicode using false assumptions (it happened to me when I used some exotic code generation switch).
How to do ?
To figure out what goes wront, proceed by elimination:
First, for plain windows, use the native unicode setting. Ok it's UTF16 and wchar_t instead of UTF8 and as thus comes with some drawbacks, but it's native and well supported.
Then use explict unicode coding in litterals, for example TEXT("\u010C") instead of TEXT("Č"). This avoids editor and compiler mismatch.
If it's still not the right character, make sure that your font FULLY supports unicode. The default system font for instance doesn't while most other do. You can easily check with the windows font pannel (WindowKey+R fonts then click on "search char") to display the character table of your font.
Set fonts explicitely in your code
For example, a very tiny experiment :
...
case WM_PAINT:
{
hdc = BeginPaint(hWnd, &ps);
auto hf = CreateFont(24, 0, 0, 0, 0, TRUE, 0, 0, 0, 0, 0, 0, 0, L"Times New Roman");
auto hfOld = SelectObject(hdc, hf); // if you comment this out, € and Č won't display
TextOut(hdc, 50, 50, L"Test with éç € \u010C special chars", 30);
SelectObject(hdc, hfOld);
DeleteObject(hf);
EndPaint(hWnd, &ps);
break;
}

CEdit::GetLine() windows 7

I have the following segment of code where m_edit is a CEdit control:
TCHAR lpsz[MAX_PATH+1];
// get the edit box text
m_edit.GetLine(0,lpsz, MAX_PATH);
This works perfectly on computers running Windows XP and earlier. I have not tested this in Vista, but on Windows 7, lpsz gets junk unicode characters inserted into it (as well as the actual text sometimes). Any idea as to what is going on here?

Since you're using MFC, why aren't you taking advantage of its CString class? That's one of the reasons many programmers were drawn to MFC, because it makes working with strings so much easier.
For example, you could simply write:
int len = m_edit.LineLength(m_edit.LineIndex(0));
CString path;
LPTSTR p = path.GetBuffer(len);
m_edit.GetLine(0, p, len);
path.ReleaseBuffer();
(The above code is tested to work fine on Windows 7.)
Note that the copied line does not contain a null-termination character (see the "Remarks" section in the documentation). That could explain the nonsense characters you're seeing in later versions of Windows.

It's not null terminated. You need to do this:
int count = m_edit.GetLine(0, lpsz, MAX_PATH);
lpsz[count] = 0;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js