How do I widen a char in c++ - c++

I'm writing a game in c++ using SFML, I found a font that support French characters. However, in the program I read all the text from files to be able to support different languages, but I don't know how to extract the text without errors into a wide strings.
Here's my code:
using namespace std;
using namespace sf;
void SettingsW :: initialize()
{
// using normal characters it reads the files correctly
wifstream settTextFile;
settTextFile.open(pageTextSource);
wstring temp;
getline(settTextFile, temp);
pageTitle.setFont(pageFont);
pageTitle.setString(temp);
getline(settTextFile, temp, L' ');
languageTitle.setFont(pageFont);
languageTitle.setString(temp);
//here is the problem
char g=' ';
ios::widen(g);
getline(settTextFile, temp, ' '));
// I want to use get line with this delimiter and when I use the L the error goes away
//but it doesn't display properly: é is displayed as ã
}

It's not too clear what your problem is. The code you present
shouldn't compile; ios::widen is a member function, and can
only be called on an ios (which is a typedef for
std::basic_ios<char>, of which you have no instance in your
code). Also, ios::widen returns the widened character, except
that ios::widen (as opposed to
std::basic_ios<wchar_t>::widen) doesn't widen, since it returns
achar. If you want to use the character ingthe delimiter
in the last call tostd::getline`, then you could use:
std::getline( settTextFile, tmp, settTextFile.widen( g ) );
(Of course, you should verify that std::getline succeeded
before using the value it read.)
With regards to the “it doesn't display properly”:
you'll have to give more information with regards to how you are
displaying it for us to be sure, but it seems likely to me that
you just haven't imbued your output stream with the same
encoding as the code page of the window (supposing Windows), or
with the encoding of the font used in the window (supposing
Unix). But you'll have to show us exactly what you're
displaying, how you're displaying it, and give us some
information about the environment if you want a complete answer.

Related

getline() doesn't read accented characters correctly

I'm trying to get accented characters from user using getline() command, but it does not print them correctly.
I tried to include some libraries as locale, but it was in vain.
Here's my code:
#include <iostream>
#include <cstdlib>
#include <string>
#include <locale>
using namespace std;
class Pers {
public:
string name;
int age;
string weapon;
};
int main()
{
setlocale(LC_ALL, "");
Pers pers;
cout << "Say the name of your character: ";
getline(cin, pers.name);
cout << pers.name;
}
When I type: Mark Coração, this is what I get:
How do I fix it?
Actually, the problem does not come from getline().
std::cout (respectively std::cin) does not support special characters. For this, you have to use std::wcout (respectively std::wcin) which uses wide characters (the size of standard characters limits you to what you can find in the ascii table).You need to use bigger characters to store the special characters too, that is the case of wide characters.std::string handles standard characters, std::wstring handles wide characters.
A way to do this could be:
std::wstring a(L"Coração");
std::wcout << a << std::endl;
Output:
Coração
To make it work with getline():
std::wstring a;
getline(std::wcin, a)
std::wcout << a << std::endl;
I hope it can help.
There are 2 levels in a same problem. The problem is that you are using characters outside the ASCII charset. The 2 levels are:
how they are converted to narrow characters on input
how they will be displayed on output
Windows console is a rather disturbing application in that respect: it is able to internaly process UCS2 characters that is any unicode character in the Basic Multilingual Plane, said differently any character with a code point of at most 0xFFFF. On input into narrow characters, it tries to map any character not represented in the current charset to what it thinks is closer, on output, it just outputs the value of each byte in its current charset. So the most reliable way is to ensure that the current locale has a correct collating sequence and that the console has a correct code page (charset in Windows language). After seeing the displayed output, I assume that you are using the code page 437 which contains semi-graphics character but few non ascii ones.
As you only need Western European characters, I would advise you to use the code page 1252. It is a Windows variant of the standard Latin1 or ISO-8859-1 charset (characters with codepoint of at most 0xFF).
So if possible you should try to configure the system in a non english west eurapean language (Portugues would be fine, but French seems to be enough, so I would assume that Spanish would go too).
And you must configure the console in an correct code page: chcp 1252.
If it is not enough (I cannot currently test anything), you could try to use wide character (wstring, wcin, wcout). But without changing code page from 437, the console would not display accented character.

C++ spanish question mark

I am beginning developing in C++ and I am developing a simple calculator in console and when my program ask to the user if wants to exit,the character '¿' doesn't appear (The questions in spanish are between '¿' and '?')
Can someone help me?
PD: The problem only happens in Windows,not in Linux
EDIT: Here is the code that output the code:
cout << '¿' <<"Desea salir (S/N)? " ;
There are a few ways to deal with this problem.
The fundamental problem is not that the ¿ doesn't exist in the console, but that the console and your C++ text editor disagree on what that character is. The two are using different character codes for many characters beyond those needed for English. Character codes 32-126 (letters, numbers, punctuation and brackets), are universally the same. However, character codes 128 through 255, which from a Spanish point of view includes all the accented characters, "u with diaeresis" (e.g. "pingüino"), Ñ, and the starting ¿ and ¡, depend on the specific environment.
Why have such an inconvenient disagreement in character codes is a historical accident, interesting on its own but out of the scope of this question. To keep it simple: in the Windows OS, "consoles" (typically) use the list of characters described in OEM Code Page 437, while Windows applications like your C++ editor (typically) use the Windows-1252 Code Page.
There is no portable (universal) solution for this problem, because the issue of differing charsets is a platform-specific problem. Windows is unfortunately somewhat unique in that the editor and (console) outputs use different sets.
The first and simplest solution - which is fine for toy programs - is to just look up the character code that you want from the OEM 437 code-page, and use that. For ¿, that's #168 (0xa8 in hex, or \250 in octal). You can just embed the character code in the string to make clear what you're trying to do, either of these:
std::cout << ""\x0a8""Cu""\x0a0""l es el primer n""\x0a3""mero?\n"; // hex
std::cout << "\250Cu\240l es el primer n\243mero?\n"; // octal
Outputs:
¿Cuál es el primer número?
Note how I had to do the same thing with the ú and the á. Unfortunately, writing strings like this gets unwieldy quickly. using macros or const chars can help, but not much.
A second alternative is to use a Windows function such as CharToOemA. For example1:
#include <windows.h>
...
...
char pregunta[] = "¿Cuál es el primer número\n";
char *pregunta_oem = new char[sizeof(pregunta)/sizeof(char)];
CharToOemA(pregunta, pregunta_oem);
std::cout << pregunta_oem;
delete []pregunta_oem;
For a more complex program, I would wrap that pattern into a utility function or class.
A different approach is to change the Code Page of the console, so that it agrees with your C++ editor and the rest of Windows. You can do that via the CHCP console command, or via the SetConsoleOutputCP() function, but that doesn't work on the default "raster font" used by consoles, so you have to change the font as well. When the font is set to a unicode font like Lucida Console, this works:
std::cout << "¿Cuál es el primer número?\n"; // ┐Cußl es el...
UINT originalCP = GetConsoleOutputCP();
SetConsoleOutputCP(1252);
std::cout << "¿Cuál es el primer número?\n"; // ¿Cuál es el...
SetConsoleOutputCP(originalCP);
(I don't know if you can change the font from the program itself; I have to look that up. The standard way to do it from the console is to click on the tiny icon on the corner, click Properties, Font tab, and pick a font from the list).
1 I have to warn that this snippet contains a number of subtleties that can easily trip a beginner. You have to make sure the source of the text is a char array; if you're using a char pointer, sizeof won't work correctly and you have to use strlen(source)+1. For the source I used the natural option of a char array initialized to a literal, but you can't do that for the destination because the contents of such an array are read/only. If you are using a new'd char array or one that is not initialized to a literal, you can use the same char array for the source and destination. This example feels very C-like.
You can use _setmode function to do that :
#include <iostream>
#include <string>
#if defined(WIN32) && !defined(UNIX)
# include <io.h> // for _setmode()
# include <fcntl.h> // for _O_U16TEXT
#endif // WIN32 && !UNIX
int main()
{
#if defined(WIN32) && !defined(UNIX)
_setmode(_fileno(stdout), _O_U16TEXT);
//^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#endif // WIN32 && !UNIX
std::wstring wstr = L"'¿' and '?'";
std::wcout << L"WString : " << wstr << std::endl;
system("pause");
return 0;
}
To write UNICODE chars (assuming LE is the standard Windows variant of UTF-16...) out with the iostream library, call _setmode() with _O_U16TEXT and then use wcout.
But you can't use cout anymore. It throws an assert.
Check this answer.
Assuming you are using simple call to std::cout, you should be able to print Unicode strings, if you set your command line to Unicode mode:
1. Change code page to UTF-8
You can do this by simply calling the command below in your cmd:
chcp 65001
2. Make sure you are using a font which has the characters you want to display
Lucidia Console should do the trick, as it supports ¿ (and other characters included in WGL4).
this character is simply not included in basic ascii. Try using wstring http://www.cplusplus.com/reference/string/wstring/
As you can see in Ascii table, symbol ¿ have the code 168. You can use in output stream \ddd to print some special character.
This is because the command console does not support non-ASCII characters by default (ASCII has mainly English language characters and few accented characters). To get support for characters in other character classes play around with the chcp command. Refer to it's documentation here.
In your case I think you need to run chcp 850 in the console before running your program.

how to get a single character from UTF-8 encoded URDU string written in a file?

i am working on Urdu Hindi translation/transliteration. my objective is to translate an Urdu sentence into Hindi and vice versa, i am using visual c++ 2010 software with c++ language. i have written an Urdu sentence in a text file saved as UTF-8 format. now i want to get a single character one by one from that file so that i can work on it to convert it into its equivalent Hindi character. when i try to get a single character from input file and write this single character on output file, i get some unknown ugly looking character placed in output file. kindly help me with proper code. my code is as follows
#include<iostream>
#include<fstream>
#include<cwchar>
#include<cstdlib>
using namespace std;
void main()
{
wchar_t arry[50];
wifstream inputfile("input.dat",ios::in);
wofstream outputfile("output.dat");
if(!inputfile)
{
cerr<<"File not open"<<endl;
exit(1);
}
while (!inputfile.eof()) // i am using this while just to
// make sure copy-paste operation of
// written urdu text from one file to
// another when i try to pick only one character
// from file, it does not work.
{ inputfile>>arry; }
int i=0;
while(arry[i] != '\0') // i want to get urdu character placed at
// each-index so that i can work on it to convert
// it into its equivalent hindi character
{ outputfile<<arry[i]<<endl;
i++; }
inputfile.close();
outputfile.close();
cout<<"Hello world"<<endl;
}
Assuming you are on Windows, the easiest way to get "useful" characters is to read a larger chunk of the file (for example a line, or the entire file), and convert it to UTF-16 using the MultiByteToWideChar function. Use the "pseudo"-codepage CP_UTF8. In many cases, decoding the UTF-16 isn't required, but I don't know about the languages you are referring to; if you expect non-BOM characters (with codes above 65535) you might want to consider decoding the UTF-16 (or decode the UTF-8 yourself) to avoid having to deal with 2-word characters.
You can also write your own UTF-8 decoder, if you prefer. It's not complicated, and just requires some bit-juggling to extract the proper bits from the input bytes and assemble them into the final unicode value.
HINT: Windows also has a NormalizeString() function, which you can use to make sure the characters from the file are what you expect. This can be used to transform characters that have several representations in Unicode into their "canonical" representation.
EDIT: if you read up on UTF-8 encoding, you can easily see that you can read the first byte, figure out how many more bytes you need, read these as well, and pass the whole thing to MultiByteToWideChar or your own decoder (although your own decoder could just read from the file, of course). That way you could really do a "read one char at a time".
'w' classes do not read and write UTF-8. They read and write UTF-16. If your file is in UTF-8, reading it with this code will produce gibberish.
You will need to read it as bytes and then convert it, or write it in UTF-16 in the first place.

Can't read unicode (japanese) from a file

Hi I have a file containing japanese text, saved as unicode file.
I need to read from the file and display the information to the stardard output.
I am using Visual studio 2008
int main()
{
wstring line;
wifstream myfile("D:\sample.txt"); //file containing japanese characters, saved as unicode file
//myfile.imbue(locale("Japanese_Japan"));
if(!myfile)
cout<<"While opening a file an error is encountered"<<endl;
else
cout << "File is successfully opened" << endl;
//wcout.imbue (locale("Japanese_Japan"));
while ( myfile.good() )
{
getline(myfile,line);
wcout << line << endl;
}
myfile.close();
system("PAUSE");
return 0;
}
This program generates some random output and I don't see any japanese text on the screen.
Oh boy. Welcome to the Fun, Fun world of character encodings.
The first thing you need to know is that your console is not unicode on windows. The only way you'll ever see Japanese characters in a console application is if you set your non-unicode (ANSI) locale to Japanese. Which will also make backslashes look like yen symbols and break paths containing european accented characters for programs using the ANSI Windows API (which was supposed to have been deprecated when Windows XP came around, but people still use to this day...)
So first thing you'll want to do is build a GUI program instead. But I'll leave that as an exercise to the interested reader.
Second, there are a lot of ways to represent text. You first need to figure out the encoding in use. Is is UTF-8? UTF-16 (and if so, little or big endian?) Shift-JIS? EUC-JP? You can only use a wstream to read directly if the file is in little-endian UTF-16. And even then you need to futz with its internal buffer. Anything other than UTF-16 and you'll get unreadable junk. And this is all only the case on Windows as well! Other OSes may have a different wstream representation. It's best not to use wstreams at all really.
So, let's assume it's not UTF-16 (for full generality). In this case you must read it as a char stream - not using a wstream. You must then convert this character string into UTF-16 (assuming you're using windows! Other OSes tend to use UTF-8 char*s). On windows this can be done with MultiByteToWideChar. Make sure you pass in the right code page value, and CP_ACP or CP_OEMCP are almost always the wrong answer.
Now, you may be wondering how to determine which code page (ie, character encoding) is correct. The short answer is you don't. There is no prima facie way of looking at a text string and saying which encoding it is. Sure, there may be hints - eg, if you see a byte order mark, chances are it's whatever variant of unicode makes that mark. But in general, you have to be told by the user, or make an attempt to guess, relying on the user to correct you if you're wrong, or you have to select a fixed character set and don't attempt to support any others.
Someone here had the same problem with Russian characters (He's using basic_ifstream<wchar_t> wich should be the same as wifstream according to this page). In the comments of that question they also link to this which should help you further.
If understood everything correctly, it seems that wifstream reads the characters correctly but your program tries to convert them to whatever locale your program is running in.
Two errors:
std::wifstream(L"D:\\sample.txt");
And do not mix cout and wcout.
Also check that your file is encoded in UTF-16, Little-Endian. If not so, you will be in trouble reading it.
wfstream uses wfilebuf for the actual reading and writing of the data. wfilebuf defaults to using a char buffer internally which means that the text in the file is assumed narrow, and converted to wide before you see it. Since the text was actually wide, you get a mess.
The solution is to replace the wfilebuf buffer with a wide one.
You probably also need to open the file as binary.
const size_t bufsize = 128;
wchar_t buffer[bufsize];
wifstream myfile("D:\\sample.txt", ios::binary);
myfile.rdbuf()->pubsetbuf(buffer, 128);
Make sure the buffer outlives the stream object!
See details here: http://msdn.microsoft.com/en-us/library/tzf8k3z8(v=VS.80).aspx

UCS-2LE text file parsing

I have a text file which was created using some Microsoft reporting tool. The text file includes the BOM 0xFFFE in the beginning and then ASCII character output with nulls between characters (i.e "F.i.e.l.d.1."). I can use iconv to convert this to UTF-8 using UCS-2LE as an input format and UTF-8 as an output format... it works great.
My problem is that I want to read in lines from the UCS-2LE file into strings and parse out the field values and then write them out to a ASCII text file (i.e. Field1 Field2). I have tried the string and wstring-based versions of getline – while it reads the string from the file, functions like substr(start, length) do interpret the string as 8-bit values, so the start and length values are off.
How do I read the UCS-2LE data into a C++ String and extract the data values? I have looked at boost and icu as well as numerous google searches but have not found anything that works. What am I missing here? Please help!
My example code looks like this:
wifstream srcFile;
srcFile.open(argv[1], ios_base::in | ios_base::binary);
..
..
wstring srcBuf;
..
..
while( getline(srcFile, srcBuf) )
{
wstring field1;
field1 = srcBuf.substr(12, 12);
...
...
}
So, if, for example, srcBuf contains "W.e. t.h.i.n.k. i.n. g.e.n.e.r.a.l.i.t.i.e.s." then the substr() above returns ".k. i.n. g.e" instead of "g.e.n.e.r.a.l.i.t.i.e.s.".
What I want is to read in the string and process it without having to worry about the multi-byte representation. Does anybody have an example of using boost (or something else) to read these strings from the file and convert them to a fixed width representation for internal use?
BTW, I am on a Mac using Eclipse and gcc.. Is it possible my STL does not understand wide character strings?
Thanks!
Having spent some good hours tackling this question, here are my conclusions:
Reading an UTF-16 (or UCS2-LE) file is apparently manageable in C++11, see How do I write a UTF-8 encoded string to a file in Windows, in C++
Since the boost::locale library is now part of C++11, one can just use codecvt_utf16 (see bullet below for eventual code samples)
However, in older compilers (e.g. MSVC 2008), you can use locale and a custom codecvt facet/"recipe", as very nicely exemplified in this answer to Writing UTF16 to file in binary mode
Alternatively, one can also try this method of reading, though it did not work in my case. The output would be missing lines which were replaced by garbage chars.
I wasn't able to get this done in my pre-C++11 compiler and had to resort to scripting it in Ruby and spawning a process (it's just in test so I think that kind of complications are ok there) to execute my task.
Hope this spares others some time, happy to help.
substr works fine for me on Linux with g++ 4.3.3. The program
#include <string>
#include <iostream>
using namespace std;
int main()
{
wstring s1 = L"Hello, world";
wstring s2 = s1.substr(3,5);
wcout << s2 << endl;
}
prints "lo, w" as it should.
However, the file reading probably does something different from what you expect. It converts the files from the locale encoding to wchar_t, which will cause each byte becoming its own wchar_t. I don't think the standard library supports reading UTF-16 into wchar_t.