How to read the … character and french accents from a text file - c++

I am given a text file that contains a couple character per line. I have to read it, line by line, and apply a lexical analyzer on each character. Then, I write my analysis in another file.
With the following code, I have no problem reading french accents, but I realized that the character '…' (this is one character not 3 dots) is turned into a '&'.
Note: My lexical analyzer must use strings, that's why I converted back the wstring to a string.
wfstream SourceFile;
ofstream ResultFile (ResultFileName);
locale utf8_locale(std::locale(), new codecvt_utf8<wchar_t>);
SourceFile.imbue(utf8_locale);
SourceFile.open(SourceFileName);
while(getline(SourceFile, wLineBuffer))
{
string LineBuffer( wLineBuffer.begin(), wLineBuffer.end() );
...
Edit: Raymond Chen figured that the character is lost because of my conversion from wstring to string.
So the new question is now : How do I convert from a wstring to a string without transforming the characters ?
Edit: file sample
"stringééé"
"ccccccccccccccccccccccccccccccccccccccccccccccccccccccccc"
Identificateur1
Identificateur2
// Commentaire22
/**/
/*
Autre commentaire
…
*/

You need a proper Unicode support library. Forget using the broken Standard functions. They were not designed to support Unicode, don't support Unicode, and cannot be extended to support it properly. Look into using ICU or Boost.Locale or something like that.

Related

Inputting a string containing greek characters in linux

I have a function which returns a string.
I have to define that string with greek characters in the function itself and should return that string.
I am working on Linux platform and my code is in C++.
My function is as follows:
string gen_string()
{
string str = "αγρω";
return str;
}
But I am not able to give the input.
When I try to copy paste the greek characters I want, it is appearing as some garbage characters.
Can some one please help me with this?
Thanks in advance.
EDIT:
Thanks for all your response.
Its not about using the wstring or string.
When I copy the string to the vim to give it as input, it is appearing as something like this.
▒~^▒~T▒~A▒~A201604¸▒~B▒žMDF_F▒~S123▒~T▒~B▒▒~B▒
I also tried by keeping the text in the file and opening the text file from vim.
But still it's the same.
string is only for ASCII characters, I believe.
You have international, likely Unicode characters. Consider using std::wstring for a multibyte "wide" string.
If you mean copy from some text to the terminal input then how to do this depends on the terminal. If it's a gnome terminal you need to specify UTF-8 in the locale settings though I'm not sure if that would get you the Greek alphabet.
locale command will list the current locale setting in locale.conf. You likely want to change the LANG setting. A way to do this system wide is
localectl set-locale LANG=en_country_code.UTF-8
Change country_code. It's US for the United States but I don't know what the Greek code is. You may need to be root. To change it just for yourself modify
~/.config/locale.conf
(or $XDG_CONFIG_HOME/locale.conf or $HOME/.config/locale.conf).
whichever gets you to the locale.conf file. On most systems all of them do.

Decoding %E6%B0%94%E6%97%8B%E5%93%88%E5%88%A9.txt to a valid string

I am trying to decode a filename*= field of content disposition header. I get a string something like:
%E6%B0%94%E6%97%8B%E5%93%88%E5%88%A9.txt
What I have figured out that replacing % to \x works fine and I get the correct file name:
气旋哈利.txt
Is there a standard way of doing this in C++? Is there any library available to decode this?
I tried
boost::replace_all(name, "%x","\\x");
std::locale::generator gen;
std::locale locl = gen.generate("en_US.utf-8");
decoded_data = boost::locale::conv::from_utf( encoded_data, locl);
But it prints the replaced string instead of chinese characters.
\xE6\xB0\x94\xE6\x97\x8B\xE5\x93\x88\xE5\x88\xA9.txt
Any Idea where am I going wrong?
Replacing escape code like "\xE6" only work in string and character literals, not generally in strings. That's because it's handled by the compiler when it compiles the program.
However, it's not very hard to do yourself, using a simple loop that check for the '%' character, gets the next two characters and convert them to a number and use that number as a "character".

Convert path to \\

Okay, after two days of searching the web and MSDN, I didn't found any real solution to this problem, so I'm gonna ask here in hope I've overlooked something.
I have open dialog window, and after I get location from selected file, it gives the string in following way C:\file.exe. For next part of mine program I need C:\\file.exe. Is there any Microsoft function that can solve this problem, or some workaround?
ofn.lpstrFile = fileName;
char fileNameStr[sizeof(fileName)+1] = "";
if (GetOpenFileName(&ofn))
strcpy(fileNameStr, fileName);
DeleteFile(fileName); // doesn't works, invalid path
I've posted only this part of code, because everything else works fine and isn't relevant to this problem. Any assistence is greatly appreciated, as I'm going mad in last two days.
You are confusing the requirement in C and C++ to escape backslash characters in string literals with what Windows requires.
Windows allows double backslashes in paths in only two circumstances:
Paths that begin with "\\?\"
Paths that refer to share names such as "\\myserver\foo"
Therefore, "C:\\file.exe" is never a valid path.
The problem here is that Microsoft made the (disastrous) decision decades ago to use backslashes as path separators rather than forward slashes like UNIX uses. That decision has been haunting Windows programmers since the early 1980s because C and C++ use the backslash as an escape character in string literals (and only in literals).
So in C or C++ if you type something like DeleteFile("c:\file.exe") what DeleteFile will see is "c:ile.exe" with an unprintable 0xf inserted between the colon and "ile.exe". That's because the compiler sees the backslash and interprets it to mean the next character isn't what it appears to be. In this case, the next character is an f, which is a valid hex digit. Therefore, the compiler converts "\f" into the character 0xf, which isn't valid in a file name.
So how do you create the path "c:\file.exe" in a C/C++ program? You have two choices:
"c:/file.exe"
"c:\\file.exe"
The first choice works because in the Win32 API (and only the API, not the command line), forward slashes in paths are accepted as path separators. The second choice works because the first backslash tells the compiler to treat the next character specially. If the next character is a hex digit, that's what you will get. If the next character is another backslash, it will be interpreted as exactly that and your string will be correct.
The library Boost.Filesystem "provides portable facilities to query and manipulate paths, files, and directories".
In short, you should not use strings as file or path names. Use boost::filesystem::path instead. You can still init it from a string or char* and you can convert it back to std::string, but all manipulations and decorations will be done correctly by the class.
Im guessing you mean convert "C:\file.exe" to "C:\\file.exe"
std::string output_string;
for (auto character : input_string)
{
if (character == '\\')
{
output_string.push_back(character);
}
output_string.push_back(character);
}
Please note it is actually looking for a single backslash to replace, the double backslash used in the code is to escape the first one.

How to use exetended unix characters in c++ in Visual studio?

We are using a korean font and freetype library and trying to display a korean character. But it displays some other characters indtead of hieroglyph
Code:
std::wstring text3 = L"놈";
Is there any tricks to type the korean characters?
For maximum portability, I'd suggest avoiding encoding Unicode characters directly in your source code and using \u escape sequences instead. The character 놈 is Unicode code point U+B188, so you could write this as:
std::wstring text3 = L"\uB188";
The question is what is the encoding of the source code.
It is likely UTF-8, which is one of the reasons not to use wstring. Use regular string. For more information on my way of handling characters, see http://utf8everywhere.org.

C++ change newline from CR+LF to LF

I am writing code that runs in Windows and outputs a text file that later becomes the input to a program in Linux. This program behaves incorrectly when given files that have newlines that are CR+LF rather than just LF.
I know that I can use tools like dos2unix, but I'd like to skip the extra step. Is it possible to get a C++ program in Windows to use the Linux newline instead of the Windows one?
Yes, you have to open the file in "binary" mode to stop the newline translation.
How you do it depends on how you are opening the file.
Using fopen:
FILE* outfile = fopen( "filename", "wb" );
Using ofstream:
std::ofstream outfile( "filename", std::ios_base::binary | std::ios_base::out );
OK, so this is probably not what you want to hear, but here's my $0.02 based on my experience with this:
If you need to pass data between different platforms, in the long run you're probably better off using a format that doesn't care what line breaks look like. If it's text files, users will sometimes mess with them. If by messing the line endings up they cause your application to fail, this is going to be a support intensive application.
Been there, done that, switched to XML. Made the support guys a lot happier.
A much cleaner solution is to use the ASCII escape sequence for the LF character (decimal 10): '\012' or '\x0A' represents an explicit single line feed regardless of platform.
Note that this at least on some compilers does not work; for example, on MSVC 2019 16.11.6, both '\012' and '\x0A' get translated to carriage return and line feed. It also does not matter there whether a string literal ("\012") or a char literal ('\012') is used.
This method also avoids string length surprises, as '\n' can expand to two characters. But so can multibyte unicode characters, in UTF8, when written directly into a string literal in the source code.
Note also that '\r' is the platform-independent code for a single carriage return (decimal 13). The '\f' character is not the line feed, but rather the form feed (decimal 12), which is not a newline on any platform I am aware of. C does not offer a single-character backslash escape for the line feed, thus the need for the longer octal or hexadecimal escapes.