read txt file in c++ (chinese) - c++

I'm trying to develop function that check whether chinese word which user enters is in the txt file or not. The following is the code. But it is not working. I want to know what the problem is. Help me please.
setlocale(LC_ALL, "Chinese-simplified");
locale::global(locale("Chinese_China"));
SetConsoleOutputCP(936);
SetConsoleCP(936);
bool exist = FALSE;
cout << "\n\n <Find the keyword whether it is in that image or not> \n ";
cout << "Enter word to search for: ";
wstring search;
wcin >> search; //There is a problem to enter chinese.
wfstream file_text("./a.txt");
wstring line;
wstring::size_type pos;
while (getline(file_text, line))
{
pos = line.find(search);
if (pos != wstring::npos) // string::npos is returned if string is not found
{
cout << "Found!" << endl;
exist = true;
break;
}
}
when I use this code, The result is as follows.
const int oldMbcp = _getmbcp();
_setmbcp(936);
const std::locale locale("Chinese_China.936");
_setmbcp(oldMbcp);

If you're interested in more details, please see stod-does-not-work-correctly-with-boostlocale for a more detailed description of how locale works,
In a nutshell the more interesting part for you:
std::stream (stringstream, fstream, cin, cout) has an inner locale-object, which matches the value of the global C++ locale at the moment of the creation of the stream object. As std::in is created long before your code in main is called, it has most probably the classical C locale, no matter what you do afterwards.
you can make sure, that a std::stream object has the desirable locale by invoking std::stream::imbue(std::locale(your_favorit_locale)).
I would like to add the following:
It is almost never a good idea to set the global locale - it might break other parts of the program or third part libraries - you never know.
std::setlocale and locale::global do slightly different things, but locale::global resets not only the global c++-locale but also the c-locale (which is also set by std::setlocale, not to be confused with the classical "C" locale), so you should call it in another order if you want to have c++ locale set to Chinese_China and C locale to chinese-simplified
First
locale::global(locale("Chinese_China"));
And than
setlocale(LC_ALL, "Chinese-simplified");

Try locale::global(locale("Chinese_China.936")); or locale::global(locale(""));
And for LC_ALL "chinese-simplified" or "chs"

If using Vladislav's answer does not solve this, take a look at answer to stl - Shift-JIS decoding fails using wifstrem in Visual C++ 2013 - Stack Overflow:
const int oldMbcp = _getmbcp();
_setmbcp(936);
const std::locale locale("Chinese_China.936");
_setmbcp(oldMbcp);
There appears to be a bug in Visual Studio's implementation of locales. See also c++ - double byte character sequence conversion issue in Visual Studio 2015 - Stack Overflow.

Related

Display large UTF-8-encoded strings for standard output decently, despite Windows or MinGW bugs

2nd Update: I found a very simple solution to this actually not that hard problem, only one day after asking. But people seem to be small-minded so there are three close votes already:
Duplicate of "How to use unicode characters in Windows command line?" (1x):
Obviously not, which has been clarified in the comments. This is not about the Windows command line tool, which I do not use.
Unclear what you're asking (1x):
Then you must suffer from functional analphabetism. I cannot be any more concrete when I ask, for example "Is there an easy way to determine whether a char in a std::string is a non-ending part of an UTF-8 symbol?" (marked bold for better visibility, indeed) and state that this would be sufficient to answer the question (and even explain why). Seriously, there are even pictures to show the problem. Furthermore, my own existing answer should clarify it even more. Your own deficiencies are not sufficient to declare something as too hard to understand.
Too broad (1x) ("Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer [...]"):
This must be another issue with functional analphabetism. I stated clearly that a single way to solve the problem (which I have already found) is sufficient. You can identify an adequate answer as follows: Take a look at the accepted answer of my own. Alternatively, use your brain to interprete my well-defined words if you are able to, which several people on this plattform unfortunately seem not.
There is, however, an actual reason to close this question: It has already been solved. But there is no such option for a close vote. So, cleary, Stack Exchange supports that there may be found alternative solutions. Since I am a curious person, I am also interested in alternative ways to solve this. If your lack of intelligence does not cope well with understanding what the problem is and that it is quite relevant under certain environments (e.g. such that use Windows, C++ in Eclipse CDT, UTF-8, but no Visual Studio and no Windows Console), then you can just leave without standing in the way of other people to satisfy their curiosity. Thanks!
1st Update: I used app.exe > out.txt 2>&1 which generates a file without these formatting issues. So the problem is that usually
std::cout does this splitting but the underlying control (which
receives the char sequence) has to handle correct reassembling?
(Unfortunately nothing seems to handle it on Windows, except file
streams. So I still need to circumvent this. Preferably without
writing to files first and displaying their content -- which of course
works.)
On the system that I use (Windows 7; MinGW-w64 (GCC 8.1 for Windows)), there is a bug with std::cout so that UTF-8 encoded strings are printed out before they are reassembled, even if they were disassembled internally by std::cout by passing a large string. The following code explains how the bug seems to behave. Note that, however, the faulty displays appear to be random, i.e. the way std::cout slices up (equal) std::string objects is not equivalent for every execution of the program. But the problems appear consistently at indices which are multiples of 1024, which is how I concluded that behavior.
#include <iostream>
#include <sstream>
void myFaultyOutput();
void simulatedFaultyBehavior();
int main()
{
myFaultyOutput();
//simulatedFaultyBehavior();
}
void myFaultyOutput() {
std::stringstream ss; // Note that ss is built correctly (which could be shown by saving ss.str() to a file).
ss << "...";
for (int i = 0; i < 20; i++) {
for (int j = 0; j < 341; j++)
ss << u8"\u301A";
ss << "\n..";
}
std::cout << ss.str() << std::endl; // Problem occurs here, with cout.
// Note that converting ss.str() to UTF-16 std::wstring and using std::wcout results in std::wcout not
// displaying anything, not even ASCII characters in the future (until restarting the application).
}
// To display the problem on well-behaved systems ; just imagine the output would not contain newlines, while the faulty formatted characters remain.
void simulatedFaultyBehavior() {
std::stringstream ss;
int amount = 2000;
for (int j = 0; j < amount; j++)
ss << u8"\u301A";
std::string s = ss.str();
std::cout << "s.length(): " << s.length() << std::endl; // amount * 3
while (s.length() > 1024) {
std::cout << s.substr(0, 1024) << std::endl;
s = s.substr(1024);
}
std::cout << s << std::endl;
}
To circumvent this behavior, I would like to split up large strings (which I receive as such from an API) manually in parts of lengths less than 1024 chars (and then call std::cout separately on each of them). But I don't know which chars actually are just a non-ending part of an UTF-8 symbol and the built-in Unicode converters also seem to be unreliable (possibly also system-dependent?). Is there an easy way to determine whether a char in a std::string is a non-ending part of an UTF-8 symbol? The following quote explains why answering this question would be sufficient.
An UTF-8 character can, for example, consist of three chars. So if one
splits a string into two parts, it should keep those three characters
together. Otherwise, one has to do what the existing GUI controls
clearly are not able to do consistently. For instance, reassembling
UTF-8-characters that have been split into pieces.
Better ideas to circumvent the problem (others than "Don't use Windows" / "Don't use UTF-8" / "Don't use cout", of course) are also welcome.
Note that this question is unrelated to the Windows Console (I do not use it -- things are displayed in Eclise and optionally on wxWidgets UI elements, which display UTF-8 correctly). It is also unrelated to MSVC (I use the MinGW compiler, as I have mentioned). In the code is also mentioned that using std::wcout with UTF-16 does not work at all (due to another MinGW an Eclipse bug). The bug results from UI controls being unable to handle what std::cout does (which may be intentional or not). Furthermore, everything usually works fine, except for those UTF-8 symbols that were split up into different chars (e.g. \u301A into \u0003 + \u001A) at indices which are multiples of 1024 (and only randomly). This behavior implies already that most assumptions of commenters are false. Please consider the code -- especially its comments -- carefully rather than rushing to conclusions.
To clarify the display issue when calling myFaultyOutput():
In Eclipse CDT:
In Scintilla (implemented in wxWidgets as wxStyledTextCtrl):
I elaborated a fairly simple workaround by experimenting, of which I am surprised that nobody knew (I found nothing like that online).
N.m.'s attempted answer gave a good hint with mentioning the platform-specific function _setmode. What it does "by design" (according to this answer and this article) is to set the file translation mode, which is how the in- and output streams according to the process are handled. But at the same time, it invalidates using std::ostream / std::istream but dictates to use std::wostream / std::wistream for decently formatted in- and output streams.
For instance, using _setmode(_fileno(stdout), _O_U8TEXT) leads to that std::wcout now works well with outputting std::wstring as UTF-8, but std::cout prints out garbage characters, even on ASCII arguments. But I want to be able to mainly use std::string, especially std::cout for output. As I have mentioned, it is a rare case that the formatting for std::cout fails, so only in cases where I print out strings that may lead to this issue (potential multi-char-encoded-characters at indices of at least 1024) I want to use a special output function, say coutUtf8String(string s).
The default (untranslated) mode of _setmode is _O_BINARY. We can temporarily switch modes. So why not just switch to _O_U8TEXT, convert the UTF-8 encoded std::string object to std::wstring, use std::wcout on it, and then switch back to _O_BINARY? To stay platform-independent, one can just define the usual std::cout call when not on Windows. Here is the code:
#if defined(_WIN32) || defined(WIN32) || defined(__CYGWIN__)
#include <fcntl.h> // Also includes the non-standard file <io.h>
// (POSIX compatibility layer) to use _setmode on Windows NT.
#ifndef _O_U8TEXT // Some GCC distributions such as TDM-GCC 9.2.0 require this explicit
// definition since, depending on __MSVCRT_VERSION__, they might
// not define it.
#define _O_U8TEXT 0x40000
#endif
#endif
void coutUtf8String(string s) {
#if defined(_WIN32) || defined(WIN32) || defined(__CYGWIN__)
if (s.length() > 1024) {
// Set translation mode of wcout to UTF-8, renders cout unusable "by design"
// (see https://developercommunity.visualstudio.com/t/_setmode_filenostdout-_O_U8TEXT;--/394790#T-N411680).
if (_setmode(STDOUT_FILENO, _O_U8TEXT) != -1) {
wcout << utf8toWide(s) << flush; // We must flush before resetting the mode.
// Set translation mode of wcout to untranslated, renders cout usable again.
_setmode(STDOUT_FILENO, _O_BINARY);
} else
// Let's use wcout anyway. Since no sink (such as Eclipse's console
// window) is attached when _setmode fails, and such sinks seem to be
// the cause for wcout to fail in default mode. The UI console view
// is filled properly like this, regardless of translation modes.
wcout << utf8toWide(s) << flush;
} else
cout << s << flush;
#else
cout << s << flush;
#endif
}
wstring utf8toWide(const char* in) {
wstring out;
if (in == nullptr)
return out;
uint32_t codepoint = 0;
while (*in != 0) {
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
if (codepoint > 0xffff) {
out.append(1, static_cast<wchar_t>(0xd800 + (codepoint >> 10)));
out.append(1, static_cast<wchar_t>(0xdc00 + (codepoint & 0x03ff)));
} else if (codepoint < 0xd800 || codepoint >= 0xe000)
out.append(1, static_cast<wchar_t>(codepoint));
}
}
return out;
}
This solution is especially convenient since it does not factually deprecate UTF-8, std::string or std::cout which are mainly used for good reasons, but simply uses std::string itself and sustains platform-independency. I rather agree with this answer that adding wchar_t (and all the redundant rubbish that comes with it, such as std::wstring, std::wstringstream, std::wostream, std::wistream, std::wstreambuf) to C++ was a mistake. Only because Microsoft takes bad design decisions, one should not adopt their mistakes but rather circumvent them.
Visual confirmation:

my c++ file is not opened , why?

cout<<"enter name of file : " <<endl;
char nof[30] ;
for (int i=0;i<20;++i){
cin>>nof[i];
if (nof[i-1]=='x'){
if (nof[i]=='t'){
break;
}
}
}
fstream file1;
file1.open(nof);
if (file1.is_open()) cout<<"file is open"<<endl;
that is a code which should take the name of file from user to create
but i checked if it is opened and it is not , what to do ?
Try using this:
#include <string>
#include <iostream>
#include <errno.h>
#include <fstream>
#include <cstring>
using namespace std;
int main() {
cout << "Enter the name of the file : ";
string file_name;
getline(cin, file_name);
fstream file_stream;
file_stream.open(file_name);
if (file_stream.is_open()) {
// File Stuffs goes here...........
cout << "The file is open" << endl;
} else {
// The file may not exists or locked by some other process.
cout << strerror(errno) << endl; // Edited this line.
}
}
The way you handle user input make variable nof a invalid file path on your running os. That's why fstream::is_open() return false.
for (int i=0;i<20; ++i){
cin >> nof[i];
if (nof[i-1]=='x'){
if (nof[i]=='t'){
break;
}
}
}
This code takes user input until it gets xt. But in C/C++, a valid string of char* or char[] type has to be end with \0 character. So if you still love the way you handling input, append \0 to the end of nof before you break the loops.
for (int i=0;i<20; ++i){
cin>>nof[i];
if (nof[i-1]=='x'){
if (nof[i]=='t'){
nof[i+1]=0; //or nof[i+1]='\0' or nof[i+1]=NULL;
break;
}
}
}
But I suggest you use std::string and getline instead, the above way is quite awkward.
std::string nof;
std::getline(std::cin, nof);
std::fstream file;
file.open(nof.c_str(), std::fstream::in | std::fstream::out);
Mohit's answer tells you how to detect failure of std::fstream::open.
That function would usually use some operating system service to open a file, generally some open system call like open(2) on Linux (which can fail for many reasons).
Your program is buggy because your nof probably does not contain a valid file path. I would recommend clearing it with memset(nof, 0, sizeof(nof)) before reading it, and using your debugger, e.g. gdb to find your bug (if you enter a filename of only three characters, or one of fourty letters, your program won't work)
You could ask your operating system for a reason of that failure. On Linux you would use errno(3) (e.g. thru perror(3)).
As far as I know, the C++ standard does not specify how to query the reason of the failure of std::fstream::open (and probably do not require a relation between fstream and errno)
Pedantically, the C++ standard does not require std::fstream to use operating system files. Of course, in practice, fstream-s always use them. But in principle you might have a C++14 implementation on something without files or even without an OS (but I cannot name any).
The notion of file is in practice tightly related to operating systems and file systems. You can have OSes without files (in the past, OS/400, PalmOS, GrassHopper and other academic OSes) even if that is highly unusual today.
And the notion of file is specific to an OS: A file on Windows is not the same as a file on Unix or on z/OS.
Languages standard specifications (like C++11 n3337, C11 n1570, Scheme R5RS) are written in English and they are purposely vague on "files" or "file streams" (precisely because different OSes have different notions of them).

Is it possible to print UTF-8 string with Boost and STL in windows console?

I'm trying to output UTF-8 encoded string with cout with no success. I'd like to use Boost.Locale in my program. I've found some info regarding windows console specific. For example, this article http://www.boost.org/doc/libs/1_60_0/libs/locale/doc/html/running_examples_under_windows.html says that I should set output console code page to 65001 and save all my sources in UTF-8 encoding with BOM. So, here is my simple example:
#include <windows.h>
#include <boost/locale.hpp>
using namespace std;
using namespace boost::locale;
int wmain(int argc, const wchar_t* argv[])
{
//system("chcp 65001 > nul"); // It's the same as SetConsoleOutputCP(CP_UTF8)
SetConsoleOutputCP(CP_UTF8);
locale::global(generator().generate(""));
static const char* utf8_string = u8"♣☻▼►♀♂☼";
cout << "cout: " << utf8_string << endl;
printf("printf: %s\n", utf8_string);
return 0;
}
I compile it with Visual Studio 2015 and it produces the following output in console:
cout: ���������������������
printf: ♣☻▼►♀♂☼
Why does printf do it well and cout don't? Can locale generator of Boost help with it? Or should I use somethong other to print UTF-8 text in console in stream mode (cout-like approach)?
It looks like std::cout is much too clever here: it tries to interpret your utf8 encoded string as an ascii one and finds 21 non ascii characters that it outputs as the unmapped character �. AFAIK Windows C++ console driver,insists on each character from a narrow char string being mapped to a position on screen and does not support multi bytes character sets.
Here what happens under the hood:
utf8_string is the following char array (just look at a Unicode table and do the utf8 conversion):
utf8_string = { '0xe2', '0x99', '0xa3', '0xe2', '0x98', '0xbb', '0xe2', '0x96',
'0xbc', '0xe2', '0x96', '0xba', '0xe2', '0x99', '0x80', '0xe2', '0x99',
'0x82', '0xe2', '0x98', '0xbc', '\0' };
that is 21 characters none of which is in the ascii range 0-0x7f.
On the opposite side, printf just outputs the byte without any conversion giving the correct output.
I'm sorry but even after many searches I could not find an easy way to correctly display UTF8 output on a windows console using a narrow stream such as std::cout.
But you should notice that your code fails to imbue the booster locale into cout
The key problem is that implementation of cout << "some string" after long and painful adventures calls WriteFile for every character.
If you'd like to debug it, set breakpoint inside _write function in write.c file of CRT sources, write something to cout and you'll see all the story.
So we can rewrite your code
static const char* utf8_string = u8"♣☻▼►♀♂☼";
cout << utf8_string << endl;
with equivalent (and faster!) one:
static const char* utf8_string = u8"♣☻▼►♀♂☼";
const size_t utf8_string_len = strlen(utf8_string);
DWORD written = 0;
for(size_t i = 0; i < utf8_string_len; ++i)
WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), utf8_string + i, 1, &written, NULL);
output: ���������������������
Replace cycle with single call of WriteFile and UTF-8 console gets brilliant:
static const char* utf8_string = u8"♣☻▼►♀♂☼";
const size_t utf8_string_len = strlen(utf8_string);
DWORD written = 0;
WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), utf8_string, utf8_string_len, &written, NULL);
output: ♣☻▼►♀♂☼
I tested it on msvc.2013 and msvc.net (2003), both of them behave identically.
Obviously windows implementation of console wants a whole characters at a call of WriteFile/WriteConsole and cannot take a UTF-8 characters by single bytes. :)
What we can do here?
My first idea is to make output buffered, like in files. It's easy:
static char cout_buff[128];
cout.rdbuf()->pubsetbuf(cout_buff, sizeof(cout_buff));
cout << utf8_string << endl; // works
cout << utf8_string << endl; // do nothing
output: ♣☻▼►♀♂☼ (only once, I explain it later)
First issue is console output become delayed, it waits until end of line or buffer overflow.
Second issue — it doesn't work.
Why? After first buffer flush (at first << endl) cout switch to bad state (badbit set). That's because of WriteFile normally returns in *lpNumberOfBytesWritten number of written bytes, but for UTF-8 console it returns number of written characters (problem described here). CRT detects, that number of bytes requested to write and written is different and stops writing to 'failed' stream.
What we can do more?
Well, I suppose that we can implement our own std::basic_streambuf to write console correct way, but it's not easy and I have no time for it. If anyone want, I'll be glad.
Another decisions are (a) use std::wcout and strings of wchar_t characters, (b) use WriteFile/WriteConsole. Sometimes that solutions can be accepted.
Working with UTF-8 console in Microsoft versions of C++ is really horrible.

Read a string with ncurses in C++

I'm writing a text-based game in C++. At some point, I ask the user to input user names corresponding to the different players playing.
I'm currently reading single char from ncurses like so:
move(y,x);
printw("Enter a char");
int char = getch();
However, I'm not sure how to a string. I'm looking for something like:
move(y,x);
printw("Enter a name: ");
std::string name = getstring();
I've seen many different guides for using ncurses all using a different set of functions that the other doesn't. As far as I can tell the lines between deprecated and non-deprecated functions is not very well defined.
How about this?
std::string getstring()
{
std::string input;
// let the terminal do the line editing
nocbreak();
echo();
// this reads from buffer after <ENTER>, not "raw"
// so any backspacing etc. has already been taken care of
int ch = getch();
while ( ch != '\n' )
{
input.push_back( ch );
ch = getch();
}
// restore your cbreak / echo settings here
return input;
}
I would discourage using the alternative *scanw() function family. You would be juggling a temporary char [] buffer, the underlying *scanf() functionality with all its problems, plus the specification of *scanw() states it returns ERR or OK instead of the number of items scanned, further reducing its usefulness.
While getstr() (suggested by user indiv) looks better than *scanw() and does special handling of function keys, it would still require a temporary char [], and I try to avoid those in C++ code, if for nothing else but avoiding some arbitrary buffer size.

C++ - string.compare issues when output to text file is different to console output?

I'm trying to find out if two strings I have are the same, for the purpose of unit testing. The first is a predefined string, hard-coded into the program. The second is a read in from a text file with an ifstream using std::getline(), and then taken as a substring. Both values are stored as C++ strings.
When I output both of the strings to the console using cout for testing, they both appear to be identical:
ThisIsATestStringOutputtedToAFile
ThisIsATestStringOutputtedToAFile
However, the string.compare returns stating they are not equal. When outputting to a text file, the two strings appear as follows:
ThisIsATestStringOutputtedToAFile
T^#h^#i^#s^#I^#s^#A^#T^#e^#s^#t^#S^#t^#r^#i^#n^#g^#O^#u^#t^#p^#u^#t^#
t^#e^#d^#T^#o^#A^#F^#i^#l^#e
I'm guessing this is some kind of encoding problem, and if I was in my native language (good old C#), I wouldn't have too many problems. As it is I'm with C/C++ and Vi, and frankly don't really know where to go from here! I've tried looking at maybe converting to/from ansi/unicode, and also removing the odd characters, but I'm not even sure if they really exist or not..
Thanks in advance for any suggestions.
EDIT
Apologies, this is my first time posting here. The code below is how I'm going through the process:
ifstream myInput;
ofstream myOutput;
myInput.open(fileLocation.c_str());
myOutput.open("test.txt");
TEST_ASSERT(myInput.is_open() == 1);
string compare1 = "ThisIsATestStringOutputtedToAFile";
string fileBuffer;
std::getline(myInput, fileBuffer);
string compare2 = fileBuffer.substr(400,100);
cout << compare1 + "\n";
cout << compare2 + "\n";
myOutput << compare1 + "\n";
myOutput << compare2 + "\n";
cin.get();
myInput.close();
myOutput.close();
TEST_ASSERT(compare1.compare(compare2) == 0);
How did you create the content of myInput? I would guess that this file is created in two-byte encoding. You can use hex-dump to verify this theory, or use a different editor to create this file.
The simpliest way would be to launch cmd.exe and type
echo "ThisIsATestStringOutputtedToAFile" > test.txt
UPDATE:
If you cannot change the encoding of the myInput file, you can try to use wide-chars in your program. I.e. use wstring instead of string, wifstream instead of ifstream, wofstream, wcout, etc.
The following works for me and writes the text pasted below into the file. Note the '\0' character embedded into the string.
#include <iostream>
#include <fstream>
#include <sstream>
int main()
{
std::istringstream myInput("0123456789ThisIsATestStringOutputtedToAFile\x0 12ou 9 21 3r8f8 reohb jfbhv jshdbv coerbgf vibdfjchbv jdfhbv jdfhbvg jhbdfejh vbfjdsb vjdfvb jfvfdhjs jfhbsd jkefhsv gjhvbdfsjh jdsfhb vjhdfbs vjhdsfg kbhjsadlj bckslASB VBAK VKLFB VLHBFDSL VHBDFSLHVGFDJSHBVG LFS1BDV LH1BJDFLV HBDSH VBLDFSHB VGLDFKHB KAPBLKFBSV LFHBV YBlkjb dflkvb sfvbsljbv sldb fvlfs1hbd vljkh1ykcvb skdfbv nkldsbf vsgdb lkjhbsgd lkdcfb vlkbsdc xlkvbxkclbklxcbv");
std::ofstream myOutput("test.txt");
//std::ostringstream myOutput;
std::string str1 = "ThisIsATestStringOutputtedToAFile";
std::string fileBuffer;
std::getline(myInput, fileBuffer);
std::string str2 = fileBuffer.substr(10,100);
std::cout << str1 + "\n";
std::cout << str2 + "\n";
myOutput << str1 + "\n";
myOutput << str2 + "\n";
std::cout << str1.compare(str2) << '\n';
//std::cout << myOutput.str() << '\n';
return 0;
}
Output:
ThisIsATestStringOutputtedToAFile
ThisIsATestStringOutputtedToAFile
It turns out that the problem was that the file encoding of myInput was UTF-16, whereas the comparison string was UTF-8. The way to convert them with the OS limitations I had for this project (Linux, C/C++ code), was to use the iconv() functions. To keep the compatibility of the C++ strings I'd been using, I ended up saving the string to a new text file, then running iconv through the system() command.
system("iconv -f UTF-16 -t UTF-8 subStr.txt -o convertedSubStr.txt");
Reading the outputted string back in then gave me the string in the format I needed for the comparison to work properly.
NOTE
I'm aware that this is not the most efficient way to do this. I've I'd had the luxury of a Windows environment and the windows.h libraries, things would have been a lot easier. In this case though, the code was in some rarely used unit tests, and as such didn't need to be highly optimized, hence the creation, destruction and I/O operations of some text files wasn't an issue.