I'm trying to find out if two strings I have are the same, for the purpose of unit testing. The first is a predefined string, hard-coded into the program. The second is a read in from a text file with an ifstream using std::getline(), and then taken as a substring. Both values are stored as C++ strings.
When I output both of the strings to the console using cout for testing, they both appear to be identical:
However, the string.compare returns stating they are not equal. When outputting to a text file, the two strings appear as follows:
I'm guessing this is some kind of encoding problem, and if I was in my native language (good old C#), I wouldn't have too many problems. As it is I'm with C/C++ and Vi, and frankly don't really know where to go from here! I've tried looking at maybe converting to/from ansi/unicode, and also removing the odd characters, but I'm not even sure if they really exist or not..
Thanks in advance for any suggestions.
Apologies, this is my first time posting here. The code below is how I'm going through the process:
ifstream myInput;
ofstream myOutput;
TEST_ASSERT(myInput.is_open() == 1);
string compare1 = "ThisIsATestStringOutputtedToAFile";
string fileBuffer;
std::getline(myInput, fileBuffer);
string compare2 = fileBuffer.substr(400,100);
cout << compare1 + "\n";
cout << compare2 + "\n";
myOutput << compare1 + "\n";
myOutput << compare2 + "\n";
TEST_ASSERT(compare1.compare(compare2) == 0);

How did you create the content of myInput? I would guess that this file is created in two-byte encoding. You can use hex-dump to verify this theory, or use a different editor to create this file.
The simpliest way would be to launch cmd.exe and type
echo "ThisIsATestStringOutputtedToAFile" > test.txt
If you cannot change the encoding of the myInput file, you can try to use wide-chars in your program. I.e. use wstring instead of string, wifstream instead of ifstream, wofstream, wcout, etc.

The following works for me and writes the text pasted below into the file. Note the '\0' character embedded into the string.
#include <iostream>
#include <fstream>
#include <sstream>
int main()
std::istringstream myInput("0123456789ThisIsATestStringOutputtedToAFile\x0 12ou 9 21 3r8f8 reohb jfbhv jshdbv coerbgf vibdfjchbv jdfhbv jdfhbvg jhbdfejh vbfjdsb vjdfvb jfvfdhjs jfhbsd jkefhsv gjhvbdfsjh jdsfhb vjhdfbs vjhdsfg kbhjsadlj bckslASB VBAK VKLFB VLHBFDSL VHBDFSLHVGFDJSHBVG LFS1BDV LH1BJDFLV HBDSH VBLDFSHB VGLDFKHB KAPBLKFBSV LFHBV YBlkjb dflkvb sfvbsljbv sldb fvlfs1hbd vljkh1ykcvb skdfbv nkldsbf vsgdb lkjhbsgd lkdcfb vlkbsdc xlkvbxkclbklxcbv");
std::ofstream myOutput("test.txt");
//std::ostringstream myOutput;
std::string str1 = "ThisIsATestStringOutputtedToAFile";
std::string fileBuffer;
std::getline(myInput, fileBuffer);
std::string str2 = fileBuffer.substr(10,100);
std::cout << str1 + "\n";
std::cout << str2 + "\n";
myOutput << str1 + "\n";
myOutput << str2 + "\n";
std::cout << str1.compare(str2) << '\n';
//std::cout << myOutput.str() << '\n';
return 0;

It turns out that the problem was that the file encoding of myInput was UTF-16, whereas the comparison string was UTF-8. The way to convert them with the OS limitations I had for this project (Linux, C/C++ code), was to use the iconv() functions. To keep the compatibility of the C++ strings I'd been using, I ended up saving the string to a new text file, then running iconv through the system() command.
system("iconv -f UTF-16 -t UTF-8 subStr.txt -o convertedSubStr.txt");
Reading the outputted string back in then gave me the string in the format I needed for the comparison to work properly.
I'm aware that this is not the most efficient way to do this. I've I'd had the luxury of a Windows environment and the windows.h libraries, things would have been a lot easier. In this case though, the code was in some rarely used unit tests, and as such didn't need to be highly optimized, hence the creation, destruction and I/O operations of some text files wasn't an issue.


How to apply <cctype> functions on text files with different encoding in c++

I would like to Split some files (around 1000) into words and remove numbers and punctuation. I will then process these tokenized words accordingly... However, the files are mostly in German language and are encoded in different types:
ISO Latin-1
The problem that I am facing is that I cannot find a correct way to apply Character Conversion functions such as tolower() and I also get some weird icons in the terminal when I use std::cout at Ubuntu linux.
For example, in non UTF-8 files, the word französische is shown as franz�sische, für as
f�r etc... Also, words like Örebro or Österreich are ignored by tolower(). From what I know the "Unicode replacement character" � (U+FFFD) is inserted for any character that the program cannot decode correctly when trying to handle Unicode.
When I open UTF-8 files i dont get any weird characters but i still cannot convert upper case special characters such as Ö to lower case... I used std::setlocale(LC_ALL, "de_DE.iso88591"); and some other options that I have found on stackoverflow but I still dont get the desired output.
My guess on how I should solve this is:
Check encoding of file that is about to be opened
open file according to its specific encoding
Convert file input to UTF-8
Process file and apply tolower() etc
Is the above algorithm feasible or the complexity will skyrocket?
What is the correct approach for this problem? How can I open the files with some sort of encoding options?
1. Should my OS have the corresponding locale enabled as global variable to process (without bothering how console displays it) text? (in linux for example I do not have de_DE enabled when i use -locale -a)
2. Is this problem only visible due to terminal default encoding? Do I need to take any further steps before i process the extracted string normally in c++?
My linux locale:
Here is some sample code that I wrote that doesnt work as I want atm.
void processFiles() {
std::string filename = "17454-8.txt";
std::ifstream inFile;
if (!inFile) {
std::cerr << "Failed to open file" << std::endl;
//calculate file size
std::string s = "";
s.reserve(filesize(filename) + std::ifstream::pos_type(1));
std::string line;
while( (inFile.good()) && std::getline(inFile, line) ) {
s.append(line + "\n");
std::cout << s << std::endl;
//remove punctuation, numbers, tolower,
//TODO encoding detection and specific transformation (cannot catch Ö, Ä etc) will add too much complexity...
std::setlocale(LC_ALL, "de_DE.iso88591");
for (unsigned int i = 0; i < s.length(); ++i) {
if (std::ispunct(s[i]) || std::isdigit(s[i]))
s[i] = ' ';
if (std::isupper(s[i]))
//std::cout << s << std::endl;
//tokenize string
std::istringstream iss(s);
tokens = {std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};
for (auto & i : tokens)
std::cout << i << std::endl;
Unicode defines "code points" for characters. A code point is a 32 bit value. There are some types of encodings. ASCII only uses 7 bits, which gives 128 different chars. The 8th bit was used by Microsoft to define another 128 chars, depending on the locale, and called "code pages". Nowadays MS uses UTF-16 2 bytes encoding. Because this is not enough for the whole Unicode set, UTF-16 is also locale dependant, with names that match Unicode's names "Latin-1", or "ISO-8859-1" etc.
Most used in Linux (typically for files) is UTF-8, which uses a variable number of bytes for each character. The first 128 chars are exactly the same as ASCII chars, with just one byte per character. To represent a character UTF8 can use up to 4 bytes. More onfo in the Wikipedia.
While MS uses UTF-16 for both files and RAM, Linux likely uses UFT-32 for RAM.
In order to read a file you need to know its encoding. Trying to detect it is a real nightmare which may not succeed. The use of std::basic_ios::imbue allows you to set the desired locale for your stream, like in this SO answer
tolower and such functions can work with a locale, e.g.
#include <iostream>
#include <locale>
int main() {
wchar_t s = L'\u00D6'; //latin capital 'o' with diaeresis, decimal 214
wchar_t sL = std::tolower(s, std::locale("en_US.UTF-8")); //hex= 00F6, dec= 246
std::cout << "s = " << s << std::endl;
std::cout << "sL= " << sL << std::endl;
return 0;
s = 214
sL= 246
In this other SO answer you can find good solutions, as the use of iconv Linux or iconv W32 library.
In Linux the terminal can be set to use a locale with the help of LC_ALL, LANG and LANGUAGE, e.g.:

Is it possible to print UTF-8 string with Boost and STL in windows console?

I'm trying to output UTF-8 encoded string with cout with no success. I'd like to use Boost.Locale in my program. I've found some info regarding windows console specific. For example, this article http://www.boost.org/doc/libs/1_60_0/libs/locale/doc/html/running_examples_under_windows.html says that I should set output console code page to 65001 and save all my sources in UTF-8 encoding with BOM. So, here is my simple example:
#include <windows.h>
#include <boost/locale.hpp>
using namespace std;
using namespace boost::locale;
int wmain(int argc, const wchar_t* argv[])
//system("chcp 65001 > nul"); // It's the same as SetConsoleOutputCP(CP_UTF8)
static const char* utf8_string = u8"♣☻▼►♀♂☼";
cout << "cout: " << utf8_string << endl;
printf("printf: %s\n", utf8_string);
return 0;
I compile it with Visual Studio 2015 and it produces the following output in console:
cout: ���������������������
printf: ♣☻▼►♀♂☼
Why does printf do it well and cout don't? Can locale generator of Boost help with it? Or should I use somethong other to print UTF-8 text in console in stream mode (cout-like approach)?
It looks like std::cout is much too clever here: it tries to interpret your utf8 encoded string as an ascii one and finds 21 non ascii characters that it outputs as the unmapped character �. AFAIK Windows C++ console driver,insists on each character from a narrow char string being mapped to a position on screen and does not support multi bytes character sets.
Here what happens under the hood:
utf8_string is the following char array (just look at a Unicode table and do the utf8 conversion):
utf8_string = { '0xe2', '0x99', '0xa3', '0xe2', '0x98', '0xbb', '0xe2', '0x96',
'0xbc', '0xe2', '0x96', '0xba', '0xe2', '0x99', '0x80', '0xe2', '0x99',
'0x82', '0xe2', '0x98', '0xbc', '\0' };
that is 21 characters none of which is in the ascii range 0-0x7f.
On the opposite side, printf just outputs the byte without any conversion giving the correct output.
I'm sorry but even after many searches I could not find an easy way to correctly display UTF8 output on a windows console using a narrow stream such as std::cout.
But you should notice that your code fails to imbue the booster locale into cout
The key problem is that implementation of cout << "some string" after long and painful adventures calls WriteFile for every character.
If you'd like to debug it, set breakpoint inside _write function in write.c file of CRT sources, write something to cout and you'll see all the story.
So we can rewrite your code
static const char* utf8_string = u8"♣☻▼►♀♂☼";
cout << utf8_string << endl;
with equivalent (and faster!) one:
static const char* utf8_string = u8"♣☻▼►♀♂☼";
const size_t utf8_string_len = strlen(utf8_string);
DWORD written = 0;
for(size_t i = 0; i < utf8_string_len; ++i)
WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), utf8_string + i, 1, &written, NULL);
output: ���������������������
Replace cycle with single call of WriteFile and UTF-8 console gets brilliant:
static const char* utf8_string = u8"♣☻▼►♀♂☼";
const size_t utf8_string_len = strlen(utf8_string);
DWORD written = 0;
WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), utf8_string, utf8_string_len, &written, NULL);
output: ♣☻▼►♀♂☼
I tested it on msvc.2013 and msvc.net (2003), both of them behave identically.
Obviously windows implementation of console wants a whole characters at a call of WriteFile/WriteConsole and cannot take a UTF-8 characters by single bytes. :)
What we can do here?
My first idea is to make output buffered, like in files. It's easy:
static char cout_buff[128];
cout.rdbuf()->pubsetbuf(cout_buff, sizeof(cout_buff));
cout << utf8_string << endl; // works
cout << utf8_string << endl; // do nothing
output: ♣☻▼►♀♂☼ (only once, I explain it later)
First issue is console output become delayed, it waits until end of line or buffer overflow.
Second issue — it doesn't work.
Why? After first buffer flush (at first << endl) cout switch to bad state (badbit set). That's because of WriteFile normally returns in *lpNumberOfBytesWritten number of written bytes, but for UTF-8 console it returns number of written characters (problem described here). CRT detects, that number of bytes requested to write and written is different and stops writing to 'failed' stream.
What we can do more?
Well, I suppose that we can implement our own std::basic_streambuf to write console correct way, but it's not easy and I have no time for it. If anyone want, I'll be glad.
Another decisions are (a) use std::wcout and strings of wchar_t characters, (b) use WriteFile/WriteConsole. Sometimes that solutions can be accepted.
Working with UTF-8 console in Microsoft versions of C++ is really horrible.

Creating the same text file over and over

I need to create a program that writes a text file in the current folder, the text file always contains the same information, for example:
This is an example of how the text file may look
some information over here
and here
and so on
So I was thinking in doing something like this:
#include <iostream>
#include <fstream>
using namespace std;
int main(){
ofstream myfile("myfile.txt");
myfile << "Hello," << endl;
myfile << "This is an example of how the text file may look" << endl;
myfile << "some information over here" << endl;
myfile << "and here" << endl;
myfile << "and so on";
return 0;
Which works if the number of lines in my text file is small, the problem is that my text file has over 2000 lines, and I'm not willing to give the myfile << TEXT << endl; format to every line.
Is there a more effective way to create this text file?
If you have the problem of writing in same file, you need to use an append mode.
i.e., your file must be opened like this
ofstream myfile("ABC.txt",ios::app)
You may use Raw string in C++11:
const char* my_text =
This is an example of how the text file may look
some information over here
and here
and so on)";
int main()
std::ofstream myfile("myfile.txt");
myfile << my_text;
return 0;
Live example
Alternatively, you may use some tools to create the array for you as xxd -i
If you don't care about the subtile differences between '\n' and std::endl, then you can create a static string with your text outside of your function, and then it's just :
myfile << str // Maybe << std::endl; too
If your text is really big, you can write a small script to format it, like changing every newlines with "\n", etc.
It sounds like you should really be using resource files. I won't copy and paste all of the information here, but there's a very good Q&A already on this website, over here: Embed Text File in a Resource in a native Windows Application
Alternatively, you could even stick the string in a header file then include that header file where it's needed:
(assuming no C++11 since if you do you could simply use Raw to make things a little easier but an answer for that has already been posted - no need to repeat).
#pragma once
#include <iostream>
std::string fileData =
"data line 1\r\n"
"data line 2\r\n"
Use std::wstring and prepend the strings with L if you need more complex characters.
All you need to do is to write a little script (or even just use Notepad++ if it's a one off) to replace backslashes with double backslash, replace double quotation marks with backslash double quotation marks, and replace line breaks with \r\n"{line break}{tab}". Tidy up the beginning and end, and you're done. Then just write the string to a file.

opening output filestreams with string names

Hi I have some C++ code that uses user defined input to generate file-names for some output files:
std::string outputName = fileName;
for(int i = 0; i < 4; i++)
std::string outputName1 = outputName;
std::string outputName2 = outputName;
Where fileName could be any word the user can define with .csv after it e.g. '~/Desktop/mytest.csv'
The code chomps .csv off and makes three filenames / paths for 3 output streams.
It then creates them and attempts to open them:
std::ofstream outputFile;
std::ofstream outputFile1;
std::ofstream outputFile2;
I made sure to pass the names to open as const char* with the c_str method, however if I test my code by adding the following line:
std::cout << outputFile.is_open() << " " << outputFile1.is_open() << " " << outputFile2.is_open() << std::endl;
and compiling and setting fineName as "test.csv". I successfully compile and run, however,
Three zeros's are printed to screen showing the three filestreams for output are not in fact open. Why are they not opening? I know passing strings as filenames does not work which is why I thought conversion with c_str() would be sufficient.
Ben W.
Your issue is likely to be due to the path beginning with ~, which isn't expanded to /{home,Users}/${LOGNAME}.
ifstream open file C++
This answer to How to create a folder in the home directory? may be of use to you.
Unfortunately, there is no standard, portable way of finding out exactly why open() failed:
Detecting reason for failure to open an ofstream when fail() is true
I know passing strings as filenames does not work which is why I thought conversion with c_str() would be sufficient.
std::basic_ofstream::open() does accept a const std::string & (since C++11)!

Reading a string from a file in C++

I'm trying to store strings directly into a file to be read later in C++ (basically for the full scope I'm trying to store an object array with string variables in a file, and those string variables will be read through something like object[0].string). However, everytime I try to read the string variables the system gives me a jumbled up error. The following codes are a basic part of what I'm trying.
#include <iostream>
#include <fstream>
using namespace std;
//this is run first to create the file and store the string
int main(){
string reed;
reed = "sees";
ofstream ofs("filrsee.txt", ios::out|ios::binary);
ofs.write(reinterpret_cast<char*>(&reed), sizeof(reed));
//this is run after that to open the file and read the string
int main(){
string ghhh;
ifstream ifs("filrsee.txt", ios::in|ios::binary);
ifs.read(reinterpret_cast<char*>(&ghhh), sizeof(ghhh));
return 0;
The second part is where things go haywire when I try to read it.
Sorry if it's been asked before, I've taken a look around for similar questions but most of them are a bit different from what I'm trying to do or I don't really understand what they're trying to do (still quite new to this).
What am I doing wrong?
You are reading from a file and trying to put the data in the string structure itself, overwriting it, which is plain wrong.
As it can be verified at http://www.cplusplus.com/reference/iostream/istream/read/ , the types you used were wrong, and you know it because you had to force the std::string into a char * using a reinterpret_cast.
C++ Hint: using a reinterpret_cast in C++ is (almost) always a sign you did something wrong.
Why is it so complicated to read a file?
A long time ago, reading a file was easy. In some Basic-like language, you used the function LOAD, and voilà!, you had your file.
So why can't we do it now?
Because you don't know what's in a file.
It could be a string.
It could be a serialized array of structs with raw data dumped from memory.
It could even be a live stream, that is, a file which is appended continuously (a log file, the stdin, whatever).
You could want to read the data word by word
... or line by line...
Or the file is so large it doesn't fit in a string, so you want to read it by parts.
The more generic solution is to read the file (thus, in C++, a fstream), byte per byte using the function get (see http://www.cplusplus.com/reference/iostream/istream/get/), and do yourself the operation to transform it into the type you expect, and stopping at EOF.
The std::isteam interface have all the functions you need to read the file in different ways (see http://www.cplusplus.com/reference/iostream/istream/), and even then, there is an additional non-member function for the std::string to read a file until a delimiter is found (usually "\n", but it could be anything, see http://www.cplusplus.com/reference/string/getline/)
But I want a "load" function for a std::string!!!
Ok, I get it.
We assume that what you put in the file is the content of a std::string, but keeping it compatible with a C-style string, that is, the \0 character marks the end of the string (if not, we would need to load the file until reaching the EOF).
And we assume you want the whole file content fully loaded once the function loadFile returns.
So, here's the loadFile function:
#include <iostream>
#include <fstream>
#include <string>
bool loadFile(const std::string & p_name, std::string & p_content)
// We create the file object, saying I want to read it
std::fstream file(p_name.c_str(), std::fstream::in) ;
// We verify if the file was successfully opened
// We use the standard getline function to read the file into
// a std::string, stoping only at "\0"
std::getline(file, p_content, '\0') ;
// We return the success of the operation
return ! file.bad() ;
// The file was not successfully opened, so returning false
return false ;
If you are using a C++11 enabled compiler, you can add this overloaded function, which will cost you nothing (while in C++03, baring optimizations, it could have cost you a temporary object):
std::string loadFile(const std::string & p_name)
std::string content ;
loadFile(p_name, content) ;
return content ;
Now, for completeness' sake, I wrote the corresponding saveFile function:
bool saveFile(const std::string & p_name, const std::string & p_content)
std::fstream file(p_name.c_str(), std::fstream::out) ;
file.write(p_content.c_str(), p_content.length()) ;
return ! file.bad() ;
return false ;
And here, the "main" I used to test those functions:
int main()
const std::string name(".//myFile.txt") ;
const std::string content("AAA BBB CCC\nDDD EEE FFF\n\n") ;
const bool success = saveFile(name, content) ;
std::cout << "saveFile(\"" << name << "\", \"" << content << "\")\n\n"
<< "result is: " << success << "\n" ;
std::string myContent ;
const bool success = loadFile(name, myContent) ;
std::cout << "loadFile(\"" << name << "\", \"" << content << "\")\n\n"
<< "result is: " << success << "\n"
<< "content is: [" << myContent << "]\n"
<< "content ok is: " << (myContent == content)<< "\n" ;
If you want to do more than that, then you will need to explore the C++ IOStreams library API, at http://www.cplusplus.com/reference/iostream/
You can't use std::istream::read() to read into a std::string object. What you could do is to determine the size of the file, create a string of suitable size, and read the data into the string's character array:
std::string str;
std::ifstream file("whatever");
std::string::size_type size = determine_size_of(file);
file.read(&str[0], size);
The tricky bit is determining the size the string should have. Given that the character sequence may get translated while reading, e.g., because line end sequences are transformed, this pretty much amounts to reading the string in the general case. Thus, I would recommend against doing it this way. Instead, I would read the string using something like this:
std::string str;
std::ifstream file("whatever");
if (std::getline(file, str, '\0')) {
This works OK for text strings and is about as fast as it gets on most systems. If the file can contain null characters, e.g., because it contains binary data, this doesn't quite work. If this is the case, I'd use an intermediate std::ostringstream:
std::ostringstream out;
std::ifstream file("whatever");
out << file.rdbuf();
std::string str = out.str();
A string object is not a mere char array, the line
ifs.read(reinterpret_cast<char*>(&ghhh), sizeof(ghhh));
is probably the root of your problems.
try applying the following changes:
char[BUFF_LEN] ghhh;
ifs.read(ghhh, BUFF_LEN);