I am trying to read and process multiple files that are in different encoding. I am supposed to only use STL for this.
Suppose that we have iso-8859-15 and UTF-8 files.
In this SO answer it states:
In a nutshell the more interesting part for you:
std::stream (stringstream, fstream, cin, cout) has an inner
locale-object, which matches the value of the global C++ locale at
the moment of the creation of the stream object. As std::in is
created long before your code in main is called, it has most
probably the classical C locale, no matter what you do afterwards.
You can make sure, that a std::stream object has the desirable
locale by invoking
std::stream::imbue(std::locale(your_favorite_locale)).
The problem is that from the two types, only the files that match the locale that was created first are processed correctly. For example If locale_DE_ISO885915 precedes locale_DE_UTF8 then files that are in UTF-8 are not appended correctly in string s and when I cout them out i only see a couple of lines from the file.
void processFiles() {
//setup locales for file decoding
std::locale locale_DE_ISO885915("de_DE.iso885915#euro");
std::locale locale_DE_UTF8("de_DE.UTF-8");
//std::locale::global(locale_DE_ISO885915);
//std::cout.imbue(std::locale());
const std::ctype<wchar_t>& facet_DE_ISO885915 = std::use_facet<std::ctype<wchar_t>>(locale_DE_ISO885915);
//std::locale::global(locale_DE_UTF8);
//std::cout.imbue(std::locale());
const std::ctype<wchar_t>& facet_DE_UTF8 = std::use_facet<std::ctype<wchar_t>>(locale_DE_UTF8);
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string currFile, fileStr;
std::wifstream inFile;
std::wstring s;
for (std::vector<std::string>::const_iterator fci = files.begin(); fci != files.end(); ++fci) {
currFile = *fci;
//check file and set locale
if (currFile.find("-8.txt") != std::string::npos) {
std::locale::global(locale_DE_ISO885915);
std::cout.imbue(locale_DE_ISO885915);
}
else {
std::locale::global(locale_DE_UTF8);
std::cout.imbue(locale_DE_UTF8);
}
inFile.open(path + currFile, std::ios_base::binary);
if (!inFile) {
//TODO specific file report
std::cerr << "Failed to open file " << *fci << std::endl;
exit(1);
}
s.clear();
//read file content
std::wstring line;
while( (inFile.good()) && std::getline(inFile, line) ) {
s.append(line + L"\n");
}
inFile.close();
//remove punctuation, numbers, tolower...
for (unsigned int i = 0; i < s.length(); ++i) {
if (ispunct(s[i]) || isdigit(s[i]))
s[i] = L' ';
}
if (currFile.find("-8.txt") != std::string::npos) {
facet_DE_ISO885915.tolower(&s[0], &s[0] + s.size());
}
else {
facet_DE_UTF8.tolower(&s[0], &s[0] + s.size());
}
fileStr = converter.to_bytes(s);
std::cout << fileStr << std::endl;
std::cout << currFile << std::endl;
std::cout << fileStr.size() << std::endl;
std::cout << std::setlocale(LC_ALL, NULL) << std::endl;
std::cout << "========================================================================================" << std::endl;
// Process...
}
return;
}
As you can see in the code, I have tried with global and locale local variables but to no avail.
In addition, in How can I use std::imbue to set the locale for std::wcout? SO answer it states:
So it really looks like there was an underlying C library mechanizme
that should be first enabled with setlocale to allow imbue conversion
to work correctly.
Is this "obscure" mechanism the problem here?
Is it possible to alternate between the two locales while processing the files? What should I imbue (cout, ifstream, getline ?) and how?
Any suggestions?
PS: Why is everything related with locale so chaotic? :|
This works for me as expected on my Linux machine, but not on my Windows machine under Cygwin (the set of available locales is apparently the same on both machines, but std::locale::locale just fails with every imaginable locale string).
#include <iostream>
#include <fstream>
#include <locale>
#include <string>
void printFile(const char* name, const char* loc)
{
try {
std::wifstream inFile;
inFile.imbue(std::locale(loc));
inFile.open(name);
std::wstring line;
while (getline(inFile, line))
std::wcout << line << '\n';
} catch (std::exception& e) {
std::cerr << e.what() << std::endl;
}
}
int main()
{
std::locale::global(std::locale("en_US.utf8"));
printFile ("gtext-u8.txt", "de_DE.utf8"); // utf-8 text: grüßen
printFile ("gtext-legacy.txt", "de_DE#euro"); // iso8859-15 text: grüßen
}
Output:
grüßen
grüßen
Related
I'm writing a program, which takes the lines of text to work with from the file, the name of which the user passes as an argument, e.g. program <name of the file>. But if the name is not provided, the input is taken dynamically from std::cin. What I've tried:
Redirecting the buffer (somewhy causes segfault)
if (argc == 2) {
std::ifstream ifs(argv[1]);
if (!ifs)
std::cerr << "couldn't open " << argv[1] << " for reading" << '\n';
std::cin.rdbuf(ifs.rdbuf());
}
for (;;) {
std::string line;
if (!std::getline(std::cin, line)) // Here the segfault happens
break;
Creating a variable, in which the input source is stored
std::ifstream ifs;
if (argc == 2) {
ifs.open(argv[1]);
if (!ifs)
std::cerr << "couldn't open " << argv[1] << " for reading" << '\n';
} else
ifs = std::cin; // Doesn't work because of the different types
for (;;) {
std::string line;
if (!std::getline(ifs, line))
break;
Now I'm thinking of doing something with file structures/descriptors. What to do?
UPD: I would like to have the possibility to update the input source in the main loop of the program (see below).
The seg fault in your first example is due to a dangling pointer; right after you call std::cin.rdbuf(ifs.rdbuf()), ifs is destroyed.
You should do what #NathanOliver suggests and write a function which takes an istream&:
#include <iostream>
#include <fstream>
#include <string>
void foo(std::istream& stream) {
std::string line;
while (std::getline(stream, line)) {
// do work
}
}
int main(int argc, char* argv[]) {
if (argc == 2) {
std::ifstream file(argv[1]);
foo(file);
} else {
foo(std::cin);
}
}
I open the mp3 file by mistake with notepad++ ( Open with ) and show the entire file in text inside the notepad it was so cool.
since I am learning c++ again, I told myself let write a program that opens any file inside the console and display their content on the console so I begin my code like this :
int readAndWrite() {
string filename(R"(path\to\a\file)");
ifstream file(filename);
string line;
if (!file.is_open()) {
cerr << "Could not open the file - '"
<< filename << "'" << endl;
return EXIT_FAILURE;
}
while (getline(file, line)){
cout << line;
}
return EXIT_SUCCESS;
}
but it only shows 3 or 4 lines of the file and then exits the program I check my notepad++ again and find out about 700,000 line is in there.
I told myself maybe there is a character inside the file so I start writing the above code with the below changes. instead of displaying the file let's wrote inside a text file.
int readAndWrite() {
string filename(R"(path\to\a\file)");
string filename2(R"(path\to\a\file\copy)");
ifstream file(filename);
ofstream copy(filename2);
string line;
if (!file.is_open()) {
cerr << "Could not open the file - '"
<< filename << "'" << endl;
return EXIT_FAILURE;
}
while (getline(file, line)){
copy << line;
}
return EXIT_SUCCESS;
}
and again the same results. next try I give up on reading the file line by line so I start copying with this function.
void copyStringNewFile(ifstream& file, ofstream& copy)
{
copy << file.rdbuf();
}
and their results did not change a bit.
At this point, I told myself the problem is from file maybe and it is kinda is because when I use a simple text file all of the above codes work.
Like all other non-text files, mp3 files don't contain lines so you shouldn't use std::getline. Use istream::read and ostream::write. You can use istream::gcount to check how many characters that was actually read.
Since you are dealing with non-text files, also open the files in binary mode.
You should also test if opening both files works - that is, both the input and the output file.
Example:
#include <cerrno>
#include <cstring>
#include <fstream>
#include <iostream>
int readAndWrite() {
std::string filename(R"(path\to\a\file)");
std::string filename2(R"(path\to\a\file_copy)");
std::ifstream file(filename, std::ios::binary);
if(!file) {
std::cerr << '\'' << filename << "': " << std::strerror(errno) << '\n';
return EXIT_FAILURE;
}
std::ofstream copy(filename2, std::ios::binary);
if(!copy) {
std::cerr << '\'' << filename2 << "': " << std::strerror(errno) << '\n';
return EXIT_FAILURE;
}
char buf[1024];
while(file) {
file.read(buf, sizeof(buf));
// write as many characters as was read above
if(!copy.write(buf, file.gcount())) {
// write failed, perhaps filesystem is full?
std::cerr << '\'' << filename2 << "': " << std::strerror(errno) << '\n';
return EXIT_FAILURE;
}
}
return EXIT_SUCCESS;
}
int main() {
return readAndWrite();
}
For my formation, an exercise ask us to create a program similar to the linux 'cat' command.
So to read the file, i use an ifstream, and everything work fine for regular file.
But not when i try to open /dev/ files like /dev/stdin: the 'enter' is not detected and so, getline really exit only when the fd is being closed (with a CTRL-D).
The problem seems to be around how ifstream or getline handle reading, because with the regular 'read' function from libc, this problem is not to be seen.
Here is my code:
#include <iostream>
#include <string>
#include <fstream>
#include <errno.h>
#ifndef PROGRAM_NAME
# define PROGRAM_NAME "cato9tails"
#endif
int g_exitCode = 0;
void
displayErrno(std::string &file)
{
if (errno)
{
g_exitCode = 1;
std::cerr << PROGRAM_NAME << ": " << file << ": " << strerror(errno) << std::endl;
}
}
void
handleStream(std::string file, std::istream &stream)
{
std::string read;
stream.peek(); /* try to read: will set fail bit if it is a folder. */
if (!stream.good())
displayErrno(file);
while (stream.good())
{
std::getline(stream, read);
std::cout << read;
if (stream.eof())
break;
std::cout << std::endl;
}
}
int
main(int argc, char **argv)
{
if (argc == 1)
handleStream("", std::cin);
else
{
for (int index = 1; index < argc; index++)
{
errno = 0;
std::string file = std::string(argv[index]);
std::ifstream stream(file, std::ifstream::in);
if (stream.is_open())
{
handleStream(file, stream);
stream.close();
}
else
displayErrno(file);
}
}
return (g_exitCode);
}
We can only use method from libcpp.
I have search this problem for a long time, and i only find this post where they seems to have a very similar problem to me:
https://github.com/bigartm/bigartm/pull/258#issuecomment-128131871
But found no really usable solution from them.
I tried to do a very ugly solution but... well...:
bool
isUnixStdFile(std::string file)
{
return (file == "/dev/stdin" || file == "/dev/stdout" || file == "/dev/stderr"
|| file == "/dev/fd/0" || file == "/dev/fd/1" || file == "/dev/fd/2");
}
...
if (isUnixStdFile(file))
handleStream(file, std::cin);
else
{
std::ifstream stream(file, std::ifstream::in);
...
As you can see, a lot of files are missing, this can only be called a temporary solution.
Any help would be appreciated!
The following code worked for me to deal with /dev/fd files or when using shell substitute syntax:
std::ifstream stream(file_name);
std::cout << "Opening file '" << file_name << "'" << std::endl;
if (stream.fail() || !stream.good())
{
std::cout << "Error: Failed to open file '" << file_name << "'" << std::endl;
return false;
}
while (!stream.eof() && stream.good() && stream.peek() != EOF)
{
std::getline(stream, buffer);
std::cout << buffer << std::endl;
}
stream.close();
Basically std::getline() fails when content from the special file is not ready yet.
I'm trying to read a file which has UTF-16LE coding with BOM.
I tried this code
#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>
int main() {
std::wifstream fin("/home/asutp/test");
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
if (!fin) {
std::cout << "!fin" << std::endl;
return 1;
}
if (fin.eof()) {
std::cout << "fin.eof()" << std::endl;
return 1;
}
std::wstring wstr;
getline(fin, wstr);
std::wcout << wstr << std::endl;
if (wstr.find(L"Test") != std::string::npos) {
std::cout << "Found" << std::endl;
} else {
std::cout << "Not found" << std::endl;
}
return 0;
}
The file can contain Latin and Cyrillic. I created the file with a string "Test тест". And this code returns me
/home/asutp/CLionProjects/untitled/cmake-build-debug/untitled
Not found
Process finished with exit code 0
I'm on Linux Mint 18.3 x64, Clion 2018.1
Tried
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)
clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final)
clang version 5.0.0-3~16.04.1 (tags/RELEASE_500/final)
Ideally you should save files in UTF8, because Window has much better UTF8 support (aside from displaying Unicode in console window), while POSIX has limited UTF16 support. Even Microsoft products favor UTF8 for saving files in Windows.
As an alternative, you can read the UTF16 file in to a buffer and convert that to UTF8 (std::codecvt_utf8_utf16)
std::ifstream fin("utf16.txt", std::ios::binary);
fin.seekg(0, std::ios::end);
size_t size = (size_t)fin.tellg();
//skip BOM
fin.seekg(2, std::ios::beg);
size -= 2;
std::u16string u16((size / 2) + 1, '\0');
fin.read((char*)&u16[0], size);
std::string utf8 = std::wstring_convert<
std::codecvt_utf8_utf16<char16_t>, char16_t>{}.to_bytes(u16);
Or
std::ifstream fin("utf16.txt", std::ios::binary);
//skip BOM
fin.seekg(2);
//read as raw bytes
std::stringstream ss;
ss << fin.rdbuf();
std::string bytes = ss.str();
//make sure len is divisible by 2
int len = bytes.size();
if(len % 2) len--;
std::wstring sw;
for(size_t i = 0; i < len;)
{
//little-endian
int lo = bytes[i++] & 0xFF;
int hi = bytes[i++] & 0xFF;
sw.push_back(hi << 8 | lo);
}
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::string utf8 = convert.to_bytes(sw);
Replace by this - std::wstring::npos (not std::string::npos) -, and your code must work :
...
//std::wcout << wstr << std::endl;
if (wstr.find(L"Test") == std::wstring::npos) {
std::cout << "Not Found" << std::endl;
} else {
std::cout << "found" << std::endl;
}
I am trying to open other process stdin and write to it, using C++, no dup() and other C tricks. But unfortunately, fstream constructor taking string seems to be available only from C++11 ( source:http://www.cplusplus.com/reference/fstream/fstream/fstream/ )
#include <iostream>
#include <fstream>
#include <sstream>
//using namespace std;
int main()
{
pid_t pid = fork();
if (pid == 0)
{
// std::cout << "child:" << i << std::endl;
std::string line;
std::cin.sync();
std::cout << "child got message:" << std::endl;
while ( std::getline(std::cin, line) )
{
std::cout << line << std::endl;
}
std::cout << "child done receiving" << std::endl;
}
else{
//std::cout << "parent"<<std::endl;
std::ifstream file("stdin", std::ios::in);
if(file.is_open())
{
std::string tmp, str;
std::stringstream pidstr;
pidstr << "/proc/" << pid << "/fd/0";
while ( std::getline(file, tmp) )
str += tmp + "\n";
std::fstream other( (pidstr.str().c_str()), std::ios::out);
other << str ;
}
}
return 0;
}
I have four questions:
Is "pidstr.str().c_str()" most C++ way to do it? Looks ugly to me.
Is that proper C++ way or maybe there are better alternatives?
Since, as far as I know, getline() is blocking, can I ommit cin.sync(), expecting child to wait for input in while ?
Why do I need to press enter twice, (regardless if i have getchar in child commented out or not(!))
Thanks for understanding.
EDIT: This compiles, and produces desired output, using Code:Blocks, Linux 13.04
EDIT2: No, apparently this does NOT produce desired output. I've added some changes: parent waits for child to exit (not important here).
But also changed this (in child) :
while ( std::getline(std::cin, line) )
{
std::cout << line << std::endl;
}
std::cout << "child done" << std::endl;
to this :
std::getline(std::cin, line);
int lines; // this is single line sent as first from parent, containing int
std::istringstream toint(line);
toint >> lines;
std::cout << "how much lines should it read:" << lines << std::endl;
And it appears that child will wait for anything on stdin, then prints out that what was sent from parent. Really, sometimes I wonder how should i consider C++ as better than plain C :/