I'm trying to read a file which has UTF-16LE coding with BOM.
I tried this code
#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>
int main() {
std::wifstream fin("/home/asutp/test");
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
if (!fin) {
std::cout << "!fin" << std::endl;
return 1;
}
if (fin.eof()) {
std::cout << "fin.eof()" << std::endl;
return 1;
}
std::wstring wstr;
getline(fin, wstr);
std::wcout << wstr << std::endl;
if (wstr.find(L"Test") != std::string::npos) {
std::cout << "Found" << std::endl;
} else {
std::cout << "Not found" << std::endl;
}
return 0;
}
The file can contain Latin and Cyrillic. I created the file with a string "Test тест". And this code returns me
/home/asutp/CLionProjects/untitled/cmake-build-debug/untitled
Not found
Process finished with exit code 0
I'm on Linux Mint 18.3 x64, Clion 2018.1
Tried
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)
clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final)
clang version 5.0.0-3~16.04.1 (tags/RELEASE_500/final)
Ideally you should save files in UTF8, because Window has much better UTF8 support (aside from displaying Unicode in console window), while POSIX has limited UTF16 support. Even Microsoft products favor UTF8 for saving files in Windows.
As an alternative, you can read the UTF16 file in to a buffer and convert that to UTF8 (std::codecvt_utf8_utf16)
std::ifstream fin("utf16.txt", std::ios::binary);
fin.seekg(0, std::ios::end);
size_t size = (size_t)fin.tellg();
//skip BOM
fin.seekg(2, std::ios::beg);
size -= 2;
std::u16string u16((size / 2) + 1, '\0');
fin.read((char*)&u16[0], size);
std::string utf8 = std::wstring_convert<
std::codecvt_utf8_utf16<char16_t>, char16_t>{}.to_bytes(u16);
Or
std::ifstream fin("utf16.txt", std::ios::binary);
//skip BOM
fin.seekg(2);
//read as raw bytes
std::stringstream ss;
ss << fin.rdbuf();
std::string bytes = ss.str();
//make sure len is divisible by 2
int len = bytes.size();
if(len % 2) len--;
std::wstring sw;
for(size_t i = 0; i < len;)
{
//little-endian
int lo = bytes[i++] & 0xFF;
int hi = bytes[i++] & 0xFF;
sw.push_back(hi << 8 | lo);
}
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::string utf8 = convert.to_bytes(sw);
Replace by this - std::wstring::npos (not std::string::npos) -, and your code must work :
...
//std::wcout << wstr << std::endl;
if (wstr.find(L"Test") == std::wstring::npos) {
std::cout << "Not Found" << std::endl;
} else {
std::cout << "found" << std::endl;
}
Related
I have next C++ code in VS2019 under Windows 10:
char const* const fileName = "random_StringArray_10000000";
FILE* infile;
long fileSize;
char* buffer;
size_t readBytes;
infile = fopen(fileName, "rb");
if (infile == NULL)
{
fputs("File error", stderr); exit(1);
}
fseek(infile, 0, SEEK_END);
fileSize = ftell(infile);
rewind(infile);
buffer = (char*)malloc(sizeof(char) * fileSize);
if (buffer == NULL)
{
fputs("Memory error", stderr); exit(2);
}
auto start = chrono::steady_clock::now();
readBytes = fread(buffer, 1, fileSize, infile);
auto end = chrono::steady_clock::now();
if (readBytes != fileSize)
{
fputs("Reading error", stderr); exit(3);
}
fclose(infile);
free(buffer);
auto elapsed_ms = chrono::duration_cast<chrono::milliseconds>(end - start);
cout << "Elapsed ms: " << elapsed_ms.count() << endl;
cout << "String count: " << stringCount << endl;
system("pause");
return 0;
This method used because it is fastest way to read file from disk under VS2019.
Now i need to convert char array to the string array.
random_StringArray_10000000 - UTF8 text file.
Strings lenght 8 - 120 symbols.
Hex view of this file:
0x0D 0x0A separate strings.
Which fastest way to convert char array (buffer) to the C++ string array?
There seems to be a regularity to your data, all strings are eight characters long and separated by the same two characters. With that in mind the following seems fairly fast.
size_t arraySize = readBytes/10;
std::string* array = new std::string[arraySize];
for (size_t i = 0; i < arraySize; ++i)
array[i].assign(buffer + 10*i, 8);
Of course timing is necessary to be sure what is fastest.
Reading lines of text from a file is much simpler if you use the classes from the c++ standard library.
This should be all of the code you need:
#include <fstream>
#include <vector>
#include <string>
#include <iostream>
int main()
{
char const* const fileName = "random_StringArray_10000000";
std::ifstream in(fileName);
if (!in)
{
std::cout << "File error\n";
return 1;
}
std::vector<std::string> lines;
std::string line;
while (std::getline(in, line))
{
lines.push_back(std::move(line));
}
return 0;
}
char as[ ] = "a char array"; : is a char array
char const* const fileName = "random_StringArray_10000000"; : is a c string
This is also a c string:
char* cs = const_cast<char*>( fileName );
If you want std::string use:
std::string s(as);
I wasn't sure which string conversion you wanted so I just added what I could off the top of my head. But here's a compilable example too.
For my formation, an exercise ask us to create a program similar to the linux 'cat' command.
So to read the file, i use an ifstream, and everything work fine for regular file.
But not when i try to open /dev/ files like /dev/stdin: the 'enter' is not detected and so, getline really exit only when the fd is being closed (with a CTRL-D).
The problem seems to be around how ifstream or getline handle reading, because with the regular 'read' function from libc, this problem is not to be seen.
Here is my code:
#include <iostream>
#include <string>
#include <fstream>
#include <errno.h>
#ifndef PROGRAM_NAME
# define PROGRAM_NAME "cato9tails"
#endif
int g_exitCode = 0;
void
displayErrno(std::string &file)
{
if (errno)
{
g_exitCode = 1;
std::cerr << PROGRAM_NAME << ": " << file << ": " << strerror(errno) << std::endl;
}
}
void
handleStream(std::string file, std::istream &stream)
{
std::string read;
stream.peek(); /* try to read: will set fail bit if it is a folder. */
if (!stream.good())
displayErrno(file);
while (stream.good())
{
std::getline(stream, read);
std::cout << read;
if (stream.eof())
break;
std::cout << std::endl;
}
}
int
main(int argc, char **argv)
{
if (argc == 1)
handleStream("", std::cin);
else
{
for (int index = 1; index < argc; index++)
{
errno = 0;
std::string file = std::string(argv[index]);
std::ifstream stream(file, std::ifstream::in);
if (stream.is_open())
{
handleStream(file, stream);
stream.close();
}
else
displayErrno(file);
}
}
return (g_exitCode);
}
We can only use method from libcpp.
I have search this problem for a long time, and i only find this post where they seems to have a very similar problem to me:
https://github.com/bigartm/bigartm/pull/258#issuecomment-128131871
But found no really usable solution from them.
I tried to do a very ugly solution but... well...:
bool
isUnixStdFile(std::string file)
{
return (file == "/dev/stdin" || file == "/dev/stdout" || file == "/dev/stderr"
|| file == "/dev/fd/0" || file == "/dev/fd/1" || file == "/dev/fd/2");
}
...
if (isUnixStdFile(file))
handleStream(file, std::cin);
else
{
std::ifstream stream(file, std::ifstream::in);
...
As you can see, a lot of files are missing, this can only be called a temporary solution.
Any help would be appreciated!
The following code worked for me to deal with /dev/fd files or when using shell substitute syntax:
std::ifstream stream(file_name);
std::cout << "Opening file '" << file_name << "'" << std::endl;
if (stream.fail() || !stream.good())
{
std::cout << "Error: Failed to open file '" << file_name << "'" << std::endl;
return false;
}
while (!stream.eof() && stream.good() && stream.peek() != EOF)
{
std::getline(stream, buffer);
std::cout << buffer << std::endl;
}
stream.close();
Basically std::getline() fails when content from the special file is not ready yet.
I want to rename some of the files ,
The names of some of the files are Russian, Chinese, and German
The program can only modify files whose name is English.
What is the problem ? please guide me
std::wstring ToUtf16(std::string str)
{
std::wstring ret;
int len = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0);
if (len > 0)
{
ret.resize(len);
MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), &ret[0], len);
}
return ret;
}
int main()
{
const std::filesystem::directory_options options = (
std::filesystem::directory_options::follow_directory_symlink |
std::filesystem::directory_options::skip_permission_denied
);
try
{
for (const auto& dirEntry :
std::filesystem::recursive_directory_iterator("C:\\folder",
std::filesystem::directory_options(options)))
{
filesystem::path myfile(dirEntry.path().u8string());
string uft8path1 = dirEntry.path().u8string();
string uft8path3 = myfile.parent_path().u8string() + "/" + myfile.filename().u8string();
_wrename(
ToUtf16(uft8path1).c_str()
,
ToUtf16(uft8path3).c_str()
);
std::cout << dirEntry.path().u8string() << std::endl;
}
}
catch (std::filesystem::filesystem_error & fse)
{
std::cout << fse.what() << std::endl;
}
system("pause");
}
filesystem::path myfile(dirEntry.path().u8string());
Windows supports UTF16 and ANSI, there is no UTF8 support for APIs (not standard anyway). When you supply UTF8 string, it thinks there is ANSI input. Use wstring() to indicate UTF16:
filesystem::path myfile(dirEntry.path().wstring());
or just put:
filesystem::path myfile(dirEntry);
Likewise, use wstring() for other objects.
wstring path1 = dirEntry.path();
wstring path3 = myfile.parent_path().wstring() + L"/" + myfile.filename().wstring();
_wrename(path1.c_str(), path3.c_str());
Renaming the files will work fine when you have UTF16 input. But there is another problem with console's limited Unicode support. You can't print some Asian characters with font changes. Use the debugger or MessageBoxW to view Asian characters.
Use _setmode and wcout to print UTF16.
Also note, std::filesystem supports / operator for adding path. Example:
#include <io.h> //for _setmode
#include <fcntl.h>
...
int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
const std::filesystem::directory_options options = (
std::filesystem::directory_options::follow_directory_symlink |
std::filesystem::directory_options::skip_permission_denied
);
try
{
for(const auto& dirEntry :
std::filesystem::recursive_directory_iterator(L"C:\\folder",
std::filesystem::directory_options(options)))
{
filesystem::path myfile(dirEntry);
auto path1 = dirEntry;
auto path3 = myfile.parent_path() / myfile;
std::wcout << path1 << ", " << path3 << endl;
//filesystem::rename(path1, path3);
}
}
...
}
I am trying to read and process multiple files that are in different encoding. I am supposed to only use STL for this.
Suppose that we have iso-8859-15 and UTF-8 files.
In this SO answer it states:
In a nutshell the more interesting part for you:
std::stream (stringstream, fstream, cin, cout) has an inner
locale-object, which matches the value of the global C++ locale at
the moment of the creation of the stream object. As std::in is
created long before your code in main is called, it has most
probably the classical C locale, no matter what you do afterwards.
You can make sure, that a std::stream object has the desirable
locale by invoking
std::stream::imbue(std::locale(your_favorite_locale)).
The problem is that from the two types, only the files that match the locale that was created first are processed correctly. For example If locale_DE_ISO885915 precedes locale_DE_UTF8 then files that are in UTF-8 are not appended correctly in string s and when I cout them out i only see a couple of lines from the file.
void processFiles() {
//setup locales for file decoding
std::locale locale_DE_ISO885915("de_DE.iso885915#euro");
std::locale locale_DE_UTF8("de_DE.UTF-8");
//std::locale::global(locale_DE_ISO885915);
//std::cout.imbue(std::locale());
const std::ctype<wchar_t>& facet_DE_ISO885915 = std::use_facet<std::ctype<wchar_t>>(locale_DE_ISO885915);
//std::locale::global(locale_DE_UTF8);
//std::cout.imbue(std::locale());
const std::ctype<wchar_t>& facet_DE_UTF8 = std::use_facet<std::ctype<wchar_t>>(locale_DE_UTF8);
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string currFile, fileStr;
std::wifstream inFile;
std::wstring s;
for (std::vector<std::string>::const_iterator fci = files.begin(); fci != files.end(); ++fci) {
currFile = *fci;
//check file and set locale
if (currFile.find("-8.txt") != std::string::npos) {
std::locale::global(locale_DE_ISO885915);
std::cout.imbue(locale_DE_ISO885915);
}
else {
std::locale::global(locale_DE_UTF8);
std::cout.imbue(locale_DE_UTF8);
}
inFile.open(path + currFile, std::ios_base::binary);
if (!inFile) {
//TODO specific file report
std::cerr << "Failed to open file " << *fci << std::endl;
exit(1);
}
s.clear();
//read file content
std::wstring line;
while( (inFile.good()) && std::getline(inFile, line) ) {
s.append(line + L"\n");
}
inFile.close();
//remove punctuation, numbers, tolower...
for (unsigned int i = 0; i < s.length(); ++i) {
if (ispunct(s[i]) || isdigit(s[i]))
s[i] = L' ';
}
if (currFile.find("-8.txt") != std::string::npos) {
facet_DE_ISO885915.tolower(&s[0], &s[0] + s.size());
}
else {
facet_DE_UTF8.tolower(&s[0], &s[0] + s.size());
}
fileStr = converter.to_bytes(s);
std::cout << fileStr << std::endl;
std::cout << currFile << std::endl;
std::cout << fileStr.size() << std::endl;
std::cout << std::setlocale(LC_ALL, NULL) << std::endl;
std::cout << "========================================================================================" << std::endl;
// Process...
}
return;
}
As you can see in the code, I have tried with global and locale local variables but to no avail.
In addition, in How can I use std::imbue to set the locale for std::wcout? SO answer it states:
So it really looks like there was an underlying C library mechanizme
that should be first enabled with setlocale to allow imbue conversion
to work correctly.
Is this "obscure" mechanism the problem here?
Is it possible to alternate between the two locales while processing the files? What should I imbue (cout, ifstream, getline ?) and how?
Any suggestions?
PS: Why is everything related with locale so chaotic? :|
This works for me as expected on my Linux machine, but not on my Windows machine under Cygwin (the set of available locales is apparently the same on both machines, but std::locale::locale just fails with every imaginable locale string).
#include <iostream>
#include <fstream>
#include <locale>
#include <string>
void printFile(const char* name, const char* loc)
{
try {
std::wifstream inFile;
inFile.imbue(std::locale(loc));
inFile.open(name);
std::wstring line;
while (getline(inFile, line))
std::wcout << line << '\n';
} catch (std::exception& e) {
std::cerr << e.what() << std::endl;
}
}
int main()
{
std::locale::global(std::locale("en_US.utf8"));
printFile ("gtext-u8.txt", "de_DE.utf8"); // utf-8 text: grüßen
printFile ("gtext-legacy.txt", "de_DE#euro"); // iso8859-15 text: grüßen
}
Output:
grüßen
grüßen
I'm having an issue when running the code below. Every time I set the while loop to reach the .eof() it returns a std::bad_alloc
inFile.open(fileName, std::ios::in | std::ios::binary);
if (inFile.is_open())
{
while (!inFile.eof())
{
read(inFile, readIn);
vecMenu.push_back(readIn);
menu.push_back(readIn);
//count++;
}
std::cout << "File was loaded succesfully..." << std::endl;
inFile.close();
}
It runs fine if I set a predetermined number of iterations, but fails when I use the EOF funtion. Here's the code for the read function:
void read(std::fstream& file, std::string& str)
{
if (file.is_open())
{
unsigned len;
char *buf = nullptr;
file.read(reinterpret_cast<char *>(&len), sizeof(unsigned));
buf = new char[len + 1];
file.read(buf, len);
buf[len] = '\0';
str = buf;
std::cout << "Test: " << str << std::endl;
delete[] buf;
}
else
{
std::cout << "File was not accessible" << std::endl;
}
}
Any help you can provide is greatly appreciated.
NOTE: I failed to mention that vecMenu is of type std::vector
and menu is of type std::list
The main problems I see are:
You are using while (!inFile.eof()) to end the loop. See Why is iostream::eof inside a loop condition considered wrong?.
You are not checking whether calls to ifstream::read succeeded before using the variables that were read into.
I suggest:
Changing your version of read to return a reference to ifstream. It should return the ifstream it takes as input. That makes it possible to use the call to read in the conditional of a loop.
Checking whether calls to ifstream::read succeed before using them.
Putting the call to read in the conditional of the while statement.
std::ifstream& read(std::fstream& file, std::string& str)
{
if (file.is_open())
{
unsigned len;
char *buf = nullptr;
if !(file.read(reinterpret_cast<char *>(&len), sizeof(unsigned)))
{
return file;
}
buf = new char[len + 1];
if ( !file.read(buf, len) )
{
delete [] buf;
return file;
}
buf[len] = '\0';
str = buf;
std::cout << "Test: " << str << std::endl;
delete[] buf;
}
else
{
std::cout << "File was not accessible" << std::endl;
}
return file;
}
and
inFile.open(fileName, std::ios::in | std::ios::binary);
if (inFile.is_open())
{
std::cout << "File was loaded succesfully..." << std::endl;
while (read(inFile, readIn))
{
vecMenu.push_back(readIn);
menu.push_back(readIn);
//count++;
}
inFile.close();
}