Im using this piece of code to read a file to a string, and its working perfectly with files manually made in notepad, notepad++ or other text editors:
std::string utils::readFile(std::string file)
{
std::ifstream t(file);
std::string str((std::istreambuf_iterator<char>(t)),
std::istreambuf_iterator<char>());
return str;
}
When I create a file via notepad (or any other editor) and save it to something, I get this result in my program:
But when I create a file via CMD (example command below), and run my program, I receive an unexpected result:
cmd /C "hostname">"C:\Users\Admin\Desktop\lel.txt" & exit
Result:
When I open this file generated by CMD (lel.txt), this is the file contents:
If I edit the generated file (lel.txt) with notepad (adding a space to the end of the file), and try running my program again, I get the same weird 3char result.
What might cause this? How can I read a file made via cmd, correctly?
EDIT
I changed my command (now using powershell), and added a function I found, named SkipBOM, and now it works:
powershell -command "hostname | Out-File "C:\Users\Admin\Desktop\lel.txt" -encoding "UTF8""
SkipBOM:
void SkipBOM(std::ifstream &in)
{
char test[3] = { 0 };
in.read(test, 3);
if ((unsigned char)test[0] == 0xEF &&
(unsigned char)test[1] == 0xBB &&
(unsigned char)test[2] == 0xBF)
{
return;
}
in.seekg(0);
}
This is almost certainly BOM (Byte Order Mark) : see here, which means that your file is saved in UNICODE with BOM.
There is a way to use C++ streams to read files with BOM (you have to use converters) - let me know if you need help with that.
That is how unicode looks when treated as an ANSI string. In notepad use File - Save As to see what the current format of a file is.
Now CMD uses OEM font, which is the same as ANSI for English characters. So any unicode will be converted to OEM by CMD. Perhaps you are grabbing the data yourself.
In VB you would use StrConv to convert it.
Related
I got a problem, that is: in a Windows application using SDL2 & SDL2_Image, it opens image files, for later saving them with modifications on the image data.
When it opens an image without special characters (like áéíóúñ, say, "buenos aires.jpg") it works as intended. But, if there is any special character as mentioned (say, "córdoba.jpg"), SDL_Image generates an error saying "Couldn't open file". Whatever, if i use the std::ifstream flux with the exact file name that i got from the CSV file (redundant, as "córdoba.jpg" or "misiónes.jpg"), the ifstream works well... Is it an error using the special characters? UNICODE, UTF, have something to do?
A little information about the environment: Windows 10 (spanish, latin american), SDL2 & SDL2_Image (up to date versions), GCC compiler using Mingw64 7.1.0
About the software I'm trying to make: it uses a CSV form, with the names of various states of Argentina, already tried changing encoding on the .CSV. It loads images based on the names found on the CSV, changes them, and saves.
I know maybe I am missing something basic, but already depleted my resources.
IMG_Load() forwards its file argument directly to SDL_RWFromFile():
// http://hg.libsdl.org/SDL_image/file/8fee51506499/IMG.c#l125
SDL_Surface *IMG_Load(const char *file)
{
SDL_RWops *src = SDL_RWFromFile(file, "rb");
const char *ext = SDL_strrchr(file, '.');
if(ext) {
ext++;
}
if(!src) {
/* The error message has been set in SDL_RWFromFile */
return NULL;
}
return IMG_LoadTyped_RW(src, 1, ext);
}
And SDL_RWFromFile()'s file argument should be a UTF-8 string:
SDL_RWops* SDL_RWFromFile(const char* file,
const char* mode)
Function Parameters:
file: a UTF-8 string representing the filename to open
mode: an ASCII string representing the mode to be used for opening the file; see Remarks for details
So pass UTF-8 paths into IMG_Load().
C++11 has UTF-8 string literal support built-in via the u8 prefix:
IMG_Load( u8"córdoba.jpg" );
When I run a commandline program (which internally deletes a log file) from CMD prompt, it's working as expected.
But the same command when run in a PowerShell prompt is not deleting the log file. The command is run successfully except for the log file deletion. There is no error or exception thrown from the PowerShell prompt.
How does PowerShell differ from a CMD prompt in the Windows environment with respect to file handling, in this case it's deleting a file?
Note: Both the CMD prompt and PowerShell are run as Administrator.
The source code of the program looks like this:
WIN32_FIND_DATA fd;
LPCWSTR search_path_wstr = ws.c_str();
HANDLE hFind = ::FindFirstFile(search_path_wstr, &fd);
wstring wsFilename(fd.cFileName);
string cFileName(wsFilename.begin(), wsFilename.end());
string absoluteFilename = strPath + "\\" + cFileName;
const char *filename = absoluteFilename.c_str();
remove(filename);
remove() is the function which deletes the file.
Update: I have tried changing remove() to DeleteFile(), the behavior is still same.
Update 2: I have found the root cause. PowerShell is returning an absolute path whereas the CMD prompt is returning a relative path. This is not part of the above code snippet.
Now I need to find whether the path is relative or not. There is a Windows function, PathIsRelative(), but it takes LPCWSTR as input and again some conversion is required.
My psychic powers tell me that the file name has non-ASCII characters in it and the error in the failing case is "file not found."
In the code, you copy wide characters into regular chars. For anything outside of ASCII, this won't do what you want.
Your code sample doesn't show how you get the source string or strPath.
It's possible that, when you enter the search string in the CMD case, it's got some non-ASCII characters that are representable in the current code page, and those values are copied to wide characters and then back without harm, and the remove works.
When you enter it in PowerShell, you probably get UTF-16 encoded text. When you copy those values back to regular chars, you're not getting the same string, so the remove probably fails with "file not found."
Don't do this:
string cFileName(wsFilename.begin(), wsFilename.end());
Work with wide strings consistently, without any conversions. If you must convert between wide and narrow strings, you must know the encoding and actually transcode the data, not just copy it.
I can't read a file correctly using CStdioFile.
I open notepad.exe, I type àèìòùáéíóú and I save twice, once I set codification as ANSI (really is CP-1252) and other as UTF-8.
Then I try to read it from MFC with the following block of code
BOOL ReadAllFileContent(const CString &FilePath, CString *fileContent)
{
CString sLine;
BOOL isSuccess = false;
CStdioFile input;
isSuccess = input.Open(FilePath, CFile::modeRead);
if (isSuccess) {
while (input.ReadString(sLine)) {
fileContent->Append(sLine);
}
input.Close();
}
return isSuccess;
}
When I call it, with ANSI file I've got the expected result àèìòùáéíóú
but when I try to read the UTF8 encoded file I've got à èìòùáéÃóú
I would like my function works with all files regardless of the encoding.
Why I need to implement?
.EDIT.
Unfortunately, in the real app, files come from external app so change the file encoding isn't an option.I must be able to read both UTF-8 and CP-1252 files.
Any file is valid ANSI, what notepad told ANSI is really Windows-1252 encode.
I've figured out a way to read UTF-8 and CP-1252 right based on the example provided here. Although it works, I need to pass the file encode which I don't know in advance.
Thnks!
I personally use the class as advertised here:
https://www.codeproject.com/Articles/7958/CTextFileDocument
It has excellent support for reading and writing text files of various encodings including unicode in its various flavours.
I have not had a problem with it.
The function below is something I have created in a unit test for a Qt project I'm working on.
It creates a file (empty or filled) that is then opened in various use cases, processed and the outcome evaluated. One special use case I have identified is that the encoding actually does affect my application so I decided to cover non-UTF-8 files too (as far as this is possible).
void TestCsvParserOperators::createCsvFile(QString& path, CsvType type, bool utf8)
{
path = "test_data.txt";
QFile csv(path);
// Make sure both reading and writing access is possible. Also turn on truncation to replace any existing files
QVERIFY(csv.open(QIODevice::ReadWrite | QIODevice::Truncate | QIODevice::Text) == true);
QTextStream csvStream(&csv);
// Set encoding
if (utf8)
{
csvStream.setCodec("UTF-8");
}
else
{
csvStream.setCodec("ISO 8859-15");
csvStream.setGenerateByteOrderMark(false);
}
switch(type)
{
case EMPTY: // File doesn't contain any data
break;
case INVALID: // File contains data that is not supported
csvStream << "abc" << '\n';
break;
case VALID:
{
// ...
break;
}
}
csv.close();
}
While the project runs on Linux the data is exported as a plain text file on Windows (and possibly edited with Notepad) and used by my application as it is. I discovered that it is encoded not as UTF-8 but as ISO 8859-15. This led to a bunch of problems including incorrectly processed characters etc.
The actual part in my application that is tested is
// ...
QTextStream in(&csvFile);
if (in.codec() != QTextCodec::codecForName("UTF-8"))
{
LOG(WARNING) << this->sTag << "Expecting CSV file with UTF-8 encoding. Found " << QString(in.codec()->name()) << ". Will attempt to convert to supported encoding";
// Handle encoding
// ...
}
// ...
Regardless of the combination of values for type and utf8 I always get my test text file. However the encoding remains UTF-8 regardless of the utf8 flag.
Calling file on the CSV file with the actual data (shipped by the client) returns
../trunk/resources/data.txt: ISO-8859 text, with CRLF line terminators
while doing the same on test_data.txt gives me
../../build/test-bin/test_data.txt: UTF-8 Unicode text
I've read somewhere that if I want to use some encoding other than UTF-8 I have to work with QByteArray. However I am unable to verify this in the Qt documentation. I've also read that setting the BOM should do the trick but I tried with both enabling and disabling its generation without any luck.
I've already written a small bash script which converts the encoding to UTF-8 (given that the input file is ISO 8859) but I'd like to
have this integrated in my actual application
not being forced to take care of this every single time
have at least some basic test coverage for the encoding that the client uses
Any ideas how to achieve this?
UPDATE: I replaced the content I'm writing to the text file as
csvStream << QString("...").toLatin1() << ...;
and now I get
../../build/test-bin/test_data.txt: ASCII text
which is still not what I'm looking for.
Usually this is what I do:
QTextCodec *codec1 = QTextCodec::codecForName("ISO 8859-15");
QByteArray csvStreambyteArray = " .... "; // from your file
QString csvStreamString = codec1->toUnicode(csvStreambyteArray);
csvStream << csvStreamString ;
So I have a QT text editor that I have started creating. I started with this http://doc.qt.io/archives/qt-5.7/gettingstartedqt.html and i have added on to it. So far I have added a proper save/save as function (the version in the link only really has a save as function), a "find" function, and an "open new window" function. Very soon, I will add a find and replace function.
I am mainly doing this for the learning experience, but I am also going to eventually add a few more functions that will specifically help me create PLC configuration files at work. These configuration files could be in many different encodings, but most of them seem to be in UTF-16LE (according to Emacs anyway.) My text editor originally had no problem reading the UTF-16LE, but wrote in plain text, I needed to change that.
Here is the snippet from the Emacs description of the encoding system of one of these UTF16-LE files.
U -- utf-16le-with-signature-dos (alias: utf-16-le-dos)
UTF-16 (little endian, with signature (BOM)).
Type: utf-16
EOL type: CRLF
This coding system encodes the following charsets:
unicode
And here is an example of the code that I am using to encode the text in my QT text editor.
First... This is similar to the link that I gave earlier. The only difference here is that "saveFile" is a global variable that I created to perform a simple "Save" function instead of a "Save As" function. This saves the text as plain text and works like a charm.
void findreplace::on_actionSave_triggered()
{
if (!saveFile.isEmpty())
{
QFile file(saveFile);
if (!file.open(QIODevice::WriteOnly))
{
// error message
}
else
{
QTextStream stream(&file);
stream << ui->textEdit->toPlainText();
stream.flush();
file.close();
}
}
}
Below is my newer version which attempts to save the code in "UTF-16LE." My text editor can read the text just fine after saving it with this, but Emacs will not read it at all. This to me says that the configuration file will probably not be readable by the programs that read it. Something changed, not sure what.
void findreplace::on_actionSave_triggered()
{
if (!saveFile.isEmpty())
{
QFile file(saveFile);
if (!file.open(QIODevice::WriteOnly))
{
// error message
}
else
{
QTextStream stream(&file);
stream << ui->textEdit->toPlainText();
stream.setCodec("UTF-16LE");
QString stream3 = stream.readAll();
//QString stream2 = stream3.setUnicode();
//QTextCodec *codec = QTextCodec::codecForName("UTF-16LE");
//QByteArray stream2 = codec->fromUnicode(stream3);
//file.write(stream3);
stream.flush();
file.close();
}
}
}
The parts that are commented out I also tried, but they ended up writing the file as Asian (Chinese or Japanese) characters. Like I said my text editor, (and Notepad in Wine) can read the file just fine, but Emacs now describes the encoding as the following after saving.
= -- no-conversion (alias: binary)
Do no conversion.
When you visit a file with this coding, the file is read into a
unibyte buffer as is, thus each byte of a file is treated as a
character.
Type: raw-text (text with random binary characters)
EOL type: LF
This indicates to me that something is not right in the file. Eventually this text editor will be used to create multiple text files at once and modify their contents via user input. It would be great if I could get this encoding right.
Thanks to the kind fellows that commented on my post here, I was able to answer my own question. This code here solved my problem.
void findreplace::on_actionSave_triggered()
{
if (!saveFile.isEmpty())
{
QFile file(saveFile);
if (!file.open(QIODevice::WriteOnly))
{
// error message
}
else
{
QTextStream stream(&file);
stream.setCodec("UTF-16LE");
stream.setGenerateByteOrderMark(true);
stream << ui->textEdit->toPlainText();
stream.flush();
file.close();
}
}
}
I set the codec of the stream, and then set the the generate BOM to "True." I guess that I have more to learn about encodings. I thought that the byte order mark had to be set to a specific value or something.I wasn't aware that I just had to set this value to "True" and that it would take care of itself. Emacs can now read the files that are generated by saving a document with this code, and the encoding documentation from Emacs is the same. I will eventually add options for the user to pick which encoding they need while saving. Glad that I was able to learn something here.