How do I read a Windows-1252 file using Rcpp? - c++

I want to force the input format when reading a file into Windows-1252 encoding together with Rcpp. I need this since I switch between Linux/Windows environments and while the files are consistently in 1252 encoding.
How do I adapt this to work:
String readFile(std::string path) {
std::ifstream t(path.c_str());
if (!t.good()){
std::string error_msg = "Failed to open file ";
error_msg += "'" + path + "'";
::Rf_error(error_msg.c_str());
}
const std::locale& locale = std::locale("sv_SE.1252");
t.imbue(locale);
std::stringstream ss;
ss << t.rdbuf();
return ss.str();
}
The above fails with:
Error in eval(expr, envir, enclos) :
locale::facet::_S_create_c_locale name not valid
I've also tried with "Swedish_Sweden.1252" that is the default for my system to no avail. I've tried #include <boost/locale.hpp> but that seems to be unavailable in Rcpp (v 0.12.0)/BH boost (v. 1.58.0-1).
Update:
After digging a little deeper into this I'm not sure if the gcc (v. 4.6.3) in RTools (v. 3.3) is built with locale support, this SO question points to that possibility. If there is any argument except "" or "C" works with std::locale() it would be interesting to know, I've tried a few more alternatives but nothing seems to work.
Fallback solution
I'm not entirely satisfied but it seems that using the base::iconv() fixes any issues with characters regardless of the original format, much thanks to the from="WINDOWS-1252"argument forcing the chars to be interpreted in the correct form, i.e. if we want to stay in Rcpp we can simply do:
String readFile(std::string path) {
std::ifstream t(path.c_str());
if (!t.good()){
std::string error_msg = "Failed to open file ";
error_msg += "'" + path + "'";
::Rf_error(error_msg.c_str());
}
const std::locale& locale = std::locale("sv_SE.1252");
t.imbue(locale);
std::stringstream ss;
ss << t.rdbuf();
Rcpp::StringVector ret = ss.str();
Environment base("package:base");
Function iconv = base["iconv"];
ret = iconv(ret, Named("from","WINDOWS-1252"),Named("to","UTF8"));
return ret;
}
Note that it is preferrable to wrap the function in R rather than getting the function from C++ and then calling it from there, it is both less code and improves performance improvement by a factor of 2 (checked with microbenchmark):
readFileWrapper <- function(path){
ret <- readFile(path)
iconv(ret, from = "WINDOWS-1252", to = "UTF8")
}

Related

How to print the content of a string as "sting literal source"

Suppose s is
a
b
c
const std::string s =
std::cout << R"( s )" << std::endl;
How to std::cout the content of the string in raw literal? I mean the cout return the value in this format: "a\nb\nc".
I need to transform a very large text into a std::string.
I cant use fileread as i need to define its value inside the src.
What you would need to do is to scan the string, and replace all occurrences of the characters you are interested in (such as carriage return, tab, etc) with printable escape sequence and than print this new text.
Here is somewhat crude proof of concept:
std::string escape(std::string_view src) {
std::string ret;
ret.reserve(src.size() * 2); // at worst, the string consists solely of escapable symbols
static constexpr std::array escapable = {std::make_pair('\t', 't'),
std::make_pair('\n', 'n')}; // add more chars as needed, note that the array is sorted
for (const char ch: src) {
std::pair search_pair{ch, ' '};
auto esc_char = std::equal_range(escapable.begin(), escapable.end(), search_pair, [](auto& a, auto& b) { return a.first < b.first; });
if (esc_char.first != escapable.end()) {
ret.push_back('\\');
ret.push_back(esc_char.first->second);
} else {
ret.push_back(ch);
}
}
return ret;
}
Now, you can use it:
const std::string str = "A\nbub\tfuf\n";
std::cout << escape(str) << "\n";
Above snippet prints A\nbub\tfuf\n
You could be interested by the JSON specification.
You could consider serializing your data in JSON format using open source C++ libraries like jsoncpp
You could also consider using some YAML format with the yaml-cpp library
You could be interested by the SWIG tool which generates C++ glue code.
You could consider using binary data formats like XDR.
You should specify (on paper, with a pencil) your data format in EBNF notation and use ANTLR or GNU bison to generate the parser (the printer is easier to code)
The RefPerSys project (an open source symbolic artificial intelligence system, GPLv3+ licensed) is persisting data in textual format. You may borrow some code are re-use it in your application, if you obey to that GPL license.
Look also into Qt or POCO frameworks, but notice that DWORD64 is not a standard C++ type. See this C++ reference and read a recent C++ standard (like n3337 or better).
Consider generating your C++ serializing code
With tools like GNU m4 or GPP (or your own one).
Pitrat's book Artificial Beings: the Conscience of a Conscious Machine (ISBN-13: 978-1848211018) should give you valuable insight and intuitions.
You can load this text file into a std::string like this:
Store the text in your file, e.g. mystring.txt, as a raw string literal in the format R"(raw_characters)":
R"(Run.M128A XmmRegisters[16];
BYTE Reserved4[96];", Run.CONTEXT64 := " DWORD64 P1Home;
DWORD64 P2Home;
...
)"
#include the file into a string:
namespace
{
const std::string mystring =
#include "mystring.txt"
;
}
Your IDE might flag this up as a syntax error, but it isn't. What you're doing is loading the contents of file directly into the string at compile time.
Finally print the string:
std::cout << mystring << std::endl;
Why not just save the escaped version of the string in the file?
Any way, here's a function to 'escape' characters:
#include <iostream>
#include <string>
#include <unordered_map>
std::string replace_all(const std::string &mystring)
{
const std::unordered_map<char, std::string> lookup =
{ {'\n', "\\n"}, {'\t', "\\t"}, {'"', "\\\""} };
std::string new_string;
new_string.reserve(mystring.length() * 2);
for (auto c : mystring)
{
auto it = lookup.find(c);
if (it != lookup.end())
new_string += it->second;
else
new_string += c;
}
return new_string;
}
int main() {
std::string mystring = R"(Run.M128A XmmRegisters[16];
BYTE Reserved4[96];", Run.CONTEXT64 := " DWORD64 P1Home;
DWORD64 P2Home;
DWORD64 P3Home;
DWORD64 P4Home;
DWORD64 P5Home;
DWORD64 P6Home;)";
auto new_string = replace_all(mystring);
std::cout << new_string << std::endl;
return 0;
}
Here's a demo.

I want to create a text file in cpp using ofstream

I want to create a file qbc.txt. I know how to create it, but I want to create a program that, if a file already exists with the same name, it would rename it to qbc(1).txt.
In C++17, boost's filesystem library was standardized as std::filesystem
It comes with a convenient std::filesystem::exists function.
It accepts a std::filesystem::path object, but fortunately those can be constructed with a std::string, making our program trivially easy:
std::string prefix = "qbc";
std::string extension = ".txt";
std::filesystem::path filename{prefix + extension};
int i = 0;
while (std::filesystem::exists(filename)){
filename = prefix + "(" + std::to_string(++i) + ")" + extension;
}
// now filename is "qbc(1)" or "qbc(2)" etc.
Unfortunately no compiler has full support for it at the time of this writing!
Here is a simple solution. The file_exists() function came from #Raviprakash in his response. I've added how to change the filename and try again until success. I've done an approach similar to this before in Python.
If you know that your program is the only one that will create or remove these files, then you can cache the last created one and simply create the next one instead of looping over all of the created ones every time. But this kind of optimization would only make sense if you plan to make hundreds of thousands of files this way.
#include <fstream>
#include <string>
bool file_exists(const std::string &filename) {
std::ifstream in(filename);
return in.good();
}
std::ofstream& open_new(std::ofstream &out, std::string prefix,
std::string suffix)
{
std::string filename = prefix + suffix;
unsigned int index = 0;
while (file_exists(filename)) {
index++;
filename = prefix + "(" + std::to_string(index) + ")" + suffix;
}
out.rdbuf()->open(filename, std::ios_base::out);
return out;
}
int main() {
std::string prefix = "qbc";
std::string suffix = ".txt";
std::ofstream out;
open_new(out, prefix, suffix);
out << "hello world!\n";
return 0;
}
I know the program needs some improvements but the general idea is here:
#include <fstream>
#include <string>
using namespace std;
inline bool file_exists(const std::string& name)
{
ifstream f(name.c_str());
return f.good();
}
int main()
{
string filename, name;
name = "qbc";
filename = name;
int counter = 1;
while (file_exists(filename+".txt")) {
string str = to_string(counter);
filename = name+ "(" + str + ")";
counter++;
}
filename += ".txt";
ofstream out(filename.c_str());
return 0;
}
I don't think this can be entirely solved using just the standard libraries. You can certainly keep picking a new file name until you find one that's unused and then create the new file (as the other answers have shown).
But there's an inherent race condition in that approach. What if another process creates a file between the time your program decides the name is available and the time it actually creates the file? Imagine two copies of your program both trying to write out files.
What you need is an atomic way to check for the file's existence and also to create the file. The normal way to do that is to first just try to create the file and then see if you succeeded or not. Unfortunately, I don't think the standard C++ or C libraries give you enough tools to do that. (I'd be happy to be proven wrong about that.)
Operating systems often provide APIs for doing just that. For example, Windows has GetTempFileName, which just keeps trying to create a new file until it succeeds. The key is that, once it succeeds, it keeps the file open so that it knows no other process can steal the name that's been selected.
If you tell us which OS you're using, we might be able to provide a more detailed answer.

How to check if a file already exists before creating it? (C++, Unicode, cross platform)

I have the following code which works on Windows:
bool fileExists(const wstring& src)
{
#ifdef PLATFORM_WINDOWS
return (_waccess(src.c_str(), 0) == 0);
#else
// ???? how to make C access() function to accept the wstring on Unix/Linux/MacOS ?
#endif
}
How do I make the code work on *nix platforms the same way as it does on Windows, considering that scr is a Unicode string and might contain file path with Unicode characters?
I have seen various StackOverflow answers which partly answer my question but I have problems to put it all together. My system relies on wide strings, especially on Windows where file names might contain non-ASCII characters. I know that generally it's better to write to the file and check for errors, but my case is the opposite - I need to skip the file if it already exists. I just want to check if the file exists, no matter if I can read/write it or not.
On many filesystems other than FAT and NTFS, filenames aren't exactly well defined as strings. They're technically byte sequences. What those byte sequences mean is a matter of interpretation. A common interpretation is UTF-8-like. Not exact UTF-8, because Unicode specifies string equality regardless of encoding. Most systems use byte equality instead. (Again, FAT and NTFS are exceptions, using case-insensitive comparisons)
A good portable solution I use is to use the following:
ifstream my_file(myFilenameHere);
if (my_file.good())
{
// file exists and do what you need to do when it exists
}
else
{
// the file doesn't exist do what you need to do to create it etc.
}
For example a small file existence checker function could be (this one works in windows, linux and unix):
inline bool doesMyFileExist (const std::string& myFilename)
{
#if defined(__unix__) || defined(__posix__) || defined(__linux__ )
// all UNIXes, POSIX (including OS X I think (cant remember been a while)) and
// all the various flavours of Linus Torvalds digital offspring:)
struct stat buffer;
return (stat (myFilename.c_str(), &buffer) == 0);
#elif defined(__APPLE__)|| defined(_WIN32)
// this includes IOS AND OSX and Windows (x64 and x86)
// note the underscore in the windows define, without it can cause problems
if (FILE *file = fopen(myFilename.c_str(), "r"))
{
fclose(file);
return true;
}
else
{
return false;
}
#else // a catch-all fallback, this is the slowest method, but works on them all:)
ifstream myFile(myFilename.c_str());
if (myFile.good())
{
myFile.close();
return true;
}
else
{
myFile.close();
return false;
}
#endif
}
The function above uses the fastest possible method to check the file for each OS variant, and has a fallback in case you are on an os other than the ones explicitly listed (original Amiga OS for example). This has been used in GCC4.8.x and VS 2010/2012.
The good method will check that everything is as it should be, and this way you actually have the file open.
The only caveat is pay close attention to how the file name is represented in the OS (as mentioned in another answer).
So far this has worked cross platform for me just fine:)
I spent some hours experimenting on my Ubuntu machine. It took many trials and errors but finally I got it working. I'm not sure if it will work on MacOS or even on other *nixes.
As many suspected, direct casting to char* did not work - then I got only the first slash of my test path /home/progmars/абвгдāēī . The trick was to use wcstombs() combined with setlocale() Although I could not get the text to display in console after this conversion, still access() function got it right.
Here is the code which worked for me:
bool fileExists(const wstring& src)
{
#ifdef PLATFORM_WINDOWS
return (_waccess(src.c_str(), 0) == 0);
#else
// hopefully this will work on most *nixes...
size_t outSize = src.size() * sizeof(wchar_t) + 1;// max possible bytes plus \0 char
char* conv = new char[outSize];
memset(conv, 0, outSize);
// MacOS claims to have wcstombs_l which has locale argument,
// but I could not find something similar on Ubuntu
// thus I had to use setlocale();
char* oldLocale = setlocale(LC_ALL, NULL);
setlocale(LC_ALL, "en_US.UTF-8"); // let's hope, most machines will have "en_US.UTF-8" available
// "Works on my machine", that is, Ubuntu 12.04
size_t wcsSize = wcstombs(conv, src.c_str(), outSize);
// we might get an error code (size_t-1) in wcsSize, ignoring for now
// now be good, restore the locale
setlocale(LC_ALL, oldLocale);
return (access(conv, 0) == 0);
#endif
}
And here is some experimental code which led me to the solution:
// this is crucial to output correct unicode characters in console and for wcstombs to work!
// empty string also works instead of en_US.UTF-8
// setlocale(LC_ALL, "en_US.UTF-8");
wstring unicoded = wstring(L"/home/progmars/абвгдāēī");
int outSize = unicoded.size() * sizeof(wchar_t) + 1;// max possible bytes plus \0 char
char* conv = new char[outSize];
memset(conv, 0, outSize);
size_t szt = wcstombs(conv, unicoded.c_str(), outSize); // this needs setlocale - only then it returns 31. else it returns some big number - most likely, an error message
wcout << "wcstombs result " << szt << endl;
int resDirect = access("/home/progmars/абвгдāēī", 0); // works fine always
int resCast = access((char*)unicoded.c_str(), 0);
int resConv = access(conv, 0);
wcout << "Raw " << unicoded.c_str() << endl; // output /home/progmars/абвгдāēī but only if setlocale has been called; else output is /home/progmars/????????
wcout << "Casted " << (char*)unicoded.c_str() << endl; // output /
wcout << "Converted " << conv << endl; // output /home/progmars/ - for some reason, Unicode chars are cut away in the console, but still they are there because access() picks them up correctly
wcout << "resDirect " << resDirect << endl; // gives correct result depending on the file existence
wcout << "resCast " << resCast << endl; // wrong result - always 0 because it looks for / and it's the filesystem root which always exists
wcout << "resConv " << resConv << endl;
// gives correct result but only if setlocale() is present
Of course, I could avoid all that hassle with ifdefs to define my own version of string which would be wstring on Windows and string on *nix because *nix seems to be more liberal about UTF8 symbols and doesn't mind using them in plain strings. Still, I wanted to keep my function declarations consistent for all platforms and also I wanted to learn how Unicode filenames work in Linux.

Stumped with Unicode, Boost, C++, codecvts

In C++, I want to use Unicode to do things. So after falling down the rabbit hole of Unicode, I've managed to end up in a train wreck of confusion, headaches and locales.
But in Boost I've had the unfortunate problem of trying to use Unicode file paths and trying to use the Boost program options library with Unicode input. I've read whatever I could find on the subjects of locales, codecvts, Unicode encodings and Boost.
My current attempt to get things to work is to have a codecvt that takes a UTF-8 string and converts it to the platform's encoding (UTF-8 on POSIX, UTF-16 on Windows), I've been trying to avoid wchar_t.
The closest I've actually gotten is trying to do this with Boost.Locale, to convert from a UTF-8 string to a UTF-32 string on output.
#include <string>
#include <boost/locale.hpp>
#include <locale>
int main(void)
{
std::string data("Testing, 㤹");
std::locale fromLoc = boost::locale::generator().generate("en_US.UTF-8");
std::locale toLoc = boost::locale::generator().generate("en_US.UTF-32");
typedef std::codecvt<wchar_t, char, mbstate_t> cvtType;
cvtType const* toCvt = &std::use_facet<cvtType>(toLoc);
std::locale convLoc = std::locale(fromLoc, toCvt);
std::cout.imbue(convLoc);
std::cout << data << std::endl;
// Output is unconverted -- what?
return 0;
}
I think I had some other kind of conversion working using wide characters, but I really don't know what I'm even doing. I don't know what the right tool for the job is at this point. Help?
Okay, after a long few months I've figured it out, and I'd like to help people in the future.
First of all, the codecvt thing was the wrong way of doing it. Boost.Locale provides a simple way of converting between character sets in its boost::locale::conv namespace. Here's one example (there's others not based on locales).
#include <boost/locale.hpp>
namespace loc = boost::locale;
int main(void)
{
loc::generator gen;
std::locale blah = gen.generate("en_US.utf-32");
std::string UTF8String = "Tésting!";
// from_utf will also work with wide strings as it uses the character size
// to detect the encoding.
std::string converted = loc::conv::from_utf(UTF8String, blah);
// Outputs a UTF-32 string.
std::cout << converted << std::endl;
return 0;
}
As you can see, if you replace the "en_US.utf-32" with "" it'll output in the user's locale.
I still don't know how to make std::cout do this all the time, but the translate() function of Boost.Locale outputs in the user's locale.
As for the filesystem using UTF-8 strings cross platform, it seems that that's possible, here's a link to how to do it.
std::cout.imbue(convLoc);
std::cout << data << std::endl;
This does no conversion, since it uses codecvt<char, char, mbstate_t> which is a no-op. The only standard streams that use codecvt are file-streams. std::cout is not required to perform any conversion at all.
To force Boost.Filesystem to interpret narrow-strings as UTF-8 on windows, use boost::filesystem::imbue with a locale with a UTF-8 ↔ UTF-16 codecvt facet. Boost.Locale has an implementation of the latter.
The Boost filesystem iostream replacement classes work fine with UTF-16 when used with Visual C++.
However, they do not work (in the sense of supporting arbitrary filenames) when used with g++ in Windows - at least as of Boost version 1.47. There is a code comment explaining that; essentially, the Visual C++ standard library provides non-standard wchar_t based constructors that Boost filesystem classes make use of, but g++ does not support these extensions.
A workaround is to use 8.3 short filenames, but this solution is a bit brittle since with old Windows versions the user can turn off automatic generation of short filenames.
Example code for using Boost filesystem in Windows:
#include "CmdLineArgs.h" // CmdLineArgs
#include "throwx.h" // throwX, hopefully
#include "string_conversions.h" // ansiOrFillerFrom( wstring )
#include <boost/filesystem/fstream.hpp> // boost::filesystem::ifstream
#include <iostream> // std::cout, std::cerr, std::endl
#include <stdexcept> // std::runtime_error, std::exception
#include <string> // std::string
#include <stdlib.h> // EXIT_SUCCESS, EXIT_FAILURE
using namespace std;
namespace bfs = boost::filesystem;
inline string ansi( wstring const& ws ) { return ansiWithFillersFrom( ws ); }
int main()
{
try
{
CmdLineArgs const args;
wstring const programPath = args.at( 0 );
hopefully( args.nArgs() == 2 )
|| throwX( "Usage: " + ansi( programPath ) + " FILENAME" );
wstring const filePath = args.at( 1 );
bfs::ifstream stream( filePath ); // Nice Boost ifstream subclass.
hopefully( !stream.fail() )
|| throwX( "Failed to open file '" + ansi( filePath ) + "'" );
string line;
while( getline( stream, line ) )
{
cout << line << endl;
}
hopefully( stream.eof() )
|| throwX( "Failed to list contents of file '" + ansi( filePath ) + "'" );
return EXIT_SUCCESS;
}
catch( exception const& x )
{
cerr << "!" << x.what() << endl;
}
return EXIT_FAILURE;
}

C++ - string.compare issues when output to text file is different to console output?

I'm trying to find out if two strings I have are the same, for the purpose of unit testing. The first is a predefined string, hard-coded into the program. The second is a read in from a text file with an ifstream using std::getline(), and then taken as a substring. Both values are stored as C++ strings.
When I output both of the strings to the console using cout for testing, they both appear to be identical:
ThisIsATestStringOutputtedToAFile
ThisIsATestStringOutputtedToAFile
However, the string.compare returns stating they are not equal. When outputting to a text file, the two strings appear as follows:
ThisIsATestStringOutputtedToAFile
T^#h^#i^#s^#I^#s^#A^#T^#e^#s^#t^#S^#t^#r^#i^#n^#g^#O^#u^#t^#p^#u^#t^#
t^#e^#d^#T^#o^#A^#F^#i^#l^#e
I'm guessing this is some kind of encoding problem, and if I was in my native language (good old C#), I wouldn't have too many problems. As it is I'm with C/C++ and Vi, and frankly don't really know where to go from here! I've tried looking at maybe converting to/from ansi/unicode, and also removing the odd characters, but I'm not even sure if they really exist or not..
Thanks in advance for any suggestions.
EDIT
Apologies, this is my first time posting here. The code below is how I'm going through the process:
ifstream myInput;
ofstream myOutput;
myInput.open(fileLocation.c_str());
myOutput.open("test.txt");
TEST_ASSERT(myInput.is_open() == 1);
string compare1 = "ThisIsATestStringOutputtedToAFile";
string fileBuffer;
std::getline(myInput, fileBuffer);
string compare2 = fileBuffer.substr(400,100);
cout << compare1 + "\n";
cout << compare2 + "\n";
myOutput << compare1 + "\n";
myOutput << compare2 + "\n";
cin.get();
myInput.close();
myOutput.close();
TEST_ASSERT(compare1.compare(compare2) == 0);
How did you create the content of myInput? I would guess that this file is created in two-byte encoding. You can use hex-dump to verify this theory, or use a different editor to create this file.
The simpliest way would be to launch cmd.exe and type
echo "ThisIsATestStringOutputtedToAFile" > test.txt
UPDATE:
If you cannot change the encoding of the myInput file, you can try to use wide-chars in your program. I.e. use wstring instead of string, wifstream instead of ifstream, wofstream, wcout, etc.
The following works for me and writes the text pasted below into the file. Note the '\0' character embedded into the string.
#include <iostream>
#include <fstream>
#include <sstream>
int main()
{
std::istringstream myInput("0123456789ThisIsATestStringOutputtedToAFile\x0 12ou 9 21 3r8f8 reohb jfbhv jshdbv coerbgf vibdfjchbv jdfhbv jdfhbvg jhbdfejh vbfjdsb vjdfvb jfvfdhjs jfhbsd jkefhsv gjhvbdfsjh jdsfhb vjhdfbs vjhdsfg kbhjsadlj bckslASB VBAK VKLFB VLHBFDSL VHBDFSLHVGFDJSHBVG LFS1BDV LH1BJDFLV HBDSH VBLDFSHB VGLDFKHB KAPBLKFBSV LFHBV YBlkjb dflkvb sfvbsljbv sldb fvlfs1hbd vljkh1ykcvb skdfbv nkldsbf vsgdb lkjhbsgd lkdcfb vlkbsdc xlkvbxkclbklxcbv");
std::ofstream myOutput("test.txt");
//std::ostringstream myOutput;
std::string str1 = "ThisIsATestStringOutputtedToAFile";
std::string fileBuffer;
std::getline(myInput, fileBuffer);
std::string str2 = fileBuffer.substr(10,100);
std::cout << str1 + "\n";
std::cout << str2 + "\n";
myOutput << str1 + "\n";
myOutput << str2 + "\n";
std::cout << str1.compare(str2) << '\n';
//std::cout << myOutput.str() << '\n';
return 0;
}
Output:
ThisIsATestStringOutputtedToAFile
ThisIsATestStringOutputtedToAFile
It turns out that the problem was that the file encoding of myInput was UTF-16, whereas the comparison string was UTF-8. The way to convert them with the OS limitations I had for this project (Linux, C/C++ code), was to use the iconv() functions. To keep the compatibility of the C++ strings I'd been using, I ended up saving the string to a new text file, then running iconv through the system() command.
system("iconv -f UTF-16 -t UTF-8 subStr.txt -o convertedSubStr.txt");
Reading the outputted string back in then gave me the string in the format I needed for the comparison to work properly.
NOTE
I'm aware that this is not the most efficient way to do this. I've I'd had the luxury of a Windows environment and the windows.h libraries, things would have been a lot easier. In this case though, the code was in some rarely used unit tests, and as such didn't need to be highly optimized, hence the creation, destruction and I/O operations of some text files wasn't an issue.