Read till end of a boost memory mapped file in VC++ - c++

I am writing a program in C++ using VS2010 to read a text file and extract certain information from it. I completed the code using filestream and it worked well. However now I am asked to map the file to memory and use it rather than the file operations.
I am absolutely a newbie in case of memory mapping. A part of the code I have written is as follows.
boost::iostreams::mapped_file_source apifile;
apifile.open(LogFileName,LogFileSize);
if(!apifile.is_open())
return FILE_OPEN_ERROR;
// Get pointer to the data.
PBYTE Buffer = (PBYTE)apifile.data();
while(//read till end of the file)
{
// read a line and check if it contains a specific word
}
While using filestream I would have used eof and getline and string::find for performing the operations. But I don't have any idea on how to do it using memory mapped file.
EDIT 1:
int ProcessLogFile(string file_name)
{
LogFileName = file_name;
apifile.open(LogFileName);//boost::iostreams::mapped_file_source apifile(declared globally)
streamReader.open(apifile, std::ios::binary);//boost::iostreams::stream <boost::iostreams::mapped_file_source> streamReader(declared globally)
streamoff Curr_Offset = 0;
string read_line;
int session_id = 0;
int device_id = 0;
while(!streamReader.eof())
{
\\COLLECT OFFSETS OF DIFFERENT SESSIONS
}
streamReader.close();
}
This function worked and i got the offsets to the required structure.
Now after calling this function, I call yet another function as follows:
int GetSystemDetails()
{
streamReader.open(apifile, std::ios::binary);
string read_line;
getline(streamReader,read_line);
cout << "LINE : " << read_line;
streamReader.close();
}
I don't get any data in read_line. Is that memory mapping only for a single function? How can I use the same memory mapped file across different functions?

I agree with people questioning the use of a mmap if you just read through the file sequentially.
boost::mapped_file_source models a Device. There's two approaches to use such a Device:
use it raw (using data() as you try)
using a stream wrapper
1. Using the raw Device source
You can use the mapped_file_source to report the actual size (m.data()+m.size()).
Let's take a sample to count lines:
#include <boost/iostreams/device/mapped_file.hpp> // for mmap
#include <algorithm> // for std::find
#include <iostream> // for std::cout
#include <cstring>
int main()
{
boost::iostreams::mapped_file mmap("input.txt", boost::iostreams::mapped_file::readonly);
auto f = mmap.const_data();
auto l = f + mmap.size();
uintmax_t m_numLines = 0;
while (f && f!=l)
if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
m_numLines++, f++;
std::cout << "m_numLines = " << m_numLines << "\n";
}
You could possibly adapt this. I have several more complicated parsing examples based on memory mapped files:
Fast textfile reading in c++
Note how in the updates you can see that indeed, open()+read() was faster than the memory map due the sequential access nature
How to parse space-separated floats in C++ quickly?
2. Wrapping the source device in a istream
This gives you all the usual stream-based operations of c++ standard streams, so you can detect the end of the file like you would always:
#include <boost/iostreams/device/mapped_file.hpp> // for mmap
#include <boost/iostreams/stream.hpp> // for stream
#include <algorithm> // for std::find
#include <iostream> // for std::cout
#include <cstring>
int main()
{
using boost::iostreams::mapped_file_source;
using boost::iostreams::stream;
mapped_file_source mmap("test.cpp");
stream<mapped_file_source> is(mmap, std::ios::binary);
std::string line;
uintmax_t m_numLines = 0;
while (std::getline(is, line))
{
m_numLines++;
}
std::cout << "m_numLines = " << m_numLines << "\n";
}

Related

can write file with zlib but cannot read it back

i want to be able to store data into a data.bin.gz using using zstr (a library that use zlib). I succeed to write into the file, but i cannot read it back. Here is a short example.
std::auto_ptr<std::ostream> ofs = std::auto_ptr<std::ostream>(new zstr::ofstream(fileName));
std::string str("hello world");
ofs.get()->write(str.c_str(), 11);
std::cout << "data sent: " << str << std::endl;
std::auto_ptr<std::istream> ifs = std::auto_ptr<std::istream>(new zstr::ifstream(fileName));
std::streamsize buffSize = 11;
char* buff = new char [11];
// fill buff to see if its content change
for (int i = 0; i < 11; i++) {
buff[i] = 'A';
}
ifs.get()->read(buff, buffSize);
std::cout << std::string(buff, buff+11) << std::endl;
delete [] buff;
i fill buff with some specfic content to see if it changes when reading the stream. but it does not change.
Here is a version that does approximately what you're asking for, but using standard file streams, not the non-standard zstr library which doesn't seem essential here:
#include <iostream>
#include <fstream>
#include <memory>
#include <string>
#include <vector>
using namespace std::string_literals;
int main()
{
constexpr auto fileName = "test.bin";
{
const auto str = "hello world"s;
auto ofs = std::ofstream( fileName, std::ios::binary );
ofs.write( str.data(), str.size() );
} // ofs is closed here by RAII
auto buff = std::vector<char>(100, 'A');
auto ifs = std::ifstream( fileName, std::ios::binary );
ifs.read(buff.data(), buff.size());
std::cout << std::string(buff.data(), buff.data()+11) << '\n';
}
It outputs hello world as expected. See it live on Coliru.
Notes:
I removed the auto_ptr and added the proper scoping.
I do not manage memory manually (new/delete), which is bad form. Instead I use std::vector and std::string.
I added the std::ios::binary flag to the fstream constructors to open in binary mode, since that is what it seems you ultimately want to do. This may not be needed with the zstr library you're using.
I made the buffer larger, as if I don't know what's in the file. Then I read from it as much space as I've allocated. When printing the result, I use the "insider knowledge" that there are 11 valid bytes. An alternative would be to initialize the vector to all zeros (the default) and just print it as a string:
auto buff = std::vector<char>( 100 );
auto ifs = std::ifstream( fileName, std::ios::binary );
ifs.read(buff.data(), buff.size() - 1); // Keep one zero for null terminator
std::cout << buff.data() << '\n';
which you can also see live on Coliru.
I also modernized in a few other ways just for fun and educational purposes:
I use constexpr on a constant known at compile-time.
I use the string literal suffix s on str to create a std::string with greater concision.
I use 'almost always auto' style for declaring objects.
Use \n instead of std::endl because you don't need the extra flush (good habit to be in).

How to create directory c++ (using _mkdir)

Today I did a lot of research online about how to create a directory on C++
and found a lot of way to do that, some easier than others.
I tried the _mkdir function using _mkdir("C:/Users/..."); to create a folder. Note that the argument of function will be converted into a const char*.
So far, so good, but when I want to change the path, it does not work (see the code below). I have a default string path "E:/test/new", and I want to create 10 sub-folders: new1, new2, newN, ..., new10.
To do that, I concatenate the string with a number (the counter of the for-loop), converted into char using static_cast, then I transform the string using c_str(), and assign it to a const char* variable.
The compiler has no problem compiling it, but it doesn't work. It prints 10 times "Impossible create folder n". What's wrong?
I probably made a mistake when transforming the string using c_str() to a get a const char*?.
Also, is there a way to create a folder using something else? I looked at CreateDirectory(); (API) but it uses keyword like DWORD HANDLE, etc., that are a little bit difficult to understand for a no-advanced level (I don't know what these mean).
#include <iostream>
#include <Windows.h>
#include<direct.h>
using namespace std;
int main()
{
int stat;
string path_s = "E:/test/new";
for (int i = 1; i <= 10; i++)
{
const char* path_c = (path_s + static_cast<char>(i + '0')).c_str();
stat = _mkdir(path_c);
if (!stat)
cout << "Folder created " << i << endl;
else
cout << "Impossible create folder " << i << endl;
Sleep(10);
}
return 0;
}
If your compiler supports c++17, you can use filesystem library to do what you want.
#include <filesystem>
#include <string>
#include <iostream>
namespace fs = std::filesystem;
int main(){
const std::string path = "E:/test/new";
for(int i = 1; i <= 10; ++i){
try{
if(fs::create_directory(path + std::to_string(i)))
std::cout << "Created a directory\n";
else
std::cerr << "Failed to create a directory\n";\
}catch(const std::exception& e){
std::cerr << e.what() << '\n';
}
}
return 0;
}
The problem is that (path_s + static_cast<char>(i + '0')) creates a temporary object. One whose life-time ends (and is destructed) just after c_str() has been called.
That leaves you with a pointer to a string that no longer exist, and using it in almost any way will lead to undefined behavior.
Instead save the std::string object, and call c_str() just when needed:
std::string path = path_s + std::to_string(i);
_mkdir(path.c_str());
Note that under Linux, you can use the mkdir command as follows:
#include <sys/stat.h>
...
const int dir_err = mkdir("foo", S_IRWXU | S_IRWXG | S_IROTH | S_IXOTH);
if (-1 == dir_err){
printf("Error creating directory!n");
exit(1);
}
More information on it can be gleaned from reading man 2 mkdir.

How can I easily and quickly store a big word database?

I am currently working as a school project in developing a spelling checker in C++. For the part which consists in checking if a word exists, I currently do the following :
I found online a .txt file with all the english existing words
My script starts by going through this text files and placing each
of its entry in a map object, for an easy access.
The problem with this approach is that when the program starts, the step 2) takes approximately 20 secs. This is not a big deal in itself but I was wondering if any of you had an idea of an other approach to have my database of words quickly available. For instance, would there be a way to store the map object in a file, so that I don't need to build it from the text file every time ?
If your file with all English words is not dynamic, you can just store it in a static map. To do so, you need to parse your .txt file, something like:
alpha
beta
gamma
...
to convert it to something like this:
static std::map<std::string,int> wordDictionary = {
{ "alpha", 0 },
{ "beta", 0 },
{ "gamma", 0 }
... };
You can do that programatically or simply with find and replace in your favourite text editor.
Your .exe is going to be much heavier than before, but it will also start much faster than reading this info from file.
I'm a little bit surprised that nobody came up with the idea of serialization yet. Boost provides great support for such a solution. If I understood you correctly, the problem is that it takes too long to read in your list of words (and put them into a data structure that hopefully provides fast look-up operations), whenever you use your application. Building up such a structure, then saving it into a binary file for later reuse should improve the performance of your application (based on the results presented below).
Here is a piece of code (and a minimal working example, at the same time) that might help you out on this.
#include <chrono>
#include <fstream>
#include <iostream>
#include <set>
#include <sstream>
#include <stdexcept>
#include <string>
#include <boost/archive/binary_iarchive.hpp>
#include <boost/archive/binary_oarchive.hpp>
#include <boost/serialization/set.hpp>
#include "prettyprint.hpp"
class Dictionary {
public:
Dictionary() = default;
Dictionary(std::string const& file_)
: _file(file_)
{}
inline size_t size() const { return _words.size(); }
void build_wordset()
{
if (!_file.size()) { throw std::runtime_error("No file to read!"); }
std::ifstream infile(_file);
std::string line;
while (std::getline(infile, line)) {
_words.insert(line);
}
}
friend std::ostream& operator<<(std::ostream& os, Dictionary const& d)
{
os << d._words; // cxx-prettyprint used here
return os;
}
int save(std::string const& out_file)
{
std::ofstream ofs(out_file.c_str(), std::ios::binary);
if (ofs.fail()) { return -1; }
boost::archive::binary_oarchive oa(ofs);
oa << _words;
return 0;
}
int load(std::string const& in_file)
{
_words.clear();
std::ifstream ifs(in_file);
if (ifs.fail()) { return -1; }
boost::archive::binary_iarchive ia(ifs);
ia >> _words;
return 0;
}
private:
friend class boost::serialization::access;
template <typename Archive>
void serialize(Archive& ar, const unsigned int version)
{
ar & _words;
}
private:
std::string _file;
std::set<std::string> _words;
};
void create_new_dict()
{
std::string const in_file("words.txt");
std::string const ser_dict("words.set");
Dictionary d(in_file);
auto start = std::chrono::system_clock::now();
d.build_wordset();
auto end = std::chrono::system_clock::now();
auto elapsed =
std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Building up the dictionary took: " << elapsed.count()
<< " (ms)" << std::endl
<< "Size of the dictionary: " << d.size() << std::endl;
d.save(ser_dict);
}
void use_existing_dict()
{
std::string const ser_dict("words.set");
Dictionary d;
auto start = std::chrono::system_clock::now();
d.load(ser_dict);
auto end = std::chrono::system_clock::now();
auto elapsed =
std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Loading in the dictionary took: " << elapsed.count()
<< " (ms)" << std::endl
<< "Size of the dictionary: " << d.size() << std::endl;
}
int main()
{
create_new_dict();
use_existing_dict();
return 0;
}
Sorry for not putting the code into separated files and for the poor design; however, for demonstrating purposes it should be just enough.
Note that I didn't use a map: I just don't see the point of storing a lot of zeros or anything else unnecessarily. AFAIK, a std::set is backed by the same powerful RB-Tree as std::maps.
For the data set available here (it contains around 466k words), I got the following results:
Building up the dictionary took: 810 (ms)
Size of the dictionary: 466544
Loading in the dictionary took: 271 (ms)
Size of the dictionary: 466544
Dependencies:
Boost's Serialization component (however, I used version 1.58).
louisdx/cxx-prettyprint.
Hope this helps. :) Cheers.
First things first. Do not use a map (or a set) for storing a word list. Use a vector of strings, make sure its contents are sorted (I would believe your word list is already sorted), and then use binary_find from the <algorithm> header to check if a word is already in the dictionary.
Although this may still be highly suboptimal (depending on whether your compiler does a small string optimisation), your load times will improve by at least an order of magnitude. Do a benchmark and if you want to make it faster still, post another question on the vector of strings.

Copy specific number of characters from std::basic_istream to std::string

What is a good, safe way to extract a specific number of characters from a std::basic_istream and store it in a std::string?
In the following program I use a char[] to eventually obtain result, but I would like to avoid the POD types and ensure something safer and more maintainable:
#include <sstream>
#include <string>
#include <iostream>
#include <exception>
int main()
{
std::stringstream inss{std::string{R"(some/path/to/a/file/is/stored/in/50/chars Other data starts here.)"}};
char arr[50]{};
if (!inss.read(arr,50))
throw std::runtime_error("Could not read enough characters.\n");
//std::string result{arr}; // Will probably copy past the end of arr
std::string result{arr,arr+50};
std::cout << "Path is: " << result << '\n';
std::cout << "stringstream still has: " << inss.str() << '\n';
return 0;
}
Alternatives:
Convert entire stream to a string up front: std::string{inss.c_str()}
This seems wasteful as it would make a copy of the entire stream.
Write a template function to accept the char[]
This would still use an intermediate POD array.
Use std::basic_istream::get in a loop to read the required number of characters together with std::basic_string::push_back
The loop seems a bit unwieldy, but it does avoid the array.
Just read it directly into the result string.
#include <sstream>
#include <string>
#include <iostream>
#include <exception>
int main()
{
std::stringstream inss{std::string{R"(some/path/to/a/file/is/stored/in/50/chars Other data starts here.)"}};
std::string result(50, '\0');
if (!inss.read(&result[0], result.size()))
throw std::runtime_error("Could not read enough characters.\n");
std::cout << "Path is: " << result << '\n';
std::cout << "stringstream still has: " << inss.str() << '\n';
return 0;
}
Since C++11, the following guarantee about the memory layout of the std::string (from cppreference).
The elements of a basic_string are stored contiguously, that is, for a basic_string s, &*(s.begin() + n) == &*s.begin() + n for any n in [0, s.size()), or, equivalently, a pointer to s[0] can be passed to functions that expect a pointer to the first element of a CharT[] array.
(since C++11)

Run std::function getted by binary read

I'm developing a application and my idea is store "apps" in files, like executables. Now i have that:
AppWriter.c
#include <vector>
#include <time.h>
#include <functional>
struct PROGRAM
{
std::vector<int> RandomStuff;
std::vector<std::function<void()>> Functions;
std::function<void()> MAIN;
} CODED;
void RANDOMFUNC()
{
srand(time(NULL));
for(int i = 0; i < 40; i++)
CODED.RandomStuff.push_back(rand() % 254);
}
void LOGARRAY()
{
for(int i = 0; i < CODED.RandomStuff.size(); i++)
std::cout << "["<< i + 1 <<"]: "<< CODED.RandomStuff[i] << std::endl;
}
void PROGRAMMAIN()
{
std::cout << "Hello i call random function!" << std::endl;
CODED.Functions[0]();
CODED.Functions[1]();
}
void main()
{
CODED.MAIN = PROGRAMMAIN;
CODED.Functions.push_back(RANDOMFUNC);
CODED.Functions.push_back(LOGARRAY);
std::cout << "Testing MAIN" << std::endl;
CODED.MAIN();
FILE *file = fopen("TEST_PROGRAM.TRI","wb+");
fwrite(&CODED,sizeof(CODED),1,file);
fclose(file);
std::cout << "Program writted correctly!" << std::endl;
_sleep(10000);
}
AppReader.c
#include <iostream>
#include <vector>
#include <time.h>
#include <functional>
struct PROGRAM
{
std::vector<int> RandomStuff;
std::vector<std::function<void()>> Functions;
std::function<void()> MAIN;
} DUMPED;
void main()
{
FILE *file = fopen("TEST_PROGRAM.TRI","rb+");
fseek(file,0,SEEK_END);
int program_len = ftell(file);
rewind(file);
fread(&DUMPED,sizeof(PROGRAM),1,file);
std::cout
<< "Function array size: " << DUMPED.Functions.size() << std::endl
<< "Random Stuff Array size: " << DUMPED.RandomStuff.size() << std::endl;
DUMPED.MAIN();
}
When i run AppReader the functions dont work(Maybe why std::function it's like void pointers?), but in arrays or if i add variables i can see with debugger the data are storaged correctly (for that i tryed the vector of functions), but whatever doesn't work throw's me error on functional file. ¿Any ideas how i can do that?
This is never going to work. At all. Ever. std::function is a complex type. Binary reads and writes don't work for complex types. They never can. You would have to ask for functions in a pre-defined serializable format, like LLVM IR.
Your problem is that you're storing information about functions that exist in one executable, then trying to run them in a separate executable. Other than that, your code does work, but as DeadMG says, you shouldn't be storing complex types in a file. Here's how I modified your code to prove that your code works if run within a single executable:
#include <iostream>
#include <vector>
#include <time.h>
#include <functional>
struct PROGRAM
{
std::vector<int> RandomStuff;
std::vector<std::function<void()>> Functions;
std::function<void()> MAIN;
} CODED;
void RANDOMFUNC()
{
srand(time(NULL));
for(int i = 0; i < 40; i++)
CODED.RandomStuff.push_back(rand() % 254);
}
void LOGARRAY()
{
for(int i = 0; i < CODED.RandomStuff.size(); i++)
std::cout << "["<< i + 1 <<"]: "<< CODED.RandomStuff[i] << std::endl;
}
void PROGRAMMAIN()
{
std::cout << "Hello i call random function!" << std::endl;
CODED.Functions[0]();
CODED.Functions[1]();
}
int main()
{
CODED.MAIN = PROGRAMMAIN;
CODED.Functions.push_back(RANDOMFUNC);
CODED.Functions.push_back(LOGARRAY);
std::cout << "Testing MAIN" << std::endl;
CODED.MAIN();
FILE *file = fopen("TEST_PROGRAM.TRI","wb+");
fwrite(&CODED,sizeof(CODED),1,file);
fclose(file);
std::cout << "Program writted correctly!" << std::endl;
// _sleep(10000);
std::cout << "---------------------\n";
file = fopen("TEST_PROGRAM.TRI","rb+");
fseek(file,0,SEEK_END);
int program_len = ftell(file);
rewind(file);
fread(&CODED,sizeof(PROGRAM),1,file);
std::cout
<< "Function array size: " << CODED.Functions.size() << std::endl
<< "Random Stuff Array size: " << CODED.RandomStuff.size() << std::endl;
CODED.MAIN();
}
The problem is not that you're storing complex types via binary read/write, per se. (Although that is a problem, it's not the cause of the problem you posted this question about.) Your problem is that your data structures are storing information about the functions that exist in your 'writer' executable. Those same functions don't even exist in your 'reader' executable, but even if they did, they likely wouldn't be at the same address. Your data structures are storing, via std::function, pointers to the addresses where the functions exist in your 'writer' executable. When you try to call these non-existent functions in your 'reader' executable, your code happily tries to call them but you get a segfault (or whatever error your OS gives) because that's not the start of a valid function in your 'reader' executable.
Now with regard to writing complex types (e.g. std::vector) directly to a file in binary format: Doing so "works" in the sample code above only because the binary copies of the std::vectors have pointers that, once read back in, still point to valid data from the original std::vectors which you wrote out. Note that you didn't write the std::vector's actual data, you only wrote their metadata, which probably includes things like the length of the vector, the amount of memory currently allocated for the vector, and a pointer to the vector's data. When you read that back, the metadata is correct except for one thing: Any pointers in it are pointing to addresses that were valid when you wrote the data, but which may not be valid now. In the case of the sample code above, the pointers end up pointing to the same (still valid) data from the original vectors. But there's still a problem here: You now have more than one std::vector that thinks they own that memory. When one of them is deleted, it will delete the memory that the other vector expects to still exist. And when the other vector is deleted, it will cause a double-delete. That opens the door to all kinds of UB. E.g. that memory could have been allocated for another purpose by that time, and now the 2nd delete will delete that other purpose's memory, or else the memory has NOT been allocated for another purpose and the 2nd delete may corrupt the heap. To fix this, you'd have to serialize out the essence of each vector, rather than their binary representation, and when reading it back in, you'd have to reconstruct an equivalent copy, rather than simply reconstitute a copy from the binary image of the original.