ofstream overwriting stack variables - c++

I'm working on a "threadpool" program for operating systems class. Essentially, files are extracted from a tar file and written to the disk, using a pool of 5 threads. Here's my thread code
#include <iostream>
#include <cstdlib>
using namespace std;
vector<Header*> headers;
vector<string> fileBlocks;
void* writeExtractedFiles(void* args)
{
bool hasFilesLeft = true;
ofstream outputFile;
while(hasFilesLeft)
{
pthread_mutex_lock(&mutex);
if(headers.size() != 0)
{
Header* hdr = headers.back();
headers.pop_back();
string fileBytes = fileBlocks.back();
fileBlocks.pop_back();
pthread_mutex_unlock(&mutex);
outputFile.open(hdr->fileName.c_str(), ios::app);
outputFile.rdbuf()->pubsetbuf(0,0);
fileBytes = fileBytes.substr(0, hdr->fileSize);
outputFile.put('0');
outputFile.close();
// This is a dummy object to check if the values are corrupted
Header* test0 = headers.back();
cout << "GRAWWR!";
//chown(hdr->fileName.c_str(), hdr->userId, hdr->groupId);
//chmod(hdr->fileName.c_str(), hdr->fileMode);
}
else
{
// We're done!
hasFilesLeft = false;
pthread_mutex_unlock(&mutex);
}
}
}
Note As of right now, I'm only testing it with a single thread. Obviously accessing the headers vector outside of my mutex would be counterproductive with multiple threads.
Problem is, the values for test0 are all messed up, super high numbers and nonsense for fileName. It seems like I'm overwriting my stack variables for some reason. When I comment out outputFile.close(); then my variable values aren't changed, but when I keep it, whether I actually write things to the file or not, things get wonky. I know there must be something I'm missing. I've tried getting rid of the buffer altogether, writing the file in a different place, anything I could think of. Any suggestions?
(I'm testing it on a Windows machine but it's being made for Linux)

Related

Speed-up a single task using multi-threading in c++

I'm sorry if this is a repeat question but I already tried to search for an answer and came up empty handed.
I have a code to transfer data (189156 numbers) from txt file (input.txt) to another file (test.txt), after executing the program, the process takes about 23 seconds to transfer all data from the file: input.txt to the file : test.txt.
I wanted to speed up the process, so I divided the process into multiple threads (4 threads), each thread process 1/4 of the data, After executing the program, there was no difference in the time it took to transfer all the data.
here is my codes:
// This program reads data from a file into an array.
#include <iostream>
#include <fstream> // To use ifstream
#include <vector>
#include <thread>
using namespace std;
void test(int start, int end)
{
std::vector<int> numbers;
ifstream inputFile("input.txt"); // Input file stream object
// Check if exists and then open the file.
if (inputFile.good()) {
// Push items into a vector
int current_number = 0;
while (inputFile >> current_number) {
numbers.push_back(current_number);
}
// Close the file.
inputFile.close();
// Display the numbers read:
cout << "The numbers are: ";
for (int count = start ; count < end; count++) {
cout << numbers[count] << " " ;
std::ofstream ofs;
ofs.open("test.txt", std::ofstream::out | std::ofstream::app);
ofs << numbers[count] << endl;
ofs.close();
}
cout << endl;
}
else {
cout << "Error!";
_exit(0);
}
}
int main() {
std::thread worker1(test, 0, 50000);
std::thread worker2(test, 50000, 100000);
std::thread worker3(test, 100000, 150000);
std::thread worker4(test, 150000, 189156);
worker1.join();
worker2.join();
worker3.join();
worker4.join();
return 0;
}
I am a beginner, I do not know if it is correct to use multi-threads in such a case, please, if so, where is my mistake and if not, what is the correct way to speed up the process.
There is a big race condition in the code that not only prevent the code to be fast, but also should produce wrong results (possibly non-deterministically). Indeed, all threads can write in the same file "test.txt" simultaneously. While this operation may be thread safe on the target system, the order in which the threads append data in the target file is undefined and thus the result can be shuffled. The file appending have to be serialized and this when this processes is thread safe, it is typically protected with a lock that prevent any parallel execution.
Additionally, the open+write+close should be extremely slow since it results in 3 system calls per line and system calls are generally slow, especially IO ones.
That being said, you cannot use one ofstream object with multiple thread without protection since it would cause a bigger undefined behaviour. Indeed, here is what the C++ standard explicitly states:
Concurrent access to a stream object [string.streams, file.streams], stream buffer object [stream.buffers], or C Library stream [c.files] by multiple threads may result in a data race [intro.multithread] unless otherwise specified [iostream.objects]. [Note: Data races result in undefined behavior [intro.multithread]. --end note]
An efficient solution is to do a inner-thread reduction: all threads append data to a thread-local ostringstream so to perform the integer to string serialization in a big buffer and then write data in a serialized way (so for the order to be the same than the sequential program). The serialization should be speed up by the use of multiple thread while the IO part will still be sequential. In practice, the serialization should be pretty slow so the use of multiple thread should help to significantly reduce the execution time.
There is another big issue: the input file is entirely read by each thread! This means the 4 threads overall compute 4 time more work than using just 1 thread. This completely defeat the benefit of using multiple threads. You need to split the input file in relatively equal parts and then perform the computation. This is not so easy since the line delimiter should be taken into account.
One solution to this problem is to first retrieve the size of the file and then divide the 0..size range in N parts, where N is the number of workers. The split ranges then need to be corrected so to reference the begining of a line. You can do this correction by reading a line in the file at the starting location of each range and then adapt the start/end location of each range consequently (you just need to add the size of the line read). Once corrected, each worker can operate on a completely independent part of the file and read it in parallel (using a different ifstream object like you did).

Data not written with ofstream, even though success is returned

I'm writing a program which fetches a large number of email files using libcurl and then writes the file to disk, and then generates a receipt.
My problem is that, whilst most of the receipts seem to get written, the majority of the emails aren't written to disk. Worse, even though the file doesn't get written, ofstream returns success - so the receipt gets written even if the file write didn't complete successfully.
My guess is that, because ofstream is asynchronous, if a write doesn't complete in time then it'll get dropped on the floor - only a certain number of writes being possible concurrently. I am just guessing here.
Perhaps I need to refactor my code to write synchronously - but I can't believe that that's necessary. Does anyone have any idea how I can make this work?
The email sizes range from a few KBytes to a couple of MBytes.
int write_file(string filename, string mail_item) {
ofstream out(filename.c_str());
out << mail_item;
out.close();
out.flush();
if (!out) {
return FUNCTION_FAILED;
}
return FUNCTION_SUCCESS;
}
This is part of another function, and has been cut out so that only the salient code for this question is shown.
vector<string> directory = curl_listroot(curl);
for (int i=0; i<directory.size(); i++) {
vector<int> mail_list = curl_search(curl,directory[i],make_vector<string>() << "SEEN" << "RECENT" << "NEW" << "ANSWERED" << "FLAGGED");
for (int j=0; j<mail_list.size(); j++) {
curl_reset(curl, imap.username, imap.password);
string mail_item = curl_fetch(curl,directory[i],mail_list[j]);
if (mail_item.compare("") != 0) {
string m_id = getMessageID(mail_item);
string filename = save_path+"/"+RECEIPTNAME+"/"+clean_filename(m_id) + ".eml";
if (!file_exists(filename)) {
string real_filename;
real_filename = save_path+"/"+INBOXNAME+"/"+clean_filename(m_id) + ".eml";
int success = write_file(real_filename, mail_item);
if (success == FUNCTION_SUCCESS) {
write_file(filename, ""); //write empty receipt
}
}
}
}
}
All suggestions gratefully received! Thank you!
Okay. I've found an answer - there may be better answers - but this one works for me. The problem seems to be in the OS (Linux, in this case) - ofstream completes, having handed the responsibility for writing the file to the OS, but the file hasn't actually been written yet (so whilst ofstream may be synchronous the end to end write of the file, from data to file safely written to disk, isn't). Given that I'm banging away with a huge number of writes in quick succession (potentially thousands), this won't necessarily work. The OS may throw its hands in the air and drop a significant number of the files writes on the floor (hence my original request for a synchronous way of writing the files - end to end).
My solution is to pause after each write to give the OS time to catch up. It's inelegant though, and not as performant as it should be - it doesn't take half a second to write an empty file. Additionally, on slow storage, half a second might not be enough time. I'd welcome any clever suggestions for how to improve my code.
int write_file(string filename, string mail_item) {
ofstream out(filename.c_str());
if (!out) {
return FUNCTION_FAILED;
}
out << mail_item << endl;
out.flush();
usleep(500000); //wait for half a second to give the OS time to output the file
if (!out) {
return FUNCTION_FAILED;
}
out.close();
if (!out) {
return FUNCTION_FAILED;
}
return FUNCTION_SUCCESS;
}

C++ Keylog Not Working Properly

I'm trying to make a simple keylogger in C++ (for learning only) and it's not quite working how I would like it to. My goal is to have it write to a txt. Here's the code I have so far:
#include <iostream>
#include <fstream>
#include <conio.h>
#define LOG(x) logger << x;
int main()
{
using std::ofstream;
using std::fstream;
ofstream logger("logger.txt", fstream::app);
char ascii;
bool typing;
for(;;)
{
if(_kbhit())
{
typing = true;
ascii = getch();
while(typing == true) //tried 'if', doesn't work
{
LOG(ascii);
std::cout << ascii << std::endl;
//typing = false;
//break
//tried using the above two and didn't work
}
}
else typing = false;
}
logger.close();
}
When I make while(typing == true) continuous, the key that is pressed continuously gets printed, but at least it actually gets saved to the txt. When I try to make the loop stop after one keyboard click, nothing gets saved to the txt.
So what am I doing wrong? Thanks for any help!
The variable typing is never set to false, so it stays true and your loop continues. The following code works:
#include <fstream>
#include <conio.h>
int main()
{
std::ofstream logger("logger.txt", std::fstream::app);
for(char ascii; ascii != 3;)
{
ascii = getche();
logger << ascii;
}
return 0;
}
getche() prints the character typed, and 3 is the ASCII code for Ctrl+C. This logs all characters, even non-printable ones.
A few comments on your code:
Don't use macros (#define) unless you are substituting a large amount of code and using it often, or plan on changing what something does.
You use loops and variables where you don't need to. getch and related functions wait for input.
logger.close() is automatically done when logger goes out of scope and is destructed.
return 0 should be at the end of main. It's not necessary, but it is used to return to the OS and return 0, although automatically put in, is important to have in for clarity.
I personally don't use using statements. Just write out the namespace, it helps avoid collisions. That's why it's in a namespace.

C++ iostream binary read and write issues

Right, please bear with me as I have two separate attempts I'll cover below.
I first started off reading the guide here (http://www.cplusplus.com/doc/tutorial/files/). However whilst it contains what appears to be a good example of how to use read(), it does not contain an example of how to use write().
I first attempted to store a simple char array in binary using write(). My original idea (and hope) was that I could append to this file with new entries using ios::app. Originally this appeared to work, but I was getting junk output as well. A post on another forum for help suggested I lacked a null terminator on the end of my char array. I applied this (or at least attempted to based on how I was shown) as can be seen in the example below. Unfortunately, this meant that read() no longer functioned properly because it won't read past the null terminator.
I was also told that doing char *memoryBlock is 'abuse' of C++ standard or something, and is unsafe, and that I should instead define an array of an exact size, ie char memoryBlock[5], however what if I wish to write char data to a file that could be of any size? How do I proceed then? The code below includes various commented out lines of code indicating various attempts I have made and different variations, including some of the suggestions I mentioned above. I do wish to try and use good-practice code, so if char *memoryBlock is unsafe, or any other lines of code, I wish to amend this.
I would also like to clarify that I am trying to write chars here for testing purposes only, so please do not suggest that I should write in text mode rather than binary mode instead. I'll elaborate further in the second part of this question under the code below.
First code:
#include <cstdlib>
#include <iostream>
#include <fstream>
//#include <string>
int main()
{
//char memoryBlock[5];
char *memoryBlock;
char *memoryBlockTwo;
std::ifstream::pos_type size;// The number of characters to be read or written from/to the memory block.
std::ofstream myFile;
myFile.open("Example", std::ios::out | /*std::ios::app |*/ std::ios::binary);
if(myFile.is_open() && myFile.good())
{
//myFile.seekp(0,std::ios::end);
std::cout<<"File opening successfully completed."<<std::endl;
memoryBlock = "THEN";
//myFile.write(memoryBlock, (sizeof(char)*4));
//memoryBlock = "NOW THIS";
//strcpy_s(memoryBlock, (sizeof(char)*5),"THIS");
//memoryBlock = "THEN";
//strcpy(memoryBlock, "THIS");
//memoryBlock[5] = NULL;
myFile.write(memoryBlock, (sizeof(char)*5));
}
else
{
std::cout<<"File opening NOT successfully completed."<<std::endl;
}
myFile.close();
std::ifstream myFileInput;
myFileInput.open("Example", std::ios::in | std::ios::binary | std::ios::ate);
if(myFileInput.is_open() && myFileInput.good())
{
std::cout<<"File opening successfully completed. Again."<<std::endl;
std::cout<<"READ:"<<std::endl;
size = myFileInput.tellg();
memoryBlockTwo = new char[size];
myFileInput.seekg(0, std::ios::beg);// Get a pointer to the beginning of the file.
myFileInput.read(memoryBlockTwo, size);
std::cout<<memoryBlockTwo<<std::endl;
delete[] memoryBlockTwo;
std::cout<<std::endl<<"END."<<std::endl;
}
else
{
std::cout<<"Something has gone disasterously wrong."<<std::endl;
}
myFileInput.close();
return 0;
}
The next attempt of mine works on the basis that attempting to use ios::app with ios::binary simply won't work, and that to ammend a file I must read the entire thing in, make my alterations, then write back and replace the entire contents of the file, although this does seem somewhat inefficient.
However I don't read in and ammend contents in my code below. What I am actually trying to do is write an object of a custom class to a file, then read it back out again intact.
This seems to work (although if I'm doing anything bad code-wise here, please point it out), HOWEVER, I am seemingly unable to store variables of type std::string and std::vector because I get access violations when I reach myFileInput.close(). With those member variables commented out the access violation does not occur. My best guess as to why this happens is that They use pointers to other pieces of memory to store their files, and I am not writing the data itself to my file but the pointers to it, which happen to still be valid when I read my data out.
Is it possible at all to store the contents of these more complex datatypes in a file? Or must I break everything down in to more basic variables such as chars, ints and floats?
Second code:
#include <cstdlib>
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
class testClass
{
public:
testClass()
{
testInt = 5;
testChar = 't';
//testString = "Test string.";
//testVector.push_back(3.142f);
//testVector.push_back(0.001f);
}
testClass(int intInput, char charInput, std::string stringInput, float floatInput01, float floatInput02)
{
testInt = intInput;
testChar = charInput;
testArray[0] = 't';
testArray[1] = 'e';
testArray[2] = 's';
testArray[3] = 't';
testArray[4] = '\0';
//testString = stringInput;
//testVector = vectorInput;
//testVector.push_back(floatInput01);
//testVector.push_back(floatInput02);
}
~testClass()
{}
private:
int testInt;
char testChar;
char testArray[5];
//std::string testString;
//std::vector<float> testVector;
};
int main()
{
testClass testObject(3, 'x', "Hello there!", 9.14f, 6.662f);
testClass testReceivedObject;
//char memoryBlock[5];
//char *memoryBlock;
//char *memoryBlockTwo;
std::ifstream::pos_type size;// The number of characters to be read or written from/to the memory block.
std::ofstream myFile;
myFile.open("Example", std::ios::out | /*std::ios::app |*/ std::ios::binary);
if(myFile.is_open() && myFile.good())
{
//myFile.seekp(0,std::ios::end);
std::cout<<"File opening successfully completed."<<std::endl;
//memoryBlock = "THEN";
//myFile.write(memoryBlock, (sizeof(char)*4));
//memoryBlock = "NOW THIS";
//strcpy_s(memoryBlock, (sizeof(char)*5),"THIS");
//memoryBlock = "THEN AND NOW";
//strcpy(memoryBlock, "THIS");
//memoryBlock[5] = NULL;
myFile.write(reinterpret_cast<char*>(&testObject), (sizeof(testClass)));//(sizeof(char)*5));
}
else
{
std::cout<<"File opening NOT successfully completed."<<std::endl;
}
myFile.close();
std::ifstream myFileInput;
myFileInput.open("Example", std::ios::in | std::ios::binary | std::ios::ate);
if(myFileInput.is_open() && myFileInput.good())
{
std::cout<<"File opening successfully completed. Again."<<std::endl;
std::cout<<"READ:"<<std::endl;
size = myFileInput.tellg();
//memoryBlockTwo = new char[size];
myFileInput.seekg(0, std::ios::beg);// Get a pointer to the beginning of the file.
myFileInput.read(reinterpret_cast<char *>(&testReceivedObject), size);
//std::cout<<memoryBlockTwo<<std::endl;
//delete[] memoryBlockTwo;
std::cout<<std::endl<<"END."<<std::endl;
}
else
{
std::cout<<"Something has gone disasterously wrong."<<std::endl;
}
myFileInput.close();
return 0;
}
I apologise for the long-windedness of this question, but I am hoping that my thoroughness in providing as much information as I can about my issues will hasten the appearance of answers, even for this (what may even be a simple issue to fix although I have searched for hours trying to find solutions), as time is a factor here. I will be monitoring this question throughout the day to provide clarifications in the aid of an answer.
In the first example, I'm not sure what you are writing out as memoryBlock is commented out and never initialized to anything. When you are reading it in, since you are using std::cout to display the data to the console, it MUST be NULL terminated or you will print beyond the end of the memory buffer allocated for memoryBlockTwo.
Either write the terminating null to the file:
memoryBlock = "THEN"; // 4 chars + implicit null terminator
myFile.write(memoryBlock, (sizeof(char)*5));
And/or, ensure the buffer is terminated after it is read:
myFileInput.read(memoryBlockTwo, size);
memoryBlockTwo[size - 1] = '\0';
In your second example, don't do that with C++ objects. You are circumventing necessary constructor calls and if you try that using vectors like you have commented out it certainly won't work like you expect. If the class is plain old data (non-virtual functions, no pointers to other data) you will likely be OK, but it's still really bad practice. When persisting C++ objects, consider looking into overloading the << and >> operators.

searching for hundreds of patterns in huge Logfiles

I have to get lots of filenames from inside a webserver's htdocs directory and then take this list of filenames to search a huge amount of archived logfiles for last access on these files.
I plan to do this in C++ with Boost. I would take newest log first and read it backwards checking every single line for all of the filenames I got.
If a filename matches, I read the Time from Logstring and save it's last access. Now I don't need to look for this file any more as I only want to know last access.
The vector of filenames to search for should rapidly decrease.
I wonder how I can handle this kind of problem with multiple threads most effective.
Do I partition the Logfiles and let every thread search a part of the logs from memory and if a thread has a match it removes this filename from the filenames vector or is there a more effective way to do this?
Try using mmap, it will save you considerable hair loss. I was feeling expeditious and in some odd mood to recall my mmap knowledge, so I wrote a simple thing to get you started. Hope this helps!
The beauty of mmap is that it can be easily parallelized with OpenMP. It's also a really good way to prevent an I/O bottleneck. Let me first define the Logfile class and then I'll go over implementation.
Here's the header file (logfile.h)
#ifndef _LOGFILE_H_
#define _LOGFILE_H_
#include <iostream>
#include <fcntl.h>
#include <stdio.h>
#include <string>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
using std::string;
class Logfile {
public:
Logfile(string title);
char* open();
unsigned int get_size() const;
string get_name() const;
bool close();
private:
string name;
char* start;
unsigned int size;
int file_descriptor;
};
#endif
And here's the .cpp file.
#include <iostream>
#include "logfile.h"
using namespace std;
Logfile::Logfile(string name){
this->name = name;
start = NULL;
size = 0;
file_descriptor = -1;
}
char* Logfile::open(){
// get file size
struct stat st;
stat(title.c_str(), &st);
size = st.st_size;
// get file descriptor
file_descriptor = open(title.c_str(), O_RDONLY);
if(file_descriptor < 0){
cerr << "Error obtaining file descriptor for: " << title.c_str() << endl;
return NULL;
}
// memory map part
start = (char*) mmap(NULL, size, PROT_READ, MAP_SHARED, file_descriptor, 0);
if(start == NULL){
cerr << "Error memory-mapping the file\n";
close(file_descriptor);
return NULL;
}
return start;
}
unsigned int Logfile::get_size() const {
return size;
}
string Logfile::get_title() const {
return title;
}
bool Logfile::close(){
if( start == NULL){
cerr << "Error closing file. Was closetext() called without a matching opentext() ?\n";
return false;
}
// unmap memory and close file
bool ret = munmap(start, size) != -1 && close(file_descriptor) != -1;
start = NULL;
return ret;
}
Now, using this code, you can use OpenMP to work-share the parsing of these logfiles, i.e.
Logfile lf ("yourfile");
char * log = lf.open();
int size = (int) lf.get_size();
#pragma omp parallel shared(log, size) private(i)
{
#pragma omp for
for (i = 0 ; i < size ; i++) {
// do your routine
}
#pragma omp critical
// some methods that combine the thread results
}
Parsing the logfile into a database table (SQLite ftw). One of the fields will be the path.
In another table, add the files you are looking for.
Now it is a simple join on a derived table. Something like this.
SELECT l.file, l.last_access FROM toFind f
LEFT JOIN (
SELECT file, max(last_access) as last_access from logs group by file
) as l ON f.file = l.file
All the files in toFind will be there, and will have last_access NULL for those not found in the logs.
Ok this is some days ago already but I spent some time writing code and working with SQLite in other projects.
I still wanted to compare the DB-Approach with the MMAP Solution just for the performance aspect.
Of course it saves you a lot of work if you can use SQL-Queries to handle all the data you parsed. But I really didn't care about the work amount because I'm still learning a lot and what I learned from this is:
This MMAP-Approach - if you implement it correctly - is absolutely superior in performance. It's unbelievable fast which you will notice if you implement the "word-count" example which can be seen as the "hello world" for MapReduce Algo.
Now if you further want to benefit from SQL-Language the correct approach would be implementing your own SQL-Wrapper that uses kind of Map-Reduce too by the means of sharing queries amongst threads.
You could perhaps share Objects by ID amongst threads, where every thread handles it's own DB-Connection. It then queries Objects in it's own part of the dataset.
This would be much faster than just writing things to SQLite DB the usual Way.
After all you can say:
MMAP is the fastest way to handle string processing
SQL provides great functionality for parser-applications but it slows down things if you don't implement a wrapper for processing SQL-Queries