How to count words in huge file by spliting data? - c++

I try to count word in huge file. I want to use max of CPU resources and i try to split input data and count words in threads. But i have a problem, when i split data it can split the words and in the end i have wrong answer. How can i split data from file to avoid spliting words? Can somebody help me?
#include <iostream>
#include <fstream>
#include <set>
#include <string>
#include <thread>
#include <mutex>
#include <sstream>
#include <vector>
#include <algorithm>
#define BUFER_SIZE 1024
using namespace std;
std::mutex mtx;
void worker(int n, set<std::string> &mySet, std::string path)
{
mtx.lock();
ifstream file (path, ios::in);
if (file.is_open())
{
char *memblock = new char [BUFER_SIZE];
file.seekg (n * (BUFER_SIZE - 1), ios::beg);
file.read(memblock, BUFER_SIZE - 1);
std::string blockString(memblock);
std::string buf;
stringstream stream(blockString);
while(stream >> buf) mySet.insert(buf);
memblock[BUFER_SIZE] = '\0';
file.close();
delete[] memblock;
}
else
cout << "Unable to open file";
mtx.unlock();
}
int main(int argc, char *argv[])
{
set<std::string> uniqWords;
int threadCount = 0;
ifstream file(argv[1], ios::in);
if(!file){
std::cout << "Bad path.\n";
return 1;
}
file.seekg(0, ios::end);
int fileSize = file.tellg();
file.close();
std::cout << "Size of the file is" << " " << fileSize << " " << "bytes\n";
threadCount = fileSize/BUFER_SIZE + 1;
std::cout << "Thread count: " << threadCount << std::endl;
std::vector<std::thread> vec;
for(int i=0; i < threadCount; i++)
{
vec.push_back(std::thread(worker, i, std::ref(uniqWords), argv[1]));
}
std::for_each(vec.begin(), vec.end(), [](std::thread& th)
{
th.join();
});
std::cout << "Count: " << uniqWords.size() << std::endl;
return 0;
}

The problem right now is that you're reading in a fixed-size chunk, and processing it. If that stops mid-word, you're doing to count the word twice, once in each buffer it was placed into.
One obvious solution is to only break between buffers at word boundaries--e.g., when you read in one chunk, look backwards from the end to find a word boundary, then have the thread only process up to that boundary.
Another solution that's a bit less obvious (but may have the potential for better performance) is to look at the last character in a chunk to see if it's a word character (e.g., a letter) or a boundary character (e.g., a space). Then when you create and process the next chunk, you tell it whether the previous buffer ended with a boundary or within a word. If it ended within a word, the counting thread knows to ignore the first partial word in the buffer.

Related

How to read large files in segments?

I'm using small files currently for testing and will scale up once it works.
I made a file bigFile.txt that has:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
I'm running this to segment the data that is being read from the file:
#include <iostream>
#include <fstream>
#include <memory>
using namespace std;
int main()
{
ifstream file("bigfile.txt", ios::binary | ios::ate);
cout << file.tellg() << " Bytes" << '\n';
ifstream bigFile("bigfile.txt");
constexpr size_t bufferSize = 4;
unique_ptr<char[]> buffer(new char[bufferSize]);
while (bigFile)
{
bigFile.read(buffer.get(), bufferSize);
// print the buffer data
cout << buffer.get() << endl;
}
}
This gives me the following result:
26 Bytes
ABCD
EFGH
IJKL
MNOP
QRST
UVWX
YZWX
Notice how in the last line after 'Z' the character 'WX' is repeated again?
How do I get rid of it so that it stops after reaching the end?
cout << buffer.get() uses the const char* overload, which prints a NULL-terminated C string.
But your buffer isn't NULL-terminated, and istream::read() can read less characters than the buffer size. So when you print buffer, you end up printing old characters that were already there, until the next NULL character is encountered.
Use istream::gcount() to determine how many characters were read, and print exactly that many characters. For example, using std::string_view:
#include <iostream>
#include <fstream>
#include <memory>
#include <string_view>
using namespace std;
int main()
{
ifstream file("bigfile.txt", ios::binary | ios::ate);
cout << file.tellg() << " Bytes" << "\n";
file.seekg(0, std::ios::beg); // rewind to the beginning
constexpr size_t bufferSize = 4;
unique_ptr<char[]> buffer = std::make_unique<char[]>(bufferSize);
while (file)
{
file.read(buffer.get(), bufferSize);
auto bytesRead = file.gcount();
if (bytesRead == 0) {
// EOF
break;
}
// print the buffer data
cout << std::string_view(buffer.get(), bytesRead) << endl;
}
}
Note also that there's no need to open the file again - you can rewind the original one to the beginning and read it.
The problem is that you don't override the buffer's content. Here's what your code does:
It reads the beginning of the file
When reaching the 'YZ', it reads it and only overrides the buffer's first two characters ('U' and 'V') because it has reached the end of the file.
One easy fix is to clear the buffer before each file read:
#include <iostream>
#include <fstream>
#include <array>
int main()
{
std::ifstream bigFile("bigfile.txt", std::ios::binary | std::ios::ate);
int fileSize = bigFile.tellg();
std::cout << bigFile.tellg() << " Bytes" << '\n';
bigFile.seekg(0);
constexpr size_t bufferSize = 4;
std::array<char, bufferSize> buffer;
while (bigFile)
{
for (int i(0); i < bufferSize; ++i)
buffer[i] = '\0';
bigFile.read(buffer.data(), bufferSize);
// Print the buffer data
std::cout.write(buffer.data(), bufferSize) << '\n';
}
}
I also changed:
The std::unique_ptr<char[]> to a std::array since we don't need dynamic allocation here and std::arrays's are safer that C-style arrays
The printing instruction to std::cout.write because it caused undefined behavior (see #paddy's comment). std::cout << prints a null-terminated string (a sequence of characters terminated by a '\0' character) whereas std::cout.write prints a fixed amount of characters
The second file opening to a call to the std::istream::seekg method (see #rustyx's answer).
Another (and most likely more efficient) way of doing this is to read the file character by character, put them in the buffer, and printing the buffer when it's full. We then print the buffer if it hasn't been already in the main for loop.
#include <iostream>
#include <fstream>
#include <array>
int main()
{
std::ifstream bigFile("bigfile.txt", std::ios::binary | std::ios::ate);
int fileSize = bigFile.tellg();
std::cout << bigFile.tellg() << " Bytes" << '\n';
bigFile.seekg(0);
constexpr size_t bufferSize = 4;
std::array<char, bufferSize> buffer;
int bufferIndex;
for (int i(0); i < fileSize; ++i)
{
// Add one character to the buffer
bufferIndex = i % bufferSize;
buffer[bufferIndex] = bigFile.get();
// Print the buffer data
if (bufferIndex == bufferSize - 1)
std::cout.write(buffer.data(), bufferSize) << '\n';
}
// Override the characters which haven't been already (in this case 'W' and 'X')
for (++bufferIndex; bufferIndex < bufferSize; ++bufferIndex)
buffer[bufferIndex] = '\0';
// Print the buffer for the last time if it hasn't been already
if (fileSize % bufferSize /* != 0 */)
std::cout.write(buffer.data(), bufferSize) << '\n';
}

Why does getline() cut off CSV Input?

I'm trying to read and parse my CSV files in C++ and ran into an error.
The CSV has 1-1000 rows and always 8 columns.
Generally what i would like to do is read the csv and output only lines that match a filter criteria. For example column 2 is timestamp and only in a specific time range.
My problem is that my program cuts off some lines.
At the point where the data is in the string record variable its not cutoff. As soon as I push it into the map of int/vector its cutoff. Am I doing something wrong here?
Could someone help me identify what the problem truly is or maybe even give me a better way to do this?
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <sstream>
#include <iostream>
#include <map>
#include "csv.h"
using std::cout; using std::cerr;
using std::endl; using std::string;
using std::ifstream; using std::ostringstream;
using std::istringstream;
string readFileIntoString(const string& path) {
auto ss = ostringstream{};
ifstream input_file(path);
if (!input_file.is_open()) {
cerr << "Could not open the file - '"
<< path << "'" << endl;
exit(EXIT_FAILURE);
}
ss << input_file.rdbuf();
return ss.str();
}
int main()
{
int filterID = 3;
int filterIDIndex = filterID;
string filter = "System";
/*Filter ID's:
0 Record ID
1 TimeStamp
2 UTC
3 UserID
4 ObjectID
5 Description
6 Comment
7 Checksum
*/
string filename("C:/Storage Card SD/Audit.csv");
string file_contents;
std::map<int, std::vector<string>> csv_contents;
char delimiter = ',';
file_contents = readFileIntoString(filename);
istringstream sstream(file_contents);
std::vector<string> items;
string record;
int counter = 0;
while (std::getline(sstream, record)) {
istringstream line(record);
while (std::getline(line, record, delimiter)) {
items.push_back(record);
cout << record << endl;
}
csv_contents[counter] = items;
//cout << csv_contents[counter][0] << endl;
items.clear();
counter += 1;
}
I can't see a reason why you data is being cropped, but I have refactored you code slightly and using this it might be easier for you to debug the problem, if it doesn't just disappear on its own.
int main()
{
string path("D:/Audit.csv");
ifstream input_file(path);
if (!input_file.is_open())
{
cerr << "Could not open the file - '" << path << "'" << endl;
exit(EXIT_FAILURE);
}
std::map<int, std::vector<string>> csv_contents;
std::vector<string> items;
string record;
char delimiter = ';';
int counter = 0;
while (std::getline(input_file, record))
{
istringstream line(record);
while (std::getline(line, record, delimiter))
{
items.push_back(record);
cout << record << endl;
}
csv_contents[counter] = items;
items.clear();
++counter;
}
return counter;
}
I have tried your code and (after fixing the delimiter) had no problems, but I only had three lines of data, so if it is a memory issue it would have been unlikely to show.

How do I read words from a file, assign them to an array and analyze its content?

I (a student whose professor encourages online research to complete projects) have an assignment where I have to analyze the contents of a file (frequency of certain words, total word cout, largest and smallest word) and I'm getting stuck on even opening the file so the program can get words out. I've tried to just count the words that it reads and i get nothing. As I understand it, the program should be opening the selected .txt file, going through its contents word by word and outputing it right now.
Here's code:
#include <iostream>
#include <string>
#include <cctype>
#include <fstream>
#include <sstream>
string selected[100];
//open selected file.
ifstream file;
file.open(story.c_str());
string line;
if (!file.good())
{
cout << "Problem with file!" << endl;
return 1;
}
while (!file.eof())
{
getline(file, line);
if (line.empty())
continue;
istringstream iss(line);
for (string word; iss >> word;)
cout << word << endl;
```
Because of the simplicity of the attached code, I will not give detailed explanations here. With the usage of std::algorithm every task can be performed in a one-liner.
We will read the complete source file into one std::string. Then, we define a std::vector and fill it with all words. The words are defined by an ultra simple regex.
The frequency is counted with a standard approach using std::map.
#include <fstream>
#include <string>
#include <iterator>
#include <vector>
#include <regex>
#include <iostream>
#include <algorithm>
#include <map>
// A word is something consiting of 1 or more letters
std::regex patternForWord{R"((\w+))"};
int main() {
// Open file and check, if it could be opened
if (std::ifstream sampleFile{ "r:\\sample.txt" }; sampleFile) {
// Read the complete File into a std::string
std::string wholeFile(std::istreambuf_iterator<char>(sampleFile), {});
// Put all words from the whole file into a vector
std::vector<std::string> words(std::sregex_token_iterator(wholeFile.begin(), wholeFile.end(), patternForWord, 1), {});
// Get the longest and shortest word
const auto [min, max] = std::minmax_element(words.begin(), words.end(),
[](const std::string & s1, const std::string & s2) { return s1.size() < s2.size(); });
// Count the frequency of words
std::map<std::string, size_t> wordFrequency{};
for (const std::string& word : words) wordFrequency[word]++;
// Show the result to the user
std::cout << "\nNumber of words: " << words.size()
<< "\nLongest word: " << *max << " (" << max->size() << ")"
<< "\nShortest word: " << *min << " (" << min->size() << ")"
<< "\nWord frequencies:\n";
for (const auto& [word, count] : wordFrequency) std::cout << word << " --> " << count << "\n";
}
else {
std::cerr << "*** Error: Problem with input file\n\n";
}
return 0;
}

Parse a large text file and store it in a tree (Binary or AVL) using C++

I am doing an assignment, in which I have a large text file (1gb). I am supposed to parse this text file and store it in a tree for some operations. The problem I am facing is the time it takes to completely parse the whole file. It takes about 40 min to completely parse the file. Can anyone please show me how to do it efficiently in a few minutes?
My code is
int main()
{
FILE * file=fopen("data.txt","r");
char line[1000];
char *token;
while(fgets(line,1000,file)!=NULL)
{
token=strtok(line," ");
while(token!=NULL)
{
cout<<token<<endl;
token=strtok(NULL," ");
}
}
fclose(file);
return 0;
}
Personally, I would guess that printing the tokens is the biggest time sink. Try this instead and see if it runs faster:
#include <iostream>
#include <fstream>
int main() {
std::ios_base::sync_with_stdio(false);
std::ifstream in("data.txt", std::ios_base::binary);
for (std::string token; in >> token; ) {
if (++count / 100000 == 0) {
std::cout << "read " << count << " tokens\n";
}
}
std::cout << "read " << count << " tokens\n";
}

print out the last 10 lines of a file

I want to have the option to print out the last 10 lines of a textfile . with this program I've been able to read the whole textfile, but I can't figure out how to manipulate the array in which the textfile is saved, any help?
// Textfile output
#include<fstream>
#include<iostream>
#include<iomanip>
using namespace std;
int main() {
int i=1;
char zeile[250], file[50];
cout << "filename:" << flush;
cin.get(file,50); ///// (1)
ifstream eingabe(datei , ios::in); /////(2)
if (eingabe.good() ) { /////(3)
eingabe.seekg(0L,ios::end); ////(4)
cout << "file:"<< file << "\t"
<< eingabe.tellg() << " Bytes" ////(5)
<< endl;
for (int j=0; j<80;j++)
cout << "_";
cout << endl;
eingabe.seekg(0L, ios::beg); ////(6)
while (!eingabe.eof() ){ ///(7)
eingabe.getline(zeile,250); ///(8)
cout << setw(2) << i++
<< ":" << zeile << endl;
}
}
else
cout <<"dateifehler oder Datei nicht gefunden!"
<< endl;
return 0;
}
Try this:
#include <list>
#include <string>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <iterator>
// A class that knows how to read a line using operator >>
struct Line
{
std::string theLine;
operator std::string const& () const { return theLine; }
friend std::istream& operator>>(std::istream& stream, Line& l)
{
return std::getline(stream, l.theLine);
}
};
// A circular buffer that only saves the last n lines.
class Buffer
{
public:
Buffer(size_t lc)
: lineCount(lc)
{}
void push_back(std::string const& line)
{
buffer.insert(buffer.end(),line);
if (buffer.size() > lineCount)
{
buffer.erase(buffer.begin());
}
}
typedef std::list<std::string> Cont;
typedef Cont::const_iterator const_iterator;
typedef Cont::const_reference const_reference;
const_iterator begin() const { return buffer.begin(); }
const_iterator end() const { return buffer.end();}
private:
size_t lineCount;
std::list<std::string> buffer;
};
// Main
int main()
{
std::ifstream file("Plop");
Buffer buffer(10);
// Copy the file into the special buffer.
std::copy(std::istream_iterator<Line>(file), std::istream_iterator<Line>(),
std::back_inserter(buffer));
// Copy the buffer (which only has the last 10 lines)
// to std::cout
std::copy(buffer.begin(), buffer.end(),
std::ostream_iterator<std::string>(std::cout, "\n"));
}
Basically, you are not saving the file contents to any array. The following sample will give you a head start:
#include <iostream>
#include <vector>
#include <string>
int main ( int, char ** )
{
// Ask user for path to file.
std::string path;
std::cout << "filename:";
std::getline(std::cin, path);
// Open selected file.
std::ifstream file(path.c_str());
if ( !file.is_open() )
{
std::cerr << "Failed to open '" << path << "'." << std::endl;
return EXIT_FAILURE;
}
// Read lines (note: stores all of it in memory, might not be your best option).
std::vector<std::string> lines;
for ( std::string line; std::getline(file,line); )
{
lines.push_back(line);
}
// Print out (up to) last ten lines.
for ( std::size_t i = std::min(lines.size(), std::size_t(10)); i < lines.size(); ++i )
{
std::cout << lines[i] << std::endl;
}
}
It would probably be wiser to avoid storing the whole file into memory, so you could re-write the last 2 segments this way:
// Read up to 10 lines, accumulating.
std::deque<std::string> lines;
for ( std::string line; lines.size() < 0 && getline(file,line); )
{
lines.push_back(line);
}
// Read the rest of the file, adding one, dumping one.
for ( std::string line; getline(file,line); )
{
lines.pop_front();
lines.push_back(line);
}
// Print out whatever is left (up to 10 lines).
for ( std::size_t i = 0; i < lines.size(); ++i )
{
std::cout << lines[i] << std::endl;
}
The eof() function does not do what you and it seems a million other C++ newbies think it does. It does NOT predict if the next read will work. In C++ as in any other language, you must check the status of each read operation, not the state of the input stream before the read. so the canonical C++ read line loop is:
while ( eingabe.getline(zeile,250) ) {
// do something with zeile
}
Also, you should be reading into a std::string, and get rid of that 250 value.
Do a circular buffer with 10 slots and while reading the file lines, putting them into this buffer. When you finish thr file, do a position++ to go to the first element and print them all.
Pay attention for null values if the file has less than 10 lines.
Have an array of strings with size 10.
Read the first line and store into the array
Continue reading till the array is full
Once the array is full delete the first entry so that you can enter new line
Repeate step 3 and 4 till the file is finished reading.
I investigate proposed approaches here and describe all in my blog post. There is a better solution but you have to jump to the end and persist all needed lines:
std::ifstream hndl(filename, std::ios::in | std::ios::ate);
// and use handler in function which iterate backward
void print_last_lines_using_circular_buffer(std::ifstream& stream, int lines)
{
circular_buffer<std::string> buffer(lines);
std::copy(std::istream_iterator<line>(stream),
std::istream_iterator<line>(),
std::back_inserter(buffer));
std::copy(buffer.begin(), buffer.end(),
std::ostream_iterator<std::string>(std::cout));
}