C++ and reading large txt files - c++

I have a lot of txt files, around 10GB. What should I use in my program to merge them into one file without duplicates? I want to make sure each line in my output file will be unique.
I was thinking about making some kind of hash tree and use MPI. I want it to be effective.

build a table of files, so you can give every filename simply a number (a std::vector<std::string> works just fine for that).
For each file in a table: open it, do the following:
read a line. Hash the line.
Have a std::multimap that maps line hashes (step 3) to std::pair<uint32_t filenumber, size_t byte_start_of_line>. If your new line hash is already in the hash table, open the specified file, seek to the specified position, and check whether your new line and the old line are identical or just share the same hash.
if identical, skip; if different or not yet present: add new entry to map, write line to output file
read next line (i.e., go to step 3)
This only takes the RAM needed for the longest line, plus enough RAM for the filenames + file numbers plus overhead, plus the space for the map, which should be far less than the actual lines. Since 10GB isn't really much text, it's relatively unlikely you'll have hash collisions, so you might as well skip the "check with the existing file" part if you're not after certainty, but a sufficiently high probability that all lines are in your output.

If you don't have requirements to keep the memory usage low, you could just read all the lines from all the files into a std::set or std::unordered_set. An unordered_set is as the name implies not ordered in any particular way while a set is (lexicographical sort order). I've chosen a std::set here, but you can try with a std::unordered_set to see if that speeds things up a little.
Example:
#include <cerrno>
#include <cstring>
#include <fstream>
#include <iostream>
#include <set>
#include <string>
#include <string_view>
#include <vector>
int cppmain(std::string_view program, std::vector<std::string_view> args) {
if(args.empty()) {
std::cerr << "USAGE: " << program << " files...\n";
return 1;
}
std::set<std::string> result; // to store all the unique lines
// loop over all the filenames the user supplied
for(auto& filename : args) {
// try to open the file
if(std::ifstream ifs(filename.data()); ifs) {
std::string line;
// read all lines and put them in the set:
while(std::getline(ifs, line)) result.insert(line);
} else {
std::cerr << filename << ": " << std::strerror(errno) << '\n';
return 1;
}
}
for(auto line : result) {
// ... manipulate the unique line here ...
std::cout << line << '\n'; // and print the result
}
return 0;
}
int main(int argc, char* argv[]) {
return cppmain(argv[0], {argv + 1, argv + argc});
}

Related

Read a file that's constantly updated (C++)

Let me start with saying that I'm around 3 days old in C++.
Ok to the main question, I have a file that spans multiple lines, and I'm trying to print one specific line repeatedly, which is subject to change arbitrarily by some other process.
Example file :
line0
line1
somevar: someval
line3
line4
I'm trying to print the middle line (one that starts with somevar). My first naive attempt was the following where I open the file, loop through the contents and print the exact line, then move to the beginning of the file.
#include <iostream>
#include <fstream>
#include <string>
int main (int argc, char *argv[])
{
std::string file = "input.txt";
std::ifstream io {file};
if (!io){
std::cerr << "Error opening file" <<std::endl;
return EXIT_FAILURE;
}
std::string line;
std::size_t pos;
while (getline (io, line))
{
pos = line.find_first_of(' ');
if (line.substr (0, pos) == "somevar:")
{
// someval is expected to be an integer
std::cout << std::stoi( line.substr (pos) ) ) << std::endl;
io.seekg (0, std::ios::beg);
}
}
io.close();
return EXIT_SUCCESS;
}
Result : Whenever the file's updated, the program exits.
I came to think the fact that the IO I'm performing is actually buffered, therefore updating the file shouldn't reflect in our existing buffer just like that (this isn't shell scripting). So now I thought let's open and close the file on each iteration, which should effectively refresh the buffer every time, I know not the best solution, but I wanted to test the theory. Here's the new source :
#include <iostream>
#include <fstream>
#include <string>
int main (int argc, char *argv[])
{
std::string proc_file = "input.txt";
std::ifstream io;
if (!io){
std::cerr << "Error opening file" <<std::endl;
return EXIT_FAILURE;
}
std::string line;
std::size_t pos;
while (io.open(proc_file, std::ios::in), io)
{
io.sync();
getline (io, line);
pos = line.find_first_of(' ');
// The line starting with "somevar:" is always going to be there.
if (line.substr (0, pos) == "somevar:")
{
std::cout << std::stoi( line.substr (pos) ) ) << std::endl;
io.close();
}
}
io.close();
return EXIT_SUCCESS;
}
Result : Same as before.
What would be the ideal way of achieving what I'm trying to? Also, why's the program exiting whenever the file in question is being updated? Thanks (:
EDIT: The file I'm trying to read is "/proc/" + std::to_string( getpid() ) + "/io", and the line is the bytes read one (starts with read_bytes:).
As discovered in the comments, you are not reading a "real" file on disk, but rather /proc/PID/io which is a virtual file whose contents may only be determined when it is opened, thanks to VFS. Your statement that it can "change arbitrarily by some other process" is misleading, the file never changes, it simply has different content each time it is opened.
So now we know that no amount of seeking will help. We simply need to open the file afresh each time we want to read it. That can be done fairly simply:
char content[1000]; // choose a suitable value
const char key[] = "read_bytes:";
while (true)
{
std::ifstream io(io_filename);
if (!io.read(content, sizeof(content)))
break;
auto it = std::search(content, std::end(content), key, key + strlen(key));
std::cout << atoi(it + strlen(key)) << std::endl;
}
You should do something more careful than atoi() which won't stop at the end of the array, but I assume your real application will do something else there so I elided handling that.
The file I'm trying to read is some /proc/1234/io
That is the most important information.
Files in proc(5) are small pseudo-files (a bit like pipe(7)-s) which can only be read in sequence.
That pseudo file is not updated, but entirely regenerated (by the Linux kernel whose source code you can study) at every open(2)
So you just read all the file quickly in memory, and process that content in memory once you have read it.
See this answer to a very related question.... Adapt it to C++

Using std::find to find chars read from binary file and cast to a std::string in a std::vector<string> creates this inpredictible behaviour?

Sorry for the long headline. I couldn't know how to describe it in short words.
Would you care to recreate the problem i am going through?
You can use any wav file to read.
I am trying to query the chunks in a wav file here, this is the simplified version of the code, but i think it might be enough to recreate if there is a problem.
I use a mac, and compile with g++ -std=c++11.
When i run this code and don't include the line std::cout << query << std::endl; then std::find(chunk_types.begin(), chunk_types.end(), query) != chunk_types.end() returns 0 in all iterations. But i know the binary file contains some of these chunks. If i include the line then it works properly, but that is also not predictable lets say it works properly sometimes.
I am a bit perplexed am i doing anything wrong here?
#include <fstream>
#include <algorithm>
#include <iostream>
#include <string>
#include <vector>
int main(){
std::vector<std::string> chunk_types{
"RIFF","WAVE","JUNK","fmt ","data","bext",
"cue ","LIST","minf","elm1",
"slnt","fact","plst","labl","note",
"adtl","ltxt","file"};
std::streampos fileSize;
std::ifstream file(/* file path here */, std::ios::binary);
file.seekg(0, std::ios::beg);
char fileData[4];
for(int i{0};i<100;i+=4){ //100 is an arbitrary number
file.seekg(i);
file.read((char*) &fileData[0], 4);
std::string query(fileData);
std::cout << query << std::endl;
/* if i put this std::cout here, it works or else std::find always returns 0 */
if( std::find(chunk_types.begin(), chunk_types.end(), query) != chunk_types.end() ){
std::cout << "found " + query << std::endl;
}
}
return 0;
}
std::string query(fileData) uses strlen on fileData to find its terminating 0, but doesn't find one because fileData is not zero-terminated and continues searching for 0 up the stack until it finds it or hits inaccessible memory past the end of the stack and causes SIGSEGV.
Also file.read can read fewer symbols than expected, gcount must be used to extract the actual number of characters last read:
A fix:
file.read(fileData, sizeof fileData);
auto len = file.gcount();
std::string query(fileData, len);
A slightly more efficient solution is to read directly into std::string and keep reusing it to avoid a memory allocation (if no short string optimisation) and copying:
std::string query;
// ...
constexpr int LENGTH = 4;
query.resize(LENGTH);
file.read(&query[0], LENGTH);
query.resize(file.gcount());

Trying to open file using fstream but it is not working

I am trying to run this but the file is constantly failing to load. What I am trying to do is load a dictionary into an Array with each level of an array accounting for one word.
#include <iostream>
#include <string>
#include <fstream>>
#include <time.h>
#include <stdlib.h>
using namespace std;
int Rand;
void SetDictionary(){
srand(time(NULL));
Rand = rand() % 235674;
fstream file("Hangman.txt");
if(file.is_open()){
string Array[235675];
for(int X = 0; X < 235673; X++){
file >> Array[X];
}
cout << Array[Rand];
}else{
cout << "Unable To Open File\n";
}
}
int main(){
SetDictionary();
}
vector<string> words;
{
ifstream file("Hangman.txt");
string word;
while (file >> word)
{
words.push_back(word);
}
}
string randword = words[rand() % words.size()];
At first, I see you do not reuse Array after cout << Array[Rand] is done. You do not need array at all in this case. Read the file line by line into temp variable and cout this variable if condition X==Rand, then break.
At second, the implementation could be improved. Assumed you are trying to cout random word from file. It would be 1000-times faster to generate Rand as 0..file-size, then offset to this Rand. Now you are "inside" desired word and the task is to read back and forward for the work begin and end respectively. This algorithm will show a bit different probability distribution.
At third. If you plan to reuse file data, it would be much faster to read whole file into memory, and then do split by words, storing words offsets as arrays of integers.
At last. With really huge dictionaries (or if the program run on limited memory) it is possible to store words offsets only, and re-read dictionary contents on-the-fly.

write data at desired line in already existing file

I have a text file which already has 40 lines of data . I want to write data just before last two lines in a file. I am newbie to c++. I searched online and found few functions like fseek and seekp, but I am not getting how those those functions to change the lines. Can you please give some pointers for this?
Thank you in advance.
Open your file using a std::ifstream
Read the whole file into a std::vector<std::string> with an entry for each line in the file (you can use std::getline() and std::vector<std::string>::push_back() methods to realize this).
Close the std::ifstream
Change the vector entry at the line index you want to change, or alternatively insert additional entries to the vector using std::vector<std::string>::insert()
Open your file using a std::ofstream
Write the vectors content back to the file (just iterate over the vector and output each entry to the file).
You shouldn't mess around with seek functions in this case; particularly not, if the replacements size changes dynamically.
You say C++, so I assume you mean that and not C. A FIFO comes to mind for this purpose.
$ cat last_two_lines.c | ./a.out
#include <iostream>
#include <string>
#include <deque>
main ()
{
std::deque<std::string> fifo;
while (!std::cin.eof()) {
std::string buffer;
std::getline(std::cin, buffer);
fifo.push_back(buffer);
if (fifo.size() > 2) {
std::cout << fifo.front() << "\n";
fifo.pop_front();
}
}
std::cout << " // LINE INSERTED" << "\n";
while (fifo.size() > 0) {
std::cout << fifo.front() << "\n";
fifo.pop_front();
}
return 0;
// LINE INSERTED
}

Calculating the info-hash of a torrent file

I'm using C++ to parse the info hash of a torrent file, and I am having trouble getting a "correct" hash value in comparison to this site:
http://i-tools.org/torrent
I have constructed a very simple toy example just to make sure I have the basics right.
I opened a .torrent file in sublime and stripped off everything except for the info dictionary, so I have a file that looks like this:
d6:lengthi729067520e4:name31:ubuntu-12.04.1-desktop-i386.iso12:piece lengthi524288e6:pieces27820:¡´E¶ˆØËš3í ..............(more unreadable stuff.....)..........
I read this file in and parse it with this code:
#include <string>
#include <sstream>
#include <iomanip>
#include <fstream>
#include <iostream>
#include <openssl/sha.h>
void printHexRep(const unsigned char * test_sha) {
std::cout << "CALLED HEX REP...PREPPING TO PRINT!\n";
std::ostringstream os;
os.fill('0');
os << std::hex;
for (const unsigned char * ptr = test_sha; ptr < test_sha + 20; ptr++) {
os << std::setw(2) << (unsigned int) *ptr;
}
std::cout << os.str() << std::endl << std::endl;
}
int main() {
using namespace std;
ifstream myFile ("INFO_HASH__ubuntu-12.04.1-desktop-i386.torrent", ifstream::binary);
//Get file length
myFile.seekg(0, myFile.end);
int fileLength = myFile.tellg();
myFile.seekg(0, myFile.beg);
char buffer[fileLength];
myFile.read(buffer, fileLength);
cout << "File length == " << fileLength << endl;
cout << buffer << endl << endl;
unsigned char datSha[20];
SHA1((unsigned char *) buffer, fileLength, datSha);
printHexRep(datSha);
myFile.close();
return 0;
}
Compile it like so:
g++ -o hashes info_hasher.cpp -lssl -lcrypto
And I am met with this output:
4d0ca7e1599fbb658d886bddf3436e6543f58a8b
When I am expecting this output:
14FFE5DD23188FD5CB53A1D47F1289DB70ABF31E
Does anybody know what I might be doing wrong here? Could the problem lie with the un-readability of the end of the file? Do I need to parse this as hex first or something?
Make sure you don't have a newline at the end of the file, you may also want to make sure it ends with an 'e'.
The info-hash of a torrent file is the SHA-1 hash of the info-section (in bencoded form) from the .torrent file. Essentially you need to decode the file (it's bencoded) and remember the byte offsets where the content of the value associated with the "info" key begins and end. That's the range of bytes you need to hash.
For example, if this is the torrent file:
d4:infod6:pieces20:....................4:name4:test12:piece lengthi1024ee8:announce27:http://tracker.com/announcee
You wan to just hash this section:
d6:pieces20:....................4:name4:test12:piece lengthi1024ee
For more information on bencoding, see BEP3.
SHA1 calculation is just as simple as what you've written, more or less. The error is probably in the data you're feeding it, if you get the wrong answer from the library function.
I can't speak to the torrent file prep work you've done, but I do see a few problems. If you'll revisit the SHA1 docs, notice the SHA1 function never requires its own digest length as a parameter. Next, you'll want to be quite certain the technique you're using to read the file's contents is faithfully sucking up the exact bytes, no translation.
A less critical style suggestion: make use of the third parameter to SHA1. General rule, static storage in the library is best avoided. Always prefer to supply your own buffer. Also, where you have a hard-coded 20 in your print function, that's a marvelous place for that digest length constant you've been flirting with.