How to read large files in segments?

How to read large files in segments? - c++

I'm using small files currently for testing and will scale up once it works.
I made a file bigFile.txt that has:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
I'm running this to segment the data that is being read from the file:
#include <iostream>
#include <fstream>
#include <memory>
using namespace std;
int main()
{
ifstream file("bigfile.txt", ios::binary | ios::ate);
cout << file.tellg() << " Bytes" << '\n';
ifstream bigFile("bigfile.txt");
constexpr size_t bufferSize = 4;
unique_ptr<char[]> buffer(new char[bufferSize]);
while (bigFile)
{
bigFile.read(buffer.get(), bufferSize);
// print the buffer data
cout << buffer.get() << endl;
}
}
This gives me the following result:
26 Bytes
ABCD
EFGH
IJKL
MNOP
QRST
UVWX
YZWX
Notice how in the last line after 'Z' the character 'WX' is repeated again?
How do I get rid of it so that it stops after reaching the end?

cout << buffer.get() uses the const char* overload, which prints a NULL-terminated C string.
But your buffer isn't NULL-terminated, and istream::read() can read less characters than the buffer size. So when you print buffer, you end up printing old characters that were already there, until the next NULL character is encountered.
Use istream::gcount() to determine how many characters were read, and print exactly that many characters. For example, using std::string_view:
#include <iostream>
#include <fstream>
#include <memory>
#include <string_view>
using namespace std;
int main()
{
ifstream file("bigfile.txt", ios::binary | ios::ate);
cout << file.tellg() << " Bytes" << "\n";
file.seekg(0, std::ios::beg); // rewind to the beginning
constexpr size_t bufferSize = 4;
unique_ptr<char[]> buffer = std::make_unique<char[]>(bufferSize);
while (file)
{
file.read(buffer.get(), bufferSize);
auto bytesRead = file.gcount();
if (bytesRead == 0) {
// EOF
break;
}
// print the buffer data
cout << std::string_view(buffer.get(), bytesRead) << endl;
}
}
Note also that there's no need to open the file again - you can rewind the original one to the beginning and read it.

The problem is that you don't override the buffer's content. Here's what your code does:
It reads the beginning of the file
When reaching the 'YZ', it reads it and only overrides the buffer's first two characters ('U' and 'V') because it has reached the end of the file.
One easy fix is to clear the buffer before each file read:
#include <iostream>
#include <fstream>
#include <array>
int main()
{
std::ifstream bigFile("bigfile.txt", std::ios::binary | std::ios::ate);
int fileSize = bigFile.tellg();
std::cout << bigFile.tellg() << " Bytes" << '\n';
bigFile.seekg(0);
constexpr size_t bufferSize = 4;
std::array<char, bufferSize> buffer;
while (bigFile)
{
for (int i(0); i < bufferSize; ++i)
buffer[i] = '\0';
bigFile.read(buffer.data(), bufferSize);
// Print the buffer data
std::cout.write(buffer.data(), bufferSize) << '\n';
}
}
I also changed:
The std::unique_ptr<char[]> to a std::array since we don't need dynamic allocation here and std::arrays's are safer that C-style arrays
The printing instruction to std::cout.write because it caused undefined behavior (see #paddy's comment). std::cout << prints a null-terminated string (a sequence of characters terminated by a '\0' character) whereas std::cout.write prints a fixed amount of characters
The second file opening to a call to the std::istream::seekg method (see #rustyx's answer).
Another (and most likely more efficient) way of doing this is to read the file character by character, put them in the buffer, and printing the buffer when it's full. We then print the buffer if it hasn't been already in the main for loop.
#include <iostream>
#include <fstream>
#include <array>
int main()
{
std::ifstream bigFile("bigfile.txt", std::ios::binary | std::ios::ate);
int fileSize = bigFile.tellg();
std::cout << bigFile.tellg() << " Bytes" << '\n';
bigFile.seekg(0);
constexpr size_t bufferSize = 4;
std::array<char, bufferSize> buffer;
int bufferIndex;
for (int i(0); i < fileSize; ++i)
{
// Add one character to the buffer
bufferIndex = i % bufferSize;
buffer[bufferIndex] = bigFile.get();
// Print the buffer data
if (bufferIndex == bufferSize - 1)
std::cout.write(buffer.data(), bufferSize) << '\n';
}
// Override the characters which haven't been already (in this case 'W' and 'X')
for (++bufferIndex; bufferIndex < bufferSize; ++bufferIndex)
buffer[bufferIndex] = '\0';
// Print the buffer for the last time if it hasn't been already
if (fileSize % bufferSize /* != 0 */)
std::cout.write(buffer.data(), bufferSize) << '\n';
}

Related

How to count words in huge file by spliting data?

I try to count word in huge file. I want to use max of CPU resources and i try to split input data and count words in threads. But i have a problem, when i split data it can split the words and in the end i have wrong answer. How can i split data from file to avoid spliting words? Can somebody help me?
#include <iostream>
#include <fstream>
#include <set>
#include <string>
#include <thread>
#include <mutex>
#include <sstream>
#include <vector>
#include <algorithm>
#define BUFER_SIZE 1024
using namespace std;
std::mutex mtx;
void worker(int n, set<std::string> &mySet, std::string path)
{
mtx.lock();
ifstream file (path, ios::in);
if (file.is_open())
{
char *memblock = new char [BUFER_SIZE];
file.seekg (n * (BUFER_SIZE - 1), ios::beg);
file.read(memblock, BUFER_SIZE - 1);
std::string blockString(memblock);
std::string buf;
stringstream stream(blockString);
while(stream >> buf) mySet.insert(buf);
memblock[BUFER_SIZE] = '\0';
file.close();
delete[] memblock;
}
else
cout << "Unable to open file";
mtx.unlock();
}
int main(int argc, char *argv[])
{
set<std::string> uniqWords;
int threadCount = 0;
ifstream file(argv[1], ios::in);
if(!file){
std::cout << "Bad path.\n";
return 1;
}
file.seekg(0, ios::end);
int fileSize = file.tellg();
file.close();
std::cout << "Size of the file is" << " " << fileSize << " " << "bytes\n";
threadCount = fileSize/BUFER_SIZE + 1;
std::cout << "Thread count: " << threadCount << std::endl;
std::vector<std::thread> vec;
for(int i=0; i < threadCount; i++)
{
vec.push_back(std::thread(worker, i, std::ref(uniqWords), argv[1]));
}
std::for_each(vec.begin(), vec.end(), [](std::thread& th)
{
th.join();
});
std::cout << "Count: " << uniqWords.size() << std::endl;
return 0;
}

The problem right now is that you're reading in a fixed-size chunk, and processing it. If that stops mid-word, you're doing to count the word twice, once in each buffer it was placed into.
One obvious solution is to only break between buffers at word boundaries--e.g., when you read in one chunk, look backwards from the end to find a word boundary, then have the thread only process up to that boundary.
Another solution that's a bit less obvious (but may have the potential for better performance) is to look at the last character in a chunk to see if it's a word character (e.g., a letter) or a boundary character (e.g., a space). Then when you create and process the next chunk, you tell it whether the previous buffer ended with a boundary or within a word. If it ended within a word, the counting thread knows to ignore the first partial word in the buffer.

Convert bytes from a file to an integer c++

I am trying to parse a .dat file reading it byte by byte with this code.(the name of the file is in arv[1])
std::ifstream is (arv[1], std::ifstream::binary);
if (is) {
is.seekg (0, is.end);
int length = is.tellg();
is.seekg (0, is.beg);
char * buffer = new char [length];
is.read (buffer,length);
if (is)
std::cout << "all characters read successfully.";
else
std::cout << "error: only " << is.gcount() << " could be read";
is.close();
}
Now all file is in the buffer variable. The file contains numbers represented in 32 bits, how can I iterate over the buffer reading 4 bytes at a time and convert them to integer?

first of all , you have a memory leak, you dynamically allocate character array but never delete[] them.
use std::string instead:
std::string buffer(length,0);
is.read (&buffer[0],length);
now, assuming you had written the integer correctly, and have read it correctly into buffer, you can use this character array as pointer to integer:
int myInt = *(int*)&buffer[0];
(do you understand why?)
if you have more then one integer stored:
std::vector<int> integers;
for (int i=0;i<buffer.size();i+=sizeof(int)){
integers.push_back(*(int*)&buffer[i]);
}

Instead of:
char * buffer = new char [length];
is.read (buffer,length);
You can use:
int numIntegers = length/sizeof(int);
int* buffer = new int[numIntegers];
is.read(reinterpret_cast<char*>(buffer), numIntegers*sizeof(int));
Update, in response to OP's comment
I am not seeing any problems with the approach I suggested. Here's a sample program and the output I see using g++ 4.9.2.
#include <iostream>
#include <fstream>
#include <cstdlib>
void writeData(char const* filename, int n)
{
std::ofstream out(filename, std::ios::binary);
for ( int i = 0; i < n; ++i )
{
int num = std::rand();
out.write(reinterpret_cast<char*>(&num), sizeof(int));
}
}
void readData(char const* filename)
{
std::ifstream is(filename, std::ifstream::binary);
if (is)
{
is.seekg (0, is.end);
int length = is.tellg();
is.seekg (0, is.beg);
int numIntegers = length/sizeof(int);
int* buffer = new int [numIntegers];
std::cout << "Number of integers: " << numIntegers << std::endl;
is.read(reinterpret_cast<char*>(buffer), numIntegers*sizeof(int));
if (is)
std::cout << "all characters read successfully." << std::endl;
else
std::cout << "error: only " << is.gcount() << " could be read" << std::endl;
for (int i = 0; i < numIntegers; ++i )
{
std::cout << buffer[i] << std::endl;
}
}
}
int main()
{
writeData("test.bin", 10);
readData("test.bin");
}
Output
Number of integers: 10
all characters read successfully.
1481765933
1085377743
1270216262
1191391529
812669700
553475508
445349752
1344887256
730417256
1812158119

how to put different C strings in a pretty format?

I am writing a simple program that builds a directory index of the current directory.
Each file has two char* objects for file name and last-modified time, and one integer for the file size.
I want to put all these in one big string or char*.
#include <sys/types.h>
#include <sys/stat.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <stdio.h>
#include <string>
#include <vector>
#include <iostream>
#include <sstream>
using namespace std;
char* file_info(char*);
int main(void)
{
DIR *d;
struct dirent *dir;
d = opendir(".");
if (d)
{
while ((dir = readdir(d)) != NULL)
{
file_info(dir->d_name);
}
closedir(d);
}
return(0);
}
char* file_info(char* file) {
if(file[0] != '.') {
struct stat sb;
if (stat(file, &sb) == -1) {
perror("stat");
exit(EXIT_FAILURE);
}
char* lm = ctime(&sb.st_mtime);
*lm = '\0';
stringstream ss;
ss << file << " " << lm << " " << sb.st_size;
cout << ss.str() << endl;
}
return lm;
}
I want the returned char* to be an object that has content in this format:
homework-1.pdf 12-Sep-2013 10:57 123K
homework-2.pdf 03-Oct-2013 13:58 189K
hw1_soln.pdf 24-Sep-2013 10:36 178K
hw2_soln.pdf 14-Oct-2013 09:37 655K
The spacing is the major issue here.
How can I correct it easily?
My attempt so far was
const char* file_info(char* file) {
if(file[0] != '.') {
struct stat sb;
if (stat(file, &sb) == -1) {
perror("stat");
exit(EXIT_FAILURE);
}
char* lm = ctime(&sb.st_mtime);
string lastmod(lm);
lastmod.at(lastmod.size()-1) = '\0';
stringstream ss;
string spacing = " ";
ss << file << spacing.substr(0, spacing.size() - sizeof(file)) << lastmod << spacing.substr(0, spacing.size() - lastmod.size()) << sb.st_size;
cout << ss.str() << endl;
return ss.str().c_str();
}
else {
return NULL;
}
}
but it did not work, and I worked with strings so poorly.

Here is the problem:
// ...
stringstream ss;
// ...
return ss.str().c_str(); // woops! ss goes out of scope and string will be destroyed!
This can be easily solved by making your function return std::string instead of char const* and doing this:
return ss.str();
There is no reason to return char const* here. It complicates everything, requires manual memory management, will be exception-unsafe at some point, confuses people who call your function and makes your code utterly unmaintainable.

To answer your iostream formatting question, you want std::setw
std::cout << "'" << std::setw(16) << "Hello" << "'" << std::endl;
http://faculty.cs.niu.edu/~mcmahon/CS241/c241man/node83.html

There are two different problems. First you obviously can not return const char * from function which is stack-allocated. So you have to have it allocated at the heap. And this is the problem. It is problem of ownership. Where you have to delete this string? It can be easily solved by using std::string.
Second problem is yours question. How to have this well aligned. Using yours method you can not print filenames longer then preallocated string. There si simple solution. In header iomanip is defined function
/*unspecified*/ std::setw( int n );
which say "Hey, next thing you will be printing have to be n characters long". And this is what you want. When thing you will be printing is longer then this n it will be printed all. No cropping or something like this.

If you absolutly have to use null-terminated C-Strings, than rather use sprintf instead of std::stringstream. Mixing C and C++ like this is considered bad practice (like pointed out already: i.e. you have to manage memory manually). Also there are some other issues with your code: the sizeof() operator doesn't calculate the length of a string - rather the necessary memory-space (in bytes). Returning a reference to the ctime internal buffer isn't safe either:
The function also accesses and modifies a shared internal buffer,
which may cause data races on concurrent calls to asctime or ctime
Rather use Call-by-reference and don't return anything. Like this:
void file_info(char* file, char* buffer) {
if(file[0] != '.') {
struct stat sb;
if (stat(file, &sb) == -1) {
perror("stat");
exit(EXIT_FAILURE);
}
char* lm = ctime(&sb.st_mtime);
*lm = '\0';
sprintf(buffer, "%10s%10s%d", file, lm, sb.st_size);
}
}
To fix your formating problem, you could also use strlen() (but not sizeof()) and use whitespaces depending on the length of lm and file. But sprintf provides an fixed length Parameter with %"number of digits"s.
See also: printf reference
Minimum number of characters to be printed. If the value to be printed
is shorter than this number, the result is padded with blank spaces.
The value is not truncated even if the result is larger.
But you would allocate memory for char* buffer before you call this function and have to make sure, it's large enough for the sprintf string(!).
i.e.
char buffer[256];
file_info(file, buffer);
printf("%s\n", buffer);

Thank you all for your answers.
However, none of them worked as I intended.(especially, it's not for outputting, but for making a string object.)
I ended up achieving what I wanted, but it's by no means good.
I attach my program, though, below. Feel free to comment.
Thank you.
void file_info(char*, stringstream&);
int main(void)
{
DIR *d;
struct dirent *dir;
d = opendir(".");
stringstream ss;
if (d)
{
while ((dir = readdir(d)) != NULL)
{
file_info(dir->d_name, ss);
}
closedir(d);
}
cout << ss.str() << endl;
return(0);
}
void file_info(char* file, stringstream& ss) {
if(file[0] != '.') {
struct stat sb;
if (stat(file, &sb) == -1) {
perror("stat");
exit(EXIT_FAILURE);
}
char* lm = ctime(&sb.st_mtime);
string lastmod(lm);
lastmod.at(lastmod.size()-1) = '\0';
string spacing = " ";
ss << file << spacing.substr(0, spacing.size() - strlen(file)) << lastmod << spacing.substr(0, spacing.size() - lastmod.size()) << sb.st_size << '\n';
}
return;
}

Missing data when reading/writing to stream

Here is complete example - compiles and runs, writes contents of map to the file and reads it right after:
#include <map>
#include <fstream>
#include <iostream>
using namespace std;
int main(int argc, char* argv[])
{
std::string fname("test.bin");
std::map<unsigned,unsigned> testMap;
testMap[0]=103;
testMap[1]=2;
testMap[5]=26;
testMap[22]=4;
std::ofstream output(fname.c_str(),std::ios_base::binary|std::ios_base::trunc);
for(std::map<unsigned,unsigned>::iterator iter = testMap.begin();iter != testMap.end();++iter)
{
unsigned temp = iter->first;
output.write((const char*)&temp,sizeof(temp));
unsigned temp1 = iter->second;
output.write((const char*)&temp1,sizeof(temp1));
std::cerr << temp <<" "<<temp1<<" "<<std::endl;
}
std::cerr << "wrote bytes.........."<<output.tellp()<<", map size "<<testMap.size()<<std::endl;
output.flush();
output.close();
std::ifstream input(fname.c_str());
// retrieve length of file:
input.seekg (0, input.end);
unsigned streamSize = input.tellg();
input.seekg (0, input.beg);
char* buff = new char[streamSize];
input.read(buff,streamSize);
cerr << "sizeof of input......"<<streamSize << endl;
cerr << "read bytes..........."<<input.gcount() << endl;
::getchar();
return 0;
}
It gives the following output:
0 103
1 2
5 26
22 4
wrote bytes..........32, map size 4
sizeof of input......32
read bytes...........20
The question is why bytes read does not match bytes written, and how to read/write whole map.
P.S. Online compiler gives me expected output of 32 read bytes, I'm getting wrong output while compiling with Visual Studio 2010 proffesional.

Make sure you're opening the file as a binary file.

print out the last 10 lines of a file

I want to have the option to print out the last 10 lines of a textfile . with this program I've been able to read the whole textfile, but I can't figure out how to manipulate the array in which the textfile is saved, any help?
// Textfile output
#include<fstream>
#include<iostream>
#include<iomanip>
using namespace std;
int main() {
int i=1;
char zeile[250], file[50];
cout << "filename:" << flush;
cin.get(file,50); ///// (1)
ifstream eingabe(datei , ios::in); /////(2)
if (eingabe.good() ) { /////(3)
eingabe.seekg(0L,ios::end); ////(4)
cout << "file:"<< file << "\t"
<< eingabe.tellg() << " Bytes" ////(5)
<< endl;
for (int j=0; j<80;j++)
cout << "_";
cout << endl;
eingabe.seekg(0L, ios::beg); ////(6)
while (!eingabe.eof() ){ ///(7)
eingabe.getline(zeile,250); ///(8)
cout << setw(2) << i++
<< ":" << zeile << endl;
}
}
else
cout <<"dateifehler oder Datei nicht gefunden!"
<< endl;
return 0;
}

Try this:
#include <list>
#include <string>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <iterator>
// A class that knows how to read a line using operator >>
struct Line
{
std::string theLine;
operator std::string const& () const { return theLine; }
friend std::istream& operator>>(std::istream& stream, Line& l)
{
return std::getline(stream, l.theLine);
}
};
// A circular buffer that only saves the last n lines.
class Buffer
{
public:
Buffer(size_t lc)
: lineCount(lc)
{}
void push_back(std::string const& line)
{
buffer.insert(buffer.end(),line);
if (buffer.size() > lineCount)
{
buffer.erase(buffer.begin());
}
}
typedef std::list<std::string> Cont;
typedef Cont::const_iterator const_iterator;
typedef Cont::const_reference const_reference;
const_iterator begin() const { return buffer.begin(); }
const_iterator end() const { return buffer.end();}
private:
size_t lineCount;
std::list<std::string> buffer;
};
// Main
int main()
{
std::ifstream file("Plop");
Buffer buffer(10);
// Copy the file into the special buffer.
std::copy(std::istream_iterator<Line>(file), std::istream_iterator<Line>(),
std::back_inserter(buffer));
// Copy the buffer (which only has the last 10 lines)
// to std::cout
std::copy(buffer.begin(), buffer.end(),
std::ostream_iterator<std::string>(std::cout, "\n"));
}

Basically, you are not saving the file contents to any array. The following sample will give you a head start:
#include <iostream>
#include <vector>
#include <string>
int main ( int, char ** )
{
// Ask user for path to file.
std::string path;
std::cout << "filename:";
std::getline(std::cin, path);
// Open selected file.
std::ifstream file(path.c_str());
if ( !file.is_open() )
{
std::cerr << "Failed to open '" << path << "'." << std::endl;
return EXIT_FAILURE;
}
// Read lines (note: stores all of it in memory, might not be your best option).
std::vector<std::string> lines;
for ( std::string line; std::getline(file,line); )
{
lines.push_back(line);
}
// Print out (up to) last ten lines.
for ( std::size_t i = std::min(lines.size(), std::size_t(10)); i < lines.size(); ++i )
{
std::cout << lines[i] << std::endl;
}
}
It would probably be wiser to avoid storing the whole file into memory, so you could re-write the last 2 segments this way:
// Read up to 10 lines, accumulating.
std::deque<std::string> lines;
for ( std::string line; lines.size() < 0 && getline(file,line); )
{
lines.push_back(line);
}
// Read the rest of the file, adding one, dumping one.
for ( std::string line; getline(file,line); )
{
lines.pop_front();
lines.push_back(line);
}
// Print out whatever is left (up to 10 lines).
for ( std::size_t i = 0; i < lines.size(); ++i )
{
std::cout << lines[i] << std::endl;
}

The eof() function does not do what you and it seems a million other C++ newbies think it does. It does NOT predict if the next read will work. In C++ as in any other language, you must check the status of each read operation, not the state of the input stream before the read. so the canonical C++ read line loop is:
while ( eingabe.getline(zeile,250) ) {
// do something with zeile
}
Also, you should be reading into a std::string, and get rid of that 250 value.

Do a circular buffer with 10 slots and while reading the file lines, putting them into this buffer. When you finish thr file, do a position++ to go to the first element and print them all.
Pay attention for null values if the file has less than 10 lines.

Have an array of strings with size 10.
Read the first line and store into the array
Continue reading till the array is full
Once the array is full delete the first entry so that you can enter new line
Repeate step 3 and 4 till the file is finished reading.

I investigate proposed approaches here and describe all in my blog post. There is a better solution but you have to jump to the end and persist all needed lines:
std::ifstream hndl(filename, std::ios::in | std::ios::ate);
// and use handler in function which iterate backward
void print_last_lines_using_circular_buffer(std::ifstream& stream, int lines)
{
circular_buffer<std::string> buffer(lines);
std::copy(std::istream_iterator<line>(stream),
std::istream_iterator<line>(),
std::back_inserter(buffer));
std::copy(buffer.begin(), buffer.end(),
std::ostream_iterator<std::string>(std::cout));
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to read large files in segments? - c++

Related

How to count words in huge file by spliting data?

Convert bytes from a file to an integer c++

how to put different C strings in a pretty format?

Missing data when reading/writing to stream

print out the last 10 lines of a file

Categories

Resources