Compare two files

Compare two files - c++

I'm trying to write a function which compares the content of two files.
I want it to return 1 if files are the same, and 0 if different.
ch1 and ch2 works as a buffer, and I used fgets to get the content of my files.
I think there is something wrong with the eof pointer, but I'm not sure. FILE variables are given within the command line.
P.S. It works with small files with size under 64KB, but doesn't work with larger files (700MB movies for example, or 5MB of .mp3 files).
Any ideas, how to work it out?
int compareFile(FILE* file_compared, FILE* file_checked)
{
bool diff = 0;
int N = 65536;
char* b1 = (char*) calloc (1, N+1);
char* b2 = (char*) calloc (1, N+1);
size_t s1, s2;
do {
s1 = fread(b1, 1, N, file_compared);
s2 = fread(b2, 1, N, file_checked);
if (s1 != s2 || memcmp(b1, b2, s1)) {
diff = 1;
break;
}
} while (!feof(file_compared) || !feof(file_checked));
free(b1);
free(b2);
if (diff) return 0;
else return 1;
}
EDIT: I've improved this function with the inclusion of your answers. But it's only comparing first buffer only -> but with an exception -> I figured out that it stops reading the file until it reaches 1A character (attached file). How can we make it work?
EDIT2: Task solved (working code attached). Thanks to everyone for the help!

If you can give up a little speed, here is a C++ way that requires little code:
#include <fstream>
#include <iterator>
#include <string>
#include <algorithm>
bool compareFiles(const std::string& p1, const std::string& p2) {
std::ifstream f1(p1, std::ifstream::binary|std::ifstream::ate);
std::ifstream f2(p2, std::ifstream::binary|std::ifstream::ate);
if (f1.fail() || f2.fail()) {
return false; //file problem
}
if (f1.tellg() != f2.tellg()) {
return false; //size mismatch
}
//seek back to beginning and use std::equal to compare contents
f1.seekg(0, std::ifstream::beg);
f2.seekg(0, std::ifstream::beg);
return std::equal(std::istreambuf_iterator<char>(f1.rdbuf()),
std::istreambuf_iterator<char>(),
std::istreambuf_iterator<char>(f2.rdbuf()));
}
By using istreambuf_iterators you push the buffer size choice, actual reading, and tracking of eof into the standard library implementation. std::equal returns when it hits the first mismatch, so this should not run any longer than it needs to.
This is slower than Linux's cmp, but it's very easy to read.

Here's a C++ solution. It seems appropriate since your question is tagged as C++. The program uses ifstream's rather than FILE*'s. It also shows you how to seek on a file stream to determine a file's size. Finally, it reads blocks of 4096 at a time, so large files will be processed as expected.
// g++ -Wall -Wextra equifile.cpp -o equifile.exe
#include <iostream>
using std::cout;
using std::cerr;
using std::endl;
#include <fstream>
using std::ios;
using std::ifstream;
#include <exception>
using std::exception;
#include <cstring>
#include <cstdlib>
using std::exit;
using std::memcmp;
bool equalFiles(ifstream& in1, ifstream& in2);
int main(int argc, char* argv[])
{
if(argc != 3)
{
cerr << "Usage: equifile.exe <file1> <file2>" << endl;
exit(-1);
}
try {
ifstream in1(argv[1], ios::binary);
ifstream in2(argv[2], ios::binary);
if(equalFiles(in1, in2)) {
cout << "Files are equal" << endl;
exit(0);
}
else
{
cout << "Files are not equal" << endl;
exit(1);
}
} catch (const exception& ex) {
cerr << ex.what() << endl;
exit(-2);
}
return -3;
}
bool equalFiles(ifstream& in1, ifstream& in2)
{
ifstream::pos_type size1, size2;
size1 = in1.seekg(0, ifstream::end).tellg();
in1.seekg(0, ifstream::beg);
size2 = in2.seekg(0, ifstream::end).tellg();
in2.seekg(0, ifstream::beg);
if(size1 != size2)
return false;
static const size_t BLOCKSIZE = 4096;
size_t remaining = size1;
while(remaining)
{
char buffer1[BLOCKSIZE], buffer2[BLOCKSIZE];
size_t size = std::min(BLOCKSIZE, remaining);
in1.read(buffer1, size);
in2.read(buffer2, size);
if(0 != memcmp(buffer1, buffer2, size))
return false;
remaining -= size;
}
return true;
}

When the files are binary, use memcmp not strcmp as \0 might appear as data.

Since you've allocated your arrays on the stack, they are filled with random values ... they aren't zeroed out.
Secondly, strcmp will only compare to the first NULL value, which, if it's a binary file, won't necessarily be at the end of the file. Therefore you should really be using memcmp on your buffers. But again, this will give unpredictable results because of the fact that your buffers were allocated on the stack, so even if you compare to files that are the same, the end of the buffers past the EOF may not be the same, so memcmp will still report false results (i.e., it will most likely report that the files are not the same when they are because of the random values at the end of the buffers past each respective file's EOF).
To get around this issue, you should really first measure the length of the file by first iterating through the file and seeing how long the file is in bytes, and then using malloc or calloc to allocate the buffers you're going to compare, and re-fill those buffers with the actual file's contents. Then you should be able to make a valid comparison of the binary contents of each file. You'll also be able to work with files larger than 64K at that point since you're dynamically allocating the buffers at run-time.

Switch's code looks good to me, but if you want an exact
comparison the while condition and the return need to be altered:
int compareFile(FILE* f1, FILE* f2) {
int N = 10000;
char buf1[N];
char buf2[N];
do {
size_t r1 = fread(buf1, 1, N, f1);
size_t r2 = fread(buf2, 1, N, f2);
if (r1 != r2 ||
memcmp(buf1, buf2, r1)) {
return 0; // Files are not equal
}
} while (!feof(f1) && !feof(f2));
return feof(f1) && feof(f2);
}

Better to use fread and memcmp to avoid \0 character issues. Also, the !feof checks really should be || instead of && since there's a small chance that one file is bigger than the other and the smaller file is divisible by your buffer size..
int compareFile(FILE* f1, FILE* f2) {
int N = 10000;
char buf1[N];
char buf2[N];
do {
size_t r1 = fread(buf1, 1, N, f1);
size_t r2 = fread(buf2, 1, N, f2);
if (r1 != r2 ||
memcmp(buf1, buf2, r1)) {
return 0;
}
} while (!feof(f1) || !feof(f2));
return 1;
}

Related

Reading lines from input

I'm looking to read from std::in with a syntax as below (it is always int, int, int, char[]/str). What would be the fastest way to parse the data into an int array[3] and either a string or char array.
#NumberOfLines(i.e.10000000)
1,2,2,'abc'
2,2,2,'abcd'
1,2,3,'ab'
...1M+ to 10M+ more lines, always in the form of (int,int,int,str)
At the moment, I'm doing something along the lines of.
//unsync stdio
std::ios_base::sync_with_stdio (false);
std::cin.tie(NULL);
//read from cin
for(i in amount of lines in stdin){
getline(cin,str);
if(i<3){
int commaindex = str.find(',');
string substring = str.substr(0,commaindex);
array[i]=atoi(substring.c_str());
str.erase(0,commaindex+1)
}else{
label = str;
}
//assign array and label to other stuff and do other stuff, repeat
}
I'm quite new to C++ and recently learned profiling with Visual Studio however not the best at interpreting it. IO takes up 68.2% and kernel takes 15.8% of CPU usage. getline() covers 35.66% of the elapsed inclusive time.
Is there any way I can do something similar to reading large chunks at once to avoid calling getline() as much? I've been told fgets() is much faster, however, I'm unsure of how to use it when I cannot predict the number of characters to specify.
I've attempted to use scanf as follows, however it was slower than getline method. Also have used `stringstreams, but that was incredibly slow.
scanf("%i,%i,%i,%s",&array[0],&array[1],&array[2],str);
Also if it matters, it is run on a server with low memory available. I think reading the entire input to buffer would not be viable?
Thanks!
Update: Using #ted-lyngmo approach, gathered the results below.
time wc datafile
real 4m53.506s
user 4m14.219s
sys 0m36.781s
time ./a.out < datafile
real 2m50.657s
user 1m55.469s
sys 0m54.422s
time ./a.out datafile
real 2m40.367s
user 1m53.523s
sys 0m53.234s

You could use std::from_chars (and reserve() the approximate amount of lines you have in the file, if you store the values in a vector for example). I also suggest adding support for reading directly from the file. Reading from a file opened by the program is (at least for me) faster than reading from std::cin (even with sync_with_stdio(false)).
Example:
#include <algorithm> // std::for_each
#include <cctype> // std::isspace
#include <charconv> // std::from_chars
#include <cstdio> // std::perror
#include <fstream>
#include <iostream>
#include <iterator> // std::istream_iterator
#include <limits> // std::numeric_limits
struct foo {
int a[3];
std::string s;
};
std::istream& operator>>(std::istream& is, foo& f) {
if(std::getline(is, f.s)) {
std::from_chars_result fcr{f.s.data(), {}};
const char* end = f.s.data() + f.s.size();
// extract the numbers
for(unsigned i = 0; i < 3 && fcr.ptr < end; ++i) {
fcr = std::from_chars(fcr.ptr, end, f.a[i]);
if(fcr.ec != std::errc{}) {
is.setstate(std::ios::failbit);
return is;
}
// find next non-whitespace
do ++fcr.ptr;
while(fcr.ptr < end &&
std::isspace(static_cast<unsigned char>(*fcr.ptr)));
}
// extract the string
if(++fcr.ptr < end)
f.s = std::string(fcr.ptr, end - 1);
else
is.setstate(std::ios::failbit);
}
return is;
}
std::ostream& operator<<(std::ostream& os, const foo& f) {
for(int i = 0; i < 3; ++i) {
os << f.a[i] << ',';
}
return os << '\'' << f.s << "'\n";
}
int main(int argc, char* argv[]) {
std::ifstream ifs;
if(argc >= 2) {
ifs.open(argv[1]); // if a filename is given as argument
if(!ifs) {
std::perror(argv[1]);
return 1;
}
} else {
std::ios_base::sync_with_stdio(false);
std::cin.tie(nullptr);
}
std::istream& is = argc >= 2 ? ifs : std::cin;
// ignore the first line - it's of no use in this demo
is.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
// read all `foo`s from the stream
std::uintmax_t co = 0;
std::for_each(std::istream_iterator<foo>(is), std::istream_iterator<foo>(),
[&co](const foo& f) {
// Process each foo here
// Just counting them for demo purposes:
++co;
});
std::cout << co << '\n';
}
My test runs on a file with 1'000'000'000 lines with content looking like below:
2,2,2,'abcd'
2, 2,2,'abcd'
2, 2, 2,'abcd'
2, 2, 2, 'abcd'
Unix time wc datafile
1000000000 2500000000 14500000000 datafile
real 1m53.440s
user 1m48.001s
sys 0m3.215s
time ./my_from_chars_prog datafile
1000000000
real 1m43.471s
user 1m28.247s
sys 0m5.622s
From this comparison I think one can see that my_from_chars_prog is able to successfully parse all entries pretty fast. It was consistently faster at doing so than wc - a standard unix tool whos only purpose is to count lines, words and characters.

sprintf buffer issue, wrong assignment to char array

I got an issue with sprintf buffer.
As you can see in the code down below I'm saving with sprintf a char array to the buffer, so pFile can check if there's a file named like that in the folder. If it's found, the buffer value will be assigned to timecycles[numCycles], and numCycles will be increased. Example: timecycles[0] = "timecyc1.dat". It works well, and as you can see in the console output it recognizes that there are only timecyc1.dat and timecyc5.dat in the folder. But as long as I want to read timecycles with a for loop, both indexes have the value "timecyc9.dat", eventhough it should be "timecyc1.dat" for timecycles[0] and "timecyc5.dat" for timecycles1. Second thing is, how can I write the code so readTimecycles() returns char* timecycles, and I could just initialize it in the main function with char* timecycles[9] = readTimecycles() or anything like that?
Console output
#include <iostream>
#include <cstdio>
char* timecycles[9];
void readTimecycles()
{
char buffer[256];
int numCycles = 0;
FILE* pFile = NULL;
for (int i = 1; i < 10; i++)
{
sprintf(buffer, "timecyc%d.dat", i);
pFile = fopen(buffer, "r");
if (pFile != NULL)
{
timecycles[numCycles] = buffer;
numCycles++;
std::cout << buffer << std::endl; //to see if the buffer is correct
}
}
for (int i = 0; i < numCycles; i++)
{
std::cout << timecycles[i] << std::endl; //here's the issue with timecyc9.dat
}
}
int main()
{
readTimecycles();
return 0;
}

With the assignment
timecycles[numCycles] = buffer;
you make all pointers point to the same buffer, since you only have a single buffer.
Since you're programming in C++ you could easily solve your problem by using std::string instead.
If I would remake your code into something a little-more C++-ish and less C-ish, it could look something like
std::array<std::string, 9> readTimeCycles()
{
std::array<std::string, 9> timecycles;
for (size_t i = 0; i < timecycles.size(); ++i)
{
// Format the file-name
std::string filename = "timecyc" + std::to_string(i + 1) + ".dat";
std::ifstream file(filename);
if (file)
{
// File was opened okay
timecycles[i] = filename;
}
}
return timecycles;
}
References:
std::array
std::string
std::to_string
std::ifstream

The fundamental problem is that your notion of a string doesn't match what a 'char array' is in C++. In particular you think that because you assign timecycles[numCycles] = buffer; somehow the chars of the char array are copied. But in C++ all that is being copied is a pointer, so timecycles ends up with multiple pointers to the same buffer. And that's not to mention the problem you will have that when you exit the readTimecycles function. At that point you will have multiple pointers to a buffer which no longer exists as it gets destroyed when you exit the readTimecycles function.
The way to fix this is to use C++ code that does match your expectations. In particular a std::string will copy in the way you expect it to. Here's how you can change your code to use std::string
#include <string>
std::string timecycles[9];
timecycles[numCycles] = buffer; // now this really does copy a string

put the file into an array in c++

I'm trying to read a txt file, and put it into an char array. But can I read different files which contain different length of characters and put them into an array. Can I create a dynamic array to contain unknown length of characters.

You can read a file of unknown size into a dynamics data structure like:
std::vector More info here.
Alternatively, you can use new to allocate a dynamic memory. However, vectors are more convenient at least to me :).

#include <vector>
#include <iostream>
#include <fstream>
int main(int argc, char **argv)
{
std::vector<std::string> content;
if (argc != 2)
{
std::cout << "bad argument" << std::endl;
return 0;
}
std::string file_name (argv[1]);
std::ifstream file(file_name);
if (!file)
{
std::cout << "can't open file" << std::endl;
return 0;
}
std::string line = "";
while (std::getline(file, line))
{
content.push_back(line);
line = "";
}
for (std::vector<std::string>::iterator it = content.begin(); it != content.end(); ++it)
std::cout << *it << std::endl;
}
here is a solution using std::vectors and std::string
the programm takes a file name as first parameter, opens it, read it line by line
each line is written in the vector
then you can display your vector as i did at the end of the function
EDIT: because C++11 is the new standars, the program use C++11 then you have to compile it using c++11 (g++ -std=c++11 if you use g++)
I just tested it it works perfectly

There may be library routines available which give you the size of the file without reading the contents of the file. In that case you could get the size and allocate a full-sized buffer, and suck in the whole file at once [if your buffer is a simple char array, don't forget to add one and put in the trailing nullchar].

The best way is use of malloc(), realloc(), and free() just like it was an old C program. If you try to use a std::vector you will choke approaching maximum RAM as realloc() can grow and shrink in place (grow is contingent on heap while shrink is guaranteed to work) while std::vector cannot do so.
In particular:
#include <iostream>
#include <tuple>
// TODO perhaps you want an auto-management class for this tuple.
// I'm not about to type up the whole auto_ptr, shared_ptr, etc.
// Mostly you don't do this enough to have to worry too hard.
std::tuple<char *, size_t> getarray(std::istream &input)
{
size_t fsize = 0;
size_t asize = 0;
size_t offset = 0;
size_t terminator = (size_t)-1;
char *buf = malloc(asize = 8192);
if (!buf) throw std::bad_alloc();
char *btmp;
char shift = 1;
do {
try {
input.read(buf + offset, asize - fsize);
} catch (...) {
free(buf);
throw;
}
if (input.gcount == 0) {
btmp = realloc(buf, bufsize);
if (btmp) buf = btmp;
return std::tuple<char *, size_t>(buf, fsize);
}
offset += input.gcount;
fsize += offset;
if (fsize == asize) {
if (shift) {
if ((asize << 1) == 0)
shift = 0;
else {
btmp = realloc(buf, asize << 1);
if (!btmp)
shift = 0;
else {
asize <<= 1;
buf = btmp;
}
}
if (!shift) {
btmp = realloc(buf, asize += 8192);
if (!btmp) {
free(buf);
throw std::bad_alloc();
}
}
}
}
} while (terminator - offset > fsize);
free(buf);
// Or perhaps something suitable.
throw "File too big to fit in size_t";
}

C++, Qt - splitting a QByteArray as fast as possible

I'm trying to split a massive QByteArray which contains UTF-8 encoded plain text(using whitespace as delimiter) with the best performance possible. I found that I can achieve much better results if I convert the array to QString first. I tried using the QString.split function using a regexp, but the performance was horrendous. This code turned out to be way faster:
QMutex mutex;
QSet<QString> split(QByteArray body)
{
QSet<QString> slova;
QString s_body = QTextCodec::codecForMib(106)->toUnicode(body);
QString current;
for(int i = 0; i< body.size(); i++){
if(s_body[i] == '\r' || s_body[i] == '\n' || s_body[i] == '\t' || s_body[i] == ' '){
mutex.lock();
slova.insert(current);
mutex.unlock();
current.clear();
current.reserve(40);
} else {
current.push_back(s_body[i]);
}
}
return slova;
}
"Slova" is a QSet<QString> currently, but I could use a std::set or any other format. This code is supposed to find how many unique words there are in the array, with the best performance possible.
Unfortunately, this code runs far from fast enough. I'm looking to squeeze the absolute maximum out of this.
Using callgrind, I found that the most gluttonous internal functions were:
QString::reallocData (18% absolute cost)
QString::append (10% absolute cost)
QString::operator= (8 % absolute cost)
QTextCodec::toUnicode (8% absolute cost)
Obviously, this has to do with memory allocation stemming from the push_back function. What is the most optimal way to solve this? Doesn't necessarily have to be a Qt solution - pure C or C++ are also acceptable.

Minimise the amount of copying you need to do. Keep the input buffer in UTF-8, and don't store std::string or QString in your set; instead, create a small class to reference the existing UTF-8 data:
#include <QString>
class stringref {
const char *start;
size_t length;
public:
stringref(const char *start, const char *end);
operator QString() const;
bool operator<(const stringref& other) const;
};
This can encapsulate a substring of the UTF-8 input. You'll need to ensure that it doesn't outlive the input string; you could do this by clever use of std::shared_ptr, but if the code is reasonably self-contained, then it should be tractable enough to reason about the lifetime.
We can construct it from a pair of pointers into our UTF-8 data, and convert it to QString when we want to actually use it:
stringref::stringref(const char *start, const char *end)
: start(start), length(end-start)
{}
stringref::operator QString() const
{
return QString::fromUtf8(start, length);
}
You need to define operator< so you can use it in a std::set.
#include <cstring>
bool stringref::operator<(const stringref& other) const
{
return length == other.length
? std::strncmp(start, other.start, length) < 0
: length < other.length;
}
Note that we sort by length before dereferencing pointers, to reduce cache impact.
Now we can write the split method:
#include <set>
#include <QByteArray>
std::set<stringref> split(const QByteArray& a)
{
std::set<stringref> words;
// start and end
const auto s = a.data(), e = s + a.length();
// current word
auto w = s;
for (auto p = s; p <= e; ++p) {
switch (*p) {
default: break;
case ' ': case '\r': case '\n': case '\t': case '\0':
if (w != p)
words.insert({w, p});
w = p+1;
}
}
return words;
}
The algorithm is pretty much yours, with the addition of the w!=p test so that runs of whitespace don't get counted.
Let's test it, and time the important bit:
#include <QDebug>
#include <chrono>
int main()
{
QByteArray body{"foo bar baz\n foo again\nbar again "};
// make it a million times longer
for (int i = 0; i < 20; ++i)
body.append(body);
using namespace std::chrono;
const auto start = high_resolution_clock::now();
auto words = split(body);
const auto end = high_resolution_clock::now();
qDebug() << "Split"
<< body.length()
<< "bytes in"
<< duration_cast<duration<double>>(end - start).count()
<< "seconds";
for (auto&& word: words)
qDebug() << word;
}
I get:
Split 35651584 bytes in 1.99142 seconds
"bar"
"baz"
"foo"
"again"
Compiling with -O3 reduced that time to 0.6188 seconds, so don't forget to beg the compiler for help!
If that's still not fast enough, it's probably time to start to look at parallelising the task. You'll want to split the string into roughly equal lengths, but advance to the next whitespace so that no work straddles two threads worth of work. Each thread should create its own set of results, and the reduction step is then to merge the result sets. I won't provide a full solution for this, as that's another question in its own right.

Your largest cost, as suspected, is in push_back causing frequent reallocations as you append one character at a time. Why not search ahead, then append all of the data at once using QString::mid():
slova.insert(s_body.mid(beginPos, i - beginPos - 1));
Where beginPos holds the index of the start of the current substring. Instead of appending each character to current before it is inserted into slova, the copy happens all at once. After copying a substring, search ahead for the next valid (not a separator) character and set beginPos equal to that index.
In (rough) code:
QString s_body = ...
//beginPos tells us the index of the current substring we are working
//with. -1 means the previous character was a separator
int beginPos = -1;
for (...) {
//basically your if statement provided in the question as a function
if (isSeparator(s_body[i])) {
//ignore double white spaces, etc.
if (beginPos != -1) {
mutex.lock();
slova.insert(s_body.mid(beginPos, i - beginPos - 1));
mutex.unlock();
}
} else if (beginPos == -1)
//if beginPos is not valid and we are not on a separator, we
//are at the start of a new substring.
beginPos = i;
}
This approach will drastically reduce your overhead in heap allocations and eliminate QString::push_back() calls.
One final note: QByteArray also provides a mid() function. You can skip the conversion to QString entirely and work directly with the byte array.

The first thing I'd do if I were you is modify your code so it isn't locking and unlocking a QMutex for ever word it inserts into the QSet -- that's pure overhead. Either lock the QMutex only once, at the beginning of the loop, and unlock it again after the loop terminates; or better yet, insert into a QSet that isn't accessible from any other thread, so that you don't need to lock any QMutexes at all.
With that out of the way, the second thing to do is eliminate as many heap allocations as possible. Ideally you'd execute the entire parse without ever allocating or freeing any dynamic memory at all; my implementation below does that (well, almost -- the unordered_set might do some internal allocations, but it probably won't). On my computer (a 2.7GHz Mac Mini) I measure a processing speed of around 11 million words per second, using the Gutenberg ASCII text of Moby Dick as my test input.
Note that due to the backward-compatible encoding that UTF-8 uses, this program will work equally well with either UTF-8 or ASCII input.
#include <ctype.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
#include <unordered_set>
// Loads in a text file from disk into an in-memory array
// Expected contents of the file are ASCII or UTF8 (doesn't matter which).
// Note that this function appends a space to the end of the returned array
// That way the parsing function doesn't have to include a special case
// since it is guaranteed that every word in the array ends with whitespace
static char * LoadFile(const char * fileName, unsigned long * retArraySizeBytes)
{
char * ret = NULL;
*retArraySizeBytes = 0;
FILE * fpIn = fopen(fileName, "r");
if (fpIn)
{
if (fseek(fpIn, 0L, SEEK_END) == 0)
{
const unsigned long fileSizeBytes = ftell(fpIn);
const unsigned long arraySizeBytes = *retArraySizeBytes = fileSizeBytes+1; // +1 because I'm going to append a space to the end
rewind(fpIn);
ret = new char[arraySizeBytes];
if (fread(ret, 1, fileSizeBytes, fpIn) == fileSizeBytes)
{
ret[fileSizeBytes] = ' '; // appending a space allows me to simplify the parsing step
}
else
{
perror("fread");
delete [] ret;
ret = NULL;
}
}
else perror("fseek");
fclose(fpIn);
}
return ret;
}
// Gotta provide our own equality-testing function otherwise unordered_set will just compare pointer values
struct CharPointersEqualityFunction : public std::binary_function<char *, char *,bool>
{
bool operator() (char * s1, char * s2) const {return strcmp(s1, s2) == 0;}
};
// Gotta provide our own hashing function otherwise unordered_set will just hash the pointer values
struct CharPointerHashFunction
{
int operator() (char * str) const
{
// djb2 by Dan Bernstein -- fast enough and simple enough
unsigned long hash = 5381;
int c; while((c = *str++) != 0) hash = ((hash << 5) + hash) + c;
return (int) hash;
}
};
typedef std::unordered_set<char *, CharPointerHashFunction, CharPointersEqualityFunction > CharPointerUnorderedSet;
int main(int argc, char ** argv)
{
if (argc < 2)
{
printf("Usage: ./split_words filename\n");
return 10;
}
unsigned long arraySizeBytes;
char * buf = LoadFile(argv[1], &arraySizeBytes);
if (buf == NULL)
{
printf("Unable to load input file [%s]\n", argv[1]);
return 10;
}
CharPointerUnorderedSet set;
set.reserve(100000); // trying to size (set) big enough that no reallocations will be necessary during the parse
struct timeval startTime;
gettimeofday(&startTime, NULL);
// The actual parsing of the text is done here
int wordCount = 0;
char * wordStart = buf;
char * wordEnd = buf;
char * bufEnd = &buf[arraySizeBytes];
while(wordEnd < bufEnd)
{
if (isspace(*wordEnd))
{
if (wordEnd > wordStart)
{
*wordEnd = '\0';
set.insert(wordStart);
wordCount++;
}
wordStart = wordEnd+1;
}
wordEnd++;
}
struct timeval endTime;
gettimeofday(&endTime, NULL);
unsigned long long startTimeMicros = (((unsigned long long)startTime.tv_sec)*1000000) + startTime.tv_usec;
unsigned long long endTimeMicros = (((unsigned long long) endTime.tv_sec)*1000000) + endTime.tv_usec;
double secondsElapsed = ((double)(endTimeMicros-startTimeMicros))/1000000.0;
printf("Parsed %i words (%zu unique words) in %f seconds, aka %.0f words/second\n", wordCount, set.size(), secondsElapsed, wordCount/secondsElapsed);
//for (const auto& elem: set) printf("word=[%s]\n", elem);
delete [] buf;
return 0;
}

reading last n lines from file in c/c++

I have seen many posts but didn't find something like i want.
I am getting wrong output :
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ...... // may be this is EOF character
Going into infinite loop.
My algorithm:
Go to end of file.
decrease position of pointer by 1 and read character by
character.
exit if we found our 10 lines or we reach beginning of file.
now i will scan the full file till EOF and print them //not implemented in code.
code:
#include<iostream>
#include<stdio.h>
#include<conio.h>
#include<stdlib.h>
#include<string.h>
using namespace std;
int main()
{
FILE *f1=fopen("input.txt","r");
FILE *f2=fopen("output.txt","w");
int i,j,pos;
int count=0;
char ch;
int begin=ftell(f1);
// GO TO END OF FILE
fseek(f1,0,SEEK_END);
int end = ftell(f1);
pos=ftell(f1);
while(count<10)
{
pos=ftell(f1);
// FILE IS LESS THAN 10 LINES
if(pos<begin)
break;
ch=fgetc(f1);
if(ch=='\n')
count++;
fputc(ch,f2);
fseek(f1,pos-1,end);
}
return 0;
}
UPD 1:
changed code: it has just 1 error now - if input has lines like
3enil
2enil
1enil
it prints 10 lines only
line1
line2
line3ÿine1
line2
line3ÿine1
line2
line3ÿine1
line2
line3ÿine1
line2
PS:
1. working on windows in notepad++
this is not homework
also i want to do it without using any more memory or use of STL.
i am practicing to improve my basic knowledge so please don't post about any functions (like tail -5 tc.)
please help to improve my code.

Comments in the code
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
FILE *in, *out;
int count = 0;
long int pos;
char s[100];
in = fopen("input.txt", "r");
/* always check return of fopen */
if (in == NULL) {
perror("fopen");
exit(EXIT_FAILURE);
}
out = fopen("output.txt", "w");
if (out == NULL) {
perror("fopen");
exit(EXIT_FAILURE);
}
fseek(in, 0, SEEK_END);
pos = ftell(in);
/* Don't write each char on output.txt, just search for '\n' */
while (pos) {
fseek(in, --pos, SEEK_SET); /* seek from begin */
if (fgetc(in) == '\n') {
if (count++ == 10) break;
}
}
/* Write line by line, is faster than fputc for each char */
while (fgets(s, sizeof(s), in) != NULL) {
fprintf(out, "%s", s);
}
fclose(in);
fclose(out);
return 0;
}

There are a number of problems with your code. The most
important one is that you never check that any of the functions
succeeded. And saving the results an ftell in an int isn't
a very good idea either. Then there's the test pos < begin;
this can only occur if there was an error. And the fact that
you're putting the results of fgetc in a char (which results
in a loss of information). And the fact that the first read you
do is at the end of file, so will fail (and once a stream enters
an error state, it stays there). And the fact that you can't
reliably do arithmetic on the values returned by ftell (except
under Unix) if the file was opened in text mode.
Oh, and there is no "EOF character"; 'ÿ' is a perfectly valid
character (0xFF in Latin-1). Once you assign the return value
of fgetc to a char, you've lost any possibility to test for
end of file.
I might add that reading backwards one character at a time is
extremely inefficient. The usual solution would be to allocate
a sufficiently large buffer, then count the '\n' in it.
EDIT:
Just a quick bit of code to give the idea:
std::string
getLastLines( std::string const& filename, int lineCount )
{
size_t const granularity = 100 * lineCount;
std::ifstream source( filename.c_str(), std::ios_base::binary );
source.seekg( 0, std::ios_base::end );
size_t size = static_cast<size_t>( source.tellg() );
std::vector<char> buffer;
int newlineCount = 0;
while ( source
&& buffer.size() != size
&& newlineCount < lineCount ) {
buffer.resize( std::min( buffer.size() + granularity, size ) );
source.seekg( -static_cast<std::streamoff>( buffer.size() ),
std::ios_base::end );
source.read( buffer.data(), buffer.size() );
newlineCount = std::count( buffer.begin(), buffer.end(), '\n');
}
std::vector<char>::iterator start = buffer.begin();
while ( newlineCount > lineCount ) {
start = std::find( start, buffer.end(), '\n' ) + 1;
-- newlineCount;
}
std::vector<char>::iterator end = remove( start, buffer.end(), '\r' );
return std::string( start, end );
}
This is a bit weak in the error handling; in particular, you
probably want to distinguish the between the inability to open
a file and any other errors. (No other errors should occur,
but you never know.)
Also, this is purely Windows, and it supposes that the actual
file contains pure text, and doesn't contain any '\r' that
aren't part of a CRLF. (For Unix, just drop the next to the
last line.)

This can be done using circular array very efficiently.
No additional buffer is required.
void printlast_n_lines(char* fileName, int n){
const int k = n;
ifstream file(fileName);
string l[k];
int size = 0 ;
while(file.good()){
getline(file, l[size%k]); //this is just circular array
cout << l[size%k] << '\n';
size++;
}
//start of circular array & size of it
int start = size > k ? (size%k) : 0 ; //this get the start of last k lines
int count = min(k, size); // no of lines to print
for(int i = 0; i< count ; i++){
cout << l[(start+i)%k] << '\n' ; // start from in between and print from start due to remainder till all counts are covered
}
}
Please provide feedback.

int end = ftell(f1);
pos=ftell(f1);
this tells you the last point at file, so EOF.
When you read, you get the EOF error, and the ppointer wants to move 1 space forward...
So, i recomend decreasing the current position by one.
Or put the fseek(f1, -2,SEEK_CUR) at the beginning of the while loop to make up for the fread by 1 point and go 1 point back...

I believe, you are using fseek wrong. Check man fseek on the Google.
Try this:
fseek(f1, -2, SEEK_CUR);
//1 to neutrialize change from fgect
//and 1 to move backward
Also you should set position at the beginning to the last element:
fseek(f1, -1, SEEK_END).
You don't need end variable.
You should check return values of all functions (fgetc, fseek and ftell). It is good practise. I don't know if this code will work with empty files or sth similar.

Use :fseek(f1,-2,SEEK_CUR);to back
I write this code ,It can work ,you can try:
#include "stdio.h"
int main()
{
int count = 0;
char * fileName = "count.c";
char * outFileName = "out11.txt";
FILE * fpIn;
FILE * fpOut;
if((fpIn = fopen(fileName,"r")) == NULL )
printf(" file %s open error\n",fileName);
if((fpOut = fopen(outFileName,"w")) == NULL )
printf(" file %s open error\n",outFileName);
fseek(fpIn,0,SEEK_END);
while(count < 10)
{
fseek(fpIn,-2,SEEK_CUR);
if(ftell(fpIn)<0L)
break;
char now = fgetc(fpIn);
printf("%c",now);
fputc(now,fpOut);
if(now == '\n')
++count;
}
fclose(fpIn);
fclose(fpOut);
}

I would use two streams to print last n lines of the file:
This runs in O(lines) runtime and O(lines) space.
#include<bits/stdc++.h>
using namespace std;
int main(){
// read last n lines of a file
ifstream f("file.in");
ifstream g("file.in");
// move f stream n lines down.
int n;
cin >> n;
string line;
for(int i=0; i<k; ++i) getline(f,line);
// move f and g stream at the same pace.
for(; getline(f,line); ){
getline(g, line);
}
// g now has to go the last n lines.
for(; getline(g,line); )
cout << line << endl;
}
A solution with a O(lines) runtime and O(N) space is using a queue:
ifstream fin("file.in");
int k;
cin >> k;
queue<string> Q;
string line;
for(; getline(fin, line); ){
if(Q.size() == k){
Q.pop();
}
Q.push(line);
}
while(!Q.empty()){
cout << Q.front() << endl;
Q.pop();
}

Here is the solution in C++.
#include <iostream>
#include <string>
#include <exception>
#include <cstdlib>
int main(int argc, char *argv[])
{
auto& file = std::cin;
int n = 5;
if (argc > 1) {
try {
n = std::stoi(argv[1]);
} catch (std::exception& e) {
std::cout << "Error: argument must be an int" << std::endl;
std::exit(EXIT_FAILURE);
}
}
file.seekg(0, file.end);
n = n + 1; // Add one so the loop stops at the newline above
while (file.tellg() != 0 && n) {
file.seekg(-1, file.cur);
if (file.peek() == '\n')
n--;
}
if (file.peek() == '\n') // If we stop in the middle we will be at a newline
file.seekg(1, file.cur);
std::string line;
while (std::getline(file, line))
std::cout << line << std::endl;
std::exit(EXIT_SUCCESS);
}
Build:
$ g++ <SOURCE_NAME> -o last_n_lines
Run:
$ ./last_n_lines 10 < <SOME_FILE>

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Compare two files - c++

When the files are binary, use memcmp not strcmp as \0 might appear as data.

Related

Reading lines from input

sprintf buffer issue, wrong assignment to char array

put the file into an array in c++

C++, Qt - splitting a QByteArray as fast as possible

reading last n lines from file in c/c++

Categories

Resources