Delete a line in an ofstream in C++ - c++

I want to erase lines within a file. I know you can store the content of the file (in a vector for example), erase the line and write again. However, it feels very cumbersome, and not very efficient if the file gets bigger.
Anyone knows of a better, more efficient, more elegant way of doing it?

On most file-systems, this is the only option you have, short of switching to an actual database.
However, if you find yourself in this situation (i.e. very large files, with inserts/deletes in the middle), consider whether you can do something like maintaining a bitmap at the top of the file, where each bit represents one line of your file. To "delete" a line, simply flip the corresponding bit value.

There's nothing particularly magical about disk files. They still like to store their data in contiguous areas (typically called something like "blocks"). They don't have ways of leaving data-free holes in the middle of these areas. So if you want to "remove" three bytes from the middle of one of these areas, something somewhere is going to have to accomplish this by moving everything else in that area back by three bytes. No, it is not efficient.
This is why text editors (which have to do this kind of thing a lot) tend to load as much of the file as possible (if not all of it) into RAM, where moving data around is much faster. They typically only write changes back to disk when requested (or periodically). If you are going to have to make lots of changes like this, I'd suggest taking a page from their book and doing something similar.

The BerkeleyDB (dbopen(3)) has an access method called DB_RECNO. This allows one to manipulate files with arbitrary lengths using any sort of record delimiter. The default uses variable-length records with unix newlines as delimiters. You then access each "record" using an integer index. Using this, you can delete arbitrary lines from your text file. This isn't specific to C++, but if you are on most Unix/Linux systems, this API is already available to you.

Related

Why must files be entirely rewritten when there are arrays in C++?

From what I've read on the internet and in books, when editing a text file, or any file for that matter, the file must be entirely rewritten; You can't just insert data into a file and save it.
If this is so, how can there be arrays in programming languages? In C++ I can explicitly modify certain values in the middle of arrays. To me, this seems like a demonstration of the modification of one or two bytes in the middle of a group of bytes.
The only two possible solutions I have thought of are
There is some funky stuff going on behind-the-scenes in C++, so it seems like only one or two bytes are being modified, but the array is actually entirely rewritten.
Then, after thinking about it, and especially after typing it out, I realized the the aformentioned solution seems really really dumb and totally not true, because there are things like addresses and pointers and the performance sounds awful. So I thought that maybe files are not entirely rewritten; only the everything after the first point in new data was inserted is rewritten. This seems much more plausible to me, and makes sense.
What is the difference between writing data to a file and writing data to an array?
You can change the values of particular locations in an array without needing to re-write the whole thing. However, you cannot insert new values into the middle of an array without shifting everything following the new values in order to make room.
Similarly, you can overwrite sections of a file without copying it (although the underlying storage technology may need to re-write an entire storage block in order to change a single byte within that block), but you can't insert new data without somehow making room for it. Text editors (and editors for more complicated file formats) are designed for random-access modifications that do not preserve length, so they will typically re-write the entire file regardless of what changed rather than trying to optimize for length-preserving edits.

Getting information from a file without traversing its contents

This question made me search for what else can I get from a file without traversing its contents (means without inputting the contents using ifstream or getc etc).
Other than file size and number of characters, what other information can I gather? I searched fseek, I found I can use SEEK_SET, SEEK_CUR and SEEK_END, which only allow me to find the end of the file, start of the file and current pointer.
In order to make it a question, I specifically want to ask:
Can occurrences of some character or type of character (newline etc) be counted?
Can its contents be matched with a certain template?
Is using these methods faster than reading the file multiple times?
And I am asking about Microsoft Windows, not Linux.
1) No, becuase searching of something in unpredicteble conditions requires thorough examing of contents. Examing is reading. Of course, you may collect some statistics before, but you need to traverse you data not less then once. You can use other applications to do this implicitly, but they also will traverse your file from very begining to the end. You may orginize your file some way to obtain necessary info with minimal amount of read-operations, but its all up to your task, and there is no general approach (Because any generiosuty comes to examing the whole source structure).
2) Also No (see above)
3) Yes. Store as much as possible (or required by task) in memory (that's called caching). For example, use mapping (See MapViewOfFile for Windows and mmap(2) on *nix systems), this uses some in-system caching mechanism.
No
No
Depends on wether there's an actual need to read the file multiple times.
There're no miracles here. The former question had a "shortcut" because the number of characters in the file equals to its size in bytes (more strictly speaking - the ansi-text file is considered of a character sequence, each is represented by a single byte).
The stat structure contains information about the file, including permissions, ownership, size, access and creation date info. As for metadata, maybe there's an API to tie into a Windows search database that might allow searching on other criteria, like content attributes (I'm a Linux guy, usually, so I don't know what Windows offers in this respect).

Search for multiple words in very large text files (10 GB) using C++ the fastest way

I have this program where I have to search for specific values and its line number in very large text file and there might be multiple occurences for the same value.
I've tried a simple C++ programs which reads the text files line by line and searches for a the value using strstr but it's taking a very longgggggggggggggg time
I also tried to use a system command using grep but still it's taking a lot of time, not as long as before but it's still too much time.
I was searching for a library I can use to fasten the search.
Any help and suggestions? Thank you :)
There are two issues concerning the spead: the time it takes to actually
read the data, and the time it takes to search.
Generally speaking, the fastest way to read a file is to mmap it (or
the equivalent under Windows). This can get complicated if the entire
file won't fit into the address space, but you mention 10GB in the
header; if searching is all you do in the program, this shouldn't create
any problems.
More generally, if speed is a problem, avoid using getline on a
string. Reading large blocks, and picking the lines up (as char[])
out of them, without copying, is significantly faster. (As a simple
compromize, you may want to copy when a line crosses a block boundary.
If you're dealing with blocks of a MB or more, this shouldn't be too
often; I've used this technique on older, 16 bit machines, with blocks
of 32KB, and still gotten a significant performance improvement.)
With regards to searching, if you're searching for a single, fixed
string (not a regular expression or other pattern matching), you might
want to try a BM search. If the string you're searching for is
reasonably long, this can make a significant difference over other
search algorithms. (I think that some implementations of grep will
use this if the search pattern is in fact a fixed string, and is
sufficiently long for it to make a difference.)
Use multiple threads. Each thread can be responsible for searching through a portion of the file. For example on a 4 core machine spawn 12 threads. The first thread looks through the first 8%evening of the file, the second thread the second 8% of the file, etc. You will want to tune the number of threads per core to keep the cpu max utilized. Since this is an I/O bound operation you may never reach 100% cpu utilization.
Feeding data to the threads will be a bottleneck using this design. Memory mapping the file might help somewhat but at the end of the day the disk can only read one sector at a time. This will be a bottleneck that you will be hard pressed to resolve. You might consider starting one thread that does nothing but read all the data in to memory and kick off search threads as the data loads up.
Since files are sequential beasts searching from start to end is something that you may not get around however there are a couple of things you could do.
if the data is static you could generate a smaller lookup file (alt. with offsets into the main file), this works good if the same string is repeated multiple times making the index file much smaller. if the file is dynamic you maybe need to regenerate the index file occassionally (offline)
instead of reading line by line, read larger chunks from the file like several MB to speed up I/O.
If you'd like to do use a library you could use xapian.
You may also want to try tokenizing your text before doing the search and I'd also suggest you to try regex too but it will take a lot if you don't have an index on that text so I'd definitely suggest you to try xapian or some search engine.
If your big text file does not change often then create a database (for example SQLite) with a table:
create table word_line_numbers
(word varchar(100), line_number integer);
Read your file and insert a record in database for every word with something like this:
insert into word_line_numbers(word, line_number) values ('foo', 13452);
insert into word_line_numbers(word, line_number) values ('foo', 13421);
insert into word_line_numbers(word, line_number) values ('bar', 1421);
Create an index of words:
create index wird_line_numbers_idx on word_line_numbers(word);
And then you can find line numbers for words fast using this index:
select line_number from word_line_numbers where word='foo';
For added speed (because of smaller database size) and complexity you can use 2 tables: words(word_id integer primary key, word not null) and word_lines(word_id integer not null references words, line_number integer not null).
I'd try first loading as much of the file into the RAM as possible (memory mapping of the file is a good option) and then search concurrently in parts of it on multiple processors. You'll need to take special care near the buffer boundaries to make sure you aren't missing any words. Also, you may want to try something more efficient than the typical strstr(), see these:
Boyer–Moore string search algorithm
Knuth–Morris–Pratt algorithm

C++ Qt WordCount and large data sets

I want to count words occurrences in a set of plain text files. Just like here http://doc.trolltech.com/4.5/qtconcurrent-wordcount-main-cpp.html
The problem is that i need to process very big amount of plain text files - so my result srored in QMap could not fit into memory.
I googled external memory (file based) merge sort algorithm, but i'm too lazy to implement myself. So i want to divide result set by portions to fit each of them into memory. Then store this portions in files on disk. Then call magic function mergeSort(QList, result_file) and have final result in result_file.
Does anyone know Qt compatible implementation of this algo?
In short i'm looking for pythons heapq.merge (http://docs.python.org/library/heapq.html#heapq.merge) analog but for Qt containers.
You might wanna check out this one:
http://stxxl.sourceforge.net/
It's not exactly what you are looking for (close enough though), but I guess you will not find exactly what you want working with Qt lists. Since you are are implementing alghoritm creating this list, changing it's type shouldn't be a problem. As far as i remember on those list you can use standard stl sorting alghoritms. The only problem remains preformance.
I presume that the map contains the association between the word and the number of occurences. In this case, why do you say you have such a significant memory consumption? How many distinct words and forms could you have and what is the average memory consumption for one word?
Considering 1.000.000 words, with 1K memory consumption per word (that includes the word text, the QMap specific storage), that would lead to (aprox) 1GB of memory, which... doesn't seem so much to me.

C++ fstream Erase the file contents from a selected Point

I need to Erase the file contents from a selected Point (C++ fstream) which function should i use ?
i have written objects , i need to delete these objects in middle of the file
C++ has no standard mechanism to truncate a file at a given point. You either have to recreate the file (open with ios::trunc and write the contents you want to keep) or use OS-specific API calls (SetEndOfFile on Windows, truncate or ftruncate on Unix).
EDIT: Deleting stuff in the middle of a file is an exceedingly precarious business. Long before considering any other alternatives, I would try to use a server-less database engine like SQLite to store serialised objects. Better still, I would use SQLite as intended by storing the data needed by those objects in a proper schema.
EDIT 2: If the problem statement requires raw file access...
As a general rule, you don't delete data from the middle of a file. If the objects can be serialised to a fixed size on disk, you can work with them as records, and rather than trying to delete data, you use a table that indexes records within the file. E.g., if you write four records in sequence, the table will hold [0, 1, 2, 3]. In order to delete the second record, you simply remove its entry from the table: [0, 2, 3]. There are at least two ways to reuse the holes left behind by the table:
On each insertion, scan for the first unused index and write the object out at the corresponding record location. This will become more expensive, though, as the file grows.
Maintain a free list. Store, as a separate variable, the index of the most recently freed record. In the space occupied by that record encode the index of the record freed before it, and so on. This maintains a handy linked-list of free records while only requiring space fo one additional number. It is more complicated to work with, however, and requires an extra disk I/O when deleting and inserting.
If the objects can't be serialised to a fixed-length, then this approach becomes much, much harder. Variable-length record management code is very complex.
Finally, if the problem statement requires keeping records in order on disk, then it's a stupid problem statement, because insertion/removal in the middle of a file is ridiculously expensive; no sane design would require this.
The general method is to open the file for read access, open a new file for write access, read the content of the first file and write the data you want retained to the second file. When complete, you delete the first file and rename the second to that of the first.