Getting information from a file without traversing its contents - c++

This question made me search for what else can I get from a file without traversing its contents (means without inputting the contents using ifstream or getc etc).
Other than file size and number of characters, what other information can I gather? I searched fseek, I found I can use SEEK_SET, SEEK_CUR and SEEK_END, which only allow me to find the end of the file, start of the file and current pointer.
In order to make it a question, I specifically want to ask:
Can occurrences of some character or type of character (newline etc) be counted?
Can its contents be matched with a certain template?
Is using these methods faster than reading the file multiple times?
And I am asking about Microsoft Windows, not Linux.

1) No, becuase searching of something in unpredicteble conditions requires thorough examing of contents. Examing is reading. Of course, you may collect some statistics before, but you need to traverse you data not less then once. You can use other applications to do this implicitly, but they also will traverse your file from very begining to the end. You may orginize your file some way to obtain necessary info with minimal amount of read-operations, but its all up to your task, and there is no general approach (Because any generiosuty comes to examing the whole source structure).
2) Also No (see above)
3) Yes. Store as much as possible (or required by task) in memory (that's called caching). For example, use mapping (See MapViewOfFile for Windows and mmap(2) on *nix systems), this uses some in-system caching mechanism.

No
No
Depends on wether there's an actual need to read the file multiple times.
There're no miracles here. The former question had a "shortcut" because the number of characters in the file equals to its size in bytes (more strictly speaking - the ansi-text file is considered of a character sequence, each is represented by a single byte).

The stat structure contains information about the file, including permissions, ownership, size, access and creation date info. As for metadata, maybe there's an API to tie into a Windows search database that might allow searching on other criteria, like content attributes (I'm a Linux guy, usually, so I don't know what Windows offers in this respect).

Related

How can I delete data from a sequential file while it is being appended by another process

I'm reading data from a file in a sequential manner while the same file is being appended data by another process. So far so good. My trouble comes when I need to delete from the file the data that I have already retrieved, which I have to do in order to prevent the file from getting too large due to the writing process. I don't need to delete exactly the data that I have just retrieved, but at least do some removal periodically without losing any data that have not already been read. How can I do this with C++?
I understand that there may be different valuable approaches. I'd check as valid answer any that would prove useful to my developing the code.
This is not just a matter of C++, any language you use it will at some point (in its runtime, standard library implementation, interpreter or whatever its architecture is) use the system calls that the system provides for file handling (e.g. open(), read(), write()).
I'm not aware of any system call that will delete parts of a file or replace parts with something else (you can position yourself somewhere in the middle of the file and start overwriting its contents, but this will be a byte for byte change, you can't change a piece of it with another piece with a different size). There are all sorts of workarounds for simulating deleting or changing parts of a file, but nothing that does it directly. For example: read from the original file, write only what you want to keep in a temporary file, remove the original and rename the temporary. But this will not work in your situation if the writing process keeps the file open.
Another approach would be something inspired by logrotate: when the file gets to a certain maximum size it gets switched with a new one, and you can process the previous one as you want. This approach does require changes in the writing process also.
You could specify the file length at the beginning, then start writing in it and when you reach your end of file, you just start writing at the beginning of the file again. But you should make sure that read pointer doesn't pass the writing pointer.
It seems like you're trying to emulate the behavior of a named pipe using a regular file. This would require special support from your operating system, which probably doesn't exist because you should be using named pipes instead. Named pipes are a special kind of file which is used for communication between two processes. Like regular files, it has a path, has a filename and exist on disk. However, where a regular file's contents are stored on disk, the contents of a named pipe only exists in memory and only that data that has been written, but not yet read. This is exactly what you're trying to do.
Assuming you're using a unix based OS. you can run mkfifo outputfile and then use outputfile for reading and appending. No C++ code required, though if you want you can also call mkfifo() from your c++ code.
If you're using Windows, it all becomes a bit more complicated. You have to specify something like \\.\pipe\outputfile as the filename for reading and appending.

For file reading, when to use filebuf

I'm going to be doing random-access reading from a read-only binary file. The interface to ifstream seems simpler than filebuf; but is there any use-case where filebuf would give better performance?
More details: I have a file of fixed-length (48-byte) records, and will be doing random-access reads in sequence -- read 1 record, process, read 1 record (from elsewhere), process, .... (Traversing a tree.) The file never changes. Since the records are fixed-length, I may later use a "character-type" that is the 48-byte record, but I don't imagine that has any performance effect.
May be if you are on Linux may be using mmap would get around the whole problem of reading the file bit by bit.
Or boost memory mapped files?
http://www.boost.org/doc/libs/1_52_0/libs/iostreams/doc/classes/mapped_file.html

Search for multiple words in very large text files (10 GB) using C++ the fastest way

I have this program where I have to search for specific values and its line number in very large text file and there might be multiple occurences for the same value.
I've tried a simple C++ programs which reads the text files line by line and searches for a the value using strstr but it's taking a very longgggggggggggggg time
I also tried to use a system command using grep but still it's taking a lot of time, not as long as before but it's still too much time.
I was searching for a library I can use to fasten the search.
Any help and suggestions? Thank you :)
There are two issues concerning the spead: the time it takes to actually
read the data, and the time it takes to search.
Generally speaking, the fastest way to read a file is to mmap it (or
the equivalent under Windows). This can get complicated if the entire
file won't fit into the address space, but you mention 10GB in the
header; if searching is all you do in the program, this shouldn't create
any problems.
More generally, if speed is a problem, avoid using getline on a
string. Reading large blocks, and picking the lines up (as char[])
out of them, without copying, is significantly faster. (As a simple
compromize, you may want to copy when a line crosses a block boundary.
If you're dealing with blocks of a MB or more, this shouldn't be too
often; I've used this technique on older, 16 bit machines, with blocks
of 32KB, and still gotten a significant performance improvement.)
With regards to searching, if you're searching for a single, fixed
string (not a regular expression or other pattern matching), you might
want to try a BM search. If the string you're searching for is
reasonably long, this can make a significant difference over other
search algorithms. (I think that some implementations of grep will
use this if the search pattern is in fact a fixed string, and is
sufficiently long for it to make a difference.)
Use multiple threads. Each thread can be responsible for searching through a portion of the file. For example on a 4 core machine spawn 12 threads. The first thread looks through the first 8%evening of the file, the second thread the second 8% of the file, etc. You will want to tune the number of threads per core to keep the cpu max utilized. Since this is an I/O bound operation you may never reach 100% cpu utilization.
Feeding data to the threads will be a bottleneck using this design. Memory mapping the file might help somewhat but at the end of the day the disk can only read one sector at a time. This will be a bottleneck that you will be hard pressed to resolve. You might consider starting one thread that does nothing but read all the data in to memory and kick off search threads as the data loads up.
Since files are sequential beasts searching from start to end is something that you may not get around however there are a couple of things you could do.
if the data is static you could generate a smaller lookup file (alt. with offsets into the main file), this works good if the same string is repeated multiple times making the index file much smaller. if the file is dynamic you maybe need to regenerate the index file occassionally (offline)
instead of reading line by line, read larger chunks from the file like several MB to speed up I/O.
If you'd like to do use a library you could use xapian.
You may also want to try tokenizing your text before doing the search and I'd also suggest you to try regex too but it will take a lot if you don't have an index on that text so I'd definitely suggest you to try xapian or some search engine.
If your big text file does not change often then create a database (for example SQLite) with a table:
create table word_line_numbers
(word varchar(100), line_number integer);
Read your file and insert a record in database for every word with something like this:
insert into word_line_numbers(word, line_number) values ('foo', 13452);
insert into word_line_numbers(word, line_number) values ('foo', 13421);
insert into word_line_numbers(word, line_number) values ('bar', 1421);
Create an index of words:
create index wird_line_numbers_idx on word_line_numbers(word);
And then you can find line numbers for words fast using this index:
select line_number from word_line_numbers where word='foo';
For added speed (because of smaller database size) and complexity you can use 2 tables: words(word_id integer primary key, word not null) and word_lines(word_id integer not null references words, line_number integer not null).
I'd try first loading as much of the file into the RAM as possible (memory mapping of the file is a good option) and then search concurrently in parts of it on multiple processors. You'll need to take special care near the buffer boundaries to make sure you aren't missing any words. Also, you may want to try something more efficient than the typical strstr(), see these:
Boyer–Moore string search algorithm
Knuth–Morris–Pratt algorithm

What is an easy way to save a set of arrays to a file and back in C++?

I've been searching for an easy way with C++ to save a set of variables (in this case, a set of double arrays) to a file, and later load said file and get the set of variables.
Most of the ways that I've read about imply to lump all of the variables into a single array, or (even worse) write a custom class that reads and writes character by character, manually, all of which is impractical since the arrays will have variable length.
Is there a class or library that I could use? (The fact that I am asking this question means that, yes, it's one of the first times I have to deal with files in C++.)
The protocol buffers may help:
protobuf
The boost serialization library is one option.

Delete a line in an ofstream in C++

I want to erase lines within a file. I know you can store the content of the file (in a vector for example), erase the line and write again. However, it feels very cumbersome, and not very efficient if the file gets bigger.
Anyone knows of a better, more efficient, more elegant way of doing it?
On most file-systems, this is the only option you have, short of switching to an actual database.
However, if you find yourself in this situation (i.e. very large files, with inserts/deletes in the middle), consider whether you can do something like maintaining a bitmap at the top of the file, where each bit represents one line of your file. To "delete" a line, simply flip the corresponding bit value.
There's nothing particularly magical about disk files. They still like to store their data in contiguous areas (typically called something like "blocks"). They don't have ways of leaving data-free holes in the middle of these areas. So if you want to "remove" three bytes from the middle of one of these areas, something somewhere is going to have to accomplish this by moving everything else in that area back by three bytes. No, it is not efficient.
This is why text editors (which have to do this kind of thing a lot) tend to load as much of the file as possible (if not all of it) into RAM, where moving data around is much faster. They typically only write changes back to disk when requested (or periodically). If you are going to have to make lots of changes like this, I'd suggest taking a page from their book and doing something similar.
The BerkeleyDB (dbopen(3)) has an access method called DB_RECNO. This allows one to manipulate files with arbitrary lengths using any sort of record delimiter. The default uses variable-length records with unix newlines as delimiters. You then access each "record" using an integer index. Using this, you can delete arbitrary lines from your text file. This isn't specific to C++, but if you are on most Unix/Linux systems, this API is already available to you.