Compress many versions of a text with fast access to each - compression

Let's say I store many versions of a source code file in a source code repository - maybe 500 historic versions of a 50k source file. So storing the versions directly would take about 12.5 MB (assuming the file grew linearly over time). Naturally though, there is ample room for compression as there will only be slight differences between most successive versions.
What I want is compact storage as well as reasonably quick extraction of any of the versions at any time.
So we would probably store a list of oft-occuring text chunks, and each version would just contain pointers to the chunks it is made of. To make this really compact, text chunks would be able to defined as concatenations of other chunks.
Is there a well-established compression algorithm that produces this kind of structure? I was not sure what term to search for.
(Bonus points if adding a new version is faster than recompressing the whole set of versions.)

What you want is called "git". In fact, that is exactly what you want. Including bonus points.

Seeing as there were no usable answers, I came up with my own format today to demonstrate what I mean. I am storing 850 versions of a source file about 20k in size. Usually from one version to the next just one line was added (but there were other changes as well).
If I store these 850 versions in a .zip, it is 4.2 MB big. I want less than that, way less.
My format is line-based. Basically each file version is stored as a list of pointers into a table. Each table entry is either:
a literal line,
or a pair of pointers into the table.
In the second case, in decompression, the two pointers have to be followed successively.
Not sure if this description makes sense to you right away, but the thing works.
The compressor generates a single text file from which each of the 850 versions can be extracted instantly. This text file has a size of 45k.
Finally we can simply gzip this file which gets us down to 18.5k. Quite an improvement from 4.2 MB!
The compressor uses a very simple but effective way to find repeating combinations of lines.
So the answer to the initial question is that there is an algorithm that combines inter-file compression (like .tar.gz) with instant extraction if any contained file (like .zip).
I still don't know how you would call this class of compression algorithms.

Related

Can we move the pointer of ofstream back and forth for output to file? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I need to output results to a file that has a predefined format. The columns look like this:
TIME COL1 COL2 COL3 COL4 ...
I am using ofstream. I have to output results line by line. However, the case can be that results for certain columns may not be available at a certain time. The order of the results may also not be sorted.
I can control the spacing between the columns while initially specifying the headers.
I guess my question is: Is it possible to move the ofstream pointer back and forth horizontally per line?
What I tried so far:
1) find current position of ofstream pointer using:
long pos = fout.tellp()
2) calculate the position to be shifted based on spacing:
spacing = column_spacing * column_number
long newpos = pos + spacing
3) then use seekp() to move pointer:
fout.seekp(newpos)
4) provide output:
fout << "output"
This does not work. Basically, the pointer does not move. The idea is to make my ofstream fout move back and forth if possible. I would appreciate any suggestions on how to control it.
Some information about the output: I am computing the elevation angle of GPS satellites in the sky over time. Hence, there are 32 columns synonymous to number of GPS satellites in total. At any point in time, not all satellites are visible and hence the need to skip some satellites/columns. Also, the list of elevation of satellites may not be arranged in ascending order due to the limitations of observation file. Hope that helps in drawing the situation.
An example of desired output. The header (TIME, SAT1, ... SAT32) is defined prior to the output of results and is not part of the question here. The spacing between each column is controlled during definition of the headers (lets say 15 spaces between each column). The output can be truncated to 1 decimal place. A new line occurs once all results at current time t are written. Then I process the observations for time t+1 and then write the outputs again, and so on. Hence the writing occurs in an epochwise manner. Satellite elevation is stored in a vector(double) and satellite number is stored in a vector(int). Both vectors are of same length. I just need to write them to a file. For the example below, the output of time is in seconds and satellite elevation is in degrees:
TIME SAT1 SAT2 SAT3 ... SAT12 SAT13 ... SAT32
1 34.3 23.2 12.2 78.2
2 34.2 23.1 12.3 78.2
3 34.1 11.3 23.0 78.3
And so on... As you may notice, satellite elevations may or may not be available, all depends on the observations. Lets also assume that the size of output and efficiency is not of priority here. Based on 24 hours of observations, the output file size can reach upto 5-10 MB's.
Thanks for your time in advance!
Can we move the pointer of ofstream back and forth for output to file?
NO, You probably don't want to do that (even if it would be doable in principle; but that would be inefficient and very brittle to code, and nearly impossible to debug), in particular for a textual output whose width is variable (I am guessing that your COLi could have variable width, as usual in most textual format). It looks that your approach is wrong.
The general way is to build in memory, as some graph of "objects" or "data structure", the entire representation of your output file. This is generally enough, unless you really need to output something huge.
If your typical textual output is of reasonable size (a few gigabytes at most) then representing the data as some internal data structure is worthwhile and it is very common practice.
If your textual output is huge (dozens of gigabytes or terabytes, which is really unlikely), then you won't be able to represent it in memory (unless you have a costly computer with a terabyte of RAM). However, you could use some database (perhaps sqlite) to serve as internal representation.
In practice, textual formats are always output in sequence (from some internal representation) and textual files in those formats have a reasonable size (so it is uncommon to have a textual file of many gigabytes today; in such cases, databases -or splitting the output file in several pieces in some directory- are better).
Without specifying precisely your textual format (e.g. using EBNF notation) and giving an example - and some estimation of the output size-, your question is too broad, and you can only get hints like above.
the output file size can reach upto 5-10 MB's
This is really tiny on current computers (even a cheap smartphone has a gigabyte of RAM). So build the data structure in memory, and output it at once when it is completed.
What data structures should you use depend upon your actual problem (and the inputs your program gets, and the precise output you want it to produce). Since you don't specify your program in your question, we cannot help. Probably C++ standard containers and smart pointers could be useful (but this is just a guess).
You should read some introduction to programming (like SICP), then some good C++ programming book and read some good Introduction to Algorithms. You probably need to read something about compilation techniques (since they include parsing and outputting structured data), like the Dragon Book. Learning to program takes a lot of time.
C++ is really a very difficult programming language, and I believe it is not the best way to learn programming. Once you have learned a bit how to program, invest your time in learning C++. Your issues is not on std::ostream or C++ but on designing your program and its architecture correctly
BTW, if the output of your program is feeding some other program (and is not only or mostly for human consumption) you might use some established textual format, perhaps JSON, YAML, CSV, XML (see also this example), ....
2 34.2 23.1 12.3 78.2
How significant are the spaces in the above line (what would happen if a space is inserted after the first 2 and another space is removed after 12.3) ? Can a wide number like 3.14159265358979323846264 appear in your output? Or how many digits do you want? That should be documented precisely somewhere ! Are you allowed to improve the output format above (you might perhaps use some sign like ? for missing numbers; that would make the output less ambiguous and more readable for humans and easier to parse by some other program)?
You need to define precisely (in English) the behavior of your program, including its input and output formats. An example of input and output is not a specification (it is just an example).
BTW, you may also want to code your program to provide several different output formats. For example, you could decide to provide CSV format for usage in speadsheets, JSON format for other data processing, gnuplot output to get nice figures, LaTeX output to be able to insert your output in some technical report, HTML output to be usable thru a browser, etc. Once you have a good internal representation (as convenient data structures) of your computed data, outputting it in various formats is easy and very convenient.
Probably your domain (satellite processing) has defined some widely used data formats. Study them in detail (at least for inspiration on specifying your own output format). I am not at all an expert of satellite data, but with google I quickly found examples like GEOSCIENCE AUSTRALIA
(CCRS) LANDSAT THE MATIC MAPPER DIGITAL DATA FORMAT
DESCRIPTION (in more than a hundred pages). You should specify your output format as precisely as they do (perhaps several dozens of pages in English, with a few pages of EBNF), and EBNF is a convenient notation for that (with a lot of additional explanations in English)
Look also for inspiration into other output data format descriptions.
You probably should, if you invent your output format, publish its specification (in English) so that other people could code programs taking your output as input to their code.
In many domains, data is much more valuable (i.e. costs much more, in € or US$) than the code processing it. This is why its format should be precisely documented. You need to specify that format so that a future programmer in 2030 could easily write a parser for it. So details matter a big lot. Specify unambiguously your output format in great details (in some English document).
Once you have specified that output format, coding the output routines from some good enough internal data representation is easy work (and don't require insane tricks like moving the file offset of the output). And a good enough specification of the output format is also a guideline in designing your internal data representations.
Is it possible to move the ofstream pointer back and forth horizontally per line?
It might be doable, but it is so inefficient and error-prone (and impossible to debug) that in practice you should never do that (but instead, specify in details your output and code a simple sequential output routines, as all textual format related software do).
BTW, today we use UTF-8 everywhere in textual files, and a single UTF-8 encoded Unicode character might span one (e.g. for some digit like 0 or latin letters like E) or several bytes (e.g. for accentuated letters like é, or cyrillic letters like я, or symbols like ∀, etc...) so replacing a single UTF8 character by a single other one could mean some byte insertion or deletion.
Notice that current file systems do not allow to insert characters or bytes or delete a span of characters in the middle of a file (for example, on Linux, there is no syscalls(2) allowing this) and do not really know about lines (the end of line is just a convention, e.g. \n byte on Linux). Programs doing that (like your favorite source code editor) are always representing the data in memory. Today, a file is a sequence of bytes, and you can only append bytes at its end, or replace bytes in the middle (from the operating system's point of view); but insertion or deletion of bytes span in the middle of the file is not possible, and that is why a textual file is -in practice- always written sequentially, from start to end, without moving inside the current file offset (other than appending bytes at its end).
(if this is homework for some CS college or undergraduate course, I guess that your teacher is expecting you to define and document your output format)

concatenating numpy memmap'd files into single memmap

I have a very large number (>1000) of files, each about 20MB, which represent continuous time-series data saved in a simple binary format such that if I concatenate them all directly, I recover my full time series.
I would like to do this virtually in python, by using memmap to address each file and then concatenate them all on the fly into one big memmap.
Searching around SO suggests that np.concatenate will load them into memory, which I can't do. The question here seems to answer it in part, but the answer there assumes that I know how big my files are before concatenation, which is not necessarily true.
So, is there a general way to concatenate memmaps without knowing beforehand how big they are?
EDIT: it was pointed out that the linked question actually creates a concatenated file on disk. This is not something I want.

Read a specific portion of a text file in C++ (in between two delimiters)

I have a text file in the following format:
UserIP-Address-1
UserInfo-1
UserInfo-2
UserInfo-3
UserIP-Address-1_ENDS
UserIP-Address-2
UserInfo-1
UserInfo-2
UserInfo-3
UserIP-Address-2-ENDS
I need to collect information as per client request and send the data in between these two UserIP-Address-1 and UserIP-Address-1_ENDS delimiters. I can find one of the delimiters using find or vector::iterator, but how to find another end and data in between? Please guide me, thank you all.
First, you should define (at least on paper or in comments) precisely the format of your file, perhaps thru some EBNF notation. An example is never enough (and better then, in addition of the format documentation, gives some real concrete example, not abstract ones). If the file is produced by some other software, that software should document the format.
You need to read the file line by line (e.g. using std::getline), and probably entirely (or at least till you have gotten all the wanted information). You might use standard lexing and parsing techniques (probably on every line, perhaps on the entire file as a whole). You could (at least if the file is not very large) fill some data in memory.
If the file is really big (e.g. gigabytes which does not fit in RAM), you could read it twice. The first time, to compute offsets (using tellg) of relevant lines (or data chunks), e.g. into some std::map and the second time to use seekg appropriately to read portions of that file.
If you can change the format of the file, you could consider using standard textual serialization formats like JSON (which has several C++ libraries handling it, e.g. JSONCPP) or YAML (I don't recommend XML, unless it is an external requirement, since XML is too complex and too verbose). You might also consider some database approach, perhaps as simple as Sqlite.

Strings vs binary for storing variables inside the file format

We aim at using HDF5 for our data format. HDF5 has been selected because it is a hierarchical filesystem-like cross-platform data format and it supports large amounts of data.
The file will contain arrays and some parameters. The question is about how to store the parameters (which are not made up by large amounts of data), considering also file versioning issues and the efforts to build the library. Parameters inside the HDF5 could be stored as either (A) human-readable attribute/value pairs or (B) binary data in the form of HDF5 compound data types.
Just as an example, let's consider as a parameter a polygon with three vertex. Under case A we could have for instance a variable named Polygon with the string representation of the series of vertices, e.g. for instance (1, 2); (3, 4); (4, 1). Under case B, we could have instead a variable named Polygon made up by a [2 x 3] matrix.
We have some idea, but it would be great to have inputs from people who have already worked with something similar. More precisely, could you please list pro/cons of A and B and also say under what circumstances which would be preferable?
Speaking as someone who's had to do exactly what you're talking about a number of time, rr got it basically right, but I would change the emphasis a little.
For file versioning, text is basically the winner.
Since you're using an hdf5 library, I assume both serializing and parsing are equivalent human-effort.
text files are more portable. You can transfer the files across generations of hardware with the minimal risk.
text files are easier for humans to work with. If you want to extract a subset of the data and manipulate it, you can do that with many programs on many computers. If you are working with binary data, you will need a program that allows you to do so. Depending on how you see people working with your data, this can make a huge difference to the accessibility of the data and maintenance costs. You'll be able to sed, grep, and even edit the data in excel.
input and output of binary data (for large data sets) will be vastly faster than text.
working with those binary files in a new environmnet (e.g. a 128 bit little endian computer in some sci-fi future) will require some engineering.
similarly, if you write applications in other languages, you'll need to handle the encoding identically between applications. This will either mean engineering effort, or having the same libraries available on all platforms. Plain text this is easier...
If you want others to write applications that work with your data, plain text is simpler. If you're providing binary files, you'll have to provide a file specification which they can follow. With plain text, anyone can just look at the file and figure out how to parse it.
you can archive the text files with compression, so space concerns are primarily an issue for the data you are actively working with.
debugging binary data storage is significantly more work than debugging plain-text storage.
So in the end it depends a little on your use case. Is it meaningful to look at the data in the myriad tools that handle plain-text? Is it only meaningful to look at it with big-data hdf5 viewers? Will writing plain text be onerous to you in terms of time and space?
In general, when I'm faced with this issue, I basically always do the same thing: I store the data in plain text until I realize the speed problems are more irritating than working with binary would be, and then I switch. If you don't know in advance if you're crossing that threshold start with plain-text, and write your interface to your persistence layer in such a way that it will be easy to switch later. This is tiny bit of additional work, which you will probably get back thanks to plain text being easier to debug.
If you expect to edit the file by hand often (like XMLs or JSONs), then go with human readable format.
Otherwise go with binary - it's much easier to create a parser for it and it will run faster than any grammar parser.
Also note how there's nothing that prevents you from creating a converter between binary and human-readable form later.
Versioning files might sound nice, but are you really going to inspect the diffs for files "containing large arrays"?

Binary parser or serialization?

I want to store a graph of different objects for a game, their classes may or may not be related, they may or may not contain vectors of simple structures.
I want parsing operation to be fast, data can be pretty big.
Adding new things should not be hard, and it should not break backward compatibility.
Smaller file size is kind of important
Readability counts
By serialization I mean, making objects serialize themselves, which is effective, but I will need to write different serialization methods for different objects for that.
By binary parsing/composing I mean, creating a new tree of parsers/composers that holds and reads data for these objects, and passing this around to have my objects push/pull their data.
I can also use json, but it can be pretty slow for reading, and it is not very size effective when it comes to pretty big sets of matrices, and numbers.
Point by point:
Fast Parsing: binary (since you don't necessarily have to "parse", you can just deserialize)
Adding New Things: text
Smaller: text (even if gzipped text is larger than binary, it won't be much larger).
Readability: text
So that's three votes for text, one point for binary. Personally, I'd go with text for everything except images (and other data which is "naturally" binary). Then, store everything in a big zip file (I can think of several games do this or something close to it).
Good reads: The Importance of Being Textual and Power Of Plain Text.
Check out protocol buffers from Google or thrift from Apache. Although billed as a way to write wire protocols easily, it's basically an object serialization mechanism that can create bindings in a dozen languages, has efficient binary representation, easy versioning, fast performance, and is well-supported.
We're using Boost.Serialization. Don't know how it performs next to those offered by samkass.