concatenating numpy memmap'd files into single memmap - python-2.7

I have a very large number (>1000) of files, each about 20MB, which represent continuous time-series data saved in a simple binary format such that if I concatenate them all directly, I recover my full time series.
I would like to do this virtually in python, by using memmap to address each file and then concatenate them all on the fly into one big memmap.
Searching around SO suggests that np.concatenate will load them into memory, which I can't do. The question here seems to answer it in part, but the answer there assumes that I know how big my files are before concatenation, which is not necessarily true.
So, is there a general way to concatenate memmaps without knowing beforehand how big they are?
EDIT: it was pointed out that the linked question actually creates a concatenated file on disk. This is not something I want.

Related

Is the main purpose of object serialization in C++ for faster object loading?

I am reading code for a project written by others. The main task of the project is to read contents from a large structured text file (.txt) with 8 columns into a KnowledgeBase object, which have a number of methods and variables. The KnowledgeBase object is then output into a binary file. For example, the KnowledgeBase class has at least these two variables:
map<string, pair<string, string>> key_info
vector<ObjectInfo> objects
...
These variables are easy to understand when I track the code with gdb. Then, it seems it is converting such vectors and maps into binary forms. And the two variables above have their corresponding binary forms:
BinaryKeyInfo *bkeys
BinaryObjectInfo *bObjects
Later on when outputting to binary file, it has such code:
fwrite((char*)(&wcount),sizeof(int32_t),1,output);
fwrite((char*)bkeys,sizeof(KeyInfo_t),wcount,output);
The converting code from the original KnowledgeBase to binary is complicated. My question is, what's the main purpose of this conversion? Is it for faster loading of binary file into memory than plain text file? The plain text file is large. I learnt that object serialization is primarily for transmitting objects over the net, but I don't think the purpose here is for that. It is more like for speeding up data loading and memory saving. Could that be part of object serialization in C++?
Is the main purpose of object serialization in C++ for faster object loading?
No. The most important purpose of serialisation is to transform the state of the program into a format that can be stored on the filesystem, or that can be communicated across a network, and that can be de-serialised back. Often, the purpose of either is for another program to do the de-serialisation. Sometimes the de-serialiser is another instance of the same program.
The speed of de-serialisation is one metric that can be used to gauge whether one particular serialisation format is a good one. The ability to quickly undo what you have done is not the reason why you do it in the first place.
what's the benefit of converting them into binary vectors or maps?
As I mention above, the benefit of serialisation is the ability to store the serialised data on the filesystem, or to send it over a network.
what' the benefit between plain text files VS binary files?
Pros of text serialisation format:
Humans are able to read and write plain text. Humans generally are not able read nor write binary files.
It's generally easier to implement a plain text format de-/serialiser in a way that works across differing computers than it is to implement a binary format de-/serialiser that achieves the same.
Pros of binary serialisation format:
Typically faster and uses less storage and bandwidth.
Can be easier to implement if there is no need for communication between differing systems. This is typically only the case in very simple cases. (Furthermore, there usually is a need for cross-system compatibility, even if the need haven't been realised yet).

Compress many versions of a text with fast access to each

Let's say I store many versions of a source code file in a source code repository - maybe 500 historic versions of a 50k source file. So storing the versions directly would take about 12.5 MB (assuming the file grew linearly over time). Naturally though, there is ample room for compression as there will only be slight differences between most successive versions.
What I want is compact storage as well as reasonably quick extraction of any of the versions at any time.
So we would probably store a list of oft-occuring text chunks, and each version would just contain pointers to the chunks it is made of. To make this really compact, text chunks would be able to defined as concatenations of other chunks.
Is there a well-established compression algorithm that produces this kind of structure? I was not sure what term to search for.
(Bonus points if adding a new version is faster than recompressing the whole set of versions.)
What you want is called "git". In fact, that is exactly what you want. Including bonus points.
Seeing as there were no usable answers, I came up with my own format today to demonstrate what I mean. I am storing 850 versions of a source file about 20k in size. Usually from one version to the next just one line was added (but there were other changes as well).
If I store these 850 versions in a .zip, it is 4.2 MB big. I want less than that, way less.
My format is line-based. Basically each file version is stored as a list of pointers into a table. Each table entry is either:
a literal line,
or a pair of pointers into the table.
In the second case, in decompression, the two pointers have to be followed successively.
Not sure if this description makes sense to you right away, but the thing works.
The compressor generates a single text file from which each of the 850 versions can be extracted instantly. This text file has a size of 45k.
Finally we can simply gzip this file which gets us down to 18.5k. Quite an improvement from 4.2 MB!
The compressor uses a very simple but effective way to find repeating combinations of lines.
So the answer to the initial question is that there is an algorithm that combines inter-file compression (like .tar.gz) with instant extraction if any contained file (like .zip).
I still don't know how you would call this class of compression algorithms.

What is the best way to process large arrays without hitting memory requirements in C++?

I've got two arrays of strings, both with 250k+ items. When I tried to hardcode these into my C++ program, it got stuck in the compilation phase. I currently have both strings as CSV .txt files e.g., {..."fksdfjsa", "fsdajhfisa","wgferwjhgo"...}.
Should I save these as arrays in a different C++ program and try to import them, or should I somehow stream them as I iterate through the values? If so, how would I do that? For what it's worth, I intend to compare each element of the first array to each element of the second.
Just read the data from your CSV files at runtime. Learn about the <fstream> standard library header.
Read data using std::fstream and use std::vector instead creating array
I would certainly read the input from files. Among other things: you can start debugging/testing on smaller input files.
If you google for C++ CSV readers you can probably save yourself the time of debugging your own.
Finally: the goal of this exercise isn't described, but "...compare each element..." done naively is going to take a long time. If you are looking for matches: consider sorting the inputs first and running through the lists in parallel. This will bring it back to linear time.

Strings vs binary for storing variables inside the file format

We aim at using HDF5 for our data format. HDF5 has been selected because it is a hierarchical filesystem-like cross-platform data format and it supports large amounts of data.
The file will contain arrays and some parameters. The question is about how to store the parameters (which are not made up by large amounts of data), considering also file versioning issues and the efforts to build the library. Parameters inside the HDF5 could be stored as either (A) human-readable attribute/value pairs or (B) binary data in the form of HDF5 compound data types.
Just as an example, let's consider as a parameter a polygon with three vertex. Under case A we could have for instance a variable named Polygon with the string representation of the series of vertices, e.g. for instance (1, 2); (3, 4); (4, 1). Under case B, we could have instead a variable named Polygon made up by a [2 x 3] matrix.
We have some idea, but it would be great to have inputs from people who have already worked with something similar. More precisely, could you please list pro/cons of A and B and also say under what circumstances which would be preferable?
Speaking as someone who's had to do exactly what you're talking about a number of time, rr got it basically right, but I would change the emphasis a little.
For file versioning, text is basically the winner.
Since you're using an hdf5 library, I assume both serializing and parsing are equivalent human-effort.
text files are more portable. You can transfer the files across generations of hardware with the minimal risk.
text files are easier for humans to work with. If you want to extract a subset of the data and manipulate it, you can do that with many programs on many computers. If you are working with binary data, you will need a program that allows you to do so. Depending on how you see people working with your data, this can make a huge difference to the accessibility of the data and maintenance costs. You'll be able to sed, grep, and even edit the data in excel.
input and output of binary data (for large data sets) will be vastly faster than text.
working with those binary files in a new environmnet (e.g. a 128 bit little endian computer in some sci-fi future) will require some engineering.
similarly, if you write applications in other languages, you'll need to handle the encoding identically between applications. This will either mean engineering effort, or having the same libraries available on all platforms. Plain text this is easier...
If you want others to write applications that work with your data, plain text is simpler. If you're providing binary files, you'll have to provide a file specification which they can follow. With plain text, anyone can just look at the file and figure out how to parse it.
you can archive the text files with compression, so space concerns are primarily an issue for the data you are actively working with.
debugging binary data storage is significantly more work than debugging plain-text storage.
So in the end it depends a little on your use case. Is it meaningful to look at the data in the myriad tools that handle plain-text? Is it only meaningful to look at it with big-data hdf5 viewers? Will writing plain text be onerous to you in terms of time and space?
In general, when I'm faced with this issue, I basically always do the same thing: I store the data in plain text until I realize the speed problems are more irritating than working with binary would be, and then I switch. If you don't know in advance if you're crossing that threshold start with plain-text, and write your interface to your persistence layer in such a way that it will be easy to switch later. This is tiny bit of additional work, which you will probably get back thanks to plain text being easier to debug.
If you expect to edit the file by hand often (like XMLs or JSONs), then go with human readable format.
Otherwise go with binary - it's much easier to create a parser for it and it will run faster than any grammar parser.
Also note how there's nothing that prevents you from creating a converter between binary and human-readable form later.
Versioning files might sound nice, but are you really going to inspect the diffs for files "containing large arrays"?

How to compress an in memory array to a *.zip archive?

I have many multidimensional arrays like the one below. They are filled with specific values of course.
uint8_t data[32][64][32];
How can I compress such arrays in memory and then store them as *.zip archives to hard drive?
libzip is a ZIP compression and decompression library that allows one to operate on memory buffers as well as files.
You should tell us a bit more about your problem to allow for a specific answer. Do you need to move these files to a different machine? Are the dimensions of the arrays always the same?
In the simplest case where you just want to store the data - always with the same format - and only use it on your machine with the same code, all you need is a compression library. I have used http://www.quicklz.com/ before and it is very easy to integrate. In this case, the fact that it is a multidimensional array does not matter, neither does the type of the data. Just hand a pointer to the first element and the length to the compressor and save the result to a file (that you can name .zip if you like). Then when you want to reload it, read the file, hand it to the decompressor and cast the result into an array with the same dimensions.
For the more likely case where you have different types of arrays, and want them to be readable on other devices (and with other software) I suggest you use something like http://www.hdfgroup.org/HDF5/ that will allow you to store the data in a structured way including the datatype (with endianness etc.).
Take a look at libarchive (BSD license). It is very powerful and you have tight control over your data: you can combine different blocks of memory as "files" inside the compressed file (.tar.gz for example). There are several nice tutorials inside.