Extremely big integer multiplication and addition

Extremely big integer multiplication and addition - c++

Greetings,
I need to multiply two extremely long integer values stored in a text file (exported via GMP (MPIR, to be exact), so they can be any in any base). Now, I would usually just import these integers via the mpz_inp_str() function and perform the multiplication in RAM, however, these values are so long that I can't really load them (about 1 GB of data each). What would be the fastest way to do this? Perhaps there are some external libraries that do this sort of thing already? Are there any easily implementable methods for this (performance is not incredibly important as this operation would only be performed once or twice)?
tl;dr: I need to multiply values so large they don't fit into process memory limits (Windows).
Thank you for your time.

I don't know if there is a library that supports this, but you could use GMP/MPIR on parts of each really big number (RBN). That is, start by breaking each RBN into manageable, uniformly sized chunks (e.g. 10M digit chunks, expect an undersized chunk for most significant digits, also see below).
RBN1 --> A B C D E
RBN2 --> F G H I J
The chunking can be done in base 10, so just read <chuck_size> characters from the file for each piece. Then multiply chunks from each number one at a time.
AxF BxF CxF DxF ExF
+ AxG BxG CxG DxG ExG
+ AxH BxH CxH DxH ExH
+ AxI BxI CxI DxI ExI
+ AxJ BxJ CxJ DxJ ExJ
Perform each column of the final sum in memory. Then, keeping the carry in memory, write the column out to disk, repeat for next column... For carries, convert each column sum result to a string with GMP, write out the bottom <chunk size> portion of the result and read the top portion back in as a GMP int for the carry.
I'd suggest selecting a chunk size dynamically for each multiplication in order to keep each column addition in memory; the larger the numbers, the more column additions will need to be done, the smaller the chunk size will need to be.
For both reading and writing, I'd suggest using memory mapped files, boost has a nice interface for this (note that this does not load the entire file, it just basically buffers the IO on virtual memory). Open one mapped file for each input RBN numbers, and one output with size = size(RBN1) + size(RBN2) + 1; With memory mapped files, file access is treated as a raw char*, so you can read/write chunks directly using gmp c-string io methods. You will probably need to read into an intermediate buffer in order to NULL terminated strings for GMP (unless you want to temporarily alter the memory mapped file).
This isn't very easy to implement correctly, but then again this isn't a particularly easy problem (maybe just tedious). This approach has the advantage that it exactly mirrors what GMP is doing in memory, so the algorithms are well known.

Related

Reading large (~1GB) data file with C++ sometimes throws bad_alloc, even if I have more than 10GB of RAM available

I'm trying to read the data contained in a .dat file with size ~1.1GB.
Because I'm doing this on a 16GB RAM machine, I though it would have not be a problem to read the whole file into memory at once, to only after process it.
To do this, I employed the slurp function found in this SO answer.
The problem is that the code sometimes, but not always, throws a bad_alloc exception.
Looking at the task manager I see that there are always at least 10GB of free memory available, so I don't see how memory would be an issue.
Here is the code that reproduces this error
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
using namespace std;
int main()
{
ifstream file;
file.open("big_file.dat");
if(!file.is_open())
cerr << "The file was not found\n";
stringstream sstr;
sstr << file.rdbuf();
string text = sstr.str();
cout << "Successfully read file!\n";
return 0;
}
What could be causing this problem?
And what are the best practices to avoid it?

The fact that your system has 16GB doesn't mean any program at any time can allocate a given amount of memory. In fact, this might work on a machine that has only 512MB of physical RAM, if enought swap is available, or it might fail on a HPC node with 128GB of RAM – it's totally up to your Operating System to decide how much memory is available to you, here.
I'd also argue that std::string is never the data type of choice if actually dealing with a file, possibly binary, that large.
The point here is that there is absolutely no knowing how much memory stringstream tries to allocate. A pretty reasonable algorithm would double the amount of memory allocated every time the allocated internal buffer becomes too small to contain the incoming bytes. Also, libc++/libc will probably also have their own allocators that will have some allocation overhead, here.
Note that stringstream::str() returns a copy of the data contained in the stringstream's internal state, again leaving you with at least 2.2 GB of heap used up for this task.
Really, if you need to deal with data from a large binary file as something that you can access with the index operator [], look into memory mapping your file; that way, you get a pointer to the beginning of the file, and might work with it as if it was a plain array in memory, letting your OS take care of handling the underlying memory/buffer management. It's what OSes are for!
If you didn't know Boost before, it's kind of "the extended standard library for C++" by now, and of course, it has a class abstracting memory mapping a file: mapped_file.
The file I'm reading contains a series of data in ASCII tabular form, i.e. float1,float2\nfloat3,float4\n....
I'm browsing through the various possible solutions proposed on SO to deal with this kind of problem, but I was left wondering on this (to me) peculiar behaviour. What would you recommend in these kinds of circumstances?
Depends; I actually think the fastest way of dealing with this (since file IO is much, much slower than in-memory parsing of ASCII) is to parse the file incrementally, directly into an in-memory array of float variables; possibly taking advantage of your OS'es pre-fetching SMP capabilities in that you don't even get that much of a speed advantage if you'd spawn separate threads for file reading and float conversion. std::copy, used to read from std::ifstream to a std::vector<float> should work fine, here.
I'm still not getting something: you say that file IO is much slower than in-memory parsing, and this I understand (and is the reason why I wanted to read the whole file at once). Then you say that the best way is to parse the whole file incrementally into an in-memory array of float. What exactly do you mean by this? Doesn't this mean to read the file line-by-line, resulting in a large number of file IO operations?
Yes, and no: First, of course, you will have more context switches then you'd have if you just ordered for the whole to be read at once. But those aren't that expensive -- at least, they're going to be much less expensive when you realize that most OSes and libc's know quite well how to optimize reads, and thus will fetch a whole lot of file at once if you don't use extremely randomized read lengths. Also, you don't infer the penalty of trying to allocate a block of RAM at least 1.1GB in size -- that calls for some serious page table lookups, which aren't that fast, either.
Now, the idea is that your occasional context switch and the fact that, if you're staying single-threaded, there will be times when you don't read the file because you're still busy converting text to float will still mean less of a performance hit, because most of the time, your read will pretty much immediately return, as your OS/runtime has already prefetched a significant part of your file.
Generally, to me, you seem to be worried about all the wrong kinds of things: Performance seems to be important to you (is it really that important, here? You're using a brain-dead file format for interchanging floats, which is both bloaty, loses information, and on top of that is slow to parse), but you'd rather first read the whole file in at once and then start converting it to numbers. Frankly, if performance was of any criticality to your application, you would start to multi-thread/-process it, so that string parsing could already happen while data is still being read. Using buffers of a few kilo- to Megabytes to be read up to \n boundaries and exchanged with a thread that creates the in-memory table of floats sounds like it would basically reduce your read+parse time down to read+non-measurable without sacrificing read performance, and without the need for Gigabytes of RAM just to parse a sequential file.
By the way, to give you an impression of how bad storing floats in ASCII is:
The typical 32bit single-precision IEEE753 floating point number has about 6-9 significant decimal digits. Hence, you will need at least 6 characters to represent these in ASCII, one ., typically one exponential divider, e.g. E, and on average 2.5 digits of decimal exponent, plus on average half a sign character (- or not), if your numbers are uniformly chosen from all possible IEEE754 32bit floats:
-1.23456E-10
That's an average of 11 characters.
Add one , or \n after every number.
Now, your character is 1B, meaning that you blow up your 4B of actual data by a factor of 3, still losing precision.
Now, people always come around telling me that plaintext is more usable, because if in doubt, the user can read it… I've yet to see one user that can skim through 1.1GB (according to my calculations above, that's around 90 million floating point numbers, or 45 million floating point pairs) and not go insane.

In a 32 bit executable, total memory address space is 4gb. Of that, sometimes 1-2 gb is reserved for system use.
To allocate 1 GB, you need 1 GB of contiguous space. To copy it, you need 2 1 GB blocks. This can easily fail, unpredictably.
There are two approaches. First, switch to a 64 bit executable. This will not run on a 32 bit system.
Second, stop allocating 1 GB contiguous blocks. Once you start dealing with that much data, segmenting it and or streaming it starts making a lot of sense. Done right you'll also be able to start to process it prior to finishing reading it.
There are many file io datastructures, from stxxl to boost, or you can roll your own.

The size of the heap (a pool of memory used for dynamic allocations) is limited independently on the amount of RAM your machine has. You should use some other memory allocation technique for such large allocations which will probably force you to change the way you read from the file.
If you are running on UNIX based system you can check the function vmalloc or the VirtualAlloc function if you are running on Windows platform.

Algorithm for ordering strings to and from disk efficiently using minimal internal memory resources

I have a very (multiple terrabytes) large amount of strings stored on disk that I need to sort alphabetically and store in another file as quickly as possible (preferrably in C/C++) and using as little internal memory as possible. It is not an option to pre-index the strings beforehand, so I need to sort the strings whenever needed in a close to real-time fashion.
What would be the best algorithm to use in my case? I would prefer a suggestion for a linear algorithm rather than just a link to an existing software library like Lucene.

You usually sort huge external data by chunking it into smaller pieces, operating on them and eventually merging them back. When choosing the sorting algorithm you usually take a look at your requirements:
If you need a time-complexity guarantee that is also stable you can go for a mergesort (O(nlogn) guaranteed) although it requires an additional O(n) space.
If severely memory-bound you might want to try Smoothsort (constant memory, time O(nlogn))
Otherwise you might want to take a look at the research stuff in the gpgpu accelerators field like GPUTeraSort.
Google servers usually have this sort of problems.

Construct simply digital tree (Trie)
Memory will be much less than input data, because many words will be have common prefix. While adding data to tree u mark (incrementation) last child as end of word. If u add all words then u doing a DFS (with priority as u want sorting ex a->z ) and you output data to file. Time-complexity is exactly the same as memory size. It is hard to say about how is complexity because it depends on strings (many short strings better complexity) but it is still much better than input data O(n*k) where n-count of strings; k-the average length of string. Im sorry for my English.
PS. For solve problem with memorysize u can part file to smallest parts, sorting them with my method, and if u will be have for ex (1000 files) u will be remember in each first word (like queues) and next u will be output right word and input next in very short time.

I suggest you use the Unix "sort" command that can easily handle such files.
See How could the UNIX sort command sort a very large file? .
Before disk drives even existed, people wrote programs to sort lists that were far too large to hold in main memory.
Such programs are known as external sorting algorithms.
My understanding is that the Unix "sort" command uses the merge sort algorithm.
Perhaps the simplest version of the external sorting merge sort algorithm works like this (quoting from Wikipedia: merge sort):
Name four tape drives as A, B, C, D, with the original data on A:
Merge pairs of records from A; writing two-record sublists alternately to C and D.
Merge two-record sublists from C and D into four-record sublists; writing these alternately to A and B.
Merge four-record sublists from A and B into eight-record sublists; writing these alternately to C and D
Repeat until you have one list containing all the data, sorted --- in log2(n) passes.
Practical implementations typically have many tweaks:
Almost every practical implementation takes advantage of available RAM by reading many items into RAM at once, using some in-RAM sorting algorithm, rather than reading only one item at a time.
some implementations are able to sort lists even when some or every item in the list is too large to hold in the available RAM.
polyphase merge sort
As suggested by Kaslai, rather than only 4 intermediate files, it is usually quicker to use 26 or more intermediate files. However, as the external sorting article points out, if you divide up the data into too many intermediate files, the program spends a lot of time waiting for disk seeks; too many intermediate files make it run slower.
As Kaslai commented, using larger RAM buffers for each intermediate file can significantly decrease the sort time -- doubling the size of each buffer halves the number of seeks. Ideally each buffer should be sized so the seek time is a relatively small part of the total time to fill that buffer. Then the number of intermediate files should be picked so the total size of all those RAM buffers put together comes close to but does not exceed available RAM. (If you have very short seek times, as with a SSD, the optimal arrangement ends up with many small buffers and many intermediate files. If you have very long seek times, as with tape drives, the optimal arrangement ends up with a few large buffers and few intermediate files. Rotating disk drives are intermediate).
etc. -- See the Knuth book "The Art of Computer Programming, Vol. 3: Sorting and Searching" for details.

Use as much memory as you can and chunk your data. Read one chunk at a time into memory.
Step 1) Sort entries inside chunks
For each chunk:
Use IntroSort to sort your chunk. But to avoid copying your strings around and having to deal with variable sized strings and memory allocations (at this point it will be interesting and relevant if you actually have fixed or max size strings or not), preallocate a standard std array or other fitting container with pointers to your strings that point to a memory region inside the current data chunk. => So your IntroSort swaps the pointers to your strings, instead of swapping actual strings.
Loop over each entry in your sort-array and write the resulting (ordered) strings back to a corresponding sorted strings file for this chunk
Step 2) Merge all strings from sorted chunks into resulting sorted strings file
Allocate a "sliding" window memory region for all sorted strings files at once. To give an example: If you have 4 sorted strings files, allocate 4 * 256MB (or whatever fits, the larger the less (sequential) disk IO reads required).
Fill each window by reading the strings into it (so, read as much strings at once as your window can store).
Use MergeSort to compare any of your chunks, using a comparator to your window (e.g. stringInsideHunkA = getStringFromWindow(1, pointerToCurrentWindow1String) - pointerToCurrentWindow1String is a reference that the function advances to the next string). Note that if the string pointer to your window is beyond the window size (or the last record didn't fit to the window read the next memory region of that chunk into the window.
Use mapped IO (or buffered writer) and write the resulting strings into a giant sorted strings final
I think this could be an IO efficient way. But I've never implemented such thing.
However, in regards to your file size and yet unknown to me "non-functional" requirements, I suggest you to also consider benchmarking a batch-import using LevelDB [1]. It's actually very fast, minimizes disk IO, and even compresses your resulting strings file to about half the size without impact on speed.
[1] http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html

Here is a general algorithm that will be able to do what you want with just a few gigs of memory. You could get away with much less, but the more you have, the less disk overhead you have to deal with. This assumes that all of the strings are in a single file, however could be applied to a multiple file setup.
1: Create some files to store loosely sorted strings in. For terabytes of data, you'd probably want 676 of them. One for strings starting in "aa", one for "ab", and so on until you get to "zy" and "zz".
2: For each file you created, create a corresponding buffer in memory. A std::vector<std::string> perhaps.
3: Determine a buffer size that you want to work with. This should not exceed much beyond 1/2 of your available physical memory.
4: Load as many strings as you can into this buffer.
5: Truncate the file so that the strings in your buffer are no longer on disk. This step can be delayed for later or omitted entirely if you have the disk space to work with or the data is too sensitive to lose in the case of process failure. If truncating, make sure you load your strings from the end of the file, so that the truncation is almost a NOP.
6: Iterate over the strings and store them in their corresponding buffer.
7: Flush all of the buffers to their corresponding files. Clear all the buffers.
8: Go to step 4 and repeat until you have exhausted your source of strings.
9: Read each file to memory and sort it with whatever algorithm you fancy. On the off chance you end up with a file that is larger than your available physical memory, use a similar process from above to split it into smaller files.
10: Overwrite the unsorted file with this new sorted file, or append it to a monolithic file.
If you keep the individual files rather than a monolithic file, you can make insertions and deletions relatively quickly. You would only have to load in, insert, and sort the value into a single file that can be read entirely into memory. Now and then you might have to split a file into smaller files, however this merely amounts to looking around the middle of the file for a good place to split it and then just moving everything after that point to another file.
Good luck with your project.

Optimizing for 3D imaging processes in C++

I am working with 3D volumetric images, possibly large (256x256x256). I have 3 such volumes that I want to read in and operate on. Presently, each volume is stored as a text file of numbers which I read in using ifstream. I save it as a matrix (This is a class I have written by dynamic allocation of a 3D array). Then I perform operations on these 3 matrices, addition, multiplication and even Fourier transform. So far, everything works well, but, it takes a hell lot of time, especially the Fourier transform since it has 6 nested loops.
I want to know how I can speed this up. Also, whether the fact that I have stored the images in text files makes a difference. Should I save them as binary or in some other easier/faster to read in format? Is fstream the fastest way I can read in? I use the same 3 matrices each time without changing them. Does that make a difference? Also, is pointer to pointer to pointer the best way to store a 3D volume? If not what else can I do?

Also, is pointer to pointer to pointer best way to store a 3d volume?
Nope thats usually very ineficient.
If not what else can I do?
Its likely that you will get better performance if you store it in a contiguous block, and use computed offsets into the block.
I'd usually use a structure like this:
class DataBlock {
unsigned int nx;
unsigned int ny;
unsigned int nz;
std::vector<double> data;
DataBlock(in_nx,in_ny,in_nz) :
nx(in_nx), ny(in_ny), nz(in_nz) , data(in_nx*in_ny*in_nz, 0)
{}
//You may want to make this check bounds in debug builds
double& at(unsigned int x, unsigned int y, unsigned int z) {
return data[ x + y*nx + z*nx*ny ];
};
const double& at(unsigned int x, unsigned int y, unsigned int z) const {
return data[ x + y*nx + z*nx*ny ];
};
private:
//Dont want this class copied, so remove the copy constructor and assignment.
DataBlock(const DataBlock&);
DataBlock&operator=(const DataBlock&);
};

Storing a large (2563 elements) 3D image file as plain text is a waste of resources.
Without loss of generality, if you have a plain text file for your image and each line of your file consists of one value, you will have to read several characters until you find the end of the line (for a 3-digit number, these will be 4 bytes; 3 bytes for the digits, 1 byte for newline). Afterwards you will have to convert these single digits to a number. When using binary, you directly read a fixed amount of bytes and you will have your number. You could and should write and read it as a binary image.
There are several formats for doing so, the one I would recommend is the meta image file format of VTK. In this format, you have a plain text header file and a binary file with the actual image data. With the information from the header file you will know how large your image is and what datatype you will be using. In your program, you then directly read the binary data and save it to a 3D array.
If you really want to speed things up, use CUDA or OpenCL which will be pretty fast for your applications.
There are several C++ libraries that can help you with writing, saving and manipulating image data, including the before-mentioned VTK and ITK.

2563 is a rather large number. Parsing 2563 text strings will take a considerable amount of time. Using binary will make the reading/writing process much faster because it doesn't require converting a number to/from string, and using much less space. For example to read the number 123 as char from a text file the program will need to read it as a string and convert from decimal to binary using lots of multiplies by 10. Whereas if you had written it directly as the binary value 0b01111011 you only need to read that byte back again into memory, no conversion at all.
Using hexadecimal number may also increase reading speed since each hex digit can map directly to binary value but if you need more speed, binary file is the way to go. Just a fread command is enough to load the whole 2563 bytes = 16MB file into memory in less than 1 sec. And when you're done, just fwrite it back to file. To speedup you can use SIMD (SSE/AVX), CUDA or another parallel processing technique. You can improve the speed even further by multithreading or by only saving the non zero values because in many cases, most values will often be 0's.
Another reason maybe because your array is large and each dimension is a power of 2. This has been discussed in many questions on SO:
Why is there huge performance hit in 2048x2048 versus 2047x2047 array multiplication?
Why is my program slow when looping over exactly 8192 elements?
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
You may consider changing the last dimension to 257 and try again. Or better use another algorithm like divide and conquer that's more cache friendly

You should add timers around the load and the process so you know which is taking the most time, and focus your optimization efforts on it. If you control the file format, make one that is more efficient to read. If it is the processing, I'll echo what previous folks have said, investigate efficient memory layout as well as GPGPU computing. Good luck.

How to store bits to a huge char array for file input/output

I want to store lots of information to a block by bits, and save it into a file.
To keep my file not so big, I want to use a small number of bits to save specified information instead of a int.
For example, I want to store Day, Hour, Minute to a file.
I only want 5 bit(day) + 5 bit(Hour) + 6 bit(Minute) = 16 bit of memory for data storage.
I cannot find a efficient way to store it in a block to put in a file.
There are some big problems in my concern:
the data length I want to store each time is not constant. It depends on the incoming information. So I cannot use a structure to store it.
there must not be any unused bit in my block, I searched for some topics that mentioned that if I store 30 bits in an int(4 byte variable), then the next 3 bit I save will automatically go into the next int. but I do not want it to happen!!
I know I can use shift right, shift left to put a number to a char, and put the char into a block, but it is inefficient.
I want a char array that I can continue putting specified bits into, and use write to put it into a file.

I think I'd just use the number of bits necessary to store the largest value you might ever need for any given piece of information. Then, Huffman encode the data as you write it (and obviously Huffman decode it as you read it). Most other approaches are likely to be less efficient, and many are likely to be more complex as well.

I haven't seen such a library. So I'm afraid you'll have to write one yourself. It won't be difficult, anyway.
And about the efficiency. This kind of operations always need bits shifting and masking, because few CPUs support directly operating into bits, especially between two machine words. The only difference is you or your compiler does the translation.

Optimal datafile format loading on a game console

I need to load large models and other structured binary data on an older CD-based game console as efficiently as possible. What's the best way to do it? The data will be exported from a Python application. This is a pretty elaborate hobby project.
Requierements:
no reliance on fully standard compliant STL - i might use uSTL though.
as little overhead as possible. Aim for a solution so good. that it could be used on the original Playstation, and yet as modern and elegant as possible.
no backward/forward compatibility necessary.
no copying of large chunks around - preferably files get loaded into RAM in background, and all large chunks accessed directly from there later.
should not rely on the target having the same endianness and alignment, i.e. a C plugin in Python which dumps its structs to disc would not be a very good idea.
should allow to move the loaded data around, as with individual files 1/3 the RAM size, fragmentation might be an issue. No MMU to abuse.
robustness is a great bonus, as my attention span is very short, i.e. i'd change saving part of the code and forget the loading one or vice versa, so at least a dumb safeguard would be nice.
exchangeability between loaded data and runtime-generated data without runtime overhead and without severe memory management issues would be a nice bonus.
I kind of have a semi-plan of parsing in Python trivial, limited-syntax C headers which would use structs with offsets instead of pointers, and convenience wrapper structs/classes in the main app with getters which would convert offsets to properly typed pointers/references, but i'd like to hear your suggestions.
Clarification: the request is primarily about data loading framework and memory management issues.

On platforms like the Nintendo GameCube and DS, 3D models are usually stored in a very simple custom format:
A brief header, containing a magic number identifying the file, the number of vertices, normals, etc., and optionally a checksum of the data following the header (Adler-32, CRC-16, etc).
A possibly compressed list of 32-bit floating-point 3-tuples for each vector and normal.
A possibly compressed list of edges or faces.
All of the data is in the native endian format of the target platform.
The compression format is often trivial (Huffman), simple (Arithmetic), or standard (gzip). All of these require very little memory or computational power.
You could take formats like that as a cue: it's quite a compact representation.
My suggestion is to use a format most similar to your in-memory data structures, to minimize post-processing and copying. If that means you create the format yourself, so be it. You have extreme needs, so extreme measures are needed.

This is a common game development pattern.
The usual approach is to cook the data in an offline pre-process step. The resulting blobs can be streamed in with minimal overhead. The blobs are platform dependent and should contain the proper alignment & endian-ness of the target platform.
At runtime, you can simply cast a pointer to the in-memory blob file. You can deal with nested structures as well. If you keep a table of contents with offsets to all the pointer values within the blob, you can then fix-up the pointers to point to the proper address. This is similar to how dll loading works.
I've been working on a ruby library, bbq, that I use to cook data for my iphone game.
Here's the memory layout I use for the blob header:
// Memory layout
//
// p begining of file in memory.
// p + 0 : num_pointers
// p + 4 : offset 0
// p + 8 : offset 1
// ...
// p + ((num_pointers - 1) * 4) : offset n-1
// p + (num_pointers * 4) : num_pointers // again so we can figure out
// what memory to free.
// p + ((num_pointers + 1) * 4) : start of cooked data
//
Here's how I load binary blob file and fix up pointers:
void* bbq_load(const char* filename)
{
unsigned char* p;
int size = LoadFileToMemory(filename, &p);
if(size <= 0)
return 0;
// get the start of the pointer table
unsigned int* ptr_table = (unsigned int*)p;
unsigned int num_ptrs = *ptr_table;
ptr_table++;
// get the start of the actual data
// the 2 is to skip past both num_pointer values
unsigned char* base = p + ((num_ptrs + 2) * sizeof(unsigned int));
// fix up the pointers
while ((ptr_table + 1) < (unsigned int*)base)
{
unsigned int* ptr = (unsigned int*)(base + *ptr_table);
*ptr = (unsigned int)((unsigned char*)ptr + *ptr);
ptr_table++;
}
return base;
}
My bbq library isn't quite ready for prime time, but it could give you some ideas on how to write one yourself in python.
Good Luck!

I note that nowhere in your description do you ask for "ease of programming". :-)
Thus, here's what comes to mind for me as a way of creating this:
The data should be in the same on-disk format as it would be in the target's memory, such that it can simply pull blobs from disk into memory with no reformatting it. Depending on how much freedom you want in putting things into memory, the "blobs" could be the whole file, or could be smaller bits within it; I don't understand your data well enough to suggest how to subdivide it but presumably you can. Because we can't rely on the same endianness and alignment on the host, you'll need to be somewhat clever about translating things when writing the files on the host-side, but at least this way you only need the cleverness on one side of the transfer rather than on both.
In order to provide a bit of assurance that the target-side and host-side code matches, you should write this in a form where you provide a single data description and have some generation code that will generate both the target-side C code and the host-side Python code from it. You could even have your generator generate a small random "version" number in the process, and have the host-side code write this into the file header and the target-side check it, and give you an error if they don't match. (The point of using a random value is that the only information bit you care about is whether they match, and you don't want to have to increment it manually.)

Consider storing your data as BLOBs in a SQLite DB. SQLite is extremely portable and lighweight, ANSI C, has both C++ and Python interfaces. This will take care of large files, no fragmentation, variable-length records with fast access, and so on. The rest is just serialization of structs to these BLOBs.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js