I have an application that manages a large number of strings. Strings are in a path-like format and have many common parts, but without a clear rule. They are not paths on the file-system but can be considered like so.
I clearly need to optimize memory consumption but without a big performance sacrifice.
I am considering 2 options:
- implement a compressed_string class that stores data zipped, but i need a fixed dictionary and i cant find a library for this right now. I don't want a Huffman on bytes, I want it on words.
- implement some kind of flyweight pattern on string parts.
The problem looks like a common one and I'm wonder what is the best solution to it or if someone knows a library that targets this issue.
thanks
Although it might be tempting to tune a specific algorithm for your problem, it is likely to require an unreasonable amount of time and effort, while standard compression techniques will immediately provide you a great boost to solve your memory consumption problem.
The "standard" way to handle this issue is to chunk source data into small blocks (such as 256KB), and compress them individually. When accessing data into the block, you need to decode it first. Therefore, the optimal block size really depends on your application, i.e. the more the application streams, the larger the blocks; on the other hand, the more random access pattern, the smaller the block size.
If you are worried by the compression/decompression speed, use a high-speed algorithm. If decompression speed is the most important metric (for access time), something like LZ4 will provide you about 1GB/s decoding performance per core, so this gives you an idea of how many blocks per second you can decode.
If only decompression speed matters, you may use the high-compression variant LZ4-HC, which will boost compression ratio even more by about 30%, while also improving decompression speed.
Strings are in a path-like format and have many common parts, but without a clear rule.
In the sense that they are locators in a hierarchy of the form name, (separator, name)*? If so, you can use interning: store the name parts as char const * elements that point into a pool of strings. That way, you effectively compress a name that is used n times to just over n * sizeof(char const *) + strlen(name) bytes. The full path would become a sequence of interned names, e.g. an std::vector.
It might seem that sizeof(char const *) is big on 64-bit hardware, but you also save some of the allocation overhead. Or, if you know for some reason that you'll never need more than, say, 65536 strings, you might store them as
class interned_name
{
uint16_t tab_idx;
public:
char const *c_str() const
{
return NAME_TABLE[tab_idx];
}
};
where NAME_TABLE is an static std::unordered_map<uint16_t, char const *>.
Related
I'm modernizing the code that reads and writes to a custom binary file format now.
I'm allowed to use C++17 and have already modernized large parts of the code base.
There are mainly two problems at hand.
binary selectors (my own name)
cased selectors (my own name as well)
For #1 it is as follows:
Given that 1 bit is set in a binary string. You either (read/write) two completely different structs.
For example, if bit 17 is set to true, it means bits 18+ should be streamed with Struct1.
But if bit 17 is false, bits 18+ should be streamed with Struct2.
Struct1 and Struct2 are completely different with minimal overlap.
For #2 it is basically the same but as follows:
Given that x bits in the bit stream are set. You have a potential pool of completely different structs. The number of structs is allowed to be between [0, 2**x] (inclusive range).
For instance, in one case you might have 3 bits and 5 structs.
But in another case, you might have 3 bits and 8 structs.
Again the overlap between the structs is minimal.
I'm currently using std::variant for this.
For case #1, it would be just two structs std::variant<Struct1, Struct2>
For case #2, it would be just a flat list of the structs again using std::variant.
The selector I use is naturally the index in the variant, but it needs to be remapped for a different bit pattern that actually needs to be written/read to/from the format.
Have any of you used or encountered some better strategies for these cases?
Is there a generally known approach to solve this in a much better way?
Is there a generally known approach to solve this in a much better way?
Nope, it's highly specific.
Have any of you used or encountered some better strategies for these cases?
The bit patterns should be encoded in the type, somehow. Almost all the (de)serialization can be generic so long as the required information is stored somewhere.
For example,
template <uint8_t BitPattern, typename T>
struct IdentifiedVariant;
// ...
using Field1 = std::variant< IdentifiedVariant<0x01, Field1a>,
IdentifiedVariant<0x02, Field1b> >;
I've absolutely used types like this in the past to automate all the boilerplate, but the details are extremely specific to the format and rarely reusable.
Note that even though you can't overlay your variant type on a buffer, there's no need for (de)serialization to proceed bit-by-bit. There's hardly any speed penalty so long as the data is already read into a buffer - if you really need to go full zero-copy, you can just have your FieldNx types keep a pointer into the buffer and decode fields on demand.
If you want your data to be bit-continuous you can't use std::variant You will need to use std::bitset or managing the memory completely manually to do that. But it isn't practical to do so because your structs will not be byte-aligned so you will need to do every read/write manually bit by bit. This will reduce speed greatly, so I only recommend this way if you want to save every bit of memory even at the cost of speed. And at this storage it will be hard to find the nth element you will need to iterate.
std::variant<T1,T2> will waste a bit of space because 1) it will always use enough space for storing the the bigger data, but using that over bit-manipulation will increase the speed and the readability of the code. (And will be easier to write)
I'm developing a tiny search engine using TF-IDF and cosine similarity. When pages are added, I build an inverted index to keep words frequency in the different pages. I remove stopwords and more common words, and plural/verb/etc. are stemmed.
My inverted index looks like:
map< string, map<int, float> > index
[
word_a => [ id_doc=>frequency, id_doc2=>frequency2, ... ],
word_b => [ id_doc->frequency, id_doc2=>frequency2, ... ],
...
]
With this data structure, I can get the idf weight with word_a.size(). Given a query, the program loops over the keywords and scores the documents.
I don't know well data structures and my questions are:
How to store a 500 Mo inverted index in order to load it at search time? Currently, I use boost to serialize the index:
ofstream ofs_index("index.sr", ios::binary);
boost::archive::bynary_oarchive oa(ofs_index);
oa << index;
And then I load it at search time:
ifstream ifs_index("index.sr", ios::binary);
boost::archive::bynary_iarchive ia(ifs_index);
ia >> index;
But it is very slow, it takes sometines 10 seconds to load.
I don't know if map are efficient enough for inverted index.
In order to cluster documents, I get all keywords from each document and I loop over these keywords to score similar documents, but I would like to avoid reading again each document and use only this inverted index. But I think this data structure would be costly.
Thank you in advance for any help!
The answer will pretty much depend on whether you need to support data comparable to or larger than your machine's RAM and whether in your typical use case you are likely to access all of the indexed data or rather only a small fraction of it.
If you are certain that your data will fit into your machine's memory, you can try to optimize the map-based structure you are using now. Storing your data in a map should give pretty fast access, but there will always be some initial overhead when you load the data from disk into memory. Also, if you only use a small fraction of the index, this approach may be quite wasteful as you always read and write the whole index, and keep all of it in memory.
Below I list some suggestions you could try out, but before you commit too much time to any of them, make sure that you actually measure what improves the load and run times and what does not. Without profiling the actual working code on actual data you use, these are just guesses which may be completely wrong.
map is implemented as a tree (usually black-red tree). In many cases, a hash_map may give you better performance as well as better memory usage (fewer allocations and less fragmentation for example).
Try reducing the size of the data - less data means it will be faster to read it from disk, potentially less memory allocation, and sometimes better in-memory performance due to better locality. You may for example consider that you use float to store the frequency, but perhaps you could store only the number of occurrences as an unsigned short in the map values and in a separate map store the number of all words for each document (also as a short). Using the two numbers, you can re-calculate the frequency, but use less disk space when you save the data to disk, which could result in faster load times.
Your map has quite a few entries, and sometimes using custom memory allocators helps improve performance in such a case.
If your data could potentially grow beyond the size of your machine's RAM, I would suggest you use memory-mapped files for storing the data. Such an approach may require re-modelling your data structures and either using custom STL allocators or using completely custom data structures instead of std::map but it may improve your performance an order of magnitude if done well. In particular, this approach frees your from having to load the whole structure into memory at once, so your startup times will improve dramatically at the cost of slight delays related to disk accesses distributed over time as you touch different parts of the structure for the first time. The subject is quite broad, and requires much deeper changes to your code than just tuning the map, but if you plan handling huge data, you should certainly have a look at mmap and friends.
I'm currently evaluating whether I should utilize a single large bitset or many 64-bit unsigned longs (uint_64) to store a large amount of bitmap information. In this case, the bitmap represents the current status of a few GB of memory pages (dirty / not dirty), and has thousands of entries.
The work which I am performing requires that I be able to query and update the dirty pages, including performing OR operations between two dirty page bitmaps.
To be clear, I will be performing the following:
Importing a bitmap from a file, and performing a bitwise OR operation with the existing bitmap
Computing the hamming weight (counting the number of bits set to 1, which represents the number of dirty pages)
Resetting / clearing a bit, to mark it as updated / clean
Checking the current status of a bit, to determine if it is clean
It looks like it is easy to perform bitwise operations on a C++ bitset, and easily compute the hamming weight. However, I imagine there is no magic here -- the CPU can only perform bitwise operations on as many bytes as it can store in a register -- so the routine utilized by the bitset is likely the same I would implement myself. This is probably also true for the hamming weight.
In addition, importing the bitmap data from the file to the bitset looks ugly -- I need to perform bitshifts multiple times, as shown here. I imagine given the size of the bitsets I would be working with, this would have a negative performance impact. Of course, I imagine I could just use many small bitsets instead, but there may be no advantage to this (other then perhaps ease of implementation).
Any advice is appriciated, as always. Thanks!
Sounds like you have a very specific single-use application. Personally, I've never used a bitset, but from what I can tell its advantages are in being accessible as if it was an array of bools as well as being able to grow dynamically like a vector.
From what I can gather, you don't really have a need for either of those. If that's the case and if populating the bitset is a drama, I would tend towards doing it myself, given that it really is quite simple to allocate a whole bunch of integers and do bit operations on them.
Given that have very specific requirements, you will probably benefit from making your own optimizations. Having access to the raw bit data is kinda crucial for this (for example, using pre-calculated tables of hamming weights for a single byte, or even two bytes if you have memory to spare).
I don't generally advocate reinventing the wheel... But if you have special optimization requirements, it might be best to tailor your solution towards those. In this case, the functionality you are implementing is pretty simple.
Thousands bits does not sound as a lot. But maybe you have millions.
I suggest you write your code as-if you had the ideal implementation by abstracting it (to begin with use whatever implementation is easier to code, ignoring any performance and memory requirement problems) then try several alternative specific implementations to verify (by measuring them) which performs best.
One solution that you did not even consider is to use Judy arrays (specifically Judy1 arrays).
I think if I were you I would probably just save myself the hassle of any DIY and use boost::dynamic_bitset. They've got all the bases covered in terms of functionality, including stream operator overloads which you could use for file IO (or just read your data in as unsigned ints and use their conversions, see their examples) and a count method for your Hamming weight. Boost is very highly regarded a least by Sutter & Alexandrescu, and they do everything in the header file--no linking, just #include the appropriate files. In addition, unlike the Standard Library bitset, you can wait until runtime to specify the size of the bitset.
Edit: Boost does seem to allow for the fast input reading that you need. dynamic_bitset supplies the following constructor:
template <typename BlockInputIterator>
dynamic_bitset(BlockInputIterator first, BlockInputIterator last,
const Allocator& alloc = Allocator());
The underlying storage is a std::vector (or something almost identical to it) of Blocks, e.g. uint64s. So if you read in your bitmap as a std::vector of uint64s, this constructor will write them directly into memory without any bitshifting.
Let's say that you were to have a structure similar to the following:
struct Person {
int gender; // betwwen 0-1
int age; // between 0-200
int birthmonth; // between 0-11
int birthday; // between 1-31
int birthdayofweek; // between 0-6
}
In terms of performance, which would be the best data type to store each of the fields? (e.g. bitfield, int, char, etc.)
It will be used on an x86 processor and stored entirely in RAM. A fairly large number will need to be stored (50,000+), so processor caches and such will need to be taken into account.
Edit: OK, let me rephrase the question. If memory usage is not important, and the entire dataset will not fit into the cache no matter which datatypes are used, is it generally better to use smaller datatypes to fit more of the data into the CPU cache, or is it better to use larger datatypes to allow the CPU to perform faster operations? I am asking this for reference only, so code readability and such should not be considered.
Don't worry about it; use what is semantically correct, and most readable
There is no answer that is correct all the time. It depends on the platform and compiler. If you really care, then you have to test.
In general, I would stay stick with ints... except for gender which should probably be an enum.
int_fast#_t from <stdint.h> or <boost/cstdint.hpp>.
That said, you'll give up simplicity and consistency (these types may be a character type, for example, which are integer types in C/C++, and that might lead to surprising function resolutions) instead of just using an int.
You'll see much more significant performance benefits by concentrating on other areas, like algorithmic complexity and access patterns.
It will be used on an x86 processor and stored entirely in RAM. A fairly large number will need to be stored (50,000+), so processor caches and such will need to be taken into account.
You still have to worry about cache (after you're at that level of optimization), even if the whole data won't be cached. For example, do you access every item in sequence? unpredictable? or just one field from every item in sequence? Compare struct { int a, b; } data[N]; to int data_a[N], data_b[N];. (Imagine you need all the 'a' at once, but can ignore the other, which way is more cache friendly?) Again, this doesn't sound like the main area on which you should focus.
Amount of bits used:
gender; 1/2 (2 if you want to include intersexuality :))
age; 8 (0-255)
birthmonth; 4 (16)
birthday; 5 (32)
birthdayofweek; 3 (8)
Bits at all: less than 22.
Knowing that it is running on x86 we have the int datatype with 32 bits.
So build your own routines which can accept an int
read(gender, int* pValue);
write(gender, int* pValue);
by using shift and bit mask operators to store and
retrieve the information. For the fields you can
use typesafe enums.
That is very fast and has an extremely low memory footprint.
it depends. Are you running out of memory? Then memory efficiency becomes paramount. Is it taking too long? Then CPU time, or at least perceived user response time, becomes paramount.
It depends. Many processors can directly access words, but require additional instructions to access octets or bits. Depending on the the compiler and processor, int might be the same as the addressable word. But is speed really going to matter? Readability and maintainability is likely to be more important.
In General
Each type has advantages and disadvantages, and specifically there are scenarios where each one will have the highest performance.
Addressable types (byte, char, short, int, and on x86-64 "long int") can all be loaded from memory in a single operation and so they have the least CPU overhead on a per-operation basis.
But, bit fields or flags packed into one or more bits might result in an overall faster program because:
they use the cache more efficiently, and this is a huge win, easily paying for a few extra cpu ops needed to unpack each item
they require fewer I/O operations to read in from disk, and this additional huge win easily pays for more CPU ops, even tho once again the cpu ops must be paid per item
Processor speeds have been advancing faster than disk and network speeds for decades, and now individual CPU ops are rarely a concern, particularly in your C/C++ case. You are already using the fastest code generator in the arsenal.
The in-RAM/not-in-cache scenario you mentioned
As it happens there is a still a cache factor to consider. Because the CPU is so fast, it is likely that execution time will be dominated by DRAM access on cache loads. If this is true, there is still an advantage to packing the data but it is dimished somewhat for a linear scan through the table. As it happens, modern DRAM is far more efficiently read in order, so you can fill an entire cache block in not much more time than is required to randomly read a single address. If execution time is dominated by an in-order traversal of the data structure, this works in your favor and would tend to flatten the performance difference between using addressable units and packed data structures.
Worry about important things
Finally, it's probably obvious but I will say it anyway: the data structure in terms of maps like hashes and trees, and the choice of algorithm typically has much more influence than machine ops tuning, which gives only an essentially linear optimization.
Worrying about memory bloat does matter, and it matters a lot if there is any possibility that your app won't fit in memory. Virtual storage turned out to be really important for protection and OS-kernel-level memory management, but one thing it never managed to do was allow programs to grow bigger than available RAM without bogging everything down.
Int is the fastest.
If you're using it in an array, you'll waste more memory however, so you may want to stick with byte in that case.
What Chris said. If this is a hypothetical program you're designing, trying to pick int versus uint8 at this stage isn't going to help you one bit in the long run. Focus your effort elsewhere.
If, when it comes down to it, you have a giant complex system you've made several rounds of optimizations on, and you're curious what the effect is, switching int to uint8 is probably (should be anyway) pretty easily anyway. At that stage you can make a statistically valid comparison in a real-world use case - not before.
enum Gender { Female, Male };
struct Date {...};
class Person
{
Gender gender;
Date date_of_birth;
//no age field - can be computed from date_of_birth
};
There are many reasons why either version could be faster.
Packing the structure by making each member into a minimal number of bits means less memory traffic and cache usage, which will cause a speedup. It also means more work to extract and access/modify individual fields, which will cost you performance. It will probably mean marginally more code as well, which will take up more of the instruction cache and so cost a bit of performance.
It's really impossible to say for sure which factor will be dominant. In most cases, it will be most efficient to use ints and bytes as needed. Don't bother with bitfields.
But in some cases, that won't be true. So measure, benchmark, profile. Find out how each implementation performs in your code.
In most cases you should stick to the default word size supported by the CPU. In many situations most CPUs will pad or require the compiler to pad values that are smaller than the default word size (so that structures can be word aligned), so any memory usage gain from a smaller data type can be rendered moot.
The main exception to this where you may have a large array, for example loading a file into an array of byte[] is preferable to int[].
In general model the problem cleanly and do the simple obvious thing first, then profile to see if you have memory issues. Looking at the case supplied, it is worth noting a couple of points:
The field age is a de-normalisation of the model and is a function of birthmonth, birthday, birthdayofweek, which could be represented as a single long birthtimestamp. So, just by looking at the model you can trim 8 bytes easily.
4 (bytes per int) * 5 (number of fields) * 50 000 (number of instances) = ~1MB. My laptop has 4MB L2 cache, you're unlikely to have memory issues.
This depends on exactly one thing:
What is the ratio of blindly copying the data and actually reading the data?
If you treat the data mostly as cookies, and seldom have to access an actual record, use bit fields which consume fewer space and are more IO-friendly.
If you access it a lot, and seldom copy it (as a part of container reallocation) or serialize it to disk, use larger integers that are CPU-friendly.
The processors word size is the smallest unit the computer can read/write to/from memory. If you read or write a bit, char, short or int the CPU, northbridge overhead is the same in all cases given reasonable system/compiler behavior. There is obviously a sram cache benefit to smaller field sizes. The one gotcha for x86 is to ensure proper alignment at word sized multiples. Other architectures such as sparc are much less forgiving and will actually crash if memory access is unaligned.
I tend to shy away from data types smaller than the computers word size in performant multi-threaded apps as changes to one field within a word shared with other fields in the same word can have unintended dependancies if access to all fields within the word are not externally synchronized.
struct Person
{
uint8_t gender; // betwwen 0-1
uint8_t age; // between 0-200
uint8_t birthmonth; // between 0-11
uint8_t birthday; // between 1-31
uint8_t birthdayofweek; // between 0-6
}
If you need to access them sequentially, cache optimisation might win. Thus, reducing the data structure to the strict minimum of information may be better.
One might argue that the CPU will have to fetch a 32 bits block in memory anyway, then need to extract the byte thus requiring more time. However that is an arbitrary choice of the CPU and it should not influence your data structures. Imagine if tomorrow CPUs started working with 64 bits blocks, your "optimisation" would immediately become futile because there would be an extraction anyway. However, it's almost certain that memory cache optimisation will always be relevant. This is why I like to privilege making the application as light as it can be, memory-wise, even when concerned with performance.
Furthermore, if you are very concerned with memory you can obviously go below bytes, by packing some of those variables together. But then you'll need to use bit operations, thus additional instructions, and your code could quickly become unreadable / unmaintainable unless you formalize the storage into some sort of generic container / structure.
Thing is, I don't see large fields of Person being evaluated. Therefore, speed is not the issue here.
What IS an issue is consistence. You have a structure with public members that one could easily set to the wrong data, being it out of range or off by an index.
I mean, you are not consistent yourself here, are you?
int birthmonth; // between 0-11
int birthday; // between 1-31
Why are months starting at zero but days are not?
And even if you had this consistent, maybe one guy who uses your structure goes with "months start by zero", one guy with "months start by one". This invites error. Why has age a restriction? Sure, we will not find a 200 years old guy, but why not say max of type?
Rather use things like an enum class, a special type that restricts your range like template<int from, int to> class RangedInt or an already existing type, like Date (especially since such a type can check how many days february has in that year et cetera).
Some people commented on the gender with gender theory stuff here, but in any case, zero is not a gender. You mean male and female, but while we have some intuition when you say that day is in range 0..31, how would we guess if zero is male or female? Shouldn't be governed by comments. (And if you actually want to address gender theory stuff, I'd say that the right type is string.)
Also, in most cases, you would not change anything about a person once it is created. Changes that happen in real life (like getting a different name through marriage) would happen at database level or the like, not in class level. Therefore, I'd made this a class in which you can set only by constructor. Although, of course, this is dependent on the usage obviously. But if you can do it immutable, do it immutable.
Edit: And it's easy to make age inconsistent with the date of birth. That's no good. Remove the redundant data, make it a method or the like. Although the year is missing? Strange design, I must say.
I need to load large models and other structured binary data on an older CD-based game console as efficiently as possible. What's the best way to do it? The data will be exported from a Python application. This is a pretty elaborate hobby project.
Requierements:
no reliance on fully standard compliant STL - i might use uSTL though.
as little overhead as possible. Aim for a solution so good. that it could be used on the original Playstation, and yet as modern and elegant as possible.
no backward/forward compatibility necessary.
no copying of large chunks around - preferably files get loaded into RAM in background, and all large chunks accessed directly from there later.
should not rely on the target having the same endianness and alignment, i.e. a C plugin in Python which dumps its structs to disc would not be a very good idea.
should allow to move the loaded data around, as with individual files 1/3 the RAM size, fragmentation might be an issue. No MMU to abuse.
robustness is a great bonus, as my attention span is very short, i.e. i'd change saving part of the code and forget the loading one or vice versa, so at least a dumb safeguard would be nice.
exchangeability between loaded data and runtime-generated data without runtime overhead and without severe memory management issues would be a nice bonus.
I kind of have a semi-plan of parsing in Python trivial, limited-syntax C headers which would use structs with offsets instead of pointers, and convenience wrapper structs/classes in the main app with getters which would convert offsets to properly typed pointers/references, but i'd like to hear your suggestions.
Clarification: the request is primarily about data loading framework and memory management issues.
On platforms like the Nintendo GameCube and DS, 3D models are usually stored in a very simple custom format:
A brief header, containing a magic number identifying the file, the number of vertices, normals, etc., and optionally a checksum of the data following the header (Adler-32, CRC-16, etc).
A possibly compressed list of 32-bit floating-point 3-tuples for each vector and normal.
A possibly compressed list of edges or faces.
All of the data is in the native endian format of the target platform.
The compression format is often trivial (Huffman), simple (Arithmetic), or standard (gzip). All of these require very little memory or computational power.
You could take formats like that as a cue: it's quite a compact representation.
My suggestion is to use a format most similar to your in-memory data structures, to minimize post-processing and copying. If that means you create the format yourself, so be it. You have extreme needs, so extreme measures are needed.
This is a common game development pattern.
The usual approach is to cook the data in an offline pre-process step. The resulting blobs can be streamed in with minimal overhead. The blobs are platform dependent and should contain the proper alignment & endian-ness of the target platform.
At runtime, you can simply cast a pointer to the in-memory blob file. You can deal with nested structures as well. If you keep a table of contents with offsets to all the pointer values within the blob, you can then fix-up the pointers to point to the proper address. This is similar to how dll loading works.
I've been working on a ruby library, bbq, that I use to cook data for my iphone game.
Here's the memory layout I use for the blob header:
// Memory layout
//
// p begining of file in memory.
// p + 0 : num_pointers
// p + 4 : offset 0
// p + 8 : offset 1
// ...
// p + ((num_pointers - 1) * 4) : offset n-1
// p + (num_pointers * 4) : num_pointers // again so we can figure out
// what memory to free.
// p + ((num_pointers + 1) * 4) : start of cooked data
//
Here's how I load binary blob file and fix up pointers:
void* bbq_load(const char* filename)
{
unsigned char* p;
int size = LoadFileToMemory(filename, &p);
if(size <= 0)
return 0;
// get the start of the pointer table
unsigned int* ptr_table = (unsigned int*)p;
unsigned int num_ptrs = *ptr_table;
ptr_table++;
// get the start of the actual data
// the 2 is to skip past both num_pointer values
unsigned char* base = p + ((num_ptrs + 2) * sizeof(unsigned int));
// fix up the pointers
while ((ptr_table + 1) < (unsigned int*)base)
{
unsigned int* ptr = (unsigned int*)(base + *ptr_table);
*ptr = (unsigned int)((unsigned char*)ptr + *ptr);
ptr_table++;
}
return base;
}
My bbq library isn't quite ready for prime time, but it could give you some ideas on how to write one yourself in python.
Good Luck!
I note that nowhere in your description do you ask for "ease of programming". :-)
Thus, here's what comes to mind for me as a way of creating this:
The data should be in the same on-disk format as it would be in the target's memory, such that it can simply pull blobs from disk into memory with no reformatting it. Depending on how much freedom you want in putting things into memory, the "blobs" could be the whole file, or could be smaller bits within it; I don't understand your data well enough to suggest how to subdivide it but presumably you can. Because we can't rely on the same endianness and alignment on the host, you'll need to be somewhat clever about translating things when writing the files on the host-side, but at least this way you only need the cleverness on one side of the transfer rather than on both.
In order to provide a bit of assurance that the target-side and host-side code matches, you should write this in a form where you provide a single data description and have some generation code that will generate both the target-side C code and the host-side Python code from it. You could even have your generator generate a small random "version" number in the process, and have the host-side code write this into the file header and the target-side check it, and give you an error if they don't match. (The point of using a random value is that the only information bit you care about is whether they match, and you don't want to have to increment it manually.)
Consider storing your data as BLOBs in a SQLite DB. SQLite is extremely portable and lighweight, ANSI C, has both C++ and Python interfaces. This will take care of large files, no fragmentation, variable-length records with fast access, and so on. The rest is just serialization of structs to these BLOBs.