Create vector from CSV at compile time in C++ - c++

I'm trying to create a lookup table for my Xilinx Zynq SoC (the ARM Cortex).
I have a CSV file with 1330 entries which I can not read or parse during runtime. The latest I can do that is at compile time. I have read it is possible to embed a file into an executable so it can be used during runtime.
In other words, I want to read and parse the CSV file at runtime, without the original file actually being on any filesystem since it's an embedded device. So I would need to somehow embed the CSV file into the executable. How would I achieve something like this?
The CSV file looks like this (full file is here);
0,0,48,112,160,208,272,320,368,....,65440,65488

You asked for a vector but I'm not sure why you'd necessarily want that. The data will unavoidably occupy space in the application's read-only (".text" or ".rodata" or something like that) section, and while you can convert it into a vector if necessary (which will consume heap space and require runtime construction and initialization from the data in the read-only .text/.rodata section), you might as well just use it as a POD array since I doubt you'll be changing the data at runtime. So to create a const POD array of the data you could just do something like this....
const int myArray[] =
{
#include "myCsvFile.csv"
};
If the number of elements is not fixed, your program can determine the number with sizeof(myArray)/sizeof(myArray[0]). Even if it is a fixed size, this technique is probably best anyway. And of course if all of your entries are unsigned and can fit within 16 bits (a cursory examination suggested so), instead of defining it as an array of int, you can define it as an array of unsigned short or uint16_t to save space.
I should also mention that the const keyword is important here: Without that, your array will occupy twice as much memory: first, it will occupy space in the read-only section (.text or .rodata or whatever), and during application initialization, the runtime will make a read/writable copy of the read-only data in the read/write data section (.data probably), where myArray is allocated. To avoid that, define it as const and then the address of myArray will be in the read-only section, and it won't be copied to the read/write data section.

As your data is a plain array of unsigned integers you can use precompiler
Assuming you csv file is in data.csv file.
Then simply in .cpp file you can following:
const unsigned int k_Data[] = {
#include "data.csv" // << Preprocessor will replace this line with the contents of data.csv
};
#include <iostream>
int main()
{
std::cout << k_Data[3];
}
112

For the specific type of CSV data you have, e.g.
0,0,48,112,160,208,272,320,368,432,480,512,576,640,704,752,800,848,896,......
which is basically just a bunch of numbers, you should be able to include them using an actual #include statement, so;
const unsigned short myCSV[] ={
#include "./theCSV.data"
};
Im using unsigned short since the largest number looks like 64k and it would save you some space -- but you may want to use int instead, if you beleive the number can be larger than 64k

Related

C++ Tweetnacl hash a file without read whole file to memory

I'm using tweetnacl to generate sha512 hashes of strings and file. For strings it works quite well but i have no idea how to do it with files.
The signature of the function ist
extern "C" int crypto_hash(u8 *out, const u8 *m, u64 n);
where u8 is of type unsigned char and u64 is of unsigend long long.
For string a can use it like that
string s("Hello");
unsigned char h[64];
crypto_hash(h, (unsigned char *)s.c_str(), s.size());
This works great for a string and small files but if i want to create a hash for a big file, it is not viable and uses to much memory. I searching for a solution to read the file byte by byte and pass it as unsigend char pointer to that function. Has anyone a idea how to achieve that?
P.S Sorry for the poor English.
p.s.s I use tweetnacl because of the small size and i need only the hashing function.
Probably the easiest way is to use a memory-mapped file. This lets you open a file and map it into virtual memory, then you can treat the file on disk as if it is in memory, and the OS will load pages as required.
So in your case, open your file and use mmap() to map it into memory. Then you can pass the pointer into your crypto_hash() function and let the OS do the work.
Note that there are caveats to do with how large the file is wrt virtual memory.
For various platforms:
Boost Interprocess
macOS and mmap
Linux and mmap
Windows .NET MemoryMappedFile
I'd suggest you to use a different implementation, one which you can incrementally feed in chunks.
This one for example. As the licence is bsd and the code is C with no dependencies, you can copy/paste only the 3 functions that you need without bringing a whole lib (no matter how small) into your project.
The life-cycle goes like:
sha256_init(&ctx)
repeatedly read blocks from file and feed them into sha256_update(&ctx, buff, buffLen)
when EOF, get your digest using sha256_final(&ctx, digestHere)

Compressing/decompressing data on the go

I'm running a physics simulation and I'd like to improve the way it is handling its data. I'm saving and reading files that contained one float then two ints followed by 512*512 = 262144 +1 or -1 weighting 595 kb per datafile in the end. All these numbers are separated by a single space.
I'm saving hundreds of thousands of these files so it quickly adds up to gigas of storage, I'd like to know if there is a quick (hopefully light cpu-effort-wise way) of compressing and decompressing this kind of data on the go (I mean not tarring/untarring before/after use).
How much could I expect saving in the end?
If you want relatively fast read-write, you would probably want to store and read them in "binary" format, i.e. native to the way they are internally stored in bytes. A float uses 4-bytes of data, and you do not need any kind of "separator" when storing a large sequence of them.
To do this you might consider boost's "serialize" library.
Note that using data compression methods (zlib etc) will save you on bytes stored but will be relatively slow to zip and unzip them for usage.
Storing in binary format will not only use less disk storage (than storing in text format) but should also be more performant, not just because there is less file I/O but also because there is no string writing/parsing going on.
Note that when you input/output to binary_iarchive or binary_oarchive you pass in an underlying istream or ostream and if this is a file, you need to open it with ios::binary flag because of the issue of line-endings being potentially converted.
Even if you do decide that data-compression (zlib or some other library) is the way to go, it is still worth using boost::serialize to get your data into a "blob" to compress. In that case you would probably use std::ostringstream as your output stream to create the blob.
Incidentally, if you have 2^18 "boolean" values that can only be 1 or -1, you only need 1 bit for each one, (they would be physically stored as 1 or 0 but you would logically translate that). That would come to 2^15 bytes which is 32K not 595K
Given the extra info about the valid data, define your class like this:-
class Data
{
float m_float_value;
int m_int_value_1, m_int_value_2;
unsigned m_weights [8192];
};
Then use binary file IO to stream this class to and from a file, don't convert to text!
The weights are stored as Boolean values, packed into unsigned integers.
To get the weight, add an accessor:-
int Data::GetWeight (size_t index)
{
return m_weights [index >> 5] & (1 << (index & 31)) ? 1 : -1;
}
This gives you a data file size of 32780 bytes (5.4%) if there's no packing in the class data.
I would suggest that if you are concerned about size a binary format would be the most useful way to "compress" your data. It sounds like you are dealing with something like the following:
struct data {
float a;
int b, c;
signed char d[512][512];
};
someFunc() {
data* someData = new data;
std::ifstream inFile("inputData.bin", std::ifstream::binary);
std::ofstream outFile("outputData.bin", std::ofstream::binary);
// Read from file
inFile.read(someData, sizeof(data));
inFile.close();
// Write to file
outFile.write(someData, sizeof(data));
outFile.close();
delete someData;
}
I should also mention then if you encode your +1/-1 as bits you should get a lot of space savings (another factor of 8 on top of what I'm showing here).
For that amount of data, anything homemade isn't going perform anywhere near as well as good-quality open-source binary-storage libraries. Try boost serialize or - for this type of storage requirement - HDF5. I've used HDF5 successfully on a few projects with very very large amounts of double, float, long and int data . Found it useful one can control the compression-rate vs cpu-effort on the fly per "file". Also useful is storing millions of "files" in a hierarchically-structured single "disk" file. NASA - probably ripping my style;) - also uses it.

write data structure into a file using binary mode

code looks like this:
struct Dog {
string name;
unsigned int age;
};
int main()
{
Dog d = {.age = 3, .name = "Lion"};
FILE *fp = fopen("dog.txt", "wb");
fwrite(&d, sizeof(d), 1, fp); //write d into dog.txt
}
My problem is what's the point of write a data object or structure into a binary file? I assume it is for making the data generated in a running program persistent, right? If yes, then how can I get the data back? Using fread?
This makes me think of database-like stuff, dose database write data into disk the same way?
You can do it but you will have a lot of issues to care about:
structure types: all your data needs really be into struct or you can just writing a pointer to some other place.
structure changes: if you need change your structure you will need write a converter to read old struct and write the new.
language interoperability: will be hard to access the data using other language
It was a common practice in the early days before relational databases popularization. You can make index files pointing to a record number.
However nowadays I will advice you to make serialization and write strings instead binaries.
NOTE:
if string is something like char[40] your code maybe will survive... but if your question is about C++ and string is a class then kill you child before it grows up! The string object characters are not into your struct but in the heap.
Writing data in binary is extremely useful and much faster then reading/writing in text, take for instance video games (Although not every video game does this), when the game is saved all of the nescessary structures/classes and other data are written into a save file in binary.
It is just one use for using binary, but the major reason for doing this is speed.
And to read the data back, you will need to know the format that you saved it in, for instance as a simple example, if I saved an integer, char array of n size, and a boolean, I would need to read the binary file in as an integer, char array of n size, and a boolean. Otherwise the data is read improperly and will not be very useful at all
Be careful. The type of field 'name' in your structure is 'string'. This class contains data allocated dynamically. So writing 'string' data into file this way only pointers will be writed, not data itself.
The C++ Middleware Writer supports binary serialization to/from files.
From a marshalling perspective the "unsigned int age" member of your struct is a potential problem. I'd consider changing the type to uint32_t.

A good way to output array values from Python and then take them in through C++?

Due to annoying overflow problems with C++, I want to instead use Python to precompute some values. I have a function f(a,b) that will then spit out a value. I want to be able to output all the values I need based on ranges of a and b into a file, and then in C++ read that file and popular a vector or array or whatever's better.
What is a good format to output f(a,b) in?
What's the best way to read this back into C++?
Vector or multidim array?
You can use Python to write out a .h file that is compatible with C++ source syntax.
h_file.write('{')
for a in range(a_size):
h_file.write('{' + ','.join(str(f(a, b)) for b in range(b_size)) + '},\n')
h_file.write('}')
You will probably want to modify that code to throw some extra newlines in, and in fact I have such code that I can show later (don't have access to it now).
You can use Python to write out C++ source code that contains your data. E.g:
def f(a, b):
# Your function here, e.g:
return pow(a, b, 65537)
num_a_values = 50
num_b_values = 50
# Write source file
with open('data.cpp', 'wt') as cpp_file:
cpp_file.write('/* Automatically generated file, do not hand edit */\n\n')
cpp_file.write('#include "data.hpp"\n')
cpp_file.write('const int f_data[%d][%d] =\n'
% (num_a_values, num_b_values))
cpp_file.write('{\n')
for a in range(num_a_values):
values = [f(a, b) for b in range(num_b_values)]
cpp_file.write(' {' + ','.join(map(str, values)) + '},\n')
cpp_file.write('}\n')
# Write corresponding header file
with open('data.hpp', 'wt') as hpp_file:
hpp_file.write('/* Automatically generated file, do not hand edit */\n\n')
hpp_file.write('#ifndef DATA_HPP_INCLUDED\n')
hpp_file.write('#define DATA_HPP_INCLUDED\n')
hpp_file.write('#define NUM_A_VALUES %d\n' % num_a_values)
hpp_file.write('#define NUM_B_VALUES %d\n' % num_b_values)
hpp_file.write('extern const int f_data[%d][%d];\n'
% (num_a_values, num_b_values))
hpp_file.write('#endif\n')
You then compile the generated source code as part of your project. You can then use it by #including the header and accessing the f_data[] array directly.
This works really well for small to medium size data tables, e.g. icons. For larger data tables (millions of entries) some C compilers will fail, and you may find that the compile/link is unacceptably slow.
If your data is more complicated, you can use this same method to define structures.
[Based on Mark Ransom's answer, but with some style differences and more explanation].
If there is megabytes of data, then I would read the data in by memory mapping the data file, read-only. I would arrange things so I can use the data file directly, without having to read it all in at startup.
The reason for doing it this way is that you don't want to read megabytes of data at startup if you're only going to use some of the values. By using memory mapping, your OS will automatically read just the parts of the file that you need. And if you run low on RAM, your OS can reuse the memory allocated for that file without having to waste time writing it to the swap file.
If the output of your function is a single number, you probably just want an array of ints. You'll probably want a 2D array, e.g.:
#define DATA_SIZE (50 * 25)
typedef const int (*data_table_type)[50];
int fd = open("my_data_file.dat", O_RDONLY);
data_table_type data_table = (data_table_type)mmap(0, DATA_SIZE,
PROT_READ, MAP_SHARED, fd, 0);
printf("f(5, 11) = %d\n", data_table[5][11]);
For more info on memory mapped files, see Wikipedia, or the UNIX mmap() function, or the Windows CreateFileMapping() function.
If you need more complicated data structures, you can put C/C++ structures and arrays into the file. But you can't embed pointers or any C++ class that has a virtual anything.
Once you've decided on how you want to read the data, the next question is how to generate it. struct.pack() is very useful for this - it will allow you to convert Python values into a properly-formatted Python string, which you can then write to a file.

Reading Superblock into a C Structure

I have a disk image which contains a standard image using fuse. The Superblock contains the following, and I have a function read_superblock(*buf) that returns the following raw data:
Bytes 0-3: Magic Number (0xC0000112)
4-7: Block Size (1024)
8-11: Total file system size (in blocks)
12-15: FAT length (in blocks)
16-19: Root Directory (block number)
20-1023: NOT USED
I am very new to C and to get me started on this project I am curious what is a simple way to read this into a structure or some variables and simply print them out to the screen using printf for debugging.
I was initially thinking of doing something like the following thinking I could see the raw data, but I think this is not the case. There is also no structure and I am trying to read it in as a string which also seems terribly wrong. for me to grab data out of. Is there a way for me to specify the structure and define the number of bytes in each variable?
char *buf;
read_superblock(*buf);
printf("%s", buf);
Yes, I think you'd be better off reading this into a structure. The fields containing useful data are all 32-bit integers, so you could define a structure that looks like this (using the types defined in the standard header file stdint.h):
typedef struct SuperBlock_Struct {
uint32_t magic_number;
uint32_t block_size;
uint32_t fs_size;
uint32_t fat_length;
uint32_t root_dir;
} SuperBlock_t;
You can cast the structure to a char* when calling read_superblock, like this:
SuperBlock_t sb;
read_superblock((char*) &sb);
Now to print out your data, you can make a call like the following:
printf("%d %d %d %d\n",
sb.magic_number,
sb.block_size,
sb.fs_size,
sb.fat_length,
sb.root_dir);
Note that you need to be aware of your platform's endianness when using a technique like this, since you're reading integer data (i.e., you may need to swap bytes when reading your data). You should be able to determine that quickly using the magic number in the first field.
Note that it's usually preferable to pass a structure like this without casting it; this allows you to take advantage of the compiler's type-checking and eliminates potential problems that casting may hide. However, that would entail changing your implementation of read_superblock to read data directly into a structure. This is not difficult and can be done using the standard C runtime function fread (assuming your data is in a file, as hinted at in your question), like so:
fread(&sb.magic_number, sizeof(sb.magic_number), 1, fp);
fread(&sb.block_size, sizeof(sb.block_size), 1, fp);
...
Two things to add here:
It's a good idea, when pulling raw data into a struct, to set the struct to have zero padding, even if it's entirely composed of 32-bit unsigned integers. In gcc you do this with #pragma pack(0) before the struct definition and #pragma pack() after it.
For dealing with potential endianness issues, two calls to look at are ntohs() and ntohl(), for 16- and 32-bit values respectively. Note that these swap from network byte order to host byte order; if these are the same (which they aren't on x86-based platforms), they do nothing. You go from host to network byte order with htons() and htonl(). However, since this data is coming from your filesystem and not the network, I don't know if endianness is an issue. It should be easy enough to figure out by comparing the values you expect (e.g. the block size) with the values you get, in hex.
It's not difficult to print the data after you successfully copied data into a structure Emerick proposed. Suppose the instance of the structure you use to hold data is named SuperBlock_t_Instance.
Then you can print its fields like this:
printf("Magic Number:\t%u\nBlock Size:\t%u\n etc",
SuperBlock_t_Instance.magic_number,
SuperBlock_t_Instance.block_size);