C++ Tweetnacl hash a file without read whole file to memory - c++

I'm using tweetnacl to generate sha512 hashes of strings and file. For strings it works quite well but i have no idea how to do it with files.
The signature of the function ist
extern "C" int crypto_hash(u8 *out, const u8 *m, u64 n);
where u8 is of type unsigned char and u64 is of unsigend long long.
For string a can use it like that
string s("Hello");
unsigned char h[64];
crypto_hash(h, (unsigned char *)s.c_str(), s.size());
This works great for a string and small files but if i want to create a hash for a big file, it is not viable and uses to much memory. I searching for a solution to read the file byte by byte and pass it as unsigend char pointer to that function. Has anyone a idea how to achieve that?
P.S Sorry for the poor English.
p.s.s I use tweetnacl because of the small size and i need only the hashing function.

Probably the easiest way is to use a memory-mapped file. This lets you open a file and map it into virtual memory, then you can treat the file on disk as if it is in memory, and the OS will load pages as required.
So in your case, open your file and use mmap() to map it into memory. Then you can pass the pointer into your crypto_hash() function and let the OS do the work.
Note that there are caveats to do with how large the file is wrt virtual memory.
For various platforms:
Boost Interprocess
macOS and mmap
Linux and mmap
Windows .NET MemoryMappedFile

I'd suggest you to use a different implementation, one which you can incrementally feed in chunks.
This one for example. As the licence is bsd and the code is C with no dependencies, you can copy/paste only the 3 functions that you need without bringing a whole lib (no matter how small) into your project.
The life-cycle goes like:
sha256_init(&ctx)
repeatedly read blocks from file and feed them into sha256_update(&ctx, buff, buffLen)
when EOF, get your digest using sha256_final(&ctx, digestHere)

Related

How to create a new binary file and fill it with a constant value?

I'm using cstdio functions to create an empty binary file and need it to be initialized to a specific byte value (can be zero, but not necessarily).
FILE* file = std::fopen("path/to/file", "wb+");
Is there a way to fill the whole file with a value, or is creating and filling a buffer and then using std::fwrite to continuously fill the file my only option? Something like
std::ffill(byteValue, sizeof(byteValue), fileSize, file);
It would be okay to have platform specific solutions (I'm targeting windows and linux).
Using C++ iostreams, it's pretty trivial:
std::ofstream out("path/to/file", std::ios::binary);
char byteValue = '\0'; // or whatever
std::fill_n(std::ostreambuf_iterator<char>(out), fileSize, byteValue);
If fileSize is really large, however, you may prefer to use std::ofstream::write instead--it can be substantially faster.
One option would be to mmap the file to memory and use memset on that memory. But filling a buffer and writing it multiple times to the file would be an easier solution and platform independent. The mmap and CreateFileMapping are platform specific.

Create vector from CSV at compile time in C++

I'm trying to create a lookup table for my Xilinx Zynq SoC (the ARM Cortex).
I have a CSV file with 1330 entries which I can not read or parse during runtime. The latest I can do that is at compile time. I have read it is possible to embed a file into an executable so it can be used during runtime.
In other words, I want to read and parse the CSV file at runtime, without the original file actually being on any filesystem since it's an embedded device. So I would need to somehow embed the CSV file into the executable. How would I achieve something like this?
The CSV file looks like this (full file is here);
0,0,48,112,160,208,272,320,368,....,65440,65488
You asked for a vector but I'm not sure why you'd necessarily want that. The data will unavoidably occupy space in the application's read-only (".text" or ".rodata" or something like that) section, and while you can convert it into a vector if necessary (which will consume heap space and require runtime construction and initialization from the data in the read-only .text/.rodata section), you might as well just use it as a POD array since I doubt you'll be changing the data at runtime. So to create a const POD array of the data you could just do something like this....
const int myArray[] =
{
#include "myCsvFile.csv"
};
If the number of elements is not fixed, your program can determine the number with sizeof(myArray)/sizeof(myArray[0]). Even if it is a fixed size, this technique is probably best anyway. And of course if all of your entries are unsigned and can fit within 16 bits (a cursory examination suggested so), instead of defining it as an array of int, you can define it as an array of unsigned short or uint16_t to save space.
I should also mention that the const keyword is important here: Without that, your array will occupy twice as much memory: first, it will occupy space in the read-only section (.text or .rodata or whatever), and during application initialization, the runtime will make a read/writable copy of the read-only data in the read/write data section (.data probably), where myArray is allocated. To avoid that, define it as const and then the address of myArray will be in the read-only section, and it won't be copied to the read/write data section.
As your data is a plain array of unsigned integers you can use precompiler
Assuming you csv file is in data.csv file.
Then simply in .cpp file you can following:
const unsigned int k_Data[] = {
#include "data.csv" // << Preprocessor will replace this line with the contents of data.csv
};
#include <iostream>
int main()
{
std::cout << k_Data[3];
}
112
For the specific type of CSV data you have, e.g.
0,0,48,112,160,208,272,320,368,432,480,512,576,640,704,752,800,848,896,......
which is basically just a bunch of numbers, you should be able to include them using an actual #include statement, so;
const unsigned short myCSV[] ={
#include "./theCSV.data"
};
Im using unsigned short since the largest number looks like 64k and it would save you some space -- but you may want to use int instead, if you beleive the number can be larger than 64k

Check if there is sufficient disk space to save a file; reserve it

I'm writing a C++ program which will be printing out large (2-4GB) files.
I'd like to make sure that there's sufficient space on the drive to save the files before I start writing them. If possible, I'd like to reserve this space.
This is taking place on a Linux-based system.
Any thoughts on a good way to do this?
Take a look at posix_fallocate():
NAME
posix_fallocate - allocate file space
SYNOPSIS
int posix_fallocate(int fd, off_t offset, off_t len);
DESCRIPTION
The function posix_fallocate() ensures that disk space is allocated for
the file referred to by the descriptor fd for the bytes in the range
starting at offset and continuing for len bytes. After a successful
call to posix_fallocate(), subsequent writes to bytes in the specified
range are guaranteed not to fail because of lack of disk space.
edit In the comments you indicate that you use C++ streams to write to the file. As far as I know, there's no standard way to get the file descriptor (fd) from a std::fstream.
With this in mind, I would make disk space pre-allocation a separate step in the process. It would:
open() the file;
use posix_fallocate();
close() the file.
This can be turned into a short function to be called before you even open the fstream.
Use aix's answer (posix_fallocate()), but since you're using C++ streams, you'll need a bit of a hack to get the stream's file descriptor.
For that, use the code here: http://www.ginac.de/~kreckel/fileno/.
If you are using C++ 17, you should do it with std::filesystem::resize_file
As shown in this post

Marshall multiple protobuf to file

Background:
I'm using Google's protobuf, and I would like to read/write several gigabytes of protobuf marshalled data to a file using C++. As it's recommended to keep the size of each protobuf object under 1MB, I figured a binary stream (illustrated below) written to a file would work. Each offset contains the number of bytes to the next offset until the end of the file is reached. This way, each protobuf can stay under 1MB, and I can glob them together to my heart's content.
[int32 offset]
[protobuf blob 1]
[int32 offset]
[protobuf blob 2]
...
[eof]
I have an implemntation that works on Github:
src/glob.hpp
src/glob.cpp
test/readglob.cpp
test/writeglob.cpp
But I feel I have written some poor code, and would appreciate some advice on how to improve it. Thus,
Questions:
I'm using reinterpret_cast<char*> to read/write the 32 bit integers to and from the binary fstream. Since I'm using protobuf, I'm making the assumption that all machines are little-endian. I also assert that an int is indeed 4 bytes. Is there a better way to read/write a 32 bit integer to a binary fstream given these two limiting assumptions?
In reading from fstream, I create a temporary fixed-length char buffer, so that I can then pass this fixed-length buffer to the protobuf library to decode using ParseFromArray, as ParseFromIstream will consume the entire stream. I'd really prefer just to tell the library to read at most the next N bytes from fstream, but there doesn't seem to be that functionality in protobuf. What would be the most idiomatic way to pass a function at most N bytes of an fstream? Or is my design sufficiently upside down that I should consider a different approach entirely?
Edit:
#codymanix: I'm casting to char since istream::read requires a char array if I'm not mistaken. I'm also not using the extraction operator >> since I read it was poor form to use with binary streams. Or is this last piece of advice bogus?
#Martin York: Removed new/delete in favor of std::vector<char>. glob.cpp is now updated. Thanks!
Don't use new []/delete[].
Instead us a std::vector as deallocation is guaranteed in the event of exceptions.
Don't assume that reading will return all the bytes you requested.
Check with gcount() to make sure that you got what you asked for.
Rather than have Glob implement the code for both input and output depending on a switch in the constructor. Rather implement two specialized classes like ifstream/ofstream. This will simplify both the interface and the usage.

Reading Superblock into a C Structure

I have a disk image which contains a standard image using fuse. The Superblock contains the following, and I have a function read_superblock(*buf) that returns the following raw data:
Bytes 0-3: Magic Number (0xC0000112)
4-7: Block Size (1024)
8-11: Total file system size (in blocks)
12-15: FAT length (in blocks)
16-19: Root Directory (block number)
20-1023: NOT USED
I am very new to C and to get me started on this project I am curious what is a simple way to read this into a structure or some variables and simply print them out to the screen using printf for debugging.
I was initially thinking of doing something like the following thinking I could see the raw data, but I think this is not the case. There is also no structure and I am trying to read it in as a string which also seems terribly wrong. for me to grab data out of. Is there a way for me to specify the structure and define the number of bytes in each variable?
char *buf;
read_superblock(*buf);
printf("%s", buf);
Yes, I think you'd be better off reading this into a structure. The fields containing useful data are all 32-bit integers, so you could define a structure that looks like this (using the types defined in the standard header file stdint.h):
typedef struct SuperBlock_Struct {
uint32_t magic_number;
uint32_t block_size;
uint32_t fs_size;
uint32_t fat_length;
uint32_t root_dir;
} SuperBlock_t;
You can cast the structure to a char* when calling read_superblock, like this:
SuperBlock_t sb;
read_superblock((char*) &sb);
Now to print out your data, you can make a call like the following:
printf("%d %d %d %d\n",
sb.magic_number,
sb.block_size,
sb.fs_size,
sb.fat_length,
sb.root_dir);
Note that you need to be aware of your platform's endianness when using a technique like this, since you're reading integer data (i.e., you may need to swap bytes when reading your data). You should be able to determine that quickly using the magic number in the first field.
Note that it's usually preferable to pass a structure like this without casting it; this allows you to take advantage of the compiler's type-checking and eliminates potential problems that casting may hide. However, that would entail changing your implementation of read_superblock to read data directly into a structure. This is not difficult and can be done using the standard C runtime function fread (assuming your data is in a file, as hinted at in your question), like so:
fread(&sb.magic_number, sizeof(sb.magic_number), 1, fp);
fread(&sb.block_size, sizeof(sb.block_size), 1, fp);
...
Two things to add here:
It's a good idea, when pulling raw data into a struct, to set the struct to have zero padding, even if it's entirely composed of 32-bit unsigned integers. In gcc you do this with #pragma pack(0) before the struct definition and #pragma pack() after it.
For dealing with potential endianness issues, two calls to look at are ntohs() and ntohl(), for 16- and 32-bit values respectively. Note that these swap from network byte order to host byte order; if these are the same (which they aren't on x86-based platforms), they do nothing. You go from host to network byte order with htons() and htonl(). However, since this data is coming from your filesystem and not the network, I don't know if endianness is an issue. It should be easy enough to figure out by comparing the values you expect (e.g. the block size) with the values you get, in hex.
It's not difficult to print the data after you successfully copied data into a structure Emerick proposed. Suppose the instance of the structure you use to hold data is named SuperBlock_t_Instance.
Then you can print its fields like this:
printf("Magic Number:\t%u\nBlock Size:\t%u\n etc",
SuperBlock_t_Instance.magic_number,
SuperBlock_t_Instance.block_size);