Writing to redis using C++ using smallest possible bandwitdth - c++

I wanted to write a ASCII string file to a remote redis server (this isnt really important, key point is redis writes strings) using the least amount of bandwidth possible. What is the best form of compression to do this? The file has rows like:
"Employee One - blah -blah" and I want to somehow compress each one, before a write since I have a severe bandwidth limitation.
A large file sample:
EMP1,500000,125670205,21.50,20.50
EMP2,215500,125670205,21.50,20.50
EMP3,251200,125670205,21.50,20.50
EMP4,234600,125670205,21.50,20.50
EMP5,735000,125670205,21.50,20.50
EMP6,134600,125670205,21.50,20.50
Its a CSV with string and numbers, I want to write it with the whole file and/or row by row.

Related

Most efficient file type for random data access

I am writing a password generation program. I have collected a list of around 30,000 English words and plan on picking from them at random by index.
Currently, I have all the words in a .txt file each separated by a newline character and organized by length.
My current plan is to write the program in C++ because that is the language I am most comfortable in so I could just load the entire file into memory, but that seems incredibly sloppy.
What would be a more efficient way (or file type like JSON if necessary) to do this? Thanks
30,000 words sounds like an insignificant amount of data to load. Even if it's ~50-500MB just load it in and forget about it.
On a modern system this will take a fraction of a second to accomplish the first time, any SSD can do ~600MB/s+, and even less once it's in the OS disk buffer.
You'd only concern yourself with not loading it if you've got a file too big to fit in memory.

What's the smallest possible file size on disk?

I'm trying to find a solution to store a binary file in it's smallest size on disk. I'm reading vehicles VIN and plate number from a database that is 30 Bytes and when I put it in a txt file and save it, its size is 30B, but its size on disk is 4KB, which means if I save 100000 files or more, it would kill storage space.
So my question is that how can I write this 30B to an individual binary file to its smallest size on disk, and what is the smallest possible size of 30B on disk including other info such as file name and permissions?
Note: I do not want to save those text in database, just I want to make separate binary files.
the smallest size of a file is always the cluster size of your disk, which is typically 4k. for data like this, having many records in a single file is really the only reasonable solution.
although another possibility would be to store those files in an archive, a zip file for example. under windows you can even access the zip contents pretty similar to ordinary files in explorer.
another creative possibility: store all the data in the filename only. a zero byte file takes only 1024 bytes in the MFT. (assuming NTFS)
edit: reading up on resident files, i found that on the newer 4k sector drives, the MFT entry is actually 4k, too. so it doesn't get smaller than this, whether the data size is 0 or not.
another edit: huge directories, with tens or hundreds of thousands of entries, will become quite unwieldy. don't try to open one in explorer, or be prepared to go drink a coffee while it loads.
Most file systems allocate disk space to files in chunks. It is not possible to take less than one chunk, except for possibly a zero-length file.
Google 'Cluster size'
You should consider using some indexed file library like gdbm: it is associating to arbitrary key some arbitrary data. You won't spend a file for each association (only a single file for all of them).
You should reconsider your opposition to "databases". Sqlite is a library giving you SQL and database abilities. And there are noSQL databases like mongodb
Of course, all this is horribly operating system and file system specific (but gdbm and sqlite should work on many systems).
AFAIU, you can configure and use both gdbm and sqlite to be able to store millions of entries of a few dozen bytes each quite efficienty.
on filesystems you have the same problem. the smallest allocate size is one data-node and also a i-node. For example in IBM JFS2 is the smallest blocksize 4k and you have a inode to allocate. The second problem is you will write many file in short time. It makes a performance problems, to write in short time many inodes.
Every write operation must jornaled and commit. Or you us a old not jornaled filesystem.
A Idear is, grep many of your data recorders put a separator between them and write 200-1000 in one file.
for example:
0102030400506070809101112131415;;0102030400506070809101112131415;;...
you can index dem with the file name. Sequence numbers or so ....

storing urls to a file so they can be reachable quickly

i have a file and a plenty of urls, these urls are written to a file all with the same structure plus a url CheckSum of type int. stackoverflow.com is written as:
12534214214 http://stackoverflow.com
now everytime i want to put an url into the file i need to check if the url doesn't exist
then i can put it.
but it takes too much time to do this with 1 000 000 urls:
//list of urls
list<string> urls;
size_t hashUrl(string argUrl); //this function will hash the url and return an int
file.open("anchors");
//search for the int 12534214214 if it isn't found then write 12534214214 http://stackoverflow.com
file.close();
question1 : -how can i search in a file using the checksum so the search will take a few ms?
question2 : -is there another way of storing these urls so that they can be reachable quickly?
thanks, and sorry for bad english
There is (likely [1]) no way you search a million URLS in a plain text-file in "a few milliseconds". You need to either load the entire file into memory (and when you do, you may just as well do that into some reasonable data structure, for example a std::map or std::unordered_map), or use some sort of indexing for the file - e.g have a smaller file with just the checksum and the place in the file that they are stored at.
The problem with a plain textfile is that there is no way to know where anything is. One line can be 10 bytes, another 10000 bytes. This means that you literally have to read every byte up to the point you are interested in.
Of course, the other option is to use a database library, SQLite etc (or proper a database server, such as MySQL) that allows the data to be stored/retrieved based on a "query". This hides all the index-generation and other such problems, and is already optimised both when it comes to search algorithms, as well as having clever caching and optimised code for reading/writing data to disk, etc.
[1] If all the URLS are short, it's perhaps possible that the file is small enough to cache well, and code can be written to be fast enough to linearly scan through the entire file in a few milliseconds. But a file with, say, an average of 50 bytes for each URL will be 50MB. If each byte takes 10 clock cycles to process, we're already at 130ms to process the file, even if it's directly available in memory.

Fast CSV parser in C++

I am trying to read a .csv file with 20k+ lines, and each line has ~300 fields.
I am using my own code to read it line by line, then I separate the lines to fields, and convert the fields to corresponding data type (such as integer, double, etc). Then these data are transfered to class objects via their constructor.
However, I found it is not very efficient. It took about 1 min to read these 20k+ lines and create 20k+ objects.
I've googled about fast csv parser, and found there are many options. I've tried some of them, but not very satisfied with the time performance.
Does anyone have a better method to read large .csv files? Many thanks in advance.
An efficient method for parsing or for that matter processing of files is to read as much of the file into memory before you start parsing.
File I/O has been, since the dawn of computers, one of the slower parts of a computer system. For example, parsing your data may take 1 microsecond. Reading the data from a hard drive may take 1 millisecond == 1000 microseconds.
I've made programs faster by allocating a large array for the data then reading the data into the array. Next I process the data in the array and repeat until the entire file is processed.
Another technique is called memory mapping, where the OS handles reading the file into memory as needed.
Please edit your post to show the code where the bottleneck is.

What is cheaper: cast to int or Trim the strings in C++?

I am reading several files from linux /proc fs and I will have to insert those values in a database. I should be as optimal as possible. So what is cheaper:
i) to cast then to int, while I stored then in memory, for later cast to string again while I build my INSERT statement
ii) or keep them as string, just sanitizing the values (removing ':', spaces, etc...)
iii) What should I take in account to learn to make this decision?
I am already doing a split in the lines, because the order they came is not good enough for me.
Thanks,
Pedro
Edit - Clarification
Sorry guys, my scenario is the following: I am measuring cpu, memory, network, disk, etc... every 10 seconds. We are developing our database system, so I cannot count with anything more than just INSERT statements.
I got interested in this optimization because the frequency off parsing data. Its gonna be write once - there will be no updates over the data after it is written.
You seem to be performing some archiving activity [write-once, read-probably-atmost-once] (storing the DB for a later rare/non-frequent use), if not, you should put the optimization emphasize based on how the data will be read (not written).
If this is the archiving case, maybe inseting BLOBs (binary large objects, [or similar concepts]) into the DB will be more efficient.
Addition:
Apparently it will depend on how you will read the data. Are you just listing the data for browse purpose later on, or there will be more complex fetching queries based on the benchmark values.
For example if you are later performing something like: SELECT * from db.Log WHERE log.time > time1 and Max (Memory) < 5000 then it is best to keep each data in its original format (int in integer, string in String, etc) so that the main data processing is left to DB server.