Does the data file have a compression mechanism? - compression

I simulated 50 million pieces of data to view the disk space occupied by the data files. Value2 and str2 will be written every 10 times.The simulation code is as follows:
while (j<=50000000) {
sender.metric("xush").tag("tagName", "tag1").field("value1", 100).field("str1", "hello");
if (j % 10 == 0) {
sender.field("value2", 100).field("str2", "hello");
}
sender.$(beginTs);
sender.flush();
j++;
}
The file disk usage is as follows:
[root#idb23 2021-11-01T00]# du -hs ./*
382M ./timestamp.d
191M ./tagName.d
668M ./str1.d
382M ./str1.i
239M ./str2.d
382M ./str2.i
191M ./value1.d
191M ./value2.d
I have the following questions:
From the official doc i know that the .d file is a column_file and .k file is index_file, so what is the .i file used for?
It seems that null will also be appended to the column file, and it takes up as much space as integer 100?
The value1 is always 100 unchanged, but the column file will additionally store each piece of data? Will this design cause a waste of space? Or is my usage method wrong?
Questdb seems to take up much larger hard disk than other tsdb such as iotdb.Does the data file have a compression mechanism?

QuestDB does not compress data. All values in the column take equal space, including NULL values. Compression in theory possible by using compressed File System but it is not documented. The only way to reduce the footprint would be use smaller data types and SYMBOL type for repeatable strings.
.i file contains offsets of variable size column - of STRING type in your case.

Related

Writing a single large data file, or multiple smaller files: Which is faster?

I am developing a c++ program that writes a large amount of data to disk. The following function gzips the data and writes it out to a file. The compressed data is on the order of 100GB. The function to compress and write out the data is as follows:
void constructSNVFastqData(string const& fname) {
ofstream fastq_gz(fname.c_str());
stringstream ss;
for (int64_t i = 0; i < snvId->size(); i++) {
consensus_pair &cns_pair = snvId->getPair(i);
string qual(cns_pair.non_mutated.size(), '!');
ss << "#" + cns_pair.mutated + "[" + to_string(cns_pair.left_ohang) +
";" + to_string(cns_pair.right_ohang) + "]\n"
+ cns_pair.non_mutated + "\n+\n" + qual + "\n";
}
boost::iostreams::filtering_streambuf<boost::iostreams::input> out;
out.push(boost::iostreams::gzip_compressor());
out.push(ss);
boost::iostreams::copy(out,fastq_gz);
fastq_gz.close();
}
The function writes data to a string stream, which I then
write out to a file (fastq_gz) using boost's filtering_streambuf.
The file is not a log file. After the file has been written
it will be read by a child process. The file does not need to be viewed
by humans.
Currently, I am writing the data out to a single, large file (fastq_gz). This is taking a while, and the file system - according to our system manager - is very busy. I wonder if, instead of writing out a single large file, I should instead write out a number of smaller files? Would this approach be faster, or reduce the load on the file system?
Please note that it is not the compression that is slow - I have benchmarked.
I am running on a linux system and do not need to consider generalising the implementation to a windows filesystem.
So what your code is probably doing is (a) generating your file into memory swap space, (b) loading from swap space and compressing on the fly, (c) writing compressed data as you get it to the outfile.
(b) and (c) are great; (a) is going to kill you. It is two roundmtrips of thr uncompressed data, one of which while competing with your output file generation.
I cannot find one in boost iostreams, but you need an istream (source) or a device that gets data from you on demand. Someone must have written it (it seems so useful), but I don't see it in 5 minutes of looking at boost iostreams docs.
0.) Devise an algorithm to divide the data into multiple files so that it could be recombined later.
1.) Write data to multiple files on separate threads in the background. Maybe shared threads. (maybe start n = 10 threads at a time or so)
2.) Query through the future attribute of the shared objects to check if writing is done. (size > 1 GB)
3.) Once above is the case; then recombine data when it is queried by the child process
4.) I would recommend writing a new file after every 1 GB

Append to a JSON array in a JSON file on disk, every second using C++

This is my first post here, so please bear with me.
I have searched high and low on the internet for an answer, but I've not been able to resolve my issue, so I have decided to write a post here.
I am trying to write(append) to a JSON array on file using C++ and JZON, at intervals of 1 write each second. The JSON file is initially written by a “Prepare” function. Another function is then called each second to a add an array to the JSON file and append an new object to the array every second.
I have tried many things, most of which resulted in all sorts of issues. My latest attempt gave me the best results and this is the code that I have included below. However, the approach I took is very inefficient as I am writing an entire array every second. This is having a massive hit on CPU utilisation as the array grows, but not so much on memory as I had first anticipated.
What I really would like to be able to do is to append to an existing array contained in a JSON file on disk, line by line, rather than having to clear the entire array from the JSON object and rewriting the entire file, each and every second.
I am hoping that some of the geniuses on this website will be able to point me in the right direction.
Thank you very much in advance.
Here is my code:
//Create some object somewhere at the top of the cpp file
Jzon::Object jsonFlight;
Jzon::Array jsonFlightPath;
Jzon::Object jsonCoordinates;
int PrepareFlight(const char* jsonfilename) {
//...SOME PREPARE FUNCTION STUFF GOES HERE...
//Add the Flight Information to the jsonFlight root JSON Object
jsonFlight.Add("Flight Number", flightnum);
jsonFlight.Add("Origin", originicao);
jsonFlight.Add("Destination", desticao);
jsonFlight.Add("Pilot in Command", pic);
//Write the jsonFlight object to a .json file on disk. Filename is passed in as a param of the function.
Jzon::FileWriter::WriteFile(jsonfilename, jsonFlight, Jzon::NoFormat);
return 0;
}
int UpdateJSON_FlightPath(ACFT_PARAM* pS, const char* jsonfilename) {
//Add the current returned coordinates to the jsonCoordinates jzon object
jsonCoordinates.Add("altitude", pS-> altitude);
jsonCoordinates.Add("latitude", pS-> latitude);
jsonCoordinates.Add("longitude", pS-> longitude);
//Add the Coordinates to the FlightPath then clear the coordinates.
jsonFlightPath.Add(jsonCoordinates);
jsonCoordinates.Clear();
//Now add the entire flightpath array to the jsonFlight object.
jsonFlight.Add("Flightpath", jsonFlightPath);
//write the jsonFlight object to a JSON file on disk.
Jzon::FileWriter::WriteFile(jsonfilename, jsonFlight, Jzon::NoFormat);
//Remove the entire jsonFlighPath array from the jsonFlight object to avoid duplicaiton next time the function executes.
jsonFlight.Remove("Flightpath");
return 0;
}
For sure you can do "flat file" storage yourself.. but this is a symptom of needing a database. Something very light like SQLite, or mid-weight & open-source like MySQL, FireBird, or PostgreSQL.
But as to your question:
1) Leave the closing ] bracket off, and just keep the file open & appending -- but if you don't close the file correctly, it will be damaged & need repair to be readable.
2) Your current option -- writing a complete file each time -- isn't safe from data loss either, as the moment you "open to overwrite" you lose all data previously stored in the file. The workaround here, is to rename the old file as a backup before you start writing.
You should also make backup copies of your file, with the first option. (Say at daily intervals). Otherwise data loss is likely to occur eventually -- on Ctrl-C, power loss, program error or system crash.
Of course if you use any of SQLlite, MySQL, Firebird or PostgreSQL all the data-integrity problems will be handled for you.

Fastest and efficient way of parsing raw data from file

I'm working on some project and I'm wondering which way is the most efficient to read a huge amount of data off a file(I'm speaking of file of 100 lines up to 3 billions lines approx., can be more thought). Once read, data will be stored in a structured data set (vector<entry> where "entry" defines a structured line).
A structured line of this file may look like :
string int int int string string
which also ends with the appropriate platform EOL and is TAB delimited
What I wish to accomplish is :
Read file into memory (string) or vector<char>
Read raw data from my buffer and format it into my data set.
I need to consider memory footprint and have a fast parsing rate.
I'm already avoiding usage of stringstream as they seems too slow.
I'm also avoiding multiple I/O call to my file by using :
// open the stream
std::ifstream is(filename);
// determine the file length
is.seekg(0, ios_base::end);
std::size_t size = is.tellg();
is.seekg(0, std::ios_base::beg);
// "out" can be a std::string or vector<char>
out.reserve(size / sizeof (char));
out.resize(size / sizeof (char), 0);
// load the data
is.read((char *) &out[0], size);
// close the file
is.close();
I've thought of taking this huge std::string and then looping line by line, I would extract line information (string and integer parts) into my data set row. Is there a better way of doing this?
EDIT : This application may run on a 32bit, 64bit computer, or on a super computer for bigger files.
Any suggestions are very welcome.
Thank you
Some random thoughts:
Use vector::resize() at the beginning (you did that)
Read large blocks of file data at a time, at least 4k, better still 256k. Read them into a memory buffer, parse that buffer into your vector.
Don't read the whole file at once, this might needlessly lead to swapping.
sizeof(char) is always 1 :)
while i cannot speak for supercomputers with 3 gig lines you will go nowhere in memory on a desktop machine.
i think you should first try to figure out all operations on that data. you should try to design all algorithms to operate sequentially. if you need random access you will do swapping all the time. this algorithm design will have a big impact on your data model.
so do not start with reading all data, just because that is an easy part, but design the whole system with a clear view an what data is in memory during the whole processing.
update
when you do all processing in a single run on the stream and separate the data processing in stages (read - preprocess - ... - write) you can utilize multithreading effectivly.
finally
whatever you want to do in a loop over the data, try to keep the number of loops a minimum. averaging for sure you can do in the read loop.
immediately make up a test file the size you expect to be the worst case in size and time two different approaches
.
time
loop
read line from disk
time
loop
process line (counting words per line)
time
loop
write data (word count) from line to disk
time
versus.
time
loop
read line from disk
process line (counting words per line)
write data (word count) from line to disk
time
if you have the algorithms already use yours. otherwise make up one (like counting words per line). if the write stage does not apply to your problem skip it. this test does take you less than an hour to write but can save you a lot.

Delaying file data for several minutes

On my machine, I have a file which is regenerated by an application every second - it contains different data each time, as it's based on some realtime data.
I would like to have a copy of this file, which would contain what the original file contained 5 minutes ago. Is this somehow easily achievable? I would be happy to do this using some BASH scripting magic, but adding some wise, memory efficient code to that original application (written in c++) would also satisfy me :)
You tagged your question with both linux and unix. This answer only applies to Linux.
You may be able to use inotify-tools (inotifywait man page) or incron (incrontab(5) man page) to watch the directory and make copies of the files as they are closed.
If disk space isn't an issue, you could make the program create a new file every second instead of writing to the same file. You would need a total of 300 files (5 min * 60 sec/min). The file name to write to would be $somename + timestamp() % 300. That way, to get the file 5 minutes ago, you would just access the file $somename + (timestamp()+1) % 300.
In order to achieve that, you need the space to hold each of the 300 (5*60) files. Since you indicate that the files are only about 50K in size, this is doable in 15MB memory (if you don't want to clutter your filesystem)
It should be as simple as: (something like)
struct {char* buf; size_t size} hist[300]; //initalize to all nulls.
int n = 0;
struct stat st;
for(;;sleep(1)){
int ifd = open("file", O_READ);
int ofd = open("file-lag", O_WRITE);
stat(ifd, &st);
hist[n].size = st.st_size;
if(hist[n].buf)
free(hist[n].buf);
buffer[n] = malloc(hist[n].size);
read(ifd, hist.buf[n], hist[n].size);
n = (n+1)%300;
if(hist[n].buf)
write(ofd, hist.buf[n], hist[n].size)
close(ofd);
close(ifd);
}

Calculate NTFS and FAT file sytem size in Windows

Does anybody know how to calculate the amount of space occupied by the file system alone?
I am trying to calculate how much space files and directories occupy in a disk without iterating thru the entire disk.
this is a sample in C++:
ULARGE_INTEGER freeBytesAvailable, totalNumberOfBytes, totalNumberOfFreeBytes;
GetDiskFreeSpaceEx(NULL, &freeBytesAvailable, &totalNumberOfBytes, &totalNumberOfFreeBytes);
mCurrentProgress = 0;
mTotalProgress = totalNumberOfBytes.QuadPart - totalNumberOfFreeBytes.QuadPart;
But the problem is that I need to exclude the size of the file system but I have no idea if it is possible or if there is an API to get this info.
Doesn't make sense. On NTFS, small files are stored in the directory. I mean literally, they're inlined. The same sector that holds the filename also holds the file contents. Therefore, you can't count that sector as either "used for files" or "used for file system overhead".