I have a couple of questions.
First, i am doing http byte range requests and then writing the received
data to a file.
Sometimes, i have to read a block of 1K, and then read that
from the file fetched through http. Now, the problem is that the next request
after the 1k request may be starting from 100 bytes and in that case i want to
write in the 1K file, overwriting from 100 bytes. How can i overwrite from a
specific offset in the file?
Secondly, how do i create a file with some data already in the file? For eg. i want
put the data in the file from lets say 500th byte, i do not care about the first
500 bytes, could be any garbage data, but its important to have the correct file
size for the code to work.
thanks
There's some reference material and sample code on ofstream's seekp at http://www.cplusplus.com/reference/iostream/ostream/seekp/
Related
I have a text file which looks like below:
0.001 ETH Rx 1 1 0 B45678810000000000000000AF0000 555
0.002 ETH Rx 1 1 0 B45678810000000000000000AF 23
0.003 ETH Rx 1 1 0 B45678810000000000000000AF156500
0.004 ETH Rx 1 1 0 B45678810000000000000000AF00000000635254
I need a way to read this file and form a structure and send it to client application.
Currently, I can do this with the help of circular queue by Boost.
The need here is to access different data at different time.
Ex: If I want to access data at 0.03sec while I am currently at 100sec, how can I do this in a best way instead of having file pointer track, or saving whole file to a memory which causes performance bottleneck? (Considering I have a file of size 2 GB with the above kind of data)
Usually the best practice for handling large files depends on the platform architecture (x86/x64) and OS (Windows/Linux etc.)
Since you mentioned boost, have you considered using boost memory mapped file?
Boost Memory Mapped File
Its all depends on
a. how frequently the data access is
b. what pattern the data access is
Splitting the file
If you need to access the data once in a while then this 2GB log
design is fine, if not the logger can be tuned to generate log with
periodic interval/ latter a logic can split the 2GB files into needed fashion of
smaller files. So that fetching the ranged log file and then reading
the log data and then sort out the needed lines is easier since file
read bytes will be reduced here.
Cache
For very frequent data access, for faster response maintaining cache is one the nice solution, again as you said it has its own bottleneck. The size and pattern of the cache memory selection is all depends on the b. what pattern of data access is. Also greater the cache size also slower the response, it should be optimum.
Database
If the searching pattern is un-ordered/dynamically grown on usage then data-base will work. Again here it will not give faster response like small cache.
A mix of database with perfect table organization to support the type of query + smaller cache layer will give optimum result.
Here is the solution I found:
Used Circular buffers (Boost lock free Buffers) for parsing file and to save the structured format of line
Used Separate threads:
One will continuously parse the file and push to lock free queue
One will continuously read from the buffer, process the line, form a structure and push to another queue
Whenever user needs random data, based on time, I will move the file pointer to particular line and read only the particular line.
Both threads have mutex wait mechanisms to stop parsing once the predefined buffer limit reached
User will get data at any time, and no need of storing the complete file contents. As and when the frame is read, I will be deleting the frame from queue. So file size doesn't matter. Parallel threads which fills the buffers allows to not spend time on reading file every time.
If I want to move to other line, move file pointer, wipe off existing data, start threads again.
Note:
Only issue is now to move the file pointer to particular line.
I need to parse line by line till I reach the point.
If there exist any solution to move file pointer to required line it would be helpful. Binary search or any efficient search algorithm can be used and will get what I want.
I appreciate if anybody gives solution for the above new issue!
i have a file and a plenty of urls, these urls are written to a file all with the same structure plus a url CheckSum of type int. stackoverflow.com is written as:
12534214214 http://stackoverflow.com
now everytime i want to put an url into the file i need to check if the url doesn't exist
then i can put it.
but it takes too much time to do this with 1 000 000 urls:
//list of urls
list<string> urls;
size_t hashUrl(string argUrl); //this function will hash the url and return an int
file.open("anchors");
//search for the int 12534214214 if it isn't found then write 12534214214 http://stackoverflow.com
file.close();
question1 : -how can i search in a file using the checksum so the search will take a few ms?
question2 : -is there another way of storing these urls so that they can be reachable quickly?
thanks, and sorry for bad english
There is (likely [1]) no way you search a million URLS in a plain text-file in "a few milliseconds". You need to either load the entire file into memory (and when you do, you may just as well do that into some reasonable data structure, for example a std::map or std::unordered_map), or use some sort of indexing for the file - e.g have a smaller file with just the checksum and the place in the file that they are stored at.
The problem with a plain textfile is that there is no way to know where anything is. One line can be 10 bytes, another 10000 bytes. This means that you literally have to read every byte up to the point you are interested in.
Of course, the other option is to use a database library, SQLite etc (or proper a database server, such as MySQL) that allows the data to be stored/retrieved based on a "query". This hides all the index-generation and other such problems, and is already optimised both when it comes to search algorithms, as well as having clever caching and optimised code for reading/writing data to disk, etc.
[1] If all the URLS are short, it's perhaps possible that the file is small enough to cache well, and code can be written to be fast enough to linearly scan through the entire file in a few milliseconds. But a file with, say, an average of 50 bytes for each URL will be 50MB. If each byte takes 10 clock cycles to process, we're already at 130ms to process the file, even if it's directly available in memory.
I am trying to read a .csv file with 20k+ lines, and each line has ~300 fields.
I am using my own code to read it line by line, then I separate the lines to fields, and convert the fields to corresponding data type (such as integer, double, etc). Then these data are transfered to class objects via their constructor.
However, I found it is not very efficient. It took about 1 min to read these 20k+ lines and create 20k+ objects.
I've googled about fast csv parser, and found there are many options. I've tried some of them, but not very satisfied with the time performance.
Does anyone have a better method to read large .csv files? Many thanks in advance.
An efficient method for parsing or for that matter processing of files is to read as much of the file into memory before you start parsing.
File I/O has been, since the dawn of computers, one of the slower parts of a computer system. For example, parsing your data may take 1 microsecond. Reading the data from a hard drive may take 1 millisecond == 1000 microseconds.
I've made programs faster by allocating a large array for the data then reading the data into the array. Next I process the data in the array and repeat until the entire file is processed.
Another technique is called memory mapping, where the OS handles reading the file into memory as needed.
Please edit your post to show the code where the bottleneck is.
For a database class, we are implementing our own database, and I am having trouble how to implement block storage in C++ (where each block is 1024 bytes).
We are to store each database table as a randomly accessible collection of blocks on the hard disk, where the first block is a file header, dedicated to meta data (block 0), and each subsequent block is dedicated to storing the rows of the table. The blocks are to be written to the hard disk as files. We are also to have one block as an "in-memory" buffer; we can read and edit the data in the buffer, and when we are ready, we write the in-memory buffer back to disk.
I think I am OK conceptualizing the in-memory buffer, but I am having trouble how to write the blocks of memory to files. I have two ideas, each with their own difficulties:
Idea 1
Create a class MemoryBlock that is exactly 1024 bytes. Each MemoryBlock can store arbitrary data (file header or rows of the table). Store each table as a single file by writing the array of MemoryBlocks to the file.
Difficulty:
Can I update a single block in the middle of the file? It is my understanding that files must be overwritten or appended to. If I have a file of 3 MemoryBlocks (blocks 0-2), and I want to update a row that is in block 1, can I just pull the block 1 into my buffer, edit it, and write it back to the middle of the file, or would I have to pull the entire file into memory, edit what I want to, and then overwrite the original file?
Idea 2
Store each block as a separate file on disk. This would allow me to randomly access any block and write it back to disk without having to worry about the rest of the table
Difficulty: I'm not sure if this is really enforcing the 1024 byte block size. Is there any way to require that each file does not exceed 1024 bytes?
I am not married to either idea, but I am appreciative of any input that helps me better understand block storage in database management systems.
Edit: As #zaufi points out, 1024 byte block sizes are very atypical. I meant to type 4096 byte blocks when writing this.
ohh man, you definitely need to read smth about databases internals...
here is my 5 cents: both ideas are bad! Why you decided to use 1024 bytes blocks??? Physical sector size for modern HDD is 4096 bytes! Disk controllers have cache 4M-6M-8M-16M-... So writing 1K is just a wasting resources...
and btw, updating smth in the middle of the file is always bad idea... but if performance is not your concern, you can definitely do...
before reinvent the wheel try to research typical approaches used in various DMBS...
one more good (simple) source to read: google about leveldb and firends... -- this will definitely give you ideas!
I have the following case. I have a big file say 1 KB. I want to read the first 100 bytes and then delete the 100 bytes read data from the file and then read next 100 bytes. To read 100 bytes is ok, but how do I delete 100 bytes from the file?
This is commonly done as a multiple-step process:
Rename the original file.
Write the data you want into a new file with the original file name.
Delete the old file with the temporary name that contains the data you no longer want.
That way, if something were to go wrong, you could simply restore the original file that you renamed. Moving a file from one place to another is implemented this way, as well.
However, if you don't want to do this, the SetEndOfFile function is another viable option to truncate the contents of a file in-place. From the documentation:
Sets the physical file size for the specified file to the current position of the file pointer.
The physical file size is also referred to as the end of the file. The SetEndOfFile function can be used to truncate or extend a file.
That wouldn't be called truncating; that term refers to removing data from the end, not the beginning. I'm not aware of any operating system where this is possible, other than by copying the contents of the file to a new file, starting at the 100th byte.
Deleting data that has been processed in a file is time consuming and in most cases not necessary.
Deleting data near the top or middle of the file requires writing a new file, which takes time and disk space. Most applications will read and process the entire file then rename the file (with a backup extension). This is useful for debugging purposes. Deleting an entire file is often a faster operation that writing a new file without processed data.
Deletions should only take place when necessary. For files, one can store an offset of where the valid data begins, thus reducing the need to delete data from a file. For secure purposes, overwriting data in the file is often faster then creating a new file without the processed data.
First try writing your program to not delete data in the file. Only delete as necessary, after the program is robust and working correctly. Many people would suggest to only delete files when there is no more space on drive.