I am given a config file that looks like this for example:
Start Simulator Configuration File
Version/Phase: 2.0
File Path: Test_2e.mdf
CPU Scheduling Code: SJF
Processor cycle time (msec): 10
Monitor display time (msec): 20
Hard drive cycle time (msec): 15
Printer cycle time (msec): 25
Keyboard cycle time (msec): 50
Mouse cycle time (msec): 10
Speaker cycle time (msec): 15
Log: Log to Both
Log File Path: logfile_1.lgf
End Simulator Configuration File
I am supposed to be able to take this file, and output the cycle and cycle times to a log and/or monitor. I am then supposed to pull data from a meta-data file that will tell me how many cycles each of these run (among other things) and then im supposed to calculate and log the total time. for example 5 Hard drive cycles would be 75msec. The config and meta data files can come in any order.
I am thinking I will put each item in an array and then cycle through waiting for true when the strings match(This will also help detect file errors). The config file should always be the same size despite a different order. The metadata file can be any size so I figured i would do a similar thing but in a vector.
Then I will multiply the cycle times from the config file by the number of cycles in the matching metadata file string. I think the best way to read the data from the vector is in a queue.
Does this sound like a good idea?
I understand most of the concepts. But my data structures is shaky in terms of actually coding it. For example when reading from the files, should I read it line by line, or would it be best to separate the int's from the strings to calculate them later? I've never had to do this that from a file that can change before.
If i separate them, would I have to use separate arrays/vectors?
Im using C++ btw
Your logic should be:
Create two std::map variables, one that maps a string to a string, and another that maps a string to a float.
Read each line of the file
If the line contains :, then, split the string into two parts:
3a. Part A is the line starting from zero, and 1-minus the index of the :
3b. Part B is the part of the line starting from 1+ the index of the :
Use these two parts to store in your custom std::map types, based on the value type.
Now you have read the file properly. When you read the meta file, you will simply look up the key in the meta data file, use it to lookup the corresponding key in your configuration file data (to get the value), then do whatever mathematical operation is required.
Related
I have a text file which looks like below:
0.001 ETH Rx 1 1 0 B45678810000000000000000AF0000 555
0.002 ETH Rx 1 1 0 B45678810000000000000000AF 23
0.003 ETH Rx 1 1 0 B45678810000000000000000AF156500
0.004 ETH Rx 1 1 0 B45678810000000000000000AF00000000635254
I need a way to read this file and form a structure and send it to client application.
Currently, I can do this with the help of circular queue by Boost.
The need here is to access different data at different time.
Ex: If I want to access data at 0.03sec while I am currently at 100sec, how can I do this in a best way instead of having file pointer track, or saving whole file to a memory which causes performance bottleneck? (Considering I have a file of size 2 GB with the above kind of data)
Usually the best practice for handling large files depends on the platform architecture (x86/x64) and OS (Windows/Linux etc.)
Since you mentioned boost, have you considered using boost memory mapped file?
Boost Memory Mapped File
Its all depends on
a. how frequently the data access is
b. what pattern the data access is
Splitting the file
If you need to access the data once in a while then this 2GB log
design is fine, if not the logger can be tuned to generate log with
periodic interval/ latter a logic can split the 2GB files into needed fashion of
smaller files. So that fetching the ranged log file and then reading
the log data and then sort out the needed lines is easier since file
read bytes will be reduced here.
Cache
For very frequent data access, for faster response maintaining cache is one the nice solution, again as you said it has its own bottleneck. The size and pattern of the cache memory selection is all depends on the b. what pattern of data access is. Also greater the cache size also slower the response, it should be optimum.
Database
If the searching pattern is un-ordered/dynamically grown on usage then data-base will work. Again here it will not give faster response like small cache.
A mix of database with perfect table organization to support the type of query + smaller cache layer will give optimum result.
Here is the solution I found:
Used Circular buffers (Boost lock free Buffers) for parsing file and to save the structured format of line
Used Separate threads:
One will continuously parse the file and push to lock free queue
One will continuously read from the buffer, process the line, form a structure and push to another queue
Whenever user needs random data, based on time, I will move the file pointer to particular line and read only the particular line.
Both threads have mutex wait mechanisms to stop parsing once the predefined buffer limit reached
User will get data at any time, and no need of storing the complete file contents. As and when the frame is read, I will be deleting the frame from queue. So file size doesn't matter. Parallel threads which fills the buffers allows to not spend time on reading file every time.
If I want to move to other line, move file pointer, wipe off existing data, start threads again.
Note:
Only issue is now to move the file pointer to particular line.
I need to parse line by line till I reach the point.
If there exist any solution to move file pointer to required line it would be helpful. Binary search or any efficient search algorithm can be used and will get what I want.
I appreciate if anybody gives solution for the above new issue!
I have a file as follows:
The file consists of 2 parts: header and data.
The data part is separated into equally sized pages. Each page holds data for a specific metric. Multiple pages (needs not to be consecutive) might be needed to hold data for a single metric. Each page consists of a page header and a page body. A page header has a field called "Next page" that is the index of the next page that holds data for the same metric. A page body holds real data. All pages have the same & fixed size (20 bytes for header and 800 bytes for body (if data amount is less than 800 bytes, 0 will be filled)).
The header part consists of 20,000 elements, each element has information about a specific metric (point 1 -> point 20000). An element has a field called "first page" that is actually index of the first page holding data for the metric.
The file can be up to 10 GB.
Requirement: Re-order data of the file in the shortest time, that is, pages holding data for a single metric must be consecutive, and from metric 1 to metric 20000 according to alphabet order (header part must be updated accordingly).
An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.
Is there any efficient ways?
One possible solution is to create an index from the file, containing the page number and the page metric that you need to sort on. Create this index as an array, so that the first entry (index 0) corresponds to the first page, the second entry (index 1) the second page, etc.
Then you sort the index using the metric specified.
When sorted, you end up with a new array which contains a new first, second etc. entries, and you read the input file writing to the output file in the order of the sorted index.
An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.
Is there any efficient ways?
Yes. After you get a working solution, measure it's efficiency, then decide which parts you wish to optimize. What and how you optimize will depend greatly on what results you get here (what are your bottlenecks).
A few generic things to consider:
if you have one set of steps that read data for a single metric and move it to the output, you should be able to parallelize that (have 20 sets of steps instead of one).
a 10Gb file will take a bit to process regardless of what hardware you run your code on (concievably, you could run it on a supercomputer but I am ignoring that case). You / your client may accept a slower solution if it displays it's progress / shows a progress bar.
do not use string comparisons for sorting;
Edit (addressing comment)
Consider performing the read as follows:
create a list of block offset for the blocks you want to read
create a list of worker threads, of fixed size (for example, 10 workers)
each idle worker will receive the file name and a block offset, then create a std::ifstream instance on the file, read the block, and return it to a receiving object (and then, request another block number, if any are left).
read pages should be passed to a central structure that manages/stores pages.
Also consider managing the memory for the blocks separately (for example, allocate chunks of multiple blocks preemptively, when you know the number of blocks to be read).
I first read header part, then sort metrics in alphabetic order. For each metric in the sorted list I read all data from the input file and write to the output file. To remove bottlenecks at reading data step, I used memory mapping. The results showed that when using memory mapping the execution time for an input file of 5 GB was reduced 5 ~ 6 times compared with when not using memory mapping. This way temporarily solve my problems. However, I will also consider suggestions of #utnapistim.
I have a primary file which has millions of lines. Then while reading each line from the file, I need to find the line in another file that has much fewer lines (several thousand only) to make some decision. Currently I am using vector to read the second file at the beginning and then for each line in the primary file I iterate over the vector to look for the line. The problem is that running time is quite long. Is there any efficient way to perform the task and limit the running time to some reasonable value.
You should read second file into std::map<std::string,int>. Map key would be line, and value is number of times line was encountered in second file.
This way time to check if given line from first file can be found in second is constant, and overall time of your run should be only limited by speed of disk drive to read contents of first huge file.
You can try to replace second (smaller) vector with a std::set.
You have an inner loop, which compares the current line of the primary file to lines in the secondary file.
If you take some stack samples, you're probably going to find it somewhere in that inner loop most of the time.
You might consider this technique, where you preprocess your secondary file into a special-purpose procedure that you then compile and link in with your main program.
The time it takes will be the time to read the secondary file, and then on the order of a second or two to write the special-purpose procedure, and then to compile and link the whole thing.
Then the running of your main program should be I/O bound reading the primary file, since the inner loop will be much faster.
I had requirement to read text file but its too large then I decide to only read some lines in this file. Can I use seek method for jump given line? Then I can only read that line because that text file is too large reading whole file is wasting lot of time. If its not possible, any one give better solution for that? (seek to given line and read it) (I know binary text files are reading byte by byte)
ex of my file
event1 0
subevent 1
subevent 2
event2 3
(In my file after one event its display number of lines I want to seek for previous event)
Yes, you can seek to a point in the file then read from there. One possible problem is that if the lines are all different lengths, a random location in the file will have a higher probability of being in a longer line: you're not getting evenly distributed probabilities of different lines. If you really really must have identical probabilities then you need to make at least one pass over the file to find the start of each line - then you can store those offsets in a vector and randomly select a vector element to guide seeking to the line data in the file. If you only care a little bit, then you can perhaps advance a small but random number of lines past the one you initially seek to... that will even the odds a bit, avoids the initial pass, but isn't perfect. hansmaad's comment adds a neat approach too - perfect results with pretty-good performance - but requires that you have all the lines numbered in the file itself.
Unless each line has exactly the same length, you're going to have to scan through it.
If you want to jump around in it, you can scan through it, saving the offset of each line in a container of your choice, and then use that to seek to a specific line.
Assuming that the lines are variable / random length, I don't believe there is any built-in way to jump directly to the start of a particular line. You can seek to an arbitrary byte position in the file. However, this might land anywhere in the beginning / middle / end of a line.
My best suggestion would be to attack the problem in two steps:
First, make a complete pass through the file, byte by byte, searching for the start of each line. Record the byte position of each line and store it into an array, vector, etc. (Basically, you are creating an index that maps from line number to starting position.) Then, when you have this index built up, you can easily jump to a particular line, by looking up the position in your index.
As far as I know, there is no built-in way to seek to a new line without already knowing where the lines are. I can't tell you the best way to achieve your goal, because most of your question details how you're trying to accomplish it, not what it is you're actually trying to accomplish. Therefore, I might go one of two ways with this:
1) If you actually need every last bit of data from the file (there is no metadata or other information that can be discarded):
Someone mentioned scanning through the file, tracking the lines as you go and building an index with it so you can read in one line at a time. This might work, and it would be the way to go if you actually need each line in its entirety, or if you only need the line number and plan on reading in small pieces at a time from there. However, without knowing details about your constraints or requirements, I would not recommend reading in entire lines using this method for one main reason: I have no way of knowing that one line will not itself be too large to load (what if there is only one line in the file?).
Instead, I would simply allocate a buffer of a size that is an appropriate amount to process at a time, and process the file in chunks of that size until you reach the end. You can stream more data in as you go. Without additional details, I can't tell you what that magic number should be, but the size of the largest chunk of information you might need to process is a good starting point as a minimum.
2) If you don't need every last bit of data from the file (you can discard some of the information in it), then you only need some of it. If you only need select pieces of data, then they are easier to find if they are tagged (which is what XML is for). There are lots of free XML parsers, or you can write your own. Then you'd search for tags instead of arbitrary line numbers, and changes to the file that result in the data being in a different location won't affect your ability to find it if it's tagged, as it would if you're just going by line numbers.
I have 2 ~59GB text files in ".fastq" format. fastq files are genomics read files from a sequencer. Every 4 lines is a new read, but the lines are of variable size.
The filesize is roughly 59GB, and there are about 211M reads-- which means, give or take, approximatley 211M*4 = 844M lines. The program I'm using, Bowtie, currently has the ability to do the following options:
"--skip 105M --qupto 105M"
which essentially means "skip the first 105M reads and only process up to the next 105M reads." In this way you can break up processing of the file. The problem is, the way that it does the skipping is incredibly slow. It just reads the first 105M reads as it normally would, but doesn't process them. Then it starts comparisons once it gets to the read value it was given.
I am wondering if I can use something like C/C++'s fsetpos to set the position to the middle of the file [or wherever] which I realize will probably put me somewhere in the middle of a line, and then from there find the beginning of the first full read to start processing rather than waiting for it to read approximately 422M lines until it gets where it needs to go. Does anybody have experience doing fsetpos on such a large file, and know whether or not the performance is any better than it is how it's currently doing it?
Thanks--
Nick
Yes, you can position to the middle of a file using C++.
For huge files, the performance is usually better than reading the data.
In general, the process for positioning within a file:
A request is made to read the directory entry for the file.
The directory is searched to find the track and sector for the file
position.
Note: Some filesystems may have directory extensions for large
files, thus more data will need to be read.
On the next read, the hard drive is told to go to the given track
and sector, then read in data.
You are saving time from all the previous data to pass through the communications port and into memory (or ignored).