Byte offset notation for a 900 mb XML file - c++

I am building a search engine in c++ (using a 900 mb rapidXML file that contains pages from wikiBooks) and my objective is to parse the ~900 MB XML document using rapidXML so that the user can just enter one word in the search bar and receive the ACTUAL XML DOCUMENTS that contain that word (link).
I need to figure out how to store index of each token (aka each word within of each document) so that when the user wants to see the page numbers a certain word occurs, I can jump to that specific page.
I have been told to do the "file io offset" (where you store where in the file a word is so that you can jump to it) and I am having a hard time understanding what to do.
Questions:
Do I use the "seekg" and "tellg" in the istream library (to find the byte location that each document PAGE is stored at)? And if so, how?
How do I return the actual document back to the user (that contains many occurances of the searched word)?

Related

Best way to read this file to manipulate later?

I am given a config file that looks like this for example:
Start Simulator Configuration File
Version/Phase: 2.0
File Path: Test_2e.mdf
CPU Scheduling Code: SJF
Processor cycle time (msec): 10
Monitor display time (msec): 20
Hard drive cycle time (msec): 15
Printer cycle time (msec): 25
Keyboard cycle time (msec): 50
Mouse cycle time (msec): 10
Speaker cycle time (msec): 15
Log: Log to Both
Log File Path: logfile_1.lgf
End Simulator Configuration File
I am supposed to be able to take this file, and output the cycle and cycle times to a log and/or monitor. I am then supposed to pull data from a meta-data file that will tell me how many cycles each of these run (among other things) and then im supposed to calculate and log the total time. for example 5 Hard drive cycles would be 75msec. The config and meta data files can come in any order.
I am thinking I will put each item in an array and then cycle through waiting for true when the strings match(This will also help detect file errors). The config file should always be the same size despite a different order. The metadata file can be any size so I figured i would do a similar thing but in a vector.
Then I will multiply the cycle times from the config file by the number of cycles in the matching metadata file string. I think the best way to read the data from the vector is in a queue.
Does this sound like a good idea?
I understand most of the concepts. But my data structures is shaky in terms of actually coding it. For example when reading from the files, should I read it line by line, or would it be best to separate the int's from the strings to calculate them later? I've never had to do this that from a file that can change before.
If i separate them, would I have to use separate arrays/vectors?
Im using C++ btw
Your logic should be:
Create two std::map variables, one that maps a string to a string, and another that maps a string to a float.
Read each line of the file
If the line contains :, then, split the string into two parts:
3a. Part A is the line starting from zero, and 1-minus the index of the :
3b. Part B is the part of the line starting from 1+ the index of the :
Use these two parts to store in your custom std::map types, based on the value type.
Now you have read the file properly. When you read the meta file, you will simply look up the key in the meta data file, use it to lookup the corresponding key in your configuration file data (to get the value), then do whatever mathematical operation is required.

Efficiently read data from a structured file in C/C++

I have a file as follows:
The file consists of 2 parts: header and data.
The data part is separated into equally sized pages. Each page holds data for a specific metric. Multiple pages (needs not to be consecutive) might be needed to hold data for a single metric. Each page consists of a page header and a page body. A page header has a field called "Next page" that is the index of the next page that holds data for the same metric. A page body holds real data. All pages have the same & fixed size (20 bytes for header and 800 bytes for body (if data amount is less than 800 bytes, 0 will be filled)).
The header part consists of 20,000 elements, each element has information about a specific metric (point 1 -> point 20000). An element has a field called "first page" that is actually index of the first page holding data for the metric.
The file can be up to 10 GB.
Requirement: Re-order data of the file in the shortest time, that is, pages holding data for a single metric must be consecutive, and from metric 1 to metric 20000 according to alphabet order (header part must be updated accordingly).
An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.
Is there any efficient ways?
One possible solution is to create an index from the file, containing the page number and the page metric that you need to sort on. Create this index as an array, so that the first entry (index 0) corresponds to the first page, the second entry (index 1) the second page, etc.
Then you sort the index using the metric specified.
When sorted, you end up with a new array which contains a new first, second etc. entries, and you read the input file writing to the output file in the order of the sorted index.
An apparent approach: For each metric, read all data for the metric (page by page), write data to new file. But this takes much time, especially when reading data from the file.
Is there any efficient ways?
Yes. After you get a working solution, measure it's efficiency, then decide which parts you wish to optimize. What and how you optimize will depend greatly on what results you get here (what are your bottlenecks).
A few generic things to consider:
if you have one set of steps that read data for a single metric and move it to the output, you should be able to parallelize that (have 20 sets of steps instead of one).
a 10Gb file will take a bit to process regardless of what hardware you run your code on (concievably, you could run it on a supercomputer but I am ignoring that case). You / your client may accept a slower solution if it displays it's progress / shows a progress bar.
do not use string comparisons for sorting;
Edit (addressing comment)
Consider performing the read as follows:
create a list of block offset for the blocks you want to read
create a list of worker threads, of fixed size (for example, 10 workers)
each idle worker will receive the file name and a block offset, then create a std::ifstream instance on the file, read the block, and return it to a receiving object (and then, request another block number, if any are left).
read pages should be passed to a central structure that manages/stores pages.
Also consider managing the memory for the blocks separately (for example, allocate chunks of multiple blocks preemptively, when you know the number of blocks to be read).
I first read header part, then sort metrics in alphabetic order. For each metric in the sorted list I read all data from the input file and write to the output file. To remove bottlenecks at reading data step, I used memory mapping. The results showed that when using memory mapping the execution time for an input file of 5 GB was reduced 5 ~ 6 times compared with when not using memory mapping. This way temporarily solve my problems. However, I will also consider suggestions of #utnapistim.

storing urls to a file so they can be reachable quickly

i have a file and a plenty of urls, these urls are written to a file all with the same structure plus a url CheckSum of type int. stackoverflow.com is written as:
12534214214 http://stackoverflow.com
now everytime i want to put an url into the file i need to check if the url doesn't exist
then i can put it.
but it takes too much time to do this with 1 000 000 urls:
//list of urls
list<string> urls;
size_t hashUrl(string argUrl); //this function will hash the url and return an int
file.open("anchors");
//search for the int 12534214214 if it isn't found then write 12534214214 http://stackoverflow.com
file.close();
question1 : -how can i search in a file using the checksum so the search will take a few ms?
question2 : -is there another way of storing these urls so that they can be reachable quickly?
thanks, and sorry for bad english
There is (likely [1]) no way you search a million URLS in a plain text-file in "a few milliseconds". You need to either load the entire file into memory (and when you do, you may just as well do that into some reasonable data structure, for example a std::map or std::unordered_map), or use some sort of indexing for the file - e.g have a smaller file with just the checksum and the place in the file that they are stored at.
The problem with a plain textfile is that there is no way to know where anything is. One line can be 10 bytes, another 10000 bytes. This means that you literally have to read every byte up to the point you are interested in.
Of course, the other option is to use a database library, SQLite etc (or proper a database server, such as MySQL) that allows the data to be stored/retrieved based on a "query". This hides all the index-generation and other such problems, and is already optimised both when it comes to search algorithms, as well as having clever caching and optimised code for reading/writing data to disk, etc.
[1] If all the URLS are short, it's perhaps possible that the file is small enough to cache well, and code can be written to be fast enough to linearly scan through the entire file in a few milliseconds. But a file with, say, an average of 50 bytes for each URL will be 50MB. If each byte takes 10 clock cycles to process, we're already at 130ms to process the file, even if it's directly available in memory.

finding a keyframe in mdat

The quicktime documentation recommends the following approach to finding a keyframe:
Finding a Key Frame
Finding a key frame for a specified time in a movie is slightly more
complicated than finding a sample for a specified time. The media
handler must use the sync sample atom and the time-to-sample atom
together in order to find a key frame.
The media handler performs the following steps:
Examines the time-to-sample atom to determine the sample number that contains the data for the specified time.
Scans the sync sample atom to find the key frame that precedes the sample number chosen in step 1.
Scans the sample-to-chunk atom to discover which chunk contains the key frame.
Extracts the offset to the chunk from the chunk offset atom.
Finds the offset within the chunk and the sample’s size by using the sample size atom.
source: https://developer.apple.com/library/mac/documentation/QuickTime/qtff/QTFFChap2/qtff2.html
This is quite confusing, since multiple tracks ("trak" atom) will yield different offsets. For example, the keyframe-sample-chunk-offset value for the video trak will be one value, and the audio will be another.
How does one translate the instructions above into a location in the file (or mdat atom)?
That's not restricted to key frames. You can't in general guarantee that samples for different tracks are close to each other in the file. You hope that audio and video will be interleaved so you can play back a movie without excessive seeking but that's up to the software that created the file. Each track has its own sample table and chunk atoms that tell you where the samples are in the file and they could be anywhere. (They could even be in a different file, though reference movies are deprecated nowadays so you can probably ignore them.)

distinguish pages into a pdf stream via regular expression

i need to extract some information from a pdf stream.
It's quite simple to extract the relevant text, since it is something like:
BT /Fo0 7.20 Tf 67.81 569.38 Td 0.000 Tc (TOTAL AMOUNT) Tj ET
I can consider fixed the y position, while the x position is variable due to giustification.
But my problem is recognize the beginning of a page and its end.
You shouldn't be sure that all the PDFs you encounter with your 'information extractor' are behaving so nicely. Or can you be, because you know they are?
Otherwise, it can very well happen that the PDF code which you encounter looks like:
BT
/Fo0 7.20 Tf
67.81 569.38 Td
0.000 Tc
(TO)12(T)13(AL A)11(M)14(OUNT) TJ
ET
That is, ...
...using TJ instead of Tj, to allow individal glyph positioning,
...having more linebreaks,
...and maybe many more modifikations.
In order to reliably get to the page's text content, you have to parse the structure of the PDF, in short:
find all objects of /Type /Page;
go to each of these page objects and retrieve the info about which its respective /Contents is;
the /Contents may point to single stream, or
the /Contents may point to an array of streams;
go to this content object and extract its stream(s).
In practical terms, the first of the above steps can turn out a bit more complicated:
find and go to the trailer <<...>> section
in the trailer locate the info about the document's /Root object
go to the root object
extract the info about the /Pages from the /Root object
go to the /Pages object (which is an intermedia page tree node with kids and parent;
find all descendands of this page tree node from inspecting the /Kids object
go to each respective object listed by /Kids;
it could be of /Type /Pages (in which case it is another page tree node, not a tree leaf, and you have to follow down the tree further on);
it could be of /Type Page (in which case you arrived a a page tree leaf which means you really arrived at a page).
At this point I should note, that the first page you found following this journey is page 1. The next is page 2, etc. Note, that no page has any metadata saying "I'm page number N" -- it's all depending on the order you parse the page tree staring from the root object.
Now that you really found content streams, you are facing two more problems:
The content streams you are looking for may not be in clear text at all (like your code showed). Content streams are very frequently compressed by one of the allowed compression schemes, and you'll have to expand them before you can parse for text content.
To see if a stream is compressed, watch out for the respective *Decode keyword (very frequently appearing as /Filter /FlateDecode).
Once you successfully uncompressed the page's content stream, you may encounter totally un-intuitive character codes describing your text. It may not at all be the same type of well behaving ASCII as you imagine and showed in your example code.
You'll have to look up fonts (even multi-byte fonts like CID), their encodings, CMaps and what-not.
Unless, as I questioned in my initial sentence, you know that's not happening in your specific use case...