distinguish pages into a pdf stream via regular expression - regex

i need to extract some information from a pdf stream.
It's quite simple to extract the relevant text, since it is something like:
BT /Fo0 7.20 Tf 67.81 569.38 Td 0.000 Tc (TOTAL AMOUNT) Tj ET
I can consider fixed the y position, while the x position is variable due to giustification.
But my problem is recognize the beginning of a page and its end.

You shouldn't be sure that all the PDFs you encounter with your 'information extractor' are behaving so nicely. Or can you be, because you know they are?
Otherwise, it can very well happen that the PDF code which you encounter looks like:
BT
/Fo0 7.20 Tf
67.81 569.38 Td
0.000 Tc
(TO)12(T)13(AL A)11(M)14(OUNT) TJ
ET
That is, ...
...using TJ instead of Tj, to allow individal glyph positioning,
...having more linebreaks,
...and maybe many more modifikations.
In order to reliably get to the page's text content, you have to parse the structure of the PDF, in short:
find all objects of /Type /Page;
go to each of these page objects and retrieve the info about which its respective /Contents is;
the /Contents may point to single stream, or
the /Contents may point to an array of streams;
go to this content object and extract its stream(s).
In practical terms, the first of the above steps can turn out a bit more complicated:
find and go to the trailer <<...>> section
in the trailer locate the info about the document's /Root object
go to the root object
extract the info about the /Pages from the /Root object
go to the /Pages object (which is an intermedia page tree node with kids and parent;
find all descendands of this page tree node from inspecting the /Kids object
go to each respective object listed by /Kids;
it could be of /Type /Pages (in which case it is another page tree node, not a tree leaf, and you have to follow down the tree further on);
it could be of /Type Page (in which case you arrived a a page tree leaf which means you really arrived at a page).
At this point I should note, that the first page you found following this journey is page 1. The next is page 2, etc. Note, that no page has any metadata saying "I'm page number N" -- it's all depending on the order you parse the page tree staring from the root object.
Now that you really found content streams, you are facing two more problems:
The content streams you are looking for may not be in clear text at all (like your code showed). Content streams are very frequently compressed by one of the allowed compression schemes, and you'll have to expand them before you can parse for text content.
To see if a stream is compressed, watch out for the respective *Decode keyword (very frequently appearing as /Filter /FlateDecode).
Once you successfully uncompressed the page's content stream, you may encounter totally un-intuitive character codes describing your text. It may not at all be the same type of well behaving ASCII as you imagine and showed in your example code.
You'll have to look up fonts (even multi-byte fonts like CID), their encodings, CMaps and what-not.
Unless, as I questioned in my initial sentence, you know that's not happening in your specific use case...

Related

Byte offset notation for a 900 mb XML file

I am building a search engine in c++ (using a 900 mb rapidXML file that contains pages from wikiBooks) and my objective is to parse the ~900 MB XML document using rapidXML so that the user can just enter one word in the search bar and receive the ACTUAL XML DOCUMENTS that contain that word (link).
I need to figure out how to store index of each token (aka each word within of each document) so that when the user wants to see the page numbers a certain word occurs, I can jump to that specific page.
I have been told to do the "file io offset" (where you store where in the file a word is so that you can jump to it) and I am having a hard time understanding what to do.
Questions:
Do I use the "seekg" and "tellg" in the istream library (to find the byte location that each document PAGE is stored at)? And if so, how?
How do I return the actual document back to the user (that contains many occurances of the searched word)?

Convert text to elements in XML using XSLT

I'm currently migrating XML from one CMS to another and needs to convert some text to elements. Because of how the system works, some editors can only enter escaped text. The challange is to replace some of these escaped elements and convert them into valid XML elements.
Source file:
<p>Press the <button-name>Select key </button-name>to show more information.</p>
<p>Press the <button-name>Back key</button-name> to save the
values.</p>
<p>When the storage is completed, the <product-name/> machine
displays:</p>
<p><attention>
<display-text translate="no">STORAGE COMPLETED
Press BACK to exit</display-text>
</attention></p>
What I want to do
Replace <button-name> with <gui>
Replace <button-name> with <kt.in name="custom-name"/>
Keeping other escaped elements.
XML I want
<p>Press the <gui>Select key</gui>to
show more information.</p>
<p>Press the <gui>Back key</gui>
to save the calibrations values.</p>
<p>When the storage is completed, the <kt.in name="custom-name"/> machine
displays:</p>
<p><attention> <display-text translate="no">STORAGE COMPLETED
Press BACK to exit</display-text>
</attention></p>
I tried using a string-based search-and-replace but as I want proper an XML element as output this wouldn't do it.
This is probably only going to work by string-based-search-and-replace - depending on the amount of text "tags" you want to switch to xml. The bigger problem I see is actually keeping it all in a proper XML-Element.
I dont think that you could keep this without writing a small tool that will read the strings between the elements of the text e.g.
<button-name>
and copy them into the right variables of an Object which you then parse back as XML conform Element.
It doesnt really depend on the language you prefer since there should be plenty of object-xml parsers available
For just changing the tags you could also switch the encoding of the text as
< would turn into -> <
and then filter any content in between the <> to exchange the ones you want e.g. button-name to gui
hope I could give you an idea..

yahoo pipes trimming all item titles

After a lot of hard work, I have created two yahoo Pipes I will be using.
One of them has a minor problem however... I am trimming the title length down to leave enough room for a ... and a link to fit within a tweet.
It trims the first post correctly... however it trims all of the posts after that to 0 length (before adding a bit of extra text to the end).
The problem is I'm not using a loop for all items after a certain point, but the reason for that is the output is always items from a loop, and I need the output to be number at a certain point so that I can feed in that number asa variable to trim the length by. The pipe can be found here: http://pipes.yahoo.com/pipes/pipe.info?_id=3e6c3c6b2d23d8ce0cf66cb3efc5fb56
Typically, I am inserting any RSS feed in the top box, something like "new blog post:" in the middle and "#bussiness #hashtags" in the last box.
If you can see any way I can have this yahoo pipe work for all posts rather than just the top one, please let me know. its not a big deal as im only ever posting for the moment, the top post to twitter... however there may come a point where I need all of them looking the same.

Create and use HTML full text search index (C++)

I need to create a search index for a collection of HTML pages.
I have no experience in implementing a search index at all, so any general information how to build one, what information to store, how to implement advanced searches such as "entire phrase", ranking of results etc.
I'm not afraid to build it myself, though I'd be happy to reuse an existing component (or use one to get started with a prototype). I am looking for a solution accessible from C++, preferrably without requiring additional installations at runtime. The content is static (so it makes sense to aggregate search information), but a search might have to accumulate results from multiple such repositories.
I can make a few educated guesses, though: create a map word ==> pages for all (relevant) words, a rank can be assigned to the mapping by promincence (h1 > h2 > ... > <p>) and proximity to top. Advanced searches could be built on top of that: searching for phrase "homo sapiens" could list all pages that contain "homo" and "sapiens", then scan all pages returned for locations where they occur together. However, there are a lot of problematic scenarios and unanswered questions, so I am looking for references to what should be a huge amount of existing work that somehow escapes my google-fu.
[edit for bounty]
The best resource I found until now is this and the links from there.
I do have an imlementation roadmap for an experimental system, however, I am still looking for:
Reference material regarding index creation and individual steps
available implementations of individual steps
reusable implementations (with above environment restrictions)
This process is generally known as information retrieval. You'll probably find this online book helpful.
Existing libraries
Here are two existing solutions that can be fully integrated into an application without requiring a separate process (I believe both will compile with VC++).
Xapian is mature and may do much of what you need, from indexing to ranked retrieval. Separate HTML parsing would be required because, AFAIK, it does not parse html (it has a companion program Omega, which is a front end for indexing web sites).
Lucene is a index/searching Apache library in Java, with an official pre-release C version lucy, and an unofficial C++ version CLucene.
Implementing information retrieval
If the above options are not viable for some reason, here's some info on the individual steps of building and using an index. Custom solutions can go from simple to very sophisticated, depending what you need for your application. I've broken the process into 5 steps
HTML processing
Text processing
Indexing
Retrieval
Ranking
HTML Processing
There are two approaches here
Stripping The page you referred to discusses a technique generally known as stripping, which involves removing all the html elements that won't be displayed and translating others to their display form. Personally, I'd preprocess using perl and index the resulting text files. But for an integrated solution, particularly one where you want to record significance tags (e.g. <h1>, <h2>), you probably want to role your own. Here is a partial implementation of a C++ stripping routine (appears in Thinking in C++ , final version of book here), that you could build from.
Parsing A level up in complexity from stripping is html parsing, which would help in your case for recording significance tags. However, a good C++ HTML parser is hard to find. Some options might be htmlcxx (never used it, but active and looks promising) or hubbub (C library, part of NetSurf, but claims to be portable).
If you are dealing with XHTML or are willing to use an HTML-to-XML converter, you can use one of the many available XML parsers. But again, HTML-to-XML converters are hard to find, the only one I know of is HTML Tidy. In addition to conversion to XHTML, its primary purpose is to fix missing/broken tags, and it has an API that could possibly be used to integrate it into an application. Given XHTML documents, there are many good XML parsers, e.g. Xerces-C++ and tinyXML.
Text Processing
For English at least, processing text to words is pretty straight forward. There are a couple of complications when search is involved though.
Stop words are words known a priori not to provide a useful distinction between documents in the set, such as articles and propositions. Often these words are not indexed and filtered from query streams. There are many stop word lists available on the web, such as this one.
Stemming involves preprocessing documents and queries to identify the root of each word to better generalize a search. E.g. searching for "foobarred" should yield "foobarred", "foobarring", and "foobar". The index can be built and searched on roots alone. The two general approaches to stemming are dictionary based (lookups from word ==> root) and algorithm based. The Porter algorithm is very common and several implementations are available, e.g. C++ here or C here. Stemming in the Snowball C library supports several languages.
Soundex encoding One method to make search more robust to spelling errors is to encode words with a phonetic encoding. Then when queries have phonetic errors, they will still map directly to indexed words. There are a lot of implementations around, here's one.
Indexing
The map word ==> page data structure is known as an inverted index. Its inverted because its often generated from a forward index of page ==> words. Inverted indexes generally come in two flavors: inverted file index, which map words to each document they occur in, and full inverted index, which map words to each position in each document they occur in.
The important decision is what backend to use for the index, some possibilities are, in order of ease of implementation:
SQLite or Berkly DB - both of these are database engines with C++ APIs that integrated into a project without requiring a separate server process. Persistent databases are essentially files, so multiple index sets can be search by just changing the associated file. Using a DBMS as a backend simplifies index creation, updating and searching.
In memory data structure - if your using a inverted file index that is not prohibitively large (memory consumption and time to load), this could be implemented as a std::map<std::string,word_data_class>, using boost::serialization for persistence.
On disk data structure - I've heard of blazingly fast results using memory mapped files for this sort of thing, YMMV. Having an inverted file index would involve having two index files, one representing words with something like struct {char word[n]; unsigned int offset; unsigned int count; };, and the second representing (word, document) tuples with just unsigned ints (words implicit in the file offset). The offset is the file offset for the first document id for the word in the second file, count is the number of document ids associate with that word (number of ids to read from the second file). Searching would then reduce to a binary search through the first file with a pointer into a memory mapped file. The down side is the need to pad/truncate words to get a constant record size.
The procedure for indexing depends on which backend you use. The classic algorithm for generating a inverted file index (detailed here) begins with reading through each document and extending a list of (page id, word) tuples, ignoring duplicate words in each document. After all documents are processed, sort the list by word, then collapsed into (word, (page id1, page id2, ...)).
The mifluz gnu library implements inverted indexes w/ storage, but without document or query parsing. GPL, so may not be a viable option, but will give you an idea of the complexities involved for an inverted index that supports a large number of documents.
Retrieval
A very common method is boolean retrieval, which is simply the union/intersection of documents indexed for each of the query words that are joined with or/and, respectively. These operations are efficient if the document ids are stored in sorted order for each term, so that algorithms like std::set_union or std::set_intersection can be applied directly.
There are variations on retrieval, wikipedia has an overview, but standard boolean is good for many/most application.
Ranking
There are many methods for ranking the documents returned by boolean retrieval. Common methods are based on the bag of words model, which just means that the relative position of words is ignored. The general approach is to score each retrieved document relative to the query, and rank documents based on their calculated score. There are many scoring methods, but a good starting place is the term frequency-inverse document frequency formula.
The idea behind this formula is that if a query word occurs frequently in a document, that document should score higher, but a word that occurs in many documents is less informative so this word should be down weighted. The formula is, over query terms i=1..N and document j
score[j] = sum_over_i(word_freq[i,j] * inv_doc_freq[i])
where the word_freq[i,j] is the number of occurrences of word i in document j, and
inv_doc_freq[i] = log(M/doc_freq[i])
where M is the number of documents and doc_freq[i] is the number of documents containing word i. Notice that words that occur in all documents will not contribute to the score. A more complex scoring model that is widely used is BM25, which is included in both Lucene and Xapian.
Often, effective ranking for a particular domain is obtained by adjusting by trial and error. A starting place for adjusting rankings by heading/paragraph context could be inflating word_freq for a word based on heading/paragraph context, e.g. 1 for a paragraph, 10 for a top level heading. For some other ideas, you might find this paper interesting, where the authors adjusted BM25 ranking for positional scoring (the idea being that words closer to the beginning of the document are more relevant than words toward the end).
Objective quantification of ranking performance is obtained by precision-recall curves or mean average precision, detailed here. Evaluation requires an ideal set of queries paired with all the relevant documents in the set.
Depending on the size and number of the static pages, you might want to look at an already existent search solution.
"How do you implement full-text search for that 10+ million row table, keep up with the load, and stay relevant? Sphinx is good at those kinds of riddles."
I would choose the Sphinx engine for full text searching. The licence is GPL but the also have a commercial version available. It is meant to be run stand-alone [2], but it can also be embedded into applications by extracting the needed functionality (be it indexing[1], searching [3], stemming, etc).
The data should be obtained by parsing the input HTML files and transforming them to plain-text by using a parser like libxml2's HTMLparser (I haven't used it, but they say it can parse even malformed HTML). If you aren't bound to C/C++ you could take a look at Beautiful Soup.
After obtaining the plain-texts, you could store them in a database like MySQL or PostgreSQL. If you want to keep everything embedded you should go with sqlite.
Note that Sphinx doesn't work out-of-the-box with sqlite, but there is an attempt to add support (sphinx-sqlite3).
I would attack this with a little sqlite database. You could have tables for 'page', 'term' and 'page term'. 'Page' would have columns like id, text, title and url. 'Term' would have a column containing a word, as well as the primary ID. 'Page term' would have foreign keys to a page ID and a term ID, and could also store the weight, calculated from the distance from the top and the number of occurrences (or whatever you want).
Perhaps a more efficient way would be to only have two tables - 'page' as before, and 'page term' which would have the page ID, the weight, and a hash of the term word.
An example query - you want to search for "foo". You hash "foo", then query all page term rows that have that term hash. Sort by descending weight and show the top ten results.
I think this should query reasonably quickly, though it obviously depends on the number and size of the pages in question. Sqlite isn't difficult to bundle and shouldn't need an additional installation.
Ranking pages is the really tricky bit here. With a large sample of pages you can use links quite a lot in working out ranks. Other wise you need to check how words seem to be placed, and also making sure your engine doesn't get fooled by 'dictionary' pages.
Good luck!

how to traverse a file in python and c++ in backward way? And also store data in backward (bottom to top) way?

Suppose i want to store 3 lines in a file both in python and C++ .
I want to store it like this
aaa
bbb
ccc ..
But i am giving ccc input first then bbb then aaa. How will I traverse the file from bottom to top and also store from bottom to top/?
It isn't obvious from the title and question whether you want to store to a file, load from a file, or both, so I'll cover both cases:
Reading
If it's OK to load it all into memory at once (in Python):
list(reversed(list(open('foo.txt'))))
Otherwise, it gets a lot more difficult. Processing a file backwards requires that you read blocks of data a time from the end, scanning backwards through each block for newline marker, and stitching things back together at block boundaries.
Writing
If the data all fit in memory at once, put the numbers into a list (in Python):
open('foo.txt', 'w').writelines(reversed(data))
If data is an iterable, replace it with list(data).
If the data doesn't fit in memory (e.g., you have some generator that spits out a ton of data), the problem will be much harder. The simplest solution that comes to mind is to just push the data into a sqlite database and then copy it into the file. Or you might just find it easier to use the data directly from sqlite.
You might want to use a collections.deque. Afaik those things are optimised for insertion at one of their endpoints, so you could read your file as it is and fill the lines into a deque object with its appendleft method ... just a thought. No idea how efficient that would be. :)
Insert the lines to be generated at the beginning of your linear structure (list, vector<string>) each time, then iterate your structure from beginning to end.