How to count number of citations/references in wikipedia raw text? - regex

I'm building a model to classify raw Wikipedia text by article quality (Wikipedia has a dataset of ~30,000 hand-graded articles and their corresponding quality grades.). Nonetheless, I am trying to figure out a way to algorithmically count the number of citations that appear on the page.
As a quick example: here is an excerpt from a raw Wiki page:
'[[Image:GD-FR-Paris-Louvre-Sculptures034.JPG|320px|thumb|Tomb of Philippe Pot, governor of [[Burgundy (region)|Burgundy]] under [[Louis XI]]|alt=A large sculpture of six life-sized black-cloaked men, their faces obscured by their hoods, carrying a slab upon which lies the supine effigy of a knight, with hands folded together in prayer. His head rests on a pillow, and his feet on a small reclining lion.]]\n[[File:Sejong tomb 1.jpg|thumb|320px|Korean tomb mound of King [[Sejong the Great]], d. 1450]]\n[[Image:Istanbul - Süleymaniye camii - Türbe di Roxellana - Foto G. Dall\'Orto 28-5-2006.jpg|thumb|320px|[[Türbe]] of [[Roxelana]] (d. 1558), [[Süleymaniye Mosque]], [[Istanbul]]]]\n\'\'\'Funerary art\'\'\' is any work of [[art]] forming, or placed in, a repository for the remains of the [[death|dead]]. [[Tomb]] is a general term for the repository, while [[grave goods]] are objects—other than the primary human remains—which have been placed inside.<ref>Hammond, 58–9 characterizes [[Dismemberment|disarticulated]] human skeletal remains packed in body bags and incorporated into [[Formative stage|Pre-Classic]] [[Mesoamerica]]n [[mass burial]]s (along with a set of primary remains) at Cuello, [[Belize]] as "human grave goods".</ref>
So far, I've concluded that I can find the number of images by counting the number of [[Image: occurrences. I was hoping I could do something similar for references. In fact, after comparing raw Wiki pages and their corresponding live pages, I think I was able to determine that </ref> corresponds to the end notation of a reference on a Wiki page. --> For example: Here, you can see that the author makes a statement at the end of the paragraph and references Hammond, 58–9 within <ref> {text} </ref>
If somebody is familiar with Wiki's raw data and can shed some light on this, please let me know! Also, if you know a better way to do this, please tell me that, too!
Many thanks in advance!

ref not always contains link to source. Sometimes contain specify explanations and etc.
You must counting not only <ref>...</ref>, but also footnote templates.
If you need count of unique refs, then you must except grouped refs (ref with name="xxx" parameter or auto grouped footnotes templates with same content).
Sorry for my English.

Counting reference tags in wiki markup isn't necessarily accurate as references can be reused so that two </ref> would only show up as one reference in the list at the end. There is an API that should give a list of the articles, but for some reason it's deactivated, but BeautifulSoup makes this pretty simple. I haven't tested this to check it counts all articles correctly, but it works:
from bs4 import BeautifulSoup
import requests
page=requests.get('https://en.wikipedia.org/wiki/Stack_Overflow')
soup=BeautifulSoup(page.content,'html.parser')
count = 0
for eachref in soup.find_all('span', attrs={'class':'reference-text'}):
count = count + 1
print (count)

Related

calculate nearest document using fasttext or word2vec

i have a small system of about 1000 documents.
For each document I would like to show links to the X "most similar" documents.
However, the documents are not labeled in any way, so this would be some kind of unsupervised method.
It feels like fasttext would be a good candidate, but I cant wrap my head around how to do it when its not labeled data.
I can calculate the word vectors, although what I really need is a vector for the whole document.
The Paragraph Vector algorithm, known as Doc2Vec in libraries like Python gensim, can train a model that will give a single vector for a run-of-text, and so could be useful for your need. Note, though, that typical published work uses tens-of-thousands to millions of documents. (Just 1,000 would be a very small training set.)
You can also simply average all the word-vectors of a text together (perhaps in some weighted fashion) to get a simple, crude vector for the full text, that will often somewhat work for this purpose. (You could use word-vectors from classi word2vec or FastText for this purpose.)
Similarly, if you have word-vectors but not full doc-vectors, there's a technique called "Word Mover's Distance" that calculates a word-vector-adjusted "distance" between two texts. It often does well in highlighting near-paraphrases, though it's somewhat expensive to calculate (especially for longer texts).
In some cases, just converting all docs to their "bag of words" representation – a giant vector containing counts of words used – then ranking docs by how many words they share is a good enough similarity measure.
Also, full-text index/search frameworks, like SOLR or ElasticSearch, can sometimes take full documents as queries, giving nicly ranked results. (This often works by picking the example document's most significant words, and using those words as fuzzy full-text queries against the full document set.)

Clear approach for assigning semantic tags to each sentence (or short documents) in python

I am looking for a good approach using python libraries to tackle the following problem:
I have a dataset with a column that has product description. The values in this column can be very messy and would have a lot of other words that are not related to the product. I want to know which rows are about the same product, so I would need to tag each description sentence with its main topics. For example, if I have the following:
"500 units shoe green sport tennis import oversea plastic", I would like the tags to be something like: "shoe", "sport". So I am looking to build an approach for semantic tagging of sentences, not part of speech tagging. Assume I don't have labeled (tagged) data for training.
Any help would be appreciated.
Lack of labeled data means you cannot apply any semantic classification method using word vectors, which would be the optimal solution to your problem. An alternative however could be to construct the document frequencies of your token n-grams and assume importance based on some smoothed variant of idf (i.e. words that tend to appear often in descriptions probably carry some semantic weight). You can then inspect your sorted-by-idf list of words and handpick(/erase) words that you deem important(/unimportant). The results won't be perfect, but it's a clean and simple solution given your lack of training data.

word2vec guesing word embeddings

can word2vec be used for guessing words with just context?
having trained the model with a large data set e.g. Google news how can I use word2vec to predict a similar word with only context e.g. with input ", who dominated chess for more than 15 years, will compete against nine top players in St Louis, Missouri." The output should be Kasparov or maybe Carlsen.
I'ven seen only the similarity apis but I can't make sense how to use them for this? is this not how word2vec was intented to use?
It is not the intended use of word2vec. The word2vec algorithm internally tries to predict exact words, using surrounding words, as a roundabout way to learn useful vectors for those surrounding words.
But even so, it's not forming exact predictions during training. It's just looking at a single narrow training example – context words and target word – and performing a very simple comparison and internal nudge to make its conformance to that one example slightly better. Over time, that self-adjusts towards useful vectors – even if the predictions remain of wildly-varying quality.
Most word2vec libraries don't offer a direct interface for showing ranked predictions, given context words. The Python gensim library, for the last few versions (as of current version 2.2.0 in July 2017), has offered a predict_output_word() method that roughly shows what the model would predict, given context-words, for some training modes. See:
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.predict_output_word
However, considering your fill-in-the-blank query (also called a 'cloze deletion' in related education or machine-learning contexts):
_____, who dominated chess for more than 15 years, will compete against nine top players in St Louis, Missouri
A vanilla word2vec model is unlikely to get that right. It has little sense of the relative importance of words (except when some words are more narrowly predictive of others). It has no sense of grammar/ordering, or or of the compositional-meaning of connected-phrases (like 'dominated chess' as opposed to the separate words 'dominated' and 'chess'). Even though words describing the same sorts of things are usually near each other, it doesn't know categories to be able to determine that the blank must be a 'person' and a 'chess player', and the fuzzy-similarities of word2vec don't guarantee words-of-a-class will necessarily all be nearer-each-other than other words.
There has been a bunch of work to train word/concept vectors (aka 'dense embeddings') to be better at helping at such question-answering tasks. A random example might be "Creating Causal Embeddings for Question Answering with Minimal Supervision" but queries like [word2vec question answering] or [embeddings for question answering] will find lots more. I don't know of easy out-of-the-box libraries for doing this, with or without a core of word2vec, though.

Distinguishing between terms of different domains

What I am trying to do:
I am trying to take a list of terms and distinguish which domain they are coming from. For example "intestine" would be from the anatomical domain while the term "cancer" would be from the disease domain. I am getting these terms from different ontologies such as DOID and FMA (they can be found at bioportal.bioontology.org)
The problem:
I am having a hard time realizing the best way to implement this. Currently I am naively taking the terms from the ontologies DOID and FMA and taking difference of any term that is in the FMA list which we know is anatomical from the DOID list (which contains terms that may be anatomical such as colon carcinoma, colon being anatomical and carcinoma being disease).
Thoughts:
I was thinking that I can get root words, prefixes, and postfixes, for the different term domains and try and match it to the terms in the list. Another idea is to take more information from their ontology such as meta data or something and use this to distinguish between the terms.
Any ideas are welcome.
As a first run, you'll probably have the best luck with bigrams. As an initial hypothesis, diseases are usually noun phrases, and usually have a very English-specific structure where NP -> N N, like "liver cancer", which means roughly the same thing as "cancer of the liver." Doctors tend not to use the latter, while the former should be caught with bigrams quite well.
Use the two ontologies you have there as starting points to train some kind of bigram model. Like Rcynic suggested, you can count them up and derive probabilities. A Naive Bayes classifier would work nicely here. The features are the bigrams; classes are anatomy or disease. sklearn has Naive Bayes built in. The "naive" part means, in this case, that all your bigrams are independent of each other. This assumption is fundamentally false, but it works well in a lot of circumstances, so we pretend it's true.
This won't work perfectly. As it's your first pass, you should be prepared to probe the output to understand how it derived the answer it came upon and find cases that failed on. When you find trends of errors, tweak your model, and try again.
I wouldn't recommend WordNet here. It wasn't written by doctors, and since what you're doing relies on precise medical terminology, it's probably going to add bizarre meanings. Consider, from nltk.corpus.wordnet:
>>> livers = reader.synsets("liver")
>>> pprint([l.definition() for l in livers])
[u'large and complicated reddish-brown glandular organ located in the upper right portion of the abdominal cavity; secretes bile and functions in metabolism of protein and carbohydrate and fat; synthesizes substances involved in the clotting of the blood; synthesizes vitamin A; detoxifies poisonous substances and breaks down worn-out erythrocytes',
u'liver of an animal used as meat',
u'a person who has a special life style',
u'someone who lives in a place',
u'having a reddish-brown color']
Only one of these is really of interest to you. As a null hypothesis, there's an 80% chance WordNet will add noise, not knowledge.
The naive approach - what precision and recall is it getting you? If you setup a test case now, then you can track your progress as you apply more sophisticated methods.
I don't know what initial set you are dealing with - but one thing to try is to get your hands on annotated documents(maybe use mechanical turk). The documents need to be tagged as the domains you're looking for - anatomical or disease.
then count and divide will tell you how likely a word you encounter is to belong to a domain. With that the next step and be to tweak some weights.
Another approach (going in a whole other direction) is using WordNet. I don't know if it will be useful for exactly your purposes, but its a massive ontology - so it might help.
Python has bindings to use Wordnet via nltk.
from nltk.corpus import wordnet as wn
wn.synsets('cancer')
gives output = [Synset('cancer.n.01'), Synset('cancer.n.02'), Synset('cancer.n.03'), Synset('cancer.n.04'), Synset('cancer.n.05')]
http://wordnetweb.princeton.edu/perl/webwn
Let us know how it works out.

Create and use HTML full text search index (C++)

I need to create a search index for a collection of HTML pages.
I have no experience in implementing a search index at all, so any general information how to build one, what information to store, how to implement advanced searches such as "entire phrase", ranking of results etc.
I'm not afraid to build it myself, though I'd be happy to reuse an existing component (or use one to get started with a prototype). I am looking for a solution accessible from C++, preferrably without requiring additional installations at runtime. The content is static (so it makes sense to aggregate search information), but a search might have to accumulate results from multiple such repositories.
I can make a few educated guesses, though: create a map word ==> pages for all (relevant) words, a rank can be assigned to the mapping by promincence (h1 > h2 > ... > <p>) and proximity to top. Advanced searches could be built on top of that: searching for phrase "homo sapiens" could list all pages that contain "homo" and "sapiens", then scan all pages returned for locations where they occur together. However, there are a lot of problematic scenarios and unanswered questions, so I am looking for references to what should be a huge amount of existing work that somehow escapes my google-fu.
[edit for bounty]
The best resource I found until now is this and the links from there.
I do have an imlementation roadmap for an experimental system, however, I am still looking for:
Reference material regarding index creation and individual steps
available implementations of individual steps
reusable implementations (with above environment restrictions)
This process is generally known as information retrieval. You'll probably find this online book helpful.
Existing libraries
Here are two existing solutions that can be fully integrated into an application without requiring a separate process (I believe both will compile with VC++).
Xapian is mature and may do much of what you need, from indexing to ranked retrieval. Separate HTML parsing would be required because, AFAIK, it does not parse html (it has a companion program Omega, which is a front end for indexing web sites).
Lucene is a index/searching Apache library in Java, with an official pre-release C version lucy, and an unofficial C++ version CLucene.
Implementing information retrieval
If the above options are not viable for some reason, here's some info on the individual steps of building and using an index. Custom solutions can go from simple to very sophisticated, depending what you need for your application. I've broken the process into 5 steps
HTML processing
Text processing
Indexing
Retrieval
Ranking
HTML Processing
There are two approaches here
Stripping The page you referred to discusses a technique generally known as stripping, which involves removing all the html elements that won't be displayed and translating others to their display form. Personally, I'd preprocess using perl and index the resulting text files. But for an integrated solution, particularly one where you want to record significance tags (e.g. <h1>, <h2>), you probably want to role your own. Here is a partial implementation of a C++ stripping routine (appears in Thinking in C++ , final version of book here), that you could build from.
Parsing A level up in complexity from stripping is html parsing, which would help in your case for recording significance tags. However, a good C++ HTML parser is hard to find. Some options might be htmlcxx (never used it, but active and looks promising) or hubbub (C library, part of NetSurf, but claims to be portable).
If you are dealing with XHTML or are willing to use an HTML-to-XML converter, you can use one of the many available XML parsers. But again, HTML-to-XML converters are hard to find, the only one I know of is HTML Tidy. In addition to conversion to XHTML, its primary purpose is to fix missing/broken tags, and it has an API that could possibly be used to integrate it into an application. Given XHTML documents, there are many good XML parsers, e.g. Xerces-C++ and tinyXML.
Text Processing
For English at least, processing text to words is pretty straight forward. There are a couple of complications when search is involved though.
Stop words are words known a priori not to provide a useful distinction between documents in the set, such as articles and propositions. Often these words are not indexed and filtered from query streams. There are many stop word lists available on the web, such as this one.
Stemming involves preprocessing documents and queries to identify the root of each word to better generalize a search. E.g. searching for "foobarred" should yield "foobarred", "foobarring", and "foobar". The index can be built and searched on roots alone. The two general approaches to stemming are dictionary based (lookups from word ==> root) and algorithm based. The Porter algorithm is very common and several implementations are available, e.g. C++ here or C here. Stemming in the Snowball C library supports several languages.
Soundex encoding One method to make search more robust to spelling errors is to encode words with a phonetic encoding. Then when queries have phonetic errors, they will still map directly to indexed words. There are a lot of implementations around, here's one.
Indexing
The map word ==> page data structure is known as an inverted index. Its inverted because its often generated from a forward index of page ==> words. Inverted indexes generally come in two flavors: inverted file index, which map words to each document they occur in, and full inverted index, which map words to each position in each document they occur in.
The important decision is what backend to use for the index, some possibilities are, in order of ease of implementation:
SQLite or Berkly DB - both of these are database engines with C++ APIs that integrated into a project without requiring a separate server process. Persistent databases are essentially files, so multiple index sets can be search by just changing the associated file. Using a DBMS as a backend simplifies index creation, updating and searching.
In memory data structure - if your using a inverted file index that is not prohibitively large (memory consumption and time to load), this could be implemented as a std::map<std::string,word_data_class>, using boost::serialization for persistence.
On disk data structure - I've heard of blazingly fast results using memory mapped files for this sort of thing, YMMV. Having an inverted file index would involve having two index files, one representing words with something like struct {char word[n]; unsigned int offset; unsigned int count; };, and the second representing (word, document) tuples with just unsigned ints (words implicit in the file offset). The offset is the file offset for the first document id for the word in the second file, count is the number of document ids associate with that word (number of ids to read from the second file). Searching would then reduce to a binary search through the first file with a pointer into a memory mapped file. The down side is the need to pad/truncate words to get a constant record size.
The procedure for indexing depends on which backend you use. The classic algorithm for generating a inverted file index (detailed here) begins with reading through each document and extending a list of (page id, word) tuples, ignoring duplicate words in each document. After all documents are processed, sort the list by word, then collapsed into (word, (page id1, page id2, ...)).
The mifluz gnu library implements inverted indexes w/ storage, but without document or query parsing. GPL, so may not be a viable option, but will give you an idea of the complexities involved for an inverted index that supports a large number of documents.
Retrieval
A very common method is boolean retrieval, which is simply the union/intersection of documents indexed for each of the query words that are joined with or/and, respectively. These operations are efficient if the document ids are stored in sorted order for each term, so that algorithms like std::set_union or std::set_intersection can be applied directly.
There are variations on retrieval, wikipedia has an overview, but standard boolean is good for many/most application.
Ranking
There are many methods for ranking the documents returned by boolean retrieval. Common methods are based on the bag of words model, which just means that the relative position of words is ignored. The general approach is to score each retrieved document relative to the query, and rank documents based on their calculated score. There are many scoring methods, but a good starting place is the term frequency-inverse document frequency formula.
The idea behind this formula is that if a query word occurs frequently in a document, that document should score higher, but a word that occurs in many documents is less informative so this word should be down weighted. The formula is, over query terms i=1..N and document j
score[j] = sum_over_i(word_freq[i,j] * inv_doc_freq[i])
where the word_freq[i,j] is the number of occurrences of word i in document j, and
inv_doc_freq[i] = log(M/doc_freq[i])
where M is the number of documents and doc_freq[i] is the number of documents containing word i. Notice that words that occur in all documents will not contribute to the score. A more complex scoring model that is widely used is BM25, which is included in both Lucene and Xapian.
Often, effective ranking for a particular domain is obtained by adjusting by trial and error. A starting place for adjusting rankings by heading/paragraph context could be inflating word_freq for a word based on heading/paragraph context, e.g. 1 for a paragraph, 10 for a top level heading. For some other ideas, you might find this paper interesting, where the authors adjusted BM25 ranking for positional scoring (the idea being that words closer to the beginning of the document are more relevant than words toward the end).
Objective quantification of ranking performance is obtained by precision-recall curves or mean average precision, detailed here. Evaluation requires an ideal set of queries paired with all the relevant documents in the set.
Depending on the size and number of the static pages, you might want to look at an already existent search solution.
"How do you implement full-text search for that 10+ million row table, keep up with the load, and stay relevant? Sphinx is good at those kinds of riddles."
I would choose the Sphinx engine for full text searching. The licence is GPL but the also have a commercial version available. It is meant to be run stand-alone [2], but it can also be embedded into applications by extracting the needed functionality (be it indexing[1], searching [3], stemming, etc).
The data should be obtained by parsing the input HTML files and transforming them to plain-text by using a parser like libxml2's HTMLparser (I haven't used it, but they say it can parse even malformed HTML). If you aren't bound to C/C++ you could take a look at Beautiful Soup.
After obtaining the plain-texts, you could store them in a database like MySQL or PostgreSQL. If you want to keep everything embedded you should go with sqlite.
Note that Sphinx doesn't work out-of-the-box with sqlite, but there is an attempt to add support (sphinx-sqlite3).
I would attack this with a little sqlite database. You could have tables for 'page', 'term' and 'page term'. 'Page' would have columns like id, text, title and url. 'Term' would have a column containing a word, as well as the primary ID. 'Page term' would have foreign keys to a page ID and a term ID, and could also store the weight, calculated from the distance from the top and the number of occurrences (or whatever you want).
Perhaps a more efficient way would be to only have two tables - 'page' as before, and 'page term' which would have the page ID, the weight, and a hash of the term word.
An example query - you want to search for "foo". You hash "foo", then query all page term rows that have that term hash. Sort by descending weight and show the top ten results.
I think this should query reasonably quickly, though it obviously depends on the number and size of the pages in question. Sqlite isn't difficult to bundle and shouldn't need an additional installation.
Ranking pages is the really tricky bit here. With a large sample of pages you can use links quite a lot in working out ranks. Other wise you need to check how words seem to be placed, and also making sure your engine doesn't get fooled by 'dictionary' pages.
Good luck!