creating a corpus of Wikipedia articles of specific field - data-mining

I want to create a corpus of biology-related articles from Wikipedia, so I could analyze it later using NLP approaches. I have downloaded a Wikipedia dump, and saved it in JSON format.
I am struggling with the task of extracting the biology-related articles. While I was able to find all the articles that are listed under the category "Biology" using the method described here, It turned out that only about 20 articles are listed directly under this category. I believe that I would be more lucky if I try to extract all the articles that belong to the biology portal, but I don't know how to do such a thing. Is there any method to extract articles that belong to a certain portal?

Categories are nested. For example, "Animals" ID probably a subcategory of "Biology".
You need to find all (transitive) subcategories first, then collect the documents.

Wikipedia's categories are organized as a DAG, so you can traverse the tree considering the Biology category node as root and collect the associated Wiki articles. I did a similar thing before (with different intent) and sharing the GitHub repo here, it may help you.

Related

Master/Detail Dilema: Wildcard items vs Sitecore Pipeline for Virtual Items or any better idea?

I used to implement listing/detail scenarios using wildcard items, meaning that, for the sake of URL, I create a regular item to display the list and then under that node, I create a wildcard item to represent all possible detail pages, like:
/news/*
(i generate a friendly name by code to replace wildcard and produce the full URL such as: mywebsite.com/news/the-meeting-press-release)
Then I create a folder or a bucket of content items somewhere else as my repository. Then I assign same datasource to listing node and wildcard node to give them same repository of content items.
Main reason I want to do this is to use datasources and make navigational nodes (which generate actual pages and URLs) to be separate from Content folder structure. In other words, separation of concerns: navigational items as presentation nodes and content items as my data repository.
This is an easy way to work around master/detail requirements but I always feel guilty about this, it feels like this technique breaks integrity (sitecore links table on database) and design pattern in Sitecore back-end.
For example when I look at Analytics, I get * as name of items, clearly the it feels like aliens to back-end system.
I know this is not a new topic. I have seen threads like this or ideas like Sitecore Pipeline Processor for Virtual Items to implement such requirements.
Is there any best practice about this? Have anyone good example of what is most sitecore-friendly way to implement such pipeline processor? How do you address this issue with wildcards on Analytics?
I'm going to go a different way to Martin here. I have successfully used Wildcards many times for the exact purpose you are suggesting (For an example have a look at http://www.atpworldtour.com/news - all news articles are items in a bucket with a wildcard to resolve the url).
There are 2 options to enabling the page editor.
The news article item becomes the page. In this way, you need a new processor in the httpRequestBegin pipeline that resolves the url to the item and then sets Sitecore.Context.Item to the current item. IIRC you do this by setting one of the pipeline argument properties. This will work fine in the page editor as the context item - the one being edited - is the news article. And then other renderings on the page can just use data sources as needed.
The news article resolves to a Datasource. I have also tried this method. To do this, you need a custom Datasource resolver. I sill used a processor in the httpRequestBegin pipeline so that I didn't have to resolve the Url multiple times for each rendering that needed the datasource. But then in the RenderRendering pipeline I had a processor that detected if I wanted a wildcard Datasource and used the item that had been resolved in the httpRequestBegin processor.
There are pros & cons for each method.
Option 1 is nice and simple. It means that you could use a single wildcard to resolve different "types" of page item as the presentation is on the page item and not the wildcard item, also each item can have its own custom presentation, so Datasources set in the page editor would be unique to an article. That is also a disadvantage in someways. A/B testing becomes more difficult with main article text etc... You are limited to testing article versions.
Option 2 is more flexible in the testing area - you can easily test/personalize parts of the article by changing the Datasource. But you are more limited as the presentation must be set on the wildcard. So renderings that are not part of the main article will have the same content/settings across all news articles.
I was previously in the same boat as you are. The are few issues with wildcard items, like resolving datasources or disability to run a page in Page(Experience) Editor or nested wildcards. Regardless of that, I have used wildcard few times and they do their job.
I've managed to resolve datasources properly, based on URL (see blog post: Automatically resolving correct Datasources for wildcard items based on URL), still did not sort the rest others.
Update: Richard suggests the way of implementing Page Editor below, you may find this helpful
Thus, my answer would be:
I would recommend you to keep classical approach of having a page item for each news item, rather than using wildcards. Content authors would use habitual approach (and page editor) rather that editing datasources somewhere on the content tree in Content Editor. If you configure that properly with templates and standard values - there would minimal hassle to create new news article.
In case if you worry about potential raise of number of news articles - use Buckets along with it (or suggest manual strategy to group them into folders).

Countries: A list of their state/province and cities list

I want to get all the countries, their states/province and cities in the world.
Where can I find this information?
Somebody please help me. I have searched many times, but I couldn't find it anywhere.
Check out django-cities which provides a list of countries and cities of the world. You will need to use http://www.geonames.org/ to import their database. Alternatively, you can directly download their database (in text format) and extract the values you need.
You can try using Yahoo's GeoPlanet API.
http://developer.yahoo.com/geo/geoplanet/guide/api-reference.html#api-countries
The more specifically relevant portions are on
/countries/
/counties/{state}
There are even examples on how to retrieve entire list of countries on the page. Based on the state, you can also retrieve counties.
Additional:
If you are more specifically looking to do geocoding, you can try Google Maps Geocoding API located below:
https://developers.google.com/maps/documentation/geocoding/?hl=fr
This will allow you to get region specific information on your location of interest.

Different types of authors in Wordpress?

I want to make a website about illustrated books. There are two different kind of authors for a book: writers and illustrators
For each writer I want to make a page that lists the books for that writer. The path would be:
http://mysite.com/writers/EdgarAllanPoe
http://mysite.com/writers/OscarWilde
etc
The same for each illustrator: a page for each illustrator listing the books illustrated by her or him.
Paths in this case would be:
http://mysite.com/illustrators/DiegoRivera
http://mysite.com/illustrators/FridaKahlo
etc
and then, each book will have a single page (like a post):
http://mysite.com/books/OneHundredYearsOfSolitude
http://mysite.com/books/WinnieThePooh
etc
Is it possible to do this in Wordpress? Thanks.
Absolutely, there are definitely ways to do this. The way I'd recommend it is using one custom post type for books and two custom taxonomies for illustrators and authors.
That would give you the url structures you want right out of the box, and would make it easy to associate any book with an author and illustrator (or multiple authors and illustrators, if it's a collaborative book) and would involve only about 30 - 40 lines of code to set it up. There'd be more involved in getting the templating to act the way you wanted, but not much.

Is there a good WikiField for django models?

Is there a simple way I can add a "WikiField" to a model I have in my application?
I think the most important requirements are:
A text field that can be added to any model.
simple wiki markup or editor widget that enables text formatting and easy insertion of links and images.
saves revision history with author information, and easily allows reverting back to any previous version.
Just to explain what I'm trying to do: Imagine you have a bookstore app. Most of the Book model's data come from the store's catalog. Now we would like to add a block of text that is a community wiki, so that users can write the plot summary for example.
How about a combination of django-reversion and django-tinymce, or Markdown if you prefer writing markdown?
I've not come across any field types specifically for Wikis, but with those components writing one really shouldn't take too long.

Limit category transclusion when using dynamically-generated categories in MediaWiki

At first I wasn't sure if a question on how to do something advanced in MediaWiki belonged here, but upon reading the faq and thinking about it, I decided that wiki markup is as much its own language as HTML and CSS, and if those questions are welcome here, then hopefully this is too! If I am wrong feel free to flag this question. Update: Well as evidenced by the 3 views this question got, I suppose that while perhaps it's within the rules of Stackoverflow, there might not be much expertise on the subject! I suppose I will need to take this question to the official forums (shudder)
The problem
On a wiki I am setting up powered my MediaWiki, I have a Template that outputs among other things dynamically created categories. This means that the page that invokes the template will be categorized based on some of the variables passed to the template. The dynamically generated categories are inside <includeonly> blocks to prevent the template page itself from getting the categories.
The problem is that I then transclude that page on to other pages, which causes the categories to be transcluded as well, and now that third page has all of the categories of all of the pages it transcluded.
I want to somehow format the template such that the page that invokes the template will make use of the categories but any pages that transclude the invoking page will not inherit the categories.
Example
Here's my best shot at an example of the setup. If this is inadequate I can provide links to my real-world example.
Template:Food
A page that takes a couple variables and outputs a highly formatted block that explains the food, including outputting a category based on the "type" variable.
Banana
This page invokes the Template:Food template with a few variables, including type set to "fruit". The result is when the user views the "Banana" page they get a nicely formatted page with some basic information about the fruit. Furthermore, if the user goes to the Category:Fruit page, they will see the Banana page listed.
Banana Nut Bread Recipe
This is the problem page. On this recipe page, the author wants to transclude all of the pages for ingredients so that each ingredient is listed in its nicely formatted block. However, when he transcludes the Banana page using {{:Banana}}, the Fruit category is transcluded along with it and now the Banana Nut Bread Recipe page is listed as being in the Fruit category which is wrong.
If I understand correctly, you want to limit the includeonly info (the category) to only depth 1 transclusion. I don't think it's possible.
Possible solutions:
1- Don't put category info into the Template:Food. Just put it directly in each ingredient page or if you really must, create a Template:Food_category or similar. Then each page could have any number of {{Food}}s and the {{food category}}s would need to be explicitly put.
The Labeled Section Transclusion extension lets you tag parts of a source article with labels, and transclude based on those tags. The tags can overlap, so that you have very granular control over what gets pulled through.
https://www.mediawiki.org/wiki/Extension:Labeled_Section_Transclusion
I would think that with Labeled Section Transclusion, you could transclude both the Type:fruit and the Banana description in separate transclusion statements on the Banana page, but only pull the description through to the Recipe page.