How to read xml data as shown below in pandas dataframe? - python-2.7

I want to read the bold words as the column names in the dataframe and the string following the bold letters as the value for that particular row.
<posts>
<**row Id**="5" PostTypeId="1" **CreationDate**="2014-05-13T23:58:30.457" **Score**="7" ViewCount="315" **Body**="<p>I've always been interested in machine learning, but I can't figure out one thing about starting out with a simple "Hello World" example - how can I avoid hard-coding behavior?</p><p>For example, if I wanted to "teach" a bot how to avoid randomly placed obstacles, I couldn't just use relative motion, because the obstacles move around, but I don't want to hard code, say, distance, because that ruins the whole point of machine learning.</p><p>Obviously, randomly generating code would be impractical, so how could I do this?</p>" **OwnerUserId**="5" LastActivityDate="2014-05-14T00:36:31.077" Title="How can I do simple machine learning without hard-coding behavior?" Tags="<machine-learning>" AnswerCount="1" CommentCount="1" FavoriteCount="1" ClosedDate="2014-05-14T14:40:25.950"/>
<**row Id**="7" **PostTypeId**="1" **AcceptedAnswerId**="10" CreationDate="2014-05-14T00:11:06.457" Score="2" ViewCount="297" Body="<p>As a researcher and instructor, I'm looking for open-source books (or similar materials) that provide a relatively thorough overview of data science from an applied perspective. To be clear, I'm especially interested in a thorough overview that provides material suitable for a college-level course, not particular pieces or papers.</p>" OwnerUserId="36" LastEditorUserId="97" LastEditDate="2014-05-16T13:45:00.237"LastActivityDate="2014-05-16T13:45:00.237" Title="What open-source books (or other materials) provide a relatively thorough overview of data science?" Tags="<education><open-source>" AnswerCount="3" CommentCount="4" FavoriteCount="1" **ClosedDate**="2014-05-14T08:40:54.950"/>
</posts>

Related

aligning sentences to corpus and finding mismatches

The ideal goal is to correct the output from a speech2text model according to a reference corpus (the actual text). I don't mind using any off the selves tool either in NLP space or ElasticSearch
I have a reference corpus like the following:
It is a reliance that has led to a cycle of addiction that has
destroyed lives it is a cycle that makes you sick when you try to stop
and potentially takes your life when you don't and beyond its physical
effects this cycle of addiction also includes constant contact with
the criminal justice system and not just a cycle of arrests release
and violation.
In fact its much longer ...
On the other hand, I have a set of sentences that are recognized from a speech-2-text model in a CSV files
1, is a cycle that makes you dick when
2, try two stops and essentially hates your
3, posses activated
4, lives when who don't and beyond
As you can see there because the speech2text model is not perfect there are errors, for example
1) With references to the corpus these subsentences are misspelled (e.g. dick instead of sick in number the sentence number 1
2) there are sentences that do not match to the corpus at all - e.g. number 3
3) putting the sentences together does not cover the whole paragraph.
So basically I wonder what is this task called in the NLP topic, then I can do a better googling, and I appreciate if you name specific functions or examples that I can leverage, e.g. in Space or NLTK or any other tool.
edit : * I already have experience with nlp (coursera certificate) - therefore, looking for a concrete answer and/or example rather a scientific paper. This is not a general error correction task or the next work recommendation based on sequential models.
The most suited NLP technique for this is probably language models.
They predict the likelihood of a word given the previous words (or surrounding words).
They can be used for error correction .
You may find following useful:
article
page
Why do you think this is "not a general error correction task"? I think it is. You cool look into 'grammar correction' or 'sentence validity'.
Sentence validity is discussed at How to check whether a sentence is correct (simple grammar check in Python)?. The listed tools also provide suggestions, and could therefore be useful for you.

Identify keywords and commands in natural text

I am trying to build a system which identifies various commands and inputs based on a written human-entered text. I'll start with an example, to make things cleaner. Suppose the user inputs the following text:
My name is John Doe, my age is 28 years old, my address is Barkley Street no. 7 Havana. I like chocolate cake with strawberries and vanilla.
Based on a set of predefined markers (e.g. "name is", "age is", "address is", "I like"), I would like to detect their corresponding value (e.g. "John Doe", "28", "Barkley Street... Havana", "chocolate cake ... vanilla").
My current attempt was to tackle this via some regex patterns: for each marker I built a regex saying something along the lines of "if you find marker X, take all the text between it and any of the X, Y, Z markers you could find". That was extracting text between markers, but building everything based on regexes is going to be very cumbersome, especially if I start taking flexing and small variations into account.
I don't have much experience with NLP, so I'm not really sure where I should start for a proper solution. What are some appropriate approaches/solutions/libraries for tackling this problem?
What you are actually trying to do is "information extraction", particularly named entity recognition (NER) to detect the mentions of interest. For an overview, see:
https://en.wikipedia.org/wiki/Information_extraction
To actually start to solve your problem with something approaching state of the art I would suggest looking into the Stanford NLP Toolkit (http://nlp.stanford.edu/software/) for your basic NLP tasks (tokenization, POS tagging) but their NER toolkit won't take you very far with your specific requirements. You could tried their SPIED to help you, but I haven't used it and can't vouch for it. Ultimately if you are serious about this task (which on the face of it sounds quite hard) you will have to write your own NER system for all the entities you want to extract. You may want to incorporate some of your regular expressions as machine learning features to help you with your task (start with a simple ML library like LibSVM or Mallet) but regardless it will be a lot of work.
Good luck!
If the requirement is to identify named entities such as person, place, organisation then one could use StanfordNER library in Python. Additionally, there is solution to training one's own custom entity recognition model using CRF algorithm in Python. Here is an article explaining the same.

Compare two strings and find how closely they are related by meaning

Problem:
I have two strings, say, "Billie Jean" and "Thriller". I need to programmatically compare them and find how closely they are related. Those are both songs of the same artist, hence, they should give a higher score (probability, percentage etc) than say, "Brad Pitt" and "Jamaican Farewell".
One way of doing this is an open source Java tool named WikipediaMiner which compares using the Wikipedia data dump, checking links, descriptions etc.
Question:
Please suggest a better alternative, that uses any or all of Wikipepdia, DBpedia, Freebase and their cousins, or combines a different approach. I would really prefer open source software that can be downloaded and set up on a server (eg. Apache Mahout), rather than a paid web service.
It's not so much a matter of programming, but of data.
So it's not really a question for StackOverflow.
What you really want is to use WordNet I guess. That is really meant as a database for reasoning about the meaning of words. So for example, the data explicitely states that data mining is a form of data processing. And which is a physical entity...
You see, the reasoning will be only as good as your data is.
DBPedia may also include a mapping from WordNet to Wikipedia maybe?
You can't tell that "Thriller" is a song, not a music video or film genre or Lambchop album without additional context.
After you've identified what your items are, it's "simply" a matter of traversing the graph of connections in Freebase, MusicBrainz, or whatever other information sources you are using.
You'll need to decide how you're going to weight things for scoring though. Are two Michael Jackson songs more closely related because they share the same type or are they more closely related to the artist Michael Jackson because they're directly connect to him?

Interesting Console Program for C++ beginners

I’m teaching an entry-level C++ programming class. We only use iostream in the class (No GUI). It seems like students are not so excited to printout strings and numbers to their console. (Most students even never used the console before.) It is hard to motivate or convey the excitement of programming by showing strings in their console.
What would be a good and exciting console program that can be written by C++ beginners? I’m looking for something doable with basic C++ skill + a bit challenging + very exciting, which can motivate students to learn programming languages.
Any comments will be appreciated.
When I taught an undergrad intro course, we did the Game of Fifteen in straight C as the third homework project. It's pretty well scoped, and it's a game, so there's some inherent motivation there.
Back when I taught, I made an early project be an ATM machine.
Text-only interface, with basic operations like withdraw, deposit, query balance, transfer between accounts, etc.
It was something that everyone was already familiar with, it didn't take huge amounts of programming time, but it did help students feel like it was a practical and realistic program.
Other similar ideas would be a cash-register (handle refunds, coupons, items priced by the pound, sales-tax, store specials, etc, etc), or a cell-phone billing program (separate daytime, night, and weekend minutes, bill text messages, picture messages separately, etc).
How about a system that generates a set of poker hands from a deck? While clearly defined, the intricacies of ensuring no duplicate cards etc, make it a good entry level challenge.
As an extension, you could have the system take an input as to whether you want to bet or fold, and effectively play a poker game.
Finally, a good design would allow them to switch the console for a gui front-end later on (e.g. intermediate class).
I always enjoyed problems where there's a real world purpose for doing it. Something like calculating a mathematical equation, or a range of prime numbers. A lot of stuff on ProjectEuler would be good, I would think. Not everybody likes math (but then again, it's kind of a necessary thing for computer science!).
Instead of just printing to the screen you could make ascii animations.
Introduce your students to pipes and filters. Create a useful utility that takes data from stdin and directs its output to stdout. Create another utility that does something else using that same protocol. Create a third utility. Demonstrate how robustly the utilities can work together.
For example, create a clone of the GNU head and tee utilities, and perhaps add a new utility called cap which capitalizes letters. Then demonstrate how you can get the first 3 lines of a text file capitalized and tee'd to a file and stdout. Next, demonstrate how you use the same utilities, without changing a single line of code, to take the first 5 lines of a file and output to the screen the capitalized letters and to a file the original letters.
When I took C++ we had to replicate a Theseus and the Minotaur game. It lends well to outputting multiple lines to the console to form something "graphical" and it's easily based on a set of implemented rules.
I've had to program a Tower of Hanoi game in the console before, I found that quite fun. It requires use of basic data structures, user input, checking for game end conditions so it would be good for a beginner I believe.
New programming students usually find graphical programs to be the most exciting.
It doesn't have to be anything really advanced, just being able to manipulate pixels and stuff should be enough to keep them interested. Making a simple graphic class around SDL should be ok. Maybe something like this:
int main()
{
GraphicWindow graphic;
graphic.setPixel(10,20,GraphicWindow::Red);
graphic.idle();
}
Then you give out assignments like "implement a drawRect function" etc.
Perhaps a text-only version of the Lunar Lander game. You could do full ASCII art and animations (with ncurses, perhaps) as an advanced exercise, but even in a pure text form it can be interesting.
I recall playing the version that ran on the HP 67 calculator, and it was entertaining with only a seven segment display to work with.
I vaguely recall a version that probably ran on an ALTAIR 8800 written in MITS/Microsoft BASIC that used the leading part of the line to show height above ground as ASCII art, with your prompt for the next tick's burn at the right.
Another traditional choice would be to implement Hunt the Wumpus, or for the ambitious, Battleship.
One of my first programming class had a long homework about implementing a (reduced) Monopoly game.
You can use chained lists for the board.
You can use Inheritance for board tiles.
You need some logic to process players turns.
It was probably the first project I've done in CS that I could talk about to my non-tech friends and generate some interest.

Looking for Ideas: How would you start to write a geo-coder?

Because the open source geo-coders cannot begin to compare to Google's or even Yahoo's, I would like to start a project to create a good open source geo-coder. Just to clarify, a geo-coder takes some text (usually with some constraints) and returns one or more lat/lon pairs.
I realize that this is a difficult and garguntuan task, so I am wondering how you might get started. What would you read? What algorithms would you familiarize yourself with? What code would you review?
And also, assuming you were going to develop this very agilely, what would you want the first prototype to be able to do?
EDIT: Let's set aside the data question for now. I am going to use OpenStreetMap data, along with a database of waypoints that I have. I would later plan to include other data sets as well, and I realize the geo-coder would be inherently limited by the quality of the original data.
The first (and probably blocking) problem would be: where do you get your data from? (unless you are willing to pay thousands of dollars for proprietary sets).
You could build a geocoding-api on top of OpenStreetMap (they publish their data in dumps on a regular basis) I guess, but that one was still very incomplete last time I checked.
Algorithms are easy. Good mapping data, however, is expensive. Very expensive.
Google drove their cars all over the world, collecting this data among other things.
From a .NET point of view these articles might be interesting for you:
Writing Your Own GPS Applications: Part I
Writing Your Own GPS Applications: Part 2
Writing GIS and Mapping Software for .NET
I've only glanced at the articles but they've been on CodeProject's 'Most Popular' list for a long time.
And maybe this CodePlex project which the author of the articles above made available.
I would start at the absolute beginning by figuring out how you're going to get the data that matches a street address with a geocode. Either Google had people going around with GPS units, OR they got the information from some existing source. That existing source may have been... (all guesses)
The Postal Service
Some existing maps(printed)
A bunch of enthusiastic users that were early adopters of GPS technology who ere more than willing to enter in street addresses and GPS coordinates
Some government entity (or entities)
Their own satellites
etc
I guess what I'm getting at is the information was either imported from somewhere or was input by someone via some interface. As my starting point I would look at how to get that information. In an open source situation, you may be able to get a bunch of enthusiastic people to enter information.
So for my first prototype, boring as it would be, I would create a form for entering information.
Then you need to know the math for figuring out the closest distance (as the crow flies). From there, try to figure out how to include roads. (My guess is you would have to have data point for each and every curve, where you hold the geocode location of the curve, and the angle of the road on a north/south and east/west vector. You'd probably need to take incline into account, too to get accurate road measurements.)
That's just where I'd start.
But in all honesty, I wouldn't even start on this. Other programmers have done it already, I'm more interested in what hasn't already been done.
get my free raw data from somewhere like http://ipinfodb.com/ip_database.php
load it into a database, denormalizing for fast lookups
design my API
build it out as a RESTful web service
return results in varying formats: JSON, XML, CSV, raw text
The first prototype should accept a ZIP code and return lat/lon in raw text.