Identify keywords and commands in natural text - regex

I am trying to build a system which identifies various commands and inputs based on a written human-entered text. I'll start with an example, to make things cleaner. Suppose the user inputs the following text:
My name is John Doe, my age is 28 years old, my address is Barkley Street no. 7 Havana. I like chocolate cake with strawberries and vanilla.
Based on a set of predefined markers (e.g. "name is", "age is", "address is", "I like"), I would like to detect their corresponding value (e.g. "John Doe", "28", "Barkley Street... Havana", "chocolate cake ... vanilla").
My current attempt was to tackle this via some regex patterns: for each marker I built a regex saying something along the lines of "if you find marker X, take all the text between it and any of the X, Y, Z markers you could find". That was extracting text between markers, but building everything based on regexes is going to be very cumbersome, especially if I start taking flexing and small variations into account.
I don't have much experience with NLP, so I'm not really sure where I should start for a proper solution. What are some appropriate approaches/solutions/libraries for tackling this problem?

What you are actually trying to do is "information extraction", particularly named entity recognition (NER) to detect the mentions of interest. For an overview, see:
https://en.wikipedia.org/wiki/Information_extraction
To actually start to solve your problem with something approaching state of the art I would suggest looking into the Stanford NLP Toolkit (http://nlp.stanford.edu/software/) for your basic NLP tasks (tokenization, POS tagging) but their NER toolkit won't take you very far with your specific requirements. You could tried their SPIED to help you, but I haven't used it and can't vouch for it. Ultimately if you are serious about this task (which on the face of it sounds quite hard) you will have to write your own NER system for all the entities you want to extract. You may want to incorporate some of your regular expressions as machine learning features to help you with your task (start with a simple ML library like LibSVM or Mallet) but regardless it will be a lot of work.
Good luck!

If the requirement is to identify named entities such as person, place, organisation then one could use StanfordNER library in Python. Additionally, there is solution to training one's own custom entity recognition model using CRF algorithm in Python. Here is an article explaining the same.

Related

Scratch project: To check if any word in a list is contained in an answer

I have the following Scratch project which has a "kind list" of words like: "good", "kind", "love", "come" etc.
A user should be able to enter any sentence containing any of these words, and the happy face would show.
Currently if the user types "kind" the happy face shows and if it types anything else like "you are kind", the sad face shows.
How do I change this, in scratch, such that if the user types in:
"you are kind" or
"how kind you are" or
"come here"
(any sentence containing any word in the "kindlist") the face is happy,else not.
I can only find a block that allows me to select the LIST and then the ANSWER and no other alternatives. What I want is the Python equivalent of > in list
answer=input("Say something")
If any word in the input answer (sentence) in in the list.
Then do - - -
For teaching purposes, I am trying to simplify what is on https://machinelearningforkids.co.uk/#!/newproject (creating of the training set). Can this be done directly in scratch or not? Or is this why the site allows you generate blocks on their site first and import them.
Surely Scratch should have the capability to enter data into lists and then test them directly.
I've also tried using a loop (which doesn't quite work correctly either) but was hoping there was a far simpler way.
I guess Scratch deliberately offers a minimal set of functions,
on the one hand not to overwhelm beginners,
on the other hand to encourage students to piece together simple blocks into more complex systems.
Yes, a simple (sentence) contains (word) is all you get out-of-the-box;
you do need a loop to match a multi-word sentence against a multi-word whitelist.
Seems to me like you would be better off with some development environment
that will at least give you some mature text parsing capabilities.
I'm not saying it's impossible to teach student about machine learning using Scratch, but I doubt it's the best tool for the job.
It feels like somebody wants to give music lessons, but students first have to go through the process of building a piano.
As for your code, it looks like a good start.
Some suggestions:
Replace the 'forever' loop with a loop bounded by the length of list 'kindthings'.
Include a leading and a trailing space in the 'contains' check, to make sure only whole words match. Wouldn't want 'unhappy' in a sentence to match 'happy' in the whitelist.

AWS Lex matches the wrong intent despite entering exact utterance

I have been having this problem in a variety of different cases.
I'll share an example of one.
I have a few FAQ intents.
One answers "What is Named Entity Recognition"
These are it's utterances :
Tell me about Named Entity Recognition
Tell me about NER
What is NER
What do you mean by Named Entity Recognition
What is Named Entity Recognition
and the other answers "What is Optical Character Recognition?"
These are it's utterances :
OCR
What do you mean by OCR
Can you tell me what OCR is
Tell about OCR
What is optical character recognition
What is OCR
When I enter, "What is ocr?" it works as expected and shows the answer for OCR.
But when I instead enter OCR in all caps, with the same exact question (What is OCR?). It switches to the NER intent and shows me the answer for "What is NER?"
Can any one answer why it is doing so? and more important than that, What do I do to make it work as expected.
Do keep in mind that this is just one example. I have encountered this in many other similar scenarios too.
There was also a case where the intent utterances didn't seem to match even remotely. But it still switched to the wrong intent.
As per the Lex and Alexa documentation, acronyms in custom slot types should be written as either a single word in all caps (OCR) or lowercase letters separated by periods and spaces (o. c. r.).
Along the bottom of the table, you can see the examples for Fire HD7, Fire h. d., Fire HD, and Fire HD 7 that demonstrate this -- both of the valid options will resolve to the same Slot Value Output.
Assuming utterances are set up in accordance with best practice, if you're providing vocal input, it's important to note that utterances are sensitive to things such as inflection in your voice, pacing/space between words, accents, and more.
As for immediate steps to improve accuracy, you can always try breaking up your intents further, where instead of having two intents, you have one for each permutation of custom slot value (NER, Named Entity Recognition, OCR, and Optical Character Recognition). It's easy for humans to understand that the first letter of a phrase maps to the letters in an acronym, but when it comes to teaching a chatbot to understand that these phrases are synonymous, that's a bit harder.
In the end I didn't find a proper solution but used some really inelegant workarounds, but hey as long as it works :D
The workaround I used was to make a "what" intent, a "how" intent etc. Keeping the sentence structure intact:
For example :
IntentName => "Bot_HowTo"
Utterances =>
"What is {slotName}"
"What are {slotName}"
"Meaning of {slotName}"
Slots =>
name : "slotName"
values (using synonyms) :
{OCR => "ocr", Optical Character recognition"}
{NER=> "ner", Named Entity Recognition"}
This makes the amount of intents needed much less and also eliminates a lot of the ambiguity. All questions that have "what" or similar formats go straight to that intent.
And then in my codehook I see which synonym was matched and provide the answer accordingly.

How to read xml data as shown below in pandas dataframe?

I want to read the bold words as the column names in the dataframe and the string following the bold letters as the value for that particular row.
<posts>
<**row Id**="5" PostTypeId="1" **CreationDate**="2014-05-13T23:58:30.457" **Score**="7" ViewCount="315" **Body**="<p>I've always been interested in machine learning, but I can't figure out one thing about starting out with a simple "Hello World" example - how can I avoid hard-coding behavior?</p><p>For example, if I wanted to "teach" a bot how to avoid randomly placed obstacles, I couldn't just use relative motion, because the obstacles move around, but I don't want to hard code, say, distance, because that ruins the whole point of machine learning.</p><p>Obviously, randomly generating code would be impractical, so how could I do this?</p>" **OwnerUserId**="5" LastActivityDate="2014-05-14T00:36:31.077" Title="How can I do simple machine learning without hard-coding behavior?" Tags="<machine-learning>" AnswerCount="1" CommentCount="1" FavoriteCount="1" ClosedDate="2014-05-14T14:40:25.950"/>
<**row Id**="7" **PostTypeId**="1" **AcceptedAnswerId**="10" CreationDate="2014-05-14T00:11:06.457" Score="2" ViewCount="297" Body="<p>As a researcher and instructor, I'm looking for open-source books (or similar materials) that provide a relatively thorough overview of data science from an applied perspective. To be clear, I'm especially interested in a thorough overview that provides material suitable for a college-level course, not particular pieces or papers.</p>" OwnerUserId="36" LastEditorUserId="97" LastEditDate="2014-05-16T13:45:00.237"LastActivityDate="2014-05-16T13:45:00.237" Title="What open-source books (or other materials) provide a relatively thorough overview of data science?" Tags="<education><open-source>" AnswerCount="3" CommentCount="4" FavoriteCount="1" **ClosedDate**="2014-05-14T08:40:54.950"/>
</posts>

NER model to recognize Indian names

I am planning to use Named Entity Recognition (NER) technique to identify person names (most of which are Indian names) from a given text. I have already explored the CRF-based NER model from Stanford NLP, however it is not quite accurate in recognizing Indian names. Hence I decided to create my own custom NER model via supervised training. I have a fair idea of how to create own NER model using the Stanford NER CRF, but creating a large training corpus with manual annotation is something I would like to avoid, as it is a humongous effort for an individual and secondly obtaining diverse people names from different states of India is also a challenge. Could anybody suggest any automation/programmatic way to prepare a labelled training corpus with at least 100k Indian names?
I have already looked into Facebook and LinkedIn API, but did not find a way to extract 100k number of user's full name from a given location (e.g. India).
I ended up doing the following to create NER model to identify Indian names. This may be useful for anybody looking for creating a custom NER model to recognize non-English person names, since most of the publicly available NER models such as the ones from Stanford NLP were trained with English names and hence are more accurate in identifying English (British/American) names.
Find an Indian celebrity with Twitter account and having a huge number of followers in Twitter (for my case, I chose Sachin Tendulkar).
Create a program in the language of your choice to call the Twitter REST API (GET followers/list) to get the names of all the followers of the celebrity and save to a file. We can safely assume most of the followers would be Indians. Note that there is an API Rate Limit in place (30 requests per 15 minute window), so the program should be built in to handle that. For our case, we developed the program as a Windows Service which runs every 15 minutes.
Since some Twitter users' names may not be valid person names, it is advisable to add some rule-based logic (like RegEx) to filter seemingly real names and add only those to the file.
Once the file with real names is generated, create another program to create the training data file containing these names labelled/annotated as PERSON as well as non-entity names annotated as OTHER. If you are using Stanford NER CRF Classifier, the program should generate a training (TSV) file having two columns - one containing the word (token) and the second column mentioning the label.
Once the training corpus is generated programmatically, you can follow the below link to create your custom NER model to recognize Indian names:
http://nlp.stanford.edu/software/crf-faq.shtml#a
This website has done this for us!It provides with the solution for these problems:
Challenges in Indian Language NER
Indian languages belong to several language families, the major ones being the Indo-European languages, Indo-Aryan and the Dravidian languages.
The challenges in NER arise due to several factors. Some of the main factors are listed below
Morphologically rich - identification of root is difficult, require use of morphological analysers
No Capitalization feature - In English, capitalization is one of the main features, whereas that is not there in Indian languages
Ambiguity - ambiguity between common and proper nouns. Eg: common words such as "Roja" meaning Rose flower is a name of a person
Spell variations - In the web data is that we find different people spell the same entity differently - for example : In Tamil person name -Roja is spelt as "rosa", "roja".
The whole corpus is provided.
Named Entity Recognition for Indian Languages and English
Best of luck for getting passwords for the zip files!
cheers!
A proposition: you could try to exploite the India version of Wikipedia for training or to create automatically gazetteer.
I don't know if it is the efficient/quick solution but a lot of research exploits Wikipedia and his semi-structured content (for example, each page is annotated with several categories).
You can have a look at these articles to find an interesting idea for you:
https://scholar.google.fr/scholar?q=named+entity+recognition+using+wikipedia&btnG=&hl=fr&as_sdt=0%2C5

Simple Phrases recognition

I am looking to recognize simple phrases like the ones what happens in google calendar
but rather than parsing Calendar Entries I have to parse Sentence related to finance, accounting and to do lists. So For example I have to parse sentences like
I spent 50 dollars on food yesterday
I need to mark an separate the info as Reason : 'food' , Cost : 50 and Time: <Yesterday's Date>
My question is do I go in for a full fledged Natural Language Processing like
given in these Questions and use Something like GATE
Machine Learning and Natural Language Processing
Natural Language Processing in Ruby
Ideas for Natural Language Processing project?
https://stackoverflow.com/a/3058063/492561
Or is it better to Write simple grammars using Something like AntLR and try to recognize it .
Or should I go really low and just Define a syntax and use Regular expressions .
Time is a Constraint , I have about 45 - 50 Days , And I don't know how to use AntLR or NLP libraries like GATE.
Preferred languages : Python , Java , Ruby (Not in any particular order)
PS : This is not home-work , So please Don't tag it as so.
PPS : Please try to give an answer with Facts on why using a particular method is better.
even if a particular method may not fit inside the time constraint please feel free to share it because It might benefit someone else .
You could look at named entity recognition indeed. From your question I understand your domain is pretty well defined, so you can identify the (few?) entities (dates, currencies, money amount, time expressions, etc.) that are relevant for you. If the phrases are very simple, you could go with a rule-based approach, otherwise it's likely to get too complex too soon.
Just to get yourself up and running in a few sec, http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code is an extremely nice example of what you could do. Of course I would not expect an high accuracy from just 6 lines of python, but it should give you an idea of how it works:
1>>> import nltk
2>>> def extract_entities(text):
3... for sent in nltk.sent_tokenize(text):
4... for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
5... if hasattr(chunk, 'node'):
6... print chunk.node, ' '.join(c[0] for c in chunk.leaves())
The core idea is on line 3 and 4: on line 3 it split text in sentences and iterates them.
On line 4, it splits the sentence in tokens, it runs "part of speech" tagging on the sentence, and then it feeds the pos-tagged sentence to the named entity recognition algorithm. That's the very basic pipeline.
In general, nltk is an extremely beautiful piece of software, and very well documented: I would look at it. Other answers contain very useful links.
Your task is a type of Information Extraction task, specifically relation/fact extraction, preceded by Named Entity Recognition.
Take a look at the following frameworks for Java/Python:
GExp
GATE
NLTK. Python. Book chapter on Information Extraction.
UIMA. (used for IBM's Watson.)