Text document classification with AWS Machine Learning - amazon-web-services

I've been looking at using AWS Machine Learning to implement a categorizer for my project. I have something on the order of 40,000 documents that have a several text-only features. For example: Name (< 200 chars) and Description (potentially hundreds / thousands of words).
In a nutshell, I'm looking to assign categories (0 or more) to each document based on it's content.
I've read through the AWS ML tutorial and checked out a few other sources but the available material seems to deal with feature fields that are numeric, boolean, datetime, or otherwise non-textual.
Is AWS Machine Learning capable of performing multi-class categorization on documents based primarily (or possibly only) on text fields? And if so, is there any reference material available for this particular avenue?

Mainly, you don`t need "text fields", first you have to create a vector space model (VTM) from your corpus (texts), than you can weight your VTM with tf-idf, and you can use numeric field.
Are you sure that do you want to apply AWS ML to train a corpus with only 40.000 documents?

Related

Is it possible to use the multiclass classifier of aws to recognize the given place of the text?

I'm using AWS SageMaker, and i want to create something that, with a given text, it recognize the place of that description. Is it possible?
If there are no other classes besides the text that you would like your model to identify, you may not need a multiclass classifier.
You could train your own text detection model using Amazon SageMaker, and train using a dataset with labelled examples using the Object Detection Algorithm, but this becomes rather involved for a problem that has existing solutions available.
If the appearance of the text you're trying to detect is identical each time, your problem space gets reduced from trying to interpret variable text, to simply having to gather enough examples and perform object detection for the "pattern" your text forms visually. Note that if the text were to appear in different fonts or styles, that the generic object detection method would not interpret it dynamically, and an OCR-based solution would likely be necessary.
More broadly, for text identification in images on AWS, you have quite a few options:
Amazon Rekognition has a DetectText method that will enable you to easily find text within an image. If it's a small or simple phrase, with alphanumeric characters, this should work very well for your use case.
Amazon Textract will help you perform OCR (optical character recognition) while retaining the structure of the source. This is great for documents and tables, but doesn't sound like it may be applicable to your use case.
The AWS marketplace will also have hosted options available from third party vendors. One example of this for text region identification is this one from RocketML.
There are also some great open source tools I'd recommend looking into: OpenCV for ascertaining the text bounding boxes, and Tesseract for OCR and text extraction. This blog post does a good job walking through the process of using them together.
Any of these will help to solve your problem of performing OCR/text identification on AWS, but the best choice comes down to what your current and future needs are, and how quickly you're looking to implement the feature.
Your question is not clear regarding the data that you have or the problem that you want to solve.
If you have a text that includes a place name in it (for example, "I visited Seattle and enjoyed the fish market"), you can use Amazon Comprehend Name Entity Extraction (NEE) including places ("Seattle" in the above example)
{
"Entities": [
{
"Score": 0.9857407212257385,
"Type": "LOCATION",
"Text": "Seattle",
"BeginOffset": 10,
"EndOffset": 17
}
]
}
If the description is more general and you want to classify if the description is of a hotel, a restaurant, a theme park, a concert/show, or similar types of places, you can either use the Custom classification in Comprehend or the Neural Topic Model in SageMaker (https://docs.aws.amazon.com/sagemaker/latest/dg/ntm.html). You will need some examples of the classes and documents/sentences that are used for the model training.

NER model to recognize Indian names

I am planning to use Named Entity Recognition (NER) technique to identify person names (most of which are Indian names) from a given text. I have already explored the CRF-based NER model from Stanford NLP, however it is not quite accurate in recognizing Indian names. Hence I decided to create my own custom NER model via supervised training. I have a fair idea of how to create own NER model using the Stanford NER CRF, but creating a large training corpus with manual annotation is something I would like to avoid, as it is a humongous effort for an individual and secondly obtaining diverse people names from different states of India is also a challenge. Could anybody suggest any automation/programmatic way to prepare a labelled training corpus with at least 100k Indian names?
I have already looked into Facebook and LinkedIn API, but did not find a way to extract 100k number of user's full name from a given location (e.g. India).
I ended up doing the following to create NER model to identify Indian names. This may be useful for anybody looking for creating a custom NER model to recognize non-English person names, since most of the publicly available NER models such as the ones from Stanford NLP were trained with English names and hence are more accurate in identifying English (British/American) names.
Find an Indian celebrity with Twitter account and having a huge number of followers in Twitter (for my case, I chose Sachin Tendulkar).
Create a program in the language of your choice to call the Twitter REST API (GET followers/list) to get the names of all the followers of the celebrity and save to a file. We can safely assume most of the followers would be Indians. Note that there is an API Rate Limit in place (30 requests per 15 minute window), so the program should be built in to handle that. For our case, we developed the program as a Windows Service which runs every 15 minutes.
Since some Twitter users' names may not be valid person names, it is advisable to add some rule-based logic (like RegEx) to filter seemingly real names and add only those to the file.
Once the file with real names is generated, create another program to create the training data file containing these names labelled/annotated as PERSON as well as non-entity names annotated as OTHER. If you are using Stanford NER CRF Classifier, the program should generate a training (TSV) file having two columns - one containing the word (token) and the second column mentioning the label.
Once the training corpus is generated programmatically, you can follow the below link to create your custom NER model to recognize Indian names:
http://nlp.stanford.edu/software/crf-faq.shtml#a
This website has done this for us!It provides with the solution for these problems:
Challenges in Indian Language NER
Indian languages belong to several language families, the major ones being the Indo-European languages, Indo-Aryan and the Dravidian languages.
The challenges in NER arise due to several factors. Some of the main factors are listed below
Morphologically rich - identification of root is difficult, require use of morphological analysers
No Capitalization feature - In English, capitalization is one of the main features, whereas that is not there in Indian languages
Ambiguity - ambiguity between common and proper nouns. Eg: common words such as "Roja" meaning Rose flower is a name of a person
Spell variations - In the web data is that we find different people spell the same entity differently - for example : In Tamil person name -Roja is spelt as "rosa", "roja".
The whole corpus is provided.
Named Entity Recognition for Indian Languages and English
Best of luck for getting passwords for the zip files!
cheers!
A proposition: you could try to exploite the India version of Wikipedia for training or to create automatically gazetteer.
I don't know if it is the efficient/quick solution but a lot of research exploits Wikipedia and his semi-structured content (for example, each page is annotated with several categories).
You can have a look at these articles to find an interesting idea for you:
https://scholar.google.fr/scholar?q=named+entity+recognition+using+wikipedia&btnG=&hl=fr&as_sdt=0%2C5

NLTK ontology/metonymy instruments

Is there any instrument to build ontonymy of the text.
The same question about metonymy - is there any opotunity to find out metonymy in sense/text?
Thanks in advance.
Ontology
There is nothing particular in NLTK that you can use to create an ontology based on text with NLTK. You will have to get concepts out of text (you can start with Named entity Recognition, Multi-Word Expressions or Information Extraction). The rest is about somehow linking to existing ontologies (e.g. you can start with topics related to your text).
Metonymy
You can use WordNet to identify metonimic relationships with the words or concepts from the text you process. This is doable via the NLTK interface to WordNet. You would have to identify a synset containing your concept / word and traverse along metonimic relationship of that synset with another. Your question could lead to wildly varying implementations depending on the requirements you have so let me leave you with a hint snippets here.

How to parse freebase quad dump using Amazon mapreduce

Im trying to extract movie informations from freebase, i just need name of the movie, name and id of the director and of the actors.
I found it hard to do so using freebases topic dumps, because there is no reference to the director ID, just directors name.
What is the right approach for this task? Do i need to parse somehow whole quad dump using amazons cloud? Or is there some esy way?
You do need to use the quad dump, but it is under 4 GB and shouldn't require Hadoop, MapReduce, or any cloud processing to do. A decent laptop should be fine. On a couple year old laptop, this simple-minded command:
time bzgrep '/film/' freebase-datadump-quadruples.tsv.bz2 | wc -l
10394545
real 18m56.968s
user 19m30.101s
sys 0m56.804s
extracts and counts everything referencing the film domain in under 20 minutes. Even if you have to make multiple passes through the file (which is likely), you'll be able to complete your whole task in less than an hour, which should mean there's no need for beefy computing resources.
You'll need to traverse an intermediary node (CVT in Freebase-speak) to get the actors, but rest of your information should be connected directly to the subject film node.
Tom
First of all, I completely share Tom's point of view and his suggestion. I often use UNIX command line tools to take 'interesting' slices of data out of Freebase data dump.
However, an alternative would be to load Freebase data into a 'graph' storage system locally and use APIs and/or the query language available from that system to interact with the data for further processing.
I use RDF, since the data model is quite similar and it is very easy to convert the Freebase data dump into RDF (see: https://github.com/castagna/freebase2rdf). I then load it into Apache Jena's TDB store (http://incubator.apache.org/jena/documentation/tdb/) and use the Jena APIs or SPARQL for further processing.
Another reasonable and scalable approach would be to implement what you need to do in MapReduce, but this makes sense only if the amount of processing you do is touching a large fraction of Freebase data and it is not as trivial as counting lines. This is more expensive than using your own machine, you need an Hadoop cluster or you need to use Amazon EMR. (I should probably write a MapReduce version of freebase2rdf ;-))
My 2 cents.

KORMARC to MARC21 converter

Does anyone know if there is a free open-source solution to convert KORMARC (Korean MARC) into MARC21 (aka USMARC)?
While I'm not certain it has KORMARC support, you may want to try USEMARCON if you can find a mapping. From the USEMARCON page:
USEMARCON facilitates the conversion of catalogue records from one MARC format to another e.g. from UKMARC to UNIMARC. The software was designed as a toolbox-style application, allowing users with detailed knowledge of the source and target MARC formats to develop rules governing the behaviour of the conversion. Rules files may be supplemented by additional tables for more accurate conversion of MARC-specific character sets or coded information. The tables and rules files are simple ASCII text files and can be created using any standard text editor such as MS Windows Notepad.
Also, this thread from the Ask a Korean Studies Librarian Google Group might be useful, particularly the following message:
Library of Congress once tried to download records from the National
Library of Korea (NLK) to use as order records. LC wrote a
specification and developed a in-house program to convert KORMARC to
USMARC. Since NLK records only provide script, LC used a
transliterator to provide romanization for Voyager system developed by
non-LC programmer. The feedback of this method is not very positive
by LC staff. ... In stead of converting KORMARC to USMARC, a few research libraries
including LC is currently using MarcEdit with Excel spreadsheets which
are provided by Korean vendors based on contract. Vendors provide
both Korean script and romanization for several elements of MARC
fields (ISBN, title, author, publisher, place, series, etc.) in
different columns of spreadsheet for your order items. It sounds a
lot simpler to set up initially. And once MarcEdit is set up
properly, it creates MARC records.