Error while filtering english language tweets only - python-2.7

I am extracting tweets written only in English Language and I used the following filter
stream.filter(stall_warnings=True, track=['#brain'], languages=['en'])
But unfortunately this filter returns a tweet which is combination of English and some other language
Please see the tweet here
How can I extract a tweet which is written only in English Language?
Note: I am sorry if it is wrong for linking some other's tweet.

The tweets are classified by Twitter on one language or another. Their classification isn't always correct. If the tweet uses multiple languages they just assign it to one of them.
So you will need to filter them in your app against a dictionary or using some language detection libraries to be 100% sure that only English Language is used on the tweets you receive.
Source: https://blog.twitter.com/2013/introducing-new-metadata-for-tweets

Related

Is there a way to explicitly set start and end of sentences on Google Natural Language API sentiment analysis?

I am using Google Natural Language API for sentiment analysis. I have an array of string texts that I concatenate and send to google to get sentiment values for each sentiment, but Google has its own idea on where the sentences start and end, so getting scrambled sentiment results and a different count of sentiment scores, then sent sentences.
If you could only set a flag like <sentence> </sentence> for each phrase you would like to treat as a separate sentence - that would be great, but docs don't have info about it.
P.S.
I am using it this(sending a chunk and not doing a separate API request for each sentence) way because I have thousands of sentences and latency is important.
With Google Natural Language API, a way to mark separate sentences that seems to work (although not officially documented) is to set the document type to HTML and end each sentence with .<br> (full-stop and HTML line break tag) to hint an end-of-sentence to the parser.

Google Vision API: both English and Arabic on the image

We are trying to read Text from images that have both English and / or Arabic text in them. We do need to extract both languages' detected text.
When passing the hints as en and ar, sometimes the English are being mis-interpreted for Arabic text. Although if we pass English alone as a preferred language to the vision service call, the english text is returned correctly.
But since we need both, I suppose we have to pass both en and ar.
Is this correct? Is there anything we can do about this?
As per OCR Languages Support documentation, it seems English language doesn't have to be specified as it is Latin alphabet. Did you try to specify only Arabic in hints?
If that doesn't work, could you share an example of image as attachment to this post, and the code showing how you issue the API call? Also you can post the meaningful part of the response.

NER model to recognize Indian names

I am planning to use Named Entity Recognition (NER) technique to identify person names (most of which are Indian names) from a given text. I have already explored the CRF-based NER model from Stanford NLP, however it is not quite accurate in recognizing Indian names. Hence I decided to create my own custom NER model via supervised training. I have a fair idea of how to create own NER model using the Stanford NER CRF, but creating a large training corpus with manual annotation is something I would like to avoid, as it is a humongous effort for an individual and secondly obtaining diverse people names from different states of India is also a challenge. Could anybody suggest any automation/programmatic way to prepare a labelled training corpus with at least 100k Indian names?
I have already looked into Facebook and LinkedIn API, but did not find a way to extract 100k number of user's full name from a given location (e.g. India).
I ended up doing the following to create NER model to identify Indian names. This may be useful for anybody looking for creating a custom NER model to recognize non-English person names, since most of the publicly available NER models such as the ones from Stanford NLP were trained with English names and hence are more accurate in identifying English (British/American) names.
Find an Indian celebrity with Twitter account and having a huge number of followers in Twitter (for my case, I chose Sachin Tendulkar).
Create a program in the language of your choice to call the Twitter REST API (GET followers/list) to get the names of all the followers of the celebrity and save to a file. We can safely assume most of the followers would be Indians. Note that there is an API Rate Limit in place (30 requests per 15 minute window), so the program should be built in to handle that. For our case, we developed the program as a Windows Service which runs every 15 minutes.
Since some Twitter users' names may not be valid person names, it is advisable to add some rule-based logic (like RegEx) to filter seemingly real names and add only those to the file.
Once the file with real names is generated, create another program to create the training data file containing these names labelled/annotated as PERSON as well as non-entity names annotated as OTHER. If you are using Stanford NER CRF Classifier, the program should generate a training (TSV) file having two columns - one containing the word (token) and the second column mentioning the label.
Once the training corpus is generated programmatically, you can follow the below link to create your custom NER model to recognize Indian names:
http://nlp.stanford.edu/software/crf-faq.shtml#a
This website has done this for us!It provides with the solution for these problems:
Challenges in Indian Language NER
Indian languages belong to several language families, the major ones being the Indo-European languages, Indo-Aryan and the Dravidian languages.
The challenges in NER arise due to several factors. Some of the main factors are listed below
Morphologically rich - identification of root is difficult, require use of morphological analysers
No Capitalization feature - In English, capitalization is one of the main features, whereas that is not there in Indian languages
Ambiguity - ambiguity between common and proper nouns. Eg: common words such as "Roja" meaning Rose flower is a name of a person
Spell variations - In the web data is that we find different people spell the same entity differently - for example : In Tamil person name -Roja is spelt as "rosa", "roja".
The whole corpus is provided.
Named Entity Recognition for Indian Languages and English
Best of luck for getting passwords for the zip files!
cheers!
A proposition: you could try to exploite the India version of Wikipedia for training or to create automatically gazetteer.
I don't know if it is the efficient/quick solution but a lot of research exploits Wikipedia and his semi-structured content (for example, each page is annotated with several categories).
You can have a look at these articles to find an interesting idea for you:
https://scholar.google.fr/scholar?q=named+entity+recognition+using+wikipedia&btnG=&hl=fr&as_sdt=0%2C5

NLTK ontology/metonymy instruments

Is there any instrument to build ontonymy of the text.
The same question about metonymy - is there any opotunity to find out metonymy in sense/text?
Thanks in advance.
Ontology
There is nothing particular in NLTK that you can use to create an ontology based on text with NLTK. You will have to get concepts out of text (you can start with Named entity Recognition, Multi-Word Expressions or Information Extraction). The rest is about somehow linking to existing ontologies (e.g. you can start with topics related to your text).
Metonymy
You can use WordNet to identify metonimic relationships with the words or concepts from the text you process. This is doable via the NLTK interface to WordNet. You would have to identify a synset containing your concept / word and traverse along metonimic relationship of that synset with another. Your question could lead to wildly varying implementations depending on the requirements you have so let me leave you with a hint snippets here.

How generate random word from real languages

How I can generate random word from real language?
Anybody know any API from internet with this functional?
For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=en' and I get responce 'Town'. Or 'Fast'. Or 'Received'... For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=ru' and I get responce 'Ходить'. Or 'Шапка'. Or 'Отправлено'... Any form (noun, adjective, verb etc...) of the words of the any language.
I find resource 'http://www.randomlists.com/random-words'. But this is not JSON format, only English, and don't any warranty work in long time.
Please any ideas.
See this answer : https://stackoverflow.com/questions/824422/can-i-get-an-english-dictionary-word-list-somewhere Download a word dictionary, stick in the databse and fetch a random record or read a random line from the file each time. This way you don't depend on 3rd party API and you can extend it in all the languages you can find words for.
You can download the OpenOffice dictionaries. They come as extension (oxt), which is nothing different than a ZIP file. You may open them with 7zip or alike. Within you will find lots of files, interesting for you are the *.dic files. They will also contain resolutions or number words.
When you encounter something like abandon/LdS get rid of the /LdS this is used for hunspell.
Take these *.dic files use their name as key, put them into a database and pick a random word from there for a given language code.
Update
Older, but easier to access, the archived hunspell dictionaries from OpenOffice.
This question can be viewed in two ways and therefore I give two answers:
To collect words, I would run a spider on websites with known language (Wikipedia is a good starting point) and strip HTML tags.
To generate words from a real language is trickier. Using statistics from the collected words, it is possible to use Markow chains that produces statistically real words. I have tried letter by letter generation, and that works poorly. It is probably a better approach to use syllable construction instead.