Is there any instrument to build ontonymy of the text.
The same question about metonymy - is there any opotunity to find out metonymy in sense/text?
Thanks in advance.
Ontology
There is nothing particular in NLTK that you can use to create an ontology based on text with NLTK. You will have to get concepts out of text (you can start with Named entity Recognition, Multi-Word Expressions or Information Extraction). The rest is about somehow linking to existing ontologies (e.g. you can start with topics related to your text).
Metonymy
You can use WordNet to identify metonimic relationships with the words or concepts from the text you process. This is doable via the NLTK interface to WordNet. You would have to identify a synset containing your concept / word and traverse along metonimic relationship of that synset with another. Your question could lead to wildly varying implementations depending on the requirements you have so let me leave you with a hint snippets here.
Related
I am currently working on my thesis and I am trying to analyze the results of NGS sequencing Illumina. I am not really familiar with bioinformatics and in this part of my project, I am trying two compare two vcf files corresponding to the results of healthy tissue and tumor tissue. I want to compare these vcf files and remove their similarities. More specifically I want to remove the information of the healthy tissue from the tumor one. Have you any suggestions on which tool I should use or any way that I can do my analysis? If you can help me I would be more than thankful. Thank you in advance!
I understand your problem. First thing I would recommend is to use Unix software (I don't know which OS you're running) called VCFtools. It's pretty simple to use. But if You want to do all the processing with, for example python, you can use the pandas library for python which helps to process data in column format or PyVCF library, which is a parser for VCF files. I can help you more if you can provide some example data you're processing.
I have built a gensim Doc2vec model. Let's call it doc2vec. Now I want to find the most relevant words to a given document according to my doc2vec model.
For example, I have a document about "java" with the tag "doc_about_java". When I ask for similar documents, I get documents about other programming languages and topics related to java. So my document model works well.
Now I want to find the most relevant words to "doc_about_java".
I follow the solution from the closed question How to find most similar terms/words of a document in doc2vec? and it gives me seemingly random words, the word "java" is not even among the first 100 similar words:
docvec = doc2vec.docvecs['doc_about_java']
print doc2vec.most_similar(positive=[docvec], topn=100)
I also tried like this:
print doc2vec.wv.similar_by_vector(doc2vec["doc_about_java"])
but it didn't change anything. How can I find the most similar words to a given document?
Not all Doc2Vec modes even train word-vectors. In particular, the PV-DBOW mode dm=0, which often works very well for doc-vector comparisons, leaves word-vectors at randomly-assigned (and unused) positions.
So that may explain why the results of your initial attempt to get a list-of-related-words seem random.
To get word-vectors, you'd need to use PV-DM mode (dm=1), or add optional concurrent word-vector training to PV-DBOW (dm=0, dbow_words=1).
(If this isn't the issue, there maybe other problems in your training setup, so you should show more detail about your data source, size, and code.)
(Separately, your alternate attempt code-line, by using doc2vec["doc_about_java"] is retrieving a word-vector for "doc_about_java" (which may not be present at all). To get the doc-vector, use doc2vec.docvecs["doc_about_java"], as in your first code block.)
I've been looking at using AWS Machine Learning to implement a categorizer for my project. I have something on the order of 40,000 documents that have a several text-only features. For example: Name (< 200 chars) and Description (potentially hundreds / thousands of words).
In a nutshell, I'm looking to assign categories (0 or more) to each document based on it's content.
I've read through the AWS ML tutorial and checked out a few other sources but the available material seems to deal with feature fields that are numeric, boolean, datetime, or otherwise non-textual.
Is AWS Machine Learning capable of performing multi-class categorization on documents based primarily (or possibly only) on text fields? And if so, is there any reference material available for this particular avenue?
Mainly, you don`t need "text fields", first you have to create a vector space model (VTM) from your corpus (texts), than you can weight your VTM with tf-idf, and you can use numeric field.
Are you sure that do you want to apply AWS ML to train a corpus with only 40.000 documents?
How I can generate random word from real language?
Anybody know any API from internet with this functional?
For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=en' and I get responce 'Town'. Or 'Fast'. Or 'Received'... For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=ru' and I get responce 'Ходить'. Or 'Шапка'. Or 'Отправлено'... Any form (noun, adjective, verb etc...) of the words of the any language.
I find resource 'http://www.randomlists.com/random-words'. But this is not JSON format, only English, and don't any warranty work in long time.
Please any ideas.
See this answer : https://stackoverflow.com/questions/824422/can-i-get-an-english-dictionary-word-list-somewhere Download a word dictionary, stick in the databse and fetch a random record or read a random line from the file each time. This way you don't depend on 3rd party API and you can extend it in all the languages you can find words for.
You can download the OpenOffice dictionaries. They come as extension (oxt), which is nothing different than a ZIP file. You may open them with 7zip or alike. Within you will find lots of files, interesting for you are the *.dic files. They will also contain resolutions or number words.
When you encounter something like abandon/LdS get rid of the /LdS this is used for hunspell.
Take these *.dic files use their name as key, put them into a database and pick a random word from there for a given language code.
Update
Older, but easier to access, the archived hunspell dictionaries from OpenOffice.
This question can be viewed in two ways and therefore I give two answers:
To collect words, I would run a spider on websites with known language (Wikipedia is a good starting point) and strip HTML tags.
To generate words from a real language is trickier. Using statistics from the collected words, it is possible to use Markow chains that produces statistically real words. I have tried letter by letter generation, and that works poorly. It is probably a better approach to use syllable construction instead.
I have a corpus of a language that has not been POS annotated before, that is, it has no existing tagset.
Apart from manually tagging it with a word processor like notepad, is there any automatic approach to start tagging a new untagged set like my corpus?.
Thanks.
It depends how detailed the tagset should be. 10-12 basic POS (Noun, Adjective, ..., foreign, punctuation) or more detailed (distinguishing verb forms, types of pronouns, gender, number, tense, ...).
The former is pretty much universal (see the categories of the Multext-East tagset or Google's universal tagset).
The latter is much more complicated, we have a paper about it. In short, we have a template for tagsets, then we modify it (dropping/adding categories and values) to suit a particular language.
Regarding annotation: again, it depends - if you have a small tagset you can just manually assign a tag to each word, say in Notepad or some simple GUI (we use this one, but there are probably better ones). If you have a tagset with hundreds or thousand tags, then you probably want some better support for it. The best thing is to use a (possibly overgenerating) morphological analyzer and a GUI allowing to choose from the options the analyzer suggests.
Brat has a very nice GUI for manual annotation.