NLTK synset with other languages - python-2.7

Right now i'm trying to compare words from two different files, one english, one chinese. I have to identify if any of the english words are related to the chinese words and if they are, are they equal or is one a hypernym of the other. I can use synsets for english but what can i do about the chinese words?

It looks like there is a Chinese (cmn) WordNet available from a university in Taiwan: http://casta-net.jp/~kuribayashi/multi/ . If this WordNet has the same format as the English WordNet, then you can probably use the WordNetCorpusReader (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.wordnet-pysrc.html#WordNetCorpusReader) in NLTK to import the Mandarin data. I don't know how you're doing your alignments or translations between the two datasets, but assuming you can map English to Chinese, this should help you figure out how the relation between two English words compares to the relation between two Mandarin words. Note that if your data uses the simplified script, you may also need to convert to the traditional script before using this cmn WordNet.

Related

Regex - How can you identify strings which are not words?

Got an interesting one, and can't come up with any solid ideas, so thought maybe someone else may have done something similar.
I want to be able to identify strings of letters in a longer sentence that are not words and remove them. Essentially things like kuashdixbkjshakd
Everything annoyingly is in lowercase which makes it more difficult, but since I only care about English, I'm essentially looking for the opposite of consonant clusters, groups of them that don't make phonetically pronounceable sounds.
Has anyone heard of/done something like this before?
EDIT: this is what ChatGpt tells me
It is difficult to provide a comprehensive list of combinations of consonants that have never appeared in a word in the English language. The English language is a dynamic and evolving language, and new words are being created all the time. Additionally, there are many regional and dialectal variations of the language, which can result in different sets of words being used in different parts of the world.
It is also worth noting that the frequency of use of a particular combination of consonants in the English language is difficult to quantify, as the existing literature on the subject is limited. The best way to determine the frequency of use of a particular combination of consonants would be to analyze a large corpus of written or spoken English.
In general, most combinations of consonants are used in some words in the English language, but some combinations of consonants may be relatively rare. Some examples of relatively rare combinations of consonants in English include "xh", "xw", "ckq", and "cqu". However, it is still possible that some words with these combinations of consonants exist.
You could try to pass every single word inside the sentence to a function that checks wether the word is listed inside a dictionary. There is a good number of dictionary text files on GitHub. To speed up the process: use a hash map :)
You could also use an auto-corretion API or a library.
Algorithm to combine both methods:
Run sentence through auto correction
Run every word through dictionary
Delete words that aren't listed in the dictionary
This could remove typos and words that are non-existent.
You could train a simple model on sequences of characters which are permitted in the language(s) you want to support, and then flag any which contain sequences which are not in the training data.
The LangId language detector in SpamAssassin implements the Cavnar & Trenkle language-identification algorithm which basically uses a sliding window over the text and examines the adjacent 1 to 5 characters at each position. So from the training data "abracadabra" you would get
a 5
ab 2
abr 2
abra 2
abrac 1
b 2
br 2
bra 2
brac 1
braca 1
:
With enough data, you could build a model which identifies unusual patterns (my suggestion would be to try a window size of 3 or smaller for a start, and train it on several human languages from, say, Wikipedia) but it's hard to predict how precise exactly this will be.
SpamAssassin is written in Perl and it should not be hard to extract the language identification module.
As an alternative, there is a library called libtextcat which you can run standalone from C code if you like. The language identification in LibreOffice uses a fork which they adapted to use Unicode specifically, I believe (though it's been a while since I last looked at that).
Following Cavnar & Trenkle, all of these truncate the collected data to a few hundred patterns; you would probably want to extend this to cover up to all the 3-grams you find in your training data at least.
Perhaps see also Gertjan van Noord's link collection: https://www.let.rug.nl/vannoord/TextCat/
Depending on your test data, you could still get false positives e.g. on peculiar Internet domain names and long abbreviations. Tweak the limits for what you want to flag - I would think that GmbH should be okay even if you didn't train on German, but something like 7 or more letters long should probably be flagged and manually inspected.
This will match words with more than 5 consonants (you probably want "y" to not be considered a consonant, but it's up to you):
\b[a-z]*[b-z&&[^aeiouy]]{6}[a-z]*\b
See live demo.
5 was chosen because I believe witchcraft has the longest chain of consonants of any English word. You could dial back "6" in the regex to say 5 or even 4 if you don't mind matching some outliers.

regex: I want to check whether any english dictionary words are present in my text file

I have text file with more sanskrit words. But inbetween there are some english sentences by mistake. Its very big file and difficult to scroll and check. So is there a way using regex i can find any matching english dictionary words in that file
duñkaraà me babhüvätra
tvädåçaà mäna-bhaïjanam
ato 'tra muralé tyaktä
lajjayaiva mayä priyä
aho bata mayä tatra
kåtaà yädåk sthitaà yathä
tad astu kila düre 'tra
nirvaktuà ca na çakyate
The situation there cannot even be described here.
ekaù sa me tad vraja-loka-vat priyas
tädåë mahä-prema-bhara-prabhävataù
vakñyaty adaù kiïcana bädaräyanir
maj-jévite çiñya-vare sva-sannibhe
çré-parékñid uväca
etädåçaà tad vraja-bhägya-vaibhavaà
samrambhataù kértayato mahä-prabhoù
punas tathä bhäva-niveça-çaìkayä
In the above tex i The situation there cannot even be described here. So is there any easy way to search out whether there are any english dictionary words.
I am using linux. So any command is fine. But prefer using regex.
If each 'Sanskrit' word always has a special character like 'ù', then you could check against a regex of a word (\w+).
Since this is not the case for words like 'priyas', you would have to check each word in a data store that holds all the English words. Unfortunately, you cannot check for a valid English word any other way.
A faster search could be done using a trie.
I am not familiar with the Unicode of Linux. But I can provide you some directions.
According to Wiki, Sanskrit characters belongs to Devanagari Unicode block.
Devanagari Unicode block is on A8E0— A8FF. You can find here.
You may need tools to convert to UTF8 such as this tool.
Set up the Regex condition excluding Devanagari Unicode block.
\S+[^\s\xA8E0-\xA8FF.]+.*
Regex demo
This will be easier to find English sentences.

How to get same result in C or C++ same as toLowerCase in Java or string.lower() in Python?

I need a C or C++ function(or library) which works like String.toLowerCase() in Java.
I know that I can use tolower() for English, but what I need is a function (or library) can cover global language. (Actually, It needs to cover 9 languages listed below.)
language list
Dutch
English
French
German
Italian
Portuguese
Russian
Spanish
Ukrainian
Add, These characters in the first line below are Input and the second line is expected result
LINE 1:
AÁÀÂÄĂĀÃÅÆBCĆČÇDEÉÈÊËĚĘFGHℏIÍÌÎÏJKLŁMNŃŇÑOÓÒÔÖÕØŒPQRŘSŚŠŞTŢUÚÙÛÜŪVWXYÝZŹŽΑΔΕΘΛΜΝΠΡΣΣΦΩАБВГҐДЕЁЄЖЗИЙІЇКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ
LINE 2:
aáàâäăāãåæbcćčçdeéèêëěęfghℏiíìîïjklłmnńňñoóòôöõøœpqrřsśšştţuúùûüūvwxyýzźžαδεθλμνπρσςφωабвгґдеёєжзийіїклмнопрстуфхцчшщъыьэюя
I verified results from Java toLowerCase() and Python string.lower()
and both are correct.
Is there any way to translate to lowercase letter in C or C++?
And Important thing is that the letters are read from a file encoded 'UTF-8'!
Please help me. My English is not very good, so please use simple English as much as you can.
I think you will find what you need in the Boost libraries - see http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/conversions.html
Quoting from their website:
Boost.Locale gives powerful tools for development of cross platform
localized software - the software that talks to user in its language.
Provided Features:
Correct case conversion, case folding and normalization.
Collation (sorting), including support for 4 Unicode collation levels.
....
You get the idea, I hope. The function you need is
Boost::Locale::to_lower(yourUTF8String)

Replace accents from lists of foreign words

Do you know if there are any linux programs out there to remove accents from lists of foreign words (in utf8)? Like Spanish, Czech, French. For instance:
administrátoři (czech) administratori
français (french) francais
niñez (spanish) ninez etc.
I know I could do it manually with sed, but it's relatively time-consuming considering that I'm working on a lot of languages. I thought a program that could do just that might exist already.
What you want is called Unicode decomposition -- the reverse process of Unicode composition (where you combine a base character with a diacritic). There are a number of related SO questions using:
JavaScript
ActionScript
Python
which you can use as a starting point.
The Python repository has unicodedata.decomposition which returns a decomposed mapping.
Your system probably also has iconv and with suitable Normalization it may get you there too!
Did you try using recode (at https://github.com/pinard/Recode/)? It removes accents while trying hard to preserve information and also can produce xlat tables expressed in C.
$ cat testfile
administrátoři (czech) administratori
français (french) francais
niñez (spanish) ninez etc.
$ LANG= recode -f UTF-8..texte <testfile
administrtori (czech) administratori
franc,ais (french) francais
niez (spanish) ninez etc.

word to syllable converter

I am writing a piece of code in c++ where in i need a word to syllable converter is there any open source standard algorithm available or any other links which can help me build one.
for a word like invisible syllable would be in-viz-uh-ble
it should be ideally be able to even parse complex words like "invisible".
I already found a link for algorithm in perl and python but i want to know if any library is available in c++
Thanks a lot.
Your example shows a phonetic representation of the word, not simply a split into syllables. This is a complex NLP issue.
Take a look at soundex and metaphone. There are C/C++ implementation for both.
Also many dictionaries provide the IPA notation of words. Take a look a Wiktionary API.
For detecting syllables in words, you could adapt a project of mine to your needs.
It's called tinyhyphenator.
It gives you an integer list of all possible hyphenation indices within a word. For German it renders quite exactly. You would have to obtain the index list and insert the hyphens yourself.
By "adapt" I mean adding the specification of English syllables. Take a look at the source code, it is supposed to be quite self explanatory.