fuzzy matching japanese strings in python? - python-2.7

this problem has me stumped for the whole day.
I have two Japanese strings that I want to fuzzy match in Python2.7. Currently I'm using fuzzywuzzy and
jpnStr = "日本語".encode('utf-8')
jpnList = ["日本語1".encode('utf-8'),"日本語2".encode('utf-8'),"日本語3".encode('utf-8')]
bestmatch = process.extractOne(jpnStr, jpnList)
but the resulting bestmatch is always
("日本語1",0)
How would I go by resolving this issue, or is there a best practice that I'm totally missing here? Sorry if I sound frustrated, it's been a roadblock for a while. Thanks in advance.

Ok, I'm not sure how helpful this is but I've found a workaround.
I found that I could fuzzymatch japanese strings using fuzzywuzzy.
First, you get the Unicoded Japanese string, ie "日本語です"
Then you output it as ascii text into a text file. Output will look something like "/uf34/ufeac/uewa3/..." so on and so forth.
Then you read the text file and compare the ascii representation of the japanese string : "/uf34/ufeac/uewa3/" against each other. This gives a workable fuzzywuzzy match rating.
It's probably not an ideal method, but it works and is fairly accurate. Hope this helps somebody.

Related

VSCode Snippets: Format File Name from my_file_name to MyFileName

I am creating custom snippets for flutter/dart. My goal is to pull the file name (TM_FILENAME_BASE) remove all of the underscores and convert it to PascalCase (or camelCase).
Here is a link to what I have learned so far regarding regex and vscode's snippets.
https://code.visualstudio.com/docs/editor/userdefinedsnippets
I have been able to remove the underscores nicely with the following code
${TM_FILENAME_BASE/[\\_]/ /}
I can even make it all caps
${TM_FILENAME_BASE/(.*)/${1:/upcase}/}
However, it seems that I cannot do two steps at a time. I am not familiar with regex, this is just me fiddling around with this for the last couple of days.
If anyone could help out a fellow programmer just trying make coding simpler, it would be really appreciated!
I expect the output of "my_file_name" to be "MyFileName".
It's as easy as that: ${TM_FILENAME_BASE/(.*)/${1:/pascalcase}/}
For the camelCase version you mentioned, you can use:
${TM_FILENAME_BASE/(.*)/${1:/camelcase}/}

How to find and replace box character in text file?

I have a large text file that I'm going to be working with programmatically but have run into problems with a special character strewn throughout the file. The file is way too large to scan it looking for specific characters. Most of the other unwanted special characters I've been able to get rid of using some regex pattern. But there is a box character, similar to "□". When I tried to copy the character from the actual text file and past it here I get "�", so the example of the box is from Windows character map which includes the code 'U+25A1', which I'm not sure how to interpret or if it's something I could use for a regex search.
Would anyone know how I could search for the box symbol similar to "□" in a UTF-8 encoded file?
EDIT:
Here is an example from the text file:
"� Prune palms when flower spathes show, or delay pruning until after the palm has finished flowering, to prevent infestation of palm flower caterpillars. Leave the top five rows."
The only problem is that, as mentioned in the original post, the square gets converted into a diamond question mark.
It's unclear where and how you are searching, although you could use the hex equivalent:
\x{25A1}
Example:
https://regex101.com/r/b84oBs/1
The black diamond with a question mark is not a character, per se. It is what a browser spits out at you when you give it unrecognizable bytes.
Find out where that data is coming from.
Determine its encoding. (Usually UTF-8, but might be something else.)
Be sure the browser is configured to display that encoding. This is likely to suffice <meta charset=UTF-8> in the header of the page.
I found a workaround using Notepad++ and this website. It's still not clear what encoding system the square is originally from, but when I post it into the query field in the website above or into the Notepad++ Conversion Table (Plugins > Converter > Conversion Table) it gives the hex-character code for the "Replacement Character" which is the diamond with the question mark.
Using this code in a regex expression, \x{FFFD}, within Notepad++ search gave me all the squares, although recognizing them as the Replacement Character.

Convert \xc3\xd8\xe8\xa7\xc3\xb4\xd to human readable format

I am having trouble converting '\xc3\xd8\xe8\xa7\xc3\xb4\xd' (which is a Thai text) to a readable format. I get this value from a smart card, and it basically was working for Windows but not in Linux.
If I print in my Python console, I get:
����ô
I tried to follow some google hints but I am unable to accomplish my goal.
Any suggestion is appreciated.
Your text does not seem to be a Unicode text. Instead, it looks like it is in one of Thai encodings. Hence, you must know the encoding before printing the text.
For example, if we assume your data is encoded in TIS-620 (and the last character is \xd2 instead of \xd) then it will be "รุ่งรดา".
To work with the non-Unicode strings in Python, you may try: myString.decode("tis-620") or even sys.setdefaultencoding("tis-620")

trying to stem a string in natural language using python-2.7

I am importing from nltk.stem.snowball import SnowballStemmer
and I have a string as follows:
text_string="Hi Everyone If you can read this message youre properly using parseOutText Please proceed to the next part of the project"
I run this code on it:
words = " ".join(stemmer.stem(word) for word in text_string.split(" "))
and I get the following which has a couple of 'e' missing. Can't figure out what is causing it. Any suggestions? Thanks for the feedbacks
"hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project"
You're using it correctly; it's the stemmer that's acting weird. It could be caused by too little training data, or the wrong balance, or simply the wrong conclusion by the stemmer's statistical algorithm. We can't expect perfection, but it's annoying when it happens with common words. It's also stemming "everything" to "everyth", as if it's a verb. At least here it's clear what it's doing. But "-e" is not a suffix in English...
The stemmer allows the option ignore_stopwords=True, which will suppress stemming of words in the stopword list (these are common words, usually irregular, that Porter thought fit to exclude from the training set because he got worse results when they are included.) Unfortunately it doesn't help with the particular examples you ask about.

Emacs regexp - replacing text strings, query replace regexp

It seems simple enough but I can't get it done.
My text file looks like this :
Johnson Cary, 2009, This important article, 109 pages.
Smith Tom, 2003, Much ado about nothing: a study, 89 pages.
I need this :
Johnson Cary%2009%This important article%109 pages.
Any special character unlikely to appear in text will do. The end goal is to end up with a .csv then a .xls file.
I am using
^\([^,]+\)\([,]\)
to find the first occuring comma but when I try to replace with
\1 %
it does not work, nor any kind of close combination of that sort for that matter.
Any help will be dearly welcome!
Thank you much in advance.
Replace this:
^\([^,]*\), \([^,]*\), \([^,]*\), \(.*\)$
with this:
\1%\2%\3%\4
to get the correct result.