Is there any way to replace all greek letters with accents with the ones without? (the need is to handle an nlu problem with greek utterances) - diacritics

I am currently building a chatbot for greek users and i need to avoid adding duplicate utterances for 'with' and 'without' accent cases so as to not have a heavy model.
Do you know if there is a way to replace the list of greek letters having accent to the ones without?
I don't care about capital letters, the model recognizes both lowercase and uppercase the same.
In addition to this, i would also like to use this approach for handling greeklish (i.e. converting the greek letters to latin).
RASA is used for training the model, if this plays any part
Thanks in advance
PS: i'm a newbie python user so please bare with me
I am trying to figure out how to use translate function but haven't made it yet

Related

Python3 unicode regex

I'm not a native English speaker, but it happened that I've never written any regex for any non-ASCII text in my life, so I'm confused with a seemingly trivial case.
I have a large dictionary scrapped from a website by a robot. All HTML tags are removed. My goal is to remove most carry over hyphens. The idea is that >90% of problematic punctuation have a form lowercase-lowercase, so they could be caught by regex like '\p{Ll}-\p{Ll}'. This should be able to capture Russian lowercase chars, при-мер for example.
However, it seems like \p isn't supported by python's re engine. I'm not sure which alternative regex engine I'm supposed to choose because googling doesn't show any information relevant to Python 3. I thought Python3 is much more advanced when it comes to i14n and Unicode, and it's supposed to have Unicode character class support.

How to remove emoticons from tweets in C++?

I'm working on a twitter sentiment analysis tool in C++. So far I get the tweets from Twitter and I process them a bit ( lowercase, remove RT, remove # and URLs).
The next step is to remove emoticons and all those special characters. How does one do that? before you jump me, I already looked at other similar questions but none of them deals with C++. Mostly R,Python and PHP.
I was thinking to use regex however I can't get it to work. I tried it with removal of hashtags and URLs and I gave up. I ended up using normal string:find and find_first_of.
Is there any library or method available to get rid of those emoticons and special stuff ?
Thanks
I would recommend using regular expressions for this. Now you have two options, you can either extract only the characters you are interested in (if you are working with English tweets this would probably be A-Z,a-z, numbers and maybe some symbols, depending on your needs), or you can select invalid characters (emoticons) and replace them with an empty string.
I only have experience with Qt's RegularExpression engine, but the c++ standard library has regex support (although I'm not sure how good it is with Unicode), but the ICU provides a regex library too.
*I'd provide more links but I don't have enough reputation yet :/

Matching every Unicode letter only in HTML5 Input form

I have this regex: https://regex101.com/r/bM5sQ0/2
<input type="text" pattern="[\p{L}]+\s[\p{L}]+">
Which I try to match the following text patterns:
Firstname Surname
Fírstnámé Súrnámé
And not match these lines:
Firstname Surname000
Fírstnámé Súrnámé000
I want to solve this globally with defining every unicode letter (what if someone is not Hungarian, but Polish, German, French, or Spanish instead and I didn't include their special characters?). However my solution does not work.
If you're using a browser that does support \p{}, and doesn't require the u switch to enable it, your code works, but you should remove the brackets because they're unnecessary:
<input type="text" pattern="\p{L}+\s\p{L}+">
It worked when I tested it in Chrome.
Older Javascript versions (before ES2018?) do not support \p{} at all, and some versions may need the u switch to enable it, which won't work here. If you really need it, I suggest that you try the solutions here: How can I use Unicode-aware regular expressions in JavaScript?.
If you just don't like digits, then you can use \D as tamas rev said in the comments. Or maybe [^\d\s] to enforce that your input isn't just spaces.
Note that only matching letters is a bad way to validate names, since it excludes names like "O'Henry". Note that forcing exactly one space to be present excludes languages where the names are not separated with a space (like in the name "蔡英文"), people who only have one name, and people whose names have more than one space ("Mary Jane", "van der Waals"). And some names do have numbers. See Falsehoods Programmers Believe About Names.

Lucene filter to replace native language letters to normal letter

How to replace native language letters to normal letters via lucene analyzer ?
In polish we have a 'ą','ę','ć' and I need to replace these to 'a','e','c'.
I tried with
new TrimFilter(new PatternReplaceFilter(source,
Pattern.compile("[^a-zA-Z0-9]"), , true), true);
But this filter is working in wrong way, replace all the chars which not belong to pattern.
Use ASCIIFoldingFilter which is designed for exactly this purpose. Here's and example how to use it.

HTML5 Input Pattern vs. Non-Latin Letters

I want to make pre-validation of some input form with new HTML5 pattern attirbute. My dataset is "Domain Name", so <input type="url"> regex preset isn't applied.
But there is a problem, I wont use A-Za-z , because of damned IDN's (Internationalized domain name).
So question: is there any way to use <input pattern=""> for random non-english letters validation ?
I tried \w ofcource but it works only for latin...
Maybe someone has a set of some \xNN-\xNN which guarantees entering of ALL unicode alpha characters, or some another way?
edit: "This question may already have an answer here:" - no, there is no answer.
Based on my testing, HTML5 pattern attributes supports Unicode character code points in the exact same way that JavaScript does and does not:
It only supports \u notation for unicode code points so \u00a1 will match '¡'.
Because these define characters, you can use them in character ranges like [\u00a1-\uffff]
. will match Unicode characters as well.
You don't really specify how you want to pre-validate so I can't really help you more than that, but by looking up the unicode character values, you should be able to work out what you need in your regex.
Keep in mind that the pattern regex execution is rather dumb overall and isn't universally supported. I recommend progressive enhancement with some javascript on top of the pattern value (you can even re-use the regex more or less).
As always, never trust user input - It doesn't take a genius to make a request to your form endpoint and pass more or less whatever data they like. Your server-side validation should necessarily be more explicit. Your client-side validation can be more generous, depending upon whether false positives or false negatives are more problematic to your use case.
I know this isn't what you want to hear, but...
The HTML5 pattern attribute isn't really for the programmer so much as it's for the user. So, considering the unfortunate limitations of pattern, you are best off providing a "loose" pattern--one that doesn't give false negatives but allows for a few false positives. When I've run into this problem, I found that the best thing to do was a pattern consisting of a blacklist + a couple minimum requirements. Hopefully, that can be done in your case.