RegEx for all letters (including Chinese, Greek, etc.) - regex

I need a regex that also matches Chinese, Greek, Russian, ... letters.
What I basically want to do is remove punctuation and numbers.
Until now I removed punctuation and numbers "manually" but that does not seem to be very consistent.
Another thing I have tried is
/[\p{L}]/
but that is not supported by Mozilla (I use this in a Firefox extension).

Have you given XRegExp and the Unicode plugin a try/look?
<script src="xregexp.js"></script>
<script src="xregexp-unicode.js"></script>
<script>
var unicodeWord = XRegExp("^\\p{L}+$");
alert(unicodeWord.test("Ниндзя")); // -> true
</script>

You can find a lot complains about the current ECMA specs on regular expressions not dealing with unicode characters the way they should. E.g. a blog entry by Scott Hanselman that links back to a SO question ;-)
There's no "real" solution to this problem yet, but take a look at the answers of Javascript + Unicode regexes (your question is more or less a duplicate of this) (edit: I take that back, the unicode plugin Jonathan Lonowski suggests look pretty nice)

Related

Python3 unicode regex

I'm not a native English speaker, but it happened that I've never written any regex for any non-ASCII text in my life, so I'm confused with a seemingly trivial case.
I have a large dictionary scrapped from a website by a robot. All HTML tags are removed. My goal is to remove most carry over hyphens. The idea is that >90% of problematic punctuation have a form lowercase-lowercase, so they could be caught by regex like '\p{Ll}-\p{Ll}'. This should be able to capture Russian lowercase chars, при-мер for example.
However, it seems like \p isn't supported by python's re engine. I'm not sure which alternative regex engine I'm supposed to choose because googling doesn't show any information relevant to Python 3. I thought Python3 is much more advanced when it comes to i14n and Unicode, and it's supposed to have Unicode character class support.

How to remove emoticons from tweets in C++?

I'm working on a twitter sentiment analysis tool in C++. So far I get the tweets from Twitter and I process them a bit ( lowercase, remove RT, remove # and URLs).
The next step is to remove emoticons and all those special characters. How does one do that? before you jump me, I already looked at other similar questions but none of them deals with C++. Mostly R,Python and PHP.
I was thinking to use regex however I can't get it to work. I tried it with removal of hashtags and URLs and I gave up. I ended up using normal string:find and find_first_of.
Is there any library or method available to get rid of those emoticons and special stuff ?
Thanks
I would recommend using regular expressions for this. Now you have two options, you can either extract only the characters you are interested in (if you are working with English tweets this would probably be A-Z,a-z, numbers and maybe some symbols, depending on your needs), or you can select invalid characters (emoticons) and replace them with an empty string.
I only have experience with Qt's RegularExpression engine, but the c++ standard library has regex support (although I'm not sure how good it is with Unicode), but the ICU provides a regex library too.
*I'd provide more links but I don't have enough reputation yet :/

Matching every Unicode letter only in HTML5 Input form

I have this regex: https://regex101.com/r/bM5sQ0/2
<input type="text" pattern="[\p{L}]+\s[\p{L}]+">
Which I try to match the following text patterns:
Firstname Surname
Fírstnámé Súrnámé
And not match these lines:
Firstname Surname000
Fírstnámé Súrnámé000
I want to solve this globally with defining every unicode letter (what if someone is not Hungarian, but Polish, German, French, or Spanish instead and I didn't include their special characters?). However my solution does not work.
If you're using a browser that does support \p{}, and doesn't require the u switch to enable it, your code works, but you should remove the brackets because they're unnecessary:
<input type="text" pattern="\p{L}+\s\p{L}+">
It worked when I tested it in Chrome.
Older Javascript versions (before ES2018?) do not support \p{} at all, and some versions may need the u switch to enable it, which won't work here. If you really need it, I suggest that you try the solutions here: How can I use Unicode-aware regular expressions in JavaScript?.
If you just don't like digits, then you can use \D as tamas rev said in the comments. Or maybe [^\d\s] to enforce that your input isn't just spaces.
Note that only matching letters is a bad way to validate names, since it excludes names like "O'Henry". Note that forcing exactly one space to be present excludes languages where the names are not separated with a space (like in the name "蔡英文"), people who only have one name, and people whose names have more than one space ("Mary Jane", "van der Waals"). And some names do have numbers. See Falsehoods Programmers Believe About Names.

HTML5 Input Pattern vs. Non-Latin Letters

I want to make pre-validation of some input form with new HTML5 pattern attirbute. My dataset is "Domain Name", so <input type="url"> regex preset isn't applied.
But there is a problem, I wont use A-Za-z , because of damned IDN's (Internationalized domain name).
So question: is there any way to use <input pattern=""> for random non-english letters validation ?
I tried \w ofcource but it works only for latin...
Maybe someone has a set of some \xNN-\xNN which guarantees entering of ALL unicode alpha characters, or some another way?
edit: "This question may already have an answer here:" - no, there is no answer.
Based on my testing, HTML5 pattern attributes supports Unicode character code points in the exact same way that JavaScript does and does not:
It only supports \u notation for unicode code points so \u00a1 will match '¡'.
Because these define characters, you can use them in character ranges like [\u00a1-\uffff]
. will match Unicode characters as well.
You don't really specify how you want to pre-validate so I can't really help you more than that, but by looking up the unicode character values, you should be able to work out what you need in your regex.
Keep in mind that the pattern regex execution is rather dumb overall and isn't universally supported. I recommend progressive enhancement with some javascript on top of the pattern value (you can even re-use the regex more or less).
As always, never trust user input - It doesn't take a genius to make a request to your form endpoint and pass more or less whatever data they like. Your server-side validation should necessarily be more explicit. Your client-side validation can be more generous, depending upon whether false positives or false negatives are more problematic to your use case.
I know this isn't what you want to hear, but...
The HTML5 pattern attribute isn't really for the programmer so much as it's for the user. So, considering the unfortunate limitations of pattern, you are best off providing a "loose" pattern--one that doesn't give false negatives but allows for a few false positives. When I've run into this problem, I found that the best thing to do was a pattern consisting of a blacklist + a couple minimum requirements. Hopefully, that can be done in your case.

Recommended built-in WinXP language support for UTF-8 regex

It's my first foray into UTF-8 land. I'm an IIS Admin, so I've never gotten to touch this professionally. I'm trying to help a missionary who's translated the bible into an African language and now needs to do some global matching against large UTF-8 files. We're specifically matching for accented characters.
We're using older XP computers here, so I cobbled together a quick script in VBS knowing the language would be installed on their boxes already. After playing around for a few minutes, it appears VBS regexes handle UTF-8 by breaking each character up into 2 characters. To match a single â, my pattern is \u00c3\u00a2. Shouldn't this be \u00e2?
Since I'm out of my depth I thought I'd seek a little guidance. It almost looks like UTF-8 simply requires this kind of double matching (and UTF-8 is required.) Can someone tell me into which box canyon I'm coding? :-)
Downloading and installing Perl or Java is probably outside this project's bandwidth and technical know-how. The tool should be built in. MS Office is installed, so VBA is an option if there's some library that offers specific support. JavaScript is installed as well, though I don't know what versions.
Thanks
Unless you need to match two or more consecutive dots (e.g. you have .. or ... in your regex but not .*) you can use any ASCII regex library on UTF-8 and expect it to work correctly.
The trick is to know what you are looking for. UTF-8 does that kind of byte breakup, so write your regex in whatever you are familiar with and convert it to UTF-8 and it will work unless it contains "..".
What about PowerShell? It uses the .NET regular expressions library, and that is one of the best libraries available, especially for Unicode support.