Remove punctuation but not all of them - regex

I am working on remove punctuation from a text variable that can be phase, etc.
Example: Hola, me llamo Juan! Hoy es día camión.
Code I am using is:
REGEXP_REPLACE(text, '[^0-9A-Za-z ]+', '')
This generally works well. Issue is that in some languages we have punctuation over some words. Example: día camión. When running the above code, the out for these words are "da" "camin". It removes the letter associate with the punctuation.
Is there a way to avoid this to happen?
Thanks!

There are two options:
Use one of the many Unicode properties. For example, \p{L} matches any Unicode letter from any language - in this case, you could make it work with [^0-9p\{L} ]+. There are many different Unicode properties, and also differences between Regex flavors, so I'd recommend investigating this link for reference.
If the solution above doesn't work for you, list specific Unicode codes that you want to match. For example, í can be matched with \u00ED, ó can be matched with \u00F3, so for this example [^\w\u00ED\u00F3 ]+ would do. There are many Unicode references out there, such as this one that you can use.
Besides that, \w has the same meaning as [0-9a-z_A-Z], and \W returns all characters not matched by \w, so you can replace that part of the expression, i.e. [\W ]+ instead of what you originally wrote. \W doesn't mitigate the Unicode issue, though - it's a matter of readability and simplicity.

Related

Generic regex umlaut solution?

Is there a generic (non-)word regex that covers all mutations of characters on this globe? I am developing an application that should handle all languages.
Technically I want to split sentences by words. Splitting them by nonword characters (\W) splits by 'ä' too. A static workaround is not an option since and explicitely covering all mutations on this world (éçḮñ and thousands more) is impossible.
I can't give you something that will work on all languages because I don't know enough languages to judge whether there will be edge cases.
My suggestion:
Split on whitespace (\s+).
Trim punctuation characters from start/end of each "word" you got in step 1 (replace ^\p{P}+|\p{P}+$ with nothing - the QRegularExpression docs say that it supports Unicode fully, so there's hope this will work)
Unless you care about preserving punctuation in examples like This is Charles' car, this should go a long way without removing punctuation within words like it's or Marne-sur-Seine.

Regex to match Hebrew and English characters except numbers

I have a question:
I want to do a validation for the first and the last name with RegEx.
I want to do it with only Hebrew and English without numbers.
Someone can help me to do that code?
Seemingly Hebrew has the range \u0590-\u05fe (according to this nice JavaScript Unicode Regex generator`.
/^[a-z\u0590-\u05fe]+$/i
While the selected answer is correct about "Hebrew" the OP wanted to limit validation to only Hebrew and English letters. The Hebrew Unicode adds a lot of punctuation and symbols (as you can see in the table here) irrelevant for such validation. If you want only Hebrew letters (along with English letters) the regex would be:
/^[a-z\u05D0-\u05EA]+$/i
I would consider adding ' (single quote) as well, for foreign consonants that are missing in Hebrew (such as G in George and Ch in Charlie) make use of it along with a letter:
/^[a-z\u05D0-\u05EA']+$/i
English & Hebrew FULL regex
I'm using the above regex on my application. My users are just fine with it:
RegExp(r'^[a-zA-Z\u0590-\u05FF\u200f\u200e ]+$');
The regex supports:
English letters (includes Capital letters). a-zA-Z
Hebrew (includes special end-letters). \u0590-\u05FF
change direction unicodes (RLM, LRM). \u200f\u200e
White space.
Enjoy!
Try this. Not sure if it will work. If not, these references should help.
[A-Za-z\u0590-\u05FF]*
Hebrew Unicode
Unicode in Regular Expressions
Hebrew letters only:
/^[\u0590-\u05ea]+$/i
You can also use \p{Hebrew} in your regex to detect any Hebrew unicode characters (if you're regex engine supports it).
Well, the RegEx pattern is between two /'s. The i at the end is a flag that says to be indifferent to the cases. ^ means the start of a line, and $ means the end of a line. Brackets ([ and ]) means either of the characters inside the brackets. - means a range. Note that the characters are ordinal, so a-z or א-ת make sense; a-z means all letters from and include a to and include z. The same goes for א-ת. + means one or more of the preceding. So this pattern matches every sequence of letters from English or Hebrew.
P.S.: Also, note that the flavor of the RegEx is different in different languages and platforms. for example in Sublime Text the pattern would be: (?i)^[א-תa-z]+$.
/^[א-תa-z]+$/i

Regex for allowing particular Special Characters

I need a regex for allowing list of special characters((_-.$#?,:'/!) and letters supporting utf-8 languages.
I tried
/^[\_\-\.\$#\?\,\:\'\/\!]*$/
but typing letters in English and Tamil shows invalid.
You need to escape the hyphen for it to be valid. You also don't need to escape most of the other characters - inside of brackets, almost everything is literal.
/[_\-.$#?,:'/!]*/
I have no idea if your regex engine supports \p{L}. You can try this:
^[_\-.\$#\?\,\:\'/!\p{L}]*$
or this one:
^[_\-.\$#\?\,\:\'/!\w]*$
The last one also matches digits.

Is there a regex way to detect whether a character can be part of a word or not?

The "tricky" part of this question is that what I mean by alphabeth is not just the 26 characters. It should also include anything alphabeth like, including accented characters and hebrew's alibeth, etc.etc.
Why I need them?
I want to split texts into words.
Alphabeths like latin alphabeth, hebrew's alibeth, arab abjads, are separated by space.
Chinese characters are separated by nothing.
So I think I should separate texts by anything that's not alphabeth.
In other word, a, b, c, d, é is fine.
駅,南,口,第,自,転,車.,3,5,6 is not and all such separator should be it's own words. Or stuff like that.
In short I want to detect whether a character may be a word by itself, or can be part of a word.
What have I tried?
Well you can check the question here I asked a long time ago:
How can we separate utf-8 characters into words if some of the characters are chinese?
I implement the only answer there but then I found out that the chinese characters aren't split. Why not split based on nothing? Well, that means the alphabeths are splitted too.
If all those alphabeths "stick" together that I can separate them based on UTF, that would be fine too.
I will just use the answer at How can we separate utf-8 characters into words if some of the characters are chinese?
and "pull out" all non alphabeth characters.
Not a perfect solution, but good enough for me because western characters and chinese characters rarely show up on the same text anyway.
Maybe you shouldn't do this with regular expressions but with good old string index scanning instead.
The Hebrew, Chinese, Korean etc. alphabets are all in consecutive ranges of unicode code-points. So you could easily detect the alphabet by reading the unicode value of the character and then checking which unicode block it belongs to.
Jan Goyvaerts (of PowerGrep fame) once showed me this very useful syntax to do just this:
(?<![\p{M}\p{L}])word(?![\p{M}\p{L}])
This expression uses a regex lookbehind and a regex lookahead to ensure that the boundaries of the word are such that there is no letter or diacritic mark on either side.
Why is this regex better than simply using "\b"? The strength of this regex is the incorporation of \p{M} to include diacritics. When the normal word boundary marker (\b) is used, regex engines will find word breaks at the places of many diacritics, even though the diacritics are actually part of the word (this is the case, for instance, with Hebrew diacritics. For an example, take the Hebrew word גְּבוּלוֹת, and run a regex of "\b." on it - you'll see how it actually breaks the word into word different parts, at each diacritic point). The regex above fixes this by using a Unicode Character Class to ensure that diacritics are always considered part of the word and not breaks within the word.

Regex word-break with unicode diacritics

I am working on an application that searches text using regular expressions based on input from a user. One option the user has is to include a "Match 0 or more characters" wildcard using the asterisk. I need this to only match between word boundaries. My first attempt was to convert all asterisks to (?:(?=\B).)*, which works fine for most cases. Where it fails is that apparently .Net considers the position between a unicode character with a diacritic and another character a word-break. I consider this a bug, and have submitted it to the Microsoft feedback site.
In the meantime, however, I need to get the functionality implemented and product shipped. I am considering using [\p{L}\p{M}\p{N}\p{Pc}]* as the replacement text, but, frankly, am in "I don't really understand what this is going to do" land. I mean, I can read the specifications, but am not confident that I could sufficiently test this to make sure it is doing what I expect. I simply wouldn't know all the boundary conditions to test. The application is used by cross-cultural workers, many of whom are in tribal locations, so any and all writing systems need to be supported, including some that use zero-width word breaks.
Does anyone have a more elegant solution, or could confirm/correct the code above, or offer some pointers?
Thanks for your help.
The equivalent of /(?:(?=\B).)*/ in a unicode context would be:
/
(?:
(?: (?<=[\p{L}\p{M}\p{N}\p{Pc}]) (?=[\p{L}\p{M}\p{N}\p{Pc}])
| (?<![\p{L}\p{M}\p{N}\p{Pc}]) (?![\p{L}\p{M}\p{N}\p{Pc}])
)
.
)*
/
...or somewhat simplified:
/(?:[\p{L}\p{M}\p{N}\p{Pc}]+|[^\p{L}\p{M}\p{N}\p{Pc}]+)?/
This would match either a word or a non-word (spacing, punctuation etc.) sequence, possibly an empty one.
A normal or negated word-boundary (\b or \B) is basically a double look-around. One looking behind, making sure of the type of character that precedes the current position. Similarly one looking ahead.
In the second regex, I removed the look-arounds and used simple character classes instead.