I have a question:
I want to do a validation for the first and the last name with RegEx.
I want to do it with only Hebrew and English without numbers.
Someone can help me to do that code?
Seemingly Hebrew has the range \u0590-\u05fe (according to this nice JavaScript Unicode Regex generator`.
/^[a-z\u0590-\u05fe]+$/i
While the selected answer is correct about "Hebrew" the OP wanted to limit validation to only Hebrew and English letters. The Hebrew Unicode adds a lot of punctuation and symbols (as you can see in the table here) irrelevant for such validation. If you want only Hebrew letters (along with English letters) the regex would be:
/^[a-z\u05D0-\u05EA]+$/i
I would consider adding ' (single quote) as well, for foreign consonants that are missing in Hebrew (such as G in George and Ch in Charlie) make use of it along with a letter:
/^[a-z\u05D0-\u05EA']+$/i
English & Hebrew FULL regex
I'm using the above regex on my application. My users are just fine with it:
RegExp(r'^[a-zA-Z\u0590-\u05FF\u200f\u200e ]+$');
The regex supports:
English letters (includes Capital letters). a-zA-Z
Hebrew (includes special end-letters). \u0590-\u05FF
change direction unicodes (RLM, LRM). \u200f\u200e
White space.
Enjoy!
Try this. Not sure if it will work. If not, these references should help.
[A-Za-z\u0590-\u05FF]*
Hebrew Unicode
Unicode in Regular Expressions
Hebrew letters only:
/^[\u0590-\u05ea]+$/i
You can also use \p{Hebrew} in your regex to detect any Hebrew unicode characters (if you're regex engine supports it).
Well, the RegEx pattern is between two /'s. The i at the end is a flag that says to be indifferent to the cases. ^ means the start of a line, and $ means the end of a line. Brackets ([ and ]) means either of the characters inside the brackets. - means a range. Note that the characters are ordinal, so a-z or א-ת make sense; a-z means all letters from and include a to and include z. The same goes for א-ת. + means one or more of the preceding. So this pattern matches every sequence of letters from English or Hebrew.
P.S.: Also, note that the flavor of the RegEx is different in different languages and platforms. for example in Sublime Text the pattern would be: (?i)^[א-תa-z]+$.
/^[א-תa-z]+$/i
Related
I am working on remove punctuation from a text variable that can be phase, etc.
Example: Hola, me llamo Juan! Hoy es día camión.
Code I am using is:
REGEXP_REPLACE(text, '[^0-9A-Za-z ]+', '')
This generally works well. Issue is that in some languages we have punctuation over some words. Example: día camión. When running the above code, the out for these words are "da" "camin". It removes the letter associate with the punctuation.
Is there a way to avoid this to happen?
Thanks!
There are two options:
Use one of the many Unicode properties. For example, \p{L} matches any Unicode letter from any language - in this case, you could make it work with [^0-9p\{L} ]+. There are many different Unicode properties, and also differences between Regex flavors, so I'd recommend investigating this link for reference.
If the solution above doesn't work for you, list specific Unicode codes that you want to match. For example, í can be matched with \u00ED, ó can be matched with \u00F3, so for this example [^\w\u00ED\u00F3 ]+ would do. There are many Unicode references out there, such as this one that you can use.
Besides that, \w has the same meaning as [0-9a-z_A-Z], and \W returns all characters not matched by \w, so you can replace that part of the expression, i.e. [\W ]+ instead of what you originally wrote. \W doesn't mitigate the Unicode issue, though - it's a matter of readability and simplicity.
I have a text box that a user can input any text in any language in and I need to split that text into words so that I could pass those words into hunspell spell check. For splitting I use a regexp that matches word delimiters.
At first I used \W as a word delimiter to split a text into wrods, but that works only with Latin letters, such as in English language. If I use non-Latin language, it treats every letter of it as \W. That's because \W is defined as any character that is [^a-zA-Z0-9_].
So far, (?![-'])[\pP|\pZ|\pC] seems to tokenize English, Spanish and Russian correctly. It basically says to treat all punctuation characters (except for the hyphen and the apostrophe), all separator characters and all "other" characters (control, private use, etc) as word delimiters. I have excluded hyphen and apostrophe because those usually shouldn't be treated as word delimiters.
I haven't tested it much, just came up with it today, so I thought it would be wise to ask if someone knew of any regex that is more suited for matching word delimiters in a multilingual text.
Note that I'm not concerned with languages that can't be tokenized, such as Japanese, Chinese, Thai, etc.
Update: Since people were asking what language I'm using (though it probably shouldn't matter much), I'm using C++ and Qt5's QRegularExpression class.
With Java (for example), you can emulate word boundaries like that (don't forget to double escape):
(?<![\p{L}\p{N}_])[\p{L}\p{N}_]+(?![\p{L}\p{N}_])
Where \p{L} matches any letters and \p{N} any digits.
Thus, you can easily split a string into "words" with: [^\p{L}\p{N}_]+
(I don't know the regex flavor you use, but you can probably remove the curly brackets).
In PHP this should work:
[\pL]*
In Javascript you can use (set "u" for unicode after delimiter):
/[\p{L}]*/u
I am java web developer from China, now I met a problem to detect if the string contains special German characters ßüöä in javascript.
I am using jquery.validate.js to make the string only contains characters,numbers and underscore, my regex is /^[a-zA-Z_]+\w*$/i,now I want to change my regex so that it will allow the string contains the special German characters ßüöä, could anyone how to change the regex in order to archive my goal?
Thanks in advance!
Now I want to change my regex so that it will allow the string contains the special German characters ßüöä
Simply include those characters in the character class.
/^[a-zßüÜöÖäÄ_][\wßüÜöÖäÄ]*$/i
This expression matches one letter (German included) or an undersocre, followed by any number of letters (German also included), digits or underscores.
Notice this does not include spaces or hyphens. Feel free to include them as well:
/^[a-zßüÜöÖäÄ_][-\wßüÜöÖäÄ ]*$/i
ideone demo
I need a regex for allowing list of special characters((_-.$#?,:'/!) and letters supporting utf-8 languages.
I tried
/^[\_\-\.\$#\?\,\:\'\/\!]*$/
but typing letters in English and Tamil shows invalid.
You need to escape the hyphen for it to be valid. You also don't need to escape most of the other characters - inside of brackets, almost everything is literal.
/[_\-.$#?,:'/!]*/
I have no idea if your regex engine supports \p{L}. You can try this:
^[_\-.\$#\?\,\:\'/!\p{L}]*$
or this one:
^[_\-.\$#\?\,\:\'/!\w]*$
The last one also matches digits.
The "tricky" part of this question is that what I mean by alphabeth is not just the 26 characters. It should also include anything alphabeth like, including accented characters and hebrew's alibeth, etc.etc.
Why I need them?
I want to split texts into words.
Alphabeths like latin alphabeth, hebrew's alibeth, arab abjads, are separated by space.
Chinese characters are separated by nothing.
So I think I should separate texts by anything that's not alphabeth.
In other word, a, b, c, d, é is fine.
駅,南,口,第,自,転,車.,3,5,6 is not and all such separator should be it's own words. Or stuff like that.
In short I want to detect whether a character may be a word by itself, or can be part of a word.
What have I tried?
Well you can check the question here I asked a long time ago:
How can we separate utf-8 characters into words if some of the characters are chinese?
I implement the only answer there but then I found out that the chinese characters aren't split. Why not split based on nothing? Well, that means the alphabeths are splitted too.
If all those alphabeths "stick" together that I can separate them based on UTF, that would be fine too.
I will just use the answer at How can we separate utf-8 characters into words if some of the characters are chinese?
and "pull out" all non alphabeth characters.
Not a perfect solution, but good enough for me because western characters and chinese characters rarely show up on the same text anyway.
Maybe you shouldn't do this with regular expressions but with good old string index scanning instead.
The Hebrew, Chinese, Korean etc. alphabets are all in consecutive ranges of unicode code-points. So you could easily detect the alphabet by reading the unicode value of the character and then checking which unicode block it belongs to.
Jan Goyvaerts (of PowerGrep fame) once showed me this very useful syntax to do just this:
(?<![\p{M}\p{L}])word(?![\p{M}\p{L}])
This expression uses a regex lookbehind and a regex lookahead to ensure that the boundaries of the word are such that there is no letter or diacritic mark on either side.
Why is this regex better than simply using "\b"? The strength of this regex is the incorporation of \p{M} to include diacritics. When the normal word boundary marker (\b) is used, regex engines will find word breaks at the places of many diacritics, even though the diacritics are actually part of the word (this is the case, for instance, with Hebrew diacritics. For an example, take the Hebrew word גְּבוּלוֹת, and run a regex of "\b." on it - you'll see how it actually breaks the word into word different parts, at each diacritic point). The regex above fixes this by using a Unicode Character Class to ensure that diacritics are always considered part of the word and not breaks within the word.