Regex to match word delimiters in multilingual text - regex

I have a text box that a user can input any text in any language in and I need to split that text into words so that I could pass those words into hunspell spell check. For splitting I use a regexp that matches word delimiters.
At first I used \W as a word delimiter to split a text into wrods, but that works only with Latin letters, such as in English language. If I use non-Latin language, it treats every letter of it as \W. That's because \W is defined as any character that is [^a-zA-Z0-9_].
So far, (?![-'])[\pP|\pZ|\pC] seems to tokenize English, Spanish and Russian correctly. It basically says to treat all punctuation characters (except for the hyphen and the apostrophe), all separator characters and all "other" characters (control, private use, etc) as word delimiters. I have excluded hyphen and apostrophe because those usually shouldn't be treated as word delimiters.
I haven't tested it much, just came up with it today, so I thought it would be wise to ask if someone knew of any regex that is more suited for matching word delimiters in a multilingual text.
Note that I'm not concerned with languages that can't be tokenized, such as Japanese, Chinese, Thai, etc.
Update: Since people were asking what language I'm using (though it probably shouldn't matter much), I'm using C++ and Qt5's QRegularExpression class.

With Java (for example), you can emulate word boundaries like that (don't forget to double escape):
(?<![\p{L}\p{N}_])[\p{L}\p{N}_]+(?![\p{L}\p{N}_])
Where \p{L} matches any letters and \p{N} any digits.
Thus, you can easily split a string into "words" with: [^\p{L}\p{N}_]+
(I don't know the regex flavor you use, but you can probably remove the curly brackets).

In PHP this should work:
[\pL]*
In Javascript you can use (set "u" for unicode after delimiter):
/[\p{L}]*/u

Related

RegExp space character

I have this regular expression: ^[a-zA-Z]\s{3,16}$
What I want is to match any name with any spaces, for example, John Smith and that contains 3 to 16 characters long..
What am I doing wrong?
Background
There are a couple of things to note here. First, a quantifier (in this case, {3,16}) only applies to the last regex token. So what your current regex really is saying is to "Match any string that has a single alphabetical character (case-insensitive) followed by 3 to 16 whitespace characters (e.g. spaces, tabs, etc.)."
Second, a name can have more than 2 parts (a middle name, certain ethnic names like "De La Cruz") or include special characters such as accented vowels. You should consider if this is something you need to account for in your program. These things are important and should be considered for any real application.
Assumptions and Answer
Now, let's just assume you only want a certain format for names that consists of a first name, a last name, and a space. Let's also assume you only want simple ASCII characters (i.e. no special characters or accented characters). Furthermore, both the first and last names should start with a capital character followed by only lower-case characters. Other than that, there are no restrictions on the length of the individual parts of the name. In this case, the following regex would do the trick:
^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$
Notes
The first token after the ^ character is what is called a positive lookahead. Basically a positive look ahead will match the regex between the opening (?= and closing ) without actually moving the position of the cursor that is matching the string.
Notice I removed the \s token, since you usually want only a (space). The space can be replaced with the \s token, if tabs and other whitespace is desired there.
I also added a restriction that a name must start with a capital letter followed by only lower-case letters.
Crude English Translation
To help your understanding, here is a simple English translation of what the regex is really doing. The part in italics is just copied from the first part of the English translation of the regex.
"Match any string that has 3-16 characters and starts with a capital alphabetical character followed by one or more (+) alphabetical characters followed by a single space followed by a capital alphabetical character followed by one or more (+) alphabetical characters and ends with any lowercase letter."
Tools
There are a couple of tools I like to use when I am trying to tackle a challenging regex. They are listed below in no particular order:
https://regex101.com/ - Allows you to test regex expressions in real time. It also has a nifty little library to help you along.
http://www.regular-expressions.info/ - Basically a repository of knowledge on regex.
Edit/Update
You mentioned in your comments that you are using your regex in JavaScript. JavaScript uses a forward slash surrounding the regex to determine what is a regex. For this simple case, there are 2 options for using a regex to match a string.
First, use String's match method as follows
"John Smith".match(/^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$/);
Second, create a regex and use its test() method. For example,
/^(?=.{3,16}$)[A-Z][a-z]+ [A-Z][a-z]+$/.test("John Smith");
The latter is probably what you want as it simply returns true or false depending on whether the regex actually matches the string or not.

Regex to match Hebrew and English characters except numbers

I have a question:
I want to do a validation for the first and the last name with RegEx.
I want to do it with only Hebrew and English without numbers.
Someone can help me to do that code?
Seemingly Hebrew has the range \u0590-\u05fe (according to this nice JavaScript Unicode Regex generator`.
/^[a-z\u0590-\u05fe]+$/i
While the selected answer is correct about "Hebrew" the OP wanted to limit validation to only Hebrew and English letters. The Hebrew Unicode adds a lot of punctuation and symbols (as you can see in the table here) irrelevant for such validation. If you want only Hebrew letters (along with English letters) the regex would be:
/^[a-z\u05D0-\u05EA]+$/i
I would consider adding ' (single quote) as well, for foreign consonants that are missing in Hebrew (such as G in George and Ch in Charlie) make use of it along with a letter:
/^[a-z\u05D0-\u05EA']+$/i
English & Hebrew FULL regex
I'm using the above regex on my application. My users are just fine with it:
RegExp(r'^[a-zA-Z\u0590-\u05FF\u200f\u200e ]+$');
The regex supports:
English letters (includes Capital letters). a-zA-Z
Hebrew (includes special end-letters). \u0590-\u05FF
change direction unicodes (RLM, LRM). \u200f\u200e
White space.
Enjoy!
Try this. Not sure if it will work. If not, these references should help.
[A-Za-z\u0590-\u05FF]*
Hebrew Unicode
Unicode in Regular Expressions
Hebrew letters only:
/^[\u0590-\u05ea]+$/i
You can also use \p{Hebrew} in your regex to detect any Hebrew unicode characters (if you're regex engine supports it).
Well, the RegEx pattern is between two /'s. The i at the end is a flag that says to be indifferent to the cases. ^ means the start of a line, and $ means the end of a line. Brackets ([ and ]) means either of the characters inside the brackets. - means a range. Note that the characters are ordinal, so a-z or א-ת make sense; a-z means all letters from and include a to and include z. The same goes for א-ת. + means one or more of the preceding. So this pattern matches every sequence of letters from English or Hebrew.
P.S.: Also, note that the flavor of the RegEx is different in different languages and platforms. for example in Sublime Text the pattern would be: (?i)^[א-תa-z]+$.
/^[א-תa-z]+$/i

How to using regex to check if a string contains German characters in javascript?

I am java web developer from China, now I met a problem to detect if the string contains special German characters ßüöä in javascript.
I am using jquery.validate.js to make the string only contains characters,numbers and underscore, my regex is /^[a-zA-Z_]+\w*$/i,now I want to change my regex so that it will allow the string contains the special German characters ßüöä, could anyone how to change the regex in order to archive my goal?
Thanks in advance!
Now I want to change my regex so that it will allow the string contains the special German characters ßüöä
Simply include those characters in the character class.
/^[a-zßüÜöÖäÄ_][\wßüÜöÖäÄ]*$/i
This expression matches one letter (German included) or an undersocre, followed by any number of letters (German also included), digits or underscores.
Notice this does not include spaces or hyphens. Feel free to include them as well:
/^[a-zßüÜöÖäÄ_][-\wßüÜöÖäÄ ]*$/i
ideone demo

Regex for allowing particular Special Characters

I need a regex for allowing list of special characters((_-.$#?,:'/!) and letters supporting utf-8 languages.
I tried
/^[\_\-\.\$#\?\,\:\'\/\!]*$/
but typing letters in English and Tamil shows invalid.
You need to escape the hyphen for it to be valid. You also don't need to escape most of the other characters - inside of brackets, almost everything is literal.
/[_\-.$#?,:'/!]*/
I have no idea if your regex engine supports \p{L}. You can try this:
^[_\-.\$#\?\,\:\'/!\p{L}]*$
or this one:
^[_\-.\$#\?\,\:\'/!\w]*$
The last one also matches digits.

Is there a regex way to detect whether a character can be part of a word or not?

The "tricky" part of this question is that what I mean by alphabeth is not just the 26 characters. It should also include anything alphabeth like, including accented characters and hebrew's alibeth, etc.etc.
Why I need them?
I want to split texts into words.
Alphabeths like latin alphabeth, hebrew's alibeth, arab abjads, are separated by space.
Chinese characters are separated by nothing.
So I think I should separate texts by anything that's not alphabeth.
In other word, a, b, c, d, é is fine.
駅,南,口,第,自,転,車.,3,5,6 is not and all such separator should be it's own words. Or stuff like that.
In short I want to detect whether a character may be a word by itself, or can be part of a word.
What have I tried?
Well you can check the question here I asked a long time ago:
How can we separate utf-8 characters into words if some of the characters are chinese?
I implement the only answer there but then I found out that the chinese characters aren't split. Why not split based on nothing? Well, that means the alphabeths are splitted too.
If all those alphabeths "stick" together that I can separate them based on UTF, that would be fine too.
I will just use the answer at How can we separate utf-8 characters into words if some of the characters are chinese?
and "pull out" all non alphabeth characters.
Not a perfect solution, but good enough for me because western characters and chinese characters rarely show up on the same text anyway.
Maybe you shouldn't do this with regular expressions but with good old string index scanning instead.
The Hebrew, Chinese, Korean etc. alphabets are all in consecutive ranges of unicode code-points. So you could easily detect the alphabet by reading the unicode value of the character and then checking which unicode block it belongs to.
Jan Goyvaerts (of PowerGrep fame) once showed me this very useful syntax to do just this:
(?<![\p{M}\p{L}])word(?![\p{M}\p{L}])
This expression uses a regex lookbehind and a regex lookahead to ensure that the boundaries of the word are such that there is no letter or diacritic mark on either side.
Why is this regex better than simply using "\b"? The strength of this regex is the incorporation of \p{M} to include diacritics. When the normal word boundary marker (\b) is used, regex engines will find word breaks at the places of many diacritics, even though the diacritics are actually part of the word (this is the case, for instance, with Hebrew diacritics. For an example, take the Hebrew word גְּבוּלוֹת, and run a regex of "\b." on it - you'll see how it actually breaks the word into word different parts, at each diacritic point). The regex above fixes this by using a Unicode Character Class to ensure that diacritics are always considered part of the word and not breaks within the word.