How to match Asian characters using regex? - regex

I have to extract phrases strings from a response data using Dart and I'm doing it well with this regex:
\B"[^"]*"\B
It matches phrases good but it excludes asian kanji characters (like japanese, chinese, korean, russian etc).
var regex = RegExp(r'\B"[^"]*"\B');
Iterable<Match> matches = regex.allMatches(returnString);
matches.forEach((match) {
t.add(match.group(0));
});
How can I make it match these kanjis alongside with the Ocidental characters too? Or if I need a new regex, can you help me to re-do it? Thank you and sorry my lack of knowlegde & bad english.

To match all non-ascii chars you can use RegExp(r'[^\x00-\x7F]')

The RegExp \B"[^"]*"\B relies on the \B escape - a "non word-boundary" zero-width match which matches only if one of the surrounding characters is a "word character" (ASCII a-z, A-Z, 0-9, $ or _) and the other is not. Since " is not, it matches only when you have a word character followed by a quote, and matches only if the next quote is followed by a word character. It should match any non-quote character between those two quotes, no matter what script it is in. The non-boundary assertions are ASCII only, though, so I'm guessing those are the ones causing you issues.
It's not clear from this alone exactly what it is you want to achieve.
Can you describe the strings that you want to match, and some examples of strings that you don't want to match?

Related

Regex pattern that matches a string with English characters only with no spaces and special characters in flutter?

I tried many pattern examples but none worked so far.
I want to use a regex pattern that only allows English characters with no spaces and no special characters in flutter.
RegExp('[a-zA-Z]');
This is what I used before, but it allows spaces and other characters.
Also after using
r'/^[a-zA-Z]+$/'
The string entered in the username always returns false with spaces and special characters and without.
I'm not looking for an answer in JavaScript also the other question doesn't have the answer I'm looking for.
Here's a regular expression pattern to match only letters from the alphabet (both uppercase and lowercase):
/^[a-zA-Z]+$/
This pattern will match one or more characters that are in the range a-z or A-Z. The ^ and $ symbols indicate the start and end of the string, respectively, ensuring that only letters are present and no additional characters (such as spaces or numbers) are allowed.

Regex not extracting all matching words

I am trying to extract words that have at least one character from a special character set. It picks up some words and not others. Here is a link to regex101 to test it. This it the regex \b(\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*)\b, and this is the sample sentence I am using
His full name is Abu ʿĪsa Muḥammad ibn ʿĪsa ibn Sawrah ibn Mūsa ibn
Al-Daḥāk Al-Sulamī Al-Tirmidhī.
It should match the following words:
ʿĪsa Muḥammad ʿĪsa Mūsa Al-Daḥāk Al-Sulamī Al-Tirmidhī
I am not too experienced with regex, so I have no idea what I am doing wrong. If someone knows any tool to find out why a specific word doesn't match a regex pattern, please let me know as well.
You can use
[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ][\wāīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]*
After matching the one required special character, use another character set to match more occurrences of those characters or normal word characters.
https://regex101.com/r/ovJoLt/2
You can make this work by enabling the Unicode flag /u (so that the word boundary \b assertions support Unicode characters) and adding hyphens to the surrounding character groups:
/\b[\w-]*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+[\w-]*\b/gu
Plus, you don't need the capturing group, since the only characters being matched form the desired output anyway (\b is a zero-width assertion).
Demo
You are not doing anything wrong except that to match unicode boundaries you have to enable u modifier or use (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ]+\w*(?!\S)
If you want to match hyphen add it to your character class (?<!\S)\w*[āīūẓḍḥṣṭĀĪŪẒḌḤṢṬʿʾ-]+\w*(?!\S)

RegEx exact repetition is not working

I am a noob in RegEx and I am trying to write a RegEx pattern that has a minimum of 6 and maximum of 9 total characters, where the first 3 characters are letters (case-insensitive, alpha only) and the rest are digits.
I have the following pattern: ^\w{3}\d{3,6}$
But for some reason, that pattern returns true when I enter the following: aa12345 or Ap4587 and so on. I need that the first 3 characters are only letters (exact).
I hope someone will be able to help me on this.
Thanks!!!
\w is equivalent to [a-zA-Z0-9_]. You should change the regex to:
^[a-zA-Z]{3}\d{3,6}$
Use [a-zA-Z] for only alphabets. I prefer using [0-9] even it's same as \d for consistency
/^[a-zA-Z]{3}[0-9]{3,6}$/
\w matches a-z, A-Z, 0-9, _ and should only be used for alphanumeric character
If you want to allow a broader range of unicode values, I'd recommend:
[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Lm}]{3}
This will allow lowercase, uppercase, title, "other" and modifiers as your first three characters.
For example, [a-zA-Z]{3} would exclude the word "Résumé" because of the special characters. The pattern above would allow it.
I recommend you check out the documentation for regular expression character classes:
Character Classes or Character Sets
The MSDN documentation is also very good and most of it is compatible with standard regex libraries:
Character Classes in Regular Expressions
Try this:
^[a-zA-Z]{3}\d{3,6}$
as \w matches a-z, A-Z, 0-9

Regular expression to allow spaces between words

I want a regular expression that prevents symbols and only allows letters and numbers. The regex below works great, but it doesn't allow for spaces between words.
^[a-zA-Z0-9_]*$
For example, when using this regular expression "HelloWorld" is fine, but "Hello World" does not match.
How can I tweak it to allow spaces?
tl;dr
Just add a space in your character class.
^[a-zA-Z0-9_ ]*$
Now, if you want to be strict...
The above isn't exactly correct. Due to the fact that * means zero or more, it would match all of the following cases that one would not usually mean to match:
An empty string, "".
A string comprised entirely of spaces, " ".
A string that leads and / or trails with spaces, " Hello World ".
A string that contains multiple spaces in between words, "Hello World".
Originally I didn't think such details were worth going into, as OP was asking such a basic question that it seemed strictness wasn't a concern. Now that the question's gained some popularity however, I want to say...
...use #stema's answer.
Which, in my flavor (without using \w) translates to:
^[a-zA-Z0-9_]+( [a-zA-Z0-9_]+)*$
(Please upvote #stema regardless.)
Some things to note about this (and #stema's) answer:
If you want to allow multiple spaces between words (say, if you'd like to allow accidental double-spaces, or if you're working with copy-pasted text from a PDF), then add a + after the space:
^\w+( +\w+)*$
If you want to allow tabs and newlines (whitespace characters), then replace the space with a \s+:
^\w+(\s+\w+)*$
Here I suggest the + by default because, for example, Windows linebreaks consist of two whitespace characters in sequence, \r\n, so you'll need the + to catch both.
Still not working?
Check what dialect of regular expressions you're using.* In languages like Java you'll have to escape your backslashes, i.e. \\w and \\s. In older or more basic languages and utilities, like sed, \w and \s aren't defined, so write them out with character classes, e.g. [a-zA-Z0-9_] and [\f\n\p\r\t], respectively.
* I know this question is tagged vb.net, but based on 25,000+ views, I'm guessing it's not only those folks who are coming across this question. Currently it's the first hit on google for the search phrase, regular expression space word.
One possibility would be to just add the space into you character class, like acheong87 suggested, this depends on how strict you are on your pattern, because this would also allow a string starting with 5 spaces, or strings consisting only of spaces.
The other possibility is to define a pattern:
I will use \w this is in most regex flavours the same than [a-zA-Z0-9_] (in some it is Unicode based)
^\w+( \w+)*$
This will allow a series of at least one word and the words are divided by spaces.
^ Match the start of the string
\w+ Match a series of at least one word character
( \w+)* is a group that is repeated 0 or more times. In the group it expects a space followed by a series of at least one word character
$ matches the end of the string
This one worked for me
([\w ]+)
Try with:
^(\w+ ?)*$
Explanation:
\w - alias for [a-zA-Z_0-9]
"whitespace"? - allow whitespace after word, set is as optional
I assume you don't want leading/trailing space. This means you have to split the regex into "first character", "stuff in the middle" and "last character":
^[a-zA-Z0-9_][a-zA-Z0-9_ ]*[a-zA-Z0-9_]$
or if you use a perl-like syntax:
^\w[\w ]*\w$
Also: If you intentionally worded your regex that it also allows empty Strings, you have to make the entire thing optional:
^(\w[\w ]*\w)?$
If you want to only allow single space chars, it looks a bit different:
^((\w+ )*\w+)?$
This matches 0..n words followed by a single space, plus one word without space. And makes the entire thing optional to allow empty strings.
This regular expression
^\w+(\s\w+)*$
will only allow a single space between words and no leading or trailing spaces.
Below is the explanation of the regular expression:
^ Assert position at start of the string
\w+ Match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
1st Capturing group (\s\w+)*
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\s Match any white space character [\r\n\t\f ]
\w+ Match any word character [a-zA-Z0-9_]
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
$ Assert position at end of the string
Just add a space to end of your regex pattern as follows:
[a-zA-Z0-9_ ]
This does not allow space in the beginning. But allowes spaces in between words. Also allows for special characters between words. A good regex for FirstName and LastName fields.
\w+.*$
For alphabets only:
^([a-zA-Z])+(\s)+[a-zA-Z]+$
For alphanumeric value and _:
^(\w)+(\s)+\w+$
If you are using JavaScript then you can use this regex:
/^[a-z0-9_.-\s]+$/i
For example:
/^[a-z0-9_.-\s]+$/i.test("") //false
/^[a-z0-9_.-\s]+$/i.test("helloworld") //true
/^[a-z0-9_.-\s]+$/i.test("hello world") //true
/^[a-z0-9_.-\s]+$/i.test("none alpha: ɹqɯ") //false
The only drawback with this regex is a string comprised entirely of spaces. "       " will also show as true.
It was my regex: #"^(?=.{3,15}$)(?:(?:\p{L}|\p{N})[._()\[\]-]?)*$"
I just added ([\w ]+) at the end of my regex before *
#"^(?=.{3,15}$)(?:(?:\p{L}|\p{N})[._()\[\]-]?)([\w ]+)*$"
Now string is allowed to have spaces.
This regex allow only alphabet and spaces:
^[a-zA-Z ]*$
Try with this one:
result = re.search(r"\w+( )\w+", text)

How can I create an alphanumeric Regex for all languages?

I had this problem today:
This regex matches only English: [a-zA-Z0-9].
If I need support for any language in this world, what regex should I write?
If you use character class shorthands and a Unicode aware regex engine you can do that. The \w class matches "word characters" (letters, digits, and underscores).
Beware of some regex flavors that don't do this so well: JavaScript uses ASCII for \d (digits) and \w, but Unicode for \s (whitespace). XML does it the other way around.
Alphabet/Letter: \p{L}
Number: \p{N}
So for alphnum match for all languages, you can use: [\p{L}\p{N}]+
I was looking for a way to replace all non-alphanum chars for all languages with a space in JS and ended up using the following way to do it:
const regexForNonAlphaNum = new RegExp(/[^\p{L}\p{N}]+/ug);
someText.replace(regexForNonAlphaNum, " ");
Here as it is JS, we need to add u at end to make the regex unicode aware and g stands for global as I wanted match all instances and not just a single instance.
References:
https://www.linkedin.com/pulse/regex-one-pattern-rule-them-all-find-bring-darkness-bind-carranza/?trackingId=U6tRte%2BzTAG6O4AA3CrFmA%3D%3D
https://www.regular-expressions.info/unicode.html
Regex supporting most languages
^[A-zÀ-Ÿ\d-]*$
The regex below is the only one worked for me:
"\\p{LD}+" ==> LD means any letter or digit.
If you want to clean your text from any non alphanumeric characters you can use the following:
text.replaceAll("\\P{LD}+", "");//Note P is capital.