I had this problem today:
This regex matches only English: [a-zA-Z0-9].
If I need support for any language in this world, what regex should I write?
If you use character class shorthands and a Unicode aware regex engine you can do that. The \w class matches "word characters" (letters, digits, and underscores).
Beware of some regex flavors that don't do this so well: JavaScript uses ASCII for \d (digits) and \w, but Unicode for \s (whitespace). XML does it the other way around.
Alphabet/Letter: \p{L}
Number: \p{N}
So for alphnum match for all languages, you can use: [\p{L}\p{N}]+
I was looking for a way to replace all non-alphanum chars for all languages with a space in JS and ended up using the following way to do it:
const regexForNonAlphaNum = new RegExp(/[^\p{L}\p{N}]+/ug);
someText.replace(regexForNonAlphaNum, " ");
Here as it is JS, we need to add u at end to make the regex unicode aware and g stands for global as I wanted match all instances and not just a single instance.
References:
https://www.linkedin.com/pulse/regex-one-pattern-rule-them-all-find-bring-darkness-bind-carranza/?trackingId=U6tRte%2BzTAG6O4AA3CrFmA%3D%3D
https://www.regular-expressions.info/unicode.html
Regex supporting most languages
^[A-zÀ-Ÿ\d-]*$
The regex below is the only one worked for me:
"\\p{LD}+" ==> LD means any letter or digit.
If you want to clean your text from any non alphanumeric characters you can use the following:
text.replaceAll("\\P{LD}+", "");//Note P is capital.
Related
I have to extract phrases strings from a response data using Dart and I'm doing it well with this regex:
\B"[^"]*"\B
It matches phrases good but it excludes asian kanji characters (like japanese, chinese, korean, russian etc).
var regex = RegExp(r'\B"[^"]*"\B');
Iterable<Match> matches = regex.allMatches(returnString);
matches.forEach((match) {
t.add(match.group(0));
});
How can I make it match these kanjis alongside with the Ocidental characters too? Or if I need a new regex, can you help me to re-do it? Thank you and sorry my lack of knowlegde & bad english.
To match all non-ascii chars you can use RegExp(r'[^\x00-\x7F]')
The RegExp \B"[^"]*"\B relies on the \B escape - a "non word-boundary" zero-width match which matches only if one of the surrounding characters is a "word character" (ASCII a-z, A-Z, 0-9, $ or _) and the other is not. Since " is not, it matches only when you have a word character followed by a quote, and matches only if the next quote is followed by a word character. It should match any non-quote character between those two quotes, no matter what script it is in. The non-boundary assertions are ASCII only, though, so I'm guessing those are the ones causing you issues.
It's not clear from this alone exactly what it is you want to achieve.
Can you describe the strings that you want to match, and some examples of strings that you don't want to match?
I am a noob in RegEx and I am trying to write a RegEx pattern that has a minimum of 6 and maximum of 9 total characters, where the first 3 characters are letters (case-insensitive, alpha only) and the rest are digits.
I have the following pattern: ^\w{3}\d{3,6}$
But for some reason, that pattern returns true when I enter the following: aa12345 or Ap4587 and so on. I need that the first 3 characters are only letters (exact).
I hope someone will be able to help me on this.
Thanks!!!
\w is equivalent to [a-zA-Z0-9_]. You should change the regex to:
^[a-zA-Z]{3}\d{3,6}$
Use [a-zA-Z] for only alphabets. I prefer using [0-9] even it's same as \d for consistency
/^[a-zA-Z]{3}[0-9]{3,6}$/
\w matches a-z, A-Z, 0-9, _ and should only be used for alphanumeric character
If you want to allow a broader range of unicode values, I'd recommend:
[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Lm}]{3}
This will allow lowercase, uppercase, title, "other" and modifiers as your first three characters.
For example, [a-zA-Z]{3} would exclude the word "Résumé" because of the special characters. The pattern above would allow it.
I recommend you check out the documentation for regular expression character classes:
Character Classes or Character Sets
The MSDN documentation is also very good and most of it is compatible with standard regex libraries:
Character Classes in Regular Expressions
Try this:
^[a-zA-Z]{3}\d{3,6}$
as \w matches a-z, A-Z, 0-9
Eg: "_V9DXkFMCEeGrv54B-L8--A"
\w+ alone will not work
You can use:
[\w-]+
use this pattern: [-\w]+ \w is an alpanumeric character. Actually pattern depends from your language. For example in java you have to write [-\\w]+ and there also can be languages where - is a special character and you should escape it too. So please edit your question and add the language you use.
I need a regex to find all chars that are NOT a-z or 0-9
I don't know the syntax for the NOT operator in regex.
I want the regex to be NOT [a-z, A-Z, 0-9].
Thanks in advance!
It's ^. Your regex should use [^a-zA-Z0-9]. Beware: this character class may have unexpected behavior with non-ascii locales. For instance, this would match é.
Edited
If the regexes are perl-compatible (PCRE), you can use \s to match all whitespace. This expands to include spaces and other whitespace characters. If they're posix-compatible, use [:space:] character class (like so: [^a-zA-Z0-9[:space:]]). I would recommend using [:alnum:] instead of a-zA-Z0-9.
If you want to match the end of a line, you should include a $ at the end. Turning on multiline mode is only when your match should extend across multiple lines, and it reduces performance for larger files since more must be read into memory.
Why don't you include a copy of sample input, the text you want to match, and the program you are using to do so?
It's pretty simple; you just add ^ at the beginning of a character set to negate that character set.
For example, the following pattern will match everything that's not in that character set -- i.e., not a lowercase ASCII character or a digit:
[^a-z0-9]
As a side note, some of the more helpful Regular Expression resources I've found have been this site and this cheat sheet (C# specific).
Put at ^ at the begining of your character class expression: [^a-z0-9]
At start [^a-zA-Z0-9]
for condition;
pre_match();
pre_replace();
ergi();
try this
You can also use \W it's a shorthand for non-word character (equal to [^a-zA-Z0-9_])
How do I match French and Russian Cyrillic alphabet characters with a regular expression? I only want to do the alpha characters, no numbers or special characters. Right now I have
[A-Za-z]
If your regex flavor supports Unicode blocks ([\p{IsCyrillic}]), you can match Cyrillic characters with:
[\p{IsCyrillic}] or [\p{Cyrillic}]
Otherwise try using:
[U+0400–U+04FF]
For PHP use:
[\x{0400}-\x{04FF}]
Explanation:
[\p{IsCyrillic}]
Match a character from the Unicode block "Cyrillic" (U+0400–U+04FF) «[\p{IsCyrillic}]»
Note:
Unicode Characters list and Numeric HTML Entities of [U+0400–U+04FF] .
It depends on your regex flavor. If it supports Unicode character classes (like .NET, for instance), \p{L} matches a letter character (in any character set).
To match only Russian Cyrillic characters use:
[\u0401\u0451\u0410-\u044f]
which is the equivalent of:
[ЁёА-я]
where А is Cyrillic, not Latin. (Despite looking the same they have different codes)
\p{IsCyrillic}, \p{Cyrillic}, [\u0400-\u04FF] which others suggested will match all variants of Cyrillic, not only Russian
If you use modern PHP version - just:
preg_match("/^[\p{L}]+$/u");
Don't forget the u flag for unicode support!
Regex to match cyrillic alphabets with normal(english) alphabets :
^[A-Za-z.!#?#"$%&:;() *\+,\/;\-=[\\\]\^_{|}<>\u0400-\u04FF]*$
It matches special chars,cyrillic alphabets,english alphabets.
Various regex dialects use [:alpha:] for any alphanumeric character in the current locale. (You may need to put that in a character class, e.g. [[:alpha:]].)
this worked for me
[a-z\u0400-\u04FF]
If you use Elixir:
String.match?(string, ~r/^\p{Cyrillic}*$/u)
You need to add the u flag for unicode support.
You can use the first and the last letter. For example in Bulgarian:
[А-я]+
For modern PHP (source):
$string = 'тест тест Тест Обязателльно Stackoverflow >!<';
var_dump(preg_replace('/[\x{0410}-\x{042F}]+.*[\x{0410}-\x{042F}]+/iu', '', $string));
In Java to match Cyrillic letters and space use the following pattern
^[\p{InCyrillic}\s]+$