How to match Cyrillic characters with a regular expression - regex

How do I match French and Russian Cyrillic alphabet characters with a regular expression? I only want to do the alpha characters, no numbers or special characters. Right now I have
[A-Za-z]

If your regex flavor supports Unicode blocks ([\p{IsCyrillic}]), you can match Cyrillic characters with:
[\p{IsCyrillic}] or [\p{Cyrillic}]
Otherwise try using:
[U+0400–U+04FF]
For PHP use:
[\x{0400}-\x{04FF}]
Explanation:
[\p{IsCyrillic}]
Match a character from the Unicode block "Cyrillic" (U+0400–U+04FF) «[\p{IsCyrillic}]»
Note:
Unicode Characters list and Numeric HTML Entities of [U+0400–U+04FF] .

It depends on your regex flavor. If it supports Unicode character classes (like .NET, for instance), \p{L} matches a letter character (in any character set).

To match only Russian Cyrillic characters use:
[\u0401\u0451\u0410-\u044f]
which is the equivalent of:
[ЁёА-я]
where А is Cyrillic, not Latin. (Despite looking the same they have different codes)
\p{IsCyrillic}, \p{Cyrillic}, [\u0400-\u04FF] which others suggested will match all variants of Cyrillic, not only Russian

If you use modern PHP version - just:
preg_match("/^[\p{L}]+$/u");
Don't forget the u flag for unicode support!

Regex to match cyrillic alphabets with normal(english) alphabets :
^[A-Za-z.!#?#"$%&:;() *\+,\/;\-=[\\\]\^_{|}<>\u0400-\u04FF]*$
It matches special chars,cyrillic alphabets,english alphabets.

Various regex dialects use [:alpha:] for any alphanumeric character in the current locale. (You may need to put that in a character class, e.g. [[:alpha:]].)

this worked for me
[a-z\u0400-\u04FF]

If you use Elixir:
String.match?(string, ~r/^\p{Cyrillic}*$/u)
You need to add the u flag for unicode support.

You can use the first and the last letter. For example in Bulgarian:
[А-я]+

For modern PHP (source):
$string = 'тест тест Тест Обязателльно Stackoverflow >!<';
var_dump(preg_replace('/[\x{0410}-\x{042F}]+.*[\x{0410}-\x{042F}]+/iu', '', $string));

In Java to match Cyrillic letters and space use the following pattern
^[\p{InCyrillic}\s]+$

Related

Alternative for VBA regex unicode characters groups support

VBA Regular Expressions character groups do not support unicode character groups (e.g. {p(L}). Also \w matches only latin alphanumerics. So my problem was how to replace non alphanumeric characters from my unicode string without typing the whole characters' list in pattern field.
For example, trying to replace with underscore every non word character in "abc (for αβψ̌) and de (for δε)", with pattern \W results in "abc__for_______and_de__for____" instead of abc__for_αβψ___and_de__for_δε_
Finally I think there is at least one quick solution...
An approach is to find the unicode first and last character in range and use it as character range. With the pattern [^\w,\u0370-\u03FF\u1F00-\u1FFF] I can get rid of any non-latin or non-greek alphanumeric character.
Also we can use this pattern in the excel function RegExReplace

Regular expressions for making uppercase accented letters

I need to make a replacement like this
from arouzière to AROUZIÈRE.
I use notepad++ 6.6.7 for this in the following manner:
search: (\p{L}*?)
replace: \U\1\E
Problem:
The result is AROUZIèRE.
As you can see the accented letter is not made UPPERCASE.
Do you know a workaround or even if this is possible via RegEx with notepad++?
Thanks a lot for any help.
Try the pattern
(\p{Ll})
Which find lowercase letters that have an uppercase variant.
Demo
Note that some unicode characters do not have an uppercase variant.

How can I create an alphanumeric Regex for all languages?

I had this problem today:
This regex matches only English: [a-zA-Z0-9].
If I need support for any language in this world, what regex should I write?
If you use character class shorthands and a Unicode aware regex engine you can do that. The \w class matches "word characters" (letters, digits, and underscores).
Beware of some regex flavors that don't do this so well: JavaScript uses ASCII for \d (digits) and \w, but Unicode for \s (whitespace). XML does it the other way around.
Alphabet/Letter: \p{L}
Number: \p{N}
So for alphnum match for all languages, you can use: [\p{L}\p{N}]+
I was looking for a way to replace all non-alphanum chars for all languages with a space in JS and ended up using the following way to do it:
const regexForNonAlphaNum = new RegExp(/[^\p{L}\p{N}]+/ug);
someText.replace(regexForNonAlphaNum, " ");
Here as it is JS, we need to add u at end to make the regex unicode aware and g stands for global as I wanted match all instances and not just a single instance.
References:
https://www.linkedin.com/pulse/regex-one-pattern-rule-them-all-find-bring-darkness-bind-carranza/?trackingId=U6tRte%2BzTAG6O4AA3CrFmA%3D%3D
https://www.regular-expressions.info/unicode.html
Regex supporting most languages
^[A-zÀ-Ÿ\d-]*$
The regex below is the only one worked for me:
"\\p{LD}+" ==> LD means any letter or digit.
If you want to clean your text from any non alphanumeric characters you can use the following:
text.replaceAll("\\P{LD}+", "");//Note P is capital.

Regex help NOT a-z or 0-9

I need a regex to find all chars that are NOT a-z or 0-9
I don't know the syntax for the NOT operator in regex.
I want the regex to be NOT [a-z, A-Z, 0-9].
Thanks in advance!
It's ^. Your regex should use [^a-zA-Z0-9]. Beware: this character class may have unexpected behavior with non-ascii locales. For instance, this would match é.
Edited
If the regexes are perl-compatible (PCRE), you can use \s to match all whitespace. This expands to include spaces and other whitespace characters. If they're posix-compatible, use [:space:] character class (like so: [^a-zA-Z0-9[:space:]]). I would recommend using [:alnum:] instead of a-zA-Z0-9.
If you want to match the end of a line, you should include a $ at the end. Turning on multiline mode is only when your match should extend across multiple lines, and it reduces performance for larger files since more must be read into memory.
Why don't you include a copy of sample input, the text you want to match, and the program you are using to do so?
It's pretty simple; you just add ^ at the beginning of a character set to negate that character set.
For example, the following pattern will match everything that's not in that character set -- i.e., not a lowercase ASCII character or a digit:
[^a-z0-9]
As a side note, some of the more helpful Regular Expression resources I've found have been this site and this cheat sheet (C# specific).
Put at ^ at the begining of your character class expression: [^a-z0-9]
At start [^a-zA-Z0-9]
for condition;
pre_match();
pre_replace();
ergi();
try this
You can also use \W it's a shorthand for non-word character (equal to [^a-zA-Z0-9_])

Regex to match only letters

How can I write a regex that matches only letters?
Use a character set: [a-zA-Z] matches one letter from A–Z in lowercase and uppercase. [a-zA-Z]+ matches one or more letters and ^[a-zA-Z]+$ matches only strings that consist of one or more letters only (^ and $ mark the begin and end of a string respectively).
If you want to match other letters than A–Z, you can either add them to the character set: [a-zA-ZäöüßÄÖÜ]. Or you use predefined character classes like the Unicode character property class \p{L} that describes the Unicode characters that are letters.
\p{L} matches anything that is a Unicode letter if you're interested in alphabets beyond the Latin one
Depending on your meaning of "character":
[A-Za-z] - all letters (uppercase and lowercase)
[^0-9] - all non-digit characters
The closest option available is
[\u\l]+
which matches a sequence of uppercase and lowercase letters. However, it is not supported by all editors/languages, so it is probably safer to use
[a-zA-Z]+
as other users suggest
You would use
/[a-z]/gi
[]--checks for any characters between given inputs
a-z---covers the entire alphabet
g-----globally throughout the whole string
i-----getting upper and lowercase
Java:
String s= "abcdef";
if(s.matches("[a-zA-Z]+")){
System.out.println("string only contains letters");
}
In python, I have found the following to work:
[^\W\d_]
This works because we are creating a new character class (the []) which excludes (^) any character from the class \W (everything NOT in [a-zA-Z0-9_]), also excludes any digit (\d) and also excludes the underscore (_).
That is, we have taken the character class [a-zA-Z0-9_] and removed the 0-9 and _ bits. You might ask, wouldn't it just be easier to write [a-zA-Z] then, instead of [^\W\d_]? You would be correct if dealing only with ASCII text, but when dealing with unicode text:
\W
Matches any character which is not a word character. This is the opposite of \w. > If the ASCII flag is used this becomes the equivalent of [^a-zA-Z0-9_].
^ from the python re module documentation
That is, we are taking everything considered to be a word character in unicode, removing everything considered to be a digit character in unicode, and also removing the underscore.
For example, the following code snippet
import re
regex = "[^\W\d_]"
test_string = "A;,./>>?()*)&^*&^%&^#Bsfa1 203974"
re.findall(regex, test_string)
Returns
['A', 'B', 's', 'f', 'a']
Regular expression which few people has written as "/^[a-zA-Z]$/i" is not correct because at the last they have mentioned /i which is for case insensitive and after matching for first time it will return back. Instead of /i just use /g which is for global and you also do not have any need to put ^ $ for starting and ending.
/[a-zA-Z]+/g
[a-z_]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed
a-z a single character in the range between a and z (case sensitive)
A-Z a single character in the range between A and Z (case sensitive)
g modifier: global. All matches (don't return on first match)
/[a-zA-Z]+/
Super simple example. Regular expressions are extremely easy to find online.
http://www.regular-expressions.info/reference.html
For PHP, following will work fine
'/^[a-zA-Z]+$/'
Use character groups
\D
Matches any character except digits 0-9
^\D+$
See example here
Just use \w or [:alpha:]. It is an escape sequences which matches only symbols which might appear in words.
So, I've been reading a lot of the answers, and most of them don't take exceptions into account, like letters with accents or diaeresis (á, à, ä, etc.).
I made a function in typescript that should be pretty much extrapolable to any language that can use RegExp. This is my personal implementation for my use case in TypeScript. What I basically did is add ranges of letters with each kind of symbol that I wanted to add. I also converted the char to upper case before applying the RegExp, which saves me some work.
function isLetter(char: string): boolean {
return char.toUpperCase().match('[A-ZÀ-ÚÄ-Ü]+') !== null;
}
If you want to add another range of letters with another kind of accent, just add it to the regex. Same goes for special symbols.
I implemented this function with TDD and I can confirm this works with, at least, the following cases:
character | isLetter
${'A'} | ${true}
${'e'} | ${true}
${'Á'} | ${true}
${'ü'} | ${true}
${'ù'} | ${true}
${'û'} | ${true}
${'('} | ${false}
${'^'} | ${false}
${"'"} | ${false}
${'`'} | ${false}
${' '} | ${false}
If you mean any letters in any character encoding, then a good approach might be to delete non-letters like spaces \s, digits \d, and other special characters like:
[!##\$%\^&\*\(\)\[\]:;'",\. ...more special chars... ]
Or use negation of above negation to directly describe any letters:
\S \D and [^ ..special chars..]
Pros:
Works with all regex flavors.
Easy to write, sometimes save lots of time.
Cons:
Long, sometimes not perfect, but character encoding can be broken as well.
You can try this regular expression : [^\W\d_] or [a-zA-Z].
Lately I have used this pattern in my forms to check names of people, containing letters, blanks and special characters like accent marks.
pattern="[A-zÀ-ú\s]+"
JavaScript
If you want to return matched letters:
('Example 123').match(/[A-Z]/gi) // Result: ["E", "x", "a", "m", "p", "l", "e"]
If you want to replace matched letters with stars ('*') for example:
('Example 123').replace(/[A-Z]/gi, '*') //Result: "****** 123"*
/^[A-z]+$/.test('asd')
// true
/^[A-z]+$/.test('asd0')
// false
/^[A-z]+$/.test('0asd')
// false
pattern = /[a-zA-Z]/
puts "[a-zA-Z]: #{pattern.match("mine blossom")}" OK
puts "[a-zA-Z]: #{pattern.match("456")}"
puts "[a-zA-Z]: #{pattern.match("")}"
puts "[a-zA-Z]: #{pattern.match("#$%^&*")}"
puts "[a-zA-Z]: #{pattern.match("#$%^&*A")}" OK
Pattern pattern = Pattern.compile("^[a-zA-Z]+$");
if (pattern.matcher("a").find()) {
...do something ......
}