Regular Expression Arabic characters and numbers only - regex

I want Regular Expression to accept only Arabic characters, Spaces and Numbers.
Numbers are not required to be in Arabic.
I found the following expression:
^[\u0621-\u064A]+$
which accepts only only Arabic characters while I need Arabic characters, Spaces and Numbers.

Just add 1-9 (in Unicode format) to your character-class:
^[\u0621-\u064A0-9 ]+$
OR add \u0660-\u0669 to the character-class which is the range of Arabic numbers :
^[\u0621-\u064A\u0660-\u0669 ]+$

You can use:
^[\u0621-\u064A\s\p{N}]+$
\p{N} will match any unicode numeric digit.
To match only ASCII digit use:
^[\u0621-\u064A\s0-9]+$
EDIT: Better to use this regex:
^[\p{Arabic}\s\p{N}]+$
RegEx Demo

you can use
[ء-ي]
it worked for me in javascript Jquery forme.validate rules
for my example I want to force user to insert 3 characters
[a-zA-Zء-ي]

use this
[\u0600-\u06FF]
it worked for me on visual studio

With a lot of try and edit i got this for Persian names:
[گچپژیلفقهمو ء-ي]+$

^[\u0621-\u064Aa-zA-Z\d\-_\s]+$
This regex must accept Arabic letters,English letters, spaces and numbers

Simple, use this code:
^[؀-ۿ]+$
This works for Arabic/Persian even numbers.

function HasArabicCharacters(string text)
{
var regex = new RegExp(
"[\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufc3f]|[\ufe70-\ufefc]");
return regex.test(text);
}

To allow Arabic + English Letters with min&max allowed number of characters in a field, try this, tested 100%:
^[\u0621-\u064A\u0660-\u0669a-zA-Z\-_\s]{4,35}$
A- Arabic English letters Allowed.
B- Numbers not allowed.
C- {4,35} means the Min,Max characters allowed.
Update: On submit: Accepted English words with spaces, but the Arabic words with spaces could not be submitted!
All cases tested

Regex for English and Arabic Numbers only
function HasArabicEnglishNumbers(text)
{
var regex = new RegExp(
"^[\u0621-\u064A0-9]|[\u0621-\u064A\u0660-\u0669]+$");
return regex.test(text);
}

#Pattern(regexp = "^[\\p{InArabic}\\s]+$")
Accept arabic digit and character

This one allows Arabic letters, Arabic numbers and English numbers
var arabic = RegExp("^[\u0621-\u064A\u0660-\u0669 1-9]+\$");

In PHP, use this:
preg_replace("/\p{Arabic}/u", 'x', 'abc123ابت');// will replace arabic letters with "x".
Note: For \p{Arabic} to match arabic letters, you need to pass u modifier (for unicode) at the end.

The posts above include much more than arabic (MSA) characters, it includes persian, urdu, quranic symbols, and some other symbols. The arabic MSA characters are only (see Arabic Unicode)
[\u0621-\u063A\u0641-\u0652]

I always use these to control user input in my apps
public static Regex IntegerString => new(#"^[\s\da-zA-Zء-ي]+[^\.]*$");
public static Regex String => new(#"^[\sa-zA-Zء-ي]*$");
public static Regex Email => new(#"^[\d\#\.a-z]*$");
public static Regex Phone => new(#"^[\d\s\(\)\-\+]+[^\.]*$");
public static Regex Address => new(#"^[\s\d\.\,\،\-a-zA-Zء-ي]*$");
public static Regex Integer => new(#"^[\d]+[^\.]*$");
public static Regex Double => new(#"^[\d\.]*$");

This is useful example
public class Test {
public static void main(String[] args) {
String thai = "1ประเทศไทย1ประเทศไทย";
String arabic = "1عربي1عربي";
//correct inputs
System.out.println(thai.matches("[[0-9]*\\p{In" + Character.UnicodeBlock.THAI.toString() + "}*]*"));
System.out.println(arabic.matches("[[0-9]*\\p{In" + Character.UnicodeBlock.ARABIC.toString() + "}*]*"));
//incorrect inputs
System.out.println(arabic.matches("[[0-9]*\\p{In" + Character.UnicodeBlock.THAI.toString() + "}*]*"));
System.out.println(thai.matches("[[0-9]*\\p{In" + Character.UnicodeBlock.ARABIC.toString() + "}*]*"));
}
}

[\p{IsArabic}-[\D]]
An Arabic character that is not a non-digit

Related

I want to remove symbols from a string in dart

I want to remove all symbols except for characters (Japanese hiragana, kanji, and Roman alphabet ) that unmatch this regex.
var reg = RegExp(
r'([\u3040-\u309F]|\u3000|[\u30A1-\u30FC]|[\u4E00-\u9FFF]|[a-zA-Z]|[々〇〻])');
I don't know what to put in this "?".
text=text.replaceAll(?,"");
a="「私は、アメリカに行きました。」、'I went to the United States.'"
b="私はアメリカに行きましたI went to the United States"
I want to make a into b.
You can use
String a = "「私は、アメリカに行きました。」、'I went to the United States.'";
a = a.replaceAll(RegExp(r'[^\p{L}\p{M}\p{N}\s]+', unicode: true), '') );
Also, if you just want to remove any punctuation or math symbols, you can use
.replaceAll(RegExp(r'[\p{P}\p{S}]+', unicode: true), '')
Output:
私はアメリカに行きましたI went to the United States
The [^\p{L}\p{M}\p{N}\s]+ regex matches one or more chars other than letters (\p{L}), diacritics (\p{M}), digits (\p{N}) and whitespace chars (\s).
The [\p{P}\p{S}]+ regex matches one or more punctuation proper (\p{P}) or match symbol (\p{S}) chars.
The unicode: true enables the Unicode property class support in the regex.
You can need to specify the Pattern (RegEx) you want to apply on your replaceAll method.
// Creating the regEx/Pattern
var reg = RegExp(r'([\u3040-\u309F]|\u3000|[\u30A1-\u30FC]|[\u4E00-\u9FFF]|[a-zA-Z]|[々〇〻])');
// Applying it to your text.
text=text.replaceAll(reg,"");
You can learn more about it here:
https://api.flutter.dev/flutter/dart-core/String/replaceAll.html

Powershell find non printing characters [duplicate]

i would appreciate your help on this, since i do not know which range of characters to use, or if there is a character class like [[:cntrl:]] that i have found in ruby?
by means of non printable, i mean delete all characters that are not shown in ie output, when one prints the input string. Please note, i look for a c# regex, i do not have a problem with my code
You may remove all control and other non-printable characters with
s = Regex.Replace(s, #"\p{C}+", string.Empty);
The \p{C} Unicode category class matches all control characters, even those outside the ASCII table because in .NET, Unicode category classes are Unicode-aware by default.
Breaking it down into subcategories
To only match basic control characters you may use \p{Cc}+, see 65 chars in the Other, Control Unicode category. It is equal to a [\u0000-\u0008\u000E-\u001F\u007F-\u0084\u0086-\u009F \u0009-\u000D \u0085]+ regex.
To only match 161 other format chars including the well-known soft hyphen (\u00AD), zero-width space (\u200B), zero-width non-joiner (\u200C), zero-width joiner (\u200D), left-to-right mark (\u200E) and right-to-left mark (\u200F) use \p{Cf}+. The equivalent including astral place code points is a (?:[\xAD\u0600-\u0605\u061C\u06DD\u070F\u08E2\u180E\u200B-\u200F\u202A-\u202E\u2060-\u2064\u2066-\u206F\uFEFF\uFFF9-\uFFFB]|\uD804[\uDCBD\uDCCD]|\uD80D[\uDC30-\uDC38]|\uD82F[\uDCA0-\uDCA3]|\uD834[\uDD73-\uDD7A]|\uDB40[\uDC01\uDC20-\uDC7F])+ regex.
To match 137,468 Other, Private Use control code points you may use \p{Co}+, or its equivalent including astral place code points, (?:[\uE000-\uF8FF]|[\uDB80-\uDBBE\uDBC0-\uDBFE][\uDC00-\uDFFF]|[\uDBBF\uDBFF][\uDC00-\uDFFD])+.
To match 2,048 Other, Surrogate code points that include some emojis, you may use \p{Cs}+, or [\uD800-\uDFFF]+ regex.
You can try with :
string s = "Täkörgåsmrgås";
s = Regex.Replace(s, #"[^\u0000-\u007F]+", string.Empty);
Updated answer after comments:
Documentation about non-printable character:
https://en.wikipedia.org/wiki/Control_character
Char.IsControl Method:
https://msdn.microsoft.com/en-us/library/system.char.iscontrol.aspx
Maybe you can try:
string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());
To remove all control and other non-printable characters
Regex.Replace(s, #"\p{C}+", String.Empty);
To remove the control characters only (if you don't want to remove the emojis 😎)
Regex.Replace(s, #"\p{Cc}+", String.Empty);
you can try this:
public static string TrimNonAscii(this string value)
{
string pattern = "[^ -~]*";
Regex reg_exp = new Regex(pattern);
return reg_exp.Replace(value, "");
}

Using regex to match certain text

I try to look for this answer for a while but no luck (sorry if I could describe it well). I am still newbie with regex. I am trying to match a string with only number and a certain delimiter. For example: the patter would be 8/16/32/64/.... the number will be split by '/' with arbitrary amount of number, I could find a way to match them.
My attempt is \d+/\d+? but couldn't get it to work.
You could remove the '/' delimiter and then test for the existence of a number
Here is some C# as an example:
static void Main(string[] args)
{
string text = "8/16/32/64/";
Console.WriteLine(text);
TestForNum(text);
text = "8/16/32/64/b";
Console.WriteLine(text);
TestForNum(text);
Console.ReadKey();
}
private static void TestForNum(string text)
{
string tmp = Regex.Replace(text, #"/", "");
Match m = Regex.Match(tmp, #"^\d+$");
if(m.Success)
{
Console.WriteLine("\t" + m.Groups[0]);
}
else Console.WriteLine("\tno match");
}
A naive approach would be
[\d/]+
However, this does match //// as well as just 12345. To match only "proper" strings:
\d+(/\d+)+
Reads digits followed by delimiter+digits repeated at least once. If trailing/leading delimiters are allowed, then
/?(\d+/)+\d*
If you're using a flavor that uses slashes to quote the regex (like javascript), you'll need to escape them:
/\d+(\/\d+)+/
You can do:
(\d+)(\D|$)
See this work That will split a list of digits delimited by any non digit, so 1?2!3.4 would match
If you want a specific delimiter, such as /:
(\d+)(?:/|$)
As simple as possible:
(\d+\/?)+
Every digit followed by [a] slash, as many as possible. You may use g flag for all matches.

regex to count english words as single char inside char count of asian words

need some help from a regex jedi master:
If I have a string of mb chars (specifically, Japanese, Korean or Chinese) with English words sprinkled throughout, I would like to count:
asian characters as 1 per single char
english "words" (no dictionary check needed - just a string of consecutive english letters) as a single char.
English only is fine - don't worry about special spanish, swedish, etc. chars.
I am searching for a regex pattern I can use to count these strings, that will function in php and js.
Example:
これは猫です、けどKittyも大丈夫。
should count as 13 chars.
thanks for your help!
jeff
What ever you are trying to achieve, this will help you:
To count only Hiragana+Katakana+Kanji (Japanese) Chars (excluding punctuation marks):
var x = "これは猫です、けどKittyも大丈夫。";
x.match(/[ぁ-ゖァ-ヺー一-龯々]/g).length; //Result: 12 : これは猫ですけども大丈夫
Updated:
To count only words in Alphabet:
x.match(/\w+/g).length; //Result: 1 : "Kitty"
All in one line (as function):
function myCount(str) {
return str.match(/[ぁ-ゖァ-ヺー一-龯々]|\w+/g).length;
}
alert(myCount("これは猫です、けどKittyも大丈夫。")); //13
alert(myCount("これは犬です。DogとPuppyもOKですね!")); //14
These are the arrays resulted of match:
["こ", "れ", "は", "猫", "で", "す", "け", "ど", "Kitty", "も", "大", "丈", "夫"]
["こ", "れ", "は", "犬", "で", "す", "Dog", "と", "Puppy", "も", "OK", "で", "す", "ね"]
Updated (JAP, KOR, CH):
function myCount(str) {
return str.match(/[ぁ-ㆌㇰ-䶵一-鿃々가-힣-豈ヲ-ン]|\w+/g).length;
}
These will cover around 99% of the Japanese, Chinese and Korean. You may need to manually add extra characters that are not included such as "〶".
A very good reference is:
http://www.tamasoft.co.jp/en/general-info/unicode.html
This should solve your question.
OK, so I would do two runs: First count the occurrences of the English words and then of the Asian ones. This is a JS example, it might be different in PHP. In JS, only ASCII chars match \w.
string = "これは猫です、けどKittyも大丈夫";
var m = string.match(/\w+/gm);
var e_count = m.length; // is 1
Next count the Asian chars.
m = string.match(/([^\w\s\d])/gm); // any non-whitespace, non-word, non-digit chars
var a_count = m.length; // is 13
You might have to tweak it a bit. But in JS, you can add up e_count and a_count, and you should be good to go.
Also check out Rubular: http://www.rubular.com
Johannes
Something like /[[:ascii:]]+|./ will match one non-ASCII character or one or more ASCII characters. Probably is that'll give 15. So it seems that you want to ignore punctuation. So possibly: /[A-Za-z]+|[^[:punct:]]/
$ perl -E 'use utf8; $f = "これは猫です、けどKittyも大丈夫。"; ++$c while $f =~ /[A-Za-z]+|[^[:punct:]]/g; say $c'
13
So, that works in Perl at least. Probably in JS and PHP as well, provided their [[:punct:]] understands Unicode.
The alternative approach is to filter out stuff instead.

Regular expression for alphanumeric and underscores

Is there a regular expression which checks if a string contains only upper and lowercase letters, numbers, and underscores?
To match a string that contains only those characters (or an empty string), try
"^[a-zA-Z0-9_]*$"
This works for .NET regular expressions, and probably a lot of other languages as well.
Breaking it down:
^ : start of string
[ : beginning of character group
a-z : any lowercase letter
A-Z : any uppercase letter
0-9 : any digit
_ : underscore
] : end of character group
* : zero or more of the given characters
$ : end of string
If you don't want to allow empty strings, use + instead of *.
As others have pointed out, some regex languages have a shorthand form for [a-zA-Z0-9_]. In the .NET regex language, you can turn on ECMAScript behavior and use \w as a shorthand (yielding ^\w*$ or ^\w+$). Note that in other languages, and by default in .NET, \w is somewhat broader, and will match other sorts of Unicode characters as well (thanks to Jan for pointing this out). So if you're really intending to match only those characters, using the explicit (longer) form is probably best.
There's a lot of verbosity in here, and I'm deeply against it, so, my conclusive answer would be:
/^\w+$/
\w is equivalent to [A-Za-z0-9_], which is pretty much what you want (unless we introduce Unicode to the mix).
Using the + quantifier you'll match one or more characters. If you want to accept an empty string too, use * instead.
You want to check that each character matches your requirements, which is why we use:
[A-Za-z0-9_]
And you can even use the shorthand version:
\w
Which is equivalent (in some regex flavors, so make sure you check before you use it). Then to indicate that the entire string must match, you use:
^
To indicate the string must start with that character, then use
$
To indicate the string must end with that character. Then use
\w+ or \w*
To indicate "1 or more", or "0 or more". Putting it all together, we have:
^\w*$
Although it's more verbose than \w, I personally appreciate the readability of the full POSIX character class names ( http://www.zytrax.com/tech/web/regex.htm#special ), so I'd say:
^[[:alnum:]_]+$
However, while the documentation at the above links states that \w will "Match any character in the range 0 - 9, A - Z and a - z (equivalent of POSIX [:alnum:])", I have not found this to be true. Not with grep -P anyway. You need to explicitly include the underscore if you use [:alnum:] but not if you use \w. You can't beat the following for short and sweet:
^\w+$
Along with readability, using the POSIX character classes (http://www.regular-expressions.info/posixbrackets.html) means that your regex can work on non ASCII strings, which the range based regexes won't do since they rely on the underlying ordering of the ASCII characters which may be different from other character sets and will therefore exclude some non-ASCII characters (letters such as œ) which you might want to capture.
Um...question: Does it need to have at least one character or no? Can it be an empty string?
^[A-Za-z0-9_]+$
Will do at least one upper or lower case alphanumeric or underscore. If it can be zero length, then just substitute the + for *:
^[A-Za-z0-9_]*$
If diacritics need to be included (such as cedilla - ç) then you would need to use the word character which does the same as the above, but includes the diacritic characters:
^\w+$
Or
^\w*$
Use
^([A-Za-z]|[0-9]|_)+$
...if you want to be explicit, or:
^\w+$
...if you prefer concise (Perl syntax).
In computer science, an alphanumeric value often means the first character is not a number, but it is an alphabet or underscore. Thereafter the character can be 0-9, A-Z, a-z, or underscore (_).
Here is how you would do that:
Tested under PHP:
$regex = '/^[A-Za-z_][A-Za-z\d_]*$/'
Or take
^[A-Za-z_][A-Za-z\d_]*$
and place it in your development language.
Use lookaheads to do the "at least one" stuff. Trust me, it's much easier.
Here's an example that would require 1-10 characters, containing at least one digit and one letter:
^(?=.*\d)(?=.*[A-Za-z])[A-Za-z0-9]{1,10}$
Note: I could have used \w, but then ECMA/Unicode considerations come into play, increasing the character coverage of the \w "word character".
This works for me. I found this in the O'Reilly's "Mastering Regular Expressions":
/^\w+$/
Explanation:
^ asserts position at start of the string
\w+ matches any word character (equal to [a-zA-Z0-9_])
"+" Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of the string
Verify yourself:
const regex = /^\w+$/;
const str = `nut_cracker_12`;
let m;
if ((m = regex.exec(str)) !== null) {
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Try these multi-lingual extensions I have made for string.
IsAlphaNumeric - The string must contain at least one alpha (letter in Unicode range, specified in charSet) and at least one number (specified in numSet). Also, the string should consist only of alpha and numbers.
IsAlpha - The string should contain at least one alpha (in the language charSet specified) and consist only of alpha.
IsNumeric - The string should contain at least one number (in the language numSet specified) and consist only of numbers.
The charSet/numSet range for the desired language can be specified. The Unicode ranges are available on Unicode Chart.
API:
public static bool IsAlphaNumeric(this string stringToTest)
{
// English
const string charSet = "a-zA-Z";
const string numSet = #"0-9";
// Greek
//const string charSet = #"\u0388-\u03EF";
//const string numSet = #"0-9";
// Bengali
//const string charSet = #"\u0985-\u09E3";
//const string numSet = #"\u09E6-\u09EF";
// Hindi
//const string charSet = #"\u0905-\u0963";
//const string numSet = #"\u0966-\u096F";
return Regex.Match(stringToTest, #"^(?=[" + numSet + #"]*?[" + charSet + #"]+)(?=[" + charSet + #"]*?[" + numSet + #"]+)[" + charSet + numSet +#"]+$").Success;
}
public static bool IsNumeric(this string stringToTest)
{
//English
const string numSet = #"0-9";
//Hindi
//const string numSet = #"\u0966-\u096F";
return Regex.Match(stringToTest, #"^[" + numSet + #"]+$").Success;
}
public static bool IsAlpha(this string stringToTest)
{
//English
const string charSet = "a-zA-Z";
return Regex.Match(stringToTest, #"^[" + charSet + #"]+$").Success;
}
Usage:
// English
string test = "AASD121asf";
// Greek
//string test = "Ϡϛβ123";
// Bengali
//string test = "শর৩৮";
// Hindi
//string test = #"क़लम३७ख़";
bool isAlphaNum = test.IsAlphaNumeric();
The following regex matches alphanumeric characters and underscore:
^[a-zA-Z0-9_]+$
For example, in Perl:
#!/usr/bin/perl -w
my $arg1 = $ARGV[0];
# Check that the string contains *only* one or more alphanumeric chars or underscores
if ($arg1 !~ /^[a-zA-Z0-9_]+$/) {
print "Failed.\n";
} else {
print "Success.\n";
}
This should work in most of the cases.
/^[\d]*[a-z_][a-z\d_]*$/gi
And by most I mean,
abcd True
abcd12 True
ab12cd True
12abcd True
1234 False
Explanation
^ ... $ - match the pattern starting and ending with
[\d]* - match zero or more digits
[a-z_] - match an alphabet or underscore
[a-z\d_]* - match an alphabet or digit or underscore
/gi - match globally across the string and case-insensitive
For those of you looking for unicode alphanumeric matching, you might want to do something like:
^[\p{L} \p{Nd}_]+$
Further reading is at Unicode Regular Expressions (Unicode Consortium) and at Unicode Regular Expressions (Regular-Expressions.info).
For me there was an issue in that I want to distinguish between alpha, numeric and alpha numeric, so to ensure an alphanumeric string contains at least one alpha and at least one numeric, I used :
^([a-zA-Z_]{1,}\d{1,})+|(\d{1,}[a-zA-Z_]{1,})+$
Here is the regex for what you want with a quantifier to specify at least 1 character and no more than 255 characters
[^a-zA-Z0-9 _]{1,255}
I believe you are not taking Latin and Unicode characters in your matches.
For example, if you need to take "ã" or "ü" chars, the use of "\w" won't work.
You can, alternatively, use this approach:
^[A-ZÀ-Ýa-zà-ý0-9_]+$
^\w*$ will work for the below combinations:
1
123
1av
pRo
av1
For Java, only case insensitive alphanumeric and underscore are allowed.
^ Matches the string starting with any characters
[a-zA-Z0-9_]+ Matches alpha-numeric character and underscore.
$ Matches the string ending with zero or more characters.
public class RegExTest {
public static void main(String[] args) {
System.out.println("_C#".matches("^[a-zA-Z0-9_]+$"));
}
}
To check the entire string and not allow empty strings, try
^[A-Za-z0-9_]+$
This works for me. You can try:
[\\p{Alnum}_]
Required Format
Allow these three:
0142171547295
014-2171547295
123abc
Don't allow other formats:
validatePnrAndTicketNumber(){
let alphaNumericRegex=/^[a-zA-Z0-9]*$/;
let numericRegex=/^[0-9]*$/;
let numericdashRegex=/^(([1-9]{3})\-?([0-9]{10}))$/;
this.currBookingRefValue = this.requestForm.controls["bookingReference"].value;
if(this.currBookingRefValue.length == 14 && this.currBookingRefValue.match(numericdashRegex)){
this.requestForm.controls["bookingReference"].setErrors({'pattern': false});
}else if(this.currBookingRefValue.length ==6 && this.currBookingRefValue.match(alphaNumericRegex)){
this.requestForm.controls["bookingReference"].setErrors({'pattern': false});
}else if(this.currBookingRefValue.length ==13 && this.currBookingRefValue.match(numericRegex) ){
this.requestForm.controls["bookingReference"].setErrors({'pattern': false});
}else{
this.requestForm.controls["bookingReference"].setErrors({'pattern': true});
}
}
<input name="booking_reference" type="text" [class.input-not-empty]="bookingRef.value"
class="glyph-input form-control floating-label-input" id="bookings_bookingReference"
value="" maxlength="14" aria-required="true" role="textbox" #bookingRef
formControlName="bookingReference" (focus)="resetMessageField()" (blur)="validatePnrAndTicketNumber()"/>