Dart Regex does not match whole word for Arabic text - regex

This pattern works fine in Java and javascript but does not seem to work in Dart. Any help is appreciated.
void main() {
String englishText = "The new nature will not find rest";
String englishFind = "Nature";
RegExp englishExp = new RegExp("\\b$englishFind\\b", unicode:true, caseSensitive:false);
bool englishResult = englishExp.hasMatch(englishText);//matches
print(englishResult); //true
String arabicText = "لن تجد الطبيعة الجديدة راحتها";
String arabicFind="الطبيعة";
RegExp arabicExp = new RegExp("\\b$arabicFind\\b", unicode:true);
bool arabicResult = arabicExp.hasMatch(arabicText);//does not match
print(arabicResult);//false
}

\b word boundary is still matching only in ASCII only contexts even when you define unicode:true whose main point is to make sure "UTF-16 surrogate pairs in the original string will be treated as a single code point and will not match separately".
You may "decompose" the word boundary and add Arabic letter and digit ranges to the class:
String arabicText = "لن تجد الطبيعة الجديدة راحتها";
String arabicFind="الطبيعة";
RegExp arabicExp = new RegExp("(?:^|[^a-zA-Z0-9_\\u06F0-\\u06F9\\u0622\\u0627\\u0628\\u067E\\u062A-\\u062C\\u0686\\u062D-\\u0632\\u0698\\u0633-\\u063A\\u0641\\u0642\\u06A9\\u06AF\\u0644-\\u0648\\u06CC\\u202C\\u064B\\u064C\\u064E-\\u0652])$arabicFind(?![a-zA-Z0-9_\\u06F0-\\u06F9\\u0622\\u0627\\u0628\\u067E\\u062A-\\u062C\\u0686\\u062D-\\u0632\\u0698\\u0633-\\u063A\\u0641\\u0642\\u06A9\\u06AF\\u0644-\\u0648\\u06CC\\u202C\\u064B\\u064C\\u064E-\\u0652])", unicode:true);
bool arabicResult = arabicExp.hasMatch(arabicText);//does not match
print(arabicResult); // => true
The regex will match an $arabicFind word when it is
(?:^|[^a-zA-Z0-9_\u06F0-\u06F9\u0622\u0627\u0628\u067E\u062A-\u062C\u0686\u062D-\u0632\u0698\u0633-\u063A\u0641\u0642\u06A9\u06AF\u0644-\u0648\u06CC\u202C\u064B\u064C\u064E-\u0652]) - preceded with start of string (^) or (|) any char but ASCII letter, digit or _ and Farsi letters or digits
(?![a-zA-Z0-9_\u06F0-\u06F9\u0622\u0627\u0628\u067E\u062A-\u062C\u0686\u062D-\u0632\u0698\u0633-\u063A\u0641\u0642\u06A9\u06AF\u0644-\u0648\u06CC\u202C\u064B\u064C\u064E-\u0652]) - not followed with an ASCII letter, digit or _ and Farsi letters or digits.

Related

Regex: Last Occurrence of a Repeating Character

So, I am looking for a Regex that is able to match with every maximal non-empty substring of consonants followed by a maximal non-empty substring of vowels in a String
e.g. In the following strings, you can see all expected matches:
"zcdbadaerfe" = {"zcdba", "dae", "rfe"}
"foubsyudba" = {"fou", "bsyu", "dba"}
I am very close! This is the regex I have managed to come up with so far:
([^aeiou].*?[aeiou])+
It returns the expected matches except for it only returns the first of any repeating lengths of vowels, for example:
String: "cccaaabbee"
Expected Matches: {"cccaaa", "bbee"}
Actual Matches: {"ccca", "bbe"}
I want to figure out how I can include the last found vowel character that comes before (a) a constant or (b) the end of the string.
Thanks! :-)
Your pattern is slightly off. I suggest using this version:
[b-df-hj-np-tv-z]+[aeiou]+
This pattern says to match:
[b-df-hj-np-tv-z]+ a lowercase non vowel, one or more times
[aeiou]+ followed by a lowercase vowel, one or more times
Here is a working demo.
const rgx = /[^aeiou]+[aeiou]+(?=[^aeiou])|.*[aeiou](?=\b)/g;
Segment
Description
[^aeiou]+
one or more of anything BUT vowels
[aeiou]+
one or more vowels
(?=[^aeiou])
will be a match if it is followed by anything BUT a vowel
|
OR
.*[aeiou](?=\b)
zero or more of any character followed by a vowel and it needs to be followed by a non-word
function lastVowel(str) {
const rgx = /[^aeiou]+[aeiou]+(?=[^aeiou])|.*[aeiou](?=\b)/g;
return [...str.matchAll(rgx)].flat();
}
const str1 = "cccaaabbee";
const str2 = "zcdbadaerfe";
const str3 = "foubsyudba";
console.log(lastVowel(str1));
console.log(lastVowel(str2));
console.log(lastVowel(str3));

Why does the regex [a-zA-Z]{5} return true for non-matching string?

I defined a regular expression to check if the string only contains alphabetic characters and with length 5:
use regex::Regex;
fn main() {
let re = Regex::new("[a-zA-Z]{5}").unwrap();
println!("{}", re.is_match("this-shouldn't-return-true#"));
}
The text I use contains many illegal characters and is longer than 5 characters, so why does this return true?
You have to put it inside ^...$ to match the whole string and not just parts:
use regex::Regex;
fn main() {
let re = Regex::new("^[a-zA-Z]{5}$").unwrap();
println!("{}", re.is_match("this-shouldn't-return-true#"));
}
Playground.
As explained in the docs:
Notice the use of the ^ and $ anchors. In this crate, every expression is executed with an implicit .*? at the beginning and end, which allows it to match anywhere in the text. Anchors can be used to ensure that the full text matches an expression.
Your pattern returns true because it matches any consecutive 5 alpha chars, in your case it matches both 'shouldn't' and 'return'.
Change your regex to: ^[a-zA-Z]{5}$
^ start of string
[a-zA-Z]{5} matches 5 alpha chars
$ end of string
This will match a string only if the string has a length of 5 chars and all of the chars from start to end fall in range a-z and A-Z.

Regex catch word at the start and end of a UITextView

I'm trying to catch when a word is used in a UITextView. I've got it working for words in the interior of the view.
The problem is when the word is first or last in the view.
My code so far:
private func filteredTermFor(_ word: String) -> String {
let punctuationFilter = "([\\A|\\W|\\d|\\z| ])"
let wordInParens = "(\(word))"
return punctuationFilter + wordInParens + punctuationFilter
}
I checked and found I should use ^ for the start of input and $ for the end of input. When I add either of these, for example:
"([^|\\A|\\W|\\d|\\z| ])"
they don't seem to have any effect when the word in question is the first or last in the view.
*For the sake of being verbose with my question, the return value from the function above is being used as searchTerm in this:
func highlightedTextInString(with searchTerm: String, targetString: String) -> NSAttributedString? {
let attributedString = NSMutableAttributedString(string: targetString)
do {
let regex = try NSRegularExpression(pattern: searchTerm, options: .caseInsensitive)
let range = NSRange(location: 0, length: targetString.utf16.count)
for match in regex.matches(in: targetString, options: .withTransparentBounds, range: range) {
let fontColor = UIColor.red
attributedString.addAttribute(NSForegroundColorAttributeName, value: fontColor, range: match.range)
}
return attributedString
} catch _ {
print("Error creating regular expression")
return nil
}
}
** Edit **
Since this was marked as a duplicate
The question this was reported a duplicate of does not cover edge cases when the word is typed next to a punctuation mark or digit without spaces.
For example:
.word , word9 , ?word?
Note that ([^|\\A|\\W|\\d|\\z| ]) is a capturing group ((...)) containing a character class that matches a single char defined inside it. The ^ after [ makes the class a negated one, and it matches any char but the one(s) defined in the set. So, [^|\\A|\\W|\\d|\\z| ] matches a single char other than | (it is no longer an alternation operator inside a character class), A (the \ in front is not considered, is omitted), a non-word char, a digit, z and space. It effectively matches _ and any letters other than A and z.
You state that the words you need to match may occur within word boundaries or digits.
You may use
return "(?<![^\\W\\d])(\(word))(?![^\\W\\d])"
See the regex demo.
Here, "(?<![^\\W\\d])" is a negative lookbehind that matches a location that is NOT immediately preceded with a character other than a non-word and a digit char. This sounds cumbersome, but the main point here is that [^\W\d] matches the same texts as \w excluding digits (\w matches letters, digit, and _. So, "(?<![^\\W\\d])" makes sure there is a start of string or a non-letter and non-_ char right before the word. If you allow a word to match after _, just use (?<!\\p{L}) (where \p{L} matches any Unicode letter).
The "(?![^\\W\\d])" is a negative lookahead that makes sure there is an end of string or a non-letter and non-_ (there can be punctuation, symbols and digits) immediately to the right of the word. Again, if you want to match a word if it is followed with _, you may replace this lookahead with "(?!\\p{L})" (just no letter after the word is allowed).

Regex for string *11F23H3*: Start and end with *, 7 Uppercase literals or numbers in between

I need to check strings like *11F23H3* that start and end with a *and have 7 uppercase literals or numbers in between. So far I have:
if (!barcode.match('[*A-Z0-9*]')) {
console.error(`ERROR: Barcode not valid`);
process.exitCode = 1;
}
But this does not cover strings like *11111111111*. How would the correct regex look like?
I need to check strings like 11F23H3 that start and end with a *and have 7 uppercase literals or numbers in between
You can use this regex:
/\*[A-Z0-9]{7}\*/
* is regex meta character that needs to be escaped outside character class
[A-Z0-9]{7} will match 7 characters containing uppercase letter or digits
RegEx Demo
Code:
var re = /\*[A-Z0-9]{7}\*/;
if (!re.test(barcode)) {
console.error(`ERROR: Barcode ${barcode} in row ${row} is not valid`);
process.exitCode = 1;
}
Note that if barcode is only going to have this string then you should also use anchors like this to avoid matching any other text on either side of *:
var re = /^\*[A-Z0-9]{7}\*$/;

How to detect if a string contains hindi (devnagri) in it with character and word count

Below is a example string -
$string = "abcde वायरस abcde"
I need to check weather this string contains any Hindi (Devanagari) content and if so the count of characters and words. I guess regex with unicode character class can work http://www.regular-expressions.info/unicode.html. But I am not able to figure out the correct regex statement.
To find out, if a string contains a Hindi (Devanagari) character, you need to have a full list of all Hindi characters. According to this website, the Hindi characters are the hexadecimal characters between 0x0900 and 0x097F (decimal 2304 to 2431).
The regular expression pattern needs to match, if any of those characters are in the set. Therefore, you can use a pattern (actually a set of characters) to match the string, which looks like this:
[\u0900\u0901\u0902 ... \u097D\u097E\u097F]
Because it is rather cumbersome to manually write this list of characters down, you can generate this string by iterating over the decimal characters from 2304 to 2431 or over the hexadecimal characters.
To count all words containing at least one Hindi character, you can use the following pattern. It contains white-space (\s) around the word or the beginning (^) or the end ($) around the word, and a global flag, to match every occurence (/g):
/(?:^|\s)[\u0900\u0901\u0902 ... \u097D\u097E\u097F]+?(?:\s|$)/g
Here is a live implementation in JavaScript:
var numberOfHindiCharacters = 128;
var unicodeShift = 0x0900;
var hindiAlphabet = [];
for(var i = 0; i < numberOfHindiCharacters; i++) {
hindiAlphabet.push("\\u0" + (unicodeShift + i).toString(16));
}
var regex = new RegExp("(?:^|\\s)["+hindiAlphabet.join("")+"]+?(?:\\s|$)", "g");
var string1 = "abcde वायरस abcde";
var string2 = "abcde abcde";
[ string1.match(regex), string2.match(regex) ].forEach(function(match) {
if(match) {
console.log("String contains " + match.length + " words with Hindi characters only.");
} else {
console.log("String does NOT contain any words with Hindi characters only.");
}
});
It should be a range. The list of all characters is not required.
The following will detect a Devanagari word
[\u0900-\u097F]+