Dart RegExp white spaces is not recognized - regex

I'm trying to implement a regex pattern for username that allows English letters, Arabic letters, digits, dash and space.
The following pattern always returns no match if the input string has a space even though \s is included in the pattern
Pattern _usernamePattern = r'^[a-zA-Z0-9\u0621-\u064A\-\s]{3,30}$';
I also tried replacing \s with " " and \\s but the regex always returns no matches for any input that has a space in it.
Edit: It turns out that flutter adds a unicode character for "Right-To-Left Mark" or "Left-To-Right Mark" when using a textfield with a mix of languages that go LTR or RTL. This additional mark is a unicode character that's gets added to the text. The regex above was failing because of this additional character. To resolve the issue simply do a replaceAll for these characters. Read more here: https://github.com/flutter/flutter/issues/56514.

This is a fairly nasty problem and worth documenting in an answer here.
As documented in the source:
/// When LTR text is entered into an RTL field, or RTL text is entered into an
/// LTR field, [LRM](https://en.wikipedia.org/wiki/Left-to-right_mark) or
/// [RLM](https://en.wikipedia.org/wiki/Right-to-left_mark) characters will be
/// inserted alongside whitespace characters, respectively. This is to
/// eliminate ambiguous directionality in whitespace and ensure proper caret
/// placement. These characters will affect the length of the string and may
/// need to be parsed out when doing things like string comparison with other
/// text.
While this is well-intended it can cause problems when you work with mixed LTR/RTL text patterns (as it is the case here) and have to ensure exact field length, etc.
The suggested solution is to remove all left-right-marks:
void main() {
final String lrm = 'aaaa \u{200e}bbbb';
print('lrm: "$lrm" with length ${lrm.length}');
final String lrmFree = lrm.replaceAll(RegExp(r'\u{200e}', unicode: true), '');
print('lrmFree: "$lrmFree" with length ${lrmFree.length}');
}
Related: right-to-left (RTL) in flutter

Related

Powershell find non printing characters [duplicate]

i would appreciate your help on this, since i do not know which range of characters to use, or if there is a character class like [[:cntrl:]] that i have found in ruby?
by means of non printable, i mean delete all characters that are not shown in ie output, when one prints the input string. Please note, i look for a c# regex, i do not have a problem with my code
You may remove all control and other non-printable characters with
s = Regex.Replace(s, #"\p{C}+", string.Empty);
The \p{C} Unicode category class matches all control characters, even those outside the ASCII table because in .NET, Unicode category classes are Unicode-aware by default.
Breaking it down into subcategories
To only match basic control characters you may use \p{Cc}+, see 65 chars in the Other, Control Unicode category. It is equal to a [\u0000-\u0008\u000E-\u001F\u007F-\u0084\u0086-\u009F \u0009-\u000D \u0085]+ regex.
To only match 161 other format chars including the well-known soft hyphen (\u00AD), zero-width space (\u200B), zero-width non-joiner (\u200C), zero-width joiner (\u200D), left-to-right mark (\u200E) and right-to-left mark (\u200F) use \p{Cf}+. The equivalent including astral place code points is a (?:[\xAD\u0600-\u0605\u061C\u06DD\u070F\u08E2\u180E\u200B-\u200F\u202A-\u202E\u2060-\u2064\u2066-\u206F\uFEFF\uFFF9-\uFFFB]|\uD804[\uDCBD\uDCCD]|\uD80D[\uDC30-\uDC38]|\uD82F[\uDCA0-\uDCA3]|\uD834[\uDD73-\uDD7A]|\uDB40[\uDC01\uDC20-\uDC7F])+ regex.
To match 137,468 Other, Private Use control code points you may use \p{Co}+, or its equivalent including astral place code points, (?:[\uE000-\uF8FF]|[\uDB80-\uDBBE\uDBC0-\uDBFE][\uDC00-\uDFFF]|[\uDBBF\uDBFF][\uDC00-\uDFFD])+.
To match 2,048 Other, Surrogate code points that include some emojis, you may use \p{Cs}+, or [\uD800-\uDFFF]+ regex.
You can try with :
string s = "Täkörgåsmrgås";
s = Regex.Replace(s, #"[^\u0000-\u007F]+", string.Empty);
Updated answer after comments:
Documentation about non-printable character:
https://en.wikipedia.org/wiki/Control_character
Char.IsControl Method:
https://msdn.microsoft.com/en-us/library/system.char.iscontrol.aspx
Maybe you can try:
string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());
To remove all control and other non-printable characters
Regex.Replace(s, #"\p{C}+", String.Empty);
To remove the control characters only (if you don't want to remove the emojis 😎)
Regex.Replace(s, #"\p{Cc}+", String.Empty);
you can try this:
public static string TrimNonAscii(this string value)
{
string pattern = "[^ -~]*";
Regex reg_exp = new Regex(pattern);
return reg_exp.Replace(value, "");
}

Regex in Excel VBA Special Characters and Embedded Spaces

I have to parse a huge file, but one of the values is causing me a lot of grief.
It is a fixed length field of six characters. The description of the allowable values is:
Left justified; space filled. Cannot contain special characters or embedded spaces. If data is unavailable, space filled.
What I have attempted so far is to check:
If Code = " " Then
MsgBox "Code is Space Filled."
This will check if it is all space filled, which is ok.
Next I check if there is any special characters using the following function:
With ObjRegex
.Global = True
.Pattern = "[^a-zA-Z0-9\s]+"
StripNonAlpha = .Replace(Replace(TextToReplace, "-", Chr(32)),
End With
I can compare two strings, the original code and the stripped of special characters one. If they don't match then it contains a special character and is not valid.
It is the spaces that are causing me issues. I have to check for left aligned (no leading spaces followed by characters) and no embedded spaces, trailing spaces are OK.
I have tried a few variations of the above function but to no avail.
e.g. (wrong):
(^\sa-zA-Z0-9\sa-zA-Z0-9)+
I would appreciate any pointer. If there is a more 'all in one' regex that makes more sense that would be great and if regex is the wrong way to go I'm more than happy happy to abandon them.
Partial answer:
Demo
Regex: (?=[a-zA-Z0-9\s]{6})[a-zA-Z0-9]*\s*
Drawbacks: It will match > 6 chars (but not less than 6)

Regex: match whole line except first string and #comment lines

I tried (\s|\t).*[\b\w*\s\b], this one is almost ok but I want also except lines with #.
#Name Type Allowable values
#========================== ========= ========================================
_absolute-path-base-uri String -
add-xml-decl Boolean y/n, yes/no, t/f, true/false, 1/0
As #anubhava said in his answer, it looks you just need to check for # at the beginning of the line. The regex for that is simple, but the mechanics of applying the regex varies wildly, so it would help if we knew which regex flavor/tool you're using (e.g. PHP, .NET, Notepad++, EditPad Pro, etc.). Here's a JavaScript version:
/^[^#].*$/mg
Notice the modifiers: m ("multiline") allows ^ and $ to match at line boundaries, and g ("global") allows you to find all the matches, not just the first one.
Now let's look at your regex. [\b\w*\s\b] is a character class that matches a word character (\w), a whitespace character (\s), an asterisk (*), or a backspace (\b). In other words, both * and \b lose their special meanings when the appear in a character class.
\s matches any whitespace character including \t, so (\s|\t) is needlessly redundant, and may not be needed at all. What it's actually doing in your case is matching the newline before each matched line. There's no need for that when you can use ^ in multiline mode. If you want to allow for horizontal whitespace (i.e., spaces and tabs) before the #, you can do this:
/^(?![ \t]*#).*$/mg
(?![ \t]*#) is a negative lookahead; it means "from this position, it is impossible to match zero or more tabs or spaces followed by #". Coming right after the ^ line anchor as it does, "this position" means the beginning of a line.
Try this:
^[A-z0-9_-]+\s+(.+)$
Assuming your first string will consist of only letters, numbers, underscores or hyphens, the first part will match that. Then we match whitespace, and then capture the rest. However, this is all dependent on the regular expression engine being used. Is this using language support for regexes, a specific editor, or a certain library? Which one? There isn't a standard: each regex engine works slightly differently.
Try this:
^[^#].*?(\s|\t)(?<Group>.*)$
After a match is found, the Group group will contain your string.
I would use this regex. In English, this says "First character is not a pound sign (#), then non-white space to match the first 'word', then white space, then match the whole line.
^[^#]\S*\s+(.+)$
Can I suggest another approach though? It looks like there are tabs between each field in the text, so why not just read the text line-by-line and split by tab into an array?
Here is an example in C# (untested):
using(StreamReader sr = new StreamReader("C:\\Path\\to\\file.txt"))
{
string line = sr.ReadLine();
while(!sr.EndOfStream)
{
//skip the comment lines
if(line.StartsWith("#"))
continue;
string[] fields = line.Split(new string[] {"\t"}, StringSplitOptions.RemoveEmptyEntries);
//now fields[0] contains the Name field
//fields[1] contains the Type field
//fields[2] contains the Allowable Values field
line = sr.ReadLine();
}
}
Try this code in php:
<?php
$s="#Name Type Allowable values
#========================== ========= ========================================
_absolute-path-base-uri String -
add-xml-decl Boolean y/n, yes/no, t/f, true/false, 1/0 ";
$a = explode("\n", $s);
foreach($a as $str) {
preg_match('~^[^#].*$~', $str, $m);
var_dump($m);
}
?>
OUTPUT
array(0) {
}
array(0) {
}
array(1) {
[0]=>
string(79) "_absolute-path-base-uri String - "
}
array(1) {
[0]=>
string(77) "add-xml-decl Boolean y/n, yes/no, t/f, true/false, 1/0 "
}
Code is pretty simple, it just ignores matching # at the start of a line thus ingoring those lines completely.

XSL - Remove non breaking space

In my XSL implementation (2.0), I tried using the below statement to remove all the spaces & non breaking spaces within a text node. It works for spaces only but not for non breaking spaces whose ASCII codes are,                            ​  etc. I am using SAXON processor for execution.
Current XSL code:
translate(normalize-space($text-nodes[1]), ' ' , '' ))
How can I have them removed. Please share your thoughts.
Those codes are Unicode, not ASCII (for the most part), so you should probably use the replace function with a regex containing the Unicode separator character class:
replace($text-nodes[1], '\p{Z}+', '')
In more detail:
The regex \p{Z}+ matches one or more characters that are in the "separator" category in Unicode. \p{} is the category escape sequence, which matches a single character in the category specified within the curly braces. Z specifies the "separator" category (which includes various kinds of whitespace). + means "match the preceding regex one or more times". The replace function returns a version of its first argument with all non-overlapping substrings matching its second argument replaced with its third argument. So this returns a version of $text-nodes[1] with all sequences of separator characters replaced with the empty string, i.e. removed.

Help with specific Regex: need to match multiple instances of multiple formats in a single string

I apologize for the terrible title...it can be hard to try to summarize an entire situation into a single sentence.
Let me start by saying that I'm asking because I'm just not a Regex expert. I've used it a bit here and there, but I just come up short with the correct way to meet the following requirements.
The Regex that I'm attempting to write is for use in an XML schema for input validation, and used elsewhere in Javascript for the same purpose.
There are two different possible formats that are supported. There is a literal string, which must be surrounded by quotation marks, and a Hex-value string which must be surrounded by braces.
Some test cases:
"this is a literal string" <-- Valid string, enclosed properly in "s
"this should " still be correct" <-- Valid string, "s are allowed within (if possible, this requirement could be forgiven if necessary)
"{00 11 22}" <-- Valid string, {}'s allow in strings. Another one that can be forgiven if necessary
I am bad output <-- Invalid string, no "s
"Some more problemss"you know <-- Invalid string, must be fully contained in "s
{0A 68 4F 89 AC D2} <-- Valid string, hex characters enclosed in {}s
{DDFF1234} <-- Valid string, spaces are ignored for Hex strings
DEADBEEF <-- Invalid string, must be contained in either "s or {}s
{0A 12 ZZ} <-- Invalid string, 'Z' is not a valid Hex character
To satisfy these general requirements, I had come up with the following Regex that seems to work well enough. I'm still fairly new to Regex, so there could be a huge hole here that I'm missing.:
".+"|\{([0-9]|[a-f]|[A-F]| )+\}
If I recall correctly, the XML Schema regex automatically assumes beginning and end of line (^ and $ respectively). So, essentially, this regex accepts any string that starts and ends with a ", or starts and ends with {}s and contains only valid Hexidecimal characters. This has worked well for me so far except that I had forgotten about another (although less common, and thus forgotten) input option that completely breaks my regex.
Where I made my mistake:
Valid input should also allow a user to separate valid strings (of either type, literal/hex) by a comma. This means that a single string should be able to contain more than one of the above valid strings, separated by commas. Luckily, however, a comma is not a supported character within a literal string (although I see that my existing regex does not care about commas).
Example test cases:
"some string",{0A F1} <-- Valid
{1122},{face},"peanut butter" <-- Valid
{0D 0A FF FE},"string",{FF FFAC19 85} <-- Valid (Spaces don't matter in Hex values)
"Validation is allowed to break, if a comma is found not separating values",{0d 0a} <-- Invalid, comma is a delimiter, but "Validation is allowed to break" and "if a comma..." are not marked as separate strings with "s
hi mom,"hello" <-- Invalid, String1 was not enclosed properly in "s or {}s
My thoughts are that it is possible to use commas as a delimiter to check each "section" of the string to match a regex similar to the original, but I just am not that advanced in regex yet to come up with a solution on my own. Any help would be appreciated, but ultimately a final solution with an explanation would just stellar.
Thanks for reading this huge wall of text!
According to http://www.regular-expressions.info/xml.html the regex language to be used in XSD is less expressive than the one used in Java, but expressive enough for your task.
Now for a construction, take your own regex. Replace the dot with a negated character class [^,] to match everything except the comma, and (for increased clarity) merge the hexadecimal character classes into one. You get the following regex:
"[^,]+"|\{[0-9a-fA-F ]+\}
If we name this regex <S> (for "single string"), the additional feature is validated by the regex matching any number of <S>,, followed by a single <S>:
(<S>,)*<S>
Expanded, this yields the desired regex:
(("[^,]+"|\{[0-9a-fA-F ]+\}),)*("[^,]+"|\{[0-9a-fA-F ]+\})
Maybe something along the lines of
(?:(?:"[^,]+?"|\{(?:[0-9]|[a-f]|[A-F]| )+?\}),)*(?:(?:"[^,]+?"|\{(?:[0-9]|[a-f]|[A-F]| )+?\}))