Regex in Excel VBA Special Characters and Embedded Spaces - regex

I have to parse a huge file, but one of the values is causing me a lot of grief.
It is a fixed length field of six characters. The description of the allowable values is:
Left justified; space filled. Cannot contain special characters or embedded spaces. If data is unavailable, space filled.
What I have attempted so far is to check:
If Code = " " Then
MsgBox "Code is Space Filled."
This will check if it is all space filled, which is ok.
Next I check if there is any special characters using the following function:
With ObjRegex
.Global = True
.Pattern = "[^a-zA-Z0-9\s]+"
StripNonAlpha = .Replace(Replace(TextToReplace, "-", Chr(32)),
End With
I can compare two strings, the original code and the stripped of special characters one. If they don't match then it contains a special character and is not valid.
It is the spaces that are causing me issues. I have to check for left aligned (no leading spaces followed by characters) and no embedded spaces, trailing spaces are OK.
I have tried a few variations of the above function but to no avail.
e.g. (wrong):
(^\sa-zA-Z0-9\sa-zA-Z0-9)+
I would appreciate any pointer. If there is a more 'all in one' regex that makes more sense that would be great and if regex is the wrong way to go I'm more than happy happy to abandon them.

Partial answer:
Demo
Regex: (?=[a-zA-Z0-9\s]{6})[a-zA-Z0-9]*\s*
Drawbacks: It will match > 6 chars (but not less than 6)

Related

Dart RegExp white spaces is not recognized

I'm trying to implement a regex pattern for username that allows English letters, Arabic letters, digits, dash and space.
The following pattern always returns no match if the input string has a space even though \s is included in the pattern
Pattern _usernamePattern = r'^[a-zA-Z0-9\u0621-\u064A\-\s]{3,30}$';
I also tried replacing \s with " " and \\s but the regex always returns no matches for any input that has a space in it.
Edit: It turns out that flutter adds a unicode character for "Right-To-Left Mark" or "Left-To-Right Mark" when using a textfield with a mix of languages that go LTR or RTL. This additional mark is a unicode character that's gets added to the text. The regex above was failing because of this additional character. To resolve the issue simply do a replaceAll for these characters. Read more here: https://github.com/flutter/flutter/issues/56514.
This is a fairly nasty problem and worth documenting in an answer here.
As documented in the source:
/// When LTR text is entered into an RTL field, or RTL text is entered into an
/// LTR field, [LRM](https://en.wikipedia.org/wiki/Left-to-right_mark) or
/// [RLM](https://en.wikipedia.org/wiki/Right-to-left_mark) characters will be
/// inserted alongside whitespace characters, respectively. This is to
/// eliminate ambiguous directionality in whitespace and ensure proper caret
/// placement. These characters will affect the length of the string and may
/// need to be parsed out when doing things like string comparison with other
/// text.
While this is well-intended it can cause problems when you work with mixed LTR/RTL text patterns (as it is the case here) and have to ensure exact field length, etc.
The suggested solution is to remove all left-right-marks:
void main() {
final String lrm = 'aaaa \u{200e}bbbb';
print('lrm: "$lrm" with length ${lrm.length}');
final String lrmFree = lrm.replaceAll(RegExp(r'\u{200e}', unicode: true), '');
print('lrmFree: "$lrmFree" with length ${lrmFree.length}');
}
Related: right-to-left (RTL) in flutter

Regex: not all BLANKS but allow certain characters, with limit

Trying to come up with a Regex, or combination of Regex, that returns False if a) they have only entered only BLANK(s), or they b) entered "non-legal" characters. Lastly, the number of characters has a set limit.
The closest I have thus far is below. Where it fails is that it does not count any leading spaces; only the non-BLANKs are counted, and so it fails. Using js.
const reg = /^(**[ ]***[!-~\u2018-\u201d\u2013\u2014]){1,10}$/;
EDIT: I think the above is incorrect, and I meant to post this:
const re4 = /^(?!\s*$)[!-~\u2018-\u201d\u2013\u2014]{1,10}$/;
EDIT 2: this has less clutter; allow space and all other 'standard' keyboard chars:
const re5 = /^(?!\s*$)[!-~]{1,10}$/;
So, this says you can enter a bunch of spaces, and must include at least 1 other character from the list following; but the {1,10} only counts the non-spaces and so I can end up with too many in total.
EDIT:
So, using re5 above --
s = ' '; // should fail
s = ' blah blah'; // should pass
s = ' blah blah'; // should fail, as there are 11 characters
Try ^(?:\s*\S){1,10}\s*$
Allow 1-10 non whiter, change \S to allow chars
Update 2: After learning that you cannot invert the match result in code, here's one last suggestion using negative lookahead (like you already tried yourself).
This regex matches only strings of 1-10 non-banned characters that are not all whitespace:
const re4 = /^(?!\s+$)[^\!-\~\u2018-\u201d\u2013\u2014]{1,10}$/
Update 1: Use this regex to match all-whitespace string OR strings longer than 10 chars OR strings containing bad characters:
const re4 = /(^\s+$|^.{11,}$|[\!-\~\u2018-\u201d\u2013\u2014])/
I understand that you want to impose a length restriction via regex. I would suggest against that and recommend using str.length instead.
This regex will match whitespace-only strings and strings containing one or more bad characters:
const re4 = /(^\s+$|[\!-\~\u2018-\u201d\u2013\u2014])/;
Regarding prohibition of all-whitespace strings: Instead of packing it into a regex, you might consider using something more explicit like if (s.trim().length == 0). IMO this makes your intention clearer and your code propably more readable, leaving you with this easy to read regex:
# matches any string containing a *bad* character
const re4 = /[\!-\~\u2018-\u201d\u2013\u2014]/;
If you use trim for the all-whitespace check, you might convert your regex into a positive assertion, even with length restriction:
# matches any string consisting of 1-10 characters not considered *bad*
const re4 = /^[^\!-\~\u2018-\u201d\u2013\u2014]{1,10}$/;
To match the input when it’s from 1 to 10 chars long and can't be all blanks, use a negative look ahead to assert not all blanks:
^(?! *$).{1,10}
If you want to restrict allowable chars, change the dot to a suitable character class of allowable chars.

Regex words with letters, numbers, optional special characters in any order

I've been using some help on here for a while now but cannot find anything specific to my requirement. I need to pick out whole words which contain at least 6 letters and/or numbers (combined, not each), with optional 'special' characters. All in any order, so A12345, 12345A, 1-2-345-A, 12A45B and so-on.
I've done a fiddle here. I'm almost there (but could be done better) - I can't work out why it needs to be a least 6 numbers to get a match. Is it beacuse the letters are all optional with *
This is VBA so no access to look behinds. The special characters will only ever be 'within' the match, not start or end (will never be -1234-A- for example).
I think this is what you are looking for:
[a-z0-9/-]{6,}
That will match in any order a to z or 0 to 9 or - or / of at least 6. Note the - is at the end of the character class. You can have it in the middle but then need to escape it. Also, / will need to be escaped if your delimiters are also /
update
As Wiktor noted this would also capture ------ which may not be what you want. I would suggest simply cleaning out all optional characters, and then running the above regex. I would delete my answer since I'm not providing exactly what was being asked, but it would be a workable solution so it may have value.
You could do a regex replacement to remove all non letters/numbers, and then check that the length of the resulting string is 6 or more:
Dim input As String = "A-1234-B"
Dim pattern As String = "[^A-Za-z0-9]+"
Dim replacement As String = ""
Dim rgx As New Regex(pattern)
Dim result As String = rgx.Replace(input, replacement)
Console.WriteLine(result.Length) ' 6
Demo

Excel VBA Regex Check For Repeated Strings

I have some user input that I want to validate for correctness. The user should input 1 or more sets of characters, separated by commas.
So these are valid input
COM1
COM1,COM2,1234
these are invalid
COM -- only 3 characters
COM1,123 -- one set is only 3 characters
COM1.1234,abcd -- a dot separator not comma
I googled for a regex pattern to this and found a possible pattern that tested for a recurring instance of any 3 characters, and I modified like so
/^(.{4,}).*\1$/
but this is not finding matches.
I can manage the last comma that may or may not be there before passing to the test so that it is always there.
Preferably, I would like to test for letters (any case) and numbers only, but I can live with any characters.
I know I could easily do this in straight VBA splitting the input on a comma delimiter and looping through each character of each array element, but regex seems more efficient, and I will have more cases than have slightly different patterns, so parameterising the regex for that would be better design.
TIA
I believe this does what you want:
^([A-Z|a-z|0-9]{4},)*[A-Z|a-z|0-9]{4}$
It's a line beginning followed by zero or more groups of four letters or numbers ending with a comma, followed by one group of four letters or number followed by an end-of-line.
You can play around with it here: https://regex101.com/r/Hdv65h/1
The regular expression
"^[\w]{4}(,[\w]{4})*$"
should work.
You can try this to see whether it works for all your cases using the following function. Assuming your test strings are in cells A1 thru A5 on the spreadsheet:
Sub findPattern()
Dim regEx As New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.Pattern = "^[\w]{4}(,[\w]{4})*$"
Dim i As Integer
Dim val As String
For i = 1 To 5:
val = Trim(Cells(i, 1).Value)
Set mat = regEx.Execute(val)
If mat.Count = 0 Then
MsgBox ("No match found for " & val)
Else
MsgBox ("Match found for " & val)
End If
Next
End Sub

Why doesn't my Regular Expression for an empty string work?

I've tested this on Regexr and it works, but it doesn't seem to work in AS3:
var emptyResult:* = new RegExp("^\s*$", "gi").exec(myField.text);
or
var emptyResult:* = /^\s*$/gi.exec(myField.text);
No matter whether I have text in the field or not, whitespace or non-whitespace, emptyResult is always null. I've tried with and without the g and i tags, but nothing seems to work.
Anyone know why this might be?
You don't need the 'i' flag - it stands for 'ignore case', which is only applicable to letters of Latin alphabet - you aren't using them.
In the first example, you need to escape the backslash, otherwise it's treated as if it was meant to escape the following letter 's'.
You don't need the 'g' flag either, since you are trying to test the entire string (in your case the line and the string are the same thing, \s will first encounter the end of the line, before $ can).
When using your second regexp, however, with the 'i' flag removed, it gives me the results I'd expect, i.e. if the entire text of the tested string consists of white space, tabulation, carriage return or line feed, then that entire string is returned.
For instance:
trace(/^\s*$/.exec(" \t\r\n")[0].length); // 4