^[A-Za-z](\W|\w)* regular expression? - regex

The regular expression ^[A-Za-z](\W|\w)* matches when the user gives the first letter as white space, and the first letter should not be a digit and remaining letters may be alpha numerical. When the user gives a white space as the first character it should automatically be trimmed. How?

^\s*([A-Za-z]\w*)
Should do it. Just get group 1.
I'm not sure the language you are using, I'm going to assume C#, so here is a C# sample:
string testString = " myMatch123 not in the match";
Regex regexObj = new Regex("^\\s*([A-Za-z]\\w*)",
RegexOptions.IgnoreCase | RegexOptions.Multiline);
string result = regexObj.Match(testString).Groups[1].Value;
Console.WriteLine("-" + result + "-");
This will print
-myMatch123-
to the console window.

Is it possible to Trim() your input before giving it to your regex?
If you're looking for alpha-numerical, starting with non-numeric, you probably want:
\s*([A-Za-z][A-Za-z0-9]+)
If you allow one-character user names, change that plus to a star.

Related

How to extract words entirely written in uppercase with accents (Diacritics) with a Google Sheet REGEXEXTRACT formula?

Ok,
it looks simple but when the words start or finish with an accents it is the mess.
I've looked on Stack Overflow and others and haven't really found a way to solve this problem.
I would like, to be able with a Google sheet formula, to extract from a cell, words only built with the ASCII characters that follow:
A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,À,Á,Â,Ã,Ä,Å,Æ,Ç,È,É,Ê,Ë,Ì,Í,Î,Ï,Ð,Ñ,Ò,Ó,Ô,Õ,Ö,Ø,Ù,Ú,Û,Ü,Ý
For example with "Éléonorä-Camilliâ ÀLËMMNIÖ DE SANTORINÕ" or "ÀLËMMNIÖ DE SANTORINÕ Éléonorä Camilliâ" the result has to be the same "ÀLËMMNIÖ DE SANTORINÕ"
This formula works when no accent all:
=REGEXEXTRACT(A2;"\b[A-Z]+(?:\s+[A-Z]+)*\b")
These formula work sometimes when the names are easy.
=REGEXEXTRACT(A2;"\b[A-Ý]+(?:\s+[A-Ý]+)*\b")
=REGEXEXTRACT(A2;"\B[A-Ý]+(?:\S+[A-Ý]+)*\B")
Can anybody help me or give me some hint?
It seems your expected matches are simply between whitespace or start/end of string. If you add a space before and after the cell value, you may simply extract all the chunks of whitespace-separated uppercase letter words between whitespaces, and the formula will boil down to
=REGEXEXTRACT(" " & A2 & " "; "\s([A-ZÀ-ÖØ-Ý]+(?:\s+[A-ZÀ-ÖØ-Ý]+)*)\s")
See the Google sheets demo:
Regex details:
\s - a whitespace
([A-ZÀ-ÖØ-Ý]+(?:\s+[A-ZÀ-ÖØ-Ý]+)*) - Group 1 (the actual value returned by REGEXEXTRACT): one or more uppercase letters from the specified ranges followed with zero or more repetitions of one or more whitespace and then one or more uppercase letters
\s - a whitespace.
You may use an ARRAYFORMULA, as well:
=ARRAYFORMULA(IFERROR(REGEXEXTRACT(" " & A:A & " ", "\s([A-ZÀ-ÖØ-Ý]+(?:\s+[A-ZÀ-ÖØ-Ý]+)*)\s"),""))
Supposing your sample name were in A2, this should work:
=TRIM(REGEXEXTRACT(A2&" ","([A-ZÀ-Ý ]+)\s"))
By appending a space to the end of the string first, we can then look for the [uppercase letter set or space] in any number up ending with a space. This rules out strings like "Éléonorä" and "Camilliâ" because those uppercase letters are not followed by a space.
Put a different way, the rule here says, "Grab as many uppercase letters or spaces in this set as possible, as long as you still have a space left over at the end." And since we appended a space to the end of the entire string, we can catch such groupings anywhere in the modified string.
Try this- backslashing the non A-Z characters.
[A-Z\À\Á\Â\Ã\Ä\Å\Æ\Ç\È\É\Ê\Ë\Ì\Í\Î\Ï\Ð\Ñ\Ò\Ó\Ô\Õ\Ö\Ø\Ù\Ú\Û\Ü\Ý]
If that fails you can encode each one of those letters like below:
Look up for characters: https://www.w3schools.com/charsets/ref_utf_latin1_supplement.asp
[A-Z\u00C0\u00C1... and so on...]
use:
=ARRAYFORMULA(TRIM(TRANSPOSE(QUERY(TRANSPOSE(IF(""<>
IFERROR(REGEXEXTRACT(SPLIT(A1:A, " "), "["&TEXTJOIN("", 1,
UNIQUE(QUERY({UPPER(CHAR(ROW(65:1500))), LOWER(CHAR(ROW(65:1500)))},
"select Col2 where Col1<>Col2")))&"]+")),,IFERROR(SPLIT(A1:A, " ")))),,9^9))))
or 10 characters shorter:
=INDEX(TRIM(TRANSPOSE(QUERY(TRANSPOSE(IF(""<>
IFERROR(REGEXEXTRACT(SPLIT(A:A; " "); "["&JOIN(;
UNIQUE(LOWER(QUERY(CHAR(ROUNDUP(SEQUENCE(1500; 2; 65)/2));
"select Col1 where lower(Col1)<>upper(Col2)"))))&"]+"));;
IFERROR(SPLIT(A:A; " "))));;9^9))))
works with all Europe-based alphabets and captures all diacritics out there. it can differentiate between:
LOWER
and
UPPER

Excel VBA Regex Check For Repeated Strings

I have some user input that I want to validate for correctness. The user should input 1 or more sets of characters, separated by commas.
So these are valid input
COM1
COM1,COM2,1234
these are invalid
COM -- only 3 characters
COM1,123 -- one set is only 3 characters
COM1.1234,abcd -- a dot separator not comma
I googled for a regex pattern to this and found a possible pattern that tested for a recurring instance of any 3 characters, and I modified like so
/^(.{4,}).*\1$/
but this is not finding matches.
I can manage the last comma that may or may not be there before passing to the test so that it is always there.
Preferably, I would like to test for letters (any case) and numbers only, but I can live with any characters.
I know I could easily do this in straight VBA splitting the input on a comma delimiter and looping through each character of each array element, but regex seems more efficient, and I will have more cases than have slightly different patterns, so parameterising the regex for that would be better design.
TIA
I believe this does what you want:
^([A-Z|a-z|0-9]{4},)*[A-Z|a-z|0-9]{4}$
It's a line beginning followed by zero or more groups of four letters or numbers ending with a comma, followed by one group of four letters or number followed by an end-of-line.
You can play around with it here: https://regex101.com/r/Hdv65h/1
The regular expression
"^[\w]{4}(,[\w]{4})*$"
should work.
You can try this to see whether it works for all your cases using the following function. Assuming your test strings are in cells A1 thru A5 on the spreadsheet:
Sub findPattern()
Dim regEx As New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.Pattern = "^[\w]{4}(,[\w]{4})*$"
Dim i As Integer
Dim val As String
For i = 1 To 5:
val = Trim(Cells(i, 1).Value)
Set mat = regEx.Execute(val)
If mat.Count = 0 Then
MsgBox ("No match found for " & val)
Else
MsgBox ("Match found for " & val)
End If
Next
End Sub

How to better this regex?

I have a list of strings like this:
/soccer/poland/ekstraklasa-2008-2009/results/
/soccer/poland/orange-ekstraklasa-2007-2008/results/
/soccer/poland/orange-ekstraklasa-youth-2010-2011/results/
From each string I want to take a middle part resulting in respectively:
ekstraklasa
orange ekstraklasa
orange ekstraklasa youth
My code here does the job but it feels like it can be done in fewer steps and probably with regex alone.
name = re.search('/([-a-z\d]+)/results/', string).group(1) # take the middle part
name = re.search('[-a-z]+', name).group() # trim numbers
if name.endswith('-'):
name = name[:-1] # trim tailing `-` if needed
name = name.replace('-', ' ')
Can anyone see how make it better?
This regex should do the work:
/(?:\/\w+){2}\/([\w\-]+)(?:-\d+){2}/
Explanation:
(?:\/\w+){2} - eat the first two words delimited by /
\/ - eat the next /
([\w\-]+)- match the word characters of hyphens (this is what we're looking for)
(?:-\d+){2} - eat the hyphens and the numbers after the part we're looking for
The result is in the first match group
I cant test it because i am not using python, but i would use an Expression like
^(/soccer/poland/)([a-z\-]*)(.*)$
or
^(/[a-z]*/[a-z]*/)([a-z\-]*)(.*)$
This Expressen works like "/soccer/poland/" at the beginning, than "everything with a to z (small) or -" and the rest of the string.
And than taking 2nd Group!
The Groups should hold this Strings:
/soccer/poland/
orange-ekstraklasa-youth-
2010-2011/results/
And then simply replacing "-" with " " and after that TRIM Spaces.
PS: If ur Using regex101.com e.g., u need to escape / AND just use one Row of String!
Expression
^(\/soccer\/poland\/)([a-z\-]*)(.*)$
And one Row of ur String.
/soccer/poland/orange-ekstraklasa-youth-2010-2011/results/
If u prefere to use the Expression not just for soccer and poland, use
^(\/[a-z]*\/[a-z]*\/)([a-z\-]*)(.*)$

Regular expression to find position of the last alpha character that is followed by a space?

I am using ColdFusion 10. I rarely need to use regular expression and really need some help.
I have some lengthy content (up to 8,000 characters) and want to create a teaser. After a certain length (which I will define elsewhere), I want to find the last alpha character that is followed by a space. I will remove everything after that character. I will then add the ellipsis (...)
MyString = "The lazy brown fox is not a dog."
In this case, I would delete everything after the "a" that precedes "dog".
MyString = "There are 123 boxes on up the hill, says that 612 guy."
In this case, I would delete everything after the "that" that precedes "612 ".
MyString = "I fell down the stairs on June 30th, 1962."
In this case, I would delete everything after the "June" that precedes "30th".
What regular expression would I use to find the position of the last alpha [a-Z] character that is followed by a space?
MyReg = "";
LastPosition = reFindNoCase(MyReg, MyString);
I'm not sure about REFindNoCase, but I think you can try with REReplaceNoCase. I hope that CF can take back references like most regex engines do:
REReplaceNoCase(MyString, "(.*\b[a-zA-Z]+\b)\s.*", "$1", ALL);
EDIT: for the backreference, it appears that you use the backslash instead of the dollar sign:
REReplaceNoCase(MyString, "(.*\b[a-zA-Z]+\b)\s.*", "\1", ALL);
And if it goes well, you should have something like this.
.* matches anything besides a newline character, \b matches word boundaries, [a-zA-Z]+ are for alphabet characters and \s is for the space just after it.
The greediness of the first .*'s is being exploited here to capture as much as possible until you get the last word followed by a space.
And I guess you can add the ellpses after the $1 like so:
REReplaceNoCase(MyString, "(.*\b[a-zA-Z]+\b)\s.*", "\1 (...)", ALL)
If you only want to use REFind(), you could maybe use this:
REFindNoCase("[A-Za-z](?:\s\d+|\w+,)*\s[^\s]+\.$", MyString);
Note that I haven't tested this against other possible scenarios, but I tried a few which don't work with the above but with this one:
REFindNoCase("[A-Za-z](?:\s\d+|\s?\w+[,.-]+)*\s[^\s]+[.\s]*$", MyString);
And those are the few test subjects: link.
REFind will give you the position of the last alpha character. You can add 1 to get the position of the space in the original string.
If you're dealing with long strings, a regex would need to scan the whole string to get to the end, and it's likely more efficient to instead start at the end and work backwards.
Like this:
LastPos = len(String);
while( LastPos > 1 )
{
LastPos = String.lastIndexOf(' ',LastPos-1);
if ( mid(String,LastPos,1).matches('[a-zA-Z]') )
break;
}
NewString = left(String,LastPos);
The idea is to keep stepping backwards finding spaces, and break the loop when the previous character is a letter (or the start of the string is reached).
If you really want a regex solution, just do:
NewString = rematch('.*[a-zA-Z] ',MyString)[1];
To get the position, you do len(NewString).
(If newlines are involved, you'd need to put (?s) at the start of the expression so that the dot matches them.)

Capturing a repeated group

I am attempting to parse a string like the following using a .NET regular expression:
H3Y5NC8E-TGA5B6SB-2NVAQ4E0
and return the following using Split:
H3Y5NC8E
TGA5B6SB
2NVAQ4E0
I validate each character against a specific character set (note that the letters 'I', 'O', 'U' & 'W' are absent), so using string.Split is not an option. The number of characters in each group can vary and the number of groups can also vary. I am using the following expression:
([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8}-?){3}
This will match exactly 3 groups of 8 characters each. Any more or less will fail the match.
This works insofar as it correctly matches the input. However, when I use the Split method to extract each character group, I just get the final group. RegexBuddy complains that I have repeated the capturing group itself and that I should put a capture group around the repeated group. However, none of my attempts to do this achieve the desired result. I have been trying expressions like this:
(([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})-?){4}
But this does not work.
Since I generate the regex in code, I could just expand it out by the number of groups, but I was hoping for a more elegant solution.
Please note that the character set does not include the entire alphabet. It is part of a product activation system. As such, any characters that can be accidentally interpreted as numbers or other characters are removed. e.g. The letters 'I', 'O', 'U' & 'W' are not in the character set.
The hyphens are optional since a user does not need top type them in, but they can be there if the user as done a copy & paste.
BTW, you can replace [ABCDEFGHJKLMNPQRSTVXYZ0123456789] character class with a more readable subtracted character class.
[[A-Z\d]-[IOUW]]
If you just want to match 3 groups like that, why don't you use this pattern 3 times in your regex and just use captured 1, 2, 3 subgroups to form the new string?
([[A-Z\d]-[IOUW]]){8}-([[A-Z\d]-[IOUW]]){8}-([[A-Z\d]-[IOUW]]){8}
In PHP I would return (I don't know .NET)
return "$1 $2 $3";
I have discovered the answer I was after. Here is my working code:
static void Main(string[] args)
{
string pattern = #"^\s*((?<group>[ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})-?){3}\s*$";
string input = "H3Y5NC8E-TGA5B6SB-2NVAQ4E0";
Regex re = new Regex(pattern);
Match m = re.Match(input);
if (m.Success)
foreach (Capture c in m.Groups["group"].Captures)
Console.WriteLine(c.Value);
}
After reviewing your question and the answers given, I came up with this:
RegexOptions options = RegexOptions.None;
Regex regex = new Regex(#"([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8})", options);
string input = #"H3Y5NC8E-TGA5B6SB-2NVAQ4E0";
MatchCollection matches = regex.Matches(input);
for (int i = 0; i != matches.Count; ++i)
{
string match = matches[i].Value;
}
Since the "-" is optional, you don't need to include it. I am not sure what you was using the {4} at the end for? This will find the matches based on what you want, then using the MatchCollection you can access each match to rebuild the string.
Why use Regex? If the groups are always split by a -, can't you use Split()?
Sorry if this isn't what you intended, but your string always has the hyphen separating the groups then instead of using regex couldn't you use the String.Split() method?
Dim stringArray As Array = someString.Split("-")
What are the defining characteristics of a valid block? We'd need to know that in order to really be helpful.
My generic suggestion, validate the charset in a first step, then split and parse in a seperate method based on what you expect. If this is in a web site/app then you can use the ASP Regex validation on the front end then break it up on the back end.
If you're just checking the value of the group, with group(i).value, then you will only get the last one. However, if you want to enumerate over all the times that group was captured, use group(2).captures(i).value, as shown below.
system.text.RegularExpressions.Regex.Match("H3Y5NC8E-TGA5B6SB-2NVAQ4E0","(([ABCDEFGHJKLMNPQRSTVXYZ0123456789]+)-?)*").Groups(2).Captures(i).Value
Mike,
You can use character set of your choice inside character group. All you need is to add "+" modifier to capture all groups. See my previous answer, just change [A-Z0-9] to whatever you need (i.e. [ABCDEFGHJKLMNPQRSTVXYZ0123456789])
You can use this pattern:
Regex.Split("H3Y5NC8E-TGA5B6SB-2NVAQ4E0", "([ABCDEFGHJKLMNPQRSTVXYZ0123456789]{8}+)-?")
But you will need to filter out empty strings from resulting array.
Citation from MSDN:
If multiple matches are adjacent to one another, an empty string is inserted into the array.