Can i understand why the str.startswith() is not dealing with Regex :
col1
0 country
1 Country
i.e : df.col1.str.startswith('(C|c)ountry')
it returns all the values False :
col1
0 False
1 False
Series.str.startswith does not accept regex because it is intended to behave similarly to str.startswith in vanilla Python, which does not accept regex. The alternative is to use a regex match (as explained in the docs):
df.col1.str.contains('^[Cc]ountry')
The character class [Cc] is probably a better way to match C or c than (C|c), unless of course you need to capture which letter is used. In this case you can do ([Cc]).
Series.str.startswith does not accept regexes. Use Series.str.match instead:
df.col1.str.match(r'(C|c)ountry', as_indexer=True)
Output:
0 True
1 True
Name: col1, dtype: bool
Related
I need to extract IDs from a string : 1;#ChapitreA;#2;#ChapitreB;
Here, IDs are 1 and 2
I tried (^|;#)([0-9]?)(;) but it returns too many results => Here
Or behind :
0-2 1;
0-0 null
0-1 1
1-2 ;
12-16 ;#2;
12-14 ;#
14-15 2
15-16 ;
Constraint : I cannot use groups as this RegEx will be use in Nintex Workflow which doesn't support it. By the way I can't use any programming language neither to manage the result.. I need the RegEx to return the exact result I need (IDs) and nothing more.
Solution : (?:(?<=^)|(?<=;#))\d+(?=;)
Using groups : (?:^|;#)(\d+);
Without using groups : (?:(?<=^)|(?<=;#))\d+(?=;)
Thx #wiktor-stribiżew
I have a large list (a table with one field) of non-standardized strings, imported from a poorly managed legacy database. I need to extract the single-digit number (surrounded by spaces) that occurs exactly once in each of those strings (though the strings have other multi-digit numbers sometimes too). For example, from the following string:
"Quality Assurance File System And Records Retention Johnson, R.M. 004 4 2999 ss/ds/free ReviewMo = Aug Effective 1/31/2012 FileOpen-?"
I would want to pull the number 4 (or 4's position in the string, i.e. 71)
I can use
WHERE rsLegacyList.F1 LIKE "* # *"
inside a select statement to find if each string has a lone digit, and thereby filter my list. But it doesn't tell me where the digit is so I can extract the digit itself (with mid() function) and start sorting the list. The goal is to create a second field with that digit by itself as a method of sorting the larger strings in the first field.
Is there a way to use Instr() along with regular expressions to find where a regular expression occurs within a larger string? Something like
intMarkerLocation = instr(rsLegacyList.F1, Like "* # *")
but that actually works?
I appreciate any suggestions, or workarounds that avoid the problem entirely.
#Lee Mac, I made a function RegExFindStringIndex as shown here:
Public Function RegExFindStringIndex(strToSearch As String, strPatternToMatch As String) As Integer
Dim regex As RegExp
Dim Matching As Match
Set regex = New RegExp
With regex
.MultiLine = False
.Global = True
.IgnoreCase = False
.Pattern = strPatternToMatch
Matching = .Execute(strToSearch)
RegExFindStringIndex = Matching.FirstIndex
End With
Set regex = Nothing
Set Matching = Nothing
End Function
But it gives me an error Invalid use of property at line Matching = .Execute(strToSearch)
Using Regular Expressions
If you were to use Regular Expressions, you would need to define a VBA function to instantiate a RegExp object, set the pattern property to something like \s\d\s (whitespace-digit-whitespace) and then invoke the Execute method to obtain a match (or matches), each of which will provide an index of the pattern within the string. If you want to pursue this route, here are some existing examples for Excel, but the RegExp manipulation will be identical in MS Access.
Here is an example function demonstrating how to use the first result returned by the Execute method:
Public Function RegexInStr(strStr As String, strPat As String) As Integer
With New RegExp
.Multiline = False
.Global = True
.IgnoreCase = False
.Pattern = strPat
With .Execute(strStr)
If .Count > 0 Then RegexInStr = .Item(0).FirstIndex + 1
End With
End With
End Function
Note that the above uses early binding and so you will need to add a reference to the Microsoft VBScript Regular Expressions 5.5 library to your project.
Example Immediate Window evaluation:
?InStr("abc 1 123", " 1 ")
4
?RegexInStr("abc 1 123", "\s\w\s")
4
Using InStr
An alternative using the in-built instr function within a query might be the following inelegant (and probably very slow) query:
select
switch
(
instr(rsLegacyList.F1," 0 ")>0,instr(rsLegacyList.F1," 0 ")+1,
instr(rsLegacyList.F1," 1 ")>0,instr(rsLegacyList.F1," 1 ")+1,
instr(rsLegacyList.F1," 2 ")>0,instr(rsLegacyList.F1," 2 ")+1,
instr(rsLegacyList.F1," 3 ")>0,instr(rsLegacyList.F1," 3 ")+1,
instr(rsLegacyList.F1," 4 ")>0,instr(rsLegacyList.F1," 4 ")+1,
instr(rsLegacyList.F1," 5 ")>0,instr(rsLegacyList.F1," 5 ")+1,
instr(rsLegacyList.F1," 6 ")>0,instr(rsLegacyList.F1," 6 ")+1,
instr(rsLegacyList.F1," 7 ")>0,instr(rsLegacyList.F1," 7 ")+1,
instr(rsLegacyList.F1," 8 ")>0,instr(rsLegacyList.F1," 8 ")+1,
instr(rsLegacyList.F1," 9 ")>0,instr(rsLegacyList.F1," 9 ")+1,
true, null
) as intMarkerLocation
from
rsLegacyList
where
rsLegacyList.F1 like "* # *"
How about:
select
instr(rsLegacyList.F1, " # ") + 1 as position
from rsLegacyList.F1
where rsLegacyList.F1 LIKE "* # *"
Right now, Regex say is valid a date if I have 200011 - Which is Jan 1st 2000
but i want to restrict that to have the format YYYYMMDD so it will accept only 20000101 as a valid date. How can I achieve this?
My code:
^(?:(?:(?:(?:(?:[1-9]\d)(?:0[48]|[2468][048]|[13579][26])|(?:(?:[2468][048]|[13579][26])00))([-\/.]?)(?:0?2\1(?:29)))|(?:(?:[1-9]\d{3})([-\/.]?)(?:(?:(?:0?[13578]|1[02])\2(?:31))|(?:(?:0?[13-9]|1[0-2])\2(?:29|30))|(?:(?:0?[1-9])|(?:1[0-2]))\2(?:0?[1-9]|1\d|2[0-8])))))$
You need to remove ? after all 0s:
^(?:(?:(?:(?:(?:[1-9]\d)(?:0[48]|[2468][048]|[13579][26])|(?:(?:[2468][048]|[13579][26])00))([-\/.]?)(?:02\1(?:29)))|(?:(?:[1-9]\d{3})([-\/.]?)(?:(?:(?:0[13578]|1[02])\2(?:31))|(?:(?:0[13-9]|1[0-2])\2(?:29|30))|(?:(?:0[1-9])|(?:1[0-2]))\2(?:0[1-9]|1\d|2[0-8])))))$
See the regex demo
For example, the last 0?[1-9] would match 0 one or zero times, and then a non-zero digit. When you remove ? quantifier, the 0 will become required.
I have the following line :
3EAM7A 1 3 EI AMANDINE MRV SHP 70 W 0 SH3-A1 1 SHP 70W OVOIDE AI E27 SON PIA PLUS
I'd like to get the string : EI AMANDINE MRV SHP 70 W. So I decided to select the strings between 1 (can also be 2, 3 or 99) and 0 (can also be 1, 2, 3, 4 or 5).
I tried :
(0|1|2|3|99)(.*)(0|1|2|3|4|5)
But I have this result :
EAM7A 1 3 EI AMANDINE MRV SHP 70 W 0 SH3-A1 1 SHP 70W OVOIDE AI E
that is not what I want to obtain.
Do you have an idea in regex to make that selection work ?
Thanks !
You were pretty close! Try this:
\b(?:0|1|2|3|99) ([^0|1|2|3|99].*?) (?:0|1|2|3|4|5)\b
Regex101
I think that you want to match "word" 4 to 9?
Your desired match will be in group 1
^(\S+\s){3}((\S+\s){6})
Enable the multiline option if you have a whole file of subject strings.
You can try with:
\s(?:[0-3]|99)\s([A-Z].*?)\b(?:[0-5])\b
DEMO
and get string by group $1. Or if your language support look around, try:
(?<=\s[0-3]\s|99)[A-Z].+?(?=\s[0-5]\s)
DEMO
to get match directly.
Another solution that is based on matching all initial space + digit sequences:
\b(?:(?:[0-3]|99)\b\s*)+(.*?)\s*\b(?:[0-5])\b
See demo
The result is in Group 1.
With \b(?:(?:[0-3]|99)\b\s*)+ the rightmost number from the allowed leading set is picked.
You can use following regex :
(?:(?:[0-3]|99)\s)+(.*?)\s(?:[0-5])\s
See demo https://regex101.com/r/iX6oE1/6
Also note that for matching a range of number you can use a character class instead of multiple OR.
I've written a regex whose job is to return all matches to its three alternative capture groups. My goal is to learn which capture group produced each match. PCRE seems able to produce that information. But I haven't yet been able coerce the TRegEx class in Delphi XE8 to yield meaningful capture group info for matches. I can't claim to be at the head of regex class, and TRegEx is new to me, so who knows what errors I'm making.
The regex (regex101.com workpad) is:
(?'word'\b[a-zA-Z]{3,}\b)|(?'id'\b\d{1,3}\b)|(?'course'\b[BL]\d{3}\b)
This test text:
externship L763 clinic 207 B706 b512
gives five matches in test environments. But a simple test program that walks the TGroupCollection of each TMatch in the TMatchCollection shows bizarre results about groups: all matches have more than one group (2, 3 or 4) with each group's Success true, and often the matched text is duplicated in several groups or is empty. So this data structure (below) isn't what I'd expect:
Using TRegEx
Regex: (?'word'\b[a-zA-Z]{3,}\b)|(?'id'\b\d{1,3}\b)|(?'course'\b[BL]\d{3}\b)
Text: externship L763 clinic 207 B706 b512
5 matches
'externship' with 2 groups:
length 10 at 1 value 'externship' (Sucess? True)
length 10 at 1 value 'externship' (Sucess? True)
'L763' with 4 groups:
length 4 at 12 value 'L763' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 4 at 12 value 'L763' (Sucess? True)
'clinic' with 2 groups:
length 6 at 17 value 'clinic' (Sucess? True)
length 6 at 17 value 'clinic' (Sucess? True)
'207' with 3 groups:
length 3 at 24 value '207' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 3 at 24 value '207' (Sucess? True)
'B706' with 4 groups:
length 4 at 28 value 'B706' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 4 at 28 value 'B706' (Sucess? True)
My simple test runner is this:
program regex_tester;
{$APPTYPE CONSOLE}
{$R *.res}
uses
System.SysUtils,
System.RegularExpressions,
System.RegularExpressionsCore;
var
Matched : Boolean;
J : integer;
Group : TGroup;
Match : TMatch;
Matches : TMatchCollection;
RegexText,
TestText : String;
RX : TRegEx;
RXPerl : TPerlRegEx;
begin
try
RegexText:='(?''word''\b[a-zA-Z]{3,}\b)|(?''id''\b\d{1,3}\b)|(?''course''\b[BL]\d{3}\b)';
TestText:='externship L763 clinic 207 B706 b512';
RX:=TRegex.Create(RegexText);
Matches:=RX.Matches(TestText);
Writeln(Format(#10#13#10#13'Using TRegEx'#10#13'Regex: %s'#10#13'Text: %s'#10#13,[RegexText, TestText]));
Writeln(Format('%d matches', [Matches.Count]));
for Match in Matches do
begin
Writeln(Format(' ''%s'' with %d groups:', [Match.Value,Match.Groups.Count]));
for Group in Match.Groups do
Writeln(Format(#9'length %d at %d value ''%s'' (Sucess? %s)', [Group.Length,Group.Index,Group.Value,BoolToStr(Group.Success, True)]));
end;
RXPerl:=TPerlRegEx.Create;
RXPerl.Subject:=TestText;
RXPerl.RegEx:=RegexText;
Writeln(Format(#10#13#10#13'Using TPerlRegEx'#10#13'Regex: %s'#10#13'Text: %s'#10#13,[RXPerl.Regex, RXPerl.Subject]));
Matched:=RXPerl.Match;
if Matched then
repeat
begin
Writeln(Format(' ''%s'' with %d groups:', [RXPerl.MatchedText,RXPerl.GroupCount]));
for J:=1 to RXPerl.GroupCount do
Writeln(Format(#9'length %d at %d, value ''%s''',[RXPerl.GroupLengths[J],RXPerl.GroupOffsets[J],RXPerl.Groups[J]]));
Matched:=RXPerl.MatchAgain;
end;
until Matched=false;
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
end.
I'd certainly appreciate a nudge in the right direction. If TRegEx is broken, I can of course use an alternative -- or I can let go of the perceived elegance of the solution, instead using three simpler tests to find the bits of info I need.
Added Information and Interpretation
As #andrei-galatyn notes, TRegEx uses TPerlRegEx for its work. So I added a section to my testing program (output below) where I experiment with that, too. It isn't as convenient to use as TRegEx, but its result is what it should be -- and without the problems of TRegEx's broken TGroup data structures. Whichever class I use, the last group's index (less 1 for TRegEx) is the capturing group I want.
Along the way I was reminded that Pascal arrays are often based on 1 rather than 0.
Using TPerlRegEx
Regex: (?'word'\b[a-zA-Z]{3,}\b)|(?'id'\b\d{1,3}\b)|(?'course'\b[BL]\d{3}\b)
Text: externship L763 clinic 207 B706 b512
'externship' with 1 groups:
length 10 at 1, value 'externship'
'L763' with 3 groups:
length 0 at 1, value ''
length 0 at 1, value ''
length 4 at 12, value 'L763'
'clinic' with 1 groups:
length 6 at 17, value 'clinic'
'207' with 2 groups:
length 0 at 1, value ''
length 3 at 24, value '207'
'B706' with 3 groups:
length 0 at 1, value ''
length 0 at 1, value ''
length 4 at 28, value 'B706'
Internally Delphi uses class TPerlRegEx and it has such description for GroupCount property:
Number of matched groups stored in the Groups array. This number is the number of the highest-numbered capturing group in your regular expression that actually participated in the last match. It may be less than the number of capturing groups in your regular expression.
E.g. when the regex "(a)|(b)" matches "a", GroupCount will be 1. When the same regex matches "b", GroupCount will be 2.
TRegEx class always adds one more group (for whole expression i guess).
In your case it should be enough to check every match like this:
case Match.Groups.Count-1 of
1: ; // "word" found
2: ; // "id" found
3: ; // "course" found
end;
It doesn't answer why Groups are filled with strange data, indeed it seems to be enough to answer your question. :)
I realize this is 4 years too late, but for others who may search for this, here's what I found (in Delphi RIO at least, I have not tested earlier versions, however the source URL below says as of XE), however:
From https://www.regular-expressions.info/delphi.html (important bit is bold italic)
The TMatch record provides several properties with details about the
match. Success indicates if a match was found. If this is False, all
other properties and methods are invalid. Value returns the matched
string. Index and Length indicate the position in the input string and
the length of the match. Groups returns a TGroupCollection record that
stores a TGroup record in its default Item[] property for each
capturing group. You can use a numeric index to Item[] for numbered
capturing groups, and a string index for named capturing groups.
So, if you NAME your matching group like this for example:
(?'MatchName'[^\/\?]*)
Which would match a string up until a / or ? character, you can then reference that capture group by name like this:
Matches := TRegEx.Matches(StrToMatch, RegexString, [ roExplicitCapture ] );
for Match in Matches do begin
// ... maybe some logic here to determine if this is the match you want ...
try
whatGotMatched := Match.Groups['MatchName'].Value;
except
whatGotMatched := '';
end;
end;
How exactly you get to your match groups in the Matches is dependent on how you structure your RegEx etc, how many matches you create, etc. I surround the Match.Groups['MatchName'].Value in a try-except block because if the match wasn't found, you will generate an index out of bounds error when accessing .Value since it will still say the match would have been at character X, for 0 characters. You could also check Match.Groups['MatchName'].Length for 0 before trying to access .Value to probably better results...but if you reference it by name, you don't have to decipher specifically what groups matched what, etc ... just ask for the named match you wanted, and if it matched, you will get it.
Edit: they reference to .Length to help verify before throwing an error about index out of bounds. However, if your search is constructed where a named search is not matched, and therefore not returned at all, you will get an error thrown for an invalid match name, so you may still may need to capture for that.