Pandas replace any integer with string using regex - regex

struggling with something that is probably super basic, but I'm trying to replace some integers with string (using pandas & regex)
test = pd.DataFrame([14,5,3,2345])
test2 = test.replace('\d', 'TRUE', regex=True)
test2
When I run that, I expect to see: TRUE TRUE TRUE TRUE, but instead I see exactly the same list:
test2
Out[93]:
0
0 14
1 5
2 3
3 2345
Am I missing something? I thought '\d' is any numerical character?

You need to cast the data to string and use a ^\d+$ regex to see if the whole string is composed of digits:
>>> test2 = test.astype(str).replace(r'^\d+$', 'TRUE', regex=True)
>>> test2
0
0 TRUE
1 TRUE
2 TRUE
3 TRUE
>>>
The ^ matches the start of string, \d+ matches 1 or more digits and $ matches the end of string.
See this regex demo.

Related

Regex match between n and m numbers but as much as possible

I have a set of strings that have some letters, occasional one number, and then somewhere 2 or 3 numbers. I need to match those 2 or 3 numbers.
I have this:
\w*(\d{2,3})\w*
but then for strings like
AAA1AAA12A
AAA2AA123A
it matches '12' and '23' respectively, i.e. it fails to pick the three digits in the second case.
How do I get those 3 digits?
Here is how you would do it in Java.
the regex simply matches on a group of 2 or 3 digits.
the while loop uses find() to continue finding matches and the printing the captured match. The 1 and the 1223 are ignored.
String s= "AAA1AAA12Aksk2ksksk21sksksk123ksk1223sk";
String regex = "\\D(\\d{2,3})\\D";
Matcher m = Pattern.compile(regex).matcher(s);
while (m.find()) {
System.out.println(m.group(1));
}
prints
12
21
123
Looks like the correct answer would be:
\w*?(\d{2,3})\w*
Basically, making preceding expression lazy does the job

Regex substitution: getting half of the string

I got another question about regex. The requirement is quite easy:
Given a string that has length of a even number.
12
1234
123456
12345678
abcdef
Write a substition regex to get the first half of the string:
After substition:
1
12
123
1234
abc
I'm using pcre, it supports recursion and control verbs.
I tried something like this but it's not working :(
s/^(?=(.))(?:((?1))(?1))+$/$2/mg
Here's the test subject on regex101
Is it possible? How can I achieve this?
I'm pretty sure this is not the most elegant solution, but it does work:
>>> def half(string):
regex = re.compile(r"(.{%d})" % int(len(string)/2))
return regex.search(string).group(1)
>>> half("12")
'1'
>>> half("1234")
'12'
>>> half("123456")
'123'
>>> half("12345678")
'1234'
>>> half("abcdef")
'abc'

Regex Regular Expression for Number Range Check between -2 and 5

I've got the following regular expression shown below which checks the range between 0 and 5.
^(([0-4])+(.\d)?)|((5)+(.0)?)$
I will need to change it so it checks the range between -2 to 5 instead but am not familiar with this code.
Can I get someone to make a quick change so the expression will then check between -2 and 5 please.
Thank you very much!
I understand that we are to match integers and floats with one decimal digit that fall between -2 and 5.
You can use the following regular expression with virtually all languages.
r = /\A(?:-2(?:\.0)?|-1(?:\.\d)?|-0\.[1-9]|[0-4](?:\.\d)?|5(?:\.0)?)\z/
Here I test it with Ruby.
'-2'.match? r #=> true
'-2.0'.match? r #=> true
'-1.8'.match? r #=> true
'-1.0'.match? r #=> true
'-1'.match? r #=> true
'-0.3'.match? r #=> true
'0'.match? r #=> true
'0.0'.match? r #=> true
'0.3'.match? r #=> true
'3'.match? r #=> true
'3.3'.match? r #=> true
'5'.match? r #=> true
'5.0'.match? r #=> true
'-2.1'.match? r #=> false
'-1.85'.match? r #=> false
'-0'.match? r #=> false
'-0.0'.match? r #=> false
'0.34'.match? r #=> false
'5.1'.match? r #=> false
This should work for all numbers from -2.0 to 5.0 inclusive which have 0 or 1 digit after the decimal place:
^((-[0-1]|[0-4])(\.\d)?|(-2|5)(\.0)?)$
Test cases: https://regex101.com/r/bInayE/2/
If you want to allow any number of digits after the decimal place (ex. 2.157):
^((-[0-1]|[0-4])(\.\d+)?|(-2|5)(\.0+)?)$
Test cases: https://regex101.com/r/zp01M6/3/
In case you want to be able to update these in the future:
[0-1] refers to the range of digits which can be preceded by a minus sign and have digits after a decimal point
[0-4] is the range of digits which can have digits after a decimal point without being preceded by a minus sign
(-2|5) are the numbers which can only be followed by .0
Update:
Shortened both expressions by combining the -2 and 5 cases as suggested by #Robert McKee.
Added case for negative numbers starting with 0 as pointed out by #Cary Swoveland
Regex are an awkward way to check for ranges, so I would suggest you try validating ranges in a more traditional way. I remember doing something similar
to validate ip addresses, and it got a lot more complicated than I expected.
With that said, that regex you pasted doesn't actually check for a 0-5 range. In fact, you could get any char in there, after a "5" and before a "0" (something like 5f0 for instance).
With a regex like the one below, you could validate the input in this set {-2,-1,0,1,2,3,4,5}. You may add the ^ at the beginning and $ at the end, so the regex matches the whole line.
(-[1-2]|[0-5])

pandas 'match' on non-ascii characters

I'm sure this is a very simple issue with regexps, but: Trying to use str.match in pandas to match a non-ASCII character (the times sign). I expect the first match call will match the first row of the DataFrame; the second match call will match the last row; and the third match will match the first and last rows. However, the first call does match but the second and third calls do not. Where am I going wrong?
Dataframe looks like (with x replacing the times sign, it actually prints as a ?):
Column
0 2x 32
1 42
2 64 x2
Pandas 0.20.3, python 2.7.13, OS X.
#!/usr/bin/env python
import pandas as pd
import re
html = '<table><thead><tr><th>Column</th></tr></thead><tbody><tr><td>2× 32</td></tr><tr><td>42</td></tr><tr><td>64 ×2</td></tr></tbody><table>'
df = pd.read_html(html)[0]
print df
print df[df['Column'].str.match(ur'^[2-9]\u00d7', re.UNICODE, na=False)]
print df[df['Column'].str.match(ur'\u00d7[2-9]$', re.UNICODE, na=False)]
print df[df['Column'].str.match(ur'\u00d7', re.UNICODE, na=False)]
Output I see (again with ? replaced with x):
Column
0 2x 32
Empty DataFrame
Columns: [Column]
Index: []
Empty DataFrame
Columns: [Column]
Index: []
Use contains():
df.Column.str.contains(r'^[2-9]\u00d7')
0 True
1 False
2 False
Name: Column, dtype: bool
df.Column.str.contains(r'\u00d7[2-9]$')
0 False
1 False
2 True
Name: Column, dtype: bool
df.Column.str.contains(r'\u00d7')
0 True
1 False
2 True
Name: Column, dtype: bool
Explanation: contains() uses re.search(), and match() uses re.match() (docs). Since re.match() only matches from the beginning of a string (docs), only your first case, which matches at the start (with ^) will work. Actually in that case you don't need both match and ^:
df.Column.str.match(r'[2-9]\u00d7')
0 True
1 False
2 False
Name: Column, dtype: bool

Get which capture group matched a result using Delphi's TRegex

I've written a regex whose job is to return all matches to its three alternative capture groups. My goal is to learn which capture group produced each match. PCRE seems able to produce that information. But I haven't yet been able coerce the TRegEx class in Delphi XE8 to yield meaningful capture group info for matches. I can't claim to be at the head of regex class, and TRegEx is new to me, so who knows what errors I'm making.
The regex (regex101.com workpad) is:
(?'word'\b[a-zA-Z]{3,}\b)|(?'id'\b\d{1,3}\b)|(?'course'\b[BL]\d{3}\b)
This test text:
externship L763 clinic 207 B706 b512
gives five matches in test environments. But a simple test program that walks the TGroupCollection of each TMatch in the TMatchCollection shows bizarre results about groups: all matches have more than one group (2, 3 or 4) with each group's Success true, and often the matched text is duplicated in several groups or is empty. So this data structure (below) isn't what I'd expect:
Using TRegEx
Regex: (?'word'\b[a-zA-Z]{3,}\b)|(?'id'\b\d{1,3}\b)|(?'course'\b[BL]\d{3}\b)
Text: externship L763 clinic 207 B706 b512
5 matches
'externship' with 2 groups:
length 10 at 1 value 'externship' (Sucess? True)
length 10 at 1 value 'externship' (Sucess? True)
'L763' with 4 groups:
length 4 at 12 value 'L763' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 4 at 12 value 'L763' (Sucess? True)
'clinic' with 2 groups:
length 6 at 17 value 'clinic' (Sucess? True)
length 6 at 17 value 'clinic' (Sucess? True)
'207' with 3 groups:
length 3 at 24 value '207' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 3 at 24 value '207' (Sucess? True)
'B706' with 4 groups:
length 4 at 28 value 'B706' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 4 at 28 value 'B706' (Sucess? True)
My simple test runner is this:
program regex_tester;
{$APPTYPE CONSOLE}
{$R *.res}
uses
System.SysUtils,
System.RegularExpressions,
System.RegularExpressionsCore;
var
Matched : Boolean;
J : integer;
Group : TGroup;
Match : TMatch;
Matches : TMatchCollection;
RegexText,
TestText : String;
RX : TRegEx;
RXPerl : TPerlRegEx;
begin
try
RegexText:='(?''word''\b[a-zA-Z]{3,}\b)|(?''id''\b\d{1,3}\b)|(?''course''\b[BL]\d{3}\b)';
TestText:='externship L763 clinic 207 B706 b512';
RX:=TRegex.Create(RegexText);
Matches:=RX.Matches(TestText);
Writeln(Format(#10#13#10#13'Using TRegEx'#10#13'Regex: %s'#10#13'Text: %s'#10#13,[RegexText, TestText]));
Writeln(Format('%d matches', [Matches.Count]));
for Match in Matches do
begin
Writeln(Format(' ''%s'' with %d groups:', [Match.Value,Match.Groups.Count]));
for Group in Match.Groups do
Writeln(Format(#9'length %d at %d value ''%s'' (Sucess? %s)', [Group.Length,Group.Index,Group.Value,BoolToStr(Group.Success, True)]));
end;
RXPerl:=TPerlRegEx.Create;
RXPerl.Subject:=TestText;
RXPerl.RegEx:=RegexText;
Writeln(Format(#10#13#10#13'Using TPerlRegEx'#10#13'Regex: %s'#10#13'Text: %s'#10#13,[RXPerl.Regex, RXPerl.Subject]));
Matched:=RXPerl.Match;
if Matched then
repeat
begin
Writeln(Format(' ''%s'' with %d groups:', [RXPerl.MatchedText,RXPerl.GroupCount]));
for J:=1 to RXPerl.GroupCount do
Writeln(Format(#9'length %d at %d, value ''%s''',[RXPerl.GroupLengths[J],RXPerl.GroupOffsets[J],RXPerl.Groups[J]]));
Matched:=RXPerl.MatchAgain;
end;
until Matched=false;
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
end.
I'd certainly appreciate a nudge in the right direction. If TRegEx is broken, I can of course use an alternative -- or I can let go of the perceived elegance of the solution, instead using three simpler tests to find the bits of info I need.
Added Information and Interpretation
As #andrei-galatyn notes, TRegEx uses TPerlRegEx for its work. So I added a section to my testing program (output below) where I experiment with that, too. It isn't as convenient to use as TRegEx, but its result is what it should be -- and without the problems of TRegEx's broken TGroup data structures. Whichever class I use, the last group's index (less 1 for TRegEx) is the capturing group I want.
Along the way I was reminded that Pascal arrays are often based on 1 rather than 0.
Using TPerlRegEx
Regex: (?'word'\b[a-zA-Z]{3,}\b)|(?'id'\b\d{1,3}\b)|(?'course'\b[BL]\d{3}\b)
Text: externship L763 clinic 207 B706 b512
'externship' with 1 groups:
length 10 at 1, value 'externship'
'L763' with 3 groups:
length 0 at 1, value ''
length 0 at 1, value ''
length 4 at 12, value 'L763'
'clinic' with 1 groups:
length 6 at 17, value 'clinic'
'207' with 2 groups:
length 0 at 1, value ''
length 3 at 24, value '207'
'B706' with 3 groups:
length 0 at 1, value ''
length 0 at 1, value ''
length 4 at 28, value 'B706'
Internally Delphi uses class TPerlRegEx and it has such description for GroupCount property:
Number of matched groups stored in the Groups array. This number is the number of the highest-numbered capturing group in your regular expression that actually participated in the last match. It may be less than the number of capturing groups in your regular expression.
E.g. when the regex "(a)|(b)" matches "a", GroupCount will be 1. When the same regex matches "b", GroupCount will be 2.
TRegEx class always adds one more group (for whole expression i guess).
In your case it should be enough to check every match like this:
case Match.Groups.Count-1 of
1: ; // "word" found
2: ; // "id" found
3: ; // "course" found
end;
It doesn't answer why Groups are filled with strange data, indeed it seems to be enough to answer your question. :)
I realize this is 4 years too late, but for others who may search for this, here's what I found (in Delphi RIO at least, I have not tested earlier versions, however the source URL below says as of XE), however:
From https://www.regular-expressions.info/delphi.html (important bit is bold italic)
The TMatch record provides several properties with details about the
match. Success indicates if a match was found. If this is False, all
other properties and methods are invalid. Value returns the matched
string. Index and Length indicate the position in the input string and
the length of the match. Groups returns a TGroupCollection record that
stores a TGroup record in its default Item[] property for each
capturing group. You can use a numeric index to Item[] for numbered
capturing groups, and a string index for named capturing groups.
So, if you NAME your matching group like this for example:
(?'MatchName'[^\/\?]*)
Which would match a string up until a / or ? character, you can then reference that capture group by name like this:
Matches := TRegEx.Matches(StrToMatch, RegexString, [ roExplicitCapture ] );
for Match in Matches do begin
// ... maybe some logic here to determine if this is the match you want ...
try
whatGotMatched := Match.Groups['MatchName'].Value;
except
whatGotMatched := '';
end;
end;
How exactly you get to your match groups in the Matches is dependent on how you structure your RegEx etc, how many matches you create, etc. I surround the Match.Groups['MatchName'].Value in a try-except block because if the match wasn't found, you will generate an index out of bounds error when accessing .Value since it will still say the match would have been at character X, for 0 characters. You could also check Match.Groups['MatchName'].Length for 0 before trying to access .Value to probably better results...but if you reference it by name, you don't have to decipher specifically what groups matched what, etc ... just ask for the named match you wanted, and if it matched, you will get it.
Edit: they reference to .Length to help verify before throwing an error about index out of bounds. However, if your search is constructed where a named search is not matched, and therefore not returned at all, you will get an error thrown for an invalid match name, so you may still may need to capture for that.