pandas 'match' on non-ascii characters - regex

I'm sure this is a very simple issue with regexps, but: Trying to use str.match in pandas to match a non-ASCII character (the times sign). I expect the first match call will match the first row of the DataFrame; the second match call will match the last row; and the third match will match the first and last rows. However, the first call does match but the second and third calls do not. Where am I going wrong?
Dataframe looks like (with x replacing the times sign, it actually prints as a ?):
Column
0 2x 32
1 42
2 64 x2
Pandas 0.20.3, python 2.7.13, OS X.
#!/usr/bin/env python
import pandas as pd
import re
html = '<table><thead><tr><th>Column</th></tr></thead><tbody><tr><td>2× 32</td></tr><tr><td>42</td></tr><tr><td>64 ×2</td></tr></tbody><table>'
df = pd.read_html(html)[0]
print df
print df[df['Column'].str.match(ur'^[2-9]\u00d7', re.UNICODE, na=False)]
print df[df['Column'].str.match(ur'\u00d7[2-9]$', re.UNICODE, na=False)]
print df[df['Column'].str.match(ur'\u00d7', re.UNICODE, na=False)]
Output I see (again with ? replaced with x):
Column
0 2x 32
Empty DataFrame
Columns: [Column]
Index: []
Empty DataFrame
Columns: [Column]
Index: []

Use contains():
df.Column.str.contains(r'^[2-9]\u00d7')
0 True
1 False
2 False
Name: Column, dtype: bool
df.Column.str.contains(r'\u00d7[2-9]$')
0 False
1 False
2 True
Name: Column, dtype: bool
df.Column.str.contains(r'\u00d7')
0 True
1 False
2 True
Name: Column, dtype: bool
Explanation: contains() uses re.search(), and match() uses re.match() (docs). Since re.match() only matches from the beginning of a string (docs), only your first case, which matches at the start (with ^) will work. Actually in that case you don't need both match and ^:
df.Column.str.match(r'[2-9]\u00d7')
0 True
1 False
2 False
Name: Column, dtype: bool

Related

Can I use regular expressions, the Like operator, and/or Instr() find the index of a pattern within a larger string?

I have a large list (a table with one field) of non-standardized strings, imported from a poorly managed legacy database. I need to extract the single-digit number (surrounded by spaces) that occurs exactly once in each of those strings (though the strings have other multi-digit numbers sometimes too). For example, from the following string:
"Quality Assurance File System And Records Retention Johnson, R.M. 004 4 2999 ss/ds/free ReviewMo = Aug Effective 1/31/2012 FileOpen-?"
I would want to pull the number 4 (or 4's position in the string, i.e. 71)
I can use
WHERE rsLegacyList.F1 LIKE "* # *"
inside a select statement to find if each string has a lone digit, and thereby filter my list. But it doesn't tell me where the digit is so I can extract the digit itself (with mid() function) and start sorting the list. The goal is to create a second field with that digit by itself as a method of sorting the larger strings in the first field.
Is there a way to use Instr() along with regular expressions to find where a regular expression occurs within a larger string? Something like
intMarkerLocation = instr(rsLegacyList.F1, Like "* # *")
but that actually works?
I appreciate any suggestions, or workarounds that avoid the problem entirely.
#Lee Mac, I made a function RegExFindStringIndex as shown here:
Public Function RegExFindStringIndex(strToSearch As String, strPatternToMatch As String) As Integer
Dim regex As RegExp
Dim Matching As Match
Set regex = New RegExp
With regex
.MultiLine = False
.Global = True
.IgnoreCase = False
.Pattern = strPatternToMatch
Matching = .Execute(strToSearch)
RegExFindStringIndex = Matching.FirstIndex
End With
Set regex = Nothing
Set Matching = Nothing
End Function
But it gives me an error Invalid use of property at line Matching = .Execute(strToSearch)
Using Regular Expressions
If you were to use Regular Expressions, you would need to define a VBA function to instantiate a RegExp object, set the pattern property to something like \s\d\s (whitespace-digit-whitespace) and then invoke the Execute method to obtain a match (or matches), each of which will provide an index of the pattern within the string. If you want to pursue this route, here are some existing examples for Excel, but the RegExp manipulation will be identical in MS Access.
Here is an example function demonstrating how to use the first result returned by the Execute method:
Public Function RegexInStr(strStr As String, strPat As String) As Integer
With New RegExp
.Multiline = False
.Global = True
.IgnoreCase = False
.Pattern = strPat
With .Execute(strStr)
If .Count > 0 Then RegexInStr = .Item(0).FirstIndex + 1
End With
End With
End Function
Note that the above uses early binding and so you will need to add a reference to the Microsoft VBScript Regular Expressions 5.5 library to your project.
Example Immediate Window evaluation:
?InStr("abc 1 123", " 1 ")
4
?RegexInStr("abc 1 123", "\s\w\s")
4
Using InStr
An alternative using the in-built instr function within a query might be the following inelegant (and probably very slow) query:
select
switch
(
instr(rsLegacyList.F1," 0 ")>0,instr(rsLegacyList.F1," 0 ")+1,
instr(rsLegacyList.F1," 1 ")>0,instr(rsLegacyList.F1," 1 ")+1,
instr(rsLegacyList.F1," 2 ")>0,instr(rsLegacyList.F1," 2 ")+1,
instr(rsLegacyList.F1," 3 ")>0,instr(rsLegacyList.F1," 3 ")+1,
instr(rsLegacyList.F1," 4 ")>0,instr(rsLegacyList.F1," 4 ")+1,
instr(rsLegacyList.F1," 5 ")>0,instr(rsLegacyList.F1," 5 ")+1,
instr(rsLegacyList.F1," 6 ")>0,instr(rsLegacyList.F1," 6 ")+1,
instr(rsLegacyList.F1," 7 ")>0,instr(rsLegacyList.F1," 7 ")+1,
instr(rsLegacyList.F1," 8 ")>0,instr(rsLegacyList.F1," 8 ")+1,
instr(rsLegacyList.F1," 9 ")>0,instr(rsLegacyList.F1," 9 ")+1,
true, null
) as intMarkerLocation
from
rsLegacyList
where
rsLegacyList.F1 like "* # *"
How about:
select
instr(rsLegacyList.F1, " # ") + 1 as position
from rsLegacyList.F1
where rsLegacyList.F1 LIKE "* # *"

Pandas replace any integer with string using regex

struggling with something that is probably super basic, but I'm trying to replace some integers with string (using pandas & regex)
test = pd.DataFrame([14,5,3,2345])
test2 = test.replace('\d', 'TRUE', regex=True)
test2
When I run that, I expect to see: TRUE TRUE TRUE TRUE, but instead I see exactly the same list:
test2
Out[93]:
0
0 14
1 5
2 3
3 2345
Am I missing something? I thought '\d' is any numerical character?
You need to cast the data to string and use a ^\d+$ regex to see if the whole string is composed of digits:
>>> test2 = test.astype(str).replace(r'^\d+$', 'TRUE', regex=True)
>>> test2
0
0 TRUE
1 TRUE
2 TRUE
3 TRUE
>>>
The ^ matches the start of string, \d+ matches 1 or more digits and $ matches the end of string.
See this regex demo.

pandas convert numbers to words in every rows and one specific column

UPDATE
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df.iloc[:,3].replace(r'(?<!\S)\d+(?!\S)', lambda x: p.number_to_words(x.group()), regex=True, inplace=True)
df.iloc[:,3].head(2)
0 15
1 89
Name: D, dtype: int64
df = df.astype(str)
df.iloc[:,3].replace(r'(?<!\S)\d+(?!\S)', lambda x: p.number_to_words(x.group()), regex=True, inplace=True)
df.iloc[:,3].head(2)
0 <function <lambda> at 0x7fd8a6b4db18>
1 <function <lambda> at 0x7fd8a6b4db18>
Name: D, dtype: object
I got a pandas data frame and some of the rows contains numbers in some columns. I want to use the inflect library to replace only the numbers with corresponding word representation.
I think df.replace is a good fit. But how can I specify that only numbers
(all the numbers which are separated by white spaces) should be replaced and pass it as argument to inflect ?.
p = inflect.engine()
df.replace(r' (\d+) ', p.number_to_words($1), regex=True, inplace=True)
Similarly, I have second dataframe, where I want to do it for a specific column, column with index 4. The column contains just 4 digit numbers only (year). How can I do it ?.
Import re library, make sure your column is of type string, and use (?<!\S)\d+(?!\S) to match sequences of digits that are between start/end of string and whitespace chars. If you want to only match whole entries that are all digits, you may use ^\d+$ regex.
df.iloc[:,3].astype(str).apply(lambda row: re.sub(r'(?<!\S)\d+(?!\S)', lambda x: p.number_to_words(x.group()), row))
First, the column is cast to string with .astype(str). Then, (?<!\S)\d+(?!\S) matches in each row and the number is sent to the .number_to_words() method.

Pandas regex split on characters and group

I've never got around to learning regex till now, but I'm trying to figure out how to use it in pandas with Series.str.match(expression) In order to split one column to make two new columns. (I know I can do this without regex)
examples of the column data are:
True Grit {'Rooster Cogburn'}
The King's Speech {'King George VI'}
Biutiful {'Uxbal'}
Where there can be any number of strings greater than 1 in each of the two groupings. How can I extract two groups to result in True Grit, Rooster Cogburn?
Given this dataframe
col
0 True Grit {Rooster Cogburn}
1 The King's Speech {King George VI}
2 Biutiful {Uxbal}
df = df.col.str.extract('(.*)\s*{(.*)}', expand = True)
will return
0 1
0 True Grit Rooster Cogburn
1 The King's Speech King George VI
2 Biutiful Uxbal

str.startswith using Regex

Can i understand why the str.startswith() is not dealing with Regex :
col1
0 country
1 Country
i.e : df.col1.str.startswith('(C|c)ountry')
it returns all the values False :
col1
0 False
1 False
Series.str.startswith does not accept regex because it is intended to behave similarly to str.startswith in vanilla Python, which does not accept regex. The alternative is to use a regex match (as explained in the docs):
df.col1.str.contains('^[Cc]ountry')
The character class [Cc] is probably a better way to match C or c than (C|c), unless of course you need to capture which letter is used. In this case you can do ([Cc]).
Series.str.startswith does not accept regexes. Use Series.str.match instead:
df.col1.str.match(r'(C|c)ountry', as_indexer=True)
Output:
0 True
1 True
Name: col1, dtype: bool