Trying to apply regex to a column in a dataframe - regex

I have the following df:
concatenar buy_sell
1 BBVA2018-03-2020 sell
5 santander2018-03-2020 buy
I would like to apply regex to the concatenar column, where I would like to display the [A-Z][a-z] inside that column values.
This is what I have tried:
re.findall(r'[A-Z][a-z]*',df['concatenar'])
But outputs:
TypeError: expected string or bytes-like object
My desired output would be :
concatenar buy_sell
1 BBVA sell
5 santander buy
How could I correctly apply the regex to the concatenar column?

replace with dict
df.concatenar.replace({r'\d+':'','-':''},regex=True)
Out[354]:
1 BBVA
5 santander
Name: concatenar, dtype: object

Related

Merging two Pandas Dataframes using Regular Expressions

I'm new to Python and Pandas but I try to use Pandas Dataframes to merge two dataframes based on regular expression.
I have one dataframe with some 2 million rows. This table contains data about cars but the model name is often specified in - lets say - a creative way, e.g. 'Audi A100', 'Audi 100', 'Audit 100 Quadro', or just 'A 100'. And the same for other brands. This is stored in a column called "Model". In a second model I have the manufacturer.
Index
Model
Manufacturer
0
A 100
Audi
1
A100 Quadro
Audi
2
Audi A 100
Audi
...
...
...
To clean up the data I created about 1000 regular expressions to search for some key words and stored it in a dataframe called 'regex'. In a second column of this table I save the manufacture. This value is used in a second step to validate the result.
Index
RegEx
Manufacturer
0
.* A100 .*
Audi
1
.* A 100 .*
Audi
2
.* C240 .*
Mercedes
3
.* ID3 .*
Volkswagen
I hope you get the idea.
As far as I understood, the Pandas function "merge()" does not work with regular expressions. Therefore I use a loop to process the list of regular expressions, then use the "match" function to locate matching rows in the car DataFrame and assign the successfully used RegEx and the suggested manufacturer.
I added two additional columns to the cars table 'RegEx' and 'Manufacturer'.
for index, row in regex.iterrows():
cars.loc[cars['Model'].str.match(row['RegEx']),'RegEx'] = row['RegEx']
cars.loc[cars['Model'].str.match(row['RegEx']),'Manufacturer'] = row['Manfacturer']
I learnd 'iterrows' should not be used for performance reasons. It takes 8 minutes to finish the loop, what isn't too bad. However, is there a better way to get it done?
Kind regards
Jiriki
I have no idea if it would be faster (I'll be glad, if you would test it), but it doesn't use iterrows():
regex.groupby(["RegEx", "Manufacturer"])["RegEx"]\
.apply(lambda x: cars.loc[cars['Model'].str.match(x.iloc[0])])
EDIT: Code for reproduction:
cars = pd.DataFrame({"Model": ["A 100", "A100 Quatro", "Audi A 100", "Passat V", "Passat Gruz"],
"Manufacturer": ["Audi", "Audi", "Audi", "VW", "VW"]})
regex = pd.DataFrame({"RegEx": [".*A100.*", ".*A 100.*", ".*Passat.*"],
"Manufacturer": ["Audi", "Audi", "VW"]})
#Output:
# Model Manufacturer
#RegEx Manufacturer
#.*A 100.* Audi 0 A 100 Audi
# 2 Audi A 100 Audi
#.*A100.* Audi 1 A100 Quatro Audi
#.*Passat.* VW 3 Passat V VW
# 4 Passat Gruz VW

Count of occurrence of a string group by day in Google spreadsheet

I have table data in Google Spreadsheet something like this:
Date|Diet
4-Jan-2020|Coffee
4-Jan-2020|Snacks
4-Jan-2020|xyz
4-Jan-2020|Coffee
5-Jan-2020|Snacks
5-Jan-2020|abc
6-Jan-2020|Coffee
6-Jan-2020|Snacks
This table is a list of food items I had on a daily basis. I would like to get the number of times I had coffee on a daily basis. So I would like to get the output like this:
Date | No of times I had Coffee
4-Jan-2020| 2
5-Jan-2020| 0
6-Jan-2020| 1
I used this query to get the output.
=query(A1:B1425,"select A, COUNT(B) where B='Coffee' group by A")
With this query, I get the below output. Do note that I don't get those days when I didn't have coffee
4-Jan-2020| 2
6-Jan-2020| 1
So count for 5-Jan-2020 is missing because there is no string "Coffee" for that day.
How do I get the desired output including the count 0? Thank you.
try:
=ARRAYFORMULA({UNIQUE(FILTER(A1:A, A1:A<>"")),
IFNA(VLOOKUP(UNIQUE(FILTER(A1:A, A1:A<>"")),
QUERY(A1:B,
"select A,count(B)
where B='Coffee'
group by A
label count(B)''"), 2, 0))*1})
or try:
=ARRAYFORMULA(QUERY({A1:B, IF(B1:B="coffee", 1, 0)},
"select Col1,sum(Col3)
where Col1 is not null
group by Col1
label sum(Col3)''"))
you might want to change the counter into an If statement.
Something like "IF(COUNT(B) where B='Coffee' group by A">0,COUNT(B) where B='Coffee' group by A",0).
That will force the counter to have an actual value (0), even when nothing is found

How to extract where the reg expression doesn't match in data frame column?

I have a two dataframes:
OrderedDict([('page1', name dob
0 John 07-20200
1 Lilly 05-1999
2 James 02-2002), ('page2', name dob
0 Chris 07-2020
1 Robert 05-1999
2 barb 02-20022)])
I want to run my reg expression against each date in both dataframes and if they are all matches I want to continue with my program and if there is not a match I want to print a message that shows cases the df name, index and date thats wrong like this:
INVALID DATE: Page1: index 0: dob: 02-20200
INVALID DATE: Page2: index 2: dob: 02-20022
I got to this point
date_pattern = r'(?<!\d)((?:0?[1-9]|1[0-2])-(?:19|20)\d{2})(?!\d)'
for df_name, df in employee_dict.items():
x = df[df.dob.str.contains(date_pattern, regex=True)]
print(x)
that prints where they do match in a table format but I want to print where they don't match in individual print statements
any ideas?
You may iterate over all the rows of the dataframes and if the entry does not match your pattern, you may generate the message of your choice:
for df_name, df in employee_dict.items(): # Iterate over your DFs
for index, row in df.iterrows(): # Iterate over DF rows
if not re.search(date_pattern, row['dob']): # If the dob column value has no match
print("INVALID DATE: {}: index {}: dob: {}".format(df_name, index,row['dob'])) # Print error message
If your df is pd.DataFrame({'dob': ['05-2020','4-2020','07-1999','2-2001','1-20202020','112-2020']}), the results will be
INVALID DATE: page1: index 4: dob: 1-20202020
INVALID DATE: page1: index 5: dob: 112-2020
You're looking for Series.str.match.
Essentially, you need to extract the dob series, which I assume is what you're doing with df['dob'], and do result = df['dob'].str.match(date_pattern). The result will be a series of True and False values, corresponding to their respective df['dob'] values.

Remove square brackets from cells using pandas

I have a Pandas Dataframe with data as below
id, name, date
[101],[test_name],[2019-06-13T13:45:00.000Z]
[103],[test_name3],[2019-06-14T13:45:00.000Z, 2019-06-14T17:45:00.000Z]
[104],[],[]
I am trying to convert it to a format as below with no square brackets
Expected output:
id, name, date
101,test_name,2019-06-13T13:45:00.000Z
103,test_name3,2019-06-14T13:45:00.000Z, 2019-06-14T17:45:00.000Z
104,,
I tried using regex as below but it gave me an error TypeError: expected string or bytes-like object
re.search(r"\[([A-Za-z0-9_]+)\]", df['id'])
Figured I am able to extract the data using the below:
df['id'].str.get(0)
Loop through the data frame to access each string then use:
newstring = oldstring[1:len(oldstring)-1]
to replace the cell in the dataframe.
Try looping through columns:
for col in df.columns:
df[col] = df[col].str[1:-1]
Or use apply if your duplication of your data is not a problem:
df = df.apply(lambda x: x.str[1:-1])
Output:
id name date
0 101 test_name 2019-06-13T13:45:00.000Z
1 103 test_name3 2019-06-14T13:45:00.000Z, 2019-06-14T17:45:00....
2 104
Or if you want to use regex, you need str accessor, and extract:
df.apply(lambda x: x.str.extract('\[([A-Za-z0-9_]+)\]'))

Python Pandas replace sub-string in column with sub-string in another column

I have a Python Pandas data frame like following:
Id Title URL PosterPath
Id-1 Bruce Almighty https://www.youtube.com/embed/5VGyTOGxyVA https://i.ytimg.com/vi/XXXRRR/hqdefault.jpg
Id-2 Superhero Movie https://www.youtube.com/embed/3BnXz-7-y-o https://i.ytimg.com/vi/XXXRRR/hqdefault.jpg
Id-3 Taken https://www.youtube.com/embed/vjbfiOERDYs https://i.ytimg.com/vi/XXXRRR/hqdefault.jpg
I want to replace sub-string "XXXRRR" from column PosterPath with sub-string which comes after the string "embed/" from the column "URL"
Output data frame would look like following:
Id Title URL PosterPath
Id-1 Bruce Almighty https://www.youtube.com/embed/5VGyTOGxyVA https://i.ytimg.com/vi/5VGyTOGxyVA/hqdefault.jpg
Id-2 Superhero Movie https://www.youtube.com/embed/3BnXz-7-y-o https://i.ytimg.com/vi/3BnXz-7-y-o/hqdefault.jpg
Id-3 Taken https://www.youtube.com/embed/vjbfiOERDYs https://i.ytimg.com/vi/vjbfiOERDYs/hqdefault.jpg
Use str.extract with Series.replace:
a = df['URL'].str.extract('embed/(.*)$', expand=False)
print (a)
0 5VGyTOGxyVA
1 3BnXz-7-y-o
2 vjbfiOERDYs
Name: URL, dtype: object
df['PosterPath'] = df['PosterPath'].replace('XXXRRR', a, regex=True)
print (df)
Id Title URL \
0 Id-1 Bruce Almighty https://www.youtube.com/embed/5VGyTOGxyVA
1 Id-2 Superhero Movie https://www.youtube.com/embed/3BnXz-7-y-o
2 Id-3 Taken https://www.youtube.com/embed/vjbfiOERDYs
PosterPath
0 https://i.ytimg.com/vi/5VGyTOGxyVA/hqdefault.jpg
1 https://i.ytimg.com/vi/3BnXz-7-y-o/hqdefault.jpg
2 https://i.ytimg.com/vi/vjbfiOERDYs/hqdefault.jpg