Remove square brackets from cells using pandas - regex

I have a Pandas Dataframe with data as below
id, name, date
[101],[test_name],[2019-06-13T13:45:00.000Z]
[103],[test_name3],[2019-06-14T13:45:00.000Z, 2019-06-14T17:45:00.000Z]
[104],[],[]
I am trying to convert it to a format as below with no square brackets
Expected output:
id, name, date
101,test_name,2019-06-13T13:45:00.000Z
103,test_name3,2019-06-14T13:45:00.000Z, 2019-06-14T17:45:00.000Z
104,,
I tried using regex as below but it gave me an error TypeError: expected string or bytes-like object
re.search(r"\[([A-Za-z0-9_]+)\]", df['id'])

Figured I am able to extract the data using the below:
df['id'].str.get(0)

Loop through the data frame to access each string then use:
newstring = oldstring[1:len(oldstring)-1]
to replace the cell in the dataframe.

Try looping through columns:
for col in df.columns:
df[col] = df[col].str[1:-1]
Or use apply if your duplication of your data is not a problem:
df = df.apply(lambda x: x.str[1:-1])
Output:
id name date
0 101 test_name 2019-06-13T13:45:00.000Z
1 103 test_name3 2019-06-14T13:45:00.000Z, 2019-06-14T17:45:00....
2 104
Or if you want to use regex, you need str accessor, and extract:
df.apply(lambda x: x.str.extract('\[([A-Za-z0-9_]+)\]'))

Related

How to extract where the reg expression doesn't match in data frame column?

I have a two dataframes:
OrderedDict([('page1', name dob
0 John 07-20200
1 Lilly 05-1999
2 James 02-2002), ('page2', name dob
0 Chris 07-2020
1 Robert 05-1999
2 barb 02-20022)])
I want to run my reg expression against each date in both dataframes and if they are all matches I want to continue with my program and if there is not a match I want to print a message that shows cases the df name, index and date thats wrong like this:
INVALID DATE: Page1: index 0: dob: 02-20200
INVALID DATE: Page2: index 2: dob: 02-20022
I got to this point
date_pattern = r'(?<!\d)((?:0?[1-9]|1[0-2])-(?:19|20)\d{2})(?!\d)'
for df_name, df in employee_dict.items():
x = df[df.dob.str.contains(date_pattern, regex=True)]
print(x)
that prints where they do match in a table format but I want to print where they don't match in individual print statements
any ideas?
You may iterate over all the rows of the dataframes and if the entry does not match your pattern, you may generate the message of your choice:
for df_name, df in employee_dict.items(): # Iterate over your DFs
for index, row in df.iterrows(): # Iterate over DF rows
if not re.search(date_pattern, row['dob']): # If the dob column value has no match
print("INVALID DATE: {}: index {}: dob: {}".format(df_name, index,row['dob'])) # Print error message
If your df is pd.DataFrame({'dob': ['05-2020','4-2020','07-1999','2-2001','1-20202020','112-2020']}), the results will be
INVALID DATE: page1: index 4: dob: 1-20202020
INVALID DATE: page1: index 5: dob: 112-2020
You're looking for Series.str.match.
Essentially, you need to extract the dob series, which I assume is what you're doing with df['dob'], and do result = df['dob'].str.match(date_pattern). The result will be a series of True and False values, corresponding to their respective df['dob'] values.

How to extract date from String column using regex in Spark

I have a dataframe which consist of filename, email and other details. Need to get the dates out of it from one of the column file name.
Ex: File name: Test_04_21_2019_34600.csv
Need to extract the date: 04_21_2019
Dataframe
val df1 = Seq(
("Test_04_21_2018_1200.csv", "abc#gmail.com",200),
("home/server2_04_15_2020_34610.csv", "abc1#gmail.com", 300),
("/server1/Test3_01_2_2019_54680.csv", "abc2#gmail.com",800))
.toDF("file_name", "email", "points")
Output To be
date email points
04_21_2018 abc#gmail.com 200
04_15_2020 abc1#gmail.com 300
01_2_2019 abc2#gmail.com 800
Can we use regex on spark dataframe to achieve this or any other way to achieve this. Any help will be appreciated.
You can use regexp_extract function to extract the date as below
val resultDF = df1.withColumn("date",
regexp_extract($"file_name", "\\d{1,2}_\\d{1,2}_\\d{4}", 0)
)
Output:
+--------------------+--------------+------+----------+
| file_name| email|points| date|
+--------------------+--------------+------+----------+
|Test_04_21_2018_1...| abc#gmail.com| 200|04_21_2018|
|home/server2_04_1...|abc1#gmail.com| 300|04_15_2020|
|/server1/Test3_01...|abc2#gmail.com| 800| 01_2_2019|
+--------------------+--------------+------+----------+

Trying to apply regex to a column in a dataframe

I have the following df:
concatenar buy_sell
1 BBVA2018-03-2020 sell
5 santander2018-03-2020 buy
I would like to apply regex to the concatenar column, where I would like to display the [A-Z][a-z] inside that column values.
This is what I have tried:
re.findall(r'[A-Z][a-z]*',df['concatenar'])
But outputs:
TypeError: expected string or bytes-like object
My desired output would be :
concatenar buy_sell
1 BBVA sell
5 santander buy
How could I correctly apply the regex to the concatenar column?
replace with dict
df.concatenar.replace({r'\d+':'','-':''},regex=True)
Out[354]:
1 BBVA
5 santander
Name: concatenar, dtype: object

python replace string function throws asterix wildcard error

When i use * i receive the error
raise error, v # invalid expression
error: nothing to repeat
other wildcard characters such as ^ work fine.
the line of code:
df.columns = df.columns.str.replace('*agriculture', 'agri')
am using pandas and python
edit:
when I try using / to escape, the wildcard does not work as i intend
In[44]df = pd.DataFrame(columns=['agriculture', 'dfad agriculture df'])
In[45]df
Out[45]:
Empty DataFrame
Columns: [agriculture, dfad agriculture df]
Index: []
in[46]df.columns.str.replace('/*agriculture*','agri')
Out[46]: Index([u'agri', u'dfad agri df'], dtype='object')
I thought the wildcard should output Index([u'agri', u'agri'], dtype='object)
edit:
I am currently using hierarchical columns and would like to only replace agri for that specific level (level = 2).
original:
df.columns[0] = ('grand total', '2005', 'agriculture')
df.columns[1] = ('grand total', '2005', 'other')
desired:
df.columns[0] = ('grand total', '2005', 'agri')
df.columns[1] = ('grand total', '2005', 'other')
I'm looking at this link right now: Changing columns names in Pandas with hierarchical columns
and that author says it will get easier at 0.15.0 so I am hoping there are more recent updated solutions
You need to the asterisk * at the end in order to match the string 0 or more times, see the docs:
In [287]:
df = pd.DataFrame(columns=['agriculture'])
df
Out[287]:
Empty DataFrame
Columns: [agriculture]
Index: []
In [289]:
df.columns.str.replace('agriculture*', 'agri')
Out[289]:
Index(['agri'], dtype='object')
EDIT
Based on your new and actual requirements, you can use str.contains to find matches and then use this to build a dict to map the old against new names and then call rename:
In [307]:
matching_cols = df.columns[df.columns.str.contains('agriculture')]
df.rename(columns = dict(zip(matching_cols, ['agri'] * len(matching_cols))))
Out[307]:
Empty DataFrame
Columns: [agri, agri]
Index: []

AttributeError: 'DataFrame' object has no attribute 'Height'

I am able to convert a csv file to pandas DataFormat and able to print out the table, as seen below. However, when I try to print out the Height column I get an error. How can I fix this?
import pandas as pd
df = pd.read_csv('/path../NavieBayes.csv')
print df #this prints out as seen below
print df.Height #this gives me the "AttributeError: 'DataFrame' object has no attribute 'Height'
Height Weight Classifer
0 70.0 180 Adult
1 58.0 109 Adult
2 59.0 111 Adult
3 60.0 113 Adult
4 61.0 115 Adult
I have run into a similar issue before when reading from csv. Assuming it is the same:
col_name =df.columns[0]
df=df.rename(columns = {col_name:'new_name'})
The error in my case was caused by (I think) by a byte order marker in the csv or some other non-printing character being added to the first column label. df.columns returns an array of the column names. df.columns[0] gets the first one. Try printing it and seeing if something is odd with the results.
PS On above answer by JAB - if there is clearly spaces in your column names use skipinitialspace=True in read_csv e.g.
df = pd.read_csv('/path../NavieBayes.csv',skipinitialspace=True)
df = pd.read_csv(r'path_of_file\csv_file_name.csv')
OR
df = pd.read_csv('path_of_file/csv_file_name.csv')
Example:
data = pd.read_csv(r'F:\Desktop\datasets\hackathon+data+set.csv')
Try it, it will work.