Any ideas on Iterating over dataframe and applying regex? - regex

This may be a rudimentary problem but I am new to pandas.
I have a csv dataframe and I want to iterate over each row to extract all the string information in a specific column through regex. . (The reason why I am using regex is because eventually I want to make a separate dataframe of that column)
I tried iterating through for loop but I got ton of errors. So far, It looks like for loop reads each input row as a list or series rather than a string (correct me if i'm wrong). My main functions are iteritems() and findall() but no good results so far. How can I approach this problem?
My dataframe looks like this:
df =pd.read_csv('foobar.csv')
df[['column1','column2, 'TEXT']]
My approach looks like this:
for Individual_row in df['TEXT'].iteritems():
parsed = re.findall('(.*?)\:\s*?\[(.*?)\], Individual_row)
res = {g[0].strip() : g[1].strip() for g in parsed}
Many thanks in advance

you can try the following instead of loop:
df['new_TEXT'] = df['TEXT'].apply(lambda x: [g[0].strip(), g[1].strip()] for g in re.findall('(.*?)\:\s*?\[(.*?)\]', x), na_action='ignore' )
This will create a new column with your resultant data.

Related

Make new rows in Pandas dataframe based on df.str.findall matches?

I have a dataframe current_df I want to create a new row for each regex match that occurs in each entry of column_1. I currently have this below:
current_df['new_column']=current_df['column_1'].str.findall('(?<=ABC).*?(?=XYZ)')
This appends a list of the matches for the regex in each row. How do I create a new row for each match? I'm guessing something with list comprehension, but I'm not sure what it'd be exactly.
The output df would be something like:
column_1 column2 new_column
ABC_stuff_to_match_XYZ_ABC_more_stuff_to_match_XYZ... data _stuff_to_match_
ABC_stuff_to_match_XYZ_ABC_more_stuff_to_match_XYZ... data _more_stuff_to_match_
ABC_a_different_but_important_piece_of_data_XYZ_ABC_find_me_too_XYZ... different_stuff _a_different_but_important_piece_of_data_
ABC_a_different_but_important_piece_of_data_XYZ_ABC_find_me_too_XYZ... different_stuff _find_me_too_
Use can use extractall, and with merge:
df.merge(df.column_1.str.extractall('(?<=ABC)(.*?)(?=XYZ)')
.reset_index(level=-1, drop=True),
left_index=True,
right_index=True
)
Use the extract function.
df['new_column'] = df['column_1'].str.extract('(?<=ABC).*?(?=XYZ)', expand=True)

unexpected character after line continuation character. Also to keep rows after floating point rows in pandas dataframe

I have a dataset in which I want to keep row just after a floating value row and remove other rows.
For eg, a column of the dataframe looks like this:
17.3
Hi Hello
Pranjal
17.1
[aasd]How are you
I am fine[:"]
Live Free
So in this I want to preserve:
Hi Hello
[aasd]How are you
and remove the rest. I tried it with the following code, but an error showed up saying "unexpected character after line continuation character". Also I don't know if this code will solve my purpose
Dropping extra rows
for ind in data.index:
if re.search((([1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?, ind):
ind+=1
else:
data.drop(ind)
your regex has to be a string, you can't just write it like that.
re.search((('[1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?', ind):
edit - but actually i think the rest of your code is wrong too.
what you really want is something more like this:
import pandas as pd
l = ['17.3',
'Hi Hello',
'Pranjal',
'17.1',
'[aasd]How are you',
'I am fine[:"]',
'Live Free']
data = pd.DataFrame(l, columns=['col'])
data[data.col.str.match('\d+\.\d*').shift(1) == True]
logic:
if you have a dataframe with a column that is all string type (won't work for mixed type decimal and string you can find the decimal / int entries with the regex '\d+.?\d*'. If you shift this mask by one it gives you the entries after the matches. use that to select the rows you want in your dataframe.

Search pandas column and return all elements (rows) that contain any (one or more) non-digit character

Seems pretty straight forward. The column contains numbers in general but for some reason, some of them have non-digit characters. I want to find all of them. I am using this code:
df_other_values.total_count.str.contains('[^0-9]')
but I get the following error:
AttributeError: Can only use .str accessor with string values, which use
np.object_ dtype in pandas
So I tried this:
df_other_values = df_other.total_countvalues
df_other_values.total_count.str.contains('[^0-9]')
but get the following error:
AttributeError: 'DataFrame' object has no attribute 'total_countvalues'
So instead of going down the rabbit hole further, I was thinking there must be a way to do this without having to change my dataframe into a np.object. Please advise.
Thanks.
I believe you need cast to strings first by astype and then filter by boolean indexing:
df1 = df[df_other_values.total_count.astype(str).str.contains('[^0-9]')]
Alternative solution with isnumeric:
df1 = df[~df_other_values.total_count.astype(str).str.isnumeric()]

Using Regex in Pig in hadoop

I have a CSV file containing user (tweetid, tweets, userid).
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124436740317184,"“#BleacherReport: Halloween has given us this amazing Derrick Rose photo (via #amandakaschube, #ScottStrazzante) http://t.co/tM0wEugZR1” yes",Colten_stamkos
396124436845178880,"When's 12.4k gonna roll around",Matty_T_03
Now I need to write a Pig Query that returns all the tweets that include the word 'favorite', ordered by tweet id.
For this I have the following code:
A = load '/user/pig/tweets' as (line);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,”:-](.*)[“,:-](.*)')) AS (tweetid:long,msg:chararray,userid:chararray);
C = filter B by msg matches '.*favorite.*';
D = order C by tweetid;
How does the regular expression work here in splitting the output in desired way?
I tried using REGEX_EXTRACT instead of REGEX_EXTRACT_ALL as I find that much more simpler, but couldn't get the code working except for extracting just the tweets:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'[,”:-](.*)[“,:-]',1)) AS (msg:chararray);
the above alias gets me the tweets, but if I use REGEX_EXTRACT to get the tweet_id, I do not get the desired o/p: B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'(.*)[,”:-]',1)) AS (tweetid:long);
(396124554353197056,"Just saw #samantha0wen and #DakotaFears at the drake concert #waddup")
(396124554172432384,"#Yutika_Diwadkar I'm just so bright 😁")
(396124554609033216,"#TB23GMODE i don't know, i'm just saying, why you in GA though? that's where you from?")
(396124554805776385,"#MichaelThe_Lion me too 😒")
(396124552540852226,"Happy Halloween from us 2 #maddow & #Rev_AlSharpton :) http://t.co/uC35lDFQYn")
grunt>
Please help.
Can't comment, but from looking at this and testing it out, it looks like your quotes in the regex are different from those in the csv.
" in the csv
” in the regex code.
To get the tweetid try this:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'.*(,")',1)) AS (tweetid:long);

Python Pandas: How to get column values to be a list?

I have a dataframe with one column and 20 rows. I want to use
dataframe[column].apply(lambda x : some_func(x))
to get second column. The function returns a list. Pandas is not giving me what I want. It is filling the second column with NaN instead of the list items that some_func() is returning.
Is there a clever or simple way to fix this?
It seems that the error was cause because I forgot to include:
axis = 1
My full line of code should have been:
dataframe[column].apply(lambda x : some_func(x), axis = 1)
You can just assign it like a dictionary:
dataframe['column2'] = dataframe['column1'].apply(lambda x : some_func(x))
Simple as that.