Make new rows in Pandas dataframe based on df.str.findall matches? - regex

I have a dataframe current_df I want to create a new row for each regex match that occurs in each entry of column_1. I currently have this below:
current_df['new_column']=current_df['column_1'].str.findall('(?<=ABC).*?(?=XYZ)')
This appends a list of the matches for the regex in each row. How do I create a new row for each match? I'm guessing something with list comprehension, but I'm not sure what it'd be exactly.
The output df would be something like:
column_1 column2 new_column
ABC_stuff_to_match_XYZ_ABC_more_stuff_to_match_XYZ... data _stuff_to_match_
ABC_stuff_to_match_XYZ_ABC_more_stuff_to_match_XYZ... data _more_stuff_to_match_
ABC_a_different_but_important_piece_of_data_XYZ_ABC_find_me_too_XYZ... different_stuff _a_different_but_important_piece_of_data_
ABC_a_different_but_important_piece_of_data_XYZ_ABC_find_me_too_XYZ... different_stuff _find_me_too_

Use can use extractall, and with merge:
df.merge(df.column_1.str.extractall('(?<=ABC)(.*?)(?=XYZ)')
.reset_index(level=-1, drop=True),
left_index=True,
right_index=True
)

Use the extract function.
df['new_column'] = df['column_1'].str.extract('(?<=ABC).*?(?=XYZ)', expand=True)

Related

Any ideas on Iterating over dataframe and applying regex?

This may be a rudimentary problem but I am new to pandas.
I have a csv dataframe and I want to iterate over each row to extract all the string information in a specific column through regex. . (The reason why I am using regex is because eventually I want to make a separate dataframe of that column)
I tried iterating through for loop but I got ton of errors. So far, It looks like for loop reads each input row as a list or series rather than a string (correct me if i'm wrong). My main functions are iteritems() and findall() but no good results so far. How can I approach this problem?
My dataframe looks like this:
df =pd.read_csv('foobar.csv')
df[['column1','column2, 'TEXT']]
My approach looks like this:
for Individual_row in df['TEXT'].iteritems():
parsed = re.findall('(.*?)\:\s*?\[(.*?)\], Individual_row)
res = {g[0].strip() : g[1].strip() for g in parsed}
Many thanks in advance
you can try the following instead of loop:
df['new_TEXT'] = df['TEXT'].apply(lambda x: [g[0].strip(), g[1].strip()] for g in re.findall('(.*?)\:\s*?\[(.*?)\]', x), na_action='ignore' )
This will create a new column with your resultant data.

How do I conditionally remove text from a string in a column in a Scala dataframe?

I'm currently exploring Azure Databricks for a POC (Scala and Databricks are both completely new to me. I'm using this (Cars - Corgis) sample dataset to show off the manipulation characteristics of Databricks.
My problem is that I have a dataframe column called 'model' that contains data like '2009 Audi A3' and '2005 Mercedes E550'. What I would like to be able to do is alter that column so instead of the aforementioned, it reads as 'Audi A3' or 'Mercedes E550'. I have a separate model year column so trying to reduce the size of the columns where possible.
From what I have seen, replaceAllIn doesn't seem to work with strings with Scala.
This is my code so far:
//Use the dataframe from the previous cell and trim the model year from the model column so for example it reads as 'Audi A3' instead of '2009 Audi A3'
import scala.util.matching.Regex
val modelPrefixPatternMatch = "[0-9 ]".r
val newModel = modelPrefixPatternMatch.replaceAllIn((specificColumnsDf.select("model")),"")
However, when I run this code, I get the following error message:
command-1778339999318469:5: error: overloaded method value replaceAllIn with alternatives:
(target: CharSequence,replacer: scala.util.matching.Regex.Match => String)String <and>
(target: CharSequence,replacement: String)String
cannot be applied to (org.apache.spark.sql.DataFrame, String)
val newModel = modelPrefixPatternMatch.replaceAllIn((specificColumnsDf.select("model")),"")
I have also tried completing the SparkSQL but didn't have any luck there either.
Thanks!
In Spark you would normally add additional columns using withColumn and then select only the columns you want. In this simple example, I use regexp_replace function to trim out the years, something like this:
%scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
df
.withColumn("cleanColumn", regexp_replace($"`Identification.Model Year`", "20[0-2][0-9] ","") )
.select($"`Identification.Model Year`", $"cleanColumn").distinct
.show(false)
My results:
We could probably make the regular expression tighter, eg tie it to the start of the column or open it up for years 1980, 1990 etc - this is just an example.
If the year is always at the start then you could just use substring and start at position 5. The regex approach at least protects from the year not being there for some records.
HTH

unexpected character after line continuation character. Also to keep rows after floating point rows in pandas dataframe

I have a dataset in which I want to keep row just after a floating value row and remove other rows.
For eg, a column of the dataframe looks like this:
17.3
Hi Hello
Pranjal
17.1
[aasd]How are you
I am fine[:"]
Live Free
So in this I want to preserve:
Hi Hello
[aasd]How are you
and remove the rest. I tried it with the following code, but an error showed up saying "unexpected character after line continuation character". Also I don't know if this code will solve my purpose
Dropping extra rows
for ind in data.index:
if re.search((([1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?, ind):
ind+=1
else:
data.drop(ind)
your regex has to be a string, you can't just write it like that.
re.search((('[1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?', ind):
edit - but actually i think the rest of your code is wrong too.
what you really want is something more like this:
import pandas as pd
l = ['17.3',
'Hi Hello',
'Pranjal',
'17.1',
'[aasd]How are you',
'I am fine[:"]',
'Live Free']
data = pd.DataFrame(l, columns=['col'])
data[data.col.str.match('\d+\.\d*').shift(1) == True]
logic:
if you have a dataframe with a column that is all string type (won't work for mixed type decimal and string you can find the decimal / int entries with the regex '\d+.?\d*'. If you shift this mask by one it gives you the entries after the matches. use that to select the rows you want in your dataframe.

Modify values across all column pyspark

I have a pyspark data frame and I'd like to have a conditional replacement of a string across multiple columns, not just one.
To be more concrete: I'd like to replace the string 'HIGH' with 1, and everything else in the column with 0. [Or at least replace every 'HIGH' with 1.] In pandas I would do:
df[df == 'HIGH'] = 1
Is there a way to do something similar? Or can I do a loop?
I'm new to pyspark so I don't know how to generate example code.
You can use the replace method for this:
>>> df.replace("HIGH", "1")
Keep in mind that you'll need to replace like for like datatypes, so attemping to replace "HIGH" with 1 will throw an exception.
Edit: You could also use regexp_replace to address both parts of your question, but you'd need to apply it to all columns:
>>> df = df.withColumn("col1", regexp_replace("col1", "^(?!HIGH).*$", "0"))
>>> df = df.withColumn("col1", regexp_replace("col1", "^HIGH$", "1"))

QueryBuilder: Search a value in a column containing comma-separated integers

I have a column tags containing ids in a comma separated list.
I want to search all rows where a given value is in that column.
Say I have two rows where the column tags looks like this:
Row1: 1,2,3,4
Row2: 2,5,3,12
and I want to search for a row where the column contains a 1. I try to do it this way:
$qb = $this->createQueryBuilder('p')
->where(':value IN (p.tags))
->setParameter('value', 1);
I expect it to do something like
SELECT p.* FROM mytable AS p WHERE 1 IN (p.tags)
Executing this in MySQL directly works perfectly. In Doctrine it does not work:
Error: Expected Literal, got 'p'
It works the other way around, though, but this is not what I need:
->where("p.tags IN :value")
I've tried a lot to make this work, but it just won't... Any ideas?
I think you should use the LIKE function for each scenario, as example:
$q = "1";
$qb = $this->createQueryBuilder('p')
->andWhere(
$this->expr()->orX(
$this->expr()->like('p.tags', $this->expr()->literal($q.',%')), // Start with...
$this->expr()->like('p.tags', $this->expr()->literal('%,'.$q.',%')), // In the middle...
$this->expr()->like('p.tags', $this->expr()->literal('%,'.$q)), // End with...
),
);
See the SQL statement result in this fiddle
Hope this help