Modify values across all column pyspark - replace

I have a pyspark data frame and I'd like to have a conditional replacement of a string across multiple columns, not just one.
To be more concrete: I'd like to replace the string 'HIGH' with 1, and everything else in the column with 0. [Or at least replace every 'HIGH' with 1.] In pandas I would do:
df[df == 'HIGH'] = 1
Is there a way to do something similar? Or can I do a loop?
I'm new to pyspark so I don't know how to generate example code.

You can use the replace method for this:
>>> df.replace("HIGH", "1")
Keep in mind that you'll need to replace like for like datatypes, so attemping to replace "HIGH" with 1 will throw an exception.
Edit: You could also use regexp_replace to address both parts of your question, but you'd need to apply it to all columns:
>>> df = df.withColumn("col1", regexp_replace("col1", "^(?!HIGH).*$", "0"))
>>> df = df.withColumn("col1", regexp_replace("col1", "^HIGH$", "1"))

Related

How to replace the values of a single column which consist of different values with only one value?

I have a dataframe with a column which is having values like:
Jan\n\t\t\t\t\t\t\t\t\t\t\t\tMon\n\t\t\t\t\t\t\t\t\t\t\t\t
Jan\n\t\t\t\t\t\t\t\t\t\t\t\tWed\n\t\t\t\t\t\t\t\t\t\t\t\t
Jan\n\t\t\t\t\t\t\t\t\t\t\t\tTue\n\t\t\t\t\t\t\t\t\t\t\t\t
Jan\n\t\t\t\t\t\t\t\t\t\t\t\tSat\n\t\t\t\t\t\t\t\t\t\t\t\t
Jan\n\t\t\t\t\t\t\t\t\t\t\t\tSun\n\t\t\t\t\t\t\t\t\t\t\t\t
I want to replace these values with only the name, Jan and have tried to use the following code for doing that.
df['month'] = df['month'].apply(str)
df['month'] = df['month'].str.replace('Jan\n\t\t\t\t\t\t\t\t\t\t\t\tMon\n\t\t\t\t\t\t\t\t\t\t\t\t ', '1-stop')
I have also tried to rename the values using this command:
df.rename({'Jan\n\t\t\t\t\t\t\t\t\t\t\t\tMon\n\t\t\t\t\t\t\t\t\t\t\t\t': 'Jan'}, axis=1, inplace=True)
Also, used
import re month=df['month'] result= str(month)
#result = re.sub(r'\s+', '',str(month))
But nothing seems to work. Also, I want to rename all the row values with the name 'Jan', can anyone help me with this?
Thanks in advance.

How do I conditionally remove text from a string in a column in a Scala dataframe?

I'm currently exploring Azure Databricks for a POC (Scala and Databricks are both completely new to me. I'm using this (Cars - Corgis) sample dataset to show off the manipulation characteristics of Databricks.
My problem is that I have a dataframe column called 'model' that contains data like '2009 Audi A3' and '2005 Mercedes E550'. What I would like to be able to do is alter that column so instead of the aforementioned, it reads as 'Audi A3' or 'Mercedes E550'. I have a separate model year column so trying to reduce the size of the columns where possible.
From what I have seen, replaceAllIn doesn't seem to work with strings with Scala.
This is my code so far:
//Use the dataframe from the previous cell and trim the model year from the model column so for example it reads as 'Audi A3' instead of '2009 Audi A3'
import scala.util.matching.Regex
val modelPrefixPatternMatch = "[0-9 ]".r
val newModel = modelPrefixPatternMatch.replaceAllIn((specificColumnsDf.select("model")),"")
However, when I run this code, I get the following error message:
command-1778339999318469:5: error: overloaded method value replaceAllIn with alternatives:
(target: CharSequence,replacer: scala.util.matching.Regex.Match => String)String <and>
(target: CharSequence,replacement: String)String
cannot be applied to (org.apache.spark.sql.DataFrame, String)
val newModel = modelPrefixPatternMatch.replaceAllIn((specificColumnsDf.select("model")),"")
I have also tried completing the SparkSQL but didn't have any luck there either.
Thanks!
In Spark you would normally add additional columns using withColumn and then select only the columns you want. In this simple example, I use regexp_replace function to trim out the years, something like this:
%scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
df
.withColumn("cleanColumn", regexp_replace($"`Identification.Model Year`", "20[0-2][0-9] ","") )
.select($"`Identification.Model Year`", $"cleanColumn").distinct
.show(false)
My results:
We could probably make the regular expression tighter, eg tie it to the start of the column or open it up for years 1980, 1990 etc - this is just an example.
If the year is always at the start then you could just use substring and start at position 5. The regex approach at least protects from the year not being there for some records.
HTH

Make new rows in Pandas dataframe based on df.str.findall matches?

I have a dataframe current_df I want to create a new row for each regex match that occurs in each entry of column_1. I currently have this below:
current_df['new_column']=current_df['column_1'].str.findall('(?<=ABC).*?(?=XYZ)')
This appends a list of the matches for the regex in each row. How do I create a new row for each match? I'm guessing something with list comprehension, but I'm not sure what it'd be exactly.
The output df would be something like:
column_1 column2 new_column
ABC_stuff_to_match_XYZ_ABC_more_stuff_to_match_XYZ... data _stuff_to_match_
ABC_stuff_to_match_XYZ_ABC_more_stuff_to_match_XYZ... data _more_stuff_to_match_
ABC_a_different_but_important_piece_of_data_XYZ_ABC_find_me_too_XYZ... different_stuff _a_different_but_important_piece_of_data_
ABC_a_different_but_important_piece_of_data_XYZ_ABC_find_me_too_XYZ... different_stuff _find_me_too_
Use can use extractall, and with merge:
df.merge(df.column_1.str.extractall('(?<=ABC)(.*?)(?=XYZ)')
.reset_index(level=-1, drop=True),
left_index=True,
right_index=True
)
Use the extract function.
df['new_column'] = df['column_1'].str.extract('(?<=ABC).*?(?=XYZ)', expand=True)

unexpected character after line continuation character. Also to keep rows after floating point rows in pandas dataframe

I have a dataset in which I want to keep row just after a floating value row and remove other rows.
For eg, a column of the dataframe looks like this:
17.3
Hi Hello
Pranjal
17.1
[aasd]How are you
I am fine[:"]
Live Free
So in this I want to preserve:
Hi Hello
[aasd]How are you
and remove the rest. I tried it with the following code, but an error showed up saying "unexpected character after line continuation character". Also I don't know if this code will solve my purpose
Dropping extra rows
for ind in data.index:
if re.search((([1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?, ind):
ind+=1
else:
data.drop(ind)
your regex has to be a string, you can't just write it like that.
re.search((('[1-9][0-9]*\.?[0-9]*)|(\.[0-9]+))([Ee][+-]?[0-9]+)?', ind):
edit - but actually i think the rest of your code is wrong too.
what you really want is something more like this:
import pandas as pd
l = ['17.3',
'Hi Hello',
'Pranjal',
'17.1',
'[aasd]How are you',
'I am fine[:"]',
'Live Free']
data = pd.DataFrame(l, columns=['col'])
data[data.col.str.match('\d+\.\d*').shift(1) == True]
logic:
if you have a dataframe with a column that is all string type (won't work for mixed type decimal and string you can find the decimal / int entries with the regex '\d+.?\d*'. If you shift this mask by one it gives you the entries after the matches. use that to select the rows you want in your dataframe.

Why is conditional clause comparing dataframe columns causing error even though conditional clause is boolean in nature?

I am trying to build logic that compares two columns in a dataframe after some logic has been verified.
Here is my code, it is pulling historical rates for cryptocurrency data from gdax.com. The test condition I am applying is 'if df.column 4 is greater than sum of df.column4 and df.column3 then buy 10% of account'
import GDAX
import pandas as pd
import numpy as np
public_client = GDAX.PublicClient()
ticker_call = public_client.getProductHistoricRates(product='LTC-USD')
df = pd.DataFrame(ticker_call)
df['account']=100
if df[4] > df[3] + df[4]:
df['account_new'] = df['account']-df['account'] *.10
print df
I am getting the error 'ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().' for the if statement, what I dont understand is why? when I run each of the lines in the if statement individually they work. How can I fix this issue?
Why and how do I use a.bool() instead of an if statement?
thank you in advance.
df[4] > df[3] + df[4]
(that actually is equivalent to df[3] < 0) is a pandas.Series of boolean values. When do you want to enter the if statement? When all values are True? then you should use all(). When any of them is True? Then you should use any().
If instead you want to execute that function for any row when the condition is True, you should do something like
condition = df[4] > df[3] + df[4]
true_df = df[condition]
true_df["account_new"] = true_df['account']-true_df['account'] *.10
but now the columns "account_new" exists only in true_df, and not in df.
With something like
df["account_new"] = true_df["account_new"]
now also df has the column "account_new", but in the lines where condition is false you have nans...