This question already has answers here:
Pyspark dataframe operator "IS NOT IN"
(7 answers)
Closed 3 years ago.
I have a pyspark sc initialized.
instance = (data
.filter(lambda x: len(x) != 0 )
.filter(lambda x: ('%auth%login%' not in url)
.map(lambda x: function(x))
.reduceByKey(lambda x, y: x + y)
My goal is to filter out any url that has both auth and login keywords in it, but they could be in any position of a string.
In sql I could use %auth%login%, % means any length of string.
How to do it in pyspark syntax easily?
Forgot to mention, there are 'auth' page I do not want to filter out, I only want to filter out auth when login is also in the string
I am not sure why this is flagged as dups, this is RDD not dataframe
Using PySpark RDD filter method, you just need to make sure at least one of login or auth is NOT in the string, in Python code:
data.filter(lambda x: any(e not in x for e in ['login', 'auth']) ).collect()
In case you are using a dataframe, you are looking for contains:
#url is the column name
df = df.filter(~df.url.contains('auth') & ~df.url.contains('login'))
When you are working with a RDD, please have a look at the answer of jxc.
This question already has answers here:
How to convert a timezone aware string to datetime in Python without dateutil?
(7 answers)
Closed 3 years ago.
I'm using Django and Python 3.7. I'm parsing some strings into dates using this logic ...
created_on = datetime.strptime(created_on_txt, '%Y-%m-%dT%H:%M:%S+00:00')
print(created_on.tzinfo)
How do I incorporate the fact that the time zone I want to be interpreted for the string shoudl be UTC? When I print out the ".tzinfo," it reads "None." An example of something I'm parsing is
2019-04-08T17:03:00+00:00
Python 3.7, lucky you:
created_on = datetime.fromisoformat('2019-04-08T17:03:00+00:00')
created_on.tzinfo
>>> datetime.timezone.utc
If you insist on using strptime():
created_on = datetime.strptime('2019-04-08T17:03:00+00:00', '%Y-%m-%dT%H:%M:%S%z')
created_on.tzinfo
>>> datetime.timezone.utc
This uses the %z directive.
This question already has answers here:
How do I convert a Pandas series or index to a NumPy array? [duplicate]
(8 answers)
Closed 4 years ago.
I am new to pandas and python. My input data is like
category text
1 hello iam fine. how are you
1 iam good. how are you doing.
inputData= pd.read_csv(Input', sep='\t', names=['category','text'])
X = inputData["text"]
Y = inputData["category"]
here Y is the panda series object, which i want to convert into numpy array. so i tried .as_matrix
YArray= Y.as_matrix(columns=None)
print YArray
But i got the output as [1,1] (which is wrong since i have only one column category and two rows). I want the result as 2x1 matrix.
To get numpy array, you need
Y.values
Try this:
after applying the .as_matrix on your series object
Y.reshape((2,1))
Since .as_matrix() only returns a numpy-array NOT a numpy-matrix.
Link here
If df is your dataframe, then a column of the dataframe is a series and to convert it into an array,
df = pd.DataFrame()
x = df.values
print(x.type)
The following prints,
<class 'numpy.ndarray'>
successfully converting it to an array.
I have 10 dataframes and I'm trying to merge the data in them on the variable names. The purpose is to get one file which would contain all the data from the relevant variables
I'm using the below mentioned formula:
pd.merge(df,df1,df2,df3,df4,df5,df6,df7,df8,df9,df10, on = ['RSSD9999', 'RCFD0010','RCFD0071','RCFD0081','RCFD1400','RCFD1773','RCFD2123','RCFD2145','RCFD2160','RCFD3123','RCFD3210','RCFD3300','RCFD3360','RCFD3368','RCFD3792','RCFD6631','RCFD6636','RCFD8274','RCFD8275','RCFDB530','RIAD4000','RIAD4073','RIAD4074','RIAD4079','RCFD1403','RCON3646','RIAD4230','RIAD4300','RIAD4301','RIAD4302','RIAD4340','RIAD4475','RCFD1406','RCFD3230','RCFD2950','RCFD3632','RCFD3839','RCFDB529','RCFDB556','RCON0071','RCON0081','RCON0426','RCON2145','RCON2148','RCON2168','RCON2938','RCON3210','RCON3230','RCON3300','RCON3839','RCONB528','RCONB529','RCONB530','RCONB556','RCONB696','RCONB697','RCONB698','RCONB699','RCONB700','RCONB701','RCONB702','RCONB703','RCONB704','RCON1410','RCON6835','RCFD2210','RCONA223','RCONA224','RCON5311','RCON5320','RCON5327','RCON5334','RCON5339','RCON5340','RCON7204','RCON7205','RCON7206','RCON3360','RCON3368','RCON3385','RIAD3217','RCFDA222','RCFDA223','RCFDA224','RCON3792','RCON0391','RCFD7204','RCFD7206','RCFD7205','RCONB639','RIADG104','RCFDG105','RSSD9017','RSSD9010','RSSD9042','RSSD9050'],how='outer')
But I'm getting an error "merge() got multiple values for keyword argument 'on'". I think the code is correct, can anyone help me to understand whats wrong here?
Firstly you are using 10 dataframes to merge. Ok it's possible but all dataframe should have atleast one column should have same.
import pandas as pd
df=pd.Dataframe(data,column=[your columns],index=[index names])
df=df.set_index(common co)
# do for all ten dataframes
answer=pd.merge(df,df1........,df10,on=column name,how='outer')
pd.merge(Ray, Bec, Dan, on = 'Key', how ='outer')
This question already has answers here:
Python - printing multiple shortest and longest words from a list
(6 answers)
Closed 6 years ago.
I have a orginal list with three words: 'hello', 'hi' and 'Good day'.
I want to create a new_list containing the item with the shortest length. How do I write that code? I use Python 2.7.
original_list=['hello','hi','Good day']
Word Length
'Hello' 5
'Hi' 2
'Good day' 8
Expected output (since I only want the item with the shortest length):
new_list=['Hi']
min() takes a key function as an argument, use len as a key function:
>>> original_list = ['hello','hi','Good day']
>>> new_list = [min(original_list, key=len)]
>>> new_list
['hi']
This though does not handle multiple items with the same shortest length - e.g. ['hello','hi','Good day', 'me'] input list.