PySpark dataframe with list having null values

PySpark dataframe with list having null values - list

I see some PySpark dataframe has list of values like [2,,3,,,4]. These values between commas are null but they're not 'null' in the list. Could someone suggest how this kind of list is generated?
Thanks,
J

They are empty strings.
import pyspark.sql.functions as F
......
data = [
('2,,3,,,4',)
]
df = spark.createDataFrame(data, ['col'])
df = df.withColumn('col', F.split('col', ','))
df.printSchema()
df.show(truncate=False)

Related

Django ValueError Can only compare identically-labeled Series objects

i am getting this error. one df dataframe is read from json API and second df2 is read from csv i want to compare one column of csv to API and then matched value to save into new csv. can anyone help me
df2=pd.read_csv(file_path)
r = requests.get('https://data.ct.gov/resource/6tja-6vdt.json')
df = pd.DataFrame(r.json())
df['verified'] = np.where(df['salespersoncredential'] == df2['salespersoncredential'],'True', 'False')
print(df)

Probably just make df['verified'] = np.where(df['salespersoncredential'] == df2['salespersoncredential'],'True', 'False')
this
df['verified'] = df['salespersoncredential'] == df2['salespersoncredential']
assuming the dtypes and are correct.
If the indexes are different on the two dataframes, you might need to .reset_index().

Convert array of rows into array of strings in pyspark

I have a dataframe with 2 columns and I got below array by doing df.collect().
array = [Row(name=u'Alice', age=10), Row(name=u'Bob', age=15)]
Now I want to get an output array like below.
new_array = ['Alice', 'Bob']
Could anyone please let me know how to extract above output using pyspark. Any help would be appreciated.
Thanks

# Creating the base dataframe.
values = [('Alice',10),('Bob',15)]
df = sqlContext.createDataFrame(values,['name','age'])
df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 10|
| Bob| 15|
+-----+---+
df.collect()
[Row(name='Alice', age=10), Row(name='Bob', age=15)]
# Use list comprehensions to create a list.
new_list = [row.name for row in df.collect()]
print(new_list)
['Alice', 'Bob']

I see two columns name and age in the df. Now, you want only the name column to be displayed.
You can select it like:
df.select("name").show()
This will show you only the names.
Tip: Also, you df.show() instead of df.collect(). That will show you in tabular form instead of row(...)

data frame with pandas not outputing tabular

I have been working on extracting data from a large number of files. I want to form a table of the data, with the file base name as the left most column and the numerical data in the next. So far, I have been testing on a folder containing 8 files, but am hoping to be able to read hundreds.
I have tried adding an index, but that seemed to cause more problems. I am attaching the closest working code I have come up with, alongside the output.
In:
import re, glob
import pandas as pd
pattern = re.compile('-\d+\D\d+\skcal/mol', flags=re.S)
for file in glob.glob('*rank_*.pdb'):
with open(file) as fp:
for result in pattern.findall(fp.read()):
Dock_energy = {file:[],result:[]}
df = pd.DataFrame(Dock_energy)
df.append(df)
df = df.append(df)
print(df)
This seems to work for extracting the data, but it is not in the form I am looking for.
Out:
Empty DataFrame
Columns: [-10.02 kcal/mol, MII_rank_8.pdb]
Index: []
Empty DataFrame
Columns: [-12.51 kcal/mol, MII_rank_5.pdb]
Index: []
Empty DataFrame
Columns: [-13.47 kcal/mol, MII_rank_4.pdb]
Index: []
Empty DataFrame
Columns: [-14.67 kcal/mol, MII_rank_2.pdb]
Index: []
Empty DataFrame
Columns: [-13.67 kcal/mol, MII_rank_3.pdb]
Index: []
Empty DataFrame
Columns: [-14.80 kcal/mol, MII_rank_1.pdb]
Index: []
Empty DataFrame
Columns: [-11.45 kcal/mol, MII_rank_7.pdb]
Index: []
Empty DataFrame
Columns: [-12.47 kcal/mol, MII_rank_6.pdb]
Index: []
What is causing the fractured table, and why are my columns in reverse order from what I am hoping? Any help is greatly appreciate.

This should be closer to what you intend:
all_data = []
for file in glob.glob('*rank_*.pdb'):
with open(file) as fp:
file_data = []
for result in pattern.findall(fp.read()):
file_data.append([file, result])
all_data.extend(file_data)
df = pd.DataFrame(all_data, columns=['file', 'result'])
print(df)

RDD to DataFrame in pyspark (columns from rdd's first element)

I have created a rdd from a csv file and the first row is the header line in that csv file. Now I want to create dataframe from that rdd and retain the column from 1st element of rdd.
Problem is I am able to create the dataframe and with column from rdd.first(), but the created dataframe has its first row as the headers itself. How to remove that?
lines = sc.textFile('/path/data.csv')
rdd = lines.map(lambda x: x.split('#####')) ###multiple char sep can be there #### or ### , so can't directly read csv to a dataframe
#rdd: [[u'mailid', u'age', u'address'], [u'satya', u'23', u'Mumbai'], [u'abc', u'27', u'Goa']] ###first element is the header
df = rdd.toDF(rdd.first()) ###retaing te column from rdd.first()
df.show()
#mailid age address
mailid age address ####I don't want this as dataframe data
satya 23 Mumbai
abc 27 Goa
How to avoid that first element moving to dataframe data. Can I give any option in rdd.toDF(rdd.first()) to get that done??
Note: I can't collect rdd to form list , then remove first item from that list, then parallelize that list back to form rdd again and then toDF()...
Please suggest!!!Thanks

You will have to remove the header from your RDD. One way to do it is the following considering your rdd variable :
>>> header = rdd.first()
>>> header
# ['mailid', 'age', 'address']
>>> data = rdd.filter(lambda row : row != header).toDF(header)
>>> data.show()
# +------+---+-------+
# |mailid|age|address|
# +------+---+-------+
# | satya| 23| Mumbai|
# | abc| 27| Goa|
# +------+---+-------+

Python - Create An Empty Pandas DataFrame and Populate From Another DataFrame Using a For Loop

Using: Python 2.7 and Pandas 0.11.0 on Mac OSX Lion
I'm trying to create an empty DataFrame and then populate it from another dataframe, based on a for loop.
I have found that when I construct the DataFrame and then use the for loop as follows:
data = pd.DataFrame()
for item in cols_to_keep:
if item not in dummies:
data = data.join(df[item])
Results in an empty DataFrame, but with the headers of the appropriate columns to be added from the other DataFrame.

That's because you are using join incorrectly.
You can use a list comprehension to restrict the DataFrame to the columns you want:
df[[col for col in cols_to_keep if col not in dummies]]

What about just creating a new frame based off of the columns you know you want to keep, instead of creating an empty one first?
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':np.random.randn(5),
'b':np.random.randn(5),
'c':np.random.randn(5),
'd':np.random.randn(5)})
cols_to_keep = ['a', 'c', 'd']
dummies = ['d']
not_dummies = [x for x in cols_to_keep if x not in dummies]
data = df[not_dummies]
data
a c
0 2.288460 0.698057
1 0.097110 -0.110896
2 1.075598 -0.632659
3 -0.120013 -2.185709
4 -0.099343 1.627839

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

PySpark dataframe with list having null values - list

I see some PySpark dataframe has list of values like [2,,3,,,4]. These values between commas are null but they're not 'null' in the list. Could someone suggest how this kind of list is generated? Thanks, J

They are empty strings. import pyspark.sql.functions as F ...... data = [ ('2,,3,,,4',) ] df = spark.createDataFrame(data, ['col']) df = df.withColumn('col', F.split('col', ',')) df.printSchema() df.show(truncate=False)

Related

Django ValueError Can only compare identically-labeled Series objects

Convert array of rows into array of strings in pyspark

data frame with pandas not outputing tabular

RDD to DataFrame in pyspark (columns from rdd's first element)

Python - Create An Empty Pandas DataFrame and Populate From Another DataFrame Using a For Loop

Categories

Resources