I have a dataframe with 2 columns and I got below array by doing df.collect().
array = [Row(name=u'Alice', age=10), Row(name=u'Bob', age=15)]
Now I want to get an output array like below.
new_array = ['Alice', 'Bob']
Could anyone please let me know how to extract above output using pyspark. Any help would be appreciated.
Thanks
# Creating the base dataframe.
values = [('Alice',10),('Bob',15)]
df = sqlContext.createDataFrame(values,['name','age'])
df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 10|
| Bob| 15|
+-----+---+
df.collect()
[Row(name='Alice', age=10), Row(name='Bob', age=15)]
# Use list comprehensions to create a list.
new_list = [row.name for row in df.collect()]
print(new_list)
['Alice', 'Bob']
I see two columns name and age in the df. Now, you want only the name column to be displayed.
You can select it like:
df.select("name").show()
This will show you only the names.
Tip: Also, you df.show() instead of df.collect(). That will show you in tabular form instead of row(...)
Related
I see some PySpark dataframe has list of values like [2,,3,,,4]. These values between commas are null but they're not 'null' in the list. Could someone suggest how this kind of list is generated?
Thanks,
J
They are empty strings.
import pyspark.sql.functions as F
......
data = [
('2,,3,,,4',)
]
df = spark.createDataFrame(data, ['col'])
df = df.withColumn('col', F.split('col', ','))
df.printSchema()
df.show(truncate=False)
I have two columns in a pandas DataFrame, both containing also a lot of null values. Some values in column B, exist partially in a field (or multiple fields) in columns A. I want to check if this value of B exists in A, and if so, seperate this value and add as a new row in column A
Example:
Column A | Column B
black bear | null
black box | null
red fox | null
red fire | null
green tree | null
null | red
null | yellow
null | black
null | red
null | green
And I want the following:
Column A
black
bear
box
red
fire
fox
yellow
green
Does anyone have any tips on how to get this result? I have tried using regex (re.match), but I am struggling with the fact that I do not have a fixed pattern but a variable (namely, any value in column B) This is my effort:
import re
list_A= df['Column A'].values.tolist()
list_B= df['Column B'].values.tolist()
for i in list_A:
for j in list_B:
if i != None:
if re.match('{j}.+', i) :
...
Note: the columns are over 2500 rows long.
If I understand your question correctly that you want to split the value of b from the value in a whenever b is found in a, and then stored the separated values separately, then how about trying the following?
import re
list_A = df['Column A'].values.tolist()
list_B = df['Column B'].values.tolist()
list_of_separated_values = []
for a in list_a:
for b in list_b:
if b in a:
list_of_separated_values.extend([val for val in re.split('({})'.format(b),a) if not val])
This is not a regex question. You have your data in a dataframe, use the dataframe functionality to fix it.
Assuming data_frame is your pandas DataFrame.
# filter the DataFrame to just those with `null` in Column A
filtered = data_frame[data_frame["Column A"].isnull()]
# in the filtered table, assign Column B to Column A
filtered["Column A"] = filtered["Column B"]
# set Column B to null/None (I'm assuming you want this or this step can be skipped)
filtered["Column B"] = None
print(data_frame)
I have a DataFrame with 2 columns. Column 1 is "code" which can repeat more than 1 time and column 2 which is "Values". For example, column 1 is 1,1,1,5,5 and Column 2 is 15,18,24,38,41. What I want to do is first sort by the 2 columns ( df.sort("code","Values") ) and then do a ("groupBy" "Code") and (agg Values) but I want to apply a UDF on values so I need to pass the "Values" of each code as a "list" to the UDF. I am not sure how many "Values" each Code will have. As you can see in this example "Code" 1 has 3 values and "Code" 5 has 2 Values. So for each "Code" I need to pass all the "Values" of that "Code" as a list to the UDF.
You can do a groupBy and then use the collect_set or collect_list function in pyspark. Below is an example dataframe of your use case (I hope this is what are you referring to ):
from pyspark import SparkContext
from pyspark.sql import HiveContext
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("code1", "val1"),
("code1", "val2"),
("code1", "val3"),
("code2", "val1"),
("code2", "val2"),
], ["code", "val"])
df.show()
+-----+-----+
| code| val |
+-----+-----+
|code1|val1 |
|code1|val2 |
|code1|val3 |
|code2|val1 |
|code2|val2 |
+---+-------+
Now the groupBy and collect_list command:
(df
.groupby("code")
.agg(F.collect_list("val"))
.show())
Output:
+------+------------------+
|code |collect_list(val) |
+------+------------------+
|code1 |[val1, val2, val3]|
|code2 |[val1, val2] |
+------+------------------+
Here above you get list of aggregated values in second column
I have created a rdd from a csv file and the first row is the header line in that csv file. Now I want to create dataframe from that rdd and retain the column from 1st element of rdd.
Problem is I am able to create the dataframe and with column from rdd.first(), but the created dataframe has its first row as the headers itself. How to remove that?
lines = sc.textFile('/path/data.csv')
rdd = lines.map(lambda x: x.split('#####')) ###multiple char sep can be there #### or ### , so can't directly read csv to a dataframe
#rdd: [[u'mailid', u'age', u'address'], [u'satya', u'23', u'Mumbai'], [u'abc', u'27', u'Goa']] ###first element is the header
df = rdd.toDF(rdd.first()) ###retaing te column from rdd.first()
df.show()
#mailid age address
mailid age address ####I don't want this as dataframe data
satya 23 Mumbai
abc 27 Goa
How to avoid that first element moving to dataframe data. Can I give any option in rdd.toDF(rdd.first()) to get that done??
Note: I can't collect rdd to form list , then remove first item from that list, then parallelize that list back to form rdd again and then toDF()...
Please suggest!!!Thanks
You will have to remove the header from your RDD. One way to do it is the following considering your rdd variable :
>>> header = rdd.first()
>>> header
# ['mailid', 'age', 'address']
>>> data = rdd.filter(lambda row : row != header).toDF(header)
>>> data.show()
# +------+---+-------+
# |mailid|age|address|
# +------+---+-------+
# | satya| 23| Mumbai|
# | abc| 27| Goa|
# +------+---+-------+
For example:
I have four pandas Dateframe df1,df2,df3,df4. And my work process to these 4 dataframe are the same?
How to define i =(1,2,3,4) link with "df"
So, I don't have to change "df1"-> "df2/3/4" so many time.
Whenever you have numbered variable names, think about using a list instead. For example:
dfs = [df1, df2, df3, df4]
for df in dfs:
....
Moreover, it might behoove you to refactor the code defining df1, df2, df3 and df4 so as to eliminate those variables and define the list dfs alone. Then, instead of df2, for instance, you would just refer to dfs[1]. Instead of
df1 = ...
df2 = ...
df3 = ...
df4 = ...
you would use something like
dfs = []
for i in range(4):
dfs.append(...)