pyspark dataframe change column with two arrays into columns - python-2.7

I've been searching around and haven't figured out a way to restructure a dataframe's column to add new columns to the dataframe based on the array contents dynamically. I'm new to python, so I might be searching on the wrong terms and be the reason I haven't found a clear example yet. Please let me know if this is a duplicate and reference link to find it. I think I just need to be pointed in the right direction.
Ok, the details.
The environment is pyspark 2.3.2 and python 2.7
The sample column contains 2 arrays, which they are correlated to each other 1 to 1. I would like to create a column for each value in the titles array and put the corresponding name (in the person array) the respective column.
I cobbled up an example to focus on my problem with changing the dataframe.
import json
from pyspark.sql.types import ArrayType, StructType, StructField, StringType
from pyspark.sql import functions as f
input = { "sample": { "titles": ["Engineer", "Designer", "Manager"], "person": ["Mary", "Charlie", "Mac"] }, "location": "loc a"},{ "sample": { "titles": ["Engineer", "Owner"],
"person": ["Tom", "Sue"] }, "location": "loc b"},{ "sample": { "titles": ["Engineer", "Designer"], "person": ["Jane", "Bill"] }, "location": "loc a"}
a = [json.dumps(input)]
jsonRDD = sc.parallelize(a)
df = spark.read.json(jsonRDD)
This is the schema of my dataframe:
In [4]: df.printSchema()
root
|-- location: string (nullable = true)
|-- sample: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- titles: array (nullable = true)
| | |-- element: string (containsNull = true)
My dataframe data:
In [5]: df.show(truncate=False)
+--------+-----------------------------------------------------+
|location|sample |
+--------+-----------------------------------------------------+
|loc a |[[Mary, Charlie, Mac], [Engineer, Designer, Manager]]|
|loc b |[[Sue, Tom], [Owner, Engineer]] |
|loc a |[[Jane, Bill], [Engineer, Designer]] |
+--------+-----------------------------------------------------+
And what I would like my dataframe to look like:
+--------+-----------------------------------------------------+------------+-----------+---------+---------+
|location|sample |Engineer |Desginer |Manager | Owner |
+--------+-----------------------------------------------------+------------+-----------+---------+---------+
|loc a |[[Mary, Charlie, Mac], [Engineer, Designer, Manager]]|Mary |Charlie |Mac | |
|loc b |[[Sue, Tom], [Owner, Engineer]] |Tom | | |Sue |
|loc a |[[Jane, Bill], [Engineer, Designer]] |Jane |Bill | | |
+--------+-----------------------------------------------------+------------+-----------+---------+---------+
I've tried to use the explode function, only to end up with more records with the array field in each record. There have been some examples in stackoverflow, but they have static column names. This dataset can have them in any order and new titles can be added later.

Without explode
First convert each struct to a map:
from pyspark.sql.functions import udf
#udf("map<string,string>")
def as_dict(x):
return dict(zip(*x)) if x else None
dfmap = df.withColumn("sample", as_dict("sample")
Then use method shown in PySpark converting a column of type 'map' to multiple columns in a dataframe to split map into columns
With explode
Add unique id using monotonically_increasing_id.
Use one of the methods show in Pyspark: Split multiple array columns into rows to explode both arrays together or explode the map created with the first method.
pivot the result, grouping by added id and other fields you want to preserve, pivot by title and taking first(person)

#user10601094 helped me get this question answered. I'm posting the full solution below to help anyone else that might have a similar question
I'm not very fluent in python, so please feel free to suggest better approaches
In [1]: import json
...: from pyspark.sql import functions as f
...:
In [2]: # define a sample data set
...: input = { "sample": { "titles": ["Engineer", "Designer", "Manager"], "person": ["Mary", "Charlie", "Mac"] }, "location": "loc a"},{ "sample": { "titles": ["Engineer", "Owner"],
...: "person": ["Tom", "Sue"] }, "location": "loc b"},{ "sample": { "titles": ["Engineer", "Designer"], "person": ["Jane", "Bill"] }, "location": "loc a"}
In [3]: # create a dataframe with the sample json data
...: a = [json.dumps(input)]
...: jsonRDD = sc.parallelize(a)
...: df = spark.read.json(jsonRDD)
...:
2018-11-03 20:48:09 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
In [4]: # Change the array in the sample column to a dictionary
...: # swap the columns so the titles are the key
...:
...: # UDF to convert 2 arrays into a map
...: #f.udf("map<string,string>")
...: def as_dict(x):
...: return dict(zip(x[1],x[0])) if x else None
...:
In [5]: # create a new dataframe based on the original dataframe
...: dfmap = df.withColumn("sample", as_dict("sample"))
In [6]: # Convert sample column to be title columns based on the map
...:
...: # get the columns names, stored in the keys
...: keys = (dfmap
...: .select(f.explode("sample"))
...: .select("key")
...: .distinct()
...: .rdd.flatMap(lambda x: x)
...: .collect())
In [7]: # create a list of column names
...: exprs = [f.col("sample").getItem(k).alias(k) for k in keys]
...:
In [8]: dfmap.select(dfmap.location, *exprs).show()
+--------+--------+--------+-------+-----+
|location|Designer|Engineer|Manager|Owner|
+--------+--------+--------+-------+-----+
| loc a| Charlie| Mary| Mac| null|
| loc b| null| Tom| null| Sue|
| loc a| Bill| Jane| null| null|
+--------+--------+--------+-------+-----+

Related

How to extract values from a column and have it as float in pyspark?

I have a pyspark dataframe that visually looks like the following. I want the column to hold float values only. Please note, currently the values have square bracket around it.
from pyspark.sql.types import StructType,StructField
from pyspark.sql.types import StringType, IntegerType, ArrayType
data = [
("Smith","OH","[55.5]"),
("Anna","NY","[33.3]"),
("Williams","OH","[939.3]"),
]
schema = StructType([
StructField('name', StringType(), True),
StructField('state', StringType(), True),
StructField('salary', StringType(), True)
])
df = spark.createDataFrame(data = data, schema= schema)
df.show(truncate=False)
Input:
+--------+-----+-------+
|name |state|salary |
+--------+-----+-------+
|Smith |OH |[55.5] |
|Anna |NY |[33.3] |
|Williams|OH |[939.3]|
+--------+-----+-------+
And the output should look like,
+--------+-----+------------------+
|name |state|float_value_salary|
+--------+-----+------------------+
|Smith |OH |55.5 |
|Anna |NY |33.3 |
|Williams|OH |939.3 |
+--------+-----+------------------+
Thank you for any help.
You can trim the square brackets and cast to float:
import pyspark.sql.functions as F
df2 = df.withColumn('salary', F.expr("float(trim('[]', salary))"))
df2.show()
+--------+-----+------+
| name|state|salary|
+--------+-----+------+
| Smith| OH| 55.5|
| Anna| NY| 33.3|
|Williams| OH| 939.3|
+--------+-----+------+
Or you can use from_json to parse it as an array of float, and get the first array element:
df2 = df.withColumn('salary', F.from_json('salary', 'array<float>')[0])
You can use regex:
import pyspark.sql.functions as F
df.select(
F.regexp_extract('salary', '([\d\.]+)', 1).cast('float').alias('salary')
).show()
Output:
+------+
|salary|
+------+
| 55.5|
| 33.3|
| 939.3|
+------+
you need to parse the string to a float array using a UDF and then you can explode the array to get the singular value within the array.
the program would be as follows :
import json
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType
def parse_value_from_string(x):
res = json.loads(x)
return res
parse_float_array = F.udf(parse_value_from_string, ArrayType(FloatType()))
df = df.withColumn('float_value_salary',F.explode(parse_float_array(F.col('salary'))))
df_output = df.select('name','state','float_value_salary')
The output dataframe would like the following result
+--------+-----+------------------+
| name|state|float_value_salary|
+--------+-----+------------------+
| Smith| OH| 55.5|
| Anna| NY| 33.3|
|Williams| OH| 939.3|
+--------+-----+------------------+

unzip list of tuples in pyspark dataframe

I want unzip list of tuples in a column of a pyspark dataframe
Let's say a column as [(blue, 0.5), (red, 0.1), (green, 0.7)], I want to split into two columns, with first column as [blue, red, green] and second column as [0.5, 0.1, 0.7]
+-----+-------------------------------------------+
|Topic| Tokens |
+-----+-------------------------------------------+
| 1| ('blue', 0.5),('red', 0.1),('green', 0.7)|
| 2| ('red', 0.9),('cyan', 0.5),('white', 0.4)|
+-----+-------------------------------------------+
which can be created with this code:
df = sqlCtx.createDataFrame(
[
(1, ('blue', 0.5),('red', 0.1),('green', 0.7)),
(2, ('red', 0.9),('cyan', 0.5),('white', 0.4))
],
('Topic', 'Tokens')
)
And, the output should look like:
+-----+--------------------------+-----------------+
|Topic| Tokens | Weights |
+-----+--------------------------+-----------------+
| 1| ['blue', 'red', 'green']| [0.5, 0.1, 0.7] |
| 2| ['red', 'cyan', 'white']| [0.9, 0.5, 0.4] |
+-----+--------------------------------------------+
If schema of your DataFrame looks like this:
root
|-- Topic: long (nullable = true)
|-- Tokens: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: double (nullable = true)
then you can select:
from pyspark.sql.functions import col
df.select(
col("Topic"),
col("Tokens._1").alias("Tokens"), col("Tokens._2").alias("weights")
).show()
# +-----+------------------+---------------+
# |Topic| Tokens| weights|
# +-----+------------------+---------------+
# | 1|[blue, red, green]|[0.5, 0.1, 0.7]|
# | 2|[red, cyan, white]|[0.9, 0.5, 0.4]|
# +-----+------------------+---------------+
And generalized:
cols = [
col("Tokens.{}".format(n)) for n in
df.schema["Tokens"].dataType.elementType.names]
df.select("Topic", *cols)
Reference Querying Spark SQL DataFrame with complex types
You can achieve this with simple indexing using udf():
from pyspark.sql.functions import udf, col
# create the dataframe
df = sqlCtx.createDataFrame(
[
(1, [('blue', 0.5),('red', 0.1),('green', 0.7)]),
(2, [('red', 0.9),('cyan', 0.5),('white', 0.4)])
],
('Topic', 'Tokens')
)
def get_colors(l):
return [x[0] for x in l]
def get_weights(l):
return [x[1] for x in l]
# make udfs from the above functions - Note the return types
get_colors_udf = udf(get_colors, ArrayType(StringType()))
get_weights_udf = udf(get_weights, ArrayType(FloatType()))
# use withColumn and apply the udfs
df.withColumn('Weights', get_weights_udf(col('Tokens')))\
.withColumn('Tokens', get_colors_udf(col('Tokens')))\
.select(['Topic', 'Tokens', 'Weights'])\
.show()
Output:
+-----+------------------+---------------+
|Topic| Tokens| Weights|
+-----+------------------+---------------+
| 1|[blue, red, green]|[0.5, 0.1, 0.7]|
| 2|[red, cyan, white]|[0.9, 0.5, 0.4]|
+-----+------------------+---------------+

How to create an UDF with two inputs in pyspark

I am new to pyspark and I am trying to create a simple udf that must take two input columns, check if the second column has a blank space and if so, split the first one into two values and overwritte the original columns. This is what I have done:
def split(x, y):
if x == "EXDRA" and y == "":
return ("EXT", "DCHA")
if x == "EXIZQ" and y == "":
return ("EXT", "IZDA")
udf_split = udf(split, ArrayType())
df = df \
.withColumn("x", udf_split(df['x'], df['y'])[1]) \
.withColumn("y", udf_split(df['x'], df['y'])[0])
But when I run this code I get the following error:
File "<stdin>", line 1, in <module>
TypeError: __init__() takes at least 2 arguments (1 given)
What am I doing wrong?
Thank you,
Álvaro
I'm not sure about what you are trying to do, but this is how I would do it from what I understood :
from pyspark.sql.types import *
from pyspark.sql.functions import udf, col
def split(x, y):
if x == "EXDRA" and y == "":
return ("EXT", "DCHA")
if x == "EXIZQ" and y == "":
return ("EXT", "IZDA")
schema = StructType([StructField("x1", StringType(), False), StructField("y1", StringType(), False)])
udf_split = udf(split, schema)
df = spark.createDataFrame([("EXDRA", ""), ("EXIZQ", ""), ("", "foo")], ("x", "y"))
df.show()
# +-----+---+
# | x| y|
# +-----+---+
# |EXDRA| |
# |EXIZQ| |
# | |foo|
# +-----+---+
df = df \
.withColumn("split", udf_split(df['x'], df['y'])) \
.withColumn("x", col("split.x1")) \
.withColumn("y", col("split.y1"))
df.printSchema()
# root
# |-- x: string (nullable = true)
# |-- y: string (nullable = true)
# |-- split: struct (nullable = true)
# | |-- x1: string (nullable = false)
# | |-- y1: string (nullable = false)
df.show()
# +----+----+----------+
# | x| y| split|
# +----+----+----------+
# | EXT|DCHA|[EXT,DCHA]|
# | EXT|IZDA|[EXT,IZDA]|
# |null|null| null|
# +----+----+----------+
Guess you have to define your udf as:
udf_split = udf(split, ArrayType(StringType()))

How to iterate dictionaries and save in to the database in python2.7?

This is my dictionary
{u'krishna': [u'vijayawada', u'gudivada', u'avanigada']}
I want to iterate items and save in the database,my Models is
class Example(models.Model):
district = models.CharField(max_length=50,**optional)
taluk = models.CharField(max_length=20,**optional)
It should save as:
-----------------------------
|district | taluk |
|-----------|--------------- |
|krishna | vijayawada |
|-----------|----------------|
|krishna | gudivada |
|----------------------------|
|krishna | avanigada |
------------------------------
You can do something like this:
form models import Example
places = {u'krishna': [u'vijayawada', u'gudivada', u'avanigada']}
for district in places:
for taluk in district:
e = Example(district=district, taluk=taluk)
e.save()
for key in dict:
for value in dict[key]:
example = Example()
example.district = key
example.taluk = value
example.save()
This will work for you:-
for districtName in places.keys():
for talukName in places[districtName]:
print districtName,talukName #Try to print it
addData = Example.objects.create(district=districtName,taluk=talukName)
addData.save()

Q object parameters

from django.db.models import Q
MODULES_USERS_PERMS = {
MODULE_METHOD: [],
MODULE_NEWS: [],
MODULE_PROJECT: ['created_by', 'leader'],
MODULE_TASK: [],
MODULE_TICKET: [],
MODULE_TODO: []
}
filter_fields = MODULES_USERS_PERMS[MODULE_PROJECT]
perm_q = map(lambda x: Q(x=user), filter_fields)
if perm_q: #sum(perm_q)
if len(perm_q) == 1:
return perm_q[0]
elif len(perm_q) == 2:
return perm_q[0] | perm_q[1]
elif len(perm_q) == 3:
return perm_q[0] | perm_q[1] | perm_q[2]
I do not know how to describe in words what is required by code, I hope he speaks for itself.
I need to make a filter from the list of objects.
Needless code is not working.
UPDATE:
code, that looks better, but not working too:
filters = ['created_by', 'leader']
filter_params = Q()
for filter_obj in filters:
filter_params = filter_params | Q(filter_obj=user)
FieldError at /projects/
Cannot resolve keyword 'filter_obj' into field. Choices are:
begin_time, comment, created_at, created_by, created_by_id, end_time,
id, leader, leader_id, name, project_task, status, ticket_project
If you're looking to combine an unknown number of Q objects:
import operator
perm_q = reduce(operator.or_, perm_q)
Or:
summed_q = perm_q[0]
for new_term in perm_q[1:]:
summed_q = summed_q | new_term
Which does the same thing, just more explicitly.
Based on your edit - you need to turn the string contained in your filter_obj variable into a keyword argument. You can do this by creating a dictionary to use as the keyword arguments for the Q constructor:
filters = ['created_by', 'leader']
filter_params = Q()
for filter_obj in filters:
kwargs = {filter_obj: user}
filter_params = filter_params | Q(**kwargs)