Pyspark converting dataframe to rdd and split - list

I have a dataframe and I did convert it to rdd but when I apply split function I got an error message
Here is my dataframe
df = spark.createDataFrame([(1, '2013-07-25 00:00:00.0',100,'CLOSED'),
(2, '2013-07-25 12:23:00.0',200,'PENDING PAYMENT'),
(3, '2013-07-25 03:30:00.0',400,'PENDING PAYMENT'),
(4, '2013-07-25 12:23:00.0',50,'COMPLETE'),
(5, '2013-07-25 12:23:00.0',50,'CLOSED'),
(6, '2013-07-26 02:00:00.0',300,'CLOSED'),
(7, '2013-07-26 6:23:00.0',10,'PENDING PAYMENT'),
(8, '2013-07-26 03:30:00.0',5,'PENDING PAYMENT'),
(9, '2013-07-26 2:23:00.0',20,'COMPLETE'),
(10,'2013-07-26 1:23:00.0',30,'CLOSED')],
['Id', 'Date', 'Total', 'Transaction'])
I did convert to list and rdd
rdd = df.rdd.map(list).collect()
rdd_df=sc.parallelize(rdd)
Then apply
rdd_df.map(lambda z: z.split(","))
"AttributeError: 'list' object has no attribute 'split'"
But rdd_df is not a list, let's check
type(rdd_df)
pyspark.rdd.RDD
What might be the problem? I would like to map and add the column 3.Desired output will be like ;
(PENDING PAYMENT,4),(COMPLETE,2),(CLOSED,4)
Thank you.

Related

Get top n records for each group with Django queryset

I have a model like the following Table,
create table `mytable`
(
`person` varchar(10),
`groupname` int,
`age` int
);
And I want to get the 2 oldest people from each group. The original SQL question and answers are here StackOverflow and One of the solutions that work is
SELECT
person,
groupname,
age
FROM
(
SELECT
person,
groupname,
age,
#rn := IF(#prev = groupname, #rn + 1, 1) AS rn,
#prev := groupname
FROM mytable
JOIN (SELECT #prev := NULL, #rn := 0) AS vars
ORDER BY groupname, age DESC, person
) AS T1
WHERE rn <= 2
You can check the SQL output here as well SQLFIDLE
I just want to know how can I implement this query in Django's views as queryset.
Another SQL with similar output would have window function that annotates each row with row number within particular group name and then you would filter row numbers lower or equal 2 in HAVING clause.
At the moment of writing django does not support filtering based on window function result so you need to calculate row in the first query and filter People in the second query.
Following code is based on similar question but it implements limiting number of rows to be returned per group_name.
from django.db.models import F, When, Window
from django.db.models.functions import RowNumber
person_ids = {
pk
for pk, row_no_in_group in Person.objects.annotate(
row_no_in_group=Window(
expression=RowNumber(),
partition_by=[F('group_name')],
order_by=['group_name', F('age').desc(), 'person']
)
).values_list('id', 'row_no_in_group')
if row_no_in_group <= 2
}
filtered_persons = Person.objects.filter(id__in=person_ids)
For following state of Person table
>>> Person.objects.order_by('group_name', '-age', 'person').values_list('group_name', 'age', 'person')
<QuerySet [(1, 19, 'Brian'), (1, 17, 'Brett'), (1, 14, 'Teresa'), (1, 13, 'Sydney'), (2, 20, 'Daniel'), (2, 18, 'Maureen'), (2, 14, 'Vincent'), (2, 12, 'Carlos'), (2, 11, 'Kathleen'), (2, 11, 'Sandra')]>
queries above return
>>> filtered_persons.order_by('group_name', '-age', 'person').values_list('group_name', 'age', 'person')
<QuerySet [(1, 19, 'Brian'), (1, 17, 'Brett'), (2, 20, 'Daniel'), (2, 18, 'Maureen')]>

Replace Null values with median in pyspark

How can I replace null values with median in the columns Age and Height below data set df.
df = spark.createDataFrame([(1, 'John', 1.79, 28,'M', 'Doctor'),
(2, 'Steve', 1.78, 45,'M', None),
(3, 'Emma', 1.75, None, None, None),
(4, 'Ashley',1.6, 33,'F', 'Analyst'),
(5, 'Olivia', 1.8, 54,'F', 'Teacher'),
(6, 'Hannah', 1.82, None, 'F', None),
(7, 'William',None, 42,'M', 'Engineer'),
(None,None,None,None,None,None),
(8,'Ethan',1.55,38,'M','Doctor'),
(9,'Hannah',1.65,None,'F','Doctor'),
(10,'Xavier',1.64,43,None,'Doctor')]
, ['Id', 'Name', 'Height', 'Age', 'Gender', 'Profession'])
In the post Replace missing values with mean - Spark Dataframe I used the function given
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns])
imputer.fit(df).transform(df)
It throws me an error.
IllegalArgumentException: 'requirement failed: Column Id must be of type equal to one of the following types: [DoubleType, FloatType] but was actually of type LongType.'
So please help.
Thank you
It's likely an initial casting error (I had some strings I needed to be floats). To convert all cols to floats do:
from pyspark.sql.functions import col
df = df.select(*(col(c).cast("float").alias(c) for c in df.columns))
Then you should be fine to impute. Note: I set my strategy to median rather than mean.
from pyspark.ml.feature import Imputer
imputer = Imputer(
inputCols=df.columns,
outputCols=["{}_imputed".format(c) for c in df.columns]
).setStrategy("median")
# Add imputation cols to df
df = imputer.fit(df).transform(df)
I'd be interested in a more elegant solution but I separately imputed the categoricals from the numerics. To impute the categoricals I got the most common value and filled the blanks with it using the when and otherwise functions:
import pyspark.sql.functions as F
for col_name in ['Name', 'Gender', 'Profession']:
common = df.dropna().groupBy(col_name).agg(F.count("*")).orderBy('count(1)', ascending=False).first()[col_name]
df = df.withColumn(col_name, F.when(F.isnull(col_name), common).otherwise(df[col_name]))
To impute the numerics before running the imputer lines I simply casting the Age and Id columns as doubles circumvents the issue for the numeric fields and restrict the imputer to the numerical columns.
from pyspark.ml.feature import Imputer
df = df.withColumn("Age", df['Age'].cast('double')).withColumn('Id', df['Id'].cast('double'))
imputer = Imputer(
inputCols=['Id', 'Height', 'Age'],
outputCols=['Id', 'Height', 'Age'])
imputer.fit(df).transform(df)

generating a multi dict with random numbers

I have a data structure defined as follow :
reqList[i] = [multidict({
1: ['type1', randint(1, 5), randint(1, 5), randint(1, 5)],
2: ['type2', randint(1, 5), randint(1, 5), randint(1, 5)],
3: ['type3', randint(1, 5), randint(1, 5), randint(1, 5)],
4: ['type4', randint(1, 5), randint(1, 5), randint(1, 5)]
}),
multidict({
(1, 2): randint(500, 1000),
(2, 3): randint(500, 1000),
(3, 4): randint(500, 1000)
})]
I want to make the creation of this data structure automatic in a for loop for example. I did this:
nodes = {}
for j in range(1, randint(2, 5)):
nodes[j] = ['type%d' % j, randint(1, 5), randint(1, 5), randint(1, 5)]
edges = {}
for kk in range(1, len(nodes)):
edges[(kk, kk + 1)] = randint(500, 1000)
print "EDGES", edges
reqList[i] = [multidict(nodes),
multidict(edges)]
del (nodes, edges)
when I look into the outputed edges the order of the keys is not kept ! For example I am getting this:
EDGES {(1, 2): 583, (3, 4): 504, (2, 3): 993}
I want it to be :
EDGES {(1, 2): 583, (2, 3): 993, (3, 4): 504}
Does the way I am coding it is correct ? if not, could you suggest a better way knowing that I need to get the same result as in the first example?
Dictionary in 2.7 is unordered and you can not keep the order of insert, unless you manually keep a reference on what key got inserted and when in a separate list. The module collections contains a class called OrderedDict that acts like a dictionary but keeps the inserts in ordered, which is what you could use (it also uses a list to keep track of the keys insert but uses a double link list to speed up deletion of keys).
There's no other way other than these two method.
from collections import OrderedDict
nodes = {}
for j in range(1, randint(2, 5)):
nodes[j] = ['type%d' % j, randint(1, 5), randint(1, 5), randint(1, 5)]
edges = OrderedDict()
for kk in range(1, len(nodes)):
edges[(kk, kk + 1)] = randint(500, 1000)
print "EDGES", edges # EDGES OrderedDict([((1, 2), 898), ((2, 3), 814)])
print edges[(1,2)] # still yields the correct number
You can read more about OrderedDict here

sorting of python dict, first by value then by key

I want to sort a dict, first by the value and then by the key. The key being a string.
Example,
>>> words = {'fall':4, 'down':4, 'was':3, 'a':3, 'zebra':2, 'bitter':1, 'betty':1}
>>> sorted(words.items(), key=itemgetter(1,0), reverse=True)
[('fall', 4), ('down', 4), ('was', 3), ('a', 3), ('zebra', 2), ('bitter', 1), ('betty', 1)]
As you can see above, the dict is getting sorted by value but not by the key.
Thanks.
Edit: I forgot to point out that I wanted the values to get sorted top to down and the key bottom to top.
This should do the trick. Take advantage of the fact that the values are numbers.
from operator import itemgetter
words = {'fall':4, 'down':4, 'was':3, 'a':3, 'zebra':2, 'bitter':1, 'betty':1}
sorted_words = [v for v in sorted(words.iteritems(), key=lambda(k, v): (-v, k))]
print(sorted_words)
Output:
[('down', 4), ('fall', 4), ('a', 3), ('was', 3), ('zebra', 2), ('betty', 1), ('bitter', 1)]

Two-Dimensional structured array

I am trying to construct a structured array in Python that can be accessed by the names of the columns and rows. Is this possible with the structured array method of numpy?
Example:
My array should have roughly this form:
My_array = A B C
E 1 2 3
F 4 5 6
G 7 8 9
And i want to have the possibility to do the following:
My_array["A"]["E"] = 1
My_array["C"]["F"] = 6
Is it possible to do this in pyhton using structured arrays or is there another type of structure which is more suitable for such a task?
A basic structured array gives you something that can be indexed with one name:
In [276]: dt=np.dtype([('A',int),('B',int),('C',int)])
In [277]: x=np.arange(9).reshape(3,3).view(dtype=dt)
In [278]: x
Out[278]:
array([[(0, 1, 2)],
[(3, 4, 5)],
[(6, 7, 8)]],
dtype=[('A', '<i4'), ('B', '<i4'), ('C', '<i4')])
In [279]: x['B'] # index by field name
Out[279]:
array([[1],
[4],
[7]])
In [280]: x[1] # index by row (array element)
Out[280]:
array([(3, 4, 5)],
dtype=[('A', '<i4'), ('B', '<i4'), ('C', '<i4')])
In [281]: x['B'][1]
Out[281]: array([4])
In [282]: x.shape # could be reshaped to (3,)
Out[282]: (3, 1)
The view approach produced a 2d array, but with just one column. The usual columns are replaced by dtype fields. It's 2d but with a twist. By using view the data buffer is unchanged; the dtype just provides a different way of accessing those 'columns'. dtype fields are, technically, not a dimension. They don't register in either the .shape or .ndim of the array. Also you can't use x[0,'A'].
recarray does the same thing, but adds the option of accessing fields as attributes, e.g. x.B is the same as x['B'].
rows still have to be accessed by index number.
Another way of constructing a structured array is to defined values as a list of tuples.
In [283]: x1 = np.arange(9).reshape(3,3)
In [284]: x2=np.array([tuple(i) for i in x1],dtype=dt)
In [285]: x2
Out[285]:
array([(0, 1, 2), (3, 4, 5), (6, 7, 8)],
dtype=[('A', '<i4'), ('B', '<i4'), ('C', '<i4')])
In [286]: x2.shape
Out[286]: (3,)
ones, zeros, empty also construct basic structured arrays
In [287]: np.ones((3,),dtype=dt)
Out[287]:
array([(1, 1, 1), (1, 1, 1), (1, 1, 1)],
dtype=[('A', '<i4'), ('B', '<i4'), ('C', '<i4')])
I can construct an array that is indexed with 2 field names, by nesting dtypes:
In [294]: dt1=np.dtype([('D',int),('E',int),('F',int)])
In [295]: dt2=np.dtype([('A',dt1),('B',dt1),('C',dt1)])
In [296]: y=np.ones((),dtype=dt2)
In [297]: y
Out[297]:
array(((1, 1, 1), (1, 1, 1), (1, 1, 1)),
dtype=[('A', [('D', '<i4'), ('E', '<i4'), ('F', '<i4')]), ('B', [('D', '<i4'), ('E', '<i4'), ('F', '<i4')]), ('C', [('D', '<i4'), ('E', '<i4'), ('F', '<i4')])])
In [298]: y['A']['F']
Out[298]: array(1)
But frankly this is rather convoluted. I haven't even figured out how to set the elements to arange(9) (without iterating over field names).
Structured arrays are most commonly produced by reading csv files with np.genfromtxt (or loadtxt). The result is a named field for each labeled column, and a numbered 'row' for each line in the file.
With a recarray, you can access columns with dot notation or with specific reference to the column name. For rows, they are accessed by row number. I haven't seen them accessed via a row name, for example:
>>> import numpy as np
>>> a = np.arange(1,10,1).reshape(3,3)
>>> dt = np.dtype([('A','int'),('B','int'),('C','int')])
>>> a.dtype = dt
>>> r = a.view(type=np.recarray)
>>> r
rec.array([[(1, 2, 3)],
[(4, 5, 6)],
[(7, 8, 9)]],
dtype=[('A', '<i4'), ('B', '<i4'), ('C', '<i4')])
>>> r.A
array([[1],
[4],
[7]])
>>> r['A']
array([[1],
[4],
[7]])
>>> r.A[0]
array([1])
>>> a['A'][0]
array([1])
>>> # now for the row
>>> >>> r[0]
rec.array([(1, 2, 3)],
dtype=[('A', '<i4'), ('B', '<i4'), ('C', '<i4')])
>>>
You can specify the dtype and the type at the same time
>>> a = np.ones((3,3))
>>> b = a.view(dtype= [('A','<f8'), ('B','<f8'),('C', '<f8')], type = np.recarray)
>>> b
rec.array([[(1.0, 1.0, 1.0)],
[(1.0, 1.0, 1.0)],
[(1.0, 1.0, 1.0)]],
dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
>>> b.A
array([[ 1.],
[ 1.],
[ 1.]])
>>> b.A[0]
array([ 1.])