Join list column with string column in PySpark - list

I have two data frames like df_emp and df_dept:
df_emp:
id Name
1 aaa
2 bbb
3 ccc
4 ddd
df_dept:
dept_id dept_name employees
1 DE [1, 2]
2 DA [3, 4]
The expected result after joining:
dept_name employees employee_names
DE [1, 2] [aaa, bbb]
DA [3, 4] [ccc, ddd]
Any idea how to do it using simple joins or udf's?

It can be done without UDF. First explode the array, then join and group.
Input data:
from pyspark.sql import functions as F
df_emp = spark.createDataFrame(
[(1, 'aaa'),
(2, 'bbb'),
(3, 'ccc'),
(4, 'ddd')],
['id', 'Name']
)
df_dept = spark.createDataFrame(
[(1, 'DE', [1, 2]),
(2, 'DA', [3, 4])],
['dept_id', 'dept_name', 'employees']
)
Script:
df_dept_exploded = df_dept.withColumn('id', F.explode('employees'))
df_joined = df_dept_exploded.join(df_emp, 'id', 'left')
df = (
df_joined
.groupBy('dept_name')
.agg(
F.collect_list('id').alias('employees'),
F.collect_list('Name').alias('employee_names')
)
)
df.show()
# +---------+---------+--------------+
# |dept_name|employees|employee_names|
# +---------+---------+--------------+
# | DE| [1, 2]| [aaa, bbb]|
# | DA| [3, 4]| [ccc, ddd]|
# +---------+---------+--------------+

Related

Dictionary from Pandas dataframe

I read two columns of a large file (10 million lines) using pandas read_csv (first line is the header), and now I want to convert the dataframe to a dictionary where the 1st column is the key and the second column is the value.
col_name = ['A', 'B'];
df = pd.read_csv(f_loc, usecols = col_name, sep = "\s+", dtype={'B':np.float16});
Create index with first column by set_index and convert by Series.to_dict:
df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
print (df)
a b
0 1 2
1 3 4
df = df.set_index('A')['B'].to_dict()
print (df)
{1: 2, 3: 4}
Another idea with zip:
d = dict(zip(df['A'], df['B']))
print (d)
{1: 2, 3: 4}
Or:
d = dict(df.values)
print (d)
{1: 2, 3: 4}

Python 2.7 current row index on 2d array iteration

When iterating on a 2d array, how can I get the current row index? For example:
x = [[ 1. 2. 3. 4.]
[ 5. 6. 7. 8.]
[ 9. 0. 3. 6.]]
Something like:
for rows in x:
print x current index (for example, when iterating on [ 5. 6. 7. 8.], return 1)
Enumerate is a built-in function of Python. It’s usefulness can not be summarized in a single line. Yet most of the newcomers and even some advanced programmers are unaware of it. It allows us to loop over something and have an automatic counter. Here is an example:
for counter, value in enumerate(some_list):
print(counter, value)
And there is more! enumerate also accepts an optional argument which makes it even more useful.
my_list = ['apple', 'banana', 'grapes', 'pear']
for c, value in enumerate(my_list, 1):
print(c, value)
.
# Output:
# 1 apple
# 2 banana
# 3 grapes
# 4 pear
The optional argument allows us to tell enumerate from where to start the index. You can also create tuples containing the index and list item using a list. Here is an example:
my_list = ['apple', 'banana', 'grapes', 'pear']
counter_list = list(enumerate(my_list, 1))
print(counter_list)
.
# Output: [(1, 'apple'), (2, 'banana'), (3, 'grapes'), (4, 'pear')]
enumerate:
In [42]: x = [[ 1, 2, 3, 4],
...: [ 5, 6, 7, 8],
...: [ 9, 0, 3, 6]]
In [43]: for index, rows in enumerate(x):
...: print('current index {}'.format(index))
...: print('current row {}'.format(rows))
...:
current index 0
current row [1, 2, 3, 4]
current index 1
current row [5, 6, 7, 8]
current index 2
current row [9, 0, 3, 6]

python itertools groupby find the max value

use:
from itertools import groupby
from operater import itemgetter
like this:
input:
test = {('a','b'):1,('a','c'):2,('a','d'):3,('x','b'):4,('x','c'):5}
find the max value groupby the key[0]
output:
output_test = {('a','d'):3,('x','c'):5}
To do this using itetools.groupby and assuming you do not care which entry is returned if there are multiple entries with the same max value:
test = {('a', 'b'): 1, ('a', 'c'): 2, ('a', 'd'): 3, ('x', 'b'): 4, ('x', 'c'): 5}
output_test = {('a', 'd'): 3, ('x', 'c'): 5}
grouped = itertools.groupby(sorted(test.iteritems()), lambda x: x[0][0])
maxEntries = {x[0]: x[1] for x in {max(v, key=lambda q: q[1]) for k, v in grouped}}
print maxEntries
print maxEntries == output_test
Outputs:
{('x', 'c'): 5, ('a', 'd'): 3}
True
from itertools import groupby
max([sum(1 for _ in g) for k, g in groupby(input)])

Two-Dimensional structured array

I am trying to construct a structured array in Python that can be accessed by the names of the columns and rows. Is this possible with the structured array method of numpy?
Example:
My array should have roughly this form:
My_array = A B C
E 1 2 3
F 4 5 6
G 7 8 9
And i want to have the possibility to do the following:
My_array["A"]["E"] = 1
My_array["C"]["F"] = 6
Is it possible to do this in pyhton using structured arrays or is there another type of structure which is more suitable for such a task?
A basic structured array gives you something that can be indexed with one name:
In [276]: dt=np.dtype([('A',int),('B',int),('C',int)])
In [277]: x=np.arange(9).reshape(3,3).view(dtype=dt)
In [278]: x
Out[278]:
array([[(0, 1, 2)],
[(3, 4, 5)],
[(6, 7, 8)]],
dtype=[('A', '<i4'), ('B', '<i4'), ('C', '<i4')])
In [279]: x['B'] # index by field name
Out[279]:
array([[1],
[4],
[7]])
In [280]: x[1] # index by row (array element)
Out[280]:
array([(3, 4, 5)],
dtype=[('A', '<i4'), ('B', '<i4'), ('C', '<i4')])
In [281]: x['B'][1]
Out[281]: array([4])
In [282]: x.shape # could be reshaped to (3,)
Out[282]: (3, 1)
The view approach produced a 2d array, but with just one column. The usual columns are replaced by dtype fields. It's 2d but with a twist. By using view the data buffer is unchanged; the dtype just provides a different way of accessing those 'columns'. dtype fields are, technically, not a dimension. They don't register in either the .shape or .ndim of the array. Also you can't use x[0,'A'].
recarray does the same thing, but adds the option of accessing fields as attributes, e.g. x.B is the same as x['B'].
rows still have to be accessed by index number.
Another way of constructing a structured array is to defined values as a list of tuples.
In [283]: x1 = np.arange(9).reshape(3,3)
In [284]: x2=np.array([tuple(i) for i in x1],dtype=dt)
In [285]: x2
Out[285]:
array([(0, 1, 2), (3, 4, 5), (6, 7, 8)],
dtype=[('A', '<i4'), ('B', '<i4'), ('C', '<i4')])
In [286]: x2.shape
Out[286]: (3,)
ones, zeros, empty also construct basic structured arrays
In [287]: np.ones((3,),dtype=dt)
Out[287]:
array([(1, 1, 1), (1, 1, 1), (1, 1, 1)],
dtype=[('A', '<i4'), ('B', '<i4'), ('C', '<i4')])
I can construct an array that is indexed with 2 field names, by nesting dtypes:
In [294]: dt1=np.dtype([('D',int),('E',int),('F',int)])
In [295]: dt2=np.dtype([('A',dt1),('B',dt1),('C',dt1)])
In [296]: y=np.ones((),dtype=dt2)
In [297]: y
Out[297]:
array(((1, 1, 1), (1, 1, 1), (1, 1, 1)),
dtype=[('A', [('D', '<i4'), ('E', '<i4'), ('F', '<i4')]), ('B', [('D', '<i4'), ('E', '<i4'), ('F', '<i4')]), ('C', [('D', '<i4'), ('E', '<i4'), ('F', '<i4')])])
In [298]: y['A']['F']
Out[298]: array(1)
But frankly this is rather convoluted. I haven't even figured out how to set the elements to arange(9) (without iterating over field names).
Structured arrays are most commonly produced by reading csv files with np.genfromtxt (or loadtxt). The result is a named field for each labeled column, and a numbered 'row' for each line in the file.
With a recarray, you can access columns with dot notation or with specific reference to the column name. For rows, they are accessed by row number. I haven't seen them accessed via a row name, for example:
>>> import numpy as np
>>> a = np.arange(1,10,1).reshape(3,3)
>>> dt = np.dtype([('A','int'),('B','int'),('C','int')])
>>> a.dtype = dt
>>> r = a.view(type=np.recarray)
>>> r
rec.array([[(1, 2, 3)],
[(4, 5, 6)],
[(7, 8, 9)]],
dtype=[('A', '<i4'), ('B', '<i4'), ('C', '<i4')])
>>> r.A
array([[1],
[4],
[7]])
>>> r['A']
array([[1],
[4],
[7]])
>>> r.A[0]
array([1])
>>> a['A'][0]
array([1])
>>> # now for the row
>>> >>> r[0]
rec.array([(1, 2, 3)],
dtype=[('A', '<i4'), ('B', '<i4'), ('C', '<i4')])
>>>
You can specify the dtype and the type at the same time
>>> a = np.ones((3,3))
>>> b = a.view(dtype= [('A','<f8'), ('B','<f8'),('C', '<f8')], type = np.recarray)
>>> b
rec.array([[(1.0, 1.0, 1.0)],
[(1.0, 1.0, 1.0)],
[(1.0, 1.0, 1.0)]],
dtype=[('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
>>> b.A
array([[ 1.],
[ 1.],
[ 1.]])
>>> b.A[0]
array([ 1.])

writing a list with multiple data to a csv file in separate columns in python

import csv
from itertools import izip
if l > 0:
for i in range(0,l):
combined.append(str(questionList[i]).encode('utf-8') + str(viewList[i]).encode('utf-8'))
# viewcsv.append(str(viewList[i]).encode('utf-8'))
# quescsv.append(str(questionList[i]).encode('utf-8'))
with open('collect.csv', 'a') as csvfile:
spamwriter = csv.writer(csvfile, delimiter='\n')
spamwriter.writerow(combined)
# spamwriter.writerows(izip(quescsv, viewcsv))
return 1
else:
return 0
I need to generate a csv file and flood it with data from 2 or more lists into separate columns and not a single column. Currently I'm trying to combine two lists in one list(combined) and use this as input for writing, but I haven't got desired o/p.
I have tried many things including the fieldnames way,izip way, but in vain.
Eg:
questionList viewList
4 3 views
5 0 views
The numbers used are just for example.
Probably, you need something like this:
import csv
X = [1, 2, 3, 4, 5]
Y = [2, 3, 5, 7, 11]
Z = ['two', 'three', 'five', 'seven', 'eleven']
with open('collect.csv', 'w') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
for row in zip(X, Y, Z):
writer.writerow(row)
import csv
X = [1, 2, 3, 4, 5]
Y = [2, 3, 5, 7, 11]
Z = ['two', 'three', 'five', 'seven', 'eleven']
with open('collect.csv', 'w') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
writer.writerow(X)
writer.writerow(Y)
writer.writerow(Z)