I have a data frame that is production data for a factory. The factory is organised into lines. The structure of the data is such that one of the columns contains repeating values that properly thought of are headers. I need to reshape the data. So in the following DataFrame the 'Quality' column contains 4 measures, that are then measured for each hour. Clearly this gives us four observations per line.
The goal here is to transpose this data, but such that some of the columns are single index and some are multi index. The row index should remain ['Date', 'ID']. The single index columns should be 'line_no', 'floor', 'buyer' and the multi index columns should be the hourly measures for each of the quality measures.
I know that this is possible because I accidentally stumbled across the way to do it. Basically as my code will show, I put everything in the index except the hourly data and then unstacked the quality column from the index. Then by chance, I tried to reset the index and it created this amazing dataframe where some columns were single index and some multi. Of course its highly impractical to have loads of columns in the index, because we might want to do stuff with them, like change them. My question is how to achieve this type of thing without having to go through this (what I feel is a) workaraound.
import random
import pandas as pd
d = {'ID' : [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] * 2,
'Date' : ['2013-05-04' for x in xrange(12)] + \
['2013-05-06' for x in xrange(12)],
'line_no' : [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] * 2,
'floor' : [5, 5, 5, 5, 6, 6, 6, 6, 5, 5, 5, 5] * 2,
'buyer' : ['buyer1', 'buyer1', 'buyer1', 'buyer1',\
'buyer2', 'buyer2', 'buyer2', 'buyer2',\
'buyer1', 'buyer1', 'buyer1', 'buyer1'] * 2,
'Quality' : ['no_checked', 'good', 'alter', 'rejected'] * 6,
'Hour1' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour2' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour3' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour4' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour5' : [random.randint(1000, 15000) for x in xrange(24)],
'Hour6' : [random.randint(1000, 15000) for x in xrange(24)]}
DF = pd.DataFrame(d, columns = ['ID', 'Date', 'line_no', 'floor', 'buyer',
'Quality', 'Hour1', 'Hour2', 'Hour3', 'Hour4',
'Hour5', 'Hour6'])
DF.set_index(['Date', 'ID'])
So this is how I achieved what I wanted, but there must be a way to do this without having to go through all these steps. Help please...
# Reset the index
DF.reset_index(inplace = True)
# Put everything in the index
DF.set_index(['Date', 'ID', 'line_no', 'floor', 'buyer', 'Quality'], inplace = True)
# Unstack Quality
DFS = DF.unstack('Quality')
#Now this was the accidental workaround - gives exactly the result I want
DFS.reset_index(inplace = True)
DFS.set_index(['Date', 'ID'], inplace = True)
All help appreciated. Sorry for the long question, but at least there is some data riiiight!
In general inplace operations are not faster and IMHO less readable.
In [18]: df.set_index(['Date','ID','Quality']).unstack('Quality'))
Out[18]:
line_no floor buyer Hour1 Hour2 Hour3 Hour4 Hour5 Hour6
Quality alter good no_checked rejected alter good no_checked rejected alter good no_checked rejected alter good no_checked rejected alter good no_checked rejected alter good no_checked rejected
Date ID
2013-05-04 1 1 5 buyer1 6920 8681 9317 14631 5739 2112 4211 12026 13577 1855 13884 12710 7250 2540 1948 7116 9874 7302 10961 8251 3070 2793 14293 10895
2 2 6 buyer2 7943 7501 13725 1648 7178 9670 6278 6888 9969 11766 9968 4722 7242 4049 6704 2225 6546 8688 11513 14550 2140 11941 1142 6683
3 3 5 buyer1 5155 2449 13648 2183 14184 7309 1185 10454 11742 14102 2242 14297 6185 5554 12505 13312 3062 7426 4421 5693 12342 11622 10431 13375
2013-05-06 1 1 5 buyer1 14563 1343 14419 3350 8526 1185 5244 14777 2238 3640 6717 1109 7777 13136 1732 8681 14454 1059 10606 6942 9349 4524 13931 11799
2 2 6 buyer2 14837 9524 8453 6074 11516 12356 9651 10650 15000 11374 4690 10914 1857 3231 14627 6590 6503 9268 13108 8581 8448 12013 14175 10783
3 3 5 buyer1 9032 12959 4613 6793 7918 2827 6027 13002 11771 13370 12767 11080 12624 13269 11740 10543 8609 14709 11921 12484 8670 12706 8001 8991
[6 rows x 27 columns]
is a quite reasonable idiom for what you are doing
Related
Does anybody know how to performe a window function in apache beam (dataflow)?
Example:
Ex
ID Sector Country Income
1 Liam US 16133
2 Noah BR 10184
3 Oliver ITA 11119
4 Elijah FRA 13256
5 William GER 7722
6 James AUS 9786
7 Benjamin ARG 1451
8 Lucas FRA 4541
9 Henry US 9111
10 Alexander ITA 13002
11 Olivia ENG 5143
12 Emma US 18076
13 Ava MEX 15930
14 Charlotte ENG 18247
15 Sophia BR 9578
16 Amelia FRA 10813
17 Isabella FRA 7575
18 Mia GER 14875
19 Evelyn AUS 19749
20 Harper ITA 19642
Questions:
How to create another column with the running sum of the Income ordered by ID?
How to create another column with the Rank of the people who earns the most
Thank You
Bruno
Consider below approach. I have tried my best to make sure that the Pado fns are associative and commutative. Which means this should not break when run parallel on multiple workers. Let me know if you find this breaking over DataflowRunner
import apache_beam as beam
from apache_beam.transforms.core import DoFn
class cum_sum(DoFn):
def process(self, element,lkp_data,accum_sum):
for lkp_id_income in lkp_data:
if element['ID'] >= lkp_id_income[0]:
accum_sum += lkp_id_income[1]
element.update({'cumulative_sum':accum_sum})
yield element
class rank_it(DoFn):
def process(self, element, lkp_data,counter):
for lkp_id_cumsum in lkp_data:
if lkp_id_cumsum['cumulative_sum'] < element['cumulative_sum']:
counter += 1
element.update({'rank':counter})
yield element
with beam.Pipeline() as p:
data = (
p
| 'create'>>beam.Create(
[
{
'ID':4,
'Sector':'Liam',
'Country':'US',
'Income':1400
},
{
'ID':2,
'Sector':'piam',
'Country':'IS',
'Income':1200
},
{
'ID':1,
'Sector':'Oiam',
'Country':'PS',
'Income':1300
},
{
'ID':3,
'Sector':'Uiam',
'Country':'OS',
'Income':1800
}
]
)
)
ids_income = (
data
| 'get_ids_income'>>beam.Map(lambda element: (element['ID'], element['Income']))
)
with_cumulative_sum = (
data
| 'cumulative_sum'>>beam.ParDo(cum_sum(),lkp_data = beam.pvalue.AsIter(ids_income),accum_sum = 0)
)
with_ranking =(
with_cumulative_sum
| 'ranking'>>beam.ParDo(rank_it(),lkp_data = beam.pvalue.AsIter(with_cumulative_sum),counter = 1)
| 'print'>>beam.Map(print)
)
Output
{'ID': 4, 'Sector': 'Liam', 'Country': 'US', 'Income': 1400, 'cumulative_sum': 5700, 'rank': 4}
{'ID': 2, 'Sector': 'piam', 'Country': 'IS', 'Income': 1200, 'cumulative_sum': 2500, 'rank': 2}
{'ID': 1, 'Sector': 'Oiam', 'Country': 'PS', 'Income': 1300, 'cumulative_sum': 1300, 'rank': 1}
{'ID': 3, 'Sector': 'Uiam', 'Country': 'OS', 'Income': 1800, 'cumulative_sum': 4300, 'rank': 3}
Windowing in Apache Beam subdivide your unbounded PCollection in smaller bounded chunk to apply some computation (group by, sum, avg,..).
Unbounded PCollection comes from streaming processing and windows are based on timestamp (you can create sliding window of 5 minutes for instance). In your example, you haven't timestamps, and sounds like a bounded PCollection (a batch).
Technically you can simulate timestamp by preprocessing the elements and adding a dummy time indicator. But in your case, a simple groupby, or a sort is enough to achieve what you want.
I'm new to pandas and am trying to create a pivot table from a numpy array.
variable npArray is just that, a numpy array:
>>> npArray
array([(1, 3), (4, 3), (1, 3), ..., (1, 4), (1, 12), (1, 12)],
dtype=[('MATERIAL', '<i4'), ('DIVISION', '<i4')])
I'd to count occurrences of each material by division, with division being rows and material being columns. Example:
What I have:
#numpy array to pandas data frame
pandaDf = pandas.DataFrame (npArray)
#pivot table - guessing here
pandas.pivot_table (pandaDf, index = "DIVISION",
columns = "MATERIAL",
aggfunc = numpy.sum) #<--- want count, not sum
Results:
Empty DataFrame
Columns: []
Index: []
Sample of pandaDf:
>>> print pandaDf
MATERIAL DIVISION
0 1 3
1 4 3
2 1 3
3 1 3
4 1 3
5 1 3
6 1 3
7 1 3
8 1 3
9 1 3
10 1 3
11 1 3
12 4 3
... ... ...
3845291 1 4
3845292 1 4
3845293 1 4
3845294 1 12
3845295 1 12
[3845296 rows x 2 columns]
Any help would be appreciated.
Something similar has already been asked: https://stackoverflow.com/a/12862196/9754169
Bottom line, just do aggfunc=lambda x: len(x)
#GerardoFlores is correct. Another solution I found was adding a column for frequency.
#numpy array to pandas data frame
pandaDf = pandas.DataFrame (npArray)
print "adding frequency column"
pandaDf ["FREQ"] = 1
#pivot table
pivot = pandas.pivot_table (pandaDf, values = "FREQ",
index = "DIVISION", columns = "MATERIAL",
aggfunc = "count")
A dataframe stores some values in columns, passing those values to a function I get another dataframe. I'd like to concatenate the returned dataframe's columns to the original dataframe.
I tried to do something like
i = pd.concat([i, i[['cid', 'id']].apply(lambda x: xy(*x), axis=1)], axis=1)
but it did not work with error:
ValueError: cannot copy sequence with size 2 to array axis with dimension 1
So I did like this:
def xy(x, y):
return pd.DataFrame({'x': [x*2], 'y': [y*2]})
df1 = pd.DataFrame({'cid': [4, 4], 'id': [6, 10]})
print('df1:\n{}'.format(df1))
df2 = pd.DataFrame()
for _, row in df1.iterrows():
nr = xy(row['cid'], row['id'])
nr['cid'] = row['cid']
nr['id'] = row['id']
df2 = df2.append(nr, ignore_index=True)
print('df2:\n{}'.format(df2))
Output:
df1:
cid id
0 4 6
1 4 10
df2:
x y cid id
0 8 12 4 6
1 8 20 4 10
The code does not look nice and should work slowly.
Is there pandas/pythonic way to do it properly and fast working?
python 2.7
Option 0
Most directly with pd.DataFrame.assign. Not very generalizable.
df1.assign(x=df1.cid * 2, y=df1.id * 2)
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 1
Use pd.DataFrame.join to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.join(df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 2
Use pd.DataFrame.assign to add new columns
This shows how to adjoin new columns after having used apply with a lambda
df1.assign(**df1.apply(lambda x: pd.Series(x.values * 2, ['x', 'y']), 1))
cid id x y
0 4 6 8 12
1 4 10 8 20
Option 3
However, if your function really is just multiplying by 2
df1.join(df1.mul(2).rename(columns=dict(cid='x', id='y')))
Or
df1.assign(**df1.mul(2).rename(columns=dict(cid='x', id='y')))
How do I print a line from a 2-dimensional mixed array (containing integers and floating data types) without the square brackets and the spacing in the floating types and decimal point in integers?
I want to use:
for line in xd:
print line,
to get the final output
The code i tried is as follows:
import numpy
x =[[1.456, 2, 3],
[4, 5.231, 6],
[7, 8, 9.145]]
x=numpy.array(x)
xd=numpy.array2string(x, separator='\t')
for line in xd:
print line,
This is the output from the code
[ [ 1 . 4 5 6 2 . 3 . ]
[ 4 . 5 . 2 3 1 6 . ]
[ 7 . 8 . 9 . 1 4 5 ] ]
Your x is a list of lists
In [14]: x =[[1.456, 2, 3],
...: [4, 5.231, 6],
...: [7, 8, 9.145]]
In [15]: x
Out[15]: [[1.456, 2, 3], [4, 5.231, 6], [7, 8, 9.145]]
In [16]: for row in x:
...: print(row)
...:
[1.456, 2, 3]
[4, 5.231, 6]
[7, 8, 9.145]
We could set up a Python formatted print expression for each row.
But first look at what happens when you make an array:
In [17]: arr=np.array(x)
In [18]: arr
Out[18]:
array([[ 1.456, 2. , 3. ],
[ 4. , 5.231, 6. ],
[ 7. , 8. , 9.145]])
This is a 2d array - of the dtype float, the dtype that can hold both the ints and the floats. Formally you've lost the distinction between float and ints.
Trying to display this array without decimals and without [] is a lot harder than for the original list. The array formatting puts a lot of effort into lining up the columns, in other words making a pretty table like display. If the array is large lines will wrap, and it will start to add ellipsis. So from a display stand point, you actually loose control when making an array.
np.savetxt may help; it can be used to write the rows without [];
Drawing inspiration from savetxt (and how it formats rows)
In [21]: for row in x:
...: print('%6s %6s %6s'%tuple(row))
...:
1.456 2 3
4 5.231 6
7 8 9.145
In [22]: for row in arr:
...: print('%6s %6s %6s'%tuple(row))
...:
1.456 2.0 3.0
4.0 5.231 6.0
7.0 8.0 9.145
So if you are picky about the format of the numbers, stick with the list of lists, and study up on the Python formatting system (whether the % or .format version). For example %g works with the array as well as with the list
In [29]: for row in arr:
...: print('%6.2g %6.3g %6g'%tuple(row))
...:
1.5 2 3
4 5.23 6
7 8 9.145
How would you create a column(s) in the below pandas DataFrame where the new columns are the expanding mean/median of 'val' for each 'Mod_ID_x'. Imagine this as if were time series data and 'ID' 1-2 was on Day 1 and 'ID' 3-4 was on Day 2.
I have tried every way I could think of but just can't seem to get it right.
left4 = pd.DataFrame({'ID': [1,2,3,4],'val': [10000, 25000, 20000, 40000],
'Mod_ID': [15, 35, 15, 42],'car': ['ford','honda', 'ford', 'lexus']})
right4 = pd.DataFrame({'ID': [3,1,2,4],'color': ['red', 'green', 'blue', 'grey'], 'wheel': ['4wheel','4wheel', '2wheel', '2wheel'],
'Mod_ID': [15, 15, 35, 42]})
df1 = pd.merge(left4, right4, on='ID').drop('Mod_ID_y', axis=1)
Hard to test properly on your DataFrame, but you can use something like this:
>>> df1["exp_mean"] = df1[["Mod_ID_x","val"]].groupby("Mod_ID_x").transform(pd.expanding_mean)
>>> df1
ID Mod_ID_x car val color wheel exp_mean
0 1 15 ford 10000 green 4wheel 10000
1 2 35 honda 25000 blue 2wheel 25000
2 3 15 ford 20000 red 4wheel 15000
3 4 42 lexus 40000 grey 2wheel 40000