I am new to python and have been struggling with this problem for quite a while. I have a dict like this:
dict1 = {(a,a) : 5, (a,b) :10, (a,c) : 11, (b,a): 4, (b,b) : 8, (b,c) : 3....}
What I would like to do is convert this into a pandas dataframe that looks like this:
a b c
a 5 10 11
b 4 8 3
c .. .. ..
After that I would like to create a multiple bar plot in the jupyter notebook. I know you can display the data as a pandas series to show the following:
dataset = pd.Series(dict1)
print dataset
a a 5
b 10
c 11
b a 4
b 8
c 3
c a ..
b ..
c ..
However, I was not able to create a multiple bar plot from that.
You're almost there, just need to unstack:
dataset.unstack()
I prefer to use this page for reference, rather than the official documentation.
Related
I have a dataframe: Outlet_results
it goes something like this
index Calendar year/Week Material Sellthru Qty
0 37.2013 ABC 2
1 38.2913 ABC 7
2 37.2913 BCG 22
3 39.2013 XYZ 5
Now, I wanted a separate list for the Materials and week for further coding.
I used this code for the material list
mat_outlet = list(set(outlet_result['Material']))
It works perfectly and gives me 3 values (ABC, BCG, XYZ)
However, the week list shows a faulty output even though the code is same.
week_outlet_list = list(set(outlet_result['Calendar Year/Week']))
I am getting a list with 4 values
['38.2013', '37.2013', 'Calendar Year/Week', '39.2013']
Why is the string (header) included in the list? Please help me understand this concept.
I am using Python 2.7.... has it got something to do with it?
In my dataset I have a column with Topics which are strings separated by coma.
df = pd.DataFrame({'Stats': [3377, 1843, 15234], 'Topics': ["A, B, C, D", "A, B", "C, D"]})
What I need is to plot average Stats per Topic (A,B,C,D). Something like this:
Could anyone suggest a smart way of doing it?
I'm not sure what your desired output is, but this should hopefully get you going in the right direction. Key point is to split out the topics, and then you can do whatever analytics you want.
df2 = pd.DataFrame([(row.Stats, topic.strip())
for _, row in df.iterrows()
for topic in row.Topics.split(',')],
columns=['Stats', 'Topic'])
>>> df2.groupby('Topic').Stats.mean()
Topic
A 2610.0
B 2610.0
C 9305.5
D 9305.5
Name: Stats, dtype: float64
>>> df2.head()
Stats Topic
0 3377 A
1 3377 B
2 3377 C
3 3377 D
4 1843 A
I want to create a small test data set with some specific values:
x
-
1
3
4
5
7
I can do this the hard way:
. set obs 5
. generate x = .
. replace x = 1 in 1
. replace x = 3 in 2
. replace x = 4 in 3
. replace x = 5 in 4
. replace x = 7 in 5
I can also use the data editor, but I'd like to create a .do file which can recreate this data set.
So how do I set the values of a variable from a list of numbers?
This can be done using a (to my mind) poorly documented feature of input:
clear
input x
1
3
4
5
7
end
I say poorly documented because the title of the input help page is
[D] Input -- Enter data from keyboard
which is clearly only a subset of what this command can do.
Here is another way
clear
mat x = (1,3,4,5,7)
set obs `=colsof(x)'
generate x = x[1, _n]
and another
clear
mata : x = (1,3,4,5,7)'
getmata x=x
My Data Set Looks like
1
2
3
4
5
...
I have an intermediate step which should do the folowing
1
1,2
1,2,3
1,2,3,4
1,2,3,4,5
....
And finally calculate its mean
1
1.5
2
2.5
3
...
Questions
a) Is there a way to implement this in python / py-spark?.
b) Is there a method/api which does this out of the box.
c) I googled around for this kind of solution the closest i got was to moving mean/ rolling average / moving group. Is there a term for this operation?
In Pandas, this is called an expanding_mean:
import pandas as pd
df = pd.Series(range(1,6))
s = pd.Series(range(1,6))
pd.expanding_mean(s)
Out[128]:
0 1.0
1 1.5
2 2.0
3 2.5
4 3.0
dtype: float64
I'm not sure how you'd do this in Spark, but that said, I'm also not sure if this is a "parallelalizable" task: since each step relies on the previous step, I'm not sure how you'd break this up into steps.
I was thinking about a code that I wrote a few years ago in Python, at some point it had to get just some elements, by index, of a list of lists.
I remember I did something like this:
def getRows(m, row_indices):
tmp = []
for i in row_indices:
tmp.append(m[i])
return tmp
Now that I've learnt a little bit more since then, I'd use a list comprehension like this:
[m[i] for i in row_indices]
But I'm still wondering if there's an even more pythonic way to do it. Any ideas?
I would like to know also alternatives with numpy o any other array libraries.
It's worth looking at NumPy for its slicing syntax. Scroll down in the linked page until you get to "Indexing, Slicing and Iterating".
It's the clean an obvious way. So, I'd say it doesn't get more Pythonic than that.
As Curt said, it seems that Numpy is a good tool for this. Here's an example,
from numpy import *
a = arange(16).reshape((4,4))
b = a[:, [1,2]]
c = a[[1,2], :]
print a
print b
print c
gives
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
[[ 1 2]
[ 5 6]
[ 9 10]
[13 14]]
[[ 4 5 6 7]
[ 8 9 10 11]]