how to draw a multiline chart using python pandas? - python-2.7

Dataframe:
Dept,Date,Que
ece,2015-06-25,96
ece,2015-06-24,89
ece,2015-06-26,88
ece,2015-06-19,87
ece,2015-06-23,82
ece,2015-06-30,82
eee,2015-06-24,73
eee,2015-06-23,71
eee,2015-06-25,70
eee,2015-06-19,66
eee,2015-06-27,60
eee,2015-06-22,56
mech,2015-06-27,10
mech,2015-06-22,8
mech,2015-06-25,8
mech,2015-06-19,7
I need multiline chart with grid based on Dept column, i need each Dept in one line.
For Ex:ece the sparkline should be 96,89,88,87,82,82.... like wise i need for other Dept also.

I think you need pivot and plot:
import matplotlib.pyplot as plt
df = df.pivot(index='Dept', columns='Date', values='Que')
print df
Date 2015-06-19 2015-06-22 2015-06-23 2015-06-24 2015-06-25 2015-06-26 \
Dept
ece 87.0 NaN 82.0 89.0 96.0 88.0
eee 66.0 56.0 71.0 73.0 70.0 NaN
mech 7.0 8.0 NaN NaN 8.0 NaN
Date 2015-06-27 2015-06-30
Dept
ece NaN 82.0
eee 60.0 NaN
mech 10.0 NaN
df.plot()
plt.show()
You can check docs.

Related

PC1 and PC2 values: original values

I just ran PC analysis in r on the iris data set. This has been discussed several times in the past but I am little confused on the output.
I used prcomp and this is output for the loadings:
PC1 PC2 PC3 PC4
Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
Here is the first 6 rows for the scores:
PC1 PC2 PC3 PC4
[1,] -2.257141 -0.4784238 0.12727962 0.024087508
[2,] -2.074013 0.6718827 0.23382552 0.102662845
[3,] -2.356335 0.3407664 -0.04405390 0.028282305
[4,] -2.291707 0.5953999 -0.09098530 -0.065735340
[5,] -2.381863 -0.6446757 -0.01568565 -0.035802870
[6,] -2.068701 -1.4842053 -0.02687825 0.006586116
Here is the first 6 rows for the original values:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
Could someone explain how we get the PC1 score of -2.25 for row 1?
thanks.
As per the documentation (?prcomp), the PC scores are the data — centred and scaled, if requested — multiplied by the rotation matrix. So, let's do that calculation for row 1 and PC 1 to check. In this example, I use a PCA object imaginatively called pca.
First, we centre the first row of data, iris[1, 1:4], using pca$center and then scale using pca$scale. Finally, we multiply by the loadings for PC 1, pca$rotation[, 1], and sum the result.
# Perform PCA
pca <- prcomp(iris[, 1:4], center = TRUE, scale = TRUE)
# Calculate PC1 score for first row of 'iris'
sum(pca$rotation[,1] * (iris[1, 1:4] - pca$center) / pca$scale)
#> [1] -2.257141
Created on 2019-01-23 by the reprex package (v0.2.1.9000)
As expected, we get -2.257141.

How to only add common index pandas data frame? [duplicate]

This question already has answers here:
Adding two Series with NaNs
(3 answers)
Closed 5 years ago.
Suppose I have two data frame. I would like to add both values if there is a common index otherwise take the value. Let me illustrate this with an example
import pandas as pd
In [118]: df1 = pd.DataFrame([1, 2, 3, 4], index=pd.date_range('2018-01-01', periods=4))
In [119]: df2 = pd.DataFrame(10*np.ones_like(df1.values[1:3]), index=df1.index[1:3])
In [120]: df1.add(df2)
Out[120]:
0
2018-01-01 NaN
2018-01-02 12.0
2018-01-03 13.0
2018-01-04 NaN
However, I wanted to get
0
2018-01-01 1.0
2018-01-02 12.0
2018-01-03 13.0
2018-01-04 4.0
How can I achieve this? Moreover, is it even possible if df2.index is not a proper subset of df1.index, i.e. if
df2 = pd.DataFrame(10*np.ones_like(df1.values[1:3]), index=pd.DatetimeIndex([df1.index[1], pd.Timestamp('2019-01-01')]))
In [131]: df2
Out[131]:
0
2018-01-02 10
2019-01-01 10
In [132]: df1.add(df2)
Out[132]:
0
2018-01-01 NaN
2018-01-02 12.0
2018-01-03 NaN
2018-01-04 NaN
2019-01-01 NaN
But what I wanted is
0
2018-01-01 1.0
2018-01-02 12.0
2018-01-03 3.0
2018-01-04 4.0
2019-01-01 10.0
Combine with fillna
df1.add(df2).fillna(df1)
Out[581]:
0
2018-01-01 1.0
2018-01-02 12.0
2018-01-03 13.0
2018-01-04 4.0
Ok,
pd.concat([df1,df2]).sum(level=0)
Out[591]:
0
2018-01-01 1
2018-01-02 12
2018-01-03 3
2018-01-04 4
2019-01-01 10

rolling mean with python 2.7

I want to write a rolling mean code of m_tax using Python 2.7 pandas to analysis the time series data from the web page (http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm).
datum m_ta m_tax m_taxd m_tan m_tand
------- ----- ----- ---------- ----- ----------
1901-01 -4.7 5.0 1901-01-23 -12.2 1901-01-10
1901-02 -2.1 3.5 1901-02-06 -7.9 1901-02-15
1901-03 5.8 13.5 1901-03-20 0.6 1901-03-01
1901-04 11.6 18.2 1901-04-10 7.4 1901-04-23
1901-05 16.8 22.5 1901-05-31 12.2 1901-05-05
1901-06 21.0 24.8 1901-06-03 14.6 1901-06-17
1901-07 22.4 27.4 1901-07-30 16.9 1901-07-04
1901-08 20.7 25.9 1901-08-01 14.7 1901-08-29
....
Here I tried my code as:
pd.rolling_mean(df.resample("1M", fill_method="ffill"), window=60, min_periods=1, center=True).mean()
and I got result:
m_ta 11.029173
m_tax 17.104283
m_tan 4.848637
month 6.499500
monthly_mean 11.030405
monthly_std 1.836159
m_tax% 0.083348
m_tan% 0.023627
dtype: float64
In another way I tried as:
s = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/1900', periods=1000))
s = s.cumsum()
r = s.rolling(window=60)
r.mean()
and I got result
1900-01-01 NaN
1900-01-02 NaN
1900-01-03 NaN
1900-01-04 NaN
1900-01-05 NaN
1900-01-06 NaN
1900-01-07 NaN
1900-01-08 NaN
...
So I am confused here. Which one should I use? Could someone please give me idea? Thanks!
Starting with version 0.18.0, both rolling() and resample() are methods that behave similarly to groupby() and are deprecated as functions.
What's new in pandas version 0.18.0
rolling()/expanding() in pandas version 0.18.0
resample() in pandas version 0.18.0
I can't tell exactly what your desired results are, but maybe something like this is what you want? (And you can see the warning message below, although I'm not sure what triggers it here.)
>>> df
m_ta m_tax m_taxd m_tan m_tand
datum
1901-01-01 -4.7 5.0 1901-01-23 -12.2 1901-01-10
1901-02-01 -2.1 3.5 1901-02-06 -7.9 1901-02-15
1901-03-01 5.8 13.5 1901-03-20 0.6 1901-03-01
1901-04-01 11.6 18.2 1901-04-10 7.4 1901-04-23
1901-05-01 16.8 22.5 1901-05-31 12.2 1901-05-05
1901-06-01 21.0 24.8 1901-06-03 14.6 1901-06-17
1901-07-01 22.4 27.4 1901-07-30 16.9 1901-07-04
1901-08-01 20.7 25.9 1901-08-01 14.7 1901-08-29
>>> df.resample("1M").rolling(3,center=True,min_periods=1).mean()
/Users/john/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:1: FutureWarning: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)
if __name__ == '__main__':
m_ta m_tax m_tan
datum
1901-01-31 -3.400000 4.250000 -10.050000
1901-02-28 -0.333333 7.333333 -6.500000
1901-03-31 5.100000 11.733333 0.033333
1901-04-30 11.400000 18.066667 6.733333
1901-05-31 16.466667 21.833333 11.400000
1901-06-30 20.066667 24.900000 14.566667
1901-07-31 21.366667 26.033333 15.400000
1901-08-31 21.550000 26.650000 15.800000

Filter Pandas DataFrame by group with tag values

I want to filter a DataFrame by group, since the following nan after a, are supposed to be a (this is something like a tag), and nans followed by b, are also b.
I have a short example:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'group1': ['a',nan,nan,nan,nan,'b',nan,nan,nan,nan],
'value1': [0.4,1.1,2,3,4,5,6,7,8,8.8],
'value2': [6.4, 6.9,7.1,8,9,10,11,12,13,14]
})
My desired output would be:
In [3]: df[df.group1 == 'a']
Out[3]:
group1 value1 value2
0 a 0.4 6.4
1 NaN 1.1 6.9
2 NaN 2.0 7.1
3 NaN 3.0 8.0
4 NaN 4.0 9.0
I'll apreciate any hint!
You can use ffill to forward-fill the column:
>>> df[df['group1'].fillna(method='ffill') == 'a']
group1 value1 value2
0 a 0.4 6.4
1 NaN 1.1 6.9
2 NaN 2.0 7.1
3 NaN 3.0 8.0
4 NaN 4.0 9.0
but, perhaps the better solution would be to forward-fill the column on the original data-frame:
>>> df['group1'].fillna(method='ffill', inplace=True)
>>> df[df['group1'] == 'a']

Merging Pandas rows based on index (time series)

I used Pandas .append() to add columns from a number of Pandas timeseries by their index (date). However, instead of combining all data from common dates into one row, the data looks like this:
sve2_all.sort(inplace=True)
print sve2_all['20000101':'20000104']
Hgtot ng/l Q l/s DOC_mg/L Flow_mm/day MeHg ng/l Site \
2000-01-01 NaN NaN NaN 0.18 NaN NaN
2000-01-01 NaN 0.613234 NaN NaN NaN SVE
2000-01-02 NaN NaN NaN 0.18 NaN NaN
2000-01-02 NaN 0.614410 NaN NaN NaN SVE
2000-01-03 NaN NaN NaN NaN NaN 2
2000-01-03 NaN 0.617371 NaN NaN NaN SVE
2000-01-03 NaN NaN NaN NaN NaN NaN
2000-01-03 NaN NaN NaN 0.18 NaN NaN
2000-01-04 NaN 0.627733 NaN NaN NaN SVE
2000-01-04 NaN NaN NaN 0.18 NaN NaN
TOC_filt.TOC TOC_unfilt.TOC Temp oC pH
2000-01-01 NaN NaN NaN NaN
2000-01-01 NaN NaN -12.6117 NaN
2000-01-02 NaN NaN NaN NaN
2000-01-02 NaN NaN -2.3901 NaN
2000-01-03 NaN 8.224648 NaN NaN
2000-01-03 NaN NaN -5.0064 NaN
2000-01-03 NaN NaN NaN NaN
2000-01-03 NaN NaN NaN NaN
2000-01-04 NaN NaN -1.5868 NaN
2000-01-04 NaN NaN NaN NaN
[10 rows x 10 columns]
I've tried to resample this data by day using:
sve2_all.resample('D', how='mean')
And also to group by day using:
sve2_all.groupby(sve2_all.index.map(lambda t: t.day))
However, the DataFrame remains unchanged. How can I collapse the rows for the same date into one date? Thanks.
Additional information: I tried using pd.concat() as suggested by Joris (I had to pass 0 as the axis argument as 1 resulted in ValueError:cannot reindex from a duplicate axis) instead of .append() but the resulting DataFrame is the same as with .append(), a non-uniform non-monotonic time series. I think the index is the problem but I'm not sure what I can do to fix it, I thought that some time stamps might contain hour information while other not so I tried I've also tried using .resample('D',how='mean') on each DataFrame before using .concat() but it didn't make a difference.
Solution: Joris solution was correct, I didn't realise that .resample() wasn't inplace. Once the .resample() was assigned to a new DataFrame Joris' suggestion provided the desired result.
The append method does 'append' the rows to the other dataframe, and does not merge with it based on the index labels. For that you can use concat
Using a toy example:
In [14]: df1 = pd.DataFrame(np.random.randn(3,2), columns=list('AB'), index=pd.date_range('2000-01-01', periods=3))
In [15]: df1
Out[15]:
A B
2000-01-01 1.532085 -1.338895
2000-01-02 -0.016784 -0.270698
2000-01-03 -1.680379 0.838287
In [16]: df2 = pd.DataFrame(np.random.randn(3,2), columns=list('CD'), index=pd.date_range('2000-01-01', periods=3))
In [17]: df2
Out[17]:
C D
2000-01-01 0.375214 -0.812558
2000-01-02 -1.099848 -0.889941
2000-01-03 1.556383 0.870608
.append will append the rows (and columns of df2 that are not in df1 will be added, which is the case here):
In [18]: df1.append(df2)
Out[18]:
A B C D
2000-01-01 1.532085 -1.338895 NaN NaN
2000-01-02 -0.016784 -0.270698 NaN NaN
2000-01-03 -1.680379 0.838287 NaN NaN
2000-01-01 NaN NaN 0.375214 -0.812558
2000-01-02 NaN NaN -1.099848 -0.889941
2000-01-03 NaN NaN 1.556383 0.870608
pd.concat() concatenates the both dataframes along one of the index axises:
In [19]: pd.concat([df1, df2], axis=1)
Out[19]:
A B C D
2000-01-01 1.532085 -1.338895 0.375214 -0.812558
2000-01-02 -0.016784 -0.270698 -1.099848 -0.889941
2000-01-03 -1.680379 0.838287 1.556383 0.870608
Apart from that, the resample should normally work.