Python Side-by-side box plots on same figure - python-2.7

I am trying to generate a box plot in Python 2.7 for each categorical value in column E from the Pandas dataframe below
A B C D E
0 0.647366 0.317832 0.875353 0.993592 1
1 0.504790 0.041806 0.113889 0.445370 2
2 0.769335 0.120647 0.749565 0.935732 3
3 0.215003 0.497402 0.795033 0.246890 1
4 0.841577 0.211128 0.248779 0.250432 1
5 0.045797 0.710889 0.257784 0.207661 4
6 0.229536 0.094308 0.464018 0.402725 3
7 0.067887 0.591637 0.949509 0.858394 2
8 0.827660 0.348025 0.507488 0.343006 3
9 0.559795 0.820231 0.461300 0.921024 1
I would be willing to do this with Matplotlib or any other plotting library. So far the above code can plot all the categories combined on one plot. Here is the code to generate the above data and produce the plot:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
# Data
df = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD'))
df['E'] = [1,2,3,1,1,4,3,2,3,1]
# Boxplot
bp = ax.boxplot(df.iloc[:,:-1].values, widths=0.2)
plt.show()
In this example, the categories are 1,2,3,4. I would like to plot separate boxplots side-by-side on the same figure, for only categories 1 and 2 and show the category names in the legend.
Is there a way to do this?
Additional Information:
The output should look similar to the 3rd figure from here - replace "Yes","No" by "1","2".

Starting with this:
import numpy
import pandas
from matplotlib import pyplot
import seaborn
seaborn.set(style="ticks")
# Data
df = pandas.DataFrame(numpy.random.rand(10,4), columns=list('ABCD'))
df['E'] = [1, 2, 3, 1, 1, 4, 3, 2, 3, 1]
You've got a couple of options. If separate axes are ok,
fig, axes = pyplot.subplots(ncols=4, figsize=(12, 5), sharey=True)
df.query("E in [1, 2]").boxplot(by='E', return_type='axes', ax=axes)
If you want 1 axes, I think seaborn will be easier. You just need to clean up your data.
ax = (
df.set_index('E', append=True) # set E as part of the index
.stack() # pull A - D into rows
.to_frame() # convert to a dataframe
.reset_index() # make the index into reg. columns
.rename(columns={'level_2': 'quantity', 0: 'value'}) # rename columns
.drop('level_0', axis='columns') # drop junk columns
.pipe((seaborn.boxplot, 'data'), x='E', y='value', hue='quantity', order=[1, 2])
)
seaborn.despine(trim=True)
The cool thing about seaborn is that tweaking the parameters slightly can achieve a lot in terms of the plot's layout. If we switch our hue and x variables, we get:
ax = (
df.set_index('E', append=True) # set E as part of the index
.stack() # pull A - D into rows
.to_frame() # convert to a dataframe
.reset_index() # make the index into reg. columns
.rename(columns={'level_2': 'quantity', 0: 'value'}) # rename columns
.drop('level_0', axis='columns') # drop junk columns
.pipe((seaborn.boxplot, 'data'), x='quantity', y='value', hue='E', hue_order=[1, 2])
)
seaborn.despine(trim=True)
If you're curious, the resulting dataframe looks something like this:
E quantity value
0 1 A 0.935433
1 1 B 0.862290
2 1 C 0.197243
3 1 D 0.977969
4 2 A 0.675037
5 2 B 0.494440
6 2 C 0.492762
7 2 D 0.531296
8 3 A 0.119273
9 3 B 0.303639
10 3 C 0.911700
11 3 D 0.807861

An addition to #Paul_H answer.
Side-by-side boxplots on the single matplotlib.axes.Axes, no seaborn:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.rand(10,4), columns=list('ABCD'))
df['E'] = [1, 2, 1, 1, 1, 2, 1, 2, 2, 1]
mask_e = df['E'] == 1
# prepare data
data_to_plot = [df[mask_e]['A'], df[~mask_e]['A'],
df[mask_e]['B'], df[~mask_e]['B'],
df[mask_e]['C'], df[~mask_e]['C'],
df[mask_e]['D'], df[~mask_e]['D']]
# Positions defaults to range(1, N+1) where N is the number of boxplot to be drawn.
# we will move them a little, to visually group them
plt.figure(figsize=(10, 6))
box = plt.boxplot(data_to_plot,
positions=[1, 1.6, 2.5, 3.1, 4, 4.6, 5.5, 6.1],
labels=['A1','A0','B1','B0','C1','C0','D1','D0'])

Related

List format error using matlotlib linecollection

I have a list (coordpairs) that I am trying to use as the basis for plotting using LineCollection. The list is derived from a Pandas data frame. I am having trouble getting the list in the right format, despite what is admittedly a clear error code. Trimmed data frame contents, code, and error are below. Thank you for any help.
Part of the Data Frame
RUP_ID Vert_ID Longitude Latitude
1 1 -116.316961 34.750178
1 2 -116.316819 34.750006
2 1 -116.316752 34.749938
2 2 -116.31662 34.749787
10 1 -116.317165 34.754078
10 2 -116.317277 34.751492
10 3 -116.317206 34.751273
10 4 -116.317009 34.75074
10 5 -116.316799 34.750489
11 1 -116.316044 34.760377
11 2 -116.317105 34.755674
11 3 -116.317165 34.754078
Code
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
fig = plt.figure()
ax1 = plt.subplot2grid((2, 2), (0, 0), rowspan=2, colspan=1)
for ii in range(1,len(mydf)):
temp = mydf.loc[mydf.RUP_ID == ii]
df_line = temp.sort_values(by='Vert_ID', ascending=True)
del temp
lat = df_line.Latitude
lon = df_line.Longitude
lat = lat.tolist()
long = long.tolist()
coordpairs = zip(lat, long)
lc = LineCollection(coordpairs, colors='r') # this is line 112 in the error
ax1.add_collection(lc)
# note I also tried:
# import numpy as np
# coordpairs2 = np.vstack([np.array(u) for u in set([tuple(p) for p in coordpairs])])
# lc = LineCollection(coordpairs2, colors='r')
# and received the same plotting error
Error/Outputs
C:\apath\python.exe C:/mypath/myscript.py
Traceback (most recent call last):
File "C:/mypath/myscript.py", line 112, in <module>
lc = LineCollection(coordpairs, colors='r') # this is line 112 in the error
File "C:\apath\lib\site-packages\matplotlib\collections.py", line 1149, in __init__
self.set_segments(segments)
File "C:\apath\lib\site-packages\matplotlib\collections.py", line 1164, in set_segments
self._paths = [mpath.Path(_seg) for _seg in _segments]
File "C:\apath\lib\site-packages\matplotlib\path.py", line 141, in __init__
raise ValueError(msg)
ValueError: 'vertices' must be a 2D list or array with shape Nx2
Process finished with exit code 1
You would want to create one single LineCollection, with several lines, one per RUP_ID value from the first dataframe column. That means you best loop over the unique values of that column (not over every row!) and append the coordinates to a list. Use that list as the input to LineCollection.
u = """RUP_ID Vert_ID Longitude Latitude
1 1 -116.316961 34.750178
1 2 -116.316819 34.750006
2 1 -116.316752 34.749938
2 2 -116.31662 34.749787
10 1 -116.317165 34.754078
10 2 -116.317277 34.751492
10 3 -116.317206 34.751273
10 4 -116.317009 34.75074
10 5 -116.316799 34.750489
11 1 -116.316044 34.760377
11 2 -116.317105 34.755674
11 3 -116.317165 34.754078"""
import io
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
df = pd.read_csv(io.StringIO(u), sep="\s+")
verts = []
for (RUP_ID, grp) in df.groupby("RUP_ID"):
df_line = grp.sort_values(by='Vert_ID', ascending=True)
lat = df_line.Latitude
lon = df_line.Longitude
verts.append(list(zip(lon, lat)))
lc = LineCollection(verts, color='r')
fig, ax = plt.subplots()
ax.add_collection(lc)
ax.autoscale()
plt.show()

Null independent column wise mean calculation in Python

I am trying to calculate the mean of 3 three columns in Python. Here is the catch-
If all 3 row values of my 3 columns are not null then my mean will be (x+y+z)/3.
If one of my row value is null (suppose z), then my mean should be (x+y)/2.
I'm storing there mean values in a seperate column which is part of the pandas dataframe.
I'm looking for the best approach as my dataset has over 2 million rows.
My data is below.
Thanks in advance.
A B C
0 1 2 3 # = (1+2+3)/3 = 2
1 4 NaN 6 # = (4+6)/2 = 5
2 NaN 8 9 # = (8+9)/2 = 8.5
Just apply the numpy.nanmean function along axis 0 (columns). This is the default axis so you will get the same result with omitting axis = 0. If you want the means row-wise use axis = 1:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': [2.3, 4.5, 2.1, np.nan, 6.7],
'b': [2.4, 5.6, np.nan, np.nan, 7.1],
'c': [np.nan, np.nan, np.nan, np.nan, 0.9]
})
colmeans = df.apply(np.nanmean, axis = 0)
# colmeans
# a 3.900000
# b 5.033333
# c 0.900000
# dtype: float64
rowmeans = df.apply(np.nanmean, axis = 1)
# 0 2.35
# 1 5.05
# 2 2.10
# 3 NaN
# 4 4.90
# dtype: float64

Overlay histograms in one plot

I have two dataframes that I'm trying to make histograms of. I would like to overlay one histogram over the other and show them in the same cell, so I can easily compare the distributions. Can anyone suggest how to do that? I have example code and data below. This will plot the histograms separately one above the other.
Data:
print(df[1:5])
bob
1 1
2 3
3 5
4 1
print(df2[1:5])
bob
1 3
2 3
3 2
4 1
Code:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df[df[bob]>=1][bob].hist(bins=25, range=[0, 25])
plt.show()
df2[df2[bob]>=1][bob].hist(bins=25, range=[0, 25])
plt.show()
Use ax:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
fig = plt.figure()
ax = fig.add_subplot(111)
df = pd.DataFrame([1, 3, 5, 1], columns=["bob"], index=[1, 2, 3, 4])
df2 = pd.DataFrame([3, 3, 2, 1], columns=["bob"], index=[1, 2, 3, 4])
ax.hist([df, df2], label=("df", "df2"), bins=25, range=[0, 25])
ax.legend()

Assigning new column name and creating new column conditionally in pandas not working?

I have a simple dataframe with pandas, then I rename the variable names into 'a' and 'b'.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df.columns = ['a', 'b']
print df
df['color'] = np.where(df['b']=='Z', 'green', 'red')
print df
a b
0 Z A
1 Z B
2 X B
3 Y C
a b color
0 Z A red
1 Z B red
2 X B red
3 Y C red
Without the renaming line df.columns, I get
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
#df.columns = ['a', 'b']
#print df
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print df
Set Type color
0 Z A green
1 Z B green
2 X B red
3 Y C red
I want and would expect the first set of code to produce "green green red red", but it failed and I don't know why.
As pointed out in the comments, the problem comes from how you are rename the columns. You are better off renaming, like so:
df = df.rename( columns={'Set': 'a','Type': 'b'})

Pandas .loc setting with copy warning

I have a question on the usage of .loc. I couldn't find an explicit answer in the documentation.
Say I have a df like:
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": np.random.random(1000), "B": np.random.random(1000)})
I want to create a 1 in a new column if a value in column A is > .1. Using some boolean logic:
crit = df['A'] > .1
Now, is using .loc this way:
df['New Column'] = 0
df['New Column'].loc[crit] = 1
Any different than:
df['New Column'] = 0
df.loc[crit, 'New Column'] = 1
Using the first way, I continually get a SettingWithCopyWarning, however the values do appear to be changing in the df.