How to make a boxplot where each row in my dataframe object is a box in the plot?
I have some stock data that I want to plot with a box plot. My data is from yahoo finance and includes Open, High, Low, Close, Adjusted Close and Volume data for each trading day. I want to plot a box plot where each box is 1 day of OHLC price action.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas.io.data import DataReader
# get daily stock price data from yahoo finance for S&P500
SP = DataReader("^GSPC", "yahoo")
SP.head()
Open High Low Close Volume Adj Close
Date
2010-01-04 1116.56 1133.87 1116.56 1132.99 3991400000 1132.99
2010-01-05 1132.66 1136.63 1129.66 1136.52 2491020000 1136.52
2010-01-06 1135.71 1139.19 1133.95 1137.14 4972660000 1137.14
2010-01-07 1136.27 1142.46 1131.32 1141.69 5270680000 1141.69
2010-01-08 1140.52 1145.39 1136.22 1144.98 4389590000 1144.98
plt.figure()
bp = SP.boxplot()
But when I plot this data frame as a boxplot, I only get one box with the Open, High, Low, and Close values of the entire Volume column.
Likewise, I try re-sampling my Adjusted Close daily price data to get weekly OHLC:
close = SP['Adj Close']
wk = close.resample('W', how='ohlc')
wk.head()
open high low close
Date
2010-01-10 1132.99 1144.98 1132.99 1144.98
2010-01-17 1146.98 1148.46 1136.03 1136.03
2010-01-24 1150.23 1150.23 1091.76 1091.76
2010-01-31 1096.78 1097.50 1073.87 1073.87
2010-02-07 1089.19 1103.32 1063.11 1066.19
This yields a Box Plot with 4 Boxes. Each box is the range of each column, not row. So for example, the first Box, 'open', shows the Open, Close, High and Low of the entire 'open' Column.
But what I actually want is 1 box for each 'Date' (index or row of my DataFrame). So the first Box will show the OHLC of the first row, '2010-01-10'. Second box will be the second row ('2010-01-17').
What I really want though is each row in my original Daily data (SP DataFrame) is its own OHLC Box. Essentially I want daily candlesticks, generated as a boxplot().
Open High Low Close
Date
2010-01-04 1116.56 1133.87 1116.56 1132.99
How do I do this using the Pandas DataFrame and Matplotlib boxplot()? I just want a basic boxplot plot where each row from the DataFrame is a OHLC box in the plot. Nothing fancy at this point. Thanks!
As I said in the comments, you don't really want boxplots. Instead you should be making a candlestick chart. Here's some code to get you started.
import numpy as np
import pandas
import matplotlib.pyplot as plt
from matplotlib.finance import candlestick, candlestick2
import matplotlib.dates as mdates
from pandas.io.data import DataReader
# get daily stock price data from yahoo finance for S&P500
SP = DataReader("^GSPC", "yahoo")
SP.reset_index(inplace=True)
print(SP.columns)
SP['Date2'] = SP['Date'].apply(lambda date: mdates.date2num(date.to_pydatetime()))
fig, ax = plt.subplots()
csticks = candlestick(ax, SP[['Date2', 'Open', 'Close', 'High', 'Low']].values)
plt.show()
Related
I have a csv file and I am trying to plot the average of some values per month. My csv file is structured as shown below, so I believe that I should group my data daily, then monthly in order to calculate the mean value.
timestamp,heure,lat,lon,impact,type
2007-01-01 00:00:00,13:58:43,33.837,-9.205,10.3,1
2007-01-02 00:00:00,00:07:28,34.5293,-10.2384,17.7,1
2007-01-02 00:00:00,23:01:03,35.0617,-1.435,-17.1,2
2007-01-03 00:00:00,01:14:29,36.5685,0.9043,36.8,1
2007-01-03 00:00:00,05:03:51,34.1919,-12.5061,-48.9,1
I am using this code:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
df= pd.read_csv("ave.txt", sep=',', names =["timestamp","heure","lat","lon","impact","type"])
daily = df.set_index('timestamp').groupby(pd.TimeGrouper(key='timestamp', freq='D', axis=1), axis=1)['impact'].count()
monthly = daily.groupby(pd.TimeGrouper(freq='M')).mean()
ax = monthly.plot(kind='bar')
plt.show()
But, I keep getting errors like this:
KeyError: 'The grouper name timestamp is not found'
any ideas ??
You're getting this error because you've set the timestamp column to index. Try removing key='timestamp' from TimeGrouper() or the set_index method and it should group as you expect:
daily = df.set_index('timestamp').groupby(pd.TimeGrouper(freq='D', axis=1), axis=1)['impact'].count()
or
daily = df.groupby(pd.TimeGrouper(key='timestamp', freq='D', axis=1), axis=1)['impact'].count()
I believe you need DataFrame.resample.
Also is necessary convert timestamp to DataTimeindex by parameter parse_dates and index_col in read_csv.
names =["timestamp","heure","lat","lon","impact","type"]
data = pd.read_csv('fou.txt',names=names, parse_dates=['timestamp'],index_col=['timestamp'])
print (data.head())
#your code
daily = data.groupby(pd.TimeGrouper(freq='D'))['impact'].count()
monthly = daily.groupby(pd.TimeGrouper(freq='M')).mean()
ax = monthly.plot(kind='bar')
plt.show()
#more simpliest
daily = data.resample('D')['impact'].count()
monthly = daily.resample('M').mean()
ax = monthly.plot(kind='bar')
plt.show()
Also check if really need count, not size.
What is the difference between size and count in pandas?
daily = data.resample('D')['impact'].size()
monthly = daily.resample('M').mean()
ax = monthly.plot(kind='bar')
plt.show()
I am trying to create a scatter plot of measurements where the x labels are WIFI channels. By default matplotlib is spacing the labels in proportion to their numerical value. However, I would like them to be spaced uniformly over the scatter plot. Is that possible?
This is basically what my plot code currently looks like:
- where chanPoints is a list of frequencies and measurements is a list of measurements.
plt.scatter(chanPoints,measurements)
plt.xlabel('Frequency (MHz)')
plt.ylabel('EVM (dB)')
plt.xticks(Tchan,rotation = 90)
plt.title('EVM for 5G Channels by Site')
plt.show()
Numpy
You may use numpy to create an array which maps the unique items within chanPoints to numbers 0,1,2.... You can then give each of those numbers the corresponding label.
import matplotlib.pyplot as plt
import numpy as np
chanPoints = [4980, 4920,4920,5500,4980,5500,4980, 5500, 4920]
measurements = [5,6,4,3,5,8,4,6,3]
unique, index = np.unique(chanPoints, return_inverse=True)
plt.scatter(index, measurements)
plt.xlabel('Frequency (MHz)')
plt.ylabel('EVM (dB)')
plt.xticks(range(len(unique)), unique)
plt.title('EVM for 5G Channels by Site')
plt.show()
Seaborn
If you're happy to use seaborn, this can save a lot of manual work. Seaborn is specialized for plotting categorical data. The chanPoints would be interpreted as categories on the x axis, and have the same spacing between them, if you were e.g. using a swarmplot. If several points would then overlap, they are plotted next to each other, which may be an advantage as it allows to see the number of measurement for that channel.
import matplotlib.pyplot as plt
import seaborn.apionly as sns
chanPoints = [4980, 4920,4920,5500,4980,5500,4980, 5500, 4920]
measurements = [5,6,4,3,5,8,4,6,3]
sns.swarmplot(chanPoints, measurements)
plt.xlabel('Frequency (MHz)')
plt.ylabel('EVM (dB)')
plt.title('EVM for 5G Channels by Site')
plt.show()
Replace chanPoints with an index.
index = numpy.searchsorted(Tchan, chanPoints)
plt.scatter(index, measurements)
Then build your xticks with the corresponding lables.
ticks = range(len(Tchan))
plt.xticks(ticks, labels=Tchan, rotation = 90)
I am bit new to python and implementing a project to convert an excel sheet to pivot table excel sheet, and then once the pivot table excel sheet is created, I am trying to create multiple excel sheet for each pivot data ( each row of the pivot data )
I am succesful in getting till the pivot data in excel ( Workbook3.xlsx ), but from here, if i am trying to go for a "for loop", I am getting below error :
"raise NotImplementedError("Writing as Excel with a MultiIndex is "
NotImplementedError: Writing as Excel with a MultiIndex is not yet implemented."
Can someone help me out to implement this last step ?
Below is my code :
Author : Abhishek
Version : December 23, 2016
import pandas as pd
import xlrd
import xlsxwriter
import numpy as np
names=['Business Type','Trip ID','Status','Name of Customer','Driver Name','Trip Type','One way/two way','Channel of Lead','Payment Rate','Remark for Payment/Discount/Fixed Rate/Monthly Rate','Trip Date','Trip ID','Entry Date and Time','CSE','Trip Time(when customer required)','Begin Trip Time','End Trip Time','Hours','Minutes','Basic','Basics','Transit','Transits','Discount ','Discounts','Tax','Total','Wallet','Wallet Type','Cash with Driver','Adjustment','Remark','Blank','Basic','Zuver (20%)','Basic Earning','Transit','Total Earning','Cash Colleted','Balance','Inventive','Total Earning','Total Cash Collected','Total Balance','Total Incentive','Final Invoice']
df=pd.read_excel(r'path/28 Nov - 4 Dec _ Payment IC _ Mumbai.xlsx',sheetname='calc',header=None,names=names)
df = df[df.Status != 'Cancelled']
df = df[df.Status != 'Unfulfilled']
#print df['Begin Trip Time'].values
# Defining variables for the output Report
custname=df['Name of Customer'].values
drivername=df['Driver Name'].values
drivercontact=df['Trip Date'].values
status=df['Status'].values
tripstatus=df['Begin Trip Time'].values
triptype=df['End Trip Time'].values
starttime=df['Hours'].values
endtime=df['Minutes'].values
transit=df['Transits'].values
totalbill=df['Basic Earning'].values
totalearning=df['Transits'].values+df['Basic Earning'].values
cashcollected=df['Cash Colleted'].values
balance=df['Balance'].values
incentive= df['Inventive'].values
df1 = pd.DataFrame(zip(totalbill,drivername,custname,drivercontact,status,tripstatus,triptype,starttime,endtime,totalbill,transit,totalearning,cashcollected,balance,incentive))
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('path/Workbook2.xlsx', engine='xlsxwriter')
df1.to_excel(writer, sheet_name='Sheet1',index=False,header=False)
# Close the Pandas Excel writer and output the Excel file.
writer.save()
#Pivot table
df2=pd.read_excel(r'path/Workbook2.xlsx',sheetname='Sheet1')
#table=pd.pivot_table(df2, index=['Driver Name','Name of Customer'])
df2['Inventive'].fillna(0, inplace=True)
df2 = pd.pivot_table(df2, values=['Hours', 'Minutes'],index=['Driver Name','Name of Customer','Inventive','Trip Date','Status','Begin Trip Time','End Trip Time','Basic Earning','Cash Colleted','Balance'],aggfunc=np.sum, fill_value='0', margins=True)
pivoted = pd.ExcelWriter('path/Workbook3.xlsx', engine='xlsxwriter')
df2.to_excel(pivoted, sheet_name='Sheet1')
pivoted.save()
df3=pd.read_excel(r'path/Workbook3.xlsx',sheetname='Sheet1')
for n in range(0, len(df3)):
tempdata=df3.iloc[n]
df4= pd.DataFrame(tempdata)
writer=pd.ExcelWriter('path/Final%s.xlsx' % n, engine='xlsxwriter')
df4.to_excel(writer, sheet_name='Sheet 1')
writer.save()
I've been having some difficulty with Matplotlib's finance charting. It seems like their candlestick charts work best with daily data, and I am having a hard time making them work with intraday (every 5 minutes, between 9:30 and 4 pm) data.
I have pasted sample data in pastebin. The top is what I get from the database, and the bottom is tupled with the date formatted into an ordinal float for use in Matplotlib.
Link to sample data
When I draw my charts there are huge gaps in it, the axes suck, and the zoom is equally horrible. http://imgur.com/y7O8A
How do I make a nice readable graph out of this data? My ultimate goal is to get a chart that looks remotely like this:
http://i.imgur.com/EnrTW.jpg
The data points can be in various increments from 5 minutes to 30 minutes.
I have also made a Pandas dataframe of the data, but I am not sure if pandas has candlestick functionality.
If I understand well, one of your major concern is the gaps between the daily data.
To get rid of them, one method is to artificially 'evenly space' your data (but of course you will loose any temporal indication intra-day).
Anyways, doing this way, you will be able to obtain a chart that looks like the one you have proposed as an example.
The commented code and the resulting graph are below.
import numpy as np
import matplotlib.pyplot as plt
import datetime
from matplotlib.finance import candlestick
from matplotlib.dates import num2date
# data in a text file, 5 columns: time, opening, close, high, low
# note that I'm using the time you formated into an ordinal float
data = np.loadtxt('finance-data.txt', delimiter=',')
# determine number of days and create a list of those days
ndays = np.unique(np.trunc(data[:,0]), return_index=True)
xdays = []
for n in np.arange(len(ndays[0])):
xdays.append(datetime.date.isoformat(num2date(data[ndays[1],0][n])))
# creation of new data by replacing the time array with equally spaced values.
# this will allow to remove the gap between the days, when plotting the data
data2 = np.hstack([np.arange(data[:,0].size)[:, np.newaxis], data[:,1:]])
# plot the data
fig = plt.figure(figsize=(10, 5))
ax = fig.add_axes([0.1, 0.2, 0.85, 0.7])
# customization of the axis
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
ax.tick_params(axis='both', direction='out', width=2, length=8,
labelsize=12, pad=8)
ax.spines['left'].set_linewidth(2)
ax.spines['bottom'].set_linewidth(2)
# set the ticks of the x axis only when starting a new day
ax.set_xticks(data2[ndays[1],0])
ax.set_xticklabels(xdays, rotation=45, horizontalalignment='right')
ax.set_ylabel('Quote ($)', size=20)
ax.set_ylim([177, 196])
candlestick(ax, data2, width=0.5, colorup='g', colordown='r')
plt.show()
I got tired of matplotlib's (and plotly's) bad performance and lack of such features you request, so implemented one of my own. Here's how that works:
import finplot as fplt
import yfinance
df = yfinance.download('AAPL')
fplt.candlestick_ochl(df[['Open', 'Close', 'High', 'Low']])
fplt.show()
Not only are days in which the exchange is closed left out automatically, but also has better performance and a nicer api. For something that more resembles what you're ultimately looking for:
import finplot as fplt
import yfinance
symbol = 'AAPL'
df = yfinance.download(symbol)
ax = fplt.create_plot(symbol)
fplt.candlestick_ochl(df[['Open', 'Close', 'High', 'Low']], ax=ax)
fplt.plot(df['Close'].rolling(200).mean(), ax=ax, legend='SMA 200')
fplt.plot(df['Close'].rolling(50).mean(), ax=ax, legend='SMA 50')
fplt.plot(df['Close'].rolling(20).mean(), ax=ax, legend='SMA 20')
fplt.volume_ocv(df[['Open', 'Close', 'Volume']], ax=ax.overlay())
fplt.show()
Dear all python users,
I'm new in python and I want to use it to compute a FFT spectrum of observation data and extract the harmonics retaining the diurnal cycle. My data is composed by (time:262992 hours; site:46 stations), which is hourly air_temperature of 46 stations from 1983-2012. I have been able to plot time series of selected station (see the code below). Now I want to compute FFT spectrum of selected station (data[0] for instance), and extract the harmonic retaining the diurnal cycle. How to do it?
Python code:
import iris
from netCDF4 import Dataset
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#Load the data
t_air = iris.load_cube('/home/amadou/anaconda/RainCell_python_mw_link_training/all_new.nc', 'air_temperature')
#Have a quick look
print t_air
#create a pandas data frame with each column representing a site
data=as_data_frame(t_air[:,0])
for i in range(0,t_air.coord('latitude').shape[0]):
data[str(i)]=as_data_frame(t_air[:,i])
data.head()
#create a metadata list with lat/lon (each index represents the corresponding data frame column)
metadata=[]
for i in range(0,t_air.coord('latitude').shape[0]):
lat = t_air[:,i].coord('latitude').points[0]
lon = t_air[:,i].coord('longitude').points[0]
metadata.append([lat,lon])
# now you do the pandas stuff (plotting,resampling,…)
# Example for the monthly averages of the first site
ax = data[0].resample('D')['2012-04':'2012-05'].plot(figsize=(10,5))
ax.legend([metadata[0]])
Is there anybody who can help me, from here, to compute a FFT spectrum of this data and extract the harmonics retaining the diurnal cycle?
Best