I've got a pandas series and want to plot a stacked histogram by using a filter to create two new (smaller) series. This is some dummy data, but my actual series has a (non-unique) datetimeindex.
d =
0 2520
1 0
2 1083
3 0
4 0
5 1260
6 960
7 13
8 300
9 433
10 1860
11 1920
12 13
13 0
14 2460
15 2472
16 12
17 60
18 2832
19 12
d1 = d[0:19:2]
d2 = d[1:17:3]
d1.hist(color = 'r', label = 'foo')
d2.hist(label = 'bar')
However, the labels don't show up. I've looked at the pandas docs, which shows everything working when plotted from different columns of a dataframe, but in my case I can't combine these into a single dataframe since they have different indices (and lengths). Any suggestions?
Set the column names and plot one by one over a single subplot:
d1.columns = ['foo']
d2.columns = ['bar']
f = plt.figure()
_ax = f.add_subplot(111)
d1.plot(color='r',kind='hist', stacked=True, ax=_ax)
d2.plot(kind='hist', stacked=True, ax=_ax)
Following #iayork's comments above, the following works:
import pandas as pd
df = pd.concat([d1, d2], axis = 1)
df.plot(kind = hist)
Note: df.hist() plots individual histograms for each column within the dataframe
Related
here is a link to my data https://docs.google.com/document/d/1oIiwiucRkXBkxkdbrgFyPt6fwWtX4DJG4nbRM309M20/edit?usp=sharing
My problem is that when I run this in a Jupyter Notebook. I get just the USA map with the colour bar and the lakes in blue. No data is on the map, not the labels nor the actual z data.
Here is my header:
import plotly.graph_objs as go
import cufflinks as cf
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
%matplotlib inline
init_notebook_mode(connected=True) # For Plotly For Notebooks
cf.go_offline() # For Cufflinks For offline use
%matplotlib inline
init_notebook_mode(connected=True) # For Plotly For Notebooks
cf.go_offline() # For Cufflinks For offline use
Here is my data and layout:
data = dict(type='choropleth',
locations = gb_state['state'],
locationmode = 'USA-states',
colorscale = 'Portland',
text =gb_state['state'],
z = gb_state['beer'],
colorbar = {'title':"Styles of beer"}
)
data
layout = dict(title = 'Styles of beer by state',
geo = dict(scope='usa',
showlakes = True,
lakecolor = 'rgb(85,173,240)')
)
layout
and here is how I fire off the command:
choromap = go.Figure(data = [data],layout = layout)
iplot(choromap)
Any help, guidelines or pointers would be appreciated
Here is a minified working example which will give you the desired output.
import pandas as pd
import io
import plotly.graph_objs as go
from plotly.offline import plot
txt = """ state abv ibu id beer style ounces brewery city
0 AK 25 17 25 25.0 25.0 25 25 25
1 AL 10 9 10 10.0 10.0 10 10 10
2 AR 5 1 5 5.0 5.0 5 5 5
3 AZ 44 24 47 47.0 46.0 47 47 47
4 CA 182 135 183 183.0 183.0 183 183 183
5 CO 250 146 265 265.0 263.0 265 265 265
6 CT 27 6 27 27.0 27.0 27 27 27
7 DC 8 4 8 8.0 8.0 8 8 8
8 DE 1 1 2 2.0 2.0 2 2 2
9 FL 56 37 58 58.0 58.0 58 58 58
10 GA 16 7 16 16.0 16.0 16 16 16
"""
gb_state = pd.read_csv(io.StringIO(txt), delim_whitespace=True)
data = dict(type='choropleth',
locations=gb_state['state'],
locationmode='USA-states',
text=gb_state['state'],
z=gb_state['beer'],
)
layout = dict(geo = dict(scope='usa',
showlakes= False)
)
choromap = go.Figure(data=[data], layout=layout)
plot(choromap)
I have to compare a columns with all other columns in the dataframe. The column that i have to compare with others is located in position 4 so i write df.iloc[x,4] to take column values. Then i have to consider these values, multiply them with the values in the next column (for example df.iloc[x,5]), create a new column in the dataframe and save results. Then i have to repeat this procedure to the end the existing column (the original dataframe has 43 column, so the end it is the df.iloc[x,43] )
How can i do this in python?
If it is possibile can you do some examples? I try to put my code in the post but i 'm not good with my new phone.
I think you can use eq - compare filtered DataFrame with column E in position 4:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,8,9],
'G':[1,3,5],
'H':[5,3,6],
'I':[7,4,3]})
print (df)
A B C D E F G H I
0 1 4 7 1 5 7 1 5 7
1 2 5 8 3 3 8 3 3 4
2 3 6 9 5 6 9 5 6 3
print (df.iloc[:,5:].eq(df.iloc[:,4], axis=0))
F G H I
0 False False True False
1 False True True False
2 False False True False
If need multiple by column in position 4 use mul:
print (df.iloc[:,5:].mul(df.iloc[:,4], axis=0))
F G H I
0 35 5 25 35
1 24 9 9 12
2 54 30 36 18
Or if need multiple by shifted columns:
print (df.iloc[:,4:].mul(df.iloc[:,5:], axis=0, fill_value=1))
E F G H I
0 5.0 49 1 25 49
1 3.0 64 9 9 16
2 6.0 81 25 36 9
I have two columns in my pandas dataframe.
I want to fill the missing values of Credit_History column (dtype : int64) with values of Loan_Status column (dtype : int64).
You can try fillna or combine_first:
df.Credit_History = df.Credit_History.fillna(df.Loan_Status)
Or:
df.Credit_History = df.Credit_History.combine_first(df.Loan_Status)
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Credit_History':[1,2,np.nan, np.nan],
'Loan_Status':[4,5,6,8]})
print (df)
Credit_History Loan_Status
0 1.0 4
1 2.0 5
2 NaN 6
3 NaN 8
df.Credit_History = df.Credit_History.combine_first(df.Loan_Status)
print (df)
Credit_History Loan_Status
0 1.0 4
1 2.0 5
2 6.0 6
3 8.0 8
I have a .csv file named fileOne.csv that contains many unnecessary strings and records. I want to delete unnecessary records / rows and strings based on multiple condition / criteria using a Python or R script and save the records into a new .csv file named resultFile.csv.
What I want to do is as follows:
Delete the first column.
Split column BB into two column named as a_id, and c_id. Separate the value by _ (underscore) and left side will go to a_id, and right side will go to c_id.
Keep only records that have the .csv file extension in the files column, but do not contain No Bi in cut column.
Assign new name to each of the columns.
Delete the records that contain strings like less in the CC column.
Trim all other unnecessary string from the records.
Delete the reamining filds of each rows after I find the "Mi" in each rows.
My fileOne.csv is as follows:
AA BB CC DD EE FF GG
1 1_1.csv (=0 =10" 27" =57 "Mi"
0.97 0.9 0.8 NaN 0.9 od 0.2
2 1_3.csv (=0 =10" 27" "Mi" 0.5
0.97 0.5 0.8 NaN 0.9 od 0.4
3 1_6.csv (=0 =10" "Mi" =53 cnt
0.97 0.9 0.8 NaN 0.9 od 0.6
4 2_6.csv No Bi 000 000 000 000
5 2_8.csv No Bi 000 000 000 000
6 6_9.csv less 000 000 000 000
7 7_9.csv s(=0 =26" =46" "Mi" 121
My 1st expected results files would be as follows:
a_id b_id CC DD EE FF GG
1 1 0 10 27 57 Mi
1 3 0 10 27 Mi 0.5
1 6 0 10 Mi 53 cnt
7 9 0 26 46 Mi 121
My final expected results files would be as follows:
a_id b_id CC DD EE FF GG
1 1 0 10 27 57
1 3 0 10 27
1 6 0 10
7 9 0 26 46
This can be achieved with the following Python script:
import csv
import re
import string
output_header = ['a_id', 'b_id', 'CC', 'DD', 'EE', 'FF', 'GG']
sanitise_table = string.maketrans("","")
nodigits_table = sanitise_table.translate(sanitise_table, string.digits)
def sanitise_cell(cell):
return cell.translate(sanitise_table, nodigits_table) # Keep digits
with open('fileOne.csv') as f_input, open('resultFile.csv', 'wb') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
input_header = next(f_input)
csv_output.writerow(output_header)
for row in csv_input:
bb = re.match(r'(\d+)_(\d+)\.csv', row[1])
if bb and row[2] not in ['No Bi', 'less']:
# Remove all columns after 'Mi' if present
try:
mi = row.index('Mi')
row[:] = row[:mi] + [''] * (len(row) - mi)
except ValueError:
pass
row[:] = [sanitise_cell(col) for col in row]
row[0] = bb.group(1)
row[1] = bb.group(2)
csv_output.writerow(row)
To simply remove Mi columns from an existing file the following can be used:
import csv
with open('input.csv') as f_input, open('output.csv', 'wb') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
for row in csv_input:
try:
mi = row.index('Mi')
row[:] = row[:mi] + [''] * (len(row) - mi)
except ValueError:
pass
csv_output.writerow(row)
Tested using Python 2.7.9
I have a huge GPS datasets in a csv file.
It is something like this.
12,1999-09-08 12:12:12, 116.3426, 32.5678
12,1999-09-08 12:12:17, 116.34234, 32.5678
.
.
.
where each column is in the form of
id, timestamp, longitude, latitude
Now, I am using pandas and importing the file into a dataframe, I have so far written this code.
import pandas as pd
import numpy as np
#this imports the columns and making the timestamp values as row indexes
df = pd.read_csv('/home/abc/Downloads/..../366.txt',delimiter=',',
index_col=1,names=['id','longitude','latitude'])
#removes repeated entries due to gps errors.
df = df.groupby(df.index).first()
Sometimes, there will be 2 or 3 multiple entries for same date which should be removed
I get something like this
id longitude latitude
1999-09-08 12:12:12 12 116.3426 32.5678
1999-09-08 12:12:17 12 116.34234 32.5678
# and so on with redundant entries removed
Now I want rows which have same latitude and longitude to be indexed serially..
i.e., my visualization is
id longitude latitude
0 1999-09-08 12:12:12 12 116.3426 32.5678
1 1999-09-08 12:12:17 12 116.34234 32.5678
2 1999-09-08 12:12:22 12 116.342341 32.5678
1999-09-08 12:12:27 12 116.342341 32.5678
1999-09-08 12:12:32 12 116.342341 32.5678
....
1999-09-08 12:19:37 12 116.342341 32.5678
3 1999-09-08 12:19:42 12 116.34234 32.56123
and so on..
i.e., rows with same latitude and longitude values are to be indexed serially. How can i achieve that? i am a beginner in pandas so i don't know much about it. pls help!
You should leverage the DataFrame.duplicated and do some math with it:
idx = df.duplicated(['longitude', 'latitude'])
idx *= -1
idx += 1
idx.ix[0] = 0
df = df.set_index(idx.cumsum(), append=True).swaplevel(0,1)
How the code works
Starting with the df you get:
In [215]: df
Out[215]:
id longitude latitude
stamp
1999-09-08T12:12:12 12 116.342600 32.56780
1999-09-08T12:12:17 12 116.342340 32.56780
1999-09-08T12:12:22 12 116.342341 32.56780
1999-09-08T12:12:27 12 116.342341 32.56780
1999-09-08T12:12:32 12 116.342341 32.56780
1999-09-08T12:19:37 12 116.342341 32.56780
1999-09-08T12:19:42 12 116.342340 32.56123
First calculate the consecutive duplicated (longitude, latitude) tuples:
In [216]: idx = df.duplicated(['longitude', 'latitude'])
In [217]: idx
Out[217]:
stamp
1999-09-08T12:12:12 False
1999-09-08T12:12:17 False
1999-09-08T12:12:22 False
1999-09-08T12:12:27 True
1999-09-08T12:12:32 True
1999-09-08T12:19:37 True
1999-09-08T12:19:42 False
Then we use cumsum to create a zero-based index that does not increment on duplicaes.
Put some math with it to obtain zeros on duplicated rows and ones for others:
In [218]: idx *= -1
In [219]: idx += 1
In [220]: idx
Out[220]:
stamp
1999-09-08T12:12:12 1
1999-09-08T12:12:17 1
1999-09-08T12:12:22 1
1999-09-08T12:12:27 0
1999-09-08T12:12:32 0
1999-09-08T12:19:37 0
1999-09-08T12:19:42 1
As we want a zero-based index, we set the first cell to 0, and we append that column to the index of df to create the MultiIndex:
In [221]: idx.ix[0] = 0
In [222]: df = df.set_index(idx.cumsum(), append=True)
By default, set_index adds the index at an inferior level than the existing one. We must finish by swapping the levels between the timestamps and our additional index:
In [223]: df = df.swaplevel(0,1)
In [224]: df
Out[224]:
id longitude latitude
stamp
0 1999-09-08T12:12:12 12 116.342600 32.56780
1 1999-09-08T12:12:17 12 116.342340 32.56780
2 1999-09-08T12:12:22 12 116.342341 32.56780
1999-09-08T12:12:27 12 116.342341 32.56780
1999-09-08T12:12:32 12 116.342341 32.56780
1999-09-08T12:19:37 12 116.342341 32.56780
3 1999-09-08T12:19:42 12 116.342340 32.56123