Is there some error that I am doing or is it a fault within pandas or perhaps Quandl?
I'm pretty sure the problem is in the following line:
quandl_gold_fridays['Round'] = quandl_gold['Close'].apply(lambda x: int(float(x)/23))
Notice that you used quandl_gold on the right-hand side instead of quandl_gold_fridays. The date corresponding with your NaN is 2014-04-18, which was Good Friday (i.e. markets closed). There would be no corresponding value in quandl_gold on that date for the lambda to use, so it would be passed NaN.
To illustrate, try adding a cell with the following code:
import pandas as pd
x = pd.merge(left=quandl_gold.loc[:, ['Close']],
right=quandl_gold_fridays.loc[:, ['Close','Round']],
left_index=True,
right_index=True,
how='right')
x.tail(10)
You'll notice the NaN in the "Close_x" column.
Related
How do I set the data within a dataframe to be left-aligned?
I'm using python 2.7.13.
This question has been asked before but the accepted answer didn't even work.
The answer given was:
df.style.set_properties(**{'text-align': 'left'})
It doesn't work, my data is still right aligned.
Does anyone know how? Do I have to import any modules other than pandas?
Case 1: Styling to print as html
df.style.set_properties returns an object of type pandas.io.formats.style.Styler
type(df.style.set_properties(**{'text-align': 'left'}))
Out[37]: pandas.io.formats.style.Styler
Which is meant to be rendered as an html string as follows:
s = df.style.set_properties(**{'text-align': 'left'})
s.render()
Then you can use the result of s.render() in your HTML file.
Case 2: Align data left as a dataframe
If you are looking for a way to remove left whitespaces from the values in your dataFrame and leave the data within a dataframe, here's an example on how to do that:
df = pd.DataFrame([[' a',' b'],[' c', ' d']], columns=list('AB'))
df = df.stack().str.lstrip().unstack()
output:
A B
0 a b
1 c d
I have the below-mentioned dataset.
https://docs.google.com/spreadsheets/d/13GCAXHp5BU4vYU6PdX40wM-Jhp--LeRd9C5oUurbVY4/edit#gid=0
I want to find the cumulative values for sales for difference stores in one column. For example, the cumulative value for store 2106 the sales figure should be 176,849
I'm using the following function
df = df.groupby('storenumber')['sales'].cumsum() but i am not getting the correct result
Can someone help?
Here's what I did to solve this problem.
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv') # get data frame from csv file
You won't be able to run numerical operations on your data, as it is, because the Sale (Dollars) column in df is not formatted as a numerical type. The following piece of code will convert the data in the Sale (Dollars) and Suggested answer column to be of type float and remove the dollar sign and separating commas.
df[df.columns[2:]] = df[df.columns[2:]].replace('[\$,]', '', regex=True).astype(float)
Then, I used the following bit of code to get the cumulative value for each unique Store Number.
cum_sales_by_store_number = df.groupby('Store Number')['Sale (Dollars)'].agg(np.sum)
cum_sales_by_store_number = pd.DataFrame(cum_sales_by_store_number)
Output for cum_sales_by_store_number:
Sale (Dollars)
Store Number
2106 176849.97
I hope this answers your question. Happy coding!
I have a file like so that I am reading from excel:
Year Month Day
1 2 1
2 1 2
I want to specify the column width that excel recognizes. I would like to do it in pandas but I don't see a option. I have tried to do it with the module StyleFrame.
This is my code:
from StyleFrame import StyleFrame
import pandas as pd
df=pd.read_excel(r'P:\File.xlsx')
excel_writer = StyleFrame.ExcelWriter(r'P:\File.xlsx')
sf=StyleFrame(df)
sf=sf.set_column_width(columns=['Year', 'Month'], width=4.0)
sf=sf.set_column_width(columns=['Day'], width=6.00)
sf=sf.to_excel(excel_writer=excel_writer)
excel_writer.save()
but the formatting isn't saved when I open the new file.
Is there a way to do it in pandas? I would even take a pure python solution to this, pretty much anything that works.
As for your question on how to remove the headers, you can simply pass header=False to to_excel:
sf.to_excel(excel_writer=excel_writer, header=False).
Note that this will still result with the first line of the table being bold.
If you don't want that behavior you can update to 0.1.6 that I just released.
I have a script that processes an Excel file. The department that sends it has a system that generated it, and my script stopped working.
I suddenly got the error Can only use .str accessor with string values, which use np.object_ dtype in pandas for the following line of code:
df['DATE'] = df['Date'].str.replace(r'[^a-zA-Z0-9\._/-]', '')
I checked the type of the date columns in the file from the old system (dtype: object) vs the file from the new system (dtype: datetime64[ns]).
How do I change the date format to something my script will understand?
I saw this answer but my knowledge about date formats isn't this granular.
You can use apply function on the dataframe column to convert the necessary column to String. For example:
df['DATE'] = df['Date'].apply(lambda x: x.strftime('%Y-%m-%d'))
Make sure to import datetime module.
apply() will take each cell at a time for evaluation and apply the formatting as specified in the lambda function.
pd.to_datetime returns a Series of datetime64 dtype, as described here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
df['DATE'] = df['Date'].dt.date
or this:
df['Date'].map(datetime.datetime.date)
You can use pd.to_datetime
df['DATE'] = pd.to_datetime(df['DATE'])
I'm trying to do basic interpolation of position data at 60hz (~16ms) intervals. When I try to use pandas 0.14 interpolation over the dataframe, it tells me I only have NaNs in my data set (not true). When I try to run it over individual series pulled from the dataframe, it returns the same series without the NaNs filled in. I've tried setting the indices to integers, using different methods, fiddling with the axis and limit parameters of the interpolation function - no dice. What am I doing wrong?
df.head(5) :
x y ms
0 20.5815 14.1821 333.3333
1 NaN NaN 350
2 20.6112 14.2013 366.6667
3 NaN NaN 383.3333
4 20.5349 14.2232 400
df = df.set_index(df.ms) # set indices to milliseconds
When I try running
df.interpolate(method='values')
I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-462-cb0f1f01eb84> in <module>()
12
13
---> 14 df.interpolate(method='values')
15
16
/Users/jsb/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in interpolate(self, method, axis, limit, inplace, downcast, **kwargs)
2511
2512 if self._data.get_dtype_counts().get('object') == len(self.T):
-> 2513 raise TypeError("Cannot interpolate with all NaNs.")
2514
2515 # create/use the index
TypeError: Cannot interpolate with all NaNs.
I've also tried running over individual series, which only return what I put in:
temp = df.x
temp.interpolate(method='values')
333.333333 20.5815
350.000000 NaN
366.666667 20.6112
383.333333 NaN
400.000000 20.5349 Name: x, dtype: object
EDIT :
Props to Jeff for inspiring the solution.
Adding:
df[['x','y','ms']] = df[['x','y','ms']].astype(float)
before
df.interpolate(method='values')
interpolation did the trick.
Based on your edit with props to Jeff for inspiring the solution.
Adding:
df = df.astype(float)
before
df.interpolate(method='values')
interpolation did the trick for me as well. Unless you're sub-selecting a column set, you don't need to specify the columns.
I'm not able to to reproduce the error (see below for a copy/paste-able example), can you make sure the the data you show is actually representative of your data?
In [137]: from StringIO import StringIO
In [138]: df = pd.read_csv(StringIO(""" x y ms
...: 0 20.5815 14.1821 333.3333
...: 1 NaN NaN 350
...: 2 20.6112 14.2013 366.6667
...: 3 NaN NaN 383.3333
...: 4 20.5349 14.2232 400"""), delim_whitespace=True)
In [140]: df = df.set_index(df.ms)
In [142]: df.interpolate(method='values')
Out[142]:
x y ms
ms
333.3333 20.58150 14.18210 333.3333
350.0000 20.59635 14.19170 350.0000
366.6667 20.61120 14.20130 366.6667
383.3333 20.57305 14.21225 383.3333
400.0000 20.53490 14.22320 400.0000