pandas.DataFrame.from_csv(filename) seems to be converting my integer index into a date.
This is undesirable. How do I prevent this?
The code shown here is a toy version of a larger problem. In the larger problem, I am estimating and writing the parameters of statistical models for each zone for later use. I thought by using a pandas dataframe indexed by zone, I could easily read back the parameters. While pickle or some other format like json might solve this problem I'd like to see a pandas solution....except pandas is converting the zone number to a date.
#!/usr/bin/python
cache_file="./mydata.csv"
import numpy as np
import pandas as pd
zones = [1,2,3,8,9,10]
def create():
data = []
for z in zones:
info = {'m': int(10*np.random.rand()), 'n': int(10*np.random.rand())}
info.update({'zone':z})
data.append(info)
df = pd.DataFrame(data,index=zones)
print "about to write this data:"
print df
df.to_csv(cache_file)
def read():
df = pd.DataFrame.from_csv(cache_file)
print "read this data:"
print df
create()
read()
Sample output:
about to write this data:
m n zone
1 0 3 1
2 5 8 2
3 6 4 3
8 1 8 8
9 6 2 9
10 7 2 10
read this data:
m n zone
2013-12-01 0 3 1
2013-12-02 5 8 2
2013-12-03 6 4 3
2013-12-08 1 8 8
2013-12-09 6 2 9
2013-12-10 7 2 10
The CSV file looks OK, so the problem seems to be in reading not creating.
mydata.csv
,m,n,zone
1,0,3,1
2,5,8,2
3,6,4,3
8,1,8,8
9,6,2,9
10,7,2,10
I suppose this might be useful:
pd.__version__
0.12.0
Python version is python 2.7.5+
I want to record the zone as an index so I can easily pull out the corresponding
parameters later. How do I keep pandas.DataFrame.from_csv() from turning it into a date?
Reading pandas.DataFrame.from_csv? the parse_dates argument defaults to True. Set it to False.
Related
I did some research but i have difficulties finding an answer.
I am using python 2.7 and pandas so far but i am still learning.
I have two CSVs, let say it's the alphabet A-Z in one and digits in the second one, 0-100.
I want to merge the two files to have A0 to A100 up through Z.
For information the two files have DNA sequence so i believe they are strings.
I tried to create arrays with numpy and create a matrix but to no available..
here is a preview of the files:
barcode
0 GGAAGAA
1 CCAAGAA
2 GAGAGAA
3 AGGAGAA
4 TCGAGAA
5 CTGAGAA
6 CACAGAA
7 TGCAGAA
8 ACCAGAA
9 GTCAGAA
10 CGTAGAA
11 GCTAGAA
12 GAAGGAA
13 AGAGGAA
14 TCAGGAA
659
barcode
0 CGGAAGAA
1 GCGAAGAA
2 GGCAAGAA
3 GGAGAGAA
4 CCAGAGAA
5 GAGGAGAA
6 ACGGAGAA
7 CTGGAGAA
8 CACGAGAA
9 AGCGAGAA
10 TCCGAGAA
11 GTCGAGAA
12 CGTGAGAA
13 GCTGAGAA
14 CGACAGAA
1995
I am putting here the way i found to do it, there might be a sexier way:
index = pd.MultiIndex.from_product([df8.barcode, df7.barcode], names = ["df8", "df7"])
df = pd.DataFrame(index = index).reset_index()
def concat_BC(x):#concatenate the two sequences into one new column
return str(x["df8"]) + str(x["df7"])
df["BC"] = df.apply(concat_BC, axis=1)
– Stephane Chiron
I have a dataframe as following, the index is datetime(every Friday in a week).
begin close
date
2014-1-10 1.0 2.5
2014-1-17 2.6 2.6
........................
2016-12-30 3.5 3.8
2017-6-16 4.5 4.7
I want to extract the previour 2 year data from 2017-6-16. My code is following.
import datetime
from dateutil.relativedelta import relativedelta
df_index = df.index
df_index_test = df_index[-1] - relativedelta(years=2)
df_test = df[df_index_test:-1]
But it seems it is wrong, since the day of df_index_test may not in the dataframe.
Thanks!
You need boolean indexing, instead relativedelta is possible use DateOffset:
df_test = df[df.index >= df_index_test]
Sample:
rng = pd.date_range('2001-04-03', periods=10, freq='15M')
df = pd.DataFrame({'a': range(10)}, index=rng)
print (df)
a
2001-04-30 0
2002-07-31 1
2003-10-31 2
2005-01-31 3
2006-04-30 4
2007-07-31 5
2008-10-31 6
2010-01-31 7
2011-04-30 8
2012-07-31 9
df_test = df[df.index >= df.index[-1] - pd.offsets.DateOffset(years=2)]
print (df_test)
a
2011-04-30 8
2012-07-31 9
I am trying to take the count of 'STAGE' occurrence based on project, I used np.size as aggfunc but it return number of occurrence including the project, My count value become double if expected count is 3 means it return 6
I used the below code
df = pd.pivot_table(data_frame, index=['Project'],columns=['Stage'], aggfunc=np.size, fill_value=0)
You need aggregate function len:
print (data_frame)
Project Stage
0 an ip
1 cfc pe
2 an ip
3 ap pe
4 cfc pe
5 an ip
6 cfc ip
df = pd.pivot_table(data_frame,
index='Project',
columns='Stage',
aggfunc=len,
fill_value=0)
print (df)
Stage ip pe
Project
an 3 0
ap 0 1
cfc 1 2
Another solution with size:
df = pd.pivot_table(data_frame,
index='Project',
columns='Stage',
aggfunc='size',
fill_value=0)
print (df)
Stage ip pe
Project
an 3 0
ap 0 1
cfc 1 2
EDIT by comment:
import matplotlib.pyplot as plt
#all code
df.plot.bar()
plt.show()
Well hello everyone!
I want to create a (panda) dataset called df. This df panda form must contain "Id" and "Feature" columns. Any idea on how to do it?
I have done the following code but... the ## dictionaries are messy and put in random the two columns. I want "Id" as first column and "Feature" as a second one.
Thank you in advance! Have a loooong weekend!
df = DataFrame({'Feature': X["Feature"],'Id': X["Id"] })
From the pandas docs "If no columns are passed, the columns will be the sorted list of dict keys." I do this simple trick to arrange the columns. Just add "1", "2", etc. to beginning of your column names. For example:
>>>> df1 = pd.DataFrame({"Id":[1,2,3],"Feature":[5,6,7]})
>>>> df1
Feature Id
0 5 1
1 6 2
2 7 3
>>>> df2 = pd.DataFrame({"1Id":[1,2,3],"2Feature":[5,6,7]})
>>>> df2
1Id 2Feature
0 1 5
1 2 6
2 3 7
>>>> df2.columns = ["Id","Feature"]
>>>> df2
Id Feature
0 1 5
1 2 6
2 3 7
Now you have the order you wanted for printing or saving the DataFrame.
If this what you wanted?
import pandas as pd
data=["id","Feature"]
index=[1,2]
s = pd.Series(data,index=index)
df = pd.DataFrame(np.random.randn(2,2), index=index, columns=('id','features'))
The data frame :
>>> df['id']
1 0.254105
2 -0.132025
Name: id, dtype: float64
>>> df['features']
1 0.189972
2 2.262103
Name: features, dtype: float64
I have a CSV file, one of colons value is timestamps but when I use numpy.getfromtxt it change it to string. My goal is to create a graph but with normal time format I prefer seconds only.
this is my array that I get from bellow code:
array([('0:00:00',), ('0:00:00.001000',), ('0:00:00.002000',),
('0:00:00.081000',), ('0:00:00.095000',), ('0:00:00.195000',),
('0:00:00.294000',), ...
this is my code:
col1 = numpy.genfromtxt("mycsv.csv",usecols=(1),delimiter=',',dtype=None, names=True)
The problem that I am having that format is in string but I need it in seconds (us can be ignored or not). How can I achive that?
If you can, the best way for working with csv files in python is to use pandas. It takes care of this for you. I will assume the name of the time column is time, change it to whatever you use:
>>> import numpy as np
>>> import pandas as pd
>>>
>>> df = pd.read_csv('test.csv', parse_dates=[1]) # read time as date
>>> print(df)
test1 time test2 test3
0 5 2015-08-20 00:00:00.000 10 11.7
1 5 2015-08-20 00:00:00.001 11 11.6
2 5 2015-08-20 00:00:00.002 12 11.5
3 5 2015-08-20 00:00:00.081 13 11.4
4 5 2015-08-20 00:00:00.095 14 11.3
5 5 2015-08-20 00:00:00.195 15 11.2
6 5 2015-08-20 00:00:00.294 16 11.1
>>> df['time'] -= pd.datetime.now().date() # convert to timedelta
>>> print(df)
test1 time test2 test3
0 5 00:00:00 10 11.7
1 5 00:00:00.001000 11 11.6
2 5 00:00:00.002000 12 11.5
3 5 00:00:00.081000 13 11.4
4 5 00:00:00.095000 14 11.3
5 5 00:00:00.195000 15 11.2
6 5 00:00:00.294000 16 11.1
>>> df['time'] /= np.timedelta64(1,'s') # convert to seconds
>>> print(df)
test1 time test2 test3
0 5 0.000 10 11.7
1 5 0.001 11 11.6
2 5 0.002 12 11.5
3 5 0.081 13 11.4
4 5 0.095 14 11.3
5 5 0.195 15 11.2
6 5 0.294 16 11.1
You can work with pandas dataframes (what you have here) and series (what you would from getting a single column, such as df['time']) in most of the same ways as numpy arrays, including plotting. However, if you really, really need to convert it to a numpy array, it is as easy as arr = df['time'].values.
use the datetime library
import datetime
for x in array:
for y .... # it's not realy obvious what the nesting is here...
timestamp = datetime.strptime(y, '%H:%M:%S.%f')
You can use a converter for the timestamp field.
For example, suppose times.dat contains:
time
0:00:00
0:00:00.001000
0:00:00.002000
0:00:00.081000
0:00:00.095000
0:00:00.195000
0:00:00.294000
Define a converter that converts a timestamp string into the number of seconds as a floating point value:
In [5]: def convert_timestamp(s):
...: h, m, s = [float(w) for w in s.split(':')]
...: return h*3600 + m*60 + s
...:
Then use the converter in genfromtxt:
In [21]: genfromtxt('times.dat', skiprows=1, converters={0: convert_timestamp})
Out[21]: array([ 0. , 0.001, 0.002, 0.081, 0.095, 0.195, 0.294])