I have a CSV file, one of colons value is timestamps but when I use numpy.getfromtxt it change it to string. My goal is to create a graph but with normal time format I prefer seconds only.
this is my array that I get from bellow code:
array([('0:00:00',), ('0:00:00.001000',), ('0:00:00.002000',),
('0:00:00.081000',), ('0:00:00.095000',), ('0:00:00.195000',),
('0:00:00.294000',), ...
this is my code:
col1 = numpy.genfromtxt("mycsv.csv",usecols=(1),delimiter=',',dtype=None, names=True)
The problem that I am having that format is in string but I need it in seconds (us can be ignored or not). How can I achive that?
If you can, the best way for working with csv files in python is to use pandas. It takes care of this for you. I will assume the name of the time column is time, change it to whatever you use:
>>> import numpy as np
>>> import pandas as pd
>>>
>>> df = pd.read_csv('test.csv', parse_dates=[1]) # read time as date
>>> print(df)
test1 time test2 test3
0 5 2015-08-20 00:00:00.000 10 11.7
1 5 2015-08-20 00:00:00.001 11 11.6
2 5 2015-08-20 00:00:00.002 12 11.5
3 5 2015-08-20 00:00:00.081 13 11.4
4 5 2015-08-20 00:00:00.095 14 11.3
5 5 2015-08-20 00:00:00.195 15 11.2
6 5 2015-08-20 00:00:00.294 16 11.1
>>> df['time'] -= pd.datetime.now().date() # convert to timedelta
>>> print(df)
test1 time test2 test3
0 5 00:00:00 10 11.7
1 5 00:00:00.001000 11 11.6
2 5 00:00:00.002000 12 11.5
3 5 00:00:00.081000 13 11.4
4 5 00:00:00.095000 14 11.3
5 5 00:00:00.195000 15 11.2
6 5 00:00:00.294000 16 11.1
>>> df['time'] /= np.timedelta64(1,'s') # convert to seconds
>>> print(df)
test1 time test2 test3
0 5 0.000 10 11.7
1 5 0.001 11 11.6
2 5 0.002 12 11.5
3 5 0.081 13 11.4
4 5 0.095 14 11.3
5 5 0.195 15 11.2
6 5 0.294 16 11.1
You can work with pandas dataframes (what you have here) and series (what you would from getting a single column, such as df['time']) in most of the same ways as numpy arrays, including plotting. However, if you really, really need to convert it to a numpy array, it is as easy as arr = df['time'].values.
use the datetime library
import datetime
for x in array:
for y .... # it's not realy obvious what the nesting is here...
timestamp = datetime.strptime(y, '%H:%M:%S.%f')
You can use a converter for the timestamp field.
For example, suppose times.dat contains:
time
0:00:00
0:00:00.001000
0:00:00.002000
0:00:00.081000
0:00:00.095000
0:00:00.195000
0:00:00.294000
Define a converter that converts a timestamp string into the number of seconds as a floating point value:
In [5]: def convert_timestamp(s):
...: h, m, s = [float(w) for w in s.split(':')]
...: return h*3600 + m*60 + s
...:
Then use the converter in genfromtxt:
In [21]: genfromtxt('times.dat', skiprows=1, converters={0: convert_timestamp})
Out[21]: array([ 0. , 0.001, 0.002, 0.081, 0.095, 0.195, 0.294])
Related
I have a dataframe as following, the index is datetime(every Friday in a week).
begin close
date
2014-1-10 1.0 2.5
2014-1-17 2.6 2.6
........................
2016-12-30 3.5 3.8
2017-6-16 4.5 4.7
I want to extract the previour 2 year data from 2017-6-16. My code is following.
import datetime
from dateutil.relativedelta import relativedelta
df_index = df.index
df_index_test = df_index[-1] - relativedelta(years=2)
df_test = df[df_index_test:-1]
But it seems it is wrong, since the day of df_index_test may not in the dataframe.
Thanks!
You need boolean indexing, instead relativedelta is possible use DateOffset:
df_test = df[df.index >= df_index_test]
Sample:
rng = pd.date_range('2001-04-03', periods=10, freq='15M')
df = pd.DataFrame({'a': range(10)}, index=rng)
print (df)
a
2001-04-30 0
2002-07-31 1
2003-10-31 2
2005-01-31 3
2006-04-30 4
2007-07-31 5
2008-10-31 6
2010-01-31 7
2011-04-30 8
2012-07-31 9
df_test = df[df.index >= df.index[-1] - pd.offsets.DateOffset(years=2)]
print (df_test)
a
2011-04-30 8
2012-07-31 9
I have a data set of daily temperatures for which I want to calculate 20 year means. The data look like this:
1974 1 1 5.3 4.6 7.3 3.4
1974 1 2 3.3 7.2 4.5 6.5
...
2005 12 364 4.2 5.2 3.3 4.6
2005 12 365 3.1 5.5 2.6 6.8
There is no header in the file but the first column contains the year, the second column the month, and the third column the day of the year. The rest of the columns are temperature data.
I want to calculate the average temperature for each day over a period of 20 years. I thought the best way to do that would be to group the data by day and calculate the mean of each day for a specific range of years. Here is my code:
import pandas as pd
hist_fn = 'tmean_daily_1974_2005.txt'
twenty_year_fn = '20_yr_mean_1974_1993.txt'
start = 1974
end = 1993
hist_mean = pd.read_csv(hist_fn, sep='\s+', header=None)
# Limit dataframe to only the 20 years for which I want the mean calculated
interval_mean = hist_mean[(hist_mean[0]>=start) & (hist_mean[0]<=end)]
# Rename the first column to reflect what mean this file is displaying
interval_mean.iloc[:, 0] = ("%s-%s" % (start, end))
# Generate mean for each day spread across all the years in the dataframe
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
# Write multiyear mean to txt
interval_mean.to_csv(twenty_year_fn, sep='\t', header=False, index=False)
The data set spans longer than 20 years and the method I used has worked for the first 20 year interval but gives me a (mostly) empty text file for any other set of years entered.
So when I use these inputs it works:
start = 1974
end = 1993
and it produces a file that looks like this:
1974-1993 1 1 4.33 5.25 6.84 3.67
1974-1993 1 2 7.23 6.22 5.65 6.23
...
1974-1993 12 364 5.12 4.34 5.21 2.16
1974-1993 12 365 4.81 5.95 3.56 6.78
but when I change the inputs to this:
start = 1975
end = 1994
it produces a .txt file with no temperatures:
1975-1994 1 1
1975-1994 1 2
...
1975-1994 12 364
1975-1994 12 365
I don't understand why this method works for the first 20 year interval but none of the subsequent intervals. Is it something to do with the way the data is organized or how it is being sliced?
Now when that's out of the way, we can talk about the problem you presented:
The strange behavior is due to the fact that pandas matches indices on assignment, and slicing preserves the original indices. That means that when setting
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
Note that interval_mean.groupby(2, as_index=False).mean() has indices 0, ... , 30 (since as_index=False makes the groupby operation create new indices. Otherwise, it would have been the day number).On the other had, interval_mean has the original indices from hist_mean, meaning the first time (first 20 years) it has the indices 0, ..., ~20*365 and the second time is has indices starting from arround 20*365 and counting up.
This is a bit confusing at first, but pandas offer great documentation about it, and people quickly discover why it is so useful.
I'll to explain what happens with an example:
Assume we have the following DataFrame:
df = pd.DataFrame(np.reshape(np.random.randint(5, size=30), [-1,3]))
df
0 1 2
0 1 1 2
1 2 1 1
2 0 1 2
3 0 2 0
4 2 1 0
5 0 1 2
6 2 2 1
7 1 0 2
8 0 1 0
9 1 2 0
Note that the column names are 0,1,2 and the row names (the index) are 0, ..., 9.
When we preform groupby we obtain
df.groupby(0, as_index=False).mean()
0 1 2
0 0 1.250000 1.000000
1 1 1.000000 1.333333
2 2 1.333333 0.666667
(The index equals to the columns grouped by just because draw numbers between 0 to 2). Now, when will do assignments to df.loc, it will replace every cell by the corresponding cell in the assignee, if such cell exists. Otherwise, it will leave NA.
df.loc[:,:] = df.groupby(0, as_index=False).mean()
df
0 1 2
0 0.0 1.250000 1.000000
1 1.0 1.000000 1.333333
2 2.0 1.333333 0.666667
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
And when you write NA to csv, it leaves the cell blank.
The last piece of the puzzle is how interval_mean preserved the original indices, but this is because slicing preserves the original indices:
df[df[1] > 1]
0 1 2
3 0 2 0
6 2 2 1
9 1 2 0
Suppose I have a very simple dataframe:
>>> a
Out[158]:
monthE yearE dayE
0 10 2014 15
1 2 2012 15
2 2 2014 15
3 12 2015 15
4 2 2012 15
Suppose that I want to create the column with the date related to every line, using three columns of integers.
When I have simple numbers it is enough to do like:
>>> datetime.date(1983,11,8)
Out[159]: datetime.date(1983, 11, 8)
If I have to create a column of dates (theoretically a very basic request), instead:
a.apply(lambda x: datetime.date(x['yearE'],x['monthE'],x['dayE']))
I obtain the following error:
KeyError: ('yearE', u'occurred at index monthE')
I think you can first remove last char E and then use to_datetime, but then get pandas timestamps not python dates:
df.columns = df.columns.str[:-1]
df['date'] = pd.to_datetime(df)
#if multiple columns filter by subset
#df['date'] = pd.to_datetime(df[['year','month','day']])
print (df)
month year day date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
print (df.date.dtypes)
datetime64[ns]
print (df.date.iloc[0])
2014-10-15 00:00:00
print (type(df.date.iloc[0]))
<class 'pandas.tslib.Timestamp'>
Thank you MaxU for solution:
df['date'] = pd.to_datetime(df.rename(columns = lambda x: x[:-1]))
#if another columns in df
#df['date'] = pd.to_datetime(df[['yearE','monthE','dayE']].rename(columns=lambda x: x[:-1]))
print (df)
monthE yearE dayE date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
But if really need python dates add axis=1 to apply, but then is impossible use some pandas functions:
df['date'] =df.apply(lambda x: datetime.date(x['yearE'],x['monthE'],x['dayE']), axis=1)
print (df)
monthE yearE dayE date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
print (df.date.dtypes)
object
print (df.date.iloc[0])
2014-10-15
print (type(df.date.iloc[0]))
<class 'datetime.date'>
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have a folder with numerous csv files which look like this:
csv1
2006 Percent Land_Use
0 13 5.379564 Developed
1 8 25.781580 Grass/Pasture
2 4 54.265050 Crop
3 15 0.363983 Water
4 16 6.244104 Wetlands
5 6 4.691764 Forest
6 1 3.031494 Alfalfa
7 11 0.137424 Shrubland
8 5 0.003671 Vetch
9 3 0.055412 Barren
10 7 0.009531 Grass
11 12 0.036423 Tree
csv2
2007 Percent Land_Use
0 13 2.742430 Developed
1 4 56.007242 Crop
2 8 24.227963 Grass/Pasture
3 16 8.839979 Wetlands
4 6 6.181062 Forest
5 1 1.446668 Alfalfa
6 15 0.366116 Water
7 3 0.127760 Barren
8 11 0.034426 Shrubland
9 7 0.000827 Grass
10 12 0.025528 Tree
csv3
2008 Percent Land_Use
0 13 1.863809 Developed
1 8 31.455578 Grass/Pasture
2 4 57.896856 Crop
3 16 2.693929 Wetlands
4 6 4.417966 Forest
5 1 1.239176 Alfalfa
6 7 0.130849 Grass
7 15 0.266536 Water
8 11 0.004571 Shrubland
9 3 0.030731 Barren
and I want to merge them all together into one DataFrame on Land_Use
I am reading in the files like this:
pth = (r'G:\')
for f in os.listdir(pth):
df=pd.read_csv(os.path.join(pth,f)
but I can't figure out how to merge all the individual dataframes after that. I figured out how to concat them but that isn't what I want. The type of merge I want is outer.
If I were to use a pathway to each csv file I would merge them like this, but I do NOT want to set a pathway to each file as there are many of them:
one=pd.read_csv(r'G:\one.csv')
two=pd.read_csv(r'G:\two.csv')
three=pd.read_csv(r'G:\three.csv')
merge=pd.merge(one,two, on=['Land_Use'], how='outer')
mergetwo=pd.merge(merge,three,on=['Land_Use'], how='outer')
I think you can use in python 3:
import functools
dfs = [df1,df2,df3]
df = functools.reduce(lambda left,right: pd.merge(left,right,on='Land_Use',how='outer'),dfs)
print (df)
2006 Percent_x Land_Use 2007 Percent_y 2008 Percent
0 13 5.379564 Developed 13.0 2.742430 13.0 1.863809
1 8 25.781580 Grass/Pasture 8.0 24.227963 8.0 31.455578
2 4 54.265050 Crop 4.0 56.007242 4.0 57.896856
3 15 0.363983 Water 15.0 0.366116 15.0 0.266536
4 16 6.244104 Wetlands 16.0 8.839979 16.0 2.693929
5 6 4.691764 Forest 6.0 6.181062 6.0 4.417966
6 1 3.031494 Alfalfa 1.0 1.446668 1.0 1.239176
7 11 0.137424 Shrubland 11.0 0.034426 11.0 0.004571
8 5 0.003671 Vetch NaN NaN NaN NaN
9 3 0.055412 Barren 3.0 0.127760 3.0 0.030731
10 7 0.009531 Grass 7.0 0.000827 7.0 0.130849
11 12 0.036423 Tree 12.0 0.025528 NaN NaN
In python 2:
df = reduce(lambda left,right: pd.merge(left,right,on='Land_Use',how='outer'),dfs)
Working solution with glob:
import pandas as pd
import functools
import glob
pth = 'a/*.csv'
files = glob.glob(pth)
dfs = [pd.read_csv(f, sep=';') for f in files]
df = functools.reduce(lambda left,right: pd.merge(left,right,on='Land_Use', how='outer'),dfs)
print (df)
2006 Percent_x Land_Use 2008 Percent_y 2007 Percent
0 13 5.379564 Developed 13.0 1.863809 13.0 2.742430
1 8 25.781580 Grass/Pasture 8.0 31.455578 8.0 24.227963
2 4 54.265050 Crop 4.0 57.896856 4.0 56.007242
3 15 0.363983 Water 15.0 0.266536 15.0 0.366116
4 16 6.244104 Wetlands 16.0 2.693929 16.0 8.839979
5 6 4.691764 Forest 6.0 4.417966 6.0 6.181062
6 1 3.031494 Alfalfa 1.0 1.239176 1.0 1.446668
7 11 0.137424 Shrubland 11.0 0.004571 11.0 0.034426
8 5 0.003671 Vetch NaN NaN NaN NaN
9 3 0.055412 Barren 3.0 0.030731 3.0 0.127760
10 7 0.009531 Grass 7.0 0.130849 7.0 0.000827
11 12 0.036423 Tree NaN NaN 12.0 0.025528
I am not allowed to comment, so I am unsure of what you exactly want.
You can try using one.merge(two, on=['Land_Use'], how='outer').merge(three,on=['Land_Use'], how='outer'). Let me know if you wanted something else.
If you have many dataframes, you can try using the reduce function. First create a list containing all the dataframes dataframes = [one, two, three, four, ... , twenty] You can add them into the list by using list comprehensions or by appending them into the list in your loop.
Then if you want to combine them based on Land_Use, you can use df_final = reduce(lambda left,right: pd.merge(left,right,on=['Land_Use'], how='outer'), dataframes)
Note: The reduce function is in the functools package in python 3+
pandas.DataFrame.from_csv(filename) seems to be converting my integer index into a date.
This is undesirable. How do I prevent this?
The code shown here is a toy version of a larger problem. In the larger problem, I am estimating and writing the parameters of statistical models for each zone for later use. I thought by using a pandas dataframe indexed by zone, I could easily read back the parameters. While pickle or some other format like json might solve this problem I'd like to see a pandas solution....except pandas is converting the zone number to a date.
#!/usr/bin/python
cache_file="./mydata.csv"
import numpy as np
import pandas as pd
zones = [1,2,3,8,9,10]
def create():
data = []
for z in zones:
info = {'m': int(10*np.random.rand()), 'n': int(10*np.random.rand())}
info.update({'zone':z})
data.append(info)
df = pd.DataFrame(data,index=zones)
print "about to write this data:"
print df
df.to_csv(cache_file)
def read():
df = pd.DataFrame.from_csv(cache_file)
print "read this data:"
print df
create()
read()
Sample output:
about to write this data:
m n zone
1 0 3 1
2 5 8 2
3 6 4 3
8 1 8 8
9 6 2 9
10 7 2 10
read this data:
m n zone
2013-12-01 0 3 1
2013-12-02 5 8 2
2013-12-03 6 4 3
2013-12-08 1 8 8
2013-12-09 6 2 9
2013-12-10 7 2 10
The CSV file looks OK, so the problem seems to be in reading not creating.
mydata.csv
,m,n,zone
1,0,3,1
2,5,8,2
3,6,4,3
8,1,8,8
9,6,2,9
10,7,2,10
I suppose this might be useful:
pd.__version__
0.12.0
Python version is python 2.7.5+
I want to record the zone as an index so I can easily pull out the corresponding
parameters later. How do I keep pandas.DataFrame.from_csv() from turning it into a date?
Reading pandas.DataFrame.from_csv? the parse_dates argument defaults to True. Set it to False.