I am reading a csv in python with multiple columns.
The first column is the date and I have to delete the rows that correspond to years previous to 2017.
time high low Volume Plot Rango
0 2017-12-22 25.17984 24.280560 970 0.329943 0.899280
1 2017-12-26 25.17984 23.381280 2579 1.057921 1.798560
2 2017-12-27 25.17984 23.381280 2499 0.998083 1.798560
3 2017-12-28 25.17984 24.280560 1991 0.919885 0.899280
4 2017-12-29 25.17984 24.100704 2703 1.237694 1.079136
.. ... ... ... ... ... ...
580 2020-04-16 5.45000 4.450000 117884 3.168380 1.000000
581 2020-04-17 5.35000 4.255200 58531 1.370538 1.094800
582 2020-04-20 4.66500 4.100100 25770 0.582999 0.564900
583 2020-04-21 4.42000 3.800000 20914 0.476605 0.620000
584 2020-04-22 4.22000 3.710100 23212 0.519275 0.509900
I want to delete the rows corresponding to years prior to 2018, so 2017,2016,2015... should be deleted
I am trying with this but does not work
if 2017 in datos['time']: datos['time'].remove() #check if number 2017 is in each of the items of the column 'time'
The dates are recognized as numbers, not as datatime but I think I do not need to declare it as datatime.
In pandas
Given your data
Use Boolean indexing
time must be datetime64[ns] format
df.info() will give the dtypes
df['date'] = pd.to_datetime(df['date'])
df[df['time'].dt.year >= 2018]
I have a data set of daily temperatures for which I want to calculate 20 year means. The data look like this:
1974 1 1 5.3 4.6 7.3 3.4
1974 1 2 3.3 7.2 4.5 6.5
...
2005 12 364 4.2 5.2 3.3 4.6
2005 12 365 3.1 5.5 2.6 6.8
There is no header in the file but the first column contains the year, the second column the month, and the third column the day of the year. The rest of the columns are temperature data.
I want to calculate the average temperature for each day over a period of 20 years. I thought the best way to do that would be to group the data by day and calculate the mean of each day for a specific range of years. Here is my code:
import pandas as pd
hist_fn = 'tmean_daily_1974_2005.txt'
twenty_year_fn = '20_yr_mean_1974_1993.txt'
start = 1974
end = 1993
hist_mean = pd.read_csv(hist_fn, sep='\s+', header=None)
# Limit dataframe to only the 20 years for which I want the mean calculated
interval_mean = hist_mean[(hist_mean[0]>=start) & (hist_mean[0]<=end)]
# Rename the first column to reflect what mean this file is displaying
interval_mean.iloc[:, 0] = ("%s-%s" % (start, end))
# Generate mean for each day spread across all the years in the dataframe
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
# Write multiyear mean to txt
interval_mean.to_csv(twenty_year_fn, sep='\t', header=False, index=False)
The data set spans longer than 20 years and the method I used has worked for the first 20 year interval but gives me a (mostly) empty text file for any other set of years entered.
So when I use these inputs it works:
start = 1974
end = 1993
and it produces a file that looks like this:
1974-1993 1 1 4.33 5.25 6.84 3.67
1974-1993 1 2 7.23 6.22 5.65 6.23
...
1974-1993 12 364 5.12 4.34 5.21 2.16
1974-1993 12 365 4.81 5.95 3.56 6.78
but when I change the inputs to this:
start = 1975
end = 1994
it produces a .txt file with no temperatures:
1975-1994 1 1
1975-1994 1 2
...
1975-1994 12 364
1975-1994 12 365
I don't understand why this method works for the first 20 year interval but none of the subsequent intervals. Is it something to do with the way the data is organized or how it is being sliced?
Now when that's out of the way, we can talk about the problem you presented:
The strange behavior is due to the fact that pandas matches indices on assignment, and slicing preserves the original indices. That means that when setting
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
Note that interval_mean.groupby(2, as_index=False).mean() has indices 0, ... , 30 (since as_index=False makes the groupby operation create new indices. Otherwise, it would have been the day number).On the other had, interval_mean has the original indices from hist_mean, meaning the first time (first 20 years) it has the indices 0, ..., ~20*365 and the second time is has indices starting from arround 20*365 and counting up.
This is a bit confusing at first, but pandas offer great documentation about it, and people quickly discover why it is so useful.
I'll to explain what happens with an example:
Assume we have the following DataFrame:
df = pd.DataFrame(np.reshape(np.random.randint(5, size=30), [-1,3]))
df
0 1 2
0 1 1 2
1 2 1 1
2 0 1 2
3 0 2 0
4 2 1 0
5 0 1 2
6 2 2 1
7 1 0 2
8 0 1 0
9 1 2 0
Note that the column names are 0,1,2 and the row names (the index) are 0, ..., 9.
When we preform groupby we obtain
df.groupby(0, as_index=False).mean()
0 1 2
0 0 1.250000 1.000000
1 1 1.000000 1.333333
2 2 1.333333 0.666667
(The index equals to the columns grouped by just because draw numbers between 0 to 2). Now, when will do assignments to df.loc, it will replace every cell by the corresponding cell in the assignee, if such cell exists. Otherwise, it will leave NA.
df.loc[:,:] = df.groupby(0, as_index=False).mean()
df
0 1 2
0 0.0 1.250000 1.000000
1 1.0 1.000000 1.333333
2 2.0 1.333333 0.666667
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
And when you write NA to csv, it leaves the cell blank.
The last piece of the puzzle is how interval_mean preserved the original indices, but this is because slicing preserves the original indices:
df[df[1] > 1]
0 1 2
3 0 2 0
6 2 2 1
9 1 2 0
My goal is to compare each value from the column "year" against the appropriate column year (i.e. 1999, 2000). I then want to return the corresponding value from the corresponding column. For example, for Afghanistan (first row), year 2004, I want to find the column named "2004" and return the value from the row that contains afghanistan.
Here is the table. For reference this table is the result of a sql join between educational attainment in a single defined year and a table for gdp per country for years 1999 - 2010. My ultimate goal is to return the gdp from the year that the educational data is from.
country year men_ed_yrs women_ed_yrs total_ed_yrs 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
0 Afghanistan 2004 11 5 8 NaN NaN 2461666315 4128818042 4583648922 5285461999 6.275076e+09 7.057598e+09 9.843842e+09 1.019053e+10 1.248694e+10 1.593680e+10
1 Albania 2004 11 11 11 3414760915 3632043908 4060758804 4435078648 5746945913 7314865176 8.158549e+09 8.992642e+09 1.070101e+10 1.288135e+10 1.204421e+10 1.192695e+10
2 Algeria 2005 13 13 13 48640611686 54790060513 54744714110 56760288396 67863829705 85324998959 1.030000e+11 1.170000e+11 1.350000e+11 1.710000e+11 1.370000e+11 1.610000e+11
3 Andorra 2008 11 12 11 1239840270 1401694156 1484004617 1717563533 2373836214 2916913449 3.248135e+09 3.536452e+09 4.010785e+09 4.001349e+09 3.649863e+09 3.346317e+09
4 Anguilla 2008 11 11 11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
My approach so far is:
for value in df_combined_column_named['year']: #loops through each year in year column
if value in df_combined_column_named.columns
any thoughts?
Use df.loc:
In [62]: df.loc[df['country']=='Afghanistan', '2004'].item()
Out[62]: 5285461999.0
df.loc[rows, columns] can accept a boolean Series (such as df['country']=='Afghanistan') for rows and a column label (such as '2004') for columns. It will return the values for rows where the boolean Series is True and in the specified column.
In general this can be more than one value, so a Series is returned. However, in this case, there is only one value in the Series. So to obtain just the value, call the item method.
Note it is unclear from the posted string representation of df whether the numeric column labels are strings are integers. If the numeric column labels are integers, then you would need to use
df.loc[df['country']=='Afghanistan', 2004].item()
(with no quotation marks around 2004).
If you are going to make a lot of "queries" of this form, you make wish to set the country column as the index:
df = df.set_index('country')
Then you could access the value in the cell whose row label is 'Afghanistan' and whose column label is '2004' using get_value:
In [65]: df.get_value('Afghanistan', '2004')
Out[65]: 5285461999.0
Let's say I have a Pandas DataFrame with two columns: 1) user_id, 2) steps (which contains the number of steps on the given date). Now I want to calculate the difference between the number of steps and the number of steps in the preceding measurement (measurements are guaranteed to be in order within my DataFrame).
So basically this comes down to appending an extra column to my DataFrame where the row values of this data frame match the value of the column 'steps' within this same row, minus the value of the 'steps' column in the row above (or 0 if this is the first row). To complicate things further, I want to calculate these differences per user_id, so I want to make sure that I do not subtract the steps values of two rows with different user_id's.
Does anyone have an idea how to get this done with Python 2.7 and Panda?
So an example to illustrate this.
Example input:
user_id steps
1015 48
1015 23
1015 79
1016 10
1016 20
Desired output:
user_id steps d_steps
1015 48 0
1015 23 -25
1015 79 56
2023 10 0
2023 20 10
Your output shows user ids that are not in you orig data but the following does what you want, you will have to replace/fill the NaN values with 0:
In [16]:
df['d_steps'] = df.groupby('user_id').transform('diff')
df.fillna(0, inplace=True)
df
Out[16]:
user_id steps d_steps
0 1015 48 0
1 1015 23 -25
2 1015 79 56
3 1016 10 0
4 1016 20 10
Here we generate the desired column by calling transform on the groupby by object and pass a string which maps to the diff method which subtracts the previous row value. Transform applies a function and returns a series with an index aligned to the df.
I have a dataframe with the names of newborn babies per year.
"name","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013"
"Aicha",0,0,0,0,0,0,0,0,0,0,0,0,10,0,0,0
"Aida",15,20,16,0,10,0,10,14,13,11,12,11,13,14,13,18
"Aina",0,0,0,0,0,0,16,12,15,13,12,14,10,11,0,12
"Aisha",14,0,10,12,15,13,28,33,26,26,52,44,43,54,68,80
"Ajla",15,10,0,0,22,18,28,27,26,26,19,16,19,22,17,27
"Alba",0,0,14,14,22,14,17,19,23,15,28,32,25,33,33,33
I want to plot this in a line chart, where each line is a different name and the x axis is the years. In order to do that, I imagine I need to reshape the data into something like this:
"name","year","value"
"Aicha","1998",0
"Aicha","1999",0
"Aicha","2000",0
...
How do I reshape the data in that fashion? I've tried pivot_table but I can't seem to get it to work.
You could use pd.melt:
>>> df_melted = pd.melt(df, id_vars="name", var_name="year")
>>> df_melted.head()
name year value
0 Aicha 1998 0
1 Aida 1998 15
2 Aina 1998 0
3 Aisha 1998 14
4 Ajla 1998 15
and then sort using
>>> df_melted = df_melted.sort("name")
if you liked.