multiply pandas dataframe column with a constant - python-2.7

I have two dataframes:
df:
Conference Year SampleCitations Percent
0 CIKM 1995 373 0.027153
1 CIKM 1996 242 0.017617
2 CIKM 1997 314 0.022858
3 CIKM 1998 427 0.031084
And another dataframe which returns to me the total number of citations:
allcitations= pd.read_sql("Select Sum(Citations) as ActualCitations from publications "
I want to simply multiply the Percent column in dataframe df with the constant value ActualCitations.
I tried the following:
df['ActualCitations']=df['Percent'].multiply(allcitations['ActualCitations'])
and
df['ActualCitations']=df['Percent']* allcitations['ActualCitations']
But both only perform it for the first row and the rest is Naan, as shown below:
Conference Year SampleCitations Percent ActualCitations
0 CIKM 1995 373 0.027153 1485.374682
1 CIKM 1996 242 0.017617 NaN
2 CIKM 1997 314 0.022858 NaN
3 CIKM 1998 427 0.031084 NaN

The problem in this case is pandas's auto alignment (ususally a good thing). Because your 'constant' is actually in a dataframe, what pandas will try to do is create row 0 from each of the row 0s and then row 1 from each of the row 1s, but there is no row 1 in the second dataset, so you get NaN from there forward.
So what you need to do intentionally break the dataframe aspect of the second dataframe so that pandas will then 'broadcast' the constant to ALL rows. One way to do this is with values, which in this case essentially just drops the index from a dataframe so that it becomes a numpy array with one element (really a scalar, but contained in a numpy array technically). to_list() will also accomplish the same thing.
allcitations=pd.DataFrame({ 'ActualCitations':[54703.888410120424] })
df['Percent'] * allcitations['ActualCitations'].values
0 1485.374682
1 963.718402
2 1250.421481
3 1700.415667

Related

How to filter csv data to remove data after a specified year?

I am reading a csv in python with multiple columns.
The first column is the date and I have to delete the rows that correspond to years previous to 2017.
time high low Volume Plot Rango
0 2017-12-22 25.17984 24.280560 970 0.329943 0.899280
1 2017-12-26 25.17984 23.381280 2579 1.057921 1.798560
2 2017-12-27 25.17984 23.381280 2499 0.998083 1.798560
3 2017-12-28 25.17984 24.280560 1991 0.919885 0.899280
4 2017-12-29 25.17984 24.100704 2703 1.237694 1.079136
.. ... ... ... ... ... ...
580 2020-04-16 5.45000 4.450000 117884 3.168380 1.000000
581 2020-04-17 5.35000 4.255200 58531 1.370538 1.094800
582 2020-04-20 4.66500 4.100100 25770 0.582999 0.564900
583 2020-04-21 4.42000 3.800000 20914 0.476605 0.620000
584 2020-04-22 4.22000 3.710100 23212 0.519275 0.509900
I want to delete the rows corresponding to years prior to 2018, so 2017,2016,2015... should be deleted
I am trying with this but does not work
if 2017 in datos['time']: datos['time'].remove() #check if number 2017 is in each of the items of the column 'time'
The dates are recognized as numbers, not as datatime but I think I do not need to declare it as datatime.
In pandas
Given your data
Use Boolean indexing
time must be datetime64[ns] format
df.info() will give the dtypes
df['date'] = pd.to_datetime(df['date'])
df[df['time'].dt.year >= 2018]

Issue Calculating Mean of Grouped Data for entire range of dataset using Pandas

I have a data set of daily temperatures for which I want to calculate 20 year means. The data look like this:
1974 1 1 5.3 4.6 7.3 3.4
1974 1 2 3.3 7.2 4.5 6.5
...
2005 12 364 4.2 5.2 3.3 4.6
2005 12 365 3.1 5.5 2.6 6.8
There is no header in the file but the first column contains the year, the second column the month, and the third column the day of the year. The rest of the columns are temperature data.
I want to calculate the average temperature for each day over a period of 20 years. I thought the best way to do that would be to group the data by day and calculate the mean of each day for a specific range of years. Here is my code:
import pandas as pd
hist_fn = 'tmean_daily_1974_2005.txt'
twenty_year_fn = '20_yr_mean_1974_1993.txt'
start = 1974
end = 1993
hist_mean = pd.read_csv(hist_fn, sep='\s+', header=None)
# Limit dataframe to only the 20 years for which I want the mean calculated
interval_mean = hist_mean[(hist_mean[0]>=start) & (hist_mean[0]<=end)]
# Rename the first column to reflect what mean this file is displaying
interval_mean.iloc[:, 0] = ("%s-%s" % (start, end))
# Generate mean for each day spread across all the years in the dataframe
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
# Write multiyear mean to txt
interval_mean.to_csv(twenty_year_fn, sep='\t', header=False, index=False)
The data set spans longer than 20 years and the method I used has worked for the first 20 year interval but gives me a (mostly) empty text file for any other set of years entered.
So when I use these inputs it works:
start = 1974
end = 1993
and it produces a file that looks like this:
1974-1993 1 1 4.33 5.25 6.84 3.67
1974-1993 1 2 7.23 6.22 5.65 6.23
...
1974-1993 12 364 5.12 4.34 5.21 2.16
1974-1993 12 365 4.81 5.95 3.56 6.78
but when I change the inputs to this:
start = 1975
end = 1994
it produces a .txt file with no temperatures:
1975-1994 1 1
1975-1994 1 2
...
1975-1994 12 364
1975-1994 12 365
I don't understand why this method works for the first 20 year interval but none of the subsequent intervals. Is it something to do with the way the data is organized or how it is being sliced?
Now when that's out of the way, we can talk about the problem you presented:
The strange behavior is due to the fact that pandas matches indices on assignment, and slicing preserves the original indices. That means that when setting
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
Note that interval_mean.groupby(2, as_index=False).mean() has indices 0, ... , 30 (since as_index=False makes the groupby operation create new indices. Otherwise, it would have been the day number).On the other had, interval_mean has the original indices from hist_mean, meaning the first time (first 20 years) it has the indices 0, ..., ~20*365 and the second time is has indices starting from arround 20*365 and counting up.
This is a bit confusing at first, but pandas offer great documentation about it, and people quickly discover why it is so useful.
I'll to explain what happens with an example:
Assume we have the following DataFrame:
df = pd.DataFrame(np.reshape(np.random.randint(5, size=30), [-1,3]))
df
0 1 2
0 1 1 2
1 2 1 1
2 0 1 2
3 0 2 0
4 2 1 0
5 0 1 2
6 2 2 1
7 1 0 2
8 0 1 0
9 1 2 0
Note that the column names are 0,1,2 and the row names (the index) are 0, ..., 9.
When we preform groupby we obtain
df.groupby(0, as_index=False).mean()
0 1 2
0 0 1.250000 1.000000
1 1 1.000000 1.333333
2 2 1.333333 0.666667
(The index equals to the columns grouped by just because draw numbers between 0 to 2). Now, when will do assignments to df.loc, it will replace every cell by the corresponding cell in the assignee, if such cell exists. Otherwise, it will leave NA.
df.loc[:,:] = df.groupby(0, as_index=False).mean()
df
0 1 2
0 0.0 1.250000 1.000000
1 1.0 1.000000 1.333333
2 2.0 1.333333 0.666667
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
And when you write NA to csv, it leaves the cell blank.
The last piece of the puzzle is how interval_mean preserved the original indices, but this is because slicing preserves the original indices:
df[df[1] > 1]
0 1 2
3 0 2 0
6 2 2 1
9 1 2 0

Return value in dataframe based on row index, column reference

My goal is to compare each value from the column "year" against the appropriate column year (i.e. 1999, 2000). I then want to return the corresponding value from the corresponding column. For example, for Afghanistan (first row), year 2004, I want to find the column named "2004" and return the value from the row that contains afghanistan.
Here is the table. For reference this table is the result of a sql join between educational attainment in a single defined year and a table for gdp per country for years 1999 - 2010. My ultimate goal is to return the gdp from the year that the educational data is from.
country year men_ed_yrs women_ed_yrs total_ed_yrs 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
0 Afghanistan 2004 11 5 8 NaN NaN 2461666315 4128818042 4583648922 5285461999 6.275076e+09 7.057598e+09 9.843842e+09 1.019053e+10 1.248694e+10 1.593680e+10
1 Albania 2004 11 11 11 3414760915 3632043908 4060758804 4435078648 5746945913 7314865176 8.158549e+09 8.992642e+09 1.070101e+10 1.288135e+10 1.204421e+10 1.192695e+10
2 Algeria 2005 13 13 13 48640611686 54790060513 54744714110 56760288396 67863829705 85324998959 1.030000e+11 1.170000e+11 1.350000e+11 1.710000e+11 1.370000e+11 1.610000e+11
3 Andorra 2008 11 12 11 1239840270 1401694156 1484004617 1717563533 2373836214 2916913449 3.248135e+09 3.536452e+09 4.010785e+09 4.001349e+09 3.649863e+09 3.346317e+09
4 Anguilla 2008 11 11 11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
My approach so far is:
for value in df_combined_column_named['year']: #loops through each year in year column
if value in df_combined_column_named.columns
any thoughts?
Use df.loc:
In [62]: df.loc[df['country']=='Afghanistan', '2004'].item()
Out[62]: 5285461999.0
df.loc[rows, columns] can accept a boolean Series (such as df['country']=='Afghanistan') for rows and a column label (such as '2004') for columns. It will return the values for rows where the boolean Series is True and in the specified column.
In general this can be more than one value, so a Series is returned. However, in this case, there is only one value in the Series. So to obtain just the value, call the item method.
Note it is unclear from the posted string representation of df whether the numeric column labels are strings are integers. If the numeric column labels are integers, then you would need to use
df.loc[df['country']=='Afghanistan', 2004].item()
(with no quotation marks around 2004).
If you are going to make a lot of "queries" of this form, you make wish to set the country column as the index:
df = df.set_index('country')
Then you could access the value in the cell whose row label is 'Afghanistan' and whose column label is '2004' using get_value:
In [65]: df.get_value('Afghanistan', '2004')
Out[65]: 5285461999.0

How to append a new column to my Pandas DataFrame based on a row-based calculation?

Let's say I have a Pandas DataFrame with two columns: 1) user_id, 2) steps (which contains the number of steps on the given date). Now I want to calculate the difference between the number of steps and the number of steps in the preceding measurement (measurements are guaranteed to be in order within my DataFrame).
So basically this comes down to appending an extra column to my DataFrame where the row values of this data frame match the value of the column 'steps' within this same row, minus the value of the 'steps' column in the row above (or 0 if this is the first row). To complicate things further, I want to calculate these differences per user_id, so I want to make sure that I do not subtract the steps values of two rows with different user_id's.
Does anyone have an idea how to get this done with Python 2.7 and Panda?
So an example to illustrate this.
Example input:
user_id steps
1015 48
1015 23
1015 79
1016 10
1016 20
Desired output:
user_id steps d_steps
1015 48 0
1015 23 -25
1015 79 56
2023 10 0
2023 20 10
Your output shows user ids that are not in you orig data but the following does what you want, you will have to replace/fill the NaN values with 0:
In [16]:
df['d_steps'] = df.groupby('user_id').transform('diff')
df.fillna(0, inplace=True)
df
Out[16]:
user_id steps d_steps
0 1015 48 0
1 1015 23 -25
2 1015 79 56
3 1016 10 0
4 1016 20 10
Here we generate the desired column by calling transform on the groupby by object and pass a string which maps to the diff method which subtracts the previous row value. Transform applies a function and returns a series with an index aligned to the df.

Turn columns to rows in pandas

I have a dataframe with the names of newborn babies per year.
"name","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013"
"Aicha",0,0,0,0,0,0,0,0,0,0,0,0,10,0,0,0
"Aida",15,20,16,0,10,0,10,14,13,11,12,11,13,14,13,18
"Aina",0,0,0,0,0,0,16,12,15,13,12,14,10,11,0,12
"Aisha",14,0,10,12,15,13,28,33,26,26,52,44,43,54,68,80
"Ajla",15,10,0,0,22,18,28,27,26,26,19,16,19,22,17,27
"Alba",0,0,14,14,22,14,17,19,23,15,28,32,25,33,33,33
I want to plot this in a line chart, where each line is a different name and the x axis is the years. In order to do that, I imagine I need to reshape the data into something like this:
"name","year","value"
"Aicha","1998",0
"Aicha","1999",0
"Aicha","2000",0
...
How do I reshape the data in that fashion? I've tried pivot_table but I can't seem to get it to work.
You could use pd.melt:
>>> df_melted = pd.melt(df, id_vars="name", var_name="year")
>>> df_melted.head()
name year value
0 Aicha 1998 0
1 Aida 1998 15
2 Aina 1998 0
3 Aisha 1998 14
4 Ajla 1998 15
and then sort using
>>> df_melted = df_melted.sort("name")
if you liked.