Turn columns to rows in pandas - python-2.7
I have a dataframe with the names of newborn babies per year.
"name","1998","1999","2000","2001","2002","2003","2004","2005","2006","2007","2008","2009","2010","2011","2012","2013"
"Aicha",0,0,0,0,0,0,0,0,0,0,0,0,10,0,0,0
"Aida",15,20,16,0,10,0,10,14,13,11,12,11,13,14,13,18
"Aina",0,0,0,0,0,0,16,12,15,13,12,14,10,11,0,12
"Aisha",14,0,10,12,15,13,28,33,26,26,52,44,43,54,68,80
"Ajla",15,10,0,0,22,18,28,27,26,26,19,16,19,22,17,27
"Alba",0,0,14,14,22,14,17,19,23,15,28,32,25,33,33,33
I want to plot this in a line chart, where each line is a different name and the x axis is the years. In order to do that, I imagine I need to reshape the data into something like this:
"name","year","value"
"Aicha","1998",0
"Aicha","1999",0
"Aicha","2000",0
...
How do I reshape the data in that fashion? I've tried pivot_table but I can't seem to get it to work.
You could use pd.melt:
>>> df_melted = pd.melt(df, id_vars="name", var_name="year")
>>> df_melted.head()
name year value
0 Aicha 1998 0
1 Aida 1998 15
2 Aina 1998 0
3 Aisha 1998 14
4 Ajla 1998 15
and then sort using
>>> df_melted = df_melted.sort("name")
if you liked.
Related
Plotting categorical variables using a bar diagram/bar chart
data I am trying to plot a bar graph for both sept and oct waves. As in the image you can see the id are the individuals who are surveyed across time. So on the one graph I need to plot sept in-house, oct in-house, sept out-house, oct out-house and just have to show the proportion of people who said yes in sept in-house, oct in-house, sept out-house, oct out-house. Not all the categories have to be taken into account. Also I have to show whiskers for 95% confidence intervals for each of the respective categories.
* Example generated by -dataex-. For more info, type help dataex clear input float(id sept_outhouse sept_inhouse oct_outhouse oct_inhouse) 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 3 3 3 5 4 4 3 3 6 4 4 3 3 7 4 4 4 1 8 1 1 1 1 9 1 1 1 1 10 1 1 1 1 end label values sept_outhouse codes label values sept_inhouse codes label values oct_outhouse codes label values oct_inhouse codes label def codes 1 "yes", modify label def codes 2 "no", modify label def codes 3 "don't know", modify label def codes 4 "refused", modify save tokenexample, replace rename (*house) (house*) reshape long house, i(id) j(which) string replace which = subinstr(proper(which), "_", " ", .) gen yes = house == 1 label def WHICH 1 "Sept Out" 2 "Sept In" 3 "Oct Out" 4 "Oct In" encode which, gen(WHICH) label(WHICH) statsby, by(WHICH) clear: ci proportion yes, jeffreys set scheme s1color twoway scatter mean WHICH /// || rspike ub lb WHICH, xla(1/4, noticks valuelabel) xsc(r(0.9 4.1)) /// xtitle("") legend(off) subtitle(Proportion Yes with 95% confidence interval) This has to be solved backwards. The means and confidence intervals have to be plotted using twoway as graph bar is a dead-end here, because it does not allow whiskers too. The confidence limits have to be put in variables before the graphics. Some graph commands, notably graph bar, will calculate means for you, but as said that is a dead end. So, we need to calculate the means too. To do that you need an indicator variable for Yes. The best way I know to get the results then is to reshape to a different structure and then apply ci proportion under statsby. As a detail, the option jeffreys is explicit as a signal that there are different methods for the confidence interval calculation. You should choose one knowingly.
Issue Calculating Mean of Grouped Data for entire range of dataset using Pandas
I have a data set of daily temperatures for which I want to calculate 20 year means. The data look like this: 1974 1 1 5.3 4.6 7.3 3.4 1974 1 2 3.3 7.2 4.5 6.5 ... 2005 12 364 4.2 5.2 3.3 4.6 2005 12 365 3.1 5.5 2.6 6.8 There is no header in the file but the first column contains the year, the second column the month, and the third column the day of the year. The rest of the columns are temperature data. I want to calculate the average temperature for each day over a period of 20 years. I thought the best way to do that would be to group the data by day and calculate the mean of each day for a specific range of years. Here is my code: import pandas as pd hist_fn = 'tmean_daily_1974_2005.txt' twenty_year_fn = '20_yr_mean_1974_1993.txt' start = 1974 end = 1993 hist_mean = pd.read_csv(hist_fn, sep='\s+', header=None) # Limit dataframe to only the 20 years for which I want the mean calculated interval_mean = hist_mean[(hist_mean[0]>=start) & (hist_mean[0]<=end)] # Rename the first column to reflect what mean this file is displaying interval_mean.iloc[:, 0] = ("%s-%s" % (start, end)) # Generate mean for each day spread across all the years in the dataframe interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:] # Write multiyear mean to txt interval_mean.to_csv(twenty_year_fn, sep='\t', header=False, index=False) The data set spans longer than 20 years and the method I used has worked for the first 20 year interval but gives me a (mostly) empty text file for any other set of years entered. So when I use these inputs it works: start = 1974 end = 1993 and it produces a file that looks like this: 1974-1993 1 1 4.33 5.25 6.84 3.67 1974-1993 1 2 7.23 6.22 5.65 6.23 ... 1974-1993 12 364 5.12 4.34 5.21 2.16 1974-1993 12 365 4.81 5.95 3.56 6.78 but when I change the inputs to this: start = 1975 end = 1994 it produces a .txt file with no temperatures: 1975-1994 1 1 1975-1994 1 2 ... 1975-1994 12 364 1975-1994 12 365 I don't understand why this method works for the first 20 year interval but none of the subsequent intervals. Is it something to do with the way the data is organized or how it is being sliced?
Now when that's out of the way, we can talk about the problem you presented: The strange behavior is due to the fact that pandas matches indices on assignment, and slicing preserves the original indices. That means that when setting interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:] Note that interval_mean.groupby(2, as_index=False).mean() has indices 0, ... , 30 (since as_index=False makes the groupby operation create new indices. Otherwise, it would have been the day number).On the other had, interval_mean has the original indices from hist_mean, meaning the first time (first 20 years) it has the indices 0, ..., ~20*365 and the second time is has indices starting from arround 20*365 and counting up. This is a bit confusing at first, but pandas offer great documentation about it, and people quickly discover why it is so useful. I'll to explain what happens with an example: Assume we have the following DataFrame: df = pd.DataFrame(np.reshape(np.random.randint(5, size=30), [-1,3])) df 0 1 2 0 1 1 2 1 2 1 1 2 0 1 2 3 0 2 0 4 2 1 0 5 0 1 2 6 2 2 1 7 1 0 2 8 0 1 0 9 1 2 0 Note that the column names are 0,1,2 and the row names (the index) are 0, ..., 9. When we preform groupby we obtain df.groupby(0, as_index=False).mean() 0 1 2 0 0 1.250000 1.000000 1 1 1.000000 1.333333 2 2 1.333333 0.666667 (The index equals to the columns grouped by just because draw numbers between 0 to 2). Now, when will do assignments to df.loc, it will replace every cell by the corresponding cell in the assignee, if such cell exists. Otherwise, it will leave NA. df.loc[:,:] = df.groupby(0, as_index=False).mean() df 0 1 2 0 0.0 1.250000 1.000000 1 1.0 1.000000 1.333333 2 2.0 1.333333 0.666667 3 NaN NaN NaN 4 NaN NaN NaN 5 NaN NaN NaN 6 NaN NaN NaN 7 NaN NaN NaN 8 NaN NaN NaN 9 NaN NaN NaN And when you write NA to csv, it leaves the cell blank. The last piece of the puzzle is how interval_mean preserved the original indices, but this is because slicing preserves the original indices: df[df[1] > 1] 0 1 2 3 0 2 0 6 2 2 1 9 1 2 0
Return value in dataframe based on row index, column reference
My goal is to compare each value from the column "year" against the appropriate column year (i.e. 1999, 2000). I then want to return the corresponding value from the corresponding column. For example, for Afghanistan (first row), year 2004, I want to find the column named "2004" and return the value from the row that contains afghanistan. Here is the table. For reference this table is the result of a sql join between educational attainment in a single defined year and a table for gdp per country for years 1999 - 2010. My ultimate goal is to return the gdp from the year that the educational data is from. country year men_ed_yrs women_ed_yrs total_ed_yrs 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 0 Afghanistan 2004 11 5 8 NaN NaN 2461666315 4128818042 4583648922 5285461999 6.275076e+09 7.057598e+09 9.843842e+09 1.019053e+10 1.248694e+10 1.593680e+10 1 Albania 2004 11 11 11 3414760915 3632043908 4060758804 4435078648 5746945913 7314865176 8.158549e+09 8.992642e+09 1.070101e+10 1.288135e+10 1.204421e+10 1.192695e+10 2 Algeria 2005 13 13 13 48640611686 54790060513 54744714110 56760288396 67863829705 85324998959 1.030000e+11 1.170000e+11 1.350000e+11 1.710000e+11 1.370000e+11 1.610000e+11 3 Andorra 2008 11 12 11 1239840270 1401694156 1484004617 1717563533 2373836214 2916913449 3.248135e+09 3.536452e+09 4.010785e+09 4.001349e+09 3.649863e+09 3.346317e+09 4 Anguilla 2008 11 11 11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN My approach so far is: for value in df_combined_column_named['year']: #loops through each year in year column if value in df_combined_column_named.columns any thoughts?
Use df.loc: In [62]: df.loc[df['country']=='Afghanistan', '2004'].item() Out[62]: 5285461999.0 df.loc[rows, columns] can accept a boolean Series (such as df['country']=='Afghanistan') for rows and a column label (such as '2004') for columns. It will return the values for rows where the boolean Series is True and in the specified column. In general this can be more than one value, so a Series is returned. However, in this case, there is only one value in the Series. So to obtain just the value, call the item method. Note it is unclear from the posted string representation of df whether the numeric column labels are strings are integers. If the numeric column labels are integers, then you would need to use df.loc[df['country']=='Afghanistan', 2004].item() (with no quotation marks around 2004). If you are going to make a lot of "queries" of this form, you make wish to set the country column as the index: df = df.set_index('country') Then you could access the value in the cell whose row label is 'Afghanistan' and whose column label is '2004' using get_value: In [65]: df.get_value('Afghanistan', '2004') Out[65]: 5285461999.0
Lag in Stata generates only missing
I have a trouble using L1 command in Stata 14 to create lag variables. The resulted Lag variable is 100% missing values! gen d = L1.equity tnanks in advance
There is hardly enough information given in the question to know for certain, but as #Dimitriy V. Masterov suggested by questioning how your data is tsset, you likely have an issue there. As a quick example, imagine a panel with two countries, country 1 and country 3, with gdp by country measured over five years: clear input float(id year gdp) 1 1 5 1 2 2 1 3 7 1 4 9 1 5 6 3 1 3 3 2 4 3 3 5 3 4 3 3 5 4 end Now, if you improperly tsset this data, you can easily generate the missing values you describe: tsset year id gen lag_gdp = L1.gdp And notice now how you have 10 missing values generated. In this example, it happens because the panel and time variables are out of order and the (incorrectly specified) time variable has gaps (period 1 and period 3, but no period 2). Something else I have witnessed is someone trying to tsset by their time variable and their analysis variable, which is also incorrect: clear input float(year gdp) 1 5 2 3 3 2 4 4 5 7 end tsset year gdp gen d = L1.gdp I suspect you are having a similar issue. Without knowing what your data looks like or how it is tsset there is no possible way to diagnose this, but it is very likely an issue with how the data is tsset.
multiply pandas dataframe column with a constant
I have two dataframes: df: Conference Year SampleCitations Percent 0 CIKM 1995 373 0.027153 1 CIKM 1996 242 0.017617 2 CIKM 1997 314 0.022858 3 CIKM 1998 427 0.031084 And another dataframe which returns to me the total number of citations: allcitations= pd.read_sql("Select Sum(Citations) as ActualCitations from publications " I want to simply multiply the Percent column in dataframe df with the constant value ActualCitations. I tried the following: df['ActualCitations']=df['Percent'].multiply(allcitations['ActualCitations']) and df['ActualCitations']=df['Percent']* allcitations['ActualCitations'] But both only perform it for the first row and the rest is Naan, as shown below: Conference Year SampleCitations Percent ActualCitations 0 CIKM 1995 373 0.027153 1485.374682 1 CIKM 1996 242 0.017617 NaN 2 CIKM 1997 314 0.022858 NaN 3 CIKM 1998 427 0.031084 NaN
The problem in this case is pandas's auto alignment (ususally a good thing). Because your 'constant' is actually in a dataframe, what pandas will try to do is create row 0 from each of the row 0s and then row 1 from each of the row 1s, but there is no row 1 in the second dataset, so you get NaN from there forward. So what you need to do intentionally break the dataframe aspect of the second dataframe so that pandas will then 'broadcast' the constant to ALL rows. One way to do this is with values, which in this case essentially just drops the index from a dataframe so that it becomes a numpy array with one element (really a scalar, but contained in a numpy array technically). to_list() will also accomplish the same thing. allcitations=pd.DataFrame({ 'ActualCitations':[54703.888410120424] }) df['Percent'] * allcitations['ActualCitations'].values 0 1485.374682 1 963.718402 2 1250.421481 3 1700.415667