I have the following data in Stata:
input drug halflife hl_weight
3 2.95 0.0066
2 6.00 0.0004
5 13.60 0.0006
1 2.82 0.0331
4 8.80 0.0001
4 1.24 0.0075
2 6.25 0.1123
4 17.20 0.0002
5 14.50 0.0020
4 5.50 0.0016
5 13.30 0.0003
4 8.26 0.0201
4 16.50 0.0103
4 11.40 0.0016
4 5.90 0.0005
4 3.99 0.0100
4 2.80 0.0073
4 3.00 0.0133
4 3.17 0.0061
4 4.95 0.1404
end
I am trying to create boxplots of drug halflives using the command below:
graph box halflife [aweight=hl_weight], over(drug)
When I include the weight option, some of the resulting box plots consist of multiple dots instead of the typical interquartile range and median:
Why does this happen and how can I fix it?
Obviously, this happens because of the weighting. The weights give more emphasis to values that are well outside the interquartile range.
I do not think there is anything to fix here. You could try to use the nooutsides option of the graph box command to hide the dots but i would not recommend it.
Related
data
I am trying to plot a bar graph for both sept and oct waves. As in the image you can see the id are the individuals who are surveyed across time. So on the one graph I need to plot sept in-house, oct in-house, sept out-house, oct out-house and just have to show the proportion of people who said yes in sept in-house, oct in-house, sept out-house, oct out-house. Not all the categories have to be taken into account.
Also I have to show whiskers for 95% confidence intervals for each of the respective categories.
* Example generated by -dataex-. For more info, type help dataex
clear
input float(id sept_outhouse sept_inhouse oct_outhouse oct_inhouse)
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 3 3 3
5 4 4 3 3
6 4 4 3 3
7 4 4 4 1
8 1 1 1 1
9 1 1 1 1
10 1 1 1 1
end
label values sept_outhouse codes
label values sept_inhouse codes
label values oct_outhouse codes
label values oct_inhouse codes
label def codes 1 "yes", modify
label def codes 2 "no", modify
label def codes 3 "don't know", modify
label def codes 4 "refused", modify
save tokenexample, replace
rename (*house) (house*)
reshape long house, i(id) j(which) string
replace which = subinstr(proper(which), "_", " ", .)
gen yes = house == 1
label def WHICH 1 "Sept Out" 2 "Sept In" 3 "Oct Out" 4 "Oct In"
encode which, gen(WHICH) label(WHICH)
statsby, by(WHICH) clear: ci proportion yes, jeffreys
set scheme s1color
twoway scatter mean WHICH ///
|| rspike ub lb WHICH, xla(1/4, noticks valuelabel) xsc(r(0.9 4.1)) ///
xtitle("") legend(off) subtitle(Proportion Yes with 95% confidence interval)
This has to be solved backwards.
The means and confidence intervals have to be plotted using twoway as graph bar is a dead-end here, because it does not allow whiskers too.
The confidence limits have to be put in variables before the graphics. Some graph commands, notably graph bar, will calculate means for you, but as said that is a dead end. So, we need to calculate the means too.
To do that you need an indicator variable for Yes.
The best way I know to get the results then is to reshape to a different structure and then apply ci proportion under statsby.
As a detail, the option jeffreys is explicit as a signal that there are different methods for the confidence interval calculation. You should choose one knowingly.
I have below dataset.
Math Literature Biology date student
4 2 5 2019-08-25 A
4 5 4 2019-08-08 A
5 4 5 2019-08-23 A
5 5 5 2019-08-15 A
5 5 5 2019-07-19 A
5 5 5 2019-07-15 A
5 5 5 2019-07-03 A
5 5 5 2019-06-26 A
1 1 2 2019-06-18 A
2 3 3 2019-06-14 A
5 5 5 2019-05-01 A
2 1 3 2019-04-26 A
I need to develop a solution in powerbi so in output I have cumulative average per subject per month
For example
April May June July August
Math | 2 3.5 3 3.75 4
Literature | 1 3 3 3.75 3.83
Biology | 3 4 3.6 4.125 4.33
Can you help?
You can use a matrix visualization for this.
Create a month-year variable and use it in the columns.
Use Average of Math,Literature and Biology in values
Under the format pane --> Values --> Show on rows --> Select this
This should give the view you are looking for. You can edit the value headers to your requirement.
I have a data set of daily temperatures for which I want to calculate 20 year means. The data look like this:
1974 1 1 5.3 4.6 7.3 3.4
1974 1 2 3.3 7.2 4.5 6.5
...
2005 12 364 4.2 5.2 3.3 4.6
2005 12 365 3.1 5.5 2.6 6.8
There is no header in the file but the first column contains the year, the second column the month, and the third column the day of the year. The rest of the columns are temperature data.
I want to calculate the average temperature for each day over a period of 20 years. I thought the best way to do that would be to group the data by day and calculate the mean of each day for a specific range of years. Here is my code:
import pandas as pd
hist_fn = 'tmean_daily_1974_2005.txt'
twenty_year_fn = '20_yr_mean_1974_1993.txt'
start = 1974
end = 1993
hist_mean = pd.read_csv(hist_fn, sep='\s+', header=None)
# Limit dataframe to only the 20 years for which I want the mean calculated
interval_mean = hist_mean[(hist_mean[0]>=start) & (hist_mean[0]<=end)]
# Rename the first column to reflect what mean this file is displaying
interval_mean.iloc[:, 0] = ("%s-%s" % (start, end))
# Generate mean for each day spread across all the years in the dataframe
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
# Write multiyear mean to txt
interval_mean.to_csv(twenty_year_fn, sep='\t', header=False, index=False)
The data set spans longer than 20 years and the method I used has worked for the first 20 year interval but gives me a (mostly) empty text file for any other set of years entered.
So when I use these inputs it works:
start = 1974
end = 1993
and it produces a file that looks like this:
1974-1993 1 1 4.33 5.25 6.84 3.67
1974-1993 1 2 7.23 6.22 5.65 6.23
...
1974-1993 12 364 5.12 4.34 5.21 2.16
1974-1993 12 365 4.81 5.95 3.56 6.78
but when I change the inputs to this:
start = 1975
end = 1994
it produces a .txt file with no temperatures:
1975-1994 1 1
1975-1994 1 2
...
1975-1994 12 364
1975-1994 12 365
I don't understand why this method works for the first 20 year interval but none of the subsequent intervals. Is it something to do with the way the data is organized or how it is being sliced?
Now when that's out of the way, we can talk about the problem you presented:
The strange behavior is due to the fact that pandas matches indices on assignment, and slicing preserves the original indices. That means that when setting
interval_mean.iloc[:, 3:] = interval_mean.groupby(2, as_index=False).mean().iloc[:, 2:]
Note that interval_mean.groupby(2, as_index=False).mean() has indices 0, ... , 30 (since as_index=False makes the groupby operation create new indices. Otherwise, it would have been the day number).On the other had, interval_mean has the original indices from hist_mean, meaning the first time (first 20 years) it has the indices 0, ..., ~20*365 and the second time is has indices starting from arround 20*365 and counting up.
This is a bit confusing at first, but pandas offer great documentation about it, and people quickly discover why it is so useful.
I'll to explain what happens with an example:
Assume we have the following DataFrame:
df = pd.DataFrame(np.reshape(np.random.randint(5, size=30), [-1,3]))
df
0 1 2
0 1 1 2
1 2 1 1
2 0 1 2
3 0 2 0
4 2 1 0
5 0 1 2
6 2 2 1
7 1 0 2
8 0 1 0
9 1 2 0
Note that the column names are 0,1,2 and the row names (the index) are 0, ..., 9.
When we preform groupby we obtain
df.groupby(0, as_index=False).mean()
0 1 2
0 0 1.250000 1.000000
1 1 1.000000 1.333333
2 2 1.333333 0.666667
(The index equals to the columns grouped by just because draw numbers between 0 to 2). Now, when will do assignments to df.loc, it will replace every cell by the corresponding cell in the assignee, if such cell exists. Otherwise, it will leave NA.
df.loc[:,:] = df.groupby(0, as_index=False).mean()
df
0 1 2
0 0.0 1.250000 1.000000
1 1.0 1.000000 1.333333
2 2.0 1.333333 0.666667
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
And when you write NA to csv, it leaves the cell blank.
The last piece of the puzzle is how interval_mean preserved the original indices, but this is because slicing preserves the original indices:
df[df[1] > 1]
0 1 2
3 0 2 0
6 2 2 1
9 1 2 0
I just started using Python.
I want to draw firstly an histogram and then a plot from a text file
this file presents two columns and about 12 millions lines.
this is an example of the file data
0.69359481 1
0.64862049 1
-1.00000000 5
-0.07840061 1
-0.13079703 1
0.54300547 1
-0.55758357 1
1.00000000 9
0.48491240 1
-0.32764208 1
-0.98689210 1
So as you see, the x value is a float ranged from -1.00000000 to 1.00000000
and the y axis is an integer (minimum is 1 maximum can be more than 100000)
I need your help to do so
Thanks
I have a dataframe that originated from an excel file. It has the usual headers above the columns but some of the columns have % signs in them which I want to remove.
Searching stackoverflow gives some nice code for removing percentages from matrices, Any way to edit values in a matrix in R?, which did not work when I tried to apply it to my dataframe
as.numeric(gsub("%", "", my.dataframe))
instead it just returns a string of "NA"s with a warning message explaining that they were introduced by coercion. When I applied,
gsub("%", "", my.dataframe))
I got the values in "c(...)" form, where the ... represent numbers followed by commas which was reproduced for every column that I had. No % was in evidence; if I could just put this back together ... I'd be cooking.
Any help greatfully received, thanks.
Based on #Arun's comment and imaging how your data.frame looks like:
> DF <- data.frame(X = paste0(1:5,'%'),
Y = paste0(2*(1:5),'%'),
Z = 3*(1:5), stringsAsFactors=FALSE )
> DF # this is how I imagine your data.frame looks like
X Y Z
1 1% 2% 3
2 2% 4% 6
3 3% 6% 9
4 4% 8% 12
5 5% 10% 15
> # Using #Arun's suggestion
> (DF2 <- data.frame(sapply(DF, function(x) as.numeric(gsub("%", "", x)))))
X Y Z
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15
I added as.numeric in sapply call for the resulting cols to be numeric, if I don't use as.numeric the result will be factor. Check it out using sapply(DF2, class)