I have two columns in my dataframe. One includes vaccination rates for the flu from 2000 to 2004 and the other contains vaccination rates for the flu from 2005 to 2006. I am trying to merge them into a single column to perform a t-test. So far, I tried
gen amin_vaccination = amin_number + number_amin
which yielded a blank column. Any recommendations on how I can do this? For reference, the dataframe essentially looks like this:
Year flu_vaccine flu_vaccination_rate
2001 .12
2002 .14
2003 .15
2004 .13
2005 .145
2006 .125
I am trying to get it to look like this:
Year vaccination_rate
2001 .12
2002 .14
2003 .15
2004 .13
2005 .145
2006 .125
The missing values in the columns are treated as "undefined" and you can't do math on undefined values, so the result in gen amin_vaccination = amin_number + number_amin is also undefined as at least one of the values for each observation is undefined.
Instead do:
egen amin_vaccination = rowtotal(amin_number number_amin)
See help egen for many more functions other than rowtotal() that will help you avoid this in the future.
Note: Since you seem to come from R or Python, what is called column in those languages are called a variable in Stata. I do not want to come across as picky with terminology but that will help you read Stata help files
gen wanted = min(flu_vaccine, flu_vaccination_rate)
gen wanted = max(flu_vaccine, flu_vaccination_rate)
replace flu_vaccine = flu_vaccination_rate if missing(flu_vaccine)
are some other solutions. It may seem puzzling or even paradoxical that min() and max() are both solutions, but what does the work in either case is that missing values are ignored to the extent possible, so whenever one argument is not missing and the other is missing, the non-missing value is always returned as result.
Related
I am constructing a panel dataset based on the survey data for the years 2010-2013 (four consecutive years). As is usually the case with household survey data, there is an issue of attrition, i.e. some households drop out from the survey from year to year. I need to figure out whether these households are missing at random.
My idea is to come up with a dummy equal to 1 in 2011 if a household present in 2010 is missing in 2011 (and 0 otherwise), and so on for the years 2012, 2013. Then I want to run the logit/probit regression on this dummy with a set of covariates that I would like to control for in my study. The variable for household id is "hhid" and I have of course the time dimension variable "year".
Does anyone have a precise idea how this should be properly coded in Stata? I know it is not complicated, but I just cannot wrap my head around it and figure this out....
Here is an example on how you create a dummy in a panel data and then collapse those dummy to the parent unit-of-observation making the dummy 1 if the parent unit-of-observation was 1 in any time period. Then merge the parent unit-of-observation level data back to the panel data.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte hhid int year
1 2010
1 2011
1 2012
1 2013
2 2010
2 2011
2 2013
3 2010
3 2011
end
*Create a dummy for each year-hh level observation for each year
local year_dummies ""
forvalues year = 2010/2013 {
gen dummy`year' = (year==`year')
local year_dummies "`year_dummies' dummy`year'"
}
*Collapse the data set to hh level where the dummies is 1 if any year-hh level was 1
preserve
collapse (max) `year_dummies' , by(hhid)
tempfile year_dummy_hhlevel
save `year_dummy_hhlevel'
restore
*Rename to not having to overwrite the first step
rename dummy???? org_dummy????
*Merge the hh level data back to the year-hh level
*data merging the hh dummy to each year-hh observation
merge m:1 hhid using `year_dummy_hhlevel', nogen
Your question is if there is a difference in the households you do not observe in year X compare to those you do observe in year X. There is no perfect way to answer this question as you, by definition, did not observe those households.
You did however observe all households in your study in year 0 (2010 in your case). As you imply yourself, you can use observations in year 0 as a proxy to answer if those households are different in year X. I can help you show how you can code this, but StackOverflow is not the appropriate forum to discuss is this is statistically valid given your data, how it was collected and what analysis you intend to use.
One way to code this is to use iebaltab in the package called ietoolkit available from SSC (disclosure, I wrote that command).
You can create an attrition dummy indicating attrition and use iebaltab like this: iebaltab balancevars, grpvar(attrition) where balancevars is a list of variables for characteristics in the household where you want to make sure they were similar in year 0. You can use the option ftest to include the test across all balance variables they way you are suggesting.
Not that this command generates statistics, but it is up to you to decide if this is valid, and the validity of balance tests are hotly debated. But those debates are not about coding which StackOverflow is about.
My backend table has the data at a week level. It contain the current ISO year and current ISO week, as well as, the previous year's ISO year and week number that the current year's data should be compared with.
For each signup_iso_year-signup_iso_week combination, there exists only one iso_prev_year-iso_prev_yearweek combination.
The iso_prev_year, iso_prev_yearweek columns account for the offset that might occur due to certain years having 53 weeks instead of 52.
data table
(I can't embed images, so I have added a table here as well, although it has much less information than the image in 'data table').
Number_of_signups
signup_iso_year
signup_iso_week
iso_prev_year
iso_prev_yearweek
Country
grade_level
5
2020
18
2019
18
IN
middle school
7
2020
18
2019
18
US
high school
6
2021
17
2018
18
IN
middle school
8
2021
17
2018
18
US
high school
I want to calculate to Year-Over_Year Change in number_of_signups using the signup_iso_year, signup_iso_week, iso_prev_year, iso_prev_yearweek columns.
I have already tried to create a calculated column that contains the sum of number_of_signups from previous year, but since every combination of country, grade_level, subject, email_type might not exist in previous or current year, some of the values are getting lost and hence giving incorrect results.
The answer I am looking for, is a Power BI measure that can give me the YOY change based on signup_iso_year and signup_iso_week.
Edit: I should have mentioned this before, but I forgot. The table contains data from 2018 to current day. So, the data size is quite large. Also, I need this YoY measure for a time series visual, which means that I can't assign ISOyear/ISOweek values for previous year using simple MAX/MIN functions. It needs to pick values from the iso_prev_year, iso_prev_yearweek columns but since EARLIER function can't be used in a measure, I am not able to figure out how to do that.
Which is why I had tried to create a calculated column, and use the EARLIER function to compute previous year's number_of_signups. But because of the other columns present in the data, i.e., country, grade_level, subject, email_type, there were discrepancies occurring in the actual number_of_signups and the calculated previous_year_number_of_signups. These discrepancies were due to the fact that not every combination of these columns exists for each week, so we might miss out on some data when calculating previous_year_number_of_signups.
Edit 2: Was asked to include examples of what the expected result would look like, so adding some pictures.
YoY at overall level, country level, grade level
YOY at country+grade level
If I understand your requirement correct, you need a Measure like below. Remember, this may not the exact one you need, but this will definitely help you to reach your required output.
prev_signup =
var iso_prev_year = MIN(your_table_name[iso_prev_year])
var iso_prev_year_week = MIN(your_table_name[iso_prev_yearweek])
RETURN
CALCULATE(
SUM(your_table_name[Number_of_signups]),
FILTER(
ALL(your_table_name),
your_table_name[signup_iso_year] = iso_prev_year
&& your_table_name[signup_iso_week] = iso_prev_year_week
)
) + 0
You can also do some transformation in Power Query Editor and join the same table using your Key columns. That case, you can bring previous year's value in the same row. Rest is just compare 2 columns from your table to calculate YOY
I have a longitudinal dataset which contains variables on individuals from 2 waves from Feb and June which measure economic activity across these individuals. The variables from Feb and May wave are categorical variables and I am running the proportion command in Stata to get the individual change in economic activity. For example. I am looking for changes in hours worked across 2 waves and I run proportion but am not able to figure out the if condition as I only want individuals who responded in both Feb and June. I want to drop all those who responded in Feb but not in May or likewise.
Let's suppose you have an identifier variable id and a time-like variable, wave that takes values 1 and 2. If so, you are looking for individuals that satisfy
bysort id (wave) : gen wanted = wave[1] == 1 & wave[2] == 2
So wanted is an indicator that is 1 for individuals present for both waves and 0 otherwise and if wanted would be an if condition to select those people wanted.
There are many variations on this, depending on: your variable names; your data layout; how the information on waves is held (could be also, say, a string variable containing values like "Feb", "May" or "June", or a numeric variable holding dates).
You gave a broad-brush description sketching the problem but almost no precise information on the data. The stata tag wiki gives much detailed advice on how to post a question and flags the importance of giving a concrete data example.
I have trouble to generate a new variable which will be created for every month while having multiple entries for every month.
date1 x b
1925m12 .01213 .323
1925m12 .94323 .343
1926m01 .34343 .342
Code would look like this gen newvar = sum(x*b) but I want to create the variable for each month.
What I tried so far was
to create an index for the date1 variable with
sort date1
gen n=_n
and after that create a binary marker for when the date changes
with
gen byte new=date1!=date[[_n-1]
After that I received a value for every other month but I m not sure if this seems to be correct or not and thats why I would like someone have a look at this who could maybe confirm if that should be correct. The thing is as there are a lot of values its hard to control it manually if the numbers are correct. Hope its clear what I want to do.
Two comments on your code
There's a typo: date[[_n-1] should be date1[_n-1]
In your posted code there's no need for gen n = _n.
Maybe something along the lines of:
clear
set more off
*-----example data -----
input ///
str10 date1 x b
1925m12 .01213 .323
1925m12 .94323 .343
1926m01 .34343 .342
end
gen date2 = monthly(date1, "YM")
format %tm date2
*----- what you want -----
gen month = month(dofm(date2))
bysort month: gen newvar = sum(x*b)
list, sepby(month)
will help.
But, notice that the series of the cumulative sum can be different for each run due to the way in which Stata sorts and because month does not uniquely identify observations. That is, the last observation will always be the same, but the way in which you arrive at the sum, observation-by-observation, won't be. If you want the total, then use egen, total() instead of sum().
If you want to group by month/year, then you want: bysort date2: ...
The key here is the by: prefix. See, for example, Speaking Stata: How to move step by: step by Nick Cox, and of course, help by.
A major error is touched on in this thread which deserves its own answer.
As used with generate the function sum() returns cumulative or running sums.
As used with egen the function name sum() is an out-of-date but still legal and functioning name for the egen function total().
The word "function" is over-loaded here even within Stata. egen functions are those documented under egen and cannot be used in any other command or context. In contrast, Stata functions can be used in many places, although the most common uses are within calls to generate or display (and examples can be found even of uses within egen calls).
This use of the same name for different things is undoubtedly the source of confusion. In Stata 9, the egen function name sum() went undocumented in favour of total(), but difficulties are still possible through people guessing wrong or not studying the documentation really carefully.
I am pretty new to Stata and I am having difficulty doing something which I would guess is not that unusual of a thing to try to do. I am working with a panel data set (countries and times). Each observation consists of a country, a year, and a variable, call it x. The data is sorted by country year (i.e. all observations corresponding to a given country are consecutive and sorted by year).
Each country has 54 years of data corresponding to 1960 to 2013 inclusive. I would like to run a t-test something like in the following way:
by country: ttest x = x[54] if year != 2013
But I get an error ("weights not allowed") which I don't know how to interpret. I could do it by hardcoding it in and using the usual syntax
by country: ttest x = # if year != 2013
but I want to avoid hard-coding since there are >100 countries and I want to be able to flexibly add / remove countries (and this is just poor form in general).
My first thought was to define a macro using something like
levelsof country, local(levels)
foreach c of local levels {
local y x if year == 2013
ttest x = y if year != 2013
// some code to store the value that I haven't figured out yet
}
but you can't use "if" with declaring a local macro. I am pretty lost and would appreciate any help you all can give. Thank you!
Student's t tests here make little sense without adjustment for time and space dependence structure, unless you have grounds for treating your data as equivalent to independent draws from the same distribution. You can do the tests, but standard errors and P-values are dubious if not bogus. That is, your individual tests on time series face one problem; and collectively your tests face another problem. For a good account, see either edition of Box, Hunter, Hunter, Statistics for experimenters. John Wiley.
That large point aside, Stata is choking on the [] which are being misread as an attempt to specify weights. My guess is that
by country: ttest x = `=x[54]' if year != 2013
would be acceptable syntax to Stata, although still dubious statistics. The detail here is the macro-like syntax
`= '
which has the effect that the expression given will be evaluated by Stata before the line is passed to ttest. So the result, a numeric value, will be what the ttest command sees.
This is naturally similar in spirit to what you were imagining, although your code is some way from being legal and correct.
UPDATE This calculation may also be helpful:
egen mean = mean(x / (year != 2013)), by(country)
egen sd = sd(x / (year != 2013), by(country)
gen z = (x - mean) / sd if year == 2013
list country x z if year == 2013