I have a dataset for a survey that has several multiple choice questions (Check all, check 3, etc);
Each option is coded as binary variables
Location Popn1 Popn2 Popn3 Popn4 .... Popn20
Location1 0 1 1 1
Location2 1 1 0 0
Location3 0 0 0 0
Here is my code:
proc tabulate data=cath.binarydata;
class location sectorcollapsed;
var popn1-popn20;
table (location='Location'),
(popn1-popn20)*(Sum='Count'*f=best5. mean='Percent'*f=percent8.1 N='Total responses received per question')
/box="Populations Served by Location";
I'm using a proc tabulate to do a sum (count) and mean (percent) of each option in the multiple choice question by Location. However, I am finding that, when I do a check using my original dataset, the numbers don't make sense.
Here is a sample output:
This is the kind of output I want and have right now
Popn1 Popn2 Popn3 ....... Popn20.
Count Freq N Count N Freq
Location1 13 50% 26 11 42% 26
Location2
However, when I check back and manually calculate, what I think its doing doesn't make sense; for example, the N of 26 makes sense for location1, because there are 26 people in location1 and they all answered the question. So the sum being out of 26 makes sense.
However, for some of them, the N doesn't make sense - I thought the N would be all of the people who answered the question, but it doesn't quite add up like this. As an example, in one of the locations, there were 149 total people in that location, and 19 did not provide an answer at all - so the N here should be 130, but it is giving me a value of 134 in the output.
Does anyone have any thoughts or can help me understand how to use SAS to tabulate the multiple variables together in one column, while giving me the total answers for that option, and the percentage (out of the number of people who answered the question?)
Any help is much appreciated,
Related
I have typical union-find problem where I have to group records, but it includes multiple files of hundreds bilion of records.
Can I somehow use clickhouse database to solve it?
Edit - minimal reproducible example:
I have tree columns (item_id, from, to) which represent graph nodes.
I want to create groups (id, group_id, item_id) which names groups from disjoint sets.
[Data]
item_id from to
0 101 102
1 102 103
2 104 105
[Result]
id group_id item_id
0 0 0
1 0 1
2 1 2
There are only two groups #0 (101->102->103) and #1 (104->105).
The problem in implementation in memory is that there's too much records and I want clickhouse (or some other solution) to care about filesystem caches.
Without knowing more about your specific data and questions, it is tricky to provide a definitive answer. In general, this represents a moderate size for ClickHouse. UNION is fully supported. Your best bet is to simply try - loading data or generating data is straightforward and SQL queries can usually be translated from Postgresql/MySQL easily.
In Stata I am analyzing a study looking at pre-existing conditions that participants may have had that affect whether they experience side effects after vaccination.
For each participant, there are three binary variables that denote whether the participant had that condition (0: does not have, 1: does have), namely hypertension: 0/1, asthma: 0/1, diabetes: 0/1.
However, these categories are not mutually exclusive as the participant can have any combination of conditions: (no pre-existing conditions, only hypertension, only asthma, only diabetes, hypertension and asthma, hypertension and diabetes, asthma and diabetes, hypertension and asthma and diabetes).
I would like to perform a regression analysis to determine the risk of developing side effects given exposure to pre-existing conditions and to create a variable denoting the different combinations.
I would like to get the risk ratios for the following table:
Type of pre-existing condition
With side effects
no side effects
risk ratio
None
455
316
ref
Hypertension
51
28
Asthma
42
26
Diabetes
17
7
Does anyone havecode that would help in creating a new categorical variable to help with this regression analysis?
I've tried using the following code, but because the categories are not mutually exclusive, the values assigned overwrite each other. new_var denotes the new variable created denoting the pre-existing conditions.
generate new_var = 0
replace new_var = 1 if hypertension == 1
replace new_var = 2 if asthma == 1
replace new_var = 3 if diabetes == 1
This is as much statistical as Stata-oriented, but there is a Stata dimension, so here goes.
#Stuart has indicated some ways of getting composite variables in Stata, but as no doubt he would emphasise too, watch out that the numeric coding is arbitrary and not to be taken literally.
Other methods of creating composite variables were discussed in this paper and that advice remains valid.
That said, I suspect most researchers would not use a composite variable here at all, but would use as predictors the three indicators you already have and their interactions. That is the only serious and supported method to get estimates of effect size together with appropriate tests.
There are 8 possible combinations of preexisting conditions, and one approach is to add the variables like this, then manually label them:
generate new_var = hypertension * 4 + asthma * 2 + diabetes
label define preexisting 0 none 1 diabetes 2 asthma 3 "asthma and diabetes" 4 hypertension 5 "hypertension and asthma" 6 "hypertension and diabetes" 7 "hypertension, asthma and diabetes"
label values new_var preexisting
If you have additional preexisting condition variables, multiply them by 8, 16, 32 and so on to get unique values for every combination.
Another approach is to use interactions in the regression.
regress outcome hypertension##asthma##diabetes
I have the following data structure:
186 unique firm acquisitions
Observations for 5 years per firm; 2 years before acquisition year, acquisition year, and 2 years after
Total number of observations is thus 186 * 5 = 930
Two dependent variables, which I like to use in different analyses - one is binary (1/0), the other is one variable divided by another, which ranges from 0 to 5.
Acquisition years range from 2008 to 2019
Acquisitions took place in 20 different industries
Goal: test whether there are significant differences in target characteristics (the two DVs mentioned above) after acquisition vs before acquisition.
I expect the following unobserved factors to exist that can bias results:
Deal-specific: some deals involve characteristics that others do not
Target-specific: some targets might be more difficult to change, for example. Also, some targets get acquired twice in the period I am examining, so without controlling for that fact, the results will be biased.
Acquirer-specific: some acquirers are more likely to implement change than others. Also, some acquirers engage in multiple acquisitions during the period I am examining (max is 9)
Industry-specific: there might have been some unobserved industry-trends going on, which caused targets in certain industries to be more likely to change than targets in other industries.
Year-specific: since the acquisitions took place in different years between 2008 and 2019, observations might be biased by unobserved year-specific factors. For example, 2020 and 2021 observations will likely be affected by the COVID-19 pandemic.
I have constructed a dummy variable, post, which is coded 1 for year 1 and year 2 after acquisition, and 0 for year 1 and year 2 before acquisition.
I have been struggling with using the right models and commands in Stata. The code I have been using:
BINARY DV
First, I ran an OLS regression so that I could remove outliers after the regression:
reg Y1 post X1 post*X1 $controls i.industry i.year
Then, I removed outliers (not sure if this is the right method though):
predict estu if e(sample), rstudent
drop if abs(estu)>3.5
Then, ran the xtprobit regression below:
egen id = group(target_id acquiror_id)
xtset deal_id year
xtprobit Y1 post X1 post*X1 $controls i.industry i.year, vce(cluster id)
OTHER DV
Same as above, but replacing xtprobit with xtreg and Y1 with Y2
Although I get results which theoretically make sense, I feel like I am messing things up.
Any thoughts on how to improve my code?
You could try checking reghdfe for the different fixed effects you're running. I don't really understand the question tho. http://scorreia.com/software/reghdfe/
I have a dataset where I calculated the number of total hours it took to process a request in hours. A request should be completed in 72 hours, or else if an extension is requested a request should be completed in 408 hours (14 days plus 72 hours).
I need to flag values with a Y or N depending if they meet these criteria.
My problem is that it is only recognizing negative HHMM values as below threshold, not a value like 29:15 which would represent 29 hours 15 minutes. This is less than 72 hours and should be marked "Y" indicating it is timely, but it is marking it "N".
This is what I tried so far:
data work.tbl_6;
set work.tbl_6;
if was_a_timeframe_extension_taken_ = "N" and time_to_notification <= 72 then notification_timely="Y";
else if was_a_timeframe_extension_taken_ = "Y" and time_to_notification <= 408 then notification_timely="Y";
else notification_timely="N";
run;
Can someone advise what could be going wrong here?
This assumes you have data as a SAS time. If you do not you'll need to use INPUT() to do some conversions.
You need to specify time values using the t after to specify a time constant.
if was_a_timeframe_extension_taken_ = "N" and time_to_notification <= '72:00't then notification_timely="Y";
SAS stores times as the number of second so you could also convert 72 hours to the number of seconds and use that value.
if was_a_timeframe_extension_taken_ = "N" and time_to_notification <= (72*60*60) then notification_timely="Y";
EDIT:
In general, this style of programming is dangerous and makes it harder to debug your code. If you can avoid it, I would highly recommend it. Instead, give each data set a unique name or keep adding to the previous data step.
data work.tbl_6;
set work.tbl_6;
(Stata/MP 13.1)
I am working with a set of massive data sets that takes an extremely long time to load. I am currently looping through all the data sets to load them each time.
Is it possible to just tell Stata to load in the first 5 observations of each dataset (or in general the first n data sets in each use command) without actually having to load the entire data set? Otherwise, if I were to load in the entire data set and then just keep the first 5 observations, the process takes extremely long time.
Here are two work-arounds I have already tried
use in 1/5 using mydata : I think this is more efficient than just loading the data and then keeping the observations you want in a different line, but I think it still reads in the entire data set.
First load all the data sets, then save copies of all the data sets to just be the first 5 observations, and then just use the copies: This is cumbersome as I have a lot of different files; I would very much prefer just a direct way to read in the first 5 observations without having to resort to this method and without having to read the entire data set.
I'd say using in is the natural way to do this in Stata, but testing shows
you are correct: it really makes no "big" difference, given the size of the data set. An example is (with 148,000,000 observations)
sysuse auto, clear
expand 2000000
tempfile bigfile
save "`bigfile'", replace
clear
timer on 1
use "`bigfile'"
timer off 1
clear
timer on 2
use "`bigfile'" in 1/5
timer off 2
timer list
timer clear
Resulting in
. timer list
1: 6.44 / 1 = 6.4400
2: 4.85 / 1 = 4.8480
I find that surprising since in seems really efficicient in other contexts.
I would contact Stata Tech support (and/or search around, including www.statalist.com) only to ask why in isn't much faster
(independently of you finding some other strategy to handle this problem).
It's worth using, of course; but not fast enough for many applications.
In terms of workflow, your second option might be the best. Leave the computer running while the smaller datasets are created (use a for loop), and get back to your regular coding/debugging once that's finished. This really depends on what you're doing, so it may work or not.
Actually, I found the solution. If you run
use mybigdata if runiform() <= 0.0001
Stata will take a random sample of 0.0001 of the data set without reading the entire data set.
Thanks!
Vincent
Edit: 4/28/2015 (1:58 PM EST)
My apologies. It turns out the above was actually not a solution to the original question. It seems that on my system there was high variability in the speed of using
use mybigdata if runiform() <= 0.0001
each time I ran it. When I posted that the above was a solution, I think when I ran the code, it just happened to be a faster instance. However, as I now am repeatedly running
use mybigdata if runiform() <= 0.0001
vs.
use in 1/5 using mydata
I am actually finding that
use in 1/5 using mydata
is on average faster.
In general, my question is simply how to read in a portion of a Stata data set without having to read in the entire data set for computational purposes especially when the data set is really large.
Edit: 4/28/2015 (2:50 PM EST)
In total, I have 20 datasets, each with between 5 - 15 million observations. I only need to keep 8 of the variables (There are 58-65 variables in each data set). Below is the output from the first four "describe, short" statements.
2004 action1
Contains data from 2004action1.dta
obs: 15,039,576
vars: 64 30 Oct 2014 17:09
size: 2,827,440,288
Sorted by:
2004 action2578
Contains data from 2004action2578.dta
obs: 13,449,087
vars: 59 30 Oct 2014 17:16
size: 2,098,057,572
Sorted by:
2005 action1
Contains data from 2005action1.dta
obs: 15,638,296
vars: 65 30 Oct 2014 16:47
size: 3,143,297,496
Sorted by:
2005 action2578
Contains data from 2005action2578.dta
obs: 14,951,428
vars: 59 30 Oct 2014 17:03
size: 2,362,325,624
Sorted by:
Thanks!
Vincent