How to derive AVISIT in SAS using different conditions mentioned in specs - sas

Hello I am new in SAS and have to derive AVISIT and specs are like -
1.If there are 2 or more assessments faling in the same visit window, the non-missing assessment closest to midpoint will be use.
2. If there are 2 assessments equidistance from the mispoint then create a new observation using avarage of AVAL of 2 observations.
Mid poits are like -
If avisit= 'month1' then midpoint is 28
if avisit= 'month 3' then midpoint is 84 and so on.
So looking help for derivation of the above 2 mentioned conditions using mid points.
Thank you in advance.
Do not have any idea about how to program it.

Related

Trying to analyze panel data but feel like I am mixing up commands - could anybody review and check?

I have the following data structure:
186 unique firm acquisitions
Observations for 5 years per firm; 2 years before acquisition year, acquisition year, and 2 years after
Total number of observations is thus 186 * 5 = 930
Two dependent variables, which I like to use in different analyses - one is binary (1/0), the other is one variable divided by another, which ranges from 0 to 5.
Acquisition years range from 2008 to 2019
Acquisitions took place in 20 different industries
Goal: test whether there are significant differences in target characteristics (the two DVs mentioned above) after acquisition vs before acquisition.
I expect the following unobserved factors to exist that can bias results:
Deal-specific: some deals involve characteristics that others do not
Target-specific: some targets might be more difficult to change, for example. Also, some targets get acquired twice in the period I am examining, so without controlling for that fact, the results will be biased.
Acquirer-specific: some acquirers are more likely to implement change than others. Also, some acquirers engage in multiple acquisitions during the period I am examining (max is 9)
Industry-specific: there might have been some unobserved industry-trends going on, which caused targets in certain industries to be more likely to change than targets in other industries.
Year-specific: since the acquisitions took place in different years between 2008 and 2019, observations might be biased by unobserved year-specific factors. For example, 2020 and 2021 observations will likely be affected by the COVID-19 pandemic.
I have constructed a dummy variable, post, which is coded 1 for year 1 and year 2 after acquisition, and 0 for year 1 and year 2 before acquisition.
I have been struggling with using the right models and commands in Stata. The code I have been using:
BINARY DV
First, I ran an OLS regression so that I could remove outliers after the regression:
reg Y1 post X1 post*X1 $controls i.industry i.year
Then, I removed outliers (not sure if this is the right method though):
predict estu if e(sample), rstudent
drop if abs(estu)>3.5
Then, ran the xtprobit regression below:
egen id = group(target_id acquiror_id)
xtset deal_id year
xtprobit Y1 post X1 post*X1 $controls i.industry i.year, vce(cluster id)
OTHER DV
Same as above, but replacing xtprobit with xtreg and Y1 with Y2
Although I get results which theoretically make sense, I feel like I am messing things up.
Any thoughts on how to improve my code?
You could try checking reghdfe for the different fixed effects you're running. I don't really understand the question tho. http://scorreia.com/software/reghdfe/

In SAS: Analyses of Multiple choice variables

I have a dataset for a survey that has several multiple choice questions (Check all, check 3, etc);
Each option is coded as binary variables
Location Popn1 Popn2 Popn3 Popn4 .... Popn20
Location1 0 1 1 1
Location2 1 1 0 0
Location3 0 0 0 0
Here is my code:
proc tabulate data=cath.binarydata;
class location sectorcollapsed;
var popn1-popn20;
table (location='Location'),
(popn1-popn20)*(Sum='Count'*f=best5. mean='Percent'*f=percent8.1 N='Total responses received per question')
/box="Populations Served by Location";
I'm using a proc tabulate to do a sum (count) and mean (percent) of each option in the multiple choice question by Location. However, I am finding that, when I do a check using my original dataset, the numbers don't make sense.
Here is a sample output:
This is the kind of output I want and have right now
Popn1 Popn2 Popn3 ....... Popn20.
Count Freq N Count N Freq
Location1 13 50% 26 11 42% 26
Location2
However, when I check back and manually calculate, what I think its doing doesn't make sense; for example, the N of 26 makes sense for location1, because there are 26 people in location1 and they all answered the question. So the sum being out of 26 makes sense.
However, for some of them, the N doesn't make sense - I thought the N would be all of the people who answered the question, but it doesn't quite add up like this. As an example, in one of the locations, there were 149 total people in that location, and 19 did not provide an answer at all - so the N here should be 130, but it is giving me a value of 134 in the output.
Does anyone have any thoughts or can help me understand how to use SAS to tabulate the multiple variables together in one column, while giving me the total answers for that option, and the percentage (out of the number of people who answered the question?)
Any help is much appreciated,

interpreting propensity score results in stratification method in SAS

I am using propensity score stratification method. I got some output but can't interpret. I am looking for a source how to interpret those results.
I have divided PS scores into 5 groups and got this output at the end after running some codes
obs =1
type =0
freq =10 sum_wt = 1010988.4 sum_diff= 0.0015572 mean-diff= 0.0015572 SE-diff= 0.0000994551
I know that frequency column stands for 2*5(number of groups), mean diff is equal to sum diff and SE diff is the sq rt of 1-sum of weights
Does it say that ranking PS scores into 5 groups is an appropriate approach ? Which of above criteria I should use for final decision?
I believe your output is just stating the distribution within the groups. You evaluate whether or not propensity score matching, in your case stratified matching, works by looking at the absolute standardized differences of the variables pre vs post-matching.
Here is a peer reviewed paper my colleagues and I published that incorporates propensity score matching. There is some details in the methodology section that I wrote which should answer your question on how to evaluate if your approach is working.

Calculating p-value by hand from Stata table

I want to ask a question on how to compute the p-value without a t-stat table, just by looking at the table, like on the first page of the pdf in the following link http://faculty.arts.ubc.ca/dwhistler/326UBC/stataHILL.pdf . Like if I don't know the value 0.062, how can I know it is 0.062 by looking at other information from the table?
You need to use the ttail() function, which returns the reverse cumulative Student's t distribution, aka the probability T > t:
display ttail(38,abs(_b[_cons]/_se[_cons]))*2
The first argument, 38, is the degrees of freedom (sample size less number of parameters), while the second, 1.92, is the absolute value of the coefficient of interest divided by its standard error, or the t-stat. The factor of two comes from the fact that Stata is doing a two-tailed test. You can also use the stored DoF with
display ttail(e(df_r),abs(_b[_cons]/_se[_cons]))*2
You can also do the integration of the t density by "hand" using Adrian Mander's integrate:
ssc install integrate
integrate, f(tden(38,x)) l(-1.92) u(1.92)
This gives you 0.93761229, but you want Pr(T>|t|), which is 1-0.93761229=0.06238771.
If you look at many statistics textbooks, you will find a table called the Z-table which will give you the probability that Z is beyond your test statistic. The table is actually a cumulative distribution function of the normal curve.
When people went to school with 4-function calculators, one or more of the questions on the statistics test would include a copy of this Z-table, and the dear students would have to interpolate columns of numbers to find the p-value. In your example, you would see the test statistic between .06 and .07 and those fingers would tap out that it was closer to .06 and do a linear interpolation to come up with .062.
Today, the p-value is something that Stata or SAS will calculate for you.
Here is another SO question that may be of interest: How do I calculate a p-value if I have the t-statistic and d.f. (in Perl)?
Here is a basic page on how to determine p-value "by hand": http://www.dummies.com/how-to/content/how-to-determine-a-pvalue-when-testing-a-null-hypo.html
Here is how you can determine p-value using Excel: http://ms-office.wonderhowto.com/how-to/find-p-value-with-excel-346366/
===EDIT===
My Stata text ("Microeconometrics using Stata", Revised Ed, Cameron & Trivedi) says the following on p. 402.
* p-values for t(30), F(1,30), Z, and chi(1) at y=2
. scalar y=2
. scalar p_t30 = 2 * ttail(30,y)
. scalar p_f1and30 = Ftail(1,30,y^2)
. scalar p_z = 2 * (1 - normal(y))
. scalar p_chi1 = chi2tail(1,y^2)
. display "p-values" " t(30)=" %7.4f p_t30
p-values t(30) = 0.0546

Stata command to add all choices, those made and those not made

UPDATE:
I solved the first part of the problem. I created unique ids for each observation:
gen id=_n
Then, I used
fillin id categ
which essentially created what I was looking for.
However, for the rest of the variables (except id and categ), almost all observations are missing. Now, I need your help to duplicate the rest of the variables instead of having them missing.
Just as an example, each observation is associated with a particular week. I am missing most of them. Or another dummy variable indicates whether a purchase was made at a drug or grocery store. Most of them are missing too.
Thanks!
ORIGINAL MESSAGE:
Need your help in Stata!
Each observation in my database is a 1-unit purchase of a beer product made by a customer. These product purchases are categorized unto 8 general categories such that the variable "categ" has values from 1 to 8 (1=import, 2=craft, 3=premium, 4=light, etc).
For my multinomial logit model, I need to observe all categories purchased or not purchased by the customer in each observation.
Assume, this is my initial dataset:
customer id-------beer category-----units purchased
----------1------------------1--------------------- 1
----------2----------------- 3--------------------- 1
----------3 -----------------2 ---------------------1
This is what I am looking for:
customer id-------beer category-----units purchased
----------1------------------1--------------------- 1
----------1 -----------------2 ---------------------0
----------1----------------- 3--------------------- 0
----------2----------------- 1--------------------- 0
----------2----------------- 3--------------------- 1
----------2 -----------------3--------------------- 0
----------3----------------- 1--------------------- 0
----------3----------------- 2--------------------- 0
----------3 -----------------2 ---------------------1
Currently, my dataset is 600,000 obs. After this procedure, I should have 600,000*8=4,800,000 obs.
When constructing this code, it is necessary that all other variables in the dataset are duplicated according to the associated category of beer.
I assume that "fillin" and less likely "expand" might work.
You help will tremendously help.
Thanks!
This is an old question, but i'll post a possible answer if someone else is having this problem.
In this case, you could generate variables for every option of your "choice variable", and after that, apply the reshape long command:
tab beercategory, gen(b)
reshape long b , i(customerid) j(newvarname)
Greetings