SAS twosamplesurvival sample size question

SAS twosamplesurvival sample size question - sas

I am trying to perform a sample size calculation in SAS for a two sample time to event case.
Here is the situation:
Assume both sample follows exponential distribution
Assume a given constant hazard ratio under alternative hypothesis, we call hr (group 2 vs group 1)
We will use logrank test.
Given accrual time a, and follow up time f
Also given the exponential hazard for group 1, called it exph1
Assume the sample size ratio between the two group is 1:1
required nominal power is p
Now my code looks like this:
proc power;
twosamplesurvival test=logrank
accrualtime = a
followuptime = f
refsurvexphazard= exph1
hazardratio = hr
power = p
/* eventstotal = . /*events total */
/* ntotal= . /*total sample size */
;
run;
you can uncomment either eventstotal = . or ntotal=. depending on whether you want to compute the requested number of events, or the actual total sample size.
They should not be the same consider by the end of follow up, if the event does not happen, then the subject will be right censored.
However I am always getting the same number for events total and total sample size. What did I do wrong here?
I actually know how to compute this by hand, and my hand calculation for requested event number is very close to SAS output (SAS gives a slightly larger value but very close), however my total sample size is much larger than the event number.
I could not disclose any particular initial value for the parameters above due to confidential reasons. Could someone help? Would really appreciate that.

Related

BY processing in PROC NLMIXED; procedure stops due to error

I simulated 500 replications and planned to analyze each in NLMIXED using BY processing. My NLMIXED code is below:
PROC NLMIXED DATA=MELS GCONV=1E-12 QPOINTS=11;
BY Rep;
PARMS LMFI=&LMFI.
SMFI=&SMFI.
LMRIvar=&LMRIvar.
SMRIvar=0 TO 0.15 BY 0.005;
mu = LMFI + b0i;
evar = EXP(SMFI + t0i);
MODEL Y ~ NORMAL(mu,evar);
RANDOM b0i t0i ~ NORMAL([0,0],[LMRIvar,0,SMRIvar]) SUBJECT=PersonID;
ODS OUTPUT FitStatistics=Fit2 ConvergenceStatus=Conv2 ParameterEstimates=Parm2;
RUN;
For some of these replications, the variance components were sampled to be small, so some non-zero number of convergence errors are expected (note the ConvergenceStatus request on the ODS OUTPUT statement). However, when I get the warning below, NLMIXED quits processing regardless of the number of replications remaining to be analyzed.
WARNING: The final Hessian matrix is full rank but has at least one negative eigenvalue. Second-order optimality condition violated.
ERROR: QUANEW Optimization cannot be completed.
Am I missing something? I would think that NLMIXED could acknowledge the error for that replication, but continue with the remaining replications. Thoughts are appreciated!
Best,
Ryan

Here is what I believe is occurring. The requirement that variances must be non-negative and the fact that the distribution of variance estimates is long-tailed make variances troublesome to estimate. Variance component estimate updates may result in a negative value for one or more of the estimates. The NLMIXED procedure attempts to compute eigenvalues of the model variance components. At that point, NLMIXED crashes.
But note that
V[Y] = (sd[Y])^2
V[Y] = exp(ln(V[Y]))
V[Y] = exp(2*ln(sd[Y]))
V[Y] = exp(2*ln_sd_Y)
Now, suppose that we make ln_sd_Y the parameter. References to V[Y] would need to be written as the function shown in the last statement above. As the domain of the parameter ln_sd_Y is (-infinity, infinity), there is no lower bound on ln_sd_Y. The function exp(2*ln_sd_Y) will always produce a non-negative variance estimate. Actually, given limitations of digital computers such that negative infinity cannot be represented, only values which head to negative infinity), the function exp(2*ln_sd_Y) will always produce a positive parameter estimate. The estimate may be very, very close to 0. But the estimate would always come at 0 from above. This should preclude SAS trying to compute the eigenvalue of a negative number.
A slight alteration of your code writes LMRIvar and SMRIvar as functions of ln_sd_LMRIvar and ln_sd_SMRIvar.
PROC NLMIXED DATA=MELS GCONV=1E-12 QPOINTS=11;
BY Rep;
PARMS LMFI=&LMFI.
SMFI=&SMFI.
ln_sd_LMRIvar=%sysfunc(log(%sysfunc(sqrt(&LMRIvar.))))
ln_sd_SMRIvar=-5 to -1 by 0.1;
mu = LMFI + b0i;
evar = EXP(SMFI + t0i);
MODEL Y ~ NORMAL(mu,evar);
RANDOM b0i t0i ~ NORMAL([0,0],
[exp(2*ln_sd_LMRIvar), 0,
exp(2*ln_sd_SMRIvar)]) SUBJECT=PersonID;
ODS OUTPUT FitStatistics=Fit2 ConvergenceStatus=Conv2 ParameterEstimates=Parm2;
RUN;
Alternatively, you could employ a bounds statement in an attempt to prevent updates of LMRIvar and/or SMRIvar from going negative. You could keep your original code, inserting the statement
bounds LMRIvar SMRIvar > 0;
This is simpler than writing the model in terms of parameters which are allowed to go negative. However, my experience has been that employing parameters which have domain (-infinity, infinity) is actually the better approach.

Stata code to conditionally sum values based on a group rank

I'm trying to write a code for a fairly huge dataset (3m observations) which has been segregated into smaller groups (ID). For each observation (described in the table below), I want to create a cumulative sum of a variable "Value" for all observations ranked below me, subject to condition of the lower ranked observation equals mine.
[
I want to write this code without using loops, if there is a way to do so.
Could someone help me?
Thank you!
UPDATE:
I have pasted the equation for the output variable below.
UPDATE 2:
The CSV format of the above table is:
ID,Rank,Condition,Value,Expected output,,
1,1,30,10,0,,
1,2,40,20,0,,
1,3,20,30,0,,
1,4,30,40,10,,
1,5,40,50,20,,
1,6,20,60,30,,
1,7,30,70,80,,
2,1,40,80,0,,
2,2,20,90,0,,
2,3,30,100,0,,
2,4,40,110,80,,
2,5,20,120,90,,
2,6,30,130,100,,
2,7,40,140,190,,
2,8,20,150,210,,
2,9,30,160,230,,
Equation

If I understand correctly, for each combination of ID and Condition, you want to calculate a running sum, ordered by Rank, of the variable Value, excluding the current observation. If that is indeed your goal, the following untested code might set you on the path to a solution
sort ID Condition Rank
// be sure there is a single observation for each combination
isid ID Condition Rank
// generate the running sum
by ID Condition (Rank): generate output = sum(Value)
// subtract out the current observation
replace output = output - Value
// return to the original order
sort ID Rank
As I said, this is untested, because my copy of Stata cannot read pictures of data. If your testing shows that it is imperfect and you cannot resolve the problem yourself, providing your sample data in a usable format will increase the likelihood someone will be able to help.
Added in edit: Corrected the isid command.

Stata: estimating monthly weighted mean for portfolio

I have been struggling to write optimal code to estimate monthly, weighted mean for portfolio returns.
I have following variables:
firm stock returns (ret)
month1, year1 and date
portfolio (port1): this defines portfolio of the firm stock returns
market capitalisation (mcap): to estimate weights (by month1 year1 port1)
I want to calculate weighted returns for each month and portfolio weighted by market cap. (mcap) of each firm.
I have written following code which works without fail but takes ages and is highly inefficient:
foreach x in 11 12 13 21 22 23 {
display `x'
forvalues y = 1980/2010 {
display `y'
forvalues m = 1/12 {
display `m'
tempvar tmp_wt tmp_tm tmp_p
egen `tmp_tm' = total(mcap) if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_wt' = mcap/`tmp_tm' if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_p' = ret*`tmp_wt' if month1==`m' & year1==`y' & port1 ==`x'
gen port_ret_`m'_`y'_`x' = `tmp_p'
}
}
}
Data looks as shown in the image:![Data for value weighted portfolio return][1]

This does appear to be a casebook example of how to do things as slowly as possible, except that naturally you are not doing that on purpose. All it lacks is a loop over observations to calculate totals. So, the good news is that you should indeed be able to speed this up.
It seems to boil down to
gen double wanted = .
bysort port1 year month : replace wanted = sum(mcap)
by port1 year month : replace wanted = (mcap * ret) / wanted[_N]
Principle. To get a sum in a single scalar, use summarize, meanonly rather than using egen, total() to put that scalar into a variable repeatedly, but use sum() with by: to get group sums into a variable when that is what you need, as here. sum() returns cumulative sums, so you want the last value of the cumulative sum.
Principle. Loops (here using foreach) are not needed when a groupwise calculation can be done under the aegis of by:. That is a powerful construct which Stata programmers need to learn.
Principle. Creating lots of temporary variables, here 6 * 31 * 12 * 3 = 6696 of them, is going to slow things down and use more memory than is needed. Each time you execute tempvar and follow with generate commands, there are three more temporary variables, all the size of a column in a dataset (that's what a variable is in Stata), but once they are used they are just left in memory and never looked at again. It's a subtlety with temporary variables that a tempvar assigns a new name every time, but it should be clear that generate creates a new variable every time; generate will never overwrite an existing variable. The temporary variables would all be dropped at the end of a program, but by the end of that program, you are holding a lot of stuff unnecessarily, possibly the size of the dataset multiplied by about one thousand. If that temporarily expanded dataset could not all fit in memory, you flip Stata into a crawl.
Principle. Using if obliges Stata to check each observation in turn; in this case most are irrelevant to the particular intersection of loops being executed and you make Stata check almost all of the data set (a fraction of 2231/2232, almost 1) irrelevantly while doing each particular calculation for 1/2232 of the dataset. If you have more years, or more portfolios, the fraction looked at irrelevantly is even higher.
In essence, Stata will obey your instructions (and not try any kind of optimization -- your code is interpreted utterly literally) but by: would give the cross-combinations much more rapidly.
Note. I don't know how big or how close to zero these numbers will get, so I gave you a double. For all I know, a float would work fine for you.
Comment. I guess you are being influenced by coding experience in other languages where creating variables means something akin to x = 42 to hold a constant. You could do that in Stata too, with scalars or local or global macros, not to mention Mata. Remember that a new variable in Stata is an entire new column in the dataset, regardless of whether it is holding a constant or different values in each observation. You will get what you ask for, but it is more like getting an array every time. Again, it seems that you want as an end result just one new variable, and you do not in fact need to create any others temporarily at all.

How do I store regression results from loops in Stata?

I have built a model which basically does the following:
run regressions on single time period
organise stocks into quantiles based on coefficient from linear regression
statsby to calculate portfolio returns for stocks based on quantile (averaging all quantile x returns)
store quantile 1 portolio and quantile 10 return for the last period
The pair of variables are just the final entries in the timeframe. However, I intend to extend the single time period to rolling through a large timeframe, in essence:
for i in timeperiod {
organise stocks into quantiles based on coefficient from linear regression
statsby to calculate portfolio returns for stocks based on quantile (averaging all quantile x returns)
store quantile 1 portolio and quantile 10 return for the last period
}
The data I'm after is the portfolio 1 and 10 returns for the final day of each timeframe (built using the previous 3 years of data). This should result in a time series (of my total data 60 -3 years to build first result, so 57 years) of returns which I can then regress against eachother.
regress portfolio 1 against portfolio 10
I am coming from an R background, where storing a variable in a vector is very simple, but I'm not quite sure how to go about this in Stata.
In the end I want a 2xn matrix (a separate dataset) of numbers, each pair being results of one run of a rolling regression. Sorry for the very vague description, but it's better than explaining what my model is about. Any pointers (even if it's to the right manual entry) will be much appreciated. Thank you.
EDIT: The actual data I want to store is just a variable. I made it confusing by adding regressions. I've changed the code to more represent what I want.

Sounds like a case for either rolling or statsby, depending on what you exactly want to do. These are prefix commands, that you prefix to your regression model. rolling or statsby will take care of both the looping and storing of results for you.
If you want maximum control, you can do the loop yourself with forvalues or foreach and store the results in a separate file using post. In fact, if you look inside rolling and statsby (using viewsource) you will see that this is what these commands do internally.

Unlike R, Stata operates with only one major rectangular object in memory, called (ta-da!) the data set. (It has a multitude of other stuff, of course, but that stuff can rarely be addressed as easily as the data set that was brought into memory with use). Since your ultimate goal is to run a regression, you will either need to create an additional data set, or awkwardly add the data to the existing data set. Given that your problem is sufficiently custom, you seem to need a custom solution.
Solution 1: create a separate data set using post (see help).
use my_data, clear
postfile topost int(time_period) str40(portfolio) double(return_q1 return_q10) ///
using my_derived_data, replace
* 1. topost is a placeholder name
* 2. I have no clue what you mean by "storing the portfolio", so you'd have to fill in
* 3. This will create the file my_derived_data.dta,
* which of course you can name as you wish
* 4. The triple slash is a continuation comment: the code is coninued on next line
levelsof time_period, local( allyears )
* 5. This will create a local macro allyears
* that contains all the values of time_period
foreach t of local allyears {
regress outcome x1 x2 x3 if time_period == `t', robust
* 6. the opening and closing single quotes are references to Stata local macros
* Here, I am referring to the cycle index t
organise_stocks_into_quantiles_based_on_coefficient_from_linear_regression
* this isn't making huge sense for me, so you'll have to put your code here
* don't forget inserting if time_period == `t' as needed
* something like this:
predict yhat`t' if time_period == `t', xb
xtile decile`t' = yhat`t' if time_period == `t', n(10)
calculate_portfolio_returns_for_stocks_based_on_quantile
forvalues q=1/10 {
* do whatever if time_period == `t' & decile`t' == `q'
}
* store quantile 1 portolio and quantile 10 return for the last period
* again I am not sure what you mean and how to do that exactly
* so I'll pretend it is something like
ratio change / price if time_period == `t' , over( decile`t' )
post topost (`t') ("whatever text describes the time `t' portfolio") ///
(_b[_ratio_1:1]) (_b[_ratio_1:10])
* the last two sets of parentheses may contain whatever numeric answer you are producing
}
postclose topost
* 7. close the file you are creating
use my_derived_data, clear
tsset time_period, year
newey return_q10 return_q1, lag(3)
* 8. just in case the business cycles have about 3 years of effect
exit
* 9. you always end your do-files with exit
Solution 2: keep things within your current data set. If the above code looks awkward, you can instead create a weird centaur of a data set with both your original stocks and the summaries in it.
use my_data, clear
gen int collapsed_time = .
gen double collapsed_return_q1 = .
gen double collapsed_return_q10 = .
* 1. set up placeholders for your results
levelsof time_period, local( allyears )
* 2. This will create a local macro allyears
* that contains all the values of time_period
local T : word count `allyears'
* 3. I now use the local macro allyears as is
* and count how many distinct values there are of time_period variable
forvalues n=1/`T' {
* 4. my cycle now only runs for the numbers from 1 to `T'
local t : word `n' of `allyears'
* 5. I pull the `n'-th value of time_period
** computations as in the previous solution
replace collapsed_time_period = `t' in `n'
replace collapsed_return_q1 = (compute) in `n'
replace collapsed_return_q10 = (compute) in `n'
* 6. I am filling the pre-arranged variables with the relevant values
}
tsset collapsed_time_period, year
* 7. this will likely complain about missing values, so you may have to fix it
newey collapsed_return_q10 collapsed_return_q1, lag(3)
* 8. just in case the business cycles have about 3 years of effect
exit
* 9. you always end your do-files with exit
I avoided statsby as it overwrites the data set in memory. Remember that unlike R, Stata can only remember one data set at a time, so my preference is to avoid excessive I/O operations as they may well be the slowest part of the whole thing if you have a data set of 50+ Mbytes.

I think you're looking for the estout command to store the results of the regressions.

C++, determine the part that have the highest zero crosses

I’m not specialist in signal processing. I’m doing simple processing on 1D signal using c++. I want really to know how I can determine the part that have the highest zero cross rate (highest frequency!). Is there a simple way or method to tell the beginning and the end of this part.
This image illustrate the form of my signal, and this image is what I need to do (two indexes of beginning and end)
Edited:
Actually I have no prior idea about the width of the beginning and the end, it's so variable.
I could calculate the number of zero crossing, but I have no idea how to define it's range
double calculateZC(vector<double> signals){
int ZC_counter=0;
int size=signals.size();
for (int i=0; i<size-1; i++){
if((signals[i]>=0 && signals[i+1]<0) || (signals[i]<0 && signals[i+1]>=0)){
ZC_counter++;
}
}
return ZC_counter;
}

Here is a fairly simple strategy which might give you some point to start. The outline of the algorithm is as follows
Input: Vector of your data points {y0,y1,...}
Parameters:
Window size sigma.
A threshold 0<p<1 defining when to start looking for a region.
Output: The start- and endpoint {t0,t1} of the region with the most zero-crossings
I won't give any C++ code, but the method should be easy to implement. As example let us use the following function
What we desire is the region between about 480 and 600 where the zero density higher than in the front. First step in the algorithm is to calculate the positions of zeros. You can do this by what you already have but instead of counting, you store the values for i where you met a zero.
This will give you a list of zero positions
From this list (you can do this directly in the above for-loop!) you create a list having the same size as your input data which looks like {0,0,0,...,1,0,..,1,0,..}. Every zero-crossing position in your input data is marked with a 1.
The next step is to smooth this list with a smoothing filter of size sigma. Here, you can use what you like; in the simplest case a moving average or a Gaussian filter. The higher you choose sigma the bigger becomes your look around window which measures how many zero-crossings are around a certain point. Let me give the output of this filter together with the original zero positions. Note that I used a Gaussian filter of size 10 here
In a next step, you go through the filtered data find the maximum value. In this case it is about 0.15. Now you choose your second parameter which is some percentage of this maximum. Lets say p=0.6.
The final step is to go through the filtered data and when the value is greater than p you start to remember a new region. As soon as the value drops below p, you end this region and remember start and endpoint. Once you are finished walking through the data, you are left with a list of regions, each defined by a start and an endpoint. Now you choose the region with the biggest extend and you are done.
(Optionally, you could add the filter size to each end of the final region)
For the above example, I get 11 regions as follows
{{164,173},{196,205},{220,230},{241,252},{259,271},{278,290},
{297,309},{318,327},{341,350},{458,468},{476,590}}
where the one with the biggest extend is the last one {476,590}. The final result looks (with 1/2 filter region padding)
Conclusion
Please don't be discouraged by the length of my answer. I tried to explain everything in detail. The implementation is really just some loops:
one loop to create the zero-crossings list {0,0,..,1,0,...}
one nested loop for the moving average filter (or you use some library Gaussian filter). Here you can at the same time extract the maximum value
one loop to extract all regions
one loop to extract the largest region if you haven't already extracted it in the above step

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js