Logistic Regression using Accord.net (http://accord-framework.net/docs/html/T_Accord_Statistics_Analysis_LogisticRegressionAnalysis.htm) takes about 5 mins compute. SAS does it in a few seconds (using single CPU core as well).
Dataset is about 40000 rows and 30 inputs.
Why is there such a difference? Does SAS use algorithm with much better complexity? Logistic regression is quite simple algorithm as I know.
Is there any other library that will do better (preferably free)?
The solution is to comment out this line:
https://github.com/accord-net/framework/blob/development/Sources/Accord.Statistics/Analysis/LogisticRegressionAnalysis.cs#L504
It computes some very expensive statistics that I don't need.
There is the class that can be used with the standard Accord package: https://gist.github.com/eugenem/e1dd2ef2149e8c21c37d
I had the same expirence with multinomial logistic regression. I have made a comparison between Accord, R, SPSS and Python's Scikit. I have 30 inputs, 10 outputs and 1600+ training examples. Accord took 8 min, and the rest took 2-8 sek. Accord looks beautiful but for the multinomial logistic regression it's way to slow. My solution was that I made a small python webservice that calculates the regression and saves the result in the database.
Related
I have conducted competing risk analysis using fine and gray method in stata, similar to this command:
stcrreg ifp tumsize pelnode, compete(failtype==2)
stcurve, cif at1(ifp=5 pelnode=0) at2(ifp=20 pelnode=0)
I could not get the 95% confidence interval for the estimates. Can someone help me get the CI?
Thank you
I am trying to do a likelihood ratio test to compare nested models in SAS. I am very new to SAS and am only familiar with PROC REG to conduct a regression analysis. Do you have any ideas on how I can find the likelihood ratio test or how I would start?
I know how to do a LR test with logistic regression but it seems to automatically come up with the PROC LOGISTIC function.
Any help would be appreciated!
I'm trying to calculate some odds ratios and significance forsomething that can be out into a 2x2 table. The problem is the Fisher test in Sas is taking a long time.
I already have the cell counts. I could calculate a chi square if not for the fact that done of the sample sizes are extremely small. And yet some are extremely large, with cell sizes in the hundreds of thousands.
When I try to compute these in R, I have no problem. However, when I try to compute them in Sas, it either tasks way too long, out simply errors out with the message "Fishers exact test cannot be computed with sufficient precision for this sample size."
When I create a toy example (pull one instance from the dataset, and calculate it) it does calculate, but takes a long time.
Data Bob;
Input targ $ status $ wt;
Cards;
A c 4083
A d 111
B c 376494
B d 114231
;
Run;
Proc freq data = Bob;
Weight wt;
Tables targ*status;
Exact Fisher;
Run;
What is going wrong here?
That's funny. SAS calculates the Fisher's exact test p-value the exact way, by enumerating the hypergeometric probability of every single table in which the odds ratio is at least as big or bigger in favor of the alternative hypothesis. There's probably a way for me to calculate how many tables that is, but knowing that it's big enough to slow SAS down is enough.
R does not do this. R uses Monte Carlo methods which work just as fine in small sample sizes as large sample sizes.
tab <- matrix(c(4083, 111, 376494, 114231), 2, 2)
pc <- proc.time()
fisher.test(tab)
proc.time()-pc
gives us
> tab <- matrix(c(4083, 111, 376494, 114231), 2, 2)
> pc <- proc.time()
> fisher.test(tab)
Fisher's Exact Test for Count Data
data: tab
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
9.240311 13.606906
sample estimates:
odds ratio
11.16046
> proc.time()-pc
user system elapsed
0.08 0.00 0.08
>
A fraction of a second.
That said, the smart statistician would realize, in tables such as yours, that the normal approximation to the log odds ratio is fairly good, and as such the Pearson Chi-square test should give approximately very similar results.
People claim two very different advantages to the Fisher's exact test: some say it's good in small sample sizes. Others say it's good when cell counts are very small in specific margins of the table. The way that I've come to understand it is that Fisher's exact test is a nice alternative to the Chi Square test when bootstrapped datasets are somewhat likely to generate tables with infinite odds ratios. Visually you can imagine that the normal approximation to the log odds ratio is breaking down.
I am currently developing a sentiment index using Google search frequencies taken from Google Trends.
I am using Stata 12 on Windows.
My approach is as following:
I downloaded approx ~150 business-related search queries from Googletrends from Jan 2004 to Dec 2013
I now want to construct an index using the 30 at that point in time most relevant queries related to the market I observe
To achieve that I want to use monthly expanding backward rolling regressions of each query on the market
Thus I need to regress 150 items one-by-one on the market 120 times (12 months x 10 years), using different time windows and then extract the 30 queries with the most negative t-test.
To exemplify the procedure, if I would want to construct the sentiment for January 2010 I would regress the query terms on the market during the period from Jan 2004 to December 2009 and then extract the 30 queries with the most negative t-statistic.
Now I am looking for a way to make this as automatized as possible. I guess should be able to run the 150 items at once, and I can specify the time window using the time stamps. Using Excel commands and creating a do-file with all the regression commands in it (which would be quite large) I could probably create the regressions relatively efficiently (although it depends on how much Stata can handle - any experience on that?).
What I would need to make the data extraction much easier is a command which I can use to rank the results of the regression according to their t-statistics. Does someone have an efficient approach to this? Or has general advice?
If you are using Stata, once you run a ttest, you can type return list and you will get scalars that stata stores. Once you run a loop you can store these values in a number of different ways. check out the post command.
I'm going to study the relationship between the illiquidity and returns in stock markets, using the Amihud model proposed in the paper "Illiquidity and stock returns: cross-section and time-series effects" (2002). I would like to know if it is possible to automate the regression analysis. I've have more than 2000 stocks in the sample and I'd like to avoid to run each regression one-by-one, speeding the process up.
Do you know if it is possible automate this process in Stata? or if is it possible to do that using some other statistical software (R, SAS, Matlab, Gretl,...) ? If it is, how could I do that?
You should look at foreach and forval as ways of looping.
forval i = 1/3 {
regress Ystock`i' Xstock`i'
}
would be an example if and only if there are variables with names like those you indicated. If you have other names, or a different data structure, a loop would still be possible.