We was working on large datasets of telecom. when we standardized the data we’ve got big z-score it varies from -0.xxx to 300 or 400!
These attributes has for exemple min=0 and Max about 4,000,000
Yes somes variables has outliers. We’ll this have good results for clustering without dealing with outliers?
The results of the proc fastclus with 8 cluster lead to grouped cluster (the seventh has 1,600,000 observations) there one too with 1 observation.
What’s our problem?
https://medium.com/p/6b6056224c54/info?source=email-75f4ab5a8577-1529361861973-activity.response_createdhttps://medium.com/p/6b6056224c54/info?source=email-75f4ab5a8577-1529361861973-activity.response_created
Your variables likely are very skewed.
The use of z standardization on such variables is questionable. You probably should look into box-cox transformations, too.
Related
This is mainly a question about efficiency, as I'm unfamiliar with how SAS processes datasets. A lot of code that I run reads from multiple datasets with consecutive dates (whether this is consecutive months/quarters/years depends on the datasets).
At the moment, the codes require manual updates each time they're run to ensure they're picking up the correct dates, so I would have something such as:
Data Quarters;
Set XYZ_201803
XYZ_201806
...
...
XYZ_202006;
Run;
To help tidy up the code and make it a bit less tedious, I've approached a few different ideas and had a few sent my way and one of the big ideas is to store all of the XYZ_YYYYMM datasets as a single, appended dataset, so they can be read from with a simple filter on the date as below:
Data Quarters;
Set AppendedData;
Where Date > 201812;
Run;
Which of these two options is more efficient as far as computation goes? On datasets which are typically a couple of gb in size, which would you recommend? What other pros and cons come with each idea?
Thanks for any input. :)
Most likely a single dataset and several separate datasets will be similar from a performance standpoint; there is some small overhead opening new datasets, but as long as it's not thousands of them you probably won't notice a difference.
There will be a performance hit with a single dataset in creating that dataset, and in using that dataset, if you use only small sections usually. Typically, separate datasets are common where people usually do analysis of individual quarters, and rarely combine them.
Finally, if the datasets can vary from quarter to quarter in their contents (if the formats could change, if the fields can change), then having separate is easier in some ways than having to manage the change between the different periods.
That said, there's a huge organizational benefit to a single dataset, and all of the above issues can be dealt with. Think of SAS datasets as large SQL tables - they are effectively the same, and the same things that help SQL tables can help SAS. Proper sizing of columns, proper sorting of the stored data, indexing appropriately, are all important solutions. If you have a database team at your place of work, they may be able to help construct an ideal table plan. Files of several GB can definitely benefit from indexing and proper sorting, to allow users to easily get at the bits they need.
If you were to stay with separate datasets, you can use the macro language to make sure you're reading in the right datasets, assuming they're named in a consistent fashion. That might be the ideal solution if there are other reasons to stay separate - then no changes are needed each quarter.
Points of interest:
From a coding standpoint
Dealing with a single, stacked data set, created by appending the quarterly data sets is more efficient.
From a resource standpoint
Have to make sure you have large enough disk to hold the single large table
Have additional off storage to hold the original pieces -- no need to clutter up the primary data disk with all the pieces.
A 2TB SSD is very fast, remarkably cheap, and low power and can contain a table comprised of quite a few "couple GB" pieces.
Spinning disk has lower $/TB and more capacity. I/O will be slower and consume more power.
To further improve query performance you will want to index the variables most commonly used in BY, CLASS, and WHERE statements.
"... simple filter ..." is part of "Keep it Simple S****" (KISS)
I have a continuous dependant variable (volume of chemical) with lots of values as 0 and a bunch of continuous and categorical predictors. I want to examine the relationship between predictors and the volume of chemicals. I was thinking of using multiple linear regression but many values in the outcome variable are 0. So, I am not sure how I should proceed.
I am using SAS.
I am simulating pga tournaments using Stata. My simulation results table consists of:
column 1: the names of the 30 players in the tournament
columns 2 - 30,001: the 4 round results of my monte-carol simulations.
what I am trying to do is create a 30 x 30 matrix with the golfers' names as column 1 and across the column names where each cell represents the percentage of times Golfer A beat Golfer B outright from the 30,000 simulations. Is this possible to do in Stata? Thanks
I tend to say that everything is always possible in all programming languages, but somethings are much more difficult to do in some languages compared to others. I do not think that Stata is great tool for what you intend to do.
You need to provide some code examples for us to be able to help you with your task, but here is one thing I can say. Stata has two programming languages. One is often called Stata (but is called ado on Stata Corps webiste) and the other is Mata. If you for some reason need to use the software Stata, you should do this in the language Mata that has more matrix operators than ado. And in ado you cant store text in a matrix, so if you want to store the name of the golfer you need to use Mata, but you can also use indexes of rows and columns to keep track of the golfers.
With that said, Stata is primarily a tool to make operations and analyze a single dataset loaded into memory (recently support for multiple datasets has been added). So to answer your question, yes, this can be done in Stata, but you are probably much better of doing it in a language with more support for multidimensional arrays/vectors. For example, R or Python.
I'm hoping to one star to p<.05 and two stars to p<.001 in a correlation matrix in Stata. This is the code that I'm currently using. The code still generates a correlation matrix, but no stars appear in places where they should. Thanks for your help!
asdoc corr RELATIONSHIP anxiety BEH_SIM SIM_VALUES sptconf NEG_EFFICACY spteffort SPTEFFORT_OTHER COOP_MOTIV COMP_MOTIV, star(0.5), replace
First, you need to use pwcorr rather than corr to be able to add stars to your correlation matrix. Second, you should not have the second comma right after the star option.
For example, the code below will output a correlation matrix with 1 star if significant at a 10% level, 2 stars if significant at 5% level, and three stars if significant at a 1%
level.
asdoc pwcorr var1 var2 var3, star(all) replace
I do not believe you can specify star numbers and significant levels the way you would like to using asdoc. You can specify custom significance levels by using star(.05) rather than star(all) as I do above, but this will put one star by every correlation coefficient significant at a 5% level and I do not think you can specify more than 1 level at a time.
The author of asdoc is Professor Attaullah Shah. He is very helpful and responsive so you might ask him. If not currently possible, if you ask he may add your suggestion to a future asdoc update. Here is a link to his website: https://fintechprofessor.com/2019/06/01/export-correlation-table-to-word-with-stars-and-significance-level-using-asdoc/
I want to select my sample in Stata 13 based on three stratum variables with 12 strata in total (size - two strata; sector - three strata; intangible intensity - two strata). The selection should be proportional without replacement.
However, I can only find disproportionate selection commands that select for instance x% of each stratum.
Can anyone help me out with this problem?
Thank you for this discussion. I think I know where my problem was.
The command "gsample" can select strata based on different variables. Therefore, I thought I had to define three different stratum variables. But the solution should be more simple.
There are 12 strata in total (the large firms with high intensity in sector 1, the small firms with high intensity in sector 1, and so on) with each firm in the sample falling in to one of the strata.
All I have to do is creating a variable "strataident" with values from 1 to 12 identifying the different strata. I do this for the population dataset, so the number of firms falling into each stratum is representative for the population. The following code will provide me a stratified random sample that is representative for the population.
gsample 10, percent strata (strataident) wor
This command works as well and is much easier, see the example in 1:
gsample 10, percent wor strata(size sector intensity)
The problem is, that strata may "overlap". So you probably have to rebalance the sample after initial draft.
Now the question is, how this can be implemented. The final sample should represent the proportion of the population as good as possible.