SAS select random samples from a dataset - sas

I understand that to select a random sample, I can use
proc surveyselect data = raw_data method = srs n=200000 out=sample_data;
run;
However, sometimes my raw_data has the number of records < 200000. If the raw_data is small, I would like to just keep the raw_data; if it's larger than a million records, I would like to randomly select a 200k of records out of it. How should I do this?
Thank you!

Just create a macro variable for n. You can do this below, or you can use dictionary.tables or proc contents to get the count without actually counting all of the rows if you don't have reason to disbelieve those values.
proc sql;
select
case when count(1) < 1000000 then count(1) else 200000 end
into :sampcount
from yourdataset
;
quit;
proc surveyselect n=&sampcount. .... ;
run;

Related

SAS: print the name related to the most little value

I'm a beginner in SAS and i have difficulties with this exercice:
I have a very simple table with 2 columns and three lines
I try to find the request that will return me the name of the most little people (so it must return titi)
All what I found is to return the most little size (157) but i don't want this, I want the name related to the most little value!
Could you help me please?
Larapa
A SQL having clause is a good one for this. SAS will automatically summarize the data and merge it back to the original table, giving you one a one-line table with the name of the smallest value of taille.
proc sql noprint;
create table want as
select nom
from have
having taille = min(taille)
;
quit;
Here are some other ways you can do it:
Using PROC MEANS:
proc means data=have noprint;
id nom;
output out=want
min(taille) = min_taille;
run;
Using sort and a data step to keep only the first observation:
proc sort data=have;
by taille;
run;
data want;
set have;
if(_N_ = 1);
run;

How to perform addition of column values in SAS for large dataset

I have a large data set having millions of records with more than 200 columns. Is there any way to perform the sum of columns in SAS.Below is the sample data.
I want to display the data as below.Like Need to SUM all the values of columns in col1 .100+100+100+100+100+100=600
A simple proc summary will suffice.
proc summary data=want nway ;
var col1-col7 ;
output out=sum (drop=_:) sum= ;
run ;

SAS equivalent to R’s is.element()

It’s the first time that I’ve opened sas today and I’m looking at some code a colleague wrote.
So let’s say I have some data (import) where duplicates occur but I want only those which have a unique number named VTNR.
First she looks for unique numbers:
data M.import;
set M.import;
by VTNR;
if first.VTNR=1 then unique=1;
run;
Then she creates a table with the duplicated numbers:
data M.import_dup1;
set M.import;
where unique^=1;
run;
And finally a table with all duplicates.
But here she is really hardcoding the numbers, so for example:
data M.import_dup2;
set M.import;
where VTNR in (130001292951,130100975613,130107546425,130108026864,130131307133,130134696722,130136267001,130137413257,130137839451,130138291041);
run;
I’m sure there must be a better way.
Since I’m only familiar with R I would write something like:
import_dup2 <- subset(import, is.element(import$VTNR, import_dup1$VTNR))
I guess there must be something like the $ also for sas?
To me it looks like the most direct translation of the R code
import_dup2 <- subset(import, is.element(import$VTNR, import_dup1$VTNR))
Would be to use SQL code
proc sql;
create table import_dup2 as
select * from import
where VTNR in (select VTNR from import_dup1)
;
quit;
But if your intent is to find the observations in IMPORT that have more than one observation per VTNR value there is no need to first create some other table.
data import_dup2 ;
set import;
by VTNR ;
if not (first.VTNR and last.VTNR);
run;
I would use the options in PROC SORT.
Make sure to specify an OUT= dataset otherwise you'll overwrite your original data.
/*Generate fake data with dups*/
data class;
set sashelp.class sashelp.class(obs=5);
run;
/*Create unique and dup dataset*/
proc sort data=class nouniquekey uniqueout=uniquerecs out=dups;
by name;
run;
/*Display results - for demo*/
proc print data=uniquerecs;
title 'Unique Records';
run;
proc print data=dups;
title 'Duplicate Records';
run;
Above solution can give you duplicates but not unique values. There are many possible ways to do both in SAS. Very easy to understand would be a SQL solution.
proc sql;
create table no_duplicates as
select *
from import
group by VTNR
having count(*) = 1
;
create table all_duplicates as
select *
from import
group by VTNR
having count(*) > 1
;
quit;
I would use Reeza's or Tom's solution, but for completeness, the solution most similar to R (and your preexisting code) would be three steps. Again, I wouldn't use this here, it's excess work for something you can do more easily, but the concept is helpful in other situations.
First, get the dataset of duplicates - either her method, or proc sort.
proc sort nodupkey data=have out=nodups dupout=dups;
by byvar;
run;
Then pull those into a macro list:
proc sql;
select byvar
into :duplist separated by ','
from dups;
quit;
Then you have them in &duplist. and can use them like so:
data want;
set have;
if not (byvar in &duplist.);
run;
data want;
set import;
where VTNR in import_dup1;
run;

How to obtain the number of records of a dataset in SAS

I want to count the number of records in a dataset in SAS. There is a function the make this thing in a simple way? I used R ed for obtain this information there was the length() function. Morover I need the number of record to compute some percetages so I need this value not in a table but in a value that can be used for other data step. How can I fix?
Thanks in advance
Here is another solution, using SAS dictionaries,
proc sql;
select nobs into: num_obs
from dictionary.tables
where libname = "WORK" and memname = "A"
;
quit;
It is easy to get the size of many datasets by modifying the above code,
proc sql;
create table test as
select memname, nobs
from dictionary.tables
where libname = "WORK" and memname like "A%"
;
quit;
data _null_;
set test;
call symput(memname, nobs);
run;
The above code will give you the sizes of all data sets with name starting with "a" in the temporary/work library.
Assuming this is a basic SAS table that you've created, and not modified or appended to, the best way is to use the meta data held in a dataset (the Number of tries is held in a piece of meta data called "nobs"), without reading through the dataset its self and place it in a macro variable. You can do this in the following way:
Data _null_;
i=1;
If i = 0 then set DATASETTOCOUNT nobs= mycount;
Call symput('mycount', mycount);
Run;
%put &mycount.;
You will now have a macro variable that contains the number of rows in your dataset, that you can call on in other data steps using &mycount.

sas create a variable that is equal to obs column

I have a file with 10 obs. and different parameters. I need to add to my data a new variable of 'ID' for each observation- i.e a column of numbers 1-10.
How can I add a variable that is simply equal to the obs column?
I thought about doing it with a loop, define an empty vat, run over the var and each time add '1' to previous observation, however, it seems kind of complicated. Is there a better way to do it?
You can use the Data Step automatic variable _n_. This is the iteration count of the Data Step loop.
Data want;
set have;
ID = _n_;
run;
If you opt for a Proc SQL solution, there are two ways:
1. Undocumented:
proc sql;
create table want as
select monotonic() as row, *
from sashelp.class
;
quit;
Documented:
ods listing close;
ods output sql_results=want;
proc sql number;
select * from sashelp.class;
quit;
ods listing;
#DomPazz answer would definitely work! Just in case you would like return the number of observations according to attributes, Try this:
proc sort data= dataset out= sort_data;
by * your attribute(s) *;
data sort_data;
set sort_data;
by * your attribute(s) that is listed in above proc sort statement *;
if first.attribute then i=1; <=== first by group observation, number =1
i + 1; <==== sum statement (retaining)
if last.attribute and .... then ....; <=== whatever you want to do . Not necessary
run;
first / Last is very helpful in doing row operation.