I created a dataset to calculate a threshold:
Data black;
Set blue;
Lower=p20-2;
Upper=p20+2;
Run;
I would like to use this value the output is something like:
Variables n lower upper
Val 123 -0.2 0.1
I would like to use upper and lower as thresholds:
Proc sql;
Create table one as
Select * from two
Where (Val < upper and Val > lower)
;quit;
Upper and lower should come from black, while Val should come from two.
two looks like
ID Val
42 1471
32 1742
74 4819
...
How can I include the threshold in my dataset in order to filter values from two?
A possible solution could be adding lower value and upper value to two columns, but I do know how to assign the values to these columns.
If your bounds are static for all rows, you can read them into macro variables and reference them in your SQL query.
data black;
set blue;
Lower=p20-2;
Upper=p20+2;
/* Save the value of lower/upper to macro variables &lower and &upper */
call symputx('lower', lower);
call symputx('upper', upper);
run;
proc sql;
create table one as
select * from two
where &lower. < Val < &upper.
;
quit;
If each ID has a specific threshold value, you can use a hash table to look up each value by its key. Hash tables are very efficient in SAS and are a great way of doing lookups of small tables in a big table.
data two;
set one;
if(_N_ = 1) then do;
dcl hash h(dataset: 'blue');
h.defineKey('id');
h.defineData('lower', 'upper');
h.defineDone();
/* Initialize lower/upper to missing */
call missing(lower, upper);
end;
/* Take the ID for the current row in the and look it up in hash table.
If there is a match, return the value of lower/upper for that ID */
rc = h.Find();
/* Only output if the ID is between its threshold */
if(&lower. < Val < &upper.);
drop rc;
run;
If you prefer to use SQL, you can manipulate the SQL optimizer to force a hash join with the undocumented magic=103 option. This sometimes can be more efficient with joins on smaller tables.
proc sql magic=103;
create table two as
select t1.*
from one as t1
LEFT JOIN
black as t2
ON t1.id = t2.id
where t2.lower < t1.val < t2.upper
;
quit;
Related
I have a SAS dataset with 700 columns (variables). For all 700 of them, I want to cap all values below the 1st percentile to the 1st percentile and all values above the 99th percentile to the 99th percentile. I want to do this iteratively for all 700 variables without having to specify their names explicitly.
How can I do this?
Perhaps slightly easier than the hash table - and somewhat faster, I believe - is using the horizontal output of proc means, and then using an array.
proc means data=sashelp.prdsale;
var _numeric_;
output out=quantiles p1= p99= /autoname;
run;
proc sql;
select name
into :numlist separated by ' '
from dictionary.columns
where libname='SASHELP' and memname='PRDSALE' and type='num';
quit;
data prdsale_capped;
set sashelp.prdsale;
if _n_ eq 1 then set quantiles;
array vars &numlist.;
array p1 actual_p1--month_p1;
array p99 actual_p99--month_p99;
do _i = 1 to dim(vars);
vars[_i] = max(min(vars[_i],p99[_i]),p1[_i]);
end;
run;
Basically it's just setting up three arrays - vars, p1, p99 - and then you have all 3 values for every numeric variable on the PDV and can just compare during a single array traversal.
For a production process I'd probably not use the -- but instead make 3 lists from proc sql and make 100% sure they're in the same order by using an order by.
You can do this with proc means and a hash table lookup. Let's create some test data with 100 variables pulled from a normal distribution. For testing, we'll change all the variables in the first and second rows to really big and really small numbers.
Our approach: create a lookup table where we can find the variable's name, pull its percentiles, and compare its value against those percentiles.
data have;
array var[100];
do i = 1 to 100;
do j = 1 to dim(var);
var[j] = rand('normal');
/* Test values */
if(i = 1) then var[j] = 99999;
if(i = 2) then var[j] = -99999;
end;
output;
end;
drop i j;
run;
Data:
var1 var2 var3 ...
99999 99999 99999 ...
-99999 -99999 -99999 ...
-0.149875111 0.4455504523 -0.783127138 ...
-0.731432437 -0.572508065 -1.044928486 ...
0.0108184539 1.0605591996 1.9132874927 ...
... ... ... ...
Let's get all the percentiles with proc means. You might be tempted to use output out=, but it does not create the data in a vertical lookup table that's easy for us to use in this manner; however, the stackODSOutput option on proc means does. More info on this from Rick Wicklin.
We'll use ods select none so we don't render a large table but still produce the dataset that drives the table.
/* Get a dataset of all 1st and 99th percentiles for each variable */
ods select none;
proc means data=have stackODSOutput p1 p99;
var var1-var100;
ods output summary = percentiles;
run;
ods select all;
Note that all the percentiles will be the same in this case. This is expected. We set all the variables in the first and second rows to the same big and small numbers for easy testing.
Data:
Variable P1 P99 ...
var1 -50001 50001 ...
var2 -50001 50001 ...
var3 -50001 50001 ...
var4 -50001 50001 ...
... ... ... ...
Now we'll use our lookup approach. We know our variable names and we can store them in an array. We can loop through that array, look up the variable in the hash table by name with vname(), and get its percentile.
data want;
set have;
array var[*] var1-var100;
/* Load a table of these values into memory and search for each percentile.
Think of this like a simple lookup table that floats out in memory.
*/
if(_N_ = 1) then do;
length variable $32.;
dcl hash pctiles(dataset: 'percentiles');
pctiles.defineKey('variable');
pctiles.defineData('p1', 'p99');
pctiles.defineDone();
call missing(p1, p99);
end;
/* Get the 1st and 99th percentile of each variable.
If the variable's name matches the variable name
in the hash table, check the variable's value
against the lookup percentile.
Cap it if it's above or below the percentile.
*/
do i = 1 to dim(var);
if(pctiles.Find(key:vname(var[i]) ) = 0) then do;
if(var[i] < p1) then var[i] = p1;
else if(var[i] > p99) then var[i] = p99;
end;
end;
drop i variable p1 p99;
run;
Output:
var1 var2 var3 ...
50000.532908 50000.721522 99999 ...
-50000.61447 -50000.92196 -50001.19549 ...
-0.149875111 0.4455504523 -0.783127138 ...
-0.731432437 -0.572508065 -1.044928486 ...
0.0108184539 1.0605591996 1.9132874927 ...
... ... ... ...
If your variables do not follow an easy sequential name, you can use the -- shortcut. For example, varA varB varC varD can be selected by varA--varD.
I am trying to sum one variable as long as another remains constant. I want to cumulative sum dur as long as a is constant. when a changes the sum restarts. when a new id, the sum restarts.
enter image description here
and I would like to do this:
enter image description here
Thanks
You can use a BY statement to specify the variables whose different value combinations organize data rows into groups. You are resetting an accumulated value at the start of each group and adding to the accumulator at each row in the group. Use retain to maintain a new variables value between the DATA step implicit loop iterations. The SUM statement is a unique SAS feature for accumulating and retaining.
Example:
data want;
set have;
by id a;
if first.a then mysum = 0;
mysum + dur;
run;
The SUM statement is different than the SUM function.
<variable> + <expression>; * SUM statement, unique to SAS (not found in other languages);
can be thought of as
retain <variable>;
<variable> = sum (<variable>, <expression>);
As far as I am concerned, you need to self-join your table with a ranked column.
It should be ranked by id and a columns.
FROM WORK.QUERY_FOR_STCKOVRFLW t1; is the table you provided in the screenshot
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_STCKOVRFLW_0001 AS
SELECT t1.id,
t1.a,
t1.dur,
/* mono */
(monotonic()) AS mono
FROM WORK.QUERY_FOR_STCKOVRFLW t1;
QUIT;
PROC SORT
DATA=WORK.QUERY_FOR_STCKOVRFLW_0001
OUT=WORK.SORTTempTableSorted
;
BY id a;
RUN;
PROC RANK DATA = WORK.SORTTempTableSorted
TIES=MEAN
OUT=WORK.RANKRanked(LABEL="Rank Analysis for WORK.QUERY_FOR_STCKOVRFLW_0001");
BY id a;
VAR mono;
RANKS rank_mono ;
RUN; QUIT;
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_RANKRANKED AS
SELECT t1.id,
t1.a,
t1.dur,
/* SUM_of_dur */
(SUM(t2.dur)) FORMAT=BEST12. AS SUM_of_dur
FROM WORK.RANKRANKED t1
LEFT JOIN WORK.RANKRANKED t2 ON (t1.id = t2.id) AND (t1.a = t2.a AND (t1.rank_mono >= t2.rank_mono ))
GROUP BY t1.id,
t1.a,
t1.dur;
QUIT;
I'm trying to reference values that are in a stats table, as such:
/* Calculate Median and IQR */
PROC UNIVARIATE DATA = kddcup98(drop=TARGET_B) OUTTABLE= boxStats(keep=_VAR_ _Q1_ _Q3_ _QRANGE_) NOPRINT;
RUN;
/* Calculate upper and lower bounds */
DATA boxStats;
SET boxStats;
upper_bound = _Q3_ + 1.5*_QRANGE_;
lower_bound = _Q3_ - 1.5*_QRANGE_;
RUN;
DATA kddcup98_continuous;
SET kddcup98_continuous;
ARRAY Num_Col[*] _NUMERIC_;
DO i = 1 to dim(Num_Col);
IF Num_Col[i] > boxStats[i, "upper_bound"] OR Num_Col[i] < boxStats[i, "lower_bound"] THEN Num_Col[i] = .;
END;
RUN;
I have the main data table and a table of stats from which I computed upper and lower bounds. I need to reference those values from the boxStats table. How I can I reference those values?
Use OUTTABLE PROC statement option.
OUTTABLE=SAS-data-set
creates an output data set that contains univariate statistics arranged in tabular form, with one observation per analysis variable. See the section OUTTABLE= Output Data Set for details.
Im a new SAS User and I have a small problem
I have one large empty table A with lets say 100 columns that I have created with a simple proc sql; create table
I have another table B with lets say 40 columns and table C with 55 columns.
I want to add these two tables into table A, basically I want a table with 100 columns containing the data from table B & C and I'm doing this with a Union command.
Since I dont have values for all 100 variables I have to set default values.
Lets say I have a column named nourishment in table A, food in table B and has no equivalent in table C. I have rules like "If the data comes from table B then value =xxx if its from table C then Value="DefaultValue"
I'd do this easily with R or python but Im struggling with sas.
I'm using SAS sql commands (a Union command)
How do you set default values ? (for all data types : character, numeric or dateI'm using SAS sql commands )
Dates in SAS are actually just numeric values. Often they have a date format applied to make them readable.
So you could just assign a missing value by default like so:
. as ColumnName
or any default date like so
'17NOV2017'd as ColumnName
. as MyColumnName
SAS can deal with missing values.
Using a specially coded value, such as 'NA', to represent a missing value condition can work but may lead to headaches and extra coding. Recommended read in SAS help: "Working with Missing Values"
The default SAS missing value for numerics (which also includes dates) is period.
. as MyColumnName
SAS also has 27 special missing values for numerics that are expressed as . < character >
.A as MyColumnName
...
.Z as MyColumnName
._ as MyColumnName
The missing value for character variables is a single space
' '
'' empty quote string also works
' ' as does a longer empty string
Rule of thumb: be consistent when coding your missing values.
You can use OPTIONS MISSING to specify what character is shown when a missing value is printed.
OPTIONS MISSING = '*'; * My special representation of missing for this report;
Proc PRINT data=myData;
run;
OPTIONS MISSING = '.'; * Restore to the default;
SAS custom formats can also be used to customize what is printed for missing values.
Proc FORMAT;
value MissingN
. = 'N/A'
.N = 'Special N/A different than regular N/A' /* for .N */
;
value $MissingC
' ' = 'N/A'
;
value SillyChristmasStocking
.C = 'Bad'
.O = 'children'
.A = 'get'
.L = 'No toys'
;
The token after the value keyword can be any new valid SAS name that you want to used for your format name.
Proc PRINT data=myData;
format myColumnName MissingN.;
format name $MissingC.;
format behaviour SillyChristmasStocking.;
run;
As for your character missing value conditions, I would continue to use " " or ' '
You mention UNION which is a SQL feature. In SQL, JOIN also occur, perhaps more often then UNION. When JOINing and values from two source columns collide, you will want to use either COALESCE() function or CASE
statements to select the non-missing value.
I would not recommend using UNION in PROC SQL at any point in your SAS usage. UNION is almost always inferior to a simple data step, or a data step view.
That's because the data step seamlessly handles issues like differing variables on different tables. SAS is quite comfortable with vertically combining datasets; SQL is always a bit trickier when they're not identical.
data c;
set a b;
run;
That runs whether or not a and b are identical, so long as a and b don't have conflicting variable names (that aren't intended to be in the same column); and if they do, just use the rename dataset option to resolve it.
If you do as the above, and don't use union, you'll get a missing value automatically for those dates.
NFN:
DATA Step
A DATA Step approach for stacking data is the simplest. Use SET to stack the data and array processing to apply your defaults. For example:
data stacked_data;
set
TARGET_TEMPLATE (obs = 0)
ONE
TWO
;
array allchar _character_;
array allnum _numeric_;
array dates d1-d5;
do over allchar; if missing(allchar) then allchar = '*UNKNOWN*'; end;
do over allnum; if missing(allnum) then allnum = -995; end;
do over dates; if missing(dates) then dates='01NOV1971'd; end;
run;
A subtle issue is that any missing values in ONE or TWO will be replaced with the default value.
Proc SQL
In Proc SQL you will want to create a single row table containing the default values for A. That table can be joined to the union of B and C. The join select will involve coalesce() in order to choose the predefined default value when a column is not from B or C.
For example, suppose you have an empty (zero rows), richly columned, target table (your A) acting as a template:
data TARGET_TEMPLATE;
length _n_ 8;
length a1-a5 $25 d1-d5 4 x1-x20 y1-y20 z1-z20 p1-p20 q1-q20 r1-r20 8;
call missing (of _all_);
format d1-d5 yymmdd10.;
stop;
run;
Because Proc SQL does not provide syntax for a default constraint you need to create a table of your own defaults. This is probably easiest with DATA Step:
data TARGET_DEFAULTS;
if 0 then set TARGET_TEMPLATE (obs=0); * prep pdv to match TARGET;
array allchar _character_ (1000 * '*UNKNOWN*');
array allnum _numeric_ (1000 * -995);
array d d1-d5 (5 * '01NOV1971'd); * override the allnum array initialization;
output;
stop;
run;
Here is some generated demo data, ONE and TWO, that correspond to your B and C:
data ONE;
if 0 then set TARGET_TEMPLATE (obs=0); * prep pdv of demo data to match TARGET;
do _n_ = 1 to 100;
array a a1 a3 a5;
array num x: y: z:;
array d d1 d2;
do over a; a = catx (' ', 'ONE', _n_, _i_); end;
do over num; num = 1000 + _n_ + _i_; end;
retain foodate '01jan1975'd;
do over d; d=foodate; foodate+1; end;
output;
end;
keep a1 a3 a5 x: y: z: d1 d2; * keep the disparate columns that were populated;
run;
data TWO;
if 0 then set TARGET_TEMPLATE (obs=0); * prep pdv of demo data to match TARGET;
do _n_ = 1 to 200;
array a a1 a2 a3;
array num x5 y5 z5 p: q: r:;
array d d1 d2;
do over a; a = catx (' ', 'TWO', _n_, _i_); end;
do over num; num = 20000 + _n_*10 + _i_; end;
retain foodate '01jan1985'd;
do over d; d=foodate; foodate+1; end;
output;
end;
keep a1 a2 a3 x5 y5 z5 p: q: r:; * keep the disparate columns that were populated;
run;
A stacking of A, B and C is simple SQL but does not introduce target specific default values:
proc sql noprint;
* generic UNION stack with SAS missing values (space and dot) for cells
* where ONE and TWO did not contribute any data;
create table stacked_data as
select * from have_data_TEMPLATE %*** empty template first ensures template column order and formats are honored in output data;
outer union corresponding %*** align by column name, do not remove duplicates;
select * from ONE
outer union corresponding
select * from TWO
;
When the stacking is put in a sub-query, it can be joined with the defaults. The choosing of the target default value for each column involves examining DICTIONARY.COLUMNS and generating the SQL source for selecting the coalescence of stack and default.
proc sql noprint;
* codegen select items ;
select cat('coalesce(STACK.',trim(name),',DEFAULT.',trim(name),') as ',trim(name))
into :coalesces separated by ','
from DICTIONARY.COLUMNS
where libname = 'WORK' and memname = 'HAVE_DATA_TEMPLATE' %* dictionary lib and mem name values are always uppercase;
order by npos
;
create table stacked_data_with_defaults as
select * from TARGET_TEMPLATE %*** output honors template;
outer union corresponding
select
source
, &coalesces %*** apply codegen;
from
(
select * from WORK.have_data_TEMPLATE %*** ensure fully columned sub-select that will align with coalesces;
outer union corresponding
select 'one' as source, * from ONE
outer union corresponding
select 'two' as source, * from TWO
) as STACK
join
TARGET_DEFAULTS as DEFAULT
on 1=1
;
quit;
Why would you create an empty dataset? What is it going to be used for? Perhaps you want to use it as a default structure definition? If so and you want to stack B and C and get them in the structure defined by A you could code this way.
data want ;
set a(obs=0) b c ;
run;
Not sure what the purpose would be to have default values. Couldn't you use formats if you want missing values to display in special ways?
Or you could create code to default values and perhaps just %include it or wrap the logic into a macro. So it you had a code file name 'defaults.sas' with lines like this.
startdate=coalesce(startdate,'01JAN2013'd);
gender=coalescec(gender,'UNKNOWN');
Then your little program to make a new dataset that looks like A and uses the data from B and C would look like this.
data want ;
set a(obs=0) b c ;
%include 'defaults.sas';
run;
If you really did want to aggregate the records into some large dataset then perhaps you want to use PROC APPEND to add the records once they are created in the right structure.
proc append data=want base=a ;
run;
I have table1 that contains one column (city), I have a second table (table2) that has two columns (city, distance),
I am trying to create a third table, table 3, this table contains two columns (city, distance), the city in table 3 will come from the city column in table1 and the distance will be the corresponding distance in table2.
I tried doing this using Proc IML based on Joe's suggestion and this is what I have.
proc iml;
use Table1;
read all var _CHAR_ into Var2 ;
use Table2;
read all var _NUM_ into Var4;
read all var _CHAR_ into Var5;
do i=1 to nrow(Var2);
do j=1 to nrow(Var5);
if Var2[i,1] = Var5[j,1] then
x[i] = Var4[i];
end;
create Table3 from x;
append from x;
close Table3 ;
quit;
I am getting an error, matrix x has not been set to a value. Can somebody please help me here. Thanks in advance.
The technique you want to use is called the "unique-loc technique". It enables you to loop over unique values of a categorical variable (in this case, unique cities) and do something for each value (in this case, copy the distance into another array).
So that others can reprodce the idea, I've imbedded the data directly into the program:
T1_City = {"Gould","Boise City","Felt","Gould","Gould"};
T2_City = {"Gould","Boise City","Felt"};
T2_Dist = {10, 15, 12};
T1_Dist = j(nrow(T1_City),1,.); /* allocate vector for results */
do i = 1 to nrow(T2_City);
idx = loc(T1_City = T2_City[i]);
if ncol(idx)>0 then
T1_Dist[idx] = T2_Dist[i];
end;
print T1_City T1_Dist;
The IF-THEN statement is to prevent in case there are cities in Table2 that are not in Table1. You can read about why it is important to use that IF-THEN statement. The IF-THEN statement is not needed if Table2 contains all unique elements of Table1 cities.
This technique is discussed and used extensively in my book Statistical Programming with SAS/IML Software.
You need a nested loop, or to use a function that finds a value in another matrix.
IE:
do i = 1 to nrow(table1);
do j = 1 to nrow(table2);
...
end;
end;