Create Index (ID) and increment (SAS) - sas

I have a very simple table with sale information.
Name | Value | Index
AAC | 1000 | 1
BTR | 500 | 2
GRS | 250 | 3
AAC | 100 | 4
I add a new column Name Index.
And I run the first time
DATA BSP;
Index = _N_;
SET BSP;
RUN;
This works fine for the first time.
But now I add more and more sales items and the new line should be get a new indexnumber. The highest index + 1 .... The old sales should keep the indexnumber. But if I run the code below all new lines get the index = 1. What is wrong with the code.
proc sql noprint;
select max(Index) into :max_ID from WORK.BSP;
quit;
DATA work.BSP;
SET work.BSP;
RETAIN new_Id &max_ID;
IF Index = . THEN DO;
new_ID + 1;
index = new_id;
END;
RUN;

You definied the value of Index column in the first step. Where is a new data that you want set? This code like your data it's work well. Can you share your base datase and the last dataset that you want change? Maybe your data is wrong?
(BTW The index variable name is not a lucky choice :-))
data BSP;
Name="AAC";Value=1000;Index=1;output;
Name="BTR";Value=500;Index=2;output;
Name="GRS";Value=250;Index=3;output;
Name="AAC";Value=100;Index=4;output;
run;
/* the row where Index not definied */
data BSPNew;
Name="XXX";Value=1000;output;
run;
proc sql noprint;
select max(Index) into :max_ID from WORK.BSP;
quit;
%put &max_Id.;
proc append base=BSP data=BSPNew force;
run;
DATA work.BSP;
SET work.BSP;
RETAIN new_Id &max_ID;
IF Index = . THEN DO;
new_ID + 1;
index = new_id;
END;
RUN;
data _null_;
set BSP;
put Name Value Index;
run ;
/* the result is:
AAC 1000 1
BTR 500 2
GRS 250 3
AAC 100 4
XXX 1000 5
*/

You need to show more of your code that will demonstrate the problem. The following example is the same as yours, but does not 'fail' to assign a desired index
Example:
data master;
do name = 'A','B','C'; OUTPUT; end;
run;
data master;
set master;
index = _n_;
run;
data new;
do name = 'E','F','G'; OUTPUT; end;
run;
proc sql noprint;
insert into master(name) select name from new; * append new rows;
select max(index) into :next_index from master; * compute highest index known;
data master;
set master;
retain next_index &next_index; * utilize highest index;
if index = . then do;
next_index + 1; * increment highest index before applying;
index = next_index;
end;
drop next_index; * discard 'worker' variable;
run;
You may have inserted a 1 by accident if the insert statement looked like
insert into master select name, 1 from new;
or the new data already has index set to '1'
insert into master select name, index from new;

Related

How do I add in rows with specific values missing in a single DATA step?

Here is a simple example I came up with. There are 3 players here (id is 1,2,3) and each player gets 3 attempts at the game (attempt is 1,2,3).
data have;
infile datalines delimiter=",";
input id attempt score;
datalines;
1,1,100
1,2,200
2,1,150
3,1,60
;
run;
I would like to add in rows where the score is missing if they did not play attempt 2 or attempt 3.
data want;
set have;
by id attempt;
* ??? ;
run;
proc print data=have;
run;
The output would look something like this.
1 1 100
1 2 200
1 3 .
2 1 150
2 2 .
2 3 .
3 1 60
3 2 .
3 3 .
How do I go about doing this?
You could solve this by first creating a table where you have the structure you want to see: for each ID three attempts. This structure can then be joined with a 'left join' to your 'have' table to get the actual scores if they exist and missing variable if they don't.
/* Create table with all ids for which the structure needs to be created */
proc sql;
create table ids as
select distinct id from have;
quit;
/* Create table structure with 3 attempts per ID */
data ids (drop = i);
set ids;
do i = 1 to 3;
attempt = i;
output;
end;
run;
/* Join the table structure to the actual scores in the have table */
proc sql;
create table want as
select a.*,
b.score
from ids a left join have b on a.id = b.id and a.attempt = b.attempt;
quit;
A table of possible attempts cross joined with the distinct ids left joined to the data will produce the desired result set.
Example:
data have;
infile datalines delimiter=",";
input id attempt score;
datalines;
1,1,100
1,2,200
2,1,150
3,1,60
;
data attempts;
do attempt = 1 to 3; output; end;
run;
proc sql;
create table want as
select
each_id.id,
each_attempt.attempt,
have.score
from
(select distinct id from have) each_id
cross join
attempts each_attempt
left join
have
on
each_id.id = have.id
& each_attempt.attempt = have.attempt
order by
id, attempt
;
Update: I figured it out.
proc sort data=have;
by id attempt;
data want;
set have (rename=(attempt=orig_attempt score=orig_score));
by id;
** Previous attempt number **;
retain prev;
if first.id then prev = 0;
** If there is a gap between previous attempt and current attempt, output a blank record for each intervening attempt **;
if orig_attempt > prev + 1 then do attempt = prev + 1 to orig_attempt - 1;
score = .;
output;
end;
** Output current attempt **;
attempt = orig_attempt;
score = orig_score;
output;
** If this is the last record and there are more attempts that should be included, output dummy records for them **;
** (Assumes that you know the maximum number of attempts) **;
if last.id & attempt < 3 then do attempt = attempt + 1 to 3;
score = .;
output;
end;
** Update last attempt used in this iteration **;
prev = attempt;
run;
Here is a alternative DATA step, a DOW way:
data want;
do until (last.id);
set have;
by id;
output;
end;
call missing(score);
do attempt = attempt+1 to 3;
output;
end;
run;
If the absent observations are only at the end then you can just use a couple of OUTPUT statements and a DO loop. So write each observation as it is read and if the last one is NOT attempt 3 then add more observations until you get to attempt 3.
data want1;
set have ;
by id;
output;
score=.;
if last.id then do attempt=attempt+1 to 3;
output;
end;
run;
If the absent attempts can appear any where then you need to "look ahead" to see whether the next observations skips any attempts.
data want2;
set have end=eof;
by id ;
if not eof then set have (firstobs=2 keep=attempt rename=(attempt=next));
if last.id then next=3+1;
output;
score=.;
do attempt=attempt+1 to next-1;
output;
end;
drop next;
run;

Designing new RK number for unique record

I am a SAS Developer. I am starting a project that requires me to assign RK number to unique record. Every extraction will get data that is already existed in the target table and some may not.
For example.
Source Data:
Name
A
B
C
D
E
Target Table:
Name RK
A 1
B 2
C 3
When I load, i want it to insert D and E into the target table with RK 4 & 5 respectively. Currently, I can think of doing hash lookup from source with target table. For data that is not mapped using hash object, RK field will be blank. I will put the max RK number from the target table and incremental 1 to it by appending D & E into it.
I am not sure if this is the most efficient way of doing so. Is there another more efficient way?
You could use a hash to determine if some name (I'll call it value) already exists in target table. However, new keys would have to be tracked, output at the end of the step and then PROC APPPEND'd to target table (I'll call it master) .
For the case of just updating the master table with new RK values, a traditional SAS approach is to use a DATA step to MODIFY a unique keyed master table. The coding pattern is:
SET <source>
MODIFY <master> KEY=<value> / UNIQUE;
... _IORC_ logic ...
Example:
%* Create some source data and the master table;
data have1 have2 have3 have4 have5;
call streaminit(123);
value = 2020; output; output; output;
do _n_ = 1 to 2500;
value = ceil(rand('uniform', 5000));
select;
when (rand('uniform') < 0.20) output have1;
when (rand('uniform') < 0.20) output have2;
when (rand('uniform') < 0.20) output have3;
when (rand('uniform') < 0.20) output have4;
otherwise output have5;
end;
end;
run;
data have6;
do _n_ = 1 to 20;
value = 2020;
output;
end;
run;
* Create the unique keyed master table;
* Typically done once and stored in a permanent library.;
proc sql;
create table keys (value integer, RK integer);
create distinct index value on work.keys;
quit;
%* A macro for adding new RK values as needed;
%macro RK_ASSIGN(master, data);
%local last;
proc sql noprint;
select max(RK) into :last trimmed from &master;
quit;
data &master;
retain newkey %sysevalf(0&last+0); %* trickery for 1st use case when max(RK) is .;
set &data;
modify &master key=value / unique;
if _iorc_ eq %sysrc(_DSENOM);
newkey + 1;
RK = newkey;
output;
_error_ = 0;
run;
%mend;
%* Use the macro to process source data;
%RK_ASSIGN(keys,have1)
%RK_ASSIGN(keys,have2)
%RK_ASSIGN(keys,have3)
%RK_ASSIGN(keys,have4)
%RK_ASSIGN(keys,have5)
%RK_ASSIGN(keys,have6)
You can see the forced repeats of the 2020 value in the source data is only RK'd once in the master table, and there are no errors during processing.
If you want to backfill the source data with the found or assigned RK value there would be additional steps. You could update a custom format, or do a traditional left join. If you want to focus on backfill during a read over source data the HASH step + APPEND new RK's step might be preferable.
Example 2 Master table is named values
HASH version with RK assignment added to source data. New RKs output and appended.
proc sql;
create table values (value integer, RK integer);
create distinct index value on work.values;
%macro RK_HASH_ASSIGN(master,data);
%local last;
proc sql noprint;
select max(RK) into :last trimmed from &master;
quit;
data &data(drop=next_RK);
set &data end=end;
if _n_ = 1 then do;
declare hash lookup (dataset:"&master");
lookup.defineKey("value");
lookup.defineData("value", "RK");
lookup.defineDone();
declare hash newlookup (dataset:"&master(obs=0)");
newlookup.defineKey("value");
newlookup.defineData("value", "RK");
newlookup.defineDone();
end;
retain next_RK %sysevalf(0&last+0); %* trick;
* either load existing RK from hash, or compute and apply next RK value;
if lookup.find() ne 0 then do;
next_RK + 1;
RK = next_RK;
lookup.add();
newlookup.add();
end;
if end then do;
newlookup.output(dataset:'work.newmasters');
end;
run;
proc append base=&master data=work.newmasters;
proc delete data=work.newmasters;
run;
%mend;
%RK_HASH_ASSIGN(values,have1)
%RK_HASH_ASSIGN(values,have2)
%RK_HASH_ASSIGN(values,have3)
%RK_HASH_ASSIGN(values,have4)
%RK_HASH_ASSIGN(values,have5)
%RK_HASH_ASSIGN(values,have6)
%* Compare the two assignment strategies, no differences!;
proc sort force data=values(index=(value));
by RK;
run;
proc compare noprint base=keys compare=values out=diffs outnoequal;
by RK;
run;
----- LOG -----
2525 proc compare noprint base=keys compare=values out=diffs
outnoequal <------------- do not output when data is identical ;
;
2526 by RK;
2527 run;
NOTE: There were 215971 observations read from the data set WORK.KEYS.
NOTE: There were 215971 observations read from the data set WORK.VALUES.
NOTE: The data set WORK.DIFFS has 0 observations and 4 variables. <--- all the same ---
NOTE: PROCEDURE COMPARE used (Total process time):
real time 0.25 seconds
cpu time 0.26 seconds

Left join a bucket value based on a greater than clause

I am looking to create an optimal bucketing macro. My first obstacle is to create equidistant buckets. I am using the sashelp.baseball dataset as an example.
I take the range of logsalary and divide this by 100 to create the distance between each bucket. Then I would like to assign the logsalary column a bucket value if the logsalary is smaller than the bucket value
The code I have tried is attached. I am hoping to be able to join or merge on the bucket limit values and use a greater than or smaller than clause to append a bucket value
/*Sort the baseball dataset by smallest to largest, removing any missing data*/
PROC SORT
DATA = sashelp.baseball
(KEEP = logsalary
WHERE = (NOT MISSING(logsalary)))
OUT = baseball;
BY logsalary;
RUN;
/*Identify the size of each bucket by splitting the range into 100 equidistant buckets*/
DATA _NULL_;
RETAIN bin_size;
SET baseball END = EOF;
IF _N_ = 1 THEN DO;
bin_size = logsalary;
CALL SYMPUT("min_bin",logsalary);
END;
IF EOF THEN DO;
bin_size = ((logsalary - bin_size) / 100);
CALL SYMPUT("bin_size",bin_size);
END;
RUN;
/*Create a vector to identify each bucket range*/
DATA bin_levels;
DO bin = 1 TO 100;
IF bin = 1 THEN DO;
bin_level = &min_bin.;
OUTPUT;
END;
ELSE DO;
bin_level = &min_bin. + &bin_size. * bin;
OUTPUT;
END;
END;
RUN;
/*Append a bucket number based on the logsalary being smaller than the next bucket value*/
PROC SQL;
CREATE TABLE binned_data AS
SELECT
a.*
, b.bin
, b.bin_level
FROM
baseball a
LEFT JOIN
bin_levels b ON b.bin_level > a.logsalary
;
QUIT;
I would like to see the first ten rows look like this
logSalary bin
4.2121275979 1
4.2195077052 1
4.248495242 1
4.248495242 1
4.248495242 1
4.248495242 1
4.248495242 1
4.3174881135 2
4.3174881135 2
4.3174881135 2
...
Thanks in advance
EDIT: for now, I am going to go with this solution
DATA bucketed_data;
RETAIN bin bin_limit;
SET baseball;
IF _n_ = 1 THEN DO;
bin_limit = logsalary;
bin = 1;
END;
IF logsalary > bin_limit THEN DO;
bin_limit + &bin_size.;
bin + 1;
END;
RUN;
No need for macro variables put the values into a dataset and combine the dataset with the one you want to bin. Let's use 10 bins instead of 100 to make it easier to examine the results.
First find the minimum and range:
proc means n min max data=sashelp.baseball;
var logsalary;
output out=stats(keep=min range) min=min range=range;
run;
Then use those to bin the data:
DATA bucketed_data;
SET sashelp.baseball (keep=logsalary);
if _n_=1 then set stats;
if not missing(logsalary) then do bin=1 to 10 while(logsalary > min+bin*(range/10));
* nothing to do here ;
end;
run;
Let's use PROC MEANS to see how it worked.
proc means n min max ;
class bin / missing;
var logsalary;
run;
Results:

Counting categorical variables on row in SAS

Sample Data
I was wondering if it is possible to use data instead of proc to count the number of categorical variables on a row as shown in 'count' example above. This will allow me to further use the data e.g COUNT=1 or COUNT > 1 to check morbidity.
Also will it be possible to then count the number of each diagnosis in the entire data set per patient while accounting for duplicates if there is any? For example there are 3 CB's and 2 AA's in this data set but CB should be 2 because patient 2 had it recorded twice.
Thank you for your time and have a lovely new year.
Your question is not clear but your could manage your diag using union all and count distinct
selec patient count(distinct diag )
from (
select patient, diag1 as diag
from my_table
uniona all
select patient, diag2
from my_table
uniona all
select patient, diag3
from my_table
uniona all
select patient, diag4
from my_table
) t
group by patient
or simply union and count
selec patient count(diag )
from (
select patient, diag1 as diag
from my_table
uniona
select patient, diag2
from my_table
uniona
select patient, diag3
from my_table
uniona
select patient, diag4
from my_table
) t
group by patient
The image indicates that for each row you want a count of the number of columns with non-missing values. Additionally, you apparently have some way to do this using a PROC step, but would like to know how using a DATA step.
In DATA step you can count the number of non-missing values indirectly using CMISS, or directly using COUNTC against a constructed value:
data have;
attrib pid length=8 diag1-diag4 length=$5;
input pid & diag1-diag4;
datalines;
1 AA J9 HH6 .
2 CB . . CB
3 J10 AA CB J10
4 B B . F90 .
5 J10 . . .
6 . . . .
run;
data have_with_count;
set have;
count = 4 - cmiss (of diag1-diag4);
count_way2 = countc(catx('~', of diag1-diag4, 'SENTINEL'), '~');
run;
In order to work again MySQL data source you will also need a libref that connects you to that remote data server.
Added
Counting distinct values across a row can be accomplished using a hash or sortc. Consider this example that sorts a copy of the row data (as an array) and counts the unique values within:
data want;
set have;
array diag diag1-diag4;
array v(4) $5 _temporary_;
do _n_ = 1 to dim(diag);
v(_n_) = diag(_n_);
end;
call sortc(of v(*));
uniq = 0;
do _n_ = 1 to dim(v);
if missing(v(_n_)) then continue;
if uniq = 0 then
uniq + 1;
else
uniq + ( v(_n_) ne v(_n_-1) );
end;
run;
With Richard's dummy data to count number of diagnosis and unique number of diagnosis:
data want;
set have;
array var diag:;
length temp $30.;
call missing(diag_num);
do over var;
if not missing(var) then do;
diag_num+1;
temp=ifc(whichc(var, temp),temp,catx(' ',temp,var));
end;
end;
unique_diag=countw(temp);
drop temp;
run;

Crosstable displaying frequency combination of N variables in SAS

What I've got:
a table of 20 rows in SAS (originally 100k)
various binary attributes (columns)
What I'm looking to get:
A crosstable displaying the frequency of the attribute combinations
like this:
Attribute1 Attribute2 Attribute3 Attribute4
Attribute1 5 0 1 2
Attribute2 0 3 0 3
Attribute3 2 0 5 4
Attribute4 1 2 0 10
*The actual sum of combinations is made up and probably not 100% logical
The code I currently have:
/*create dummy data*/
data monthly_sales (drop=i);
do i=1 to 20;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
output;
end;
run;
I guess this can be done smarter, but this seem to work. First I created a table that should hold all the frequencies:
data crosstable;
Attribute1=.;Attribute2=.;Attribute3=.;Attribute4=.;output;output;output;output;
run;
Then I loop through all the combinations, inserting the count into the crosstable:
%macro lup();
%do i=1 %to 4;
%do j=&i %to 4;
proc sql noprint;
select count(*) into :Antall&i&j
from monthly_sales (where=(Attribute&i and Attribute&j));
quit;
data crosstable;
set crosstable;
if _n_=&j then Attribute&i=&&Antall&i&j;
if _n_=&i then Attribute&j=&&Antall&i&j;
run;
%end;
%end;
%mend;
%lup;
Note that since the frequency count for (i,j)=(j,i) you do not need to do both.
I'd recommend using the built-in SAS tools for this sort of thing, and probably displaying your data slightly differently as well, unless you really want a diagonal table. e.g.
data monthly_sales (drop=i);
do i=1 to 20;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
count = 1;
output;
end;
run;
proc freq data = monthly_sales noprint;
table attribute1 * attribute2 * attribute3 * attribute4 / out = frequency_table;
run;
proc summary nway data = monthly_sales;
class attribute1 attribute2 attribute3 attribute4;
var count;
output out = summary_table(drop = _TYPE_ _FREQ_) sum(COUNT)= ;
run;
Either of these gives you a table with 1 row for each contribution of attributes in your data, which is slightly different from what you requested, but conveys the same information. You can force proc summary to include rows for combinations of class variables that don't exist in your data by using the completetypes option in the proc summary statement.
It's definitely worth taking the time to get familiar with proc summary if you're doing statistical analysis in SAS - you can include additional output statistics and process multiple variables with minimal additional code and processing overhead.
Update: it's possible to produce the desired table without resorting to macro logic, albeit a rather complex process:
proc summary data = monthly_sales completetypes;
ways 1 2; /*Calculate only 1 and 2-way summaries*/
class attribute1 attribute2 attribute3 attribute4;
var count;
output out = summary_table(drop = _TYPE_ _FREQ_) sum(COUNT)= ;
run;
/*Eliminate unnecessary output rows*/
data summary_table;
set summary_table;
array a{*} attribute:;
sum = sum(of a[*]);
missing = 0;
do i = 1 to dim(a);
missing + missing(a[i]);
a[i] = a[i] * count;
end;
/*We want rows where two attributes are both 1 (sum = 2),
or one attribute is 1 and the others are all missing*/
if sum = 2 or (sum = 1 and missing = dim(a) - 1);
drop i missing sum;
edge = _n_;
run;
/*Transpose into long format - 1 row per combination of vars*/
proc transpose data = summary_table out = tr_table(where = (not(missing(col1))));
by edge;
var attribute:;
run;
/*Use cartesian join to produce table containing desired frequencies (still not in the right shape)*/
option linesize = 150;
proc sql noprint _method _tree;
create table diagonal as
select a._name_ as aname,
b._name_ as bname,
a.col1 as count
from tr_table a, tr_table b
where a.edge = b.edge
group by a.edge
having (count(a.edge) = 4 and aname ne bname) or count(a.edge) = 1
order by aname, bname
;
quit;
/*Transpose the table into the right shape*/
proc transpose data = diagonal out = want(drop = _name_);
by aname;
id bname;
var count;
run;
/*Re-order variables and set missing values to zero*/
data want;
informat aname attribute1-attribute4;
set want;
array a{*} attribute:;
do i = 1 to dim(a);
a[i] = sum(a[i],0);
end;
drop i;
run;
Yeah, user667489 was right, I just added some extra code to get the cross-frequency table looking good. First, I created a table with 10 million rows and 10 variables:
data monthly_sales (drop=i);
do i=1 to 10000000;
Attribute1=rand("Normal")>0.5;
Attribute2=rand("Normal")>0.5;
Attribute3=rand("Normal")>0.5;
Attribute4=rand("Normal")>0.5;
Attribute5=rand("Normal")>0.5;
Attribute6=rand("Normal")>0.5;
Attribute7=rand("Normal")>0.5;
Attribute8=rand("Normal")>0.5;
Attribute9=rand("Normal")>0.5;
Attribute10=rand("Normal")>0.5;
output;
end;
run;
Create an empty 10x10 crosstable:
data crosstable;
Attribute1=.;Attribute2=.;Attribute3=.;Attribute4=.;Attribute5=.;Attribute6=.;Attribute7=.;Attribute8=.;Attribute9=.;Attribute10=.;
output;output;output;output;output;output;output;output;output;output;
run;
Create a frequency table using proc freq:
proc freq data = monthly_sales noprint;
table attribute1 * attribute2 * attribute3 * attribute4 * attribute5 * attribute6 * attribute7 * attribute8 * attribute9 * attribute10
/ out = frequency_table;
run;
Loop through all the combinations of Attributes and sum the "count" variable. Insert it into the crosstable:
%macro lup();
%do i=1 %to 10;
%do j=&i %to 10;
proc sql noprint;
select sum(count) into :Antall&i&j
from frequency_table (where=(Attribute&i and Attribute&j));
quit;
data crosstable;
set crosstable;
if _n_=&j then Attribute&i=&&Antall&i&j;
if _n_=&i then Attribute&j=&&Antall&i&j;
run;
%end;
%end;
%mend;
%lup;