I have a data set with variables col1-col5 that contain an numeric identifier. There are 2000 possible values of the identifier. I also have 2000 variables named dX, where X is one of the 2000 different identifier values.
I want to loop over the col variables and then set the corresponding d variable that is indexed by the identifier to equal 1.
For example, suppose I have the observation:
col1 col2 col3 col4 col5 d10007 d10010 d10031 ... d10057 ...
10031 10057 . . . . . . .
I would want to set d10031 and d10057 to both equal 1.
Is this possible? If the numbers were sequential I see how to use an array, but given that they're not I can't see how to do it.
It can be done in an array. I'll explain, after the mandatory polemic about data structure.
This looks like it should be a vertical data structure, ie col d variables and multiple lines (with some ID tying them together).
Now, to do this in the structure you have:
You need to use the VNAME function. This allows you to get at the name of the array variable as a string. You can't take col1=10531 and create a statement d10531=1, but you can look at d10531 and compare its value to col1.
This is slow, because you need to loop twice over your variables, unless you have a reliable ordering. Your data above does respect the ordering (ie, COL1-n are in order, and D1-n are in order, so you can move left to right and not loop twice). If this isn't the case, then you may want to use call sortn with the COL array, if that's acceptable. The Dxx array should be able to be defined in proper order (if it's not in proper order on the dataset, you can construct the array statement in a macro variable ordering the variables there - the order on the array statement matters, the order in the dataset does not, unless you're using d:.)
Here's an example of the left to right structure.
data want;
set have;
array cols col1-col2000;
array ds d1:; *or whatever defines d;
_citer=1;
do _diter = 1 to dim(ds) while (_citer le dim(cols)); *safety check;
if compress(vname(ds[_diter]),,'kd') = cols[_citer] then do;
ds[_diter] = 1;
_citer+1;
end;
end;
run;
It iterates over ds, checks each one against the current col, and when it finds its match, sets that and then stops. This should be flexible - would work with any structure of ds, even if it has very many values. It will not, however, work if cols is not sorted in ascending order of value. If it's not, you would need to put an inner loop to check each cols variable, meaning you have [dim(ds)*dim(cols)] loop iterations instead of [dim(ds)] loop iterations at most.
Another alternative is to just create the entire sequential D array, then drop the 'fake' d variables at the end, like so:
data have;
Col1 = 10;
Col2 = 35;
Col3=.;
Array Dvars {*} d1 d10 d25 d35;
run;
/* Get a list of all actual D variable /*
proc sql noprint ;
select name into :dColumnsToKeep separated by ' '
From SASHELP.VColumn
where libname="WORK" and memname = "HAVE"
AND name LIKE 'd%';
;quit;
%put &dColumnsToKeep;
data want (keep=Col: &dColumnsToKeep);
set have;
array AllDVars {*} d1-d9999; *Set d9999 to as big as needed;
array ColVars {*} Col:;
do i = 1 to Dim(ColVars);
if colvars(i) ne . Then AllDvars(Colvars(i)) = 1;
end;
run;
This may be quicker processing, since it avoids looping. Though I dont know what the tradeoff is memory-wise to have SAS create 10K or 100K variables in the datastep.
Related
I'm calculating a hash (md5) row by row of an entire table, 1 hash for each row.
The table is selected by the user with a prompt (&sel_lib for the selected lib and &sel_tab for the table).
1st i get my columns from sas.help.vcolumn and concatenated all to a attribute called lista:
data contents;
do until (last.memname);
set sashelp.vcolumn;
where upcase(libname)="&sel_livraria"
and upcase(memname)="&sel_tabela";
by libname memname varnum;
length lista $&max_len ; /* sum(len col) + number of col */
lista=catx(',',lista,name);
end;
keep libname memname varnum lista;
run;
2nd i write that value in another table, because i need it for other operations not related with this issue:
proc sql;
update md5_table
set nom_colunas=(
select lista
from contents)
where libname="&sel_lib"
and memname="&sel_tab";
quit;
3rd Then i pass that concatenated columns to macro var scolunas
proc sql;
select columns
into :scolunas
from md5_table
where libname="&sel_lib"
and memname="&sel_tab";
quit;
Then i use it to run my hash in my data step
data want;
livraria="&sel_lib";
tabela="&sel_tab";
length check $32.;
format check $hex32.
set &sel_livraria..&sel_tabela.;
check = md5(cats(&scolunas));
hash = put(check,$hex32.);
keep livraria tabela hash;
put hash;
put _all_;
run;
My problem is that i need to compare the output with the output of a same table in another server (plataform migration), so i need a reference to compare both.
I prefer to do that by adding a row number of the source table (&sel_lib &sel_tab) to my data set. Any way to do that?
A more complex one would be adding the concatenate PK values to it.
Tks in advance.
STRIP will remove leading and trailing spaces of only one expression (i.e. variable) and thus accepts only a single argument.
You want your comma separated variable list concatenated as input to MD5.
CATS() will concatenate variables and implicitly strip.
...
check = md5(cats(&scolumnlist));
...
If you think that data values may in some edge cases cause a concatenation that is the same in different contexts use the CATX(<sep>,....) to explicitly separate the fields when concatenated.
Example:
A B C CATS() CATX('|',)
-- --- -- ------ ----------
PP QQ RR PPQQRR PP|QQ|RR
P PQQ RR PPQQRR P|PQQ|RR
I have the following data:
data df;
input id $ d1 d2 d3;
datalines;
a . 2 3
b . . .
c 1 . 3
d . . .
;
run;
I want to apply some transformation/operation across a subset of columns. In this case, that means dropping all rows where columns prefixed with d are all missing/null.
Here's one way I accomplished this, taking heavy influence from this SO post.
First, sum all numeric columns, row-wise.
data df_total;
set df;
total = sum(of _numeric_);
run;
Next, drop all rows where total is missing/null.
data df_final;
set df_total;
where total is not missing;
run;
Which gives me the output I wanted:
a . 2 3
c 1 . 3
My issue, however, is that this approach assumes that there's only one "primary-key" column (id, in this case) and everything else is numeric and should be considered as a part of this sum(of _numeric_) is not missing logic.
In reality, I have a diverse array of other columns in the original dataset, df, and it's not feasible to simply drop all of them, writing all of that out. I know the columns for which I want to run this "test" all are prefixed with d (and more specifically, match the pattern d<mm><dd>).
How can I extend this approach to a particular subset of columns?
Use a different short cut reference, since you know it all starts with D,
total = sum( of D:);
if n(of D:) = 0 then delete;
Which will add variables that are numeric and start with D. If you have variables you want to exclude that start with D, that's problematic.
Since it's numeric, you can also use the N() function instead, which counts the non missing values in the row. In general though, SAS will do this automatically for most PROCS such as REG/GLM(not in a data step obviously).
If that doesn't work for some reason you can query the list of variables from the sashelp table.
proc sql noprint;
select name into :var_list separated by ", " from sashelp.vcolumn
where libname='WORK' and memname='DF' and name like 'D%';
quit;
data df;
set have;
if n(&var_list.)=0 then delete;
run;
I have created a sas code which generates many sas datasets. Now I want to append all of them to a single excel file . So first I want to convert all the column headers of sas datasets as first observation. Then leave space between these datasets (adding a blank observation). How can we do it?
one way to do this would be to use dictionary.columns
proc sql;
create table Attribute as
select * from dictionary.columns;
Read through the table and check what attributes you are interested in. For your case you might be interested in the column "NAME" <- consist of the name of all columns.
Modify the table by adding where statement to the proc sql based on the identity of the column ( from which library / what type of file / name of file) e.g. where upcase(libname)= "WORK"
data attribute;
array column [ n ] $ length ;
do i=1 to n;
set attribute ( keep = name) ;
column [ i ] = name ;
end;
run;
Then I would proceed with data step. You could use macro variable to store the value of column's names by select variable into : but anyhow you still need to hardcode the size for the array n or any other method that store value into one observation . Also remember define the length and the type of array accordingly. You can give name to the variable in the result dataset Attribute by adding var1-varnafter the length at array statement.
For simplicity I use set statement to read observation one and one and store the value of column NAME, which is the official column name derived when using dictionary.columns into the array
Note that creating a non-temporary array would create variable(s) .
Add if you want to add the blank,
data younameit ;
merge attribute attribute(firstobs=2 keep=name rename=(name=_name));
output;
if name ne _name then do;
call missing(of _all_);
output;
end;
run;
As two datasets start with different observation and column names do not duplicate within one dataset, the next row of a valid observation ( derived from the first output statement in the resulting dataset would be empty due to call missing ( of _all_ ) ; output;
Sounds like you just want to combine the datasets and write the results to the Excel file. Do you really need the extra empty row?
libname out xlsx 'myfile.xlsx';
data out.report ;
set ds1 ds2 ...;
run;
Ensure that all your columns are character (or numeric, substitute numeric), then in your data step use:
array names{*} _character_;
do i=1 to dim(names);
call label(names{i}, names{i});
end;
output;
I'm trying to compare multiple sets of data by putting them in separate groups between two numbers. Originally I had statements like,
if COLUMN1 gt 0 and COLUMN1 LE 1000 then PRICE_GROUP = 1000;
I had this going up by 1000 to 100,000. The only problem is that once I counted how many were in each price_group, some price_groups were missing (57,000 had no values so when I would count(Price_group) it would not appear for some groups). The solution I think is to make a table with the bounds for each, and then compare the actual value vs the upper and lower bound.
proc iml;
mat = j(100,2,0);
total = 100000;
mat[1,1] = 0;
mat[1,2] = mat[1,1] + (total/100);
do i = 2 to nrow(mat);
mat[i,1] = mat[i-1,1] + (total/100);
mat[i,2] = mat[i,1] + (total/100);
end;
create dataset from mat;
append from mat;
quit;
This creates the table which I can compare the values, but is there an easier way besides proc iml? I was next going to do a loop to compare each value with the two columns and create a new column on the table to have the count in each bucket. This still seems like an intensive process that is inefficient.
IML isn't a terrible solution, but there are a few others depending on what exactly you're doing.
The most common is proc format. Create a format that manages each bucket, like so:
proc format;
value buckets
0-1000 = 1000
1000<-2000 = 2000
...
other="NA";
quit;
Then you can either use the format (or informat) to create a new variable with the bucketed value, or even better, use the format on the fly (ie, in proc means or whatnot) which not only means you don't have to rewrite the dataset, but you can swap formats on and off depending on how many buckets you want (say, buckets100 format or buckets20 and whatnot).
Second, your specific question looks like it's solveable just using math:
data want;
set have;
bucket = &total/100*(1+floor(column1/(&total/100)));
run;
although obviously that doesn't work for every example.
Third, you could use a hash lookup table, if you are unable to use formats (such as there are two or more elements that determine the bucket). If that's useful I can expand on that, or just google about as those are very commonly used for lookups in SAS. That's the closest solution to the IML solution inside a regular datastep.
Create another table with groups:
data group_table;
do price_group=1000 to 100000 by 1000;
output;
end;
run;
Then left join the grouping/comparison table with this new table using price_group as key:
proc sql;
create table price_group_compare as
select L.price_group,R.group1_count,R.group2_count
from group_table as L, group_counts as R
on L.price_group = R.price_group;
quit;
I am trying to create categorical variables in sas. I have written the following macro, but I get an error: "Invalid symbolic variable name xxx" when I try to run. I am not sure this is even the correct way to accomplish my goal.
Here is my code:
%macro addvars;
proc sql noprint;
select distinct coverageid
into :coverageid1 - :coverageid9999999
from save.test;
%do i=1 %to &sqlobs;
%let n=coverageid&i;
%let v=%superq(&n);
%let f=coverageid_&v;
%put &f;
data save.test;
set save.test;
%if coverageid eq %superq(&v)
%then &f=1;
%else &f=0;
run;
%end;
%mend addvars;
%addvars;
You're combining macro code with data step code in a way that isn't correct. %if = macro language, meaning you are actually evaluating whether the text "coverageid" is equal to the text that %superq(&v) evaluates to, not whether the contents of the coverageid variable equal the value in &v. You could just convert %if to if, but even if you got that to work properly it would be hideously inefficient (you're rewriting the dataset N times, so if you have 1500 values for coverageID you rewrite the entire 500MB dataset or whatnot 1500 times, instead of just once).
If what you want to do is take the variable 'coverageid' and convert it to a set of variables that consist of all possible values of coverageid, 1/0 binary, for each, there are a nubmer of ways to do it. I'm fairly sure the ETS module has a procedure that just does this, but I don't recall it off the top of my head - if you were to post this to the SAS mailing list, one of the guys there would undoubtedly have it quickly.
The simple way for me, is to do this with entirely datastep code. First determine how many potential values there are for COVERAGEID, then assign each to a direct value, then assign the value to the correct variable.
If the COVERAGEID values are consecutive (ie, 1 to some number, no skips, or you don't mind skipping) then this is easy - set up an array and iterate over it. I will assume they are NOT consecutive.
*First, get the distinct values of coverageID. There are a dozen ways to do this, this works as well as any;
proc freq data=save.test;
tables coverageid/out=coverage_values(keep=coverageid);
run;
*Then save them into a format. This converts each value to a consecutive number (so the lowest value becomes 1, the next lowest 2, etc.) This is not only useful for this step, but it can be useful in the future in converting back.;
data coverage_values_fmt;
set coverage_values;
start=coverageid;
label=_n_;
fmtname='COVERAGEF';
type='i';
call symputx('CoverageCount',_n_);
run;
*Import the created format;
proc format cntlin=coverage_values_fmt;
quit;
*Now use the created format. If you had already-consecutive values, you could skip to this step and skip the input statement - just use the value itself;
data save.test_fin;
set save.test;
array coverageids coverageid1-coverageid&coveragecount.;
do _t = 1 to &coveragecount.;
if input(coverageid,COVERAGEF.) = _t then coverageids[_t]=1;
else coverageids[_t]=0;
end;
drop _t;
run;
Here's another way that doesn't use formats, and may be easier to follow.
First, just make some test data:
data test;
input coverageid ##;
cards;
3 27 99 105
;
run;
Next, create a data set with no observations but one variable for each level of coverageid. Note that this approach allows arbitrary values here.
proc transpose data=test out=wide(drop=_name_);
id coverageid;
run;
Finally, create a new data set that combines the initial data set and the wide one. Then, for each level of x, look at each categorical variable and decide whether to turn it "on".
data want;
set test wide;
array vars{*} _:;
do i=1 to dim(vars);
vars{i} = (coverageid = substr(vname(vars{i}),2,1));
end;
drop i;
run;
The line
vars{i} = (coverageid = substr(vname(vars{i}),2));
may require more explanation. vname returns the name of the variable, and since we didn't specify a prefix in proc transpose, all variables are named something like _1, _2, etc. So we take the substring of the variable name that starts in the second position, and compare it to coverageid; if they're the same, we set the variable to 1; otherwise it evaluates to 0.