aggregate by column value and paste row values together in SAS

aggregate by column value and paste row values together in SAS - sas

I have a data set that looks like:
Have:
data have;
input a b c d e f g h ;
datalines;
1 0 0 0 0 0 1 0
0 0 1 0 1 0 0 0
0 0 0 1 0 1 0 0
0 1 0 0 0 0 0 1
;
run;
The columns a, b, c and d are four options to the question 1 on a 4-point scale. The value "1" in obs1 column A signifies that respondent has chosen option A for that question which signifies 4 on the 4 point scale.
a = 4, b = 3, c = 2 and d = 1.
The next question's options are e, f, g and h. The respondent has chosen option g which is 2 on the 4 point scale. e = 4, f = 3, g = 2 and h = 1.
The data set contains hundreds of columns like this. My idea is to collapse 4 columns into one getting values like : "1000", "0100", "0010", "0001" and then converting 1000 = 4, 0100 = 3, 0010 = 2 and 0001 = 1.
I want it to be like :
block col1 col2 col3 col4
1 1000 0100 0010 0001
2 0100 0010 1000 0001
3 1000 0100 1000 0010
I've gotten this far:
proc transpose data = have out = have_t;
run;
data have_t_block;
set have_t;
retain block;
if _n_ = 1 then block = 1;
if mod(_n_/4,1) = 0.25 and _n_ gt 1 then block +1;
run;
Is there a way to concatenate the row values while aggregating by block in SAS? I do this in R, like this:
#Create data
data <- data.frame(a = c(1, 0, 0), b = c(0, 1, 0), c = c(0, 0, 1), d = c(0, 0, 0), e = c(0, 1, 0), f = c(1, 0, 0), g = c(0, 0, 1), h = c(0, 0, 0), i = c(0, 0, 1), j = c(1, 0, 0), k = c(0, 0, 0), l = c(0, 1, 0))
#transpose
data <- data.frame(t(data))
#create a key for each group of 4
data$block <- rep(1:(nrow(data)/4), each = 4)
#convert data to long format and group by key (block) and use paste to concatenate
require(reshape2)
data_melt <- melt(data, id = c("block"))
trial <- data.frame(t(dcast(data_melt, block ~ variable, paste, collapse = "")))

First off, unless you misexplained your data, your transpose didn't help things very much here as there's no particular reason to have this have one column for each respondent - let's just have one column, period. Here's a better way to do this.
data have_t;
set have;
array cols a--h;
do _i = 1 to dim(cols);
value = cols[_i];
output;
end;
keep value; *and an ID I hope?;
run;
Making a dataset 'vertical' (one column) is very easy. Just loop over an array of all of your columns, for each set a common variable to that value, output. Normally i'd keep track of the variable name I was outputting also, but perhaps that's not necessary.
For your main problem, what you'll want to do is use retain, most likely, not dissimilar to how you handle block. Here I just calculate score directly:
data want;
set have_t;
retain score;
counter = mod(_n_,4);
if counter=1 then block+1; *slightly easier version of what you wrote;
if value=1 then score = 5-counter; *first=4, second=3, third=2, fourth=1;
if counter=0 then output;
*We never "clear" score here - to be safer you may want to do that in the if counter=1 block;
run;
If you want the intermediate '0010' or whatever, you can include that as well.
data want;
set have_t;
retain score int_Value;
length int_Value $4;
counter = mod(_n_,4);
if counter=1 then block+1; *slightly easier version of what you wrote;
if value=1 then score = 5-counter; *first=4, second=3, third=2, fourth=1;
int_value = cats(int_value,value);
if counter=0 then do;
output;
int_value=' '; *have to clear this every 4;
score=.; *here we might as well clear it;
end;
run;

If I understood your question correctly,try this:
data want;
do i=1 by 1 until(last.block);
set have_t_block;
array var $4. var1-var4;
array col col1-col4;
length var1-var4 $4.;
by block notsorted;
do over var;
var=cats(var,col);
end;
if last.block then output;
end;
keep var: block;
run;

Related

In the Data step of SAS, how can I get value of a Column with Column's name represented as a String?

In the Data Step of SAS, you get value of a Column by directly using its name, for example, like this,
name = col1;
But for some reason, I want to get value of a column where column is represented by a string. For example, like this,
name = get_value_of_column(cats("col", i))
Is this possible? And if so, how?

The DATA Step functions VVALUE and VVALUEX will return the formatted value of a variable.
VVALUE(<variable-name>) static, a step compilation time interaction
VVALUEX(<expression>) dynamic, a runtime expression resolving to a variable name
The actual value of the variable can be dynamically obtained via a _type_ array scan
Array Scan
data have;
input name $ x y z (s t u) ($) date: yymmdd10.;
format s t u $upcase. date yymmdd10.;
datalines;
x 1 2 3 a b c 2020-10-01
y 2 3 4 b c d 2020-10-02
z 3 4 5 c d e 2020-10-03
s 4 5 6 hi ho silver 2020-10-04
t 5 6 7 aa bb cc 2020-10-05
u 6 7 8 -- ** !! 2020-10-06
date 7 8 9 ppp qqq rrr 2020-10-07
;
data want;
set have;
length u_vvalue name_vvaluex $20.;
u_vvalue = vvalue(u);
name_vvaluex = vvaluex(name);
array nums _numeric_;
array chars _character_;
/* NOTE:
* variable based arrays cause automatic variable _i_ to be in the PDV
* and _i_ will be automatically dropped from output data sets
*/
do _i_ = 1 to dim(nums);
if upcase(name) = upcase(vname(nums(_i_))) then do;
name_numeric_raw = nums(_i_);
leave;
end;
end;
do _i_ = 1 to dim(chars);
if upcase(name) = upcase(vname(chars(_i_))) then do;
name_character_raw = chars(_i_);
leave;
end;
end;
run;
If you perform an 'excessive' amount of dynamic value lookup in your DATA Step a transposition could possibly lead to simpler processing.

Create matrices based on a reference table and separate data table sas iml

I have a parametrization table that mentions whether the (i,j) th element of "matrix 1" is zero, residual of the row sum or has to be read from the data table. I also have a data table with all the values for different segments. How do I construct the matrix?
For example, let's say "param_table" is the parametrization table:
data param_table;
infile datalines dsd;
length FieldName $20 FieldSourceTable $20;
input Matrix_Id Column_Order Row_Order IsZero IsRowResidual IsColumnResidual FieldName FieldSourceTable;
datalines;
1, 1, 1, 0, 1, 0, ., .
1, 1, 2, 0, 0, 0, xyz, table1
1, 1, 3, 0, 0, 0, abc, table1
1, 2, 1, 1, 0, 0, ., .
1, 2, 2, 0, 0, 0, pqr, table1
1, 2, 3, 0, 0, 0, mno, table1
1, 3, 1, 0, 0, 0, ab, table1
1, 3, 2, 0, 0, 0, pq, table1
1, 3, 3, 0, 1, 0, ., .
;
"table 1" is the actual data containing the values and references from earlier table:
data table1;
input Year (country method Segment) ( : $12.)
ABC XYZ PQR MNO AB PQ;
datalines;
2017 France ABC Retail 0.2 0.5 0.4 0.3 0.6 0.1
2017 France XYZ Corporate 0.1 0.5 0.4 0.2 0.6 0.2
;
run;
How do I create matrices with these rules for each row (each key set) in table 1? For example, matrix for row 1 of "table1" would be:
(1-ab) 0 ab
xyz pqr pq
abc mno (1-abc-mno)
(1,1)th and (3,3) th elements are row residuals, therefore are (1 - sum of rest of the row), whereas (1,2)th element is 0:
0.4 0 0.6
0.5 0.4 0.1
0.1 0.3 0.6
I have added the data steps for file "param_table" which contains the references (the column names) and if it is zero or row residual. Also added the "table1" file which contains the actual values. For each row of "table1" we should have a matrix based on the rules mentioned in param_table.
Thanks!

Each matrix defined in param_table will correspond to a 2-D array associated with each row in table1. Suppose you have macro matrixfier that generates the source code statements needed to map from the table1 data into a specified array (i.e. matrix).
%macro matrixfier (matrix_id=1, arrayName=x, out=);
%local rowCount colCount source z i;
%local s1 s2 s3 s4 s addr;
The macro will have to examine the parameter data to determine if the settings it contains are rational with regards to code generation.
proc sql noprint;
select *
from PARAM_TABLE where matrix_id = &matrix_id
and ( iszero not in (0,1) or
isrowresidual not in (0,1) or
iscolumnresidual not in (0,1) or
sum(iszero,isrowresidual,iscolumnresidual) not in (0,1)
);
%if &sqlobs %then %do;
%put ERROR: Parameters for matrix_id=&matrix_id. rejected for is* settings.;
%abort cancel;
%end;
select max(z) as z into :z from
( select column_order, sum(iscolumnresidual) as z
from PARAM_TABLE where matrix_id = &matrix_id
and iscolumnresidual
group by column_order
);
%if &z > 1 %then %do;
%put ERROR: Parameters for matrix_id=&matrix_id. rejected for iscolumnresidual settings.;
%abort cancel;
%end;
Determine how large the target array needs to be. Also, presume there is only one source table per matrix defined
select max(column_order), max(row_order), max(fieldsourcetable)
into :colCount, :rowCount, :source
from PARAM_TABLE where matrix_id = &matrix_id
;
Code generate DATA Step statements for assigning a value directly.
select cats("&arrayName.(",row_order,',',column_order,')=', fieldname)
into :s1 separated by ';'
from PARAM_TABLE where matrix_id = &matrix_id
and iszero=0 and isrowresidual=0 and iscolumnresidual=0
order by row_order, column_order
;
Code generate DATA Step statements for assigning a zero value.
select cats("&arrayName.(",row_order,',',column_order,')=0')
into :s2 separated by ';'
from PARAM_TABLE where matrix_id = &matrix_id
and iszero
order by row_order, column_order
;
Code generate DATA Step statements for computing row residuals.
%do i = 1 %to &rowCount;
select
cats(B.row_order,',',B.column_order),
'-' || A.fieldname
into
:addr,
:s separated by ','
from PARAM_TABLE A
join PARAM_TABLE B
on A.matrix_id = B.matrix_id
and A.row_order = B.row_order
where
A.matrix_id = &matrix_id and A.row_order=&i
and A.isrowresidual=0 and A.iszero=0 and A.iscolumnresidual=0
and B.isrowresidual=1
;
%if &sqlobs > 0 %then %let s3=&s3&arrayName.(&addr)=sum(1,&s)%str(;);
%end;
Code generate DATA Step statements for computing column residuals.
%do i = 1 %to &colCount;
select
cats(B.row_order,',',B.column_order),
'-' || A.fieldname
into
:addr,
:s separated by ','
from PARAM_TABLE A
join PARAM_TABLE B
on A.matrix_id = B.matrix_id
and A.column_order = B.column_order
where
A.matrix_id = &matrix_id and A.column_order=&i
and A.isrowresidual=0 and A.iszero=0 and A.iscolumnresidual=0
and B.iscolumnresidual=1
;
%if &sqlobs > 0 %then %let s4=&s4&arrayName.(&addr)=sum(1,&s)%str(;);
%end;
quit;
Assemble the statements in a DATA Step.
data &out;
set &source;
array &arrayName.(&rowCount,&colCount);
call missing (of &arrayName.(*));
* assignments;
&s1;
* zeroes;
&s2;
* row residuals;
&s3;
* column residuals;
&s4;
* log the matrix for this row;
do _i = 1 to dim(&arrayName.,1);
do _j = 1 to dim(&arrayName.,2);
putlog &arrayName(_i,_j) 6.2 #;
end;
putlog;
end;
putlog;
run;
%mend;
Resolve the parameters as applied to data
options mprint;
%matrixfier(matrix_id=1, arrayName=x, out=each);

SAS for following scenario (most frequent observation)

Assume I have a data-set D1 as follows:
ID ATR1 ATR2 ATR3
1 A R W
2 B T X
1 A S Y
2 C T E
3 D U I
1 T R W
2 C X X
I want to create a data-set D2 from this as follows
ID ATR1 ATR2 ATR3
1 A R W
2 C T X
3 D U I
In other words, Data-set D2 consists of unique IDs from D1. For each ID in D2, the values of ATR1-ATR3 are selected as the most frequent (of the respective variable) among the records in D1 with the same ID. For example ID = 1 in D2 has ATR1 = A (most frequent).
I have one solution which is very clumsy. I simply sort copies of the data set `D1' three times (by ID and ATR1 e.g) and remove duplicates. I later merge the three data-sets to get what I want. However, I think there might be an elegant way to do this. I have about 20 such variables in the original data-set.
Thanks

/*
read and restructure so we end up with:
id attr_id value
1 1 A
1 2 R
1 3 W
etc.
*/
data a(keep=id attr_id value);
length value $1;
array attrs_{*} $ 1 attr_1 - attr_3;
infile cards;
input id attr_1 - attr_3;
do attr_id=1 to dim(attrs_);
value = attrs_{attr_id};
output;
end;
cards;
1 A R W
2 B T X
1 A S Y
2 C T E
3 D U I
1 T R W
2 C X X
;
run;
/* calculate frequencies of values per id and attr_id */
proc freq data=a noprint;
tables id*attr_id*value / out=freqs(keep=id attr_id value count);
run;
/* sort so the most frequent value per id and attr_id ends up at the bottom of the group.
if there are ties then it's a matter of luck which value we get */
proc sort data = freqs;
by id attr_id count;
run;
/* read and recreate the original structure. */
data b(keep=id attr_1 - attr_3);
retain attr_1 - attr_3;
array attrs_{*} $ 1 attr_1 - attr_3;
set freqs;
by id attr_id;
if first.id then do;
do i=1 to dim(attrs_);
attrs_{i} = ' ';
end;
end;
if last.attr_id then do;
attrs_{attr_id} = value;
end;
if last.id then do;
output;
end;
run;

Counting in sas

Maybe a stupid question...
I got following dataset:
id count
x 1
y 2
z 3
a 1
b 2
c 3
etc.
And i want this:
id count group
x 1 1
y 2 1
z 3 1
a 1 2
b 2 2
c 3 2
etc.
Here is what I try:
data macro_1; set vix.macro_spy; where macro=1;
count+1;
if count>3 then do;
count=1;
end;
group=0;
if count=1 then group+1;
run;
But it is not working. How can I add all 'group' by one if I once get a 'count=1'?
Thanks.

even simpler
data want;
set vix.macro_spy;
group+(count=1);
run;

I'm not sure I understand what you need. So you have this dataset ordered so that values of variable count always go 1, 2, 3, 1, 2, 3, 1, 2, 3...
Now, you want to generate variable group so that value increments every time variable count passes over 3?
If so, you could do something like this:
data group;
set vix.macro_spy;
retain group;
if _N_ = 1 then group = 0;
if count = 1 then group + 1;
run;
This is the general pattern that I'm using.
if _N_ = 1 part is executed only once, this is where you initialize you variables.
retain statement ensures that the variable will retain its value from one iteration of the DATA step to the next.

Read previous and next observations

I have a dataset like this(sp is an indicator):
datetime sp
ddmmyy:10:30:00 N
ddmmyy:10:31:00 N
ddmmyy:10:32:00 Y
ddmmyy:10:33:00 N
ddmmyy:10:34:00 N
And I would like to extract observations with "Y" and also the previous and next one:
ID sp
ddmmyy:10:31:00 N
ddmmyy:10:32:00 Y
ddmmyy:10:33:00 N
I tired to use "lag" and successfully extract the observations with "Y" and the next one, but still have no idea about how to extract the previous one.
Here is my try:
data surprise_6_step3; set surprise_6_step2;
length lag_sp $1;
lag_sp=lag(sp);
if sp='N' and lag(sp)='N' then delete;
run;
and the result is:
ID sp
ddmmyy:10:32:00 Y
ddmmyy:10:33:00 N
Any methods to extract the previous observation also?
Thx for any help.

Try using the point option in set statement in data step.
Like this:
data extract;
set surprise_6_step2 nobs=nobs;
if sp = 'Y' then do;
current = _N_;
prev = current - 1;
next = current + 1;
if prev > 0 then do;
set x point = prev;
output;
end;
set x point = current;
output;
if next <= nobs then do;
set x point = next;
output;
end;
end;
run;
There is an implicite loop through dataset when you use it in set statement.
_N_ is an automatic variable that contains information about what observation is implicite loop on (starts from 1). When you find your value, you store the value of _N_ into variable current so you know on which row you have found it. nobs is total number of observations in a dataset.
Checking if prev is greater then 0 and if next is less then nobs avoids an error if your row is first in a dataset (then there is no previous row) and if your row is last in a dataset (then there is no next row).

/* generate test data */
data test;
do dt = 1 to 100;
sp = ifc( rand("uniform") > 0.75, "Y", "N" );
output;
end;
run;
proc sql;
create table test2 as
select *,
monotonic() as _n
from test
;
create table test3 ( drop= _n ) as
select a.*
from test2 as a
full join test2 as b
on a._n = b._n + 1
full join test2 as c
on a._n = c._n - 1
where a.sp = "Y"
or b.sp = "Y"
or c.sp = "Y"
;
quit;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

aggregate by column value and paste row values together in SAS - sas

If I understood your question correctly,try this: data want; do i=1 by 1 until(last.block); set have_t_block; array var $4. var1-var4; array col col1-col4; length var1-var4 $4.; by block notsorted; do over var; var=cats(var,col); end; if last.block then output; end; keep var: block; run;

Related

In the Data step of SAS, how can I get value of a Column with Column's name represented as a String?

Create matrices based on a reference table and separate data table sas iml

SAS for following scenario (most frequent observation)

Counting in sas

Read previous and next observations

Categories

Resources