My dataset has 3 SNPs which looks like below
Id SNP1 SNP 2 SNP3
1 AA AA AA
2 AG AC AG
3 GG CC GG
4
5
6 So on
In SNP1 - I would like to modify the values AA =2, AG =1, GG = 0 and Likewise in SNP1 and SNP2
How can I do this?
I would put the new values in a proc format, so that you can either keep the existing values but displayed with the formatted value, or convert the existing values using the format. Here are both ways to do this.
/* create format */
proc format;
value $snpfmt 'AA' = '2'
'AG' = '1'
'GG' = '0'
;
run;
/* create initial dataset */
data have;
input Id SNP1 $ SNP2 $ SNP3 $;
datalines;
1 AA AA AA
2 AG AC AG
3 GG CC GG
;
/* option1 - format the values */
proc datasets lib=work nodetails nolist;
modify have;
format snp1 snp2 snp3 $snpfmt2. ;
quit;
/* option2 - change the values using the format */
data want;
set have;
snp1 = put(snp1,$snpfmt2.);
snp2 = put(snp2,$snpfmt2.);
snp3 = put(snp3,$snpfmt2.);
run;
Related
I have a dataset like as below, by using SAS, I need to assign the order variable based on descending count order to this dataset, when the category is missing, it should be always in the last whatever the count is. All other category above the missing one should be order by descending count.
Category Count
aa 10
bb 9
cc 8
6
ab 3
Desired output:
Category Count Order
aa 10 1
bb 9 2
cc 8 3
ab 3 4
6 5
You can use Proc DS2 to compute a sequence number for a result set.
Example:
data have;
input s $ f;
datalines;
aa 10
bb 9
cc 8
. 6
ab 3
;
proc ds2;
data want(overwrite=yes);
declare int sequence ;
method run();
set
{
select s,f from have
order by case when s is not null then f else -1e9-f end desc
};
sequence + 1;
end;
run;
quit;
Sort and split all the datasets into descending value order by missing and not-missing category, then stack them on top of each other.
/* Sort the non-missing values */
proc sort data=have out=have_notmissing;
by descending value;
where NOT missing(category);
run;
/* Sort the missing values */
proc sort data=have out=have_missing;
by descending value;
where missing(category);
run;
/* Stack them on top of each other */
data want;
set have_notmissing
have_missing
;
rank+1;
run;
Output:
category value rank
aa 10 1
bb 9 2
cc 8 3
ab 3 4
6 5
Perhaps what you really want is a NOTSORTED format and a procedure that supports the PRELOADFMT option and ORDER=DATA.
data test;
input category $ count;
cards;
aa 10
bb 9
cc 8
. 6
ab 3
;;;;
run;
proc print;
run;
proc format;
value $cat(notsorted)
'bb'='Bee Bee'
'aa'='Aha'
'cc'='CC'
'dd'='D D'
'ab'='AB'
;
quit;
proc summary data=test nway missing;
class category / order=data preloadfmt;
format category $cat.;
freq count;
output out=summary(drop=_type_) / levels;
run;
proc print;
run;
I have to merge two very large files and I want to avoid this doing in Data step as that would mean sorting the data. I need all observations for all IDs from left file excluding IDs that are not in the second.
data leftdata;
input id $ y;
datalines;
AA 10
AA 20
BB 30
BB 40
CC 50
CC 80
DD 60
;run;
data rightdata;
input id $ ;
datalines;
AA
BB
;
run;
*Using datastep;
PROC SORT DATA=leftdata; BY id;
PROC SORT DATA=rightdata; BY id; RUN;
DATA datastep;
MERGE leftdata(IN=a) rightdata(IN=b);
BY id; IF a and b=0;
RUN;
How can the same be achieved using PROC SQL?
Final output must include the following observations:
CC 50
CC 80
DD 60
Here are two ways:
WHERE NOT IN (SELECT ...) filtering, or
LEFT JOIN where missing the right id.
Example:
data have;
input id $ y;
datalines;
AA 10
AA 20
BB 30
BB 40
CC 50
CC 80
DD 60
;
data excluded_ids;
input id $ ;
datalines;
AA
BB
;
proc sql;
create table want as
select * from have
where id not in (select id from excluded_ids)
;
create table want as
select have.* from have
left join excluded_ids as remove
on have.id = remove.id
where remove.id is null
;
For the second way you will need a SELECT DISTINCT if the exclusion list has a repeated id.
Data step
Use a hash object to store the exclusion list and check method to test for removal
Example:
data want;
set have;
if _n_ = 1 then do;
declare hash exclude(dataset:'excluded_ids');
exclude.defineKey('id');
exclude.defineDone();
end;
if exclude.check() = 0 then delete;
run;
I'd like to make the dataset like the below. I got it, but it’s a long program.
I think it would become more simple. If you have a good idea, please give me some advice.
This is the data.
data test;
input ID $ NO DAT1 $ TIM1 $ DAT2 $ TIM2 $;
cards;
1 1 2020/8/4 8:30 2020/8/5 8:30
1 2 2020/8/18 8:30 2020/8/19 8:30
1 3 2020/9/1 8:30 2020/9/2 8:30
1 4 2020/9/15 8:30 2020/9/16 8:30
2 1 2020/8/4 8:34 2020/8/5 8:34
2 2 2020/8/18 8:34 2020/8/19 8:34
2 3 2020/9/1 8:34 2020/9/2 8:34
2 4 2020/9/15 8:34 2020/9/16 8:34
3 1 2020/8/4 8:46 2020/8/5 8:46
3 2 2020/8/18 8:46 2020/8/19 8:46
3 3 2020/9/1 8:46 2020/9/2 8:46
3 4 2020/9/15 8:46 2020/9/16 8:46
;
run;
This is my program.
data
t1(keep = ID A1 A2 A3 A4)
t2(keep = ID B1 B2 B3 B4)
t3(keep = ID C1 C2 C3 C4)
t4(keep = ID D1 D2 D3 D4);
set test;
if NO = 1 then do;
A1 = DAT1;
A2 = TIM1;
A3 = DAT2;
A4 = TIM2;
end;
*--- cut (NO = 2, 3, 4 are same as NO = 1)--- ;
end;
if NO = 1 then output t1;
if NO = 2 then output t2;
if NO = 3 then output t3;
if NO = 4 then output t4;
run;
proc sort data = t1;by ID; run;
proc sort data = t2;by ID; run;
proc sort data = t3;by ID; run;
proc sort data = t4;by ID; run;
data test2;
merge t1 t2 t3 t4;
by ID;
run;
Since the result looks like a report use a reporting tool.
proc report data=test ;
column id no,(dat1 tim1 dat2 tim2 n) ;
define id / group width=5;
define no / across ' ' ;
define n / noprint;
run;
Tall to very wide data transformations are typically
sketchy, you put data into metadata (column names or labels) or lose referential context, or
a reporting layout for human consumption
Presuming your "as dataset like below" is accurate and you want to pivot your data in such a manner.
Way 1 - self merging subsets with renaming
You should see that the NO field is a sequence number that can be used as a BY variable when merging data sets.
Consider this example code as a template that could be the source code generation of a macro:
NO is changed name to seq for better clarity
data want;
merge
have (where=(seq=1) rename=(dat1=A1 tim1=B1 dat2=C1 tim2=D1)
have (where=(seq=2) rename=(dat1=A2 tim1=B2 dat2=C2 tim2=D2)
have (where=(seq=3) rename=(dat1=A3 tim1=B3 dat2=C3 tim2=D3)
have (where=(seq=4) rename=(dat1=A4 tim1=B4 dat2=C4 tim2=D4)
;
by id;
run;
For unknown data sets organized like the above pattern, the code generation requirements should be obvious; determine maximum seq and have the names of variables to pivot be specified (as macro parameters, in which loop over the names occurs).
Way 2 - multiple transposes
Caution, all pivoted columns will be character type and contain the formatted result of original values.
proc transpose data=have(rename=(dat1=A tim1=B dat2=C tim2=D)) out=stage1;
by id seq;
var a b c d;
run;
proc transpose data=stage1 out=want;
by id;
var col1;
id _name_ seq;
run;
Way 3 - Use array and DOW loop
* presume SEQ is indeed a unit monotonic sequence value;
data want (keep=id a1--d4);
do until (last.id);
array wide A1-A4 B1-B4 C1-C4 D1-D4;
wide [ (seq-1)*4 + 1 ] = dat1;
wide [ (seq-1)*4 + 2 ] = tim1;
wide [ (seq-1)*4 + 3 ] = dat2;
wide [ (seq-1)*4 + 4 ] = tim2;
end;
keep id A1--D4;
* format A1 A3 B1 B3 C1 C3 D1 D3 your-date-format;
* format A2 A4 ................. your-time-format;
Way 4 - change your data values to datetime
I'll leave this to esteemed others
I want to summarize a dataset by creating a vector that gives information on what departments the id is found in. For example,
data test;
input id dept $;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
I want
id dept_vect
1 1111
2 0010
3 0001
4 1000
5 1001
The position of the elements of the dept_vect is organized alphabetically. So a '1' in the first position means that the id is found in deptartment A and a '1' in the second position means that the id is found in department B. A '0' means the id is not found in the department.
I can solve this problem using a brute force approach
proc transpose data = test out = test1(drop = _NAME_);
by id;
var dept;
run;
data test2;
set test1;
array x[4] $ col1-col4;
array d[4] $ d1-d4;
do i = 1 to 4;
if not missing(x[i]) then do;
if x[i] = 'A' then d[1] = 1;
else if x[i] = 'B' then d[2] = 1;
else if x[i] = 'C' then d[3] = 1;
else if x[i] = 'D' then d[4] = 1;
end;
else leave;
end;
do i = 1 to 4;
if missing(d[i]) then d[i] = 0;
end;
dept_id = compress(d1) || compress(d2) || compress(d3) || compress(d4);
keep id dept_id;
run;
This works but there are a couple of problems. For col4 to appear, I need at least one id to be found on all departments but that could be fixed by creating a dummy id so that id is found on all departments. But the main problem is that this code is not robust. Is there a way to code this so that it would work for any number of departments?
Add a 1 to get a count variable
Transpose using PROC TRANSPOSE
Replace missing with 0
Use CATT() to create desired results.
data have;
input id dept $;
count = 1;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
proc transpose data=test out=wide prefix=dept;
by id;
id dept;
var count;
run;
data want;
set wide;
array _d(*) dept:;
do i=1 to dim(_d);
if missing(_d(i)) then _d(i) = 0;
end;
want = catt(of _d(*));
run;
Maybe TRANSREG can help with this.
data test;
input id dept $;
datalines;
1 A
1 D
1 B
1 C
2 C
3 D
4 A
5 C
5 D
;
run;
proc transreg;
id id;
model class(dept / zero=none);
output design out=dummy(drop=dept);
run;
proc print;
run;
proc summary nway;
class id;
output out=want(drop=_type_) max(dept:)=;
run;
proc print;
run;
I would like to combine different observations from different variables in one variable. The observations are all related to each other and are repeated measurements.
Example:
Current database layout.
**Var1 Dat1 Var2 Dat2 Var3 Dat3**
Obs1a Dat1a Dat2a Dat2a Obs3a Dat3a
Obs1b Dat1b Dat2b Dat2b Obs3b Dat3b
Obs1c Dat1c Dat2c Dat2c Obs3c Dat3c
Obs1d Dat1d Dat2d Dat2d Obs3d Dat3d
I want to create a new variable with combined observations:
Var Dat
Obs 1a Dat 1a
Obs 1b Dat 1b
.... ...
Obs 2a Dat 2a
Obs 2b Dat 2b
.... ...
Obs 3c Dat 3c
Obs 3d Dat 3d
Could somebody explain me how to do this in SAS?
Create 2 arrays of the "var" and "dat" values. Loop through them and use the OUTPUT statement to create 1 observation for each value in the arrays.
data test;
input var1 $ dat1 $ var2 $ dat2 $ var3 $ dat3 $;
datalines;
Obs1a Dat1a Dat2a Dat2a Obs3a Dat3a
Obs1b Dat1b Dat2b Dat2b Obs3b Dat3b
Obs1c Dat1c Dat2c Dat2c Obs3c Dat3c
Obs1d Dat1d Dat2d Dat2d Obs3d Dat3d
;
run;
data test2(keep=var dat);
set test;
array v[3] var1-var3 ;
array d[3] dat1-dat3;
format var dat $8.;
do i=1 to 3;
var = v[i];
dat = d[i];
output;
end;
run;