Subset repeated elements based on a list

Subset repeated elements based on a list - sas

is there a way to subset from a data set some IDs based on an external list?
In other words, I have a data set:
ID Value
0001 0.3
0001 0.6
0002 0.7
0002 0.71
0002 0.43
0003 0.01
0003 0.2
0005 12
0005 11
and a list:
ID
0001
0002
0005
The desired output would be:
ID Value
0001 0.3
0001 0.6
0002 0.7
0002 0.71
0002 0.43
0005 12
0005 11
My point is that in the data set, IDs are repeated.
Thank you in advance

SQL is the easiest for sure. Assuming ID list is in a data set, you can do something like this:
HAVE: Input data set with all records
ID_LIST: input data set with ids to be selected
proc sql;
create table want as
select * from have
where ID in
(select id from id_list);
quit;
There are several other ways to solve this including a data step merge (requires pre-sorting) and/or a hash table.

One way of doing it is by saving your IDs from external list into macro variable, and then use select statement using that variable.
data have;
input ID Value;
format ID z4.;
datalines;
0001 0.3
0001 0.6
0002 0.7
0002 0.71
0002 0.43
0003 0.01
0003 0.2
0005 12
0005 11
;
run;
This list can be imported from an external file using proc import (in this example I used external xlsx file, located at &path.).
proc import
datafile="&path."
out=list
dbms=xlsx
replace;
run;
Or can be a separate existing table.
data list;
input ID;
datalines;
0001
0002
0005
;
run;
Here we save IDs from your list into a macro variable x.
proc sql noprint;
select ID into :x separated by ','
from list
;
quit;
When you have your selected IDs in a macro variable you can use that for a simple select statement.
proc sql;
create table want as
select *
from have
where ID in (&x.)
;
quit;

Related

Conditional replacement of labels based on another data set

suppose to have the following data set:
ID Label
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0002 0002_1
0002 0002_1
0002 0002_2
0002 0002_2
0002 0002_3
0002 0002_3
and another one:
ID Label
0001 0001_1
0001 0001_1
0001 0001_2
0001 0001_2
0001 0001_3
0001 0001_3
0002 0002_1
0002 0002_1
0002 0002_2
0002 0002_2
0002 0002_3
0002 0002_3
You want the following:
if in the first dataset there is only one type of Label (i.e., 0001_1), the second dataset should have that type. Otherwise if there are multiple labels nothing must be done. The desired output should be:
ID Label
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0001 0001_1
0002 0002_1
0002 0002_1
0002 0002_2
0002 0002_2
0002 0002_3
0002 0002_3
Thank you in advance
Best

You will want to compute the groups in the first table that have a single label in aggregate and apply that label to the groups in the second table.
Example:
Computation with PROC FREQ and application via MERGE.
data have1;
call streaminit(20231);
do id = 1 to 10;
do seq = 1 to rand('integer', 10) + 2;
if mod(id,2) = 0
then label = 'AAA';
else label = repeat(byte(64+rand('integer', 26)),2);
output;
end;
end;
run;
data have2;
call streaminit(20232);
do id = 1 to 10;
do seq = 1 to rand('integer', 12) + 2;
label = repeat(byte(64+rand('integer', 26)),2);
output;
end;
end;
run;
proc freq noprint data=have1;
by id;
table label / out=one_label(where=(percent=100));
run;
data want2;
merge
have2
one_label(keep=id label rename=(label=have1label) in=reassign)
;
by id;
if reassign then label = have1label;
drop have1label;
run;
Same result achieved with SQL code, performing computation in a sub-select and using COALESCE for application.
proc sql;
create table want2 as
select
have2.id
, coalesce(singular.onelabel, have2.label) as label
from
have2
left join
( select unique id, label as onelabel
from have1
group by id
having count(distinct label) = 1
) as singular
on
have2.id = singular.id
;

Sort variables based on another data set and append data

is there a way in SAS to order columns (variables) of a data set based on the order of another data set? The names are perfectly equal.
And is there also a way to append them (vertically) based on the same column names?
Thank you in advance
ID YEAR DAYS WORK DATASET
0001 2020 32 234 1
0002 2019 31 232 1
0003 2015 3 22 1
0004 2003 15 60 1
0005 2021 32 98 1
0006 2000 31 56 1
DATASET DAYS WORK ID YEAR
2 56 23 0001 2010
2 34 123 0002 2011
2 432 3 0003 2013
2 45 543 0004 2022
2 76 765 0005 2000
2 43 8 0006 1999
I just need to sort the second data set based on the first and append the second to the first.
Can anyone help me please?

This should work:
data have1;
input ID YEAR DAYS WORK DATASET;
format ID z4.;
datalines;
0001 2020 32 234 1
0002 2019 31 232 1
0003 2015 3 22 1
0004 2003 15 60 1
0005 2021 32 98 1
0006 2000 31 56 1
;
run;
data have2;
input DATASET DAYS WORK ID YEAR;
format ID z4.;
datalines;
2 56 23 0001 2010
2 34 123 0002 2011
2 432 3 0003 2013
2 45 543 0004 2022
2 76 765 0005 2000
2 43 8 0006 1999
;
run;
First we create a new table by copying our first table. Then we just insert into it variables from the second table. No need to change the column order of the original second table.
proc sql;
create table want as
select *
from have1
;
insert into want(ID, YEAR, DAYS, WORK, DATASET)
select ID, YEAR, DAYS, WORK, DATASET
from have2
;
quit;

I have no idea how you could sort based on something that is not there.
But appending is trivial. You can just set them together.
data want;
set one two;
run;
And if both dataset are already sorted by some key variables (year perhaps in your example?) then you could interleave the observations instead. Just add a BY statement.
data want;
set one two;
by year;
run;
And if you want to make a new version of the second dataset with the variable order modified to match the variable order in the first dataset (something that really has nothing to with sorting the data) you could use the OBS= dataset option. So code like this will order the variables based on the order they have in ONE but not actually use any of the data from that dataset.
data want;
set one(obs=0) two;
run;

Get duplicated by date

suppose to have the following:
ID Start
0001 31JAN2015
0001 31JAN2015
0003 16FEB2016
0006 01FEB2018
0004 31DEC2016
0004 31DEC2016
is there a way to retrieve which ID has identical Start date, i.e. is duplicated?
Desired output:
ID Start
0001 31JAN2015
0001 31JAN2015
0004 31DEC2016
0004 31DEC2016
Thank you in advance

Use proc sort to remove duplicates and create a list of IDs with duplicates.
proc sort data=have nodupkey dupout=dupes;
by date;
run;
In your example, ID 4 does not have a duplicate start date although the start day itself is the same (31st).

Conditionally set flags based on dates and a character variable

suppose to have the following:
ID Start End Place
0001 13JAN2015 20JAN2015 HospA
0001 21JAN2015 31DEC2015 HospA
0001 01JAN2018 31DEC2018 HospB
0001 01JAN2019 31DEC2019 HospA
0002 01JAN2015 31DEC2015 HospA
0002 01JAN2019 31OCT2019 HospA
0002 01NOV2019 31DEC2020 HospA
..... ........ ......... .....
I would like to set a flag for the start and a flag for the end as follows:
for each ID, for consecutive periods and same Place, put "1" in StartFlag column relative to the first date in Start column (first two rows of desired output); fill the remaining StartFlag and EndFlag with 0;
there is a jump of years for the same ID and the Place changes put "1" in StartFlag column referred to Start and 0 to the remaining (EndFlag column). This refers to rows 3 and 4 of desired output. The idea is to record the change in the Place;
if there are jumps of years but the Place does not change, then put: "1" in the column StarFlag for the first date in Start column, "9" to the EndFlag before the jump and "9" in the column StartFlag relative to the starting of the next non-consecutive period.
I tried with if statement. I don't know how to "call" the first date of each consecutive/non-consecutive period of each Place.
Thank you in advance
Desired output:
ID Start End Place StartFlag EndFlag
0001 13JAN2015 20JAN2015 HospA 1 0
0001 21JAN2015 31DEC2015 HospA 0 0
0001 01JAN2018 31DEC2018 HospB 1 0
0001 01JAN2019 31DEC2019 HospA 1 0
0002 01JAN2015 30SEP2015 HospA 1 0
0002 01OCT2015 31DEC2015 HospA 0 9
0002 01JAN2019 31OCT2019 HospA 9 0
0002 01NOV2019 31DEC2020 HospA 0 0
..... ........ ......... .....

You can use BY group processing to detect the first or last observation for an ID. And you can extend it to include PLACE by using the NOTSORTED option. But to compare the dates you need to look back and look ahead. Looking back is easy with the LAG() function. Looking ahead takes a little work, here is a simple method using dataset options to read starting from the second observation.
First make sure the data is sorted by ID and START date.
data have;
input ID $ (Start End) (:date.) Place $;
format start end date9.;
cards;
0001 13JAN2015 20JAN2015 HospA
0001 21JAN2015 31DEC2015 HospA
0001 01JAN2018 31DEC2018 HospB
0001 01JAN2019 31DEC2019 HospA
0002 01JAN2015 31DEC2015 HospA
0002 01JAN2019 31OCT2019 HospA
0002 01NOV2019 31DEC2020 HospA
;
proc sort data=have;
by id start ;
run;
If you tell the data step the data is grouped by ID and PLACE you can use the FIRST.PLACE and LAST.PLACE flags. You just need to add some logic to test the date intervals.
data want;
set have ;
by id place notsorted;
lag_end=lag(end);
format lag_end date9.;
set have(firstobs=2 keep=start rename=(start=next_start))
have(obs=1 drop=_all_)
;
if first.place then startflag = 1;
else if lag_end+1 < start then startflag=1;
else startflag=0;
if last.place then endflag=1;
else if (end+1 < next_start) then endflag=1;
else endflag=0;
run;
Result:

The lag function is quite powerful, but here it is sufficient to use the retain statement to get the results. Normally, you have access to one row at a time in SAS due to the data vector, but by using retain you can keep values to the next row.
At first, we generate the sample data and sort it by id and start:
data have;
input ID $ (Start End) (:date.) Place $;
format start end date9.;
cards;
0001 13JAN2015 20JAN2015 HospA
0001 21JAN2015 31DEC2015 HospA
0001 01JAN2018 31DEC2018 HospB
0001 01JAN2019 31DEC2019 HospA
0002 01JAN2015 30SEP2015 HospA
0002 01OCT2015 31DEC2015 HospA
0002 01JAN2019 31OCT2019 HospA
0002 01NOV2019 31DEC2020 HospA
run;
proc sort data=have;
by id start ;
run;
Now we use the retain statement to keep the values of id, start, end and place of the previous row.
(Actually, we don't need the by statement here, because first and last statements are not used.)
In this way, we can put the right values into start_flag:
data want (drop=prev:);
set have;
by id start;
format prev_start prev_end date9.;
retain prev_id prev_start prev_end prev_place;
if prev_id eq id AND prev_place eq place
AND prev_end+1 eq start
then start_flag=0;
else start_flag=1;
if prev_id eq id AND prev_place eq place
AND prev_end+1 ne start
then start_flag=9;
output;
prev_id=id; prev_start=start; prev_end=end; prev_place=place;
run;
End_flag is a little more tricky, because we need the next row, not the previous one.
Hence, we sort our data in the reverse order and use another retain (if the data amount is huge, consider using an index or two...).
proc sort data=want;
by descending id descending start;
quit;
data want_final (drop=next_startflag);
set want;
by descending id descending start;
retain next_startflag;
if next_startflag eq 9 then end_flag=9;
else end_flag=0;
output;
next_startflag=start_flag;
run;
Finally we sort the data back in the original order:
proc sort data=want_final;
by id start;
quit;

Drop all observations by ID where conditions are not met

I have a dataset with ~4 million transactional records, grouped by Customer_No (consisting of 1 or more transactions per Customer_No, denoted by a sequential counter). Each transaction has a Type code and I am only interested in customers where a particular combination of transaction Types were used. Neither joining the table on itself or using EXISTS in Proc Sql is allowing me to efficiently evaluate the transaction Type criteria. I suspect a data step using retain and do-loops would process the dataset faster
The dataset:
Customer_No Tran_Seq Tran_Type
0001 1 05
0001 2 12
0002 1 07
0002 2 86
0002 3 04
0003 1 07
0003 2 84
0003 3 84
0003 4 84
The criteria I am trying to apply:
All Customer_No's Tran_Type's must only be in ('04','05','07','84','86'),
drop all transactions for that Customer_No if any other Tran_Type was used
Customer_No's Tran_Type's must include ('84' or '86') AND '04', drop all transactions for the Customer_No if this condition is not met
The output I want:
Customer_No Tran_Seq Tran_Type
0002 1 07
0002 2 86
0002 3 04

The DoW loop solution should be the most efficient if the data is sorted. If it's not sorted, it will either be the most efficient or similar in scale but slightly less efficient depending on the circumstances of the dataset.
I compared to Dom's solution with a 3e7 ID dataset, and got for the DoW a similar (slightly less) total length with less CPU for unsorted dataset, and about 50% faster for sorted. It is guaranteed to run in about the length of time the dataset takes to write out (maybe a bit more, but it shouldn't be much), plus sorting time if needed.
data want;
do _n_=1 by 1 until (last.customer_no);
set have;
by customer_no;
if tran_type in ('84','86')
then has_8486 = 1;
else if tran_type in ('04')
then has_04 = 1;
else if not (tran_type in ('04','05','07','84','86'))
then has_other = 1;
end;
do _n_= 1 by 1 until (last.customer_no);
set have;
by customer_no;
if has_8486 and has_04 and not has_other then output;
end;
run;

I don't think it's that complicated. Join to a subquery, group by Customer_No, and put your conditions in a having clause. A condition in a min function must be true for all rows, whereas a condition in a max function must be true for any one row:
proc sql;
create table want as
select
h.*
from
have h
inner join (
select
Customer_No
from
have
group by
Customer_No
having
min(Tran_Type in('04','05','07','84','86')) and
max(Tran_Type in('84','86')) and
max(Tran_Type eq '04')) h2
on h.Customer_No = h2.Customer_No
;
quit;

I must have made a join error. On re-writing, Proc Sql completed in less than 30 seconds (on the original 4.9 million record dataset). It's not particularly elegant code though, so I'd still appreciate any improvements or alternative methods.
data Have;
input Customer_No $ Tran_Seq $ Tran_Type:$2.;
cards;
0001 1 05
0001 2 12
0002 1 07
0002 2 86
0002 3 04
0003 1 07
0003 2 84
0003 3 84
0003 4 84
;
run;
Proc sql;
Create table Want as
select t1.* from Have t1
LEFT JOIN (select DISTINCT Customer_No from Have
where Tran_Type not in ('04','05','07','84','86')
) t2
ON(t1.Customer_No=t2.Customer_No)
INNER JOIN (select DISTINCT Customer_No from Have
where Tran_Type in ('84','86')
) t3
ON(t1.Customer_No=t3.Customer_No)
INNER JOIN (select DISTINCT Customer_No from Have
where Tran_Type in ('04')
) t4
ON(t1.Customer_No=t4.Customer_No)
Where t2.Customer_No is null
;Quit;

I would offer a slightly less complex SQL solution than #naed555 using the INTERSECT operator.
proc sql noprint;
create table to_keep as
(
select distinct customer_no
from have
where tran_type in ('84','86')
INTERSECT
select distinct customer_no
from have
where tran_type in ('04')
)
EXCEPT
select distinct customer_no
from have
where tran_type not in ('04','05','07','84','86')
;
create table want as
select a.*
from have as a
inner join
to_keep as b
on a.customer_no = b.customer_no;
quit;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js