Sort variables based on another data set and append data - sas

is there a way in SAS to order columns (variables) of a data set based on the order of another data set? The names are perfectly equal.
And is there also a way to append them (vertically) based on the same column names?
Thank you in advance
ID YEAR DAYS WORK DATASET
0001 2020 32 234 1
0002 2019 31 232 1
0003 2015 3 22 1
0004 2003 15 60 1
0005 2021 32 98 1
0006 2000 31 56 1
DATASET DAYS WORK ID YEAR
2 56 23 0001 2010
2 34 123 0002 2011
2 432 3 0003 2013
2 45 543 0004 2022
2 76 765 0005 2000
2 43 8 0006 1999
I just need to sort the second data set based on the first and append the second to the first.
Can anyone help me please?

This should work:
data have1;
input ID YEAR DAYS WORK DATASET;
format ID z4.;
datalines;
0001 2020 32 234 1
0002 2019 31 232 1
0003 2015 3 22 1
0004 2003 15 60 1
0005 2021 32 98 1
0006 2000 31 56 1
;
run;
data have2;
input DATASET DAYS WORK ID YEAR;
format ID z4.;
datalines;
2 56 23 0001 2010
2 34 123 0002 2011
2 432 3 0003 2013
2 45 543 0004 2022
2 76 765 0005 2000
2 43 8 0006 1999
;
run;
First we create a new table by copying our first table. Then we just insert into it variables from the second table. No need to change the column order of the original second table.
proc sql;
create table want as
select *
from have1
;
insert into want(ID, YEAR, DAYS, WORK, DATASET)
select ID, YEAR, DAYS, WORK, DATASET
from have2
;
quit;

I have no idea how you could sort based on something that is not there.
But appending is trivial. You can just set them together.
data want;
set one two;
run;
And if both dataset are already sorted by some key variables (year perhaps in your example?) then you could interleave the observations instead. Just add a BY statement.
data want;
set one two;
by year;
run;
And if you want to make a new version of the second dataset with the variable order modified to match the variable order in the first dataset (something that really has nothing to with sorting the data) you could use the OBS= dataset option. So code like this will order the variables based on the order they have in ONE but not actually use any of the data from that dataset.
data want;
set one(obs=0) two;
run;

Related

Data step - unexpected performance behaviour in where-clause

As the title suggests, I experience some unexpected performance behaviour while working with a datastep.
A. The following code executes in 0.01 sec. So far so good.
data policen_roh;
set dwhprod.tbwh_kdu_detail_hi(
keep=
kdu_dt_id police_nr record_typ kdnr bag betrag_akt ursp_beginn_dt beginn_dt ablauf_dt storno_dt
where=(
police_nr=406045267
and record_typ='P'
)
)
;
run;
B. Additionally I have to filter a date, which is stored in a date-id, starting at 1 for 01/01/1850. Since I created formats to convert the date-id to a year (integer), I added the line input(put(kdu_dt_id, tag_id2jahr.),best.) ge 2017.
Works as expected. No problem here. I get my 15 expected records, and execution time increases marginally to 0.02 sec:
data policen_roh;
set dwhprod.tbwh_kdu_detail_hi(
keep=
kdu_dt_id police_nr record_typ kdnr bag betrag_akt ursp_beginn_dt beginn_dt ablauf_dt storno_dt
where=(
police_nr=406045267
and input(put(kdu_dt_id, tag_id2jahr.),best.) ge 2017
and record_typ='P'
)
)
;
run;
C. Now here is the problem: In an effort to speed up my code for larger datasets, I replaced
input(put(kdu_dt_id, tag_id2jahr.),best.) ge 2017
with
kdu_dt_id gt 60997 - the equivalent of 01/01/2017.
To my understanding, this should be way faster, since there is no put/input calculation required. However, while this returns the same result as B., execution time increases to roughly 30.00 seconds.
What is did I miss?
Appendix: Log for further reference
1 The SAS System 13:56 Wednesday, February 7, 2018
1 ;*';*";*/;quit;run;
2 OPTIONS PAGENO=MIN;
3 %LET _CLIENTTASKLABEL='Programm';
4 %LET _CLIENTPROJECTPATH='R:\Projekte\20180125 Erneuerungsprovisionen\Erneuerungsprovisionen.egp';
5 %LET _CLIENTPROJECTNAME='Erneuerungsprovisionen.egp';
6 %LET _SASPROGRAMFILE=;
7
8 ODS _ALL_ CLOSE;
9 OPTIONS DEV=ACTIVEX;
10 GOPTIONS XPIXELS=0 YPIXELS=0;
11 FILENAME EGHTML TEMP;
12 ODS HTML(ID=EGHTML) FILE=EGHTML
13 ENCODING='utf-8'
14 STYLE=HtmlBlue
15 STYLESHEET=(URL="file:///C:/Program%20Files%20(x86)/SASHOME/x86/SASEnterpriseGuide/7.1/Styles/HtmlBlue.css")
16 ATTRIBUTES=("CODEBASE"="http://www2.sas.com/codebase/graph/v94/sasgraph.exe#version=9,4")
17 NOGTITLE
18 NOGFOOTNOTE
19 GPATH=&sasworklocation
20 ;
NOTE: Writing HTML(EGHTML) Body file: EGHTML
21
22 GOPTIONS ACCESSIBLE;
23 data policen_roh;
24 set dwhprod.tbwh_kdu_detail_hi(
25 keep=
26 kdu_dt_id police_nr record_typ kdnr bag betrag_akt ursp_beginn_dt beginn_dt ablauf_dt storno_dt
27 where=(
28 police_nr=406045267
29 and kdu_dt_id gt 60997
30 and record_typ='P'
31 )
32 )
33 ;
34 run;
NOTE: There were 14 observations read from the data set DWHPROD.TBWH_KDU_DETAIL_HI.
WHERE (police_nr=406045267) and (kdu_dt_id>60997) and (record_typ='P');
NOTE: The data set WORK.POLICEN_ROH has 14 observations and 10 variables.
NOTE: Compressing data set WORK.POLICEN_ROH increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: DATA statement used (Total process time):
real time 1:10.44
cpu time 0.03 seconds
35
36 GOPTIONS NOACCESSIBLE;
37 %LET _CLIENTTASKLABEL=;
38 %LET _CLIENTPROJECTPATH=;
39 %LET _CLIENTPROJECTNAME=;
40 %LET _SASPROGRAMFILE=;
41
42 ;*';*";*/;quit;run;
43 ODS _ALL_ CLOSE;
44
45
46 QUIT; RUN;
2 The SAS System 13:56 Wednesday, February 7, 2018
47
My guess is that your underlying table is on a database, and the removal of the input / put functions changed the query execution such that it is no longer making use of available indexes.
A bit counterintuitive, but try removing kdu_dt_id gt 60997 from your where clause and put it in a gating if statement instead.
data policen_roh;
set dwhprod.tbwh_kdu_detail_hi(
keep=
kdu_dt_id police_nr record_typ kdnr bag betrag_akt ursp_beginn_dt beginn_dt ablauf_dt storno_dt
where=(
police_nr=406045267
and record_typ='P'
)
)
;
if kdu_dt_id gt 60997;
run;
Alternatively speak to your dba about tuning your database if this is a query you will run often.
For more information, you could re-write your query using proc sql, with the _tree option (to view the execution plan). You could then use the _method option to play around / tune that plan.
Also, check out options sastrace=',,,d' sastraceloc=saslog; to show your dba more info on what is being sent to the database in terms of the underlying SQL query.

All the records within an hour of each other

I have a data set that has ID, datetime + bunch of value fields.
The idea is that the records are within one hour of each other are one session. There can only be one session every 24 hours. (Time is measured from the start of the first record)
The day() approach does not work as one record can be 23:55 PM and the next one could be 12:01 AM the next day and it would be the same session.
I've added rowid and ran the following:
data testing;
set testing;
by subscriber_no;
prev_dt = lag(record_ts);
prev_row = lag(rowid);
time_from_last = intck("Second",record_ts,prev_dt);
if intck("Second",record_ts,prev_dt) > -60*60 and intck("Second",record_ts,prev_dt) < 0 then
same_session = 'yes';
else same_session = 'no';
if intck("Second",record_ts,prev_dt) > -60*60 and intck("Second",record_ts,prev_dt) < 0 then
rowid = prev_row;
else rowid = rowid;
format prev_dt datetime19.;
output;
run;
Input
ID record_TS rowid
52 17MAY2017:06:24:28 4
52 17MAY2017:07:16:12 5
91 05APR2017:07:04:55 6
91 05APR2017:07:23:37 7
91 05APR2017:08:04:52 8
91 05MAY2017:08:56:23 9
input file is sorted by ID and record TS.
The output was
ID record_TS rowid prev_dt prev_row time_from_last same_session
52 17MAY2017:06:24:28 4 28APR2017:08:51:25 3 -1632783 no
52 17MAY2017:07:16:12 4 17MAY2017:06:24:28 4 -3104 yes
91 05APR2017:07:04:55 6 17MAY2017:07:16:12 5 3629477 no
91 05APR2017:07:23:37 6 05APR2017:07:04:55 6 -1122 yes
91 05APR2017:08:04:52 7 05APR2017:07:23:37 7 -2475 yes This needs to be 6
91 05MAY2017:08:56:23 9 05APR2017:08:04:52 8 -2595091 no
Second row from the bottom - rowid comes out 7, while I need it to come be 6.
Basically I need to change to the current rowid saved before the script moves to assess the next one.
Thank you
Ben
I've achieved what I needed with
proc sql;
create table testing2 as
select distinct t1.*, min(t2.record_TS) format datetime19. as from_time, max(t2.record_TS) format datetime19. as to_time
from testing t1
join testing t2 on t1.id_val= t2.id_val
and intck("Second",t1.record_ts,t2.record_ts) between -3600 and 3600
group by t1.id_val, t1.record_ts
order by t1.id_val, t1.record_ts
;
quit;
But I'm still wondering if there is a way to commit changes to current row before moving to assess the next row.
I think your logic is just:
Grab record_TS datetime of the first record for each ID
For subsequent records, if their record_TS is within an hour of the first record's, recode it to be the same rowID as first record.
If that's the case, you can use RETAIN to keep track of the first record_TS and rowID for each ID. This should be easier than lag(), and allows there to be multiple records in a single session. Below seems to work:
data have;
input ID record_TS datetime. rowid;
format record_TS datetime.;
cards;
52 17MAY2017:06:24:28 4
52 17MAY2017:07:16:12 5
91 05APR2017:07:04:55 6
91 05APR2017:07:23:37 7
91 05APR2017:08:04:52 8
91 05MAY2017:08:56:23 9
;
run;
data want;
set have;
by ID Record_TS;
retain SessionStart SessionRowID;
if first.ID then do;
SessionStart=Record_TS;
SessionRowID=RowID;
end;
else if (record_TS-SessionStart)<(60*60) then RowID=SessionRowID;
drop SessionStart SessionRowID;
run;
Outputs:
ID record_TS rowid
52 17MAY17:06:24:28 4
52 17MAY17:07:16:12 4
91 05APR17:07:04:55 6
91 05APR17:07:23:37 6
91 05APR17:08:04:52 6
91 05MAY17:08:56:23 9

How to subset automatically in SAS?

I am new to SAS, so this might be a silly type of question.
Assume there are several datasets with similar structure but different column names. I want to get new datasets with the same number of rows but only a subset of columns.
In the following example, Data_A and Data_B are original datasets and SubA and SubBare what I want. What is the efficient way of deriving SubA and SubB?
DATA A_auto;
LENGTH A_make $ 20;
INPUT A_make $ 1-17 A_price A_mpg A_rep78 A_hdroom A_trunk A_weight A_length A_turn A_displ A_gratio A_foreign;
CARDS;
AMC Concord 4099 22 3 2.5 11 2930 186 40 121 3.58 0
AMC Pacer 4749 17 3 3.0 11 3350 173 40 258 2.53 0
Audi Fox 6295 23 3 2.5 11 2070 174 36 97 3.70 1
;
RUN;
DATA B_auto;
LENGTH make $ 20;
INPUT B_make $ 1-17 B_price B_mpg B_rep78 B_hdroom B_trunk B_weight B_length B_turn B_displ B_gratio B_foreign;
CARDS;
Toyota Celica 5899 18 5 2.5 14 2410 174 36 134 3.06 1
Toyota Corolla 3748 31 5 3.0 9 2200 165 35 97 3.21 1
VW Scirocco 6850 25 4 2.0 16 1990 156 36 97 3.78 1
;
RUN;
DATA SubA;
set A_auto;
keep A_make A_price;
RUN;
DATA SubB;
set B_auto;
keep B_make B_price;
RUN;
Here's my new answer. This introduces quite a few concepts, but all are necessary to complete this task.
First of all I would store the required part variable names (the suffixes that are common to all datasets) in a new dataset. This keeps them all in one place and makes it easier to change if required.
The next step is to create a regular expression (regex) search string that combines all the names, separated by a pipe (|), which is the regex symbol for or. I've also added a $ symbol to end of the names, this ensures only variables ending with the part names will be selected.
select into :[macroname] is the method to create macro variables within proc sql
Then I set up a macro to extract the specific variable names for the current dataset and use those names to create a view (like my original answer)
The dictionary library referenced in the proc sql is a metadata library that contains information on all active libraries, tables, columns etc, so is a good source of identifying what the actual variable names are called (based on the regex search string created earlier).
You won't need the proc print in your code, I just put it in to show everything is working as expected.
Let me know if this works for you
/* create intial datasets */
DATA A_auto;
LENGTH A_make $ 20;
INPUT A_make $ 1-17 A_price A_mpg A_rep78 A_hdroom A_trunk A_weight A_length A_turn A_displ A_gratio A_foreign;
CARDS;
AMC Concord 4099 22 3 2.5 11 2930 186 40 121 3.58 0
AMC Pacer 4749 17 3 3.0 11 3350 173 40 258 2.53 0
Audi Fox 6295 23 3 2.5 11 2070 174 36 97 3.70 1
;
RUN;
DATA B_auto;
LENGTH B_make $ 20;
INPUT B_make $ 1-17 B_price B_mpg B_rep78 B_hdroom B_trunk B_weight B_length B_turn B_displ B_gratio B_foreign;
CARDS;
Toyota Celica 5899 18 5 2.5 14 2410 174 36 134 3.06 1
Toyota Corolla 3748 31 5 3.0 9 2200 165 35 97 3.21 1
VW Scirocco 6850 25 4 2.0 16 1990 156 36 97 3.78 1
;
RUN;
/* create dataset containing partial name of variables to keep */
data keepvars;
input part_name $ :20.;
datalines;
_make
_price
;
run;
/* create regular expression search string from partial names */
proc sql noprint;
select
cats(part_name,'$') /* '$' matches end of string */
into
:name_str separated by '|' /* '|' is an 'or' search operator in regular expressions */
from
keepvars;
quit;
%put &name_str.; /* print search string to log */
/* macro to create views from datasets */
%macro create_views (dsname, vwname); /* inputs are dataset name being read in and view name being created */
/* extract specific variable names to be kept, based on search string */
proc sql noprint;
select
name
into
:vars separated by ' '
from
dictionary.columns
where
libname = 'WORK'
and memname = upper("&dsname.")
and prxmatch("/&name_str./",strip(name))>0; /* prxmatch is regular expression search function */
quit;
%put &vars.; /* print variables to keep to log */
/* create views */
data &vwname. / view=&vwname.;
set &dsname. (keep=&vars.);
run;
/* test view by printing */
proc print data=&vwname.;;
run;
%mend create_views;
/* run macro for each dataset */
%create_views(A_auto, SubA);
%create_views(B_auto, SubB);

Drop all observations by ID where conditions are not met

I have a dataset with ~4 million transactional records, grouped by Customer_No (consisting of 1 or more transactions per Customer_No, denoted by a sequential counter). Each transaction has a Type code and I am only interested in customers where a particular combination of transaction Types were used. Neither joining the table on itself or using EXISTS in Proc Sql is allowing me to efficiently evaluate the transaction Type criteria. I suspect a data step using retain and do-loops would process the dataset faster
The dataset:
Customer_No Tran_Seq Tran_Type
0001 1 05
0001 2 12
0002 1 07
0002 2 86
0002 3 04
0003 1 07
0003 2 84
0003 3 84
0003 4 84
The criteria I am trying to apply:
All Customer_No's Tran_Type's must only be in ('04','05','07','84','86'),
drop all transactions for that Customer_No if any other Tran_Type was used
Customer_No's Tran_Type's must include ('84' or '86') AND '04', drop all transactions for the Customer_No if this condition is not met
The output I want:
Customer_No Tran_Seq Tran_Type
0002 1 07
0002 2 86
0002 3 04
The DoW loop solution should be the most efficient if the data is sorted. If it's not sorted, it will either be the most efficient or similar in scale but slightly less efficient depending on the circumstances of the dataset.
I compared to Dom's solution with a 3e7 ID dataset, and got for the DoW a similar (slightly less) total length with less CPU for unsorted dataset, and about 50% faster for sorted. It is guaranteed to run in about the length of time the dataset takes to write out (maybe a bit more, but it shouldn't be much), plus sorting time if needed.
data want;
do _n_=1 by 1 until (last.customer_no);
set have;
by customer_no;
if tran_type in ('84','86')
then has_8486 = 1;
else if tran_type in ('04')
then has_04 = 1;
else if not (tran_type in ('04','05','07','84','86'))
then has_other = 1;
end;
do _n_= 1 by 1 until (last.customer_no);
set have;
by customer_no;
if has_8486 and has_04 and not has_other then output;
end;
run;
I don't think it's that complicated. Join to a subquery, group by Customer_No, and put your conditions in a having clause. A condition in a min function must be true for all rows, whereas a condition in a max function must be true for any one row:
proc sql;
create table want as
select
h.*
from
have h
inner join (
select
Customer_No
from
have
group by
Customer_No
having
min(Tran_Type in('04','05','07','84','86')) and
max(Tran_Type in('84','86')) and
max(Tran_Type eq '04')) h2
on h.Customer_No = h2.Customer_No
;
quit;
I must have made a join error. On re-writing, Proc Sql completed in less than 30 seconds (on the original 4.9 million record dataset). It's not particularly elegant code though, so I'd still appreciate any improvements or alternative methods.
data Have;
input Customer_No $ Tran_Seq $ Tran_Type:$2.;
cards;
0001 1 05
0001 2 12
0002 1 07
0002 2 86
0002 3 04
0003 1 07
0003 2 84
0003 3 84
0003 4 84
;
run;
Proc sql;
Create table Want as
select t1.* from Have t1
LEFT JOIN (select DISTINCT Customer_No from Have
where Tran_Type not in ('04','05','07','84','86')
) t2
ON(t1.Customer_No=t2.Customer_No)
INNER JOIN (select DISTINCT Customer_No from Have
where Tran_Type in ('84','86')
) t3
ON(t1.Customer_No=t3.Customer_No)
INNER JOIN (select DISTINCT Customer_No from Have
where Tran_Type in ('04')
) t4
ON(t1.Customer_No=t4.Customer_No)
Where t2.Customer_No is null
;Quit;
I would offer a slightly less complex SQL solution than #naed555 using the INTERSECT operator.
proc sql noprint;
create table to_keep as
(
select distinct customer_no
from have
where tran_type in ('84','86')
INTERSECT
select distinct customer_no
from have
where tran_type in ('04')
)
EXCEPT
select distinct customer_no
from have
where tran_type not in ('04','05','07','84','86')
;
create table want as
select a.*
from have as a
inner join
to_keep as b
on a.customer_no = b.customer_no;
quit;

Check if a column exists and then sum in SAS

This is my input dataset:
Ref Col_A0 Col_01 Col_02 Col_aa Col_03 Col_04 Col_bb
NYC 10 0 44 55 66 34 44
CHG 90 55 4 33 22 34 23
TAR 10 8 0 25 65 88 22
I need to calculate the % of Col_A0 for a specific reference.
For example % col_A0 would be calculated as
10/(10+0+44+55+66+34+44)=.0395 i.e. 3.95%
So my output should be
Ref %Col_A0 %Rest
NYC 3.95% 96.05%
CHG 34.48% 65.52%
TAR 4.58% 95.42%
I can do this part but the issue is column variables.
Col_A0 and Ref are fixed columns so they will be there in the input every time. But the other columns won't be there. And there can be some additional columns too like Col_10, col_11 till col_30 and col_cc till col_zz.
For example the input data set in some scenarios can be just:
Ref Col_A0 Col_01 Col_02 Col_aa Col_03
NYC 10 0 44 55 66
CHG 90 55 4 33 22
TAR 10 8 0 25 65
So is there a way I can write a SAS code which checks to see if the column exists or not. Or if there is any other better way to do it.
This is my current SAS code written in Enterprise Guide.
PROC SQL;
CREATE TABLE output123 AS
select
ref,
(col_A0/(Sum(Col_A0,Col_01,Col_02,Col_aa,Col_03,Col_04,Col_bb)) FORMAT=PERCENT8.2 AS PERCNT_ColA0,
(1-(col_A0/(Sum(Col_A0,Col_01,Col_02,Col_aa,Col_03,Col_04,Col_bb))) FORMAT=PERCENT8.2 AS PERCNT_Rest
From Input123;
quit;
Scenarios where all the columns are not there I get an error. And if there are additional columns then I miss those. Please advice.
Thanks
I would not use SQL, but would use regular datastep.
data want;
set have;
a0_prop = col_a0/sum(of _numeric_);
run;
If you wanted to do this in SQL, the easiest way is to keep (or transform) the dataset in vertical format, ie, each variable a separate row per ID. Then you don't need to know how many variables there are to figure it out.
If you always want to sum all the numeric columns then just do :
col_A0 / sum(of _numeric_)