is there a way to count how many days pass from a start-end to the next one? Let say:
ID Start End
0001 22JAN2022 23JAN2022
0001 26JAN2022 30JAN2022
0001 03MAR2022 08MAR2022
0001 09MAR2022 15MAR2022
0001 17MAR2022 30MAR2022
desired output:
ID Start End days
0001 22JAN2022 23JAN2022 3
0001 26JAN2022 30JAN2022 4
0001 03FEB2022 08MAR2022 1
0001 09MAR2022 15MAR2022 2
0001 17MAR2022 30MAR2022 .......
I believe I demonstrated this in another thread but there you go
data have;
input ID $ (Start End)(:date9.);
format Start End date9.;
datalines;
0001 22JAN2022 23JAN2022
0001 26JAN2022 30JAN2022
0001 03FEB2022 08MAR2022
0001 09MAR2022 15MAR2022
0001 17MAR2022 30MAR2022
;
data want;
set have;
by ID;
set have(firstobs = 2 rename = start = s keep = start)
have(obs = 1 drop = _all_);
if last.ID then s = .;
days = s - end;
run;
Related
suppose to have the following data set.
ID Hired Start_date End_date Flag_Start Flag_End
0001 1-1900 01JAN2018 21DEC2018 1 2
0001 1-1900 01JAN2019 01DEC2020 2 2
0002 10-2020 26MAR2020 03MAY2020 1 2
0003 03-2021 18DEC2020 31DEC2020 1 2
..... ....... ......... ......... ........... ...........
I would like the desired output. Sorry if I ask you but I'm a newbie and this seems to be a very difficult task with SAS. I'm familiar with R.
Desired output:
ID Hired Start_date End_date Flag_Start Flag_End
0001 1-1900 01JAN2018 21DEC2018 1 2
0001 1-1900 01JAN2019 01DEC2020 2 3
0002 03-2020 26MAR2020 03MAY2020 1 0
0003 03-2021 18DEC2020 31DEC2020 1 3
..... ....... ......... ......... ........... ...........
So, for each ID, if, after sorting, the last End_date is "x" and the "Hired" is 1-1900 then in Flag_End add 3 otherwise if Hired is < End_date add 0 otherwise if Hired is > End_date but not 1-1900 add 3.
Thank you in advance
I think this is what you want.
The Hired Date does not match between your two posted data sets. I chose the second one (03-2020).
data have;
input ID $ Hired :anydtdte. (Start_date End_date)(:date9.) Flag_Start Flag_End;
format Hired Start_date End_date date9.;
datalines;
0001 1-1900 01JAN2018 21DEC2018 1 2
0001 1-1900 01JAN2019 01DEC2020 2 2
0002 03-2020 26MAR2020 03MAY2020 1 2
0003 03-2021 18DEC2020 31DEC2020 1 2
;
data want;
set have;
by ID;
if last.ID then do;
if Hired = '01jan1900'd then flag_end = 3;
else if Hired < End_date then flag_end = 0;
else if Hired >= End_date then flag_end = 3;
end;
run;
suppose to have the following:
data have;
input ID :$20. Label :$20. Hours :$20. Days :$20.;
cards;
0001 w 3144 3
0001 w 23 54
0001 p 12 1
0002 m 456 34
0002 w 2 1
0002 s 231 45
0002 w 98 23
0003 w 12 6
0003 w 98 76
;
Is there a way to, for each ID, sum the Hours, so get the total and then split it by the days but only when the label is == w? If the label is not w put a missing.
Desired output:
data have;
input ID :$20. Label :$20. Hours :$20. Days :$20.;
cards;
0001 w 167.3158 3
0001 w 3011.684 54
0001 p . 1
0002 m . 34
0002 w 32.79167 1
0002 s . 45
0002 w 754.2084 23
0003 w 8.048778 6
0003 w 101.9512 76
;
In other words: for 0001 in the desired output example I added: 3144+23+12 = 3179, the 54+3=57 that are the days where the label is "w" then I divided 3179 by 57 and multiplied the result for 3 and 54 but not for 1 respectively.
Thank you in advance
Same idea with #Stu Sztukowski, but use DOW-Loop skill:
data want;
do until(last.id);
set have;
by id notsorted;
sum_of_hours=sum(sum_of_hours,input(hours,best.));
sum_of_days_w=sum(sum_of_days_w,(label='w')*input(days,best.));
end;
do until(last.id);
set have;
by id notsorted;
if label='w' then hours=cats(sum_of_hours*(input(days,best.)/sum_of_days_w));
else hours='';
output;
end;
run;
The calculation you need to do looks like this in code form:
if(id = 'w') then hours = sum_of_hours_w/sum_of_days * days
else hours = .
All we need to do is get the sum of hours and days where label = 'w', then merge it back with our original table by id. The table to do this calculation would look like this:
id
label
hours
days
sum_of_hours
sum_of_days_w
0001
w
3144
3
3179
57
0001
w
23
54
3179
57
You can accomplish this all in a single SQL step.
proc sql;
create table want as
select t1.id
, t1.label
, CASE(t1.label)
when('w') then t2.sum_hours/t2.sum_days_w * t1.days
else .
END as hours
, t1.days
from have as t1
/* Get the sum of all hours and days where label = 'w' */
LEFT JOIN
(select id
, sum( (label = 'w')*days ) as sum_days_w
, sum(hours) as sum_hours
from have
group by id
) as t2
ON t1.id = t2.id
;
quit;
Is there an elegant way to check in SAS Base if a numeric value is made of only one kind of digit?
Example:
1 -> Yes
11 -> Yes
111 -> Yes
1111 -> Yes
1121 -> No
9999999 -> Yes
9999990 -> No
I would go with something like this.
Realize that SAS does not store leading 0s in numbers, so the last one in your example will pass -- that 0 will not show up.
This converts the numbers to strings and then compares the individual characters in the string. Alter the format in the put statement as needed.
Also note that a decimal will fail because . will be compared to the numbers. If you need these to pass, then remove the . from the string.
data have;
input x;
datalines;
1
11
12
111
1111
1121
99999999
09999999
1.11
;
run;
data test;
set have;
pass = 1;
format temp $32.;
temp = strip(put(x,best32.));
do i=1 to length(temp)-1;
pass = pass and (substr(temp,i,1) = substr(temp,i+1,1));
if ^pass then leave;
end;
drop temp i;
run;
Just want to share an additional solution with regex:
data have;
input x;
datalines;
1
11
12
111
1111
1121
99999999
9999990
;
run;
data want;
set have;
if PRXMATCH("/\b1+\b|\b2+\b|\b3+\b|\b4+\b|\b5+\b|\b6+\b|\b7+\b|\b8+\b|\b9+\b|\b0+\b/",x);
run;
I have two datasets: base (master table) updateX (updated table which might contain new observation)
data base;
input Field1 $ Field2 $ Field3 $ Field4 $;
datalines;
F 0001 20160501 ABC
NF 0001 20160502 CDF
NF 0002 20160601 ABC
NF 0002 20160602 CDF
;
run;
data updateX;
input Field1 $ Field2 $ Field3 $ Field4 $;
datalines;
F 0001 20160502 CDF
F 0002 20160602 CDF
F 0003 20160603 CDF
;
run;
My desired output
F 0001 20160501 ABC
F 0001 20160502 CDF
NF 0001 20160502 CDF
F 0002 20160602 CDF
F 0003 20160603 CDF
My effort:
data base;
modify base updateX;
by Field2 Field3;
run;
With MODIFY you need to tell SAS to REPLACE or OUTPUT depending on if the records are matched or not.
data base;
modify base updatex;
by field2 field3;
if _iorc_ eq 0 then replace;
else do;
output;
_error_=0;
end;
run;
It is easier if you can create a new data set using UPDATE. With update the matching records are updated and then output (replaced) and new records from the transaction file are output.
data ubase;
update base updatex;
by field2 field3;
run;
The dataset looks like this:
Code Type Rating
0001 NULL 1
0002 NULL 1
0003 NULL 1
0003 PA 1 3
0004 NULL 1
0004 PB 1 2
0005 AC 1 3
0005 NULL 6
0006 AC 1 2
I want the output dataset looks like
Code Type Rating
0001 NULL 1
0002 NULL 1
0003 PA 1 4
0004 PB 1 3
0005 AC 1 9
0006 AC 1 2
For each Code, Type has at most two values. I want to select the unique Code by summing Rating. But the problem is, for Type, if it has only one value, the passes its value to output dataset. If is has two values (one has to be NULL), then passes the one not equals to NULL to output dataset.
The total number of observation N>100,000,000. So is there any tricky way to achieve this?
If the data is sorted as per your example, then you can achieve this in a single data step. I've assumed that the NULL values are actually missing, however if not then change [if missing(type)] to [if type='NULL']. All this does is sum the Rating values for each Code, then output the last record, keeping the non-null Type. If your data isn't sorted or indexed on Code then you'll need to do a sort first, which will obviously add quite a bit to the execution time.
/* create input file */
data have;
input Code Type $ Rating;
infile datalines dsd;
datalines;
0001,,1
0002,,1
0003,,1
0003,PA 1,3
0004,,1
0004,PB 1,2
0005,AC 1,3
0005,,6
0006,AC 1,2
;
run;
/* create summarised dataset */
data want;
set have;
by code;
retain _type; /* temporary variable */
if first.code then do;
_type = type;
_rating_sum = 0; /* reset sum */
end;
_rating_sum + rating; /* sum rating per Code */
if last.code then do;
if missing(type) then type = _type; /* pick non-null value */
rating = _rating_sum; /* insert sum */
output;
end;
run;
Given the comments, another possibility presents, the hash solution. This is memory-constrained, so it may or may not be able to work with the actual data (the hash table isn't very big, but 100M rows might imply 60 or 70M rows in the hash table, times 40 or 50 bytes would still be pretty big).
This is almost certainly inferior to the plain data step method if the dataset is sorted by code, so this should only be used on unsorted data.
The concepts:
Create hash table keyed on code
If incoming record is new, add to hash table
If incoming record is not a new code, take the retrieved value and sum the rating. Check to see if type needs to be replaced.
Output to dataset.
Code:
data _null_;
if _n_=1 then do;
if 0 then set have;
declare hash h(ordered:'a');
h.defineKey('code');
h.defineData('code','type','rating');
h.defineDone();
end;
set have(rename=(type=type_in rating=rating_in)) end=eof;
rc_1 = h.find();
if rc_1 eq 0 then do;
if type ne type_in and type='NULL' then type=type_in;
rating=sum(rating,rating_in);
h.replace();
end;
else do;
type=type_in;
rating=rating_in;
h.add();
end;
if eof then do;
h.output(dataset:'want');
end;
run;
It's pretty easy to do in one SQL step as well. Just use a CASE...WHEN...END to remove the NULLs and a MAX to then get the non-null value.
data have;
input #1 Code 4.
#9 Type $4.
#19 Rating 1.;
datalines;
0001 NULL 1
0002 NULL 1
0003 NULL 1
0003 PA 1 3
0004 NULL 1
0004 PB 1 2
0005 AC 1 3
0005 NULL 6
0006 AC 1 2
;;;;
run;
proc sql;
create table want as
select code,
max(case type when 'NULL' then '' else type end) as type,
sum(Rating) as rating
from have
group by code;
quit;
If you want the NULLs back, then you need to wrap the select in a select code, case type when ' ' then 'NULL' else type end as type, rating from ( ... );, though I would suggest leaving them blank.