I have a dataset like the one below, I am wondering how can i fill in those missing dates in SAS.
Missing FROM_NEW (2nd line)
First, find the TO_NEW (14NOV2016) in the last row of TOS="FAC_IP" under the same ADM.
Then, loop in PRO_IP to find the min(date) but greater than (>14NOV2016) to replace the missing FROM_NEW which is 20NOV2016 should be filled in.
Missing TO_NEW (line 25)
First, find the FROM_NEW (14DEC2016) in the last row of TOS="FAC_IP"
Then, loop in PRO_IP to find the max(date) but less than (<14DEC2016) to replace the missing TO_NEW which 08DEC2016 should be filled in.
Obs MEMNUM ADM LINE FROM TO FROM_NEW TO_NEW TOS order
1 9840964190000001 237696870 X 23OCT2016 23NOV2016 23OCT2016 30OCT2016 FAC_IP 1
2 9840964190000001 237696870 R 23OCT2016 23NOV2016 . 23NOV2016 FAC_IP 2
3 9840964190000001 237696870 X 23OCT2016 23NOV2016 02NOV2016 03NOV2016 FAC_IP 3
4 9840964190000001 237696870 X 23OCT2016 23NOV2016 05NOV2016 09NOV2016 FAC_IP 4
5 9840964190000001 237696870 X 23OCT2016 23NOV2016 11NOV2016 14NOV2016 FAC_IP 5
6 9840964190000001 237696870 23OCT2016 23NOV2016 25OCT2016 25OCT2016 PRO_IP .
7 9840964190000001 237696870 23OCT2016 23NOV2016 26OCT2016 26OCT2016 PRO_IP .
8 9840964190000001 237696870 23OCT2016 23NOV2016 27OCT2016 27OCT2016 PRO_IP .
9 9840964190000001 237696870 23OCT2016 23NOV2016 28OCT2016 28OCT2016 PRO_IP .
10 9840964190000001 237696870 23OCT2016 23NOV2016 29OCT2016 29OCT2016 PRO_IP .
11 9840964190000001 237696870 23OCT2016 23NOV2016 30OCT2016 30OCT2016 PRO_IP .
12 9840964190000001 237696870 23OCT2016 23NOV2016 02NOV2016 02NOV2016 PRO_IP .
13 9840964190000001 237696870 23OCT2016 23NOV2016 03NOV2016 03NOV2016 PRO_IP .
14 9840964190000001 237696870 23OCT2016 23NOV2016 06NOV2016 06NOV2016 PRO_IP .
15 9840964190000001 237696870 23OCT2016 23NOV2016 07NOV2016 07NOV2016 PRO_IP .
16 9840964190000001 237696870 23OCT2016 23NOV2016 08NOV2016 08NOV2016 PRO_IP .
17 9840964190000001 237696870 23OCT2016 23NOV2016 11NOV2016 11NOV2016 PRO_IP .
18 9840964190000001 237696870 23OCT2016 23NOV2016 12NOV2016 12NOV2016 PRO_IP .
19 9840964190000001 237696870 23OCT2016 23NOV2016 13NOV2016 13NOV2016 PRO_IP .
20 9840964190000001 237696870 23OCT2016 23NOV2016 14NOV2016 14NOV2016 PRO_IP .
21 9840964190000001 237696870 23OCT2016 23NOV2016 20NOV2016 20NOV2016 PRO_IP .
22 9840964190000001 237696870 23OCT2016 23NOV2016 21NOV2016 21NOV2016 PRO_IP .
23 9840964190000001 237696870 23OCT2016 23NOV2016 22NOV2016 22NOV2016 PRO_IP .
24 9840964190000001 237696870 23OCT2016 23NOV2016 23NOV2016 23NOV2016 PRO_IP .
25 9840964190000001 244243815 R 04DEC2016 17DEC2016 04DEC2016 . FAC_IP 1
26 9840964190000001 244243815 X 04DEC2016 17DEC2016 14DEC2016 17DEC2016 FAC_IP 2
27 9840964190000001 244243815 04DEC2016 17DEC2016 04DEC2016 04DEC2016 PRO_IP .
28 9840964190000001 244243815 04DEC2016 17DEC2016 05DEC2016 05DEC2016 PRO_IP .
29 9840964190000001 244243815 04DEC2016 17DEC2016 06DEC2016 06DEC2016 PRO_IP .
30 9840964190000001 244243815 04DEC2016 17DEC2016 07DEC2016 07DEC2016 PRO_IP .
31 9840964190000001 244243815 04DEC2016 17DEC2016 08DEC2016 08DEC2016 PRO_IP .
32 9840964190000001 244243815 04DEC2016 17DEC2016 14DEC2016 14DEC2016 PRO_IP .
33 9840964190000001 244243815 04DEC2016 17DEC2016 15DEC2016 15DEC2016 PRO_IP .
34 9840964190000001 244243815 04DEC2016 17DEC2016 16DEC2016 16DEC2016 PRO_IP .
35 9840964190000001 244243815 04DEC2016 17DEC2016 17DEC2016 17DEC2016 PRO_IP .
so, here's how you can update the FROM_NEW in a sas macro. It's pretty explicit and you should be able to create your own macro to do the TO_NEW. If you need additional help come back with what you've tried and we can go from there. Make sure all your dates are actual sas dates and not literals in your dataset
*create file of all needed from_new dates
proc sql noprint;
create table needed_FN as
select distinct adm,order as row from test where from_new = .;quit;
* find out how many we have
proc sql noprint;
select count(*) into: cnt from needed_fn;quit;
*create macro to loop through
%macro test;
%do i = 1 %to &cnt;
*for each needed from_new date place the adm and order number into macro vars
proc sql noprint;
select ADM,row into: adm, :row from needed_FN where monotonic()=&i;quit;
* create temp table for specific adm
proc sql noprint;
create table work as
select * from test where adm = &adm;quit;
* find out how many 'FAC_IP's we have
proc sql noprint;
select max(order)into:maxOrder from work;quit;
* get the checkdate
proc sql noprint;
select (TO_NEW) format=date9. into:checkDate from work where ORDER= &maxOrder;
*create temp table of all PRO_iP dates > checkDate
proc sql noprint;
create table proDates as
select * from work where TOS='PRO_IP' and FROM_NEW > "&checkDate"d;quit;
* get the smallest one
proc sql noprint;
select min(From_new)format date9. into:replaceDate from prodates;quit;
*update the table
proc sql;
update test
set from_new = "&replaceDate"d
where adm = &adm
and order = &row;
%end;
%mend test;
%test;
You will have to use either nested sub-queries or use the SAS Retain functionality within a data step to accomplish this.
Solution below is using sub-queries, see my inline comments:
data have;
/* infile datalines dlm=',' dsd; */
length Obs 8. MEMNUM $22. ADM $9. LINE $1. FROM 8. TO 8. FROM_NEW 8. TO_NEW 8. TOS $6. order 8. ;
format FROM date9. TO date9. FROM_NEW date9. TO_NEW date9. ;
informat FROM date9. TO date9. FROM_NEW date9. TO_NEW date9. ;
input
Obs MEMNUM $ ADM $ LINE $ FROM TO FROM_NEW TO_NEW TOS $ order ;
datalines;
1 9840964190000001 237696870 X 23OCT2016 23NOV2016 23OCT2016 30OCT2016 FAC_IP 1
2 9840964190000001 237696870 R 23OCT2016 23NOV2016 . 23NOV2016 FAC_IP 2
3 9840964190000001 237696870 X 23OCT2016 23NOV2016 02NOV2016 03NOV2016 FAC_IP 3
4 9840964190000001 237696870 X 23OCT2016 23NOV2016 05NOV2016 09NOV2016 FAC_IP 4
5 9840964190000001 237696870 X 23OCT2016 23NOV2016 11NOV2016 14NOV2016 FAC_IP 5
6 9840964190000001 237696870 . 23OCT2016 23NOV2016 25OCT2016 25OCT2016 PRO_IP .
7 9840964190000001 237696870 . 23OCT2016 23NOV2016 26OCT2016 26OCT2016 PRO_IP .
8 9840964190000001 237696870 . 23OCT2016 23NOV2016 27OCT2016 27OCT2016 PRO_IP .
9 9840964190000001 237696870 . 23OCT2016 23NOV2016 28OCT2016 28OCT2016 PRO_IP .
10 9840964190000001 237696870 . 23OCT2016 23NOV2016 29OCT2016 29OCT2016 PRO_IP .
11 9840964190000001 237696870 . 23OCT2016 23NOV2016 30OCT2016 30OCT2016 PRO_IP .
12 9840964190000001 237696870 . 23OCT2016 23NOV2016 02NOV2016 02NOV2016 PRO_IP .
13 9840964190000001 237696870 . 23OCT2016 23NOV2016 03NOV2016 03NOV2016 PRO_IP .
14 9840964190000001 237696870 . 23OCT2016 23NOV2016 06NOV2016 06NOV2016 PRO_IP .
15 9840964190000001 237696870 . 23OCT2016 23NOV2016 07NOV2016 07NOV2016 PRO_IP .
16 9840964190000001 237696870 . 23OCT2016 23NOV2016 08NOV2016 08NOV2016 PRO_IP .
17 9840964190000001 237696870 . 23OCT2016 23NOV2016 11NOV2016 11NOV2016 PRO_IP .
18 9840964190000001 237696870 . 23OCT2016 23NOV2016 12NOV2016 12NOV2016 PRO_IP .
19 9840964190000001 237696870 . 23OCT2016 23NOV2016 13NOV2016 13NOV2016 PRO_IP .
20 9840964190000001 237696870 . 23OCT2016 23NOV2016 14NOV2016 14NOV2016 PRO_IP .
21 9840964190000001 237696870 . 23OCT2016 23NOV2016 20NOV2016 20NOV2016 PRO_IP .
22 9840964190000001 237696870 . 23OCT2016 23NOV2016 21NOV2016 21NOV2016 PRO_IP .
23 9840964190000001 237696870 . 23OCT2016 23NOV2016 22NOV2016 22NOV2016 PRO_IP .
24 9840964190000001 237696870 . 23OCT2016 23NOV2016 23NOV2016 23NOV2016 PRO_IP .
25 9840964190000001 244243815 R 04DEC2016 17DEC2016 04DEC2016 . FAC_IP 1
26 9840964190000001 244243815 X 04DEC2016 17DEC2016 14DEC2016 17DEC2016 FAC_IP 2
27 9840964190000001 244243815 . 04DEC2016 17DEC2016 04DEC2016 04DEC2016 PRO_IP .
28 9840964190000001 244243815 . 04DEC2016 17DEC2016 05DEC2016 05DEC2016 PRO_IP .
29 9840964190000001 244243815 . 04DEC2016 17DEC2016 06DEC2016 06DEC2016 PRO_IP .
30 9840964190000001 244243815 . 04DEC2016 17DEC2016 07DEC2016 07DEC2016 PRO_IP .
31 9840964190000001 244243815 . 04DEC2016 17DEC2016 08DEC2016 08DEC2016 PRO_IP .
32 9840964190000001 244243815 . 04DEC2016 17DEC2016 14DEC2016 14DEC2016 PRO_IP .
33 9840964190000001 244243815 . 04DEC2016 17DEC2016 15DEC2016 15DEC2016 PRO_IP .
34 9840964190000001 244243815 . 04DEC2016 17DEC2016 16DEC2016 16DEC2016 PRO_IP .
35 9840964190000001 244243815 . 04DEC2016 17DEC2016 17DEC2016 17DEC2016 PRO_IP .
;
run;
proc sql;
create table missing_dates as
select MEMNUM , ADM , FROM_NEW , TO_NEW , MAX_FROM_NEW , Max_TO_NEW ,
max(FNL) as FNL format=date9., min(TNL) as TNL format=date9.
from (
select t1.*, t2.Max_Order, t3.FROM_NEW as MAX_FROM_NEW , t3.TO_NEW as Max_TO_NEW ,
t4.FROM_NEW as FNL, t4.TO_NEW as TNL
from
/* Identify Missing dates*/
(select * from have where FROM_NEW = . or TO_NEW=.) as t1
left join
/* Identify Max order */
(select MEMNUM, ADM, max(order) as Max_Order
from have where TOS='FAC_IP' group by MEMNUM, ADM ) as t2 on t1.MEMNUM=t2.MEMNUM and t1.ADM = t2.ADM
/* Join Max Order with the rest of the Data */
inner join have as t3 on t3.TOS='FAC_IP' and t3.MEMNUM=t2.MEMNUM and t3.ADM = t2.ADM and t3.order=t2.Max_Order
/* Join Max Order with PRO_IP */
left join
(select MEMNUM, ADM, FROM_NEW , TO_NEW from have where TOS='PRO_IP' group by MEMNUM, ADM ) as t4 on t1.MEMNUM=t4.MEMNUM and t1.ADM = t4.ADM
) as want
where case when (FROM_NEW=. and TNL> Max_TO_NEW) then 1
when (TO_NEW=. and FNL < MAX_FROM_NEW) then 1
else 0
end
group by MEMNUM , ADM , FROM_NEW , TO_NEW , MAX_FROM_NEW , Max_TO_NEW
;
quit;
proc sql;
create table update_mising as select
A.Obs, A.MEMNUM, A.ADM , A.LINE ,A.FROM , A.TO ,
coalesce(A.FROM_NEW,B.TNL) as FROM_NEW format=date9. ,
coalesce(A.TO_NEW,B.TNL) as TO_NEW format=date9. ,
A.TOS , A.order
from have A left join missing_dates B
on (A.FROM_NEW = . or A.TO_NEW=.) and A.MEMNUM=B.MEMNUM and A.ADM = B.ADM
order by A.OBS
;
quit;
Final Table:
Related
I'm asking support to manage something I'm able to manage with R but not in SAS and I must in SAS.
Suppose to deal with the following dataset:
ID Start End Label
subjectA 01/01/2020 15/01/2020 holidays
subjectA 16/01/2020 20/01/2020 holidays
subjectB 01/05/2020 30/05/2020 holidays
subjectB 01/06/2020 07/06/2020 holidays
subjectC 01/02/2020 01/02/2020 work_permit
subjectD 01/03/2020 01/09/2020 maternity
subjectE 03/01/2020 09/01/2020 disease
subjectE 11/01/2020 13/01/2020 disease
subjectF 12/02/2020 12/02/2020 work_permit
subjectG 11/09/2020 20/09/2020 course
....... ........ ........ ..........
I need the following:
for repeated entries after sorting Start and End so that the previous Start-End represents a time period before the subsequent:
if the difference between the Start and End is 1 day for the same repeated entry (ID) then sum the number of days, otherwise (>1 days) count without sum. This for holidays, maternity and disease.
if the difference between the Start and End is 0 days (consecutive) for the same repeated entry (ID) then sum the number of days, otherwise if > 1 days sum without count. This for holidays, maternity and disease.
For not repeated entries:
count the days;
count 1 day if the same Start-End.
Desired output:
ID Days Label Flag
subjectA 20 holidays summarised
subjectB 37 holidays summarised
subjectC 1 work_permit single_day
subjectD 184 maternity consecutive_not_summarized
subjectE 19 disease summarised
subjectF 1 work_permit single_day
subjectG 10 course consecutive_not_summarized
....... ........ ........ ..........
For ranges involving February, it was of 29 days. Moreover there might be more than two periods per repeated record.
Sorry it seems to be complex. I have no idea how to start writing this in SAS and so I need support and guide.
Thank you in advance
To calculate the days in the period just subtract start from end and add one. To calculate the gap between periods use the LAG() of end. Make sure to reset the calculated gap when starting a new ID.
data have;
input ID :$20. Start :ddmmyy. End :ddmmyy. Label :$20.;
format start end yymmdd10.;
cards;
subjectA 01/01/2020 15/01/2020 holidays
subjectA 16/01/2020 20/01/2020 holidays
subjectB 01/05/2020 30/05/2020 holidays
subjectB 01/06/2020 07/06/2020 holidays
subjectC 01/02/2020 01/02/2020 work_permit
subjectD 01/03/2020 01/09/2020 maternity
subjectE 03/01/2020 09/01/2020 disease
subjectE 11/01/2020 13/01/2020 disease
subjectF 12/02/2020 12/02/2020 work_permit
subjectG 11/09/2020 20/09/2020 course
;
data days;
set have;
by id;
days = end - start + 1 ;
gap = start - lag(end);
if first.id then gap=. ;
run;
Result
Obs ID Start End Label days gap
1 subjectA 2020-01-01 2020-01-15 holidays 15 .
2 subjectA 2020-01-16 2020-01-20 holidays 5 1
3 subjectB 2020-05-01 2020-05-30 holidays 30 .
4 subjectB 2020-06-01 2020-06-07 holidays 7 2
5 subjectC 2020-02-01 2020-02-01 work_permit 1 .
6 subjectD 2020-03-01 2020-09-01 maternity 185 .
7 subjectE 2020-01-03 2020-01-09 disease 7 .
8 subjectE 2020-01-11 2020-01-13 disease 3 2
9 subjectF 2020-02-12 2020-02-12 work_permit 1 .
10 subjectG 2020-09-11 2020-09-20 course 10 .
But I cannot figure out what you want to do when the gap is larger than 1 and your example data and results do not really provide any real guidance. For most cases you seem to just want the sum of the days whether or not there is a large gap.
proc summary data=days;
by id label;
var days;
output out=want sum= ;
run;
Result:
Obs ID Label _TYPE_ _FREQ_ days
1 subjectA holidays 0 2 20
2 subjectB holidays 0 2 37
3 subjectC work_permit 0 1 1
4 subjectD maternity 0 1 185
5 subjectE disease 0 2 10
6 subjectF work_permit 0 1 1
7 subjectG course 0 1 10
If you want to exclude periods that are more than 1 day after the previous period you could just add a WHERE clause.
proc summary data=days;
where gap < 2;
by id label;
var days;
output out=want sum= ;
run;
Results:
Obs ID Label _TYPE_ _FREQ_ days
1 subjectA holidays 0 2 20
2 subjectB holidays 0 1 30
3 subjectC work_permit 0 1 1
4 subjectD maternity 0 1 185
5 subjectE disease 0 1 7
6 subjectF work_permit 0 1 1
7 subjectG course 0 1 10
If the goal is not collapse the intervals in periods without gaps then make a new variable to indicate when a new period starts.
data days;
set have;
by id;
days = end - start + 1 ;
gap = start - lag(end);
period + (gap > 1);
if first.id then do;
gap=. ;
period=1;
end;
run;
proc summary data=days ;
by id period label ;
var days;
output out=want sum=;
run;
Now subjects B and E have two periods and the other examples only one.
Results
Obs ID period Label _TYPE_ _FREQ_ days
1 subjectA 1 holidays 0 2 20
2 subjectB 1 holidays 0 1 30
3 subjectB 2 holidays 0 1 7
4 subjectC 1 work_permit 0 1 1
5 subjectD 1 maternity 0 1 185
6 subjectE 1 disease 0 1 7
7 subjectE 2 disease 0 1 3
8 subjectF 1 work_permit 0 1 1
9 subjectG 1 course 0 1 10
Strong suspicion this won't scale because you're not accounting for weekends/days off allowed at work, ie Statutory holidays, office closures, consecutive days rules doesn't seem applied consistently.
But here's a start. You can modify as needed or update your question with more details. Please include data as a data step in future questions (like what I have in the first step of my answer below).
data have;
input ID : $14. Start : ddmmyy10. End : ddmmyy10. Label : $20.;
format start end date9.;
cards;
subjectA 01/01/2020 15/01/2020 holidays
subjectA 16/01/2020 20/01/2020 holidays
subjectB 01/05/2020 30/05/2020 holidays
subjectB 01/06/2020 07/06/2020 holidays
subjectC 01/02/2020 01/02/2020 work_permit
subjectD 01/03/2020 01/09/2020 maternity
subjectE 03/01/2020 09/01/2020 disease
subjectE 11/01/2020 13/01/2020 disease
subjectF 12/02/2020 12/02/2020 work_permit
subjectG 11/09/2020 20/09/2020 course
;
;
;;
run;
data groups;
set have;
by id label notsorted;
prev_end=lag(end);
if first.id then
do;
group=0;
call missing(prev_end);
end;
if first.label or prev_end+1 ne start then
group+1;
duration=end-start+1;
run;
proc means data=groups noprint nway;
class id label group;
var duration;
output out=summarized N=Number_Events Sum=Total_Duration;
run;
data want;
set groups;
IF n>2 AND LABEL not in ('work_permit', 'maternity') then flag = 'Summarized';
else if duration = 1 then flag = 'Single Day';
else flag = 'consecutive_not_summarized';
run;
I have a Stata dataset which has six variables with consumed food ingredient codes and their weight in grams. I have another distinct dataset which has food ingredient codes and consecutive calorie per 100 g. I need to replace codes with calorie to calculate total calorie consumption.
How can I do that? (by replacing or generating new variable)
My first (master) dataset is
clear
input double hhid int(Ingredient_1_code Ingredient_1_weight Ingredient_2_code incredient_2_weight Ingredient_3_code ingredient_3_weight Ingredient_4_code Ingredient_4_weight Ingredient_5_code Ingredient_5_weight Ingredient_6_code Ingredient_6_weight)
1 269 8 266 46 . . . . . . . .
1 315 19 . . . . . . . . . .
1 316 9 . . . . . . . . . .
1 2522 3 . . . . . . . . . .
1 1 1570 . . . . . . . . . .
1 1 530 . . . . . . . . . .
1 61 262 64 23 57 17 31 8 2522 5 . .
1 130 78 64 23 57 17 2521 2 31 15 248 1
1 228 578 64 138 57 37 248 3 2521 14 31 35
2 142 328 . . . . . . . . . .
2 272 78 . . . . . . . . . .
2 1 602 . . . . . . . . . .
2 51 344 61 212 246 2 64 50 65 11 2522 10
2 176 402 44 348 61 163 57 17 248 2 64 71
3.1 1 1219 . . . . . . . . . .
3.1 1 410 . . . . . . . . . .
3.1 54 130 52 60 61 32 51 23 21 17 57 4
3.1 44 78 130 44 57 3 248 4 31 49 2522 6
3.1 231 116 904 119 61 220 57 22 248 3 254 6
3.2 156 396 . . . . . . . . . .
3.2 272 78 . . . . . . . . . .
end
My second dataset with food ingredient codes and calorie per 100 g is
clear
input str39 Ingredient int(Ingredient_codes Calorie_per_100gm)
"Parboiled rice (coarse)" 1 344
"Non-parboiled rice (coarse)" 2 344
"Fine rice" 3 344
"Rice flour" 4 366
"Suji (cream of wheat/barley)" 5 364
"Wheat" 6 347
"Atta" 7 334
"Maida (wheat flour w/o bran)" 8 346
"Semai/noodles" 9 347
"Chaatu" 10 324
"Chira (flattened rice)" 11 356
"Muri/Khoi (puffed rice)" 12 361
"Barley" 13 324
"Sagu" 14 346
"Corn" 15 355
"Cerelac" 16 418
"Lentil" 21 317
"Chick pea" 22 327
"Anchor daal" 23 375
"Black gram" 24 317
"Khesari" 25 352
"Mung" 26 161
end
I want to get calories per 100 g in master dataset in according to ingredients.
I agree with the comment Nick made about it is better to first make this data long. Read why that is a much better practice here: https://worldbank.github.io/dime-data-handbook/processing.html#making-data-tidy
However, it can be done in the current un-tidy wide format if you for some reason must keep your data like that. The code below shows how that can be done.
clear
input str39 Ingredient int(Ingredient_codes Calorie_per_100gm)
"Parboiled rice (coarse)" 1 344
"Non-parboiled rice (coarse)" 2 344
"Fine rice" 3 344
"Rice flour" 4 366
"Suji (cream of wheat/barley)" 5 364
"Wheat" 6 347
"Atta" 7 334
"Maida (wheat flour w/o bran)" 8 346
"Semai/noodles" 9 347
"Chaatu" 10 324
"Chira (flattened rice)" 11 356
"Muri/Khoi (puffed rice)" 12 361
"Barley" 13 324
"Sagu" 14 346
"Corn" 15 355
"Cerelac" 16 418
"Lentil" 21 317
"Chick pea" 22 327
"Anchor daal" 23 375
"Black gram" 24 317
"Khesari" 25 352
"Mung" 26 161
end
drop Ingredient
tempfile code_calories
save `code_calories'
clear
input double hhid int(Ingredient_1_code Ingredient_1_weight Ingredient_2_code incredient_2_weight Ingredient_3_code ingredient_3_weight Ingredient_4_code Ingredient_4_weight Ingredient_5_code Ingredient_5_weight Ingredient_6_code Ingredient_6_weight)
1 269 8 266 46 . . . . . . . .
1 315 19 . . . . . . . . . .
1 316 9 . . . . . . . . . .
1 2522 3 . . . . . . . . . .
1 1 1570 . . . . . . . . . .
1 1 530 . . . . . . . . . .
1 61 262 64 23 57 17 31 8 2522 5 . .
1 130 78 64 23 57 17 2521 2 31 15 248 1
1 228 578 64 138 57 37 248 3 2521 14 31 35
2 142 328 . . . . . . . . . .
2 272 78 . . . . . . . . . .
2 1 602 . . . . . . . . . .
2 51 344 61 212 246 2 64 50 65 11 2522 10
2 176 402 44 348 61 163 57 17 248 2 64 71
3.1 1 1219 . . . . . . . . . .
3.1 1 410 . . . . . . . . . .
3.1 54 130 52 60 61 32 51 23 21 17 57 4
3.1 44 78 130 44 57 3 248 4 31 49 2522 6
3.1 231 116 904 119 61 220 57 22 248 3 254 6
3.2 156 396 . . . . . . . . . .
3.2 272 78 . . . . . . . . . .
end
*Standardize varname
rename incredient_2_weight Ingredient_2_weight
rename ingredient_3_weight Ingredient_3_weight
*Loop over all variables
forvalues var_num = 1/6 {
*Rename to match name in code_calories dataset
rename Ingredient_`var_num'_code Ingredient_codes
*Merge calories for this ingridient
merge m:1 Ingredient_codes using `code_calories', keep(master matched) nogen
*Calculate number of calories for this ingredient
gen Calories_`var_num' = Calorie_per_100gm * Ingredient_`var_num'_weight
*Order new variables and restore names of variables
order Calorie_per_100gm Calories_`var_num', after(Ingredient_`var_num'_weight)
rename Ingredient_codes Ingredient_`var_num'_code
rename Calorie_per_100gm Calorie_per_100gm_`var_num'
}
*Summarize calories across all ingredients
egen total_calories = rowtotal(Calories_?)
I have a daily transactional dataset with gaps. I want to see whether a product with a reference price websiteprice of $X on date T had an actual sold price actualsoldprice $Y >= $X for at least 10% of previous T – 90 days. In other words, for each transaction where sale_at_or_above_refprice == 1, we need to count how many times the actual sold price for prior transactions (of a given product) within the previous 90 days met or exceeded that transaction's reference price.
I have included first step results that I am looking for in the wanted variable.
My data is as follows,
* Example generated by -dataex-. For more info, type help dataex
clear
input str9 orderdate str16 productcode str10 productcategory byte(websiteprice actualsoldprice sale_at_or_above_refprice var7 wanted)
"3-Jan-20" "MZZ32819-564-282" "Mens Jeans" 40 25 . . .
"8-Jan-20" "MZZ32819-564-282" "Mens Jeans" 40 40 1 . .
"12-Jan-20" "MZZ32819-564-282" "Mens Jeans" 40 40 1 . 1
"12-Sep-20" "MZZ32819-564-282" "Mens Jeans" 40 28 . . .
"18-Sep-20" "MZZ32819-564-282" "Mens Jeans" 40 24 . . .
"20-Sep-20" "MZZ32819-564-282" "Mens Jeans" 50 30 . . .
"27-Sep-20" "MZZ32819-564-282" "Mens Jeans" 50 25 . . .
"11-Oct-20" "MZZ32819-564-282" "Mens Jeans" 40 20 . . .
"19-Oct-20" "MZZ32819-564-282" "Mens Jeans" 35 24 . . .
"2-Nov-20" "MZZ32819-564-282" "Mens Jeans" 20 20 1 . 6
"2-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 . 7
"4-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 . 8
"7-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 . 9
"7-Nov-20" "MZZ32819-564-282" "Mens Jeans" 20 20 1 . 7
"9-Nov-20" "MZZ32819-564-282" "Mens Jeans" 20 20 1 . 8
"11-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 . 12
"12-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 . 13
"14-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 . 14
"15-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 . 15
"18-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 . 16
"24-Nov-20" "MZZ32819-564-282" "Mens Jeans" 20 20 1 . 9
end
EDIT - I have updated the wanted variable and include a new_wanted. The difference is taking into account repeated dates with multiple prices. Also including 2 products to run this process by id.
* Example generated by -dataex-. For more info, type help dataex
clear
input str9 orderdate str16 productcode str10 productcategory byte(websiteprice actualsoldprice sale_at_or_above_refprice wanted new_wanted)
"3-Jan-20" "MZZ32819-564-282" "Mens Jeans" 40 25 . . .
"8-Jan-20" "MZZ32819-564-282" "Mens Jeans" 40 40 1 0 0
"12-Jan-20" "MZZ32819-564-282" "Mens Jeans" 40 40 1 1 1
"12-Sep-20" "MZZ32819-564-282" "Mens Jeans" 40 28 . . .
"18-Sep-20" "MZZ32819-564-282" "Mens Jeans" 40 24 . . .
"20-Sep-20" "MZZ32819-564-282" "Mens Jeans" 50 30 . . .
"27-Sep-20" "MZZ32819-564-282" "Mens Jeans" 50 25 . . .
"11-Oct-20" "MZZ32819-564-282" "Mens Jeans" 40 20 . . .
"19-Oct-20" "MZZ32819-564-282" "Mens Jeans" 35 24 . . .
"2-Nov-20" "MZZ32819-564-282" "Mens Jeans" 20 20 1 6 6
"2-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 6 6
"4-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 8 7
"7-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 9 8
"7-Nov-20" "MZZ32819-564-282" "Mens Jeans" 20 20 1 7 7
"9-Nov-20" "MZZ32819-564-282" "Mens Jeans" 20 20 1 8 9
"11-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 12 10
"12-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 13 11
"14-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 14 12
"15-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 15 13
"18-Nov-20" "MZZ32819-564-282" "Mens Jeans" 14 14 1 16 14
"24-Nov-20" "MZZ32819-564-282" "Mens Jeans" 20 20 1 9 9
"6-Jan-20" "ADDZ4449-524-645" "Mens Bags" 60 50 . . .
"11-Jan-20" "ADDZ4449-524-645" "Mens Bags" 70 60 . . .
"12-Feb-20" "ADDZ4449-524-645" "Mens Bags" 60 60 1 . 1
"12-Jul-20" "ADDZ4449-524-645" "Mens Bags" 60 50 . . .
"18-Sep-20" "ADDZ4449-524-645" "Mens Bags" 50 55 1 . 1
"20-Sep-20" "ADDZ4449-524-645" "Mens Bags" 50 45 . . .
"20-Sep-20" "ADDZ4449-524-645" "Mens Bags" 66 45 . . .
"12-Oct-20" "ADDZ4449-524-645" "Mens Bags" 55 60 1 . 1
"19-Oct-20" "ADDZ4449-524-645" "Mens Bags" 60 60 1 . 1
"2-Nov-20" "ADDZ4449-524-645" "Mens Bags" 70 73 1 . 0
"2-Nov-20" "ADDZ4449-524-645" "Mens Bags" 60 56 . . .
"4-Nov-20" "ADDZ4449-524-645" "Mens Bags" 60 60 1 . 3
"7-Nov-20" "ADDZ4449-524-645" "Mens Bags" 50 45 . . .
"7-Nov-20" "ADDZ4449-524-645" "Mens Bags" 66 66 1 . 1
"9-Nov-20" "ADDZ4449-524-645" "Mens Bags" 60 56 . . .
"11-Nov-20" "ADDZ4449-524-645" "Mens Bags" 60 76 1 . 5
"12-Nov-20" "ADDZ4449-524-645" "Mens Bags" 60 71 1 . 6
"13-Nov-20" "ADDZ4449-524-645" "Mens Bags" 60 26 . . .
"15-Nov-20" "ADDZ4449-524-645" "Mens Bags" 65 70 1 . 4
"15-Nov-20" "ADDZ4449-524-645" "Mens Bags" 67 70 1 . 3
"22-Nov-20" "ADDZ4449-524-645" "Mens Bags" 56 70 1 . 9
end
Below is the code that I am trying to adapt for this task. Credit to Ken Chui from STATALIST.
gen date1 = date(orderdate, "DMY", 2020)
format date1 %td
local max = _N
gen wanted2 = .
foreach x of numlist 1/`max'{
capture drop get get_sum
gen get = actualsoldprice >= actualsoldprice[`x']
rangestat (sum) get, interval(date -90 -1)
replace wanted2 = get_sum if _n == `x'
}
replace wanted2 = . if sale_at_or_above_refprice == .
*Start by converting date to Stata date
gen stata_date = date(orderdate,"DM20Y")
format stata_date %td
*Sort data and product code as stop conditions in while loop expect them to be sorted
sort productcode stata_date
*Create varialbe to store result
gen count_less = .
*Loop over all rows
count
forvalue row = 1/`r(N)' {
*Only applicable to
if sale_at_or_above_refprice[`row'] == 1 {
*Set result variable to 0 for this row
replace count_less = 0 if _n == `row'
*Initate locals used in while loop
local true = 1
local row_skip = 1
local count = 0
local last_date = stata_date[`row']
*Loop until any stop condition sets local true to 0
while `true' == 1 {
*Test if row_skip hits top of data set (i.e row 0)
if `row'-`row_skip' == 0 local true = 0
*Test that product is same in compare row
else if productcode[`row'] != productcode[`row'-`row_skip'] local true = 0
*Test that previous order is within 90 days
else if stata_date[`row'] - stata_date[`row'-`row_skip'] > 90 local true = 0
*Test if actualsoldprice is less thatn old websiteprice
else if websiteprice[`row'] <= actualsoldprice[`row'-`row_skip'] {
* Each date can only be counted once, so test if date is last date counted
if `last_date' != stata_date[`row'-`row_skip'] {
*Compare row fits condition, add 1 to counter
local count = `count' + 1
*Update last counted date
local last_date = stata_date[`row'-`row_skip']
}
}
*Skip one more prevuous row
local row_skip = `row_skip' + 1
}
*Add the count result to the result varaible for this row
replace count_less = `count' if _n == `row'
}
}
I have a text file in which some lines aren't formatted the same way as the others. To be more precise: In this type of lines, fields are written on 9 characters, whereas on the other lines, fields are 7 characters length. Format of a (misformatted) line being:
1st field on 40 characters, then 4 spaces, then 1 character (field named note, for example letter 'a', space ' ' or letter 'b'), then 4 spaces. The preceding sequence 4 spaces+1 character+4 spaces (example below) being repeated a number of times:
a
Such a line would for example be:
a a a a a c b a a a a a
Due to the specificity of these lines, I get an offset between lines:
26 26 26 26 26 26 26 26 26 26 26
a a a a a c b a a a a
I would like to get rid of 2 unwanted spaces, so that on my line, a field would become 7 characters wide, as in others:
1st field on 40 characters, then 3 spaces, then 1 character (field named note, for example letter 'a'), then 3 spaces.
I can "find" such a line using the regex find in Notepad++:
Find what : ^ {40}(( {4})(?<note>.)( {4}))+
But how do I replace all utterances of the 4 spaces+1 character+4 spaces by 3 spaces+1 character+3 spaces, so that fields on every line have the same length?
The desired format, after substitution, would for example be:
26 26 26 26 26 26 26 26 26 26 26
a a a a a c b a a a a
Here is a larger extract of the offset file:
20/03/2018
H0917 26_LAV 0 Semaine 2 En service le 04 Septembre 2017 - TAD Partiel 11/07/2017 16:09 H0917
Vertical Notes
Montferrier-sur-Lez - Cirad de Baillargu Montferrier-sur-Lez - Cirad de Baillarguet
Montpellier Occitanie Montpellier Occitanie
26 Occitanie - Montferrier-sur-Lez 26 Montferrier-sur-Lez - Cirad de BaillarguMontpellier Occitanie Montferrier-sur-Lez - Cirad de Baillarguet
########## VOYAGES
0 Semaine Montferrier-sur-Lez - Cirad de Baillarguet
26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26
a a a a a c b a a a a a a a a a a
Occitanie 6:50 7:20 7:50 8:20 8:50 9:30 10:10 10:40 11:10 11:40 12:10 12:10 12:40 13:10 13:40 14:10 14:40 15:10 15:40 16:10 16:40 17:10 17:40 18:10 18:40 19:10 19:40 20:10 20:40
Lycée Frédéric Bazille 6:58 7:28 7:58 8:28 8:58 9:37 10:17 10:47 11:17 11:47 12:17 12:17 12:47 13:17 13:47 14:18 14:48 15:18 15:48 16:18 16:49 17:19 17:49 18:19 18:49 19:18 19:47 20:17 20:47
La Lironde 7:02 7:32 8:02 8:35 9:05 9:42 10:22 10:52 11:22 11:52 12:23 12:23 12:52 13:22 13:52 14:23 14:53 15:23 15:53 16:23 16:56 17:26 17:56 18:26 18:56 19:23 19:52 20:22 20:52
Chemin Neuf 7:07 7:37 8:07 8:40 9:10 9:47 10:27 10:57 11:27 11:57 12:28 12:28 12:57 13:27 13:57 14:28 14:58 15:28 15:58 16:28 17:01 17:31 18:01 18:31 19:01 19:28 19:57 20:27 20:57
La Grand Font 7:08 7:38 8:08 8:41 9:11 9:48 10:28 10:58 11:28 11:58 12:29 12:29 12:58 13:28 13:58 14:29 14:59 15:29 15:59 16:29 17:02 17:32 18:02 18:32 19:02 19:29 19:58 20:28 20:58
Picadou 7:09 7:39 8:09 8:42 9:12 9:49 10:29 10:59 11:29 11:59 12:30 12:30 12:59 13:29 13:59 14:30 15:00 15:30 16:00 16:30 17:03 17:33 18:03 18:33 19:03 19:30 19:59 20:29 20:59
Distillerie 7:11 7:41 8:11 8:44 9:14 9:51 10:31 11:01 11:31 12:01 12:32 12:32 13:01 13:31 14:01 14:32 15:02 15:32 16:02 16:32 17:05 17:35 18:05 18:35 19:05 19:32 20:01 20:31 21:01
Cirad de Baillarguet 7:14 7:44 8:14 8:47 9:17 9:54 10:34 11:04 11:34 12:04 12:35 12:35 13:04 13:34 14:04 14:35 15:05 15:35 16:05 16:35 17:08 17:38 18:08 18:38 19:08 19:35 20:04 20:34 21:04
Correct file should be:
20/03/2018
H0917 26_LAV 0 Semaine 2 En service le 04 Septembre 2017 - TAD Partiel 11/07/2017 16:09 H0917
Vertical Notes
Montferrier-sur-Lez - Cirad de Baillargu Montferrier-sur-Lez - Cirad de Baillarguet
Montpellier Occitanie Montpellier Occitanie
26 Occitanie - Montferrier-sur-Lez 26 Montferrier-sur-Lez - Cirad de BaillarguMontpellier Occitanie Montferrier-sur-Lez - Cirad de Baillarguet
########## VOYAGES
0 Semaine Montferrier-sur-Lez - Cirad de Baillarguet
26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26 26
a a a a a c b a a a a a a a a a a
Occitanie 6:50 7:20 7:50 8:20 8:50 9:30 10:10 10:40 11:10 11:40 12:10 12:10 12:40 13:10 13:40 14:10 14:40 15:10 15:40 16:10 16:40 17:10 17:40 18:10 18:40 19:10 19:40 20:10 20:40
Lycée Frédéric Bazille 6:58 7:28 7:58 8:28 8:58 9:37 10:17 10:47 11:17 11:47 12:17 12:17 12:47 13:17 13:47 14:18 14:48 15:18 15:48 16:18 16:49 17:19 17:49 18:19 18:49 19:18 19:47 20:17 20:47
La Lironde 7:02 7:32 8:02 8:35 9:05 9:42 10:22 10:52 11:22 11:52 12:23 12:23 12:52 13:22 13:52 14:23 14:53 15:23 15:53 16:23 16:56 17:26 17:56 18:26 18:56 19:23 19:52 20:22 20:52
Chemin Neuf 7:07 7:37 8:07 8:40 9:10 9:47 10:27 10:57 11:27 11:57 12:28 12:28 12:57 13:27 13:57 14:28 14:58 15:28 15:58 16:28 17:01 17:31 18:01 18:31 19:01 19:28 19:57 20:27 20:57
La Grand Font 7:08 7:38 8:08 8:41 9:11 9:48 10:28 10:58 11:28 11:58 12:29 12:29 12:58 13:28 13:58 14:29 14:59 15:29 15:59 16:29 17:02 17:32 18:02 18:32 19:02 19:29 19:58 20:28 20:58
Picadou 7:09 7:39 8:09 8:42 9:12 9:49 10:29 10:59 11:29 11:59 12:30 12:30 12:59 13:29 13:59 14:30 15:00 15:30 16:00 16:30 17:03 17:33 18:03 18:33 19:03 19:30 19:59 20:29 20:59
Distillerie 7:11 7:41 8:11 8:44 9:14 9:51 10:31 11:01 11:31 12:01 12:32 12:32 13:01 13:31 14:01 14:32 15:02 15:32 16:02 16:32 17:05 17:35 18:05 18:35 19:05 19:32 20:01 20:31 21:01
Cirad de Baillarguet 7:14 7:44 8:14 8:47 9:17 9:54 10:34 11:04 11:34 12:04 12:35 12:35 13:04 13:34 14:04 14:35 15:05 15:35 16:05 16:35 17:08 17:38 18:08 18:38 19:08 19:35 20:04 20:34 21:04
Thanks for helping,
Julien
Your question is a bit unclear, but this got things aligned for me:
([^ ])
\1
Note there is one space before and one after the parenthesis. This removes one space before and one after each single non-space character. It ignores the 26s since those are two non-space characters, so replace all works.
I have these two files, and would like to find the matching patterns
for the second file based on the entries from the first file, then print out the
matched results.
Here is the first file,
switch1 Ethernet2 4 4 5
switch1 Ethernet4 4 4 5
switch1 Ethernet7 4 4 5
switch1 Ethernet9 4 4 5
switch1 Ethernet10 4 4 5
switch1 Ethernet13 1 4 5
switch1 Ethernet14 1 4 5
switch1 Ethernet15 1 4 5
switch2 Ethernet5 4 4 5
switch2 Ethernet6 1 4 5
switch2 Ethernet7 4 5
switch2 Ethernet8 1 4 5
switch1 Ethernet1 2002 2697 2523
switch1 Ethernet17 2002 2576 2515
switch1 Ethernet11 2002 2621 2617
switch2 Ethernet1 2001 2512 2577
And here is the second file,
switch1 Ethernet1 1 4 5 40 2002 2523 2590 2621 2656 2661 2665 2684 2697 2999
switch1 Ethernet2 1 4 5 40 2002 2504 2505 2508 2514 2516 2517 2531 2533 2535 2544 2545 2547 2549 2555 2566 2571 2575 2589 2590 2597 2604 2611 2626 2629 2649 2666 2672 2684 2691 2695
switch1 Ethernet3 40
switch1 Ethernet4 1 4 5 40 2002 2504 2516 2517 2535 2547 2549 2571 2579 2590 2597 2604 2605 2620 2624 2629 2649 2684 2695
switch1 Ethernet5 1 4 5 40 55 56 2000 2002 2010 2037 2128 2401 2409 2504 2526 2531 2540 2541 2548 2562 2575 2578 2579 2588 2590 2597 2604 2606 2608 2612 2615 2616 2621 2638 2640 2645 2650 2666 2667 2669 2670 2674 2678 2684 2690 2696
switch1 Ethernet6 40
switch1 Ethernet7 1 4 5 40 2037 2128 2174 2401 2409 2463 2526 2535 2540 2541 2544 2562 2578 2579 2590 2616 2621 2625 2631 2645 2659 2667 2670 2674 2678 2682 2684 2690 2696
switch1 Ethernet8 1 4 5 40 2037 2128 2396 2401 2409 2420 2531 2619 2640 2653 2658 2669 2677 2683 2684
switch1 Ethernet9 1 4 5 40 2128 2169 2396 2401 2409 2420 2504 2515 2531 2553 2578 2597 2619 2621 2640 2658 2669 2677 2683 2684 2694
switch1 Ethernet10 1 4 5 40 2079 2128 2169 2378 2396 2453 2509 2578 2591 2597 2621 2634 2641 2657
switch1 Ethernet11 1 4 5 40 2002 2128 2169 2396 2453 2509 2512 2520 2526 2549 2552 2564 2571 2575 2589 2591 2597 2611 2617 2621 2634 2641 2657 2671 2676 2686 2694
switch1 Ethernet12 1 4 5 40 2079 2378 2396 2453 2515 2531 2553 2597 2619 2621 2640 2657 2669 2677 2684 2694
switch1 Ethernet13 1 4 5 40 2174 2396 2453 2463 2508 2524 2531 2536 2546 2567 2597 2629 2640 2657 2669 2674 2684
switch1 Ethernet14 1 4 5 40 2524 2536 2544 2567 2575 2582 2628 2640 2659 2674 2681 2689
switch1 Ethernet15 1 4 5 40 2515 2553 2575 2582 2621 2628 2640 2681 2689 2694
switch1 Ethernet16 1 4 5 40 2513 2539 2544 2553 2561 2573 2575 2582 2619 2640 2670 2681
switch1 Ethernet17 1 4 5 40 2002 2508 2513 2515 2531 2538 2539 2544 2547 2553 2561 2570 2573 2575 2576 2582 2586 2601 2608 2619 2621 2640 2658 2670 2681
the results should display,
switch1 Ethernet13 1 4 5
switch1 Ethernet14 1 4 5
switch1 Ethernet15 1 4 5
switch1 Ethernet8 1 4 5
switch1 Ethernet16 1 4 5
switch1 Ethernet1 2002 2697 2523
switch1 Ethernet17 2002 2576 2515
The challenging part for me is that I don't know
how to compare a pattern from one line of the first file to all the lines of the second file. And in this case, it has 5 patterns to match, when the second file doesn't have
consistent fields to work with. Some lines are shorter than the others, and the matching fields may not be the same with other lines.
Any idea or suggestion would be greatly appreciated.
Thanks much in advance.
Something like this should work, but there's probably a better way
#!/bin/bash
patternfile='file1'
exec 4<$patternfile
otherfile='file2'
while read -u4 pattern; do
grep ".*`echo ""$pattern"" | sed 's/\s\+/ \\\+.* /g'`.*" $otherfile
done
P.S. This doesn't output all of the lines you said you wanted:
Ethernet8 is switch2 in one and switch1 in the other. fixing that in regex would be really ugly, but you could use awk to get only the records you want.
Ethernet16 isn't even in the first file, I think that line is a mistake.
Ethernet1 and Ethernet17 have the records out of order, which my solution can't deal with.