Use of Lag /Lead function - sas

Kindly refer the sample data. I have Month, Region and Values in my data set. I need an Ouput Column as mentioned below. Basically I need on the basis of Month by Values moved ahead. Kindly help.
Month Region Values Output
1 R1 2 3
1 R2 4 5
2 R1 3 4
2 R2 5 7
3 R1 4 6
3 R2 7 5
4 R1 6
4 R2 5
Thanks,
Gauraw

If I got it right, you want to assign as OUTCOME the value from the next month within each region. If so, then you can use two SET-statements, the second of which will add the same dataset, but shifted by one record (FIRSTOBS=2).
proc sort data=yourdata; by region month; run;
data result;
set yourdata;
by region;
do until(eof);
set yourdata(firstobs=2 keep=values rename=(values=outcome)) end=eof;
end;
if LAST.region then call missing(outcome);
run;
And we need to wrap SET into DO UNTIL loop, because otherwise we'll loose the last record of the dataset - the end of the second instance of the same dataset will be reached one record earlier and DATA step will stop.

Related

Generate new variable in one dataset using observation from another dataset in SAS

I have two datasets, one with one observation and two variables. Other dataset with 10 observations, four variables.
Dataset 1
Final Result
X Fail
Dataset 2
A B C D Output
1 1 2 Pass
2 1 2 Pass
3 1 2 Pass
4 1 2 Fail
5 1 2 Pass
6 1 2 Fail
7 1 2 Pass
8 1 2 Fail
9 1 2 Pass
10 1 2 Pass
I would like to generate a fifth variable (output) in the second dataset depending on the value of the second variable in the first dataset.
If Result in first dataset equal to fail, generate a new variable output in the second dataset as fail. If Result in first dataset equal to pass, then generate a new variable output in the second dataset which will be equal to the value in column D of the second dataset.
Just use some simple IF/THEN logic. Since you know DATASET1 only has one observation then only read one observation from it.
data want;
if _n_=1 then set dataset1 ;
set dataset2 ;
length OUTPUT $4 ;
if RESULT='FAIL' then OUTPUT=RESULT;
else OUTPUT=D ;
run;

SAS group by counters per variable - primary key creation

I have some data which needs to be split into 12 or so different groups, there is no key and the order the data is in is important.
The data has a number of groups and those groups have singular and / or nested groups within that. Each group will be split out as the data is in a hierarchical format. so each "GROUP" then has its own format which then all needs to be joined up on one line (or many) rows.
Sample data file:
"TRANS","23115168","","","OTVST","","23115168","","COMLT","","",20180216,"OAMI","501928",,
"MTPNT","UPDTE",2415799999,"","","17","","",,20180216,
"ASSET","","REPRT","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","REMVE","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","INSTL","METER","","CR","E6VG470","LPG",2017,"E6S06769699999","","","LI"
"METER","","U","S1",6.0000,"","",20180216,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00000"
"APPNT","",20180216,,"","123900",""
The hierarchy that should exist when data is input. I am thinking there could be several tables that can be joined together later. (numbers for illustration of parent child levels)
1. Transaction [TRANS]
1.1. Meter Point [MTPNT]
1.1.1. Asset [ASSET]
1.1.1.1. Meter [METER]
1.1.1.2. Converter [CONVE]
1.1.1.3. Register Details [REGST]
1.1.1.3.1. Reading [READG]
1.1.1.4. Market Participant [MKPRT]
1.1.1.5. Name [NAME]
1.1.1.5.1. Address [ADDRS]
1.1.1.5.2. Contact Mechanism [CONTM]
1.2. Appointment [APPNT]
1.3. Name [NAME]
1.3.1. Address [ADDRS]
1.3.2. Contact Mechanism [CONTM]
1.4. Market Participant [MKPRT]
The industry GAS data, so in this flow you can have many ASSET per MTPNT, and those many ASSET can have many REGST because this is where the meter reading is kept for READG
I have tried using by groups and iterative first. processing, but i have not worked with this type of data before. I need a way to split create a key per grouping, which when split up and the fields are defined, can be joined back together.
I have tried manipulating the infile so that all the data appears on one line per TRANS, but then i still have the issue of applying the fields, and ordering is paramount.
I have managed to get a few keys for some of the groups, but after splitting they dont quite join back together.
data TRANS;
set mpancreate_a;
by DataItmGrp NOTSORTED;
if first.DataItmGrp then
do;
if DataItmGrp = "TRANS" then
TRANSKey+1;
end;
run;
data TRANS;
set TRANS;
TRANSKey2 + 1;
by DataItmGrp NOTSORTED;
if first.DataItmGrp then
do;
if DataItmGrp = "TRANS" then
TRANSKEY2=1;
end;
run;
data MTPNT;
set TRANS;
by DataItmGrp NOTSORTED;
if first.DataItmGrp then
do;
if DataItmGrp = "MTPNT" then
MTPNTKEY+1;
end;
run;
data MTPNT;
set MTPNT;
by MTPNTKEY NOTSORTED;
if first.MTPNTKEY and DataItmGrp = "MTPNT" then
MTPNTKEY2=0;
MTPNTKEY2+1;
run;
data ASSET;
set MTPNT;
IF MTPNTKEY = 0 THEN
MTPNTKEY2=0;
by DataItmGrp NOTSORTED;
if first.DataItmGrp then
do;
if DataItmGrp = "ASSET" then
ASSETKEY+1;
end;
run;
data ASSET;
set ASSET;
by ASSETKEY NOTSORTED;
if first.ASSETKEY and DataItmGrp = "ASSET" then
ASSETKEY2=0;
ASSETKEY2+1;
IF ASSETKEY =0 THEN
ASSETKEY2=0;
run;
i want a counter for each group found, and a retained counter for that particular group - but i cannot work out how to get in and out of the groupings based on the hierarchy above
i'm hoping that once i have these keys, i can split the data by group and then left join back together
_n_ TRANS TRANS2 MTPNT MTPNT2
TRANS 1 1 0 0 0
MTPNT 2 2 1 1 1
ASSET 3 3 1 2 1
METER 4 4 1 3 1
READG 5 5 1 4 1
MTPNT 6 6 1 1 2
ASSET 7 7 1 2 2
METER 8 8 1 3 2
READG 9 9 1 4 2
APPNT 10 10 1 5 2
TRANS 11 1 2 6 2
MTPNT 12 2 2 1 3
ASSET 13 3 2 2 3
METER 14 4 2 3 3
READG 15 5 2 4 3
MTPNT 16 6 2 1 4
ASSET 17 7 2 2 4
METER 18 8 2 3 4
READG 19 9 2 4 4
APPNT 20 10 2 5 4
The input of hierarchical data from a data file that has no definitive markers is problematic. The best suggestion I have is to understand what are the salient values you want to extract and in what context do you want to know them. For this problem a simplest first approach would be to have a single monolithic table with categorical variables to capture the path that descends to the salient value (meter reading).
A more complex situation would be the first token in each line drives the input for that line and the output table it belongs to. Since there are no landmarks as to hierarchy absolute or relative position (as in the NAME and MKPRT) there is no 100% confident way to place them in the hierarchy and that can also affect the placement of items read-in from subsequent data lines.
Depending on the true complexity and adherence to rules in the real world you may or may not 'miss out' the reading of some values.
Suppose there is the simpler goal of just getting the meter readings.
data want;
length tier level1-level6 $8 path $64 meterReadingString $8 dummy $1;
retain level1-level5 path;
attrib readingdate informat=yymmdd10. format=yymmdd10.;
infile cards dsd missover;
input #1 tier #; * held input - dont advance read line yet;
if tier="TRANS" then do;
level1 = tier;
call missing (of level2-level6);
path = catx("/", of level:);
end;
if tier="MTPNT" and path="TRANS" then do;
level2 = tier;
call missing (of level3-level6);
path = catx("/", of level:);
end;
if tier="ASSET" and path="TRANS/MTPNT" then do;
level3 = tier;
call missing (of level4-level6);
path = catx("/", of level:);
end;
if tier="METER" and path="TRANS/MTPNT/ASSET" then do;
level4 = tier;
call missing (of level5-level6);
path = catx("/", of level:);
end;
if tier="REGST" and path="TRANS/MTPNT/ASSET/METER" then do;
level5 = tier;
call missing (of level6-level6);
path = catx("/", of level:);
end;
if tier="READG" and path="TRANS/MTPNT/ASSET/METER/REGST" then do;
level6 = tier;
path = catx("/", of level:);
input #1 tier readingdate dummy meterReadingString #; * reread line according to tier;
meterReading = input(meterReadingString, best12.);
if path = "TRANS/MTPNT/ASSET/METER/REGST/READG" then OUTPUT;
end;
datalines;
"TRANS","23115168","","","OTVST","","23115168","","COMLT","","",20180216,"OAMI","501928",,
"MTPNT","UPDTE",2415799999,"","","17","","",,20180216,
"ASSET","","REPRT","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","REMVE","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","INSTL","METER","","CR","E6VG470","LPG",2017,"E6S06769699999","","","LI"
"METER","","U","S1",6.0000,"","",20180216,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00000"
"APPNT","",20180216,,"","123900",""
run;
You can use this as the basis of a more complicated reader that has a different output <tier> data set for each tier or path to tier encountered. You would need a different input statement per tier, similar to how READG is read.

use data step generate next observation

Case 1
Suppose the data are sorted by year then by month (always have 3 observations in data).
Year Month Index
2014 11 1.1
2014 12 1.5
2015 1 1.2
I need to copy the Index of last month to new observation
Year Month Index
2014 11 1.1
2014 12 1.5
2015 1 1.2
2015 2 1.2
Case 2
Year is removed from data. So we only have Month and Index.
Month Index
1 1.2
11 1.1
12 1.5
Data is always collected from consecutive 3 months. So 1 is the last month.
Still, ideal output is
Month Index
1 1.2
2 1.2
11 1.1
12 1.5
I solve it by creating another dataset only contains Month (1,2...12). Then right join the original dataset twice. But I think there's more elegant way to deal with this.
Case 1 can be a straight-forward data step. Add end=eof to the set statement to initialize a variable eof that returns value 1 when the data step is reading the last row of the data set. An output statement in the data step outputs a row during each iteration. If eof=1, a do block runs that increments the month by 1 and outputs another row.
data want;
set have end=eof;
output;
if eof then do;
month=mod(month+1,12);
output;
end;
run;
For case 2, I would switch to an sql solution. Self join the table to itself on month, incremented by 1 in the second table. Use the coalesce function to keep the values from the existing table if it exists. If not, use the values from the second table. Since a case crossing December-January will produce 5 months, limit the output to four rows using the outobs= option in proc sql to exclude the unwanted second January.
proc sql outobs=4;
create table want as
select
coalesce(t1.month,mod(t2.month+1,12)) as month,
coalesce(t1.index,t2.index) as index
from
have t1
full outer join have t2
on t1.month = t2.month+1
order by
coalesce(t1.month,t2.month+1)
;
quit;

Modifying data in SAS: copying part of the value of a cell, adding missing data and labeling it

I have three different questions about modifying a dataset in SAS. My data contains: the day and the specific number belonging to the tag which was registred by an antenna on a specific day.
I have three separate questions:
1) The tag numbers are continuous and range from 1 to 560. Can I easily add numbers within this range which have not been registred on a specific day. So, if 160-280 is not registered for 23-May and 40-190 for 24-May to add these non-registered numbers only for that specific day? (The non registered numbers are much more scattered and for a dataset encompassing a few weeks to much to do by hand).
2) Furthermore, I want to make a new variable saying a tag has been registered (1) or not (0). Would it work to make this variable and set it to 1, then add the missing variables and (assuming the new variable is not set for the new number) set the missing values to 0.
3) the last question would be in regard to the format of the registered numbers which is along the line of 528 000000000400 and 000 000000000054. I am only interested in the last three digits of the number and want to remove the others. If I could add the missing numbers I could make a new variable after the data has been sorted by date and the original transponder code but otherwise what would you suggest?
I would love some suggestions and thank you in advance.
I am inventing some data here, I hope I got your questions right.
data chickens;
do tag=1 to 560;
output;
end;
run;
data registered;
input date mmddyy8. antenna tag;
format date date7.;
datalines;
01012014 1 1
01012014 1 2
01012014 1 6
01012014 1 8
01022014 1 1
01022014 1 2
01022014 1 7
01022014 1 9
01012014 2 2
01012014 2 3
01012014 2 4
01012014 2 7
01022014 2 4
01022014 2 5
01022014 2 8
01022014 2 9
;
run;
proc sql;
create table dates as
select distinct date, antenna
from registered;
create table DatesChickens as
select date, antenna, tag
from dates, chickens
order by date, antenna, tag;
quit;
proc sort data=registered;
by date antenna tag;
run;
data registered;
merge registered(in=INR) DatesChickens;
by date antenna tag;
Registered=INR;
run;
data registeredNumbers;
input Numbers $16.;
datalines;
528 000000000400
000 000000000054
;
run;
data registeredNumbers;
set registeredNumbers;
NewNumbers=substr(Numbers,14);
run;
I do not know SAS, but here is how I would do it in SQL - may give you an idea of how to start.
1 - Birds that have not registered through pophole that day
SELECT b.BirdId
FROM Birds b
WHERE NOT EXISTS
(SELECT 1 FROM Pophole_Visits p WHERE b.BirdId = p.BirdId AND p.date = ????)
2 - Birds registered through pophole
If you have a dataset with pophole data you can query that to find if a bird has been through. What would you flag be doing - finding a bird that has never been through any popholes? Looking for dodgy sensor tags or dead birds?
3 - Data code
You might have more joy with the SUBSTRING function
Good luck

how to solve the problem of selecting multiple rows

I have the data in this format- it is just an
example: n=2
X Y info
2 1 good
2 4 bad
3 2 good
4 1 bad
4 4 good
6 2 good
6 3 good
Now, the above data is in sorted manner (total 7 rows). I need to make a group of 2 , 3 or 4 rows separately and generate a graph. In the above data, I made a group of 2 rows. The third row is left alone as there is no other column in 3rd row to form a group. A group can be formed only within the same row. NOT with other rows.
Now, I will check if both the rows have “good” in the info column or not. If both rows have “good” – the group formed is also good , otherwise bad. In the above example, 3rd /last group is “good” group. Rest are all bad group. Once I’m done with all the rows, I will calculate the total no. of Good groups formed/Total no. of groups.
In the above example, the output will be: Total no. of good groups/Total no. of groups => 1/3.
This is the case of n=2(size of group)
Now, for n=3, we make group of 3 rows and for n=4, we make a group of 4 rows and find the good /bad groups in a similar way. If all the rows in a group has “good” block—the result is good block, otherwise bad.
Example: n= 3
2 1 good
2 4 bad
2 6 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
In the above case, I left the 4th row and last 2 rows as I can’t make group of 3 rows with them. The first group result is “bad” and last group result is “good”.
Output: 1/ 2
For n= 4:
2 1 good
2 4 good
2 6 good
2 7 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
6 4 good
6 5 good
In this case, I make a group of 4 and finds the result. The 5th,6th,7th,8th row are left behind or ignored. I made 2 groups of 4 rows and both are “good” blocks.
Output: 2/2
So, After getting 3 output values for n=2 , n-3, and n=4 I will plot a graph of these values.
Below is code that I think is getting what you are looking for. It assumes that the data that you described is stored separately in the three datasets named data_2, data_3, and data_4. Each of these datasets is processed by the %FIND_GOOD_GROUPS macro that determines which groups of X have all "GOOD" values in INFO, then this summary information is appended as a new row to the BASE dataset. I didn't add the code, but you could calculate the ratio of GOOD_COUNT to FREQ in a separate data step, then use a procedure to plot the N value and the ratio. Hope this gets close to what you're trying to accomplish.
%******************************************************************************;
%macro main;
%find_good_groups(dsn=data_2, n=2);
%find_good_groups(dsn=data_3, n=3);
%find_good_groups(dsn=data_4, n=4);
proc print data=base uniform noobs;
%mend main;
%******************************************************************************;
%******************************************************************************;
%macro find_good_groups(dsn=,n=);
%***************************************************************************;
%* Sort data by X and Y so that you can use FIRST.X variable in Data step. *;
%***************************************************************************;
proc sort data=&dsn;
by x y;
run;
%***************************************************************************;
%* TEMP dataset uses the FIRST.X variable to reset COUNT and GOOD_COUNT to *;
%* initial values for each row where X changes. Each row in the X groups *;
%* adds 1 to COUNT and sets GOOD_COUNT to 0 (zero) if INFO is ever "BAD". *;
%* A record is output if COUNT is equal to the macro parameter &N. *;
%***************************************************************************;
data temp;
keep good_count n;
retain count 0 good_count 1 n &n;
set &dsn;
by x y;
if first.x then do;
count = 0;
good_count = 1;
end;
count = count + 1;
if good_count eq 1 then do;
if trim(left(upcase(info))) eq "BAD" then do;
good_count = 0;
end;
end;
if count eq &n then output;
run;
%***************************************************************************;
%* Summarize the TEMP data to find the number of times that all of the *;
%* rows had "GOOD" in the INFO column for each value of X. *;
%***************************************************************************;
proc summary data=temp;
id n;
var good_count;
output out=n_&n (drop=_type_) sum=;
run;
%***************************************************************************;
%* Append to BASE dataset to retain the sums and frequencies from all of *;
%* the datasets. BASE can be used to plot the N / number of Good records. *;
%***************************************************************************;
proc append data=n_&n base=base force; run;
%mend find_good_groups;
%******************************************************************************;
%main