Selecting one obs from multiple ones based on some conditions - sas

I have a dataset that includes accounts having multiple rows.
I would like to keep one per each based on some filtering conditions.
The dataset is
Account Model Class Value
1 A 0 1.0
1 B 0 1.0
2 B 0 0.5
3 A 1 0.5
3 A 0 1.0
I would like to keep one row per each account based on these conditions:
if an account has only one row, then keep it, regardless class and model.
if an account has multiple rows, then keep the account having Model A and Class = 1 or having Model A and Class whatever.
The expected output would be
Account Model Class Value
1 A 0 1.0
2 B 0 0.5
3 A 1 0.5
The code is implemented in SAS.
What I have thought is to assign a value for those accounts having multiple rows and select only those ones that satisfy the above conditions.
However I do not know how to select accounts having only one row.
I think that I could have a look only at max Class as it will select regardless Model (generally Class is the highest associated with Model A).

Assuming Class has a maximum value of 1, you can just sort your data according to your conditions and output the first one.
data have;
infile datalines;
input Account Model $ Class Value;
datalines4;;;;
1 A 0 1.0
1 B 0 1.0
2 B 0 0.5
3 A 1 0.5
3 A 0 1.0
;;;;
run;
proc sort out=stage1 data=have; by account model descending class;run;
data want;
set stage1;
by account model descending class;
if first.account then output;
run;
However I do not know how to select accounts having only one row.
You can use the first. and last. data step variables to select accounts having only one occurence. Indeed if first.account and last.account then it means there is only one observation in the by group.
Basically, the FIRST. assigns the value of one for the first observation in a by group and a value of zero for all other observations in the by group. Keep in mind that in your specific case, the data need to be sorted by account before applying the FIRST. variable.

Related

Informatica : Count the number of rows based on sequence of number

I have table where source has 1 column, like below. for example, column name is A and I have set of records in the source.
A
1
1
1
2
2
3
I want to populate two columns in target, Say columns are A and B.
Column A in the Target has same values as in source and column B has count
A B
1 1
1 2
1 3
2 1
2 2
3 1
Can someone please explain how can i achieve this.
Thanks in advance
If source is a dbms like oracle, you can use source qualifier overwrite sql like below. Use row number and partition by to generate sequence for every A.
Select
A, row_number() over(partition by A order by A) as B
From mytable
If you're looking for infomatica only solution then this is how you can do it.
Sort the data by column A
Use ex transformation, create one in/out two var, and one out port.
We are going to compare first val with prev val, if they r same, add 1to the sequence else start from 1 again.
A in/out
v_B = iif (A=prev_A, v_B +1, 1)
prev_A=A
o_B =v_B
Link A and o_B to the target.

Merging Tables Correctly in SAS

Hi I am trying to merge two tables the FormA scores table that I made that is now CalculatingScores with the domain number found in DomainsFormA. I need to merge them by QuestionNum. Here is my code.
proc sql;
create table combined as
select *
from CalculatingScores inner join DomainsFormA
on CalculatingScores.Scores=DomainsFormA.QuestionNum;
quit;
proc print data=combined (obs=15);
run;
This table is what I am trying to get my merged tables to look like but for 15 observations.
Form
Student
QuestionNum
Scores
DomainNum
A
1
1
0
5
A
1
2
1
4
A
1
3
0
5
But My tables look more like this
Form
Student
QuestionNum
Scores
DomainNum
A
1
2
1
5
A
1
4
1
5
A
1
5
1
5
My entire Scores column for these 15 observations have a value of 1. Also my DomainNum column only has values of 5. My Student and Form columns are correct but I need to have varied scores and varied domain numbers. Any ideas for how to solve my problem? Maybe I need a order by statement?
You appear to be joining on the incorrect columns
You coded
on CalculatingScores.Scores=DomainsFormA.QuestionNum
which is joining a score to a question number
perhaps you should be coding
on CalculatingScores.QuestionNum=DomainsFormA.QuestionNum
^^^^^^^^^^^ ^^^^^^^^^^^

what is this program doing exactly? (SAS)

I was confused by the following SAS code. So, here, the SAS data set named WORK.SALARY contains 10 observations for each department,and is currently ordered by Department. The following SAS program is submitted:
data WORK.TOTAL;
set WORK.SALARY(keep=Department MonthlyWageRate);
by Department;
if First.Department=1 then Payroll=0;
Payroll+(MonthlyWageRate*12);
if Last.Department=1;
run;
So, what exactly is First.Department and Last.Department? Many thanks for your time and attention.
Your data step calculates the total PAYROLL for each DEPARTMENT.
The FIRST. and LAST. variables are generated automatically when you use a BY statement. They are true when the current observation is the first (or last) observation in the BY group. How the DATA Step Identifies BY Groups
The sum statement (Syntax: var+expression;) for PAYROLL means that the value of PAYROLL is retained (or carried over) to the next observation.
The IF/THEN statement will initializes the value to zero when a new group starts.
The subsetting IF statement will make sure that only the final observation for each department is output.
As explained, it is calculating payroll for each department.
First.department assigns value =1 when a particular department id is encountered. last.department assigns a value =1 when the last record for the department is read.
So if you have :
Department Wage
1 100
1 200
1 300
2 1000
2 2000
2 3000
With the first. and last. assigned, it will look like this:
Department Wage first.deaprtment last.department
1 100 1 0
1 200 0 0
1 300 0 1
2 1000 1 0
2 2000 0 0
2 3000 0 1
Now you can follow your logic as to what happens when first.department = 1.
By the way, in your code, I dont see they are doing anything if Last.Department=1;

Modifying data in SAS: copying part of the value of a cell, adding missing data and labeling it

I have three different questions about modifying a dataset in SAS. My data contains: the day and the specific number belonging to the tag which was registred by an antenna on a specific day.
I have three separate questions:
1) The tag numbers are continuous and range from 1 to 560. Can I easily add numbers within this range which have not been registred on a specific day. So, if 160-280 is not registered for 23-May and 40-190 for 24-May to add these non-registered numbers only for that specific day? (The non registered numbers are much more scattered and for a dataset encompassing a few weeks to much to do by hand).
2) Furthermore, I want to make a new variable saying a tag has been registered (1) or not (0). Would it work to make this variable and set it to 1, then add the missing variables and (assuming the new variable is not set for the new number) set the missing values to 0.
3) the last question would be in regard to the format of the registered numbers which is along the line of 528 000000000400 and 000 000000000054. I am only interested in the last three digits of the number and want to remove the others. If I could add the missing numbers I could make a new variable after the data has been sorted by date and the original transponder code but otherwise what would you suggest?
I would love some suggestions and thank you in advance.
I am inventing some data here, I hope I got your questions right.
data chickens;
do tag=1 to 560;
output;
end;
run;
data registered;
input date mmddyy8. antenna tag;
format date date7.;
datalines;
01012014 1 1
01012014 1 2
01012014 1 6
01012014 1 8
01022014 1 1
01022014 1 2
01022014 1 7
01022014 1 9
01012014 2 2
01012014 2 3
01012014 2 4
01012014 2 7
01022014 2 4
01022014 2 5
01022014 2 8
01022014 2 9
;
run;
proc sql;
create table dates as
select distinct date, antenna
from registered;
create table DatesChickens as
select date, antenna, tag
from dates, chickens
order by date, antenna, tag;
quit;
proc sort data=registered;
by date antenna tag;
run;
data registered;
merge registered(in=INR) DatesChickens;
by date antenna tag;
Registered=INR;
run;
data registeredNumbers;
input Numbers $16.;
datalines;
528 000000000400
000 000000000054
;
run;
data registeredNumbers;
set registeredNumbers;
NewNumbers=substr(Numbers,14);
run;
I do not know SAS, but here is how I would do it in SQL - may give you an idea of how to start.
1 - Birds that have not registered through pophole that day
SELECT b.BirdId
FROM Birds b
WHERE NOT EXISTS
(SELECT 1 FROM Pophole_Visits p WHERE b.BirdId = p.BirdId AND p.date = ????)
2 - Birds registered through pophole
If you have a dataset with pophole data you can query that to find if a bird has been through. What would you flag be doing - finding a bird that has never been through any popholes? Looking for dodgy sensor tags or dead birds?
3 - Data code
You might have more joy with the SUBSTRING function
Good luck

Dynamic Rolling Window in SAS for correlation calculation

Problem: I have a data set as below -
Comp date time returns
1 12-Aug-97 10:23:38 0.919292648
1 12-Aug-97 10:59:43 0.204139521
1 13-Aug-97 11:03:12 0.31909242
1 14-Aug-97 11:10:02 0.989339371
1 14-Aug-97 11:19:27 0.08394389
1 15-Aug-97 11:56:17 0.481199854
1 16-Aug-97 13:53:45 0.140404929
1 17-Aug-97 10:09:03 0.538569786
2 14-Aug-97 11:43:49 0.427344962
2 14-Aug-97 11:48:32 0.154836294
2 15-Aug-97 14:03:47 0.445415114
2 15-Aug-97 9:38:59 0.696953041
2 15-Aug-97 13:59:23 0.577391987
2 15-Aug-97 9:10:12 0.750949097
2 15-Aug-97 10:22:38 0.077787596
2 15-Aug-97 11:07:57 0.515822161
2 16-Aug-97 11:37:26 0.862673945
2 17-Aug-97 11:42:33 0.400670247
2 19-Aug-97 11:59:34 0.109279307
These are nothing but share price returns for every company at a date and time level.
I need to calculate autocorrelation(degree 1) of returns over a period of 10 days for each Comp and date value combination. As you can see, my time series is not continuous, it has breaks for weekends and public holidays. In such cases, if i need to take a 10 day range, I can't use a intnk function as adding 10 days to the date column might include a saturday/sunday for which I don't have data for and hence, my autocorrelation value will be compromised. How do I make this range dynamic?
I found this question Calculating rolling correlations in SAS that I thought might help but then again, there is the same intnx problem.
You can use the INTERVALDS system option to define a custom interval that fits your needs. See this article for more details.
The basic concept is that you create a dataset containing all of your possible dates (or datetimes) and define an interval value for each one, then tell SAS via the system option to use that dataset when you use a particular interval name. Then use INTNX as normal.
Otherwise, you could just do a PROC FREQ of your data to get the unique days, and then use that to create a day counter; then instead of creating your fromDate with intnx, you can just use SQL to grab the row with a date 10 less than current date.