Average a variable by two others - sas

Suppose I have the following database:
DATA have;
INPUT id date gain;
CARDS;
1 201405 100
2 201504 20
2 201504 30
2 201505 30
2 201505 50
3 201508 200
3 201509 200
3 201509 300
;
RUN;
I want to create a new table want where the average of the variable gain is grouped by id and by date. The final database should look like this:
DATA want;
INPUT id date average_gain;
CARDS;
1 201405 100
2 201504 25
2 201505 40
3 201508 200
3 201509 250
I tried to obtain the desired result using the code below but it didn't work:
PROC sql;
CREATE TABLE want as
SELECT *,
mean(gain) as average_gain
FROM have
GROUP BY id, date
ORDER BY id, date
;
QUIT;

It's the asterisk that's causing the issue. That will resolve to id, date, gain, which is not what you want. ANSI SQL would not allow this type of functionality so it's one way in which SAS differs from other SQL implementation.
There should be a note in the log about remerging with the original data, which is essentially what's happening. The summary values are remerged to every line.
To avoid this, list your group by fields in your query and it will work as expected.
PROC sql;
CREATE TABLE want as
SELECT id, date,
mean(gain) as average_gain
FROM have
GROUP BY id, date
ORDER BY id, date
;
QUIT;
I will say, in general, PROC MEANS is usually a better option because:
calculate for multiple variables & statistics without need to list them all out multiple times
can get results at multiple levels, for example totals at grand total, id and group level
not all statistics can be calculated within PROC MEANS
supports variable lists so you can shortcut reference long lists without any issues

Related

SAS join on a flag to indicate percentile

I am looking to join two tables together
Table 1 - The baseball dataset
DATA baseball;
SET sashelp.baseball
(KEEP = crhits);
RUN;
Table 2 - A table containing the percentiles of CRhits
PROC STDIZE
DATA = baseball
OUT=_NULL_
PCTLMTD=ORD_STAT
PCTLDEF=5
OUTSTAT=STDLONGPCTLS
(WHERE = (SUBSTR(_TYPE_,1,1) = "P"))
pctlpts = 1 TO 99 BY 1;
RUN;
I would like to join these tables together to create a table that contains the values for crhits and then a column identifying which percentile that value belongs to like below
crhits percentile percentile_value
54 p3 54
66 p5 66
825 p63 825
1134 p76 1133
The last column indicates the percentile value given by stdlongpctls
I currently use the following code to calculate the percentiles and a loop to count the number of "Events" per percentile, per factor
I have tried a cross-join but I am having trouble visualising how to join these two tables without an explicit key
PROC SQL;
CREATE TABLE cross_join_table AS
SELECT
a.crhits
, b._TYPE_
, CASE WHEN
a.crhits < b.type THEN b._TYPE_ END AS percentile
FROM
baseball a
CROSS JOIN
stdlongpctls b;
QUIT;
If there is another easier / more efficient way to find the number of observations and number of dependent variables (e.g. I am modelling on a default flag event in my actual dataset, so the sum of 1's per percentile group, I would appreciate it)
Use PROC RANK instead to group it into the percentiles.
proc rank data=sashelp.baseball out=baseball_ranks group=100;
var crhits;
rank rank_crhits;
run;
You can then summarize it using PROC MEANS.

Can i standardize/convert the data in datetime to have the same "minutes"?

I am a SAS developer using SAS Data Integration Studio.
Currently i have a scenario where data in many excel tabs are having different time value(minutes). For example
Tab A will have datetime
2010-Jan-01 01:00:00
2010-Jan-02 01:00:00
2010-Jan-03 01:00:00
Tab B will have datetime below:
2010-Jan-01 01:25:00
2010-Jan-02 01:25:00
2010-Jan-03 01:25:00
Do take note that table A and table B will have different columns apart from the SAME DATETIME (but not the minutes).
When i merge these 2 tables, i want to ignore the minutes, or rather standardize it to 01:00:00.
Is there a way to do so?
Currently, if i were to merge these 2 tables, i will have "Duplicate"(i mean same day but different record/result due to different minutes).
Imagine in table A, i have a column called Age
Table B, i have a column called Net Worth.
When i merge this 2 tables, I am expecting the table to have 1 single row record of 2010-Jan-01 01:00:00 with Age and Net worth.
However, when datetime does not match, they will have a new record with Age containing value but Blank Net worth or likewise.
Seeking for your help on how to achieve that.
You can truncate to the nearest hour using the intnx function before matching:
data _null_;
mydatetime = datetime();
mydatehour = intnx('dthour',mydatetime, 0, 'b');
put mydatetime= is8601dt. / mydatehour= is8601dt.;
run;
There is a function DATEPART(datetime)
Example:
data _null_;
conn='01feb94:8:45'dt;
servdate=datepart(conn);
dhms = dhms(datepart(conn),hour(conn),0,0);
put dhms datetime.;;
run;
result in log
01FEB94:08:00:00

SAS for the following scenario [duplicate]

This question already has answers here:
SAS- Condensing Multiple Rows, Keeping highest Value
(2 answers)
Closed 6 years ago.
Assume I have a data-set D1 as follows:
ID ATR1 ATR2 ATR3
1 23 10 11
2 22 11 14
1 19 14 15
2 34 6 17
3 10 11 5
I want to create a data-set D2 from this as follows
ID ATR1 ATR2 ATR3
1 23 14 15
2 34 11 17
3 10 11 5
In other words, Data-set D2 consists of unique IDs from D1. For each ID in D2, the values of ATR1-ATR3 are selected as the maximum (of the respective variable) among the records in D1 with the same ID. For example ID = 1 in D2 has ATR1 = max(23,19) = 23.
I have one solution which is very clumsy. I simply sort copies of the data set `D1' three times (by ID and ATR1 e.g) and remove duplicates. I later merge the three data-sets to get what I want. However, I think there might be an elegant way to do this. I have about 20 such variables in the original data-set.
Thanks
PROC SQL METHOD
PROC SQL;
CREATE TABLE D2 AS
SELECT ID,
MAX(ATR1) as ATR1,
MAX(ATR2) as ATR2,
MAX(ATR3) as ATR3,
FROM D1
GROUP BY ID;
QUIT;
The GROUP BY clause can also be written GROUP BY 1, omitting ID, as this refers to the 1st column in the SELECT clause.
PROC SUMMARY METHOD
PROC SUMMARY DATA=D1 NWAY;
CLASS ID;
VAR ATR1 ATR2 ATR3;
OUTPUT OUT=D2 (DROP=_TYPE_ _FREQ_) MAX()=;
RUN;
Here's an explanation of some of the options:
NWAY - gives only the maximum level of summarisation, here it's not as important because you have only one CLASS variable, meaning there is only one level of summarisation. However, without NWAY you get an extra row showing the max value of ATR1-ATR3 across the whole dataset, which is not something you asked for in your question.
DROP=_TYPE_ _FREQ_ - This removes the automatic variables:
_TYPE_ - which shows the level of summarisation (see comment above), which would just be a column containing the value 1.
_FREQ_ - gives a frequency count of the ID values, which although useful, isn't something you wanted in your question.

Modifying data in SAS: copying part of the value of a cell, adding missing data and labeling it

I have three different questions about modifying a dataset in SAS. My data contains: the day and the specific number belonging to the tag which was registred by an antenna on a specific day.
I have three separate questions:
1) The tag numbers are continuous and range from 1 to 560. Can I easily add numbers within this range which have not been registred on a specific day. So, if 160-280 is not registered for 23-May and 40-190 for 24-May to add these non-registered numbers only for that specific day? (The non registered numbers are much more scattered and for a dataset encompassing a few weeks to much to do by hand).
2) Furthermore, I want to make a new variable saying a tag has been registered (1) or not (0). Would it work to make this variable and set it to 1, then add the missing variables and (assuming the new variable is not set for the new number) set the missing values to 0.
3) the last question would be in regard to the format of the registered numbers which is along the line of 528 000000000400 and 000 000000000054. I am only interested in the last three digits of the number and want to remove the others. If I could add the missing numbers I could make a new variable after the data has been sorted by date and the original transponder code but otherwise what would you suggest?
I would love some suggestions and thank you in advance.
I am inventing some data here, I hope I got your questions right.
data chickens;
do tag=1 to 560;
output;
end;
run;
data registered;
input date mmddyy8. antenna tag;
format date date7.;
datalines;
01012014 1 1
01012014 1 2
01012014 1 6
01012014 1 8
01022014 1 1
01022014 1 2
01022014 1 7
01022014 1 9
01012014 2 2
01012014 2 3
01012014 2 4
01012014 2 7
01022014 2 4
01022014 2 5
01022014 2 8
01022014 2 9
;
run;
proc sql;
create table dates as
select distinct date, antenna
from registered;
create table DatesChickens as
select date, antenna, tag
from dates, chickens
order by date, antenna, tag;
quit;
proc sort data=registered;
by date antenna tag;
run;
data registered;
merge registered(in=INR) DatesChickens;
by date antenna tag;
Registered=INR;
run;
data registeredNumbers;
input Numbers $16.;
datalines;
528 000000000400
000 000000000054
;
run;
data registeredNumbers;
set registeredNumbers;
NewNumbers=substr(Numbers,14);
run;
I do not know SAS, but here is how I would do it in SQL - may give you an idea of how to start.
1 - Birds that have not registered through pophole that day
SELECT b.BirdId
FROM Birds b
WHERE NOT EXISTS
(SELECT 1 FROM Pophole_Visits p WHERE b.BirdId = p.BirdId AND p.date = ????)
2 - Birds registered through pophole
If you have a dataset with pophole data you can query that to find if a bird has been through. What would you flag be doing - finding a bird that has never been through any popholes? Looking for dodgy sensor tags or dead birds?
3 - Data code
You might have more joy with the SUBSTRING function
Good luck

Dynamic Where in SAS - Conditional Operation based on another set

To my disappointment, the following code, which sums up 'value' by week from 'master' for weeks which appear in 'transaction' does not work -
data master;
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input change_week ;
datalines;
1
3
;
run;
data _null_;
set transaction;
do until(done);
set master end=done;
where week=change_week;
sum = sum(value, sum);
end;
file print;
put week= sum=;
run;
SAS complains, rightly, because it doesn't see 'change_week' in master and does not know how to operate on it.
Surely there must be a way of doing some operation on a subset of a master set (of course, suitably indexed), given a transaction dataset... Does any one know?
I believe this is the closest answer to what the asker has requested.
This method uses an index on week on the large dataset, allowing for the possibility of invalid week values in the transaction dataset, and without requiring either dataset to be sorted in any particular order. Performance will probably be better if the master dataset is in week order.
For small transaction datasets, this should perform quite a lot better than the other solutions as it only retrieves the required observations from the master dataset. If you're dealing with > ~30% of the records in the master dataset in a single transaction dataset, Quentin's method may sometimes perform better due to the overhead of using the index.
data master(index = (week));
input week value;
datalines;
1 10
1 20
1 30
2 40
2 40
2 50
3 15
3 25
3 35
;
run;
data transaction;
input week ;
datalines;
1
3
4
;
run;
data _null_;
set transaction;
file print;
do until(done);
set master key = week end=done;
/*Prevent implicit retain from previous row if the key isn't found,
or we've read past the last record for the current key*/
if _IORC_ ne 0 then do;
_ERROR_ = 0;
call missing(value);
end;
else sum = sum(value, sum);
end;
put week= sum=;
run;
N.B. for this to work, the indexed variable in the master dataset must have exactly the same name and type as the variable in the transaction dataset. Also, the index must be of the non-unique variety in order to accommodate multiple rows with the same key value.
Also, it is possible to replace the set master... statement with an equivalent modify master... statement if you want to apply transactional changes directly, i.e. without SAS making a massive temp file and replacing the original.
You are correct, there are many ways to do this in SAS. Your example is inefficient because (once we got it working) it would still require a full read of "master" for ever line of "transaction".
(The reason you got the error was because you used where instead of if. In SAS, the sub-setting where in a data step is only aware of columns already existing within the data set it's sub-setting. They keep two options because there where is faster when it's usable.)
An alternative solution would be use proc sql. Hopefully this example is self-explanatory:
proc sql;
select
a.change_week,
sum(b.value) as value
from
transaction as a,
master as b
where a.change_week = b.week
group by change_week;
quit;
I don't suggest below solution (would like #Jeff's SQL solution or even a hash better). But just for playing with data step logic, I think below approach would work, if you trust that every key in transaction will exist in master. It relies on the fact that both datasets are sorted, so only makes one pass of each dataset.
On first iteration of the DATA step, it reads the first record from the transaction dataset, then keeps reading through the master dataset until it finds all the matching records for that key, then the DATA step loop iterates and it does it again for the next transaction record.
1003 data _null_;
1004 set transaction;
1005 by change_week;
1006
1007 do until(last.week and _found);
1008 set master;
1009 by week;
1010
1011 if week=change_week then do;
1012 sum = sum(value, sum);
1013 _found=1;
1014 end;
1015 end;
1016
1017 *file print;
1018 put week= sum= ;
1019 run;
week=1 sum=60
week=3 sum=75