This question already has answers here:
SAS- Condensing Multiple Rows, Keeping highest Value
(2 answers)
Closed 6 years ago.
Assume I have a data-set D1 as follows:
ID ATR1 ATR2 ATR3
1 23 10 11
2 22 11 14
1 19 14 15
2 34 6 17
3 10 11 5
I want to create a data-set D2 from this as follows
ID ATR1 ATR2 ATR3
1 23 14 15
2 34 11 17
3 10 11 5
In other words, Data-set D2 consists of unique IDs from D1. For each ID in D2, the values of ATR1-ATR3 are selected as the maximum (of the respective variable) among the records in D1 with the same ID. For example ID = 1 in D2 has ATR1 = max(23,19) = 23.
I have one solution which is very clumsy. I simply sort copies of the data set `D1' three times (by ID and ATR1 e.g) and remove duplicates. I later merge the three data-sets to get what I want. However, I think there might be an elegant way to do this. I have about 20 such variables in the original data-set.
Thanks
PROC SQL METHOD
PROC SQL;
CREATE TABLE D2 AS
SELECT ID,
MAX(ATR1) as ATR1,
MAX(ATR2) as ATR2,
MAX(ATR3) as ATR3,
FROM D1
GROUP BY ID;
QUIT;
The GROUP BY clause can also be written GROUP BY 1, omitting ID, as this refers to the 1st column in the SELECT clause.
PROC SUMMARY METHOD
PROC SUMMARY DATA=D1 NWAY;
CLASS ID;
VAR ATR1 ATR2 ATR3;
OUTPUT OUT=D2 (DROP=_TYPE_ _FREQ_) MAX()=;
RUN;
Here's an explanation of some of the options:
NWAY - gives only the maximum level of summarisation, here it's not as important because you have only one CLASS variable, meaning there is only one level of summarisation. However, without NWAY you get an extra row showing the max value of ATR1-ATR3 across the whole dataset, which is not something you asked for in your question.
DROP=_TYPE_ _FREQ_ - This removes the automatic variables:
_TYPE_ - which shows the level of summarisation (see comment above), which would just be a column containing the value 1.
_FREQ_ - gives a frequency count of the ID values, which although useful, isn't something you wanted in your question.
Related
Suppose I have the following database:
DATA have;
INPUT id date gain;
CARDS;
1 201405 100
2 201504 20
2 201504 30
2 201505 30
2 201505 50
3 201508 200
3 201509 200
3 201509 300
;
RUN;
I want to create a new table want where the average of the variable gain is grouped by id and by date. The final database should look like this:
DATA want;
INPUT id date average_gain;
CARDS;
1 201405 100
2 201504 25
2 201505 40
3 201508 200
3 201509 250
I tried to obtain the desired result using the code below but it didn't work:
PROC sql;
CREATE TABLE want as
SELECT *,
mean(gain) as average_gain
FROM have
GROUP BY id, date
ORDER BY id, date
;
QUIT;
It's the asterisk that's causing the issue. That will resolve to id, date, gain, which is not what you want. ANSI SQL would not allow this type of functionality so it's one way in which SAS differs from other SQL implementation.
There should be a note in the log about remerging with the original data, which is essentially what's happening. The summary values are remerged to every line.
To avoid this, list your group by fields in your query and it will work as expected.
PROC sql;
CREATE TABLE want as
SELECT id, date,
mean(gain) as average_gain
FROM have
GROUP BY id, date
ORDER BY id, date
;
QUIT;
I will say, in general, PROC MEANS is usually a better option because:
calculate for multiple variables & statistics without need to list them all out multiple times
can get results at multiple levels, for example totals at grand total, id and group level
not all statistics can be calculated within PROC MEANS
supports variable lists so you can shortcut reference long lists without any issues
The google search has been difficult for this. I have two categorical variables, age and months, with 7 levels each. for a few levels, say age =7 and month = 7 there is no value and when I use proc sql the intersections that do not have entries do not show, eg:
age month value
1 1 4
2 1 12
3 1 5
....
7 1 6
...
1 7 8
....
5 7 44
6 7 5
THIS LINE DOESNT SHOW
what i want
age month value
1 1 4
2 1 12
3 1 5
....
7 1 6
...
1 7 8
....
5 7 44
6 7 5
7 7 0
this happens a few times in the data, where tha last groups dont have value so they dont show, but I'd like them to for later purposes
You have a few options available, both seem to work on the premise of creating the master data and then merging it in.
Another is to use a PRELOADFMT and FORMATs or CLASSDATA option.
And the last - but possibly the easiest, if you have all months in the data set and all ages, then use the SPARSE option within PROC FREQ. It creates all possible combinations.
proc freq data=have;
table age*month /out = want SPARSE;
weight value;
run;
First some sample data:
data test;
do age=1 to 7;
do month=1 to 12;
value = ceil(10*ranuni(1));
if ranuni(1) < .9 then
output;
end;
end;
run;
This leaves a few holes, notably, (1,1).
I would use a series of SQL statements to get the levels, cross join those, and then left join the values on, doing a coalesce to put 0 when missing.
proc sql;
create table ages as
select distinct age from test;
create table months as
select distinct month from test;
create table want as
select a.age,
a.month,
coalesce(b.value,0) as value
from (
select age, month from ages, months
) as a
left join
test as b
on a.age = b.age
and a.month = b.month;
quit;
The group independent crossing of the classification variables requires a distinct selection of each level variable be crossed joined with the others -- this forms a hull that can be left joined to the original data. For the case of age*month having more than one item you need to determine if you want
rows with repeated age and month and original value
rows with distinct age and month with either
aggregate function to summarize the values, or
an indication of too many values
data have;
input age month value;
datalines;
1 1 4
2 1 12
3 1 5
7 1 6
1 7 8
5 7 44
6 7 5
8 8 1
8 8 11
run;
proc sql;
create table want1(label="Original class combos including duplicates and zeros for absent cross joins")
as
select
allAges.age
, allMonths.month
, coalesce(have.value,0) as value
from
(select distinct age from have) as allAges
cross join
(select distinct month from have) as allMonths
left join
have
on
have.age = allAges.age and have.month = allMonths.month
order by
allMonths.month, allAges.age
;
quit;
And a slight variation that marks duplicated class crossings
proc format;
value S_V_V .t = 'Too many source values'; /* single valued value */
quit;
proc sql;
create table want2(label="Distinct class combos allowing only one contributor to value, or defaulting to zero when none")
as
select distinct
allAges.age
, allMonths.month
, case
when count(*) = 1 then coalesce(have.value,0)
else .t
end as value format=S_V_V.
, count(*) as dup_check
from
(select distinct age from have) as allAges
cross join
(select distinct month from have) as allMonths
left join
have
on
have.age = allAges.age and have.month = allMonths.month
group by
allMonths.month, allAges.age
order by
allMonths.month, allAges.age
;
quit;
This type of processing can also be done in Proc TABULATE using the CLASSDATA= option.
Sorry for the confusing title.
Background
data looks like this
Area Date Ind LB UB
A 1mar 14 1 20
A 2mar 3 1 20
B 1mar 11 7 22
B 2mar 0 7 22
Area has several distinct values. For each area, LB and UB are fixed across multiple dates, while Ind varies. Date always starts from month start to certain day of the month.
Target
My target is to run a control chart for each area to see if Ind exceeds the range (LB,UB).
But if I just plot the raw data for each area, the xaxis by default not ends at the last day of the month (In the previous example, the plot will be from 1-Mar to 2-Mar instead of 31-Mar. I do know the by specifying the xmax option in xaxis the plot will extends to 31-Mar. But this only extends the xaxis, LB and UB still display from 1-Mar to 2-Mar, leaving the right side of the graph empty.
Thus I use modify to add in some date records.
What I have done
data have;
modify have;
do i = 0 to intck('day',today(),intnx('month',today(),0,'E'));
Date = today()+i;
call missing(Ind);
output;
end;
stop;
run;
proc sgplot data=have missing;
series ... Ind ...;
series ... LB ...;
series ... UB ...;
run;
Question
But this only works for one area. I need to modify each area first then plot them one by one. How can I relatively efficient to get below data
Area Date Ind LB UB
A 1mar 14 1 20
A 2mar 3 1 20
A 3mar . 1 20
....
A 31mar. 1 20
B 1mar 11 7 22
B 2mar 0 7 22
B 3mar . 7 22
....
B 31mar. 7 22
Or there's other options in proc sgplot to solve this?
You can use proc timeseries with the by-group area to get it into the form that you need. The end= option will let you specify an ending date for your data. It looks like you're using the current month, so we'll take your intnx function and plop it into a set of macro functions that resolve to a date literal (most ETS procs require a date literal for some reason).
We'll use two var statements: one for ind where we fill in unobserved values with ., and another for LB & UB to set their unobserved values with the previous valid value.
Note that we are assuming you've already put date into a SAS date. Make sure you do this first before running the below code.
proc timeseries data=have
out=want;
by area;
id Date interval=day notsorted
accumulate=none
end="%sysfunc(intnx(month, %sysfunc(today() ), 0, E), date9.)"d;
var Ind / setmissing=missing;
var LB UB / setmissing=previous;
run;
Your final dataset will look exactly as you'd like.
I have three different questions about modifying a dataset in SAS. My data contains: the day and the specific number belonging to the tag which was registred by an antenna on a specific day.
I have three separate questions:
1) The tag numbers are continuous and range from 1 to 560. Can I easily add numbers within this range which have not been registred on a specific day. So, if 160-280 is not registered for 23-May and 40-190 for 24-May to add these non-registered numbers only for that specific day? (The non registered numbers are much more scattered and for a dataset encompassing a few weeks to much to do by hand).
2) Furthermore, I want to make a new variable saying a tag has been registered (1) or not (0). Would it work to make this variable and set it to 1, then add the missing variables and (assuming the new variable is not set for the new number) set the missing values to 0.
3) the last question would be in regard to the format of the registered numbers which is along the line of 528 000000000400 and 000 000000000054. I am only interested in the last three digits of the number and want to remove the others. If I could add the missing numbers I could make a new variable after the data has been sorted by date and the original transponder code but otherwise what would you suggest?
I would love some suggestions and thank you in advance.
I am inventing some data here, I hope I got your questions right.
data chickens;
do tag=1 to 560;
output;
end;
run;
data registered;
input date mmddyy8. antenna tag;
format date date7.;
datalines;
01012014 1 1
01012014 1 2
01012014 1 6
01012014 1 8
01022014 1 1
01022014 1 2
01022014 1 7
01022014 1 9
01012014 2 2
01012014 2 3
01012014 2 4
01012014 2 7
01022014 2 4
01022014 2 5
01022014 2 8
01022014 2 9
;
run;
proc sql;
create table dates as
select distinct date, antenna
from registered;
create table DatesChickens as
select date, antenna, tag
from dates, chickens
order by date, antenna, tag;
quit;
proc sort data=registered;
by date antenna tag;
run;
data registered;
merge registered(in=INR) DatesChickens;
by date antenna tag;
Registered=INR;
run;
data registeredNumbers;
input Numbers $16.;
datalines;
528 000000000400
000 000000000054
;
run;
data registeredNumbers;
set registeredNumbers;
NewNumbers=substr(Numbers,14);
run;
I do not know SAS, but here is how I would do it in SQL - may give you an idea of how to start.
1 - Birds that have not registered through pophole that day
SELECT b.BirdId
FROM Birds b
WHERE NOT EXISTS
(SELECT 1 FROM Pophole_Visits p WHERE b.BirdId = p.BirdId AND p.date = ????)
2 - Birds registered through pophole
If you have a dataset with pophole data you can query that to find if a bird has been through. What would you flag be doing - finding a bird that has never been through any popholes? Looking for dodgy sensor tags or dead birds?
3 - Data code
You might have more joy with the SUBSTRING function
Good luck
I have an imported excel file, DATASET looks like:
Family Weight
1 150
1 210
1 99
2 230
2 100
2 172
I need to find the sum of ranks for each family.
I know that I can do this easily using PROC RANK but this is a HW problem and the only PROC statement I can use is PROC Means. I cannot even use Proc Sort.
The ranking would be as follows (lowest weight receives rank = 1, etc)
99 - Rank = 1
100 - Rank = 2
150 - Rank = 3
172 - Rank = 4
210 - Rank = 5
230 - Rank = 6
Resulting Dataset:
Family Sum_Ranking
1 9
2 12
Family 1 Sum_Ranking was calculated by (3+5+1)
Family 2 Sum_Ranking was calculated by (6+2+4)
Thank you for assistance.
I'm not going to give you code, but some tips.
Specifically, the most interesting part about the instructions is the explicit "not even PROC SORT".
PROC MEANS has a useful side effect, in that it sorts data by the class variables (in the class variable order). So,
PROC SORT data=blah out=blah_w;
by x y;
run;
and
PROC MEANS data=blah;
class x y;
var y;
output out=blah_w n=;
run;
Have almost the identical results. Both produce a dataset sorted by x y, even though PROC MEANS didn't require a sort.
So in this case, you can use PROC MEANS' class statement to produce a dataset that is sorted by weight and family (you must carry over family here even though you don't need it). Then you must use a data step to produce a RANK variable, which is the rank of the current line (use the _FREQ_ column to figure that out in case there are more than one with the same rank in the same family, and think about what to do in case of ties), then another PROC MEANS to summarize by family this time.