I have the following data:
data have;
input username $ betdate : datetime. customerCode;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 1
player1 04NOV2008:09:03:44 10
player2 07NOV2008:07:03:33 1
player2 05NOV2008:09:00:00 0.5
player3 05NOV2008:09:05:00 1
player2 07NOV2008:14:03:33 1
player1 05NOV2008:09:00:05 20
player2 07NOV2008:16:03:33 1
player2 07NOV2008:18:03:33 1
player2 09NOV2008:10:05:10 0.7
player3 15NOV2008:15:05:33 10
player3 15NOV2008:15:05:33 1
player2 15NOV2008:15:05:33 0.1
run;
PROC PRINT; RUN;
When I run the following, I don't get distinct, collapsed entries for customerCode when I group by it because it is numeric, I presume.
proc sql;
select username, customerCode from have group by 1,2;
quit;
How can I do this? I want to get a history of all the customer codes that have been assigned to a customer (i.e as they change), rather than an entry for each numeric value for customerCode. I haven't been able to convert the variable to a char value so that the grouping works:
proc sql;
create table want as
select * from have, customerCode FORMAT $10. as code;
quit;
Thanks for any help on this.
You're not getting distinct entries because it is ignoring your group by, because you didn't ask for any summary functions. SAS does not permit group by without a summary function (ie, sum(something) or count(something) or whatever), it converts it to order by. There's no explicit reason numeric wouldn't work for grouping.
This is noted in the log with a NOTE, by the way.
You can use distinct, as you suggested in the comments:
proc sql;
select distinct username, customercode from have;
quit;
That will give you a list of all username/customercode combinations.
If you wanted to format it, you have to remove the $ - the $ in format does not mean "make this a character", which is what all formats do; it means "the original value pre-format was a character value".
proc sql;
create table want as
select distinct username, customercode format=10. from have;
quit;
This won't quite work as expected, because the format is applied after the distinct is processed (and the post-decimal portion still exists, just under the hood). However, you can do:
proc sql;
create table want as
select distinct username, put(customercode,10.) from have;
quit;
Or you could use ROUND or something else to keep it numeric.
Related
I have a dataset as follows:
ID status
101 Checked
101 Checked
101 NotChecked
101 Checked
101 NotChecked
I want to count the number of obs base don the status variable like
ID status Count
101 Checked 2
101 Checked 2
101 NotChecked 1
101 Checked 1
101 NotChecked 1
I dont want to use proc sql because when I say group by then it sorts the dataset and gives the result where as here the Status variable is not sorted.
Aggregating by groups will always require sorting unless you want to use some complex data step logic.
If you have a particular sort order that you want to keep, the easiest way is to create a key column that holds your desired order. You can then resort it back to the way you'd like it after grouping.
data have2;
set have;
varorder = _N_;
run;
proc sql;
create table want as
select id, status, count(*) as count
from have2
group by id, status
order by varorder
;
quit;
This works for me, a bit of a longer solution but basically add row and group identifiers to control the count. The NOTSORTED option on the BY statement helps to identify your groups uniquely.
data have;
input ID status $12.;
cards;
101 Checked
101 Checked
101 NotChecked
101 Checked
101 NotChecked
;;;;
run;
data grouped;
set have;
by id status notsorted;
retain MyGroups count;
if first.id then count=1;
else count+1;
if first.status then MyGroups+1;
run;
proc sql;
create table want as
select *, count(*) as numberFound
from grouped
group by MyGroups
order by ID, count;
quit;
Here is my data :
data example;
input id sports_name;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
This is just a sample. The variable sports_name is categorical with 56 types.
I am trying to transpose the data to wide form where each row would have a user_id and the names of sports as the variables with values being 1/0 indicating Presence or absence.
So far, I used proc freq procedure to get the cross tabulated frequency table and put that in a different data set and then transposed that data. Now i have missing values in some cases and count of the sports in rest of the cases.
Is there any better way to do this?
Thanks!!
You need a way to create something from nothing. You could have also used the SPARSE option in PROC FREQ. SAS names cannot have length greater than 32.
data example;
input id sports_name :$16.;
retain y 1;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;;;;
run;
proc print;
run;
proc summary data=example nway completetypes;
class id sports_name;
output out=freq(drop=_type_);
run;
proc print;
run;
proc transpose data=freq out=wide(drop=_name_);
by id;
var _freq_;
id sports_name;
run;
proc print;
run;
Same theory here, generate a list of all possible combinations using SQL instead of Proc Summary and then transposing the results.
data example;
informat sports_name $20.;
input id sports_name $;
datalines;
1 baseball
1 basketball
1 cricket
1 soccer
2 golf
2 fencing
;
run;
proc sql;
create table complete as
select a.id, a_x.sports_name, case when not missing(e.sports_name) then 1 else 0 end as Present
from (select distinct ID from example) a
cross join (select distinct sports_name from example) a_x
full join example as e
on e.id=a.id
and e.sports_name=a_x.sports_name;
quit;
proc transpose data=complete out=want;
by id;
id sports_name;
var Present;
run;
proc sort data=sas.mincome;
by F3 F4;
run;
Proc sort doesn't sort the dataset by formatted values, only internal values. I need to sort by two variables prior to a merge. Is there anyway to do this with proc sort?
I don't think you can sort by formatted values in proc sort, but you can definitely use a simple proc SQL procedure to sort a dataset by formatted values. proc SQL is similar to the data step and proc sort, but is more powerful.
The general syntax of proc sql for sorting by formatted values will be:
proc sql;
create table NewDataSet as
select variable(s)
from OriginalDataSet
order by put(variable1, format1.), put(variable2, format2.);
quit;
For example, we have a sample data set containing the names, sex and ages of some people and we want to sort them:
proc format;
value gender 1='Male'
2='Female';
value age 10-15='Young'
16-24='Old';
run;
data work.original;
input name $ sex age;
datalines;
John 1 12
Zack 1 15
Mary 2 18
Peter 1 11
Angela 2 24
Jack 1 16
Lucy 2 17
Sharon 2 12
Isaac 1 22
;
run;
proc sql;
create table work.new as
select name, sex format=gender., age format=age.
from work.original
order by put(sex, gender.), put(age, age.);
quit;
Output of work.new will be:
Obs name sex age
1 Mary Female Old
2 Angela Female Old
3 Lucy Female Old
4 Sharon Female Young
5 Jack Male Old
6 Isaac Male Old
7 John Male Young
8 Zack Male Young
9 Peter Male Young
If we had used proc sort by sex, then Males would have been ranked first because we had used 1 to represent Males and 2 to represent Females which is not what we want. So, we can clearly see that proc sql did in fact sort them according to the formatted values (Females first, Males second).
Hope this helps.
Because of the nature of formats, SAS only uses the underlying values for the sort. To my knowledge, you cannot change that (unless you want to build your own translation table via PROC TRANTAB).
What you can do is create a new column that contains the formatted value. Then you can sort on that column.
proc format library=work;
value $test 'z' = 'a'
'y' = 'b'
'x' = 'c';
run;
data test;
format val $test.;
informat val $1.;
input val $;
val_fmt = put(val,$test.);
datalines;
x
y
z
;
run;
proc print data=test(drop=val_fmt);
run;
proc sort data=test;
by val_fmt;
run;
proc print data=test(drop=val_fmt);
run;
Produces
Obs val
1 c
2 b
3 a
Obs val
1 a
2 b
3 c
I would like to replicate the output of PROC MEANS using PROC TABULATE. The reason for this is that I would like to have a profit percentage (or margin) as one of the variables in the PROC MEANS output, but would like to suppress the calculation for one or more of the statistics i.e. there will be a '-' or similar in the 'margin' row under 'N' and 'SUM.
Here is the sample data:
data have;
input username $ betdate : datetime. stake winnings;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90 -90
player1 04NOV2008:09:03:44 100 40
player2 07NOV2008:14:03:33 120 -120
player1 05NOV2008:09:00:00 50 15
player1 05NOV2008:09:05:00 30 5
player1 05NOV2008:09:00:05 20 10
player2 09NOV2008:10:05:10 10 -10
player2 15NOV2008:15:05:33 35 -35
player1 15NOV2008:15:05:33 35 15
player1 15NOV2008:15:05:33 35 15
run;
data want;
set have;
retain margin;
margin = (winnings) / stake;
PROC PRINT; RUN;
I have been calculating statistics with PROC MEANS (like below), but the value for the SUM statistics for the 'margin' variable means nothing: I would like to suppress this value. I have therefore been attempting to replicate this table using PROC TABULATE to have more control of the output, but have been unsuccessful so far.
proc means data=want N sum mean median stddev min max maxdec=2 order=freq STACKODS;
var stake winnings margin;
run;
proc tabulate data=want;
var stake winnings margin;
table stake * (N Sum mean Median StdDev Min Max);
run;
I would appreciate any help on this.
In principle, you can't create this type of output as a default part of the TABULATE function; in essence, you are asking for two different table definitions. Anything you do with the SAS syntax will basically amount to adding more dimensions to the table, but it won't fix your core problem.
You can use this code to get the tables you want, but they're still different tables:
PROC TABULATE DATA=want NOSEPS;
VAR stake winnings margin;
TABLE (stake winnings),(N SUM MEAN MEDIAN STDDEV MIN MAX);
TABLE (margin),(N MEAN MEDIAN STDDEV MIN MAX);
RUN;
There are some guides out there on hacking ODS to do what you want (namely, create "stacked tables" where several child tables are assembled into a single table. Check out here for an example. If you Google "SAS stack tables" you'll find more examples.
I've done this in HTML by creating a new tagset - basically, a special ODS destination that removes spaces between tables, etc. I don't have the code that I used anymore, unfortunately; I moved to R to do automated reporting.
I have some data (created using the code) below that ranks observations according to two variables. In this case, it ranks the players first bet and second bet and creates two 'rank' variables. What I want to do instead is rank the observations according a function of the two variables instead (like the average of the two variables) and I'd like to do this in the PROC RANK command itself rather than using a preliminary data step as the ranking will get fairly involved after I replicate this on all the variables I need. Can I put operators into the PROC RANK statement? Rather than doing this:
Proc rank data=want ties=mean out=ranked groups=2;
var bet1stake bet2stake;
ranks bet1stakeRank bet2stakeRank;
run;
I would like to do this:
Proc rank data=want ties=mean out=ranked groups=2;
var avg(bet1stake, bet2stake);
ranks firstTwoBetsRank;
run;
Is this possible?
This is how the full example data can be created.
data have;
input username $ betdate : datetime. stake winnings;
dateOnly = datepart(betdate) ;
format betdate DATETIME.;
format dateOnly ddmmyy8.;
datalines;
player1 12NOV2008:12:04:01 90 -90
player1 04NOV2008:09:03:44 100 40
player2 07NOV2008:14:03:33 120 -120
player1 05NOV2008:09:00:00 50 15
player1 05NOV2008:09:05:00 30 5
player1 05NOV2008:09:00:05 20 10
player2 09NOV2008:10:05:10 10 -10
player2 15NOV2008:15:05:33 35 -35
player1 15NOV2008:15:05:33 35 15
player1 15NOV2008:15:05:33 35 15
run;
proc sort data=have;
by username betdate;
run;
data have;
set have;
by username betdate;
retain eventTime;
if first.username then eventTime = 0;
if first.betdate then eventTime + 1;
run;
proc sql;
create table want as
select
distinct username,
(select distinct stake from have where username = main.username and eventTime = 1) as bet1Stake,
(select distinct stake from have where username = main.username and eventTime = 2) as bet2Stake
from have main;
quit;
Proc rank data=want ties=mean out=want groups=2;
var bet1stake bet2stake;
ranks bet1stakeRank bet2stakeRank;
run;
Thanks for any help on this.
I'm afraid you cannot apply operators on the variables you'd like to rank your observations.
The choice you have is either to use a DATA step to do both the application of operators and the calculation of the ranking
Or
use a Data step view or SQL view to apply the operator as an intermediate step just in case if you are concerned about disk space.
In case you are pulling the data from a SQL database (assuming it supports windowing functions) you should be to do exactly what you are trying to do just with some SQL code that is passed-through to the database.