I am looking to join two tables together
Table 1 - The baseball dataset
DATA baseball;
SET sashelp.baseball
(KEEP = crhits);
RUN;
Table 2 - A table containing the percentiles of CRhits
PROC STDIZE
DATA = baseball
OUT=_NULL_
PCTLMTD=ORD_STAT
PCTLDEF=5
OUTSTAT=STDLONGPCTLS
(WHERE = (SUBSTR(_TYPE_,1,1) = "P"))
pctlpts = 1 TO 99 BY 1;
RUN;
I would like to join these tables together to create a table that contains the values for crhits and then a column identifying which percentile that value belongs to like below
crhits percentile percentile_value
54 p3 54
66 p5 66
825 p63 825
1134 p76 1133
The last column indicates the percentile value given by stdlongpctls
I currently use the following code to calculate the percentiles and a loop to count the number of "Events" per percentile, per factor
I have tried a cross-join but I am having trouble visualising how to join these two tables without an explicit key
PROC SQL;
CREATE TABLE cross_join_table AS
SELECT
a.crhits
, b._TYPE_
, CASE WHEN
a.crhits < b.type THEN b._TYPE_ END AS percentile
FROM
baseball a
CROSS JOIN
stdlongpctls b;
QUIT;
If there is another easier / more efficient way to find the number of observations and number of dependent variables (e.g. I am modelling on a default flag event in my actual dataset, so the sum of 1's per percentile group, I would appreciate it)
Use PROC RANK instead to group it into the percentiles.
proc rank data=sashelp.baseball out=baseball_ranks group=100;
var crhits;
rank rank_crhits;
run;
You can then summarize it using PROC MEANS.
Related
I'm trying to merge a dataset to another table (hist_dataset) by applying one condition.
The dataset that I'm trying to merge looks like this:
Label
week_start
date
Value1
Value2
Ac
09Jan2023
13Jan2023
45
43
The logic that I'm using is the next:
If the value("week_start" column) of the first record is equal to today's week + 14 then merge the dataset with the dataset that I want to append.
If the value(week_start column) of the first record is not equal to today's week + 14 then do nothing, don't merge the data.
The code that I'm using is the next:
libname out /"path"
data dataset;
set dataset;
by week_start;
if first.week_start = intnx('week.2', today() + 14, 0, 'b') then do;
data dataset;
merge out.hist_dataset dataset;
by label, week_start, date;
end;
run;
But I'm getting 2 Errors:
117 - 185: There was 1 unclosed DO block.
161 - 185: No matching DO/SELECT statement.
Do you know how can make the program run correctly or do you know another way to do it?
Thanks,
'''
I cannot make heads or tails of what you are asking. So let me take a guess at what you are trying to do and give answer to my guesses.
Let's first make up some dataset and variable names. So you have an existing dateset named OLD that has key variables LABEL WEEK_START and DATE.
Now you have received a NEW dataset that has those same variables.
You want to first subset the NEW dataset to just those observations where the value of DATE is within 14 days of the first value of START_WEEK in the NEW dataset.
data subset ;
set new;
if _n_=1 then first_week=start_week;
retain first_week;
if date <= first_week+14 ;
run;
You then want to merge that into the OLD dataset.
data want;
merge old subset;
by label week_start date ;
run;
I'm very new to SAS and trying to learn it. I have a problem statement where I need to extract two files from a location and then perform joins. Below is a detailed explanation of what I'm trying to achieve in a single proc sql statement:
There are two tables, table a (columns - account#, sales, transaction, store#) and table b (columns - account#, account zipcode) and an excel file (columns - store# and store zipcode). I need to first join these two tables on column account#.
Next step is to join their resulting values with the excel file on column store# and also add a column called as 'distance', which calculates the distance between account zipcode and store zipcode with the help of zipcitydistance(account zipcode, store zipcode) function. Let the resulting table be called "F".
Next I want to use case statement to create a column of distance bucket based on the distance from above query, for e.g.,
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from F
group by 1
So far, below is the code that I have written:
data table_a
set xyzstore.filea;
run;
data table_b
set xyzstore.fileb;
run;
proc import datafile="/location/file.xlsx"
out=filec dbms=xlsx replace;
run;
proc sql;
create table d as
select a.store_number, b.account_number, sum(a.sales) as sales, sum(a.transactions) as transactions, b.account_zipcode
from table_a left join table_b
on a.account_number=b.account_number
group by a.store_number, b.account_number, b.account_zipcode;
quit;
proc sql;
create table e as
select d.*, c.store_zipcode, zipcitydistance(table_d.account_zipcode, c.store_zipcode) as distance
from d inner join filec as c
on d.store_number=c.store_number;
quit;
proc sql;
create table final as
select
case
when distance<=5 then "<=5"
when distance between 5 and 10 then "5-10"
when distance between 10 and 15 then "10-15"
else ">=15"
end as distance_bucket,
sum(transactions) as total_txn,
sum(sales) as total_sales,
from e
group by 1;
quit;
How can I write the above lines of code in a single proc sql statement?
The way you are currently doing it is more readable and the preferred way to do it. Turning it into a single SQL statement will not yield any significant performance gains and will make it harder to troubleshoot in the future.
To do a little cleanup, you can remove the two data step set statements and join directly on those files themselves:
create table d as
...
from xyzstore.filea left join xystore.fileb
...
quit;
You could also use a format instead to clean up the CASE statement.
proc format;
value storedistance
low - 5 = '<=5'
5< - 10 = '5-10'
10< - 15 = '10-15'
15 - high = '>=15'
other = ' '
;
run;
...
proc sql;
create table final as
select put(distance, storedistance.) as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
from e
group by calculated distance_bucket
;
quit;
If you did want to turn your existing code into one big SQL statement, it would look like this:
proc sql;
create table final as
select CASE
when(distance <= 5) then '<=5'
when(distance between 5 and 10) then '5-10'
when(distance between 10 and 15) then '10-15'
else '>=15'
END as distance_bucket
, sum(transactions) as total_txn
, sum(sales) as total_sales
/* Join 'table d' with c */
from (select d.*
, c.store_zipocde
, zipcitydistance(d.account_zipcode, c.store_zipcode) as distance
/* Create 'table d' */
from (select a.store_number
, b.account_number
, sum(a.sales) as sales
, sum(a.transactions) as transactions
, b.account_zipcode
from xyzstore.filea as a
LEFT JOIN
xyzstore.fileb as b
ON a.account_number = b.account_number
group by a.store_number
, b.account_number
, b.account_zipcode
) as d
INNER JOIN
filec as c
)
group by calculated distance_bucket
;
quit;
While more compact, it is more difficult to troubleshoot. You lose those in-between steps that can identify if there's an issue with the data. Suppose the store distances look incorrect one day: you'd need to unpack all of those SQL statements, put them into individual PROC SQL blocks and run them. Every time you run into a problem you will need to do this. If you have them separated out, you'll use a negligible amount of temporary disk space and have a much easier time troubleshooting.
When dealing with raw data, especially data that updates regularly, assume something will go wrong one day and you'll need to review it in-depth. Sometimes the wrong file gets sent. Sometimes an upstream issue occurs that sends corrupted data. Any time that happens, you'll need to dig in and find out if it's a problem with your process or their process. Making easy-to-troubleshoot code will speed up the solution for everyone.
Suppose I have the following database:
DATA have;
INPUT id date gain;
CARDS;
1 201405 100
2 201504 20
2 201504 30
2 201505 30
2 201505 50
3 201508 200
3 201509 200
3 201509 300
;
RUN;
I want to create a new table want where the average of the variable gain is grouped by id and by date. The final database should look like this:
DATA want;
INPUT id date average_gain;
CARDS;
1 201405 100
2 201504 25
2 201505 40
3 201508 200
3 201509 250
I tried to obtain the desired result using the code below but it didn't work:
PROC sql;
CREATE TABLE want as
SELECT *,
mean(gain) as average_gain
FROM have
GROUP BY id, date
ORDER BY id, date
;
QUIT;
It's the asterisk that's causing the issue. That will resolve to id, date, gain, which is not what you want. ANSI SQL would not allow this type of functionality so it's one way in which SAS differs from other SQL implementation.
There should be a note in the log about remerging with the original data, which is essentially what's happening. The summary values are remerged to every line.
To avoid this, list your group by fields in your query and it will work as expected.
PROC sql;
CREATE TABLE want as
SELECT id, date,
mean(gain) as average_gain
FROM have
GROUP BY id, date
ORDER BY id, date
;
QUIT;
I will say, in general, PROC MEANS is usually a better option because:
calculate for multiple variables & statistics without need to list them all out multiple times
can get results at multiple levels, for example totals at grand total, id and group level
not all statistics can be calculated within PROC MEANS
supports variable lists so you can shortcut reference long lists without any issues
I have monthly datasets in SAS Library for customers from Jan 2013 onwards with datasets name as CUST_JAN2013,CUST_FEB2013........CUST_OCT2017. These customers datasets have huge records of 2 million members for each month.This monthly datset has two columns (customer number and customer monthly expenses).
I have one input dataset Cust_Expense with customer number and month as columns. This Cust_Expense table has only 250,000 members and want to pull expense data for each member from SPECIFIC monthly SAS dataset by joining customer number.
Cust_Expense
------------
Customer_Number Month
111 FEB2014
987 APR2017
784 FEB2014
768 APR2017
.....
145 AUG2017
345 AUG2014
I have tried using call execute, but it tries to loop thru each 250,000 records of input dataset (Cust_Expense) and join with corresponding monthly SAS customer tables which takes too much of time.
Is there a way to read input tables (Cust_Expense) by month so that we read all customers for a specific month and then read the same monthly table ONCE to pull all the records from that month, so that it does not loop 250,000 times.
Depending on what you want the result to be, you can create one output per month by filtering on cust_expenses per month and joining with the corresponding monthly dataset
%macro want;
proc sql noprint;
select distinct month
into :months separated by ' '
from cust_expenses
;
quit;
proc sql;
%do i=1 %to %sysfunc(countw(&months));
%let month=%scan(&months,&i,%str( ));
create table want_&month. as
select *
from cust_expense(where=(month="&month.")) t1
inner join cust_&month. t2
on t1.customer_number=t2.customer_number
;
%end;
quit;
%mend;
%want;
Or you could have one output using one join by 'unioning' all those monthly datasets into one and dynamically adding a month column.
%macro want;
proc sql noprint;
select distinct month
into :months separated by ' '
from cust_expenses
;
quit;
proc sql;
create table want as
select *
from cust_expense t1
inner join (
%do i=1 %to %sysfunc(countw(&months));
%let month=%scan(&months,&i,%str( ));
%if &i>1 %then union;
select *, "&month." as month
from cust_&month
%end;
) t2
on t1.customer_number=t2.customer_number
and t1.month=t2.month
;
quit;
%mend;
%want;
In either case, I don't really see the point in joining those monthly datasets with the cust_expense dataset. The latter does not seem to hold any information that isn't already present in the monthly datasets.
Your first, best answer is to get rid of these monthly separate tables and make them into one large table with ID and month as key. Then you can simply join on this and go on your way. Having many separate tables like this where a data element determines what table they're in is never a good idea. Then index on month to make it faster.
If you can't do that, then try creating a view that is all of those tables unioned. It may be faster to do that; SAS might decide to materialize the view but maybe not (but if it's extremely slow, then look in your temp table space to see if that's what's happening).
Third option then is probably to make use of SAS formats. Turn the smaller table into a format, using the CNTLIN option. Then a single large datastep will allow you to perform the join.
data want;
set jan feb mar apr ... ;
where put(id,CUSTEXPF1.) = '1';
run;
That only makes one pass through the 250k table and one pass through the monthly tables, plus the very very fast format lookup which is undoubtedly zero cost in this data step (as the disk i/o will be slower).
I guess you could output your data in specific dataset like this example :
data test;
infile datalines dsd;
input ID : $2. MONTH $3. ;
datalines;
1,JAN
2,JAN
3,JAN
4,FEB
5,FEB
6,MAR
7,MAR
8,MAR
9,MAR
;
run;
data JAN FEB MAR;
set test;
if MONTH = "JAN" then output JAN;
if MONTH = "FEB" then output FEB;
if MONTH = "MAR" then output MAR;
run;
You will avoid to loop through all your ID (250000)
and you will use dataset statement from SAS
At the end you will get 12 DATASET containing the ID related.
If you case, FEB2014 , for example, you will use a substring fonction and the condition in your dataset will become :
...
set test;
...
if SUBSTR(MONTH,1,3)="FEB" then output FEB;
...
Regards
I calculate a ratio for 40 stocks. I need to sort those into three groups high, medium and low based on the value of the ratio. The ratios are fractions of one and there aren't many repetitions. What I need is to create three groups of about 13 stocks each, in group 1 to have the high ratios, in group 2 medium ratios and group 3 low ratios. I have the below code but it just assigns rank 1 to all my stocks.
How can I correct this?
data sourceh.combinedfreq2;
merge sourceh.nonnfreq2 sourceh.nofreq2 sourcet.caps;
by symbol;
ratio=(freqnn/freq);
run;
proc rank data=sourceh.combinedFreq2 out=sourceh.ranked groups=3;
by symbol notsorted;
var ratio;
ranks rank;
run;
If you want to automatically partition into three relatively even groups, you can use PROC RANK (See example using sashelp.stocks):
data have;
set sashelp.stocks;
ratio=high/low;
run;
proc rank data=have out=want groups=3;
by stock notsorted;
var ratio;
ranks rank;
run;
That partitions them into three groups. As long as you have 40 different values (ie, not a lot of repeats of one value), it will make 3 evenly split groups (with ~13 in each).
In your case, do not use by anything - by will create separate sets of ranks (here I'm ranking dates by stock, but you want to rank stocks.)
I think people are making this more complicated than it needs to be. Lets do this on easy mode.
First, we'll create the dataset and create out ratios.
Second, We'll sort the data by ratio.
Lastly, we'll assign a group based on observation number.
WARNING! UNTESTED CODE!
/*Make the dataset. I stole this from your code above*/
data sourceh.combinedfreq2;
merge sourceh.nonnfreq2 sourceh.nofreq2 sourcet.caps;
by symbol;
ratio=(freqnn/freq);
run;
/*sort the data so that its ordered by ratio*/
PROC SORT DATA=sourceh.combinedfreq2 OUT=sourceh.combinedfreq2 ;
BY DESCENDING ratio ;
RUN ;
/*Assign a value based on observation number*/
Data sourceh.combinedfreq2;
Set sourceh.combinedfreq2;
length Group $6.;
if _N_ <=13 Then Group = "High";
if _N_ > 13 and _N_ <= 26 Then Group = "Medium";
if _N_ > 26 Then Group = "Low";
run;