What is the difference in using the ID statement Vs the BY Statement in proc compare.
I understand the ID statement -- that when added observations are compared according to ID..
but what exactly BY statement does..
I did read the SAS documentation and searched net I couldt understand , can anybody elaborate on it .
The way I understand it, the "by" statement makes proc compare do a separate comparison for each by group in the comparison data sets. It's basically like running a separate "proc compare" for each "by" group.
The "id" statement on the other hand correlates records by key between the two data sets to be compared and reports on the number of common elements and how many are in one data sets but not in the other. You would use this if your data sets have a common primary key i.e. a combination of variables that uniquely identify each record, and you want "prooc compare" to take each matching pair and compare them.
Related
I was using the following code to analyze data:
set taq.cq_&yyyymmdd:;
by symbol date time NOTSORTED ex;
There are are thousands of datasets I am running the code on in the unit of days. When &yyyymmdd only specifies one dataset (for one day. for example, 20130102), it works. However, when I try to run it for multiple datasets (for example, 201301:), SAS returns the following errors:
BY NOTSORTED/NOBYSORTED cannot be used with SET statement when
more than one data set is specified.
If I cannot use NOTSORTED here, what is an equivalent statement that I could use?
My understanding of the keyword NOTSORTED is that you use it when the data is not sorted yet. Therefore, do I need to sort it first? How to do it?
I am also confused by the number of variables that NOTSORTED is referencing. Does it only have an effect on "time", or it has effect on "symbol, data, time"?
Many thanks!
UPDATE#2:
The rest of the process immediately following the set statement is: (pseudo code as i don't have the permission to post the original code)
Data _quotes;
SET STATEMENT HERE
Change the name of a variable in the dataset (Variable name is EXN).
last.EXN in a if statement. If the condition is satisfied, label EXN.
Drop some variables.
Run;
DATA NEWDATASET (sortedby= SYMBOL DATE TIME index=(SYMBOL)
label="WRDS-TAQ NBBO Data");
SET _quotes;
by symbol date time;
....
Run;
NOTSORTED means that SAS can assume the sort order in the data is correct, so it may not have explicitly gone through a PROC SORT but it is in logical order as listed in the BY statement.
All variables in the BY statement are included in the NOTSORTED option. Given that I suspect you fully don't understand BY group processing.
It's usually a bit dangerous to use, especially if you don't understand BY group processing. If your data is in the same group but not adjacent it won't work properly and will not produce an error. The correct workaround depends on your processes to be honest.
I would suggest reviewing the documentation regarding BY group processing. It's quite in depth and has lots of samples to illustrate the different type of calculations.
http://support.sas.com/documentation/cdl/en/lrcon/69852/HTML/default/viewer.htm#n138da4gme3zb7n1nifpfhqv7clq.htm
NOTSORTED is often used in example posts to either avoid a sort or when using a custom sort that's difficult to implement in other ways. Explicitly sorting will remove this issue but you may also be misunderstanding how SAS processes data when you have a SET statement with a BY statement. I believe this is called interleaving.
http://support.sas.com/documentation/cdl/en/lrcon/69852/HTML/default/viewer.htm#n1tgk0uanvisvon1r26lc036k0w7.htm
I suspect that the NOTSORTED keyword is being using to find groups for observations with the same value for the EX variable within the same symbol,date,time. If you only need to find the FIRST then you can use the LAG() function to calculate the FIRST.EX flag.
data want;
set taq.cq_&yyyymmdd:;
by symbol date time;
first_ex = first.time or ex ne lag(ex);
Otherwise then perhaps you want to convert the process to data step views and then set the views together.
data work.view_cq_20130102 / view=work.view_cq_20130102;
set taq.cq_20130102;
by symbol date time ex NOTSORTED;
...
run;
...
data want ;
set work.view_cq_201301: ;
by symbol date time;
...
i need to use a append object after a series of join that have a conditional run... So the join step may be not execute if the condition is not verified and his work physical dataset will not be created.
The problem is that the append step take an error if one o more input physical dataset are not created.
Is there a smart way to create a physical empty table from a metadata structure of the works table of the joins or to use the append with some non-created datasets?
The create table with the list of all field is not a real solution because i've to replicate it per 8 different joins and then replicate the job 10 times...
Thanks to all
Roberto
Thank you for your comments.
What you should do:
Amend your conditional node so that it would on positive condition to create a global macro variable with value of MAX. On negative condition to create the same variable with value of 0.
Replace offending SQL step with "CREATE TABLE" node
In the options for "CREATE TABLE", specify macro variable for "MAXIMUM OUTPUT ROWS (OUTOBS)". See the picture below for example of those options.
So now when your condition is not met, you will always end up with an empty table. When condition is met, the step executes normally.
I must say my version of DI Studio is a bit old. In my version SQL node doens't allow passing macro variables to SQL options, only integers can be typed in. Check if your version allows it because if it does, then you can amend existing SQL step and avoid replacing it with another node.
One more thing, you will get a warning when OUTOBS options is less then the resulting would be dataset.
Let me know if you have any questions.
See the picture for create table options:
At the end i've created another step that extract 0 row from the source table by the condition 1=0 in the where tab. In this way i have a empty table that i can use with a data/set in the post sql of the conditional run if the work table of the join does not exist.
This is not a solution but a valid work around.
I am trying to run codes on SAS for Concatenate, Appending and Merge and unable to understand the difference between the same. Looking for some one to help me understand the same with examples.
Concatenate and Append are similar, but not used the same way. In SAS, Append is used most commonly to mean concatenation in place. In other words, adding rows to a dataset without reading in the original dataset. This is very efficient, as you skip reading one of the datasets, but it has limitations (largely, you can't interleave or do other data step type things while appending). Append is most often done by PROC APPEND.
Concatenate, on the other hand, while it can mean appending, is usually used when combining the rows from two datasets into a new dataset with all of the rows from each source dataset as separate rows, but not in-place. This would be done with a set statement in a data step, most commonly. This reads both datasets in and writes a new dataset (that could replace one of the original input datasets, or have a new name). Concatenate also is often used to mean combine two string values into a variable; that's actually the most common usage I've heard it in.
Merge is not the same at all, though; it is side-by-side in some fashion, placing the data from one dataset in new variables on the same rows as the data from the other dataset. New rows can be created as part of merge, when one dataset has different key identifier values from the other, but that's usually not the point of the merge (usually!). Merge is done most often in the data step, either with the merge or the update statement.
Concatenate and Merge can also be done in other ways, of course, including SQL.
In a nutshell:
Concatenate: add a dataset on top (or to the bottom) of another one. Look into the SET statement of the DATA Step or the UNION clause of PROC SQL.
Append: Just another word for concatenate. Look into PROC DATASETS / APPEND, but it accomplishes the same task with different means.
Merge: add a dataset to the side (right, generally) of another one. Look into the MERGE statement of the DATA Step and/or the various JOIN's allowed by PROC SQL.
SAS Documentation will show you plenty of examples!
concatenate :it is used to append observations from one data set to another data set,so you specify a list of data set names in the set statement,to concatenate two data set the SAS system must process all observations from both data sets to create a new one.
APPEND:bypasses the observations from original data set and add the new observations directly at the end of the original data set when you have different variables you use the force=option with the append procedure.It function the same way as the append statement in the datasets statement.
and you can only append one data set at a time while in concatenation you can add as many data set in the set statement.
MERGE:you should have a common variable or several variables which taken together uniquely identify each observations,it sequentaly checks observation for each by-value(you have to sort your data sets before you can merge them),then write the combined observation to the new data set
I have been working on a two step process in SAS EG that creates a temporary table (connected to Netezza) then uses that table to build the final summary table. When creating the final table, I am trying to have two calculated columns that represent averages of all sales quotes and only customers that accepted the quote (i.e. the expectation is that the customers who accept have a lower average quote than all inquiries combined). In order to segment customers that accepted the quoted offer versus those that did not, I have attempted to use Sum(Case When...) and Sum with a Boolean operator.
I have the following code with the final three sum statements trying to create the same column in different ways (just looking for one that works). The first two attempts return an error saying that it was unable to identify a function that satisfies the given argument types. The final attempt (which is where I began things) does not recognize the syntax, which I feel is correct. The error occurs after the closed parentheses around "END." Any help would be greatly appreciated:
EXECUTE (
CREATE TABLE SAMPLE1 AS
SELECT DATE
,STATE
,SUM(INQRY_CN) AS INQ_COUNT
,(SUM(BOUND_IN)/INQ_COUNT) AS CLOSURE
,SUM(QUOTE_AMOUNT) AS AVG_QTD_AM
**,SUM(QUOTE_AMOUNT*(DAY_OF_QUOTE <> '01JAN1901')) AS AVG_BND_AM
,SUM(QUOTE_AMOUNT*(BOUND_IN=1)) AS AVG_BND_AM
,SUM(CASE WHEN BOUND_IN=1 THEN QUOTE_AMOUNT
ELSE . END) AS AVG_BND_AM**
FROM TEMP_TABLE
GROUP BY 1,2
ORDER BY 1,2
) BY NETEZZA;
DISCONNECT FROM NETEZZA;
QUIT;
I am attempting to set up SAS to do something I am able to easily do in Excel, but am unable to find a way to do effectively. Given the first two tables shown here (dubbed TREE and LEVEL, respectively), I am trying to end up with the third table (FINAL_TREE).
Adding in the Level column to TREE, so that it becomes FINAL_TREE works as follows: any given tree must have a number Apple which is greater than or equal to Apple_Req for a given Level, as well as Orange greater than or equal to Orange_Req. So a Tree is given a Level to which it meets all given requirements.
So in the example tables, Tree3 is given Level1, despite the fact that it would easily be Level3 if not for its low Orange count.
In Excel, this can be done using INDEX and finding the MIN of two MATCH functions, but I don't think that can be directly translated into SAS. I imagine there is a way to set this up using explicilty defined nested IF statements, but I am hoping there is a solution which can handle a LEVEL table with any number of levels (so long as the requirements are set up correctly).
In fact, this is quite a bit easier in SAS - in part because there are a lot of different ways to do this.
The most straightforward is probably using SQL, if you're familiar with it. The most similar to what you're doing in Excel, though, is Format, and perhaps the fastest as well.
proc format;
value appleF
1-<4 = '1'
5-<15 = '2'
15-high='3'
other='0';
value orangeF
5-<15 = '1'
16-<30 = '2'
30-high= '3'
other='0';
quit;
Now, you can convert the values using put and then use min just like you would in Excel. Basically this replaces your index.
data want;
set have;
level = min(put(apple,applef1.),put(orange,orangef1.));
run;
You can also produce a format from a dataset directly - see this paper for example for using CNTLIN option on PROC FORMAT.