What is the difference between concatenate, appending and merge on SAS?

What is the difference between concatenate, appending and merge on SAS? - sas

I am trying to run codes on SAS for Concatenate, Appending and Merge and unable to understand the difference between the same. Looking for some one to help me understand the same with examples.

Concatenate and Append are similar, but not used the same way. In SAS, Append is used most commonly to mean concatenation in place. In other words, adding rows to a dataset without reading in the original dataset. This is very efficient, as you skip reading one of the datasets, but it has limitations (largely, you can't interleave or do other data step type things while appending). Append is most often done by PROC APPEND.
Concatenate, on the other hand, while it can mean appending, is usually used when combining the rows from two datasets into a new dataset with all of the rows from each source dataset as separate rows, but not in-place. This would be done with a set statement in a data step, most commonly. This reads both datasets in and writes a new dataset (that could replace one of the original input datasets, or have a new name). Concatenate also is often used to mean combine two string values into a variable; that's actually the most common usage I've heard it in.
Merge is not the same at all, though; it is side-by-side in some fashion, placing the data from one dataset in new variables on the same rows as the data from the other dataset. New rows can be created as part of merge, when one dataset has different key identifier values from the other, but that's usually not the point of the merge (usually!). Merge is done most often in the data step, either with the merge or the update statement.
Concatenate and Merge can also be done in other ways, of course, including SQL.

In a nutshell:
Concatenate: add a dataset on top (or to the bottom) of another one. Look into the SET statement of the DATA Step or the UNION clause of PROC SQL.
Append: Just another word for concatenate. Look into PROC DATASETS / APPEND, but it accomplishes the same task with different means.
Merge: add a dataset to the side (right, generally) of another one. Look into the MERGE statement of the DATA Step and/or the various JOIN's allowed by PROC SQL.
SAS Documentation will show you plenty of examples!

concatenate :it is used to append observations from one data set to another data set,so you specify a list of data set names in the set statement,to concatenate two data set the SAS system must process all observations from both data sets to create a new one.
APPEND:bypasses the observations from original data set and add the new observations directly at the end of the original data set when you have different variables you use the force=option with the append procedure.It function the same way as the append statement in the datasets statement.
and you can only append one data set at a time while in concatenation you can add as many data set in the set statement.
MERGE:you should have a common variable or several variables which taken together uniquely identify each observations,it sequentaly checks observation for each by-value(you have to sort your data sets before you can merge them),then write the combined observation to the new data set

Related

Extract 2 Columns and Attach as Rows in SAS

I have two datasets in SAS. The first looks like this (let's say it is called data 1 (I'm only concerned with two columns of it)
...and the second dataset (let's say it is called data 2) looks like this:
...and I am trying to extract the second column of the first dataset and insert it into the second dataset, to achieve something that looks like this:
Basic Problem Description:
I am trying to extract two columns from a dataset in SAS and add them as rows to a second dataset. The variable names in the first dataset are in a column of their own (entitled 'variable name') and in the second dataset each variable is a column header (a variable in itself) with corresponding data. The images I provided are overly simplistic, as the actual data itself is very long.
Basically, I am trying to find functions in SAS which allow me to do this.
What I have tried
-I have tried to extract the first two columns as a table using proc sql, converted them to a data frame using a data step, sorted them, then used proc transpose to try to convert them from long to wide, then tried to use some sort of append function to tack them on to the second dataset, but append did not work.
-I have tried to merge the two sets, but the merge does not seem to work after using proc transpose.
-I have also tried transposing the second dataset and then merging them, which worked (for some reason) but then I was not able to transpose the data back (so that I can analyze it, which is my purpose in doing all of this).
What functions would I use to go about this process?
Apologies for not providing replicable data, I am more searching for recommendations for functions rather than a detailed hard solution.

To force PROC TRANSPOSE to use a variable as the source for the new variable names use the ID statement. So if you have this first dataset:
data tall;
input fruit $ count ##;
cards;
APPLE 1 PEACH 2 PEAR 2
;
You can use this code to convert it.
proc transpose data=tall out=wide;
id fruit;
var count;
run;
Then if you have another dataset that already has the variables APPLE, PEACH, PEAR etc then just set the two together.
data want;
set wide have ;
run;

using "set" of two datasets, merges or appends the datasets?

I am new to SAS and wanted to ask you for the interpretation of the code for the two datasets that I have (sid fullsid):
data sid; set sid fullsid; run;
Can you please help me to understand what the code of line above does? Does it append the two datasets?

It is spelled out in the documentation.
Combining SAS Data Sets
Use a single SET statement with multiple data sets to concatenate the specified data sets. That is, the number of observations in the new data set is the sum of the number of observations in the original data sets, and the order of the observations is all the observations from the first data set followed by all the observations from the second data set, and so on.

Common Values in Two Tables in SAS (without Proc SQL)

I am learning SAS at the moment and I wanted to know how do you join two tables without using any SQL where I need to get only the common values in two tables.
Both tables have a common unique id. Also the tables don't have common variables.
Please don't give any documentation links as I already have and I know merge. I am trying it with an IN operator.
Table 1 : Screenshot
Table 2 : Screenshot
Description: The first table has 157 records and the other has 161 records.
I tried searching a solution but didn't get any. Please refer a solution.
Thanks !

In DATA Step you will want to use the MERGE statement and the IN= option which sets up flags indicating 'contribution' to the current state of the program data vector (PDV)
data want;
merge
have1 (in=_from1)
have2 (in=_from2)
;
by uniqueid; * variable of same name, type and length should be in have1 and have2;
if _from1 and _from2; * subsetting if;
run;
DATA Step is an implicit loop. The MERGE automatically advances reads through the contributing data, synchronizing about the BY variables.
When a DATA Step has no explicit OUTPUT statement, there will be an implicit OUPUT of the values in the PDV when control reaches the bottom of the step. Thus, the if without a then is called subsetting because control only goes past the if (and reaches the bottom for implicit output) when both flags are true (or when data is coming from both tables at a common key value)

SAS NOTSORTED Equivalent

I was using the following code to analyze data:
set taq.cq_&yyyymmdd:;
by symbol date time NOTSORTED ex;
There are are thousands of datasets I am running the code on in the unit of days. When &yyyymmdd only specifies one dataset (for one day. for example, 20130102), it works. However, when I try to run it for multiple datasets (for example, 201301:), SAS returns the following errors:
BY NOTSORTED/NOBYSORTED cannot be used with SET statement when
more than one data set is specified.
If I cannot use NOTSORTED here, what is an equivalent statement that I could use?
My understanding of the keyword NOTSORTED is that you use it when the data is not sorted yet. Therefore, do I need to sort it first? How to do it?
I am also confused by the number of variables that NOTSORTED is referencing. Does it only have an effect on "time", or it has effect on "symbol, data, time"?
Many thanks!
UPDATE#2:
The rest of the process immediately following the set statement is: (pseudo code as i don't have the permission to post the original code)
Data _quotes;
SET STATEMENT HERE
Change the name of a variable in the dataset (Variable name is EXN).
last.EXN in a if statement. If the condition is satisfied, label EXN.
Drop some variables.
Run;
DATA NEWDATASET (sortedby= SYMBOL DATE TIME index=(SYMBOL)
label="WRDS-TAQ NBBO Data");
SET _quotes;
by symbol date time;
....
Run;

NOTSORTED means that SAS can assume the sort order in the data is correct, so it may not have explicitly gone through a PROC SORT but it is in logical order as listed in the BY statement.
All variables in the BY statement are included in the NOTSORTED option. Given that I suspect you fully don't understand BY group processing.
It's usually a bit dangerous to use, especially if you don't understand BY group processing. If your data is in the same group but not adjacent it won't work properly and will not produce an error. The correct workaround depends on your processes to be honest.
I would suggest reviewing the documentation regarding BY group processing. It's quite in depth and has lots of samples to illustrate the different type of calculations.
http://support.sas.com/documentation/cdl/en/lrcon/69852/HTML/default/viewer.htm#n138da4gme3zb7n1nifpfhqv7clq.htm
NOTSORTED is often used in example posts to either avoid a sort or when using a custom sort that's difficult to implement in other ways. Explicitly sorting will remove this issue but you may also be misunderstanding how SAS processes data when you have a SET statement with a BY statement. I believe this is called interleaving.
http://support.sas.com/documentation/cdl/en/lrcon/69852/HTML/default/viewer.htm#n1tgk0uanvisvon1r26lc036k0w7.htm

I suspect that the NOTSORTED keyword is being using to find groups for observations with the same value for the EX variable within the same symbol,date,time. If you only need to find the FIRST then you can use the LAG() function to calculate the FIRST.EX flag.
data want;
set taq.cq_&yyyymmdd:;
by symbol date time;
first_ex = first.time or ex ne lag(ex);
Otherwise then perhaps you want to convert the process to data step views and then set the views together.
data work.view_cq_20130102 / view=work.view_cq_20130102;
set taq.cq_20130102;
by symbol date time ex NOTSORTED;
...
run;
...
data want ;
set work.view_cq_201301: ;
by symbol date time;
...

creating two data sets in one data step in sas

This should be very simple, but somehow I confuse myself.
data in_both
missing_name (drop = name);
merge employee (in=in_employee)
hours (in = in_hours);
by ID;
if in_employee and in_hours then output in_both;
else if in_employee and not in_hours then output missing_name;
run;
I have two questions:
(1): For the first statement "missing_name(drop = name)", I understand that, it means keep all the data except the column whose head is name. But keep which data here? What is the input?
(2): I know we can create two datasets within one data step, but that means we should use "data in_both missing_name", instead of "data in_both", right?
Many thanks for your time and attention. I appreciate your help.

(1) The DROP= option refers to dropping variables from the dataset MISSING_NAME. With no drop= or keep= option, all variables that exist in EMPLOYEE or HOURS would be written to MISSING_NAME. You can run PROC CONTENTS on the four datasets to see which variables are included in each.
(2) As written, your code will output two datasets IN_BOTH and MISSING_NAME. As #Tom just commented, your current DATA statement already lists both datasets, because the semicolon ends the statement, not the white space/carriage return.

The DATA statement is determining which datasets will be created by the data step. The dataset options, like the DROP= option in your example, can we used to control which of the variables are written into those datasets.
It is the OUTPUT statement that is deciding which observations will be written. So in your example your IF/THEN/ELSE logic is determining which output statements to execute.

Using your posted code:
data in_both
missing_name (drop = name);
merge employee (in=in_employee)
hours (in = in_hours);
by ID;
run;
Inputs - merge_employee & hours
Outputs - in_both & missing_name
In this example the output missing_name has the column NAME dropped.
The best way to view what's going on if the line breaks are confusing is to look for the semi-colon. At first glance I got a little confused too!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js