I'm very new to SAS. I spent some time looking for a solution for my problem. Unfortunately, I couldn't find any. I'd very much appreciate your help. The problem is actually quite simple.
I have two different datasets (let's call it dat1 and dat2) of different length. Moreover, they both have a variable X, which I'm interested in. I'm looking for a way to find common values of these two columns (let's call them dat1_X and dat2_X). The dataset is quite big though, approximately 10 million observations.
There are many ways!
If one dataset was small, and efficiency was an issue, you could consider using hash tables (or formats) to perform a lookup whilst passing through the larger one.
Otherwise, the following SQL approaches will do the trick (try testing to see which is most efficient):
/* correlated subquery (generally slow) */
proc sql;
create table want1 as
select distinct x
from dat1_x
where x in (select from dat2_x);
/* inner join */
proc sql;
create table want2 as
select distinct a.x
from dat1_x a, dat2_x b
where a.x=b.x;
/* intersect */
proc sql;
create table want3 as
select x
from dat1_x
intersect
select x
from dat2_x;
Related
I am working in SAS Enterprise guide and have a one column SAS table that contains unique identifiers (id_list).
I want to filter another SAS table to contain only observations that can be found in id_list.
My code so far is:
proc sql noprint;
CREATE TABLE test AS
SELECT *
FROM data_sample
WHERE id IN id_list
quit;
This code gives me the following errors:
Error 22-322: Syntax error, expecting on of the following: (, SELECT.
What am I doing wrong?
Thanks up front for the help.
You can't just give it the table name. You need to make a subquery that includes what variable you want it to read from ID_LIST.
CREATE TABLE test AS
SELECT *
FROM data_sample
WHERE id IN (select id from id_list)
;
You could use a join in proc sql but probably simpler to use a merge in a data step with an in= statement.
data want;
merge oneColData(in = A) otherData(in = B);
by id_list;
if A;
run;
You merge the two datasets together, and then using if A you only take the ID's that appear in the single column dataset. For this to work you have to merge on id_list which must be in both datasets, and both datasets must be sorted by id_list.
The problem with using a Data Step instead of a PROC SQL is that for the Data step the Data-set must be sorted on the variable used for the merge. If this is not yet the case, the complete Data-set must be sorted first.
If I have a very large SAS Data-set, which is not sorted on the variable to be merged, I have to sort it first (which can take quite some time). If I use the subquery in PROC SQL, I can read the Data-set selectively, so no sort is needed.
My bet is that PROC SQL is much faster for large Data-sets from which you want only a small subset.
My first SAS data set, ds1 contains dates, firms and share prices. My second data set, ds2 contains a subset of the firms in ds1. I'd like to create ds3 which contains all of the observations in ds1 provided a firm in ds1 is also in ds2. I try doing this as follows:
DATA ds3;
set ds1;
IF firm IN (d2);
run;
The above does not work as planned as ds3 ends up containing no observations. I believe the problem is the IF IN statement. I could manually type all the firms in the parenthesis instead of putting d2 there but that would be very inefficient for me.
You have several options here; the correct one depends largely on your particular needs.
What you're effectively doing is joining two tables together. So, a MERGE or a SQL JOIN would be a simple solution.
data ds3;
merge ds1(in=_ds1) ds2(in=_ds2 keep=firm);
by firm;
if _ds1 and _ds2;
run;
That joins ds1 and ds2, only keeping the firm variable from ds2, and keeps only firms that are in both. Both DS1 and DS2 need to be sorted by firm;, and DS2 should have have only unique values of firm - no duplicates.
SQL is also pretty easy.
proc sql;
create table ds3 as
select * from ds1 where exists (
select 1 from ds2
where ds1.firm=ds2.firm
);
quit;
That's a bit closer to your terminology (unsurprising as SQL attempts to be close to natural language for many simple queries). That doesn't require either sorting or uniqueness, although this won't be particularly fast.
You could also store the DS2 firms in a format, or use a hash table to allow you to open DS2 alongside DS1. You could also use a keyed set statement. All of these are slightly more complicated to implement, but are generally faster as they don't require sorting and don't repeatedly reference the same data as the SQL does.
This might be a weird question. I have a data set contains data like agree, neutral, disagree...for many questions. There is not so many observations so for some question, one or more options has frequency of 0, say neutral. When I run proc freq, since neutral shows up in that variable, the table does not contain a row for neutral. I end up with tables with different number of rows. I would like to know if there is a option to show these 0 frequency rows. I will also need to run proc gchart for the same data set, and I will run into the same problem for having different number of bars. Please help me on this. Thank you!
This depends on how exactly you are running your PROC FREQ. It has the sparse option, which tells it to create a value for every logical cell on the table when creating an output dataset; normally, while you would have a cell with a missing value (or zero) in a crosstab, if that is output to a dataset (which is vertical, ie each combination of x and y axis value are placed in one row) those rows are left off. Sparse makes sure that doesn't happen; and in a larger (n-dimensional) crosstab, it creates rows for every possible combination of every variable, even ones that don't occur in the data.
However, if you're just doing
proc freq data=mydata;
tables myvar;
run;
That won't help you, as SAS doesn't really have anything to go on to figure out what should be there.
For that, you have to use a class variable procedure. Proc Tabulate is one of such procedures, and is similar to Proc Freq in its syntax (sort of). You need to either use CLASSDATA on the proc statement, or PRINTMISS on the table statement. In the former case, you do not need to use a format, I don't believe. In the latter case (PRINTMISS), you need to create a format for your variable (if you don't already have one) that contains all levels of the data that you want to display (even if it's just an identity format, e.g. formatting character strings to identical character strings), and specify PRELOADFMT on the proc statement. See this man page for more details.
This is a follow-up question to this: Cleaner way of handling addition of summarizing rows to table?
However, this time we've got something slightly different I'm afraid. We have: a dataset with four independent variables, and two dependent variables. We want:
a) a set distinct by three independent variables, with a count of distinct variable #4 and sums of variables #5 & #6. This is easy.
b) TOTAL entries created for each combination of the three independent variables, with a count of distinct variable #4 and sums of variables #5 and #6. This is not easy (to me).
So the idea would be to modify this:
proc means data=have;
class ind1 ind2 ind3;
var dependent5 dependent6;
output out=want sum=;
run;
Such that it additionally could count the number of distinct values of variable #4, for each combination of variables #1,2, and 3, including ALL.
Ideas I've had:
1) Abandon hope all ye who entered here; go back to trying to do this in proc sql with a bunch of macro code, which allows you to use the count(distinct )) useful thang.
2) Something with nlevels?
3) Using convoluted, terrible macro code to generate however many hashes would be necessary to handle the uniques.
4) ??
Creating sample data here is kind of tricky; let me know if this makes no sense and I'll do my best to come up with some.
--
Edit: to be clear, the reason SQL queries require macro code (and would be slow) is because it would need to be something like the following (but expanded to many more levels)
select
"ALL" as ind1,
ind2
...
group by ind1, ind2;
select
ind1,
"ALL" as ind2
...
group by ind1, ind2;
select
"ALL" as ind1,
"ALL" as ind2
...
group by ind1, ind2;
This will get very unwieldly as we add more and more independent variables.
Did you Try This :
Proc Sql;
Count(distinct #4) as var_4, sum(#5) as var_5, sum(#6) as var_6
from have
group by #1,#2,#3
order by #1,#2,#3;
quit;
I don't think there's a clean way to use PROC MEANS to do this directly. What I might do is a two step process. First PROC MEANS to generate the total rows and the two sums, and then join that back to the original table in PROC SQL to get the distinct - when you do the join, it will neatly give you the right ones including the total row.
You might be able to use PROC REPORT to do this as well, but I'm not sure it would be easier than the two step solution.
I have tried the following options with unacceptable response times - creating index 'key' did not help either (NOTE:duplicate'keys'in both datasets):
data a;
merge b
c;
by key
if b;
run;
=== OR ===
proc sql;
create a
as select *
from b
left outer join c
on b.key;
quit;
You should first sort the two datasets before merging them. This is what will give the performance. using an index when you have to scan the whole table to have a result is usually slower then presorting the datasets and merging them.
Be sure to trim your dataset as much as possible. Sort your dataset before the data step or proc sql. Also, I'm not 100% if it matters, but ANSI SQL would be proc sql; create a as select * from b left outer join c on b.key=C.KEY; quit;
Creating a key is the fastest way to get the datasets ready for joining. Sorting them can take just as long as merging them, if not longer, but is still a good idea.
AFHood's suggestion of trimming them is a good. Can you possibly run them through a PROC SUMMARY? Are there any columns you can drop? Any of these will reduce the size of your dataset and make the merging faster.
None of these methods may work however. I routinely merge files of several million rows and it can take a while.
You might try the SQL merge. I don't know if it would be faster for your needs but I've found SQL to be much more efficient than a regular SAS merge. Plus, once you realize what you can do with SQL, manipulating datasets becomes easier!
Don't use a data step merge to do this.
With duplicate keys in both datasets the result will be wrong.
The only way to do this is with a
Proc SQL;
Create table newdata
as select firsttable.aster, secondtable.aster
from table1 as firsttable
inner join table2 as secondtable
on (firstable.keyfield = secondtable.keyfield);
quit;
If you have more than one keyfield the join order should be least match field first to greatest match field last. SAS has a bad habbit of creating a tempory dataset containing all possible matches and the siveing it down from there. Can blow out your tempory space allocation and slow everyting down.
If you still wish to use a DATA Step then you need to get rid of the duplicate keys out of one of the datasets.