I have a dataset with >7K observations. I have a column RFLP that can have a numerous amount of different responses. I want to get a table that tells me how many occurrences of each unique response are in the column. Since there are so many responses, I am not able to specify them in the statement.
Thanks!
Denise
I would use proc sql. This is an aggregation query:
proc sql;
select rflp, count(*)
from <dataset>
group by rflp;
Related
I have two datasets. The first, big_dataset, has around 3000 columns, most of which are never used. The second, column_list, contains a single column called column_name with around 100 values. Each value is the name of a column I want to keep.
I want to filter big_dataset so that only columns in column_list are kept, and the rest are discarded.
If I were using Pandas dataframes in Python, this would be a trivial task:
cols = column_list['column_name'].tolist()
smaller_dataset = big_dataset[cols]
However, I can't figure out the SAS equivalent. Proc Transpose doesn't let me turn the rows into headers. I can't figure out a statement in the data step that would let this work, and as far as I'm aware this isn't something that Proc SQL could handle. I've read through the docs on Proc Datasets and that doesn't seem to have what I need either.
To obtain a list of columns from column_list to use against big_dataset, you can query the column_list table and put the result into a macro variable. This can be achieved with PROC SQL and the SEPARATED BY clause:
proc sql noprint;
select column_name
into :cols separated by ','
from column_list;
create table SMALLER_DATASET AS
select &cols.
from WORK.BIG_DATASET;
quit;
Alternatively you may use SEPARATED BY ' ' and then use the resulting list in a KEEP statement or dataset option:
proc sql noprint;
select column_name
into :cols separated by ' '
from column_list;
quit;
data small_dataset;
set big_dataset (keep=&cols.);
/* or keep=&cols.; */
run;
Thanks for the feedback guys, but I have to rewrite the question to make it more clear.
Say, we have a table:
Table
What I am trying to get from this table is a list of numbers which have matching FP_NDT dates to my condition, for example, I want to get a list of numbers, which only have FP_NDT not null for 2014 and 2015 and missing values for 2011, 2012 and 2013 (irrelevant of the months). So with this condition I should get only Number 4. Is it possible to do it from this table ?
PS: If I write a simple sql select statement and put a condition like
where year(FP_NDT) in (2014,2015)
it would also give me numbers 2 and 3...
Why not first summarize the data?
proc sql;
create table XX as
select number
, max(year(fp_ndt)=2011) as yr2011
, max(year(fp_ndt)=2012) as yr2012
, max(year(fp_ndt)=2013) as yr2013
, max(year(fp_ndt)=2014) as yr2014
, max(year(fp_ndt)=2015) as yr2015
from table1
group by number
;
Now it is easy to make your tests.
select * from XX
where yr2014+yr2015=2 and yr2011+yr2012+yr2013=0
;
You could use the first query as a sub-query instead of creating a physical table.
So you want names associate with both 1 and 2 and 3, but in different rows.
You can group rows by names and count the associated numbers as this:
PROC SQL;
CREATE TABLE xxx AS SELECT
name,
SUM(number=1) AS count1,
SUM(number=2) AS count2,
SUM(number=3) AS count3
FROM test GROUP BY name;
QUIT;
Then you can filter the results based on count1-count3, i.e. (count1>0 AND count2>0 AND count3>0).
Try this:
proc sql;
select *
from work.test
group by name having nmiss(number)=0;
quit;
I have found one work around which is to actually create separate data sets for each year and then inner join them with a where condition for missing and not null for needed years. However, it becomes a bit cumbersome when it comes to 60 months, for instance...
I have a WORK dataset with more than 30 columns but only 2 columns out of them are date fields. (Start date and End date). I want the date format in the permanent dataset to be in date. and not in yymmdd10. which is the current format in work dataset. When I used the below code, the two date fields are taking first two positions. I dont want to reorder the positions and at the same time dont want to mention the format with all 30+ columns. Could someone please help me if there is any way for this?
data DLR.DEALER;
set work.dealer_invoices; * this dataset contains more than 30 columns;
format start_dt end_dt date.;
run;
I could not find any solution for this on our site. Any help is highly appreciated than just asking me to mention all the columns in the format statement :) Thanks in advance.
Certainly the format statement shouldn't have any impact on ordering given its location.
A workaround would be to use PROC DATASETS to change the format instead of in the data step.
You also could "mention all columns" fairly easily.
proc sql;
select name into :namelist separated by ' '
from dictionary.columns
where libname='WORK' and memname='DEALER_INVOICES'
order by varnum;
quit;
then
data DLR.DEALER;
retain &namelist;
set work.dealer_invoices;
format...;
run;
Suppose I'm subsetting a table and summarizing it in proc sql. The code uses a where ... in clause and a subquery to do the subsetting. I know that some SQL engines would set some limit on the number of arguments to the where ... in clause. Does SAS has have limit on this? This question would apply to a program like this:
proc sql;
create table want as
select
ID,
sum(var1) as var1,
sum(var2) as var2,
sum(var3) as var3
from largetable
where ID in (select ID from longlist)
group by ID;
quit;
What if longlist returns 10,000 IDs? How about 10,000,000?
I'm not aware of any explicit limit on this. SAS's SQL parser seems to convert these often to JOINs, when they're not explicitly coded in the table; that means there are some limitations, but not particularly small ones.
I do believe there is a limit to the length of a SQL statement in total, so if you were trying to include an extremely long list in text you might run into problems, but in the example above I don't see a problem with 10,000,000 IDs. I just tested it with 250,000,000 IDs in the longlist table, and SAS had no problem with it:
data largetable;
do id=1 to 1e8;
if mod(id,7)=0 then output;
end;
run;
data ids;
do id = 1 to 1e9;
if mod(id,4)=0 then output;
end;
run;
proc sql _method;
create table want as
select
ID
from largetable
where ID in (select ID from IDs)
group by ID;
quit;
Interestingly, adding _method indicates it does not do this as a join, but as a subquery. I'm not sure why, at least in this case; everything I've been told says that it should convert this to a join implicitly.
As Joe has said, there should probably be no problems with any reasonable number of rows in the longlist table. However, although this may be readable, a join may perform better.
Do you have a strong preference for running the query as written rather than doing a left join, e.g.
proc sql;
create table want as
select
b.ID,
sum(b.var1) as var1,
sum(b.var2) as var2,
sum(b.var3) as var3
from longlist a left join largetable b
on a.ID = b.ID
group by b.ID;
quit;
Elaborating a bit on entering a long list as text - I'm not aware of any limit on the length of any one statement in SAS, but there are various limits on the length of individual lines of code, depending on your version and how you're submitting it. I suspect it's possible to split a long statement over several lines each approaching the maximum allowed length.
I have 2 tables, lets name them table1 and table2. Both of them have credit_id, loan_id and Date field. For some reason credit_id field needs to be updated with corresponding values from table2, linking data by Date and loan_id fields. To do so, I made a query like:
proc sql;
UPDATE a
SET a.credit_id = b.credit_id
FROM table1 a, table2 b
WHERE (a.Date = b.Date) AND (a.loan_id = b.loan_id);
quit;
According to googling, this query should work in many sql environments, but it seems that SAS is an exception here, because it seems that from part is ignored.
How to update needed field then?
I can't comment on the SQL, but you can do the same thing using a data step:
data table1;
update table1 table2(keep = date loan_id credit_id);
by date loan_id;
run;
This requires that:
No two rows in the same table have the same date and loan_id, and
Both tables are sorted/indexed by date and loan id
You need the keep on the transaction dataset in order to prevent it from updating/creating any other variables on the master dataset. There are also several other ways you could do this, e.g. using the modify or merge statements.