Is there a way to select rows in SAS that contain strings based on values from another dataset? - sas

So I have two datasets. Dataset1 has a column "tags" that contains numbers separated by +. It looks like this:
id complete tags
1 true 11+27+5672+26+108+14
2 false 27+5672+26+2675+25+16083945+18567+1022+58157+23+16+11
3 true 13+27+5672+2675+26+238895+23+16
4 true 10+27+5672+163333+26+25+43869595+234149+18567+19390192+13675649+23+16
5 true 10+27+5672+26+2675+25+481056+106602+54874998+54875001+110+23+16
.
.
My second dataset has a list of "tags" like so:
id name count
+968+ TagName968 39421
+2675+ TagName2675 7450
.
.
What I would like to do is to go through and select all of the rows from Dataset1 where any of the tags in the tags column match any of the id rows in Dataset2. So basically any rows in Dataset1 that contain "+968+" or "+2675+" or any of the other ~3000 ids (so I can't hardcode them all in). So if I ran it based on the example above, rows 2, 3, and 5 would all be selected because they contain "+2675+".
I have tried different versions of DATA steps and PROC SQL, but I'm not sure I'm doing them right. The problem with PROC SQL is that I'm not really looking for a join, I don't think? Or at least it would have to be one-to-many.
I was thinking I might need to separate out the "tags" column so there's one tag per column, but the problem is that each row can have a wildly different number of tags (many of them have a lot more than given in the example) so I'm not sure that would be the best choice.

Consider switching from wide to long format using a simple do loop and the scan function.
data have;
input id complete $ tags :$2000.;
cards;
1 true 11+27+5672+26+108+14
2 false 27+5672+26+2675+25+16083945+18567+1022+58157+23+16+11
3 true 13+27+5672+2675+26+238895+23+16
4 true 10+27+5672+163333+26+25+43869595+234149+18567+19390192+13675649+23+16
5 true 10+27+5672+26+2675+25+481056+106602+54874998+54875001+110+23+16
;
data mapping;
input id $ name $ count;
cards;
+968+ TagName968 39421
+2675+ TagName2675 7450
;
data tags;
set have;
do i=1 to countw(tags, '+');
tag=input(scan(tags, i, '+'), 8.);
output;
end;
drop i;
run;
proc sql;
create table want as select distinct id, complete, tags
from tags
where tag in (select input(compress(id, '+'), 8.) as id from mapping);
quit;
id complete tags
2 false 27+5672+26+2675+25+16083945+18567+1022+58157+23+16+11
3 true 13+27+5672+2675+26+238895+23+16
5 true 10+27+5672+26+2675+25+481056+106602+54874998+54875001+110+23+16

Related

Collapsing a large dataset while conditionally preserving some missing values

Dataset HAVE includes id values and a character variable of names. Values in names are usually missing. If names is missing for all values of an id EXCEPT one, the obs for IDs with missing values in names can be deleted. If names is completely missing for all id of a certain value (like id = 2 or 5 below), one record for this id value must be preserved.
In other words, I need to turn HAVE:
id names
1
1
1 Matt, Lisa, Dan
1
2
2
2
3
3
3 Emily, Nate
3
4
4
4 Bob
5
into WANT:
id names
1 Matt, Lisa, Dan
2
3 Emily, Nate
4 Bob
5
I currently do this by deleting all records where names is missing, then merging the results onto a new dataset KEY with one variable id that contains all original values (1, 2, 3, 4, and 5):
data WANT_pre;
set HAVE;
if names = " " then delete;
run;
data WANT;
merge KEY
WANT_pre;
by id;
run;
This is ideal for HAVE because I know that id is a set of numeric values ranging from 1 to 5. But I am less sure how I could do this efficiently (A) on a much larger file, and (B) if if I couldn't simply create an id KEY dataset by counting from 1 to n. If your HAVE had a few million observations and your id values were more complex (e.g., hexadecimal values like XR4GN), how would you produce WANT?
You can use SQL here easily, MAX() applies to character variables within SQL.
proc sql;
create table want as
select id, max(names) as names
from have
group by ID;
quit;
Another option is to use an UPDATE statement instead.
data want;
update have (obs=0) have;
by ID;
run;
This seems like a good candidate for a DOW-loop, assuming that your dataset is sorted by id:
data want;
do until(last.id);
set have;
by id;
length t_names $50; /*Set this to at least the same length as names unless you want the default length of 200 from coalescec*/
t_names = coalescec(t_names,names);
end;
names = t_names;
drop t_names;
run;
proc summary data=have nway missing;
class id;
output out=want(drop=_:) idgroup(max(names) out(names)=);
run;
Use the UPDATE statement. That will ignore the missing values and keep the last non-missing value. It normally requires a master and transaction dataset, but you can use your single dataset for both.
data want;
update have(obs=0) have ;
by id;
run;

SAS where condition exact matching

Thanks for the feedback guys, but I have to rewrite the question to make it more clear.
Say, we have a table:
Table
What I am trying to get from this table is a list of numbers which have matching FP_NDT dates to my condition, for example, I want to get a list of numbers, which only have FP_NDT not null for 2014 and 2015 and missing values for 2011, 2012 and 2013 (irrelevant of the months). So with this condition I should get only Number 4. Is it possible to do it from this table ?
PS: If I write a simple sql select statement and put a condition like
where year(FP_NDT) in (2014,2015)
it would also give me numbers 2 and 3...
Why not first summarize the data?
proc sql;
create table XX as
select number
, max(year(fp_ndt)=2011) as yr2011
, max(year(fp_ndt)=2012) as yr2012
, max(year(fp_ndt)=2013) as yr2013
, max(year(fp_ndt)=2014) as yr2014
, max(year(fp_ndt)=2015) as yr2015
from table1
group by number
;
Now it is easy to make your tests.
select * from XX
where yr2014+yr2015=2 and yr2011+yr2012+yr2013=0
;
You could use the first query as a sub-query instead of creating a physical table.
So you want names associate with both 1 and 2 and 3, but in different rows.
You can group rows by names and count the associated numbers as this:
PROC SQL;
CREATE TABLE xxx AS SELECT
name,
SUM(number=1) AS count1,
SUM(number=2) AS count2,
SUM(number=3) AS count3
FROM test GROUP BY name;
QUIT;
Then you can filter the results based on count1-count3, i.e. (count1>0 AND count2>0 AND count3>0).
Try this:
proc sql;
select *
from work.test
group by name having nmiss(number)=0;
quit;
I have found one work around which is to actually create separate data sets for each year and then inner join them with a where condition for missing and not null for needed years. However, it becomes a bit cumbersome when it comes to 60 months, for instance...

Check all values in a column

Is there a way to select all values in a column, and then check whether the entire column includes only certain parameters? Here is the code that I have tried, but I can see why it is not doing what I want:
IF City = "8" or
City = "12" or
City = "15" or
City = "24" or
City = "35"
THEN put "All cities are within New York";
I am trying to select the entire column on 'City', and check to see if that column includes ONLY those 5 values. If it includes ONLY those 5 values, then I want it to print to the log saying that. But, I can see that my method checks each row if it includes one of those, and if even only just 1 row contains one of them, it will print to the log. So I am getting a print to the log for every instance of this.
What I am trying to do is:
want-- IF (all of)City includes 8 & 12 & 15 & 24 & 35
THEN put "All cities are within New York."
You just need test if any observation is NOT in your list of values.
data _null_;
if eof then put 'All cities are within New York';
set have end=eof;
where not (city in ('8','12','15','24','35') );
put 'Found city NOT within New York. ' CITY= ;
stop;
run;
A SQL alternative could be:
proc sql;
select case when sum(not city in ('8','12','15','24','35'))>0 then 'Found city NOT within New York' else 'All cities are within New York' end
from have;
quit;
Not the same as being shown in the log, and not as efficient as the one #Tom offered.

PROC SQL - Counting distinct values across variables

Looking for ways of counting distinct entries across multiple columns / variables with PROC SQL, all I am coming across is how to count combinations of values.
However, I would like to search through 2 (character) columns (within rows that meet a certain condition) and count the number of distinct values that appear in any of the two.
Consider a dataset that looks like this:
DATA have;
INPUT A_ID C C_ID1 $ C_ID2 $;
DATALINES;
1 1 abc .
2 0 . .
3 1 efg abc
4 0 . .
5 1 abc kli
6 1 hij .
;
RUN;
I now want to have a table containing the count of the nr. of unique values within C_ID1 and C_ID2 in rows where C = 1.
The result should be 4 (abc, efg, hij, kli):
nr_distinct_C_IDs
4
So far, I only have been able to process one column (C_ID1):
PROC SQL;
CREATE TABLE try AS
SELECT
COUNT (DISTINCT
(CASE WHEN C=1 THEN C_ID1 ELSE ' ' END)) AS nr_distinct_C_IDs
FROM have;
QUIT;
(Note that I use CASE processing instead of a WHERE clause since my actual PROC SQL also processes other cases within the same query).
This gives me:
nr_distinct_C_IDs
3
How can I extend this to two variables (C_ID1 and C_ID2 in my example)?
It is hard to extend this to two or more variables with your method. Try to stack variables first, then count distinct value. Like this:
proc sql;
create table want as
select count(ID) as nr_distinct_C_IDs from
(select C_ID1 as ID from have
union
select C_ID2 as ID from have)
where not missing(ID);
quit;
I think in this case a data step may be a better fit if your priority is to come up with something that extends easily to a large number of variables. E.g.
data _null_;
length ID $3;
declare hash h();
rc = h.definekey('ID');
rc = h.definedone();
array IDs $ C_ID1-C_ID2;
do until(eof);
set have(where = (C = 1)) end = eof;
do i = 1 to dim(IDs);
if not(missing(IDs[i])) then do;
ID = IDs[i];
rc = h.add();
if rc = 0 then COUNT + 1;
end;
end;
end;
put "Total distinct values found: " COUNT;
run;
All that needs to be done here to accommodate a further variable is to add it to the array.
N.B. as this uses a hash object, you will need sufficient memory to hold all of the distinct values you expect to find. On the other hand, it only reads the input dataset once, with no sorting required, so it might be faster than SQL approaches that require multiple internal reads and sorts.

Remove all instances of duplicates in SAS

I am merging two SAS datasets by ID number and would like to remove all instances of duplicate IDs, i.e. if an ID number occurs twice in the merged dataset then both observations with that ID will be deleted.
Web searches have suggested some sql methods and nodupkey, but these are not working because they are for typical duplicate cleansing where one instance is kept and then the multiples are deleted.
Assuming you are using a DATA step with a BY id; statement, then adding:
if NOT (first.id and last.id) then delete;
should do it. If that doesn't work, please show your code.
I'm actually a fan of writing dropped records to a separate dataset so you can track how many records were dropped at different points. So I would code this something like:
data want
drop_dups
;
merge a b ;
by id ;
if first.id and last.id then output want ;
else output drop_dups ;
run ;
Here is an SQL way to do it. You can use left/right/inner join best suitable for your needs. Note that this works on a single dataset just as well.
proc sql;
create table singles as
select * from dataset1 a inner join dataset2 b
on a.ID = b.ID
group by a.ID
having count(*) = 1;
quit;
For example from
ID x
5 2
5 4
1 6
2 7
3 6
You will select
ID x
1 6
2 7
3 6