Is there a way to select all values in a column, and then check whether the entire column includes only certain parameters? Here is the code that I have tried, but I can see why it is not doing what I want:
IF City = "8" or
City = "12" or
City = "15" or
City = "24" or
City = "35"
THEN put "All cities are within New York";
I am trying to select the entire column on 'City', and check to see if that column includes ONLY those 5 values. If it includes ONLY those 5 values, then I want it to print to the log saying that. But, I can see that my method checks each row if it includes one of those, and if even only just 1 row contains one of them, it will print to the log. So I am getting a print to the log for every instance of this.
What I am trying to do is:
want-- IF (all of)City includes 8 & 12 & 15 & 24 & 35
THEN put "All cities are within New York."
You just need test if any observation is NOT in your list of values.
data _null_;
if eof then put 'All cities are within New York';
set have end=eof;
where not (city in ('8','12','15','24','35') );
put 'Found city NOT within New York. ' CITY= ;
stop;
run;
A SQL alternative could be:
proc sql;
select case when sum(not city in ('8','12','15','24','35'))>0 then 'Found city NOT within New York' else 'All cities are within New York' end
from have;
quit;
Not the same as being shown in the log, and not as efficient as the one #Tom offered.
Related
So I have two datasets. Dataset1 has a column "tags" that contains numbers separated by +. It looks like this:
id complete tags
1 true 11+27+5672+26+108+14
2 false 27+5672+26+2675+25+16083945+18567+1022+58157+23+16+11
3 true 13+27+5672+2675+26+238895+23+16
4 true 10+27+5672+163333+26+25+43869595+234149+18567+19390192+13675649+23+16
5 true 10+27+5672+26+2675+25+481056+106602+54874998+54875001+110+23+16
.
.
My second dataset has a list of "tags" like so:
id name count
+968+ TagName968 39421
+2675+ TagName2675 7450
.
.
What I would like to do is to go through and select all of the rows from Dataset1 where any of the tags in the tags column match any of the id rows in Dataset2. So basically any rows in Dataset1 that contain "+968+" or "+2675+" or any of the other ~3000 ids (so I can't hardcode them all in). So if I ran it based on the example above, rows 2, 3, and 5 would all be selected because they contain "+2675+".
I have tried different versions of DATA steps and PROC SQL, but I'm not sure I'm doing them right. The problem with PROC SQL is that I'm not really looking for a join, I don't think? Or at least it would have to be one-to-many.
I was thinking I might need to separate out the "tags" column so there's one tag per column, but the problem is that each row can have a wildly different number of tags (many of them have a lot more than given in the example) so I'm not sure that would be the best choice.
Consider switching from wide to long format using a simple do loop and the scan function.
data have;
input id complete $ tags :$2000.;
cards;
1 true 11+27+5672+26+108+14
2 false 27+5672+26+2675+25+16083945+18567+1022+58157+23+16+11
3 true 13+27+5672+2675+26+238895+23+16
4 true 10+27+5672+163333+26+25+43869595+234149+18567+19390192+13675649+23+16
5 true 10+27+5672+26+2675+25+481056+106602+54874998+54875001+110+23+16
;
data mapping;
input id $ name $ count;
cards;
+968+ TagName968 39421
+2675+ TagName2675 7450
;
data tags;
set have;
do i=1 to countw(tags, '+');
tag=input(scan(tags, i, '+'), 8.);
output;
end;
drop i;
run;
proc sql;
create table want as select distinct id, complete, tags
from tags
where tag in (select input(compress(id, '+'), 8.) as id from mapping);
quit;
id complete tags
2 false 27+5672+26+2675+25+16083945+18567+1022+58157+23+16+11
3 true 13+27+5672+2675+26+238895+23+16
5 true 10+27+5672+26+2675+25+481056+106602+54874998+54875001+110+23+16
I need help reading the code below. I am not sure what specific parts in this code are doing. For example, what does ( firstobs = 2 keep = column3 rename = (column3 = column4) ) do?
Also, what does ( obs = 1 drop = _all_ ); do?
I have also not used column5 = ifn( first.column1, (.), lag(column3) ); before. What does this do?
I am reading someone else's code. I wish I could provide more detail. If I find a solution, I will post it. Thanks for your help.
data out.dataset1;
set out.dataset2;
by column1;
WHERE column2 = 'N';
set out.dataset1 ( firstobs = 2 keep = column3 rename = (column3 = column4) )
out.dataset1 ( obs = 1 drop = _all_ );
FORMAT column5 DATETIME20.;
FORMAT column4 DATETIME20.;
column5 = ifn( first.column1, (.), lag(column3) );
column4 = ifn( last.column1, (.), column4 );
IF first.column1 then DIF=intck('dtday',column4,column3);
ELSE DIF= intck('dtday',column5,column3);
format column6 $6.;
IF first.column1
OR intck('dtday',column5,column3) GT 20 THEN column6= 'HARM';
ELSE column6= 'REPEAT';
run;
Seems like you need to learn about SAS datastep languange!
This series of things happening in the parenthesis are datastep options
You can use those options when ever you are referencing a table, even in a proc sql
The options you have:
firstobs : This starts the datafeed on the record enter in your case 2 it means SAS will start on the table on the 2nd record.
keep : This will only use the fields in list rather than using all the field in the table
rename = rename will rename field, so it works like an alias in SQL
OBS = will limit the amount of record you pull out of a table like top or limit in SQL
DROP = would remove the fields selected from the table in your case all is used wich means he's dropping all the fields.
as for the functions:
LAG is keeping the value from the previous record for the field you put in parenthesis so DPD_CLOSE_OF_BUSINESS_DT
INF = Works like a case or if. Basically you create a condition in the 1st argument and then the 2nd argument is applied when your condition in the 1st argument is true, the 3rd argument get done in the event that your condition on the 1st argument is false.
So to answer that question if it's the first record for the variable SOR_LEASE_NBR then the field Prev_COB_DT will be . otherwise it will be a previous value of DPD_CLOSE_OF_BUSINESS_DT.
The best advise I can give you is to start googling SAS and the function name you are wondering what it does, then it's a matter of encapsulation!
Hope this helps!
Basically your data step is using the LAG() function to look back one observation and the extra SET statement to look ahead one observation.
The IFN() function calls are then being used to make sure that missing values are assigned when at the boundary of a group.
You then use these calculated PREV and NEXT dates to calculate the DIF variable.
Note for this to work you need to be referencing the same input dataset in the two different SET statements (the dataset used in the last with the obs=1 and drop=_all_ dataset options
doesn't really need to be the same since it is not reading any of the actual data, it just has to have at least one observation).
( firstobs = 2 keep = DPD_CLOSE_OF_BUSINESS_DT rename =
(DPD_CLOSE_OF_BUSINESS_DT = Next_COB_DT) ) do?
Here the code firstobs=2 says SAS to read the data from 2nd observation in the dataset.
and also by using rename option trying to change the name of the variable.
(obs = 1 drop = _all_);
obs=1 is reading only the 1st obs in the dataset. If you specify obs=2 then up to 2nd obs will be read.
drop = _all_ , is dropping all of your variables.
Firstobs:
Can read part of the data. If you specify Firstobs= 10, it starts reading the data from 10th observation.
Obs :
If specify obs=15, up to 15th obs the data will be readed.
If you run the below table, it gives you 3 observations ( from 2nd to 4th ) in the output result.
Example;
DATA INSURANCE;
INFILE CARDS FIRSTOBS=2 OBS=4;
INPUT NAME$ GENDER$ AGE INSURANCE $;
CARDS;
SOWMYA FEMALE 20 MEDICAL
SUNDAR MALE 25 MEDICAL
DIANA FEMALE 67 MEDICARE
NINA FEMALE 56 MEDICAL
RUN;
I have two SAS datasets with (assume for simplicity) one char variable in each. The first dataset has a variable with company description (sometimes including city, sometimes not; a messy field) and a second dataset has a variable, where all cities are listed. I need to create variable in a first dataset saying, if any of the cities from 2nd dataset was found or not and the outcome should not contain just 0 or 1 answers, but the city itself.
Is there an easy way to do it without looping INDEXW (or similar) functions?
What's wrong with indexw? Using proc sql and indexw allows a pretty straightforward solution.
Sample data:
data have_messy;
length messy $100;
messy = 'this is a city name: brisbane' ; output;
messy = 'this is a city name: sydney' ; output;
messy = 'this is a city name: melbourne'; output;
run;
data have_city;
length city $20;
city = 'sydney' ; output;
city = 'brisbane'; output;
run;
Example query:
proc sql noprint;
create table want as
select a.*,
b.city
from have_messy a
left join have_city b on indexw(a.messy, b.city)
;
quit;
Results:
messy city
=============================== =========
this is a city name: sydney sydney
this is a city name: brisbane brisbane
this is a city name: melbourne
Be careful - the above query can return multiple results per row in table a if multiple city names are found. I suggest you run a follow up step to handle any duplicate rows depending on your requirements.
Suppose i have a table:
Name Age
Bob 4
Pop 5
Yoy 6
Bob 5
I want to delete all names, which are not unique in the table:
Name Age
Pop 5
Yoy 6
ATM, my solution is to make a new table with counts of unique names:
Name Count
Bob 2
Pop 1
Yoy 1
And then, leave all, which's Count > 1
I believe there are much more beautiful solutions.
If I understand you correctly there are two ways to do it:
The SQL Procedure
In SAS you may not need to use a summarisation function such as MIN() as I have here, but when there is only one of name then min(age) = age anyway, and when migrating this to another RDBMS (e.g. Oracle, SQL Server) it may be required:
proc sql;
create table want as
select name, min(age) as age
from have
group by name
having count(*) = 1;
quit;
Data Step
Requires the data to be pre-sorted:
proc sort data=have out=have_stg;
by name;
run;
When doing SAS data-step by group processing, the first. (first-dot) and last. (last-dot) variables are generated which denote whether the current observation is the first and/or last in the by-group. Using SAS conditional logic one can simply test if first.name = 1 and last.name = 1. Reducing this using logical shorthand becomes:
data want;
set have_stg;
by name;
if first.name and last.name;
/* Equivalent to:*/
*if first.name = 1 and last.name = 1;
run;
I left both versions in the code above, use whichever version you find more readable.
You can use proc sort with the nouniquekey option. Then use uniqueout= to output the unique values and out= to output the duplicates (the out= statement is necessary if you don't wan't to overwrite your original dataset).
proc sort data = have nouniquekey uniqueout = unique out = dups;
by name;
run;
I recently combined two datasets with a pretty straightforward Merge statement. I was using an ACS dataset and a Census population dataset. I needed a flag from the latter to be in the former. When I merged, the place variable (town/county, state) was not de-duplicated because one dataset used state abbreviations while the other used the full spelling:
Obs GeoID GeoName
1 . Abbeville County, SC
2 45001 Abbeville County, South Carolina
I need to change the GeoName for Obs1 so that it equals Obs2
Would an index function work? Or do I need the TRANWRD function? Thanks.
Solved:
data _null_;
length geoName $100;
GeoName_C = scan(GeoName,1,',');
GeoName_S = scan(GeoName,-1,','); *-1 scans from the right in case you could have commas in the city - check for this and adjust GeoName_C to include them if it is possible;
GeoName_S_F = stnamel(strip(GeoName_S));
GeoName = catx(',',GeoName_C,GeoName_S_F);
put _all_;
run;
What I would do is separate the city from the state and use SAS's inbuilt function stnamel to convert the abbreviation to the full name.
data _null_;
length geoName $100;
GeoName='Abbeville Road, SC';
GeoName_C = scan(GeoName,1,',');
GeoName_S = scan(GeoName,-1,','); *-1 scans from the right in case you could have commas in the city - check for this and adjust GeoName_C to include them if it is possible;
GeoName_S_F = stnamel(strip(GeoName_S));
GeoName = catx(',',GeoName_C,GeoName_S_F);
put _all_;
run;