I am currrently using 9.4 SAS. I have the following sentence in a column:
"BELINDA S LEE IS A PARTNER IN THE SAN FRANCISCO OFFICE OF LATHAM & WATKINS. SHE IS A MEMBER OF THE FIRM'S LITIGATION & TRIAL DEPARTMENT. HER PRACTICE FOCUSES ON ANTITRUST AND COMPLEX LITIGATION."
I am wanting to scan the text for the WORD "DEPARTMENT" keyword and when it comes across it take the whole of the sentence before that keyword and then stop when it comes to a space or full stop. From this I will create a new column with the following information "SHE IS A MEMBER OF THE FIRM'S LITIGATION & TRIAL DEPARTMENT".
Is it possible to scan a text column with a keyword like this to obtain the rest of the sentence before the keyword?
Thanks
Chris
You want to break the string into sentences first. Then test each sentence to see if it contains the word you are looking for. If it does, output that record.
data have;
input = "BELINDA S LEE IS A PARTNER IN THE SAN FRANCISCO OFFICE OF LATHAM & WATKINS. SHE IS A MEMBER OF THE FIRM'S LITIGATION & TRIAL DEPARTMENT. HER PRACTICE FOCUSES ON ANTITRUST AND COMPLEX LITIGATION.";
run;
data want;
set have;
format out $2000.;
n = countw(input,".");
do i=1 to n;
out = scan(input,i,".");
if index(out,"DEPARTMENT") then
output;
end;
drop i n;
run;
So here I use the COUNTw() function to count the number of sentences delimited by a '.'. Then I loop over those, getting each with the SCAN() function. I test to see if "DEPARTMENT" is in that sentence and if so, output.
Related
I have a list of people who may have a cellphone number, a home phone number, or both. The dataset often has the same number listed for both the cellphone number and the home phone number. This is what I am trying to do for each record:
if cellphonenumber = (homephonenumber) then keep (cellphonenumber) and drop (homephonenumber)
I've tried different combinations and cannot get it to work. I am competent in writing SQL and VBA for Access and have branched into SAS. I know the syntax is different and that Access does not have the full library (i.e. does not recognize "distinct").
Here are two ways. SQL is supported in SAS, but you should familiarize yourself with the data step since it's one of SAS's most powerful tools.
Data Step
Let's assume your data looks like this:
id home cell
1 111-111-0123
2 222-222-0123 222-222-0123
3 333-333-0123 444-444-0123
If you want to remove the home phone number, then simple if-then logic will work fine. In SAS, ' ' is missing for character columns, and . for numeric. You can optionally use the call missing() subroutine to automatically set it for you.
data want;
set have;
if(home = cell) then home = ' ';
run;
SQL
You can also do this in SQL:
proc sql;
create table want as
select id
, CASE
when(home = cell) then home = ' '
else home
END as home
, cell
from have
;
quit;
Other info
If your data is not clean and has leading or trailing blanks, you can loop through all of your character columns to ensure that all leading/trailing blanks are removed. If you need to standardize your home/cell numbers, you'll need to do some additional standardization logic (note that if you have access to SAS Data Quality Server, all of that can be done for you automatically).
The below will loop through every character variable and run the strip() function to remove leading and trailing blanks for every row.
data want;
set have;
array charvars[*] _CHARACTER_;
do i = 1 to dim(charvars);
charvars[i] = strip(charvars[i]);
end;
if(home = cell) then home = ' ';
drop i;
run;
Take a look at SAS's free e-learning for training on SAS programming concepts.
I am looking for a specific employer in a SAS data set. The data set has not been reviewed for spelling so if I am looking for Univ it could be entered as Unversity, University, Univercity ...
I've tried scaning, counting the matching letters, 'contains'. These are work but I am still missing some.
proc sql;
create table SpecificEmployers as
select *
, case when employer contains 'Univ' then 'Y'
else 'N' end as Emp
from AllEmployers
;quit;
In this case, rather than searching for a substring, I would suggest searching individual characters which can occur most commonly such as U, N, V etc. Then you can keep only those values which have all these characters available. For example- I have used findc function to search the string which has U, N and V
data have;
input string $15.;
datalines;
uNiverstY
UNVERSTy
college
univercity
school
schools
UNIVERSITY
Uversity
unvarcity
school123
;
run;
proc sql;
select string from have
where findc(upcase(string),'U')>=1
and findc(upcase(string),'N')>=1
and findc(upcase(string),'V')>=1;
quit;
proc print data=want; run;
using upcase will also make your task easy .. so you don't have to worry about the case. You can put as many conditions as you need depending on the value
You should investigate some of the edit distance functions:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206133.htm
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206137.htm
One approach would be to loop through each word in the employer name and see if any of the individual words has an edit distance below a certain threshold when compared to the string university.
I have a SAS table that I imported from Oracle with two fields. SYSTEMID and T_BLOB.
Inside the T_BLOB field there is data:
2203 Mountain Meadow===========OSCAR ST===========Zephyrhill Road
(why they are delimiting with equal signs I do not know nor do I know who to ask).
I'm new to SAS and I'm being asked to split T_BLOB field into multiple rows in a table called rick.split_blob. I tried Google but I can't find the exact example. I'm trying to get the output to look like:
SYSTEM_ID T_BLOB
GID_1 2203 Mountain Ave
GID_1 OSCAR ST
GID_1 Zephyrhill Road
Can anyone help me with how to code this?
If none of the values ever contain = then you can just use the scan() function.
data want;
set have ;
length T_BLOB_VALUE $200 ;
do i=1 by 1 until(t_blob_value=' ');
t_blob_value=scan(t_blob,i,'=') ;
if i=1 or t_blob_value ne ' ' then output;
end;
run;
You could try this:
data rick.split_blob (keep=SYSTEM_ID T_BLOB_SUB rename=(T_BLOB_SUB=T_BLOB));
set orig_dataset;
T_BLOB_TRANS = tranwrd(T_BLOB,"===========","|");
do i = 1 to countw(T_BLOB_TRANS,"|");
T_BLOB_SUB = scan(T_BLOB,i,"|");
output;
end;
run;
What I'm trying to do is first translate the odd string of equals signs to a simple pipe to avoid counting them as consecutive delimiters. Then we determine how many "words" (really - delimited strings) there are in T_BLOB_TRANS so we know how many times to run the DO loop. Finally we read everything between each delimiter and output it to a new T_BLOB variable for each new word.
It looks like you'll want to use a combination of the "scan" function and the "output" statement (with countw to get you the number of words if it is variable). Scan returns the nth word where you can specify the delimiter. Output outputs a record. So, for example, you can say
do i=1 to countw(line);
newvar = scan(line,i);
output;
end;
A raw data file is listed below:
RANCH,1250,2,1,Sheppard Avenue, "$64,000"
SPLIT,1190,1,1,Rand Street, "$65,850"
CONDON, 1400,2,1,Market Street, "80,050"
TWOSTORY, 1810,4,3,Garris Street, "$107,250"
RANCH, 1500,3,3,Kemble Avenue, "$86,650"
SPLIT, 1615, 4,3, West Drive, "94,450"
SPLIT, 1305, 3,1.5,Graham Avenue, "$73,650"
The following is the code:
data work.condo_ranch;
infield "file_specificaton" did;
input style $ #;
if style = 'CONDO' or style = 'RANCH' then
input sqfeet bedrooms baths street $ price: dollar10.;
run;
So, I think the output dataset contains 3 observations, while the correct answer is that the output contains 7 observations. Does anyone tell me why? Many thanks for your time and attention.
Why would you expect the output dataset to have only 3 observations. There is an implied OUTPUT statement at the bottom of the DATA step. If you want to output only those records where STYLE IN ("CONDO","RANCH") you could add a conditional OUTPUT, e.g.:
if style = 'CONDO' or style = 'RANCH' then do;
input sqfeet bedrooms baths street $ price: dollar10.;
output;
end;
If you only want to output the records where style is CONDO or RANCH you could just change your THEN to a semi-colon. That would make your IF statement a subsetting IF. So the data step would return at that point and never run the second INPUT or the implied OUTPUT at the end of the step.
What's wrong with the below SAS code? The single date column cannot be read correctly.
DATA test;
INPUT mydate MMDDYY8.;
FORMAT mydate YYMMDD10.;
DATALINES;
01-22-98
03-03-97
;
PROC PRINT DATA = test;
RUN;
Edit: Thanks for the answer. Another follow-up question is, when I try to read CSV format where datetime is quoted, it always fails to read correctly. How to read CSV format with quoted datetime values correctly? DSD option doesn't help much in my case.
Try left-aligning the datalines.
Though SAS is a free format language. I.e. Any statement can start in any line, one statement can span across multiple lines, multiple statement can be on online.
However with the datalines - statement that represents data within the code, data should start from column 1 / at least in column 2. Hence if the first two columns are blank, SAS assumes that the row is blank and goes to the next row.
Hence the mistake in your code is to start the data from the right column.