SAS: create new column based on regex logical statement - regex

I have a large dataset where one column contains free text. I wish to create a new column based on whether this free text contains a regular expression.
Eg:
I want to know whether this column contains the text GnRH, or those letters in any case, and create a new column with a flag to indicate if this is true or not.

FIND or INDEX work as well, and slightly easier to understand.
DUMMY = find(text, "gnrh", 'it') > 0;

Try this
data have;
input text $20.;
datalines;
Not in this line
In GnRH this line
Not here either
This one GNRH too
;
data want;
set have;
dummy = prxmatch('/gnrh/i', text) > 0;
run;

Related

Parsing a long column to multiple rows using RegEx?

I have a table where the history for a file's edits is stored off, one row per file. Most of them are pipe delimited so I transform the table into a 'one row one edit' style by parsing out the fields with this sort of thing:
LATERAL FLATTEN (INPUT => SPLIT(x.User,'|')) a
However, annoyingly one of the fields doesn't have pipes and instead has a timestamp between edits. (it's so people in the file can see the edit history) In our SAS world, I have a job (below) that parses it out using a RegEx and looping around to do the parse/transpose. Is such a thing doable in Snowflake?
data notesdata_parsed;
rx_date = prxparse('/[ ]\d+[\/]\d+[\/](2020|2021)[ ]\d+[:]\d+[:]\d+( AM -| PM -)/');
set notesdata;
where textfield ne '';
do while(1);
rx_pos = prxmatch(rx_date,textfield);
if rx_pos = 0 then
do;
textfield_new=textfield;
output;
leave;
end;
textfield_new = substr(textfield,1,rx_pos-1);
textfield=substr(textfield,rx_pos+1);
output;
end;
drop rx_date textfield rx_pos;
run;
I'm not sure of exact regex you'd need in Snowflake, but you could leverage the REGEXP_REPLACE function in Snowflake to make the date into a PIPE and then do your existing LATERAL FLATTEN type of thing.
Something along the lines of:
LATERAL FLATTEN (INPUT => SPLIT(REGEXP_REPLACE(x.User,'{regex expression}','|'),'|')) a
The regex syntax in Snowflake might be a little different, so I just used a placeholder there. I'm not an expert in Regex.

SAS set statement using colon and creating a filename variable

So using SAS, I have a number of SAS monthend datasets named as follows:
mydata_201501
mydata_201602
mydata_201603
mydata_201604
mydata_201605
...
mydata_201612
Each has account information at particular monthend. I want to stack the datasets all into one dataset using colon rather than writing out the full set statement as follows:
data mynewdata;
set mydata_:;
run;
However there is no datestamp variable within the datasets so when I stack them I will lose the monthend information for each account. I want to know which line refers to which monthend for each account. Is there a way I can automatically create a variable that names the table the row come from. for example the long winded way would be this:
data mynewdata;
set mydata_201501 (in=a) mydata_201502 (in=b) mydata_201503 (in=c)...;
if a then tablename = 'mydata_201501';
if b then tablename = 'mydata_201502';
if c...
run;
but is there a quicker way using colon along these lines?
data mynewdata;
set mydata_:;
tablename = _tablelabel_;
run;
thanks
I always find clicking on comment links annoying, so hopefully here's the answer in your context. Use the INDSNAME= SET statement option to assign the dataset name to a variable:
data mynewdata;
set mydata_: indsname=_tablelabel_;
tablename = _tablelabel_;
run;
N.B. you can call _tablelabel_ whatever you want, and you may wish to change it so it doesn't look like a SAS generated variable name.
INDSNAME= only became a SAS SET statement option in version 9.2
Just to be clear, with my particular code, where the datasets were named mydata_yyyymm and I wanted a monthend variable with datestamp, I was able to produce this using the solution provided by mjsqu as follows (obs and keep statement provided if required):
data mynewdata;
set mydata_: (obs=100 keep=xxx xxx) indsname=_tablelabel_;
format monthend yymmdd10.;
monthend = input(scan(_tablelabel_,-1,'_'),yymmn6.);
run;

Assigning index to two concatenated tables in SAS?

I have two table with exactly the same column headers and one row each. I have the code to concatenate them which works fine.
data concatenation;
set CURR_CURR CURR_30;
run;
However, there is no index in the output to say which row corresponds to which table.
I've tried using 'create index' and 'index create' already but they don't work syntactically. Simply I'd just want to add a column of strings and move it to the front of all the other columns in the data set.
INDSNAME option on the SET statement + variable to store the information.
If you set the length statement ahead of your SET statement it will create it as the first column.
Just a note that this isn't the same as an 'index'. An index in SAS has a different meaning which isn't what you're trying to create here.
data concatenation;
length dset source $50.;
set CURR_CURR CURR_30 indsname=source;
dset=source;
run;
Reeza's answer is very similar to something I figured out that worked as well. Here's my version as an alternative.
data concatenation;
length id $ 10;
set CURR_CURR (in=a) CURR_30 (in=b);
if a then id = 'curr_curr';
else if b then id = 'curr_30';
run;

Search through the data with a loop or nested loops in SAS

I am rather a beginner in SAS. I have the following problem. Given is a big data set (my_time) which I imported into SAS looking as follows
I want to implement the following algorithm
for every account look for a status and if it is equal to na then look for the same contract after one year (one year after it gets the status na) and put the information "my_date", "status" and "money" in three new columns "new_my_date", "new_status" and "new_money" like in
I need something like countifs in excel. I found loops in SAS like DO but not for the purpose to look through all rows.
I do not even know for which key word I have to look.
I would be grateful for any hint.
A simple method would be by sorting, then exploiting the special variable prefix first. and retain statement to get the desired result.
Step 1: Sort by account, date, and status
proc sort data=have;
by account my_date status;
run;
This will guarantee that your data is in the order that you need. Since we are looking only for year+1 after the status = 'na', anything that happens in-between that doesn't matter.
Step 2: Use a data step to remember the first year when na happens for that account
data want;
set have;
by account my_date status;
retain first_na_year first_na_account;
if(first.account) then call missing(first_na_year,first_na_account);
if(status IN('na', 'tna') ) then do;
first_na_year = year;
first_na_month = month;
first_na_account = account;
end;
if( year = first_na_year+1
AND first_na_month = month
AND account = first_na_account)
AND status NOT IN('na', 'tna') )
then do;
new_status = status ;
new_my_date = my_date;
new_money = money;
end;
if(cmiss(new_status, new_my_date, new_money) ) = 0;
drop first:;
run;
For each row, we compare three things:
Is the status not 'na'?
Is the year 1 year bigger than the last time it was 'na'?
Is this the same account we're comparing?
If all are true, then we want to create the three new variables.
What's happening:
SAS is inherently a looping language, so we do not need to use a do loop here. When SAS goes to a new row, it will clear all variables in the Program Data Vector (PDV) in preparation for filling them in with the new values in the row.
Since SAS the SAS data step only goes forwards and doesn't like to go backwards, we want it to remember the first time that na occurs for that account. retain tells SAS not to discard the value of a variable when it reads a new row.
When we are done doing our comparison and we've moved onto the next account, we reset these variables to missing. by group processing allows SAS to know exactly where the first and last occurrence of the account is in the dataset.
At the end, we output only if all 3 of the new variables are not missing. cmiss counts how many variables are not missing. Note that output is always implied before the run statement, so we simply need to use an "if without then" in this case.
The final statement, drop first:;, is a simple shortcut to remove any variables that start with the phrase first. This prevents them from being shown in the final dataset.

Retrieve Column Based On Data Step Variable

I'm writing a SAS job. For this SAS job, I need to do the following --
Retrieve the value of the field ActiveColumn. This value will be the name of another column in the table.
Set ActiveValue equal to the value of the field named by ActiveColumn.
Basically, I'm trying to write a version of this where I don't half to write out every column name beforehand --
Select(ActiveColumn);
when ('CITY') ActiveValue = City;
when ('STATE') ActiveValue = State;
when ('ZIP') ActiveValue = Zip;
otherwise;
What is the simplest way to do this?
Thank you very much!
This sounds like a vertical transpose. That would be done something like this, if all fields are character:
data want;
set have;
array fields city state zip;
do _t = 1 to dim(fields);
if lowcase(activeColumn)=vname(fields[_t]) then activeValue=fields[_t];
*may want an OUTPUT here.;
end;
run;
If they are mixed type you would need two arrays and loops. You might not need ActiveColumn if you are intending to just loop over all fields anyway; you can just set ActiveColumn to vname(fields[_t]) in the loop.
If you are intending to have this be more flexible, you can use array fields _character_;
which will use all character variables (thus meaning you don't have to explicitly specify them).