Wildcard usage for string variables - stata

Pretty straight-forward question. Can you use wildcard functions for strings in Stata? I haven't been able to find a suitable workaround.
Here's the code I am trying to use:
gen newvar= "output" if reg_id == "input*"
I have different values of input, i.e. input12, input18, input28292, etc. The wildcard selection does not appear to be working.

This won't work as you want. So far as Stata is concerned here, "*" is a literal character you are looking for and won't find.
Wildcard syntax like this applies when a variable list is expected, i.e. it can apply to variable names, but to use it with string values you need a dedicated function.
In your example, all cases begin with the string input, so this would work:
gen newvar = "output" if substr(reg_id, 1, 5) == "input"
Stata also supports pattern matching and regular expressions.
gen newvar = "output" if strmatch(reg_id, "input*")
is in fact the simplest way to get what you ask.
All documented:
help string functions

One simple solution:
gen newvar = "output" if strmatch(reg_id, "input*")
see help strmatch for usage.
Note also that you can use regexm in place of strmatch.

Related

Specify the number of characters that should match a LIKE REGEX in T-SQL

I've done a ton of Googling on this and can't find the answer. Or, at least, not the answer I am hoping to find. I am attempting to convert a REGEXP_SUBSTR search from Teradata into T-SQL on SQL Server 2016.
This is the way it is written in Teradata:
REGEXP_SUBSTR(cn.CONTRACT_PD_AOR,'\b([a-zA-Z]{2})-([[:digit:]]{2})-([[:digit:]]{3})(-([a-zA-Z]{2}))?\b')
The numbers in the curly brackets specify the number of characters that can match the specific REGEXP. So, this is looking for a contract number that look like this format: XX-99-999-XX
Is this not possible in T-SQL? Specifying the amount of characters to look at? So I would have to write something like this:
where CONTRACT_PD_AOR like '[a-zA-Z][a-zA-Z]-[0-9][0-9]-[0-9][0-9][0-9]-[a-zA-Z][a-zA-Z]%'
Is there not a simpler way to go about it?
While not an answer, with this method it makes things a little less panful. This is a way to set a format and reuse it if you'll need it multiple times in your code while keeping it clean and readable.
Set a format variable at the top, then do the needed replaces to build it. Then use the format name in the code. Saves a little typing, makes your code less fugly, and has the benefit of making that format variable reusable should you need it in multiple queries without all that typing.
Declare #fmt_CONTRACT_PD_AOR nvarchar(max) = 'XX-99-999-XX';
Set #fmt_CONTRACT_PD_AOR = REPLACE(#fmt_CONTRACT_PD_AOR, '9', '[0-9]');
Set #fmt_CONTRACT_PD_AOR = REPLACE(#fmt_CONTRACT_PD_AOR, 'X', '[a-zA-Z]');
with tbl(str) as (
select 'AA-23-234-ZZ' union all
select 'db-32-123-dd' union all
select 'ab-123-88-kk'
)
select str from tbl
where str like #fmt_CONTRACT_PD_AOR;

Advanced Lua Pattern Matching

I would like to know if either/both of these two scenarios are possible in Lua:
I have a string that looks like such: some_value=averylongintegervalue
Say I know there are exactly 21 characters after the = sign in the string, is there a short way to replace the string averylongintegervalue with my own? (i.e. a simpler way than typing out: string.gsub("some_value=averylongintegervalue", "some_value=.....................", "some_value=anewintegervalue")
Say we edit the original string to look like such: some_value=averylongintegervalue&
Assuming we do not know how many characters is after the = sign, is there a way to replace the string in between the some_value= and the &?
I know this is an oddly specific question but I often find myself needing to perform similar tasks using regex and would like to know how it would be done in Lua using pattern-matching.
Yes, you can use something like the following (%1 refers to the first capture in the pattern, which in this case captures some_value=):
local str = ("some_value=averylongintegervalue"):gsub("(some_value=)[^&]+", "%1replaced")
This should assign some_value=replaced.
Do you know if it is also possible to replace every character between the = and & with a single character repeated (such as a * symbol repeated 21 times instead of a constant string like replaced)?
Yes, but you need to use a function:
local str = ("some_value=averylongintegervalue")
:gsub("(some_value=)([^&]+)", function(a,b) return a..("#"):rep(#b) end)
This will assign some_value=#####################. If you need to limit this to just one replacement, then add ,1 as the last parameter to gsub (as Wiktor suggested in the comment).

I need to rework this code so it includes everything that could read "inpatient" - including when misspelt?

I either need to find a way of recoding this to include misspelt versions of inpatient or 'flagging' those that haven't been affected by this?
df1$Admission_Type <- as.character(df1$Admission_Type)
df1$Admission_Type[df1$Admission_Type == "Inpatient"]<-"ip"
df1$Admission_Type[df1$Admission_Type == "inpatient"]<-"ip"
df1$Admission_Type[df1$Admission_Type == "INPATIENT"]<-"ip"
It repeats like this.
To deal with case issues, convert all to lower case
df1 <- data.frame(Admission_Type = c("Inpatient", "inpatient", "INPATIENT", "inp", "impatient"), stringsAsFactors = FALSE)
df1$Admission_Type <- tolower(df1$Admission_Type)
Then you can use regular expressions to deal with misspellings. While impossible to get all, you can use intuition to get close. In my example, I made the (intentional) misspelling of "impatient". You can set up a regular expression to detect this possibly common mistake as such
grep("^i[nm]pat[ie][ei]nt", df1$Admission_Type, ignore.case = TRUE)
where I allowed the second position to be either an 'n' or 'm', or the 'ie' to be switched at positions 6-7. This returns
[1] 1 2 3 5
You can add likely possible misspelled letters to each position. Plenty of tips on how to make this regex more complicated to allow for missing/extra letters if you search.
Note you can use gsub to do the replacement automatically.
df1$Admission_Type[grepl("inpatient", df1$Admission_Type, ignore.case=TRUE)] = "ip" will cover the cases you listed. #JohnSG's answer shows how to include potential misspellings into the regular expression as well. (You'll probably want to create a new column to store your recodings (at least while you're testing out different options) rather than overwriting the original column of data.)
As #alistaire mentioned, you can use agrep for approximate matching. For example:
x = c("inpatient","Inpatient","Impatient","inpateint")
agrep("inpatient", x, max.dist=2, ignore.case=TRUE)
So, in your case, you could do:
df1$Admission_Type[agrep("inpatient", df1$aAdmisstion_Type, max.dist=2, ignore.case=TRUE)] = "ip"
agrep returns the indices of the matching values. max.dist controls how different the actual values can be from the target value and still be considered a match. You'll probably need to test and tweak this to capture mispellings while avoiding incorrect matches.
grepl covers the cases you listed in your questions, but for future reference, if you ever do need to match on a number of separate values, you can reduce the amount of code needed by using the %in% function. In your case, that would be:
df1$Admission_Type[df1$Admission_Type %in% c("Inpatient","inpatient","INPATIENT")]<-"ip"

Wildcard character

I've a dataframe, and I'm trying to select columns with certain properties in the name.
One example (of many) is I want to select columns called "t*_b**" where * would be a wildcard character. This would select columns with names t1_b2, t2_b2, t3_b2 and t4_b2 (as well as several others like t1_b13, t2_b13 etc.).
If there is such a wildcard character I could use, I know that I could just use the following command:
grep("t*_b", names(df))
As opposed to doing:
c(grep("t1_b", names(df)), grep("t2_b", names(df)), grep("t3_b", names(df)), grep("t4_b", names(df)))
which is messier and harder to read.
Update: the first comment has resolved my issue. I don't have any real need for any further input, thanks for the help!
The wildcard 'character' in regular expressions is a .. As such, you could do
grep("t._b", names(df))

Extract left part of the string in SAS?

Is there a function SAS proc SQL which i can use to extract left part of the string.it is something similar to LEFT function sql server. in SQL I have left(11111111, 4) * 9 = 9999, I would like to something similar in SAS proc SQL. Any help will be appreciated.
Had an impression you want to repeat the substring instead of multiply, so I'm adding REPEAT function just for the curiosity.
proc sql;
select
INPUT(SUBSTR('11111111', 1, 4), 4.) * 9 /* if source is char */
, INPUT(SUBSTR(PUT(11111111, 16. -L), 1, 4), 4.) * 9 /* if source is number */
, REPEAT(SUBSTR(PUT(11111111, 16. -L), 1, 4), 9) /* repeat instead of multiply */
FROM SASHELP.CLASS (obs=1)
;
quit;
substr("some text",1,4) will give you "some". This function works the same way in a lot of SQL implementations.
Also, note that this is a string function, but in your example you're applying it to a number. SAS will let you do this, but in general it's wise to control you conversion between strings and numbers with put() and input() functions to keep your log clean and be sure that you're only converting where you actually intend to.
You might be looking for SUBSTRN function..
SUBSTRN(string, position <, length>)
Arguments
string specifies a character or numeric constant, variable,
or expression.
If string is numeric, then it is converted to a character value that
uses the BEST32. format. Leading and trailing blanks are removed, and
no message is sent to the SAS log.
position is an integer that specifies the position of the first
character in the substring.
length is an integer that specifies the length of the substring. If
you do not specify length, the SUBSTRN function returns the substring
that extends from the position that you specify to the end of the
string.
As others have pointed out, substr() is the function you are looking for, although I feel that a more useful answer would also 'teach you how to fish'.
A great way to find out about SAS functions is to google sas functions by category which at the time of writing this post will direct you here:
SAS Functions and CALL Routines by Category
It's worth scanning through this list at least once just to get an idea of all of the functions available.
If you're after a specific version, you may want to include the SAS version number in your search. Note that the link above is for 9.2.
If you have scanned through all the functions, and still can't find what you are looking for, then your next option may be to write your own SAS function using proc fcmp. If you ever need assistance with doing this than I suggest posting a new question.