Specify the number of characters that should match a LIKE REGEX in T-SQL - regex

I've done a ton of Googling on this and can't find the answer. Or, at least, not the answer I am hoping to find. I am attempting to convert a REGEXP_SUBSTR search from Teradata into T-SQL on SQL Server 2016.
This is the way it is written in Teradata:
REGEXP_SUBSTR(cn.CONTRACT_PD_AOR,'\b([a-zA-Z]{2})-([[:digit:]]{2})-([[:digit:]]{3})(-([a-zA-Z]{2}))?\b')
The numbers in the curly brackets specify the number of characters that can match the specific REGEXP. So, this is looking for a contract number that look like this format: XX-99-999-XX
Is this not possible in T-SQL? Specifying the amount of characters to look at? So I would have to write something like this:
where CONTRACT_PD_AOR like '[a-zA-Z][a-zA-Z]-[0-9][0-9]-[0-9][0-9][0-9]-[a-zA-Z][a-zA-Z]%'
Is there not a simpler way to go about it?

While not an answer, with this method it makes things a little less panful. This is a way to set a format and reuse it if you'll need it multiple times in your code while keeping it clean and readable.
Set a format variable at the top, then do the needed replaces to build it. Then use the format name in the code. Saves a little typing, makes your code less fugly, and has the benefit of making that format variable reusable should you need it in multiple queries without all that typing.
Declare #fmt_CONTRACT_PD_AOR nvarchar(max) = 'XX-99-999-XX';
Set #fmt_CONTRACT_PD_AOR = REPLACE(#fmt_CONTRACT_PD_AOR, '9', '[0-9]');
Set #fmt_CONTRACT_PD_AOR = REPLACE(#fmt_CONTRACT_PD_AOR, 'X', '[a-zA-Z]');
with tbl(str) as (
select 'AA-23-234-ZZ' union all
select 'db-32-123-dd' union all
select 'ab-123-88-kk'
)
select str from tbl
where str like #fmt_CONTRACT_PD_AOR;

Related

Possible combination (variations) of words in a string variable in stata

I have a string variable containing school names and I need to find all the possible combination of each word in this string variable in stata:
For example variation of a word "Academy" would be:
Academy,
Academy,
acdamey,
aacdemy,
dmcaamy,
aacedmy,
and so on.
I need this to standardize the raw data of school names, which has many typos of each word due to data entry issues, like the ones given above for "academy".
Depending whether your data is already in the Excel sheets or a file, you can either use regex trying to match all possible combinations (and probably fix them when found) or parse the strings first before bringing them into Excel. In either case you could make a file (or Excel list/table/area/etc.) that includes all the common typos and pick each typo as regex match to use when comparing to your actual input.
Making regexp that would actually find all possible cases is next to impossible, especially if there are cases where very similar (but correct) names for schools exist. In any case direct regexps would be very messy and complex, so I would advice you to parse the data by finding first the correct form, excluding it and then using (greedy) search/regex to find the typoed versions. You can then save the typos to use them as a filter/match/pattern.
To get some sort of starting ideas, check this links:
Regex: Search for verb roots
Read text file and extract string into Excel sheet using regex
P.s You should keep the count of all strings/school names and finally get a list of all names that did not match correct form or any of your regexp filters, so you can manually insert/correct them.

Wildcard character

I've a dataframe, and I'm trying to select columns with certain properties in the name.
One example (of many) is I want to select columns called "t*_b**" where * would be a wildcard character. This would select columns with names t1_b2, t2_b2, t3_b2 and t4_b2 (as well as several others like t1_b13, t2_b13 etc.).
If there is such a wildcard character I could use, I know that I could just use the following command:
grep("t*_b", names(df))
As opposed to doing:
c(grep("t1_b", names(df)), grep("t2_b", names(df)), grep("t3_b", names(df)), grep("t4_b", names(df)))
which is messier and harder to read.
Update: the first comment has resolved my issue. I don't have any real need for any further input, thanks for the help!
The wildcard 'character' in regular expressions is a .. As such, you could do
grep("t._b", names(df))

How do I use numeric functions to correct date typos?

I know it's easy enough to do manual corrections on date typos, but I want to automate such corrections using one or more SAS functions, given that my dataset is large and typos are frequent.
For instance, it seems that whomever created the dataset I am cleaning often transposed digits in the year of someone's birthdate (e.g., '2102' rather than '2012', '2110' instead of '2010', etc). I'm aware of string functions such as INDEX() that find certain character values or strings and then allow for the replacement of said characters in the same position (i.e., replace "ABCD" with "ABBB", regardless of the string's location in a value). Can the same process be replicated with numeric (and specifically date) values?
I don't think SAS has any functions that would check numeric values for digit patterns. I often do data cleaning and address this issue by making a character variable out of the numeric date variable, then using character functions and Perl regex to clean the character values, and then storing the cleaned values as numeric date.
For specifically date values, you could try using SAS date functions (e.g. DAY(), MONTH(), YEAR(), MDY(), etc.) to extract parts of the date value, error-check them, and put them all back together into a date value. This could be a good quick solution if you expect a limited set of typos and you roughly know what they are. For a more thorough error check, converting the numeric values to character and using char or regex functions would give you more options.
The only really concise suggestion I can imagine is using mdy (Assuming this is date, not datetime variables).
For example:
data want;
set have;
if year(datevar) > 2100 then
datevar = mdy(month(datevar),day(datevar),year(datevar)-90);
run;
would correct any '2104' to '2014'. That's a very simple correction (and may well do as much harm as good, since '2114' is also a possible typo), but things along those lines - break the date up into its pieces, verify the pieces, reconstruct using mdy.

Using a RegEx in a SQL Query

Here's the situation I'm in: We have a field in our database that contains a 3 digit number, surrounded by some text. This number is actually a PK in another table, and I need to extract this out so I can implement a proper FK relationship. Here's an example of what would currently reside in the column:
Some Text Goes Here - (305) Followed By Some More Text
So, what I'm looking to do is extract the '305' from the column, and hopefully end up with a result that looks something like this (pseudo code)
SELECT
<My Extracted Value>,
Original Column Text,
Id
FROM dbo.MyTable
It seems to me that using a Regex match in my query is the most effective way to do this. Can anybody point me in the right direction?
EDIT: We're using SQL Server 2005
RegExp in SQL is defined by a SQL-Standard but most databases implemented their own syntax, you should tell us the product name of your RDBMS ;)
This is based on Pranay's first answer that has since been changed.
DECLARE #NumStr varchar(1000)
SET #NumStr = 'Some Text Goes Here - (305) Followed By Some More Text';
SELECT SUBSTRING(#NumStr,PATINDEX('%[0-9][0-9][0-9]%',#NumStr),3)
Returns 305
Microsoft seems to suggest using a CLR assembly to do Regex pattern matching in SQL Server 2005.
http://msdn.microsoft.com/en-us/magazine/cc163473.aspx
Apart from LIKE (which is not going to solve your problem) I don't know of a built-in pattern matching functionality in SQL Server 2005 (that is, more advanced than simple string searches).
Just after I implemented a solution in Postgres, I see you are using SqlServer... Just for the records, then, with a regex that extracts data in parenthesis.
Postgresql solution:
create table main(id text not null)
insert into main values('some text (44) other text');
insert into main values('and more text (78) and even more');
select substring(id from '\\(([^\\(]+)\\)') from main
The only way to access RegEx-type functions in SQL 2005 (and probably 2008) is by writing (or downloading) and using CLR functions.
If all the strings are always formatted in such a way as you can identify the specific numbers you want, you can do something like the following. This is based on the (big) assumption that the first set of parenthesis found in the string contains the number that you want.
/*
CREATE TABLE MyTable
(
MyText varchar(500) not null
)
INSERT MyTable values ('Some Text Goes Here - (305) Followed By Some More Text')
*/
SELECT
MyText -- String
,charindex('(', MyText) -- Where's the open parenthesis
,charindex(')', MyText) -- Where's the closed parenthesis
,substring(MyText
,charindex('(', MyText) + 1, charindex(')'
,MyText) - charindex('(', MyText) - 1) -- Glom it all together
from MyTable
Awkward as heck (because SQL has a pathetically limited set of string manipulation functions), but it works.

Use cases for regular expression find/replace

I recently discussed editors with a co-worker. He uses one of the less popular editors and I use another (I won't say which ones since it's not relevant and I want to avoid an editor flame war). I was saying that I didn't like his editor as much because it doesn't let you do find/replace with regular expressions.
He said he's never wanted to do that, which was surprising since it's something I find myself doing all the time. However, off the top of my head I wasn't able to come up with more than one or two examples. Can anyone here offer some examples of times when they've found regex find/replace useful in their editor? Here's what I've been able to come up with since then as examples of things that I've actually had to do:
Strip the beginning of a line off of every line in a file that looks like:
Line 25634 :
Line 632157 :
Taking a few dozen files with a standard header which is slightly different for each file and stripping the first 19 lines from all of them all at once.
Piping the result of a MySQL select statement into a text file, then removing all of the formatting junk and reformatting it as a Python dictionary for use in a simple script.
In a CSV file with no escaped commas, replace the first character of the 8th column of each row with a capital A.
Given a bunch of GDB stack traces with lines like
#3 0x080a6d61 in _mvl_set_req_done (req=0x82624a4, result=27158) at ../../mvl/src/mvl_serv.c:850
strip out everything from each line except the function names.
Does anyone else have any real-life examples? The next time this comes up, I'd like to be more prepared to list good examples of why this feature is useful.
Just last week, I used regex find/replace to convert a CSV file to an XML file.
Simple enough to do really, just chop up each field (luckily it didn't have any escaped commas) and push it back out with the appropriate tags in place of the commas.
Regex make it easy to replace whole words using word boundaries.
(\b\w+\b)
So you can replace unwanted words in your file without disturbing words like Scunthorpe
Yesterday I took a create table statement I made for an Oracle table and converted the fields to setString() method calls using JDBC and PreparedStatements. The table's field names were mapped to my class properties, so regex search and replace was the perfect fit.
Create Table text:
...
field_1 VARCHAR2(100) NULL,
field_2 VARCHAR2(10) NULL,
field_3 NUMBER(8) NULL,
field_4 VARCHAR2(100) NULL,
....
My Regex Search:
/([a-z_])+ .*?,?/
My Replacement:
pstmt.setString(1, \1);
The result:
...
pstmt.setString(1, field_1);
pstmt.setString(1, field_2);
pstmt.setString(1, field_3);
pstmt.setString(1, field_4);
....
I then went through and manually set the position int for each call and changed the method to setInt() (and others) where necessary, but that worked handy for me. I actually used it three or four times for similar field to method call conversions.
I like to use regexps to reformat lists of items like this:
int item1
double item2
to
public void item1(int item1){
}
public void item2(double item2){
}
This can be a big time saver.
I use it all the time when someone sends me a list of patient visit numbers in a column (say 100-200) and I need them in a '0000000444','000000004445' format. works wonders for me!
I also use it to pull out email addresses in an email. I send out group emails often and all the bounced returns come back in one email. So, I regex to pull them all out and then drop them into a string var to remove from the database.
I even wrote a little dialog prog to apply regex to my clipboard. It grabs the contents applies the regex and then loads it back into the clipboard.
One thing I use it for in web development all the time is stripping some text of its HTML tags. This might need to be done to sanitize user input for security, or for displaying a preview of a news article. For example, if you have an article with lots of HTML tags for formatting, you can't just do LEFT(article_text,100) + '...' (plus a "read more" link) and render that on a page at the risk of breaking the page by splitting apart an HTML tag.
Also, I've had to strip img tags in database records that link to images that no longer exist. And let's not forget web form validation. If you want to make a user has entered a correct email address (syntactically speaking) into a web form this is about the only way of checking it thoroughly.
I've just pasted a long character sequence into a string literal, and now I want to break it up into a concatenation of shorter string literals so it doesn't wrap. I also want it to be readable, so I want to break only after spaces. I select the whole string (minus the quotation marks) and do an in-selection-only replace-all with this regex:
/.{20,60} /
...and this replacement:
/$0"ΒΆ + "/
...where the pilcrow is an actual newline, and the number of spaces varies from one incident to the next. Result:
String s = "I recently discussed editors with a co-worker. He uses one "
+ "of the less popular editors and I use another (I won't say "
+ "which ones since it's not relevant and I want to avoid an "
+ "editor flame war). I was saying that I didn't like his "
+ "editor as much because it doesn't let you do find/replace "
+ "with regular expressions.";
The first thing I do with any editor is try to figure out it's Regex oddities. I use it all the time. Nothing really crazy, but it's handy when you've got to copy/paste stuff between different types of text - SQL <-> PHP is the one I do most often - and you don't want to fart around making the same change 500 times.
Regex is very handy any time I am trying to replace a value that spans multiple lines. Or when I want to replace a value with something that contains a line break.
I also like that you can match things in a regular expression and not replace the full match using the $# syntax to output the portion of the match you want to maintain.
I agree with you on points 3, 4, and 5 but not necessarily points 1 and 2.
In some cases 1 and 2 are easier to achieve using a anonymous keyboard macro.
By this I mean doing the following:
Position the cursor on the first line
Start a keyboard macro recording
Modify the first line
Position the cursor on the next line
Stop record.
Now all that is needed to modify the next line is to repeat the macro.
I could live with out support for regex but could not live without anonymous keyboard macros.