How to compare text pattern in SAS Studio? - replace

I am doing code migration from Teradata SAS to Snowflake SAS Studio.
The goal is to compare whether 2 text sentences are similar, for example
column_1: 'I want to eat 100 apples!'
column_2: 'I want to eat 20 apples!'
'I want to eat 100 bananas!'
In this case the number of fruits I want to eat doesn't matters, I want to find matching cell for 'I want to eat apples'. Therefore 'I want to eat 20 apples!' should be my result.
Originally in Teradata SQL we combined RegExp_Replace and Oreplace to do the trick, but it seems the RegExp_Replace function changed in SAS Studio.
Any suggestions?
Many thanks!

Related

Specify the number of characters that should match a LIKE REGEX in T-SQL

I've done a ton of Googling on this and can't find the answer. Or, at least, not the answer I am hoping to find. I am attempting to convert a REGEXP_SUBSTR search from Teradata into T-SQL on SQL Server 2016.
This is the way it is written in Teradata:
REGEXP_SUBSTR(cn.CONTRACT_PD_AOR,'\b([a-zA-Z]{2})-([[:digit:]]{2})-([[:digit:]]{3})(-([a-zA-Z]{2}))?\b')
The numbers in the curly brackets specify the number of characters that can match the specific REGEXP. So, this is looking for a contract number that look like this format: XX-99-999-XX
Is this not possible in T-SQL? Specifying the amount of characters to look at? So I would have to write something like this:
where CONTRACT_PD_AOR like '[a-zA-Z][a-zA-Z]-[0-9][0-9]-[0-9][0-9][0-9]-[a-zA-Z][a-zA-Z]%'
Is there not a simpler way to go about it?
While not an answer, with this method it makes things a little less panful. This is a way to set a format and reuse it if you'll need it multiple times in your code while keeping it clean and readable.
Set a format variable at the top, then do the needed replaces to build it. Then use the format name in the code. Saves a little typing, makes your code less fugly, and has the benefit of making that format variable reusable should you need it in multiple queries without all that typing.
Declare #fmt_CONTRACT_PD_AOR nvarchar(max) = 'XX-99-999-XX';
Set #fmt_CONTRACT_PD_AOR = REPLACE(#fmt_CONTRACT_PD_AOR, '9', '[0-9]');
Set #fmt_CONTRACT_PD_AOR = REPLACE(#fmt_CONTRACT_PD_AOR, 'X', '[a-zA-Z]');
with tbl(str) as (
select 'AA-23-234-ZZ' union all
select 'db-32-123-dd' union all
select 'ab-123-88-kk'
)
select str from tbl
where str like #fmt_CONTRACT_PD_AOR;

Filter condition in Redshift

I am fairly new to redshift and I have the following postcodes in my table
B13 7GB
BA43 87F
BR8 H4D
B4H HFT
I would like to only extract the rows where there is a number FROM 0-9 after the first letter.
Expected output
B13 7GB
B4H HFT
Thank you
Perhaps someone can provide an "simpler" answer, but I like using REGEX functions to be as precise as possible.
select code
from tbl
where regexp_count(code,'^[A-Z]{1}[0-9]{1}')>0
^ -> Only check the start of the string (our code).
[A-Z]{1} -> Search for ONE Capital Letter.
[0-9]{1} -> Search for ONE number from 0-9.
All together:
At the start of the string, search for ONE capital letter that is followed by ONE number from 0-9.
Regexp is very flexible but in Redshift it is not very fast. If you dataset is very large you will likely be better off for this simple case with SIMILAR TO.
select *
from tbl
where code SIMILAR TO '_[0-9]%';
LIKE and SIMILAR TO are less flexible than regexp but compile and run faster. In general the more flexible the syntax the harder it is to execute. This (and backward compatibility) is why LIKE and SIMILAR TO still exists.

Toad: Count characters in SQL statement

Toad for Oracle 12:
I'm working in a system that has an unfortunate limitation where an SQL query's FROM clause can only have up to 1000 characters (including spaces). This becomes a problem when the FROM clause has a lengthy subquery in it (> 1000 chars).
So, when writing SQL in Toad, I need a way to highlight the SQL in the FROM clause and count the characters in the highlighted text, including spaces.
Currently, I copy the text into MS Word and do a character count there. That works, but it would be better if I could do the character count right in Toad.
Question:
Does Toad 12 have SQL character count functionality?
As #SamM mentioned, the bottom panel/status bar shows the character count when I highlight text in the SQL editor window:

R dividing texts in tm package - recognizing speakers

I am trying to identify the most frequently used words in the congress speeches, and have to separate them by the congressperson. I am just starting to learn about R and the tm package. I have a code that can find the most frequent words, but what kind of a code can I use to automatically identify and store the speaker of the speech?
Text looks like this:
OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN
The Chairman. Good afternoon to everybody, and thank you
very much for coming to this hearing this afternoon.
In today's tough economic climate, millions of seniors have
lost a big part of their retirement and investments in only a
matter of months. Unlike younger Americans, they do not have
time to wait for the markets to rebound in order to recoup a
lifetime of savings.
[....]
STATEMENT OF SENATOR MEL MARTINEZ, RANKING MEMBER
[....]
I would like to be able to get these names, or separate text by the people. Hope you can help me. Thanks a lot.
Would it be correct to say that you want to split the file so you have one text object per speaker? And then use a regular expression to grab the speaker's name for each object? Then you can write a function to collect word frequencies, etc. on each object and put them in a table where the row or column names are the speaker's names.
If so, you might say x is your text, then use strsplit(x, "STATEMENT OF") to split on the words STATEMENT OF, then grep() or str_extract() to return the 2 or 3 words after SENATOR (do they always have only two names as in your example?).
Have a look here for more on the use of these functions, and text manipulation in general in R: http://en.wikibooks.org/wiki/R_Programming/Text_Processing
UPDATE Here's a more complete answer...
#create object containing all text
x <- c("OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN
The Chairman. Good afternoon to everybody, and thank you
very much for coming to this hearing this afternoon.
In today's tough economic climate, millions of seniors have
lost a big part of their retirement and investments in only a
matter of months. Unlike younger Americans, they do not have
time to wait for the markets to rebound in order to recoup a
lifetime of savings.
STATEMENT OF SENATOR BIG APPLE KOHL, CHAIRMAN
I am trying to identify the most frequently used words in the
congress speeches, and have to separate them by the congressperson.
I am just starting to learn about R and the tm package. I have a code
that can find the most frequent words, but what kind of a code can I
use to automatically identify and store the speaker of the speech
STATEMENT OF SENATOR LITTLE ORANGE, CHAIRMAN
Would it be correct to say that you want
to split the file so you have one text object
per speaker? And then use a regular expression
to grab the speaker's name for each object? Then
you can write a function to collect word frequencies,
etc. on each object and put them in a table where the
row or column names are the speaker's names.")
# split object on first two words
y <- unlist(strsplit(x, "STATEMENT OF"))
#load library containing handy function
library(stringr)
# use word() to return words in positions 3 to 4 of each string, which is where the first and last names are
z <- word(y[2:4], 3, 4) # note that the first line in the character vector y has only one word and this function gives and error if there are not enough words in the line
z # have a look at the result...
[1] "HERB KOHL," "BIG APPLE" "LITTLE ORANGE,"
No doubt a regular expressions wizard could come up with something to do it quicker and neater!
Anyway, from here you can run a function to calculate word freqs on each line in the vector y (ie. each speaker's speech) and then make another object that combines the word freq results with the names for further analysis.
This is how I'd approach it using Ben's example (use qdap to parse and create a dataframe and then convert to a Corpus with 3 documents; note that qdap was designed for transcript data like this and a Corpus may not be the best data format):
library(qdap)
dat <- unlist(strsplit(x, "\\n"))
locs <- grep("STATEMENT OF ", dat)
nms <- sapply(strsplit(dat[locs], "STATEMENT OF |,"), "[", 2)
dat[locs] <- "SPLIT_HERE"
corp <- with(data.frame(person=nms, dialogue =
Trim(unlist(strsplit(paste(dat[-1], collapse=" "), "SPLIT_HERE")))),
df2tm_corpus(dialogue, person))
tm::inspect(corp)
## A corpus with 3 text documents
##
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
## create_date creator
## Available variables in the data frame are:
## MetaID
##
## $`SENATOR BIG APPLE KOHL`
## I am trying to identify the most frequently used words in the congress speeches, and have to separate them by the congressperson. I am just starting to learn about R and the tm package. I have a code that can find the most frequent words, but what kind of a code can I use to automatically identify and store the speaker of the speech
##
## $`SENATOR HERB KOHL`
## The Chairman. Good afternoon to everybody, and thank you very much for coming to this hearing this afternoon. In today's tough economic climate, millions of seniors have lost a big part of their retirement and investments in only a matter of months. Unlike younger Americans, they do not have time to wait for the markets to rebound in order to recoup a lifetime of savings.
##
## $`SENATOR LITTLE ORANGE`
## Would it be correct to say that you want to split the file so you have one text object per speaker? And then use a regular expression to grab the speaker's name for each object? Then you can write a function to collect word frequencies, etc. on each object and put them in a table where the row or column names are the speaker's names.

Using a RegEx in a SQL Query

Here's the situation I'm in: We have a field in our database that contains a 3 digit number, surrounded by some text. This number is actually a PK in another table, and I need to extract this out so I can implement a proper FK relationship. Here's an example of what would currently reside in the column:
Some Text Goes Here - (305) Followed By Some More Text
So, what I'm looking to do is extract the '305' from the column, and hopefully end up with a result that looks something like this (pseudo code)
SELECT
<My Extracted Value>,
Original Column Text,
Id
FROM dbo.MyTable
It seems to me that using a Regex match in my query is the most effective way to do this. Can anybody point me in the right direction?
EDIT: We're using SQL Server 2005
RegExp in SQL is defined by a SQL-Standard but most databases implemented their own syntax, you should tell us the product name of your RDBMS ;)
This is based on Pranay's first answer that has since been changed.
DECLARE #NumStr varchar(1000)
SET #NumStr = 'Some Text Goes Here - (305) Followed By Some More Text';
SELECT SUBSTRING(#NumStr,PATINDEX('%[0-9][0-9][0-9]%',#NumStr),3)
Returns 305
Microsoft seems to suggest using a CLR assembly to do Regex pattern matching in SQL Server 2005.
http://msdn.microsoft.com/en-us/magazine/cc163473.aspx
Apart from LIKE (which is not going to solve your problem) I don't know of a built-in pattern matching functionality in SQL Server 2005 (that is, more advanced than simple string searches).
Just after I implemented a solution in Postgres, I see you are using SqlServer... Just for the records, then, with a regex that extracts data in parenthesis.
Postgresql solution:
create table main(id text not null)
insert into main values('some text (44) other text');
insert into main values('and more text (78) and even more');
select substring(id from '\\(([^\\(]+)\\)') from main
The only way to access RegEx-type functions in SQL 2005 (and probably 2008) is by writing (or downloading) and using CLR functions.
If all the strings are always formatted in such a way as you can identify the specific numbers you want, you can do something like the following. This is based on the (big) assumption that the first set of parenthesis found in the string contains the number that you want.
/*
CREATE TABLE MyTable
(
MyText varchar(500) not null
)
INSERT MyTable values ('Some Text Goes Here - (305) Followed By Some More Text')
*/
SELECT
MyText -- String
,charindex('(', MyText) -- Where's the open parenthesis
,charindex(')', MyText) -- Where's the closed parenthesis
,substring(MyText
,charindex('(', MyText) + 1, charindex(')'
,MyText) - charindex('(', MyText) - 1) -- Glom it all together
from MyTable
Awkward as heck (because SQL has a pathetically limited set of string manipulation functions), but it works.