I'm trying to use Snowflake's match_recognize tool to match a particular pattern across rows. The pattern consists of any sequence of a's and b's, provided that long runs of b's are excluded. In my test case, I want to allow runs of up to 4 b's to be included in the match.
Using the handy https://regexr.com/?2tp0k website, I was able to build the desired regexp:
((ab{0,4})+a)|a+
Applying it to this string:
baabbbaaaaaaaababbabbabbabbbabbbab
I get this one match (in bold), which I am happy with:
baabbbaaaaaaaababbabbabbabbbabbbab
As desired, this is absorbing into the match any run of b's that is 4 or shorter. (It doesn't pick up b at the beginning of the string or the b at the end, but that is expected.) Also note that while it doesn't contain any long runs of b's, there are a bunch of b's spread throughout that match.
For some reason, when I use this regular expression with Snowflake's match_recognize pattern, it doesn't take up all of the short runs of b's.
Hence, instead of matching the entire sequence matched above, it matches these:
baabbbaaaaaaaababbabbabbabbbabbbab
Any suggestions?
Here's the query that illustrates the result:
WITH data AS (
SELECT * FROM VALUES
( 0,'b'),( 1,'a'),( 2,'a'),( 3,'b'),( 4,'b'),( 5,'b'),( 6,'a'),( 7,'a'),( 8,'a'),( 9,'a'),
(10,'a'),(11,'a'),(12,'a'),(13,'a'),(14,'b'),(15,'a'),(16,'b'),(17,'b'),(18,'a'),(19,'b'),
(20,'b'),(21,'a'),(22,'b'),(23,'b'),(24,'a'),(25,'b'),(26,'b'),(27,'b'),(28,'a'),(29,'b'),
(30,'b'),(31,'b'),(32,'a'),(33,'b')
)
SELECT * FROM data
match_recognize(
order by column1
measures
match_number() as "MATCH_NUMBER",
match_sequence_number() as msq,
classifier() as cl
all rows per match with unmatched rows
PATTERN ( ((a b{0,4})+ a) | a+ )
DEFINE
a as column2 = 'a',
b as column2 = 'b'
)
ORDER BY 1;
Resulting in this result. Rows 25-27 are not included in the match, and a new match is started at row 28.
Image of results
Interestingly enough when changing pattern from ((ab{0,4})+a)|a+ to ( ((a | ab | abb | abbb | abbbb)+ a) | a+ ) it produces:
WITH data AS (
SELECT * FROM VALUES
( 0,'b'),( 1,'a'),( 2,'a'),( 3,'b'),( 4,'b'),( 5,'b'),( 6,'a'),( 7,'a'),( 8,'a'),( 9,'a'),
(10,'a'),(11,'a'),(12,'a'),(13,'a'),(14,'b'),(15,'a'),(16,'b'),(17,'b'),(18,'a'),(19,'b'),
(20,'b'),(21,'a'),(22,'b'),(23,'b'),(24,'a'),(25,'b'),(26,'b'),(27,'b'),(28,'a'),(29,'b'),
(30,'b'),(31,'b'),(32,'a'),(33,'b')
)
SELECT * FROM data
match_recognize(
order by column1
measures
match_number() as "MATCH_NUMBER",
match_sequence_number() as msq,
classifier() as cl
all rows per match with unmatched rows
PATTERN ( ((a | ab | abb | abbb | abbbb)+ a) | a+ )
DEFINE
a as column2 = 'a',
b as column2 = 'b'
)
ORDER BY 1;
Output:
WITH data AS (
SELECT * FROM VALUES
( 0,'b'),( 1,'a'),( 2,'a'),( 3,'b'),( 4,'b'),( 5,'b'),( 6,'a'),( 7,'a'),( 8,'a'),( 9,'a'),
(10,'a'),(11,'a'),(12,'a'),(13,'a'),(14,'b'),(15,'a'),(16,'b'),(17,'b'),(18,'a'),(19,'b'),
(20,'b'),(21,'a'),(22,'b'),(23,'b'),(24,'a'),(25,'b'),(26,'b'),(27,'b'),(28,'a'),(29,'b'),
(30,'b'),(31,'b'),(32,'a'),(33,'b')
)
SELECT * FROM data
match_recognize(
order by column1
measures
match_number() as "MATCH_NUMBER",
match_sequence_number() as msq,
classifier() as cl
all rows per match with unmatched rows
AFTER MATCH SKIP PAST LAST ROW
PATTERN ( ((a+ b{1,4})+ a) | a+ )
DEFINE
a as column2 = 'a',
b as column2 = 'b'
) ORDER BY 1;
Basically I have a very long text containing multiple spaces, special characters, etc. in one cell in an excel file and I need to extract only specific words from it, each one to a seperate cell in another column.
What I'm looing for:
symbols that are always 9 characters in lenght, and always contain at least one number (up to 9).
So for an example in A1 I have:
euhe: djj33 dkdakofja. kaowdk ---------- jffjbrjjjj j jrjj 08/01/2222 999ABC123
fjfjfj 321XXX888 .... ........ 123456789AA
And in the end I want to have:
999ABC123 in B1
and
321XXX888 in B2.
Right now I'm doing this by using Text to columns feature and then just looking for specific words manually but sometimes the volume is so big it takes too much time and would be cool to automate this.
Can anyone help with this? Thank you!
EDIT:
More examples:
INPUT: '10/01/2016 1,060X 8.999%!!! 1.33 0.666 928888XE0'
OUTPUT: '928888XE0'
INPUT: 'ABCDEBATX ..... ,,00,001% 20///^^ addcA7 7777a 123456789 djaoij8888888 0.000001 12#'
OUTPUT: '123456789'
INPUT: 'FAR687465 B22222222 __ djj^66 20/20/20/20 1:'
OUTPUT: 'FAR687465' in B1 'B22222222' in B2
INPUT: 'fil476 .00 20/.. BUT AAAAAAAAA k98776 000.0001'
OUTPUT: 'blank'
To clarify: the 9 character string can be anywhere, there is no rule what is before or after them, they can be next to each other, or just at the beginning and end of this wall of text, no rules here, the text is random, taken out of some system, can contain dates, etc anything... The symbols are always 9 characters long and they are not the only 9 character symbols in the text. I call them symbols but they should only consist of numbers and letters. Can be only numbers, but never only letters. A1 cell can contain multiple spaces/tabs between words/symbols.
Also if possible to do this not only for A1, but the whole column A until it finds the first blank cell.
Try this code
Sub Test()
Dim r As Range
Dim i As Long
Dim m As Long
With CreateObject("VBScript.RegExp")
.Global = True
.Pattern = "\b[a-zA-Z\d]{9}\b"
For Each r In Range("A1", Range("A" & Rows.Count).End(xlUp))
If .Test(r.Value) Then
For i = 0 To .Execute(r.Value).Count - 1
If CBool(.Execute(r.Value)(i) Like "*[0-9]*") Then
m = IIf(Cells(1, 2).Value = "", 1, Cells(Rows.Count, 2).End(xlUp).Row + 1)
Cells(m, 2).Value = .Execute(r.Value)(i)
End If
Next i
End If
Next r
End With
End Sub
This bit of code is almost it... just need to check the strings... but excel crashes on the Str line of code
Sub Test()
Dim Outputs, i As Integer, LastRow As Long, Prueba, Prueba2
Outputs = Split(Range("A1"), " ")
For i = 0 To UBound(Outputs)
If Len(Outputs(i)) = 9 Then
Prueba = 0
Prueba2 = 0
On Error Resume Next
Prueba = Val(Outputs(i))
Prueba2 = Str(Outputs(i))
On Error GoTo 0
If Prueba <> 0 And Prueba2 <> 0 Then
LastRow = Range("B10000").End(xlUp).Row + 1
Cells(LastRow, 2) = Outputs(i)
End If
End If
Next i
End Sub
If someone could help to set the string check.. that would do the thing I guess.
I have list of databases and tables obtained like this:
SELECT TRIM(DatabaseName) || '.' || TRIM(TableName) AS DatabaseTable
FROM DBC.TablesV t1
WHERE TableKind IN ('T','O','V')
I now want to match them to dbc.TablesV.RequestText to build a hierarchy of my database views.
At first i did it with simple join like below
JOIN DBC.TablesV t2
ON t2.RequestText LIKE '%' || DatabaseTable || '%'
but unfortunately, we have tables like T1010_User and T1010_User_Hist, and databases like DB_STAGE and Q_DB_STAGE so i decided to add spaces to % to a LIKE clause making it LIKE '% ' || DatabaseTable || ' %' but then it fails to get proper results because sometimes tablename is at the end of a requesttext like this: (...) DB_STAGE.TableName; and sometimes its like this:
(...)
FROM
DB_STAGE.TableName t1
(...)
I decided to use REGEXP_SIMILAR to match them with WHEN REGEXP_SIMILAR() = 1 but my regex-fu is weak, so I cannot build regex that will do something like:
((anything other than a letter/number) or nothing) DatabaseTable ((anything other than a letter/number) or nothing)
This is to build hierarchy of views to help with migrating data to a different database.
This is very simplified case:
CREATE VOLATILE TABLE test1
(
c0 SMALLINT,
c1 varchar(100)
)ON COMMIT PRESERVE ROWS;
INSERT INTO test1 VALUES(1,'aaa
Q_abcdef.abcdef');
INSERT INTO test1 VALUES(2,' Q_abcdef.abcdef ');
INSERT INTO test1 VALUES(3,'aaa
DQ_abcdef.abcdef ');
INSERT INTO test1 VALUES(4,' S_abcdef.abcdef');
INSERT INTO test1 VALUES(5,'Q_abcdef.abcdefg');
INSERT INTO test1 VALUES(6,' sdfs
Q_abcdef.abcdefg');
INSERT INTO test1 VALUES(DQ_abcdef,' 7.abcdefg');
INSERT INTO test1 VALUES(8,' S_abcdef.abcdefg');
INSERT INTO test1 VALUES(9,'Q_abcdef.abcdef;');
INSERT INTO test1 VALUES(10,' Q_abcdef.abcdef;');
INSERT INTO test1 VALUES(11,'DQ_abcdef.abcdef;');
INSERT INTO test1 VALUES(12,' S_abcdef.abcdef;');
I need to match 1, 2, 9 and 10. The ones that have string Q_abcdef.abcdef exactly.
You can use \b for matching word boundaries:
WHERE REGEXP_SIMILAR (c1, '.*\bQ_abcdef.abcdef\b.*', 'i') = 1
This will not return 5 & 6 because it's not matching due to the final g
How to find out the index of the first number encountered in the reverse direction of a string?
For example: 'CUSTOMC23VBA' and 'CUSTOMC245BA'.
So, function should return as '2' or '3' from reverse or the index value as '9' or '10'.
I could get the value by hard-coding the SUBSTR('CUSTOMC23VBA', -3) but I would want it to be generic as regular expressions.
You can try:
select regexp_instr(reverse('CUSTOM123XYZ'), '[[:digit:]]',1,1) from dual
Output: 4
Zero based index would be:
select regexp_instr(reverse('CUSTOM123XYZ'), '[[:digit:]]',1,1)-1 from dual
Output: 3
If you want the rest of the string from the last number, you can use substr and take advantage of the negative position to count from end of string:
select substr('CUSTOM123XYZ', -1 * (regexp_instr(reverse('CUSTOM123XYZ'), '[[:digit:]]',1,1)-1)) from dual;
Output: XYZ
An example testing multiple input strings:
with d as (
select 'CUSTOM123XYZ' as input_str from dual
union
select 'CUSTOM123XZ' as input_str from dual
union
select 'CUSTOM 1 X 3YZ' as input_str from dual
)
select input_str,
substr(input_str, -1 * (regexp_instr(reverse(input_str), '[[:digit:]]',1,1)-1)) as result
from d
Output:
INPUT_STR RESULT
CUSTOM 1 X 3YZ YZ
CUSTOM123XYZ XYZ
CUSTOM123XZ XZ
Is there only one number in the string? if so, you could go for something like this:
select REGEXP_REPLACE('CUSTOMC23VBA', '[[:alpha:]]','') from dual
But this will fail when there are multiple numbers in the string.