Snowflake regexp_like breaks when string column contains end of line characters - regex

We've been pulling our hair out debugging Snowflakes regexp_like/rlike function on a string column. It started silently failing and always returning a False when the string contained a line break (\n).
Example:
select regexp_like('looking for a keyword', '.*keyword.*', 'i')
-> True
select regexp_like('looking for a
keyword', '.*keyword.*', 'i')
-> False
Both expressions, however, should evaluate to True.
Some observations:
The ilike function doesn't seem to suffer from this issue.
select 'looking for a \n keyword' ilike '%keyword%'
-> True
ilike and regexp_like are differently implemented which would explain why one works but not the other.
The (programmatic) workaround is to strip all linebreak characters.
select regexp_like(replace('looking for a
keyword','\n',''), '.*keyword.*', 'i')
->True
We obtained the string entries in our table from websites parsed with beautiful soup. While beautifulsoup removes HTML <br> line breaks, it doesn't seem to remove the line breaks encoded through \n and \r. In retrospective, I see that regexp could be tripped up by a newline character.
Is this expected behavior and specific to Snowflake?

It is possible to provide parameter s:
select regexp_like('looking for a
keyword', '.*keyword.*', 'is')
-- TRUE
Specifying the Parameters for the Regular Expression:
s - Enables the POSIX wildcard character . to match \n. By default, wildcard character matching is disabled.
The default string is simply c, which specifies:
POSIX wildcard character . does not match \n newline characters.

Related

Regex Find/Replace char on a line before a specific word

Hope here is the right place to write ask this question.
I am preparing a script to import to a database using notepad++.
I have a huge file that has rows like that:
(10496, '69055-230', 'Rua', '5', 'Manaus', 'Parque 10 de Novembro',
'AM'),
INSERT INTO dne id, cep, tp_logradouro, logradouro, cidade,
bairro, uf VALUES
Is there a way using FIND/REPLACE to replace the ',' to ';' on every line before the INSERT statement?
I am not sure how to match the end of the line before a specific word.
The result would be
(10496, '69055-230', 'Rua', '5', 'Manaus', 'Parque 10 de Novembro',
'AM');
INSERT INTO dne id, cep, tp_logradouro, logradouro, cidade,
bairro, uf VALUES
Find what: ,(?=\s*INSERT)
Replace with: ;
Description
, matches a literal comma
(?=\s*INSERT) is a lookeahead that will assert for (but won't consume)
\s* any number of white spaces (including newlines)
INSERT as literal
If you also want to replace any commas before the end of the file, use
,(?=\h*\R\h*INSERT|\s*\z)
Note both expressions would fail if you have another instance of a comma followed by INSERT that shouldn't be replaced, but in that case you should specify it in the question.
You don't even need a regular expression for that.
Select Extended in Search Mode
Replace ,\nINSERT INTO with ;\nINSERT INTO
This matches , at the end of a line just before INSERT INTO at the beginning of the next line. Keep in mind that \n will match only in a Linux/Unix/Mac OS X file. For Windows use \r\n, for Mac OS Classic \r (reference).
Using sublim text or notepad++, click CTRL+h and replace all ")INSERT," by ");INSERT"
I expect that the INSERT statements will all have the form:
INSERT INTO table col1, col2, col3, ...
VALUES (val1, val2, val3, ...),
^^ what you want to replace
Assuming that the only place that ), will be observed is the end of the VALUES line, then you can just can just do the following replacement:
Find: ),$
Replace: );$
You can do this replacement with the regex option enabled.

How do I remove all characters that aren't alphabetic from a string in PL/SQL?

I have a PL/SQL procedure and I need to take a string and remove all characters that aren't alphabetic. I've seen some examples and read documentation about the REGEXP_REPLACE function but can't understand how it functions.
This is not a duplicate because I need to remove punctuation, not numbers.
Either:
select regexp_replace('1A23B$%C_z1123d', '[^A-Za-z]') from dual;
or:
select regexp_replace('1A23B$%C_z1123d', '[^[:alpha:]]') from dual;
The second one takes into account possible other letters like:
select regexp_replace('123żźć', '[^[:alpha:]]') from dual;
Result:
żźć
Also to answer your question about how the functions works: the first parameter is the source string, the second - a regular expression - everything which will be matched to it, will be replaced by the third argument (optional, NULL by default, meaning all matched characters will just be removed).
Read more about regular expressions:
http://docs.oracle.com/cd/B19306_01/appdev.102/b14251/adfns_regexp.htm
you can use regexp like that:
SELECT REGEXP_REPLACE(UPPER('xYztu-123-hello'), '[^A-Z]+', '') FROM DUAL;
also answered here for non-numeric chars
Try this:
SELECT REGEXP_REPLACE('AB$%c','[^a-zA-Z]', '') FROM DUAL;
Or
SELECT REGEXP_REPLACE( your_column, '[^a-zA-Z]', '' ) FROM your_table;
Read here for more information

How to Compare strings by Regex inside a LINQ Query?

I have two datatables dtRosterList and falsefields
I want to list out all rosterlist against falsefields ,where list column value is a sentence "Hello i am a list of field" and falsefields column value having a single string "field"...i am matching by Regex.IsMatch
Regex.IsMatch(r.Field<string>("ListName"),#"\bFacebook Link\b")
is returning true in the Visual studio intermediate window but
Regex.IsMatch(r.Field("ListName"),#"\b"+fn+"\b") is coming false in Linq query itself and i am getting no rows...The query is below:
var listTobeDeleted = dtRosterList.AsEnumerable().
Where(r => falsefields.AsEnumerable()
.Select(f => f.Field<string>("FieldName")).Any(fn => Regex.Match(r.Field<string>("ListName"),#"\b"+fn+"\b",RegexOptions.IgnoreCase))).CopyToDataTable();
Two problems with your code:
You've used Regex.Match instead of Regex.IsMatch;
You've missed the # verbatim string literal prefix on the second "\b" string;
With the prefix, the string contains two characters: \ (ASCII code 92), and b (ASCII code 98). The Regex engine interprets this to mean "the match must occur on a boundary between an alphanumeric and a non-alphanumeric character".
Without the prefix, the string contains a single character: backspace (ASCII code 8). The Regex engine interprets this as a literal character, and so will only match a string containing the backspace character.
I'd also be inclined to add a Regex.Escape call around the word to find, in case it contains any special characters.
var listTobeDeleted = dtRosterList.AsEnumerable()
.Where(r => falsefields.AsEnumerable()
.Select(f => f.Field<string>("FieldName"))
.Any(fn => Regex.IsMatch(r.Field<string>("ListName"), #"\b" + Regex.Escape(fn) + #"\b", RegexOptions.IgnoreCase)))
.CopyToDataTable();

Wordnet how to know if string is valid query string

So I'm having trouble calling functions from Wordnet::SenseRelate because some of the "words" in the text are not valid queries. I've tried surrounding with try and catch so that the program doesn't quit and skips it but no luck. I wanted to check if a word was valid by using Wordnet::QueryData but it will quit when i use an invalid word like:
$wn->querySense("#44");
I get:
(querySense) Bad query string: #44
The regex which is used can be found in the statement:
my ($word, $pos, $sense) = $string =~ /^([^\#]+)(?:\#([^\#]+)(?:\#(\d+))?)?$/;
If in doubt whether a token will be accepted, test it against this regex.
Commenting on the specific question, there cannot be any leading or trailing # characters (the problem experienced). If # characters are present, there can be 1 or 2 but not more than 2 in the query string. The # characters if present as as delimiters to determine what is word, what is pos and what is sense.

XSL - Remove non breaking space

In my XSL implementation (2.0), I tried using the below statement to remove all the spaces & non breaking spaces within a text node. It works for spaces only but not for non breaking spaces whose ASCII codes are,                            ​  etc. I am using SAXON processor for execution.
Current XSL code:
translate(normalize-space($text-nodes[1]), ' ' , '' ))
How can I have them removed. Please share your thoughts.
Those codes are Unicode, not ASCII (for the most part), so you should probably use the replace function with a regex containing the Unicode separator character class:
replace($text-nodes[1], '\p{Z}+', '')
In more detail:
The regex \p{Z}+ matches one or more characters that are in the "separator" category in Unicode. \p{} is the category escape sequence, which matches a single character in the category specified within the curly braces. Z specifies the "separator" category (which includes various kinds of whitespace). + means "match the preceding regex one or more times". The replace function returns a version of its first argument with all non-overlapping substrings matching its second argument replaced with its third argument. So this returns a version of $text-nodes[1] with all sequences of separator characters replaced with the empty string, i.e. removed.