how to split string in Snowflake by single backslash - replace

have a problem I can't solve directly using Snowflake docs:
I have a strings like 'abc\def'
need to split it to 'abc', 'def'
tried:
split_to_table('abc\def', '\\') - error
strtok_to_array('abc\def', '\\') ==> [
"abcdef"
]
also, I've tried to replace it to better delimiter prior the split
replace('abc\cde','\\','_another_symbol_'); ==> abccde
REGEXP_REPLACE('abc\cde','$$\$$','_another_symbol_') ==> abccde_another_symbol
but it doesn't work
any idea how to solve that?

If you just use SPLIT it will split the values into an array which you can then process however you want e.g.
with dataset as (
select $1 as col1 from
(values
('abc\\def'),
('ghi\\jkl'))
)
select col1, split(col1,'\\')
from dataset
COL1
SPLIT(COL1,'\')
abc\def
[ "abc", "def" ]
ghi\jkl
[ "ghi", "jkl" ]

Related

Extract all substrings bounded by the same characters

Given a name_loc column of text like the following:
{"Charlie – White Plains, NY","Wrigley – Minneapolis, MN","Ana – Decatur, GA"}
I'm trying to extract the names, ideally separated by commas:
Charlie, Wrigley, Ana
I've gotten this far:
SELECT SUBSTRING(CAST(name_loc AS VARCHAR) from '"([^ –]+)')
FROM table;
which returns
Charlie
How can I extend this query to extract all names?
You can do this with a combination of regexp_matches (to extract the names), array_agg (to regroup all matches in a row) and array_to_string (to format the array as you'd like, e.g. with a comma separator):
WITH input(name_loc) AS (
VALUES ('{"Charlie – White Plains, NY","Wrigley – Minneapolis, MN","Ana – Decatur, GA"}')
, ('{"Other - somewhere}') -- added this to show multiple rows do not get merged
)
SELECT array_to_string(names, ', ')
FROM input
CROSS JOIN LATERAL (
SELECT array_agg(name)
FROM regexp_matches(name_loc, '"(\w+)', 'g') AS f(name)
) AS f(names);
array_to_string
Charlie, Wrigley, Ana
Other
View on DB Fiddle
My two cents, though I'm rather new to postgreSQL and I had to copy the 1st piece from #Marth's his answer:
WITH input(name_loc) AS (
VALUES ('{"Charlie – White Plains, NY","Wrigley – Minneapolis, MN","Ana – Decatur, GA"}')
, ('{"Other - somewhere"}')
)
SELECT REGEXP_REPLACE(name_loc, '{?(,)?"(\w+)[^"]+"}?','\1\2', 'g') FROM input;
regexp_replace
Charlie,Wrigley,Ana
Other
Your string literal happens to be a valid array literal.
(Maybe not by coincidence? And the column should be type text[] to begin with?)
If that's the reliable format, there is a safe and simple solution:
SELECT t.id, x.names
FROM tbl t
CROSS JOIN LATERAL (
SELECT string_agg(split_part(elem, ' – ', 1), ', ') AS names
FROM unnest(t.name_loc::text[]) elem
) x;
Or:
SELECT id, string_agg(split_part(elem, ' – ', 1), ', ') AS names
FROM (SELECT id, unnest(name_loc::text[]) AS elem FROM tbl) t
GROUP BY id;
db<>fiddle here
Steps
Unnest the array with unnest() in a LATERAL CROSS JOIN, or directly in the SELECT list.
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Take the first part with split_part(). I chose ' – ' as delimiter, not just ' ', to allow for names with nested space like "Anne Nicole". See:
Split comma separated column data into additional columns
Aggregate results with string_agg(). I added no particular order as you didn't specify one.
Concatenate multiple result rows of one column into one, group by another column

Snowflake External Stage file name pattern match not working

I am trying to run a Select query on multiple files placed on External Stage(s3). The file names are
SAMPLE_FILE_NAME_20201014_155022.csv
SAMPLE_FILE_NAME_20201016_092711.csv
SAMPLE_FILE2_NAME2_20201014_155022.csv
SAMPLE_FILE2_NAME2_20201016_092711.csv
Want to select only the files with name SAMPLE_FILE_NAME.* If I query like below then it is not able to find any files or data
select $1 as col1, $2 as col2, $3 as col3, $4 as col4
FROM #ROLE1_ID1.LOCATION_ESTG (file_format => 'SEMICOLON', PATTERN => '.*SAMPLE_FILE_NAME.csv.*') t;
If I do like below then both SAMPLE_FILE_NAME & SAMPLE_FILE2_NAME2 get selected and I get incorrect data
select $1 as col1, $2 as col2, $3 as col3, $4 as col4
FROM #ROLE1_ID1.LOCATION_ESTG (file_format => 'SEMICOLON', PATTERN => '.*_20201014_155022.csv') t;
I tried several combinations of REGEXP but they don't seem to work while reading the external stage. For Ex. I tried 'SAMPLE_FILE_NAME.*\.csv' which didn't work. What can be the correct expression to filter SAMPLE_FILE_NAME* and make the select work
The .*SAMPLE_FILE_NAME.csv.* pattern does not work because . before csv only matches any single char, while there are many more chars between NAME and csv.
You can use
'SAMPLE_FILE_NAME_.*\\.csv'
If the path includes directory name, you might need to add .* at the start and use
'.*SAMPLE_FILE_NAME_.*\\.csv'
See the regex demo
Note that to match a literal dot, you need to escape it with a backslash, and since it can form string escape sequences in the string literals, it needs doubling.

Regex (All after first match (without the first match))

I am struggling with the easy Regex expression. Basically I want everything after the first match of "_" without the "_".
My current expression is like this: _(.*)
When I give input: AAA_BBB_CCC
The output is: _BBB_CCC
My ideal output would be: BBB_CCC
I am using a snowflake database with their build-in regex function.
Unfortunately, I can not use (?<=_).* as it does not support this format of "?<=". Is there some other way how can I modify _(.*) to get the right output?
Thank you.
You can use a regular expression to achieve this, something like this is JavaScript for example will do the job
"AAA_BBB_CCC".replace(/[^_]+./, '')
Use REGEXP_REPLACE with Snowflake
regexp_replace('AAA_BBB_CCC','^[^_]+_','')
https://docs.snowflake.net/manuals/sql-reference/functions/regexp_replace.html
But you can also find the first index of _ and use substring, available in all languages
let text = "AAA_BBB_CCC"
let index = text.indexOf('_')
if(index !== -1 && index < text.length) {
let result = text.substring(index+1)
}
In Snowflake SQL, you may use REGEXP_SUBSTR, its syntax is
REGEXP_SUBSTR( <string> , <pattern> [ , <position> [ , <occurrence> [ , <regex_parameters> [ , <group_num ] ] ] ] ).
The function allows you to return captured substrings:
By default, REGEXP_SUBSTR returns the entire matching part of the subject. However, if the e (for “extract”) parameter is specified, REGEXP_SUBSTR returns the the part of the subject that matches the first group in the pattern. If e is specified but a group_num is not also specified, then the group_num defaults to 1 (the first group). If there is no sub-expression in the pattern, REGEXP_SUBSTR behaves as if e was not set.
So, you need to set the regex_parameters to e and - optionally - group_num argument to 1:
Select REGEXP_SUBSTR('AAA_BBB_CCC', '_(.*)', 1, 1, 'e', 1)
Select REGEXP_SUBSTR('AAA_BBB_CCC', '_(.*)', 1, 1, 'e')
Use a capture group:
\_(?<data>.*)
Which returns the capture group data containing BBB_CCC
Example:
https://regex101.com/r/xZaXKR/1
To get this actually working you need to use:
SELECT REGEXP_SUBSTR('AAA_BBB_CCC', '_(.*)', 1, 1, 'e', 1);
which gives:
REGEXP_SUBSTR('AAA_BBB_CCC', '_(.*)', 1, 1, 'E', 1)
BBB_CCC
you need to pass the REGEXP_SUBSTR parameter <regex_parameters> clause of e as that is extract sub-matches. thus Wiktor's answer is 95% correct.

Match at least 3 words in any order from some 5 words

I have a group of words:
"dog", "car", "house", "work", "cat"
I need to be able to match at least 3 of them in a text, for example:
"I always let my cat and dog at the animal nursery when I go to work by car"
Here I want to match the regex because it matches at least 3 words (4 words here):
"cat", "dog", "car" and "work"
EDIT 1
I want to use it with Oracle's regexp_like function
EDIT 2
I also need it to work with consecutive words
Since Oracle's regexp_like doesn't support non-capturing groups and word boundaries, the following expression can be used:
^((.*? )?(dog|car|house|work|cat)( |$)){3}.*$
Try it out here.
Alternatively, a larger but arguably cleaner solution is:
^(.*? )?(dog|car|house|work|cat) .*?(dog|car|house|work|cat) .*?(dog|car|house|work|cat)( .*)?$
Try it out here.
NOTE: These will both match the same word used multiple times, e.g. "dog dog dog".
EDIT: To address the concerns over punctuation, a small modification can be made. It isn't perfect, but should match 99% of situations involving punctuation (but won't match e.g. !dog):
^((.*? )?(dog|car|house|work|cat)([ ,.!?]|$)){3}.*$
Try it out here
This is a solution that doesn't use regular expressions, will exclude repeated words and the words to match can be passed in as a bind parameter in a collection:
SQL Fiddle
Oracle 11g R2 Schema Setup:
Create a collection type to store a list of words:
CREATE TYPE StringList IS TABLE OF VARCHAR2(50)
/
Create a PL/SQL function to split a delimited string into the collection:
CREATE OR REPLACE FUNCTION split_String(
i_str IN VARCHAR2,
i_delim IN VARCHAR2 DEFAULT ','
) RETURN StringList DETERMINISTIC
AS
p_result StringList := StringList();
p_start NUMBER(5) := 1;
p_end NUMBER(5);
c_len CONSTANT NUMBER(5) := LENGTH( i_str );
c_ld CONSTANT NUMBER(5) := LENGTH( i_delim );
BEGIN
IF c_len > 0 THEN
p_end := INSTR( i_str, i_delim, p_start );
WHILE p_end > 0 LOOP
p_result.EXTEND;
p_result( p_result.COUNT ) := SUBSTR( i_str, p_start, p_end - p_start );
p_start := p_end + c_ld;
p_end := INSTR( i_str, i_delim, p_start );
END LOOP;
IF p_start <= c_len + 1 THEN
p_result.EXTEND;
p_result( p_result.COUNT ) := SUBSTR( i_str, p_start, c_len - p_start + 1 );
END IF;
END IF;
RETURN p_result;
END;
/
Create some test data:
CREATE TABLE test_data ( value ) AS
SELECT 'I always let my cat and dog at the animal nursery when I go to work by car' FROM DUAL UNION ALL
SELECT 'dog dog foo bar dog' FROM DUAL
/
Query 1:
SELECT *
FROM test_data
WHERE CARDINALITY(
split_string( value, ' ' ) -- Split the string into a collection
MULTISET INTERSECT -- Intersect it with the input words
StringList( 'dog', 'car', 'house', 'work', 'cat' )
) >= 3 -- Check that the size of the intersection
-- is at least 3 items.
Results:
| VALUE |
|----------------------------------------------------------------------------|
| I always let my cat and dog at the animal nursery when I go to work by car |
Ignoring the questions I asked in a Comment under the original post, here is one easy way to solve the problem, with a join and aggregation (using a HAVING condition). Note that a word like doghouse in the input will match both dog and house, etc. (Do read my comment under the original post!)
In the query below, both the input phrase and the words to match are hardcoded in factored subqueries (the WITH clause). In a serious environment, both should be in base tables, or be provided as input variables, etc.
I show how to use the standard string comparison operator LIKE. This can be changed to REGEXP_LIKE, but that is generally unneeded (and indeed a bad idea). But if you need to differentiate between 'dog' and 'dogs' (and 'dogwood'), or need case insensitive comparison, etc., you can use REGEXP_LIKE. The point of this solution is that you don't need to worry about matching THREE different words; if you know how to match ONE (whether full word match is needed, capitalization does or does not matter, etc.), then you can also, easily, match THREE words under the same rules.
with
inputs ( input_phrase ) as (
select
'I always let my cat and dog at the animal nursery when I go to work by car'
from dual
),
words ( word_to_match) as (
select 'dog' from dual union all
select 'car' from dual union all
select 'house' from dual union all
select 'work' from dual union all
select 'cat' from dual
)
select input_phrase
from inputs inner join words
on input_phrase like '%' || word_to_match || '%'
group by input_phrase
having count(*) >= 3
;
INPUT_PHRASE
--------------------------------------------------------------------------
I always let my cat and dog at the animal nursery when I go to work by car
The following solution will exclude repeated matches, doesn't use regular expressions (though you can if you like), and doesn't use PL/SQL.
WITH match_list ( match_word ) AS (
SELECT 'dog' AS match_word FROM dual
UNION ALL
SELECT 'work' FROM dual
UNION ALL
SELECT 'car' FROM dual
UNION ALL
SELECT 'house' FROM dual
UNION ALL
SELECT 'cat' FROM dual
)
SELECT phrase, COUNT(*) AS unique_match_cnt, SUM(match_cnt) AS total_match_cnt
, LISTAGG(match_word, ',') WITHIN GROUP ( ORDER BY match_word ) AS unique_matches
FROM (
SELECT pt.phrase, ml.match_word, COUNT(*) AS match_cnt
FROM phrase_table pt INNER JOIN match_list ml
ON ' ' || LOWER(pt.phrase) || ' ' LIKE '%' || ml.match_word || '%'
GROUP BY pt.phrase, ml.match_word
) GROUP BY phrase
HAVING COUNT(*) >= 3;
The key is putting the words you want to match into a table or common table expression/subquery. If you like you can use REGEXP_LIKE() in place of LIKE though I think that would be more expensive. Skip LISTAGG() if you're not using Oracle 11g or higher, or if you don't actually need to know which words were matched, and skip LOWER() if you want a case-sensitive match.
If you don't need to match different words.
(?:\b(?:dog|car|house|work|cat)\b.*?){3}
I don't know if this works in your environment.
EDIT: I didn't see there is another answer almost like this one.

Vertica new-line CR LF replace

I have a column in vertica which I wish to export to .csv.
The problem is that this column has CRLF in the middle, meaning that the export reads each line as two lines. Example of input(the EOF delimiter was copy pasted from Vertica):
First part
Second part
I tried the REPLACE option but it does not replace the sequence.
select TABLE, REPLACE(column_name, '\r\n', 'FUFU') from DB;
The command does replace random letters.
Hence I start to question if there is a CRLF (Notepad++ found it) or if there is some other character hidden there which I fail to replace...
Any help on what are other possible causes for the new line (I tried \n, \c, \r and any possible combinations...) or how to see it other than in Notepad (directly in Vertica?) will be greatly appreciated...
Alternatively, I found no way to explicitly define in Vertica the EOF characters on export - does something like this exist?
Thanks
You might want to check how to use Extended String Literals in the Vertica's SQL Reference Manual.
Example:
create table a ( id integer , txt varchar(20) ) ;
insert into a values ( 1 , 'abc' ) ;
insert into a values ( 2 , e'def\r\nrghi' ) ;
insert into a values ( 3 , e'ij\r\nklm' ) ;
insert into a values ( 4 , 'poq' ) ;
Then, to replace \r\n sequences - for example - with a space:
SQL=> select id, replace(txt, e'\r\n', ' ' ) from a order by id ;
id | replace
----+----------
1 | abc
2 | def rghi
3 | ij klm
4 | poq
(4 rows)
REGEXP_REPLACE(text, '(?>\r\n|\n|\r)', ' ')