Azure Synapse: turn an n-length, delimited list column into n distinct columns

Azure Synapse: turn an n-length, delimited list column into n distinct columns - azure-sqldw

In Azure Synapse, I'd like to convert this table
id,list
0,'a:b'
1,'d:e'
2,'g:h'
into this one
id,col1,col2
0,a,b
1,d,e
2,g,h
I'm sure STRING_SPLIT comes into play, but it's return format confuses me.

If your data is as simple as shown then something like this will work:
;WITH cte AS (
SELECT *, CHARINDEX( ':', list ) AS xpos
FROM dbo.rawData
)
SELECT id, LEFT( list, xpos - 1 ) AS col1, SUBSTRING ( list, xpos + 1, 50 ) AS col2
FROM cte
If your data has the single quotes then use REPLACE function to clean them. If this does not work for you, please provide some more realistic sample data.

Related

Count number of WHERE filters in SQL query using regex

Update: I've updated the test string to cover a case that I've missed.
I'm trying to do count the number of WHERE filters in a query using regex.
So the general idea is to count the number of WHERE and AND occuring in the query, while excluding the AND that happens after a JOIN and before a WHERE. And also excluding the AND that happens in a CASE WHEN clause.
For example, this query:
WITH cte AS (\nSELECT a,b\nFROM something\nWHERE a>10\n AND b<5)\n, cte2 AS (\n SELECT c,\nd FROM another\nWHERE c>10\nAND d<5)\n SELECT CASE WHEN c1.a=1\nAND c2.c=1 THEN 'yes' ELSE 'no' \nEND,c1.a,c1.b,c2.c,c2.d\nFROM cte c1\nINNER JOIN cte2 c2 ON c1.a = c2.c\nAND c1.b = c2.d\nWHERE c1.a<4 AND DATE(c1)>'2022-01-01'\nAND c2.c>6
-- FORMATTED FOR EASE OF READ. PLEASE USE LINE ABOVE AS REGEX TEST STRING
WITH cte AS (
SELECT a,b
FROM something
WHERE a>10
AND b<5
)
, cte2 AS (
SELECT c,d
FROM another
WHERE c>10
AND d<5
)
SELECT
CASE
WHEN c1.a=1 AND c2.c=1 THEN 'yes'
WHEN c1.a=1 AND c2.c=1 THEN 'maybe'
ELSE 'no'
END,
c1.a,
c1.b,
c2.c,
c2.d
FROM cte c1
INNER JOIN cte2 c2
ON c1.a = c2.c
AND c1.b = c2.d
WHERE c1.a<4
AND DATE(c1)>'2022-01-01'
AND c2.c>6
should return 7, which are:
WHERE a>10
AND b<5
WHERE c>10
AND d<5
WHERE c1.a<4
AND DATE(c1)>'2022-01-01'
AND c2.c>6
The portion AND c1.b = c2.d is not counted because it happens after JOIN, before WHERE.
The portion AND c2.c=1 is not counted because it is in a CASE WHEN clause.
I eventually plan to use this on a Postgresql query to count the number of filters that happens in all queries in a certain period.
I've tried searching around for answer and trying it myself but to no avail. Hence looking for help here. Thank you in advanced!

I try to stay away from lookarounds as they could be messy and too painful to use, especially with the fixed-width limitation of lookbehind assertion.
My proposed solution is to capture all scenarios in different groups, and then select only the group of interest. The undesired scenarios will still be matched, but will not be selected.
Group 1 - Starts with JOIN (undesired)
Group 2 - Starts with WHERE (desired)
Group 3 - Starts with CASE (undesired)
(JOIN.*?(?=$|WHERE|JOIN|CASE|END))|(WHERE.*?(?=$|WHERE|JOIN|CASE|END))|(CASE.*?(?=$|WHERE|JOIN|CASE|END))
Note: Feel free to replace WHERE|JOIN|CASE|END to any keyword you want to be the 'stopper' words.
All scenarios including the undesired ones will be matched, but you need to select only Group 2 (highlighted in orange).

You can try something like this:
WITH DataSource (parts) AS
(
SELECT REGEXP_MATCHES(
'WITH cte AS (SELECT a,b FROM something WHERE a>10 AND b<5)\n, cte2 AS (SELECT c,d FROM another WHERE c>10 AND d<5)\n SELECT c1.a,c1.b,c2.c,c2.d FROM cte c1 INNER JOIN cte2 c2 ON c1.a = c2.c AND c1.b = c2.d WHERE c1.a<4 AND c2.c>6',
E'(?= WHERE)[^)|;]+'
,'gmi'
)
)
SELECT SUM
(
(length(parts[1]) - length(REPLACE(parts[1], 'AND', ''))) / 3 -- counting ANDs
+ 1 -- for the where
)
FROM DataSource
The idea is to match the text after WHERE clause:
and then simply count the ANDs and add one because of the matched WHERE.

How to use listagg function in select query? [duplicate]

Would it be possible to construct SQL to concatenate column values from
multiple rows?
The following is an example:
Table A
PID
A
B
C
Table B
PID SEQ Desc
A 1 Have
A 2 a nice
A 3 day.
B 1 Nice Work.
C 1 Yes
C 2 we can
C 3 do
C 4 this work!
Output of the SQL should be -
PID Desc
A Have a nice day.
B Nice Work.
C Yes we can do this work!
So basically the Desc column for out put table is a concatenation of the SEQ values from Table B?
Any help with the SQL?

There are a few ways depending on what version you have - see the oracle documentation on string aggregation techniques. A very common one is to use LISTAGG:
SELECT pid, LISTAGG(Desc, ' ') WITHIN GROUP (ORDER BY seq) AS description
FROM B GROUP BY pid;
Then join to A to pick out the pids you want.
Note: Out of the box, LISTAGG only works correctly with VARCHAR2 columns.

There's also an XMLAGG function, which works on versions prior to 11.2. Because WM_CONCAT is undocumented and unsupported by Oracle, it's recommended not to use it in production system.
With XMLAGG you can do the following:
SELECT XMLAGG(XMLELEMENT(E,ename||',')).EXTRACT('//text()') "Result"
FROM employee_names
What this does is
put the values of the ename column (concatenated with a comma) from the employee_names table in an xml element (with tag E)
extract the text of this
aggregate the xml (concatenate it)
call the resulting column "Result"

With SQL model clause:
SQL> select pid
2 , ltrim(sentence) sentence
3 from ( select pid
4 , seq
5 , sentence
6 from b
7 model
8 partition by (pid)
9 dimension by (seq)
10 measures (descr,cast(null as varchar2(100)) as sentence)
11 ( sentence[any] order by seq desc
12 = descr[cv()] || ' ' || sentence[cv()+1]
13 )
14 )
15 where seq = 1
16 /
P SENTENCE
- ---------------------------------------------------------------------------
A Have a nice day
B Nice Work.
C Yes we can do this work!
3 rows selected.
I wrote about this here. And if you follow the link to the OTN-thread you will find some more, including a performance comparison.

The LISTAGG analytic function was introduced in Oracle 11g Release 2, making it very easy to aggregate strings.
If you are using 11g Release 2 you should use this function for string aggregation.
Please refer below url for more information about string concatenation.
http://www.oracle-base.com/articles/misc/StringAggregationTechniques.php
String Concatenation

As most of the answers suggest, LISTAGG is the obvious option. However, one annoying aspect with LISTAGG is that if the total length of concatenated string exceeds 4000 characters( limit for VARCHAR2 in SQL ), the below error is thrown, which is difficult to manage in Oracle versions upto 12.1
ORA-01489: result of string concatenation is too long
A new feature added in 12cR2 is the ON OVERFLOW clause of LISTAGG.
The query including this clause would look like:
SELECT pid, LISTAGG(Desc, ' ' on overflow truncate) WITHIN GROUP (ORDER BY seq) AS desc
FROM B GROUP BY pid;
The above will restrict the output to 4000 characters but will not throw the ORA-01489 error.
These are some of the additional options of ON OVERFLOW clause:
ON OVERFLOW TRUNCATE 'Contd..' : This will display 'Contd..' at
the end of string (Default is ... )
ON OVERFLOW TRUNCATE '' : This will display the 4000 characters
without any terminating string.
ON OVERFLOW TRUNCATE WITH COUNT : This will display the total
number of characters at the end after the terminating characters.
Eg:- '...(5512)'
ON OVERFLOW ERROR : If you expect the LISTAGG to fail with the
ORA-01489 error ( Which is default anyway ).

For those who must solve this problem using Oracle 9i (or earlier), you will probably need to use SYS_CONNECT_BY_PATH, since LISTAGG is not available.
To answer the OP, the following query will display the PID from Table A and concatenate all the DESC columns from Table B:
SELECT pid, SUBSTR (MAX (SYS_CONNECT_BY_PATH (description, ', ')), 3) all_descriptions
FROM (
SELECT ROW_NUMBER () OVER (PARTITION BY pid ORDER BY pid, seq) rnum, pid, description
FROM (
SELECT a.pid, seq, description
FROM table_a a, table_b b
WHERE a.pid = b.pid(+)
)
)
START WITH rnum = 1
CONNECT BY PRIOR rnum = rnum - 1 AND PRIOR pid = pid
GROUP BY pid
ORDER BY pid;
There may also be instances where keys and values are all contained in one table. The following query can be used where there is no Table A, and only Table B exists:
SELECT pid, SUBSTR (MAX (SYS_CONNECT_BY_PATH (description, ', ')), 3) all_descriptions
FROM (
SELECT ROW_NUMBER () OVER (PARTITION BY pid ORDER BY pid, seq) rnum, pid, description
FROM (
SELECT pid, seq, description
FROM table_b
)
)
START WITH rnum = 1
CONNECT BY PRIOR rnum = rnum - 1 AND PRIOR pid = pid
GROUP BY pid
ORDER BY pid;
All values can be reordered as desired. Individual concatenated descriptions can be reordered in the PARTITION BY clause, and the list of PIDs can be reordered in the final ORDER BY clause.
Alternately: there may be times when you want to concatenate all the values from an entire table into one row.
The key idea here is using an artificial value for the group of descriptions to be concatenated.
In the following query, the constant string '1' is used, but any value will work:
SELECT SUBSTR (MAX (SYS_CONNECT_BY_PATH (description, ', ')), 3) all_descriptions
FROM (
SELECT ROW_NUMBER () OVER (PARTITION BY unique_id ORDER BY pid, seq) rnum, description
FROM (
SELECT '1' unique_id, b.pid, b.seq, b.description
FROM table_b b
)
)
START WITH rnum = 1
CONNECT BY PRIOR rnum = rnum - 1;
Individual concatenated descriptions can be reordered in the PARTITION BY clause.
Several other answers on this page have also mentioned this extremely helpful reference:
https://oracle-base.com/articles/misc/string-aggregation-techniques

LISTAGG delivers the best performance if sorting is a must(00:00:05.85)
SELECT pid, LISTAGG(Desc, ' ') WITHIN GROUP (ORDER BY seq) AS description
FROM B GROUP BY pid;
COLLECT delivers the best performance if sorting is not needed(00:00:02.90):
SELECT pid, TO_STRING(CAST(COLLECT(Desc) AS varchar2_ntt)) AS Vals FROM B GROUP BY pid;
COLLECT with ordering is bit slower(00:00:07.08):
SELECT pid, TO_STRING(CAST(COLLECT(Desc ORDER BY Desc) AS varchar2_ntt)) AS Vals FROM B GROUP BY pid;
All other techniques were slower.

Before you run a select query, run this:
SET SERVEROUT ON SIZE 6000
SELECT XMLAGG(XMLELEMENT(E,SUPLR_SUPLR_ID||',')).EXTRACT('//text()') "SUPPLIER"
FROM SUPPLIERS;

Try this code:
SELECT XMLAGG(XMLELEMENT(E,fieldname||',')).EXTRACT('//text()') "FieldNames"
FROM FIELD_MASTER
WHERE FIELD_ID > 10 AND FIELD_AREA != 'NEBRASKA';

In the select where you want your concatenation, call a SQL function.
For example:
select PID, dbo.MyConcat(PID)
from TableA;
Then for the SQL function:
Function MyConcat(#PID varchar(10))
returns varchar(1000)
as
begin
declare #x varchar(1000);
select #x = isnull(#x +',', #x, #x +',') + Desc
from TableB
where PID = #PID;
return #x;
end
The Function Header syntax might be wrong, but the principle does work.

How to build a regex in Hive to get string until Nth occurrence of a delimiter

I have some sample data in Hive as
select "abc:def:ghi:jkl" as data
union all
select "jkl:mno:23ar:stu:abc:def:ghi:7345" as data
I want to extract the strings until 3rd colon so that I get the output as
abc:def:ghi
jkl:mno:23ar
I want to keep N as variable so that I can shrink the output text as needed. How do I do this in Hive?

SELECT regexp_replace(`data`, '^([^:]+:[^:]+:[^:]+).*$', "$1")
FROM
( SELECT "abc:def:ghi:jkl" AS `data`
UNION ALL SELECT "jkl:mno:23ar:stu:abc:def:ghi:7345" AS `data`) AS tmp

With using split and posexplode functions, you can combine again with filtering position
select t.dataId, concat_ws(":", collect_list(t.cell)) as firstN from (
SELECT x.dataId, pos as pos, cell
FROM (
select 1 as dataId, "jkl:mno:23ar:stu:abc:def:ghi:7345" as data
union all
select 2 as dataId, "abc:def:ghi:7345" as data
) x
LATERAL VIEW posexplode(split(x.data,':')) dataTable AS pos, cell
) t
where t.pos<3
group by t.dataId

With variable:
set hivevar:n=3; --variable, you can pass it to the script
with your_table as(
select stack(2,"abc:def:ghi:jkl", "jkl:mno:23ar:stu:abc:def:ghi:7345")as data
)
select regexp_replace(regexp_extract(data,'([^:]*:){1,${hivevar:n}}',0),':$','') from your_table;
Result:
OK
abc:def:ghi
jkl:mno:23ar
Time taken: 0.105 seconds, Fetched: 2 row(s)
Quantifier {1,${hivevar:n}} after variable substitution will become {1,3} which means 1 to 3 times, this allows to extract values shorter than 3. If you need not to extract shorter values, use {${hivevar:n}} quantifier. If there are < than N elements, it will extract empty string in this case.

Select substring_index('abc:def:ghi:jkl',':',3) as data
Union all
Select substring_index('jkl:mno:23ar:stu:abc:def:ghi:7345',':',3) as data;

Looking for the proper way to format the text in a column and compare that with the value of a cell?

I am trying to format the information from a column that I am querying and compare that to information in a cell. I have tried to hack together various ways to do this, but I am not a proficient SQL/spreadsheet user.
In COLUMN I there is nothing.
In COLUMN K there is a match on A2.
In COLUMN N there is Information formatted like 31'-40' and 41'+.
I would prefer to use = instead of contains.
The REPLACE Function seems to work when I substitute N for a String and run it on the W3 School Website.
The REGEXREPLACE seems to work on D2. I would expect them to match, but they do not.
COUNT( QUERY( '2019'!A2:P, "select D where I='' and upper(K) contains '" & UPPER(A2) & "' and REPLACE(REPLACE(REPLACE(N, '-', ''), '''', ''), '+','') contains '"& Regexreplace(D2,"[[:punct:]]","") &"' ")
I get 0 matches.

you almost had it, but try like this:
=COUNTA(FILTER(2019!D2:D, I2:I="",
REGEXMATCH(UPPER(K2:K), UPPER(A2)),
REGEXMATCH(UPPER(N2:N), UPPER(D2))))

How to remove carriage returns and new lines on all the columns in a table using Postgresql?

I am trying to see if there is any way to remove carriage and new lines from all the varchar columns in a table using one statement.
I know that we can do this for a single column using something like below
select regexp_replace(field, E'[\\n\\r]+', ' ', 'g' )
In that case I need have one for every column, which I don't want to do unless there is any easy way.
Appreciate your help!

You can do this either creating a plpgsql function to execute dynamic SQL, or directly run it via DO, as the following example (replace my_table with the name of your table`):
do $$declare _q text; _table text = '<mytable>';
begin
select 'update '||attrelid::regclass::text||E' set\n'||
string_agg(' '||quote_ident(attname)||$q$ = regexp_replace($q$||quote_ident(attname)||$q$, '[\n\r]+', ' ', 'g')$q$, E',\n' order by attnum)
into _q
from pg_attribute
where attnum > 0 and atttypid::regtype::text in ('text', 'varchar')
group by attrelid
having attrelid = _table::regclass;
raise notice E'Executing:\n\n%', _q;
-- uncomment this line when happy with the query:
-- execute _q;
end;$$;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Azure Synapse: turn an n-length, delimited list column into n distinct columns - azure-sqldw

In Azure Synapse, I'd like to convert this table id,list 0,'a:b' 1,'d:e' 2,'g:h' into this one id,col1,col2 0,a,b 1,d,e 2,g,h I'm sure STRING_SPLIT comes into play, but it's return format confuses me.

Related

Count number of WHERE filters in SQL query using regex

How to use listagg function in select query? [duplicate]

How to build a regex in Hive to get string until Nth occurrence of a delimiter

Looking for the proper way to format the text in a column and compare that with the value of a cell?

How to remove carriage returns and new lines on all the columns in a table using Postgresql?

Categories

Resources