how to write regex for list of items sharing the same roots - regex

So I have a list of keywords:
['xxxxl','xxxl','xxl','xl','xxxxt','xxxt','xxt','xt']
In bigquery, I want to write a regex, inside the following sql code
SELECT my_column
FROM table
REGEXP_CONTAINS(lower(my_column),regex)
so that my output table contains only the values that don't match any of the items in keywords list.
Thanks

Below is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.lookup_table` AS (
SELECT ['xxxxl','xxxl','xxl','xl','xxxxt','xxxt','xxt','xt'] keywords
)
SELECT my_column
FROM `project.dataset.table`,
(SELECT STRING_AGG(LOWER(keyword), '|') exclude_pattern
FROM `project.dataset.lookup_table`,
UNNEST(keywords) keyword)
WHERE NOT REGEXP_CONTAINS(LOWER(my_column), exclude_pattern)
You can test / play with above using below simplified example
#standardSQL
WITH `project.dataset.lookup_table` AS (
SELECT ['xxxxl','xxxl','xxl','xl','xxxxt','xxxt','xxt','xt'] keywords
), `project.dataset.table` AS (
SELECT 'xxxxl' my_column UNION ALL
SELECT 'abc'
)
SELECT my_column
FROM `project.dataset.table`,
(SELECT STRING_AGG(LOWER(keyword), '|') exclude_pattern
FROM `project.dataset.lookup_table`,
UNNEST(keywords) keyword)
WHERE NOT REGEXP_CONTAINS(LOWER(my_column), exclude_pattern)
with output
Row my_column
1 abc

Related

what is wrong with my query in oracle apex?

The following code does not work, but it does work when I convert it to the next code. Why?
select * ,count(*) over(partition by colour) as counts_by_colour
from bricks;
output:
ORA-00923: FROM keyword not found where expected
modified:
select b.* ,count(*) over(partition by colour) as counts_by_colour
from bricks b;
This is just the way SQL works - nothing to do with APEX.
select * means select all columns from what follows. So...
select * from emp join dept;
...returns all columns from emp and all columns from dept.
You are not allowed to select anything else with select * - e.g. ...
select *, 'abc' from emp;
raises the same error you got.
However, you can use select alias.* to select all columns from one table/view in the query, and then you are also allowed to select other things:
select e.*, 'abc' from emp e;
or
select e.*, d.loc from emp e join dept d;
The "implicit" alias also works:
select emp.*, 'abc' from emp;

Getting table names and row counts for all tables in an athena database

I have an AWS database with multiple tables that I am trying to get the row counts for in a single query.
The ideal query output would be:
table_name row_count
table2_name row_count
etc...
So far I've been able to either get all the table names from the database or all the rowcounts of the tables (in random order), but not both in the same query.
This query returns a column of all the table names that exist in the database:
SELECT table_name FROM information_schema.tables WHERE table_schema = '<database_name>';
This query returns all the row counts for the tables:
SELECT COUNT(*) FROM table_name
UNION ALL
SELECT COUNT(*) FROM table2_name
UNION ALL
etc..for the rest of the tables
The issue with this query is that is displays the row counts in a random order that doesn't correspond with the order of the tables in the query, and so I don't know which row count goes with which table - hence why I need both the table names and row counts.
Simply add the names of the tables as literals in your queries:
SELECT 'table_name' AS table_name, COUNT(*) AS row_count FROM table_name
UNION ALL
SELECT 'table_name2' AS table_name, COUNT(*) AS row_count FROM table_name2
UNION ALL
…
The following query generates the UNION query to produce counts of all records.
The problem to solve is that (as of December 2022) INFORMATION_SCHEMA.TABLES incorrectly defines every table and view as a BASE TABLE so you will need some logic to eliminate the views.
In Data Warehousing it is common practise to record snapshots of the record counts of landing tables at frequent intervals. Any unexpected deviations from expected counts can be used for reporting/alerting
WITH Table_List AS (
SELECT table_schema,table_name, CONCAT('SELECT CURRENT_DATE AS run_date, ''',table_name, ''' AS table_name, COUNT(*) AS Records FROM "',table_schema,'"."', table_name, '"') AS BaseSQL
FROM INFORMATION_SCHEMA.TABLES
WHERE
table_schema = 'YOUR_DB_NAME' -- Change this
AND table_name LIKE 'YOUR TABLE PATTERN%' -- Change or remove this line
)
, Total_Records AS (
SELECT COUNT(*) AS Table_Count
FROM Table_List
)
SELECT
CASE WHEN ROW_NUMBER() OVER (ORDER BY table_name) = Table_Count
THEN BaseSQL
ELSE CONCAT(BaseSql, ' UNION ALL') END AS All_Table_Record_count_SQL
FROM Table_List CROSS JOIN Total_Records
ORDER BY table_name;

Why does Calcite change GROUP_CONCAT to LISTAGG?

I built a RelNode using the following SQL:
SELECT GROUP_CONCAT(ename ORDER BY ename DESC SEPARATOR 'a') FROM emp
and I used RelToSqlConverter to converter it to SQL. I get this SQL:
SELECT LISTAGG(`ename`, 'a') WITHIN GROUP (ORDER BY `ename` IS NULL DESC, `ename` DESC) FROM `emp`
But I want to get GROUP_CONCAT not LISTAGG.
Check https://issues.apache.org/jira/browse/CALCITE-4349
GROUP_CONCAT is analogous to LISTAGG (see CALCITE-2754) (and also to BigQuery and > PostgreSQL's STRING_AGG, see CALCITE-4335). For example, the query
SELECT deptno, GROUP_CONCAT(ename ORDER BY empno SEPARATOR ';')
FROM Emp
GROUP BY deptno
is equivalent to (and in Calcite's algebra would be desugared to)
SELECT deptno, LISTAGG(ename, ';') WITHIN GROUP (ORDER BY empno)
FROM Emp
GROUP BY deptno

Bigquery Standard Dialect REGEXP_REPLACE input type

I am exploring the power of Google Biguery with the GDELT database using this tutorial however the sql dialect is in 'legacy' and I would like to use the standard dialect.
In legacy dialect:
SELECT
theme,
COUNT(*) AS count
FROM (
SELECT
REGEXP_REPLACE(SPLIT(V2Themes,';'), r',.*',"") theme
from [gdelt-bq:gdeltv2.gkg]
where DATE>20150302000000 and DATE < 20150304000000 and V2Persons like '%Netanyahu%'
)
group by theme
ORDER BY 2 DESC
LIMIT 300
and when I try to translate into standard dialect:
SELECT
theme,
COUNT(*) AS count
FROM (
SELECT
REGEXP_REPLACE(SPLIT(V2Themes,';') , r',.*', " ") AS theme
FROM
`gdelt-bq.gdeltv2.gkg`
WHERE
DATE>20150302000000
AND DATE < 20150304000000
AND V2Persons LIKE '%Netanyahu%' )
GROUP BY
theme
ORDER BY
2 DESC
LIMIT
300
it throws the following error:
No matching signature for function REGEXP_REPLACE for argument types: ARRAY<STRING>, STRING, STRING. Supported signatures: REGEXP_REPLACE(STRING, STRING, STRING); REGEXP_REPLACE(BYTES, BYTES, BYTES) at [6:5]
it seems like I have to cast the result of the SPLIT() operation as a string. How do I do this?
UPDATE: I found a talk explaining the unnest operation:
SELECT
COUNT(*),
REGEXP_REPLACE(themes,",.*","") AS theme
FROM
`gdelt-bq.gdeltv2.gkg_partitioned`,
UNNEST( SPLIT(V2Themes,";") ) AS themes
WHERE
_PARTITIONTIME >= "2018-08-09 00:00:00"
AND _PARTITIONTIME < "2018-08-10 00:00:00"
AND V2Persons LIKE '%Netanyahu%'
GROUP BY
theme
ORDER BY
2 DESC
LIMIT
100
Flatten the array first:
SELECT
REGEXP_REPLACE(theme , r',.*', " ") AS theme,
COUNT(*) AS count
FROM
`gdelt-bq.gdeltv2.gkg`,
UNNEST(SPLIT(V2Themes,';')) AS theme
WHERE
DATE>20150302000000
AND DATE < 20150304000000
AND V2Persons LIKE '%Netanyahu%'
GROUP BY
theme
ORDER BY
2 DESC
LIMIT
300
The legacy SQL equivalent in your question actually has the effect of flattening the array as well, although it's implicit in the GROUP BY on the theme.

RegEx in BigQuery

I need to split the following field: LP1234354_CD12346
and get the 2 separate columns with the following values:1234354 and 12346.
I tried regex and right/left but not successful. Thank you in advance!
Dummy data:
SELECT 'LP1234354_CD12346' AS word UNION ALL
SELECT 'LP1234456_CD12345'
Below is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 AS id, 'LP1234354_CD12346' AS word UNION ALL
SELECT 2, 'LP1234456_CD12345'
)
SELECT id,
REGEXP_EXTRACT_ALL(word, r'(\d+)')[SAFE_OFFSET(0)] AS val1,
REGEXP_EXTRACT_ALL(word, r'(\d+)')[SAFE_OFFSET(1)] AS val2
FROM `project.dataset.table`