Using another table in BigQuery Regex - regex

I would like to map a string column to a category based on a regular expression match.
Is it possible to use another bigquery table containing the regular expressions and corresponding category for this? This would make it easier for me to update only a table when adding new categories/updating the regex, instead of having to update all queries that would use this lookup.
Query:
CASE
-- Use the entries from another table here
WHEN REGEXP_MATCH(string_to_check, cat1regex) THEN cat1
WHEN REGEXP_MATCH(string_to_check, cat2regex) THEN cat2
etc.
END
Mapping table:
Regex category
pagex|pagey xy
pagez|page1 z1
It's also possible there is another simple way to do something similar that I'm not thinking of, answers pointing those out are welcome too.
Any help would be appreciated.

Below is for BigQuery Standard SQL
#standardSQL
SELECT
string_to_check,
MAX(IF(REGEXP_CONTAINS(string_to_check, reg), category, NULL)) AS category
FROM yourTable
CROSS JOIN mappingTable
GROUP BY string_to_check
You can test / play with it using below dummy date from your question
#standardSQL
WITH `mappingTable` AS (
SELECT r'pagex|pagey' AS reg, 'xy' AS category UNION ALL
SELECT r'pagez|page1', 'z1'
),
`yourTable` AS (
SELECT string_to_check
FROM UNNEST(["pagex.com", "pagez#example.org", "page.example.net"]) AS string_to_check
)
SELECT
string_to_check,
MAX(IF(REGEXP_CONTAINS(string_to_check, reg), category, NULL)) AS category
FROM yourTable
CROSS JOIN mappingTable
GROUP BY string_to_check

Related

Using cte to swap two columns of a table

I want to swap 2nd and 3rd column of one table using CTE.
I'm working with below query, which keeps throwing an error,
no such column: cte.comm1
Table - [SalComm] column: ID, Sal, Comm
with CTE as
(
SELECT ID as id1, sal as sal1, comm as comm1 from SalComm
) UPDATE SalComm SET sal=cte.comm1, comm=cte.sal1 where ID= cte.id1*
Could you please suggest to me the right query?
This answer assumes you are using SQL Server, or some other database, which supports directly updating common table expressions. I don't see the point at all of the aliases inside your CTE. If you want to swap columns values, just use the direct columns names:
WITH cte AS (
SELECT ID, sal, comm
FROM SalComm
)
UPDATE cte
SET sal = comm, comm = sal;
-- no WHERE clause needed, if you really want to cover the entire table
That being said, you could just as easily do the above update on the original table. Updatable CTEs are more useful when they generate some complex derived results which you intend to use as part of a later update. That does not appear to be the case here.

Remove duplicate substring from a string in oracle

I have strings like below in my table
2001,2452,2452,2421,2421,2495
2001,2483,2421,2421,2482
2001,2420,2421,2421,2425
2001,2420,2421,2421,2422
2001,2452,2452,2421,2421,2464
I want to remove the repeated numbers like 2452 and 2421 and show them only once in the data like
2001,2452,2421,2495
2001,2483,2421,2482
2001,2420,2421,2425
2001,2420,2421,2422
2001,2452,2421,2464
Has anyone done something like this? please let me know how to solve this
Thanks!
In Oracle SQL, You can use the hierarchy query and listagg as follows:
select str, listagg(str_distinct, ',') within group (order by 1) as distinct_str from
(select distinct str, regexp_substr(str,'[^,]+',1,column_value) str_distinct from cte
cross join table(
cast(multiset(
select level lvl
from dual
connect by level <= regexp_count(str, '[^,]+'))
as sys.odcivarchar2list)
) lvls)
group by str;
db<>fiddle for one of the input string.

REGEX help needed in Oracle

How to get all the table names from the below Sql? My sql returns only the last table name.
with t as
(select 'select col1,
(select max(col3) from dd3) max_timestamp
from dd1,
dd2
where dd1.col1 = dd2.col1
and dd1.col1 in(select col1 from dd4)' sql_text from dual)
select regexp_substr(regexp_substr(upper(sql_text), '\sFROM\s*(\w|\.|_)*'), '(\w|_|\.)+', 1,2)
from t
Thanks,
DD.
This is a more of a regex question than an Oracle question.
If you can run the sql through REPLACE(REPLACE(sql,CHR(13),' '),CHR(10),NULL) to replace all newlines with a space, so that the query fits on a single line, here is regex that will return all the tables in group 1 (for the ones after FROM) and group 3 for subsequent items in a list:
/FROM ([A-Z0-9$#_]+)(,[\s]*([A-Z0-9$#_]+))*/gi
Having multiple groups is not ideal, so I would look at the full match instead, see https://regex101.com/r/OZUalH/1/ for an example (see full match on the right, where every match has from followed by one or more tables).
But let me warn you this is not going to be robust, as these valid FROM clause expressions are not handled:
"my_table"
MY_TABLE AS A
MY_TABLE AS "a"
etc...
If it were me, I would write a function to run the query through explain plan (execute immediate 'explain plan for ...') and extract the tables from the plan tables (or possibly using SYS.DBMS_XPLAN)

Analyzing tweeter with hive, regex extract

I am trying to analyze what are the most popular hashtags of July. So far I am able to select tweets from July, or display the most popular tweets, but I didn't sucess in putting them together. I am thinking about creating a intermediate table with july tweets, then display the popular hashtags, but I don't know how, can you help me? What about a 2 level select (select a from select b from table) ?
SELECT hashtags.text, count(*) as total FROM tweets
WHERE regexp_extract(created_at, "(Tue) (Jul)*", 2) = "Jul"
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text), created_at
ORDER BY total_count DESC
LIMIT 200
Regards, K.
So far, I did this, which is pretty much what I want, but is there any mean to achieve this differently ?
Working nested query:
SELECT
LOWER(hashtags.text),
COUNT(*) AS total_count
FROM (
SELECT * FROM tweets WHERE regexp_extract(created_at,"(Tue Jul)*",1) = "Tue Jul"
) tweets
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text)
ORDER BY total_count DESC
LIMIT 15
EDIT:
Ok, so if you want you can also do it by a temporary table:
CREATE TABLE tmpdb (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweet_count INT,
retweeted_status STRUCT<
text:STRING,
user:STRUCT<screen_name:STRING,name:STRING>>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
text STRING,
user STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
in_reply_to_screen_name STRING
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
Then you update it:
INSERT OVERWRITE TABLE tmpdb
SELECT * FROM tweets WHERE regexp_extract(created_at,"(Tue Jul)*",1) = "Tue Jul"
And the request become as simple as this:
SELECT
LOWER(hashtags.text),
COUNT(*) AS total_count
FROM tmpdb
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text)
ORDER BY total_count DESC
LIMIT 15
The pro/cons about the second method:
You need to update the table if you want accurate requests, so it is not suited for one-shot request, but if you need to do multiple requests on the current state of the database, then this method is better.
Don't forget that, copying a database is a costly operation ! So know when to use it :)

Birt Report Grouping

I have 2 columns: fltrte_P1_Plt_Per_Id_Fk (Pilot) and fltrte_P2_Plt_Per_Id_Fk (Co-Pilot).
When displaying data in the report I need to group based pilot name. He may be pilot or co-pilot.
It should come in same group. How can this grouping be achieved in a Birt report?
I suggest amending your query from:
select fltrte_P1_Plt_Per_Id_Fk, fltrte_P2_Plt_Per_Id_Fk, ... from flight_Log_Table
to:
select fltrte_P1_Plt_Per_Id_Fk as group_By, ... from flight_Log_Table
union all
select fltrte_P2_Plt_Per_Id_Fk as group_By, ... from flight_Log_Table
then amend your report to group on the new group_By field in the query.
Create a new column that combines the two in your SQL Query.
ISNULL ( fltrte_P1_Plt_Per_Id_Fk, fltrte_P2_Plt_Per_Id_Fk ) as 'Pilot'
If there is a value for P1 (Pilot) it will be in the new field 'Pilot', otherwise P2 (Co-Pilot) will the new field 'Pilot'
This solution works in BIRT 4.2 using a 2008 R2 database.
Add the GROUP BY clause Group by Pilot, Co-Pilot to the end of your SQL query.