Regex for SQL slow query log analyzing - regex

I am currently struggling with the following input:
# Time: 2022-06-01T20:00:00.000000Z
# User#Host: database[database] # [10.10.10.10] Id: 8888888
# Query_time: 0.000450 Lock_time: 0.000160 Rows_sent: 1 Rows_examined: 2
SET timestamp=1654715324;
SELECT id
FROM table_name
WHERE field = 'some-data' AND another_field != 'random-stuff'
ORDER BY field_2;
All my input data will look similar to this. Basically I want to check how many times a certain query shows up. Right now I am a little stuck because my regex cannot filter out the parameters between the single quotes.
I would like to match the following:
SELECT id
FROM table_name
WHERE field = '' AND another_field != ''
ORDER BY field_2;
I've managed to get the query from the input above with the following regExp, but right now this will only match the exact sql.
/(?<=\d;\n).+?(?=;)/gmi
I want to expand this regex so it will ignore anything between single quotes.
Help would be very much appreciated!

Related

String matching in URL using Hive / Spark SQL

I have two tables, one containing list of URL and other having a list of words. My requirement is to filter out the URLs containing the words.
For eg:
URL
https://www.techhive.com/article/3409153/65-inch-oled-4k-tv-from-lg-at-a-1300-dollar-discount.html
https://www.techradar.com/in/news/lg-c9-oled-65-inch-4ktv-price-drop
https://www.t3.com/news/cheap-oled-tv-deals-currys-august
https://indianexpress.com/article/technology/gadgets/lg-bets-big-on-oled-tvs-in-india-to-roll-out-rollable-tv-by-year-end-5823635/
https://www.sony.co.in/electronics/televisions/a1-series
https://www.amazon.in/Sony-138-8-inches-Bravia-KD-55A8F/dp/B07BWKVBYW
https://www.91mobiles.com/list-of-tvs/sony-oled-tv
Words
Sony
Samsung
Deal
Bravia
Now I want to filter any URL that has any of the words. Normally i would do a
Select url from url_table where url not like '%Sony%' or url not like '%Samsung%' or url not like '%Deal%' or not like '%Bravia%';
But that's a cumbersome and not scalable way to do it. What is the best way to achieve this? How do I use a not like function to the words table?
Using regex:
where url not rlike '(?i)Sony|Samsung|Deal|Bravia'
(?i) means case insesitive.
And now let's build the same regexp from the table with words.
You can aggregate list of words from the table and pass it to the rlike. See this example:
with
initial_data as (--replace with your table
select stack(7,
'https://www.techhive.com/article/3409153/65-inch-oled-4k-tv-from-lg-at-a-1300-dollar-discount.html',
'https://www.techradar.com/in/news/lg-c9-oled-65-inch-4ktv-price-drop',
'https://www.t3.com/news/cheap-oled-tv-deals-currys-august',
'https://indianexpress.com/article/technology/gadgets/lg-bets-big-on-oled-tvs-in-india-to-roll-out-rollable-tv-by-year-end-5823635/',
'https://www.sony.co.in/electronics/televisions/a1-series',
'https://www.amazon.in/Sony-138-8-inches-Bravia-KD-55A8F/dp/B07BWKVBYW',
'https://www.91mobiles.com/list-of-tvs/sony-oled-tv'
) as url ) ,
words as (-- replace with your words table
select stack (4, 'Sony','Samsung','Deal','Bravia') as word
),
sub as (--aggregate list of words for rlike
select concat('''','(?i)',concat_ws('|',collect_set(word)),'''') words_regex from words
)
select s.url
from initial_data s cross join sub --cross join with words_regex
where url not rlike sub.words_regex --rlike works fine
Result:
OK
url
https://www.techhive.com/article/3409153/65-inch-oled-4k-tv-from-lg-at-a-1300-dollar-discount.html
https://www.techradar.com/in/news/lg-c9-oled-65-inch-4ktv-price-drop
https://indianexpress.com/article/technology/gadgets/lg-bets-big-on-oled-tvs-in-india-to-roll-out-rollable-tv-by-year-end-5823635/
Time taken: 10.145 seconds, Fetched: 3 row(s)
Also you can calculate sub subquery separately and pass it's result as a variable instead of cross join in my example. Hope you got the idea.

REGEX help needed in Oracle

How to get all the table names from the below Sql? My sql returns only the last table name.
with t as
(select 'select col1,
(select max(col3) from dd3) max_timestamp
from dd1,
dd2
where dd1.col1 = dd2.col1
and dd1.col1 in(select col1 from dd4)' sql_text from dual)
select regexp_substr(regexp_substr(upper(sql_text), '\sFROM\s*(\w|\.|_)*'), '(\w|_|\.)+', 1,2)
from t
Thanks,
DD.
This is a more of a regex question than an Oracle question.
If you can run the sql through REPLACE(REPLACE(sql,CHR(13),' '),CHR(10),NULL) to replace all newlines with a space, so that the query fits on a single line, here is regex that will return all the tables in group 1 (for the ones after FROM) and group 3 for subsequent items in a list:
/FROM ([A-Z0-9$#_]+)(,[\s]*([A-Z0-9$#_]+))*/gi
Having multiple groups is not ideal, so I would look at the full match instead, see https://regex101.com/r/OZUalH/1/ for an example (see full match on the right, where every match has from followed by one or more tables).
But let me warn you this is not going to be robust, as these valid FROM clause expressions are not handled:
"my_table"
MY_TABLE AS A
MY_TABLE AS "a"
etc...
If it were me, I would write a function to run the query through explain plan (execute immediate 'explain plan for ...') and extract the tables from the plan tables (or possibly using SYS.DBMS_XPLAN)

PL/SQL regexp_like filters

I want to delete some tables and wrote this procedure:
set serveroutput on
declare
type namearray is table of varchar2(50);
total integer;
name namearray;
begin
--select statement here ..., please see below
total :=name.count;
dbms_output_line(total);
for i in 1 .. total loop
dbms_output.put_line(name(i));
-- execute immediate 'drop table ' || name(i) || ' purge';
End loop;
end;
/
The idea is to drop all tables with table name having pattern like this:
ERROR_REPORT[2 digit][3 Capital characters][10 digits]
example: ERROR_REPORT16MAY2014122748
However, I am not able to come up with the correct regexp. Below are my select statements and results:
select table_name bulk collect into name from user_tables where regexp_like(table_name, '^ERROR_REPORT[0-9{2}A-Z{3}0-9{10}]');
The results included all the table names I needed plus ERROR_REPORT311AUG20111111111. This should not be showing up in the result.
The follow select statement showed the same result, which meant the A-Z{3} had no effect on the regexp.
select table_name bulk collect into name from user_tables where regexp_like(table_name, '^ERROR_REPORT[0-9{2}0-9{10}]');
My question is what would be the correct regexp, and what's wrong with mine?
Thanks,
Alex
Correct regex is
'^ERROR_REPORT[0-9]{2}[A-Z]{3}[0-9]{10}'
I think this regex should work:
^ERROR_REPORT[0-9]{2}[A-Z]{3}[0-9]{10}
However, please check the regex101 link. I've assumed that you need 2 digits after ERROR_REPORT but your example name shows 3.

Regex QueryString Parsing for a specific in BigQuery

So last week I was able to begin to stream my Appengine logs into BigQuery and am now attempting to pull some data out of the log entries into a table.
The data in protoPayload.resource is the page requested with the querystring paramters included.
The contents of protoPayload.resource looks like the following examples:
/service.html?device_ID=123456
/service.html?v=2&device_ID=78ec9b4a56
I am getting close, but when there is another entry before device_ID, I am not getting it. As you can see I am not great with Regex, but it is the only way I think I can parse the data in the query. To get just the device ID from the first example, I was able to use the following example. Works great. My next challenge is to the data when the second parameter exists. The device IDs can vary in length from about 10 to 26 characters.
SELECT
RIGHT(Regexp_extract(protoPayload.resource,r'[\?&]([^&]+)'),
length(Regexp_extract(protoPayload.resource,r'[\?&]([^&]+)'))-10) as Device_ID
FROM logs
What I would like is just the values from the querystring device_ID such as:
123456
78ec9b4a56
Assuming you have just 1 query string per record then you can do this:
SELECT REGEXP_EXTRACT(protoPayload.resource, r'device_ID=(.*)$') as device_id FROM mytable
The part within the parentheses will be captured and returned in the result.
If device_ID isn't guaranteed to be the last parameter in the string, then use something like this:
SELECT REGEXP_EXTRACT(protoPayload.resource, r'device_ID=([^\&]*)') as device_id FROM mytable
One approach is to split protoPayload.resource into multiple service entries, and then apply regexp - this way it will support arbitrary number of device_id, i.e.
select regexp_extract(service_entry, r'device_ID=(.*$)') from
(select split(protoPayload.resource, ' ') service_entry from
(select
'/service.html?device_ID=123456 /service.html?v=2&device_ID=78ec9b4a56'
as protoPayload.resource))

Extracting id number from a varchar in PostgreSQL

I have a field that contains mixed data with an id number that I want extract to another column. The column I wish to extract from has some records that match the format 'lastname, firstname-ID'. I only want to strip the 'ID' part, and from those columns who have a '-' and numbers following it.
So what I was trying to do was...
update data.xml_customerqueryrs
set new_id = regexp_replace(name, '[a-z]A-Z]', '')
where name like '%-%';
I know there is something minor that I need to fix, but I am not sure as the postgresql documentation for pattern matching doesn't really do a good job covering searching for only numerics.
If you actually only want to strip the 'ID' part:
new_id = regexp_replace(name, '-.*?$', '')
Or, if you, in fact, want to extract the ID part:
new_id = substring(name, '-(.*?)$')
I use the *? quantifier, so that only the last part is extracted, where a name has a - in it. Like:
Skeet-Gravell,John-1234
String functions in current manual
Or you can also do:
new_id = substr(name, strpos(name, '-'), length(name))
Ref: Strings