Oracles REGEXP_LIKE not working correctly? - regex

This could probably be something simple I'm missing but anyways:
I'm trying to match strings in a column that end with '01'. I currently have this for my expression '[A-Z0-9\-]+01$' which matches the types of strings I want matched checking with regex101. The strings it should be matching should be like this:
1-WA01-0009-01
which works in the linked site.
This is the SQL I'm using to test this:
SELECT *
FROM V_Translog_MDATE_V1
WHERE
REGEXP_LIKE (ITEMNO, '[a-zA-Z0-9\-]+01$')
ORDER BY ARINVT_ID
Why isn't my regex working in the SQL, in that why does this reutrn nothing when I know there are strings that match the pattern?
Closed but not working
Nothing seemed out of place with I was doing so hoping it could be something to do with the columns values itself. Thanks for the help from those who tried.

I'm trying to match strings in a column that end with '01'.
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE V_Translog_MDATE_V1 ( ARINVT_ID, itemno ) AS
SELECT 1, '1-WA01-0009-01' FROM DUAL UNION ALL
SELECT 2, '1-WA01-0009-02' FROM DUAL UNION ALL
SELECT 3, '%&*^$%"£*&%-01' FROM DUAL UNION ALL
SELECT 4, '%&*^$%"£*&%-02' FROM DUAL;
Query 1:
You could just use LIKE:
SELECT *
FROM V_Translog_MDATE_V1
WHERE ITEMNO LIKE '%01'
ORDER BY ARINVT_ID
Results:
| ARINVT_ID | ITEMNO |
|-----------|----------------|
| 1 | 1-WA01-0009-01 |
| 3 | %&*^$%"£*&%-01 |
Query 2:
If you want to use regular expressions then you do not need to check the preceding characters:
SELECT *
FROM V_Translog_MDATE_V1
WHERE REGEXP_LIKE( ITEMNO, '01$' )
ORDER BY ARINVT_ID
Results:
| ARINVT_ID | ITEMNO |
|-----------|----------------|
| 1 | 1-WA01-0009-01 |
| 3 | %&*^$%"£*&%-01 |
Query 3:
If you want the entire string to be matched by your acceptable preceding characters regular expression then you want to prefix it with ^ (start-of-string):
SELECT *
FROM V_Translog_MDATE_V1
WHERE REGEXP_LIKE( ITEMNO, '^[A-Z0-9\-]+01$' )
ORDER BY ARINVT_ID
Results:
| ARINVT_ID | ITEMNO |
|-----------|----------------|
| 1 | 1-WA01-0009-01 |
(otherwise [A-Z0-9\-]+01$ would match £$£^*&$^"%-01 since there is at least one matched character preceding the final 01 characters.)

Related

How to extract words from a string that end with substrings listed in an array? BigQuery

I have a table of rows with cells containing multiple strings. Like this:
K1111=V1111;K1=V1;kv13_key4=--xxxxxsomething;id5=true;impid=23123123;location=domain_co_uk
I need to extract a substring that begins with kv13_key4= and ends with anything after but the lengths all vary and the substrings are all separated by a semicolon ; . I tried
REGEXP_EXTRACT(customtargeting,'%in2w_key4%;') As contains_key_Value
but didn't work. I need something like this:
| Original Cell | Extracted |
| key88=1811111;id89=9990string;K1=V1;23234234234tttttttt13_key4=--x;id5=true;impid=23123;url=domain_co_uk | kv13_key4=--x |
| K1111=V1111;K1=V1;kv13_key4=--xsomething;id5=true;impid=23123123;location=domain_co_uk | kv13_key4=--xsomething |
| ;id5=true;T6791=V1111;K1=V1;kv13_key4=--xxxxxsomething123;impid=23123 | kv13_key4=--xxxxxsomething123 |
Consider below
select *, regexp_extract(customtargeting, r'kv13_key4=[^;]+') as Extracted
from your_table
if applied to sample data in your question - output is
Does this regex work:
(?<=kv13_key4=)[^;]+(?=;)
It captures everything between 'kv13_key4=' and the nearest ';'
Your REGEX_EXTRACT would look like:
REGEXP_EXTRACT(customtargeting,r'(?<=kv13_key4=)[^;]+(?=;)')

How to extract domain name using dynamic regex in Redshift?

I need to extract domain name from url using Redshift PostgreSQL. Example : extract 'google.com' from 'www.google.com'. Each url in my dataset has different top level domain (TLD). My approach was to first join the matching TLD to the dataset and use regex to extract 'first_string.TLD'. In Redshift, I'm getting error 'The pattern must be a valid UTF-8 literal character expression'. Is there a way around this?
A sample of my dataset:
+---+------------------------+--------------+
| id| trimmed_domain | tld |
+---+------------------------+--------------+
| 1 | sample.co.uk | co.uk |
| 2 | www.sample.co.uk | co.uk |
| 3 | www3.sample.co.uk | co.uk |
| 4 | biz.sample.co.uk | co.uk |
| 5 | digital.testing.sam.co | co |
| 6 | sam.co | co |
| 7 | www.google.com | com |
| 8 | 1.11.220 | |
+---+------------------------+--------------+
My code:
SELECT t1.extracted_domain, COUNT(DISTINCT(t1.id))
FROM(
SELECT
d.id,
d.trimmed_domain,
CASE
WHEN d.tld IS null THEN d.trimmed_domain ELSE
regexp_replace(d.trimmed_domain,'(.*\.)((.[a-z]*).*'||replace(tld,'.','\.')||')','\2')
END AS "extracted_domain"
FROM dataset d
)t1
GROUP BY 1
ORDER BY 2;
Expected output:
+------------------------+--------------+
| extracted_domain | count |
+------------------------+--------------+
| sample.co.uk | 4 |
| sam.co | 2 |
| google.com | 1 |
| 1.11.220 | 1 |
+------------------------+--------------+
I'm so sure about the query. However, you can use this tool and design any expression that you wish to modify your query.
My guess is that maybe this would help:
^(?!d|b|www3).*
You can list any domain that you wish to exclude in the list using OR (?!d|b|www3).
RegEx Circuit
You can visualize your expressions in this link:
You maybe want to add your desired URLs to an expression similar to:
^(sam|www.google|1.11|www.sample|www3.sample).*
So, I've found a solution. Redshift does not support column based regex so the alternative is to use Python UDF.
Change the tld column to regex pattern.
Go row by row and extract the domain name using the regex pattern column.
Group by the extracted_domain and count the users.
The SQL query is as below:
CREATE OR REPLACE function extractor(col_domain varchar)
RETURNS varchar
IMMUTABLE AS $$
import re
_regex = ''
for domain in col_domain:
if domain is None:
continue
else:
_regex += r'{}'.format(domain)
domain_regex = r'([^/.]+\.({}))'.format(_regex)
return domain_regex
$$ LANGUAGE plpythonu;
CREATE OR REPLACE FUNCTION regex_match(in_pattern varchar, input_str varchar)
RETURNS varchar
IMMUTABLE AS $$
import re
if in_pattern == '':
a = str(input_str)
else:
a= str(re.search(in_pattern, input_str).group())
return a
$$ LANGUAGE plpythonu;
SELECT
t2.extracted_domain,
COUNT(DISTINCT(t2.id)) AS "Unique Users"
FROM(
SELECT
t1.id,
t1.trimmed_domain,
regex_match(t1.regex_pattern, t1.trimmed_domain) AS "extracted_domain"
FROM(
SELECT
id,
trimmed_domain,
CASE WHEN tld is null THEN '' ELSE extractor(tld) END AS "regex_pattern"
FROM dataset
)t1
)t2
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;
Python UDF seems to be slow on a large dataset. So, I'm open to suggestions on improving the query.
If you know the prefixes you would like to remove from the domains, then why not just exclude these? The following query simply removes the know www/http/etc prefixes from domain names and counts normalized domain names.
SELECT COUNT(*) from
(select REGEXP_REPLACE(domain, '^(https|http|www|biz)') FROM domains)
GROUP BY regexp_replace;

Regular Expression to parse Query Output

I have a need to execute a metadata query which will dump a list of tables into a file. However, I need a way to eliminate all formatting besides the tableId itself. Can this be done through a regex? Appreciate all help in advance.
+-------------------------------------+-------+
| tableId | Type |
+-------------------------------------+-------+
| t_margins | TABLE |
| t_rev_test | TABLE |
| t_rev_share | TABLE |
You have some options, but I would suggest something like this:
^\| (\S+)
It will match on the line from the start, a pipe, a space and then all non-spaces. The non-spaces will be your tableId. Here is a little example in Python:
import re
my_string = '''| t_margins | TABLE |
| t_rev_test | TABLE |
| t_rev_share | TABLE |'''
my_list = my_string.split('\n')
for line in my_list:
match = re.search("^\| (\S+)", line)
print (match.group(1))
This will give you:
t_margins
t_rev_test
t_rev_share
The following regexp captures just the column values of the first column:
^\| (\w+)
https://regex101.com/r/gODhra/3

Regex to exclude values between braces and apply pattern on the rest

I had the following values in a field
Aa11
BBB-
BBB+
A- /*-
A3
Ca
I would use the regex
(([A-Z](([abc]+\d?)|\d))|([A-Z]+[+-]?)
which worked fine. however, now I have another new set of data
(p)A3
(q)A- /*-
How do I make sure I ignore the brackets and values between them to apply my above regex?
I am doing this using REGEX_SUBSTR in oracle.
The regular expression \(.*?\) will match opening and closing brackets with as few characters as possible between them and [^(]*? will match zero-or-as-few-as-possible non-open bracket characters. You can combine these to give the regular expression ^([^(]*?\(.*?\))*?[^(]*? which will match as few as possible bracket groups (provided you do not have nested brackets) until your required pattern is found.
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE data ( value ) AS
SELECT 'Aa11' FROM DUAL UNION ALL
SELECT 'BBB-' FROM DUAL UNION ALL
SELECT 'BBB+' FROM DUAL UNION ALL
SELECT 'A- /*-' FROM DUAL UNION ALL
SELECT 'A3' FROM DUAL UNION ALL
SELECT 'Ca' FROM DUAL UNION ALL
SELECT '(p)A3' FROM DUAL UNION ALL
SELECT '(q)A- /*-' FROM DUAL UNION ALL
SELECT '(Ca)Cb(Cc)' FROM DUAL UNION ALL
SELECT '--(Ca)--(Cb)--Cc(--Ca)' FROM DUAL;
Query 1:
SELECT value,
REGEXP_SUBSTR(
value,
'^([^(]*?\(.*?\))*?[^(]*?([A-Z]([abc]+\d?|\d|[A-Z]*[+-]?))',
1, -- Start at 1st character
1, -- Find the 1st occurrence
NULL, -- No flags
2 -- Return 2nd capturing group
) AS regex_output
FROM data
Results:
| VALUE | REGEX_OUTPUT |
|------------------------|--------------|
| Aa11 | Aa1 |
| BBB- | BBB- |
| BBB+ | BBB+ |
| A- /*- | A- |
| A3 | A3 |
| Ca | Ca |
| (p)A3 | A3 |
| (q)A- /*- | A- |
| (Ca)Cb(Cc) | Cb |
| --(Ca)--(Cb)--Cc(--Ca) | Cc |

Regex Match and Replace path to file

I am trying to do a regex match and replace for hotfile.com links to mp3 files i have in my database (Wordpress).
I used to use hotfile for streaming mp3 files on my site, now i switched to a CDN, could someone kindly help me out with this:
Replace: http//hotfile.com/dl/157490069/c8732d4/mp3_file_name.mp3
With: http//p.music.cdndomain.com/vod/music.folder/2010/mp3_file_name.mp3
I have been trying a simple Search and Replace like this: http//hotfile.com/dl/%/%/, but its not working.
It would have been easier to perform a search and replace if hotfile.com didn't have different folders for all files, below is 2 examples of the problem:
http//hotfile.com/dl/155490069/c7932d4/
http//hotfile.com/dl/165490070/c8745e7/
I have over 500 files to replace.
Thanks
Since you must be using mysql for your wordpress database, you can do this replacement either by regex as you asked:
Regex pattern : #http://(www.)?hotfile.com/\w+/\w+/\w+/#
Replacement pattern: http//p.music.cdndomain.com/vod/music.folder/2010/
An alternate simpler solution would be to extract the mp3 file name using simple string functions of mysql e.g.
Use SUBSTRING or SUBSTRING_INDEX to extract file name of your mp3 file i.e. find the string after last occurence of "/" in your hotfiles url.
Use CONCAT to append the file name retreived to new url prefix and update it in the database.
Here is an example, you can appropriately change it for your database:
mysql> select * from test_songs;
+---------------------------------------------------------------+
| song_url |
+---------------------------------------------------------------+
| http://hotfile.com/dl/157490069/c8732d4/mp3_file_name.mp3 |
| http://www.hotfile.com/dl/123412312/dd732d4/mp3_song_name.mp3 |
+---------------------------------------------------------------+
Taking substrings:
mysql> select SUBSTRING_INDEX(song_url,"/",-1) from test_songs;
+----------------------------------+
| SUBSTRING_INDEX(song_url,"/",-1) |
+----------------------------------+
| mp3_file_name.mp3 |
| mp3_song_name.mp3 |
+----------------------------------+
2 rows in set (0.03 sec)
Creating final update query:
mysql> Update test_songs set song_url =
CONCAT("http//p.music.cdndomain.com/vod/music.folder/2010/",
SUBSTRING_INDEX(song_url,"/",-1)) ;
Query OK, 2 rows affected (0.00 sec)
Rows matched: 2 Changed: 2 Warnings: 0
Checking the results :
mysql> select * from test_songs;
+---------------------------------------------------------------------+
| song_url |
+---------------------------------------------------------------------+
| http//p.music.cdndomain.com/vod/music.folder/2010/mp3_file_name.mp3 |
| http//p.music.cdndomain.com/vod/music.folder/2010/mp3_song_name.mp3 |
+---------------------------------------------------------------------+
2 rows in set (0.00 sec)
Done !
Something as simple as http://regex101.com/r/lK9wH4 should work:
/^.+\/(.+)$/ and replace with <your_new_url>\1.
Good luck.
you can use notepad++ to search and replace all your files
for this particular sample:
search and replace in regex mode:
search "http//hotfile.com/(.)/(..mp3) "
replace "http//p.music.cdndomain.com/vod/music.folder/2010/\2 "
remove the quote mark but keep the space at the end
updated: screencapture for notepad++