Regex non-capturing parenthesis issue - regex

I have a database query which looks like this
select * from students join (select * from teachers) join (select * from workers
I had a requirement to tokenize this string based on 'select'.
I am trying regex (select)(.*?)((?:select)|$), ut it is matching only 2 times.
Request some pointers on how to achieve this.
I need the 3 output tokens as below
select * from students join (
select * from teachers) join (
select * from workers

I think this regex will work:
select.*?(?=select|$)
The regex matches the word select, then any text (not including new lines) up until right before the next select or the end of the string.
Demonstration here: http://regex101.com/r/sR3gV1

If you are trying to parse the select queries from the string then you can use this regex. Assuming you are not doing select from multiple tables(i.e. not doing select * from x,y,z)
(select.*?from\\s+\\w+)

Related

Oracle regex and replace

I have varchar field in the database that contains text. I need to replace every occurrence of a any 2 letter + 8 digits string to a link, such as VA12345678 will return /cs/page.asp?id=VA12345678
I have a regex that replaces the string but how can I replace it with a string where part of it is the string itself?
SELECT REGEXP_REPLACE ('test PI20099742', '[A-Z]{2}[0-9]{8}$', 'link to replace with')
FROM dual;
I can have more than one of these strings in one varchar field and ideally I would like to have them replaced in one statement instead of a loop.
As mathguy had said, you can use backreferences for your use case. Try a query like this one.
SELECT REGEXP_REPLACE ('test PI20099742', '([A-Z]{2}[0-9]{8})', '/cs/page.asp?id=\1')
FROM DUAL;
For such cases, you may want to keep the "text to add" somewhere at the top of the query, so that if you ever need to change it, you don't have to hunt for it.
You can do that with a with clause, as shown below. I also put some input data for testing in the with clause, but you should remove that and reference your actual table in your query.
I used the [:alpha:] character class, to match all letters - upper or lower case, accented or not, etc. [A-Z] will work until it doesn't.
with
text_to_add (link) as (
select '/cs/page.asp?id=' from dual
)
, sample_strings (str) as (
select 'test VA12398403 and PI83048203 to PT3904' from dual
)
select regexp_replace(str, '([[:alpha:]]{2}\d{8})', link || '\1')
as str_with_links
from sample_strings cross join text_to_add
;
STR_WITH_LINKS
------------------------------------------------------------------------
test /cs/page.asp?id=VA12398403 and /cs/page.asp?id=PI83048203 to PT3904

How to define regexp for text in Postgres

Please help to define Postgres regexp for this case:
I have string field:
union all select 'AbC-345776-2345' /*comment*/ union all select 'Fgr-sdf344-111a' /*BN34*/ some text union all select 'sss-sdf34-123' /*some text*/ some text
Here is the same text in select statement for convinience:
select 'union all select ''AbC-345776-2345'' /*comment*/ union all select ''Fgr-sdf344-111a'' /*BN34*/ some text union all select ''sss-sdf34-123'' /*some text*/ some text' as str
I need to get from this mess text only values in '...' and select it into separated rows like this:
AbC-345776-2345
Fgr-sdf344-111a
sss-sdf34-123
Pattern: 'first 2-3 letters - several letters and numbers - several letters and numbers'
I created this select but it contains all comments and "sometext" as well:
select regexp_split_to_table(trim(replace(replace(replace(replace(t1.str,'union all select',''),'from DUAL',''),chr(10),''),'''','') ), E'\\s+')
from (select 'union all select ''AbC-345776-2345'' /*comment*/ union all select ''Fgr-sdf344-111a'' /*BN34*/ some text union all select ''sss-sdf34-123'' /*some text*/ some text' as str) t1;
The following should do it:
select (regexp_matches(str, $$'([a-zA-Z]{2,3}-[a-zA-Z0-9]+-[a-zA-Z0-9]+)'$$, 'g'))[1]
from the_table;
Given your sample data it returns:
regexp_matches
---------------
AbC-345776-2345
Fgr-sdf344-111a
sss-sdf34-123
The regex checks for the pattern you specified inside single quotes. By using a group (...) I excluded the single quotes from the result.
regexp_matches() returns one row for each match, containing an array of matches. But as the regex only contains a single group, the first element of the array is what we are interested in.
I used dollar quoting to avoid escaping the single quotes in the regex
Online example

How can I use regular expressions to select text between commas?

I am using BigQuery on Google Cloud Platform to extract data from GDELT. This uses an SQL syntax and regular expressions.
I have a column of data (called V2Tone), in which each cell looks like this:
1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299
To select only the first number (i.e., the number before the first comma) using regular expressions, we use this:
regexp_replace(V2Tone, r',.*', '')
How can we select only the second number (i.e., the number between the first and second commas)?
How about the third number (i.e., the number between the second and third commas)?
I understand that re2 syntax (https://github.com/google/re2/wiki/Syntax) is used here, but my understanding of how to put that all together is limited.
If anything is unclear, please let me know. Thank you for your help as I learn to use regular expressions.
Below example is for BigQuery Standard SQL using super simple SPLIT approach
#standardSQL
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number
FROM `project.dataset.table`
If for some reason you need/want to use regexp here - use below
#standardSQL
SELECT
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number
FROM `project.dataset.table`
Note use of REGEXP_EXTRACT instead of REGEXP_REPLACE
You can play, test above options with dummy string from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT '1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299' V2Tone
)
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number,
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number_re
FROM `project.dataset.table`
with output :
first_number second_number third_number first_number_re second_number_re third_number_re fifth_number_re
1.55763239875389 2.80373831775701 1.24610591900312 1.55763239875389 2.80373831775701 1.24610591900312 26.4797507788162
I don't know of a single regex replace which could be used to isolate a single number in your CSV string, because we need to remove things on both sides of the match, in general. But, we can chain together two calls to regex_replace. For example, if you wanted to target the third number in the CSV string, we could try this:
regexp_replace(regexp_replace(V2Tone, r'^(?:(?:\d+(?:\.\d+)?),){2}', ''),
r',.*', ''))
The pattern I am using to strip of the first n numbers is this:
^(?:(?:\d+(?:\.\d+)?),){n}
This just removes a number, followed by a comma, n times, from the beginning of the string.
Demo
Here is a solution with a single regex replace:
^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$
Demo
\n is added to the negated character class in the demo to avoid matching accross lines in m|multiline mode.
Usage:
regexp_replace(V2Tone, r'^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$', '$1')
Explanation:
([^,]+(?:,|$){n} captures everything to the next comma or the end of the string n times
([^,]+(?:,|$))* captures the rest 0 or more times
^.*$ capture everything if we cannot match n times
And then, finally, we can reinsert the nth match using $1.

Hive - regexp_replace function for multiple strings

I am using hive 0.13! I want to find multiple tokens like "hip hop" and "rock music" in my data and replace them with "hiphop" and "rockmusic" - basically replace them without white space. I have used the regexp_replace function in hive. Below is my query and it works great for above 2 examples.
drop table vp_hiphop;
create table vp_hiphop as
select userid, ntext,
regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock music', 'rockmusic') as ntext1
from vp_nlp_protext_males
;
But I have 100 such bigrams/ngrams and want to be able to do replace efficiently where I just remove the whitespace. I can pattern match the phrase - hip hop and rock music but in the replace I want to simply trim the white spaces. Below is what I tried. I also tried using trim with regexp_replace but it wants the third argument in the regexp_replace function.
drop table vp_hiphop;
create table vp_hiphop as
select userid, ntext,
regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1
from vp_nlp_protext_males
;
You can strip all occurrences of a substring from a string using the TRANSLATE function to replace the substring with the empty string. For your query it would become this:
drop table vp_hiphop;
create table vp_hiphop as
select userid, ntext,
translate(ntext, ' ', '') as ntext1
from vp_nlp_protext_males
;

Simple regular expression in Oracle

I have a table with the following values:
ID NAME ADDRESS
1 Bob Super stree1 here goes
2 Alice stree100 here goes
3 Clark Fast left stree1005
4 Magie Right stree1580 here goes
I need to make a query using LIKE and get only the row having stree1 (in this case only get the one with ID=1) and I use the following query:
select * from table t1 WHERE t1.ADDRESS LIKE '%stree1%';
But the problem is that I get all rows as each of them contains stree1 plus some char/number after.
I have found out that I can use REGEXP_LIKE as I am using oracle, what would be the proper regex to use in:
select * from table t1 WHERE regexp_like(t1.ADDRESS ,'stree1');
I would think that this would be the reg-ex you are seeking:
select * from table t1 WHERE regexp_like(t1.ADDRESS ,'stree1(?:[^[:word:]]|$)');
If you want to, you can further simplify this to:
select * from table t1 WHERE regexp_like(t1.ADDRESS ,'stree1(?:\W|$)');
That is, 'stree1' is not followed by a word character (i.e., is followed by space/punctuation/etc...) or 'stree1' appears at the end of the string. Of course there are many other ways to do the same thing, including word boundaries 'stree1\b', expecting particular characters after the 1 in stree1 (e.g., a white-space with 'stree1\s'), etc...
This may help:
stree1\b
The first '\W' is tells it it a non-word character since you need noting after 'stree1' but space
and '$' tells take it as a valid string if it ends with stree1
select *
from table1
where regexp_like(address,'stree1(\W|$)')