Use REGEXP_SUBSTR like a Split function - regex

I need to extract a text value from data in a VARCHAR2 column. Sample:
EDKES^Visit: ^PRIMARY INSURANCE COMMENTS: ^SECONDARY INSURANCE COMMENTS: ^TERTIARY INSURANCE COMMENTS: ^NO PRIMARY INSURANCE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NO SECONDARY INSURANCE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NO TERTIARY INS*
I need to get the text that proceeds the 6th occurrence of the '^' (excluding the '^'). In this example, the text would be NO PRIMARY INSURANCE.
([\w\s\:\*]+(\^?)) mostly works, but doesn't exclude the '^'.
When I try to use this expression REGEXP_SUBSTR(VARCHAR_COL, '([\w\s\:\*]+(\^?))', 1, 6), I get a single character ('s'), rather than the expected match NO PRIMARY INSURANCE^.
What am I missing?

This should work pretty well:
REPLACE(REGEXP_SUBSTR(VARCHAR_COL, '[^^]+\^?', 1, 6), '^', '')

You might be able to account for blank columns as well. And if the engine only returns
the capture groups, it will trim the delimiter.
([^^]*).?
This of course means that the last column found is always invalid.

Related

How can I use regular expressions to select text between commas?

I am using BigQuery on Google Cloud Platform to extract data from GDELT. This uses an SQL syntax and regular expressions.
I have a column of data (called V2Tone), in which each cell looks like this:
1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299
To select only the first number (i.e., the number before the first comma) using regular expressions, we use this:
regexp_replace(V2Tone, r',.*', '')
How can we select only the second number (i.e., the number between the first and second commas)?
How about the third number (i.e., the number between the second and third commas)?
I understand that re2 syntax (https://github.com/google/re2/wiki/Syntax) is used here, but my understanding of how to put that all together is limited.
If anything is unclear, please let me know. Thank you for your help as I learn to use regular expressions.
Below example is for BigQuery Standard SQL using super simple SPLIT approach
#standardSQL
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number
FROM `project.dataset.table`
If for some reason you need/want to use regexp here - use below
#standardSQL
SELECT
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number
FROM `project.dataset.table`
Note use of REGEXP_EXTRACT instead of REGEXP_REPLACE
You can play, test above options with dummy string from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT '1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299' V2Tone
)
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number,
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number_re
FROM `project.dataset.table`
with output :
first_number second_number third_number first_number_re second_number_re third_number_re fifth_number_re
1.55763239875389 2.80373831775701 1.24610591900312 1.55763239875389 2.80373831775701 1.24610591900312 26.4797507788162
I don't know of a single regex replace which could be used to isolate a single number in your CSV string, because we need to remove things on both sides of the match, in general. But, we can chain together two calls to regex_replace. For example, if you wanted to target the third number in the CSV string, we could try this:
regexp_replace(regexp_replace(V2Tone, r'^(?:(?:\d+(?:\.\d+)?),){2}', ''),
r',.*', ''))
The pattern I am using to strip of the first n numbers is this:
^(?:(?:\d+(?:\.\d+)?),){n}
This just removes a number, followed by a comma, n times, from the beginning of the string.
Demo
Here is a solution with a single regex replace:
^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$
Demo
\n is added to the negated character class in the demo to avoid matching accross lines in m|multiline mode.
Usage:
regexp_replace(V2Tone, r'^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$', '$1')
Explanation:
([^,]+(?:,|$){n} captures everything to the next comma or the end of the string n times
([^,]+(?:,|$))* captures the rest 0 or more times
^.*$ capture everything if we cannot match n times
And then, finally, we can reinsert the nth match using $1.

string replace method to be replaced by regular expression

I am using string replace method to clean-up column names.
df.columns=df.columns.str.replace("#$%./- ","").str.replace(' ', '_').str.replace('.', '_').str.replace('(','').str.replace(')','').str.replace('.','').str.lower()
Though it works, certainly does not look pythonic. Any suggestion?
I need only A-Za-z and underscore _ if required as column names.
Update:
I tried using Regular expression in the first replace method, but I still need to chain the string like this...
terms.columns=terms.columns.str.replace(r"^[^a-zA-Z1-9]*", '').str.replace(' ', '_').str.replace('(','').str.replace(')','').str.replace('.', '').str.replace(',', '')
Update showing test data:
Original string (Tab separated):
[Sr.No. Course Terms Besic of Education Degree Course Course Approving Authority (i.e Medical Council, etc.) Full form of Course 1 year Duration 2nd year 3rd year Duration 4 th year Duration]
Change column names:
terms.columns=terms.columns.str.replace(r"^[^a-zA-Z1-9]*", '').str.replace(' ', '_').str.replace('(','').str.replace(')','').str.replace('.', '').str.replace(',', '').str.lower()
Output:
['srno', 'course', 'terms', 'besic_of_education', 'degree_course',
'course_approving_authority_ie_medical_council_etc',
'full_form_of_course', '1_year_duration', '2nd_year_',
'3rd_year_duration', '4_th_year_duration']
Above output is correct. The question: Is there any way to achive the same other than the way I have used?
You can use a smaller number of .replace operations by replacing non-word strings with an empty string and subsequently removing the whitespace characters with an underscore.
df.columns.str.replace("[^\w\s]+","").str.replace("\s+","_")‌​.str.lower()
I hope this helps.

Postgresql - How do I extract the first occurence of a substring in a string using a regular expression pattern?

I am trying to extract a substring from a text column using a regular expression, but in some cases, there are multiple instances of that substring in the string.
In those cases, I am finding that the query does not return the first occurrence of the substring. Does anyone know what I am doing wrong?
For example:
If I have this data:
create table data1
(full_text text, name text);
insert into data1 (full_text)
values ('I 56, donkey, moon, I 92')
I am using
UPDATE data1
SET name = substring(full_text from '%#"I ([0-9]{1,3})#"%' for '#')
and I want to get 'I 56' not 'I 92'
You can use regexp_matches() instead:
update data1
set full_text = (regexp_matches(full_text, 'I [0-9]{1,3}'))[1];
As no additional flag is passed, regexp_matches() only returns the first match - but it returns an array so you need to pick the first (and only) element from the result (that's the [1] part)
It is probably a good idea to limit the update to only rows that would match the regex in the first place:
update data1
set full_text = (regexp_matches(full_text, 'I [0-9]{1,3}'))[1]
where full_text ~ 'I [0-9]{1,3}'
Try the following expression. It will return the first occurrence:
SUBSTRING(full_text, 'I [0-9]{1,3}')
You can use regexp_match() In PostgreSQL 10+
select regexp_match('I 56, donkey, moon, I 92', 'I [0-9]{1,3}');
Quote from documentation:
In most cases regexp_matches() should be used with the g flag, since
if you only want the first match, it's easier and more efficient to
use regexp_match(). However, regexp_match() only exists in PostgreSQL
version 10 and up. When working in older versions, a common trick is
to place a regexp_matches() call in a sub-select...

Oracle: Extract number from String

I've reviewed this question and I'm wondering my output seems to be a little skewed.
From my understanding the REGEXP_REPLACE method, takes a string that you want to replace content with, followed by a pattern to match, then anything that does not match that pattern is replaced with the substitution param.
I've written the following function to extract distance from a text field, in which a spatial query will be performed on the result.
CREATE OR REPLACE FUNCTION extract_distance
(
p_search_string VARCHAR2
)
RETURN VARCHAR2
IS
l_distance VARCHAR2(25);
BEGIN
SELECT REGEXP_REPLACE(UPPER(p_search_string), '(([0-9]{0,4}) ?MILES)', '')
INTO l_distance FROM SYS.DUAL;
RETURN l_distance;
END extract_distance;
When I run this in a block to test:
DECLARE
l_output VARCHAR2(25);
BEGIN
l_output := extract_distance('Stores selling COD4 in 400 Miles');
DBMS_OUTPUT.PUT_LINE(l_output);
END;
I'd expect the output 400 miles but in-fact I get Stores selling COD4 in. Where have I gone wrong?
"REGEXP_REPLACE extends the functionality of the REPLACE function by letting you search a string for a regular expression pattern. By default, the function returns source_char with every occurrence of the regular expression pattern replaced with replace_string." from Oracle docu
You could use, e.g.,
SELECT REGEXP_REPLACE('Stores selling COD4 in 400 Miles', '^.*?(\d+ ?MILES).*$', '\1', 1, 0, 'i') FROM DUAL;
or alternatively
SELECT REGEXP_SUBSTR('Stores selling COD4 in 400 Miles', '(\d+ ?MILES)', 1, 1, 'i') FROM DUAL;
You'll want to use, regexp_substr which returns a substring that matches the regular expression.
REGEX_SUBSTR

variable number of capturing groups

I have a xpath expression which I want to use to extract City and date from a td which contains a string of this kind:
City(may contain spaces and may be missing, but the following space is always present) on 2013/07/20
So far, I got to the following solution for extracting the date, which works partially:
//path/to/my/td/text()/replace(.,'(.*) on (.*)','$3')
This works when City is present, but when City is missing I get "on 2013/07/20" as a result.
I think this is because the first capturing group fails and so the number of groups is different.
How can I get this expression to work?
I did not fully check your regex, but it looks fine at first sight. Anyway, you can also go an easier way if you only want to get the date by extracting the text after "on ":
//path/to/my/td/text()/substring-after(.,'on ')
edit: or you may go the substring-way and select the last 10 characters of the content:
//path/to/my/td/text()/substring(., string-length(.) - 9)