Postgresql - How do I extract the first occurence of a substring in a string using a regular expression pattern? - regex

I am trying to extract a substring from a text column using a regular expression, but in some cases, there are multiple instances of that substring in the string.
In those cases, I am finding that the query does not return the first occurrence of the substring. Does anyone know what I am doing wrong?
For example:
If I have this data:
create table data1
(full_text text, name text);
insert into data1 (full_text)
values ('I 56, donkey, moon, I 92')
I am using
UPDATE data1
SET name = substring(full_text from '%#"I ([0-9]{1,3})#"%' for '#')
and I want to get 'I 56' not 'I 92'

You can use regexp_matches() instead:
update data1
set full_text = (regexp_matches(full_text, 'I [0-9]{1,3}'))[1];
As no additional flag is passed, regexp_matches() only returns the first match - but it returns an array so you need to pick the first (and only) element from the result (that's the [1] part)
It is probably a good idea to limit the update to only rows that would match the regex in the first place:
update data1
set full_text = (regexp_matches(full_text, 'I [0-9]{1,3}'))[1]
where full_text ~ 'I [0-9]{1,3}'

Try the following expression. It will return the first occurrence:
SUBSTRING(full_text, 'I [0-9]{1,3}')

You can use regexp_match() In PostgreSQL 10+
select regexp_match('I 56, donkey, moon, I 92', 'I [0-9]{1,3}');
Quote from documentation:
In most cases regexp_matches() should be used with the g flag, since
if you only want the first match, it's easier and more efficient to
use regexp_match(). However, regexp_match() only exists in PostgreSQL
version 10 and up. When working in older versions, a common trick is
to place a regexp_matches() call in a sub-select...

Related

How can I separate a string by underscore (_) in google spreadsheets using regex?

I need to create some columns from a cell that contains text separated by "_".
The input would be:
campaign1_attribute1_whatever_yes_123421
And the output has to be in different columns (one per field), with no "_" and excluding the final number, as it follows:
campaign1 attribute1 whatever yes
It must be done using a regex formula!
help!
Thanks in advance (and sorry for my english)
=REGEXEXTRACT("campaign1_attribute1_whatever_yes_123421","(("&REGEXREPLACE("campaign1_attribute1_whatever_yes_123421","((_)|(\d+$))",")$1(")&"))")
What this does is replace all the _ with parenthesis to create capture groups, while also excluding the digit string at the end, then surround the whole string with parenthesis.
We then use regex extract to actuall pull the pieces out, the groups automatically push them to their own cells/columns
To solve this you can use the SPLIT and REGEXREPLACE functions
Solution:
Text - A1 = "campaign1_attribute1_whatever_yes_123421"
Formula - A3 = =SPLIT(REGEXREPLACE(A1,"_+\d*$",""), "_", TRUE)
Explanation:
In cell A3 We use SPLIT(text, delimiter, [split_by_each]), the text in this case is formatted with regex =REGEXREPLACE(A1,"_+\d$","")* to remove 123421, witch will give you a column for each word delimited by ""
A1 = "campaign1_attribute1_whatever_yes_123421"
A2 = "=REGEXREPLACE(A1,"_+\d*$","")" //This gives you : *campaign1_attribute1_whatever_yes*
A3 = SPLIT(A2, "_", TRUE) //This gives you: campaign1 attribute1 whatever yes, each in a separate column.
I finally figured it out yesterday in stackoverflow (spanish): https://es.stackoverflow.com/questions/55362/c%C3%B3mo-separo-texto-por-guiones-bajos-de-una-celda-en...
It was simple enough after all...
The reason I asked to be only in regex and for google sheets was because I need to use it in Google data studio (same regex functions than spreadsheets)
To get each column just use this regex extract function:
1st column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){0}([^_]*)_')
2nd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){1}([^_]*)_')
3rd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){2}([^_]*)_')
etc...
The only thing that has to be changed in the formula to switch columns is the numer inside {}, (column number - 1).
If you do not have the final number, just don't put the last "_".
Lastly, remember to do all the calculated fields again, because (for example) it gets an error with CPC, CTR and other Adwords metrics that are calculated automatically.
Hope it helps!

filtering columns by regex in dataframe

I have a large dataframe (3000+ columns) and I am trying to get a list of all column names that follow this pattern:
"stat.mineBlock.minecraft.123456stone"
"stat.mineBlock.minecraft.DFHFFBSBstone2"
"stat.mineBlock.minecraft.AAAstoneAAAA"
My code:
stoneCombined<-grep("^[stat.mineBlock.minecraft.][a-zA-Z0-9]*?[stone][a-zA-Z0-9]*?", colnames(ingame), ignore.case =T)
where ingame is the dataframe I am searching. My code returns a list of numbers however instead of the dataframe columns (like those above) that I was expecting. Con someone tell me why?
After adding value=TRUE (Thanks to user227710):
I now get column names, but I get every column in my dataset not those that contain : stat.mineBlock.minecraft. and stone like I was trying to get.
To return the column names you need to set value=TRUE as an additional argument of grep. The default option in grep is to set value=FALSE and so it will give you indices of the matched colnames. .
help("grep")
value
if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and if TRUE, a vector containing the matching elements themselves is returned.
grep("your regex pattern", colnames(ingame),value=TRUE, ignore.case =T)
Here is a solution in dplyr:
library(dplyr)
your_df %>%
select(starts_with("stat.mineBlock.minecraft"))
The more general way to match a column name to a regex is with matches() inside select(). See ?select for more information.
My answer is based on this SO post. As per the regex, you were very close.
Just [] create a character class matching a single character from the defined set, and it is the main reason it was not working. Also, perl=T is always safer to use with regex in R.
So, here is my sample code:
df <- data.frame(
"stat.mineBlock.minecraft.123456stone" = 1,
"stat.mineBlock.minecraft.DFHFFBSBwater2" = 2,
"stat.mineBlock.minecraft.DFHFFBSBwater3" = 3,
"stat.mineBlock.minecraft.DFHFFBSBstone4" = 4
)
grep("^stat\\.mineBlock\\.minecraft\\.[a-zA-Z0-9]*?stone[a-zA-Z0-9]*?", colnames(df), value=TRUE, ignore.case=T, perl=T)
See IDEONE demo

Oracle: Extract number from String

I've reviewed this question and I'm wondering my output seems to be a little skewed.
From my understanding the REGEXP_REPLACE method, takes a string that you want to replace content with, followed by a pattern to match, then anything that does not match that pattern is replaced with the substitution param.
I've written the following function to extract distance from a text field, in which a spatial query will be performed on the result.
CREATE OR REPLACE FUNCTION extract_distance
(
p_search_string VARCHAR2
)
RETURN VARCHAR2
IS
l_distance VARCHAR2(25);
BEGIN
SELECT REGEXP_REPLACE(UPPER(p_search_string), '(([0-9]{0,4}) ?MILES)', '')
INTO l_distance FROM SYS.DUAL;
RETURN l_distance;
END extract_distance;
When I run this in a block to test:
DECLARE
l_output VARCHAR2(25);
BEGIN
l_output := extract_distance('Stores selling COD4 in 400 Miles');
DBMS_OUTPUT.PUT_LINE(l_output);
END;
I'd expect the output 400 miles but in-fact I get Stores selling COD4 in. Where have I gone wrong?
"REGEXP_REPLACE extends the functionality of the REPLACE function by letting you search a string for a regular expression pattern. By default, the function returns source_char with every occurrence of the regular expression pattern replaced with replace_string." from Oracle docu
You could use, e.g.,
SELECT REGEXP_REPLACE('Stores selling COD4 in 400 Miles', '^.*?(\d+ ?MILES).*$', '\1', 1, 0, 'i') FROM DUAL;
or alternatively
SELECT REGEXP_SUBSTR('Stores selling COD4 in 400 Miles', '(\d+ ?MILES)', 1, 1, 'i') FROM DUAL;
You'll want to use, regexp_substr which returns a substring that matches the regular expression.
REGEX_SUBSTR

Use REGEXP_SUBSTR like a Split function

I need to extract a text value from data in a VARCHAR2 column. Sample:
EDKES^Visit: ^PRIMARY INSURANCE COMMENTS: ^SECONDARY INSURANCE COMMENTS: ^TERTIARY INSURANCE COMMENTS: ^NO PRIMARY INSURANCE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NO SECONDARY INSURANCE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NO TERTIARY INS*
I need to get the text that proceeds the 6th occurrence of the '^' (excluding the '^'). In this example, the text would be NO PRIMARY INSURANCE.
([\w\s\:\*]+(\^?)) mostly works, but doesn't exclude the '^'.
When I try to use this expression REGEXP_SUBSTR(VARCHAR_COL, '([\w\s\:\*]+(\^?))', 1, 6), I get a single character ('s'), rather than the expected match NO PRIMARY INSURANCE^.
What am I missing?
This should work pretty well:
REPLACE(REGEXP_SUBSTR(VARCHAR_COL, '[^^]+\^?', 1, 6), '^', '')
You might be able to account for blank columns as well. And if the engine only returns
the capture groups, it will trim the delimiter.
([^^]*).?
This of course means that the last column found is always invalid.

RegEx : Replace parts of dynamic strings

I have a string
IsNull(VSK1_DVal.RuntimeSUM,0),
I need to remove IsNull part, so the result would be
VSK1_DVal.RuntimeSUM,
I'm absolute new to RegEx, but it wouldn't be a problem, if not one thing :
VSK1 is dynamic part, can be any combination of A-Z,0-9 and any length. How to replace strings with RegEx? I use MSSQL 2k5, i think it uses general set of RegEx rules.
EDIT : I forgot to say, that I'm doing replacement in SSMS Query window's Replace Box (^H) - not building RegEx query
br
marius
here's a regex that should work:
[^(]+\(([^,]+),[^)]\)
Then use $1 capture group to extract the part that you need.
I did a sanity check in ruby:
orig = "IsNull(VSK1_DVal.RuntimeSUM,0),"
regex = /[^(]*\(([^,]+),[^)]\)/
result = orig.sub(regex){$1} # result => VSK1_DVal.RuntimeSUM,
It gets trickier if you have a prefix that you want to retain. Like if you have this:
"somestuff = IsNull(VSK1_DVal.RuntimeSUM,0),"
In this case, you need someway to identify the start of the pattern. Maybe you can use '=' to identify the start of the pattern? If so, this should work:
orig = "somestuff = IsNull(VSK1_DVal.RuntimeSUM,0),"
regex = /=\s*\w+\(([^,]+),[^)]\)/
result = orig.sub(regex){$1} # result => somestuff = VSK1_DVal.RuntimeSUM,
But then the case where you don't have an equals sign will fail. Maybe you can use 'IsNull' to identify the start of the pattern? If so, try this (note the '/i' representing case insensitive matching):
orig = "somestuff = isnull(VSK1_DVal.RuntimeSUM,0),"
regex = /IsNull\(([^,]+),[^)]\)/i
result = orig.sub(regex){$1} # result => somestuff = VSK1_DVal.RuntimeSUM,
/IsNULL\((A-Z0-9+),0\)/
Then pick group match number 1.
Here's a very useful site: http://www.regexlib.com/RETester.aspx
They have a tester and a cheatsheet that are very useful for quick testing of this sort.
I tested the solution by Dave and it works fine except it also removes the trailing comma you wanted retained. Minor thing to fix.
Try this:
IsNULL\((.*,)0\)
You say in your question
I use MSSQL 2k5, i think it uses
general set of RegEx rules.
This is not true unless you enable CLR and compile and install an assembly. You can use its native pattern matching syntax and LIKE for this as below.
WITH T(C) AS
(
SELECT 'IsNull(VSK1_DVal.RuntimeSUM,0),' UNION ALL
SELECT 'IsNull(VSK1_DVal.RuntimeSUM,123465),' UNION ALL
SELECT 'No Match'
)
SELECT SUBSTRING(C,8,1+LEN(C)-8-CHARINDEX(',',REVERSE(C),2))
FROM T
WHERE C LIKE 'IsNull(%,_%),'