Use REPLACE and LIKE together in postgres - regex

I am trying to replace all the occurences of '-' in a column of a table.
What I need is also to replace the string which exists after the dash and its a random number.
To be more specific this is one of my values:
"ANDRIU 5-9, CHAL 152 34, SOMETHING"
What I want is to replace this part:
-9
with an empty space.
The problem is that: 9 can be any number and not necessarily one digit.
So I need something like finding the position of the first comma in the whole string. And the position of the dash and then replacing this based on the index values.
Is this possible?

Postgres provides the function regexp_replace(), which does what you want directly:
select regexp_replace(col, '-[0-9]+', ' ')

Related

Splitting name/value pairs with regex to ignore special characters based on surrounding characters

I have this regex that's worked well so far that splits 'name=value' pairs separated by a given character.
(?s)([^\s=]+)=(.*?)(?=\s+[^\s=]+=|\Z)
I know the separator, but the problem is in the example below (tab separated):
usrName=Wilma sev=4 cat=Detection CommandLine="C:\powershell.exe" -Enc 0ATQBpAG0AAcABDAHIAZQBkAHMAIgA= IOCValue= ProcessEndTime=2023-01-18 15:51:05
https://regex101.com/r/1wgVxs/5
Some values can have no value in the case of 'IOCValue' which works as expected, however some values like the CommandLine are giving me up to -Enc as one match and the remainder to the next pair as another.
What I'm hoping to get out from the above is:
usrName=Wilma
sev=4
cat=Detection
CommandLine="C:\powershell.exe" -Enc 0ATQBpAG0AAcABDAHIAZQBkAHMAIgA=
IOCValue=
ProcessEndTime=2023-01-18 15:51:05
But I'm getting:
usrName=Wilma
sev=4
cat=Detection
CommandLine="C:\powershell.exe" -Enc
0ATQBpAG0AAcABDAHIAZQBkAHMAIgA=
IOCValue=
ProcessEndTime=2023-01-18 15:51:05
Given I know the separator is a tab I think what I need is to only look for name=value pairs when they are at the start of the line or proceeded by the separator (tab). Is this possible?
Note, I can expect a space separator too, but I have a less performant and messy non-regex version I can send these too, so presume tab.
You may use this simplified regex:
(?s)([^\s=]+)=(.*?)(?=\t|\Z)
Updated RegEx Demo
Here, lookahead (?=\t|\Z) will make sure that value part is followed by either a tab character or end position.

Regex to detect string is x.x.x where x is a digit from 1-3 digits

I have values 1000+ rows with variable values entered as below
5.99
5.188.0
v5.33
v.440.0
I am looking in Gsheet another column to perform following operations:
Remove the 'v' character from the values
if there is 2nd '.' missing as so string can become 5.88 --> 5.88.0
Can help please in the regex and replace logic as tried this but new to regex making. Thanks for the help given
=regexmatch(<cellvalue>,"^[0-9]{1}\.[0-9]{1,3}\.[0-9]{1,3}$")
I have done till finding the value as 5.88.0 returns TRUE and 5.99 returns false, need to now append ".0" so 5.99 --> 5.99.0 and remove 'v' if found.
You can use a combination of functions, it may not be pretty, but it does the work
Replace any instance of v with an empty string using substitute, by making the content of the cell upper case, if we don't put UPPER(CELL) we could exclude any upper case V or lower case v(it will depend which function you use)
SUBSTITUTE(text_to_search, search_for, replace_with, [occurrence_number])
=SUBSTITUTE(UPPER(A1),"V","")
Look for cell missing the last block .xxx, you need to update a bit your regex to specified that the last group it's not present
^([0-9]{1}\.[0-9]{1,3} ( \.[0-9]{1,3}){0} )$
Using REGEXMATCH and IF we can then CONCATENATE the last group as .0
REGEXMATCH(text, regular_expression)
CONCATENATE(string1, [string2, ...])
=IF(REGEXMATCH(substitute(upper(A2),"V",""),"^([0-9]{1}\.[0-9]{1,3}(\.[0-9]{1,3}){0})$"),concatenate(A2,".0"), A2)
The last A2 will be replace with something similar than what we have until now, but before that we need to make small change in the regex, we want to look for the groups you specified were the last group it's present, that's your orignal regex, if it meets the regex it will put it in the cell, otherwise it will put INVALID, you can change that to anything you want it to be
^([0-9]{1}.[0-9]{1,3}.[0-9]{1,3})$
This it's the piece we are putting instead of the last A2
IF(REGEXMATCH(substitute(upper(A2),"V",""),"^([0-9]{1}\.[0-9]{1,3}\.[0-9]{1,3})$"),substitute(upper(A2),"V",""),"INVALID")
With this the final code to put in your cell will be:
=IF(REGEXMATCH(substitute(upper(A2),"V",""),"^([0-9]{1}\.[0-9]{1,3}(\.[0-9]{1,3}){0})$"),concatenate(SUBSTITUTE(UPPER(A2),"V",""),".0"),IF(REGEXMATCH(substitute(upper(A2),"V",""),"^([0-9]{1}\.[0-9]{1,3}\.[0-9]{1,3})$"),substitute(upper(A2),"V",""),"INVALID"))

Remove character from string in informatica

I have the following string values .
string 1 = test123
string 2 = stri567
now i need to remove 123,567 from the string. which means i need only first four character from the strings.(test,stri)
Have you tried reg replace to replace all numbers from a string -
REG_REPLACE( inp_col, '[0-9]','')
If you need first four characters from the column then use substring
SUBSTR(column_name, 0, 4)
Please make sure that you need first four digits or just strings from the columns.
In case if you need strings from the values, please use Koushik Roy's solution

Extract text up to the Nth character in a string

How can I extract the text up to the 4th instance of a character in a column?
I'm selecting text out of a column called filter_type up to the fourth > character.
To accomplish this, I've been trying to find the position of the fourth > character, but it's not working:
select substring(filter_type from 1 for position('>' in filter_type))
You can use the pattern matching function in Postgres.
First figure out a pattern to capture everything up to the fourth > character.
To start your pattern you should create a sub-group that captures non > characters, and one > character:
([^>]*>)
Then capture that four times to get to the fourth instance of >
([^>]*>){4}
Then, you will need to wrap that in a group so that the match brings back all four instances:
(([^>]*>){4})
and put a start of string symbol for good measure to make sure it only matches from the beginning of the String (not in the middle):
^(([^>]*>){4})
Here's a working regex101 example of that!
Once you have the pattern that will return what you want in the first group element (which you can tell at the online regex on the right side panel), you need to select it back in the SQL.
In Postgres, the substring function has an option to use a regex pattern to extract text out of the input using a 'from' statement in the substring.
To finish, put it all together!
select substring(filter_type from '^(([^>]*>){4})')
from filter_table
See a working sqlfiddle here
If you want to match the entire string whenever there are less than four instances of >, use this regular expression:
^(([^>]*>){4}|.*)
You can also use a simple, non-regex solution:
SELECT array_to_string((string_to_array(filter_type, '>'))[1:4], '>')
The above query:
splits your string into an array, using '>' as delimeter
selects only the first 4 elements
transforms the array back to a string
substring(filter_type from '^(([^>]*>){4})')
This form of substring lets you extract the portion of a string that matches a regex pattern.
You can also split the string, then choose the N'th element inside the result list. For example:
SELECT SPLIT_PART('aa,bb,cc', ',', 2)
will return: bb.
This function is defined as:
SPLIT_PART(string, delimiter, position)
In order to look at this problem, I did the following (all of the code below is available on the fiddle here):
CREATE TABLE s
(
a TEXT
);
I then created a PL/pgSQL function to generate random strings as follows.
CREATE FUNCTION f() RETURNS TEXT LANGUAGE SQL AS
$$
SELECT STRING_AGG(SUBSTR('abcdef>', CEIL(RANDOM() * 7)::INTEGER, 1), '')
FROM GENERATE_SERIES(1, 40)
$$;
I got the code from here and modified it so that it would produce strings with lots of > characters for testing purposes.
I then manually inserted a few strings at the beginning so that a quick look would tell me if the code was working as anticipated.
INSERT INTO s VALUES
('afsad>adfsaf>asfasf>afasdX>asdffs>asfdf>'),
('23433>433453>4>4559>455>3433>'),
('adfd>adafs>afadsf>'), -- only 3 '>'s!
('babedacfab>feaefbf>fedabbcbbcdcfefefcfcd'),
('e>>>>>'), -- edge case - multiple terminal '>'s
('aaaaaaa'); -- edge case - no '>'s whatsoever
The reason I put in the records with fewer than 4 >s is because the accepted answer (see discussion at the end of this answer) puts forward a solution which should return the entire string if this is the case!
On the fiddle, I then added 50,000 records as follows:
INSERT INTO s
SELECT f() FROM GENERATE_SERIES(1, 50000);
I also created a table s on a home laptop (16GB RAM, 500MB NVMe SSD) and populated it with 40,000,000 (50M) records - times also shown.
Now, my reading of the question is that we need to extract the string up to but not including the 4th > character.
The first solution (from treecon) was this one (I also show them running on the fiddle, but to save space here, I've only included the partial output of EXPLAIN (ANALYZE, BUFFERS, VERBOSE)) - the times shown are typical over a few runs:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
ARRAY_TO_STRING((STRING_TO_ARRAY(a, '>'))[1:4], '>'),
a
FROM s;
Result (only key parts included):
Seq Scan on public.s
Execution Time: 81.807 ms
40M Time: 46 seconds
A regex solution which works (significantly faster):
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
SUBSTRING(a FROM '^(?:[^>]*>){0,3}[^>]*'),
a
FROM s;
Result:
Seq Scan on public.s
Execution Time: 74.757 ms
40M Time: 32 seconds
The accepted answer fails on many levels (see the fiddle). It leaves a > at the end and fails on various strings even when modified. Also, the solution proposed to include strings with fewer than 4 >s (i.e. ^(([^>]*>){4}|.*)) merely returns the original string (see end of fiddle).

Remove substrings that vary in value in Oracle

I have a column in Oracle which can contain up to 5 separate values, each separated by a '|'. Any of the values can be present or missing. Here are come examples of how the data might look:
100-1
10-3|25-1|120/240
15-1|15-3|15-2|120/208
15-1|15-3|15-2|120/208|STA-2
112-123|120/208|STA-3
The values are arbitrary except for the order. The numerical values separated by dashes always come first. There can be 1 to 3 of these values present. The numerical values separated by a slash (if it is present) is next. The string, 'STA', and a numerical value separated by a dash is always last, if it is present.
What I would like to do is reformat this column to only ever include the first three possible values, those being the three numerical values separated by dashes. Afterwards, I want to replace 2nd numeric in each value (the numeric after the dash) using the following pattern:
1 = A
2 = B
3 = C
I would also like to remove the dash afterwards, but not the '|' that separates the values unless there is a trailing '|'.
To give you an idea, here's how the values at the beginning of the post would look after the reformatting:
100A
10C|25A
15A|15C|15B
15A|15C|15B
112ABC
I'm thinking this can be done with regex expressions but it's got me a little confused. Does anyone have a solution?
If I have to solve this problem I will solve it in following ways.
SELECT
REGEXP_REPLACE(column,'\|\d+\/\d+(\|STA-\d+)?',''),
REGEXP_REPLACE(column,'(\d+)-(1)([^\d])','\1A\3'),
REGEXP_REPLACE(column,'(\d+)-(2)([^\d])','\1B\3'),
REGEXP_REPLACE(column,'(\d+)-(3)([^\d])','\1C\3'),
REGEXP_REPLACE(column,'(\d+)-(123)([^\d])','\1ABC')
FROM table;
Explanation: Let us break down each REGEXP_REPLACE statement one by one.
REGEXP_REPLACE(column,'\|\d+\/\d+(\|STA-\d+)?','')
This will replace the end part like 120/208|STA-2 with empty string so that further processing is easy.
Finding match was easy but replacing A for 1, B for 2 and C for 3 was not possible ( as per my knowledge ) So I did those matching and replacements separately.
In each regex from second statement (\d+)-(yourNumber)([^\d]) first group is number before - then yourNumber is either 1,2,3 or 123 followed by |.
So the replacement will be according to yourNumber.
All demos here from version 1 to 5.
Note:- I have just done replacement for combination of yourNUmber for those present in question. You can do likewise for other combinations too.
you can do this in one line, but you can write simple function to do that
SELECT str, REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?','') cut
, REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4') rep3toC
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4') rep2toB
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4'), '(\-)([B,C]*)(1)([B,C]*)', '\1\2A\4') rep1toA
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4'), '(\-)([B,C]*)(1)([B,C]*)', '\1\2A\4'), '-', '') "rep-"
FROM (
SELECT '100-1' str FROM dual UNION
SELECT '10-3|25-1|120/240' str FROM dual UNION
SELECT '15-1|15-3|15-2|120/208' str FROM dual UNION
SELECT '15-1|15-3|15-2|120/208|STA-2' str FROM dual UNION
SELECT '112-123|120/208|STA-3' FROM dual
) tab