Replacing regex matched text with uppercase version in Postgresql - regex

Given a string with certain words surrounded by stars, e.g.
The *quick* *brown* fox jumped over the *lazy* dog
can you transform the words surrounded by stars into uppercase versions, i.e.
The QUICK BROWN fox jumped over the LAZY dog
Given text in a column 'sentence' in table 'sentences', I can mark/extract the words as follows:
SELECT regexp_replace(sentence,'\*(.*?)\*','STARTUPPER\1ENDUPPER','g') FROM sentences;
but my first attempt at uppercase transforms doesn't work:
select regexp_replace(sentence,'\*(.*?)\*','' || upper('\1'),'g') from sentences;
I thought of using substring() to split the parts up after replacing the stars with start and end markers, but that would fail if there was more than one word starred.

You can create a PL/pgSQL function like:
CREATE FUNCTION upper_asterisk(inp_str varchar)
RETURNS varchar AS $$
DECLARE t_str varchar;
BEGIN
FOR t_str IN (SELECT regexp_matches(inp_str,'\*.+\*','g'))
BEGIN
inp_str := replace(inp_str, t_str, upper(t_str));
END;
RETURN inp_str;
END;
$$ LANGUAGE plpgsql;
(Havent tested, may have bugs).
Or use any available language to write such function inside DB.

Answer from the Postgresql mailing list:
Yeah, you cannot embed a function-call result in the "replace with" section;
it has to be a literal (with the group insertion meta-sequences allowed of
course).
I see two possible approaches.
1) Use pl/perl (or some variant thereof) which has facilities to do just
this.
2) Use regexp_matches(,,'g') to explode the input string into its components
parts. You can explode it so every character of the original string is in
the output with the different columns containing the "raw" and "to modify"
parts of each match. This would be done in a sub-query and then in the
parent query you would "string_agg(...)" the matches back together while
manipulating the columns needed "i.e., string_agg(c1 || upper(c3))"
HTH
David J.

SQL version, something like this:
SELECT string_agg(m, '')
FROM (
SELECT CASE WHEN n % 2 = 0 THEN upper(m) ELSE m END m
FROM (
SELECT m, row_number() OVER () n
FROM regexp_split_to_table( 'The *quick* *brown* fox j*ump*ed over the *lazy* dog' , '\*') m
) a
) a
split the string into a table with '*' and keep track of the row number
even rows should be UPPERized, if the first char is a '*', an empty row will be produced
glue all rows with string_agg

Related

Oracle regex and replace

I have varchar field in the database that contains text. I need to replace every occurrence of a any 2 letter + 8 digits string to a link, such as VA12345678 will return /cs/page.asp?id=VA12345678
I have a regex that replaces the string but how can I replace it with a string where part of it is the string itself?
SELECT REGEXP_REPLACE ('test PI20099742', '[A-Z]{2}[0-9]{8}$', 'link to replace with')
FROM dual;
I can have more than one of these strings in one varchar field and ideally I would like to have them replaced in one statement instead of a loop.
As mathguy had said, you can use backreferences for your use case. Try a query like this one.
SELECT REGEXP_REPLACE ('test PI20099742', '([A-Z]{2}[0-9]{8})', '/cs/page.asp?id=\1')
FROM DUAL;
For such cases, you may want to keep the "text to add" somewhere at the top of the query, so that if you ever need to change it, you don't have to hunt for it.
You can do that with a with clause, as shown below. I also put some input data for testing in the with clause, but you should remove that and reference your actual table in your query.
I used the [:alpha:] character class, to match all letters - upper or lower case, accented or not, etc. [A-Z] will work until it doesn't.
with
text_to_add (link) as (
select '/cs/page.asp?id=' from dual
)
, sample_strings (str) as (
select 'test VA12398403 and PI83048203 to PT3904' from dual
)
select regexp_replace(str, '([[:alpha:]]{2}\d{8})', link || '\1')
as str_with_links
from sample_strings cross join text_to_add
;
STR_WITH_LINKS
------------------------------------------------------------------------
test /cs/page.asp?id=VA12398403 and /cs/page.asp?id=PI83048203 to PT3904

Vertica REGEXP_SUBSTR use /g flag

I am trying to extract all occurrences of a word before '=' in a string, i tried to use this regex '/\w+(?=\=)/g' but it returns null, when i remove the first '/' and the last '/g' it returns only one occurrence that's why i need the global flag, any suggestions?
As Wiktor pointed out, by default, you only get the first string in a REGEXP_SUBSTR() call. But you can get the second, third, fourth, etc.
Embedded into SQL, you need to treat regular expressions differently from the way you would treat them in perl, for example. The pattern is just the pattern, modifiers go elsewhere, you can't use $n to get the n-th captured sub-expression, and you need to proceed in a specific way to get the n-th match of a pattern, etc.
The trick is to CROSS JOIN your queried table with an in-line created index table, consisting of as many consecutive integers as you expect occurrences of your pattern - and a few more for safety. And Vertica's REGEXP_SUBSTR() call allows for additional parameters to do that. See this example:
WITH
-- one exemplary input row; concatenating substrings for
-- readability
input(s) AS (
SELECT 'DRIVER={Vertica};COLUMNSASCHAR=1;CONNECTIONLOADBALANCE=True;'
||'CONNSETTINGS=set+search_path+to+public;DATABASE=sbx;'
||'LABEL=dbman;PORT=5433;PWD=;SERVERNAME=127.0.0.1;UID=dbadmin;'
)
,
-- an index table to CROSS JOIN with ... maybe you need more integers ...
loop_idx(i) AS (
SELECT 1
UNION SELECT 2
UNION SELECT 3
UNION SELECT 4
UNION SELECT 5
UNION SELECT 6
UNION SELECT 7
UNION SELECT 8
UNION SELECT 9
UNION SELECT 10
)
,
-- the query containing the REGEXP_SUBSTR() call
find_token AS (
SELECT
i -- the index from the in-line index table, needed
-- for ordering the outermost SELECT
, REGEXP_SUBSTR (
s -- the input string
, '(\w+)=' -- the pattern - a word followed by an equal sign; capture the word
, 1 -- start from pos 1
, i -- the i-th occurrence of the match
, '' -- no modifiers to regexp
, 1 -- the first and only sub-pattern captured
) AS token
FROM input CROSS JOIN loop_idx -- the CROSS JOIN with the in-line index table
)
-- the outermost query filtering the non-matches - the empty strings - away...
SELECT
token
FROM find_token
WHERE token <> ''
ORDER BY i
;
The result will be one row per found pattern:
token
DRIVER
COLUMNSASCHAR
CONNECTIONLOADBALANCE
CONNSETTINGS
DATABASE
LABEL
PORT
PWD
SERVERNAME
UID
You can do all sorts of things in modern SQL - but you need to stick to the SQL and to the relational paradigm - that's all ...
Happy playing ...
Marco

Notepad++ complex conditional search regex

I have a database SQL followed by a bunch of statements to collect statistics. I'd like to search the SQL for a specific join and find all corresponding collect statistics statements and then modify them to remove extraneous chars to finally extract a useful bunch of statements Input
select tbd.cola , tba.a, tbx.b,
tbc.r,
tbx.c ,
case when yada ya then tbx.c + xyz else 'daddy' end as nicecol
, tbx.g
from
tbd join tba on tbd.cola = tba.colb
left join
tbx on tbx.colp= tba.colp left join
tbc on tbc.colfff=tbx.colm join......
/*this is followed by a bunch of statements in format */
---- "collect stats column (cola,colbxx)
on tbd ( medium strong )"
---- "collect stats column (colfff) on tbc ( not
strong )"
---- "collect stats column ( colddsdsd) on tbc ( very strong )"
----"collect stats col (yada,secretxxx,xxx) on tbx ( strong ) "
note the spacing between follows logic
(/s*medium|not|very/s*strong/s*)
same thing for
---- "collect stats column
in other words - variable spacing between all the words.
No consistent spacing pattern and
the statements arbitrarily span between multiple lines or squeeze in a single line.
What I'd like to do is :
Search for column names being joined e.g. tbd.cola = tba.colb
Then look for these column names in the collect statistics statements so in our case
cola colp colm colfff are they join column names that come from
tbd join tba on tbd.cola = tba.colb
left join
tbx on tbx.colp= tba.colp left join
tbc on tbc.colfff=tbx.colm
we search for these in the collect stats statements and the following qualify
---- "collect stats column (cola,colbxx) on tbd ( medium strong )"
---- "collect stats column (colfff) on tbc ( not strong )"
Next the statements have to be "purified" so the extraneous chars & writing around em are removed.The desirable output format is below
collect stats column (cola,colbxx) on tbd;
collect stats column (colfff) on tbc ;
remove the ---- " pattern [-]+?" and
replace ( <string with or without space and with variable spaces around it> )" of the form ( not strong )" with ;
What I did was multistep process. I could manage the 3rd part using
"\s*([^"]+ strong\s*)\)
so that is like done but I am looking for a conditional select approach here. Need help w/ the 1st two.
there is no need to use boundaries to select the collect stats statement. I could select that part using my mouse and then work a regex in the selected part only
The logic would be to
search for join\s*tablename.column\s*\=\s*tablename.column pattern. The \= has = escaped
collect all matching column names into a buffer
Then create boundaries or physically select the part where collect statistics statement begins.
Run the select column list through the bunch of collect stats statements to see which qualify.
if there is a column combination like collect stats column (cola,colbxx) and only cola is a join column - that is also selected since one of em cols is a join column
Finally we have a shortlisted collect statistics statement bunch on which we run the last regex ( logic "\s*([^"]+ strong\s*)\))to rid it off extraneous characters.
We can break this operation into 2 components. 1st part is the conditional search. Search for joined column names in the collect statistics area. Search results get copied and pasted into another work area ( a new file ) and then we run the last part above on this selected file.
Ok I found something ! It works for the example you gave, but I can't have anticipated all possibilities, so tell me if it works for you.
It uses 2 substitutions. Make sure you checked regular expression, and the box next to it (saying something like ". matches new lines")
First substitution :
Replace this :
join\s+\w+\s+on\s+\w+\.(\w+)\b\s*=\s*\w+\.(\w+)\b(?=.*-+\s+"([^"]+(?:\1|\2)[^"]+)(\s)+\([^)]+\)")|.
By this :
\3\4
Second substitution :
Replace this :
(collect.*?)\s+(on\s\w+)\s
By this :
`\1 \2;\n
Demo
First substitution : Regex101
Second substitution : Regex101
Explanations
The regex is based on a alternation. The first part is
join\s+\w+\s+on\s+\w+\.(\w+)\b\s*=\s*\w+\.(\w+)\b(?=.*-+\s+"([^"]+(?:\1|\2)[^"]+)(\s)+\([^)]+\)")
join\s+\w+\s+on\s+\w+\.(\w+)\b\s*=\s*\w+\.(\w+)\b matches a string built like that : join tbname on tbname.cola = tbname.colb. Note that spaces around the = are optional and the names of cola and colb are captured for future use.
(?=.*-+\s+"([^"]+(?:\1|\2)[^"]+)(\s)+\([^)]+\)") allows the precedent match only if there is later in the file a string like ---- "[...] [cola OR colb] [...] ([...])", or in other words, a string beginning with multiples -, then 1 or more spaces and a ", ending with a pair of () and a ", and containing either cola or colb (or both).
It will look for a match like that at each position in the file, and for each position, if it does not match, it will go to the second part of the alternation, which is . (anything). So in the end, it will match the whole file, but if it matched some joined columns, capturing groups will contain something which is then written in the file through the replacement \3\4
The second substitution is just a reformatting of the lines kept.
Notes
I could do it with a single substitution, but it would be much more
ugly.
It might be strange, I had to erase the text that need to be kept at the end and rewrite it. The reason is Notepad++ does not allow lookbehinds to have a non defined size.
Depending on the size of your file, the first substitution might take much more time that for the example. I don't know how Notepad++ reacts when it takes too much time, but it might crash... If it is the case, we will have to split the process into multiples smaller substitutions.

Extract numbers from a field in PostgreSQL

I have a table with a column po_number of type varchar in Postgres 8.4. It stores alphanumeric values with some special characters. I want to ignore the characters [/alpha/?/$/encoding/.] and check if the column contains a number or not. If its a number then it needs to typecast as number or else pass null, as my output field po_number_new is a number field.
Below is the example:
SQL Fiddle.
I tired this statement:
select
(case when regexp_replace(po_number,'[^\w],.-+\?/','') then po_number::numeric
else null
end) as po_number_new from test
But I got an error for explicit cast:
Simply:
SELECT NULLIF(regexp_replace(po_number, '\D','','g'), '')::numeric AS result
FROM tbl;
\D being the class shorthand for "not a digit".
And you need the 4th parameter 'g' (for "globally") to replace all occurrences.
Details in the manual.
For a known, limited set of characters to replace, plain string manipulation functions like replace() or translate() are substantially cheaper. Regular expressions are just more versatile, and we want to eliminate everything but digits in this case. Related:
Regex remove all occurrences of multiple characters in a string
PostgreSQL SELECT only alpha characters on a row
Is there a regexp_replace equivalent for postgresql 7.4?
But why Postgres 8.4? Consider upgrading to a modern version.
Consider pitfalls for outdated versions:
Order varchar string as numeric
WARNING: nonstandard use of escape in a string literal
I think you want something like this:
select (case when regexp_replace(po_number, '[^\w],.-+\?/', '') ~ '^[0-9]+$'
then regexp_replace(po_number, '[^\w],.-+\?/', '')::numeric
end) as po_number_new
from test;
That is, you need to do the conversion on the string after replacement.
Note: This assumes that the "number" is just a string of digits.
The logic I would use to determine if the po_number field contains numeric digits is that its length should decrease when attempting to remove numeric digits.
If so, then all non numeric digits ([^\d]) should be removed from the po_number column. Otherwise, NULL should be returned.
select case when char_length(regexp_replace(po_number, '\d', '', 'g')) < char_length(po_number)
then regexp_replace(po_number, '[^0-9]', '', 'g')
else null
end as po_number_new
from test
If you want to extract floating numbers try to use this:
SELECT NULLIF(regexp_replace(po_number, '[^\.\d]','','g'), '')::numeric AS result FROM tbl;
It's the same as Erwin Brandstetter answer but with different expression:
[^...] - match any character except a list of excluded characters, put the excluded charaters instead of ...
\. - point character (also you can change it to , char)
\d - digit character
Since version 12 - that's 2 years + 4 months ago at the time of writing (but after the last edit that I can see on the accepted answer), you could use a GENERATED FIELD to do this quite easily on a one-time basis rather than having to calculate it each time you wish to SELECT a new po_number.
Furthermore, you can use the TRANSLATE function to extract your digits which is less expensive than the REGEXP_REPLACE solution proposed by #ErwinBrandstetter!
I would do this as follows (all of the code below is available on the fiddle here):
CREATE TABLE s
(
num TEXT,
new_num INTEGER GENERATED ALWAYS AS
(NULLIF(TRANSLATE(num, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ. ', ''), '')::INTEGER) STORED
);
You can add to the 'ABCDEFG... string in the TRANSLATE function as appropriate - I have decimal point (.) and a space ( ) at the end - you may wish to have more characters there depending on your input!
And checking:
INSERT INTO s VALUES ('2'), (''), (NULL), (' ');
INSERT INTO t VALUES ('2'), (''), (NULL), (' ');
SELECT * FROM s;
SELECT * FROM t;
Result (same for both):
num new_num
2 2
NULL
NULL
NULL
So, I wanted to check how efficient my solution was, so I ran the following test inserting 10,000 records into both tables s and t as follows (from here):
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
INSERT INTO t
with symbols(characters) as
(
VALUES ('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
)
select string_agg(substr(characters, (random() * length(characters) + 1) :: INTEGER, 1), '')
from symbols
join generate_series(1,10) as word(chr_idx) on 1 = 1 -- word length
join generate_series(1,10000) as words(idx) on 1 = 1 -- # of words
group by idx;
The differences weren't that huge but the regex solution was consistently slower by about 25% - even changing the order of the tables undergoing the INSERTs.
However, where the TRANSLATE solution really shines is when doing a "raw" SELECT as follows:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
NULLIF(TRANSLATE(num, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ. ', ''), '')::INTEGER
FROM s;
and the same for the REGEXP_REPLACE solution.
The differences were very marked, the TRANSLATE taking approx. 25% of the time of the other function. Finally, in the interests of fairness, I also did this for both tables:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
num, new_num
FROM t;
Both extremely quick and identical!

Remove substrings that vary in value in Oracle

I have a column in Oracle which can contain up to 5 separate values, each separated by a '|'. Any of the values can be present or missing. Here are come examples of how the data might look:
100-1
10-3|25-1|120/240
15-1|15-3|15-2|120/208
15-1|15-3|15-2|120/208|STA-2
112-123|120/208|STA-3
The values are arbitrary except for the order. The numerical values separated by dashes always come first. There can be 1 to 3 of these values present. The numerical values separated by a slash (if it is present) is next. The string, 'STA', and a numerical value separated by a dash is always last, if it is present.
What I would like to do is reformat this column to only ever include the first three possible values, those being the three numerical values separated by dashes. Afterwards, I want to replace 2nd numeric in each value (the numeric after the dash) using the following pattern:
1 = A
2 = B
3 = C
I would also like to remove the dash afterwards, but not the '|' that separates the values unless there is a trailing '|'.
To give you an idea, here's how the values at the beginning of the post would look after the reformatting:
100A
10C|25A
15A|15C|15B
15A|15C|15B
112ABC
I'm thinking this can be done with regex expressions but it's got me a little confused. Does anyone have a solution?
If I have to solve this problem I will solve it in following ways.
SELECT
REGEXP_REPLACE(column,'\|\d+\/\d+(\|STA-\d+)?',''),
REGEXP_REPLACE(column,'(\d+)-(1)([^\d])','\1A\3'),
REGEXP_REPLACE(column,'(\d+)-(2)([^\d])','\1B\3'),
REGEXP_REPLACE(column,'(\d+)-(3)([^\d])','\1C\3'),
REGEXP_REPLACE(column,'(\d+)-(123)([^\d])','\1ABC')
FROM table;
Explanation: Let us break down each REGEXP_REPLACE statement one by one.
REGEXP_REPLACE(column,'\|\d+\/\d+(\|STA-\d+)?','')
This will replace the end part like 120/208|STA-2 with empty string so that further processing is easy.
Finding match was easy but replacing A for 1, B for 2 and C for 3 was not possible ( as per my knowledge ) So I did those matching and replacements separately.
In each regex from second statement (\d+)-(yourNumber)([^\d]) first group is number before - then yourNumber is either 1,2,3 or 123 followed by |.
So the replacement will be according to yourNumber.
All demos here from version 1 to 5.
Note:- I have just done replacement for combination of yourNUmber for those present in question. You can do likewise for other combinations too.
you can do this in one line, but you can write simple function to do that
SELECT str, REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?','') cut
, REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4') rep3toC
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4') rep2toB
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4'), '(\-)([B,C]*)(1)([B,C]*)', '\1\2A\4') rep1toA
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4'), '(\-)([B,C]*)(1)([B,C]*)', '\1\2A\4'), '-', '') "rep-"
FROM (
SELECT '100-1' str FROM dual UNION
SELECT '10-3|25-1|120/240' str FROM dual UNION
SELECT '15-1|15-3|15-2|120/208' str FROM dual UNION
SELECT '15-1|15-3|15-2|120/208|STA-2' str FROM dual UNION
SELECT '112-123|120/208|STA-3' FROM dual
) tab