The requirement for this arose in answering this question here which has only been partially completed, at least from my POV.
For all tables and data SQL + test harness, see the fiddle here.
I have a table and data as follows:
CREATE TABLE payment (id INT, amount TEXT NOT NULL);
and
INSERT INTO payment VALUES -- Desired result
(1, 'KES 0.80__asdfa .80..98..00sadf '), -- 0.80 or .8 or .80
(2, '00 a 0..0...00asafd 0000013..0.85...0000'), -- 0.85 or .85
(3, '0sf00..0...0 00.000 X013....12.851...0000'), -- 12.85
(4, '000..0...007600.00 0013..12.....0000'), -- 7600
(5, '0afs0 sdff 00...00000.56.....000343..343.0'), -- 0.56 or .56
(6, 'fs0 sdff 00...0.000710x00.56..asfd0003..3.0'); -- 0.56 or .56, NOT 0.00071 or 710
OK, so I want to pull out valid numbers of the form xxxx.yy where the number of x's is arbitrary and the number of y's can be at most 2 (but being able to specify #x would be the icing on the cake).
Explanation of requirement:
1 - 0.80 - should be self-evident - it's the first valid number with 2 digits after the decimal.
2 - 0.85 and not 13 because 13.. isn't a valid number - if it was 13.00 (or two other digits), then OK, but not 13.0. for example.
3 - 12.85 and not 12.851 because of the 2 y's rule.
4 - 7600 fairly obvious. However 7,600 would not be valid - no separators.
5 - 0.56 - fairly obvious.
6 - Now, this one will separate the wheat from the chaff... it has to be 0.56 and not 0.00071 (2 y's), but, it also can't be 710 because 710x is not valid. A valid number is a run of digits, a decimal point, and then at least two (or more) valid digits which should be truncated.
The truncation rule is not absolute - I can use PostgreSQL to round!
Just to help you on your way :-), I've included the fiddle from my answer - my regex is:
SELECT
REGEXP_REPLACE(amount, '[^0-9\.]+|\. +|\.0|\.{2,}', '', 'g')
FROM payment;
The sample data for this question is at the bottom and also on a 2nd fiddle here. The problem with my regex, I feel, is that it's too "tailored" - with too many pipes (|) for specifics. I'd like something more generic - hence this question.
A few points to note:
I've left a "test harness" in the fiddle so that solutions can be checked. I've done this because in a previous regex question, I sent poor Wiktor Stribiżew (very helpful guy!) all round the houses before we realised that there are "quirks" with PostgreSQL's implementation of regexes that rendered an answer impossible, notably that: Lookahead and lookbehind constraints cannot contain back references (see Section 9.7.3.3), and all parentheses within them are considered non-capturing.
I don't think that applies in this case, but what I don't want is to be going from one regex site to the other with working code and then finding that it doesn't work on PostgreSQL - so to save wasting anybody's time, I set this up so that you could just put your code into the pattern area specified and run a PostgreSQL fiddle?
a note about PostgreSQL's REGEXP_REPLACE() function.
REGEXP_REPLACE(source, pattern, replacement [, flags ]).
The replacement I've used is '' - i.e. empty string and I've used the 'g' flag (global). If that flag isn't present, then it only does the first occurrence. This might be of use to anybody considering doing two passes or some other solution - can't hurt to know it.
I'm actually interested in learning how to use regular expressions. Check out the last snippet in this fiddle to see what contortions I got up to before I started to go down the regex route - 30 lines down to 3 with regexes. So, please, if you could comment your regex and explain step by step what it does, that would be great, thanks.
And finally, any (small-improvments/comments on my own original fiddle's regex appreciated - if you have time! Anything else you think helpful would be great!
I've endeavoured to be thorough with this question and I hope that the issue is crystal! However, it's tricky enough stuff, so should you require any further input on my part (esp. on the PostgreSQL side), I'd only be delighted to provide any further necessary clarification!
===========================================
Sample (simpler) data for working regex (fiddle here):
INSERT INTO payment VALUES
(1, 'KES 0.80'),
(2, 'KES 0.80'),
(3, 'KES .80'),
(4, 'xyzKES 0.80'),
(5, 'KES . 0.80'),
(6, 'KES 0.80afasf'),
(7, 'KES 0.80__asdfa ..'),
(8, 'KES 0.80..asdfasdf'),
(9, 'KES 0.8.0..asdfasdf'),
(10, 'KES 0.8.0 ..asdfasdf...');
This isn't a purely regex based solution, but it does produce the desired output:
Query #1
select distinct on (id) -- 5
id
, amount
, mat
, cast(mat as numeric (10, 2)) mat_num -- 6
from payment
join lateral (
select
array_to_string(_mat, '') mat -- 2
, rn
from regexp_matches(amount, '\d+\.\d{1,2}', 'g') -- 1
with ordinality as x(_mat, rn) -- 3
) x on true
where mat !~ '^0+\.0+$' -- 4
order by id, rn; -- 5
id
amount
mat
mat_num
1
KES 0.80__asdfa .80..98..00sadf
0.80
0.80
2
00 a 0..0...00asafd 0000013..0.85...0000
0.85
0.85
3
0sf00..0...0 00.000 X013....12.851...0000
12.85
12.85
4
000..0...007600.00 0013..12.....0000
007600.00
7600.00
5
0afs0 sdff 00...00000.56.....000343..343.0
00000.56
0.56
6
fs0 sdff 00...0.000710x00.56..asfd0003..3.0
00.56
0.56
View on DB Fiddle
What is happening here:
Use the function regexp_matches with the flag 'g' to extract all sub strings that match <digit>.<digit>{1, 2}. Note that this will not match a substring if it does not have decimal digits. I'm not sure if that is important for you, but all the matches in your example seem to have decimal points.
regexp_matches returns multiple rows for each match, where each row is an array of the matched groups. since we only have 1 group, this array will always have 1 element. It is necessary to convert the array back to a string for further processing.
Also important to assign a sequence id to the match, as we want to keep the first valid match, as we don't want the ordering to be undefined. 1, 2 & 3 are all specified in the lateral join-ed query
since our regex pattern will also match 00.00 and the like, it is important to get rid of these via the where clause
to pick the first valid match, distinct on is used with the rows ordered by the match sequence (rn)
casting the matched string to numeric gets rid of any leading or trailing 0s.
I want to extract the letters with numbers on the left.
For example 3F 4G XY output F and G.
I tried to do this using this regular expression (?=\d)[A-Z], but failed.
Look at the following code:
s = "1A 2B 3C IJK X1 Y2 Z3"
s.match(/(?=\d)[A-Z]/g) // null
s.match(/[A-Z](?=\d)/g) // [ 'X', 'Y', 'Z' ]
I was surprised why the first match was not found, and the second was found correctly.
In my opinion, this even expression is the same, but the left and right change.
Did I get it wrong?
That,
How do I get this output ['A', 'B', 'C']?
Because the text is processed from left to right the number is processed first, so you need a look-behind:
(?<=\d)[A-Z]
So the regex engine steps behind into the already processed text to look for your pattern
Suppose I have the following data in the "people" table
id name phone_number
1 Pete +651234-5678
2 John 1234 56 78 Main number
If I had a search string of "123456789" how would I extract the rows that match the phone number - I assume some regex is required but what is the best approach to do this in Postgres?
Looks like this will work
x="SELECT id FROM people WHERE regexp_replace(phone_number, '[^0-9]', '', 'g') = '12345678';"
or perhaps
x="SELECT id FROM people WHERE regexp_replace(phone_number, '[^0-9]', '', 'g') LIKE '12345678';"
Is this the best way?
Your original search string would not be found of course but I think mine is the complete answer - only votes will tell??
select *, regexp_replace(phone_number, '[^0-9]', '', 'g')
from people
where position ( '12345678' in regexp_replace(phone_number, '[^0-9]', '', 'g')) >= 0 ;
You could use the following query:
SELECT * FROM people WHERE phone_number ~ '^1234\s*56\s*78.*'
The pattern \s* will match zero or more spaces. The regex will match any of the following:
1234 56 78 Main number
123456 78 some text
12345678
I'm merging 2 fields into one value. I'm also replacing dots with commas with the below query.
REPLACE(CAST(testmin as varchar)+'-'+ CAST(testmax as varchar), '.', ',') AS Testvalue
This works a a charm as long as testmin isn't zero. But when it's zero that's the result I get.
As example I have
testmin 0,00
testmax 100
With the above query the the query returns 0.
If I change testmin to 1 then the query returns the correct value 1-100.
Any ideas on why it is like this?
Have you tried using CONCAT instead of '+' ?
I have values in column like "07960/WR" , "27163/WR", etc. I need to select all numbers from it. So i created sql:
select CAST (regexp_replace(object_index, '\D', '', 'g') as integer) as number from ...
Its OK, but when somebody put [number] / <- slash / ....
example: "99/27163/WR"
My query doesnt work.
How to use regexp_replace ONLY for last 5 digits in value?
I don't know PostgreSQL, but with a little help from RegexBuddy, I pieced something together that hopefully works:
select CAST (REGEXP_REPLACE(object_index, $$(?p)^.*(\d{5})\D*$$$, $$\1$$, 'g') as integer) as number from ...
The idea of this regex is to match and capture the last five digits \d{5} in the string (i. e. those that are followed only by non-digits: \D*$) and remove everything around them.