Postgresql regexp_replace negative lookahead not working - regex

I am trying to replace street with st only if street isn't followed by any alphabet. Replacement is allowed if after street EITHER there is non-alphabet character OR end of string.
I am trying to achieve this in Postgresql 9.5 regex_replace function. Sample query i wrote:
select regexp_replace('super streetcom','street(?!=[a-z])','st');
street here shouldn't have been replaced by st since street is followed by 'c'. So the expected output is 'super streetcom' but the output i am getting is 'super stcom'.
Any help for why i am getting the unexpected output and what can be the right way to achieve the intended result.

A lookahead construct looks like (?!...), all what follows ?! is a lookahead pattern that the engine will try to match, and once found, the match will be failed.
It seems you need to match a whole word street. Use \y, a word boundary:
select regexp_replace('super streetcom street','\ystreet\y','st');
See the online demo
From the docs:
\y matches only at the beginning or end of a word

This looks like a syntax issue. Try: ?! instead of ?!= .
e.g.
select regexp_replace('super street','street(?![a-z])','st');
will return
super st

Related

Need regex help for matching names

Let's say I have these three names
John Doe (p45643)
Le'anne Frank
Molly-Mae Edwards
I want to match
1) John Doe
2) Le'anne Frank
3) Molly-Mae Edwards
The regex I have tried is
(^[a-zA-Z-'^\d]$)+
but it isn't working as I am expecting.
I would like help creating a regex pattern that:
Matches a name from start to finish, and cannot contain a number. The only permitted values each "name" can contain is, [a-zA-Z'-], so if a name was
J0hn then it shouldn't match
If I understood correctly your question, then you have a minor errors in your regex:
(^[a-zA-Z-'^\d]$)+
^-------^------Here
The - pointed above should be escaped or moved to the end since it works as a range character. The + is marking the group as repeated.
You can use this regex instead (following your previous pattern):
(^[a-zA-Z'^\d -]+$)
Regex demo
Update: for your comment. If you want to match separately, then you can use:
(\b[a-zA-Z'^\d-]+\b)
Regex demo
And if you only want to match string (not numbers), then you can use:
(\b[a-zA-Z'-]+\b)
Regex demo
You are using the anchors incorrectly. Based on the modifier it can match the whole string or a single line.
Try
/^[a-zA-Z'-]+$/
Thanks to #Djory Krache
The query I was looking for was
(\b[a-zA-Z'-]+\b)

Regex to MATCH number string (with optional text) in a sentence

I am trying to write a regex that matches only strings like this:
89-72
10-123
109-12
122-311(a)
22-311(a)(1)(d)(4)
These strings are embedded in sentences and sometimes there are 2 potential matches in the sentence like this:
In section 10-123 which references section 122-311(a) there is a phone number 456-234-2222
I do not want to match the phone. Here is my current working regex
\d{2,3}\-\d{2,3}(\([a-zA-Z0-9]\))*
see DEMO
I've been looking on Stack and have not found anything yet. Any help would be appreciated. Will be using this in a google sheet and potentially postgres.
Based on regex, suggested by #Wiktor Stribiżew:
=REGEXEXTRACT(A1,REPT("\b(\d{2,3}-\d{2,3}\b(?:\([A-Za-z0-9]\))*)(?:[^-]|$)(?:.*)",LEN(REGEXREPLACE(REGEXREPLACE(A1,"\b(\d{2,3}-\d{2,3}\b(?:\([A-Za-z0-9]\))*)(?:[^-]|$)", char (9)),"[^"&char(9)&"]",""))))
The formula will return all matches.
String:
A
In 22-311(a)(1)(d)(4) section 10-123 which ... 122-311(a) ... number 456-234-2222
Output:
B C D
22-311(a)(1)(d)(4) 10-123 122-311(a)
Solution
To extract all matches from a string, use this pattern:
=REGEXEXTRACT(A1,
REPT(basic_regex & "(?:.*)",
LEN(REGEXREPLACE(REGEXREPLACE(A1,basic_regex, char (9)),"[^"&char(9)&"]",""))))
The tail of a function:
LEN(REGEXREPLACE(REGEXREPLACE(A1,basic_regex, char (9)),"[^"&char(9)&"]","")))
is just for finding number 3 -- how many entries of a pattern in a string.
To not match the phone number you have to indicate that the match must neither be preceded nor followed by \d or -. Google spreadsheet uses RE2 which does not support look around assertion (see the list of supported feature) so as far as I can tell, the only solution is to add a character before and after the match, or the string boundary:
(?:^|[^-\d])\d{2,3}\-\d{2,3}(\([a-zA-Z0-9]\))*(?:$|[^-\d])
(?:^|[^-\d]) means either the start of a line (^) or a character that is not - or \d (you might want to change that, and forbid all letters as well). $ is the end of a line. ^ and $ only do what you want with the /m flag though
As you can see here this finds the correct strings, but with additional spaces around some of the matches.

regex to select only the zipcode

,Ray Balwierczak,4/11/2017,,895 Forest Hill Rd,Apalachin,NY,13732,y,,
i want to select only 13732 from the line. I came up with this regex
(\d)(\s*\d+)*(\,y,,)
But its also selecting the ,y,, .if i remove it that part from regex, the regex also gets valid for the date. please help me on this.
Generally, if you want to match something without capturing it, use zero-length lookaround (lookahead or lookbehind). In your case, you can use lookahead:
(\d)(\s*\d+)*(?=\,y,,)
The syntax (?=<stuff>) means "followed by <stuff>, without matching it".
More information on lookarounds can be found in this tutorial.
Regex: \D*(\d{5})\D*
Explanation: match 5 digits surrounded by zero or more non-digits on both sides. Then you can extract group containing the match.
Here's code in python:
import re
string = ",Ray Balwierczak,4/11/2017,,895 Forest Hill Rd,Apalachin,NY,13732,y,,"
search = re.search("\D*(\d{5})\D*", string)
print search.group(1)
Output:
13732

How can I match the last two words in a sentence in PostgreSQL?

Have been trying for a while, to match the last word of a sentence:
select regexp_matches('My name is Harry Potter', '[^ ]+$');
returned {Potter}
to try to match the last two words:
select regexp_matches('My name is Harry Potter', '[^ ]\s+[^ ]+$');
failed.
select regexp_matches('My name is Harry Potter', '(.*?)\s+(.*?)$');
Did not word as intended either.
Any insights?
Instead of using REGEXP_MATCHES which returns an array of matches, you may be better off using SUBSTRING which will give you the match as TEXT directly.
Using the correct pattern, as #Abelisto shared, you can do this:
SELECT SUBSTRING('My name is Harry Potter' FROM '\w+\W+\w+$')
This returns Harry Potter as opposed to {"Harry Potter"}
Per #Hambone's comment, if either of the words at the end contain punctuation, like an apostrophe, you would want to consider using the following pattern:
SELECT SUBSTRING('My name is Danny O''neal' FROM '\S+\s+\S+$')
The above would correctly return Danny O'neal as opposed to just O'neal
You should use double escaping in the pattern since it seems the standard_conforming_strings parameter of your PostgreSQL instance is turned off. See PostgreSQL 9.5.3 Documentation:
standard_conforming_strings (boolean)
This controls whether ordinary string literals ('...') treat backslashes literally, as specified in the SQL standard. Beginning in PostgreSQL 9.1, the default is on (prior releases defaulted to off).
Thus, you need to use
'[^ ]+\\s+[^ ]+$'
^^
or
'\\S+\\s+\\S+$'
Here,
[^ ]+ - 1 or more characters other than a space (any non-whitespace if \\S is used)
\\s+ - 1 or more whitespaces
[^ ]+ - 1 or more characters other than a space (any non-whitespace if \\S is used)
$ - end of string anchor.
Don't know how the regex works for postgres, but
online regex testers tell me that .*\s(.+)\s+(.*?)$ might do the trick.
I'm not 100% clear on what you're trying to do, but this regex matches the last two words of a sentence, and it's similar to your initial regex: "[^ ]+\s+[^ ]+$" (I just added a '+'.)
For further testing, I suggest going to https://regex101.com/ It's one of the best online regex helpers I've found, and it even breaks down the regex for you. (I'm not involved with the site in any way - it's a recommendation, not a plug)

Regex to extract date with negative lookahead

I am using this pattern to extract confirmation dates from a text file and converting them to a date object (see my post here Extract/convert date from string in MS Access).
The current pattern matches all strings that look like a date, but may not be the confirmation date (which is always preceded by Confirmed by), and moreover, may not have complete date information (e.g. no AM or PM).
Pattern: (\d+/\d+/\d+\s+\d+:\d+:\d+\s+\w+|\d+-\w+-\d+\s+\d+:\d+:\d+)
Sample text:
WHEN COMPARED WITH RESULT OF 7/13/12 09:06:42 NO SIGNIFICANT
CHANGE; Confirmed by SMITH, MD, JOHN (2242) on 7/14/2012 3:46:21 PM;
The above pattern matches the following:
WHEN COMPARED WITH RESULT OF 7/13/12 09:06:42 NO SIGNIFICANT
^^^^^^^^^^^^^^^^^^^^
CHANGE; Confirmed by SMITH, MD, JOHN (2242) on 7/14/2012 3:46:21 PM;
^^^^^^^^^^^^^^^^^^^^
I want the pattern to look for the date in the segment of the text file that begins with Confirmed by and ends with a semi-colon. Also, in order to properly convert the time, the pattern should match only AM or PM at the end. How can I restrict the pattern to this segment and add the additional AM or PM criteria?
Can anyone help?
In order to match the end of the string, use $ at the end of your regex. To match the entire phrase "Confirmed by <someone> on <date>", use plain text (remember that plain text can be used in a regex as well -- if you aren't using special characters, the matcher will match your query verbatim). You need to use a negative look-ahead to exclude entire words.So maybe something like this:
Confirmed by (?!\ on\ )(\d+/\d+/\d+\s+\d+:\d+:\d+\s+\w+|\d+-\w+-\d+\s+\d+:\d+:\d+)$
Which will allow you to match a string that starts with "Confirmed by", followed by anything except for " on ", followed by the date that you capture, and the end of the string.
Edit: the negative look-ahead part is tricky, look at the answer below for more reference:
A regular expression to exclude a word/string
I don't see any need for a lookahead here, positive or negative. This works correctly on your sample string:
Confirmed by [^;]*(\d+/\d+/\d+\s+\d+:\d+:\d+(?:\s+(?:AM|PM))?|\d+-\w+-\d+\s+\d+:\d+:\d+);
The [^;]* effectively corrals the match between a Confirmed by sequence and its closing semicolon. (I'm assuming the semicolon will always be present.)
+(?:\s+(?:AM|PM))? makes the AM/PM optional, along with its leading whitespace.
The actual date will be stored in capturing group #1.
Try this:
(\d+/\d+/\d+\s+\d+:\d+:\d+\s+(?:AM|PM));
The simplest answer is more than often a good enough solution. By turning of the default greedy behavior (using the question mark: .*?) the regular expression will instead try to find the shortest match that matches the pattern. A pattern never matches the same string more than once, this means that each Confirmed by can only be coupled with one date which in this case is the next to follow.
Confirmed by.*?(\d+/\d+/\d+\s+\d+:\d+:\d+\s+(?:AM|PM));