Regex in Postgres: replacing part of found pattern - regex

I'm looking to do some simple partial redaction for addresses. Basically I'd like to replace the street name before the street suffix with ### while keeping the street suffix.
Examples:
Cherry Street -> ### Street
America Lane -> ### Lane
The full list of suffixes will be known at some point soon, and I could end up replacing the entirety of the address (getting rid of Street and Lane as well as the street name) with something like
regexp_replace(col1, '(\w) (Street|Lane)', '###', 'g')
but I can't figure out how to just replace just the word before the street suffix.

What you need is a positive look ahead (?= )
try the regex
(\w)* (?=Street|Lane)
see how the regex works http://regex101.com/r/pS9oV3/2
Explanation
(\w)* matches anything followed by a space
(?=Street|Lane) asserts if the following regex can be matched. successfull if it can be matched
regexp_replace(col1, '(\w)* (?=Street|Lane)', '###', 'g')
would produce
### Street
### Lane

I'm not familiar with PostgreSQL's exact syntax, but you need to reference the capture group you wanna keep. Something like:
regexp_replace(col1, '(\w) (Street|Lane)', '### $2', 'g')
or
regexp_replace(col1, '(\w) (Street|Lane)', '### \2', 'g')
Where $2 or \2 references the second capture group, so whatever was in the 2nd pair of parentheses will be placed there.

I wouldn't expect all street names to be that simple. Therefore:
SELECT col1
, regexp_replace(col1, '.+(?=\m(Street|Lane)\M)', '### ', 'i')
FROM (
VALUES
('Cherry Street'::text)
, ('America Lane')
, ('Weird name Lane')
, ('Lanex Lane') -- 'Lane' included in street name
, ('Buster-Keaton-Lane')
, ('White Space street') -- with tab
) t(col1);
SQL Fiddle.
Explain
.+ ... any string of one or more characters (including white space)
(?=\m(Street|Lane)\M) ... that is followed by 'Street' or 'Lane'
- (?=) ... positive look-ahead
- \m\M ... begin and end of word
The additional parameter 'i' switches to case-insensitive matching. No point in adding 'g' (replace "globally"), since only a single replacement should happen.
Details in the manual.

Related

Parsing between characters using Perl in SAS

I am sure this is a simple thing to do but I cannot seem to find any examples or make it past the numerous documentation sources I have been using.
I have a variable in a table (called location) such as: OH_DRT HOME_G4-T7 77 Cafe Entrance
I want to be able to parse this into several columns based on some delimiters. There is variability in my data set so I thought using perl expressions for pattern matching would be the way to go. I am trying to take that string and break it up into something like this:
State
Building
Name
Desc
OH
DRT HOME
G4
T7 Cafe Entrance
FL
Cleveland
RG
03 Back Entry
I am able to split the first part out
Data Mydata;
Set Int_Data;
retain re;
if _N_ = 1 Then re = prxparse("/(\D{2})/");
if prxmatch(re, location) Then Do
State= prxposn(re,1,location);
end;
It is parsing out any of the other sections I am at a loss for. The only one I have been able to get to work correctly is the State. I assume I should be able to pull anything between two characters.
In my head I should be able to split something like this:
Anything before the first _, anything between the first _ and second _, anything second _ to first -, and then finally anything after the -
Are all records exactly the same? If so:
use warnings;
use strict;
my $data = 'OH_DRT HOME_G4-T7 77 Cafe entrance';
my ($state, $building, $name, $desc);
if ($data =~ /^([A-Z]{2})_(.*)_(\w{2})-\w{2}\s+(.*)$/) {
$state = $1;
$building = $2;
$name = $3;
$desc = $4;
}
print "$state, $building, $name, $desc\n";
The regex works as follows:
Capture two upper-cased letters at the start of the string and put it into $1
Skip an underscore and capture everything until the next underscore and put it into $2
Capture the following two word characters and put them into $3
Skip a hyphen and the following two word characters along with any amount of whitespace, and put the remaining portion of the string into $4
Assign the numbered matches into the more descriptive named variables
Note that if any of the matches/captures fail, all of the named variables will be undefined.
The output of the above is:
OH, DRT HOME, G4, 77 Cafe entrance
You can use a pattern with 4 capture groups, but note that when keeping the following remark into account, it will give T7 77 Cafe entrance in the last group.
and then finally anything after the -
If you want to match anything between the underscores and the - you can use a negated character class excluding characters to match that you specify.
To not cross newlines, you can add a newline and a carriage return [^_\r\n]+
^([^_]+)_([^_]+)_([^-]+)-(.*)
Explanation
^ Start of string
([^_]+)_ Capture 1+ chars other than _ in group 1 and then match it
([^_]+)_ Capture 1+ chars other than _ in group 2 and then match it
([^-]+)- Capture 1+ chars other than - in group 3 and then match it
(.*) Match all after the underscore in group 4
Regex demo
If you want 77 Cafe entrance in group 4:
^([^_]+)_([^_]+)_([^-]+)-[^\s-]*\s*(.*)
Regex demo
I'm sure the regex solution works fine. If you wanted a SCAN solution.
Data WANT(Keep STATE BUILDING NAME DESC);
Length State $2 Building $50 Name $2 Desc $100;
TEST="OH_DRT HOME_G4-T7 77 Cafe Entrance";
State=scan(test,1,"_");
Building=scan(test,2,"_");
temp=scan(test,3,"_");
Name=scan(temp,1,"-");
Desc=scan(temp,2,"-");
Run;

How to extract parameter names and values using regular expressions

I would like to know how to extract values of all this parameters.
My regular expression:
([\w]+)(\s*=\s*)(['|"|\w])(.+)['|"|\w]
Parameter names and values that should match:
name='John Doe'
name=John Doe
organization=Acme Widgets Inc.
server=192.0.2.62
port='143'
file="payroll.dat"
DOS=HIGH,UMB
DEVICE=C:\DOS\HIMEM.SYS
DEVICE=C:\DOS\EMM386.EXE RAM
DEVICEHIGH=C:\DOS\ANSI.SYS
FILES=30
SHELL=C:\DOS\COMMAND.COM C:\DOS /E:512 /P
When i run my expression in regex101.com it only finds the first parameter that matches. In this case being: name='John Doe'
Desired output is name John Doe
I am having extra trouble understanding how to find and extract parameter names and values without parantesis and equals signs.
Try this:
(\w+)\s*=\s*['"]?([^'"\n]+)
The keyword will be in capture group 1 (there's no need for [] around \w).
There's no need for a capture group around the equal sign.
[^"]? allows an optional quote after the equal sign. There's no need to put it in a capture group.
([^'"\n]+) then matches everything that isn't another quote or newline. So it will capture everything until either a quote or the end of the line. This value will be put into group 2.
DEMO
i hope this will useful for you:
/^.*?=.*?$/gm
I test the pattern in: https://regexr.com/
var str = `
name='John Doe'
name=John Doe
organization=Acme Widgets Inc.
server=192.0.2.62
port='143'
file="payroll.dat"
DOS=HIGH,UMB
DEVICE=C:\DOS\HIMEM.SYS
DEVICE=C:\DOS\EMM386.EXE RAM
DEVICEHIGH=C:\DOS\ANSI.SYS
FILES=30
SHELL=C:\DOS\COMMAND.COM C:\DOS /E:512 /P`;
console.log( str.match(/^.*?=.*?$/gm).map(str => str.replace(/("|')/g, '').replace(/=/g, ' ') ) )

Regular Expression To Extract Names

I have strings in this form:
"""00.000000 00.000000; X-XX000-0000-0; France; Paris; Street 12a;
00.000000 00.000000; X-XX000-0000-0; Spain; Barcelona; Street 123;"""
I want to get specific data towns above string. How do I get this data??
If you just want to get the city for your given example, you could use a positive lookahead:
\b[^;]+(?=;[^;]+;$)
Explanation
\b # Word boundary
[^;]+ # Match NOT ; one or more times
(?= # Positive lookahead that asserts what follows is
; # Match semicolon
[^;]+ # Match NOT ; one or more times
; # Match ;
$ # Match end of the string
) # Close lookahead
Assuming Python (three quotes-string):
string = """00.000000 00.000000; X-XX000-0000-0; France; Paris; Street 12a;
00.000000 00.000000; X-XX000-0000-0; Spain; Barcelona; Street 123;"""
towns = [part[3] for line in string.split("\n") for part in [line.split("; ")]]
print(towns)
Which yields
['Paris', 'Barcelona']
No regex needed, really.
If you have the city on the 4th field, you can match it using this pattern:
/(?:[^;]*;){3}([^;]*);/
See the demo
[^;]*; you find a field consisting in non-semicolons and ending with a semicolon
(?:...){3} you find it 3 times, but you do not capture it
([^;]*); then you get 4th column matching its content (not the semicolon)

How to Capture Only Surnames from a Regex Pattern?

Team
I have written a Perl program to validate the accuracy of formatting (punctuation and the like) of surnames, forenames, and years.
If a particular entry doesn't follow a specified pattern, that entry is highlighted to be fixed.
For example, my input file has lines of similar text:
<bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., & Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>
My programs works just fine, that is, if any entry doesn't follow the pattern, the script generates an error. The above input text doesn't generate any error. But the one below is an example of an error because Rose A. J. is missing a comma after Rose:
NOT FOUND: <bibliomixed id="bkrmbib120">Asher, S. R., & Rose A. J. (1997). Promoting children’s social-emotional adjustment with peers. In P. Salovey & D. Sluyter, (Eds). <emphasis>Emotional development and emotional intelligence: Educational implications.</emphasis> New York: Basic Books.</bibliomixed>
From my regex search pattern, is it possible to capture all the surnames and the year, so I can generate a text prefixed to each line as shown below?
<BIB>Abdo, Afif-Abdo, Otani, Machado, 2008</BIB><bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., & Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>
My regex search script is as follows:
while(<$INPUT_REF_XML_FH>){
$line_count += 1;
chomp;
if(/
# bibliomixed XML ID tag and attribute----<START>
<bibliomixed
\s+
id=".*?">
# bibliomixed XML ID tag and attribute----<END>
# --------2 OR MORE AUTHOR GROUP--------<START>
(?:
(?:
# pattern for surname----<START>
(?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
(?:(?:[\w\x{2019}|\x{0027}]+-)+)? # surnames with hyphens
(?:[A-Z](?:\x{2019}|\x{0027}))? # surnames with closing single quote or apostrophe O’Leary
(?:St\.\s)? # pattern for St.
(?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
(?:[\w\x{2019}|\x{0027}]+) # final surname pattern----REQUIRED
# pattern for surname----<END>
,\s
# pattern for forename----<START>
(?:
(?:(?:[A-Z]\.\s)+)? #initials with periods
(?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
(?:(?:[A-Z]\.\s)+)? #initials with periods
[A-Z]\. #----REQUIRED
# pattern for titles....<START>
(?:,\s(?:Jr\.|Sr\.|II|III|IV))?
# pattern for titles....<END>
)
# pattern for forename----<END>
,\s)+
#---------------FINAL AUTHOR GROUP SEPATOR----<START>
&\s
#---------------FINAL AUTHOR GROUP SEPATOR----<END>
# --------2 OR MORE AUTHOR GROUP--------<END>
)?
# --------LAST AUTHOR GROUP--------<START>
# pattern for surname----<START>
(?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
(?:(?:[\w\x{2019}|\x{0027}]+-)+)? # surnames with hyphens
(?:[A-Z](?:\x{2019}|\x{0027}))? # surnames with closing single quote or apostrophe O’Leary
(?:St\.\s)? # pattern for St.
(?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
(?:[\w\x{2019}|\x{0027}]+) # final surname pattern----REQUIRED
# pattern for surname----<END>
,\s
# pattern for forename----<START>
(?:
(?:(?:[A-Z]\.\s)+)? #initials with periods
(?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
(?:(?:[A-Z]\.\s)+)? #initials with periods
[A-Z]\. #----REQUIRED
# pattern for titles....<START>
(?:,\s(?:Jr\.|Sr\.|II|III|IV))?
# pattern for titles....<END>
)
# pattern for forename----<END>
(?: # pattern for editor notation----<START>
\s\(Ed(?:s)?\.\)\.
)? # pattern for editor notation----<END>
# --------LAST AUTHOR GROUP--------<END>
\s
\(
# pattern for a year----<START>
(?:[A-Za-z]+,\s)? # July, 1999
(?:[A-Za-z]+\s)? # July 1999
(?:[0-9]{4}\/)? # 1999\/2000
(?:\w+\s\d+,\s)?# August 18, 2003
(?:[0-9]{4}|in\spress|manuscript\sin\spreparation) # (1999) (in press) (manuscript in preparation)----REQUIRED
(?:[A-Za-z])? # 1999a
(?:,\s[A-Za-z]+\s[0-9]+)? # 1999, July 2
(?:,\s[A-Za-z]+\s[0-9]+\x{2013}[0-9]+)? # 2002, June 19–25
(?:,\s[A-Za-z]+)? # 1999, Spring
(?:,\s[A-Za-z]+\/[A-Za-z]+)? # 1999, Spring\/Winter
(?:,\s[A-Za-z]+-[A-Za-z]+)? # 2003, Mid-Winter
(?:,\s[A-Za-z]+\s[A-Za-z]+)? # 2007, Anniversary Issue
# pattern for a year----<END>
\)\.
/six){
print $FOUND_REPORT_FH "$line_count\tFOUND: $&\n";
$found_count += 1;
} else{
print $ERROR_REPORT_FH "$line_count\tNOT FOUND: $_\n";
$not_found_count += 1;
}
Thanks for your help,
Prem
Alter this bit
# pattern for surname----<END>
,?\s
This now means an optional , followed by white space. If the Persons surname is "Bunga Bunga" it won't work
All of your subpatterns are non-capturing groups, starting with (?:. This reduces compilation times by a number of factors, one of which being that the subpattern is not captured.
To capture a pattern you merely need to place parenthesis around the part you require to capture. So you could remove the non-capturing assertion ?: or place parens () where you need them. http://perldoc.perl.org/perlretut.html#Non-capturing-groupings
I'm not sure but, from your code I think you may be attempting to use lookahead assertions as, for example, you test for surnames with spaces, if none then test for surnames with hyphens. This will not start from the same point every time, it will either match the first example or not, then move forward to test the next position with the second surname pattern, whether the regex will then test the second name for the first subpattern is what I am unsure of. http://perldoc.perl.org/perlretut.html#Looking-ahead-and-looking-behind
#!usr/bin/perl
use warnings;
use strict;
my $line = '123 456 7antelope89';
$line =~ /^(\d+\s\d+\s)?(\d+\w+\d+)?/;
my ($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
$line = '123 456 7bealzelope89';
$line =~ /(?:\d+\s\d+\s)?(?:\d+\w+\d+)?/;
($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
$line = '123 456 7canteloupe89';
$line =~ /((?:\d+\s\d+\s))?(?:\d+(\w+)\d+)?/;
($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
exit 0;
For capturing the whole pattern the first pattern of the third example does not make sense, as this tells the regex to not capture the pattern group while also capturing the pattern group. Where this is useful is in the second pattern which is a fine grained pattern capture, in that the pattern captured is part of a non-capturing group.
a: 123 456 b: 7antelope89
a: nocapture b: nocapture
a: 123 456 b: canteloupe
One little nitpic
id=".*?"
may be better as
id="\w*?"
id names requiring to be _alphanumeric iirc.

Regex with lookahead

I can't seem to make this regex work.
The input is as follows. Its really on one row but I have inserted line breaks after each \r\n so that it's easier to see, so no check for space characters are needed.
01-03\r\n
01-04\r\n
TEXTONE\r\n
STOCKHOLM\r\n
350,00\r\n ---- 350,00 should be the last value in the first match
12-29\r\n
01-03\r\n
TEXTTWO\r\n
COPENHAGEN\r\n
10,80\r\n
This could go on with another 01-31 and 02-01, marking another new match (these are dates).
I would like to have a total of 2 matches for this input.
My problem is that I cant figure out how to look ahead and match the starting of a new match (two following dates) but not to include those dates within the first match. They should belong to the second match.
It's hard to explain, but I hope someone will get me.
This is what I got so far but its not even close:
(.*?)((?<=\\d{2}-\\d{2}))
The matches I want are:
1: 01-03\r\n01-04\r\nTEXTONE\r\nSTOCKHOLM\r\n350,00\r\n
2: 12-29\r\n01-03\r\nTEXTTWO\r\nCOPENHAGEN\r\n10,80\r\n
After that I can easily separate the columns with \r\n.
Can this more explicit pattern work to you?
(\d{2}-\d{2})\r\n(\d{2}-\d{2})\r\n(.*)\r\n(.*)\r\n(\d+(?:,?\d+))
Here's another option for you to try:
(.+?)(?=\d{2}-\d{2}\\r\\n\d{2}-\d{2}|$)
Rubular
/
\G
(
(?:
[0-9]{2}-[0-9]{2}\r\n
){2}
(?:
(?! [0-9]{2}-[0-9]{2}\r\n ) [^\n]*\n
)*
)
/xg
Why do so much work?
$string = q(01-03\r\n01-04\r\nTEXTONE\r\nSTOCKHOLM\r\n350,00\r\n12-29\r\n01-03\r\nTEXTTWO\r\nCOPENHAGEN\r\n10,80\r\n);
for (split /(?=(?:\d{2}-\d{2}\\r\\n){2})/, $string) {
print join( "\t", split /\\r\\n/), "\n"
}
Output:
01-03 01-04 TEXTONE STOCKHOLM 350,00
12-29 01-03 TEXTTWO COPENHAGEN 10,80`