Extract words preceding and following search terms - regex

Suppose I have a text like the following.
The City of New York often called New York City or simply New York is
the most populous city in the United States. With an estimated
population of 8537673 distributed over a land area of about 3026
square miles (784 km2) New York City is also the most densely
populated major city in the United States.
I want to locate the n words preceding and following occurrences of the a search term. For example, n=3 and search term="New York", then
1st occurrence:
words preceding = {The, city, of}
words following = {often, called, New}
2nd occurrence:
words preceding = {York, often, called}
words following = {City, or, simply}
3rd occurence:
words preceding = {City, or, simply}
words following = {is, the, most}
4th occurrence:
words preceding = {miles, 784, km2}
words following = {City, is, also}
How can I do this using regex? I found a similar question here Extract words surrounding a search word but it does not consider multiple occurrences of the search term.
Attempts:
def search(text,n):
word = r"\W*([\w]+)"
groups = re.search(r'{}\W*{}{}'.format(wordn,'place',wordn), text).groups() return groups[:n],groups[n:]

You need to use a positive lookahead assertion in order to handle overlapping matches:
re.findall(r"((?:\w+\W+){3})(?=New York((?:\W+\w+){3}))", t)
Result:
[('The City of ', ' often called New'),
('York often called ', ' City or simply'),
('City or simply ', ' is the most'),
('miles (784 km2) ', ' City is also')]

You may try the following:
((?:\w+\W+){3})(?=New York((?:\W+\w+){3}))
and get your values in group 1 and 2
Sample Source ( run here )
import re
regex = r"((?:\w+\W+){3})(?=New York((?:\W+\w+){3}))"
test_str = "The City of New York often called New York City or simply New York is the most populous city in the United States. With an estimated 2016 population of 8537673 distributed over a land area of about 3026 square miles (784 km2) New York City is also the most densely populated major city in the United States."
matches = re.finditer(regex, test_str)
for match in matches:
print(re.sub(r'\W+', ' ', match.group(1))+" <------>" +re.sub(r'\W+', ' ', match.group(2)))
Regex 101 Demo

Related

Convert regex to rust regext. Replace text between nth comma

The goal is to use a regex to remove text between the nth and the next comma in rust.
For example outside of rust I would use
^((?:.*?,){4})[^,]*,(.*)$
on London, City of Westminster, Greater London, England, SW1A 2DX, United Kingdom
to get a desired result like:
London, City of Westminster, Greater London, England, United Kingdom
I don't have a strong understanding of regex in general unfortunately. So I would learn more about the mechanic and be able to use it in the program I'm writing to learn rust.
Just copy pasting it ala
let string = "London, City of Westminster, Greater London, England, United Kingdom"
let re = Regex::new(r"^((?:.*?,){4})[^,]*,(.*)$").unwrap();
re.replace(string, "");
is not working obviously.
The value you want to remove is the fifth comm-delimited value, not the fourth, and you need to replace with two backreferences, $1 and $2 that refer to Group 1 and Group 2 values.
Note it makes it more precise to use a [^,] negated character class rather than a .*? lazy dot in the quantified part since you are running it against a comma-delimited string.
See the Rust demo:
let string = "London, City of Westminster, Greater London, England, SW1A 2DX, United Kingdom";
let re = Regex::new(r"^((?:[^,]*,){4})[^,]*,(.*)").unwrap();
println!("{}", re.replace(string, "$1$2"));
// => London, City of Westminster, Greater London, England, United Kingdom

Return certain elements of a phrase after special characters

I'm a newbie to regular expressions and would need to parse the following phrases in BigQuery.
phrase
custom3==10-25% sale&+brand==xxx
custom2==30-50% sale&+brand==yyy
to return
output
10-25% sale and xxx
30-50% sale and yyy
Below is the furthest point that I've been able to reach so far"
REGEXP_CONTAINS(phrase, r"\==") then REGEXP_EXTRACT(phrase, r"\==(.*)")
which obviously does not do the job given its output
10-25% sale&+brand==xxx
30-50% sale&+brand==yyy
Any thought on this is appreciated.
With REGEX_REPLACE and capturing groups:
WITH data AS (
SELECT * FROM UNNEST(
['custom3==10-25% sale&+brand==xxx','custom2==30-50% sale&+brand==yyy']
) phrase
)
SELECT REGEXP_REPLACE(phrase
, r'.*==([^ ]* sale).*(...)'
, r'\1 and \2')
FROM data
10-25% sale and xxx
30-50% sale and yyy

Extract different formats street address from a string using RE - Python

I have street address strings in different formats. I tried this old post, but did not help much. My string formats are as follows,
format 1:
string_1 = ', landlord and tenant entered into a an agreement with respect to approximately 5,569 square feet of space in the building known as "the company" located at 788 e.7th street, st. louis, missouri 55605 ( capitalized terms used herein and not otherwise defined herein shall have the respective meanings given to them in the agreement); whereas, the term of the agreement expires on may 30, 2015;'
desired output:
788 e.7th street, st. louis, missouri 55605
format 2:
string_2 = 'first floor 824 6th avenue, chicago, il where the office is located'
desired output:
824 6th avenue, chicago, il
format 3:
string_3 = 'whose address is 90 south seventh street, suite 5400, dubuque, iowa, 55402.'
desired output:
90 south seventh street, suite 5400, dubuque, iowa, 55402
So far, I tried, this for string_1,
address_match_1 = re.findall(r'((\d*)\s+(\d{1,2})(th|nd|rd).*\s([a-z]))', string_1)
I get an empty list.
For the 2nd string I tried the same and getting the empty list as follows,
address_match_2 = re.findall(r'((\d*)\s+(\d{1,2})(th|nd|rd).*\s([a-z]))', string_2)
How can I try to match using re ? They are all in different formats, how can I get suite involved in string_3? Any help would be appreciated.
Solution
This regex matches all addresses in the question:
(?i)\d+ ((?! \d+ ).)*(missouri|il|iowa)(, \d{5}| \d{5}|\b)
You would need to add all of the states and their abbreviations, as well as a better match for the zip code, which you can find if you google it. Also, this will only work for US addresses.
Here is the output for each of the given strings:
>>> m = re.findall(r"((?i)\d+ ((?! \d+ ).)*(missouri|il|iowa)(, \d{5}| \d{5}|\b))", string_1)
>>> print m
[('788 e.7th street, st. louis, missouri 55605', ' ', 'missouri', ' 55605')]
>>> m = re.findall(r"((?i)\d+ ((?! \d+ ).)*(missouri|il|iowa)(, \d{5}| \d{5}|\b))", string_2)
>>> print m
[('824 6th avenue, chicago, il', ' ', 'il', '')]
>>> m = re.findall(r"((?i)\d+ ((?! \d+ ).)*(missouri|il|iowa)(, \d{5}| \d{5}|\b))", string_3)
>>> print m
[('90 south seventh street, suite 5400, dubuque, iowa, 55402', ' ', 'iowa', ', 55402')]
>>>
The first value of each tuple has the correct address. However, this may not be exactly what you need (see Weakness below).
Detail
Assumptions:
Address starts with a number fallowed by a space
Address ends with a state, or its abbreviation, optionally followed by a 5 digit zip code
The rest of the address is in between the two parts above. This part doesn't contain any numbers surrounded by spaces (i.e. with no " \d+ ").
regex string:
r"((?i)\d+ ((?! \d+ ).)*(missouri|il|iowa)(, \d{5}| \d{5}|\b))"
r"" make string a raw string to avoid escaping special characters
(?i) to make regex case insensitive
\d+ address starts with a number followed by a space
(missouri|il|iowa)(, \d{5}| \d{5}|\b)) address ends with state optionally followed by zip code. The \b is just the 'end of word', which makes the zip code optional.
((?! \d+ ).)* any group of characters except for a number surrounded by spaces. Refer to this article for an explanation on how this works.
Weakness
Regular expressions are used to match patterns, but the addresses presented don't have much of a pattern compared with the rest of the string they may be in. Here is the pattern that I identified and that I based the solution on:
Address starts with a number fallowed by a space
Address ends with a state, or its abbreviation, optionally followed by a 5 digit zip code
The rest of the address is in between the two parts above. This part doesn't contain any numbers surrounded by spaces (i.e. with no " \d+ ").
Any address that violates these assumptions won't be matched correctly. For example:
Addresses starting with a number with letters, such as: 102A or 3B.
Addresses with numbers in between initial number and the state, such as one containing ' 7 street' instead of ' 7th street.'
Some of these weaknesses may be fixed with simple changes to the regex, but some may be more difficult to fix.

How to lookup an array of strings to match a value in a column?

I have a master table holding the list of possible street types:
CREATE TABLE land.street_type (
str_type character varying(300)
);
insert into land.street_type values
('STREET'),
('DRIVE'),
('ROAD');
I have a table in which address is loaded and I need to parse the string to do a lookup on the master street type to fetch the suburb following the street.
CREATE TABLE land.bank_application (
mailing_address character varying(300)
);
insert into land.bank_application values
('8 115 MACKIE STREET VICTORIA PARK WA 6100 AU'),
('69 79 CABBAGE TREE ROAD BAYVIEW NSW 2104 AU'),
('17 COWPER DRIVE CAMDEN SOUTH NSW 2570 AU');
Expected output:
VICTORIA PARK
BAYVIEW
CAMDEN SOUTH
Any PostgreSQL technique to look up a array of values against a table column and fetch the data following the matching word?
If I'm able to fetch the data present after the street type, then I can remove the last 3 fields state, postal code and country code from that to identify the suburb.
This query does what you ask for using regular expressions:
SELECT substring(b.mailing_address, ' ' || s.str_type || ' (.*) \D+ \d+ \D+$') AS suburb
FROM bank_application b
JOIN street_type s ON b.mailing_address ~ (' ' || s.str_type || ' ');
The regexp ' (.*) \D+ \d+ \D+$' explained step by step:
.. leading space (the assumed delimiter, else something like 'BROAD' would match 'ROAD')
(.*) .. capturing parentheses with 0-n arbitrary characters: .*
\D+ .. 1-n non-digits
\d+ .. 1-n digits
$ .. end of string
The manual on POSIX Regular Expressions.
But it relies on the given format of mailing_address. Is the format of your strings that reliable?
And suburbs can have words like 'STREET' etc. as part of their name - the approach seems unreliable on principal.
BTW, there is no array involved, you seem to be confusing arrays and sets.

scala regex to limit with double space

I have a data like below
135 stjosephhrsecschool london DunAve
175865 stbele_higher_secondary sch New York
11 st marys high school for women Paris Louis Avenue
I want to extract id schoolname city area.
Pattern is id(digits) followed by single space then school name. name can have multiple words split by single space or it may have special chars. then minimum of double space or more then city . Again city may have multi words split space or may have special chars. then minimum of 2 spaces or more then its area. Even area follows the same properties as school name & city. But area may or may not present in the line. If its not then i want null value for area.
Here is regex I have tried.
([\d]+) ([\w\s\S]+)\s\s+([\w\s\S]+)\s\s+([\w\s\S]*)
But This regex is not stopping when it see more than 2 spaces. Not sure how to modify this to fit to my data.
all the help are appreciated.
Thanks
If I understand your issue correctly - the issue is that the resulting groups contain trailing spaces (e.g. "Louis Avenue "). If so - you can fix this by using the non-greedy modifiers like +? and *?:
([\d]+) ([\w\s\S]+?)\s\s+([\w\s\S]+?)\s\s+([\w\s\S]*?)?\s*
Which results in what seems to be the desired output:
val s1 = "135 stjosephhrsecschool london DunAve"
val s2 = "175865 stbele_higher_secondary sch New York "
val s3 = "11 st marys high school for women Paris Louis Avenue "
val r = """([\d]+) ([\w\s\S]+?)\s\s+([\w\s\S]+?)\s\s+([\w\s\S]*?)?\s*""".r
def matching(s: String) = s match {
case r(a,b,c,d) => println((a,b,c,d))
case _ => println("no match")
}
matching(s1) // (135,stjosephhrsecschool,london,DunAve)
matching(s2) // (175865,stbele_higher_secondary sch,New York,)
matching(s3) // (11,st marys high school for women,Paris,Louis Avenue)