Can I use Regex to return a value based on a specific match? - regex

If I have a list of addresses related to a specific city for a given company/entity, can I use regex to return that address to me vs doing it in code (db query)? For instance if I have:
Manhattan|600 Broadway, 10023|
San Francisco|100 Taylor Street, 94133|
Can I have some regex with namegroups that returns that address something along the lines of
(?<LOCATION_CITY>Manhattan)(?<LOCATION_BUILDING>600 Broadway)?|(?<LOCATION_CITY2>San Francisco)(?<LOCATION_BUILDING2>100 Taylor)?
I get that that is not the correct regex but as a starting point I am wondering if I can use one matched namegroup in the scanned text block to return a known value that is not in the text block. The result I'd be looking for is:
LOCATION_CITY: Manhattan
LOCATION_BUILDING: 600 Broadway
or
LOCATION_CITY: San Francisco
LOCATION_BUILDING: 100 Taylor Street
Where Location City is returned because it is found in the text and LOCATION_ADDRESS is returned because the associated city was found.

How about:
^(?<LOCATION_CITY>[^|]+)\|(?<LOCATION_BUILDING>[^|,]+)

Related

How can a regex catch all parts before a keyword from a finite set, but sometimes separated only by a single space

This question relates to PCRE regular expressions.
Part of my big dataset are address data like this:
12H MARKET ST. Canada
123 SW 4TH Street USA
ONE HOUSE USA
1234 Quantity Dr USA
123 Quality Court Canada
1234 W HWY 56A USA
12345 BERNARDO CNTR DRIVE Canada
12 VILLAGE PLAZA USA
1234 WEST SAND LAKE RD ?567 USA
1234 TELEGRAM BLVD SUITE D USA
1234-A SOUTHWEST FRWY USA
123 CHURCH STREET USA
123 S WASHINGTON USA
123 NW-SE BLVD USA
# USA
1234 E MAIN STREET USA
I would like to extract the street names including house numbers and additional information from these records. (Of course there are other things in those records and I already know how to extract them).
For the purpose of this question I just manually clipped the interesting part from the data for this example.
The number of words in the address parts is not known before. The only criterion I have found so far is to find the occurrence of country names belonging to some finite set, which of course is bigger than (USA|Canada). For brevity I limit my example just to those two countries.
This regular expression
([a-zA-Z0-9?\-#.]+\s)
already isolates the words making up what I am after, including one space after them. Unfortunately there are cases, where the country after the to-be-extracted street information is only separated by a single space from the country, like e.g. in the first and in the last example.
Since I want to capture the matching parts glued together, I place a + sign behind my regular expression:
([a-zA-Z0-9?\-#.]+\s)+
but then in the two nasty cases with only one separating space before the country, the country is also caught!
Since I know the possible countries from looking at the data, I could try to exclude them by a look ahead-condition like this:
([a-zA-Z0-9?\-#.]+\s)(?!USA|Canada)
which excludes ST. from the match in the first line and STREET from the match in the last line. Of course the single capture groups are not yet glued together by this.
So I would add a plus sign to the group on the left:
([a-zA-Z0-9?\-#.]+\s)+(?!USA|Canada)
But then ST. and STREET and the Country, separated by only a single space, are caught again together with the country, which I want to exclude from my result!
How would you proceed in such a case?
If it would be possible by properly using regular expressions to replace each country name by the same one preceded by an additional space (or even to do this only for cases, where there is only a single space in front of one of the country-names), my problem would be solved. But I want to avoid such a substitution for the whole database in a separate run because a country name might appear in some other column too.
I am quite new to regular expressions and I have no idea how to do two processing steps onto the same input in sequence. - But maybe, someone has a better idea how to cope with this problem.
If I understand correctly, you want all content before the country (excluding spaces before the country). The country will always be present at the end of the line and comes from a list.
So you should be able to set the 'global' and 'multiline' options and then use the following regex:
^(.*?)(?=\s+(USA|Canada)\s*$)
Explanation:
^(.*) match all characters from start of line
(?=\s+(USA|Canada)\s*$) look ahead for one or more spaces, followed by one of the country names, followed by zero or more spaces and end of line.
That should give you a list with all addresses.
Edit:
I have changed the first part to: (.*?), making it non-greedy. That way the match will stop at the last letter before country instead of including some spaces.

Using regex and vba, extracting parts of data

I have an excel spreadsheet and within its contents it is formatted like -
Street Name, Street Number Street Direction(may not be present represented be an NSWE)
So it could look like John Doe Ave, 900 E or Jane Doe DR, 100
However, the people who used this spreadsheet put business names or other information that shouldn't be present
The part I'm stuck at is using regex patterns I'm not familiar with it and it confuses me
I have this variable
Dim strPattern As String: strPattern = "^(.+),\s(\d+)\s([NWSEnwse])"
So, I have this its working SLIGHTLY I wanted to know what changes I could make to this so it would include or exlude NWSEnwse, because right now it detects the address only when street direction is present
You may use this regex pattern to match it.
^(.+),\s+(\d+)(\s+[NWSEnwse])?
The ? at the end signifies that that part is optional.
I also replaced \s with \s+ to account for any extra spaces that might have crept in.

Regex parse with alteryx

One of the columns has the data as below and I only need the suburb name, not the state or postcode.
I'm using Alteryx and tried regex (\<\w+\>)\s\<\w+\> but only get a few records to the new column.
Input:
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta NSW 2150
Claymore 2559
CASULA
Output
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta
Claymore
CASULA
This regex matches all letter-words up to but not including an Australian state abbreviation (since the addresses are clearly Australian):
( ?(?!(VIC|NSW|QLD|TAS|SA|WA|ACT|NT)\b)\b[a-zA-Z]+)+
See demo
The negative look ahead includes a word boundary to allow suburbs that start with a state abbreviation (see demo).
Expanding on Bohemian's answer, you can use groupings to do a REGEXP REPLACE in alteryx. So:
REGEX_Replace([Field1], "(.*)(\VIC|NSW|QLD|TAS|SA|WA|ACT|NT)+(\s*\d+)" , "\1")
This will grab anything that matches in the first group (so just the suburb). The second and third groups match the state and the zip. Not a perfect regex, but should get you most of the way there.
I think this workflow will help you :

Avoid literal repetition

Suppose I have this string:
Address XXXXX city XXXXX
And this regEX:
Address (.*?) city (.*?)
What will happen if the Address is "The city of London" ?
It depends on whether your reex engine is in greedy mode or not.
If it's in greedy mode, it will work as expected since it will look for the longest match.
Whether your particular regex engines runs in greedy mode by default, or whether it even has a greedy mode, is not something we can tell you based on the information provided in the question.
If you're using .NET, this page has a description on greedy versus lazy matching.
Basically, given the string XYZZY, the regex X.*Y will match XYZZY (greedy) while X.*?Y will match XY (lazy).
What you need is a way to ensure you can differentiate between the delimiters and the elements of your string, otherwise you'll be in trouble no matter what, such as with:
Address The city baths city Manchester city, England
Perhaps you could look into something like:
Address "put address here" city "put city here"
and try to make sure you never get a city name with quotes in it. However, be careful. I once worked on a project where we managed to get some decent compression on city names (it was embedded so every byte counted) by only having to store alpha characters.
Shortly thereafter, we rolled out nationally and the residents of A1 mining settlement were rather miffed at our short-sightedness :-) One town in the whole of Oz with a digit in the name, who'd have thought?
Alternatively, put the address and city on separate lines thus:
Address: The city baths
City: Manchester city, England
Then you can look for things like:
^Address:\s*(.*)$
^City:\s*(.*)$

Extract a portion of text using RegEx

I would like to extract portion of a text using a regular expression. So for example, I have an address and want to return just the number and streets and exclude the rest:
2222 Main at King Edward Vancouver BC CA
But the addresses varies in format most of the time. I tried using Lookbehind Regex and came out with this expression:
.*?(?=\w* \w* \w{2}$)
The above expressions handles the above example nicely but then it gets way too messy as soon as commas come into the text, postal codes which can be a 6 character string or two 3 character strings with a space in the middle, etc...
Is there any more elegant way of extracting a portion of text other than a lookbehind regex?
Any suggestion or a point in another direction is greatly appreciated.
Thanks!
Regular expressions are for data that is REGULAR, that follows a pattern. So if your data is completely random, no, there's no elegant way to do this with regex.
On the other hand, if you know what values you want, you can probably write a few simple regexes, and then just test them all on each string.
Ex.
regex1= address # grabber, regex2 = street type grabber, regex3 = name grabber.
Attempt a match on string1 with regex1, regex2, and finally regex3. Move on to the next string.
well i thot i'd throw my hat into the ring:
.*(?=,? ([a-zA-Z]+,?\s){3}([\d-]*\s)?)
and you might want ^ or \d+ at the front for good measure
and i didn't bother specifying lengths for the postal codes... just any amount of characters hyphens in this one.
it works for these inputs so far and variations on comas within the City/state/country area:
2222 Main at King Edward Vancouver, BC, CA, 333-333
555 road and street place CA US 95000
2222 Main at King Edward Vancouver BC CA 333
555 road and street place CA US
it is counting at there being three words at the end for the city, state and country but other than that it's like ryansstack said, if it's random it won't work. if the city is two words like New York it won't work. yeah... regex isn't the tool for this one.
btw: tested on regexhero.net
i can think of 2 ways you can do this
1) if you know that "the rest" of your data after the address is exactly 2 fields, ie BC and CA, you can do split on your string using space as delimiter, remove the last 2 items.
2) do a split on delimiter /[A-Z][A-Z]/ and store the result in array. then print out the array ( this is provided that the address doesn't contain 2 or more capital letters)