Regex (Posix) to get first word only, not including numbers - regex

New to Regex (which was recently added to SQL in DB2 for i). I don't know anything about the different engines but research indicates that it is "based on POSIX extended regular expressions".
I would like to get the street name (first non-numeric word) from an address.
e.g.
101 Main Street = Main
2/b Pleasant Ave = Pleasant
5H Unpleasant Crescent = Unpleasant
I'm sorry I don't have a string that isn't working, as suggested by the forum software. I don't even know where to start. I tried a few things I found in search but they either yielded nothing or the first "word" - i.e. the number (101, 2/b, 5H).
Thanks
Edit: Although it's looking as if IBM's implementation of regex on the DB2 family of databases may be too alien for many of the resident experts, I'll press ahead with some more detail in case it helps.
A plain English statement of the requirement would be:
Basic/acceptable: Find the first word/unbroken string that contains no numbers or special characters
Advanced/ideal: Find the first word that contains three or more characters, being only letters and zero or one embedded dash/hyphen, but no numbers or other characters.
Additional examples (original ones at top are still valid)
190 - 192 Tweety-bird avenue = Tweety-bird
190-192 Tweety-bird avenue = Tweety-bird
Charles Bronson Place = Charles
190H Charles-Bronson Place = Charles-Bronson
190 to 192 Charles Bronson Place = Charles
Second Edit:
Mooching around on the internet and trying every vaguely connected expression that I could find, I stumbled on this one:
[a-zA-Z]+(?:[\s-][a-zA-Z]+)*
which actually works pretty well - it gives the street name and street type, which on reflection would actually suit my purpose as well as the street name alone (I can easily expand common abbreviations - e.g. RD to ROAD - on the fly).
Sample SQL:
select HAD1,
regexp_substr(HAD1, '[a-zA-Z]+(?:[\s-][a-zA-Z]+)*')
from ECH
where HEDTE > 20190601
Sample output
Ship To REGEXP_SUBSTR
Address
Line 1
32 CHRISTOPHER STREET CHRISTOPHER STREET
250 - 270 FEATHERSTON STREET FEATHERSTON STREET
118 MONTREAL STREET MONTREAL STREET
7 BIRMINGHAM STREET BIRMINGHAM STREET
59 MORRISON DRIVE MORRISON DRIVE
118 MONTREAL STREET MONTREAL STREET
MASON ROAD MASON ROAD
I know this wasn't exactly the question I asked, so apologies to anyone who could have done this but was following the original request faithfully.

Not sure if this is Posix compliant, but something like this could work: ^[\w\/]+?\s((\w+\s)+?)\s*\w+?$, example here.
The script assumes that the first chunk is the number of the building, the second chunk, is the name of the street, and the last chunk is Road/Ave/Blvd/etc.
This should also cater for street names which have white spaces in them.

Using the following regex matches your examples :
(?<=[^ ]+ )[^ ]*[ ]

Related

How can a regex catch all parts before a keyword from a finite set, but sometimes separated only by a single space

This question relates to PCRE regular expressions.
Part of my big dataset are address data like this:
12H MARKET ST. Canada
123 SW 4TH Street USA
ONE HOUSE USA
1234 Quantity Dr USA
123 Quality Court Canada
1234 W HWY 56A USA
12345 BERNARDO CNTR DRIVE Canada
12 VILLAGE PLAZA USA
1234 WEST SAND LAKE RD ?567 USA
1234 TELEGRAM BLVD SUITE D USA
1234-A SOUTHWEST FRWY USA
123 CHURCH STREET USA
123 S WASHINGTON USA
123 NW-SE BLVD USA
# USA
1234 E MAIN STREET USA
I would like to extract the street names including house numbers and additional information from these records. (Of course there are other things in those records and I already know how to extract them).
For the purpose of this question I just manually clipped the interesting part from the data for this example.
The number of words in the address parts is not known before. The only criterion I have found so far is to find the occurrence of country names belonging to some finite set, which of course is bigger than (USA|Canada). For brevity I limit my example just to those two countries.
This regular expression
([a-zA-Z0-9?\-#.]+\s)
already isolates the words making up what I am after, including one space after them. Unfortunately there are cases, where the country after the to-be-extracted street information is only separated by a single space from the country, like e.g. in the first and in the last example.
Since I want to capture the matching parts glued together, I place a + sign behind my regular expression:
([a-zA-Z0-9?\-#.]+\s)+
but then in the two nasty cases with only one separating space before the country, the country is also caught!
Since I know the possible countries from looking at the data, I could try to exclude them by a look ahead-condition like this:
([a-zA-Z0-9?\-#.]+\s)(?!USA|Canada)
which excludes ST. from the match in the first line and STREET from the match in the last line. Of course the single capture groups are not yet glued together by this.
So I would add a plus sign to the group on the left:
([a-zA-Z0-9?\-#.]+\s)+(?!USA|Canada)
But then ST. and STREET and the Country, separated by only a single space, are caught again together with the country, which I want to exclude from my result!
How would you proceed in such a case?
If it would be possible by properly using regular expressions to replace each country name by the same one preceded by an additional space (or even to do this only for cases, where there is only a single space in front of one of the country-names), my problem would be solved. But I want to avoid such a substitution for the whole database in a separate run because a country name might appear in some other column too.
I am quite new to regular expressions and I have no idea how to do two processing steps onto the same input in sequence. - But maybe, someone has a better idea how to cope with this problem.
If I understand correctly, you want all content before the country (excluding spaces before the country). The country will always be present at the end of the line and comes from a list.
So you should be able to set the 'global' and 'multiline' options and then use the following regex:
^(.*?)(?=\s+(USA|Canada)\s*$)
Explanation:
^(.*) match all characters from start of line
(?=\s+(USA|Canada)\s*$) look ahead for one or more spaces, followed by one of the country names, followed by zero or more spaces and end of line.
That should give you a list with all addresses.
Edit:
I have changed the first part to: (.*?), making it non-greedy. That way the match will stop at the last letter before country instead of including some spaces.

Regular expression for address field validation

I am trying to write a regular expression that facilitates an address, example 21-big walk way or 21 St.Elizabeth's drive I came up with the following regular expression but I am not too keen to how to incorporate all the characters (alphanumeric, space dash, full stop, apostrophe)
"regexp=^[A-Za-z-0-99999999'
See the answer to this question on address validating with regex:
regex street address match
The problem is, street addresses vary so much in formatting that it's hard to code against them. If you are trying to validate addresses, finding if one isn't valid based on its format is mighty hard to do.
This would return the following address (253 N. Cherry St. ), anything with its same format:
\d{1,5}\s\w.\s(\b\w*\b\s){1,2}\w*\.
This allows 1-5 digits for the house number, a space, a character followed by a period (for N. or S.), 1-2 words for the street name, finished with an abbreviation (like st. or rd.).
Because regex is used to see if things meet a standard or protocol (which you define), you probably wouldn't want to allow for the addresses provided above, especially the first one with the dash, since they aren't very standard. you can modify my above code to allow for them if you wish--you could add
(-?)
to allow for a dash but not require one.
In addition, http://rubular.com/ is a quick and interactive way to learn regex. Try it out with the addresses above.
In case if you don't have a fixed format for the address as mentioned above, I would use regex expression just to eliminate the symbols which are not used in the address (like specialized sybmols - &(%#$^). Result would be:
[A-Za-z0-9'\.\-\s\,]
Just to add to Serzas' answer(since don't have enough reps. to comment).
alphabets and numbers can effectively be replaced by \w for words.
Additionally apostrophe,comma,period and hyphen doesn't necessarily need a backslash.
My requirement also involved front and back slashes so \/ and finally whitespaces with \s. The working regex for me ,as such was :
pattern: "[\w',-\\/.\s]"
Regular expression for simple address validation
^[#.0-9a-zA-Z\s,-]+$
E.g. for Address match case
#1, North Street, Chennai - 11
E.g. for Address not match case
$1, North Street, Chennai # 11
I have succesfully used ;
Dim regexString = New stringbuilder
With regexString
.Append("(?<h>^[\d]+[ ])(?<s>.+$)|") 'find the 2013 1st ambonstreet
.Append("(?<s>^.*?)(?<h>[ ][\d]+[ ])(?<e>[\D]+$)|") 'find the 1-7-4 Dual Ampstreet 130 A
.Append("(?<s>^[\D]+[ ])(?<h>[\d]+)(?<e>.*?$)|") 'find the Terheydenlaan 320 B3
.Append("(?<s>^.*?)(?<h>\d*?$)") 'find the 245e oosterkade 9
End With
Dim Address As Match = Regex.Match(DataRow("customerAddressLine1"), regexString.ToString(), RegexOptions.Multiline)
If Not String.IsNullOrEmpty(Address.Groups("s").Value) Then StreetName = Address.Groups("s").Value
If Not String.IsNullOrEmpty(Address.Groups("h").Value) Then HouseNumber = Address.Groups("h").Value
If Not String.IsNullOrEmpty(Address.Groups("e").Value) Then Extension = Address.Groups("e").Value
The regex will attempt to find a result, if there is none, it move to the next alternative. If no result is found, none of the 4 formats where present.
This one worked for me:
\d+[ ](?:[A-Za-z0-9.-]+[ ]?)+(?:Avenue|Lane|Road|Boulevard|Drive|Street|Ave|Dr|Rd|Blvd|Ln|St)\.?
The source: https://www.codeproject.com/Tips/989012/Validate-and-Find-Addresses-with-RegEx
Regex is a very bad choice for this kind of task. Try to find a web service or an address database or a product which can clean address data instead.
Related:
Address validation using Google Maps API
As a simple one line expression recommend this,
^([a-zA-z0-9/\\''(),-\s]{2,255})$
I needed
STREET # | STREET | CITY | STATE | ZIP
So I wrote the following regex
[0-9]{1,5}( [a-zA-Z.]*){1,4},?( [a-zA-Z]*){1,3},? [a-zA-Z]{2},? [0-9]{5}
This allows
1-5 Street #s
1-4 Street description words
1-3 City words
2 Char State
5 Char Zip code
I also added option , for separating street, city, state, zip
Here is the approach I have taken to finding addresses using regular expressions:
A set of patterns is useful to find many forms that we might expect from an address starting with simply a number followed by set of strings (ex. 1 Basic Road) and then getting more specific such as looking for "P.O. Box", "c/o", "attn:", etc.
Below is a simple test in python. The test will find all the addresses but not the last 4 items which are company names. This example is not comprehensive, but can be altered to suit your needs and catch examples you find in your data.
import re
strings = [
'701 FIFTH AVE',
'2157 Henderson Highway',
'Attn: Patent Docketing',
'HOLLYWOOD, FL 33022-2480',
'1940 DUKE STREET',
'111 MONUMENT CIRCLE, SUITE 3700',
'c/o Armstrong Teasdale LLP',
'1 Almaden Boulevard',
'999 Peachtree Street NE',
'P.O. BOX 2903',
'2040 MAIN STREET',
'300 North Meridian Street',
'465 Columbus Avenue',
'1441 SEAMIST DR.',
'2000 PENNSYLVANIA AVENUE, N.W.',
'465 Columbus Avenue',
'28 STATE STREET',
'P.O, Drawer 800889.',
'2200 CLARENDON BLVD.',
'840 NORTH PLANKINTON AVENUE',
'1025 Connecticut Avenue, NW',
'340 Commercial Street',
'799 Ninth Street, NW',
'11318 Lazarro Ln',
'P.O, Box 65745',
'c/o Ballard Spahr LLP',
'8210 SOUTHPARK TERRACE',
'1130 Connecticut Ave., NW, Suite 420',
'465 Columbus Avenue',
"BANNER & WITCOFF , LTD",
"CHIP LAW GROUP",
"HAMMER & ASSOCIATES, P.C.",
"MH2 TECHNOLOGY LAW GROUP, LLP",
]
patterns = [
"c\/o [\w ]{2,}",
"C\/O [\w ]{2,}",
"P.O\. [\w ]{2,}",
"P.O\, [\w ]{2,}",
"[\w\.]{2,5} BOX [\d]{2,8}",
"^[#\d]{1,7} [\w ]{2,}",
"[A-Z]{2,2} [\d]{5,5}",
"Attn: [\w]{2,}",
"ATTN: [\w]{2,}",
"Attention: [\w]{2,}",
"ATTENTION: [\w]{2,}"
]
contact_list = []
total_count = len(strings)
found_count = 0
for string in strings:
pat_no = 1
for pattern in patterns:
match = re.search(pattern, string.strip())
if match:
print("Item found: " + match.group(0) + " | Pattern no: " + str(pat_no))
found_count += 1
pat_no += 1
print("-- Total: " + str(total_count) + " Found: " + str(found_count))
UiPath Academy training video lists this RegEx for US addresses (and it works fine for me):
\b\d{1,8}(-)?[a-z]?\W[a-z|\W|\.]{1,}\W(road|drive|avenue|boulevard|circle|street|lane|waylrd\.|st\.|dr\.|ave\.|blvd\.|cir\.|In\.|rd|dr|ave|blvd|cir|ln)
I had a different use case - find any addresses in logs and scold application developers (favourite part of a devops job). I had the advantage of having the word "address" in the pattern but should work without that if you have specific field to scan
\baddress.[0-9\\\/# ,a-zA-Z]+[ ,]+[0-9\\\/#, a-zA-Z]{1,}
Look for the word "address" - skip this if not applicable
Look for first part numbers, letters, #, space - Unit Number / street number/suite number/door number
Separated by a space or comma
Look for one or more of rest of address numbers, letters, #, space
Tested against :
1 Sleepy Boulevard PO, Box 65745
Suite #100 /98,North St,Snoozepura
Ave., New Jersey,
Suite 420 1130 Connect Ave., NW,
Suite 420 19 / 21 Old Avenue,
Suite 12, Springfield, VIC 3001
Suite#100/98 North St Snoozepura
This worked for me when there were street addresses with unit/suite numbers, zip codes, only street. It also didn't match IP addresses or mac addresses. Worked with extra spaces.
This assumes users are normal people separate elements of a street address with a comma, hash sign, or space and not psychopaths who use characters like "|" or ":"!
For French address and some international address too, I use it.
[\\D+ || \\d]+\\d+[ ||,||[A-Za-z0-9.-]]+(?:[Rue|Avenue|Lane|... etcd|Ln|St]+[ ]?)+(?:[A-Za-z0-9.-](.*)]?)
I was inspired from the responses given here and came with those 2 solutions
support optional uppercase
support french also
regex structure
numbers (required)
letters, chars and spaces
at least one common address keyword (required)
as many chars you want before the line break
definitions:
accuracy
capacity of detecting addresses and not something that looks like an address which is not.
range
capacity to detect uncommon addresses.
Regex 1:
high accuracy
low range
/[0-9]+[ |[a-zà-ú.,-]* ((highway)|(autoroute)|(north)|(nord)|(south)|(sud)|(east)|(est)|(west)|(ouest)|(avenue)|(lane)|(voie)|(ruelle)|(road)|(rue)|(route)|(drive)|(boulevard)|(circle)|(cercle)|(street)|(cer\.)|(cir\.)|(blvd\.)|(hway\.)|(st\.)|(aut\.)|(ave\.)|(ln\.)|(rd\.)|(hw\.)|(dr\.)|(a\.))([ .,-]*[a-zà-ú0-9]*)*/i
regex 2:
low accuracy
high range
/[0-9]*[ |[a-zà-ú.,-]* ((highway)|(autoroute)|(north)|(nord)|(south)|(sud)|(east)|(est)|(west)|(ouest)|(avenue)|(lane)|(voie)|(ruelle)|(road)|(rue)|(route)|(drive)|(boulevard)|(circle)|(cercle)|(street)|(cer\.?)|(cir\.?)|(blvd\.?)|(hway\.?)|(st\.?)|(aut\.?)|(ave\.?)|(ln\.?)|(rd\.?)|(hw\.?)|(dr\.?)|(a\.))([ .,-]*[a-zà-ú0-9]*)*/i
This one works well for me
^(\d+) ?([A-Za-z](?= ))? (.*?) ([^ ]+?) ?((?<= )APT)? ?((?<= )\d*)?$
Source : https://community.alteryx.com/t5/Alteryx-Designer-Discussions/RegEx-Addresses-different-formats-and-headaches/td-p/360147
Here is my RegEx for address, city & postal validation rules
validation rules:
address -
1 - 40 characters length.
Letters, numbers, space and . , : ' #
city -
1 - 19 characters length
Only Alpha characters are allowed
Spaces are allowed
postalCode -
The USA zip must meet the following criteria and is required:
Minimum of 5 digits (9 digits if zip + 4 is provided)
Numeric only
A Canadian postal code is a six-character string.
in the format A1A 1A1, where A is a letter and 1 is a digit.
a space separates the third and fourth characters.
do not include the letters D, F, I, O, Q or U.
the first position does not make use of the letters W or Z.
address: ^[a-zA-Z0-9 .,#;:'-]{1,40}$
city: ^[a-zA-Z ]{1,19}$
usaPostal: ^([0-9]{5})(?:[-]?([0-9]{4}))?$
canadaPostal : ^(?!.*[DFIOQU])[A-VXY][0-9][A-Z] ?[0-9][A-Z][0-9]$
\b(\d{1,8}[a-z]?[0-9\/#- ,a-zA-Z]+[ ,]+[.0-9\/#, a-zA-Z]{1,})\n
A more dynamic approach to #micah would be the following:
(?'Address'(?'Street'[0-9][a-zA-Z\s]),?\s*(?'City'[A-Za-z\s]),?\s(?'Country'[A-Za-z])\s(?'Zipcode'[0-9]-?[0-9]))
It won't care about individual lengths of segments of code.
https://regex101.com/r/nuy7hB/1

Regular Expression for England only Postcode

I have an Asp.Net website and I want to use a RegularExpressionValidator to check if a UK postcode is English (i.e. it's not Scottish, Welsh or N.Irish).
It should be possible to see if the postcode is English by using just the letters from the first segmant (called the Postcode Area). In total there are 124 postcode areas and this is a list of them.
From that list, the following postcode areas are not in England.
ZE,KW,IV,HS,PH,AB,DD,PA,FK,G,KY,KA,DG,TD,EH,ML (Scotland)
LL,SY,LD,HR,NP,CF,SA (Wales)
BT (N.Ireland)
The input to the regex may be the whole postcode, or it might just be the postcode area.
Can anyone help me create a regular expression that will match only if a given postcode is English?
EDIT - Solution
With help from several posters I was able to create the following regex which i've tested against over 1500 testcases successfully.
^(AL|B|B[ABDHLNRS]|C[ABHMORTVW]|D[AEHLNTY]|E|E[CNX]|FY|G[LUY]|H[ADGPUX]|I[GM‌​P]‌​
|JE|KT|L|L[AENSU]|M|ME|N|N[EGNRW]|O[LX]|P[ELOR]|R[GHM]|S|S[EGKLMNOPRSTW]|T[AFNQ‌​‌​
RSW]|UB|W|W[ACDFNRSV]|YO)\d{1,2}\s?(\d[\w]{2})?
I've already answered once, making the point that it's not possible to come up with a 100% correct England-only regex (since the postcode areas don't lie along political boundaries).
However I've dug a bit deeper into this, and ... well it is possible, but it's a lot of work.
To verify an England-only postcode, you need to exclude the non-English postcodes. The easy ones are:
BT (Northern Ireland)
IM (Isle of Man)
JE (Jersey)
GG (Guernsey)
BF (British Forces)
BX (non-geographic UK postcodes)
GIR (Girobank, which is also non-geographic)
(I'm not going to mention UK-style postcodes for territories outside the UK, like St Helena, Gibraltar etc. Technically speaking, the Isle of Man and Channel Islands aren't part of the UK either, but they're much nearer by, and more closely tied into the Royal Mail system in the UK.)
The purely Scottish postcode areas are (as you mentioned):
ZE,KW,IV,HS,PH,AB,DD,PA,FK,G,KY,KA,EH,ML
DG and TD are nominally Scottish, and are for the most part in Scotland. However some areas extend over the Scotland-England border as follows:
DG16 - a tiny bit in England
TD9 - a tiny bit in England
TD12 - half in England
TD15 - mostly in England
The breakdown is as follows:
DG16 is in Scotland except for the following English postcodes:
DG16 5H[TUZ]
DG16 5J[AB]
TD9 is in Scotland except for TD9 0T[JPRSTUW]
TD12 has only one sector (TD12 4), which is spread roughly half and half across England and Scotland:
TD12 4[ABDEHJLN] are in Scotland
TD12 4[QRSTUWX] are in England
TD15 is the most complicated. There are 3 sectors, of which TD15 2 and TD15 9 are entirely in England.
TD15 1 is split across England and Scotland.
Postcodes beginning as follows are in Scotland:
TD15 1T
TD15 1X
... except for these English postcodes:
TD15 1T[ABQUX]
TD15 1XX
All other postcodes in TD15 1 are in England, except for those beginning as follows:
TD15 1B
TD15 1S (i.e. TD15 1S[ABEJLNPWXY])
TD15 1U (i.e. TD15 1U[BDENPQRTUXY])
... which are all in England, with the exception of the following postcodes which are in Scotland:
TD15 1BT
TD15 1S[UZ]
TD15 1U[FGHJLSZ]
The English postcode areas CA and NE lie on the other side of the England-Scotland border, however they never extend into Scotland.
In fact, the last two letters of a UK postcode is based on how the postman actually delivers post (as far as I'm aware), so it's not given for granted that it will fall inside a political boundary. Thus if there's a group of houses which straddle the border, then it's possible that the entire postcode (i.e. at the most fine-grained level) does not lie entirely within either England or Scotland. E.g. TD9 0TJ and TD15 1UZ are very close to the border, and I don't really know for sure if they're entirely on one side or not.
The England-Wales border is also complicated, however I'll leave this as an exercise for the reader.
There are 124 Postcode Areas in the UK.
-- PAF®
statistics August 2012, via List of postcodes in the United Kingdom (Wikipedia).
I recommend breaking your problem down into two parts (think functions):
Is the postcode valid?
UK Postcode Regex (Comprehensive)
Is the postcode English?
This can be broken down further:
Not Scottish:
! /^(ZE|KW|IV|HS|PH|AB|DD|PA|FK|G|KY|KA|DG|TD|EH|ML)[0-9]/
Not Welsh:
! /^(LL|SY|LD|HR|NP|CF|SA)[0-9]/
Not Northern Irish, Manx, from the Channel Islands, ...
et cetera...
or you could just check that the Postcode Area is among the hundred or so English ones, depending on how you want to optimise ☻
Note that the syntax will vary according to your programming language.
Doing all this in one regular expression would soon become unmanageable.
It's not possible to come up with an England-only regex, because the postcode areas don't lie along political boundaries, at least not at the postcode area or district level.
For example, CH1 is in England, and CH5 is in Wales.
At the postcode district level there are still problems, for example TD12 is half in England, half in Scotland.
The only area which you can rely on is BT (Northern Ireland)
Use ^(AB|AL|B| ... )$, where the ... is where you fill the rest of the valid ones in, separated by pipes (|).
EDIT: There's a boatload of information here: http://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom
If you were to include the in/out codes, it would be something like ^(AB|AL|B| ... )([\d\w]{3})\s([\d\w]{3})$, which would get the rest of the code.
EDIT
^(A[BL]|B[ABDHLNRST]?|C[ABFHMORTVW]|D[ADEGHLNTY]|E[CNX]?|F[KY]|G[LUY]|H[ADGPRSUX]|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EL]?|N[EGNPRW]?|O[LX]|P[AEHLOR]|R[GHM]|S[AEGKLMNOPRSTWY]?|T[AFNQRSW]|UB|W[ACDFNRSV]?|YO|ZE)([\w\d]{1,2})\s?([\w\d]{3})$
Part of this regex is taken from another one of the answers. It matches the valid postcodes, then 1 to 2 {1,2} letters \w or numbers \d, an optional space \s?, then 3 letters or numbers. Hope that helps.
These are the RegEx i put together that follows the Royal Mail defined standards for all UK postcode types:
Standard UK PostCodes:
/^([A-PR-UWYZ](?:[0-9]{1,2}|[0-9][A-HJKMNPR-Y]|[A-HK-Y][0-9]{1,2}|[A-HK-Y][0-9][ABEHMNPRVWXY]))\s*([0-9][ABD-HJLNP-UW-Z]{2})$/i
GiroBank PostCodes:
/^(GIR)\s*(0AA)$/i
UK Overseas Territories:
/^([A-Z]{4})\s*(1ZZ)$/i
British Forces Post Office:
/^(BFPO)\s*(?:(c\/o)\s*)?((?(2)[0-9]{1,3}|[0-9]{1,4}))$/i
And this is the function I wrote which validates a postcode against these four types and allows type detection:
public function UKPostCode(&$strPostCode, &$strError = null, &$strType = null, $ReturnFormatted = true) {
$strStrippedPostCode = preg_replace("/[\s\-]/i", "", $strPostCode);
if (empty($strStrippedPostCode)) {
$strError = $this->__getErrorMessage("Post", "EMPTY_POST");
return false;
}
$arrRegExp = array(
"STD" => "/^([A-PR-UWYZ](?:[0-9]{1,2}|[0-9][A-HJKMNPR-Y]|[A-HK-Y][0-9]{1,2}|[A-HK-Y][0-9][ABEHMNPRVWXY]))\s*([0-9][ABD-HJLNP-UW-Z]{2})$/i",
"GIR" => "/^(GIR)\s*(0AA)$/i",
"OST" => "/^([A-Z]{4})\s*(1ZZ)$/i",
"BFPO" => "/^(BFPO)\s*(?:(c\/o)\s*)?((?(2)[0-9]{1,3}|[0-9]{1,4}))$/i"
);
foreach ($arrRegExp as $strPostCodeType => $strExpression) {
if (preg_match($strExpression, $strPostCode, $arrMatches)) {
if ($ReturnFormatted !== null) {
array_shift($arrMatches);
$strPostCode = implode(" ", array_filter($arrMatches));
$strPostCode = ((bool)$ReturnFormatted === true) ? strtoupper($strPostCode) : strtolower($strPostCode);
}
$strType = $strPostCodeType;
return true;
}
}
$strError = $this->__getErrorMessage("Post", "INVALID_POST");
return false;
}
Hope this helps
'A[BL]|B[ABDHLNRST]?|C[ABFHMORTVW]|D[ADEGHLNTY]|E[CNX]?|F[KY]|G[LUY]|H[ADGPRSUX]|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EL]?|N[EGNPRW]?|O[LX]|P[AEHLOR]|R[GHM]|S[AEGKLMNOPRSTWY]?|T[AFNQRSW]|UB|W[ACDFNRSV]?|YO|ZE'

Regular expression for dividing country calling codes

I have a list of calling codes for all countries(the phone number prefixes), I would like to split them up in the
country name and the actual code so I can put then into an xml.
I have tried back and forth but can not get a regexp going that takes all cases into account.
I think it is fairly simple for someone with a bit of experience.
The codes have these formats:
Afghanistan 93
Anguilla 1 264
Antarctica 6721
Antigua and Barbuda 1 268
Bosnia and Herzegovina 387
Canada 1
Congo, Republic of the 242
Cote d'Ivoire 225
Ireland (Eire) 353
United States of America 1
There are around 235 of them in total, but these are the regulars and the exceptions.
^[a-zA-Z]\s,'()] for between 1 and X words and then it is [0-9\s]{1,5}$ for the numbers:
X
XX
XXX
XXXX
X XXX
So if I should express it as a sentence it would be: "from beginning of a line, take all characters (1) including space,'() until you encounter digits, then take all of these including space(2) until you encounter a line break."
I am using TextMate, and the docs says:
TextMate uses the Oniguruma regular
expression library by K. Kosako.
I would appreciate any help given:)
Thank you.
This posix regex should be sufficient: ^[a-zA-Z ]+[0-9 ]+$

Extract a portion of text using RegEx

I would like to extract portion of a text using a regular expression. So for example, I have an address and want to return just the number and streets and exclude the rest:
2222 Main at King Edward Vancouver BC CA
But the addresses varies in format most of the time. I tried using Lookbehind Regex and came out with this expression:
.*?(?=\w* \w* \w{2}$)
The above expressions handles the above example nicely but then it gets way too messy as soon as commas come into the text, postal codes which can be a 6 character string or two 3 character strings with a space in the middle, etc...
Is there any more elegant way of extracting a portion of text other than a lookbehind regex?
Any suggestion or a point in another direction is greatly appreciated.
Thanks!
Regular expressions are for data that is REGULAR, that follows a pattern. So if your data is completely random, no, there's no elegant way to do this with regex.
On the other hand, if you know what values you want, you can probably write a few simple regexes, and then just test them all on each string.
Ex.
regex1= address # grabber, regex2 = street type grabber, regex3 = name grabber.
Attempt a match on string1 with regex1, regex2, and finally regex3. Move on to the next string.
well i thot i'd throw my hat into the ring:
.*(?=,? ([a-zA-Z]+,?\s){3}([\d-]*\s)?)
and you might want ^ or \d+ at the front for good measure
and i didn't bother specifying lengths for the postal codes... just any amount of characters hyphens in this one.
it works for these inputs so far and variations on comas within the City/state/country area:
2222 Main at King Edward Vancouver, BC, CA, 333-333
555 road and street place CA US 95000
2222 Main at King Edward Vancouver BC CA 333
555 road and street place CA US
it is counting at there being three words at the end for the city, state and country but other than that it's like ryansstack said, if it's random it won't work. if the city is two words like New York it won't work. yeah... regex isn't the tool for this one.
btw: tested on regexhero.net
i can think of 2 ways you can do this
1) if you know that "the rest" of your data after the address is exactly 2 fields, ie BC and CA, you can do split on your string using space as delimiter, remove the last 2 items.
2) do a split on delimiter /[A-Z][A-Z]/ and store the result in array. then print out the array ( this is provided that the address doesn't contain 2 or more capital letters)