Regex for extracting names of colleges, universities, and institutes? - regex

I have a bunch of strings like this in a file:
M.S., Arizona University, Tucson, Az., 1957
B.A., American International College, Springfield, Mass., 1978
B.A., American University, Washington, D.C., 1985
and I'd like to extract Tufts University, American International College, American University, University of Massachusetts, etc, but not the high schools (it's probably safe to assume that if it contains "Academy" or "High School" that it's a high school). Any ideas?

Tested with preg_match_all in PHP, will work for the sample text you provided:
/(?<=,)[\w\s]*(College|University|Institute)[^,\d]*(?=,|\d)/
Will need to be modified somewhat if your regex engine does not support lookaheads/lookbehinds.
Update: I looked at your linked sample text & updated the regex accordingly
/([A-Z][^\s,.]+[.]?\s[(]?)*(College|University|Institute|Law School|School of|Academy)[^,\d]*(?=,|\d)/
The first part will match a string starting with a capital letter, optionally followed by an .. Then a space, then optionally an (. This pattern is matched zero or more times.
This should get all relevant words preceding the keywords.

Related

How can a regex catch all parts before a keyword from a finite set, but sometimes separated only by a single space

This question relates to PCRE regular expressions.
Part of my big dataset are address data like this:
12H MARKET ST. Canada
123 SW 4TH Street USA
ONE HOUSE USA
1234 Quantity Dr USA
123 Quality Court Canada
1234 W HWY 56A USA
12345 BERNARDO CNTR DRIVE Canada
12 VILLAGE PLAZA USA
1234 WEST SAND LAKE RD ?567 USA
1234 TELEGRAM BLVD SUITE D USA
1234-A SOUTHWEST FRWY USA
123 CHURCH STREET USA
123 S WASHINGTON USA
123 NW-SE BLVD USA
# USA
1234 E MAIN STREET USA
I would like to extract the street names including house numbers and additional information from these records. (Of course there are other things in those records and I already know how to extract them).
For the purpose of this question I just manually clipped the interesting part from the data for this example.
The number of words in the address parts is not known before. The only criterion I have found so far is to find the occurrence of country names belonging to some finite set, which of course is bigger than (USA|Canada). For brevity I limit my example just to those two countries.
This regular expression
([a-zA-Z0-9?\-#.]+\s)
already isolates the words making up what I am after, including one space after them. Unfortunately there are cases, where the country after the to-be-extracted street information is only separated by a single space from the country, like e.g. in the first and in the last example.
Since I want to capture the matching parts glued together, I place a + sign behind my regular expression:
([a-zA-Z0-9?\-#.]+\s)+
but then in the two nasty cases with only one separating space before the country, the country is also caught!
Since I know the possible countries from looking at the data, I could try to exclude them by a look ahead-condition like this:
([a-zA-Z0-9?\-#.]+\s)(?!USA|Canada)
which excludes ST. from the match in the first line and STREET from the match in the last line. Of course the single capture groups are not yet glued together by this.
So I would add a plus sign to the group on the left:
([a-zA-Z0-9?\-#.]+\s)+(?!USA|Canada)
But then ST. and STREET and the Country, separated by only a single space, are caught again together with the country, which I want to exclude from my result!
How would you proceed in such a case?
If it would be possible by properly using regular expressions to replace each country name by the same one preceded by an additional space (or even to do this only for cases, where there is only a single space in front of one of the country-names), my problem would be solved. But I want to avoid such a substitution for the whole database in a separate run because a country name might appear in some other column too.
I am quite new to regular expressions and I have no idea how to do two processing steps onto the same input in sequence. - But maybe, someone has a better idea how to cope with this problem.
If I understand correctly, you want all content before the country (excluding spaces before the country). The country will always be present at the end of the line and comes from a list.
So you should be able to set the 'global' and 'multiline' options and then use the following regex:
^(.*?)(?=\s+(USA|Canada)\s*$)
Explanation:
^(.*) match all characters from start of line
(?=\s+(USA|Canada)\s*$) look ahead for one or more spaces, followed by one of the country names, followed by zero or more spaces and end of line.
That should give you a list with all addresses.
Edit:
I have changed the first part to: (.*?), making it non-greedy. That way the match will stop at the last letter before country instead of including some spaces.

Regex (Posix) to get first word only, not including numbers

New to Regex (which was recently added to SQL in DB2 for i). I don't know anything about the different engines but research indicates that it is "based on POSIX extended regular expressions".
I would like to get the street name (first non-numeric word) from an address.
e.g.
101 Main Street = Main
2/b Pleasant Ave = Pleasant
5H Unpleasant Crescent = Unpleasant
I'm sorry I don't have a string that isn't working, as suggested by the forum software. I don't even know where to start. I tried a few things I found in search but they either yielded nothing or the first "word" - i.e. the number (101, 2/b, 5H).
Thanks
Edit: Although it's looking as if IBM's implementation of regex on the DB2 family of databases may be too alien for many of the resident experts, I'll press ahead with some more detail in case it helps.
A plain English statement of the requirement would be:
Basic/acceptable: Find the first word/unbroken string that contains no numbers or special characters
Advanced/ideal: Find the first word that contains three or more characters, being only letters and zero or one embedded dash/hyphen, but no numbers or other characters.
Additional examples (original ones at top are still valid)
190 - 192 Tweety-bird avenue = Tweety-bird
190-192 Tweety-bird avenue = Tweety-bird
Charles Bronson Place = Charles
190H Charles-Bronson Place = Charles-Bronson
190 to 192 Charles Bronson Place = Charles
Second Edit:
Mooching around on the internet and trying every vaguely connected expression that I could find, I stumbled on this one:
[a-zA-Z]+(?:[\s-][a-zA-Z]+)*
which actually works pretty well - it gives the street name and street type, which on reflection would actually suit my purpose as well as the street name alone (I can easily expand common abbreviations - e.g. RD to ROAD - on the fly).
Sample SQL:
select HAD1,
regexp_substr(HAD1, '[a-zA-Z]+(?:[\s-][a-zA-Z]+)*')
from ECH
where HEDTE > 20190601
Sample output
Ship To REGEXP_SUBSTR
Address
Line 1
32 CHRISTOPHER STREET CHRISTOPHER STREET
250 - 270 FEATHERSTON STREET FEATHERSTON STREET
118 MONTREAL STREET MONTREAL STREET
7 BIRMINGHAM STREET BIRMINGHAM STREET
59 MORRISON DRIVE MORRISON DRIVE
118 MONTREAL STREET MONTREAL STREET
MASON ROAD MASON ROAD
I know this wasn't exactly the question I asked, so apologies to anyone who could have done this but was following the original request faithfully.
Not sure if this is Posix compliant, but something like this could work: ^[\w\/]+?\s((\w+\s)+?)\s*\w+?$, example here.
The script assumes that the first chunk is the number of the building, the second chunk, is the name of the street, and the last chunk is Road/Ave/Blvd/etc.
This should also cater for street names which have white spaces in them.
Using the following regex matches your examples :
(?<=[^ ]+ )[^ ]*[ ]

Using regex and vba, extracting parts of data

I have an excel spreadsheet and within its contents it is formatted like -
Street Name, Street Number Street Direction(may not be present represented be an NSWE)
So it could look like John Doe Ave, 900 E or Jane Doe DR, 100
However, the people who used this spreadsheet put business names or other information that shouldn't be present
The part I'm stuck at is using regex patterns I'm not familiar with it and it confuses me
I have this variable
Dim strPattern As String: strPattern = "^(.+),\s(\d+)\s([NWSEnwse])"
So, I have this its working SLIGHTLY I wanted to know what changes I could make to this so it would include or exlude NWSEnwse, because right now it detects the address only when street direction is present
You may use this regex pattern to match it.
^(.+),\s+(\d+)(\s+[NWSEnwse])?
The ? at the end signifies that that part is optional.
I also replaced \s with \s+ to account for any extra spaces that might have crept in.

Regex find Proper Nouns or Phrases that are NOT first word in a sentence

I've found several questions that touch on this, but none that seem to answer it. I am trying to build a Regex that will allow me to identify Proper Nouns in a group of text.
I am defining a Proper Noun as follows: A word or group of words that begin with a capital letter, are longer than 1 digit (to exclude things like I, A, etc), and are NOT the first word of a new sentence.
So, in the following text
"Susan Dow stayed at the Holiday Inn on Thursday. She met Tom and Shirley Temple at the bar where they ordered Green Eggs and Ham"
I would want the following returned
Holiday Inn
Thursday
Tom
Shirley Temple
Green Eggs
Ham
Right now, [A-Z]{1,1}[a-z]*([\s][A-Z]{1,1}[a-z]*)* is what I have, but it's returning Susan Dow and She in addition to the ones listed above. How can I get my . look-up to work?
You can use:
(?<!^|\. |\. )[A-Z][a-z]+
per this rubular
Update: Integrated the two negative looks using alternation. Also added check for two spaces between sentences. Note that repetition operators cannot be used in negative lookbehinds per notes in http://www.regular-expressions.info/lookaround.html

Regex that finds consecutive words with first letter capitalized

I am looking for a regex that can identify in a sentence that consecutive words in a sentence start with capital letters.
If we take the text below as an example:
The A-Z Group is a long-established
market leader in the provision of
information for the global air cargo
community, and also for the defence
and security sectors through BDEC
Limited, publishers of the British
Defence Equipment Catalogue and
British Defence Industry Directory.
I want to be able to retrieve the following:
The A-Z Group
BDEC Limited Defence Equipment
Catalogue British Defence
IndustryDefence Industry
Is this even possible with a regex?
If so, can anyone suggest one?
(Update: I misunderstood your question at first.)
A simple case is
/([A-Z][\w-]*(\s+[A-Z][\w-]*)+)/
It may need to be modified if there are special cases of different language construct.
ruby-1.9.2-p0 > %Q{The A-Z Group is a long-established market leader in the provision of information for the global air cargo community, and also for the defence and security sectors through BDEC Limited, publishers of the British Defence Equipment Catalogue and British Defence Industry Directory.}.scan(/([A-Z][\w-]*(\s+[A-Z][\w-]*)+)/).map{|i| i.first}
=> ["The A-Z Group", "BDEC Limited", "British Defence Equipment Catalogue", "British Defence Industry Directory"]
hopefully this will do what you want, but apologies if I've misunderstood:
([A-Z][a-zA-Z0-9-]*[\s]{0,1}){2,}
The regex searches for two or more consecutive occurences of the following sequence: a capital letter followed by any amount of lowercase/uppercase/numerical/hyphen characters (alter this to any range of non-whitespace characters to suit your needs of course), followed by a whitespace character.
Edit: I know it's common sense, but just make sure that you set the regex search to be case sensitive, caught me out when I tested it :p
Edit: The above regex will, as 動靜能量 points out, match the single word THE because it doesn't enforce that at least the first two items must have a space between them. Corrected version:
([A-Z][a-zA-Z0-9-]*)([\s][A-Z][a-zA-Z0-9-]*)+
Start off by thinking in non-technical terms. What do you want? A "word" followed by one or more groups of "a word separator followed by a word"
Now you just need to define the pattern for a "word" and a "word separator", and then combine those into a complete pattern.
When you break it down like that, a complex regex is nothing more than a few very simple pattern groups.
$mystring = "the United States of America has many big cities like New York and Los Angeles, and others like Atlanta";
#phrases = $mystring =~ /[A-Z][\w'-]\*(?:\s+[A-Z][\w'-]\*)\*/g;
print "\n" . join(", ", #phrases) . "\n\n# phrases = " . scalar(#phrases) . "\n\n";
OUTPUT:
$ ./try_me.pl
United States, America, New York, Los Angeles, Atlanta
\# phrases = 5