Related
New to Regex (which was recently added to SQL in DB2 for i). I don't know anything about the different engines but research indicates that it is "based on POSIX extended regular expressions".
I would like to get the street name (first non-numeric word) from an address.
e.g.
101 Main Street = Main
2/b Pleasant Ave = Pleasant
5H Unpleasant Crescent = Unpleasant
I'm sorry I don't have a string that isn't working, as suggested by the forum software. I don't even know where to start. I tried a few things I found in search but they either yielded nothing or the first "word" - i.e. the number (101, 2/b, 5H).
Thanks
Edit: Although it's looking as if IBM's implementation of regex on the DB2 family of databases may be too alien for many of the resident experts, I'll press ahead with some more detail in case it helps.
A plain English statement of the requirement would be:
Basic/acceptable: Find the first word/unbroken string that contains no numbers or special characters
Advanced/ideal: Find the first word that contains three or more characters, being only letters and zero or one embedded dash/hyphen, but no numbers or other characters.
Additional examples (original ones at top are still valid)
190 - 192 Tweety-bird avenue = Tweety-bird
190-192 Tweety-bird avenue = Tweety-bird
Charles Bronson Place = Charles
190H Charles-Bronson Place = Charles-Bronson
190 to 192 Charles Bronson Place = Charles
Second Edit:
Mooching around on the internet and trying every vaguely connected expression that I could find, I stumbled on this one:
[a-zA-Z]+(?:[\s-][a-zA-Z]+)*
which actually works pretty well - it gives the street name and street type, which on reflection would actually suit my purpose as well as the street name alone (I can easily expand common abbreviations - e.g. RD to ROAD - on the fly).
Sample SQL:
select HAD1,
regexp_substr(HAD1, '[a-zA-Z]+(?:[\s-][a-zA-Z]+)*')
from ECH
where HEDTE > 20190601
Sample output
Ship To REGEXP_SUBSTR
Address
Line 1
32 CHRISTOPHER STREET CHRISTOPHER STREET
250 - 270 FEATHERSTON STREET FEATHERSTON STREET
118 MONTREAL STREET MONTREAL STREET
7 BIRMINGHAM STREET BIRMINGHAM STREET
59 MORRISON DRIVE MORRISON DRIVE
118 MONTREAL STREET MONTREAL STREET
MASON ROAD MASON ROAD
I know this wasn't exactly the question I asked, so apologies to anyone who could have done this but was following the original request faithfully.
Not sure if this is Posix compliant, but something like this could work: ^[\w\/]+?\s((\w+\s)+?)\s*\w+?$, example here.
The script assumes that the first chunk is the number of the building, the second chunk, is the name of the street, and the last chunk is Road/Ave/Blvd/etc.
This should also cater for street names which have white spaces in them.
Using the following regex matches your examples :
(?<=[^ ]+ )[^ ]*[ ]
I have a file (ratings.lst) downloaded from IMDB Interfaces. The content appears to be in in the following format :-
Distribution Votes Rating Title
0000001222 297339 8.4 Reservoir Dogs (1992)
0000001223 64504 8.4 The Third Man (1949)
0000000115 48173 8.4 Jodaeiye Nader az Simin (2011)
0000001232 324564 8.4 The Prestige (2006)
0000001222 301527 8.4 The Green Mile (1999)
My aim is to convert this file into a CSV file (comma separated) with the following desired result (example for 1 line) :
Distribution Votes Rating Title
0000001222, 301527, 8.4, The Green Mile (1999)
I am using textpad and it supports regex based search and replace. I'm not sure what type of regex is needed to achieve the above desired results. Can somebody please help me on this. Thanks in advance.
The other regular expressions are somewhat overcomplicated. Because whitespace is guaranteed not to appear in the first three columns, you don't have to do a fancy match - "three columns of anything separated by whitepace" will do.
Try replacing ^(.+?)\s+(.+?)\s+(.+?)\s+(.+?)$ with \1,\2,\3,"\4" giving the following output (using Notepad++)
Distribution,Votes,Rating,"Title"
0000001222,297339,8.4,"Reservoir Dogs (1992)"
0000001223,64504,8.4,"The Third Man (1949)"
0000000115,48173,8.4,"Jodaeiye Nader az Simin (2011)"
0000001232,324564,8.4,"The Prestige (2006)"
0000001222,301527,8.4,"The Green Mile (1999)"
Note the use of a non-greedy quantifier, .+?, to prevent accidentally matching more than we should. Also note that I've enclosed the fourth column with quote marks "" in case a comma appears in the movie title - otherwise the software you use to read the file would interpret Avatar, the Last Airbender as two columns.
The nice tabular alignment is gone - but if you open the file in Excel it will look fine.
Alternately, just do the entire thing in Excel.
First replace all " with "" then do this:
Find: ^\([0-9]+\)[ \t]+\([0-9]+\)[ \t]+\([^ \t]+\)[ \t]+\(.*\)
Replace with: \1,\2,\3,"\4"
Press F8 to open Replace dialog
Make sure Regular Expression is selected
In Find what: put: ^([[:digit:]]{10})[[:space:]]+([[:digit:]]+)[[:space:]]+([[:digit:]]- {1,2}\.[[:digit:]])[[:space:]]+(.*)$
In Replace with: put \1,\2,\3,"\4"
Click Replace All
Note: This uses 1 or more spaces between fields from ratings.lst - you might be better off specifying the exact number of spaces if you know it.
Also Note: I didn't put spaces between the comma seperated items, as generally you don't, but feel free to add those in
Final Note: I put the movie title in quotes, so that if it contains a comma it doesn't break the CSV format. You may want to handle this differently.
MY BAD This is a C# program. I will leave it up for an alternate solution.
The ignorepattern whitespace is for commenting the pattern.
This will create data which can be placed into a CSV file. Note CSV files do not have optional whitepsace in them as per your example....
string data =#"Distribution Votes Rating Title
0000001222 297339 8.4 Reservoir Dogs (1992)
0000001223 64504 8.4 The Third Man (1949)
0000000115 48173 8.4 Jodaeiye Nader az Simin (2011)
0000001232 324564 8.4 The Prestige (2006)
0000001222 301527 8.4 The Green Mile (1999)
";
string pattern = #"
^ # Always start at the Beginning of line
( # Grouping
(?<Value>[^\s]+) # Place all text into Value named capture
(?:\s+) # Match but don't capture 1 to many spaces
){3} # 3 groups of data
(?<Value>[^\n\r]+) # Append final to value named capture group of the match
";
var result = Regex.Matches(data, pattern, RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace)
.OfType<Match>()
.Select (mt => string.Join(",", mt.Groups["Value"].Captures
.OfType<Capture>()
.Select (c => c.Value))
);
Console.WriteLine (result);
/* output
Distribution,Votes,Rating,Title
0000001222,297339,8.4,Reservoir Dogs (1992)
0000001223,64504,8.4,The Third Man (1949)
0000000115,48173,8.4,Jodaeiye Nader az Simin (2011)
0000001232,324564,8.4,The Prestige (2006)
0000001222,301527,8.4,The Green Mile (1999)
*/
How would you design a regular expression to capture a legal citation? Here is a paragraph that shows a two typical legal citations:
We have insisted on strict scrutiny in every context, even for
so-called “benign” racial classifications, such as race-conscious
university admissions policies, see Grutter v. Bollinger, 539 U.S.
306, 326 (2003), race-based preferences in government contracts, see
Adarand, supra, at 226, and race-based districting intended to improve
minority representation, see Shaw v. Reno, 509 U.S. 630, 650 (1993).
A citation will either be preceded by a comma and whitespace, a period and whitespace, or a "signal" such as "see" or "see, e.g.," and whitespace. I'm having trouble figuring out how to accurately specify the start of the citation.
I am most familiar with Perl regular expressions but can understand examples from other languages as well.
In your example, you've preceded the citations with what the BlueBook deems a 'signal' (Rule 1.2 on page 54 of the nineteenth edition). Other signals include but are not limited to : e.g., accord, also, cf., compare, and, with, contra, and but. These can be combined in surprising and unexpected ways . . . See also, e.g. Watts v. United States, 394 U.S. 705 (1969) (per curiam). Of course, there are also citations that are not preceded by signals
Then you'll also want to handle case citations with unexpected case names :
See v. Seattle, 387 U.S. 541 (1967)
Others have attacked this particular problem by first identifying the reporter reference (i.e. 387 U.S. 541) with a regular expression like (\d+)\s(.+?)\s(\d+) and then trying to expand the range from there. Case citations can be arbitrarily complex so this path is not without its own pitfalls. Reporter references can also take on some interesting forms as per BlueBook rules:
Jones v. Smith, _ F.3d _ (2011)
For decisions which are not yet published for instance. Of course, authors will use variations of the above including (but not limited to) --- F.3d ---
This certainly isn't perfect, but without more examples to test against it's the best I can think of. Thanks to #Paul H. for extra signal words to add.
#!/usr/bin/perl
$search_text = <<EOD;
"We have insisted on strict scrutiny in every context, even for so-called “benign” racial classifications, such as race-conscious university admissions policies, see Grutter v. Bollinger, 539 U.S. 306, 326 (2003), race-based preferences in government contracts, see Adarand, supra, at 226, and race-based districting intended to improve minority representation, see Shaw v. Reno, 509 U.S. 630, 650 (1993)."
In your example, you've preceded the citations with what the BlueBook deems a 'signal' (Rule 1.2 on page 54 of the nineteenth edition). Other signals include but are not limited to : e.g., accord, also, cf., compare, and, with, contra, and but. These can be combined in surprising and unexpected ways . . . See also, e.g. Watts v. United States, 394 U.S. 705 (1969) (per curiam). Of course, there are also citations that are not preceded by signals
Then you'll also want to handle case citations with unexpected case names :
See v. Seattle, 387 U.S. 541 (1967)
Others have attacked this particular problem by first identifying the reporter reference (i.e. 387 U.S. 541) with a regular expression like (\d+)\s(.+?)\s(\d+) and then trying to expand the range from there. Case citations can be arbitrarily complex so this path is not without its own pitfalls. Reporter references can also take on some interesting forms as per BlueBook rules:
EOD
while ($search_text =~ m/(\, |\. |\; )?(see(\,|\.|\;)? |e\.g\.(\,|\.|\;)? |accord(\,|\.|\;)? |also(\,|\.|\;)? |cf\.(\,|\.|\;)? |compare(\,|\.|\;)? |with(\,|\.|\;)? |contra(\,|\.|\;)? |but(\,|\.|\;)? )+(.{0,100}\d+ \(\d{4}\))/g) {
print "$12\n";
}
while ($search_text =~ m/[\n\t]+(.{0,100}\d+ \(\d{4}\))/ig) {
print "$1\n";
}
Output is:
Grutter v. Bollinger, 539 U.S. 306, 326 (2003)
Shaw v. Reno, 509 U.S. 630, 650 (1993)
Watts v. United States, 394 U.S. 705 (1969)
See v. Seattle, 387 U.S. 541 (1967)
Well you can use the following at the start. You will need more patterns for other starts.
/(, )|(see )/
The end will prove to be the bigger problem. For example in "see Adarand, supra, at 226, and race-based..." there is no clear end indicator. I suspect pure regular expressions won't be sufficient for this task, you need a higher form of language analysis. Or be content with matching only a subset of all citations, or matching too much sometimes.
I'm using http://gskinner.com/RegExr/ to test this syntax
(?<=see )\w+ v. \w+, \d{3} U\.S\. \d{3}, \d{3} \(\d{4}\)
As you can see, i'm using 'Positive lookbehind'
I've written a pattern (created for JavaScript since you didn't specify a language) which can be used to match the two citations you mention:
var regex = /(\w+\sv.\s\w+,\s\d*\s?[\w.]*[\d,\s]*\(\d{4}\))/ig;
You can see it in action here.
It will match others as long as they follow the same format of:
Name v. Name, 999 A.A.A... 999, 999 (1999)
Although presence of some parts is made optional. Please provide more information on citations that may not fit this pattern if it doesn't appear to meet your needs.
For this kind of potentially complex regexes, I tend to break it down into simple pieces that can be individually unit-tested and evolved.
I use REL, a DSL (in Scala) that allows you to reassemble and reuse your regex pieces. This way, you can define your regex like this:
val NAME = α+
val VS = """v\.?""" ~ """s\.?""".?
val CASE = NAME("name1") ~ " " ~ VS ~ " " ~ NAME("name2")
val NUM = ß ~ (δ+) ~ ß
val US = """U\.? ?S\.? """
val YEAR = ( ("1[7-9]" | "20") ~ δ{2} )("year")
val REF = CASE ~ ", " ~ // "Doe v. Smith, "
(NUM ~ " ").? ~ // "123 " (optional)
US ~ NUM ~ // "U.S. 456"
(", " ~ NUM).* ~ // ", 678" 0 to N times
" \\(" ~ YEAR ~ "\\)" // "(1999)"
Then test every bit like this:
"NUM" should {
"Match 1+ digits" in {
"1" must be matching(NUM)
"12" must be matching(NUM)
"123" must be matching(NUM)
"1234" must be matching(NUM)
"12345" must be matching(NUM)
"123456" must be matching(NUM)
}
"Match only standalone digits" in {
NUM.findFirstIn(" 123 ") must beSome("123")
NUM.findFirstIn(" n123 ") must beNone
}
}
Also, your unit/spec tests can double as your doc for this bit of regex, indicating what is matched and what is not (which tends to be important with regexes).
I made a gist for this example with a first naive implementation.
In the upcoming version of REL (0.3), you will be directly able to export the Regex in, say, PCRE flavor to use it independently… For now only JavaScript and .NET translations are implemented, so you can just run the sample using SBT and it will output a Java-flavored regex (although pretty simple, I think it can be copy/pasted to Perl).
I'm not overly familiar with perl, but if I wanted to do this, I would use some web lookups. First, I would find a good set of patterns.
I went with this regex:
(\d{3})\sU\.S\.\s(\d{3})
Breakdown of regex:
(\d{3}) --> looks for 3 numbers, places them into $1
\sU.S.\s --> looks for a whitespace followed by U.S. followed by another whitespace
(\d{3}) --> looks for 3 numbers again, placing them into $2
What this does, is looks for the pattern 539 U.S. 306 and places them into captures groups. This puts the following values into variables:
$1 = 539
$2 = 306
I would loop through, and find each instance of the pattern, then I would use something to grab this site from the web:
http://supreme.justia.com/cases/federal/us/$1/$2/case.html
Which in this case would become:
http://supreme.justia.com/cases/federal/us/539/306/case.html
Once I had this, I could go through the site tree for the following (I put the whole tree here, since depending on language the way you do this might be changed):
<body>
<div id="main">
<div id="container">
<div id="maincontent">
<h1> HERE IS THE TITLE OF THE CASE </h1>
The xpath of this is //*[#id="maincontent"]/h1.
From here you now have the full reference:
Grutter v. Bollinger - 539 U.S. 306 (2003)
I'm not a legal expert, so I don't know if there are other ways they can be declared (one of the other answers mentioned something like F.3d) then another approach would be needed to capture that. If I get some time later, I might write this out in PowerShell just to see how I would do it.
I manage the CourtListener.com project, which extracts millions of citations from opinions. The task at hand is a hard one. I wouldn't rely on signals like "see" and "also".
What we've done instead is create a JSON object containing all of the possible reporter abbreviations (it's about 10k lines long), which we store in our reporter-db repo:
https://pypi.org/project/reporters-db/1.0.12/
That's in Python, but the JSON object is in the source repo as well (there's a CSV if you like too).
Using the JSON object, we build up a regex that knows what all the reporter abbreviations are.
For your example, that'd be U.S., but we have hundreds more including misspellings and typos, and the regex is essentially:
\d{1,3} (REPORTER|REPORTER2|REPORTER3|...) \d{1,3}
It gets complicated because some page numbers use roman numerals just to mess with you and there's a half dozen other corner cases to think about.
So...you've kind of asked a really big question, if you want to do it well. The good news is that in addition to the reporters-db Python project and JSON object, we also open sourced our citation finder. It's not a standalone project unfortunately, but you can see how it works and use it as a starting place:
https://github.com/freelawproject/courtlistener
Look in cl/citations for the related code.
I hope this helps! We're a non-profit dedicated to these kinds of problems, so definitely reach out if you have questions or ideas.
I have a bunch of strings like this in a file:
M.S., Arizona University, Tucson, Az., 1957
B.A., American International College, Springfield, Mass., 1978
B.A., American University, Washington, D.C., 1985
and I'd like to extract Tufts University, American International College, American University, University of Massachusetts, etc, but not the high schools (it's probably safe to assume that if it contains "Academy" or "High School" that it's a high school). Any ideas?
Tested with preg_match_all in PHP, will work for the sample text you provided:
/(?<=,)[\w\s]*(College|University|Institute)[^,\d]*(?=,|\d)/
Will need to be modified somewhat if your regex engine does not support lookaheads/lookbehinds.
Update: I looked at your linked sample text & updated the regex accordingly
/([A-Z][^\s,.]+[.]?\s[(]?)*(College|University|Institute|Law School|School of|Academy)[^,\d]*(?=,|\d)/
The first part will match a string starting with a capital letter, optionally followed by an .. Then a space, then optionally an (. This pattern is matched zero or more times.
This should get all relevant words preceding the keywords.
I would like to extract portion of a text using a regular expression. So for example, I have an address and want to return just the number and streets and exclude the rest:
2222 Main at King Edward Vancouver BC CA
But the addresses varies in format most of the time. I tried using Lookbehind Regex and came out with this expression:
.*?(?=\w* \w* \w{2}$)
The above expressions handles the above example nicely but then it gets way too messy as soon as commas come into the text, postal codes which can be a 6 character string or two 3 character strings with a space in the middle, etc...
Is there any more elegant way of extracting a portion of text other than a lookbehind regex?
Any suggestion or a point in another direction is greatly appreciated.
Thanks!
Regular expressions are for data that is REGULAR, that follows a pattern. So if your data is completely random, no, there's no elegant way to do this with regex.
On the other hand, if you know what values you want, you can probably write a few simple regexes, and then just test them all on each string.
Ex.
regex1= address # grabber, regex2 = street type grabber, regex3 = name grabber.
Attempt a match on string1 with regex1, regex2, and finally regex3. Move on to the next string.
well i thot i'd throw my hat into the ring:
.*(?=,? ([a-zA-Z]+,?\s){3}([\d-]*\s)?)
and you might want ^ or \d+ at the front for good measure
and i didn't bother specifying lengths for the postal codes... just any amount of characters hyphens in this one.
it works for these inputs so far and variations on comas within the City/state/country area:
2222 Main at King Edward Vancouver, BC, CA, 333-333
555 road and street place CA US 95000
2222 Main at King Edward Vancouver BC CA 333
555 road and street place CA US
it is counting at there being three words at the end for the city, state and country but other than that it's like ryansstack said, if it's random it won't work. if the city is two words like New York it won't work. yeah... regex isn't the tool for this one.
btw: tested on regexhero.net
i can think of 2 ways you can do this
1) if you know that "the rest" of your data after the address is exactly 2 fields, ie BC and CA, you can do split on your string using space as delimiter, remove the last 2 items.
2) do a split on delimiter /[A-Z][A-Z]/ and store the result in array. then print out the array ( this is provided that the address doesn't contain 2 or more capital letters)