Can you limit the words between two capturing groups in Regex - regex

I have been trying to create a parser for Law texts.
I need to find a way to find "external links" like : art. 45 alin. (1) din Lege nr. 54/2000
But the problem is that my country law writing style is so, soooo lacking uniformity and that means sometimes the links might look like this : articolul 45 alineatul (1) din Legeea nr. 30/2000
The fact that my language has forms for words for days. (articol, articolului, articolelor....)
That means that i need to generalize that first thing... (art.) as to catch as many forms as possible and pray that the last thing is a law number & year (54/2000).
Now here comes the hard part... The problem is that every section that starts with Articol N starts the regex and it goes on and on until it finds a law number & year that have absolutely no relation between them.
This is how it looks \b(((A|a)rt.*?) \(?\d*?\)??)( \w*? )*?nr\.? (\d+\/\d\d\d\d|\d+\/\d\d\d\d)\b
My question is there a way to limit the words between the two capturing groups?
Link to a Docs to determine what should pass and what not:
https://docs.google.com/document/d/1vn2HwYaCq8UB1felY1GvfmbTI2w8o5RgW4efD9fsvQM/edit?usp=sharing

As Cary and James answered in comments above, I used (?:\S+\s*){0,15}. I used \S instead of \w to include punctuation and thus, abbreviated forms of the names of the Law (e.g. Const. for Constitution). That was the reason why my original regex wasn't working even when using {m,n}.

Related

how to match a sentence having particular word in different patterns

We have a problem here...
We have a text having different patterns of sentences.
We want to get the sentence having a particular word.
Eg:
One further point, by way of providing another model. The analysis in
the second paragraph could lead in the following direction. 'The
Destructors' deals with, obviously, destruction, whilst the book of
Genesis deals with creation. The vocabulary is similar: Blackie
notices that 'chaos had advanced', an ironic reversal of God's
imposing of form on a void. Furthermore, the phrase 'streaks of light
came in through the closed shutters where they worked with the
seriousness of creators', used in the context of destruction, also
parodies the creation of light and darkness in the early passages of
the Biblical book. Greene's ironic use of the vocabulary of the Bible
might be making the point that, for him, the Second World War
signalled the end of a particular Christian era. Now, it is perfectly
arguable that the rise of fascism is linked to this, or that it is the
cause. The cult of personality and secular leadership has, for Greene,
taken over from the key role of the church in Western societies. In
this way the two main themes identified above - the tension between
individual and community, and religion - are linked. In terms of essay
writing this link could well be made after the discussion of the theme
of the individual and the community, and its links with the theme of
leadership. This might be the general conclusion to the essay. After
thoughtful consideration and interpretation a student may well decide
that this is what the (destructors.)' boils down to: Greene is making a
clear link between the rise of fascism and the decline of the Church's
influence. Despite the fact that fascism has been recently defeated,
Greene sees the lack of any contemporary values which could provide
social cohesion as providing the potential for its reappearance.
In the above text, we have bold words (destructors). We want to get the sentences which are having the word "destructors".
The word "destructors" can be present in different formats. Eg: (destructors), (DesTrucTors), (Des.tructors), DESTRUCTORS, destructors, des-tructors.
When we tried writing a regex to match the sentences, we are failing to get the sentences at some conditions(like we are getting half sentences, etc.,).
Could you please help us with this.
If this information doesn't help you to solve, please let us know. Will update it.
Thank you...
I'm not too sure about Python, but I believe this might work:
for match in re.finditer(r"[^.]*destructors[^.]*\.[^\w\s]*", subject, re.IGNORECASE):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()
In any case, I think the regex you want is:
[^.]*destructors[^.]*\.[^\w\s]*
with the case insensitive and global flags set.
It will be helpful if you could provide the regex pattern which you have tried with so far. The best I can come up with is,
str_text='your text here containing DESTRUCTORS'
match=re.search('pass all the destructors combination here', str_text, flags=re.IGNORECASE)
Try for more patterns available for string formatting with regex here,https://docs.python.org/3/library/re.html

Regex lookbehind - excluding words from searches

I need to search my corpus for words such as game or shame but I would like to specify the search to exclude three strings a game/a shame or , A game/A shame and a/an/A/An WORD game or a/an/A/An WORD shame , where WORD is a modifier, e.g., a great game or a great shame.
If someone could help me out, that would be great, thanks!
In my corpus, the optional WORD between the indefinite article a/an and game or a/an and shame is most commonly great and real. So even excluding these two, would already help me a lot.
The lookbehind below works perfectly to exclude a/A
(?<!a\s|A\s)\bshame\b
To exclude the modifying WORD, I was trying to use ?\w in the lookbehind grep, but it just wouldn't work - the grep below without ? runs and it still excludes examples such as a shame, but it still returns the undesired examples such as a great shame or a crying shame - see concordance lines (3) and (4) in the sample text below:
(?<!a\s|A\s|a\b\w\b|A\b\w\b)\bshame\b
The tool I'm using to implement regex is AntConc, which supports Perl regular expressions.
Sample text with two irrelevant examples (3 & 4) after using the search string below
(?<!a\s|A\s)\bshame\b
1 (match shame)
, people ogling from the sidelines. If you want a closer look, you have to ring for entry and wait to be admitted. I guess me and Saul just have no shame (or just know the benefits of our bank accounts being in hard currencies), because we wandered into plenty. Lots and lots of little boutiques and edgily designed fashion stores with music blaring.& abbutterflie.txt 47 1
2 (match shame)
last twenty years and I've experienced all sorts of biggotry but I seriously thought that anti black nazism in football wass a thing of the past. You should all hang your heads in shame, bunch of [badword]s. adamdphillips.txt 57 1
3 (don't match shame)
me monetarily as I wasn't that close to her, but she was really good friends with the other girl and it's messed that up for them a bit, which is a great shame. Anyway, Holly and I have since found somewhere to move in just the two of us. It's going to cost an absolute fortune and I'm going to be eating basics beans on aderyn.txt 60 1
4 (don't match shame)
are loads of amazingly good bands out there, gigging up and down the country who will never get signed because no-one can figure out how to market them, and this is a crying shame. There are artists out there like Thea Gilmore and <a href="http://blog.amandapalmer.net/" rel="nofollow"> Amanda Palmer& aderyn.txt 60 2
5 (match shame)
/><br />"There is no better time to show these terrorists that we have no fear of them. Instead we are forced, through the cowardly acts of our superiors, to hide in shame."<br /><br />But Herb Wiseman, high school consultant for Lee County, Florida, pointed to the July 7 London bombings.<br /><br />"What happens if kids get on aggy91.txt 64 1
Because variable length negative lookbehinds are not allowed, the approach in your previous question's answer won't transfer to this one.
I've gone with a (*SKIP)(*FAIL) pattern. This will match and discard the disqualified matches, and only retain qualifying matches:
/[Aa]n?( \w+)? shame(*SKIP)(*FAIL)|shame/ 3844 steps (Demo)
Or if you wish to include word boundary metacharacters:
/\b[Aa]n?( \w+)? shame\b(*SKIP)(*FAIL)|\bshame\b/ 4762 steps (Demo)

Replace ONLY first occurrence of word from a paragraph using RegEx

I have two paragraphs. I want to replace ONLY first occurrence of a specific word 'acetaminophen' by '{yootooltip :: It is a widely used over-the-counter analgesic (pain reliever) and antipyretic (fever reducer). Excessive use of paracetamol can damage multiple organs, especially the liver and kidney.}acetaminophen{/yootooltip}'
The paragraph is:
Percocet is a painkiller which is partly made from oxycodone and partly made from acetaminophen. It will usually be prescribed for a patient who is suffering from acute severe pain. Because it has oxycodone in it, this substance can create an addiction and is also a dangerous prescription drug to abuse. It is illegal to either sell or use Percocet that has not been prescribed by a licensed professional.
In 2008, drugs like Percocet (which have both oxycodone and acetaminophen as their main ingredients) were the prescription drugs most sold in all of Ontario. The records also show that the rates of death by oxycodone (this includes brand name like Percocet) doubled. That is why it is imperative that the people who are addicted go to an Ontario drug rehab center. Most of the drug rehabs can take care of Percocet addiction.
I am trying to write a regular expression for this. I have tried
\bacetaminophen\b
But it is replacing both occurrences.
Any help would be appreciated.
Thanks
Use the optional $limit parameter in PHP's preg_replace function http://us2.php.net/manual/en/function.preg-replace.php
$text = preg_replace('/acetaminophen/i', 'da-da daa', $text, 1);
will replace only the first occurance.
For the tool you're using, just use this:
(.*?)\bacetaminophen\b

How to get sentence number from input?

It seems hard to detect a sentence boundary in a text. Quotation marks like .!? may be used to delimite sentences but not so accurate as there may be ambiguous words and quotations such as U.S.A or Prof. or Dr. I am studying Tperlregex library and Regular Expression Cookbook by Jan Goyvaerts but I do not know how to write the expression that detects sentence?
What may be comparatively accurate expression using Tperlregex in delphi?
Thanks
First, you probably need to arrive at your own definition of what a "sentence" is, then implement that definition. For example, how about:
He said: "It's OK!"
Is it one sentence or two? A general answer is irrelevant. Decide whether you want it to interpret it as one or two sentences, and proceed accordingly.
Second, I don't think I'd be using regular expressions for this. Instead, I would scan each character and try to detect sequences. A period by itself may not be enough to delimit a sentence, but a period followed by whitespace or carriage return (or end of string) probably does. This immediately lets you weed out U.S.A (periods not followed by whitespace).
For common abbreviations like Prof. an Dr. it may be a good idea to create a dictionary - perhaps editable by your users, since each language will have its own set of common abbreviations.
Each language will have its own set of punctuation rules too, which may affect how you interpret punctuation characters. For example, English tends to put a period inside the parentheses (like this.) while Polish does the opposite (like this). The same difference will apply to double quotes, single quotes (some languages don't use them at all, sometimes they are indistinguishable from apostrophes etc.). Your rules may well have to be language-specific, at least in part.
In the end, you may approximate the human way of delimiting sentences, but there will always be cases that can throw the analysis off. For example, assuming that you have a dictionary that recognizes "Prof." as an abbreviation, what are you going to do about
Most people called him Professor Jones, but to me he was simply The Prof.
Even if you have another sentence that follows and starts with a capital letter, that still won't help you know where the sentence ends, because it might as well be
Most people called him Professor Jones, but to me he was simply Prof. Bill.
Check my tutorial here http://code.google.com/p/graph-expression/wiki/SentenceSplitting. This concrete example can be easily rewritten to regular expressions and some imperative code.
It will be wise to use a NLP processor with a pre-trained model. EnglishSD.nbin is one such model that is available for OpenNLP and it can be used in Visual Studio with SharpNLP.
The advantage of using this method is numerous. For example consider the input
Prof. Jessica is a wonderful woman. She is a native of U.S.A. She is married to Mr. Jacob Jr.
If you are using a regex split, for example
string[] sentences = Regex.Split(text, #"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])");
Then the above input will be split as
Prof.
Jessica is a wonderful woman.
She is a native of U.
S.
A.
She is married to Mr.
Jacob Jr.
However the desired output is
Prof. Jessica is a wonderful woman.
She is a native of U.S.A. She is married to Mr. Jacob Jr.
This kind of logical sentence split can be achieved only using trained models from OpenNLP project. The method is as simple as this.
private string mModelPath = #"C:\Users\ATS\Documents\Visual Studio 2012\Projects\Google_page_speed_json\Google_page_speed_json\bin\Release\";
private OpenNLP.Tools.SentenceDetect.MaximumEntropySentenceDetector mSentenceDetector;
private string[] SplitSentences(string paragraph)
{
if (mSentenceDetector == null)
{
mSentenceDetector = new OpenNLP.Tools.SentenceDetect.EnglishMaximumEntropySentenceDetector(mModelPath + "EnglishSD.nbin");
}
return mSentenceDetector.SentenceDetect(paragraph);
}
where mModelPath is the path of the directory containing the nbin file.
The mSentenceDetector is derived from the OpenNLP dll.
You can get the desired output by
string[] sentences = SplitSentences(text);
Kindly read through this article I have written for integrating SharpNLP with your Application in Visual Studio to make use of the NLP tools

I'm going to be teaching a few developers regular expressions - what are some good homework problems? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I'm thinking of presenting questions in the form of "here is your input: [foo], here are the capture groups/results: [bar]" (and maybe writing a small script to test their answers for my results).
What are some good regex questions to ask? I need everything from beginner questions like "validate a 4 digit number" to "extract postal codes from addresses".
A few that I can think off the top of my head:
Phone numbers in any format e.g. 555-5555, 555 55 55 55, (555) 555-555 etc.
Remove all html tags from text.
Match social security number (Finnish one is easy;)
All IP addresses
IP addresses with shorthand netmask (xx.xx.xx.xx/yy)
There's a bunch of examples of various regular expression techniques over at www.regular-expressions.info - everything for simple literal matching to backreferences and lookahead.
To keep things a bit more interesting than the usual email/phone/url stuff, try looking for more original exercises. Avoid boredom.
For example, have a look at the Forsysth-Edwards Notation which is used for describing a particular board position of a chess game.
Have your students validate and extract all the bits of information from a string like this:
rnbqkbnr/pp1ppppp/8/2p5/4P3/5N2/PPPP1PPP/RNBQKB1R b KQkq - 1 2
Additionaly, have a look at algebraic chess notation, used to describe moves. Extract chess moves out of a piece of text (and make them bold).
1. e4 e5 2. Nf3 Black now defends his pawn 2...Nc6 3. Bb5 Black threatens c4
Validate phone numbers (extract area code + rest of number with grouping) (Assuming US phone number, otherwise generalize for you style)
Play around with validating email address (probably want to tell the students that this is hugely complicated regular expression but for simple ones it is pretty straight forward)
regexplib.com has a good library you can search through for examples.
H0w about extract first name, middle name, last name, personal suffix (Jr., III, etc.) from a format like:
Smith III, John Paul
How about Reg Ex to remove line breaks and tabs from the input
I would start with the common ones:
validate email
validate phone number
separate the parts of a URL
Be cruel. Tell them parse HTML.
RegEx match open tags except XHTML self-contained tags
Are you teaching them theory of finite automata as well?
Here is a good one: parse the addresses of churches correctly from this badly structured format (copy and paste it as text first)
http://www.churchangel.com/WEBNY/newhart.htm
I'm a fan of parsing date strings. Define a few common data formats, as well as time and date-time formats. These are often good exercises because some dates are simple mixes of digits and punctuation. There's a limited degree of freedom in parsing dates.
Just to throw them for a loop, why not reword a question or two to suggest that they write a regular expression to generate data fitting a specific pattern like email addresses, phone numbers, etc.? It's the same thing as validating, but can help them get out of the mindset that regex is just for validation (whereas the data generation tool in visual studio uses regex to randomly generate data).
Rather than teaching examples based from the data set, I would do examples from the perspective of the rule set to get basics across. Give them simple examples to solve that leads them to use ONE of several basic groupings in each solution. Then have a couple of "compound" regex's at the end.
Simple:
s/abc/def/
Spinners and special characters:
s/a\s*b/abc/
Grouping:
s/[abc]/def/
Backreference:
s/ab(c)/def$1/
Anchors:
s/^fred/wilma/
s/$rubble/and betty/
Modifiers:
s/Abcd/def/gi
After this, I would give a few examples illustrating the pitfalls of trying to match html tags or other strings that shouldn't be done with regex's to show the limitations.
Try to think of some tests that don't include ones that can be found with Google.
Asking a email validator should pose no trouble finding..
Try something like a 5 proof test.
Input 5 digit. Sum up each digit must be dividable by five: 12345 = 1+2+3+4+5 = 15 / 5 = 3(.0)