Replace ONLY first occurrence of word from a paragraph using RegEx - regex

I have two paragraphs. I want to replace ONLY first occurrence of a specific word 'acetaminophen' by '{yootooltip :: It is a widely used over-the-counter analgesic (pain reliever) and antipyretic (fever reducer). Excessive use of paracetamol can damage multiple organs, especially the liver and kidney.}acetaminophen{/yootooltip}'
The paragraph is:
Percocet is a painkiller which is partly made from oxycodone and partly made from acetaminophen. It will usually be prescribed for a patient who is suffering from acute severe pain. Because it has oxycodone in it, this substance can create an addiction and is also a dangerous prescription drug to abuse. It is illegal to either sell or use Percocet that has not been prescribed by a licensed professional.
In 2008, drugs like Percocet (which have both oxycodone and acetaminophen as their main ingredients) were the prescription drugs most sold in all of Ontario. The records also show that the rates of death by oxycodone (this includes brand name like Percocet) doubled. That is why it is imperative that the people who are addicted go to an Ontario drug rehab center. Most of the drug rehabs can take care of Percocet addiction.
I am trying to write a regular expression for this. I have tried
\bacetaminophen\b
But it is replacing both occurrences.
Any help would be appreciated.
Thanks

Use the optional $limit parameter in PHP's preg_replace function http://us2.php.net/manual/en/function.preg-replace.php
$text = preg_replace('/acetaminophen/i', 'da-da daa', $text, 1);
will replace only the first occurance.

For the tool you're using, just use this:
(.*?)\bacetaminophen\b

Related

Can you limit the words between two capturing groups in Regex

I have been trying to create a parser for Law texts.
I need to find a way to find "external links" like : art. 45 alin. (1) din Lege nr. 54/2000
But the problem is that my country law writing style is so, soooo lacking uniformity and that means sometimes the links might look like this : articolul 45 alineatul (1) din Legeea nr. 30/2000
The fact that my language has forms for words for days. (articol, articolului, articolelor....)
That means that i need to generalize that first thing... (art.) as to catch as many forms as possible and pray that the last thing is a law number & year (54/2000).
Now here comes the hard part... The problem is that every section that starts with Articol N starts the regex and it goes on and on until it finds a law number & year that have absolutely no relation between them.
This is how it looks \b(((A|a)rt.*?) \(?\d*?\)??)( \w*? )*?nr\.? (\d+\/\d\d\d\d|\d+\/\d\d\d\d)\b
My question is there a way to limit the words between the two capturing groups?
Link to a Docs to determine what should pass and what not:
https://docs.google.com/document/d/1vn2HwYaCq8UB1felY1GvfmbTI2w8o5RgW4efD9fsvQM/edit?usp=sharing
As Cary and James answered in comments above, I used (?:\S+\s*){0,15}. I used \S instead of \w to include punctuation and thus, abbreviated forms of the names of the Law (e.g. Const. for Constitution). That was the reason why my original regex wasn't working even when using {m,n}.

Regex - Company name

I have a plain text and need to extract company names. It's a huge document including company names, financial reports and lots of text.Here are examples of company names:
Big laundry, a.s.
AVERA, s.r.o.
Airoflot Airlines, a.s.
Is it even possible to make regex like this? I'm complete beginner to regex and have no idea how to create this one. Thanks for any help.
Example of text:
`There are many competitors of AVERA, s.r.o. the main one is Airflot Airlines, a.s. and Big laundry, s.r.o. These organisations hold main share of market.
Another companies:
a. Big Company, a.s.
b. Smaller company, s.r.o.
c. Huge company, a.s.`
As the question currently stands, no it is not possible to create a regex for company names.
It would be possible if you are able to create a PATTERN.
Means e.g. A company name always:
starts with an uppercase letter
has a comma
after the comma there is always one of "a.s." or "s.r.o."
So, the difficulties that I see here are:
How many words before the comma belong to the name?
Is there always a comma with following abbreviation?
Names are always difficult to match because a name can be nearly everything, especially company names.
The examples you give follow this pattern : ([A-Z][A-Za-z]+ ?)+, (\w\.)+
The matching operation will be dependent of the tool you use.
For example in JavaScript :
var line = "some name is Airoflot Airlines, a.s. in this line";
var m = line.match(/([A-Z][A-Za-z]+ ?)+, (\w\.)+/);
if (m.length) console.log(m[0]);
This logs
"Airoflot Airlines, a.s."
But this isn't a very reliable solutions : many real company names wouldn't fit and, more importantly perhaps, this would match sentences that aren't company names. So this can only be used as an help in a solution which also incorporates some kind of validation (human or dictionary based).
I use this
(?:\s*[a-zA-Z0-9,_\.\077\0100\*\+\&\#\'\~\;\-\!\#\;]{2,}\s*)*
it matches all a-z, A-Z,0-9 and some special characters which Quickbook supports.
https://community.intuit.com/articles/1146006-acceptable-characters-in-the-company-name-in-quickbooks-online
with your given examples, this regexp would match
Big laundry, a\.s\.|AVERA, s\.r\.o\.|Airoflot Airlines, a\.s\.
The trick is to use the alternation operator | on a set of strings
You may wish to consider missing punctuation and white space in the company names too

How to get sentence number from input?

It seems hard to detect a sentence boundary in a text. Quotation marks like .!? may be used to delimite sentences but not so accurate as there may be ambiguous words and quotations such as U.S.A or Prof. or Dr. I am studying Tperlregex library and Regular Expression Cookbook by Jan Goyvaerts but I do not know how to write the expression that detects sentence?
What may be comparatively accurate expression using Tperlregex in delphi?
Thanks
First, you probably need to arrive at your own definition of what a "sentence" is, then implement that definition. For example, how about:
He said: "It's OK!"
Is it one sentence or two? A general answer is irrelevant. Decide whether you want it to interpret it as one or two sentences, and proceed accordingly.
Second, I don't think I'd be using regular expressions for this. Instead, I would scan each character and try to detect sequences. A period by itself may not be enough to delimit a sentence, but a period followed by whitespace or carriage return (or end of string) probably does. This immediately lets you weed out U.S.A (periods not followed by whitespace).
For common abbreviations like Prof. an Dr. it may be a good idea to create a dictionary - perhaps editable by your users, since each language will have its own set of common abbreviations.
Each language will have its own set of punctuation rules too, which may affect how you interpret punctuation characters. For example, English tends to put a period inside the parentheses (like this.) while Polish does the opposite (like this). The same difference will apply to double quotes, single quotes (some languages don't use them at all, sometimes they are indistinguishable from apostrophes etc.). Your rules may well have to be language-specific, at least in part.
In the end, you may approximate the human way of delimiting sentences, but there will always be cases that can throw the analysis off. For example, assuming that you have a dictionary that recognizes "Prof." as an abbreviation, what are you going to do about
Most people called him Professor Jones, but to me he was simply The Prof.
Even if you have another sentence that follows and starts with a capital letter, that still won't help you know where the sentence ends, because it might as well be
Most people called him Professor Jones, but to me he was simply Prof. Bill.
Check my tutorial here http://code.google.com/p/graph-expression/wiki/SentenceSplitting. This concrete example can be easily rewritten to regular expressions and some imperative code.
It will be wise to use a NLP processor with a pre-trained model. EnglishSD.nbin is one such model that is available for OpenNLP and it can be used in Visual Studio with SharpNLP.
The advantage of using this method is numerous. For example consider the input
Prof. Jessica is a wonderful woman. She is a native of U.S.A. She is married to Mr. Jacob Jr.
If you are using a regex split, for example
string[] sentences = Regex.Split(text, #"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])");
Then the above input will be split as
Prof.
Jessica is a wonderful woman.
She is a native of U.
S.
A.
She is married to Mr.
Jacob Jr.
However the desired output is
Prof. Jessica is a wonderful woman.
She is a native of U.S.A. She is married to Mr. Jacob Jr.
This kind of logical sentence split can be achieved only using trained models from OpenNLP project. The method is as simple as this.
private string mModelPath = #"C:\Users\ATS\Documents\Visual Studio 2012\Projects\Google_page_speed_json\Google_page_speed_json\bin\Release\";
private OpenNLP.Tools.SentenceDetect.MaximumEntropySentenceDetector mSentenceDetector;
private string[] SplitSentences(string paragraph)
{
if (mSentenceDetector == null)
{
mSentenceDetector = new OpenNLP.Tools.SentenceDetect.EnglishMaximumEntropySentenceDetector(mModelPath + "EnglishSD.nbin");
}
return mSentenceDetector.SentenceDetect(paragraph);
}
where mModelPath is the path of the directory containing the nbin file.
The mSentenceDetector is derived from the OpenNLP dll.
You can get the desired output by
string[] sentences = SplitSentences(text);
Kindly read through this article I have written for integrating SharpNLP with your Application in Visual Studio to make use of the NLP tools

Weighted disjunction in Perl Regular Expressions?

I am fairly experienced with regular expressions, but I am having some difficulty with a current application involving disjunction.
My situation is this: I need to separate an address into its component parts based on a regular expression match on the "Identifier elements" of the address -- A comparable English example would be words like "state", "road", or "boulevard"--IF, for example, we wrote these out in our addresses. Imagine we have an address like the following, where (and this would never happen in English), we specified the identifier type after each name
United States COUNTRY California STATE San Francisco CITY Mission STREET 345 NUMBER
(Where the words in CAPS are what I have called "identifiers").
We want to parse it into:
United States COUNTRY
California STATE
San Francisco CITY
Mission STREET
245 NUMBER
OK, this is certainly contrived for English, but here's the catch: I am working with Chinese data, where in fact this style of identifier specification happens all the time. An example below:
云南-省 ; 丽江-市 ; 古城-区 ; 西安-街 ; 杨春-巷 ;
Yunnan-Province ; LiJiang-City ; GuCheng-District ; Xi'An-Street ; Yangchun-Alley
This is easy enough--a lazy match on a potential candidate identifier names, separated into a disjunctive list.
For China, the following are the "province-level" entities:
省 (Province) ,
自治区 (Autonomous Region) ,
市 (Municipality)
So my regex so far looks like this:
(.+?(?:(?:省)|(?:自治区)|(?:市)))
I have a series of these, in order to account for different portions of the address. The next level, corresponding to cities, for instance, is:
(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
So to match a province entity followed by a city entity:
(.+?(?:(?:省)|(?:自治区)|(?:市)))(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
With named capture groups:
(?<Province>.+?(?:(?:省)|(?:自治区)|(?:市)))(?<City>.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
For the above, this yields:
$+{Province} = 云南省
$+{City} = 丽江市
This is all good and well, and gets me pretty far. The problem, however, is when I try to account for identifiers that can be a substring of other identifiers. A common street-level entity, for instance, is "村委会", which means village organizing committee. In the set of addresses I wish to separate, not every address has this written out in full. In fact, I find "村委" and just plain "村" as well.
The problem? If I have a pure disjunction of these elements, we have the following:
(?<Street>.+?(?:(?:村委会)|(?:村委)|(?:村)))
What happens, though, is that if you have an entity 保定-村委会 (Baoding Village organizing committee), this lazy regex stops at 村 and calls it a day, orphaning our poor 委会 because 村 is one of the potential disjunctive elements.
Imagine an English equivalent like the following:
(?<Animal>.+?(?:(?:Cat)|(?:Elephant)|(?:CatElephant)|(?:City)))
We have two input strings:
1. "crap catelephant crap city", where we wanted "Crap catelephant" and "crap city"
2. "crap catelephant city" , where we wanted "crap cat" "elephant city"
Ah, the solution, you say, is to make the pre-identifier capture greedy. But! There are entities have the same identifier that are not at the same level.
Take 市 for example. It means simply "city". But in China, there are county-level, province-level, and municipality-level cities. If this character occurred twice in the string, especially in two adjacent entities, the greedy search would incorrectly tag the greedy match as the first entity. As in the following:
广东-省 ; 江门-市 ; 开平-市 ; 三埠-区 石海管-区
Guangdong-province ; Jiangmen-City ; Kaiping-City ; Sanbu-District ; Shihaiguan-District
(Note, as above, this has been hand-segmented. The raw data would simply have a string of concatenated characters)
The match for a greedy search would be
江门市开平市
This is wrong, as the two adjacent entities should be separated into their constituent parts. Once is at the level of provincial city, one is a county-level city.
Back to the original point, and I thank you for reading this far, is there a way to put a weighting on disjunctive entities? I would want the regex to find the highest "weighted" identifier first. 村委会 instead of simple 村 for example, "catelephant" instead of just "cat". In preliminary experiments, the regex parser apparently proceeds left to right in finding disjunctive matches. Is this a valid assumption to make? Should I put the most frequently-occurring identifiers first in the disjunctive list?
If I have lost anyone with Chinese-related details, I apologize, and can further clarify if needed. The example really doesn't have to be Chinese--I think more generally it is a question about the mechanics of the regex disjunctive match -- in what order does it preference the disjunctive entities, and how does it decide when to "call it a day" in the context of a lazy search?
In a way, is there some sort of middle ground between lazy and greedy searches? Find the smallest bit you can find before the longest / highest weighted disjunctive entity? Be lazy, but put in that little bit of extra effort if you can for the sake of thoroughness?
(Incidentally, my work philosophy in college?)
How alternations are handled depends on the particular regular expression engine. For almost all engines (including Perl's regular expression engine) the alternation matches eagerly - that is, it matches the left-most choice first and only tries another alternative if this fails. For example, if you have /(cat|catelephant)/ it will never match catelephant. The solution is to reorder the choices so that the most specific comes first.

Regexp to parse out a person's name?

This might be a hard one (if not impossible), but can anyone think of a regular expression that will find a person's name, in say, a resume? I know this won't be 100% accurate, but I can't come up with something.
Let's assume the name only shows up once in the document.
No, you can't use regular expressions for this. The only chance you have is if the document is always in the same format and you can find the name based on the context surrounding it. But this probably isn't the case for you.
If you are asking your applicants to submit their résumé online you could provide a separate field for them to enter their name and any other information you need instead of trying to automatically parse résumés.
Forget it - seriously.
Or expect to get a lot of applications from a Mr C Vitae
In my experience, having written something very similar (but a very long time ago), about 95% of resumes have the person's name as the very first line. You could probably have a pretty loose regex checking for alpha, hyphens, periods, and assume that's the name.
Obviously there's no way to do this 100% accurately, as you said, but this would be close.
Unless you wanted to build an expression that contained every possible name, or-ed together, the expression you are referring to is not "Regular," with a capital R. A good guess might be to go looking for the largest-font words in the document. If they follow a pattern that looks like firstname-lastname, name-initial-name, etc., you could call it a good guess...
That's a really hairy problem to tackle. The regex has to match two words that could be someone's name. The problem with that is that some people, of Hispanic origin, for example, might have a name that's more than 2 words. Also, how would you define two words to match for a name? Would you use a database of common first and last name fields? That might work unless someone has an uncommon name.
I'm reminded of a story of a COBOL teacher in college told me about an individual of Asian origin who's name would break every rule the programmers defined for a bank's internal system. His first name was "O." just the letter O.
The only remotely dependable way to nail down the regex would be if you had something to set off your search with; maybe if a line of text in the resume began with "Name: " then you'd know where to start looking.
tl;dr: People's names and individual resumes are too heavily varied for a regular expression to pick apart.
You could do something like Amazon does for book overviews: SIPs. This would require some after-the-fact double checking by humans but you might find the person's name(s) in there.