regex to add a ? at the end of non-punctuated sentences - regex

I'm looking for a regex formula to add a ? at the end of all non-punctuated sentences in a text document. I want to do this in edit pad pro or Power GREP.
for ex
Lichen planus occurs most frequently on the
A. buccal mucosa.
B. tongue.
C. floor of the mouth.
D. gingiva.
In the absence of “Hanks balanced salt solution”, what is the most appropriate media to transport an avulsed
A. Saliva.
B. Milk.
C. Saline.
D. Tap water.
Which of the following is the most likely cause of osteoporosis, glaucoma, hypertension and peptic ulcers in a 65 year old with Crohn’s disease
A. Uncontrolled diabetes.
B. Systemic corticosteroid therapy.
C. Chronic renal failure.
D. Prolonged NSAID therapy.
E. Malabsorption syndrome.
DESIRED RESULT
Lichen planus occurs most frequently on the?
A. buccal mucosa.
B. tongue.
C. floor of the mouth.
D. gingiva.
In the absence of “Hanks balanced salt solution”, what is the most appropriate media to transport an avulsed?
A. Saliva.
B. Milk.
C. Saline.
D. Tap water.
Which of the following is the most likely cause of osteoporosis, glaucoma, hypertension and peptic ulcers in a 65 year old with Crohn’s disease?
A. Uncontrolled diabetes.
B. Systemic corticosteroid therapy.
C. Chronic renal failure.
D. Prolonged NSAID therapy.
E. Malabsorption syndrome.

You really don't need code of any kind for that - just a wildcard Find/Replace, where:
Find = ([!.])( [A-Z].)
Replace = \1?\2
You could, of course, implement the above as a macro or some other script, but it hardly seems worth the effort.

Related

Word find and replace with first character from previous line

•
A. Review the Stakeholder Register.
•
B. Review the RACI Chart.
•
C. Review the work breakdown structure.
•
D. Examine the Change Management Plan.
(Correct)
•
A. Customer, Sponsor.
•
B. Suppliers, Regulatory bodies.
(Correct)
•
C. Project Manager, Team members.
•
D. Competitors, PMO.
A. Forming.
•
B. Storming.
•
C. Norming.
•
D. Performing.
(Correct)
I want to find the answer key "(correct)"
for example in first block
"• D. Examine the Change Management Plan.
(Correct)"
Answer is D
So i just want to print "D" which is the first character in previous line
same for 2nd block
"(Correct)" is written with "B. Suppliers, Regulatory bodies."
So i just want to get 1st character "B" as answer
So Sample output would be like
D.
B.
D.
Any Suggestion How i can do this in word or any text editor or anywhere

I would like to use regex to format text

I would very much appreciate any help to get the following thing done in notepad ++.
I have more than 40.000 lines like the ones under. All of them are English language tests. They look like this:
Q1 "I know that you don't like seafood but our friends make the best seafood Fettuccini Alfredo I have ever had.
Will you agree to keep an ......... mind and try it before deciding you don't like it?" Jill asked her son.
(a) open (b) airy (c) indifferent (d) ignorant
Q2 The boss was a scary guy. When he called you into his office, you could bet that you would receive the worse
insults you have ever had to endure and there was something about him that would stop anyone from talking
back to him. People immediately froze in their ......... and meekly walked into his office when he called them.
(a) paths (b) tracks (c) cars (d) shoes
Q3 In ......... to change the sink, he would have to turn off the water that runs to the facet. He failed to do so and got
a surprise when water started liberally spraying down on the kitchen floor.
(a) ability (b) possibility (c) plausibility (d) order
Q4 Since the company put a set of sexual harassment rules in ......... incidents of sexual harassment were virtually
non-existent.
(a) ordering (b) place (c) storage (d) foundation
Q5 "Your shed is in pretty poor ......... . The back of the foundation is sinking and there is water getting into it from
the roof. I can't help you with the foundation but we can look for ways to seal it," Rob said to Christian.
(a) mass (b) density (c) support (d) shape
As you can see the questions are not in one line but they are broken with an enter plus an empty line plus and some empty space characters.
I would like to achieve something like this:
Q1 "I know that you don't like seafood but our friends make the best seafood Fettuccini Alfredo I have ever had. Will you agree to keep an ......... mind and try it before deciding you don't like it?" Jill asked her son.
(a) open (b) airy (c) indifferent (d) ignorant
Q2 The boss was a scary guy. When he called you into his office, you could bet that you would receive the worse insults you have ever had to endure and there was something about him that would stop anyone from talking back to him. People immediately froze in their ......... and meekly walked into his office when he called them.
(a) paths (b) tracks (c) cars (d) shoes
Q3 In ......... to change the sink, he would have to turn off the water that runs to the facet. He failed to do so and got a surprise when water started liberally spraying down on the kitchen floor.
(a) ability (b) possibility (c) plausibility (d) order
Q4 Since the company put a set of sexual harassment rules in ......... incidents of sexual harassment were virtually non-existent.
(a) ordering (b) place (c) storage (d) foundation
Q5 "Your shed is in pretty poor ......... . The back of the foundation is sinking and there is water getting into it from the roof. I can't help you with the foundation but we can look for ways to seal it," Rob said to Christian.
(a) mass (b) density (c) support (d) shape
So I need all the questions in one line, one long line as long as the question until it ends. The question options must be under them as they are, I think I do not need to change the question options (a, b, c, d) only the questions.
Manually, I would have to go line by line and delete the characters until the questions are one line each. With tens of thousands of questions, it would be a difficult thing to do. Is there a way that it could be done in Notepad ++ with regex?
If it helps, each and every question starts with Q1, Q2, Q3 and so on up until Q10. All the lines that start with (a) are question options.
Two approaches:
Based on the fact, that the start of the lines you want attach is always indented, you can use
\R++\h++([^(])
and replace with $1.
Or based on the fact that you don't want to merge lines starting with an opening bracket or Q number, you can use
\R++\h*+((?!Q\d)[^(])
and again replace with $1.
You can use the following regex:
Q(.*)(\r?\n)+\h+(\w)
and replacement:
Q\1 \3
or
Q(.+)\v+\h+(\w)
and replacement:
Q\1 \2
Click on Replace All a couple of times and it will be done.
EXPLANATIONS:
Q(.+)\v+\h+(\w) will select all lines starting with Q followed by one or several characters and ending by one or several EOL/Carriage return char themselves followed by several horizontal space characters then followed by a word char to avoid taking the answers into account.
Then you replace the whole thing by Q\1 \2: the Q the first line of the question a space and the second line of the question (by using backreferences)
You need to click several times on replace all until no occurence is replaced.
Let me know if anything is unclear.
TESTED:
Try this with Find+Replace:
(\n\s+)(\s\w)
Replace with:
$2

Selecting sentences surrounding a keyword

I am a Python beginner. I tried to figure this out, but I failed. I need to find a keyword in a text file. If there is the keyword in any part of the whole text, then I need to select sentences surrounding the keyword, including the keyword. The number of sentences is arbitrary so it could be 5 or 10. There could be a blank line between sentences so I need to include the blank line as well.
For example:
Let keyword be: compensation
Let input text is:
"The costs incidental to our solicitation and obtaining of proxies, including the cost of reimbursing banks and brokers for forwarding proxy materials to their principals, will be borne by us. Proxies may be solicited, without extra compensation, by our officers and employees, both in person and by mail, telephone and other methods of communication."
The output I want for example: "The costs incidental... compensation... communication."
I tried to use this: p = re.compile( r'[^.]compensation[^.]+.') p.findall(text)
Using the above code, I can select only the sentence that contains the keyword. What I need is to select sentences surrounding the keyword. I need to control the number of sentences before and after the sentence containing the keyword. SO for example, if I want to select two sentences before the sentence containing the keyword, the sentence containing the keyword, and two sentences after the sentence containing the keyword, what should I do?
Assuming your input is structured such: <sentence> <period> <sentence> <period>
Then you need to first select the full sentence which may start by your keyword, end by your keyword, start AND end by your keyword (although unlikely) for each match. Then you select the number of <sentence> <period> before and the same goes for after.
import re
s = open('text.txt', 'r').read()
p = re.compile(r'(([^\.]*\.){2}[^\.]*compensation[^\.]*\.([^\.]*\.){3})')
for i in p.findall(s):
print("match='" + i[0] + "'")
Because we are using the group metacharacters '(' and ')', findall() will return a list of tuple of those, not what we want. So we add a group around the whole regex (which will necessarily be the first group as it is the outermost one).
EDIT: Another possibility is to use non-capturing groups (?:...). findall() will only return the full matches with those.
Allowing the number or sentences matched before (2) and after (3) to vary is left as an exercice (this should be easy to do using the string formatting facilities of Python).
Output
match=' Holy bacon. The costs incidental to our solicitation and
obtaining of proxies, including the cost of reimbursing banks and
brokers for forwarding proxy materials to their principals, will be
borne by us. Proxies may be solicited, without extra compensation, by
our officers and employees, both in person and by mail, telephone and
other methods of communication. My. Oh My. God.'
match=' C. D. My compensation is your compensation. E. F. G.'
Text used
Golly. Jeez. Holy bacon. The costs incidental to our solicitation and
obtaining of proxies, including the cost of reimbursing banks and
brokers for forwarding proxy materials to their principals, will be
borne by us. Proxies may be solicited, without extra compensation, by
our officers and employees, both in person and by mail, telephone and
other methods of communication. My. Oh My. God. Feels. Good. To be.
King of the Jungle.
A. B. C. D. My compensation is your compensation. E. F. G. Hi. Ijjk.
Lllme.

Replace ONLY first occurrence of word from a paragraph using RegEx

I have two paragraphs. I want to replace ONLY first occurrence of a specific word 'acetaminophen' by '{yootooltip :: It is a widely used over-the-counter analgesic (pain reliever) and antipyretic (fever reducer). Excessive use of paracetamol can damage multiple organs, especially the liver and kidney.}acetaminophen{/yootooltip}'
The paragraph is:
Percocet is a painkiller which is partly made from oxycodone and partly made from acetaminophen. It will usually be prescribed for a patient who is suffering from acute severe pain. Because it has oxycodone in it, this substance can create an addiction and is also a dangerous prescription drug to abuse. It is illegal to either sell or use Percocet that has not been prescribed by a licensed professional.
In 2008, drugs like Percocet (which have both oxycodone and acetaminophen as their main ingredients) were the prescription drugs most sold in all of Ontario. The records also show that the rates of death by oxycodone (this includes brand name like Percocet) doubled. That is why it is imperative that the people who are addicted go to an Ontario drug rehab center. Most of the drug rehabs can take care of Percocet addiction.
I am trying to write a regular expression for this. I have tried
\bacetaminophen\b
But it is replacing both occurrences.
Any help would be appreciated.
Thanks
Use the optional $limit parameter in PHP's preg_replace function http://us2.php.net/manual/en/function.preg-replace.php
$text = preg_replace('/acetaminophen/i', 'da-da daa', $text, 1);
will replace only the first occurance.
For the tool you're using, just use this:
(.*?)\bacetaminophen\b

Weighted disjunction in Perl Regular Expressions?

I am fairly experienced with regular expressions, but I am having some difficulty with a current application involving disjunction.
My situation is this: I need to separate an address into its component parts based on a regular expression match on the "Identifier elements" of the address -- A comparable English example would be words like "state", "road", or "boulevard"--IF, for example, we wrote these out in our addresses. Imagine we have an address like the following, where (and this would never happen in English), we specified the identifier type after each name
United States COUNTRY California STATE San Francisco CITY Mission STREET 345 NUMBER
(Where the words in CAPS are what I have called "identifiers").
We want to parse it into:
United States COUNTRY
California STATE
San Francisco CITY
Mission STREET
245 NUMBER
OK, this is certainly contrived for English, but here's the catch: I am working with Chinese data, where in fact this style of identifier specification happens all the time. An example below:
云南-省 ; 丽江-市 ; 古城-区 ; 西安-街 ; 杨春-巷 ;
Yunnan-Province ; LiJiang-City ; GuCheng-District ; Xi'An-Street ; Yangchun-Alley
This is easy enough--a lazy match on a potential candidate identifier names, separated into a disjunctive list.
For China, the following are the "province-level" entities:
省 (Province) ,
自治区 (Autonomous Region) ,
市 (Municipality)
So my regex so far looks like this:
(.+?(?:(?:省)|(?:自治区)|(?:市)))
I have a series of these, in order to account for different portions of the address. The next level, corresponding to cities, for instance, is:
(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
So to match a province entity followed by a city entity:
(.+?(?:(?:省)|(?:自治区)|(?:市)))(.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
With named capture groups:
(?<Province>.+?(?:(?:省)|(?:自治区)|(?:市)))(?<City>.+?(?:(?:地区)|(?:自治州)|(?:市)|(?:盟)))
For the above, this yields:
$+{Province} = 云南省
$+{City} = 丽江市
This is all good and well, and gets me pretty far. The problem, however, is when I try to account for identifiers that can be a substring of other identifiers. A common street-level entity, for instance, is "村委会", which means village organizing committee. In the set of addresses I wish to separate, not every address has this written out in full. In fact, I find "村委" and just plain "村" as well.
The problem? If I have a pure disjunction of these elements, we have the following:
(?<Street>.+?(?:(?:村委会)|(?:村委)|(?:村)))
What happens, though, is that if you have an entity 保定-村委会 (Baoding Village organizing committee), this lazy regex stops at 村 and calls it a day, orphaning our poor 委会 because 村 is one of the potential disjunctive elements.
Imagine an English equivalent like the following:
(?<Animal>.+?(?:(?:Cat)|(?:Elephant)|(?:CatElephant)|(?:City)))
We have two input strings:
1. "crap catelephant crap city", where we wanted "Crap catelephant" and "crap city"
2. "crap catelephant city" , where we wanted "crap cat" "elephant city"
Ah, the solution, you say, is to make the pre-identifier capture greedy. But! There are entities have the same identifier that are not at the same level.
Take 市 for example. It means simply "city". But in China, there are county-level, province-level, and municipality-level cities. If this character occurred twice in the string, especially in two adjacent entities, the greedy search would incorrectly tag the greedy match as the first entity. As in the following:
广东-省 ; 江门-市 ; 开平-市 ; 三埠-区 石海管-区
Guangdong-province ; Jiangmen-City ; Kaiping-City ; Sanbu-District ; Shihaiguan-District
(Note, as above, this has been hand-segmented. The raw data would simply have a string of concatenated characters)
The match for a greedy search would be
江门市开平市
This is wrong, as the two adjacent entities should be separated into their constituent parts. Once is at the level of provincial city, one is a county-level city.
Back to the original point, and I thank you for reading this far, is there a way to put a weighting on disjunctive entities? I would want the regex to find the highest "weighted" identifier first. 村委会 instead of simple 村 for example, "catelephant" instead of just "cat". In preliminary experiments, the regex parser apparently proceeds left to right in finding disjunctive matches. Is this a valid assumption to make? Should I put the most frequently-occurring identifiers first in the disjunctive list?
If I have lost anyone with Chinese-related details, I apologize, and can further clarify if needed. The example really doesn't have to be Chinese--I think more generally it is a question about the mechanics of the regex disjunctive match -- in what order does it preference the disjunctive entities, and how does it decide when to "call it a day" in the context of a lazy search?
In a way, is there some sort of middle ground between lazy and greedy searches? Find the smallest bit you can find before the longest / highest weighted disjunctive entity? Be lazy, but put in that little bit of extra effort if you can for the sake of thoroughness?
(Incidentally, my work philosophy in college?)
How alternations are handled depends on the particular regular expression engine. For almost all engines (including Perl's regular expression engine) the alternation matches eagerly - that is, it matches the left-most choice first and only tries another alternative if this fails. For example, if you have /(cat|catelephant)/ it will never match catelephant. The solution is to reorder the choices so that the most specific comes first.