I would like to use regex to format text - regex

I would very much appreciate any help to get the following thing done in notepad ++.
I have more than 40.000 lines like the ones under. All of them are English language tests. They look like this:
Q1 "I know that you don't like seafood but our friends make the best seafood Fettuccini Alfredo I have ever had.
Will you agree to keep an ......... mind and try it before deciding you don't like it?" Jill asked her son.
(a) open (b) airy (c) indifferent (d) ignorant
Q2 The boss was a scary guy. When he called you into his office, you could bet that you would receive the worse
insults you have ever had to endure and there was something about him that would stop anyone from talking
back to him. People immediately froze in their ......... and meekly walked into his office when he called them.
(a) paths (b) tracks (c) cars (d) shoes
Q3 In ......... to change the sink, he would have to turn off the water that runs to the facet. He failed to do so and got
a surprise when water started liberally spraying down on the kitchen floor.
(a) ability (b) possibility (c) plausibility (d) order
Q4 Since the company put a set of sexual harassment rules in ......... incidents of sexual harassment were virtually
non-existent.
(a) ordering (b) place (c) storage (d) foundation
Q5 "Your shed is in pretty poor ......... . The back of the foundation is sinking and there is water getting into it from
the roof. I can't help you with the foundation but we can look for ways to seal it," Rob said to Christian.
(a) mass (b) density (c) support (d) shape
As you can see the questions are not in one line but they are broken with an enter plus an empty line plus and some empty space characters.
I would like to achieve something like this:
Q1 "I know that you don't like seafood but our friends make the best seafood Fettuccini Alfredo I have ever had. Will you agree to keep an ......... mind and try it before deciding you don't like it?" Jill asked her son.
(a) open (b) airy (c) indifferent (d) ignorant
Q2 The boss was a scary guy. When he called you into his office, you could bet that you would receive the worse insults you have ever had to endure and there was something about him that would stop anyone from talking back to him. People immediately froze in their ......... and meekly walked into his office when he called them.
(a) paths (b) tracks (c) cars (d) shoes
Q3 In ......... to change the sink, he would have to turn off the water that runs to the facet. He failed to do so and got a surprise when water started liberally spraying down on the kitchen floor.
(a) ability (b) possibility (c) plausibility (d) order
Q4 Since the company put a set of sexual harassment rules in ......... incidents of sexual harassment were virtually non-existent.
(a) ordering (b) place (c) storage (d) foundation
Q5 "Your shed is in pretty poor ......... . The back of the foundation is sinking and there is water getting into it from the roof. I can't help you with the foundation but we can look for ways to seal it," Rob said to Christian.
(a) mass (b) density (c) support (d) shape
So I need all the questions in one line, one long line as long as the question until it ends. The question options must be under them as they are, I think I do not need to change the question options (a, b, c, d) only the questions.
Manually, I would have to go line by line and delete the characters until the questions are one line each. With tens of thousands of questions, it would be a difficult thing to do. Is there a way that it could be done in Notepad ++ with regex?
If it helps, each and every question starts with Q1, Q2, Q3 and so on up until Q10. All the lines that start with (a) are question options.

Two approaches:
Based on the fact, that the start of the lines you want attach is always indented, you can use
\R++\h++([^(])
and replace with $1.
Or based on the fact that you don't want to merge lines starting with an opening bracket or Q number, you can use
\R++\h*+((?!Q\d)[^(])
and again replace with $1.

You can use the following regex:
Q(.*)(\r?\n)+\h+(\w)
and replacement:
Q\1 \3
or
Q(.+)\v+\h+(\w)
and replacement:
Q\1 \2
Click on Replace All a couple of times and it will be done.
EXPLANATIONS:
Q(.+)\v+\h+(\w) will select all lines starting with Q followed by one or several characters and ending by one or several EOL/Carriage return char themselves followed by several horizontal space characters then followed by a word char to avoid taking the answers into account.
Then you replace the whole thing by Q\1 \2: the Q the first line of the question a space and the second line of the question (by using backreferences)
You need to click several times on replace all until no occurence is replaced.
Let me know if anything is unclear.
TESTED:

Try this with Find+Replace:
(\n\s+)(\s\w)
Replace with:
$2

Related

Can you limit the words between two capturing groups in Regex

I have been trying to create a parser for Law texts.
I need to find a way to find "external links" like : art. 45 alin. (1) din Lege nr. 54/2000
But the problem is that my country law writing style is so, soooo lacking uniformity and that means sometimes the links might look like this : articolul 45 alineatul (1) din Legeea nr. 30/2000
The fact that my language has forms for words for days. (articol, articolului, articolelor....)
That means that i need to generalize that first thing... (art.) as to catch as many forms as possible and pray that the last thing is a law number & year (54/2000).
Now here comes the hard part... The problem is that every section that starts with Articol N starts the regex and it goes on and on until it finds a law number & year that have absolutely no relation between them.
This is how it looks \b(((A|a)rt.*?) \(?\d*?\)??)( \w*? )*?nr\.? (\d+\/\d\d\d\d|\d+\/\d\d\d\d)\b
My question is there a way to limit the words between the two capturing groups?
Link to a Docs to determine what should pass and what not:
https://docs.google.com/document/d/1vn2HwYaCq8UB1felY1GvfmbTI2w8o5RgW4efD9fsvQM/edit?usp=sharing
As Cary and James answered in comments above, I used (?:\S+\s*){0,15}. I used \S instead of \w to include punctuation and thus, abbreviated forms of the names of the Law (e.g. Const. for Constitution). That was the reason why my original regex wasn't working even when using {m,n}.

Regex to capture transcript speaker names before colon

From text transcripts, I want to capture all names of speakers.
The target names start at the start of a line and should end at a ": " (ie. colon and space).
Optionally, for even finer control, it may be safe to assume the first colon and two spaces.
Example text:
Julian Z.: What's really exciting is the opportunity to be more intelligent about how you approach trying to reach your consumer. In a world where digital and the use of digital has exploded, to be able to have one-on-one conversations in the digital world, and to be able to eventually translate that into the TV space, whether that be addressable or data-driven, is really fantastic. Because at the end of the day, you want your brand, in our case, our networks, to be able to have a relationship with the consumer. Data is a proxy to allow for that to occur.
From an advertiser perspective, obviously now the ability to go to the broadcast networks and have a data-driven buy has absolutely blown up and proliferated. That's with us. That's with some of our competitors. Obviously, we think we're the best at it, but neither here nor there. I think it's a really wonderful foundational approach for advertisers to take. I think it's a great advancement in the market.
As a spender of money, and as somebody who is trying to get people to engage with our brands, the ability to use data to really have, again, these really one-on-one, unique conversations, and to be able to deliver creative content that's relevant for individual consumers, that's driven by what we know about the consumer, now, ultimately, where we can reach them effectively and in environments where we know they're engaged, is really a great, tremendous advancement. You'll see by our ratings numbers, which are on the upswing, that approach has really had a direct impact on what our linear ratings have resulted in.
Speaker 2: Great. Tell us a little bit about Viacom. It's a lot of fans, a lot of passion in people. How do you define the audience in broad strokes? How do they respond to advertising and what are some of the concerns that consumers have around ads?
Julian Z.: Well, I think, again, when you're talking about how we're reaching fans, it is using intelligence, and information, and data, not only to profile who our fans are, but ultimately where they're best reached. Our job is to deliver great, compelling content, which we believe we're really, really good at.
In order to do that, there's the linear side of the equation, but of course we want to make sure that we're reaching our fans in digital as well, and that there's a 360 kind of fan experience. We believe holistically that our fans are really the base of what we're trying to do. We're trying to please and create value for our fans. The more we engage with them, and the more we know about them, the better we're able to deliver customized content that fits their need.
Ultimately, as a content creator, what's more exciting than to delivery really great content to people that they really, really engage with and they build relationships with? That's all you can really hope for is, somebody that creates content, is to be able to develop compelling content and content that your audience really wants to engage with.
Speaker 2: When you look at targeting, is that a cross-platform? Where does that targeting happen?
Julian Z.: It absolutely is cross-platform. Of course, there is natural addressability in the digital market, because it is much more of a one-to-one. But now you see a lot of the MVPDs have obviously opened up addressable inventory. A lot of the MVPDs now have matured their addressable footprint, which allows you now to have a digital-like, not exactly the same obviously, but a digital-like experience in the linear space, to deliver content to the consumer or advertising to the consumer when it's relevant and when it's going to have the most impact for your message.
Ultimately, it's absolutely cross-platform because addressability is all about having that conversation, having that direct one-to-one with your audience. Our partners on the MVPD side have really matured over the last several years as of regard to addressable, and now you can have that 360 experience of having a conversation in linear and in digital that really is addressable.
Example strings to be captured are: Julian Z. and Speaker 2. Names will vary from text to text. I need all/multiple names present. As you see, names may include a mixture of alpha case, punctuation characters and numbers.
I will want to deduplicate names, which are repeated in the text, but believe I should shelve that for now, focusing this question on the capture.
I have tried plenty, for the last day or two.
eg. ^[^:]+\s* with /g comes close, but only captures the first, single Julian Z., whereas I want everything. For now, I am out of ideas and need to learn how to do this.
Regex to match any characters up until the first colon:
/^.*?(?=:)/gm
https://regex101.com/r/3uyXMM/3
^: match from beginning of line
.: match anything
*?: non-greedy search, so it stops at first colon (see next line)
(?=:): positive lookahead meaning next character should be colon but it doesn't capture
g: don't return after first match, returns all matches
m: run regex for each line
You can use this regex based on a negated character class:
/^\w[^:\n]*/mg
RegEx Demo 1
RegEx Demo 2
RegEx Breakup:
^\w: Match a word character at the start
[^:\n]*: Match zero or more of any character that is not a colon and not a newline.
Code:
var names = inputData.transcript.match(/^\w[^:\n]*/mg) || [];

Regex lookbehind - excluding words from searches

I need to search my corpus for words such as game or shame but I would like to specify the search to exclude three strings a game/a shame or , A game/A shame and a/an/A/An WORD game or a/an/A/An WORD shame , where WORD is a modifier, e.g., a great game or a great shame.
If someone could help me out, that would be great, thanks!
In my corpus, the optional WORD between the indefinite article a/an and game or a/an and shame is most commonly great and real. So even excluding these two, would already help me a lot.
The lookbehind below works perfectly to exclude a/A
(?<!a\s|A\s)\bshame\b
To exclude the modifying WORD, I was trying to use ?\w in the lookbehind grep, but it just wouldn't work - the grep below without ? runs and it still excludes examples such as a shame, but it still returns the undesired examples such as a great shame or a crying shame - see concordance lines (3) and (4) in the sample text below:
(?<!a\s|A\s|a\b\w\b|A\b\w\b)\bshame\b
The tool I'm using to implement regex is AntConc, which supports Perl regular expressions.
Sample text with two irrelevant examples (3 & 4) after using the search string below
(?<!a\s|A\s)\bshame\b
1 (match shame)
, people ogling from the sidelines. If you want a closer look, you have to ring for entry and wait to be admitted. I guess me and Saul just have no shame (or just know the benefits of our bank accounts being in hard currencies), because we wandered into plenty. Lots and lots of little boutiques and edgily designed fashion stores with music blaring.& abbutterflie.txt 47 1
2 (match shame)
last twenty years and I've experienced all sorts of biggotry but I seriously thought that anti black nazism in football wass a thing of the past. You should all hang your heads in shame, bunch of [badword]s. adamdphillips.txt 57 1
3 (don't match shame)
me monetarily as I wasn't that close to her, but she was really good friends with the other girl and it's messed that up for them a bit, which is a great shame. Anyway, Holly and I have since found somewhere to move in just the two of us. It's going to cost an absolute fortune and I'm going to be eating basics beans on aderyn.txt 60 1
4 (don't match shame)
are loads of amazingly good bands out there, gigging up and down the country who will never get signed because no-one can figure out how to market them, and this is a crying shame. There are artists out there like Thea Gilmore and <a href="http://blog.amandapalmer.net/" rel="nofollow"> Amanda Palmer& aderyn.txt 60 2
5 (match shame)
/><br />"There is no better time to show these terrorists that we have no fear of them. Instead we are forced, through the cowardly acts of our superiors, to hide in shame."<br /><br />But Herb Wiseman, high school consultant for Lee County, Florida, pointed to the July 7 London bombings.<br /><br />"What happens if kids get on aggy91.txt 64 1
Because variable length negative lookbehinds are not allowed, the approach in your previous question's answer won't transfer to this one.
I've gone with a (*SKIP)(*FAIL) pattern. This will match and discard the disqualified matches, and only retain qualifying matches:
/[Aa]n?( \w+)? shame(*SKIP)(*FAIL)|shame/ 3844 steps (Demo)
Or if you wish to include word boundary metacharacters:
/\b[Aa]n?( \w+)? shame\b(*SKIP)(*FAIL)|\bshame\b/ 4762 steps (Demo)

stop short of multiple strings and characters using '^'

I'm doing a regex operation that to stop short of either character sets { or \t\t{.
the first is ok, but the second cannot be achieved using the ^ symbol the way I have been.
My current regex is [\t+]?{\d+}[^\{]*
As you can see, I've used ^ effectively with a single character, but I cannot apply it to a string of characters like \t\t\{
How can the current regex be applied to consider both of these possibilities?
Example text:
{1} The words of the blessing of Enoch, wherewith he blessed the elect and righteous, who will be living in the day of tribulation, when all the wicked and godless are to be removed. {2} And he took up his parable and said--Enoch a righteous man, whose eyes were opened by God, saw the vision of the Holy One in the heavens, which the angels showed me, and from them I heard everything, and from them I understood as I saw, but not for this generation, but for a remote one which is for to come. {3} Concerning the elect I said, and took up my parable concerning them:
The Holy Great One will come forth from His dwelling,
{4} And the eternal God will tread upon the earth, [even] on Mount Sinai,
And appear from His camp
And appear in the strength of His might from the heaven of heavens.
{5} And all shall be smitten with fear
And the Watchers shall quake,
And great fear and trembling shall seize them unto the ends of the earth.
{6} And the high mountains shall be shaken,
And the high hills shall be made low,
And shall melt like wax before the flame
When I do this as a multi-line extract, the indendantation does not maintain for the first line of each block. Ideally the extract should stop short of the \t\t{ allowing it to be picked up properly in the next extract, creating perfectly indented blocks. The reason for this is when they are taken from the database, the \t\t should be detected at the first line to allow dynamic formatting.
[\t+]?{\d+}[\s\S]*?(?=\s*{|$)
You can use this.See demo.
https://regex101.com/r/nNUHJ8/1

Replace ONLY first occurrence of word from a paragraph using RegEx

I have two paragraphs. I want to replace ONLY first occurrence of a specific word 'acetaminophen' by '{yootooltip :: It is a widely used over-the-counter analgesic (pain reliever) and antipyretic (fever reducer). Excessive use of paracetamol can damage multiple organs, especially the liver and kidney.}acetaminophen{/yootooltip}'
The paragraph is:
Percocet is a painkiller which is partly made from oxycodone and partly made from acetaminophen. It will usually be prescribed for a patient who is suffering from acute severe pain. Because it has oxycodone in it, this substance can create an addiction and is also a dangerous prescription drug to abuse. It is illegal to either sell or use Percocet that has not been prescribed by a licensed professional.
In 2008, drugs like Percocet (which have both oxycodone and acetaminophen as their main ingredients) were the prescription drugs most sold in all of Ontario. The records also show that the rates of death by oxycodone (this includes brand name like Percocet) doubled. That is why it is imperative that the people who are addicted go to an Ontario drug rehab center. Most of the drug rehabs can take care of Percocet addiction.
I am trying to write a regular expression for this. I have tried
\bacetaminophen\b
But it is replacing both occurrences.
Any help would be appreciated.
Thanks
Use the optional $limit parameter in PHP's preg_replace function http://us2.php.net/manual/en/function.preg-replace.php
$text = preg_replace('/acetaminophen/i', 'da-da daa', $text, 1);
will replace only the first occurance.
For the tool you're using, just use this:
(.*?)\bacetaminophen\b