Notepad++ remove linebreak in between two specific strings - regex

I have something like this
\text
This is a sentence.
This is another sentence.
\endtext
I want to remove the line break in between the two sentences in all instances of \text and \endtext. In order to look like this
\text
This is a sentence. This is another sentence.
\endtext
But of course where it gets complicated is that there are also line breaks after \text and before \endtext. These I want to keep. So, logically and in english speaking, what I was looking to do is something like
(after "\text\n") (remove all instances of \n) (before "\n\endtext")
but since I'm not very good at regex, I'm not sure how that would be written out in that language. Could someone help?

Notepad++ supports PCRE.
You can use this regex to search:
(?:\\text\R|(?<!\A)\G).*\K\R(?!\\endtext)
Replace this with a space.
RegEx Demo
RegEx Breakdown:
(?:: Start non-capture group
\\text: Match \text
\R: Match a line break
|: OR
(?<!\A)\G: Start from the end of the previous match. (?<!\A) is to make sure that we are not the start position
): End non-capture group
.*: Match 0 or more any character except line break
\K: Reset match info
\R: Match a line break
(?!\\endtext): Negative lookahead to make sure that we don't have Match \endtext` right next to the current position

Related

Regex - how do I match this?

I've been trying hard to get this Regex to work, but am simply not good enough at this stuff apparently :(
Regex - Trying to extract sources
I thought this would work... I'm trying to get all of the content where:
It starts with ds://
Ends with either carriage return or line feed
That's it! Essentially I'm going to then do a negative lookahead such that I can remove all content that is NOT conforming to above (in Notepad++) which allows for Regex search/replace.
Search for lines that contain the pattern, and mark them
Search menu > Mark
Find what: ds://.*\R
check Regular expression
Check Mark the lines
Find all
Remove the non marked lines
Search menu > Bookmark
Remove unmarked lines
You don't need to add the \w specifier to look for a word after the ds:// in the look ahead. Removing that and altering the final specification from "zero or one carriage return, then zero or one newline" to "either a carriage return or a newline" in capture group should do it for you:
(?=ds:\/\/).*(?:\r|\n)
Update: Carriage return or Line feed group does not need to be captured.
Update 2: The following regex will actually work for your proposed use case in the comments, matching everything but the pattern you described in the question.
^(?:(?!ds:\/\/.*(?:\r|\n)).)*$
You regex (?=ds:\w+).*\r?\n? does not match because in the content there is ds:// and \w does not match a forward slash. To make your regex work you could change it to:
(?=ds://\w+).*\r?\n? demo which can be shortened to ds://.*\R? demo
Note that you don't have to escape the forward slash.
If you want to do a find and replace to keep the lines that contain ds:// you could use a negative lookahead:
Find what
^(?!.*ds://).*\R?
Replace with
Leave empty
Explanation
^ Start of the string
(?!.*ds://) Negative lookahead to assert the string does not contain ds://
.* Match any character 0+ times
\R? An optional unicode newline sequence to also match the last line if it is not followed by a newline
See the Regex demo
Here you go, Andrew:
Regex: ds:\/\/.*
Link: https://regex101.com/r/ulO9GO/2
Let me know if any question.

How to replace within a capture group

I am modifying an existing HTML doc. I'm doing things like adding a table of contents etc.
I have a heading with this ID: id="transcending intellectual limitations" (for real!)
I want to be able to find the whole ID, and then replace the spaces with hyphens.
It would be simple if I had just the IDs but I don't want to remove all the spaces in the whole document.
I'm reasonably new to regex, I'm using Sublime's find and replace to do this.
You can use
(?:\bid="|(?!^)\G)[^\s"]*\K\s+ 
And replace with anything you need to replace spaces with.
The (?:\bid="|(?!^)\G) pattern sets the initial boundary: either id=" or the end of the last successful match. This pattern presents an alternation list with two alternatives. \b matches a word boundary so that id=" is matched as a whole word. The \G operator matches at the start of the string and after ech successful match. To exclude the start position, a negative (?!^) lookahead is added (not followed with a string start position).
See more about \G in "Where You Left Off: The \G Assertion".
The [^\s"]* matches zero or more characters other than whitespace and a quote.
The \K operator makes the regex engine omit all the text matched so far from the match buffer.
The \s+ finally matches one or more whitespaces that will be replaced.
Regex101 Demo
Here's a 2 pass solution using Ruby as the regex parser:
#!/usr/bin/env ruby
line = 'yadayadayadaid="transcending intellectual limitations"yadayadayada'
line =~ /id="(.*)"/
part = $1.gsub( /\s+/, '-' )
print part
yields:
transcending-intellectual-limitations
Note that this will replace all whitespace between the words on the 2nd pass.

What regular expression will select all lines that have more than one punctuation mark?

I have this regular expression:
\..*?\.
But it only selects between two periods, not every punctuation mark, and it also selects across multiple lines.
Would modifying this expression to only take in one line at a time work somehow, if there's also a way to group punctuation into where we have a period?
Just to make things simpler, at this time I only need the expression to recognize periods, exclamation points, and question marks. I don't need it to register commas.
Thanks to Nathan and Agumander below, I know to substitute [.!?] in place of \. now, but I'm still having trouble with the other half of my question.
Just to make sure I'm being more clear, using [.!?].*?[.!?]\s will highlight text between punctuation marks, but across multiple lines. So I can't use it to bookmark only the lines that have multiple punctuation marks.
Placing characters inside a pair of square brackets will match to any of the enclosed characters. In your case you'd want [.?!]
If you want to match any sentence that has two of these, then you'll be looking for a pair of [.!?] separated by zero or more of any character.
The regex that matches strings with more than one of the set [.?!] would then be [.!?].*[.!?]
To make . match newlines, you'd add the s modifier to your regex.
...so the full regex would be /[.!?].*[.!?]/s
Ok I figured it out. Thanks to Agumander and Nathan above I substituted [.!?] in for the two \. in my original regex:
\..*?\. became [.!?].*[.!?]
Putting \s at the end of the regex made it pink select the entire document in notepad++.
The last issue I had was remembering to turn off "matches newline."
Agumander, I think you're asking for a regex that basically finds multiple punctuation marks on a single line. So here's one way to do it.
Here's the text I'm going to match. The regex will match the first line in it's entirety, but will not match the second.
Here's a line with multiple punctuation. The entire line will match the regex!
This line does not have multiple punctuation.
Regex
^.*(?:[\.?!].*){2,}$
Explanation
^ -- Start matching at the beginning of a line
.* -- match any character 0 or more times
(?: -- start a new non-capturing group
[.?!] -- find a character matching a period, question mark, or exclamation point.
.* -- match any character 0 or more times
)
{2,} -- repeat the previous group 2 or more times. This is how we ensure there's at least two punctuation marks before considering it a match.
$ -- end of line anchor, basically stop matching at the end of a line

Ignore specific lines when matching with a regex

I'm trying to make a regex that matches a specific pattern, but I want to ignore lines starting with a #. How do I do it?
Let's say i have the pattern (?i)(^|\W)[a-z]($|\W)
It matches all lines with a single occurance of a letter. It matches these lines for instance:
asdf e asdf
j
kke o
Now I want to override this so that it does not match lines starting with a #
EDIT:
I was not specific enough. My real pattern is more complicated. It looks a bit like this: (?i)(^|\W)([a-hj-z]|lala|bwaaa|foo($|\W)
It should be used kind of like I want to block offensive language, if a line does not start with a hash, in which case it should override.
This is what you are looking for
^(?!#).+$
^ marks the beginning of line and $ marks the end of line(in multiline mode)
.+ would match 1 to many characters
(?!#) is a lookahead which would match further only if the line doesn't start with #
This regex will match any word character \w not preceeded by a #:
^(?<!#)\w+$
It performs a negative lookbehind at the start of the string and then follows it with 1 or more word characters.

Regular expression to match string starting with a specific word

How do I create a regular expression to match a word at the beginning of a string?
We are looking to match stop at the beginning of a string and anything can follow it.
For example, the expression should match:
stop
stop random
stopping
If you wish to match only lines beginning with stop, use
^stop
If you wish to match lines beginning with the word stop followed by a space:
^stop\s
Or, if you wish to match lines beginning with the word stop, but followed by either a space or any other non-word character you can use (your regex flavor permitting)
^stop\W
On the other hand, what follows matches a word at the beginning of a string on most regex flavors (in these flavors \w matches the opposite of \W)
^\w
If your flavor does not have the \w shortcut, you can use
^[a-zA-Z0-9]+
Be wary that this second idiom will only match letters and numbers, no symbol whatsoever.
Check your regex flavor manual to know what shortcuts are allowed and what exactly do they match (and how do they deal with Unicode).
Try this:
/^stop.*$/
Explanation:
/ charachters delimit the regular expression (i.e. they are not part of the Regex per se)
^ means match at the beginning of the line
. followed by * means match any character (.), any number of times (*)
$ means to the end of the line
If you would like to enforce that stop be followed by a whitespace, you could modify the RegEx like so:
/^stop\s+.*$/
\s means any whitespace character
+ following the \s means there has to be at least one whitespace character following after the stop word
Note: Also keep in mind that the RegEx above requires that the stop word be followed by a space! So it wouldn't match a line that only contains: stop
If you want to match anything after a word, stop, and not only at the start of the line, you may use: \bstop.*\b - word followed by a line.
Or if you want to match the word in the string, use \bstop[a-zA-Z]* - only the words starting with stop.
Or the start of lines with stop - ^stop[a-zA-Z]* for the word only - first word only.
The whole line ^stop.* - first line of the string only.
And if you want to match every string starting with stop, including newlines, use: /^stop.*/s - multiline string starting with stop.
Like #SharadHolani said. This won't match every word beginning with "stop"
. Only if it's at the beginning of a line like "stop going".
#Waxo gave the right answer:
This one is slightly better, if you want to match any word beginning with "stop" and containing nothing but letters from A to Z.
\bstop[a-zA-Z]*\b
This would match all
stop (1)
stop random (2)
stopping (3)
want to stop (4)
please stop (5)
But
/^stop[a-zA-Z]*/
would only match (1) until (3), but not (4) & (5)
If you want to match anything that starts with "stop" including "stop going", "stop" and "stopping" use:
^stop
If you want to match the word stop followed by anything as in "stop going", "stop this", but not "stopped" and not "stopping" use:
^stop\W
/stop([a-zA-Z])+/
Will match any stop word (stop, stopped, stopping, etc)
However, if you just want to match "stop" at the start of a string
/^stop/
will do :D
If you want the word to start with "stop", you can use the following pattern.
"^stop.*"
This will match words starting with stop followed by anything.
/^stop*$/i
i - in case it is case sensitive.
I'd advise against a simple regular expression approach to this problem. There are too many words that are substrings of other unrelated words, and you'll probably drive yourself crazy trying to overadapt the simpler solutions already provided.
You'll want at least a naive stemming algorithm (try the Porter stemmer; there's available, free code in most languages) to process text first. Keep this processed text and the preprocessed text in two separate space-split arrays. Make sure each non-alphabetical character also gets its own index in this array. Whatever list of words you're filtering, stem them also.
The next step would be to find the array indices which match to your list of stemmed 'stop' words. Remove those from the unprocessed array, and then rejoin on spaces.
This is only slightly more complicated, but will be much more reliable an approach. If you've got any doubts on the value of a more NLP-oriented approach, you might want to do some research into clbuttic mistakes.
can you try this:
https://regex101.com/r/P3qfKG/1
reg = /stop(\w+| [^ ]+|$)/gm
it will select both stop and start with stop and next word;