Regex for matching text between two regex-patters - regex

I am looking for a way to capture text and its paragraph title from a text document.
Text File:
paraTitle-1
--------
Lines and words
empty....
more lines
still part of paraTitle-1
paraTitle-2
--------
Lines and words
empty....
more lines
still part of paraTitle-2
I want to capture both the titles and the text below them.
array = [paraTitle-1: <text...below paraTitle-11>,
paraTitle-2: <text below paraTitle-2>]
I made a few attempts with pattern (?<=(.*))\n----*\n(?=(.*)) to no avail. Any guidance would be awesome.

The following regex will do:
(?!--------\R)(.*)\R--------\R((?:\R?(?!.*\R--------\R).*)+)
See regex101.
The title separator line (--------) can also be specified as -{8}, which is easier to adjust to variable length if needed, e.g. instead of exactly 8 dashes, it could be 6 or more: -{6,}
Explanation:
Capture a line of text (paragraph title):
(.*)\R
The . doesn't match line break characters
\R matches line breaks, including the Windows CRLF pair. If your regex engine doesn't support \R, use \r?\n as a simple alternative.
Make sure the captured text is not the title separator line:
(?!--------\R)
Skip the mandatory title separator line:
--------\R
Capture the paragraph text, as a repeating group of lines:
((?:xxx)+)
A line has an optional leading line break (first line doesn't have one):
\R?.*
But make sure the line is not the title of the next paragraph, i.e. it's not a line followed by the title separator line.
(?!.*\R--------\R)

Related

Remove duplicate lines in Notepad++ [duplicate]

I use the following expression in Notepad++ to delete duplicate lines:
^(.*)(\r?\n\1)+$
The problems are:
It is only for single word lines, if there is space in a line it won't work.
It is only for consecutive duplicate lines.
Is there a solution (preferably regular expression or macro) to delete duplicate lines in a text that contains space, and that are nonconsecutive?
Since no one is interested, I will post what I think you need.
delete duplicate lines in a text that contains space, and that are nonconsecutive
I assume you have text having, say duplicate lines My Line One and some text and My Line Two and more text:
My Line One and some text
My Line One and some text
My Line Two and more text
My Line One and some text
My Line Two and more text
These duplicate lines are not all consecutive (only the first two).
So, you can remove duplicate lines by running this search and replace:
^(.+)\r?\n(?=[\s\S]*?^\1$)
Replace with empty string.
Regex note: ^ and $ are treated as line start/end anchors by default, so we only match one line and capture it with ^(.+)$. Then we match the newline symbol (any OS style) with \r?\n. The look-ahead (?=...) checks if there is any text (with [\s\S]*?) after our line under inspection with the same contents (with the ^\1$ where \1 is a backreference to the line text captured).

Match Regex to a new line that has no characters preceding it

example text
example text
I was wondering if there was a way to match the line break in the middle of these two bits of text.
I was using \n but it would match at the end of "example text" and in the blank line
I am using this in a text to speech program called Voicedream to say out loud that it has progressed to a new line.
I suggest that you only match a newline that is preceded with another newline.
Use a positive lookbehind (?<=\n):
(?<=\n)\n
^^^^^^^

REGEX - How can I select/mark 3 works delimited by tabs on a consecutive lines?

Happy New Year !
I have a problem. I don’t know how to marks\select some words delimited by tabs on a consecutive lines: Recent, Coments and Tags
please see this print screen:
I can easy to put | sign, like: Recent|Comments|Tags but this will select all the words in the files that repeats, and I want only those 3 on those lines.
What I want is to make a regex, to remove all text before those 3 words, and another regex to remove everything after those 3 words.
I try something like this ((?s)((^.*)^.*Recente.*$|^.*Coments.*$|^.*Tags.*^))(.*$)but is not very good. And I have to pay atention, because those words can repeated in the text files, so I have to select\mark exactly those 3, on that 3 consecutive line (that doesn't have any other words on it)
Since you mentioned in a comment that you want to do this in Notepad++ (a fact that should have been mentioned in the question text), and since the screenshot shows a single space after the first two words, you might try this regular expression:
.*\n([ \t]+Recente\s+Coments\s+Tags).*
It will select everything, but capture the 3 words including whitespace between them and whitespace preceding first word on same line.
If you then replace with $1, everything not in the capture group will be removed.
Actually, the spaces after the first two words don't matter to this regex.
Could you please try this in perl:
perl -0777 -ne 'while(m/((\s|\t)+)Recent\n\1Comments\n\1Tags/g){print "$&\n";}' /path/to/file
To breakdown:
Start with 1 or more tab characters (first capture group)
Then "Recent" followed by new line
Capture group, Comments and new line
Capture group, Tags
By the way, is "tab" really tab or multiple consecutive whitespaces (\s+)?

Regex to replace multiple blank lines

Why does the following pattern not match only two or more consecutive blank lines?
(Including regex flag : Multiline)
/(^\s*$){2,}/m
Using Regex101, I see that it matches (for example) the first single blank line of example below (note, I did use ALT-255 for the first character in the block quote below just to represent a starting blank line, remove it if you copy the example text):
 
some text after the first blank line
more text
// comment after a space
// comment after 2 blank lines
text
// comment
// comment
How can I tweak this to match 2 or more blank lines only?
Regex you should be using is ^\n{1,}$
This will look for 2 or more blank newlines.
Regex101 Demo
^(\n{2,})
Here is the working DEMO

Notepad++ Regex Issue - Remove Number in Line Replace with HTML

I'm a regex newbie so this has been a lot of trial and error but for some reason I can only get this to work sometimes and I'm not sure why. Let me layout what I'm doing. I have a text file that looks like this:
1.Some Text Here
A paragraph of words here.
2.Some More Text Here
A paragraph of words here.
I use this code to find the lines with a number at the beginning:
^[0-9]+.([^.]*)$
Then I replace it with this:
<h2>$1</h2>\r\r
The problem I'm running into is that it usually grabs the line starting with the number but for some reason it will grab the line with the number and the paragraph below it. So instead of putting the </h2> at the end of the line it puts it at the end of the paragraph below.
I displayed all symbols to see if it had something to do with carriage/line returns but everything looks identical from line to line. The paragraph is on its own line and I see CRLF at the end of each line.
The expression [^.] (ie not a literal dot) matches newlines.
Don't match newlines in your capture:
^[0-9]+\.([^.\r\n]*)
Note that I also escaped the dot following the numbers, making it match a literal dot (a naked dot matches any character).
use \2 instead of $2, check "wrap around"tested on notepad++ 5.9.3 (UNICODE)
Not sure what version of notepad++ you're using but your version of the regex works fine for the example that you have ... i use 6.7.9.2
I can reproduce with the following text. Notice the paragraph for line 1 doesn't end in a period.
1.Some Text Here[CR][LF]
A paragraph of words here[CR][LF]
2.Some Text Here[CR][LF]
A paragraph of words here.[CR][LF]
Your regex matches any number of lines that begins with a set of digits, and doesn't end in a period. It could include more than one line. I would recommend this regex: ^[0-9]+\.([^\r\n]*)\r\n.