Remove duplicate lines in Notepad++ [duplicate] - regex

I use the following expression in Notepad++ to delete duplicate lines:
^(.*)(\r?\n\1)+$
The problems are:
It is only for single word lines, if there is space in a line it won't work.
It is only for consecutive duplicate lines.
Is there a solution (preferably regular expression or macro) to delete duplicate lines in a text that contains space, and that are nonconsecutive?

Since no one is interested, I will post what I think you need.
delete duplicate lines in a text that contains space, and that are nonconsecutive
I assume you have text having, say duplicate lines My Line One and some text and My Line Two and more text:
My Line One and some text
My Line One and some text
My Line Two and more text
My Line One and some text
My Line Two and more text
These duplicate lines are not all consecutive (only the first two).
So, you can remove duplicate lines by running this search and replace:
^(.+)\r?\n(?=[\s\S]*?^\1$)
Replace with empty string.
Regex note: ^ and $ are treated as line start/end anchors by default, so we only match one line and capture it with ^(.+)$. Then we match the newline symbol (any OS style) with \r?\n. The look-ahead (?=...) checks if there is any text (with [\s\S]*?) after our line under inspection with the same contents (with the ^\1$ where \1 is a backreference to the line text captured).

Related

Matching lines containing Unicode line break chars with a dot pattern in Notepad++ regex

I'm using the following Regex to search for a string in each line of a document. Every line is encapsulated with þ.
^þ.*(SEARCHSTRING).*þ$
But I came across a discrepancy in my count. Running the regex over the below two example lines of data will only get one hit when I'd like to capture both. This is because of the Line Separator Character. My regex believes this to be a new line when in fact it is simply a new line indicator. Is there any way around this?
þ
SEARCHSTRINGþ
þ#SEARCHSTRINGþ
In Notepad++, . matches any char that is not a Unicode line break char.
If you need to match a line that is a chunk of chars other than LF and CR, use
^þ[^\r\n]*(SEARCHSTRING)[^\r\n]*þ$

Regex for matching text between two regex-patters

I am looking for a way to capture text and its paragraph title from a text document.
Text File:
paraTitle-1
--------
Lines and words
empty....
more lines
still part of paraTitle-1
paraTitle-2
--------
Lines and words
empty....
more lines
still part of paraTitle-2
I want to capture both the titles and the text below them.
array = [paraTitle-1: <text...below paraTitle-11>,
paraTitle-2: <text below paraTitle-2>]
I made a few attempts with pattern (?<=(.*))\n----*\n(?=(.*)) to no avail. Any guidance would be awesome.
The following regex will do:
(?!--------\R)(.*)\R--------\R((?:\R?(?!.*\R--------\R).*)+)
See regex101.
The title separator line (--------) can also be specified as -{8}, which is easier to adjust to variable length if needed, e.g. instead of exactly 8 dashes, it could be 6 or more: -{6,}
Explanation:
Capture a line of text (paragraph title):
(.*)\R
The . doesn't match line break characters
\R matches line breaks, including the Windows CRLF pair. If your regex engine doesn't support \R, use \r?\n as a simple alternative.
Make sure the captured text is not the title separator line:
(?!--------\R)
Skip the mandatory title separator line:
--------\R
Capture the paragraph text, as a repeating group of lines:
((?:xxx)+)
A line has an optional leading line break (first line doesn't have one):
\R?.*
But make sure the line is not the title of the next paragraph, i.e. it's not a line followed by the title separator line.
(?!.*\R--------\R)

finding pattern between two "CERRADO}" strings using negative look-ahead

I have a text file containing lines like these:
CERRADO}165856}TICKET}DESCRIPTION}some random text here\r\n
other random text here}158277747\r\n
CERRADO}165856}TICKET}FR2CODE}more random text also here}1587269339\r\n
My ultimate goal is to concatenate those lines not beginnning with "CERRADO}" string with their preceding line. There might be an arbitrary number of lines not beginning with that string on the file. This is the end result:
CERRADO}165856}TICKET}DESCRIPTION}some random text here other random text here}158277747\r\n
CERRADO}165856}TICKET}FR2CODE}more random text also here}1587269339\r\n
My first attempt was to create a simple regex to match those lines.
CERRADO\}.+\r\n(?!CERRADO\})(.+\r\n)+
After having that regex right, to create a matching group and replace it getting rid of the \r\n patterns, here is what I have so far:
The proposed regex matches all the lines in the file and not just the wanted ones.
Any ideas would be appreciated
You may use
\R(?!CERRADO\})
and replace with a space.
The regex matches:
\R - a line break sequence that is...
(?!CERRADO\}) - not followed with CERRADO}.
Or,
^(CERRADO\}.*)\R(?!CERRADO\})
and replace with \1 . This regex matches:
^ - start of a line
(CERRADO\}.*) - Capturing group 1 (later referred to with \1 backreference from the replacement pattern): CERRADO} substring and then the rest of the line
\R - a line break sequence
(?!CERRADO\}) - not followed with CERRADO}.
To make multiple replacements with this one, you will need to hit Replace All several times.

Find/replace character+line feed

How to search for character+line feed with regex?
For example to turn this:
line one
line two
line (three)
line four
line five
into this:
line one
line two
line (three)=line four
line five
e.g. to search for ) and \n and replace \n only in lines containing ) with something else.
Search for \)\r?\n, replace with \)=.
You need to escape special regex characters (like brackets) when using them as literal portions of your pattern. Here is a good read on that: http://www.regular-expressions.info/characters.html

Regular expression to modify a line that contains only a single word at the start

I have a text file in which any line that starts with a single word and has no other characters after that should be enclosed inside caret characters.
For example, a line that contains only the following 6 characters (plus the newline):
France
should be replaced with a line that consists of only the following 8 characters (plus the newline):
^France^
Is there a Regular Expression I could use in the Find/Replace feature of my text editor (Jedit) to make these modifications to the file?
Regex to find lines with a single word:
^(\w+)$
replace with:
^$1^