Remove duplicate lines based on a search in notepad++ - regex

I have a text file that contains thousands of lines of text as below.
aaaa "test "
aa "test "(version 2)
bbbb "test "(version 4)
bbbbb "test1 "(with heads)
abs "test1 "
absc "test3"
I would like to be able to remove all the duplicates based on a search and keep only the first line (in my case all lines with the same value between the quotation marks)
EDIT : More details about how I detect that a line is a duplicate of another :
I check the value between the quotation marks. On the 3 first lines there is the value "test " between quotation marks so I want to keep the first line with this value and remove the other values. For lines 4 and 5 the value is "test1 " so I keep only line 4 and remove the other.
So after cleaning my text file would have this form
aaaa "test "
bbbbb "test1 "(with heads)
absc "test3"
I tried to use this regular search in notepad++
(.\".*?")
But I don't know how to use it to find duplicates and remove the other lines with the same value. I already checked other user's case but I can't found a solution.

I would solve it in several steps.
append line numbers
put the quoted text in front
sort, now lines with the same quoted text are sorted behind each other, and secondly in the original sequence due to the line numbers from step 1
remove "duplicates"
remove the inserted quoted text from step 2
sort by the line number from step 1
remove the line numbers from step 1
Now the detailed explanation:
append line numbers: use Edit -> Column Editor in the first column two times
insert text (some delimiter that does not occur in the file, e.g. | or : )
insert numbers start with 1 increment by 1 use leading zeros
Now each line should start with a line number and a delimiter
prepend the quoted text: use regexp replace
Find what: ^([^"]*)("[^"]+")(.*)$
Replace: \2\1\2\3
Now your lines should start with the text.
Sort: by using Edit -> Line Operations -> Sort ...
Remove Duplicates: with an regexp replace:
Find What: ("[^"]+")(.*)\n\1.*
Replace: \1\2
Use Replace All.
Remove the texts from step 2: using regex replace
Find What: ^"[^"]+"
Replace with: Nothing i.e. leave empty
Sort by the original line numbers: by using Edit -> Line Operations -> Sort ...
Remove the line numbers from step 1: using a regexp replace:
Find What: ^(.*\|) (use \| or whatever you used in step 1 as delimiter)
Replace with: Nothing i.e. leave empty

Related

Keep all lines of a list with identical beginning (Notepad++)

From a list, how to keep all occurrences of those lines only whose "first part or beginning" (defined from the beginning of the line to the ^ character) is present in other lines? (The pattern of lines in the list: beginning-of-line^rest_of_line_012345)
The type of characters, length, etc. after the ^ is irrelevant (but needs to be kept). Every line has only one (1) ^ character. The "beginning" string that determines identity must be present in the same (analogous) position in other lines (i.e., from the beginning of the line to ^, and must be exact match). (Lines contain characters that trouble regex, such as \/()*., so these need to be summarily escaped.)
For example: Original list:
abc^123
0xyz^xxx
aaa-123^123
aaa-12^0xyz
0xyz^098
00xyz^098
0xyz^x111xx
Keep all occurrences of lines with identical first part:
0xyz^xxx
0xyz^098
0xyz^x111xx
This elegant script by #Lars Fischer ((.*)\R(\2\R?)+)*\K.* (after pre-sorting) keeps all occurrences of duplicate lines, but it considers the entire line (it was designed to do so).
In this Q, I am looking for a solution that considers only the "beginning" of the line to see if it occurs more than once, and if yes, then keep the entire line. Any guidance?
Note: in this solution the characters # and % are used based on the assumption that these characters do not show up ANYWHERE in the file to begin with. If that's not the case for you, just use different patterns that you know don't show up anywhere in the file, such as ##### and %%%%%.
Start by sorting the file Lexicographically with Notepad++ by going to Edit -> Line Operations -> Sort Lines Lexicographically Ascending
Do a regex Find-and-Replace (UNcheck the box for ". matches newline"):
Find what:
^(.*?)\^[^\r\n]+[\r\n]+(\1\^.*?[\r\n]+)*\1\^.*?$
Replace with:
#$&%
Now do another regex Find-and-Replace (CHECK the box for ". matches newline"):
Find what:
%.*?#
Replace with:
\r\n
Finally, do one last regex Find-and-Replace (CHECK the box for ". matches newline"):
Find what:
^.*?#|%.*
Replace with nothing.
You said in comments that a perl script is OK for you.
#!/usr/bin/perl
use Modern::Perl;
my %values;
my $file = 'path/to/file';
open my $fh, '<', $file or die "unable to open '$file': $!";
while(<$fh>) {
chomp;
# get the prefix value
my ($prefix) = split('\^', $_);
# push in array the whole line in hash with the prefix as key
push #{$values{$prefix}}, $_;
}
foreach (keys %values) {
# skip the prefix tat have only one line
next if scalar #{$values{$_}} == 1;
local $" = "\n";
say "#{$values{$_}}";
}
Output:
0xyz^xxx
0xyz^098
0xyz^x111xx

Using a regex to extract a set of numbers and/or blank lines

I'm constructing a regex using PCRE to process text to extract a set of numbers from a set of text lines (the lines are produced by parsing HTML with XPATH but the question doesn't depend on that). If the number required isn't present, I need to return a blank line.
I'm using a module in Drupal called Feeds Tamper that provides a limited set of options to modify the content -- including a Regex find and replace based on PCRE (not PCRE2). I have options to do a sequence of Regex Find and Replace and/or simple Find and Replace.
The input takes the format:
Text A Location1 More text q=1,2)" Even more text
Text B
Text C Location1 More text q=3,4)" Even more text
Text D
There can be any number of lines including and not including the digits I want to extract; the last line may or may not have a digit in it; I need to process all the lines and end up with one result per line and no extras. The results are then replaced with a capturing group.
My search Regex currently looks like
.*?Location1.*?q=(.*?),(.*?)".*?(\r|$)|.*?(\r|$)
and my replacement like
\1|
but (see regex101.com) this gives results such as
1||
||
3||
||
||
where the expected output is:
1|
|
3|
|
i.e there is an extra line at the end that doesn't correspond to an input line, and an extra pipe character at the end of each line.
If I use
.*?Location1.*?q=(.*?),(.*?)".*?\r|.*?\r
the last line is omitted so I get:
1|
|
3|
If I don't add a pipe | to end of the substitution I get the right number of lines with the expected content (digit or blank), but as soon as I add something at the end of the substitutionI get an extra line and the substituted characte ris doubled.
What do I need to change in my Regex and why?
Something like this:
^(?:.*Location1.*?q=(\d+),(\d+))?.*$
First it matches start of line, optionally followed by the "required" Location and q= parts and captures the numbers. Finally it matches anything up to the end.
Here at regex101.

How to join all lines till next condition?

I can't find out how to join all lines till a next condition happens (a line with only 1 or more numbers) p.e.
input:
1
text text text text (with numbers)
text text text text (with numbers)
2
this text
text text text text (with numbers)
text text text
3
text text text text (with numbers)
4
etc
desidered output:
1 text text text text (with numbers) text text text text (with numbers)
2 this text text text text text (with numbers) text text text
3 text text text text (with numbers)
4
etc
I normally use global/^/,+2 join but the number of lines to join are not always 3 in my example above.
Instead of the static +2 end of the range for the :join command, just specify a search range for the next line that only contains a number (/^\d\+$/), and then join until the line before (-1):
:global/^/,/^\d\+$/-1 join
v/^\d\+/-j will do the trick.
v execute the function for each not matching the condition
^\d\+ your condition : Line starting with a number.
-j go one line backward an join. Or if you prefer join the current line with the previous line.
So basically we join every lines not matching your condition with the previous line.
Just because of the comment by Tim that it couldn't be done with only a regular expression search and replace using Vim, I present this: how to do it with only a regular expression search and replace, using Vim:
:%s#\s*\n\(\d\+\s*\n\)\#!# #
If you're not fond of backslashes, it can be simplified using "very magic" \v:
:%s#\v\s*\n(\d+\s*\n)#!# #
This is adapted from Tim's Perl-style regular expression given in the same comment, improved to make sure the "stop line" only has numbers (and maybe trailing whitespace).
See :help perl-patterns if you're comfortable with Perl and find yourself having trouble with the Vim regular expression dialect.

Removing lines with less than 10 characters

I work under Windows, I'm trying to clean a text I'd like to study, what's the right regex in Notepad++ to remove lines which are <= 10-character-sized.
Search:
^.{0,9}((\r?\n)|$)
Replace with a blank (ie nothing)
Step.1) Replace all the lines containing less than or equal to 10 chars with empty string.
^.{0,10}$
Step.2) Now you have lots of empty lines. So, remove empty lines:
Remove Empty Lines
"Edit" > "Line Operations" > "Remove Empty Lines"

Seeking regex in Notepad++ to search and replace CRLF between two quotation marks ["] only

I've got a CSV file with some 600 records where I need to replace some [CRLF] with a [space] but only when the [CRLF] is positioned between two ["] (quotation marks). When the second ["] is encountered then it should skip the rest of the line and go to the next line in the text.
I don't really have a starting point. Hope someone comes up with a suggestion.
Example:
John und Carol,,Smith,,,J.S.,,,,,,,,,,,,,+11 22 333 4444,,,,,"streetx 21[CRLF]
New York City[CRLF]
USA",streetx 21,,,,New York City,,,USA,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Normal,,My Contacts,[CRLF]
In this case the two [CRLF] after the first ["] need to be replaced with a space [ ]. When the second ["] is encountered, skip the end of the line and go to next line.
Then again, now on the next line, after the first ["] is encountered replace all [CRLF] until the second ["] is encountered. The [CRLF]s vary in numbers.
In the CSV-file the amount of commas [,] before (23) and after (65) the 2 quotation marks ["] is constant.
So maybe a comma counter could be used. I don't know.
Thanks for feedback.
This will work using one regex only (tested in Notepad++):
Enter this regex in the Find what field:
((?:^|\r\n)[^"]*+"[^\r\n"]*+)\r\n([^"]*+")
Enter this string in the Replace with field:
$1 $2
Make sure the Wrap around check box (and Regular expression radio button) are selected.
Do a Replace All as many times as required (until the "0 occurrences were replaced" dialog pops up).
Explanation:
(
(?:^|\r\n) Begin at start of file or before the CRLF before the start of a record
[^"]*+ Consume all chars up to the opening "
" Consume the opening "
[^\r\n"]*+ Consume all chars up to either the first CRLF or the closing "
) Save as capturing group 1 (= everything in record before the target CRLF)
\r\n Consume the target CRLF without capturing it
(
[^"]*+ Consume all chars up to the closing "
" Consume the closing "
) Save as capturing group 2 (= the rest of the string after the target CRLF)
Note: The *+ is a possessive quantifier. Use them appropriately to speed up execution.
Update:
This more general version of the regex will work with any line break sequence (\r\n, \r or \n):
((?:^|[\r\n]+)[^"]*+"[^\r\n"]*+)[\r\n]+([^"]*+")
Maybe do it in three steps (assuming you have 88 fields in the CSV, because you said there are 23 commas before, and 65 after each second ")
Step 1: replace all CR/LF with some character not anywhere in the file, like ~
Search: \r\n Replace: ~
Step 2: replace all ~ after every 88th 'comma group' (or however many fields in CSV) with \r\n -- to reinsert the required CSV linebreaks:
Search: ((?:[^,]*?,){88})~ Replace: $1\r\n
Step 3: replace all remaining ~ with space
Search ~ Replace: <space>
In this case the source data is generated by the export function in GMail for your contacts.
After the modification outlined below (without RegEx) the result can be used to tidy up your contacts database and re-import it to GMail or to MS Outlook.
Yes, I am standing on the shoulders of #alan and #robinCTS. Thank you both.
Instructions in 5 steps:
use Notepad++ / find replace / extended search mode / wrap around = on
-1- replace all [CRLF] with a unique set characters or a string (I used [~~])
find: \r\n and replace with: ~~
The file contents are now on one line only.
-2- Now we need to separate the header line. For this move to where the first record starts exactly before the 88th. comma (including the word after the 87th. comma [,]) and enter the [CRLF] manually by hitting the return key. There are two lines now: header and records.
-3- now find all [,~~] and replace with [,\r\n] The result is one record per line.
-4- remove the remaining [~~] find: ~~ and replace with: [ ] a space.
The file is now clean of unwanted [CRLF]s.
-5- Save the file and use it as intended.