From a list, how to keep all occurrences of those lines only whose "first part or beginning" (defined from the beginning of the line to the ^ character) is present in other lines? (The pattern of lines in the list: beginning-of-line^rest_of_line_012345)
The type of characters, length, etc. after the ^ is irrelevant (but needs to be kept). Every line has only one (1) ^ character. The "beginning" string that determines identity must be present in the same (analogous) position in other lines (i.e., from the beginning of the line to ^, and must be exact match). (Lines contain characters that trouble regex, such as \/()*., so these need to be summarily escaped.)
For example: Original list:
abc^123
0xyz^xxx
aaa-123^123
aaa-12^0xyz
0xyz^098
00xyz^098
0xyz^x111xx
Keep all occurrences of lines with identical first part:
0xyz^xxx
0xyz^098
0xyz^x111xx
This elegant script by #Lars Fischer ((.*)\R(\2\R?)+)*\K.* (after pre-sorting) keeps all occurrences of duplicate lines, but it considers the entire line (it was designed to do so).
In this Q, I am looking for a solution that considers only the "beginning" of the line to see if it occurs more than once, and if yes, then keep the entire line. Any guidance?
Note: in this solution the characters # and % are used based on the assumption that these characters do not show up ANYWHERE in the file to begin with. If that's not the case for you, just use different patterns that you know don't show up anywhere in the file, such as ##### and %%%%%.
Start by sorting the file Lexicographically with Notepad++ by going to Edit -> Line Operations -> Sort Lines Lexicographically Ascending
Do a regex Find-and-Replace (UNcheck the box for ". matches newline"):
Find what:
^(.*?)\^[^\r\n]+[\r\n]+(\1\^.*?[\r\n]+)*\1\^.*?$
Replace with:
#$&%
Now do another regex Find-and-Replace (CHECK the box for ". matches newline"):
Find what:
%.*?#
Replace with:
\r\n
Finally, do one last regex Find-and-Replace (CHECK the box for ". matches newline"):
Find what:
^.*?#|%.*
Replace with nothing.
You said in comments that a perl script is OK for you.
#!/usr/bin/perl
use Modern::Perl;
my %values;
my $file = 'path/to/file';
open my $fh, '<', $file or die "unable to open '$file': $!";
while(<$fh>) {
chomp;
# get the prefix value
my ($prefix) = split('\^', $_);
# push in array the whole line in hash with the prefix as key
push #{$values{$prefix}}, $_;
}
foreach (keys %values) {
# skip the prefix tat have only one line
next if scalar #{$values{$_}} == 1;
local $" = "\n";
say "#{$values{$_}}";
}
Output:
0xyz^xxx
0xyz^098
0xyz^x111xx
I'm constructing a regex using PCRE to process text to extract a set of numbers from a set of text lines (the lines are produced by parsing HTML with XPATH but the question doesn't depend on that). If the number required isn't present, I need to return a blank line.
I'm using a module in Drupal called Feeds Tamper that provides a limited set of options to modify the content -- including a Regex find and replace based on PCRE (not PCRE2). I have options to do a sequence of Regex Find and Replace and/or simple Find and Replace.
The input takes the format:
Text A Location1 More text q=1,2)" Even more text
Text B
Text C Location1 More text q=3,4)" Even more text
Text D
There can be any number of lines including and not including the digits I want to extract; the last line may or may not have a digit in it; I need to process all the lines and end up with one result per line and no extras. The results are then replaced with a capturing group.
My search Regex currently looks like
.*?Location1.*?q=(.*?),(.*?)".*?(\r|$)|.*?(\r|$)
and my replacement like
\1|
but (see regex101.com) this gives results such as
1||
||
3||
||
||
where the expected output is:
1|
|
3|
|
i.e there is an extra line at the end that doesn't correspond to an input line, and an extra pipe character at the end of each line.
If I use
.*?Location1.*?q=(.*?),(.*?)".*?\r|.*?\r
the last line is omitted so I get:
1|
|
3|
If I don't add a pipe | to end of the substitution I get the right number of lines with the expected content (digit or blank), but as soon as I add something at the end of the substitutionI get an extra line and the substituted characte ris doubled.
What do I need to change in my Regex and why?
Something like this:
^(?:.*Location1.*?q=(\d+),(\d+))?.*$
First it matches start of line, optionally followed by the "required" Location and q= parts and captures the numbers. Finally it matches anything up to the end.
Here at regex101.
I can't find out how to join all lines till a next condition happens (a line with only 1 or more numbers) p.e.
input:
1
text text text text (with numbers)
text text text text (with numbers)
2
this text
text text text text (with numbers)
text text text
3
text text text text (with numbers)
4
etc
desidered output:
1 text text text text (with numbers) text text text text (with numbers)
2 this text text text text text (with numbers) text text text
3 text text text text (with numbers)
4
etc
I normally use global/^/,+2 join but the number of lines to join are not always 3 in my example above.
Instead of the static +2 end of the range for the :join command, just specify a search range for the next line that only contains a number (/^\d\+$/), and then join until the line before (-1):
:global/^/,/^\d\+$/-1 join
v/^\d\+/-j will do the trick.
v execute the function for each not matching the condition
^\d\+ your condition : Line starting with a number.
-j go one line backward an join. Or if you prefer join the current line with the previous line.
So basically we join every lines not matching your condition with the previous line.
Just because of the comment by Tim that it couldn't be done with only a regular expression search and replace using Vim, I present this: how to do it with only a regular expression search and replace, using Vim:
:%s#\s*\n\(\d\+\s*\n\)\#!# #
If you're not fond of backslashes, it can be simplified using "very magic" \v:
:%s#\v\s*\n(\d+\s*\n)#!# #
This is adapted from Tim's Perl-style regular expression given in the same comment, improved to make sure the "stop line" only has numbers (and maybe trailing whitespace).
See :help perl-patterns if you're comfortable with Perl and find yourself having trouble with the Vim regular expression dialect.
I work under Windows, I'm trying to clean a text I'd like to study, what's the right regex in Notepad++ to remove lines which are <= 10-character-sized.
Search:
^.{0,9}((\r?\n)|$)
Replace with a blank (ie nothing)
Step.1) Replace all the lines containing less than or equal to 10 chars with empty string.
^.{0,10}$
Step.2) Now you have lots of empty lines. So, remove empty lines:
Remove Empty Lines
"Edit" > "Line Operations" > "Remove Empty Lines"
I've got a CSV file with some 600 records where I need to replace some [CRLF] with a [space] but only when the [CRLF] is positioned between two ["] (quotation marks). When the second ["] is encountered then it should skip the rest of the line and go to the next line in the text.
I don't really have a starting point. Hope someone comes up with a suggestion.
Example:
John und Carol,,Smith,,,J.S.,,,,,,,,,,,,,+11 22 333 4444,,,,,"streetx 21[CRLF]
New York City[CRLF]
USA",streetx 21,,,,New York City,,,USA,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Normal,,My Contacts,[CRLF]
In this case the two [CRLF] after the first ["] need to be replaced with a space [ ]. When the second ["] is encountered, skip the end of the line and go to next line.
Then again, now on the next line, after the first ["] is encountered replace all [CRLF] until the second ["] is encountered. The [CRLF]s vary in numbers.
In the CSV-file the amount of commas [,] before (23) and after (65) the 2 quotation marks ["] is constant.
So maybe a comma counter could be used. I don't know.
Thanks for feedback.
This will work using one regex only (tested in Notepad++):
Enter this regex in the Find what field:
((?:^|\r\n)[^"]*+"[^\r\n"]*+)\r\n([^"]*+")
Enter this string in the Replace with field:
$1 $2
Make sure the Wrap around check box (and Regular expression radio button) are selected.
Do a Replace All as many times as required (until the "0 occurrences were replaced" dialog pops up).
Explanation:
(
(?:^|\r\n) Begin at start of file or before the CRLF before the start of a record
[^"]*+ Consume all chars up to the opening "
" Consume the opening "
[^\r\n"]*+ Consume all chars up to either the first CRLF or the closing "
) Save as capturing group 1 (= everything in record before the target CRLF)
\r\n Consume the target CRLF without capturing it
(
[^"]*+ Consume all chars up to the closing "
" Consume the closing "
) Save as capturing group 2 (= the rest of the string after the target CRLF)
Note: The *+ is a possessive quantifier. Use them appropriately to speed up execution.
Update:
This more general version of the regex will work with any line break sequence (\r\n, \r or \n):
((?:^|[\r\n]+)[^"]*+"[^\r\n"]*+)[\r\n]+([^"]*+")
Maybe do it in three steps (assuming you have 88 fields in the CSV, because you said there are 23 commas before, and 65 after each second ")
Step 1: replace all CR/LF with some character not anywhere in the file, like ~
Search: \r\n Replace: ~
Step 2: replace all ~ after every 88th 'comma group' (or however many fields in CSV) with \r\n -- to reinsert the required CSV linebreaks:
Search: ((?:[^,]*?,){88})~ Replace: $1\r\n
Step 3: replace all remaining ~ with space
Search ~ Replace: <space>
In this case the source data is generated by the export function in GMail for your contacts.
After the modification outlined below (without RegEx) the result can be used to tidy up your contacts database and re-import it to GMail or to MS Outlook.
Yes, I am standing on the shoulders of #alan and #robinCTS. Thank you both.
Instructions in 5 steps:
use Notepad++ / find replace / extended search mode / wrap around = on
-1- replace all [CRLF] with a unique set characters or a string (I used [~~])
find: \r\n and replace with: ~~
The file contents are now on one line only.
-2- Now we need to separate the header line. For this move to where the first record starts exactly before the 88th. comma (including the word after the 87th. comma [,]) and enter the [CRLF] manually by hitting the return key. There are two lines now: header and records.
-3- now find all [,~~] and replace with [,\r\n] The result is one record per line.
-4- remove the remaining [~~] find: ~~ and replace with: [ ] a space.
The file is now clean of unwanted [CRLF]s.
-5- Save the file and use it as intended.