Regex that only matches when no duplicate lines are found - regex

I have a multiline string like this:
SA21 abcdef
BKxyz
SA21 abcdef
I need a regex that only matches if the line ^SA21 abcdef$ is present once. So it should not match for the first example but it should match for this one:
BK udsia
SA21 abcdef
BKxyz
I tried to capture the line and make sure it matches only when the same line is not found later: /(^SA21 abcdef$)(?!\1)/m regex101 but that does not work as it will probably always match the last line...

The regex you want should only match a line if the line is not present before or after the single occurrence of the line. This is achieved with a tempered greedy token:
/\A(?:(?!^SA21 abcdef$).)*(^SA21 abcdef$)(?:(?!^SA21 abcdef$).)*\z/ms
See the regex demo
The (?:(?!^SA21 abcdef$).)* is the token matching any text but the beginning of the SA21 abcdef line. The /s modifier is required so that a . could match a newline.
However, the construct is resource consuming, and it is a good idea to unroll it:
/\A(?:\n+(?!SA21 abcdef$).*)*\n*^(SA21 abcdef)$(?:\n+(?!SA21 abcdef$).*)*\z/m
See another demo
Note that \A and \z are unambiguous start/end string anchors, the /m modifier does not affect them.
Pattern explanation:
\A - start of string
(?:\n+(?!SA21 abcdef$).*)* - zero or more sequences of:
\n+ - 1 or more newlines ...
(?!SA21 abcdef$) - not followed with SA21 abcdef that is the whole line
.* - zero or more chars other than a newline
\n* - zero or more newlines
^ - start of a line
(SA21 abcdef) - the line that must be single
$ - end of line
(?:\n+(?!SA21 abcdef$).*)* - see above
\z - end of string.

Related

How to clear lines after the last regex match

I got an huge log of records I need to turn into a table.
Each line has a record, preceded by date and time, something like this:
27/11/2019 16:35 - i don't need this
28/11/2019 17:25 - don't need this either
30/11/2019 11:33 - stuff i'm looking for
01/12/2019 08:11 - stuff that i'm also looking for
03/11/2019 09:39 - don't need this
I want to completely clear the file from all the lines that I don't need.
I'm able to clear most of the lines that I don't want if I use the following regex and substitution patterns (in notepad++, using the flag in which dot matches newline):
.+?(?<datetime>[\d\/]+\s[\d:]+)\s-\s(?<mystuff>stuff[^\n]+)
'${datetime};${mystuff}
However, I can't clear the lines after the last match. How could I do so?
You may use
Find What: ^(?:.+?([\d/]+\h[\d:]+)\h-\h(stuff.*)|.*\R?)
Replace With: (?{1}$1;$2)
Details
^ - start of a line
(?:.+?([\d/]+\h[\d:]+)\h-\h(stuff.*)|.*\R?) - match either
.+? - any 1+ chars, as few as possible
([\d/]+\h[\d:]+) - Group 1: one or more digits or /, a horizontal whitespace, one or more digits or :
\h-\h - a horizontal whitespace, - and a hor. whitespace
(stuff.*) - Group 2: stuff and the rest of the line
| - or
.* - any 0+ chars other than linebreak chars
\R? - an optional line break sequence.
The (?{1}$1;$2) replacement pattern only replaces with $1;$2 if Group 1 matches.
See the Notepad++ demo:

Python Regex - Capture match and previous two lines

In reference to a previous question
Python data extract from text file - script stops before expected data match
How can I capture a match and the previous two lines?
I tried this but get:
unterminated subpattern at position 0 (line 1, column 1)
output = re.findall('(.*\r\n{2}random data.',f.read(), re.DOTALL)
You may use
re.findall(r'(?:.*\r?\n){2}.*random data.*', s)
Note you can't use re.DOTALL or .* will match up to the end of the input and you will only get the last occurrence.
See the Python demo
Pattern details
(?:.*\r?\n){2} - 2 occurrences of a sequence of
.* - any 0+ chars other than line break chars, as many as possible (a line)
\r?\n - a line ending (CRLF or LF)
.*random data.* - a line containing random data substring.
See the regex demo.

Looking for regex to match before and after a number

Given the string
170905-CBM-238.pdf
I'm trying to match 170905-CBM and .pdf so that I can replace/remove them and be left with 238.
I've searched and found pieces that work but can't put it all together.
This-> (.*-) will match the first section and
This-> (.[^/.]+$) will match the last section
But I can't figure out how to tie them together so that it matches everything before, including the second dash and everything after, including the period (or the extension) but does not match the numbers between.
help :) and thank you for your kind consideration.
There are several options to achieve what you need in Nintex.
If you use Extract operation, use (?<=^.*-)\d+(?=\.[^.]*$) as Pattern.
See the regex demo.
Details
(?<=^.*-) - a positive lookbehind requiring, immediately to the left of the current location, the start of string (^), then any 0+ chars other than LF as many as possible up to the last occurrence of - and the subsequent subpatterns
\d+ - 1 or more digits
(?=\.[^.]*$) - a positive lookahead requiring, immediately to the right of the current location, the presence of a . and 0+ chars other than . up to the end of the string.
If you use Replace text operation, use
Pattern: ^.*-([0-9]+)\.[^.]+$
Replacement text: $1
See another regex demo (the Context tab shows the result of the replacement).
Details
^ - a start of string anchor
.* - any 0+ chars other than LF up to the last occurrence of the subsequent subpatterns...
- - a hyphen
([0-9]+) - Group 1: one or more ASCII digits
\. - a literal .
[^.]+ - 1 or more chars other than .
$ - end of string.
The replacement $1 references the value stored in Group 1.
I don't know ninetex regex, but a sed type regex:
$ echo "170905-CBM-238.pdf" | sed -E 's/^.*-([0-9]*)\.[^.]*$/\1/'
238
Same works in Perl:
$ echo "170905-CBM-238.pdf" | perl -pe 's/^.*-([0-9]*)\.[^.]*$/$1/'
238

Get the first ocurrence of a string in a variable REGEX

I have the following variable in a database: PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT and I want to split it into two variables, the first will be PSC-CAMPO-GRANDE-I08 and the second V00-C09-H09-IPRMKT.
I'm trying the regex .*(\-I).*(\-V), this doesn't work. Then I tried .*(\-I), but it gets the last -IPRMKT string.
Then my question is: There a way of split the string PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT considering the first occurrence of -I?
This should do the trick:
regex = "(.*?-I[\d]{2})-(.*)"
Here is test script in Python
import re
regex = "(.*?-I[\d]{2})-(.*)"
match = re.search(regex, "PSC-CAMPO-GRANDE-I08-V00-C09-H09-IPRMKT")
if match:
print ("yep")
print (match.group(1))
print (match.group(2))
else:
print ("nope")
In the regex, I'm grabbing everything up to the first -I then 2 numbers. Then match but don't capture a -. Then capture the rest. I can help tweak it if you have more logic that you are trying to do.
You may use
^(.*?-I[^-]*)-(.*)
See the regex demo
Details:
^ - start of a string
(.*?-I[^-]*) - Group 1:
.*? - any 0+ 0+ chars other than line break chars up to the first (because *? is a lazy quantifier that matches up to the first occurrence)
-I - a literal substring -I
[^-]* - any 0+ chars other than a hyphen (your pattern was missing it)
- - a hyphen
(.*) - Group 2: any 0+ chars other than line break chars up to the end of a line.

Notepad++ regex to insert character every nth character from a starting position

How do you use regex to insert | every two characters from a starting position to the end of the line?
Using regex on the following sample (tshark output of packet data), the regex inserts | after the first two characters and the next two characters, but does not apply the pattern to the rest of the line. I think the issue is with a repeated pattern on the 2nd grouping (or lackthereof).
Sample:
1478646603.255173000 10.10.10.1 0000000000000000000000
^(.{34})(..) replace with \1|\2| OR ^(.{34})(.*?(..)) replace with \1|\2
Produces this:
1478646603.255173000 10.10.10.1 00|00|000000000000000000
What I want is:
1478646603.255173000 10.10.10.1 00|00|00|00|00|00|00|00|00|00|00
You may use
(?:\G(?!^)|^.{36})\K..(?!$)
and replace with $&|.
Details:
(?:\G(?!^)|^.{36}) - matches the location at the end of the previous successful match (with \G(?!^)) or (|) the start of a line (^) and the first 36 characters other than linebreak chars (.{36})
\K - the match reset operator that discards the whole text matched so far
.. - any 2 chars other than linebreak chars
(?!$) - that are not at the end of the string.
The replacement pattern only contains the backreference to the whole match ($&) and a | pipe symbol (a literal symbol in the replacement pattern).