Regex to find strings not containing a specified value - regex

I'm using notepad++'s regular expression search function to find all strings in a .txt document that do not contain a specific value (HIJ in the below example), where all strings begin with the same value (ABC in the below example).
How would I go about doing this?
Example
Every String starts with ABC
ABC is never used in a string other than at the beginning,
ABCABC123 would be two strings --"ABC" and "ABC123"
HIJ may appear multiple times in a string
I need to find the strings that do not contain HIJ
Input is one long file with no line breaks, but does contain special characters (*, ^, #, ~, :) and spaces
Example Input:
ABC1234HIJ56ABC7#HIJABC89ABCHIJ0ABE:HIJABC12~34HI456J
Example Input would be viewed as the following strings
ABC1234HIJ56
ABC7#HIJ
ABC89
ABCHIJ0ABE:HIJ
ABC12%34HI456J
The Third and Fifth strings both lack "HIJ" and therefore are included in the output, all others are not included in the output.
Example desired output:
ABC89
ABC12~34HI456J
I am 99% new to RegEx and will be looking more into it in the future, as my job description suddenly changed earlier this week when someone else in the company left suddenly, and therefore I have been doing this manually by searching (ABC|HIJ) and going through the search function's results looking for "ABC" appearing twice in a row. Supposedly the former employee was able to do this in an automated way, but left no documentation.
Any help would be appreciated!
This question is a repeat of a prior question I asked, but I was very very bad at formatting a question and it seems to have sunk beyond noticeable levels.

You can find the items you want with:
ABC(?:[^HA]+|H(?!IJ)|A(?!BC))*+(?=ABC|$)
Note: in this first pattern, you can replace (?=ABC|$) with (?!HIJ)
pattern details:
ABC
(?: # non-capturing group
[^HA]+ # all that is not a H or an A
| # OR
H(?!IJ) # an H not followed by IJ
|
A(?!BC) # an A not followed by BC
)*+ # repeat the group
(?=ABC|$) # followed by "ABC" or the end of the string
Note: if you want to remove all that is not the items you want you can make this search replace:
search: (?:ABC(?:[^HA]+|H(?!IJ)|A(?!BC))*+HIJ.*?(?=ABC|$))+|(?=ABC)
replace: \r\n

you could use this pattern
(ABC(?:(?!HIJ).)*?)(?=ABC|\R)
Demo
( # Capturing Group (1)
ABC # "ABC"
(?: # Non Capturing Group
(?! # Negative Look-Ahead
HIJ # "HIJ"
) # End of Negative Look-Ahead
. # Any character except line break
) # End of Non Capturing Group
*? # (zero or more)(lazy)
) # End of Capturing Group (1)
(?= # Look-Ahead
ABC # "ABC"
| # OR
\R # <line break>
) # End of Look-Ahead

You can use the following expression to match your criterion:
(^ABC(?:(?!HIJ).)*$)
This starts with ABC and looks ahead (negative) for HIJ pattern. The pattern works for the separated strings.
For a single line pattern (as provided in your question), a slight modification of this works (as follows):
(ABC(?:(?!HIJ).)*?)(?=ABC|$)

Related

Regex for text file

I have a text file with the following text:
andal-4.1.0.jar
besc_2.1.0-beta
prov-3.0.jar
add4lib-1.0.jar
com_lab_2.0.jar
astrix
lis-2_0_1.jar
Is there any way i can split the name and the version using regex. I want to use the results to make two columns 'Name' and 'Version' in excel.
So i want the results from regex to look like
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
So far I have used ^(?:.*-(?=\d)|\D+) to get the Version and -\d.*$ to get the Name separately. The problem with this is that when i do it for a large text file, the results from the two regex are not in the same order. So is there any way to get the results in the way I have mentioned above?
Ctrl+H
Find what: ^(.+?)[-_](\d.*)$
Replace with: $1\t$2
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(.+?) # group 1, 1 or more any character but newline, not greedy
[-_] # a dash or underscore
(\d.*) # group 2, a digit then 0 or more any character but newline
$ # end of line
Replacement:
$1 # content of group 1
\t # a tabulation, you may replace with what you want
$2 # content of group 2
Result for given example:
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
Not quite sure what you meant for the problem in large file, and I believe the two regex you showed are doing opposite as what you said: first one should get you the name and second one should give you version.
Anyway, here is the assumption I have to guess what may make sense to you:
"Name" may follow by - or _, followed by version string.
"Version" string is something preceded by - or _, with some digit, followed by a dot or underscore, followed by some digit, and then any string.
If these assumption make sense, you may use
^(.+?)(?:[-_](\d+[._]\d+.*))?$
as your regex. Group 1 is will be the name, Group 2 will be the Version.
Demo in regex101: https://regex101.com/r/RnwMaw/3
Explanation of regex
^ start of line
(.+?) "Name" part, using reluctant match of
at least 1 character
(?: )? Optional group of "Version String", which
consists of:
[-_] - or _
( ) Followed by the "Version" , which is
\d+ at least 1 digit,
[._] then 1 dot or underscore,
\d+ then at least 1 digit,
.* then any string
$ end of line

If pattern repeats two times (nonconsecutive) match both patterns, regex

I have 3 values that I'm trying to match. foo, bar and 123. However I would like to match them only if they can be matched twice.
In the following line:
foo;bar;123;foo;123;
since bar is not present twice, it would only match:
foo;bar;123;foo;123;
I understand how to specify to match exactly two matches, (foo|bar|123){2} however I need to use backreferences in order to make it work in my example.
I'm struggling putting the two concepts together and making a working solution for this.
You could use
(?<=^|;)([^\n;]+)(?=.*(?:(?<=^|;)\1(?=;|$)))
Broken down, this is
(?<=^|;) # pos. loobehind, either start of string or ;
([^\n;]+) # not ; nor newline 1+ times
(?=.* # pos. lookahead
(?:
(?<=^|;) # same pattern as above
\1 # group 1
(?=;|$) # end or ;
)
)
\b # word boundary
([^;]+) # anything not ; 1+ times
\b # another word boundary
(?=.*\1) # pos. lookahead, making sure the pattern is found again
See a demo on regex101.com.
Otherwise - as said in the comments - split on the ; programmatically and use some programming logic afterwards.
Find a demo in Python for example (can be adjusted for other languages as well):
from collections import Counter
string = """
foo;bar;123;foo;123;
foo;bar;foo;bar;
foo;foo;foo;bar;bar;
"""
twins = [element
for line in string.split("\n")
for element, times in Counter(line.split(";")).most_common()
if times == 2]
print(twins)
making sure to allow room for text that may occur in between matches with a ".*", this should match any of your values that occur at least twice:
(foo|bar|123).*\1

Regex for unique user count

I'm trying to create a regex to check the number of unique users.
In this case, 3 different users in 1 string means it's valid.
Let's say we have the following string
lab\simon;lab\lieven;lab\tim;\lab\davy;lab\lieven
It contains the domain for each user (lab) and their first name.
Each user is seperated by ;
The goal is to have 3 unique users in a string.
In this case, the string is valid because we have the following unique users
simon, lieven, tim, davy = valid
If we take this string
lab\simon;lab\lieven;lab\simon
It's invalid because we only have 2 unique users
simon, lieven = invalid
So far, I've only come up with the following regex but I don't know how to continue
/(lab)\\(?:[a-zA-Z]*)/g
Could you help me with this regex?
Please let me know if you need more information if it's not clear.
What you are after cannot be achieved through regular expressions on their own. Regular expressions are to be used for parsing information and not processing.
There is no particular pattern you are after, which is what regular expression excel at. You will need to split by ; and use a data structure such as a set to store you string values.
Is this what you want:
1) Using regular expression:
import re
s = r'lab\simon;lab\lieven;lab\tim;\lab\davy;lab\lieven'
pattern = re.compile(r'lab\\([A-z]{1,})')
user = re.findall(pattern, s)
if len(user) == len(set(user)) and len(user) >= 3:
print('Valid')
else:
print('Invalid')
2) Without using regular expression:
s = r'lab\simon;lab\lieven;lab\tim;\lab\davy;lab\lieven'
users = [i.split('\\')[-1] for i in s.split(';')]
if len(users) == len(set(users)) and len(users) >= 3:
print('Valid')
else:
print('Invalid')
In order to have a successful match, we need at least 3 sets of lab\user, i.e:
(?:\\?lab\\[\w]+(?:;|$)){3}
You didn't specify your engine but with pythonyou can use:
import re
if re.search(r"(?:\\?lab\\[\w]+(?:;|$)){3}", string):
# Successful match
else:
# Match attempt failed
Regex Demo
Regex Explanation
(?:\\?lab\\[\w]+(?:;|$)){3}
Match the regular expression «(?:\\?lab\\[\w]+(?:;|$)){3}»
Exactly 3 times «{3}»
Match the backslash character «\\?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the character string “lab” literally «lab»
Match the backslash character «\\»
Match a single character that is a “word character” «[\w]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regular expression below «(?:;|$)»
Match this alternative «;»
Match the character “;” literally «;»
Or match this alternative «$»
Assert position at the end of a line «$»
Here is a beginner-friendly way to solve your problem:
You should .split() the string per each "lab" section and declare the result as the array variable, like splitted_string.
Declare a second empty array to save each unique name, like unique_names.
Use a for loop to iterate through the splitted_string array. Check for unique names: if it isn't in your array of unique_names, add the name to unique_names.
Find the length of your array of unique_names to see if it is equal to 3. If yes, print that it is. If not, then print a fail message.
You seem like a practical person that is relatively new to string manipulation. Maybe you would enjoy some practical background reading on string manipulation at beginner sites like Automate The Boring Stuff With Python:
https://automatetheboringstuff.com/chapter6/
Or Codecademy, etc.
Another pure regex answer for the sport. As other said, you should probably not be doing this
^([^;]+)(;\1)*;((?!\1)[^;]+)(;(\1|\3))*;((?!\1|\3)[^;]+)
Explanation :
^ from the start of the string
([^;]+) we catch everything that isn't a ';'.
that's our first user, and our first capturing group
(;\1)* it could be repeated
;((?!\1)[^;]+) but at some point, we want to capture everything that isn't either
our first user nor a ';'. That's our second user,
and our third capturing group
(;(\1|\3))* both the first and second user can be repeated now
;((?!\1|\3)[^;]+) but at some point, we want to capture yada yada,
our third user and fifth capturing group
This can be done with a simple regex.
Uses a conditional for each user name slot so that the required
three names are obtained.
Note that since the three slots are in a loop, the conditional guarantees the
capture group is not overwritten (which would invalidate the below mentioned
assertion test (?! \1 | \2 | \3 ).
There is a complication. Each user name uses the same regex [a-zA-Z]+
so to accommodate that, a function is defined to check that the slot
has not been matched before.
This is using the boost engine, that cosmetically requires the group be
defined before it is back referenced.
The workaround is to define a function at the bottom after the group is defined.
In PERL (and some other engines) it is not required to define a group ahead
of time before its back referenced, so you could do away with the function
and put
(?! \1 | \2 | \3 ) # Cannot have seen this user
[a-zA-Z]+
in the capture groups on top.
At a minimum, this requires conditionals.
Formatted and tested:
# (?:(?:.*?\blab\\(?:((?(1)(?!))(?&GetUser))|((?(2)(?!))(?&GetUser))|((?(3)(?!))(?&GetUser))))){3}(?(DEFINE)(?<GetUser>(?!\1|\2|\3)[a-zA-Z]+))
# Look for 3 unique users
(?:
(?:
.*?
\b lab \\
(?:
( # (1), User 1
(?(1) (?!) )
(?&GetUser)
)
| ( # (2), User 2
(?(2) (?!) )
(?&GetUser)
)
| ( # (3), User 3
(?(3) (?!) )
(?&GetUser)
)
)
)
){3}
(?(DEFINE)
(?<GetUser> # (4)
(?! \1 | \2 | \3 ) # Cannot have seen this user
[a-zA-Z]+
)
)

Can a Regex Return the Number of the Line where the Match is Found?

In a text editor, I want to replace a given word with the number of the line number on which this word is found. Is this is possible with Regex?
Recursion, Self-Referencing Group (Qtax trick), Reverse Qtax or Balancing Groups
Introduction
The idea of adding a list of integers to the bottom of the input is similar to a famous database hack (nothing to do with regex) where one joins to a table of integers. My original answer used the #Qtax trick. The current answers use either Recursion, the Qtax trick (straight or in a reversed variation), or Balancing Groups.
Yes, it is possible... With some caveats and regex trickery.
The solutions in this answer are meant as a vehicle to demonstrate some regex syntax more than practical answers to be implemented.
At the end of your file, we will paste a list of numbers preceded with a unique delimiter. For this experiment, the appended string is :1:2:3:4:5:6:7 This is a similar technique to a famous database hack that uses a table of integers.
For the first two solutions, we need an editor that uses a regex flavor that allows recursion (solution 1) or self-referencing capture groups (solutions 2 and 3). Two come to mind: Notepad++ and EditPad Pro. For the third solution, we need an editor that supports balancing groups. That probably limits us to EditPad Pro or Visual Studio 2013+.
Input file:
Let's say we are searching for pig and want to replace it with the line number.
We'll use this as input:
my cat
dog
my pig
my cow
my mouse
:1:2:3:4:5:6:7
First Solution: Recursion
Supported languages: Apart from the text editors mentioned above (Notepad++ and EditPad Pro), this solution should work in languages that use PCRE (PHP, R, Delphi), in Perl, and in Python using Matthew Barnett's regex module (untested).
The recursive structure lives in a lookahead, and is optional. Its job is to balance lines that don't contain pig, on the left, with numbers, on the right: think of it as balancing a nested construct like {{{ }}}... Except that on the left we have the no-match lines, and on the right we have the numbers. The point is that when we exit the lookahead, we know how many lines were skipped.
Search:
(?sm)(?=.*?pig)(?=((?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?:(?1)|[^:]+)(:\d+))?).*?\Kpig(?=.*?(?(2)\2):(\d+))
Free-Spacing Version with Comments:
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # fail right away if pig isn't there
(?= # The Recursive Structure Lives In This Lookahead
( # Group 1
(?: # skip one line
^
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
(?:(?1)|[^:]+) # recurse Group 1 OR match all chars that are not a :
(:\d+) # match digits
)? # End Group
) # End lookahead.
.*?\Kpig # get to pig
(?=.*?(?(2)\2):(\d+)) # Lookahead: capture the next digits
Replace: \3
In the demo, see the substitutions at the bottom. You can play with the letters on the first two lines (delete a space to make pig) to move the first occurrence of pig to a different line, and see how that affects the results.
Second Solution: Group that Refers to Itself ("Qtax Trick")
Supported languages: Apart from the text editors mentioned above (Notepad++ and EditPad Pro), this solution should work in languages that use PCRE (PHP, R, Delphi), in Perl, and in Python using Matthew Barnett's regex module (untested). The solution is easy to adapt to .NET by converting the \K to a lookahead and the possessive quantifier to an atomic group (see the .NET Version a few lines below.)
Search:
(?sm)(?=.*?pig)(?:(?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*+.*?\Kpig(?=[^:]+(?(1)\1):(\d+))
.NET version: Back to the Future
.NET does not have \K. It its place, we use a "back to the future" lookbehind (a lookbehind that contains a lookahead that skips ahead of the match). Also, we need to use an atomic group instead of a possessive quantifier.
(?sm)(?<=(?=.*?pig)(?=(?>(?:^(?:(?!pig)[^\r\n])*(?:\r?\n))(?=[^:]+((?(1)\1):\d+)))*).*)pig(?=[^:]+(?(1)\1):(\d+))
Free-Spacing Version with Comments (Perl / PCRE Version):
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # lookahead: if pig is not there, fail right away to save the effort
(?: # start counter-line-skipper (lines that don't include pig)
(?: # skip one line
^ #
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
# for each line skipped, let Group 1 match an ever increasing portion of the numbers string at the bottom
(?= # lookahead
[^:]+ # skip all chars that are not colons
( # start Group 1
(?(1)\1) # match Group 1 if set
:\d+ # match a colon and some digits
) # end Group 1
) # end lookahead
)*+ # end counter-line-skipper: zero or more times
.*? # match
\K # drop everything we've matched so far
pig # match pig (this is the match!)
(?=[^:]+(?(1)\1):(\d+)) # capture the next number to Group 2
Replace:
\2
Output:
my cat
dog
my 3
my cow
my mouse
:1:2:3:4:5:6:7
In the demo, see the substitutions at the bottom. You can play with the letters on the first two lines (delete a space to make pig) to move the first occurrence of pig to a different line, and see how that affects the results.
Choice of Delimiter for Digits
In our example, the delimiter : for the string of digits is rather common, and could happen elsewhere. We can invent a UNIQUE_DELIMITER and tweak the expression slightly. But the following optimization is even more efficient and lets us keep the :
Optimization on Second Solution: Reverse String of Digits
Instead of pasting our digits in order, it may be to our benefit to use them in the reverse order: :7:6:5:4:3:2:1
In our lookaheads, this allows us to get down to the bottom of the input with a simple .*, and to start backtracking from there. Since we know we're at the end of the string, we don't have to worry about the :digits being part of another section of the string. Here's how to do it.
Input:
my cat pi g
dog p ig
my pig
my cow
my mouse
:7:6:5:4:3:2:1
Search:
(?xsm) # free-spacing mode, multi-line
(?=.*?pig) # lookahead: if pig is not there, fail right away to save the effort
(?: # start counter-line-skipper (lines that don't include pig)
(?: # skip one line that doesn't have pig
^ #
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
)
# Group 1 matches increasing portion of the numbers string at the bottom
(?= # lookahead
.* # get to the end of the input
( # start Group 1
:\d+ # match a colon and some digits
(?(1)\1) # match Group 1 if set
) # end Group 1
) # end lookahead
)*+ # end counter-line-skipper: zero or more times
.*? # match
\K # drop match so far
pig # match pig (this is the match!)
(?=.*(\d+)(?(1)\1)) # capture the next number to Group 2
Replace: \2
See the substitutions in the demo.
Third Solution: Balancing Groups
This solution is specific to .NET.
Search:
(?m)(?<=\A(?<c>^(?:(?!pig)[^\r\n])*(?:\r?\n))*.*?)pig(?=[^:]+(?(c)(?<-c>:\d+)*):(\d+))
Free-Spacing Version with Comments:
(?xm) # free-spacing, multi-line
(?<= # lookbehind
\A #
(?<c> # skip one line that doesn't have pig
# The length of Group c Captures will serve as a counter
^ # beginning of line
(?:(?!pig)[^\r\n])* # zero or more chars not followed by pig
(?:\r?\n) # newline chars
) # end skipper
* # repeat skipper
.*? # we're on the pig line: lazily match chars before pig
) # end lookbehind
pig # match pig: this is the match
(?= # lookahead
[^:]+ # get to the digits
(?(c) # if Group c has been set
(?<-c>:\d+) # decrement c while we match a group of digits
* # repeat: this will only repeat as long as the length of Group c captures > 0
) # end if Group c has been set
:(\d+) # Match the next digit group, capture the digits
) # end lokahead
Replace: $1
Reference
Qtax trick
On Which Line Number Was the Regex Match Found?
Because you didn't specify which text editor, in vim it would be:
:%s/searched_word/\=printf('%-4d', line('.'))/g (read more)
But as somebody mentioned it's not a question for SO but rather Super User ;)
I don't know of an editor that does that short of extending an editor that allows arbitrary extensions.
You could easily use perl to do the task, though.
perl -i.bak -e"s/word/$./eg" file
Or if you want to use wildcards,
perl -MFile::DosGlob=glob -i.bak -e"BEGIN { #ARGV = map glob($_), #ARGV } s/word/$./eg" *.txt

How to find 2 or more consecutive words which are in caps whereas remaining words are non-caps

I am trying to solve this problem using RegExp. I am sure this could be easily solved in Java and many other language. But, I want to use this example to further learn about RegExp
For below 4 input sentences:
1. Abc Abcabc 123,00 test ABCDTEST XYZTEST XY
2. aBC Abcabc 24DD test ABCDTEST XYZTEST XY test is test
3. ABC Abcabc test ABCDTEST XYZTEST
4. ABC ABCABC TEST ABCDTEST XYZTEST
I want matching term as:
1. ABCDTEST XYZTEST XY
2. ABCDTEST XYZTEST XY
3. ABCDTEST XYZTEST (only two in end satisfies condition)
4. (no match, because all of them are in caps)
It would be helpful get start offset and end offset of the matching term.
For simplicity, lets assume there will only one match be present.
i.e there won't be any input like this
5. Abc Abcabc 123,00 test ABCDTEST XYZTEST XY agab WXYZ ABCDE
But, extra credit if you can solve this too.
Here is how my initial regex looks like (which is wrong)
(([A-Z]+){2}){2}
If there won't be two matching pattern in one line:
^(?=.*[a-z]).*?(\b[A-Z]+(?:\h+[A-Z]+\b)+)
will store the result in the first captured group. If your string is multiline and you want to consider it line by line, use the g (don't stop at first match) and m (multiline) flags.
Demo: http://regex101.com/r/qC2wF9
Explanation
^(?=.*[a-z]): checks, from the beginning of the line, that there is at least one lowercase letter.
(\b[A-Z]+(?:\h+[A-Z]+\b)+):
\b[A-Z]+: checks that there is an all-caps word...
\h+[A-Z]+\b: ...separated by at least a space (\h is short for horizontal space, ie whitespaces, tabs... but no newline) from another all-caps word...
(?:\h+[A-Z]+\b)+: ...possibly followed by other all-caps words ((?: ) is a non-capturing group)
Warning
The \b will allow stuff like abc-ABD ABD. If there a risk of that happening, you can replace the regex with:
^(?=.*[a-z]).*?((?:^|\h+)[A-Z]+(?:\h+[A-Z]+(?=\h+|$))+)
Improvement
This is by no mean pretty, nor does it solve the "two matches in one line" problem. Feel free to comment!
\b([A-Z]+(?:\s+[A-Z])+)\b
will match 2 or more all caps words
So, it would handle your cases 1-3, but not 4
I can't think how to do (4) yet - but I'll think about it some more
Does this work for you?
[a-z]+[^A-Z]*\s([A-Z]+\s[A-Z\s]+)
http://regex101.com/r/lR3nN5
This may work but it might be alot to grasp if you are just learning.
Regex:
(?!^[^\S\n]*(?:[A-Z]+[^\S\n]*)*$)^.*?(?:^|(?<=[^\S\n]))([A-Z]+(?:[^\S\n]+[A-Z]+){1,})(?=[^\S\n]|$)
Explained:
# Modifier: multi-line mode '(?m)'
(?! # Ensure this is not a line of all caps (via assertion)
^ # Beginning of line
[^\S\n]*
(?: [A-Z]+ [^\S\n]* )*
$ # End of line
)
^ # Begining of line. Ok, this is a good candidate, check it
.*? # Slowly, creep up on it
(?: # Here, the candidate must be qualified (via assertion)
^ # Either start of the line
| # or
(?<= [^\S\n] ) # A non-newline whitespace separatore before us
)
( # (1 start), Capture our candidate
[A-Z]+ # First of all caps
(?: [^\S\n]+ [A-Z]+ ){1,} # Second to more all caps
) # (1 end)
(?= [^\S\n] | $ ) # Found them, but have to qualify (via assertion)
# there is a valid separator after us,
# either non-newline whitespace or End of line