How to exclude a specific string with REGEX? (Perl) - regex

For example, I have these strings
APPLEJUCE1A
APPLETREE2B
APPLECAKE3C
APPLETEA1B
APPLEWINE3B
APPLEWINE1C
I want all of these strings except those that have TEA or WINE1C in them.
APPLEJUCE1A
APPLETREE2B
APPLECAKE3C
APPLEWINE3B
I've already tried the following, but it didn't work:
^APPLE(?!.*(?:TEA|WINE1C)).*$
Any help is appreciated as I'm also kinda new to this.

If you indeed have mutliple strings as you claim, there's no need to jam all that in one regex pattern.
/^APPLE/ && !/TEA|WINE1C/
If you have a single string, the best approach is probably to splice it into lines (split /\n/), but you could also use a single regex match too
/^APPLE(?!.*TEA|WINE1C).*/mg

You can use
^APPLE(?!.*TEA)(?!.*WINE1C).*
See the regex demo.
Details:
^ - start of string
APPLE - a fixed string
(?!.*TEA) - no TEA allowed anywhere to the right of the current location
(?!.*WINE1C) - no WINE1C allowed anywhere to the right of the current location
.* - any zero or more chars other than line break chars as many as possible.

If you don't want to match a string that has both or them (which is not in the current example data):
^APPLE(?!.*(WINE1C|TEA).*(?!\1)(?:TEA|WINE1C)).*
Explanation
^ Start of string
APPLE match literally
(?! Negative lookahead
.*(WINE1C|TEA) Capture either one of the values in group 1
.* Match 0+ characters
(?!\1)(?:TEA|WINE1C) Match either one of the values as long as it is not the same as previously matched in group 1
) Close the lookahead
.* Match the rest of the line
Regex demo

Related

How to match characters between two occurrences of the same but random string

Base string looks like:
repeatedRandomStr ABCXYZ /an/arbitrary/##-~/sequence/of_characters=I+WANT+TO+MATCH/repeatedRandomStr/the/rest/of/strings.etc
The things I know about this base string are:
ABCXYZ is constant and always present.
repeatedRandomStr is random, but its first occurrence is always at the beginning and before ABCXYZ
So far I looked at regex context matching, recursion and subroutines but couldn't come up with a solution myself.
My currently working solution is to first determine what repeatedRandomStr is with:
^(.*)\sABCXYZ
and then use:
repeatedRandomStr\sABCXYZ\s(.*)\srepeatedRandomStr
to match what I want in $1. But this requires two separate regex queries. I want to know if this can be done in a single execution.
In Go, where RE2 library is used, there is no way other than yours: keep extracting the value before the ABCXYZ and then use the regex to match a string between two strings, as RE2 does not and won't support backreferences.
In case the regex flavor can be switched to PCRE or compatible, you can use
^(.*?)\s+ABCXYZ\s(.*)\1
^(.*?)\s+ABCXYZ\s(.*?)\1
See the regex demo.
Details:
^ - start of string
(.*?) - Group 1: zero or more chars other than line break chars as few as possible
\s+ - one or more whitespaces
ABCXYZ - some constant string
\s - a whitespace
(.*) - Group 2: zero or more chars other than line break chars as many as possible
\1 - the same value as in Group 1.

Match a part of a string using regex

I have a string and would like to match a part of it.
The string is Accept: multipart/mixedPrivacy: nonePAI: <sip:4168755400#1.1.1.238>From: <sip:4168755400#1.1.1.238>;tag=5430960946837208_c1b08.2.3.1602135087396.0_1237422_3895152To: <sip:4168755400#1.1.1.238>
I want to match PAI: <sip:4168755400#
the whitespace can be a word so i would like to use .* but if i used that it matches most of the string
The example on that link is showing what i'm matching if i use the whitespace instead of .*
(PAI: <sip:)((?:\([2-9]\d{2}\)\ ?|[2-9]\d{2}(?:\-?|\ ?))[2-9]\d{2}[- ]?\d{4})#
The example on that link is showing what i'm trying to achieve with .* but it should only match PAI: <sip:4168755400#
(PAI:.*<sip:)((?:\([2-9]\d{2}\)\ ?|[2-9]\d{2}(?:\-?|\ ?))[2-9]\d{2}[- ]?\d{4})#
I tried lookaround but failing.
Any idea?
thanks
Matching the single space can be updated by using a character class matching either a space or a word character and repeat that 1 or more times to match at least a single occurrence.
Note that you don't have to escape the spaces, and in both occasions you can use an optional character class matching either a space or hyphen [ -]?
If you want the match only, you can omit the 2 capturing groups if you want to.
(PAI:[ \w]+<sip:)((?:\([2-9]\d{2}\) ?|[2-9]\d{2}[ -]?)[2-9]\d{2}[- ]?\d{4})#
Regex demo
The regex should be like
PAI:.*?(<sip:.*?#)
Explanation:
PAI:.*? find the word PAI: and after the word it can be anything (.*) but ? is used to indicate that it should match as few as possible before it found the next expression.
(<sip:.*?#) capturing group that we want the result.
<sip:.*?# find <sip: and after the word it can be anything .*? before it found #.
Example

Regex to ignore Cobol comment line

I'd like to use regex to scan a few Cobol files for a specific word but skipping comment lines. Cobol comments have an asterisk on the 7. column. The regex i've gotten so far using a negative lookbehind looks like this:
^(?<!.{6}\*).+?COPY
It matches both lines:
* COPY
COPY
I would assume that .+? overrides the negative lookbehind somehow, but i'm stuck on how to correct this. What would i need to fix to get a regex that only matches the second line?
You may use a lookahead instead of a lookbehind:
^(?!.{6}\*).+?COPY
See the regex demo.
The lookbehind required some pattern to be absent before the start of the string, and thus was redundant, it always returned true. Lookaheads check for a pattern that is to the right of the current location.
So,
^ - matches the start of the string
(?!.{6}\*) - fails the match if there are any 6 chars followed with * from the start of the string (replace . with a space if you need to match just spaces)
.+? - matches any 1+ chars, as few as possible, up to the first
COPY -COPY substring.
If you want to filter out EVERY comment you could use:
^ {6}(?!\*)
That will match only lines starting with spaces that DOES NOT have an '*' at the 7th position.
COBOL can use the position 1-6 for numbering the lines, so may be safter to just use:
^.{6}(?!\*).*$

Regex extracting string after last hyphen and spaces

Which regex needs to be used to extract 'Manchester City' from string.
String is:
Aston Villa - Manchester City
I tried -(.*)\w|-(.), but it grabs - .
Note that -(.*)\w|-(.) matches - since both the alternatives here start with matching a hyphen. You can usually check if something is present or not with a lookaround.
However, in this case, I'd suggest
-\s*\K[^-]+$
Since you need to only match the substring after the last - with spaces trimmed off, you need something like a negative infinite width lookbehind (?<=-\s*). However, in PCRE, infinite width lookbehind is not supported. Instead, there is a \K operator that makes the engine omit the whole match that was grabbed so far by the current pattern.
See a regex demo
Breakdown:
- - a literal hyphen
\s* - zero or more whitespace characters
\K - operator that resets (empties) all currently kept match buffer
[^-]+ - one or more characters other than - up to ...
$ - the end of the string.
The simplest is[code] . *- (. *) [/code] and your data is in $1 or \1 or something else that depends on your tool. That assume that data are in format xxxxx-xxxxxx
Another simple option is - (.*) see: https://regex101.com/r/fY3oE7/1. Use the first capturing group in your language to get the part after the dash.

Regular expression to match non-integer values in a string

I want to match the following rules:
One dash is allowed at the start of a number.
Only values between 0 and 9 should be allowed.
I currently have the following regex pattern, I'm matching the inverse so that I can thrown an exception upon finding a match that doesn't follow the rules:
[^-0-9]
The downside to this pattern is that it works for all cases except a hyphen in the middle of the String will still pass. For example:
"-2304923" is allowed correctly but "9234-342" is also allowed and shouldn't be.
Please let me know what I can do to specify the first character as [^-0-9] and the rest as [^0-9]. Thanks!
This regex will work for you:
^-?\d+$
Explanation: start the string ^, then - but optional (?), the digit \d repeated few times (+), and string must finish here $.
You can do this:
(?:^|\s)(-?\d+)(?:["'\s]|$)
^^^^^ non capturing group for start of line or space
^^^^^ capture number
^^^^^^^^^ non capturing group for end of line, space or quote
See it work
This will capture all strings of numbers in a line with an optional hyphen in front.
-2304923" "9234-342" 1234 -1234
++++++++ captured
^^^^^^^^ NOT captured
++++ captured
+++++ captured
I don't understand how your pattern - [^-0-9] is matching those strings you are talking about. That pattern is just the opposite of what you want. You have simply negated the character class by using caret(^) at the beginning. So, this pattern would match anything except the hyphen and the digits.
Anyways, for your requirement, first you need to match one hyphen at the beginning. So, just keep it outside the character class. And then to match any number of digits later on, you can use [0-9]+ or \d+.
So, your pattern to match the required format should be:
-[0-9]+ // or -\d+
The above regex is used to find the pattern in some large string. If you want the entire string to match this pattern, then you can add anchors at the ends of the regex: -
^-[0-9]+$
For a regular expression like this, it's sometimes helpful to think of it in terms of two cases.
Is the first character messed up somehow?
If not, are any of the other characters messed up somehow?
Combine these with |
(^[^-0-9]|^.+?[^0-9])