Regex pattern that matches only if not included in another pattern - regex

I am trying to understand how to make a regular expression that only matches a pattern if this pattern is not included in another one.
In the following example, I want to match dashes only if they are not into a [code][/code] tag.
---------
[code]
-------------------------------------------------------------------------------------
Some text
-----------------
Some other text
-------------------------------------------------------------------------------------
test
[/code]
I have searched for explanations about lookahead and lookbehind but cannot understand if and how it could be suitable for what I need.
I wanted to use a combination of negative lookbehind and negative lookahead but it seems that it is not possible to use + or * in negative lookbehind pattern.
So, for example, this won't work (because of the + in the negative look behind)
/(?<!\[code\].+?)(-{5,100})(?!.+?\[\/code\])/m
How can I achieve that in another way ?

One possibility if the tags are not nested is to match from the opening till the closing tag to match what you don't want. Then use an alternation to capture in a group what you do want, in this case 5 - 100 times a hyphen.
\[code\](?:(?!\[\/?code\]).)*\[\/code]|(-{5,100})/m
Explanation
\[code\] Match [code]
(?: Non capturing group
(?!\[\/?code\]). Assert if what is on the right is not [code] with an optional / after the opening [ Then match any character.
)* Repeat non capturing group and repeat 0+ times
\[\/code] Match [/code]
| Or
(-{5,100}) Capture in group 1 matching 5 - 100 times a hyphen
Regex demo

I don't believe a regular expression is the right tool for the job here.
str = <<END
---------
[code]
-------------------------------------------------------------------------------
Some text
----------------------------------
Some other text
-------------------------------------------------------------------------------
test
[/code]
------------
---
[code]
Some text
-------------------------------------------
[/code]
------------
END
within = false
str.split("\n").select do |line|
case line
when "[code]"
within = true
false
when "[/code]"
within = false
false
else
within == false
end
end
#=> ["---------", "------------", "---", "------------"]
I would have used the to-some-beloved flip-flip operator had it not been deprecated.
str.split("\n").reject do |line|
true if line == "[code]"..line == "[/code]"
end
#=> ["---------", "------------", "---", "------------"]
Hold the phone! It looks like Matz has un-deprecated it! (Scroll to end.)

Related

replaceAll regex to remove last - from the output

I was able to achieve some of the output but not the right one. I am using replace all regex and below is the sample code.
final String label = "abcs-xyzed-abc-nyd-request-xyxpt--1-cnaq9";
System.out.println(label.replaceAll(
"([^-]+)-([^-]+)-(.+)-([^-]+)-([^-]+)", "$3"));
i want this output:
abc-nyd-request-xyxpt
but getting:
abc-nyd-request-xyxpt-
here is the code https://ideone.com/UKnepg
You may use this .replaceFirst solution:
String label = "abcs-xyzed-abc-nyd-request-xyxpt--1-cnaq9";
label.replaceFirst("(?:[^-]*-){2}(.+?)(?:--1)?-[^-]+$", "$1");
//=> "abc-nyd-request-xyxpt"
RegEx Demo
RegEx Details:
(?:[^-]+-){2}: Match 2 repetitions of non-hyphenated string followed by a hyphen
(.+?): Match 1+ of any characters and capture in group #1
(?:--1)?: Match optional --1
-: Match a -
[^-]+: Match a non-hyphenated string
$: End
The following works for your example case
([^-]+)-([^-]+)-(.+[^-])-+([^-]+)-([^-]+)
https://regex101.com/r/VNtryN/1
We don't want to capture any trailing - while allowing the trailing dashes to have more than a single one which makes it match the double --.
With your shown samples and attempts, please try following regex. This is going to create 1 capturing group which can be used in replacement. Do replacement like: $1in your function.
^(?:.*?-){2}([^-]*(?:-[^-]*){3})--.*
Here is the Online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^(?:.*?-){2} ##Matching from starting of value in a non-capturing group where using lazy match to match very near occurrence of - and matching 2 occurrences of it.
([^-]*(?:-[^-]*){3}) ##Creating 1st and only capturing group and matching everything before - followed by - followed by everything just before - and this combination 3 times to get required output.
--.* ##Matching -- to all values till last.

Match string between delimiters, but ignore matches with specific substring

I have to parse all the text in a paranthesis but not the one that contains "GST"
e.g:
(AUSTRALIAN RED CROSS – ATHERTON)
(Total GST for this Invoice $1,104.96)
today for a quote (07) 55394226 − admin.nerang#waste.com.au − this applies to your Nerang services.
expected parsed value:
AUSTRALIAN RED CROSS – ATHERTON
I am trying:
^\(((?!GST).)*$
But its only matching the value and not grouping correctly.
https://regex101.com/r/HndrUv/1
What would be the correct regex for the same?
This regex should work to get the expected string:
^\((?!.*GST)(.*)\)$
It first checks if it does not contain the regular expression *GST. If true, it then captures the entire text.
(?!*GST)(.*)
All that is then surrounded by \( and \) to leave it out of the capturing group.
\((?!.*GST)(.*)\)
Finally you add the BOL and EOL symbols and you get the result.
^\((?!.*GST)(.*)\)$
The expected value is saved in the first capture group (.*).
You can use
^\((?![^()]*\bGST\b)([^()]*)\)$
See the regex demo. Details:
^ - start of string
\( - a ( char
(?![^()]*\bGST\b) - a negative lookahead that fails the match if, immediately to the right of the current location, there are zero or more chars other than ) and ( and then GST as a whole word (remove \bs if you do not need whole word matching)
([^()]*) - Group 1: any zero or more chars other than ) and (
\) - a ) char
$ - end of string
Bonus:
If substrings in longer texts need to be matched, too, you need to remove ^ and $ anchors in the above regex.

A regular expression for matching a group followed by a specific character

So I need to match the following:
1.2.
3.4.5.
5.6.7.10
((\d+)\.(\d+)\.((\d+)\.)*) will do fine for the very first line, but the problem is: there could be many lines: could be one or more than one.
\n will only appear if there are more than one lines.
In string version, I get it like this: "1.2.\n3.4.5.\n1.2."
So my issue is: if there is only one line, \n needs not to be at the end, but if there are more than one lines, \n needs be there at the end for each line except the very last.
Here is the pattern I suggest:
^\d+(?:\.\d+)*\.?(?:\n\d+(?:\.\d+)*\.?)*$
Demo
Here is a brief explanation of the pattern:
^ from the start of the string
\d+ match a number
(?:\.\d+)* followed by dot, and another number, zero or more times
\.? followed by an optional trailing dot
(?:\n followed by a newline
\d+(?:\.\d+)*\.?)* and another path sequence, zero or more times
$ end of the string
You might check if there is a newline at the end using a positive lookahead (?=.*\n):
(?=.*\n)(\d+)\.(\d+)\.((\d+)\.)*
See a regex demo
Edit
You could use an alternation to either match when on the next line there is the same pattern following, or match the pattern when not followed by a newline.
^(?:\d+\.\d+\.(?:\d+\.)*(?=.*\n\d+\.\d+\.)|\d+\.\d+\.(?:\d+\.)*(?!.*\n))
Regex demo
^ Start of string
(?: Non capturing group
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
(?=.*\n\d+\.\d+\.) Positive lookahead, assert what follows a a newline starting with the pattern
| Or
\d+\.\d+\. Match 2 times a digit and a dot
(?:\d+\.)* Repeat 0+ times matching 1+ digits and a dot
*(?!.*\n) Negative lookahead, assert what follows is not a newline
) Close non capturing group
(\d+\.*)+\n* will match the text you provided. If you need to make sure the final line also ends with a . then (\d+\.)+\n* will work.
Most programming languages offer the m flag. Which is the multiline modifier. Enabling this would let $ match at the end of lines and end of string.
The solution below only appends the $ to your current regex and sets the m flag. This may vary depending on your programming language.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /((\d+)\.(\d+)\.((\d+)\.)*)$/gm,
match;
while (match = regex.exec(text)) {
console.log(match);
}
You could simplify the regex to /(\d+\.){2,}$/gm, then split the full match based on the dot character to get all the different numbers. I've given a JavaScript example below, but getting a substring and splitting a string are pretty basic operations in most languages.
var text = "1.2.\n3.4.5.\n1.2.\n12.34.56.78.123.\nthis 1.2. shouldn't hit",
regex = /(\d+\.){2,}$/gm;
/* Slice is used to drop the dot at the end, otherwise resulting in
* an empty string on split.
*
* "1.2.3.".split(".") //=> ["1", "2", "3", ""]
* "1.2.3.".slice(0, -1) //=> "1.2.3"
* "1.2.3".split(".") //=> ["1", "2", "3"]
*/
console.log(
text.match(regex)
.map(match => match.slice(0, -1).split("."))
);
For more info about regex flags/modifiers have a look at: Regular Expression Reference: Mode Modifiers

Regular Expression to Anonymize Names

I am using Notepad++ and the Find and Replace pattern with regular expressions to alter usernames such that only the first and last character of the screen name is shown, separated by exactly four asterisks (*). For example, "albobz" would become "a****z".
Usernames are listed directly after the cue "screen_name: " and I know I can find all the usernames using the regular expression:
screen_name:\s([^\s]+)
However, this expression won't store the first or last letter and I am not sure how to do it.
Here is a sample line:
February 3, 2018 screen_name: FR33Q location: Europe verified: false lang: en
Method 1
You have to work with \G meta-character. In N++ using \G is kinda tricky.
Regex to find:
(?>(screen_name:\s+\S)|\G(?!^))\S(?=\S)
Breakdown:
(?> Construct a non-capturing group (atomic)
( Beginning of first capturing group
screen_name:\s\S Match up to first letter of name
) End of first CG
| Or
\G(?!^) Continue from previous match
) End of NCG
\S Match a non-whitespace character
(?=\S) Up to last but one character
Replace with:
\1*
Live demo
Method 2
Above solution substitutes each inner character with a * so length remains intact. If you want to put four number of *s without considering length you would search for:
(screen_name:\s+\S)(\S*)(\S)
and replace with: \1****\3
Live demo

Regex to check only if the group is present

I have String which may have values like below.
854METHYLDOPA
041ALDOMET /00000101/
133IODETO DE SODIO [I 131]
In this i need to get the text starting from index 4 till we find any one these patterns /00000101/ or [I 131]
Expected Output:
METHYLDOPA
ALDOMET
IODETO DE SODIO
I have tried the below RegEx for the same
(?:^.{3})(.*)(?:[[/][A-Z0-9\s]+[]/\s+])
But this RegEx works if the string contains [/ but it doesn't work for the case1 where these patterns doesn't exist.
I have tried adding ? at the end but it works fore case 1 but doesn't work for case 2 and 3.
Could anyone please help me on getting the regx work?
Your logic is difficult to phrase. My interpretation is that you always want to capture from the 4th character onwards. What else gets captured depends on the remainder of the input. Should either /00000101/ or [I 131] occur, then you want to capture up until that point. Otherwise, you want to capture the entire string. Putting this all together yields this regex:
^.{3}(?:(.*)(?=/00000101/|\[I 131\])|(.*))
Demo
You may try this:
^.{3}(.*?)($|(?:\s*\/00000101\/)|(?:\s*\[I\s+131\])).*$
and replace by this to get the exact output you want.
\1
Regex Demo
Explanation:
^ --> start of a the string
.{3} --> followed by 3 characters
(.*?) --> followed by anything where ? means lazy it will fetch until it finds the following and won't go beyond that. It also captures it as
group 1 --> \1
($|(?:\s*\/00000101\/)|(?:\s*\[I\s+131\])) ---------->
$ --> ends with $ which means there is there is not such pattern that
you have mentioned
| or
(?:\s*\/00000101\/) -->another pattern of yours improvised with \s* to cover zero or more blank space.
| or
(?:\s*\[I\s+131\]) --> another pattern of yours with improvised \s+
which means 1 or more spaces. ?: indicates that we will not capture
it.
.*$ --> .* is just to match anything that follows and $
declares the end of string.
so we end up only capturing group 1 and nothing else which ensures to
replace everything by group1 which is your target output.
You could get the values you are looking for in group 1:
^.{3}(.+?)(?=$| ?\[I 131\]| ?\/00000101\/)
Explanation
From the beginning of the string ^
Match the first 3 characters .{3}
Match in a capturing group (where your values will be) any character one or more times non greedy (.+?)
A positive lookahead (?=
To assert what follow is either the end of the string $
or |
an optional space ? followed by [I 131] \[I 131\]
or |
an optional space ? followed by /00000101/ \/00000101\/
If your regex engine supports \K, you could try it like this and the values you are looking for are not in a group but the full match:
^.{3}\K.+?(?=$| ?\[I 131\]| ?\/00000101\/)