Regular expression to replace shortest match - regex

my string is like this
sfdfdsfdsfstart112matlab2336endgfdgdfgkknfkgstart558899enddfdsfd
how can we replace part of a string such a way that the result will be
sfdfdsfdsfgfdgdfgkknfkgdfdsfd
i.e bolded content need to be removed.

You need to use non-greedy matching:
start.*?end

Use replacement function with this regex /start.+?end/g which will match the bold parts of your string. The g part of the regex means globally, and might need to be implemented differently depending on the language you use.
The key here is to use ? which turns on un-greedy matching. That means the match consumes the minimum amount of characters rather than the maximum, so will match from the start to the next rather than the last end

start[1-9]+end
if you need to have numbers between words

Related

Trying to extract repeating pattern from string in php/javascript

The following is in PHP but the regex will also be used in javascript.
Trying to extract repeating patterns from a string
string can be any of the following:
"something arbitrary"
"D123"
"D111|something"
"D197|what.org|when.net"
"D297|who.197d234.whatever|when.net|some other arbitrary string"
I'm currently using the following regex: /^D([0-9]{3})(?:\|([^\|]+))*/
This correctly does not match the first string, matches the second and third correctly. The problem is the third and fourth only match the Dxxx and the last string. I need each of the strings between the '|' to be matched.
I'm hoping to use a regex as it makes it a single step. I realize I could just detect the leading Dxxx then use explode or split as appropriate to break the strings out. I've just gotten stuck on wanting a single regular expression match step.
This same regex may be used in Python as well so just want a generic regex solution.
There is no way to have a dynamic number of capture groups in a regular expression, but if you know some upper limit to how many parts you would have in one string, you can just repeat the pattern that many times:
/^D([0-9]{3})(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)(.*?)(?:$|\|)/
So after the initial ^D([0-9]{3})(?:$|\|) you just repeat (.*?)(?:$|\|) as many times as you need it.
When the string has fewer elements, those remaining capture groups will match the empty string.
See regex tester.
Is something like preg_match_all() (the PHP variant of a global match) also acceptable for you?
Then you could use:
^(?|D([0-9]{3})|^.+$|(?!^)\|([^|\n]*)(?=\||$))
This will match everything in a string in different matches, e.g. take your string:
D197|what.org|when.net
It will you then give three matches:
D197
what.org
when.net
Running live: https://regex101.com/r/jL2oX6/4 (Everything in green are your group matches. Ignore what's in blue.)

Is there any upper limit for number of groups used or the length of the regex in Notepad++?

I am new to using regex. I am trying to use the regex find and replace option in Notepad++.
I have used the following regex:
((?:)|(\+)|(-))(\d)((?:)|(\+)|(-))(/)((?:)|(\+)|(-))(\d)((?:)|(\+)|(-))
For the following text:
2/2
+2/+2
-2/-2
2+/2+
2-/2-
But I am able to get matches only for the first three. The last two, it only gives partial matches, excluding the last "+" and the "-". I am wondering if there is any upper limit for the number of groups (which i doubt is unlikely) that can be used or any upper limit for the maximum length of the regex. I am not sure why my regex is failing. Or if there is anything wrong with my regex, please correct it.
This is not an issue with Notepad++'s regex engine. The problem is that when you have alternations like (?:)|(\+)|(-), the regex engine will attempt to match the different options in the order they are specified. Since you specified an empty group first, it will attempt to match an empty string first, only matching the + or - if it needs to backtrack. This essentially makes the alternation lazy—it will never match any character unless it has to.
vks's answer works perfectly well, but just in case you actually needed those capturing groups separated out, you can do the same thing just by rewriting your alternations like this:
((\+)|(-)|(?:))(\d)((\+)|(-)|(?:))(/)((\+)|(-)|(?:))(\d)((\+)|(-)|(?:))
or even more simply, like this:
((\+)|(-)|)(\d)((\+)|(-)|)(/)((\+)|(-)|)(\d)((\+)|(-)|)
([-+]?)(\d)([-+]?)(/)([-+]?)(\d)([-+]?)
You can use this simple regex to match all cases.See here.
https://www.regex101.com/r/fG5pZ8/19

regex to match strings not ending with a pattern?

I am trying to form a regular expression that will match strings that do NOT end a with a DOT FOLLOWED BY NUMBER.
eg.
abcd1
abcdf12
abcdf124
abcd1.0
abcd1.134
abcdf12.13
abcdf124.2
abcdf124.21
I want to match first three.
I tried modifying this post but it didn't work for me as the number may have variable length.
Can someone help?
You can use something like this:
^((?!\.[\d]+)[\w.])+$
It anchors at the start and end of a line. It basically says:
Anchor at the start of the line
DO NOT match the pattern .NUMBERS
Take every letter, digit, etc, unless we hit the pattern above
Anchor at the end of the line
So, this pattern matches this (no dot then number):
This.Is.Your.Pattern or This.Is.Your.Pattern2012
However it won't match this (dot before the number):
This.Is.Your.Pattern.2012
EDIT: In response to Wiseguy's comment, you can use this:
^((?!\.[\d]+$)[\w.])+$ - which provides an anchor after the number. Therefore, it must be a dot, then only a number at the end... not that you specified that in your question..
If you can relax your restrictions a bit, you may try using this (extended) regular expression:
^[^.]*.?[^0-9]*$
You may omit anchoring metasymbols ^ and $ if you're using function/tool that matches against whole string.
Explanation: This regex allows any symbols except dot until (optional) dot is found, after which all non-numerical symbols are allowed. It won't work for numbers in improper format, like in string: abcd1...3 or abcd1.fdfd2. It also won't work correctly for some string with multiple dots, like abcd.ab123cd.a (the problem description is a bit ambigous).
Philosophical explanation: When using regular expressions, often you don't need to do exactly what your task seems to be, etc. So even simple regex will do the job. An abstract example: you have a file with lines are either numbers, or some complicated names(without digits), and say, you want to filter out all numbers, then simple filtering by [^0-9] - grep '^[0-9]' will do the job.
But if your task is more complex and requires validation of format and doing other fancy stuff on data, why not use a simple script(say, in awk, python, perl or other language)? Or a short "hand-written" function, if you're implementing stand-alone application. Regexes are cool, but they are often not the right tool to use.
I would just use a simple negative look-behind anchored at the end:
.*(?<!\\.\\d+)$

REGEX: Select everything NOT equal to a certain string

This should be straightforward. I need a regular expression that selects everything that does not specifically contain a certain word.
So if I have this sentence: "There is a word in the middle of this sentence."
And the regular expression gets everything but "middle", I should select everything in that sentence but "middle".
Is there any easy way to do this?
Thanks.
It is not possible for a single regex match operation to be discontinuous.
You could use two capturing groups:
(.*)middle(.*)
Then concatenate the contents of capturing groups 1 and 2 after the match.
You may wish to enable the "dot also matches newline" option in your parser.
See for example Java's DOTALL, .NET's Singleline, Perl's s, etc.
Positive lookaround is the way to go:
/^(.+)(?=middle)/ -- gets everything before middle, not including middle
and
/(?!middle)(.+)$/ -- gets everything after middle, not including middle
Then you just merge the results of both

How can I "inverse match" with regex?

I'm processing a file, line-by-line, and I'd like to do an inverse match. For instance, I want to match lines where there is a string of six letters, but only if these six letters are not 'Andrea'. How should I do that?
I'm using RegexBuddy, but still having trouble.
(?!Andrea).{6}
Assuming your regexp engine supports negative lookaheads...
...or maybe you'd prefer to use [A-Za-z]{6} in place of .{6}
Note that lookaheads and lookbehinds are generally not the right way to "inverse" a regular expression match. Regexps aren't really set up for doing negative matching; they leave that to whatever language you are using them with.
For Python/Java,
^(.(?!(some text)))*$
http://www.lisnichenko.com/articles/javapython-inverse-regex.html
In PCRE and similar variants, you can actually create a regex that matches any line not containing a value:
^(?:(?!Andrea).)*$
This is called a tempered greedy token. The downside is that it doesn't perform well.
The capabilities and syntax of the regex implementation matter.
You could use look-ahead. Using Python as an example,
import re
not_andrea = re.compile('(?!Andrea)\w{6}', re.IGNORECASE)
To break that down:
(?!Andrea) means 'match if the next 6 characters are not "Andrea"'; if so then
\w means a "word character" - alphanumeric characters. This is equivalent to the class [a-zA-Z0-9_]
\w{6} means exactly six word characters.
re.IGNORECASE means that you will exclude "Andrea", "andrea", "ANDREA" ...
Another way is to use your program logic - use all lines not matching Andrea and put them through a second regex to check for six characters. Or first check for at least six word characters, and then check that it does not match Andrea.
Negative lookahead assertion
(?!Andrea)
This is not exactly an inverted match, but it's the best you can directly do with regex. Not all platforms support them though.
If you want to do this in RegexBuddy, there are two ways to get a list of all lines not matching a regex.
On the toolbar on the Test panel, set the test scope to "Line by line". When you do that, an item List All Lines without Matches will appear under the List All button on the same toolbar. (If you don't see the List All button, click the Match button in the main toolbar.)
On the GREP panel, you can turn on the "line-based" and the "invert results" checkboxes to get a list of non-matching lines in the files you're grepping through.
I just came up with this method which may be hardware intensive but it is working:
You can replace all characters which match the regex by an empty string.
This is a oneliner:
notMatched = re.sub(regex, "", string)
I used this because I was forced to use a very complex regex and couldn't figure out how to invert every part of it within a reasonable amount of time.
This will only return you the string result, not any match objects!
(?! is useful in practice. Although strictly speaking, looking ahead is not a regular expression as defined mathematically.
You can write an inverted regular expression manually.
Here is a program to calculate the result automatically.
Its result is machine generated, which is usually much more complex than hand writing one. But the result works.
If you have the possibility to do two regex matches for the inverse and join them together you can use two capturing groups to first capture everything before your regex
^((?!yourRegex).)*
and then capture everything behind your regex
(?<=yourRegex).*
This works for most regexes. One problem I discovered was when I had a quantifier like {2,4} at the end. Then you gotta get creative.
In Perl you can do:
process($line) if ($line =~ !/Andrea/);