capturing a repeating pattern with regex - regex

I'm trying to match a pattern like this CODE-UH87H-98HSH-HB383-JWWB2U and I have the following regex pattern CODE\-[A-Z0-9]+\-[A-Z0-9]+\-[A-Z0-9]+\-[A-Z0-9]+ but is there a better way of doing this? I tried CODE(\-[A-Z0-9]+\-){4} and it didn't work

I tried CODE(\-[A-Z0-9]+\-){4} and it didn't work
That does require two dashes in succession. In full, it would be CODE\-[A-Z0-9]+\-\-[A-Z0-9]+\-\-[A-Z0-9]+\-\-[A-Z0-9]+\-. What you want is
CODE(\-[A-Z0-9]+){4}

You were almost there. CODE(\-[A-Z0-9]+){4} should work!

When the pattern between the dashes may contain any character, the following regex is even shorter:
CODE(-[^-]+){4}
Of course you may have to add \ for escaping before the dash depending on what regex engine you will use.

Related

Transform negative regex lookahead to greedy needed

The task I'm trying to solve seems pretty simple - I need to choose all font-changing tags except for the particular one (AIGDT). I'm going to cut them out in order to simplify further text processing.
I'm trying to use negative regex lookahead like this:
Font='(?!(AIGDT))(.*)'
But for the single-line text sample:
<StyleOverride Font='Arial' FontSize='0,32971'>[</StyleOverride><StyleOverride FontSize='0,21558'> </StyleOverride><StyleOverride Font='AIGDT' Italic='False'>n</StyleOverride><DimensionValue/> <StyleOverride Font='Arial' FontSize='0,32971'>]</StyleOverride>
It returns single 200+symbol match ... while I'm expecting two 12-symbol matches (Font='Arial').
I believe this is because the lookahead is greedy.
Can anybody hint me to what is my mistake?
Thanks in advance.
How does Font='(?!(AIGDT))([^']+)' work for you?
Basically, narrow down the second capture to "anything but a single quote".
(Full disclosure: On my phone at the moment so I haven't run it, but in theory it works nicely)

Regex for any string not ending on .js

This has been driving me nuts. I'm trying to match everything that doesn't end in .js. I'm using perl, so ?<! etc. is more than welcome.
What I'm trying to do:
Do match these
mainfile
jquery.1.1.11
my.module
Do NOT match these
mainfile.js
jquery.1.1.11.js
my.module.js
This should be an insanely simple task, but I'm just stuck. I looked in the docs for both regex, sed, perl and was even fiddling around for half an hour on regexr. Intuitively, this example (/^.*?(?!\.js)$/) should do it. I guess I just stared myself blind.
Thanks in advance.
You can use this regex to make sure your match doesn't end with .js:
^(?!.+\.js$).+$
RegEx Demo
(?!.+\.js$) is a negative lookahead condition to fail the match if line has .js at the end.
This one should suit your needs:
^.*(?<![.]js)$
The simplest approach when you only have negative matching conditions is to construct a positive regex and then check that it doesn't match.
if ($string !~ /\.js$/)
{
print "Doesn't end in .js";
}
This is easier to understand and more efficient than a negative look-around.
Look-arounds are only needed when you need to mix positive and negative conditions (for example, "I need to match "foo" out of a string, but only when it is not followed by "bar"). Even then, sometimes it is easier to use multiple simple patterns and logic, rather than meeting all your requirements with one complex pattern.

Greedy and non-greedy regex

I currently have this regex: this\.(.*)?\s[=,]\s, however I have come across a pickle I cannot fix.
I tried the following Regex, which works, but it captures the space as well which I don't want: this\.(.*)?(?<=\s)=|(?<!\s),. What I'm trying to do is match identifier names. An example of what I want and the result is this:
this.""W = blah; which would match ""W. The second regex above does this almost perfectly, however it also captures the space before the = in the first group. Can someone point me in the correct direction to fix this?
EDIT: The reason for not simply using [^\s] in the wildcard group is that sometimes I can get lines like this: this. "$ = blah;
EDIT2: Now I have another issue. Its not matching lines like param1.readBytes(this.=!3,0,param1.readInt()); properly. Instead of matching =!3 its matching =!3,0. Is there a way to fix this? Again, I cannot simply use a [^,] because there could be a name like param1.readBytes(this.,3$,0,param1.readInt()); which should match ,3$.
(.*) will match any character including whitespace.
To force it not to end in whitespace change it to (.*[^\s])
Eg:
this\.(.*[^\s])?\s?[=,]\s
For your second edit, it seems like you are doing a language parser. Even though regular expressions are powerful, they do have limits. You need a grammar parser for that.
Maybe you can tell in your first block to capture non space characters, instead of any.
this\.(\S*)?(?<=\s)=|(?<!\s),

Simple regex for matching up to an optional character?

I'm sure this is a simple question for someone at ease with regular expressions:
I need to match everything up until the character #
I don't want the string following the # character, just the stuff before it, and the character itself should not be matched. This is the most important part, and what I'm mainly asking. As a second question, I would also like to know how to match the rest, after the # character. But not in the same expression, because I will need that in another context.
Here's an example string:
topics/install.xml#id_install
I want only topics/install.xml. And for the second question (separate expression) I want id_install
First expression:
^([^#]*)
Second expression:
#(.*)$
[a-zA-Z0-9]*[\#]
If your string contains any other special characters you need to add them into the first square bracket escaped.
I don't use C#, but i will assume that it uses pcre... if so,
"([^#]*)#.*"
with a call to 'match'. A call to 'search' does not need the trailing ".*"
The parens define the 'keep group'; the [^#] means any character that is not a '#'
You probably tried something like
"(.*)#.*"
and found that it fails when multiple '#' signs are present (keeping the leading '#'s)?
That is because ".*" is greedy, and will match as much as it can.
Your matcher should have a method that looks something like 'group(...)'. Most matchers
return the entire matched sequence as group(0), the first paren-matched group as group(1),
and so forth.
PCRE is so important i strongly encourage you to search for it on google, learn it, and always have it in your programming toolkit.
Use look ahead and look behind:
To get all characters up to, but not including the pound (#): .*?(?=\#)
To get all characters following, but not including the pound (#): (?<=\#).*
If you don't mind using groups, you can do it all in one shot:
(.*?)\#(.*) Your answers will be in group(1) and group(2). Notice the non-greedy construct, *?, which will attempt to match as little as possible instead of as much as possible.
If you want to allow for missing # section, use ([^\#]*)(?:\#(.*))?. It uses a non-collecting group to test the second half, and if it finds it, returns everything after the pound.
Honestly though, for you situation, it is probably easier to use the Split method provided in String.
More on lookahead and lookbehind
first:
/[^\#]*(?=\#)/ edit: is faster than /.*?(?=\#)/
second:
/(?<=\#).*/
For something like this in C# I would usually skip the regular expressions stuff altogether and do something like:
string[] split = exampleString.Split('#');
string firstString = split[0];
string secondString = split[1];

How can I "inverse match" with regex?

I'm processing a file, line-by-line, and I'd like to do an inverse match. For instance, I want to match lines where there is a string of six letters, but only if these six letters are not 'Andrea'. How should I do that?
I'm using RegexBuddy, but still having trouble.
(?!Andrea).{6}
Assuming your regexp engine supports negative lookaheads...
...or maybe you'd prefer to use [A-Za-z]{6} in place of .{6}
Note that lookaheads and lookbehinds are generally not the right way to "inverse" a regular expression match. Regexps aren't really set up for doing negative matching; they leave that to whatever language you are using them with.
For Python/Java,
^(.(?!(some text)))*$
http://www.lisnichenko.com/articles/javapython-inverse-regex.html
In PCRE and similar variants, you can actually create a regex that matches any line not containing a value:
^(?:(?!Andrea).)*$
This is called a tempered greedy token. The downside is that it doesn't perform well.
The capabilities and syntax of the regex implementation matter.
You could use look-ahead. Using Python as an example,
import re
not_andrea = re.compile('(?!Andrea)\w{6}', re.IGNORECASE)
To break that down:
(?!Andrea) means 'match if the next 6 characters are not "Andrea"'; if so then
\w means a "word character" - alphanumeric characters. This is equivalent to the class [a-zA-Z0-9_]
\w{6} means exactly six word characters.
re.IGNORECASE means that you will exclude "Andrea", "andrea", "ANDREA" ...
Another way is to use your program logic - use all lines not matching Andrea and put them through a second regex to check for six characters. Or first check for at least six word characters, and then check that it does not match Andrea.
Negative lookahead assertion
(?!Andrea)
This is not exactly an inverted match, but it's the best you can directly do with regex. Not all platforms support them though.
If you want to do this in RegexBuddy, there are two ways to get a list of all lines not matching a regex.
On the toolbar on the Test panel, set the test scope to "Line by line". When you do that, an item List All Lines without Matches will appear under the List All button on the same toolbar. (If you don't see the List All button, click the Match button in the main toolbar.)
On the GREP panel, you can turn on the "line-based" and the "invert results" checkboxes to get a list of non-matching lines in the files you're grepping through.
I just came up with this method which may be hardware intensive but it is working:
You can replace all characters which match the regex by an empty string.
This is a oneliner:
notMatched = re.sub(regex, "", string)
I used this because I was forced to use a very complex regex and couldn't figure out how to invert every part of it within a reasonable amount of time.
This will only return you the string result, not any match objects!
(?! is useful in practice. Although strictly speaking, looking ahead is not a regular expression as defined mathematically.
You can write an inverted regular expression manually.
Here is a program to calculate the result automatically.
Its result is machine generated, which is usually much more complex than hand writing one. But the result works.
If you have the possibility to do two regex matches for the inverse and join them together you can use two capturing groups to first capture everything before your regex
^((?!yourRegex).)*
and then capture everything behind your regex
(?<=yourRegex).*
This works for most regexes. One problem I discovered was when I had a quantifier like {2,4} at the end. Then you gotta get creative.
In Perl you can do:
process($line) if ($line =~ !/Andrea/);