Negative integer Regex doesn't match - regex

I have Googled it, and found the following results:
http://icfun.blogspot.com/2008/03/regular-expression-to-handle-negative.html
http://regexlib.com/DisplayPatterns.aspx?cattabindex=2&categoryId=3
With some (very basic) Regex knowledge, I figured this would work:
r\.(^-?\d+)\.(^-?\d+)\.mcr
For parsing such strings:
r.0.0.mcr
r.-1.5.mcr
r.20.-1.mcr
r.-1.-1.mcr
But I don't get a match on these.
Since I'm learning (or trying to learn) Regex, could you please explain why my pattern doesn't match (instead of just writing a new working one for me)? From what I understood, it goes like so:
Match r
Match a period
Match a prefix negative sign or not, and store the group
Match a period
Match a prefix negative sign or not, and store the group
Match a preiod
Match mcr
But I'm wrong, apparently :).

You are very close. ^ matches the start of a string, so it should only be located at the start of a pattern (if you want to use it at all - that depends on whether you will also accept e.g. abcr.0.0.mcr or not). Similarly, one can use $ (but only at the end of the pattern) to indicate that you will only accept strings that do not contain anything after what the pattern matches (so that e.g. r.0.0.mcrabc won't be accepted). Otherwise, I think it looks good.

The ^ characters are telling it to match only at the beginning of a line; since it's obviously not at the beginning of a line in either case, it fails to match. In this case, you just need to remove both ^s. (I think what you're trying to say is "don't let anything else be in between these", but that's the default except at the start of the regex; you would need something like .* to make it allow additional characters between them.)

Since the ^ is not at the start of the expression, its meaning is 'not'. So in this case it means that there should not be a dash there.

Related

Regex everything after, but not including

I am trying to regex the following string:
https://www.amazon.com/Tapps-Top-Apps-and-Games/dp/B00VU2BZRO/ref=sr_1_3?ie=UTF8&qid=1527813329&sr=8-3&keywords=poop
I want only B00VU2BZRO.
This substring is always going to be a 10 characters, alphanumeric, preceded by dp/.
So far I have the following regex:
[d][p][\/][0-9B][0-9A-Z]{9}
This matches dp/B00VU2BZRO
I want to match only B00VU2BZRO with no dp/
How do I regex this?
Here is one regex option which would produce an exact match of what you want:
(?<=dp\/)(.*)(?=\/)
Demo
Note that this solution makes no assumptions about the length of the path fragment occurring after dp/. If you want to match a certain number of characters, replace (.*) with (.{10}), for example.
Depending on your language/method of application, you have a couple of options.
Positive look behind. This will make your regex more complicated, but will make it match what you want exactly:
(<=dp/)[0-9A-Z]{10}
The construct (<=...) is called a positive look behind. It will not consume any of the string, but will only allow the match to happen if the pattern between the parens is matched.
Capture group. This will make the regex itself slightly simpler, but will add a step to the extraction process:
dp/([0-9A-Z]{10})
Anything between plain parens is a capture group. The entire pattern will be matched, including dp/, but most languages will give you a way of extracting the portion you are interested in.
Depending on your language, you may need to escape the forward slash (/).
As an aside, you never need to create a character class for single characters: [d][p][\/] can equally well be written as just dp\/.

Regular Expression to match most explicit string

I have some experience with regular expressions but I am far from expert level and need a way to match the record with the most explicit string in a file where each record begins with a unique 1-5 digit integer and is padded with various other characters when it is shorter than 5 digits. For example, my file has records that begin with:
32000
3201X
32014
320xy
In this example, the non-numeric characters represent wildcards. I thought the following regex examples would work but rather than match the record with the MOST explicit number, they always match the record with the LEAST explicit number. Remember, I do not know what is in the file so I need to test all possibilities to locate the MOST explicit match.
If I need to search for 32000, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3200\D|^32000/
It should match 32000 but it matches 320xy
If I need to search for 32014, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3201\D|^32014/
It should match 32014 but it matches 320xy
If I need to search for 32015, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3201\D|^32015/
It should match 3201x but it matches 320xy
In each case, the matched result is the LEAST specific numeric value. I also tried reversing the regex as follows by still get the same results:
/^32014|^3201\D|^320\D{2}|^32\D{3}|^3\D{4}/
Any help is much appreciated.
Okay, if you want to match a string literally then use anchors. Then specify the string you want matched. For instance match '123456xyz' where the xyz can be anything excep numeric use:
'^123456[^0-9]{3}$'
If you prefer specific letters to match at the end, if they will always be x y or z then use:
'^123456[xyz]{3}$'
Note the ^ and $ anchor the string to start with 12345 and end with three letters that are x y or z.
Good luck!
Ok, I did quite some tinkering here. I am 99% percent sure that this is pretty much impossible (if we don't cheat and interpolate code into the regex). The reason is you will need a negative lookbehind with variable length at some point.
However, I came up with two alternatives. One is if you want just to find the "most exact match", the second one is if you want to replace it with something. Here we go:
/(32000)|\A(?!.*32000).*(3200\D)|\A(?!.*3200[0\D]).*(320\D\D)|\A(?!.*320[0\D][0\D]).*(32\D\D\D)|\A(?!.*32[0\D][0\D][0\D]).*(3\D\D\D\D)/m
Question:
So what is my "most exact match" here?
Answer:
The concatenation of the 5 matched groups - \1\2\3\4\5. In fact always only one of them will match, the other 4 will be empty.
/(32000)|\A(?!.*32000)(.*)(3200\D)|\A(?!.*3200[0\D])(.*)(320\D\D)|\A(?!.*320[0\D][0\D])(.*)(32\D\D\D)|\A(?!.*32[0\D][0\D][0\D])(.*)(3\D\D\D\D)/m
Question:
How can I use this to replace my "most exact match"?
Answer:
In this case your "most exact match" will be the concatenation of \1\3\5\7\9, but we will have also matched some other things before that, namely \2\4\6\8 (again, only one of these can be non empty). Therefore if you want to replace your "most exact match" with fubar you can match with the above regex and replace with \2\4\6\8fubar
Another way you can think about it (and might be helpful) is that your "most exact match" will be the last matched line of either of the two regexes.
Two things to note here:
I used Ruby style RE, \A means the beginning of the string (not the beginning of a line - ^). \m means multi line mode. You should be able to find syntax for the same things in your language/technology as long as it uses some flavor of PCRE.
This can be slow. If we don't find exact match we might possibly have to match and replace the entire string (if the non exact match can be found at the end of the string).

Ant regex expression

Quite a simple one in theory but can't quite get it!
I want a regex in ant which matches anything as long as it has a slash on the end.
Below is what I expect to work
<regexp id="slash.end.pattern" pattern="*/"/>
However this throws back
java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0
*/
^
I have also tried escaping this to \*, but that matches a literal *.
Any help appreciated!
Your original regex pattern didn't work because * is a special character in regex that is only used to quantify other characters.
The pattern (.)*/$, which you mentioned in your comment, will match any string of characters not containing newlines, however it uses a possibly unnecessary capturing group. .*/$ should work just as well.
If you need to match newline characters, the dot . won't be enough. You could try something like [\s\S]*/$
On that note, it should be mentioned that you might not want to use $ in this pattern. Suppose you have the following string:
abc/def/
Should this be evaluated as two matches, abc/ and def/? Or is it a single match containing the whole thing? Your current approach creates a single match. If instead you would like to search for strings of characters and then stop the match as soon as a / is found, you could use something like this: [\s\S]*?/.

Simple regex for matching up to an optional character?

I'm sure this is a simple question for someone at ease with regular expressions:
I need to match everything up until the character #
I don't want the string following the # character, just the stuff before it, and the character itself should not be matched. This is the most important part, and what I'm mainly asking. As a second question, I would also like to know how to match the rest, after the # character. But not in the same expression, because I will need that in another context.
Here's an example string:
topics/install.xml#id_install
I want only topics/install.xml. And for the second question (separate expression) I want id_install
First expression:
^([^#]*)
Second expression:
#(.*)$
[a-zA-Z0-9]*[\#]
If your string contains any other special characters you need to add them into the first square bracket escaped.
I don't use C#, but i will assume that it uses pcre... if so,
"([^#]*)#.*"
with a call to 'match'. A call to 'search' does not need the trailing ".*"
The parens define the 'keep group'; the [^#] means any character that is not a '#'
You probably tried something like
"(.*)#.*"
and found that it fails when multiple '#' signs are present (keeping the leading '#'s)?
That is because ".*" is greedy, and will match as much as it can.
Your matcher should have a method that looks something like 'group(...)'. Most matchers
return the entire matched sequence as group(0), the first paren-matched group as group(1),
and so forth.
PCRE is so important i strongly encourage you to search for it on google, learn it, and always have it in your programming toolkit.
Use look ahead and look behind:
To get all characters up to, but not including the pound (#): .*?(?=\#)
To get all characters following, but not including the pound (#): (?<=\#).*
If you don't mind using groups, you can do it all in one shot:
(.*?)\#(.*) Your answers will be in group(1) and group(2). Notice the non-greedy construct, *?, which will attempt to match as little as possible instead of as much as possible.
If you want to allow for missing # section, use ([^\#]*)(?:\#(.*))?. It uses a non-collecting group to test the second half, and if it finds it, returns everything after the pound.
Honestly though, for you situation, it is probably easier to use the Split method provided in String.
More on lookahead and lookbehind
first:
/[^\#]*(?=\#)/ edit: is faster than /.*?(?=\#)/
second:
/(?<=\#).*/
For something like this in C# I would usually skip the regular expressions stuff altogether and do something like:
string[] split = exampleString.Split('#');
string firstString = split[0];
string secondString = split[1];

How to get the inverse of a regular expression?

Let's say I have a regular expression that works correctly to find all of the URLs in a text file:
(http://)([a-zA-Z0-9\/\.])*
If what I want is not the URLs but the inverse - all other text except the URLs - is there an easy modification to make to get this?
You could simply search and replace everything that matches the regular expression with an empty string, e.g. in Perl s/(http:\/\/)([a-zA-Z0-9\/\.])*//g
This would give you everything in the original text, except those substrings that match the regular expression.
If for some reason you need a regex-only solution, try this:
((?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%]))|\A(?!http://[a-zA-Z0-9\/\.#?/%])).+?((?=http://[a-zA-Z0-9\/\.#?/%])|\Z)
I expanded the set of of URL characters a little ([a-zA-Z0-9\/\.#?/%]) to include a few important ones, but this is by no means meant to be exact or exhaustive.
The regex is a bit of a monster, so I'll try to break it down:
(?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%])
The first potion matches the end of a URL. http://[a-zA-Z0-9\/\.#?/%]+ matches the URL itself, while (?=[^a-zA-Z0-9\/\.#?/%]) asserts that the URL must be followed by a non-URL character so that we are sure we are at the end. A lookahead is used so that the non-URL character is sought but not captured. The whole thing is wrapped in a lookbehind (?<=...) to look for it as the boundary of the match, again without capturing that portion.
We also want to match a non-URL at the beginning of the file. \A(?!http://[a-zA-Z0-9\/\.#?/%]) matches the beginning of the file (\A), followed by a negative lookahead to make sure there's not a URL lurking at the start of the file. (This URL check is simpler than the first one because we only need the beginning of the URL, not the whole thing.)
Both of those checks are put in parenthesis and OR'd together with the | character. After that, .+? matches the string we are trying to capture.
Then we come to ((?=http://[a-zA-Z0-9\/\.#?/%])|\Z). Here, we check for the beginning of a URL, once again with (?=http://[a-zA-Z0-9\/\.#?/%]). The end of the file is also a pretty good sign that we've reached the end of our match, so we should look for that, too, using \Z. Similarly to a first big group, we wrap it in parenthesis and OR the two possibilities together.
The | symbol requires the parenthesis because its precedence is very low, so you have to explicitly state the boundaries of the OR.
This regex relies heavily on zero-width assertions (the \A and \Z anchors, and the lookaround groups). You should always understand a regex before you use it for anything serious or permanent (otherwise you might catch a case of perl), so you might want to check out Start of String and End of String Anchors and Lookahead and Lookbehind Zero-Width Assertions.
Corrections welcome, of course!
If I understand the question correctly, you can use search/replace...just wildcard around your expression and then substitute the first and last parts.
s/^(.*)(your regex here)(.*)$/$1$3/
im not sure if this will work exactly as you intend but it might help:
Whatever you place in the brackets [] will be matched against. If you put ^ within the bracket, i.e [^a-zA-Z0-9/.] it will match everything except what is in the brackets.
http://www.regular-expressions.info/