Regular Expression for input validation - regex

I am trying to learn regular expressions and was hoping someone could help me out. WOuld appreciate if someone can help me come up with a regular expression to validate that an input must be of the form
Graph: XY5, YZ4, ST7
Each part such as XY5 represents an edge in the graph and the number represents a the edge weight. There can be any number of such edges.
This is what I have till now. It's probably not correct
"^Graph:\\s{1}[A-ZA-Z\\d,\\s]+"

This might be what you're looking for:
/^Graph: (?:[A-Z]{2}\d(?:$|, ?))+/
See it here in action: http://regexr.com?309av
Here's an explanation of what the regex does (screenshot from RegexBuddy, which is probably the best tool for you if you're trying to learn Regular Expressions):

Try this
/^Graph:(\s+[A-Z][A-Z]\d+)+$/

You should explain your input format a little better. This might do it, from the single example I have and what you said. It does not allow a graph to be empty, which may or may not be part of your requirements.
"^Graph:(\s\w{2}\d+,?)+"
to explain:
^Graph: will cover the start of the line
(\s\w{2}\d+,?)+
\s is a space
\w{2} matches exactly 2 alphanumeric characters (hint: you could make this better!)
\d+ matches 1 or more digits, since I am assuming an edge can have a two digit length ( such as 10)
,? matches a comma optionally. (hint: you could make this better as well, as it will not necessitate a comma between each entry!, maybe by using an or and the end of string delimiter!)
I purposely left some room for improvement, because if you think of some of it on your own, you will accomplish your goal of becoming better with regular expressions.

Related

Lookahead in Oracle's regular expressions

I've got a problem with Oracle's regexes. I have lots of phone-numbers in different tables. Now my task is to unify them. So I take out all blanks, underscores, minuses and whatnot. But then the tricky part comes - which seemed so easy at first.
There are numbers both with and without international code, so e.g. 0046812345678 and 0812345678. So I want to replace one single (!) leading zero with '0046'. I thought that ^0(?=[1-9]) would do the job but Oracle seems to think that lookaheads are useless.
(^0)(1|2|3|4|5|6|7|8|9) doesn't do the job either (or (^01|02|03|04|05|06|07|08|09) for that matter) since it would replace the first non-zero number as well making 0812345678 into 004612345678 (so, the first '8' disappears).
I searched and tried for quite some time now and can't come up with any more possibilities. Any help would be greatly appreciated. Thanks in advance!
You need to add the first 1-9 to the result so that only numbers starting with a single 0 are matched. To keep the first 1-9 we capture it (using parenthesis) and add it to the replace part (using \1). This seems to work:
select regexp_replace('0812345678', '^0([1-9])', '0046\1') from dual;
Result: 0046812345678

What is wrong with my simple regex that accepts empty strings and apartment numbers?

So I wanted to limit a textbox which contains an apartment number which is optional.
Here is the regex in question:
([0-9]{1,4}[A-Z]?)|([A-Z])|(^$)
Simple enough eh?
I'm using these tools to test my regex:
Regex Analyzer
Regex Validator
Here are the expected results:
Valid
"1234A"
"Z"
"(Empty string)"
Invalid
"A1234"
"fhfdsahds527523832dvhsfdg"
Obviously if I'm here, the invalid ones are accepted by the regex. The goal of this regex is accept either 1 to 4 numbers with an optional letter, or a single letter or an empty string.
I just can't seem to figure out what's not working, I mean it is a simple enough regex we have here. I'm probably missing something as I'm not very good with regexes, but this syntax seems ok to my eyes. Hopefully someone here can point to my error.
Thanks for all help, it is greatly appreciated.
You need to use the ^ and $ anchors for your first two options as well. Also you can include the second option into the first one (which immediately matches the third variant as well):
^[0-9]{0,4}[A-Z]?$
Without the anchors your regular expression matches because it will just pick a single letter from anywhere within your string.
Depending on the language, you can also use a negative look ahead.
^[0-9]{0,4}[A-Za-z](?!.*[0-9])
Breakdown:
^[0-9]{0,4} = This look for any number 0 through 4 times at the beginning of the string
[A-Za-z] = This look for any characters (Both cases)
(?!.*[0-9]) = This will only allow the letters if there are no numbers anywhere after the letter.
I haven't quite figured out how to validate against a null character, but that might be easier done using tools from whatever language you are using. Something along this logic:
if String Doesn't equal $null Then check the Rexex
Something along those lines, just adjusted for however you would do it in your language.
I used RegEx Skinner to validate the answers.
Edit: Fixed error from comments

Match a line without number followed by "."

Update: I would like to match a line, started with (" followed by a number and then anything except "." . For example
("10 Advanced topics 365" "#382")
is a match, while
("10.1 Approximation Algorithms 365" "#382")
is not a match.
My regex is
^\(\"\d+(?!\.).*?$
but it will match both examples above including the second one. So what am I missing here?
Thanks and regards!
While it's possible to write a RE that will match such a thing (see manji's answer) I hate such things; they're very hard to comprehend later on. I find it's easier to write an RE to match the case that you don't want, and then make the rest of the logic of the program conditional on that RE not matching. This is virtually always trivial to do.
EDIT:
Sometimes you can do better. If we're seeking to distinguish between the types of lines you describe, where good lines don't have a period after the first digit and there's always some text at that point:
("10 Advanced topics 365" "#382")
("10.1 Approximation Algorithms 365" "#382")
Then a regular expression of this form will suffice:
^\("\d+[^.].*
Potentially you might need more to properly match the remainder of the line more precisely (e.g., detecting whether it ends with the right character sequence) but that's separate.
Via update:
^\("\d[^.]*$
Try this pattern:
(?m)^(?!.*?\d\.).*$

Regex href match a number

Well, here I am back at regex and my poor understanding of it. Spent more time learning it and this is what I came up with:
/(.*)
I basically want the number in this string:
510973
My regex is almost good? my original was:
"/<a href=\"travis.php?theTaco(.*)\">(.*)<\/a>/";
But sometimes it returned me huge strings. So, I just want to get numbers only.
I searched through other posts but there is such a large amount of unrelated material, please give an example, resource, or a link directing to a very related question.
Thank you.
Try using a HTML parser provided by the language you are using.
Reason why your first regex fails:
[0-9999999] is not what you think. It is same as [0-9] which matches one digit. To match a number you need [0-9]+. Also .* is greedy and will try to match as much as it can. You can use .*? to make it non-greedy. Since you are trying to match a number again, use [0-9]+ again instead of .*. Also if the two number you are capturing will be the same, you can just match the first and use a back reference \1 for 2nd one.
And there are a few regex meta-characters which you need to escape like ., ?.
Try:
<a href=\"travis\.php\?theTaco=([0-9]+)\">\1<\/a>
To capture a number, you don't use a range like [0-99999], you capture by digit. Something like [0-9]+ is more like what you want for that section. Also, escaping is important like codaddict said.
Others have already mentioned some issues regarding your regex, so I won't bother repeating them.
There are also issues regarding how you specified what it is you want. You can simply match via
/theTaco=(\d+)/
and take the first capturing group. You have not given us enough information to know whether this suits your needs.

Regex to match a string that does not contain 'xxx'

One of my homework questions asked to develop a regex for all strings over x,y,z that did not contain xxx
After doing some reading I found out about negative lookahead and made this which works great:
(x(?!xx)|y|z)*
Still, in the spirit of completeness, is there anyway to write this without negative lookahead?
Reading I have done makes me think it can be done with some combination of carets (^), but I cannot get the right combination so I am not sure.
Taking it a step further, is it possible to exclude a string like xxx using only the or (|) operator, but still check the strings in a recursive fashion?
EDIT 9/6/2010:
Think I answered my own question. I messed with this some more, trying make this regex with only or (|) statements and I am pretty sure I figured it out... and it isn't nearly as messy as I thought it would be. If someone else has time to verify this with a human eye I would appreciate it.
(xxy|xxz|xy|xz|y|z)*(xxy|xxz|xx|xy|xz|x|y|z)
Try this:
^(x{0,2}(y|z|$))*$
The basic idea is this: for match at most 2 X's, followed by another letter or the end of the string.
When you reach a point where you have 3 X's, the regex has no rule that allows it to keep matching, and it fails.
Working example: http://rubular.com/r/ePH0fHlZxL
A less compact way to write the same is (with free spaces, usually the /x flag):
^(
y| # y is ok
z| # so is z
x(y|z|$)| # a single x, not followed by x
xx(y|z|$) # 2 x's, not followed by x
)*$
Based on the latest edit, here's an ever flatter version of the pattern: I'm not entirely sure I understand your fascination with the pipe, but you can eliminate some more options - by allowing an empty match on the second group you don't need to repeat permutations from the first group. That regex also allows ε, which I think is included in your language.
^(xxy|xxz|xy|xz|y|z)*(xx|x|)$
Basically you have the right answer already - well done you. :)
Carat (^) in a set [^abc] will only match where it does not find a character in that set so it's application for matching orders of characters (i.e. strings) is limited and weak.
Regex has numeric quantifiers {n} and {a,b} which allow you to match a defined number of repititions of a pattern, which would work for this specific pattern (because it's 'x' repeated) but it's not particularily expressive of the problem you're trying to solve (even for regex!) and is a bit brittle (it wouldn't be appropriate for negative match 'xyx' for example.
An or pattern again would be verbose and rather unexpressive but it could be done as the fragment:
(x|xx)[^x] // x OR xx followed by NOT x
Obviously you can do this with an iterative algorithm but that's highly inefficient compared to a regex.
Well done for thinking beyond the solution though.
I know you don't want to use lookahead, but here's another way to solve this:
^(?:(?!xxx)[xyz])*$
will match any line of characters x, y or z as long as it doesn't contain the string xxx.