Can you have overlapping characters in Regex? - regex

Okay a bit of a weird one. I know you can do things a bit different and get what you want, but I am just curious whether the functionality exists somewhere or somehow in a single regex line.
Here is a sample expression:
(?s)^\\sqrt[^A-Za-z].*?(\{\\rho\})
^ ^
1 2
Character 1 [^A-Za-z] is checking for a delimiter.
Character 2 \{ might be that delimiter. It also might be a space, or a ton of other random characters.
However, even if the delimiter is a space, \{ must exist, which means [ {] is not ideal.
Is it possible to just confirm that the spot filled by character 1 is not a letter, however not have it count as a character? The logic sort of being like;
if ("(?s)^\\sqrt[^A-Za-z]" matches) {
Proceed to evaluate as "(?s)^\\sqrt.*?(\{\\rho\})"
}

The logic you described fits the negative lookahead behavior: it makes sure the text after the current position does not match its pattern.
Use
(?s)^\\sqrt(?![A-Za-z]).*?(\{\\rho\})
Here, ^ matches the start of string, then \\sqrt matches \sqrt and after it, the regex engine asserts that there is no ASCII letter right after it with (?![A-Za-z]) negative lookahead. Then, .*?(\{\\rho\}) goes on to match the rest.
See the regex demo.
Also, for more details, see another SO thread describing negative lookahead behavior.

Related

regex doesn't match the word if it's not the last word

i'm trying to write a regex which can match a word in a string with theese conditions:
the word must be 8 character length.
the word must has 1 alphabetic character at any position of the
word.
the word must has 7 digits at any position of the word.
\b(?=\w{8}\z)(?=[^a-zA-Z]*[a-zA-Z]{1})(?=(?:[\D]*[\d]){7}).*\b
this can find "123r1234" and "foo 123r1234" but it doesn't find "foo bar 123r1234 foo".
i tried to add word boundries but it didn't work.
what is wrong with my regex and how can i fix it?
thanks.
You can use the following regex:
\b(?=[^a-zA-Z]*[a-zA-Z])(?=(?:\D*\d){7})\w{8}\b
See demo
There several things to note here:
It is not necessary to enclose single shorthand classes (like \d) into character classes (pattern becomes too awkward and less readable). Thus, use \D instead of [\D].
The rule of number of look-aheads should equal the number of conditions - 1 (see Fine-Tuning: Removing One Condition at rexegg.com). Most often, length restriction look-aheads with just 1 character/character class are valid candidates for being ported into the base pattern. Here, (?=\w{8}) can easily replace .* at the end.
The (?=\w{8}\z) look-ahead contains an end-of-string \z anchor that forces a match at the end of the string, while you need (as now I know) the end of a word.
[a-zA-Z]{1} is equal to [a-zA-Z] since {1} means *exactly one repetition, and it is redundant (again, regex patterns should be as clean and concise as they can be).
UPDATE (+1 goes to #Jonny5)
There is another way of approaching the current problem: by having the word contain 8 word characters, but matching only 1 letter enclosed with any number of digits. This can be achieved with
(?i)\b(?=\w{8}\b)\d*[a-z]\d*\b
See another demo (Note i modifier is used here)
You can remove last asterisk and change it by the 8 counter.
\b(?=[^a-zA-Z]*[a-zA-Z])(?=(?:[\D]*[\d]){7})\w{8}\b
You can view it running here:
https://regex101.com/r/bX6rK8/1

Condition for max character limit and on minimum character putting condition

I am trying to do do following match using regex.
The input characters should be capital letters starting from 2-10 characters.
If it's 2 characters then allow only those 2 characters which does not contain A,E,I,O,U either at first place or second place.
I tried:
[B-DF-HJ-NP-TV-XZ]{2,10}
It works well, but I am not too sure if this is the right and most efficient way to do regex here.
All credit to Jerry, for his answer:
^(?:(?![AEIOU])[A-Z]{2}|[A-Z]{3,10})$
Explanation:
^ = "start of string", and $ = "end of string". This is useful for preventing false matches (e.g. a 10-character match from an 11 character input, or "MR" matching in "AMRXYZ").
(?![AEIOU]) is a negative look-ahead for the characters A,E,I,O and U - i.e. the regex will not match if the text contains a vowel. This is only applied to the first half of the conditional "OR" (|) regex, so vowels are still allowed in longer matches.
The rest is fairly obvious, based on what you've already demonstrated an understanding about regex in your question above.

Regular expression to match non-integer values in a string

I want to match the following rules:
One dash is allowed at the start of a number.
Only values between 0 and 9 should be allowed.
I currently have the following regex pattern, I'm matching the inverse so that I can thrown an exception upon finding a match that doesn't follow the rules:
[^-0-9]
The downside to this pattern is that it works for all cases except a hyphen in the middle of the String will still pass. For example:
"-2304923" is allowed correctly but "9234-342" is also allowed and shouldn't be.
Please let me know what I can do to specify the first character as [^-0-9] and the rest as [^0-9]. Thanks!
This regex will work for you:
^-?\d+$
Explanation: start the string ^, then - but optional (?), the digit \d repeated few times (+), and string must finish here $.
You can do this:
(?:^|\s)(-?\d+)(?:["'\s]|$)
^^^^^ non capturing group for start of line or space
^^^^^ capture number
^^^^^^^^^ non capturing group for end of line, space or quote
See it work
This will capture all strings of numbers in a line with an optional hyphen in front.
-2304923" "9234-342" 1234 -1234
++++++++ captured
^^^^^^^^ NOT captured
++++ captured
+++++ captured
I don't understand how your pattern - [^-0-9] is matching those strings you are talking about. That pattern is just the opposite of what you want. You have simply negated the character class by using caret(^) at the beginning. So, this pattern would match anything except the hyphen and the digits.
Anyways, for your requirement, first you need to match one hyphen at the beginning. So, just keep it outside the character class. And then to match any number of digits later on, you can use [0-9]+ or \d+.
So, your pattern to match the required format should be:
-[0-9]+ // or -\d+
The above regex is used to find the pattern in some large string. If you want the entire string to match this pattern, then you can add anchors at the ends of the regex: -
^-[0-9]+$
For a regular expression like this, it's sometimes helpful to think of it in terms of two cases.
Is the first character messed up somehow?
If not, are any of the other characters messed up somehow?
Combine these with |
(^[^-0-9]|^.+?[^0-9])

Why do I get successful but empty regex matches?

I'm searching the pattern (.*)\\1 on the text blabl with regexec(). I get successful but empty matches in regmatch_t structures. What exactly has been matched?
The regex .* can match successfully a string of zero characters, or the nothing that occurs between adjacent characters.
So your pattern is matching zero characters in the parens, and then matching zero characters immediately following that.
So if your regex was /f(.*)\1/ it would match the string "foo" between the 'f' and the first 'o'.
You might try using .+ instead of .*, as that matches one or more instead of zero or more. (Using .+ you should match the 'oo' in 'foo')
\1 is the backreference typically used for replacement later or when trying to further refine your regex by getting a match within a match. You should just use (.*), this will give you the results you want and will automatically be given the backreference number 1. I'm no regex expert but these are my thoughts based on my limited knowledge.
As an aside, I always revert back to RegexBuddy when trying to see what's really happening.
\1 is the "re-match" instruction. The question is, do you want to re-match immediately (e.g., BLABLA)
/(.+)\1/
or later (e.g., BLAahemBLA)
/(.+).*\1/

Need a simple RegEx to find a number in a single word

I've got the following url route and i'm wanting to make sure that a segment of the route will only accept numbers. as such, i can provide some regex which checks the word.
/page/{currentPage}
so.. can someone give me a regex which matches when the word is a number (any int) greater than 0 (ie. 1 <-> int.max).
/^[1-9][0-9]*$/
Problems with other answers:
/([1-9][0-9]*)/ // Will match -1 and foo1bar
#[1-9]+# // Will not match 10, same problems as the first
[1-9] // Will only match one digit, same problems as first
If you want it greater than 0, use this regex:
/([1-9][0-9]*)/
This'll work as long as the number doesn't have leading zeros (like '03').
However, I recommend just using a simple [0-9]+ regex, and validating the number in your actual site code.
This one would address your specific problem. This expression
/\/page\/(0*[1-9][0-9]*)/ or "Perl-compatible" /\/page\/(0*[1-9]\d*)/
should capture any non-zero number, even 0-filled. And because it doesn't even look for a sign, - after the slash will not fit the pattern.
The problem that I have with eyelidlessness' expression is that, likely you do not already have the number isolated so that ^ and $ would work. You're going to have to do some work to isolate it. But a general solution would not be to assume that the number is all that a string contains, as below.
/(^|[^0-9-])(0*[1-9][0-9]*)([^0-9]|$)/
And the two tail-end groups, you could replace with word boundary marks (\b), if the RE language had those. Failing that you would put them into non-capturing groups, if the language had them, or even lookarounds if it had those--but it would more likely have word boundaries before lookarounds.
Full Perl-compatible version:
/(?<![\d-])(0*[1-9]\d*)\b/
I chose a negative lookbehind instead of a word boundary, because '-' is not a word-character, and so -1 will have a "word boundary" between the '-' and the '1'. And a negative lookbehind will match the beginning of the string--there just can't be a digit character or '-' in front.
You could say that the zero-width assumption ^ is just one of the cases that satisfies the zero-width assumption (?<![\d-]).
string testString = #"/page/100";
string pageNumber = Regex.Match(testString, "/page/([1-9][0-9]*)").Groups[1].Value;
If not matched pageNumber will be ""
While Jeremy's regex isn't perfect (should be tested in context, against leading characters and such), his advice is good: go for a generic, simple regex (eg. if you must use it in Apache's mod_rewrite) but by any means, handle the final redirect in server's code (if you can) and do a real check of parameter's validity there.
Otherwise, I would improve Jeremy's expression with bounds: /\b([1-9][0-9]*)$/
Of course, a regex cannot provide a check against any max int, at best you can control the number of digits: /\b([1-9][0-9]{0,2})$/ for example.
This will match any string such that, if it contains /page/, it must be followed by a number, not consisting of only zeros.
^(?!.*?/page/([0-9]*[^0-9/]|0*/))
(?! ) is a negative look-ahead. It will match an empty string, only if it's contained pattern does not match from the current position.