Understanding regex with OR - regex

I have a regular expression like this: ('0'|['0'‐'9']+'.'['0'‐'9' 'a'‐'f']*)
In order to test it I am using a handy tool called http://www.regexpal.com/
The thing is that I am getting stuck when trying to understand the logic, inserting a '0' is fine but then I don't get why the OR prevents inserting other characters. Any explanation is appreciated.

I'm not sure you understand how the brackets in the regex are working. It isn't the OR part that is preventing you.
('0'|['0'‐'9']+'.'['0'‐'9' 'a'‐'f']*)
Will match either '0' with the quotes or for example 0000000'z''''9 or anything else like it. The quotes are treated as literal and the period must be escaped because it is a wildcard.
(0|[0-9]+\.[0-9a-f]*)
May be what you are looking for. This will match values such as 0 or 23. or 3.14159

There are numerous problems in your regex (as others have pointed out), but I'll explain something about alternations.
Most regex flavors will short-circuit alternations.
This means that you should reorder it, if you want it to match the other expression first.

Related

Regular expression to select single words inside parentheses using stringr

I specified that I'm using stringr because its character escaping is not "standard" regex escaping.
I want to detect strings that have a single word inside parentheses. Thus, I want to detect
"Men's shirt (blue)"
and not detect
"Blade Runner (Director's cut)"
If it helps simplify the regex, all the parenthetical parts are always at the end of the string.
I have attempted
str_detect(my_string, "\\(\\w?\\)") which yields no results and
str_detect(my_string, "\\(\\S\\)$") which returns everything including multiple words
as well as various combinations using //S, with or without the $, etc.
When I look for other stack overflow answers, I usually find slightly different questions whose answers are simply "use this:" along with what seems like an incomprehensible regex using lookaheads and other seemilgly too-complicated things. I thank you for a little explanation on why the (probably obvious and simple) regex works.
Looks like I didn't hit upon the //S+ in my combinations. I was checking for a single non-space character, not one or more non-space characters. This solution works:
str_detect(desc, "\\(\\S+\\)$")

Regular Expression Ends

I have scoured the web in the past few hours trying to figure out why in the world one of my colleagues insists on using (?!.) as a last-character in his regular expressions instead of the usual $.
Some of the regular expressions I've seen have been ^.*.txt(?!.) which begin with the usual ^, but do not end with the $. I have not been able to find any definitive or time-efficient reasons, any pros and cons or differences at all?
$ may match end of line rather than end of input (this depends on modifiers used). Perhaps this is the reason.
In my opinion, the best way to match the end of input is \z - which means exactly end of input, regardless of modifiers. It is supported in most (if not all) regex implementations.
The only possible difference is with multiline
asdf$ :
http://rubular.com/r/B2cNEL1pln
asdf(?!.) :
http://rubular.com/r/rbhKi1lKGI
^.*\.txt(?!.) means match (beginning)(anything 0 or more times).txt and is not followed by anything.
You can get more info on the ?! pattern here.
If you look here, it says that using the m or s modifiers, you can modify the behavior of ^ and $, to match beginning or end of line, rather than the whole string. There's also an ms. So, I guess with (?!.), you can match the end of the entire multi-line string.
So, I wouldn't say using this is better. Rather, I would say you need to know exactly what you're looking for or what you actually intend to do, within a single-lined string, or multi-lined string and how you want to parse your input to get one-line or multi-line strings, before passing into the regexp.
I think many of us run regexps on single-lined strings and therefore do not feel a difference between the two syntaxes.

PCRE Regex Syntax

I guess this is more or less a two-part question, but here's the basics first: I am writing some PHP to use preg_match_all to look in a variable for strings book-ended by {}. It then iterates through each string returned, replaces the strings it found with data from a MySQL query.
The first question is this: Any good sites out there to really learn the ins and outs of PCRE expressions? I've done a lot of searching on Google, but the best one I've been able to find so far is http://www.regular-expressions.info/. In my opinion, the information there is not well-organized and since I'd rather not get hung up having to ask for help whenever I need to write a complex regex, please point me at a couple sites (or a couple books!) that will help me not have to bother you folks in the future.
The second question is this: I have this regex
"/{.*(_){1}(.*(_){1}[a-z]{1}|.*)}/"
and I need it to catch instances such as {first_name}, {last_name}, {email}, etc. I have three problems with this regex.
The first is that it sees "{first_name} {last_name}" as one string, when it should see it as two. I've been able to solve this by checking for the existence of the space, then exploding on the space. Messy, but it works.
The second problem is that it includes punctuation as part of the captured string. So, if you have "{first_name} {last_name},", then it returns the comma as part of the string. I've been able to partially solve this by simply using preg_replace to delete periods, commas, and semi-colons. While it works for those punctuation items, my logic is unable to handle exclamation points, question marks, and everything else.
The third problem I have with this regex is that it is not seeing instances of {email} at all.
Now, if you can, are willing, and have time to simply hand me the solution to this problem, thank you as that will solve my immediate problem. However, even if you can do this, please please provide an lmgfty that provides good web sites as references and/or a book or two that would provide a good education on this subject. Sites would be preferable as money is tight, but if a book is the solution, I'll find the money (assuming my local library system is unable to procure said volume).
Back then I found PHP's own PCRE syntax reference quite good: http://uk.php.net/manual/en/reference.pcre.pattern.syntax.php
Let's talk about your expression. It's quite a bit more verbose than necessary; I'm going to simplify it while we go through this.
A rather simpler way of looking at what you're trying to match: "find a {, then any number of letters or underscores, then a }". A regular expression for that is (in PHP's string-y syntax): '/\{[a-z_]+\}/'
This will match all of your examples but also some wilder ones like {__a_b}. If that's not an option, we can go with a somewhat more complex description: "find a {, then a bunch of letters, then (as often as possible) an underscore followed by a bunch of letters, then a }". In a regular expression: /\{([a-z]+(_[a-z]+)*\}/
This second one maybe needs a bit more explanation. Since we want to repeat the thing that matches _foo segments, we need to put it in parentheses. Then we say: try finding this as often as possible, but it's also okay if you don't find it at all (that's the meaning of *).
So now that we have something to compare your attempt to, let's have a look at what caused your problems:
Your expression matches any characters inside the {}, including } and { and a whole bunch of other things. In other words, {abcde{_fgh} would be accepted by your regex, as would {abcde} fg_h {ijkl}.
You've got a mandatory _ in there, right after the first .*. The (_){1} (which means exactly the same as _) says: whatever happens, explode if this ain't here! Clearly you don't actually want that, because it'll never match {email}.
Here's a complete description in plain language of what your regex matches:
Match a {.
Match a _.
Match absolutely anything as long as you can match all the remaining rules right after that anything.
Match a _.
Match a single letter.
Instead of that _ and the single letter, absolutely anything is okay, too.
Match a }.
This is probably pretty far from what you wanted. Don't worry, though. Regular expressions take a while to get used to. I think it's very helpful if you think of it in terms of instructions, i.e. when building a regular expression, try to build it in your head as a "find this, then find that", etc. Then figure out the right syntax to achieve exactly that.
This is hard mainly because not all instructions you might come up with in your head easily translate into a piece of a regular expression... but that's where experience comes in. I promise you that you'll have it down in no time at all... if you are fairly methodical about making your regular expressions at first.
Good luck! :)
For PCRE, I simply digested the PCRE manpages, but then my brain works that way anyway...
As for matching delimited stuff, you generally have 2 approaches:
Match the first delimiter, match anything that is not the closing delimiter, match the closing delimiter.
Match the first delimiter, match anything ungreedily, match the closing delimiter.
E.g. for your case:
\{([^}]+)\}
\{(.+?)\} - Note the ? after the +
I added a group around the content you'd likely want to extract too.
Note also that in the case of #1 in particular but also for #2 if "dot matches anything" is in effect (dotall, singleline or whatever your favourite regex flavour calls it), that they would also match linebreaks within - you'd need to manually exclude that and anything else you don't want if that would be a problem; see the above answer for if you want something more like a whitelist approach.
Here's a good regex site.
Here's a PCRE regex that will work: \{\w+\}
Here's how it works:
It's basically looking for { followed by one ore more word characters followed by }. The interesting part is that the word character class actually includes an underscore as well. \w is essentially shorthand for [A-Za-z0-9_]
So it will basically match any combination of those characters within braces and because of the plus sign will only match braces that are not empty.

The Greedy Option of Regex is really needed?

The Greedy Option of Regex is really needed?
Lets say I have following texts, I like to extract texts inside [Optionx] and [/Optionx] blocks
[Option1]
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
[/Option2]
But with Regex Greedy Option, its give me
Start=1
End=10
[/Option1]
[Option2]
Start=11
End=20
Anybody need like that? If yes, could you let me know?
If I understand correctly, the question is “why (when) do you need greedy matching?”
The answer is – almost always. Consider a regular expression that matches a sequence of arbitrary – but equal – characters, of length at least two. The regular expression would look like this:
(.)\1+
(\1 is a back-reference that matches the same text as the first parenthesized expression).
Now let’s search for repeats in the following string: abbbbbc. What do we find? Well, if we didn’t have greedy matching, we would find bb. Probably not what we want. In fact, in most application s we would be interested in finding the whole substring of bs, bbbbb.
By the way, this is a real-world example: the RLE compression works like that and can be easily implemented using regex.
In fact, if you examine regular expressions all around you will see that a lot of them use quantifiers and expect them to behave greedily. The opposite case is probably a minority. Often, it makes no difference because the searched expression is inside guard clauses (e.g. a quoted string is inside the quote marks) but like in the example above, that’s not always the case.
Regular expressions can potentially match multiple portion of a text.
For example consider the expression (ab)*c+ and the string "abccababccc". There are many portions of the string that can match the regular expressions:
(abc)cababccc
(abcc)ababccc
abcc(ababccc)
abccab(abccc)
ab(c)cababccc
ab(cc)ababccc
abcabab(c)ccc
....
some regular expressions implementation are actually able to return the entire set of matches but it is most common to return a single match.
There are many possible ways to determine the "winning match". The most common one is to take the "longest leftmost match" which results in the greedy behaviour you observed.
This is tipical of search and replace (a la grep) when with a+ you probably mean to match the entire aaaa rather than just a single a.
Choosing the "shortest non-empty leftmost" match is the usual non-greedy behaviour. It is the most useful when you have delimiters like your case.
It all depends on what you need, sometimes greedy is ok, some other times, like the case you showed, a non-greedy behaviour would be more meaningful. It's good that modern implementations of regular expressions allow us to do both.
If you're looking for text between the optionx blocks, instead of searching for .+, search for anything that's not "[\".
This is really rough, but works:
\[[^\]]+]([^(\[/)]+)
The first bit searches for anything in square brackets, then the second bit searches for anything that isn't "[\". That way you don't have to care about greediness, just tell it what you don't want to see.
One other consideration: In many cases, greedy and non-greedy quantifiers result in the same match, but differ in performance:
With a non-greedy quantifier, the regex engine needs to backtrack after every single character that was matched until it finally has matched as much as it needs to. With a greedy quantifier, on the other hand, it will match as much as possible "in one go" and only then backtrack as much as necessary to match any following tokens.
Let's say you apply a.*c to
abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc. This finds a match in 5 steps of the regex engine. Now apply a.*?c to the same string. The match is identical, but the regex engine needs 101 steps to arrive at this conclusion.
On the other hand, if you apply a.*c to abcbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb, it takes 101 steps whereas a.*?c only takes 5.
So if you know your data, you can tailor your regex to match it as efficiently as possible.
just use this algorithm which you can use in your fav language. No need regex.
flag=0
open file for reading
for each line in file :
if check "[/Option" in line:
flag=0
if check "[Option" in line:
flag=1
continue
if flag:
print line.strip()
# you can store the values of each option in this part

Regular expression that rejects all input?

Is is possible to construct a regular expression that rejects all input strings?
Probably this:
[^\w\W]
\w - word character (letter, digit, etc)
\W - opposite of \w
[^\w\W] - should always fail, because any character should belong to one of the character classes - \w or \W
Another snippets:
$.^
$ - assert position at the end of the string
^ - assert position at the start of the line
. - any char
(?#it's just a comment inside of empty regex)
Empty lookahead/behind should work:
(?<!)
The best standard regexs (i.e., no lookahead or back-references) that reject all inputs are (after #aku above)
.^
and
$.
These are flat contradictions: "a string with a character before its beginning" and "a string with a character after its end."
NOTE: It's possible that some regex implementations would reject these patterns as ill-formed (it's pretty easy to check that ^ comes at the beginning of a pattern and $ at the end... with a regular expression), but the few I've checked do accept them. These also won't work in implementations that allow ^ and $ to match newlines.
(?=not)possible
?= is a positive lookahead. They're not supported in all regexp flavors, but in many.
The expression will look for "not", then look for "possible" starting at the same position (since lookaheads don't move forward in the string).
One example of why such thing could possibly be needed is when you want to filter some input with regexes and you pass regex as an argument to a function.
In spirit of functional programming, for algebraic completeness, you may want some trivial primary regexes like "everything is allowed" and "nothing is allowed".
To me it sounds like you're attacking a problem the wrong way, what exactly
are you trying to solve?
You could do a regular expression that catches everything and negate the result.
e.g in javascript:
if (! str.match( /./ ))
but then you could just do
if (!foo)
instead, as #[jan-hani] said.
If you're looking to embed such a regex in another regex, you
might be looking for $ or ^ instead, or use lookaheads like #[henrik-n] mentioned.
But as I said, this looks like a "I think I need x, but what I really need is y" problem.
Why would you even want that? Wouldn't a simple if statment do the trick? Something along the lines of:
if ( inputString != "" )
doSomething ()
[^\x00-\xFF]
It depends on what you mean by "regular expression". Do you mean regexps in a particular programming language or library? In that case the answer is probably yes, and you can refer to any of the above replies.
If you mean the regular expressions as taught in computer science classes, then the answer is no. Every regular expression matches some string. It could be the empty string, but it always matches something.
In any case, I suggest you edit the title of your question to narrow down what kinds of regular expressions you mean.
[^]+ should do it.
In answer to aku's comment attached to this, I tested it with an online regex tester (http://www.regextester.com/), and so assume it works with JavaScript. I have to confess to not testing it in "real" code. ;)
EDIT:
[^\n\r\w\s]
Well,
I am not sure if I understood, since I always thought of regular expression of a way to match strings. I would say the best shot you have is not using regex.
But, you can also use regexp that matches empty lines like ^$ or a regexp that do not match words/spaces like [^\w\s] ...
Hope it helps!