^ and $ expressed in fundamental operations in regular expressions - regex

I've read a book where it states that all fundamental operations in regular expressions are concatatenation, or(|), closure(*) and parenthesis to override default precedence. Every other operation is just a shortcut for one or more fundamental operations.
For example, (AB)+ shortcut is expanded to (AB)(AB)* and (AB)? to (ε | AB) where ε is empty string. First of all, I looked up ASCII table and I am not sure which charcode is designated to empty string. Is it ASCII 0?
I'd like to figure out how to express the shortcuts ^ and $ as in ^AB or AB$ expression in the fundamental operations, but I am not sure how to do this. Can you help me out how this is expressed in fundamentals?

Regular expressions, the way they are defined in mathematics, are actually string generators, not search patterns. They are used as a convenient notation for a certain class of sets of strings. (Those sets can contain an infinite number of strings, so enumerating all elements is not practical.)
In a programming context, regexes are usually used as flexible search patterns. In mathematical terms we're saying, "find a substring of the target string S that is an element of the set generated by regex R". This substring search is not part of the regex proper; it's like there's a loop around the actual regex engine that tries to match every possible substring against the regex (and stops when it finds a match).
In fundamental regex terms, it's like there's an implicit .* added before and after your pattern. When you look at it this way, ^ and $ simply prevent .* from being added at the beginning/end of the regex.
As an aside, regexes (as commonly used in programming) are not actually "regular" in the mathematical sense; i.e. there are many constructs that cannot be translated to the fundamental operations listed above. These include backreferences (\1, \2, ...), word boundaries (\b, \<, \>), look-ahead/look-behind assertions ((?= ), (?! ), (?<= ), (?<! )), and others.
As for ε: It has no character code because the empty string is a string, not a character. Specifically, a string is a sequence of characters, and the empty string contains no characters.

^AB can be expressed as (εAB) ie an empty string followed by AB and AB$ can be expressed as (ABε) that's AB followed by an empty string.
The empty string is actually defined as '', that's a string of 0 length, so has no value in the ASCII table. However the C programming language terminates all strings with the ASCII NULL character, although this is not counted in the length of the string it still must be accounted for when allocating memory.
EDIT
As #melpomene pointed out in their comment εAB is equivalent to AB which makes the above invalid. Having talked to a work college I'm no longer sure how to do this or even if it's possible. Hopefully someone can come up with an answer.

Related

The most efficient lookahead substitute for jflex

I am writing tokenizer in jflex. I need to match words like interferon-a as one token, and words like interferon-alpha as three.
Obvious solution would be lookaheads, but they do not work in jflex. For a similar task, I wrote a function matching one additional wildcard character after the matched pattern, checking if it is a whitespace in java code and pushing it back with or without a part of the matched string.
REGEX = [:letter:]+\-[:letter:]\.
From string interferon-alpha it would match interferon-al.
Then, in Java code section it would check if the last character of the match is a whitespace. It is not, so -al would be pushed back and interferon returned.
In the case of interferon-a, whitespace would be pushed back and interferon returned.
However, this function does not work if matched string does not have anything succeeding. Also, it seems quite clunky. Hence, I was wondering if there is any 'nicer' way of ensuring that the following character is a whitespace without actually matching and returning it.
JFlex certainly has a lookahead facility, the same as (f)lex. Unlike Java regex lookahead assertions, the JFlex lookahead can only be applied at the end of a match, but it is otherwise similar. It is described in the Semantics section of JFlex manual:
In a lexical rule, a regular expression r may be followed by a look-ahead expression. A look-ahead expression is either $ (the end of line operator) or / followed by an arbitrary regular expression. In both cases the look-ahead is not consumed and not included in the matched text region, but it is considered while determining which rule has the longest match…
So you could certainly write the rule:
[:letter:]+\-[:letter:]/\s
However, you cannot put such a rule in a macro definition (REGEX = …), as the manual also mentions (in the section on macros):
The regular expression on the right hand side must be well formed and must not contain the ^, / or $ operators.
So the lookahead operator can only be used in a pattern rule.
Note that \s matches any whitespace character, including newline characters, while . does not match any newline character. I think that's what lead to your comment that REGEX = [:letter:]+\-[:letter:]\. "does not work if matched string does not have anything succeeding" (I'm guessing that you meant "does not have anything succeeding it on the same line, and also that you intended to write . rather than \.).
Rather than testing for following whitespace, you might (depending on your language) prefer to test for a non-word character:
[:letter:]+\-[:letter:]/\W
or to craft a more precise specification as a set of Unicode properties, as in the definition of \W (also found in the linked section of the JFlex manual).
Having said all that, I'd like to repeat the advice from my previous answer to a similar question of yours: put more specific patterns first. For example, using the following pair of patterns will guarantee that the first one picks up words with a single letter suffix, while avoiding the need to explicitly pushback.
[:letter:]+(-[:letter:])? { /* matches 'interferon' or 'interferon-a' */ }
[:letter:]+/-[:letter:]+ { /* matches only 'interferon' from 'interferon-alpha' */ }
Of course, in this case you could easily avoid the collision between the second pattern and the first pattern by using {2,} instead of + for the second repetition, but it's perfectly OK to rely on pattern ordering since it's often inconvenient to guarantee that patterns don't overlap.

What does (^?)* mean in this regex?

I have this regex:
^(^?)*\?(.*)$
If I understand correctly, this is the breakdown of what it does:
^ - start matching from the beginning of the string
(^?)* - I don't know know, but it stores it in $1
\? - matches a question mark
(.*)$ - matches anything until the end of the string
So what does (^?)* mean?
The (^?) is simply looking for the literal character ^. The ^ character in a regex pattern only has special meaning when used as the first character of the pattern or the first character in a grouping match []. When used outside those 2 positions the ^ is interpreted literally meaning in looks for the ^ character in the input string
Note: Whether or not ^ outside of the first and grouping position is interpreted literally is regex engine specific. I'm not familiar enough with LUA to state which it does
Lua does not have a conventional regexp language, it has Lua patterns in its place. While they look a lot like regexp, Lua patterns are a distinct language of their own that has a simpler set of rules and most importantly lacks grouping and alternation features.
Interpreted as a Lua pattern, the example will surprising a longtime regexp user since so many details are different.
Lua patterns are described in PiL, and at a first glance are similar enough to a conventional regexp to cause confusion. The biggest differences are probably the lack of an alternation operator |, parenthesis are only used to mark captures, quantifiers (?, -, +, and *) only apply to a character or character class, and % is the escape character not \. A big clue that this example was probably not written with Lua in mind is the lack of the Lua pattern quoting character % applied to any (or ideally, all) of the non-alphanumeric characters in the pattern string, and the suspicious use of \? which smells like a conventional regexp to match a single literal ?.
The simple answer to the question asked is: (^?)* is not a recommended form, and would match ^* or *, capturing the presence or absence of the caret. If that were the intended effect, then I would write it as (%^?)%* to make that clearer.
To see why this is the case, let's take the pattern given and analyze it as a Lua pattern. The entire pattern is:
^(^?)*\?(.*)$
Handed to string.match(), it would be interpreted as follows:
^ anchors the match to the beginning of the string.
( marks the beginning of the first capture.
^ is not at the beginning of the pattern or a character class, so it matches a literal ^ character. For clarity that should likely have been written as %^.
? matches exactly zero or one of the previous character.
) marks the end of the first capture.
* is not after something that can be quantified so it matches a literal * character. For clarity that should likely have been written as %*.
\ in a pattern matches itself, it is not an escape character in the pattern language. However, it is an escape character in a Lua short string literal, making the following character not special to the string literal parser which in this case is moot because the ? that follows was not special to it in any case. So if the pattern were enclosed in double or single quotes, then the \ would be absorbed by string parsing. If written in a long string (as [[^(^?)*\?(.*)$]], the backslash would survive the string parser, to appear in the pattern.
? matches exactly zero or one of the previous character.
( marks the beginning the second capture.
. matches any character at all, effectively a synonym for the class [\000-\255] (remember, in Lua numeric escapes are in decimal not octal as in C).
* matches zero or more of the previous character, greedily.
) marks the end of the second capture.
$ anchors the pattern to the end of the string.
So it matches and captures an optional ^ at the beginning of the string, followed by *, then an optional \ which is not captured, and captures the entire rest of the string. string.match would return two strings on success (either or both of which might be zero length), or nil on failure.
Edit: I've fixed some typos, and corrected an error in my answer, noticed by Egor in a comment. I forgot that in patterns, special symbols loose their specialness when in a spot where it can't apply. That makes the first asterisk match a literal asterisk rather than be an error. The cascade of that falls through most of the answer.
Note that if you really want a true regexp in Lua, there are libraries available that will provide it. That said, the built-in pattern language is quite powerful. If it is not sufficient, then you might be best off adopting a full parser, and use LPeg which can do everything a regexp can and more. It even comes with a module that provides a complete regexp syntax that is translated into an LPeg grammar for execution.
In this case, the (^?) refers to the previous string "^" meaning the literal character ^ as Jared has said. Check out regexlib for any further deciphering.
For all your Regex needs: http://regexlib.com/CheatSheet.aspx
It looks to me like the intent of the creator of the expression was to match any number of ^ before the question mark, but only wanted to capture the first instance of ^. However, it may not be a valid expression depending on the engine, as others have stated.

Regex to check if a string contains at least A-Za-z0-9 but not an &

I am trying to check if a string contains at least A-Za-z0-9 but not an &.
My experience with regexes is limited, so I started with the easy part and got:
.*[a-zA-Z0-9].*
However I am having troubling combining this with the does not contain an & portion.
I was thinking along the lines of ^(?=.*[a-zA-Z0-9].*)(?![&()]).* but that does not seem to do the trick.
Any help would be appreciated.
I'm not sure if this what you meant, but here is a regular expression that will match any string that:
contains at least one alpha-numeric character
does not contain a &
This expression ensures that the entire string is always matched (the ^ and $ at beginning and end), and that none of the characters matched are a "&" sign (the [^&]* sections):
^[^&]*[a-zA-Z0-9][^&]*$
However, it might be clearer in code to simply perform two checks, if you are not limited to a single expression.
Also, check out the \w class in regular expressions (it might be the better solution for catching alphanumeric chars if you want to allow non-ASCII characters).

Regular expression for parsing string inside ""

<A "SystemTemperatureOutOfSpec" >
What should be the regular expression for parsing the string inside "". In the above sample it is 'SystemTemperatureOutOfSpec'
In JavaScript, this regexp:
/"([^"]*)"/
ex.
> /"([^"]*)"/.exec('<A "SystemTemperatureOutOfSpec" >')[1]
"SystemTemperatureOutOfSpec"
Similar patterns should work in a bunch of other programming languages.
try this
string Exp = "\"!\"";
I am not sure I understand your question well but if you need to match everything between double quotes, here it is: /(?<=").*?(?=")/s
(?<=<A\s")(?<content>.*)(?="\s>)
Regular expressions don't get much easier than this, so you should be able to solve it by yourself. Here's how you go about doing that:
The first step is to try to define as precisely as possible what you want to find. Let's start with this: you want to find a quote, followed by some number of characters other than a quote, followed by a quote. Is that correct? If so, our pattern has three parts: "a quote", "some characters other than a quote", and "a quote".
Now all we need to do is figure out what the regular expressions for those patterns are.
A quote
For "a quote", the pattern is literally ". Regular expressions have special characters which you have to be aware of (*, ., etc). Anything that's not a special character matches itself, and " is one of those characters. For a complete list of special characters for your language, see the documentation.
Characters other than a quote
So now the question is, how do we match "characters other than a quote"? That sounds like a range. A range is square brackets with a list of allowable characters. If the list begins with ^ it means it is a list of not-allowed characters. We want any characters other than a quote, so that means [^"].
"Some"
That range just means any one of the characters in the range, but we want "some". "Some" usually means either zero-or-more, or one-or-more. You can place * after a part of an expression to mean zero-or-more of that part. Likewise, use + to mean one-or-more (and ? means zero-or-one). There are a few other variations, but that's enough for this problem.
So, "some characters other than a quote" is the range [^"] (any character other than a quote) followed by * (zero-or-more). Thus, [^"]*
Putting it all together
This is the easy part: just combine all the pieces. A quote, followed by some characters other than a quote, followed by a quote, is "[^"]*".
Capturing the interesting part
The pattern we have will now match your string. What you want, however, is just the part inside the quotes. For that you need a "capturing group", which is denoted by parenthesis. To capture a part of a regular expression, put it in parenthesis. So, if we want to capture everything but the beginning and ending quote, the pattern becomes "([^"]*)".
And that's how you learn regular expressions. Break your problem down into a precise statement composed of short sequences of characters, figure out the regular expression for each sequence, then put it all together.
The pattern in this answer may not actually be the perfect answer for you. There are some edge cases to worry about. For example, you may only want to match a quote following a non-word character, or only quotes at the beginning or end of a word. That's all possible, but is highly dependent on your exact problem. Figuring out how to do that is just as easy though -- decide what you want, then look at the documentation to see how to accomplish that.
Spend one day practicing on regular expressions and you'll never have to ask anyone for help with regular expressions for the rest of your career. They aren't hard, but they do require concentrated study.
Are you sure you need regular expression matching here? Looking at your "string" you might be better off using a Xml parser?

Regex for matching a character, but not when it's enclosed in quotes

I need to match a colon (':') in a string, but not when it's enclosed by quotes - either a " or ' character.
So the following should have 2 matches
something:'firstValue':'secondValue'
something:"firstValue":'secondValue'
but this should only have 1 match
something:'no:match'
If the regular expression implementation supports look-around assertions, try this:
:(?:(?<=["']:)|(?=["']))
This will match any colon that is either preceeded or followed by a double or single quote. So that does only consider construct like you mentioned. something:firstValue would not be matched.
It would be better if you build a little parser that reads the input byte-by-byte and remembers when quotation is open.
Regular expressions are stateless. Tracking whether you are inside of quotes or not is state information. It is, therefore, impossible to handle this correctly using only a single regular expression. (Note that some "regular expression" implementations add extensions which may make this possible; I'm talking solely about "true" regular expressions here.)
Doing it with two regular expressions is possible, though, provided that you're willing to modify the original string or to work with a copy of it. In Perl:
$string =~ s/['"][^'"]*['"]//g;
my $match_count = $string =~ /:/g;
The first will find every sequence consisting of a quote, followed by any number of non-quote characters, and terminated by a second quote, and remove all such sequences from the string. This will eliminate any colons which are within quotes. (something:"firstValue":'secondValue' becomes something:: and something:'no:match' becomes something:)
The second does a simple count of the remaining colons, which will be those that weren't within quotes to start with.
Just counting the non-quoted colons doesn't seem like a particularly useful thing to do in most cases, though, so I suspect that your real goal is to split the string up into fields with colons as the field delimiter, in which case this regex-based solution is unsuitable, as it will destroy any data in quoted fields. In that case, you need to use a real parser (most CSV parsers allow you to specify the delimiter and would be ideal for this) or, in the worst case, walk through the string character-by-character and split it manually.
If you tell us the language you're using, I'm sure somebody could suggest a good parser library for that language.
Uppps ... missed the point. Forget the rest. It's quite hard to do this because regex is not good at counting balanced characters (but the .NET implementation for example has an extension that can do it, but it's a bit complicated).
You can use negated character groups to do this.
[^'"]:[^'"]
You can further wrap the quotes in non-capturing groups.
(?:[^'"]):(?:[^'"])
Or you can use assertion.
(?<!['"]):(?!['"])
I've come up with the following slightly worrying construction:
(?<=^('[^']*')*("[^"]*")*[^'"]*):
It uses a lookbehind assertion to make sure you match an even number of quotes from the beginning of the line to the current colon. It allows for embedding a single quote inside double quotes and vice versa. As in:
'a":b':c::"':" (matches at positions 6, 8 and 9)
EDIT
Gumbo is right, using * within a look behind assertion is not allowed.
You can try to catch the strings withing the quotes
/(?<q>'|")([\w ]+)(\k<q>)/m
First pattern defines the allowed quote types, second pattern takes all Word-Digits and spaces.
Very good on this solution is, it takes ONLY Strings where opening and closing quotes match.
Try it at regex101.com