Special Characters in JFlex Regular Expression - regex

I want to include all special characters in regular expression in JFlex. So I prepared one as below.
P = ("_"|"-"|"/"|"."|","|"~"|"!"|"#"|"#"|"$"|"%"|"^"|"&"|"*"|"|"|"("|")"|"="|"+"|"|"|"\"|":"|";"|"""|"<"|">"|"?"|"`"|"{"|"}"|"["|"]"|"'")
Could somebody tell me is there any other way to cover all special characters in more optimized way?
Also could you please point out what's wrong in above regex as it is giving me "Unterminated string at end of line." error on compilation?

To include all special characters in regular expression in JFlex
i think its easier to exclude the numbers, letter, spaces and tabs instead of mentioning to all other possibilities .
using this regular expression :
[^0-9a-zA-Z\n\t ]?

To fix your problem, you need to escape the backslash \ with a backslash \\
An easier way to define these characters would be a character class.
[-/_.,~!##$%^&*|(){}\[\]<>?=+\\:;"'`]
You can keep adding characters you want to include to the class.
Note: You can reference the special characters at http://www.regular-expressions.info/characters.html

Related

Grep Pattern matching- underline

I've not been able to find anything online to help so hoping someone may have an idea.
What does an underline in an expression mean when using grep?
For example: [_a-zA-Z0-9]
Could someone help to explain the purpose here?
The grep command uses a regular expression as it is also described in the manpage of grep:
A regular expression is a pattern that describes a set of strings. Regular expressions are constructed analogously to arithmetic expressions, by using various operators to combine smaller expressions.
A quick reference of the regular expression syntax can be found here. To test regular expressions with several input strings I recommend regex101.
The pattern [_a-zA-Z0-9] means to match a single character in the list. The list is opened with [ and closed with ]. The underscore (_) has no special meaning it is literally the underscore character. The minus character (-) means range, here from a to z (a-z) for example.
In short [_a-zA-Z0-9] means to match a single character wich is _, a character of the alphabet either lower or uppercase or a numerical character.

Period in .Net 3.5 Regex.IsMatch

I came across this regular expression in vb.net 3.5 code:
Regex.IsMatch(strString, "^[\w\s.+'\-\(\)\/\,\&\#]+$")
What is really confusing me is the ".+" part. I was under the impression that the period means any character and the plus sign means one or more. Following this, I feel like this regular expression should allow anything! But it doesn't, so I must be misunderstanding something. In testing it, it seems like the period and the plus sign are being taken as literals.
Could somebody help explain this to me?
Thanks!
The issue is that all of those characters are enclosed in a [character-group]. The escaping rules are different in character-groups than they are elsewhere in a RegEx expression. For instance, according to the MSDN documentation, \b inside a character-group means a backspace character whereas, outside of a character-group, it is an anchor that matches a word boundary.
According to the Regular-Expressions.info documentation:
In most regex flavors, the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (), the caret (^), and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash.
Therefore, in your example RegEx expression, it looks for any one of the characters in that bracketed list, including either the literal . or + character. If you think about it, it wouldn't make any sense to use a . to mean "any character" inside of a character-group. Doing so would make the group, itself, moot. And certainly, using the + character to mean "one or more times" inside of a character-group really makes no sense.
.+ is mean any symbol in an amount of one or more. Maybe you need to escape dot like \.+?
Within the square parenthesis, dot and plus don't have their special meaning. The square brackets define a "character class". It does not contain a string but a set of characters allowed at this position.
So the expression [\w\s.+'-()/\,\&#] creates a character class of letters, digits, underscore, spaces, dots, pluses, single quotes, minuses, opening round brackets, closing round brackets, slashes, commas, ampersands and hashmarks.
The + behind the square parenthesis means you expect one or more characters of this character class.

Why is the closing bracket a special character that must be escaped to be taken as a literal?

It is clear that an opening bracket "(", among other characters, must be escaped (prefixed by a backslash) for the regex to contain a "literal opening bracket": Because there are regex options for which "(" is a lead-in.
But how comes the same holds true for the closing bracket ")"? There is no syntax construct that has ")" as a lead-in token, is there?
So why do I have to escape closing brackets for them to be taken literally?
Of course, the same question could be asked for the other closing brackets as well.
Sorry for this being a "why is this so?" question. It might possibly be un-answerable. But if there is a good reason, the only way to get to know it is by asking!
Addendum:
The rationale behind this question is:
For example, http://www.regexguru.com/2008/12/dont-escape-literal-characters-that-arent-metacharacters/ gives good reasons not to prefix characters that donĀ“t need prefixing.
And imho, the closing bracket does not need prefixing in most cases:
Since a closing bracket without an opening one is not part of a regex group, I find it totally unlogical that it needs to be escaped in this case anyways.
Assume you want to match a group holding a closing bracket. Without escaping, this would look like this ()). Escaping the bracket like (\)) makes it much easier (if not even possible) for the regular expression to be parsed correctly and unambiguously.
In the (unescaped) regular expression (\w)), does the closing bracket belong to the group, or not, i.e., is the group closed by the first or the second )? E.g., for the string abc)d, does it match c or c)?
Of course one could omit some of the escape characters in case the meaning is not ambiguous (and the regex parser allows to do so) but what would it help? You save a character here and there, but each time you encounter a ) or another special character you have to think: "Is this a control character or a character to be matched? Is it ambiguous?" Better make it clear and consistent.
As a more specific example for tobias_k's answer:
Look at the following regex:
(a*))
looking at the string bbaaa)bb will it capture aaaor aaa)?
The result is clear with
(a*\))
versus
(a*)\)
Of course, the same question could be asked for the other closing brackets as well.
No that's not correct (or may vary with one regex engine to another).
In Javascript regex engine ] and } don't need to be escaped.
See this example:
var x = 'brackets)}]';
x.match(/]/); // works
x.match(/}/); // works
x.match(/)/); // failes
Only for the case 3, it fails with the error Unmatched ')'

The Different Delimiters of Regex

When I look up regular expressions for various purposes, I see people using delimiters like /, #, !, and ~. Do these do anything different, or do they have the same effect?
They don't do anything different, they delimit the regular expression (in languages where it is needed).
The difference is: the behaviour of that character inside the regex does change. The regex delimiter becomes an additional special character and needs to be escaped (==> choose a delimiter that you don't need within the regex!).
Side note: In php you can even use a regex special character like + or | as regex delimiter, but this works only when you don't need that character inside the regex (NOT recommended). related answer
In some languages you can choose the delimiters, in others you can't.
You must escape that delimiter every time it appears in the regular expression. Choosing a delimiter that does not occur in the expression reduces the need for escaping, making the expression easier to read.
The following two regular expressions are identical, except that the first uses / as a delimiter, whereas the second uses #:
/http:\/\/example\.com\/.*\/foo\//
#http://example\.com/.*/foo/#

Regular expression for parsing string inside ""

<A "SystemTemperatureOutOfSpec" >
What should be the regular expression for parsing the string inside "". In the above sample it is 'SystemTemperatureOutOfSpec'
In JavaScript, this regexp:
/"([^"]*)"/
ex.
> /"([^"]*)"/.exec('<A "SystemTemperatureOutOfSpec" >')[1]
"SystemTemperatureOutOfSpec"
Similar patterns should work in a bunch of other programming languages.
try this
string Exp = "\"!\"";
I am not sure I understand your question well but if you need to match everything between double quotes, here it is: /(?<=").*?(?=")/s
(?<=<A\s")(?<content>.*)(?="\s>)
Regular expressions don't get much easier than this, so you should be able to solve it by yourself. Here's how you go about doing that:
The first step is to try to define as precisely as possible what you want to find. Let's start with this: you want to find a quote, followed by some number of characters other than a quote, followed by a quote. Is that correct? If so, our pattern has three parts: "a quote", "some characters other than a quote", and "a quote".
Now all we need to do is figure out what the regular expressions for those patterns are.
A quote
For "a quote", the pattern is literally ". Regular expressions have special characters which you have to be aware of (*, ., etc). Anything that's not a special character matches itself, and " is one of those characters. For a complete list of special characters for your language, see the documentation.
Characters other than a quote
So now the question is, how do we match "characters other than a quote"? That sounds like a range. A range is square brackets with a list of allowable characters. If the list begins with ^ it means it is a list of not-allowed characters. We want any characters other than a quote, so that means [^"].
"Some"
That range just means any one of the characters in the range, but we want "some". "Some" usually means either zero-or-more, or one-or-more. You can place * after a part of an expression to mean zero-or-more of that part. Likewise, use + to mean one-or-more (and ? means zero-or-one). There are a few other variations, but that's enough for this problem.
So, "some characters other than a quote" is the range [^"] (any character other than a quote) followed by * (zero-or-more). Thus, [^"]*
Putting it all together
This is the easy part: just combine all the pieces. A quote, followed by some characters other than a quote, followed by a quote, is "[^"]*".
Capturing the interesting part
The pattern we have will now match your string. What you want, however, is just the part inside the quotes. For that you need a "capturing group", which is denoted by parenthesis. To capture a part of a regular expression, put it in parenthesis. So, if we want to capture everything but the beginning and ending quote, the pattern becomes "([^"]*)".
And that's how you learn regular expressions. Break your problem down into a precise statement composed of short sequences of characters, figure out the regular expression for each sequence, then put it all together.
The pattern in this answer may not actually be the perfect answer for you. There are some edge cases to worry about. For example, you may only want to match a quote following a non-word character, or only quotes at the beginning or end of a word. That's all possible, but is highly dependent on your exact problem. Figuring out how to do that is just as easy though -- decide what you want, then look at the documentation to see how to accomplish that.
Spend one day practicing on regular expressions and you'll never have to ask anyone for help with regular expressions for the rest of your career. They aren't hard, but they do require concentrated study.
Are you sure you need regular expression matching here? Looking at your "string" you might be better off using a Xml parser?