What does a-z-A-Z mean in a regular expression? - regex

I've been working with someone else's code and I ran across the regular expression [^0-9a-z-A-Z]. This bears close resemblance to the common [^0-9a-zA-Z] which is meant to exclude non-alphanumeric characters, but note the extra dash in the middle, between the lowercase z and uppercase A.
I'm not very familiar with regular expressions, but I've read several pages on them now, and none of the rules I've seen seem to cover what this syntax would mean. Perhaps it's not even valid syntax, but the Golang regex interpreter doesn't seem to mind. I'd appreciate any clarification. Thanks.

A dash in a character class in a place where it cannot be interpreted as a range is interpreted as a literal dash. So the expression excludes the characters 0 to 9, a to z, A to Z, and -. That's why there's no syntax error.
It's probably a typo though. If the dash is meant to be there, then to prevent confusion it should be escaped and/or moved out from between the ranges, such as [^0-9a-zA-Z\-]

It excludes the minus sign.
You can test regex handily here: http://www.regexr.com/

Related

Understanding regex with OR

I have a regular expression like this: ('0'|['0'‐'9']+'.'['0'‐'9' 'a'‐'f']*)
In order to test it I am using a handy tool called http://www.regexpal.com/
The thing is that I am getting stuck when trying to understand the logic, inserting a '0' is fine but then I don't get why the OR prevents inserting other characters. Any explanation is appreciated.
I'm not sure you understand how the brackets in the regex are working. It isn't the OR part that is preventing you.
('0'|['0'‐'9']+'.'['0'‐'9' 'a'‐'f']*)
Will match either '0' with the quotes or for example 0000000'z''''9 or anything else like it. The quotes are treated as literal and the period must be escaped because it is a wildcard.
(0|[0-9]+\.[0-9a-f]*)
May be what you are looking for. This will match values such as 0 or 23. or 3.14159
There are numerous problems in your regex (as others have pointed out), but I'll explain something about alternations.
Most regex flavors will short-circuit alternations.
This means that you should reorder it, if you want it to match the other expression first.

E-mail address validation using Regular Expressions

I'm writing a simple, small app that allows me to share information. I have a question on using regx to validate email address.
I'm kind learning on my own. But when it comes to real-world examples, such that strings that can be validated with regular expressions, I'm kind stuck.
Exercise:
Untangle the following regular expression that validates an email address:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
It looks like a jumble of characters.
Can someone please explain to me how does this work?
I try to use this online resources by by Jan Goyvaerts.
Any help I will appreciate it.
First of all, there is a good thread about totally the same thing:
Using a regular expression to validate an email address
Then, below there is the explanation of your regular expression:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
- The square brackets represent the symbol class, containing all the symbols which are in the square brackets. The plus sign ('+') is a quantifier, which means that the sequence of symbols, represented by this symbol class must be at least one character long.
Also, the '+' is greedy, and, therefore, this part of the pattern will match the symbol sequence of the maximal possible length.
Talking about the square brackets contents, 'a-z' means any symbol in a range, which could be described mathematically as [a, z], and '0-9' is similar. All the other symbols are just symbols in this case.
(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
- In Regular Expressions, the brackets represent grouping, and the asterisk ('*') is a greedy quantifier, which means "occurs zero or more times". So here we are not sure if we are going to find the brackets content, but we do not rule out the possibility.
Then, inside the brackets, we see the ?: character combination, which, being put inside brackets tells us that the symbol group inside should not be captured as a sub-string for the further reference.
Going further, \. means just a usual dot (see Escape sequence), since a dot symbol is a meta-symbol in Regex.
After the dot we see again the character of symbols, explained above.
#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
- Here we see the at symbol ('#'), which is just a symbol here, then there is a non-capturing symbol group, which will occur one or more times (because of + after it), and which includes a single symbol of [a-z0-9] class and another non-capturing group of symbols, which contents you can totally describe using my explanations above except for a question mark sign ('?'), which means "either once or not at all" in this context (i.e. if it is used as a quantifier).
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
- This last part is similar to what is found in a symbol group, explained above, so I believe you have now enough information to understand it.
More on quantifier types here: Greedy vs. Reluctant vs. Possessive Quantifiers.
A good Regular Expressions reference: Regular Expression Language - Quick Reference
Some information on capturing in Regular Expressions: Regex Tutorial - Parentheses for Grouping and Capturing
About special characters: Regex Tutorial - Literal Characters and Special Characters
Regex statements can be a fun yet tricky to follow. There are 5 parts to this statement.
One valid characters for a username
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
check for a single '.' and any additional amount of characters
(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*
The '#' symbol
Valid second / lower level domain
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
A valid top level domain
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
I recommend http://www.ultrapico.com/expresso.htm. It will break the statement down for you.
I've found a remarkable tool for visualizing regular expressions here: http://regexper.com
It shows me that your regular expression breaks down like this. Hopefully this helps explain it.
[a-z0-9!#$%&'*+/=?^_`{|}~-]+
This looks for at least one of of the characters given here (a-z, 0-9, and those special characters).
(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)
This looks for the same as above, but only when it stands after a dot. This part is optional and can be repeated indefinitely. It prevents dots at the end of the name.
#
Matches the # symbol
(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
This matches a-z, 0-9 ending with a dot and optional - in the middle ending with a dot. This has to be matched at least once.
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
This looks for a-z or 0-9, optionally followed by a-z, 0-9, -, but it cant end with a - again.
Two Suggestions I have for you.
Escaping special characters is messy. 2. Email addresses are complicated. I probably recommend you to study this post if you are really interested. Please check out this other posts: Validation in Regex and Regex Help.
See this answer. The problem is probably too difficult to solve. Two problems you have here. 1. RegEx are not easy. 2. Escaping special characters is messy. Finally, Email addresses are complicated. I probably recommend you to study this post if you are really interested.

Regular expression for unusual characters?

What would the regular expression be to detect unusual characters, like the ones found here:
http://www.theworldofstuff.com/characters
So for example, what would the expression be to only allow letters, numbers and the symbols found on a keyboard (.$%^, etc.)?
You just have to list everything you want. Something like:
[0-9a-zA-Z!##$%^&*\(\)\\\?\{\[\]\}:;<>~`"'/+-\., =_]
Be careful to escape any characters that might be confused as part of the regex with \
You can check for the ASCII charset with the following regular expression:
/^[\x00-\x7F]*$/
In order to only match the printable part of ASCII:
/^[\x20-\x7E]*$/
Thought it'd be useful if I posted exactly how I did this in the end, so here you go:
[^0-9a-zA-Z !\"£$%^&*\\(\\)_\\-\\+\\={}\\[\\]:;#'~#<,>.\\?/`|§]
Not sure if it was just a Java thing, but I had a hell of a time working out which characters needed escaping. :p

Regular expression for parsing string inside ""

<A "SystemTemperatureOutOfSpec" >
What should be the regular expression for parsing the string inside "". In the above sample it is 'SystemTemperatureOutOfSpec'
In JavaScript, this regexp:
/"([^"]*)"/
ex.
> /"([^"]*)"/.exec('<A "SystemTemperatureOutOfSpec" >')[1]
"SystemTemperatureOutOfSpec"
Similar patterns should work in a bunch of other programming languages.
try this
string Exp = "\"!\"";
I am not sure I understand your question well but if you need to match everything between double quotes, here it is: /(?<=").*?(?=")/s
(?<=<A\s")(?<content>.*)(?="\s>)
Regular expressions don't get much easier than this, so you should be able to solve it by yourself. Here's how you go about doing that:
The first step is to try to define as precisely as possible what you want to find. Let's start with this: you want to find a quote, followed by some number of characters other than a quote, followed by a quote. Is that correct? If so, our pattern has three parts: "a quote", "some characters other than a quote", and "a quote".
Now all we need to do is figure out what the regular expressions for those patterns are.
A quote
For "a quote", the pattern is literally ". Regular expressions have special characters which you have to be aware of (*, ., etc). Anything that's not a special character matches itself, and " is one of those characters. For a complete list of special characters for your language, see the documentation.
Characters other than a quote
So now the question is, how do we match "characters other than a quote"? That sounds like a range. A range is square brackets with a list of allowable characters. If the list begins with ^ it means it is a list of not-allowed characters. We want any characters other than a quote, so that means [^"].
"Some"
That range just means any one of the characters in the range, but we want "some". "Some" usually means either zero-or-more, or one-or-more. You can place * after a part of an expression to mean zero-or-more of that part. Likewise, use + to mean one-or-more (and ? means zero-or-one). There are a few other variations, but that's enough for this problem.
So, "some characters other than a quote" is the range [^"] (any character other than a quote) followed by * (zero-or-more). Thus, [^"]*
Putting it all together
This is the easy part: just combine all the pieces. A quote, followed by some characters other than a quote, followed by a quote, is "[^"]*".
Capturing the interesting part
The pattern we have will now match your string. What you want, however, is just the part inside the quotes. For that you need a "capturing group", which is denoted by parenthesis. To capture a part of a regular expression, put it in parenthesis. So, if we want to capture everything but the beginning and ending quote, the pattern becomes "([^"]*)".
And that's how you learn regular expressions. Break your problem down into a precise statement composed of short sequences of characters, figure out the regular expression for each sequence, then put it all together.
The pattern in this answer may not actually be the perfect answer for you. There are some edge cases to worry about. For example, you may only want to match a quote following a non-word character, or only quotes at the beginning or end of a word. That's all possible, but is highly dependent on your exact problem. Figuring out how to do that is just as easy though -- decide what you want, then look at the documentation to see how to accomplish that.
Spend one day practicing on regular expressions and you'll never have to ask anyone for help with regular expressions for the rest of your career. They aren't hard, but they do require concentrated study.
Are you sure you need regular expression matching here? Looking at your "string" you might be better off using a Xml parser?

Can regular expressions work with different languages?

English, of course, is a no-brainer for regex because that's what it was originally developed in/for:
Can regular expressions understand this character set?
French gets into some accented characters which I'm unsure how to match against - i.e. are è and e both considered word characters by regex?
Les expressions régulières peuvent comprendre ce jeu de caractères?
Japanese doesn't contain what I know as regex word characters to match against.
正規表現は、この文字を理解でき、設定?
Short answer: yes.
More specifically it depends on your regex engine supporting unicode matches (as described here).
Such matches can complicate your regular expressions enormously, so I can recommend reading this unicode regex tutorial (also note that unicode implementations themselves can be quite a mess so you might also benefit from reading Joel Spolsky's article about the inner workings of character sets).
"[\p{L}]"
This regular expression contains all characters that are letters, from all languages, upper and lower case.
so letters like (a-z A-Z ä ß è 正 の文字を理解) are accepted but signs like (, . ? > :) or other similar ones are not.
the brackets [] mean that this expression is a set.
If you want unlimited number of letters from this set to be accepted, use an astrix * after the brackets, like this: "[\p{L}]*"
it is always important to make sure you take care of white space in your regex. since your evaluation might fail because of white space. To solve this you can use: "[\p{L} ]*" (notice the white space inside brackets)
If you want to include the numbers as well, "[\p{L|N} ]*" can help. p{N} matches any kind of numeric character in any script.
As far as I know, there isn't any specific pattern you can use i.e. [a-zA-Z] to match "è", but you can always match them in separately, i.e. [a-zA-Zè正]
Obviously that can make your regexp immense, but you can always control this by adding your strings into variables, and only passing the variables into the expressions.
Generally speaking, regex is more for grokking machine-readable text than for human-readable text. It is in many ways a more general answer to the whole XML with regex thing; regex is by its very nature incapable of properly parsing human language, because the language is more complex than what you are using to parse it.
If you want to break down human language (English included), you would want to use a language analysis tool or even an AI, not mere regular expressions.
/[\p{Latin}]/ should for example, include Latin alphabet. You can get the full explanation and reference here.
it is not about the regular expression but about framework that executes it. java and .net i think are very good in handling unicode. so "è and e both considered word characters by regex" is true.
It depends on the implementation and the character set. In general the answer is "Yes," but it may require additional setup on your part.
In Perl, for example, the meaning of things like \w is altered by the chosen locale (use locale).
This SO thread might help. It includes the Unicode character classes you can use in a regex (e.g., [Ll] is all lowercase letters, regardless of language).