Regular expression for unusual characters? - regex

What would the regular expression be to detect unusual characters, like the ones found here:
http://www.theworldofstuff.com/characters
So for example, what would the expression be to only allow letters, numbers and the symbols found on a keyboard (.$%^, etc.)?

You just have to list everything you want. Something like:
[0-9a-zA-Z!##$%^&*\(\)\\\?\{\[\]\}:;<>~`"'/+-\., =_]
Be careful to escape any characters that might be confused as part of the regex with \

You can check for the ASCII charset with the following regular expression:
/^[\x00-\x7F]*$/
In order to only match the printable part of ASCII:
/^[\x20-\x7E]*$/

Thought it'd be useful if I posted exactly how I did this in the end, so here you go:
[^0-9a-zA-Z !\"£$%^&*\\(\\)_\\-\\+\\={}\\[\\]:;#'~#<,>.\\?/`|§]
Not sure if it was just a Java thing, but I had a hell of a time working out which characters needed escaping. :p

Related

Understanding regex with OR

I have a regular expression like this: ('0'|['0'‐'9']+'.'['0'‐'9' 'a'‐'f']*)
In order to test it I am using a handy tool called http://www.regexpal.com/
The thing is that I am getting stuck when trying to understand the logic, inserting a '0' is fine but then I don't get why the OR prevents inserting other characters. Any explanation is appreciated.
I'm not sure you understand how the brackets in the regex are working. It isn't the OR part that is preventing you.
('0'|['0'‐'9']+'.'['0'‐'9' 'a'‐'f']*)
Will match either '0' with the quotes or for example 0000000'z''''9 or anything else like it. The quotes are treated as literal and the period must be escaped because it is a wildcard.
(0|[0-9]+\.[0-9a-f]*)
May be what you are looking for. This will match values such as 0 or 23. or 3.14159
There are numerous problems in your regex (as others have pointed out), but I'll explain something about alternations.
Most regex flavors will short-circuit alternations.
This means that you should reorder it, if you want it to match the other expression first.

What does a-z-A-Z mean in a regular expression?

I've been working with someone else's code and I ran across the regular expression [^0-9a-z-A-Z]. This bears close resemblance to the common [^0-9a-zA-Z] which is meant to exclude non-alphanumeric characters, but note the extra dash in the middle, between the lowercase z and uppercase A.
I'm not very familiar with regular expressions, but I've read several pages on them now, and none of the rules I've seen seem to cover what this syntax would mean. Perhaps it's not even valid syntax, but the Golang regex interpreter doesn't seem to mind. I'd appreciate any clarification. Thanks.
A dash in a character class in a place where it cannot be interpreted as a range is interpreted as a literal dash. So the expression excludes the characters 0 to 9, a to z, A to Z, and -. That's why there's no syntax error.
It's probably a typo though. If the dash is meant to be there, then to prevent confusion it should be escaped and/or moved out from between the ranges, such as [^0-9a-zA-Z\-]
It excludes the minus sign.
You can test regex handily here: http://www.regexr.com/

Is there any way to have dot (.) match newline in C++ TR1 Regular Expressions?

I couldn't find anything regarding this on http://msdn.microsoft.com/en-us/library/bb982727.aspx.
Maybe I could use '[^]+' to match everything but that seems like a hack?
Boost.Regex has a mod_s flag to make the dot match newlines, but it's not part of the TR1 regex standard. (and not available as a Microsoft extension either, as far as I can see)
As a workaround, you could use [\s\S] (which means match any whitespace or any non-whitespace).
As C++ regular expressions appear to be based on ECMAScript regular expressions, the answer to the recent question about the same thing in JavaScript may help you.
[^] should work, but if you want something a little more clear and less hackish, you could try (.|\n).
One trick people use is a character class containing anything that is not the null character. The null character is expressed in hex. It looks something like this:
[^\x00]+
You can switch to a non-ECMA flavor of regular expression (there are a number of flags to control regext flavor). Any POSIX regex should, if I recall correctly, match a newline to ..

Can regular expressions work with different languages?

English, of course, is a no-brainer for regex because that's what it was originally developed in/for:
Can regular expressions understand this character set?
French gets into some accented characters which I'm unsure how to match against - i.e. are è and e both considered word characters by regex?
Les expressions régulières peuvent comprendre ce jeu de caractères?
Japanese doesn't contain what I know as regex word characters to match against.
正規表現は、この文字を理解でき、設定?
Short answer: yes.
More specifically it depends on your regex engine supporting unicode matches (as described here).
Such matches can complicate your regular expressions enormously, so I can recommend reading this unicode regex tutorial (also note that unicode implementations themselves can be quite a mess so you might also benefit from reading Joel Spolsky's article about the inner workings of character sets).
"[\p{L}]"
This regular expression contains all characters that are letters, from all languages, upper and lower case.
so letters like (a-z A-Z ä ß è 正 の文字を理解) are accepted but signs like (, . ? > :) or other similar ones are not.
the brackets [] mean that this expression is a set.
If you want unlimited number of letters from this set to be accepted, use an astrix * after the brackets, like this: "[\p{L}]*"
it is always important to make sure you take care of white space in your regex. since your evaluation might fail because of white space. To solve this you can use: "[\p{L} ]*" (notice the white space inside brackets)
If you want to include the numbers as well, "[\p{L|N} ]*" can help. p{N} matches any kind of numeric character in any script.
As far as I know, there isn't any specific pattern you can use i.e. [a-zA-Z] to match "è", but you can always match them in separately, i.e. [a-zA-Zè正]
Obviously that can make your regexp immense, but you can always control this by adding your strings into variables, and only passing the variables into the expressions.
Generally speaking, regex is more for grokking machine-readable text than for human-readable text. It is in many ways a more general answer to the whole XML with regex thing; regex is by its very nature incapable of properly parsing human language, because the language is more complex than what you are using to parse it.
If you want to break down human language (English included), you would want to use a language analysis tool or even an AI, not mere regular expressions.
/[\p{Latin}]/ should for example, include Latin alphabet. You can get the full explanation and reference here.
it is not about the regular expression but about framework that executes it. java and .net i think are very good in handling unicode. so "è and e both considered word characters by regex" is true.
It depends on the implementation and the character set. In general the answer is "Yes," but it may require additional setup on your part.
In Perl, for example, the meaning of things like \w is altered by the chosen locale (use locale).
This SO thread might help. It includes the Unicode character classes you can use in a regex (e.g., [Ll] is all lowercase letters, regardless of language).

Regular expression that rejects all input?

Is is possible to construct a regular expression that rejects all input strings?
Probably this:
[^\w\W]
\w - word character (letter, digit, etc)
\W - opposite of \w
[^\w\W] - should always fail, because any character should belong to one of the character classes - \w or \W
Another snippets:
$.^
$ - assert position at the end of the string
^ - assert position at the start of the line
. - any char
(?#it's just a comment inside of empty regex)
Empty lookahead/behind should work:
(?<!)
The best standard regexs (i.e., no lookahead or back-references) that reject all inputs are (after #aku above)
.^
and
$.
These are flat contradictions: "a string with a character before its beginning" and "a string with a character after its end."
NOTE: It's possible that some regex implementations would reject these patterns as ill-formed (it's pretty easy to check that ^ comes at the beginning of a pattern and $ at the end... with a regular expression), but the few I've checked do accept them. These also won't work in implementations that allow ^ and $ to match newlines.
(?=not)possible
?= is a positive lookahead. They're not supported in all regexp flavors, but in many.
The expression will look for "not", then look for "possible" starting at the same position (since lookaheads don't move forward in the string).
One example of why such thing could possibly be needed is when you want to filter some input with regexes and you pass regex as an argument to a function.
In spirit of functional programming, for algebraic completeness, you may want some trivial primary regexes like "everything is allowed" and "nothing is allowed".
To me it sounds like you're attacking a problem the wrong way, what exactly
are you trying to solve?
You could do a regular expression that catches everything and negate the result.
e.g in javascript:
if (! str.match( /./ ))
but then you could just do
if (!foo)
instead, as #[jan-hani] said.
If you're looking to embed such a regex in another regex, you
might be looking for $ or ^ instead, or use lookaheads like #[henrik-n] mentioned.
But as I said, this looks like a "I think I need x, but what I really need is y" problem.
Why would you even want that? Wouldn't a simple if statment do the trick? Something along the lines of:
if ( inputString != "" )
doSomething ()
[^\x00-\xFF]
It depends on what you mean by "regular expression". Do you mean regexps in a particular programming language or library? In that case the answer is probably yes, and you can refer to any of the above replies.
If you mean the regular expressions as taught in computer science classes, then the answer is no. Every regular expression matches some string. It could be the empty string, but it always matches something.
In any case, I suggest you edit the title of your question to narrow down what kinds of regular expressions you mean.
[^]+ should do it.
In answer to aku's comment attached to this, I tested it with an online regex tester (http://www.regextester.com/), and so assume it works with JavaScript. I have to confess to not testing it in "real" code. ;)
EDIT:
[^\n\r\w\s]
Well,
I am not sure if I understood, since I always thought of regular expression of a way to match strings. I would say the best shot you have is not using regex.
But, you can also use regexp that matches empty lines like ^$ or a regexp that do not match words/spaces like [^\w\s] ...
Hope it helps!