Regex: match all special characters but not * - regex

Yet another question about a regex.
I'm trying to match all special characters, except '*'.
So if I match my regex against:
John%%%* dadidou
I should get:
John* dadidou
Here: How to match with regex all special chars except "-" in PHP?
The accepted answer advices to use (if I want to exclude '-'):
[^\w-]
But doesn't that mean: "NOT a special character, NOT -", which is a bit redundant ?

What you really want is this regex for matching:
[^\w\s*]+
Replace it by empty string.
Which means match 1 or more of any character that is:
Not a word character [AND]
Not a whitespace [AND]
Not a literal *
RegEx Demo

When you define a negative character class, you are really inverting it.
What does that mean ?
A positive character class implicitly OR's it's contents.
When you negate a class, you implicitly AND it's contents.
So, [\w-] means word OR dash,
the inverse, [^\w-] means not word AND not dash.
A negative word for instance, [^\w] would match a dash -.
So, to not match it, you have to add a not dash as well.
A C analogy would be
existing (varA || varB)
inverted (!varA && !varB)
where inverting changes the Boolean of each of the components.
Basically a negative class changes the Boolean of each of its components,
so the implicit OR becomes an implicit AND and the components characters
(or expressions) are negated.
What will really bake your noodle later on is when you see something like
[^\S\r\n]
This translates to NOT-NOT-Whitespace and NOT-cr and NOT-lf
which reduces to matching all whitespace except CR,LF

Related

Misuning simpler regex format?

So I am working on a fraction class for school and am using a regex pattern and matcher for user input. I found this online so i'll admit im not exactly sure what does what, but the following pattern finds each digit, middle operation, and allows spaces and tabs between all characters of the user's input(a fraction expression).
String fractionPattern = "\\s*\\t*\\\\*t*(-?\\d+)\\s*\\t*\\\\*t*\\/\\s*\\t*\\\\*t*(-?\\d+)\\s*\\t*\\\\*t*([-+*/])\\s*\\t*\\\\*t(\\d+)\\s*\\t*\\\\*t*\\/\\s*\\t*\\\\*t*(-?\\d+)\\s*\\t*\\\\*t*";
I've tried researching java regex metacharacters and symbol meanings, but I am sort of struggling. Can someone offer me an explanation on each character? Or possibly a simpler way of accomplishing the same thing.
So you're looking to match fractions? Like "2 / 3" or whatever, and allow spaces? If so, you want to match (and capture) some string of digits, then a '/' character, then match and capture another string of digits. Without considering spaces, this is just:
"(\\d+)/(\\d+)"
\d in a regex matches any digit (0-9), but since we specify regexes with Strings, we have to escape the backslash itself. That is, the Java string literal "\" results in a String object with one backslash character.
The + means "one or more", and the parentheses capture it. Then the slash is just a literal slash.
To make it allow spaces, match and discard any space. \s matches any type of space (space, tab, newline), and again we have to escape the backslash:
"(\\d+)\\s*/\\s*(\\d+)"
* means "zero or more". So from left to right, this means:
One or more digits, which are captured
Zero or more spaces
A literal slash
Zero or more spaces
One or more digits, which are captured
To handle whitespace on the ends, either match space there, or
trim() the string first.
The regex pattern you posted actually matches any operation between two fractions.
There is a lot of noise in there with extra \t (tab) characters (which is redundant with \s for whitespace). Removing those and changing the double backslash to single backslash for readability, we get the following:
\s*(-?\d+)\s*\/\s*(-?\d+)\s*([-+*/])\s*(\d+)\s*\/\s*(-?\d+)\s*
(Double backslash is needed when the regex is in a String, so it is not treated as an escape sequence)
Let's break it down:
\s* means match 0 or more whitespace characters
-? means match one - character or none, for negative numbers
\d+ means match 1 or more numbers (0-9)
The parentheses around (-?\d+) means we can capture this first number and refer to it later if we want to.
\s* means 0 or more whitespace characters again
\/ means match / literally. Some languages / regex parsers require the backslash in front of a forward slash. Others (like Python) do not.
\s*(-?\d+)\s* is the same thing we just did with the first number, again: get another (possibly negative) number, potentially surrounded by whitespace, and capture it with parentheses for later use.
([-+/]) means match any of -, +, or /: the operation we want to perform between the two fractions. Parentheses are optional, only needed if you want to grab this character later.
\s*(\d+)\s*\/\s*(-?\d+)\s* is again what we had with the first fraction, except for some reason the possible negative sign is not included in the numerator, which is probably a mistake.
You can test it out here. (Regex101 is your friend.)

Regular expression to match anything but certain characters

Is there a way to have a regular expression to match anything but certain characters? Say for example the only characters that aren't allowed is the * character. Rather than list out all possibly characters allowed in the regular expression is there anything that will say "everything not equal to * is allowed".
You can use the negated class character that you can use by [^]. So, for your case you can use:
^[^*]+$
A useful debuggex graph to see this is:
You can check more about the theory on negated class. Below you can find a quotation explaining this.
Negated Character Classes
Typing a caret after the opening square bracket negates the character class. The result is that the character class matches any character that is not in the character class. Unlike the dot, negated character classes also match (invisible) line break characters. If you don't want a negated character class to match line breaks, you need to include the line break characters in the class. [^0-9\r\n] matches any character that is not a digit or a line break.
It is important to remember that a negated character class still must match a character. q[^u] does not mean: "a q not followed by a u". It means: "a q followed by a character that is not a u". It does not match the q in the string Iraq. It does match the q and the space after the q in Iraq is a country. Indeed: the space becomes part of the overall match, because it is the "character that is not a u" that is matched by the negated character class in the above regexp. If you want the regex to match the q, and only the q, in both strings, you need to use negative lookahead: q(?!u).
[^*] Any single character except: *
Whenever I had to work with regular expressions I usually go to rubular.com and test my attempts. It also has some examples, pretty usefull
This is explained in the manual.
The solution is:
"[^*]*"

Why doesn't this regex accept "s 1" type input?

I have the follwwing regex:
/([^\s*][\l\u\w\d\s]+) (\d)/
It should match strings of the form: "some-string digit", e.g. "stackoverflow 1". Those strings cannot have whitespace at the beginning.
It works great, except for the simple strings with one character on the beginning, e.g.: "s 1". How can I fix it? I am using it in boost::regex (PCRE-compatible).
The [^\s*] is eating up your first string character, so when you require one-or-more string characters after it, that'll fail:
/([^\s*][\l\u\w\d\s]+) (\d)/
^^^^ ^^^^^^^^^^ ^^
"s" no match "1"
If you fix your misplaced *:
/([^\s]*[\l\u\w\d\s]+) (\d)/
^^^ ^^^^^^^^^^ ^^
"s"; "s" "1"
match
then cancelled
by backtracking
But in order to avoid the backtracking, I would instead write the regex like this:
/([\l\u\w\d]+[\l\u\w\d\s]*) (\d)/
Note that I am only showing the regex itself — re-apply your extra backslashes for use in a C++ string literal as required; e.g.
const std::string my_regex = "/([\\l\\u\\w\\d]+[\\l\\u\\w\\d\\s]*) (\\d)/";
This can probably be done more optimally anyway (I'm sure most of those character classes are redundant), but this should fix your immediate problem.
You can test your regexes here.
The problem is that you have the * in the wrong place: [^\s*] matches exactly one character that is neither whitespace nor an asterisk. (The s in "s 1" qualifies as "neither whitespace nor an asterisk", so it is matched and consumed, and no longer available to serve as a match for the next part, [\l\u\w\d\s]+. Note that "s 1", with two spaces, would succeed.)
You probably meant [^\s]*, which matches any number (including zero) of whitespace characters. If you make that small change, that will fix your regular expression.
However, there are other improvements to be made. First, the backslash+letter sequences that are short for character classes can be negated by capitalizing the letter: the character class "everything that's not in \s" can be written as above, with [^\s], but it can also be written more simply as \S.
Next, I don't know what \l and \u are. You've tagged this c++, so you're presumably using the standard regex library, which uses ECMAScript regex syntax. But the ECMAScript regular expression specification doesn't define those metacharacters.
If you're trying to match "lowercase letters" and "uppercase letters", those are [:lower:] and [:upper:] - but both sets of letters are already included in \w, so you don't need to include them in a character class that also has \w.
Pulling those out leaves a character class of [\w\d\s] - which is still redundant, because \w also includes the digits, so we don't need \d. Removing that, we have [\w\s], which matches "an underscore, letter, digit, space, tab, formfeed, or linefeed (newline)."
That makes the whole regular expression \S*[\s\w]+ (\d): zero or more non-whitespace characters, followed by at least one whitespace or word character, followed by exactly one space, followed by a digit. That seems like an unusual set of criteria to me, but it should definitely match "s 1". And it does, in my testing.
I would expect you could do something like this:
Add
{X,} where X is a number, onto the second set of brackets
Like below
([^\\s*][\\l\\u\\w\\d\\s]{2,}) (\d)
Replace 2 with whatever you want to be your minimum string length.

What is the regular expression to allow uppercase/lowercase (alphabetical characters), periods, spaces and dashes only?

I am having problems creating a regex validator that checks to make sure the input has uppercase or lowercase alphabetical characters, spaces, periods, underscores, and dashes only. Couldn't find this example online via searches. For example:
These are ok:
Dr. Marshall
sam smith
.george con-stanza .great
peter.
josh_stinson
smith _.gorne
Anything containing other characters is not okay. That is numbers, or any other symbols.
The regex you're looking for is ^[A-Za-z.\s_-]+$
^ asserts that the regular expression must match at the beginning of the subject
[] is a character class - any character that matches inside this expression is allowed
A-Z allows a range of uppercase characters
a-z allows a range of lowercase characters
. matches a period
rather than a range of characters
\s matches whitespace (spaces and tabs)
_ matches an underscore
- matches a dash (hyphen); we have it as the last character in the character class so it doesn't get interpreted as being part of a character range. We could also escape it (\-) instead and put it anywhere in the character class, but that's less clear
+ asserts that the preceding expression (in our case, the character class) must match one or more times
$ Finally, this asserts that we're now at the end of the subject
When you're testing regular expressions, you'll likely find a tool like regexpal helpful. This allows you to see your regular expression match (or fail to match) your sample data in real time as you write it.
Check out the basics of regular expressions in a tutorial. All it requires is two anchors and a repeated character class:
^[a-zA-Z ._-]*$
If you use the case-insensitive modifier, you can shorten this to
^[a-z ._-]*$
Note that the space is significant (it is just a character like any other).

Simple Regex for upper and lower case letters, numbers, and a few symbols

How can I create a Regular expression to match the following characters:
A-Z a-z 0-9 " - ? . ', !
... as well as new lines and spaces
This will match any single one of those characters:
[A-Za-z0-9"?.',! \n\r-]
There's a good chance you want something like:
^[A-Za-z0-9"?.',! \n\r-]+$
Or possibly a bit simpler will meet your needs:
^[\w\s"?.',!-]+$
Remembering that if this is inside a string, you will need to escape either the " or ' in that (either by doubling up, or by prefixing with a backslash).
Also note that the - is last so that it is not treated as a range inside the character class. (Can also be placed first, or prefixed with backslash to prevent that).
The \w will match a "word" character, which is almost always [A-Za-z0-9_].
The \s will match a whitespace character, (i.e. space,tab,newline,carriage return).
But really you need to give more context to what you're trying to do so people can suggest more fitting solutions.