Misuning simpler regex format?

Misuning simpler regex format? - regex

So I am working on a fraction class for school and am using a regex pattern and matcher for user input. I found this online so i'll admit im not exactly sure what does what, but the following pattern finds each digit, middle operation, and allows spaces and tabs between all characters of the user's input(a fraction expression).
String fractionPattern = "\\s*\\t*\\\\*t*(-?\\d+)\\s*\\t*\\\\*t*\\/\\s*\\t*\\\\*t*(-?\\d+)\\s*\\t*\\\\*t*([-+*/])\\s*\\t*\\\\*t(\\d+)\\s*\\t*\\\\*t*\\/\\s*\\t*\\\\*t*(-?\\d+)\\s*\\t*\\\\*t*";
I've tried researching java regex metacharacters and symbol meanings, but I am sort of struggling. Can someone offer me an explanation on each character? Or possibly a simpler way of accomplishing the same thing.

So you're looking to match fractions? Like "2 / 3" or whatever, and allow spaces? If so, you want to match (and capture) some string of digits, then a '/' character, then match and capture another string of digits. Without considering spaces, this is just:
"(\\d+)/(\\d+)"
\d in a regex matches any digit (0-9), but since we specify regexes with Strings, we have to escape the backslash itself. That is, the Java string literal "\" results in a String object with one backslash character.
The + means "one or more", and the parentheses capture it. Then the slash is just a literal slash.
To make it allow spaces, match and discard any space. \s matches any type of space (space, tab, newline), and again we have to escape the backslash:
"(\\d+)\\s*/\\s*(\\d+)"
* means "zero or more". So from left to right, this means:
One or more digits, which are captured
Zero or more spaces
A literal slash
Zero or more spaces
One or more digits, which are captured
To handle whitespace on the ends, either match space there, or
trim() the string first.

The regex pattern you posted actually matches any operation between two fractions.
There is a lot of noise in there with extra \t (tab) characters (which is redundant with \s for whitespace). Removing those and changing the double backslash to single backslash for readability, we get the following:
\s*(-?\d+)\s*\/\s*(-?\d+)\s*([-+*/])\s*(\d+)\s*\/\s*(-?\d+)\s*
(Double backslash is needed when the regex is in a String, so it is not treated as an escape sequence)
Let's break it down:
\s* means match 0 or more whitespace characters
-? means match one - character or none, for negative numbers
\d+ means match 1 or more numbers (0-9)
The parentheses around (-?\d+) means we can capture this first number and refer to it later if we want to.
\s* means 0 or more whitespace characters again
\/ means match / literally. Some languages / regex parsers require the backslash in front of a forward slash. Others (like Python) do not.
\s*(-?\d+)\s* is the same thing we just did with the first number, again: get another (possibly negative) number, potentially surrounded by whitespace, and capture it with parentheses for later use.
([-+/]) means match any of -, +, or /: the operation we want to perform between the two fractions. Parentheses are optional, only needed if you want to grab this character later.
\s*(\d+)\s*\/\s*(-?\d+)\s* is again what we had with the first fraction, except for some reason the possible negative sign is not included in the numerator, which is probably a mistake.
You can test it out here. (Regex101 is your friend.)

Related

Regex find commas not between quotes which may or may not contain special characters as first character in quote

I have the following regex to count the number of commas contained in a string, where the comma does not appear between quotes:
(?!\B"[^"]*),(?![^"]*"\B)
For the following string, I would get 4 returned:
a,b,c,"my my, a bit of text",e
However, if the start of the quoted string is a special character, it fails to count anything prior to the end of the quoted string:
a,b,c,"[20200624T013030 Umognog Wrote] my my, a bit of text",e
This will return just 1
I am trying to alter it to determine all 4 occurrences in the second example, whilst not altering the results from the first example but failing miserably at the first hurdle! Regex is not my kung fu :(
I know I need to escape the '[' but I really cannot figure this out and are grasping in the dark.

If the double quotes are balanced, one option is to assert what is on the right from the current position are balanced double quotes.
The newline in the character class is to prevent crossing linebreaks. You could also add \r if that it necessary.
,(?=(?:[^"\n]*"[^"\n]*")*[^"\n]*$)
Explanation
, Match a comma
(?= Positive lookahead, assert what is on the right is
(?:[^"\n]*"[^"\n]*")* Match 0+ times pairs of double quotes
[^"\n]* Match 0+ times any char except a double quote
$ Assert end of string
) Close lookahead
Regex demo

Regex: match all special characters but not *

Yet another question about a regex.
I'm trying to match all special characters, except '*'.
So if I match my regex against:
John%%%* dadidou
I should get:
John* dadidou
Here: How to match with regex all special chars except "-" in PHP?
The accepted answer advices to use (if I want to exclude '-'):
[^\w-]
But doesn't that mean: "NOT a special character, NOT -", which is a bit redundant ?

What you really want is this regex for matching:
[^\w\s*]+
Replace it by empty string.
Which means match 1 or more of any character that is:
Not a word character [AND]
Not a whitespace [AND]
Not a literal *
RegEx Demo

When you define a negative character class, you are really inverting it.
What does that mean ?
A positive character class implicitly OR's it's contents.
When you negate a class, you implicitly AND it's contents.
So, [\w-] means word OR dash,
the inverse, [^\w-] means not word AND not dash.
A negative word for instance, [^\w] would match a dash -.
So, to not match it, you have to add a not dash as well.
A C analogy would be
existing (varA || varB)
inverted (!varA && !varB)
where inverting changes the Boolean of each of the components.
Basically a negative class changes the Boolean of each of its components,
so the implicit OR becomes an implicit AND and the components characters
(or expressions) are negated.
What will really bake your noodle later on is when you see something like
[^\S\r\n]
This translates to NOT-NOT-Whitespace and NOT-cr and NOT-lf
which reduces to matching all whitespace except CR,LF

Regex to match anything

I know it seems a bit redundant but I'd like a regex to match anything.
At the moment we are using ^*$ but it doesn't seem to match no matter what the text.
I do a manual check for no text but the test view we use is always validated with a regex. However, sometimes we need it to validate anything using a regex. i.e. it doesn't matter what is in the text field, it can be anything.
I don't actually produce the regex and I'm a complete beginner with them.

The regex .* will match anything (including the empty string, as Junuxx points out).

The chosen answer is slightly incorrect, as it wont match line breaks or returns. This regex to match anything is useful if your desired selection includes any line breaks:
[\s\S]+
[\s\S] matches a character that is either a whitespace character (including line break characters), or a character that is not a whitespace character. Since all characters are either whitespace or non-whitespace, this character class matches any character. the + matches one or more of the preceding expression

^ is the beginning-of-line anchor, so it will be a "zero-width match," meaning it won't match any actual characters (and the first character matched after the ^ will be the first character of the string). Similarly, $ is the end-of-line anchor.
* is a quantifier. It will not by itself match anything; it only indicates how many times a portion of the pattern can be matched. Specifically, it indicates that the previous "atom" (that is, the previous character or the previous parenthesized sub-pattern) can match any number of times.
To actually match some set of characters, you need to use a character class. As RichieHindle pointed out, the character class you need here is ., which represents any character except newlines (and it can be made to match newlines as well using the appropriate flag). So .* represents * (any number) matches on . (any character). Similarly, .+ represents + (at least one) matches on . (any character).

I know this is a bit old post, but we can have different ways like :
.*
(.*?)

Regexp question mark (in emacs)

I'd like to ask what the following emacs regular expression means (if anyone wonders, this is the regexp that erlang-mode uses for matching a single-quoted atom):
'\\(?:[^\\']\\|\\(?:\\\\.\\)\\)*'
specifically I'm having trouble finding explanations for three things.
First, the question mark which supposedly should either make the preceding item optional or specify that the preceding quantifier make lazy, but there is no item or quantifier here, only the start of a new group so what effect does it have here?
Second, the escaped apostrophe. Why would you need to escape the apostrophe?
Third, the quadruple escape \\., wouldn't this leave you with an escaped backslash and a \. which would make it an invalid regexp?
Thanks

"[^\\']"
Second, the escaped apostrophe. Why would you need to escape the apostrophe?
Firstly note that In Emacs regexp syntax, \` matches the start of the string, and \' matches the end of the string. In multi-line strings this is different to the more familiar ^ and $, which match the beginning of a line and the end of a line.
However that is not relevant within a character alternative (square brackets), so this sequence is actually matching any character other than a backslash or an apostrophe.
Edit:
So from the comments, this is still causing confusion, so let's break it down:
"'\\(?:[^\\']\\|\\(?:\\\\.\\)\\)*'"
That code evaluates to this string/regexp:
'\(?:[^\']\|\(?:\\.\)\)*'
' matches an apostrophe
\(?:foo\)* matches zero or more foo
foo\|bar matches either of foo or bar
[^\'] matches any character other than a backslash or an apostrophe
\(?:\\.\) could (in this case, being a non-capturing group which occurs exactly once) be rewritten as simply \\., and matches a backslash followed by any character other than a newline.
' matches an apostrophe
So the whole thing matches a single-quoted string in which:
any other single-quotes must each be preceded by a backslash
any backslash must be paired with another non-newline character (which could also be a backslash)
Which of course sounds like a typical string syntax in which backslashes can be used to escape special characters, including backslashes themselves and any instances of the delimiting quote character.

First: (?: groups multiple tokens together without creating a capturing group. This allows you to apply quantifiers to the full group.
Second and third, I think those are escaped bars. Each pair means \, and the quadruple means \\. So, its not scaping apostrophe at all.

What does \'.- mean in a Regular Expression

I'm new to regular expression and I having trouble finding what "\'.-" means.
'/^[A-Z \'.-]{2,20}$/i'
So far from my research, I have found that the regular expression starts (^) and requires two to twenty ({2,20}) alphabetical (A-Z) characters. The expression is also case insensitive (/i).
Any hints about what "\'.-" means?

The character class is the entire expression [A-Z \'.-], meaning any of A-Z, space, single quote, period, or hyphen. The \ is needed to protect the single quote, since it's also being used as the string quote. This charclass must be repeated 2 to 20 times, and because of the leading ^ and trailing $ anchors that must be the entire content of the matching string.

It means to escape the single quote (') that delmits the regex (as to not prematurely end the string), and then a . which means a literal . and a - which means a literal -.
Inside of the character range, the . is treated literally, and if the - isn't part of a valid range, e.g. a-z, then it is treated literally as well.
Your regex says Match the characters a-zA-Z '.- between 2 and 20 times as the entire string, with an optional trailing \n.

This regex is in a string. The backslash is there to escape the single quote so the string doesn't end early, in the middle of the regex. The dot and dash are just what they are, a period and a dash.
So, you were nearly right, except it's 2-20 characters that are letters, space, single quote, period, or dash.

It's quoting the quote.
The regular expression is ^[A-Z'.-]{2,20}$.
In the programming language you are using, you write it as a quoted string:
'SOMETHING'
To get a single quote in there, it's been backslashed.

Everything inside the square brackets is part of the character class, and will match a single character listed. In your example, the characters listed are the letters A through Z, a space, a single quote, a period, or a hyphen. (Note the hyphen must be listed last to avoid indicating a range, like A-Z.) Your full regular expression will match between 2 and 20 of the listed characters. The single quote is needed so the compiler knows you are not ending the string that defines the regular expression.
Some examples of things this will match:
....................
abaca af - .
AAfa- - ..
.z
And so on.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Misuning simpler regex format? - regex

Related

Regex find commas not between quotes which may or may not contain special characters as first character in quote

Regex: match all special characters but not *

Regex to match anything

Regexp question mark (in emacs)

What does \'.- mean in a Regular Expression

Categories

Resources