Which characters should I be escaping using REGEXP_LIKE - amazon-athena

I need to return all rows where first_name contains anything but the following:
A-Z
a-z
Single quote
Punctuation mark
Round brackets
Backtick
Hyphen
Space
I am however having a difficult time determining which of these characters, apart from A-Z & a-z obviously, to escape. The regex seems to be producing the desired result but I am running this on a data set of 10 million rows so I just want to be sure and also for my own sanity. I have basically escaped every special character apart from the single quote as this is escaped using 2 single quotes instead. Should every other special character be escaped here? And then should I be adding anything else to my regex here such as a greedy quantifier aka + as I have seen some reference to this
REGEXP_LIKE(first_name, \'[^A-Za-z ''\.\(\)\`\-]\')

Related

Misuning simpler regex format?

So I am working on a fraction class for school and am using a regex pattern and matcher for user input. I found this online so i'll admit im not exactly sure what does what, but the following pattern finds each digit, middle operation, and allows spaces and tabs between all characters of the user's input(a fraction expression).
String fractionPattern = "\\s*\\t*\\\\*t*(-?\\d+)\\s*\\t*\\\\*t*\\/\\s*\\t*\\\\*t*(-?\\d+)\\s*\\t*\\\\*t*([-+*/])\\s*\\t*\\\\*t(\\d+)\\s*\\t*\\\\*t*\\/\\s*\\t*\\\\*t*(-?\\d+)\\s*\\t*\\\\*t*";
I've tried researching java regex metacharacters and symbol meanings, but I am sort of struggling. Can someone offer me an explanation on each character? Or possibly a simpler way of accomplishing the same thing.
So you're looking to match fractions? Like "2 / 3" or whatever, and allow spaces? If so, you want to match (and capture) some string of digits, then a '/' character, then match and capture another string of digits. Without considering spaces, this is just:
"(\\d+)/(\\d+)"
\d in a regex matches any digit (0-9), but since we specify regexes with Strings, we have to escape the backslash itself. That is, the Java string literal "\" results in a String object with one backslash character.
The + means "one or more", and the parentheses capture it. Then the slash is just a literal slash.
To make it allow spaces, match and discard any space. \s matches any type of space (space, tab, newline), and again we have to escape the backslash:
"(\\d+)\\s*/\\s*(\\d+)"
* means "zero or more". So from left to right, this means:
One or more digits, which are captured
Zero or more spaces
A literal slash
Zero or more spaces
One or more digits, which are captured
To handle whitespace on the ends, either match space there, or
trim() the string first.
The regex pattern you posted actually matches any operation between two fractions.
There is a lot of noise in there with extra \t (tab) characters (which is redundant with \s for whitespace). Removing those and changing the double backslash to single backslash for readability, we get the following:
\s*(-?\d+)\s*\/\s*(-?\d+)\s*([-+*/])\s*(\d+)\s*\/\s*(-?\d+)\s*
(Double backslash is needed when the regex is in a String, so it is not treated as an escape sequence)
Let's break it down:
\s* means match 0 or more whitespace characters
-? means match one - character or none, for negative numbers
\d+ means match 1 or more numbers (0-9)
The parentheses around (-?\d+) means we can capture this first number and refer to it later if we want to.
\s* means 0 or more whitespace characters again
\/ means match / literally. Some languages / regex parsers require the backslash in front of a forward slash. Others (like Python) do not.
\s*(-?\d+)\s* is the same thing we just did with the first number, again: get another (possibly negative) number, potentially surrounded by whitespace, and capture it with parentheses for later use.
([-+/]) means match any of -, +, or /: the operation we want to perform between the two fractions. Parentheses are optional, only needed if you want to grab this character later.
\s*(\d+)\s*\/\s*(-?\d+)\s* is again what we had with the first fraction, except for some reason the possible negative sign is not included in the numerator, which is probably a mistake.
You can test it out here. (Regex101 is your friend.)

Can't finish regexp

I'm trying to build regexp for finding valid names in mmorpg game. Here's the rules:
Must be at least 4 characters.
Cannot exceed 24 characters.
May contain the characters A-Z, a-z, 0-9, and single quotation. (Corporation names may also include minus and dot characters.)
Space or single quotation characters are not allowed as the first or last character in a name.
Here's what I've got so far.
/([A-z0-9]{1}[A-z0-9]{3,23}[A-z0-9]{1})$/
The problem is I can't insert zero or one quotation and zero or more spaces inside {3,23} part. Any tips?
You can use this regex:
/^[A-Za-z0-9](?!([^']*'){2})[ A-Za-z0-9'.-]{2,22}[A-Za-z0-9]$/
btw [A-z] is not same as [A-Za-z] as range from A to z will allow many more characters.
Online regex Demo
I suggest this regex
^(?=.{4,24}$)(?=[^']*'?[^']*$)(?![ ']|.*[ ']$)[A-Za-z0-9'. -]+$
(?=.{4,24}$) is for the character limit
(?=[^']*'?[^']*$) is for the optional one apostrophe character (note " is more known as quote character than ').
(?![ ']|.*[ ']$) prevents spaces and apostrophes at the beginning and end.
[A-Za-z0-9'. -]+$ allows alphanumeric, apostrophe (already restricted to 1 by an earlier lookahead), dashes, dots and spaces (any number).'

What is the regular expression to allow uppercase/lowercase (alphabetical characters), periods, spaces and dashes only?

I am having problems creating a regex validator that checks to make sure the input has uppercase or lowercase alphabetical characters, spaces, periods, underscores, and dashes only. Couldn't find this example online via searches. For example:
These are ok:
Dr. Marshall
sam smith
.george con-stanza .great
peter.
josh_stinson
smith _.gorne
Anything containing other characters is not okay. That is numbers, or any other symbols.
The regex you're looking for is ^[A-Za-z.\s_-]+$
^ asserts that the regular expression must match at the beginning of the subject
[] is a character class - any character that matches inside this expression is allowed
A-Z allows a range of uppercase characters
a-z allows a range of lowercase characters
. matches a period
rather than a range of characters
\s matches whitespace (spaces and tabs)
_ matches an underscore
- matches a dash (hyphen); we have it as the last character in the character class so it doesn't get interpreted as being part of a character range. We could also escape it (\-) instead and put it anywhere in the character class, but that's less clear
+ asserts that the preceding expression (in our case, the character class) must match one or more times
$ Finally, this asserts that we're now at the end of the subject
When you're testing regular expressions, you'll likely find a tool like regexpal helpful. This allows you to see your regular expression match (or fail to match) your sample data in real time as you write it.
Check out the basics of regular expressions in a tutorial. All it requires is two anchors and a repeated character class:
^[a-zA-Z ._-]*$
If you use the case-insensitive modifier, you can shorten this to
^[a-z ._-]*$
Note that the space is significant (it is just a character like any other).

What does \'.- mean in a Regular Expression

I'm new to regular expression and I having trouble finding what "\'.-" means.
'/^[A-Z \'.-]{2,20}$/i'
So far from my research, I have found that the regular expression starts (^) and requires two to twenty ({2,20}) alphabetical (A-Z) characters. The expression is also case insensitive (/i).
Any hints about what "\'.-" means?
The character class is the entire expression [A-Z \'.-], meaning any of A-Z, space, single quote, period, or hyphen. The \ is needed to protect the single quote, since it's also being used as the string quote. This charclass must be repeated 2 to 20 times, and because of the leading ^ and trailing $ anchors that must be the entire content of the matching string.
It means to escape the single quote (') that delmits the regex (as to not prematurely end the string), and then a . which means a literal . and a - which means a literal -.
Inside of the character range, the . is treated literally, and if the - isn't part of a valid range, e.g. a-z, then it is treated literally as well.
Your regex says Match the characters a-zA-Z '.- between 2 and 20 times as the entire string, with an optional trailing \n.
This regex is in a string. The backslash is there to escape the single quote so the string doesn't end early, in the middle of the regex. The dot and dash are just what they are, a period and a dash.
So, you were nearly right, except it's 2-20 characters that are letters, space, single quote, period, or dash.
It's quoting the quote.
The regular expression is ^[A-Z'.-]{2,20}$.
In the programming language you are using, you write it as a quoted string:
'SOMETHING'
To get a single quote in there, it's been backslashed.
Everything inside the square brackets is part of the character class, and will match a single character listed. In your example, the characters listed are the letters A through Z, a space, a single quote, a period, or a hyphen. (Note the hyphen must be listed last to avoid indicating a range, like A-Z.) Your full regular expression will match between 2 and 20 of the listed characters. The single quote is needed so the compiler knows you are not ending the string that defines the regular expression.
Some examples of things this will match:
....................
abaca af - .
AAfa- - ..
.z
And so on.

Simple Regex for upper and lower case letters, numbers, and a few symbols

How can I create a Regular expression to match the following characters:
A-Z a-z 0-9 " - ? . ', !
... as well as new lines and spaces
This will match any single one of those characters:
[A-Za-z0-9"?.',! \n\r-]
There's a good chance you want something like:
^[A-Za-z0-9"?.',! \n\r-]+$
Or possibly a bit simpler will meet your needs:
^[\w\s"?.',!-]+$
Remembering that if this is inside a string, you will need to escape either the " or ' in that (either by doubling up, or by prefixing with a backslash).
Also note that the - is last so that it is not treated as a range inside the character class. (Can also be placed first, or prefixed with backslash to prevent that).
The \w will match a "word" character, which is almost always [A-Za-z0-9_].
The \s will match a whitespace character, (i.e. space,tab,newline,carriage return).
But really you need to give more context to what you're trying to do so people can suggest more fitting solutions.