Problem replacing characters in a Regular Expression - regex

I'm just starting to get to grips with Regular Expressions. My first task is to remove all the characters in a string except a-z (upper and lower case), 0-9, and the characters - \ . : and ,
So I tried
objInstance.mystring.replaceAll("[^A-Za-z0-9\\- .:,]", "")
However, this still removes the hyphen and the backslash.
I suspect its the placement of the \ but some guidance would be helpful here.

You need to escape the backslash, as well as the hyphen. These are characters that have meaning in the regex so you need to escape them to have the actual character being monitored.
[A-Za-z0-9\\\-.:,] should be the correct regex. There's also a space in yours, there's no mention of it in your question so I removed that as well. There's also a ^ character in your regex. This signifies the start of a String, again as there was no mention of this in your question, I removed it in the regex.

Related

Regexp question mark (in emacs)

I'd like to ask what the following emacs regular expression means (if anyone wonders, this is the regexp that erlang-mode uses for matching a single-quoted atom):
'\\(?:[^\\']\\|\\(?:\\\\.\\)\\)*'
specifically I'm having trouble finding explanations for three things.
First, the question mark which supposedly should either make the preceding item optional or specify that the preceding quantifier make lazy, but there is no item or quantifier here, only the start of a new group so what effect does it have here?
Second, the escaped apostrophe. Why would you need to escape the apostrophe?
Third, the quadruple escape \\., wouldn't this leave you with an escaped backslash and a \. which would make it an invalid regexp?
Thanks
"[^\\']"
Second, the escaped apostrophe. Why would you need to escape the apostrophe?
Firstly note that In Emacs regexp syntax, \` matches the start of the string, and \' matches the end of the string. In multi-line strings this is different to the more familiar ^ and $, which match the beginning of a line and the end of a line.
However that is not relevant within a character alternative (square brackets), so this sequence is actually matching any character other than a backslash or an apostrophe.
Edit:
So from the comments, this is still causing confusion, so let's break it down:
"'\\(?:[^\\']\\|\\(?:\\\\.\\)\\)*'"
That code evaluates to this string/regexp:
'\(?:[^\']\|\(?:\\.\)\)*'
' matches an apostrophe
\(?:foo\)* matches zero or more foo
foo\|bar matches either of foo or bar
[^\'] matches any character other than a backslash or an apostrophe
\(?:\\.\) could (in this case, being a non-capturing group which occurs exactly once) be rewritten as simply \\., and matches a backslash followed by any character other than a newline.
' matches an apostrophe
So the whole thing matches a single-quoted string in which:
any other single-quotes must each be preceded by a backslash
any backslash must be paired with another non-newline character (which could also be a backslash)
Which of course sounds like a typical string syntax in which backslashes can be used to escape special characters, including backslashes themselves and any instances of the delimiting quote character.
First: (?: groups multiple tokens together without creating a capturing group. This allows you to apply quantifiers to the full group.
Second and third, I think those are escaped bars. Each pair means \, and the quadruple means \\. So, its not scaping apostrophe at all.

Regular Expressions pattern with Special characters

I'm working on a regular expressions pattern, but it contains a number of special characters. I'm not really sure how to incorporate them in a normal regex pattern string. Specifically, I need to test to see if a string contains '+/-'...
I've tried using quotes etc but have no luck (I'm extremely new to regex). I am coding this in C# 4.0.
One string example is "3Z1Z +/- 5.5"
Any help is much appreciated - Thanks a lot!
Create a simple regex :
foundMatch = Regex.IsMatch(SubjectString, #"\+/-");
Will return true if this sequence of characters is found anywhere in your string. The explanation is left as an exercise to you.
Read more here.
These are part of the special character list (see also). Basically, add them to the pattern by prefixing them with a backslash (\). e.g. + becomes \+
^\+|\-$ # + or -
The same would go for anything else with special meaning, such as ., {, }, (, ), ^, $, |, [, ], etc.
There are some exceptions though. For instance, when creating a class such as: [a-z] the hyphen (-) would have special meaning (all letters from a through z). So if you wanted a literal hyphen you'd have to escape it (unless it falls as the last character of the class). e.g.
[a-z-A-Z] # hyphen should be escaped if you wanted a literal hyphen
[a-z\-A-Z] # the "correct" counter-part
[a-zA-Z-] # actually legal because it's inserted as the last character
# and therefor treated as a literal hyphen despite not being
# escaped.

What does \'.- mean in a Regular Expression

I'm new to regular expression and I having trouble finding what "\'.-" means.
'/^[A-Z \'.-]{2,20}$/i'
So far from my research, I have found that the regular expression starts (^) and requires two to twenty ({2,20}) alphabetical (A-Z) characters. The expression is also case insensitive (/i).
Any hints about what "\'.-" means?
The character class is the entire expression [A-Z \'.-], meaning any of A-Z, space, single quote, period, or hyphen. The \ is needed to protect the single quote, since it's also being used as the string quote. This charclass must be repeated 2 to 20 times, and because of the leading ^ and trailing $ anchors that must be the entire content of the matching string.
It means to escape the single quote (') that delmits the regex (as to not prematurely end the string), and then a . which means a literal . and a - which means a literal -.
Inside of the character range, the . is treated literally, and if the - isn't part of a valid range, e.g. a-z, then it is treated literally as well.
Your regex says Match the characters a-zA-Z '.- between 2 and 20 times as the entire string, with an optional trailing \n.
This regex is in a string. The backslash is there to escape the single quote so the string doesn't end early, in the middle of the regex. The dot and dash are just what they are, a period and a dash.
So, you were nearly right, except it's 2-20 characters that are letters, space, single quote, period, or dash.
It's quoting the quote.
The regular expression is ^[A-Z'.-]{2,20}$.
In the programming language you are using, you write it as a quoted string:
'SOMETHING'
To get a single quote in there, it's been backslashed.
Everything inside the square brackets is part of the character class, and will match a single character listed. In your example, the characters listed are the letters A through Z, a space, a single quote, a period, or a hyphen. (Note the hyphen must be listed last to avoid indicating a range, like A-Z.) Your full regular expression will match between 2 and 20 of the listed characters. The single quote is needed so the compiler knows you are not ending the string that defines the regular expression.
Some examples of things this will match:
....................
abaca af - .
AAfa- - ..
.z
And so on.

Simple Regex for upper and lower case letters, numbers, and a few symbols

How can I create a Regular expression to match the following characters:
A-Z a-z 0-9 " - ? . ', !
... as well as new lines and spaces
This will match any single one of those characters:
[A-Za-z0-9"?.',! \n\r-]
There's a good chance you want something like:
^[A-Za-z0-9"?.',! \n\r-]+$
Or possibly a bit simpler will meet your needs:
^[\w\s"?.',!-]+$
Remembering that if this is inside a string, you will need to escape either the " or ' in that (either by doubling up, or by prefixing with a backslash).
Also note that the - is last so that it is not treated as a range inside the character class. (Can also be placed first, or prefixed with backslash to prevent that).
The \w will match a "word" character, which is almost always [A-Za-z0-9_].
The \s will match a whitespace character, (i.e. space,tab,newline,carriage return).
But really you need to give more context to what you're trying to do so people can suggest more fitting solutions.

Regex Pattern - Allow alpha numeric, a bunch of special chars, but not a certain sequence of chars

I have the following regex:
(?!^[&#]*$)^([A-Za-z0-9-'.,&#:?!()$#/\\]*)$
So allow A-Z, a-Z, 0-9, and these special chars '.,&#:?!()$#/\
I want to NOT match if the following set of chars is encountered anywhere in the string in this order:
&#
When I run this regex with just "&#" as input, it does not match my pattern, I get an error, great. When I run the regex with '.,&#:?!()$#/\ABC123 It does match my pattern, no errors.
However when I run it with:
'.,&##:?!()$#/\ABC123
It does not error either. I'm doing something wrong with the check for the &# sequence.
Can someone tell me what I've done wrong, I'm not great with these things.
Borrowing a technique for matching quoted strings, remove & from your character class, add an alternative for & not followed by #, and allow the string to optionally end with &:
^((?:[A-Za-z0-9-'.,#:?!()$#/\\]+|&[^#])*&?)$
I would actually do it in two parts:
Check your allowed character set. To do this I would look for characters that are not allowed, and return false if there's a match. That means I have a nice simple expression:
[^A-Za-z0-9'\.&#:?!()$#^]
Check your banned substring. And since it is just a substring, I probably wouldn't even use a regex for that part.
You didn't mention your language, but if in C#:
bool IsValid(string input)
{
return !( input.Contains("&#")
|| Regex.IsMatch(#"[^A-Za-z0-9'\.&#:?!()$#^]", input)
);
}
^((?!&#)[A-Za-z0-9-'.,&#:?!()$#/\\])*$
note that the last \ is escaped (doubled)
SO automatically turns \\ into \ if not in backticks
Assuming Perl compatible RegExp
To not match on the string '&#':
(?![^&]*&#)^([A-Za-z0-9-'.,&#:?!()$#/\\]*)$
Although you don't need the parenthesis because you are matching the entire string.
Just FYI, although Ben Blank's regex works, it's more complicated than it needs to be. I would do it like this:
^(?:[A-Za-z0-9-'.,#:?!()$#/\\]+|&(?!#))+$
Because I used a negative lookahead instead of a negated character class, the regex doesn't need any extra help to match an ampersand at the end of the string.
I'd recommend using two regular expressions in a conditional:
if (string has sequence "&#")
return false
else
return (string matches sequence "A-Za-z0-9-'.,&#:?!()$#/\")
I believe your second "main" regex of
^([A-Za-z0-9-'.,&#:?!()$#/\])$"
has several errors:
It will test only one character in your set
The \ character in regular expressions is a token indicating that the next character is part of some sort of "class" of characters (ex. \n = is the line feed character). The character sequence \] is actually causing your bracketed list not to be terminated.
You may be better off using
^[A-Za-z0-9-'.,&#:?!()$#/\\]+$
Note that the slash character is represented by a double-slash.
The + character indicates that at least one character being tested has to match the regex; if it is fine to pass a zero-length string, replace the + with a *.