Removing nonnumeric and nonalpha characters from a string? - regex

What is the best way to remove all the special characters from a string - like these:
!##$%^&*(){}|:"?><,./;'[]\=-
The items having these characters removed would rather short, so would it be better to use REGEX on each or just use string manipulation?
Thx
Environment == C#/.NET

It's generally better to have a whitelist than a blacklist.
Regex has a convenient \w that, effectively means alphanumeric plus underscore (some variants also add accented chars (á,é,ô,etc) to the list, others don't).
You can invert that by using \W to mean everything that's not alphanumeric.
So replace \W with empty string will remove all 'special' characters.
Alternatively, if you do need a different set of characters to alphanumeric, you can use a negated character class: [^abc] will match everything that is not a or b or c, and [^a-z] will match everything that is not in the range a,b,c,d...x,y,z
The equivalent to \w is [A-Za-z0-9_] and thus \W is [^A-Za-z0-9_]

in php:
$tests = array(
'hello, world!'
,'this is a test'
,'and so is this'
,'another test with /slashes/ & (parenthesis)'
,'l3375p34k stinks'
);
function strip_non_alphanumerics( $subject )
{
return preg_replace( '/[^a-z0-9]/i', '', $subject );
}
foreach( $tests as $test )
{
printf( "%s\n", strip_non_alphanumerics( $test ) );
}
output would be:
helloworld
thisisatest
andsoisthis
anothertestwithslashesparenthesis
l3375p34kstinks

I prefer regex because the syntax is simpler to read and maintain:
# in Python
import re
re.sub("[abcdef]", "", text)
where abcdef are the properly escaped characters to be removed.
Alternatively, if you want only alphanumeric characters (plus the underscore), you could use:
re.sub("\W", "", text)
where \W represents a non-word character, i.e. [^a-zA-Z_0-9].

here's a simple regex
[^\w]
this should catch all non-word characters this will permit a-z A-Z 0-9 space and _ neither space nor _ were in your list so this works if you wanted to catch these also then I would do something like this:
/[a-z0-90/i
this is the PHP format for a-z and 0-9 the i makes it case-insensitive.

When you just want to have alphanumeric characters, you could just express this by using an inverted character class:
[^A-Za-z0-9]+
This means: every character that is not alphanumeric.

In what language are you going the regex?
For example, in Perl you can do a translation which would translate any of the chars in your list into nothing:
e.g. This will translate 'a','b','c' or 'd' into ''
$sentence =~ tr/abcd//;

Us the "tr" command?
You don't say what enviroment you're in... shell? C program? Java? Each of those would have different best solutions.

You can rather validate them at the frontend by getting the askey values of the keyed in characters.

The ideal approach in PHP would be...
$text = "ABCDEF...Á123";
$text = preg_replace( '/[^\p{L}]/i', '', $text);
print($text); # Output: ABCDEFÁ
Or, in Perl...
my $text = "ABCDEF...Á123";
$text =~ s/[^\p{L}]//gi;
print($text); # Output: ABCDEFÁ
If you simply match on [^a-zA-Z], you will miss all accented characters, which (for the most part), I imagine you would want to retain.

Related

perl: how to check alphanumeric values and limit the string size to 30 by using regex

I am trying to write a regex for perl that would check for alphanumeric values (having spaces) but not including underscore "_" and limit the number of character to 30 I am trying this but this is not working could anyone please tell me what I am doing wrong! This code is even taking special characters as alphanumeric values. $currLine = 'Kapil# 123' this should not be a valid value.
** apologies by $currLine = "regex" i meant $currLine =~ "regex"
if ($currLine = /^[a-zA-Z0-9]{1,30}$/){
say "Line3 Good: ", $currLine;
} else {
say "Error in Line 3: Name not alphamumeric ";
}
$currLine = /^[a-zA-Z0-9]{1,30}$/
means
$currLine = $_ =~ /^[a-zA-Z0-9]{1,30}$/
You want to use
$currLine =~ /^[a-zA-Z0-9]{1,30}$/
Now on to the other problems.
You didn't allow spaces. (What follows allows whitespace. If you mean SPACE specifically, use that instead of \s).
You allow a trailing newline.
You allow 31 characters if the 31st is a newline.
You forbid many alphanumeric characters.
You forbid zero characters.
$currLine =~ /^[\p{Alnum}\s]{0,30}\z/
You are using = (assignment) where you should have =~ (bind).
Enabling warnings may have alerted you to this. The code you have is matching $_ and then assigning the results of the match to $currLine.
For your regular expression to match all alphanumeric values including spaces, you need to include for space inside your character class. You should also be using the bind operator =~ instead of = here.
if ( $currLine =~ /^[a-z0-9\s]{1,30}$/i ) { ...
Note: I included the i modifier for case-insensitive matching.
You are using assignment operator(=) instead of match operator(=~). You should change the if statement to:
if ($currLine =~ /^[a-zA-Z0-9]{1,30}$/)
This can also be shortened to:
if ($currLine =~ /^[^\W_]{1,30}$/)
[^\W] already matches anything apart from what is represented by \w. To discard _, we add it to negated character class, thus using - [^\W_]. Note however that, this matches much more than mere [a-zA-Z0-9]. It includes other unicode characters that come under word character. To just allow that regex to consider ASCII text, add /a character set modifier:
/^[^\W_]{1,30}$/a

Regular Expression:- String can contain any characters but should not be empty

My requirement is
"A string should not be blank or empty"
Eg., A String can contain any number of characters or strings followed by any special characters but should never be empty for eg., a string can contain "a,b,c" or "xyz123abc" or "12!#$#%&*()9" or " aa bb cc "
So, this is what i tried
Regex for blank or space:-
^\s*$
^ is the beginning of string anchor
$ is the end of string anchor
\s is the whitespace character class
* is zero-or-more repetition of
I'm stuck on how to negate the regex ^\s*$ so that it accepts any string like "a,b,c" or "xyz" or "12!#$#%&*()9"
Any help is appreciated.
No need for a regex. In Groovy you have the isAllWhitespace method:
groovy:000> "".allWhitespace
===> true
groovy:000> " \t\n ".allWhitespace
===> true
groovy:000> "something".allWhitespace
===> false
So asking !yourString.allWhitespace should tell you if your string is something else than empty or blank :)
\S
\S matches any non-white space character
Each character class has it's own anti-class defined, so for \w you have \W for \s you have \S for \d you have \D etc.
http://www.regular-expressions.info/charclass.html
Your regex engine may not support \S. If this is the case you use [^ \t\v] if you support unicode (which you should) there are more space types that you should watch for.
If both your regex engine and you support unicode AND \S is not supported by your regex engine then you'll probably want to use (if you care about people entering different unicode space types):
[^ \r\f\t\v\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u2028\u2029\u202F\u205F\u3000\uFEFF]
http://www.cs.tut.fi/~jkorpela/chars/spaces.html
http://en.wikipedia.org/wiki/Whitespace_character#Unicode
to me two simple ways to express it are (both no need for anchoring):
s.trim() =~ /.+/
or
s =~ /\S+/
the first assumes you know how trim() works, the second assumes the meaning of \S.
Of course
!s.allWhitespace
is perfect, again if you know it exists
The following regular expression will ensure that a string contains at least 1 non-whitespace character.
^(?!\s*$).+
Note: I am not familiar with groovy. But I would imagine there is a native functions (trim, empty, etc) that test this more naturally than a regular expression.
is this in a grails domain class?
if so, just use the blank constraint

match parentheses in powershell using regex

I'm trying to check for invalid filenames. I want a filename to only contain lowercase, uppercase, numbers, spaces, periods, underscores, dashes and parentheses. I've tried this regex:
$regex = [regex]"^([a-zA-Z0-9\s\._-\)\(]+)$"
$text = "hel()lo"
if($text -notmatch $regex)
{
write-host 'not valid'
}
I get this error:
Error: "parsing "^([a-zA-Z0-9\s\._-\)\(]+)$" - [x-y] range in reverse order"
What am I doing wrong?
Try to move the - to the end of the character class
^([a-zA-Z0-9\s\._\)\(-]+)$
in the middle of a character class it needs to be escaped otherwise it defines a range
You can replace a-zA-Z0-9 and _ with \w.
$regex = [regex]"^([\w\s\.\-\(\)]+)$"
From get-help about_Regular_Expressions:
\w
Matches any word character.
Equivalent to the Unicode
character categories [\p{Ll}
\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].
If ECMAScript-compliant behavior
is specified with the ECMAScript
option, \w is equivalent to
[a-zA-Z_0-9].
I guess, add a backslash before the lone hyphen:
$regex = [regex]"^([a-zA-Z0-9\s\._\-\)\(]+)$"

PERL-Subsitute any non alphanumerical character to "_"

In perl I want to substitute any character not [A-Z]i or [0-9] and replace it with "_" but only if this non alphanumerical character occurs between two alphanumerical characters. I do not want to touch non-alphanumericals at the beginning or end of the string.
I know enough regex to replace them, just not to only replace ones in the middle of the string.
s/(\p{Alnum})\P{Alnum}(\p{Alnum})/${1}_${2}/g;
Of course that would hurt your chanches with "#A#B%C", so you might use a look-arounds:
s/(?<=\p{Alnum})\P{Alnum}(?=\p{Alnum})/_/g;
That way you isolate it to just the non "alnum" character.
Or you could use the "keep flag", as well and get the same thing done.
s/\p{Alnum}\K\P{Alnum}(?=\p{Alnum})/_/g;
EDIT based on input:
To not eat a newline, you could do the following:
s/\p{Alnum}\K[^\p{Alnum}\n](?=\p{Alnum})/_/g;
Try this:
my $str = 'a-2=c+a()_';
$str =~ s/(?<=[A-Z0-9])[^A-Z0-9](?=[A-Z0-9])/\1_\2/gi;

Term with no alphanumeric characters before or after

I am trying to write a regular expression that matches all occurrences of a specified word, but must not have any alphanumeric characters prefixed or suffixed.
For example, searching for the term "cat" should not return terms like "catalyst".
Here is what I have so far:
"?<!([a-Z0-9])*?TERMPLACEHOLDER?!([a-Z0-9])*?"
This should return the word "TERMPLACEHOLDER" on its own.
Any ideas?
Thanks.
How about:
\bTERMPLACEHOLDER\b
You could use word boundaries: \bTERMPLACEHOLDER\b
A quick test in Javascript:
var a = "this cat is not a catalyst";
console.log(a.match(/\bcat\b/));
Returns just "cat".
You may be looking for word boundaries. From there, you can use wildcards like \w*? on either side of the word if you want to make it match partials
Search for any word containing "MYWORD"
\b\w*?MYWORD\w*?\b
Search for any word ending in "ING"
\b\w*?ING\b
Search for any word starting with "TH"
\bTH\w*?\b
Be carefull When you say "word" refering to a substring you want to find. On the regulare expression side "word" has a different meaning, its a character class.
Define the 'literal' string you would like to find (not word). This can be anything, sentences, punctuation, newline combinations. Example "find this \exact phrase <> !abc".
Since this is going to be part of a regular expression (not the whole regex), you can escape the special regular expression metacharacters that might be embedded.
string = 'foo.bar' // the string you want to find
string =~ s/[.*+?|()\[\]{}^\$\\]/\\$&/g // Escape metachars
Now the 'literal' string is ready to be inserted into the regular expression. Note that if you want to individually allow classes or want metachars in the string, you would have to escape this yourself.
sample =~ /(?<![^\W_])$string(?![^\W_])/ig // Find the string globally
(expanded)
/
(?<![^\W_]) # assertion: No alphanumeric character behind us
$string # the 'string' we want to find
(?![^\W_]) # assertion: No alphanumeric character in front of us
/ig
Perl sample -
use strict;
use warnings;
my $string = 'foo.bar';
my $sample = 'foo.bar and !fooAbar and afoo.bar.foo.bar';
# Quote string metacharacters
$string =~ s/[.*+?|()\[\]{}^\$\\]/\\$&/g;
# Globally find the string in the sample target
while ( $sample =~ /(?<![^\W_])$string(?![^\W_])/ig )
{
print substr($sample, 0, $-[0]), "-->'",
substr($sample, $-[0], $+[0] - $-[0]), "'\n";
}
Output -
-->'foo.bar'
foo.bar and !fooAbar and afoo.bar.-->'foo.bar'