How can I allow a literal dot in a Perl regular expression? - regex

I use this condition to check if the value is alphanumeric values:
$value =~ /^[a-zA-Z0-9]+$/
How can I modify this regex to account for a possible dot . in the value without accepting any other special characters?

$value =~ /^[a-zA-Z0-9.]+$/

Using the alnum Posix character class, one char shorter :)
value =~ /^[[:alnum:].]+$/;

Don't forget the /i option and the \d character class.
$value =~ /^[a-z\d.]+$/i

If you don't want to allow any characters other than those allowed in the character class, you shouldn't use the $ end of line anchor since that allows a trailing newline. Use the absolute end-of-string anchor \z instead:
$value =~ /^[a-z0-9.]+\z/i;

Look at perl regular expressions
\w Match "word" character (alphanumeric plus "_")
$value =~ /^[\w+.]\.*$/;

Related

Perl $1 variable not defined after regex match

This is probably a very basic error on my part, but I've been stuck on this problem for ages and it's driving me up the wall!
I am looping through a file of Python code using Perl and identifying its variables. I am using a Perl regex to pick out substrings of alphanumeric characters in between spaces. The regex works fine and identifies the lines that the matches belong to, but when I try to return the actual substring that matches the regex, the capture variable $1 is undefined.
Here is my regex:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
And here is the error:
x = 1
Use of uninitialized value $1 in print at ./vars.pl line 7, <> line 2.
As I understand it, $1 is supposed to return x. Where is my code going wrong?
You're not capturing the result:
if ($line =~ /.*\s+([a-zA-Z0-9]+)\s+.*/) {
If you want to match a line like x = 1 and get both parts of it, you need to match on and capture both with parenthesis. A crude approach:
if ( $line =~ /^\s* ( \w+ ) \s* = \s* ( \w+ ) \s* $/msx ) {
my $var = $1;
my $val = $2;
}
The correct answer has been given by Leeft: You need to capture the string by using parentheses. I wanted to mention some other things. In your code:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
You are surrounding your match with .*\s+. This is unlikely doing what you think. You never need to use .* with m//, unless you are capturing a string (or capturing the whole match using $&). The match is not anchored by default, and will match anywhere in the string. To anchor the match you must use ^ or $. E.g.:
if ('abcdef' =~ /c/) # returns true
if ('abcdef' =~ /^c/) # returns false, match anchored to beginning
if ('abcdef' =~ /c$/) # returns false, match anchored to end
if ('abcdef' =~ /c.*$/) # returns true
As you see in the last example, using .* is quite redundant, and to get the match you need only remove the anchor. Or if you wanted to capture the whole string:
if ('abcdef' =~ /(c.*)$/) # returns true, captures 'cdef'
You can also use $&, which contains the entire match, regardless of parentheses.
You are probably using \s+ to ensure you do not match partial words. You should be aware that there is an escape sequence called word boundary, \b. This is a zero-length assertion, that checks that the characters around it are word and non-word.
'abc cde fgh' =~ /\bde\b/ # no match
'abc cde fgh' =~ /\bcde\b/ # match
'abc cde fgh' =~ /\babc/ # match
'abc cde fgh' =~ /\s+abc/ # no match! there is no whitespace before 'a'
As you see in the last example, using \s+ fails at start or end of string. Do note that \b also matches partially at non-word characters that can be part of words, such as:
'aaa-xxx' =~ /\bxxx/ # match
You must decide if you want this behaviour or not. If you do not, an alternative to using \s is to use the double negated case: (?!\S). This is a zero-length negative look-ahead assertion, looking for non-whitespace. It will be true for whitespace, and for end of string. Use a look-behind to check the other side.
Lastly, you are using [a-zA-Z0-9]. This can be replaced with \w, although \w also includes underscore _ (and other word characters).
So your regex becomes:
/\b(\w+)\b/
Or
/(?<!\S)(\w+)(?!\S)/
Documentation:
perldoc perlvar - Perl built-in variables
perldoc perlop - Perl operators
perldoc perlre - Perl regular expressions

regular expressions in perl for extracting information

How would I match any number of any characters between two specific words... I have a document with a block of text enclosed between 'begin parameters' and 'end parameters'. These two phrases are separated by a number of lines of text. So my text looks like this:
begin parameters
<lines of text here \n.
end parameters
My current regular expression looks like this:
my $regex = "begin parameters[.*\n*]end parameters";
However this is not matching. Does anybody have any suggestions?
Use the /s switch so that the any character . will match new lines.
I also suggest that you use non greedy matching by adding ? to your quantifier.
use strict;
use warnings;
my $data = do {local $/; <DATA>};
if ($data =~ /begin parameters(.*?)end parameters/s) {
print "'$1'";
}
__DATA__
begin parameters
<lines of text here.
end parameters
Outputs:
'
<lines of text here.
'
Your current regular expression does not do what you may think, by placing those characters inside of a character class; it matches any character of: ( ., *, \n, * ) instead of actually matching what you want.
You can use the s modifier forcing the dot . to match newline sequences. By placing a capturing group around what you want to extract, you can access that by using $1
my $regex = qr/begin parameters(.*?)end parameters/s;
my $string = do {local $/; <DATA>};
print $1 if $string =~ /$regex/;
See Demo
Please try this :
Begin Parameters([\S\s]+?)EndParameters
Translation : This will look for any char who is a separator, or any char who is everything but a separator (so actually, it will look for any char) until it find "EndParameters".
I hope it is what you expect.
The meta-character . loses its special properties inside of a character class.
So [.*\n*] actually matches 0 or more literal periods or zero or more newlines.
What you actual want is to match 0 or more of any character and 0 or more of a newline. Which you can represent in a non-capturing group:
begin parameters(?:.|\n)*?end parameters

Why do these two regexes behave differently?

Why do the following two regexes behave differently?
$millisec = "1391613310.1";
$millisec =~ s/.*(\.\d+)?$/$1/;
vs.
$millisec =~ s/\d*(\.\d+)?$/$1/;
This code prints nothing:
perl -e 'my $mtime = "1391613310.1"; my $millisec = $mtime; $millisec =~ s/.*(\.\d+)?$/$1/; print "$millisec";'
While this prints the decimal portion of the string:
perl -e 'my $mtime = "1391613310.1"; my $millisec = $mtime; $millisec =~ s/\d*(\.\d+)?$/$1/; print "$millisec";'
In the first regex, the .* is taking up everything to the end of the string, so there's nothing the optional (.\d+)? can pick up. $1 will be empty, so the string is replaced by an empty string.
In the second regex, only digits are grabbed from the beginning so that \d* stops in front of the dot. (.\d+)? will pick the dot, including the trailing digits.
You're using .\d+ inside parentheses, which will match any character plus digits. If you want to match a dot explicitly, you have to use \..
To make the first regex behave similarly to the second one you would have to write
$millisec =~ s/.*?(\.\d+)?$/$1/;
so that the initial .* doesn't take up everything.
Greed.
Perl's regex engine will match as much as possible with each term before moving on to the next term. So for .*(.\d+)?$ the .* matches the entire string, then (.\d)? matches nothing as it is optional.
\d*(.\d+)?$ can match only up to the dot, so then has to match .1 against (.\d+)?

match parentheses in powershell using regex

I'm trying to check for invalid filenames. I want a filename to only contain lowercase, uppercase, numbers, spaces, periods, underscores, dashes and parentheses. I've tried this regex:
$regex = [regex]"^([a-zA-Z0-9\s\._-\)\(]+)$"
$text = "hel()lo"
if($text -notmatch $regex)
{
write-host 'not valid'
}
I get this error:
Error: "parsing "^([a-zA-Z0-9\s\._-\)\(]+)$" - [x-y] range in reverse order"
What am I doing wrong?
Try to move the - to the end of the character class
^([a-zA-Z0-9\s\._\)\(-]+)$
in the middle of a character class it needs to be escaped otherwise it defines a range
You can replace a-zA-Z0-9 and _ with \w.
$regex = [regex]"^([\w\s\.\-\(\)]+)$"
From get-help about_Regular_Expressions:
\w
Matches any word character.
Equivalent to the Unicode
character categories [\p{Ll}
\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].
If ECMAScript-compliant behavior
is specified with the ECMAScript
option, \w is equivalent to
[a-zA-Z_0-9].
I guess, add a backslash before the lone hyphen:
$regex = [regex]"^([a-zA-Z0-9\s\._\-\)\(]+)$"

Removing nonnumeric and nonalpha characters from a string?

What is the best way to remove all the special characters from a string - like these:
!##$%^&*(){}|:"?><,./;'[]\=-
The items having these characters removed would rather short, so would it be better to use REGEX on each or just use string manipulation?
Thx
Environment == C#/.NET
It's generally better to have a whitelist than a blacklist.
Regex has a convenient \w that, effectively means alphanumeric plus underscore (some variants also add accented chars (á,é,ô,etc) to the list, others don't).
You can invert that by using \W to mean everything that's not alphanumeric.
So replace \W with empty string will remove all 'special' characters.
Alternatively, if you do need a different set of characters to alphanumeric, you can use a negated character class: [^abc] will match everything that is not a or b or c, and [^a-z] will match everything that is not in the range a,b,c,d...x,y,z
The equivalent to \w is [A-Za-z0-9_] and thus \W is [^A-Za-z0-9_]
in php:
$tests = array(
'hello, world!'
,'this is a test'
,'and so is this'
,'another test with /slashes/ & (parenthesis)'
,'l3375p34k stinks'
);
function strip_non_alphanumerics( $subject )
{
return preg_replace( '/[^a-z0-9]/i', '', $subject );
}
foreach( $tests as $test )
{
printf( "%s\n", strip_non_alphanumerics( $test ) );
}
output would be:
helloworld
thisisatest
andsoisthis
anothertestwithslashesparenthesis
l3375p34kstinks
I prefer regex because the syntax is simpler to read and maintain:
# in Python
import re
re.sub("[abcdef]", "", text)
where abcdef are the properly escaped characters to be removed.
Alternatively, if you want only alphanumeric characters (plus the underscore), you could use:
re.sub("\W", "", text)
where \W represents a non-word character, i.e. [^a-zA-Z_0-9].
here's a simple regex
[^\w]
this should catch all non-word characters this will permit a-z A-Z 0-9 space and _ neither space nor _ were in your list so this works if you wanted to catch these also then I would do something like this:
/[a-z0-90/i
this is the PHP format for a-z and 0-9 the i makes it case-insensitive.
When you just want to have alphanumeric characters, you could just express this by using an inverted character class:
[^A-Za-z0-9]+
This means: every character that is not alphanumeric.
In what language are you going the regex?
For example, in Perl you can do a translation which would translate any of the chars in your list into nothing:
e.g. This will translate 'a','b','c' or 'd' into ''
$sentence =~ tr/abcd//;
Us the "tr" command?
You don't say what enviroment you're in... shell? C program? Java? Each of those would have different best solutions.
You can rather validate them at the frontend by getting the askey values of the keyed in characters.
The ideal approach in PHP would be...
$text = "ABCDEF...Á123";
$text = preg_replace( '/[^\p{L}]/i', '', $text);
print($text); # Output: ABCDEFÁ
Or, in Perl...
my $text = "ABCDEF...Á123";
$text =~ s/[^\p{L}]//gi;
print($text); # Output: ABCDEFÁ
If you simply match on [^a-zA-Z], you will miss all accented characters, which (for the most part), I imagine you would want to retain.