How to match this string 'afa' but not 'bebeeeb' in Perl? - regex

$s1='afa';
$s2='bebeeeb';
$s1=~/((\w)(?!\2))+\2?/;
This regex matches both of the strings.
I want to match only the first string. (The first character followed by any character but not the first character. The captured two characters can be repeated any number of times.)

You can try this:
^(?:(\w)(?!\1))+$
Demo

Easier to check for the opposite.
# Doesn't contain repeated word characters.
$s !~ /(\w)\1/
Otherwise, you have to check at every position. (You were checking at any position.)
# Every character is a non-word character or a non-repeated word character
$s =~ /^(?:\W|(\w)(?!\1))*\z/
If the input can only contain word characters, the above simplifies to
# Every character is a non-repeated word character
$s =~ /^(?:(\w)(?!\1))*\z/
or even
# Every character is a non-repeated character
$s =~ /^(?:(.)(?!\1))*\z/s

Related

Perl Regexp::Common package not matching certain real numbers when used with word boundary

The following code below print "34" instead of the expected ".34"
use strict;
use warnings;
use Regexp::Common;
my $regex = qr/\b($RE{num}{real})\s*/;
my $str = "This is .34 meters of cable";
if ($str =~ /$regex/) {
print $1;
}
Do I need to fix my regex? (The word boundary is need as not including it will cause it match something string like xx34 which I don't want to)
Or is it is a bug in Regexp::Common? I always thought that a longest match should win.
The word boundary is a context-dependent regex construct. When it is followed with a word char (letter, digit or _) this location should be preceded either with the start of a string or a non-word char. In this concrete case, the word boundary is followed with a non-word char and thus requires a word char to appear right before this character.
You may use a non-ambiguous word boundary expressed with a negative lookbehind:
my $regex = qr/(?<!\w)($RE{num}{real})/;
^^^^^^^
The (?<!\w) negative lookbehind always denotes one thing: fail the match if there
is no word character immediately to the left of the current location.
Or, use a whitespace boundary if you want your matches to only occur after whitespace or start of string:
my $regex = qr/(?<!\S)($RE{num}{real})/;
^^^^^^^
Try this patern: (?:^| )(\d*\.?\d+)
Explanation:
(?:...) - non-capturing group
^| - match either ^ - beginning oof a string or - space
\d* - match zero or more digits
\.? - match dot literally - zero or one
\d+ - match one or more digits
Matched number will be stored in first capturing group.
Demo

regexp print line by line and remove last word

I am trying to remove last word from each line if line contains more than one word.
If line has only one word then print it as it, no need to delete it.
say below are the lines
address 34 address
value 1 value
valuedescription
size 4 size
from above lines I want to remove all last words from each line except from 3rd line as it has only one word using regexp ..
I tried below regexp and it is removing single word lines also
$_ =~ s/\s*\S+\s*+$//;
Need your help for the same.
You can use:
$_ =~ s/(?<=\w)\h+\w+$//m;
RegEx Demo
Explanation:
(?<=\w): Lookbehind to assert that we have at least one word char before last word
\h+: Match 1+ horizontal whitespaces
\w+: match a word with 1+ word characters
$: End of line
Try this regex:
^(?=(?:\w+ \w+)).*\K\b\w+
Replace each match with a blank string
Click for Demo
OR
^((?=(?:\w+ \w+)).*\b)\w+
and replace each match with \1
Click for Demo
Explanation(1st Regex):
^ - asserts the start of the line
(?=(?:\w+ \w+)) - positive lookahead to check if the string has 2 words present in it
.* - If the above condition satisfies, then match 0+ occurrences of any character(except newline) until the end of the line
\K - forget everything matched so far
\b - backtrack to find the last word boundary
\w+ - matches the last word
a single word with no whitespace matches your regex since you've used \s* both before and after the \S+, and \s* matches an empty string.
You could use $_ =~ s/^(.*\S)\s+(\S+)$/$1/;
[Explanation: Match the RegEx if the line contains some number of characters ending with a non-whitespace (stored in $1), followed by 1 or more white-space characters, followed by 1 or more non-white-space characters. If there is a match, replace it all with the first part ($1).]
Though you might want to trim leading/trailing whitespace if you think it might contain any - depends on what you want to happen in those cases.

How to extract a number between two different strings

I am new in Perl and I have a situation where I need to extract a number between two different strings.
I have this string variable:
my $var = "1234 23.3\"
How can I extract the number between the white-space and the dot? In the example the output should be 23.
The above var string may vary, so sometimes it may be 123 4.32 or 123 334.4\ in which the output should be 4 or 334 respectively.
White space can be matched using \s backslash sequence, see perlrecharclass:
\s matches any single character considered whitespace
Likewise, a digit can be matched using \d:
\d matches a single character considered to be a decimal digit.
To match a period or dot, beware that the dot is a regex meta character, that will match any character (except newline), see perlre and perlretut, so to match a dot explicitly you should escape it.
Hence, given $var = "1234 23.3", the following statement:
$var =~ /\s+(\d+)\./;
should extract the number after the space and before the dot into capture group variable $1. See perlre for more information on the + metacharacter and also for information about capture groups.

What does this snippet do in perl?

What does the following do in Perl?
$string =~ s#[^a-zA-Z0-9]+# #sg;
$string =~ s#\s+# #sg;
I undestand that [^a-zA-Z0-9]+ is a start of sentence and at least one of a-zA-Z0-9 and \s+ is at least one whitespace.
But I can not figure out what this snippet does as a whole.
First, it replaces any sequence of non-alphanumeric characters (being neither upper case chars, lower case chars nor numbers) in the string with a single space.
After that it replaces all multi-spaces, i.e. any sequence of whitespaces with just one space character.
the first pattern replace all that is not alphanumeric by a space.
The second replace any number of white characters (space, tab, newlines) by a single space
Note that you can replace these two patterns by an only pattern:
$string =~ s#[^a-zA-Z0-9]+# #sg;
$string =~ s#[^a-zA-Z0-9]+# #sg;
$string =~ s#\s+# #sg;
is more commonly written as
$string =~ s/[^a-zA-Z0-9]+/ /sg;
$string =~ s/\s+/ /sg;
The choice of delimiter isn't significant, but / is used by convention unless the pattern contains many some /.
Here we have two instances of the substitution operator. Between the first two delimiters is a regular expression pattern to search for. Between the last two delimiters is the string with which to replace the matching text. The trailing s and g are flags.
The s flag affects what . matches. Given that . isn't used, the s flag is useless.
The g flag causes the all matches to be replaced instead of just the first one.
The first regex pattern, [^a-zA-Z0-9]
[...] is a character class that matches a single character among those specified. A leading ^ negates the class, so [^a-zA-Z0-9] matches any character other than unaccented latin letters and numbers.
atom+ matches atom one or more times, so [^a-zA-Z0-9]+ matches a sequence of non-alphanumeric characters (and some alphanumeric characters such as "é").
Therefore, s/[^a-zA-Z0-9]+/ /g replaces all sequences of non-alphanumeric characters (and some alphanumeric characters such as "é") with a single space. For example, "abc - déf :)" becomes "abc d f ".
The second regex pattern, \s+
\s matches any whitespace character (except the vertical tab and the non-breaking space sometimes).
Therefore, s/\s+/ /g replaces all sequences of white space with a single space. For example, "abc\tdef ghi\n" becomes "abc def ghi ".
As a whole
When used together, the second statement does absolutely nothing. There will never be any sequences of two or more whitespace characters left in $string after the first statement.
So
$string =~ s#[^a-zA-Z0-9]+# #sg;
$string =~ s#\s+# #sg;
is the same as
$string =~ s/[^a-zA-Z0-9]+/ /g;

What do these Perl regexes mean?

What does the following syntax mean in Perl?
$line =~ /([^:]+):/;
and
$line =~ s/([^:]+):/$replace/;
See perldoc perlreref
[^:]
is a character class that matches any character other than ':'.
[^:]+
means match one or more of such characters.
I am not sure the capturing parentheses are needed. In any case,
([^:]+):
captures a sequence of one or more non-colon characters followed by a colon.
$line =~ /([^:]+):/;
The =~ operator is called the binding operator, it runs a regex or substitution against a scalar value (in this case $line). As for the regex itself, () specify a capture. Captures place the text that matches them in special global variables. These variables are numbered starting from one and correspond to the order the parentheses show up in, so given
"abc" =~ /(.)(.)(.)/;
the $1 variable will contain "a", the $2 variable will contain "b", and the $3 variable will contain "c" (if you haven't guessed yet . matches one character*). [] specifies a character class. Character classes will match one character in them, so /[abc]/ will match one character if it is "a", "b", or "c". Character classes can be negated by starting them with ^. A negated character class matches one character that is not listed in it, so [^abc] will match one character that is not "a", "b", or "c" (for instance, "d" will match). The + is called a quantifier. Quantifiers tell you how many times the preceding pattern must match. + requires the pattern to match one or more times. (the * quantifier requires the pattern to match zero or more times). The : has no special meaning to the regex engine, so it just means a literal :.
So, putting that information together we can see that the regex will match one or more non-colon characters (saving this part to $1) followed by a colon.
$line =~ s/([^:]+):/$replace/;
This is a substitution. Substitutions have two parts, the regex, and the replacement string. The regex part follows all of the same rules as normal regexes. The replacement part is treated like a double quoted string. The substitution replaces whatever matches the regex with the replacement, so given the following code
my $line = "key: value";
my $replace = "option";
$line =~ s/([^:]+):/$replace/;
The $line variable will hold the string "option value".
You may find it useful to read perldoc perlretut.
* except newline, unless the /m option is used, in which case it matches any character
The first one captures the part in front of a colon from a line, such as "abc" in the string "abc:foo". More precisely it matches at least one non-colon character (though as many as possible) directly before a colon and puts them into a capture group.
The second one substitutes said part, although this time including the colon by the contents of the variable $replace.
I may be misunderstanding some of the previous answers, but I think that there's a confusion about the second example. It will not replace only the captured item (i.e., one or more non-colons up until a colon) by $replaced. It will replace all of ([^:]+): with $replace - the colon as well. (The substitution operates on the match, not just the capture.)
This means if you don't include a colon in $replace (and you want one), you will get bit:
my $line = 'http://www.example.com/';
my $replace = 'ftp';
$line =~ s/([^:]+):/$replace/;
print "Here's \$line now: $line\n";
Output:
Here's $line now: ftp//www.example.com/ # Damn, no colon!
I'm not sure if you are just looking at example code, but you unless you plan to use the capture I'm not sure you really want it in these examples.
If you are very unfamiliar with regular expressions (or Perl), you should look at perldoc perlrequick before trying perldoc perlre or perldoc perlretut.
You want to return something matching one or more characters that are anything but : followed by a : and the second one you want to do the same thing but replace it with $replace.
perl -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new('([^:]+):')->explain"
The regular expression:
(?-imsx:([^:]+):)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^:]+ any character except: ':' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
$line =~ /([^:]+):/;
Matches anything that does not contain : before :/
If $line = "http://www.google.com", it will match http (the variable $1 will contain http)
$line =~ s/([^:]+):/$replace/;
This time, replace the value matched by the content of the variable $replace