I'm trying to match a string such that the leftmost symbol and the rightmost symbol are the same. How do I do that?
It’s impossible to know exactly what you mean without clarification of what you consider a “symbol”, but here is one possible solution:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?= \p{Grapheme_Base} ) \X ) .* \1 \s* \z /sx;
and here is another:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?= \p{Symbol} ) \X ) .* \1 \s* \z /sx;
and here is one more:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?: (?= \p{Symbol} ) \X )+ ) .* \1 \s* \z /sx;
And it is even possible that you might be able in some very limited circumstances be able to get away with:
$string =~ / ^ (\pS) .* \1 $ /xs;
But if you do, it’s also likely that someday you’re going to wish you had been more careful.
$string =~ m/^(.).*\1$/
should work. This fails to match strings of length 1, though.
Why do you want to do this with a regex? Is it homework? I avoid regexes for trivial patterns like this.
use Unicode::Normalize qw(NFC);
$s = NFC( $s );
substr( $s, 0, 1 ) eq substr( $s, -1, 1 );
Because Tom will complain about characters versus graphemes, you can handle that too:
use v5.10.1;
use Unicode::GCString;
use Unicode::Normalize qw(NFC);
my $gcs = Unicode::GCString->new( NFC( $s ) );
$gcs->substr( 0, 1 ) eq $gcs->substr( -1, 1 )
These regex's match strings with length 1 and greater. In the expressions, (.) represent a capture group where the dot should be substituted with your class of symbols I guess (see Unicode guru's, although that does not seem to be the intent of the question).
The context of this regex is single line (/s modifier). It allows the dot to match
newlines as well as anything else (like [\s\S]) so newlines can be embedded as well as being the outter most delimeter.
Using \z is the same as $ (in /s mode), except \z corrects a scenario where $ could match before a newline (matches at the end of string is more commona). If the character in question is a newline and you use un-greedy quantifier (like .*?) and the target string is "\nasdf\n\n", it could falsly match before the final newline. But that is a moot issue since the match is all greedy. Still, leave it in for grins.
/^(?=(.)).*\1\z/s
expanded
/
^ # Beginning of string
(?=(.)) # Lookahead - capture grp1, first (any) character (but don't consume it)
.* # Optionally consume all the characters up until before the last character
\1 # Backreference to capture grp1, this must exist
\z # End of string
/s # s modifier
Example stipulating just word class characters
/^(?=(\w)).*\1\z/s
Again, just substitute your acceptable symbols
I have an SQL Select dump with many lines each looks like this:
07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',,,,'text',,,0,0,
I want to do 2 things to each line:
Replace all dates with Oracle's sysdate function. Dates can also come without hour (like 07/11/2011).
Replace all null values with null string
Here's my attempt:
$_ =~ s/,(,|\n)/,null$1/g; # Replace no data by "null"
$_ =~ s/\d{2}\/\d{2}\/d{4}.*?,/sysdate,/g; # Replace dates by "sysdate"
But this would transform the string to:
07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',null,,null,'text',null,,0,0,null
while I expect it to be
sysdate,sysdate,'YD','MANUAL',0,1,'text','text','text','text',null,null,null,'text',null,null,0,0,null
I don't understand why dates do not match and why some ,, are not replaced by null.
Any insights welcome, thanks in advance.
\d{2}\/\d{2}\/d{4}.*?, didn't work because the last d wasn't escaped.
If a , can be on either side, or begin/end of string, you could do it in 2 steps:
step 1
s/(?:^|(?<=,))(?=,|\n)/null/g
expanded:
/
(?: ^ # Begining of line, ie: nothing behind us
| (?<=,) # Or, a comma behind us
)
# we are HERE!, this is the place between characters
(?= , # A comma in front of us
| \n # Or, a newline in front of us
)
/null/g
# The above regex does not consume, it just inserts 'null', leaving the
# same search position (after the insertion, but before the comma).
# If you want to consume a comma, it would be done this way:
s/(?:^|(?<=,))(,|\n)/null$1/xg
# Now the search position is after the 'null,'
step 2
s/(?:^|(?<=,))\d{2}\/\d{2}\/\d{4}.*?(?=,|\n)/sysdate/g
Or, you could combine them into a single regex, using the eval modifier:
$row =~ s/(?:^|(?<=,))(\d{2}\/\d{2}\/\d{4}.*?|)(?=,|\n)/ length $1 ? 'sysdate' : 'null'/eg;
Broken down it looks like this
s{
(?: ^ | (?<=,) ) # begin of line or comma behind us
( # Capt group $1
\d{2}/\d{2}/\d{4}.*? # date format and optional non-newline chars
| # Or, nothing at all
) # End Capt group 1
(?= , | \n ) # comma or newline in front of us
}{
length $1 ? 'sysdate' : 'null'
}eg
If there is a chance of non-newline whitespace padding, it could be written as:
$row =~ s/(?:^|(?<=,))(?:([^\S\n]*\d{2}\/\d{2}\/\d{4}.*?)|[^\S\n]*)(?=,|\n)/ defined $1 ? 'sysdate' : 'null'/eg;
You could do this:
$ cat perlregex.pl
use warnings;
use strict;
my $row = "07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',,,,'text',,,0,0,\n";
print( "$row\n" );
while ( $row =~ /,([,\n])/ ) { $row =~ s/,([,\n])/,null$1/; }
print( "$row\n" );
$row =~ s/\d{2}\/\d{2}\/\d{4}.*?,/sysdate,/g;
print( "$row\n" );
Which results in this:
$ ./perlregex.pl
07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',,,,'text',,,0,0,
07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',null,null,null,'text',null,null,0,0,null
sysdate,sysdate,'YD','MANUAL',0,1,'text','text','text','text',null,null,null,'text',null,null,0,0,null
This could certainly be optimized, but it gets the point across.
You want to replace something. Usually lookaheads are a better option for this :
$subject =~ s/(?<=,)(?=,|$)/null/g;
Explanation :
"
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
, # Match the character “,” literally
)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
, # Match the character “,” literally
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
\$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
"
Secodnly you wish to replace the dates :
$subject =~ s!\d{2}/\d{2}/\d{4}.*?(?=,)!sysdate!g;
That's almost the same with your original regex. Just replace the last , with lookahead. (If you don't want to replace it , don't match it.)
# \d{2}/\d{2}/\d{4}.*?(?=,)
#
# Match a single digit 0..9 «\d{2}»
# Exactly 2 times «{2}»
# Match the character “/” literally «/»
# Match a single digit 0..9 «\d{2}»
# Exactly 2 times «{2}»
# Match the character “/” literally «/»
# Match a single digit 0..9 «\d{4}»
# Exactly 4 times «{4}»
# Match any single character that is not a line break character «.*?»
# Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=,)»
# Match the character “,” literally «,»
Maybe .*? is too greedy, try:
$_ =~ s/\d{2}\/\d{2}\/d{4}[^,]+,/sysdate,/g;
I am editing a Perl file, but I don't understand this regexp comparison. Can someone please explain it to me?
if ($lines =~ m/(.*?):(.*?)$/g) { } ..
What happens here? $lines is a line from a text file.
Break it up into parts:
$lines =~ m/ (.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $1.
: # Match a colon.
(.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $2.
$ # Match the end of the line.
/gx;
So, this will match strings like ":" (it matches zero characters, then a colon, then zero characters before the end of the line, $1 and $2 are empty strings), or "abc:" ($1 = "abc", $2 is an empty string), or "abc:def:ghi" ($1 = "abc" and $2 = "def:ghi").
And if you pass in a line that doesn't match (it looks like this would be if the string does not contain a colon), then it won't process the code that's within the brackets. But if it does match, then the code within the brackets can use and process the special $1 and $2 variables (at least, until the next regular expression shows up, if there is one within the brackets).
There is a tool to help understand regexes: YAPE::Regex::Explain.
Ignoring the g modifier, which is not needed here:
use strict;
use warnings;
use YAPE::Regex::Explain;
my $re = qr/(.*?):(.*?)$/;
print YAPE::Regex::Explain->new($re)->explain();
__END__
The regular expression:
(?-imsx:(.*?):(.*?)$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
See also perldoc perlre.
It was written by someone who either knows too much about regular expressions or not enough about the $' and $` variables.
THis could have been written as
if ($lines =~ /:/) {
... # use $` ($PREMATCH) instead of $1
... # use $' ($POSTMATCH) instead of $2
}
or
if ( ($var1,$var2) = split /:/, $lines, 2 and defined($var2) ) {
... # use $var1, $var2 instead of $1,$2
}
(.*?) captures any characters, but as few of them as possible.
So it looks for patterns like <something>:<somethingelse><end of line>, and if there are multiple : in the string, the first one will be used as the divider between <something> and <somethingelse>.
That line says to perform a regular expression match on $lines with the regex m/(.*?):(.*?)$/g. It will effectively return true if a match can be found in $lines and false if one cannot be found.
An explanation of the =~ operator:
Binary "=~" binds a scalar expression
to a pattern match. Certain operations
search or modify the string $_ by
default. This operator makes that kind
of operation work on some other
string. The right argument is a search
pattern, substitution, or
transliteration. The left argument is
what is supposed to be searched,
substituted, or transliterated instead
of the default $_. When used in scalar
context, the return value generally
indicates the success of the
operation.
The regex itself is:
m/ #Perform a "match" operation
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
: #Match a literal colon character
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
$ #Match the end of string
/g #Perform the regex globally (find all occurrences in $line)
So if $lines matches against that regex, it will go into the conditional portion, otherwise it will be false and will skip it.