Regex not matching data and dates

Regex not matching data and dates - regex

I have an SQL Select dump with many lines each looks like this:
07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',,,,'text',,,0,0,
I want to do 2 things to each line:
Replace all dates with Oracle's sysdate function. Dates can also come without hour (like 07/11/2011).
Replace all null values with null string
Here's my attempt:
$_ =~ s/,(,|\n)/,null$1/g; # Replace no data by "null"
$_ =~ s/\d{2}\/\d{2}\/d{4}.*?,/sysdate,/g; # Replace dates by "sysdate"
But this would transform the string to:
07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',null,,null,'text',null,,0,0,null
while I expect it to be
sysdate,sysdate,'YD','MANUAL',0,1,'text','text','text','text',null,null,null,'text',null,null,0,0,null
I don't understand why dates do not match and why some ,, are not replaced by null.
Any insights welcome, thanks in advance.

\d{2}\/\d{2}\/d{4}.*?, didn't work because the last d wasn't escaped.
If a , can be on either side, or begin/end of string, you could do it in 2 steps:
step 1
s/(?:^|(?<=,))(?=,|\n)/null/g
expanded:
/
(?: ^ # Begining of line, ie: nothing behind us
| (?<=,) # Or, a comma behind us
)
# we are HERE!, this is the place between characters
(?= , # A comma in front of us
| \n # Or, a newline in front of us
)
/null/g
# The above regex does not consume, it just inserts 'null', leaving the
# same search position (after the insertion, but before the comma).
# If you want to consume a comma, it would be done this way:
s/(?:^|(?<=,))(,|\n)/null$1/xg
# Now the search position is after the 'null,'
step 2
s/(?:^|(?<=,))\d{2}\/\d{2}\/\d{4}.*?(?=,|\n)/sysdate/g
Or, you could combine them into a single regex, using the eval modifier:
$row =~ s/(?:^|(?<=,))(\d{2}\/\d{2}\/\d{4}.*?|)(?=,|\n)/ length $1 ? 'sysdate' : 'null'/eg;
Broken down it looks like this
s{
(?: ^ | (?<=,) ) # begin of line or comma behind us
( # Capt group $1
\d{2}/\d{2}/\d{4}.*? # date format and optional non-newline chars
| # Or, nothing at all
) # End Capt group 1
(?= , | \n ) # comma or newline in front of us
}{
length $1 ? 'sysdate' : 'null'
}eg
If there is a chance of non-newline whitespace padding, it could be written as:
$row =~ s/(?:^|(?<=,))(?:([^\S\n]*\d{2}\/\d{2}\/\d{4}.*?)|[^\S\n]*)(?=,|\n)/ defined $1 ? 'sysdate' : 'null'/eg;

You could do this:
$ cat perlregex.pl
use warnings;
use strict;
my $row = "07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',,,,'text',,,0,0,\n";
print( "$row\n" );
while ( $row =~ /,([,\n])/ ) { $row =~ s/,([,\n])/,null$1/; }
print( "$row\n" );
$row =~ s/\d{2}\/\d{2}\/\d{4}.*?,/sysdate,/g;
print( "$row\n" );
Which results in this:
$ ./perlregex.pl
07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',,,,'text',,,0,0,
07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',null,null,null,'text',null,null,0,0,null
sysdate,sysdate,'YD','MANUAL',0,1,'text','text','text','text',null,null,null,'text',null,null,0,0,null
This could certainly be optimized, but it gets the point across.

You want to replace something. Usually lookaheads are a better option for this :
$subject =~ s/(?<=,)(?=,|$)/null/g;
Explanation :
"
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
, # Match the character “,” literally
)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
, # Match the character “,” literally
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
\$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
"
Secodnly you wish to replace the dates :
$subject =~ s!\d{2}/\d{2}/\d{4}.*?(?=,)!sysdate!g;
That's almost the same with your original regex. Just replace the last , with lookahead. (If you don't want to replace it , don't match it.)
# \d{2}/\d{2}/\d{4}.*?(?=,)
#
# Match a single digit 0..9 «\d{2}»
# Exactly 2 times «{2}»
# Match the character “/” literally «/»
# Match a single digit 0..9 «\d{2}»
# Exactly 2 times «{2}»
# Match the character “/” literally «/»
# Match a single digit 0..9 «\d{4}»
# Exactly 4 times «{4}»
# Match any single character that is not a line break character «.*?»
# Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=,)»
# Match the character “,” literally «,»

Maybe .*? is too greedy, try:
$_ =~ s/\d{2}\/\d{2}\/d{4}[^,]+,/sysdate,/g;

Related

Perl regex to extract digits from string with parenthesis

I have the following string:
my $string = "Ethernet FlexNIC (NIC 1) LOM1:1-a FC:15:B4:13:6A:A8";
I want to extract the number that is in brackets (1) in another variable.
The following statement does not work:
my ($NAdapter) = $string =~ /\((\d+)\)/;
What is the correct syntax?

\d+(?=[^(]*\))
You can use this.See demo.Yours will not work as inside () there is more data besides \d+.
https://regex101.com/r/fM9lY3/57

You could try something like
my ($NAdapter) = $string =~ /\(.*(\d+).*\)/;
After that, $NAdapter should include the number that you want.

my $string = "Ethernet FlexNIC (NIC 1) LOM1:1-a FC:15:B4:13:6A:A8";
I want to extract the number that is in brackets (1) in another
variable
Your regex (with some spaces for clarity):
/ \( (\d+) \) /x;
says to match:
A literal opening parenthesis, immediately followed by...
A digit, one or more times (captured in group 1), immediately followed by...
A literal closing parenthesis.
Yet, the substring you want to match:
(NIC 1)
is of the form:
A literal opening parenthesis, immediately followed by...
Some capital letters
STOP EVERYTHING! NO MATCH!
As an alternative, your substring:
(NIC 1)
could be described as:
Some digits, immediately followed by...
A literal closing parenthesis.
Here's the regex:
use strict;
use warnings;
use 5.020;
my $string = "Ethernet FlexNIC (NIC 1234) LOM1:1-a FC:15:B4:13:6A:A8";
my ($match) = $string =~ /
(\d+) #Match any digit, one or more times, captured in group 1, followed by...
\) #a literal closing parenthesis.
#Parentheses have a special meaning in a regex--they create a capture
#group--so if you want to match a parenthesis in your string, you
#have to escape the parenthesis in your regex with a backslash.
/xms; #Standard flags that some people apply to every regex.
say $match;
--output:--
1234
Another description of your substring:
(NIC 1)
could be:
A literal opening parenthesis, immediately followed by...
Some non-digits, immediately followed by...
Some digits, immediately followed by..
A literal closing parenthesis.
Here's the regex:
use strict;
use warnings;
use 5.020;
my $string = "Ethernet FlexNIC (ABC NIC789) LOM1:1-a FC:15:B4:13:6A:A8";
my ($match) = $string =~ /
\( #Match a literal opening parethesis, followed by...
\D+ #a non-digit, one or more times, followed by...
(\d+) #a digit, one or more times, captured in group 1, followed by...
\) #a literal closing parentheses.
/xms; #Standard flags that some people apply to every regex.
say $match;
--output:--
789
If there might be spaces on some lines and not others, such as:
spaces
||
VV
(NIC 1 )
(NIC 2)
You can insert a \s* (any whitespace, zero or more times) in the appropriate place in the regex, for instance:
my ($match) = $string =~ /
#Parentheses have special meaning in a regex--they create a capture
#group--so if you want to match a parenthesis in your string, you
#have to escape the parenthesis in your regex with a backslash.
\( #Match a literal opening parethesis, followed by...
\D+ #a non-digit, one or more times, followed by...
(\d+) #a digit, one or more times, captured in group 1, followed by...
\s* #any whitespace, zero or more times, followed by...
\) #a literal closing parentheses.
/xms; #Standard flags that some people apply to every regex.

Matching first letter of word

I want to match the first letter of a word in one string to another with the similar letter. In this example the letter H:
25HB matches to HC
I am using the match operator shown below:
my ($match) = ( $value =~ m/^d(\w)/ );
to not match the digit, but the first matching word character. How could I correct this?

That regex doesn't do what you think it does:
m/^d(\w)/
Matches 'start of line' - letter d then a single word character.
You may want:
m/^\d+(\w)/
Which will then match one or more digits from the start of line, and grab the first word character after that.
E.g.:
my $string = '25HC';
my ( $match ) =( $string =~ m/^\d+(\w)/ );
print $match,"\n";
Prints H

You are not clear about what you want. If you want to match the first letter in a string to the same letter later in the string:
m{
( # start a capture
[[:alpha:]] # match a single letter
) # end of capture
.*? # skip minimum number of any character
\1 # match the captured letter
}msx; # /m means multilines, /s means . matches newlines, /x means ignore whitespace in pattern
See perldoc perlre for more details.
Addendum:
If by word, you mean any alphanumeric sequence, this may be closer to what you want:
m{
\b # match a word boundary (start or end of a word)
\d* # greedy match any digits
( # start a capture
[[:alpha:]] # match a single letter
) # end of capture
.*? # skip minimum number of any character
\b # match a word boundary (start or end of a word)
\d* # greedy match any digits
\1 # match the captured letter
}msx; # /m means multilines, /s means . matches newlines, /x means ignore whitespace in pattern

You could try ^.*?([A-Za-z]).
The following code returns:
ITEM: 22hb
MATCH: h
ITEM: 33HB
MATCH: H
ITEM: 3333
MATCH:
ITEM: 43 H
MATCH: H
ITEM: HB33
MATCH: H
Script.
#!/usr/bin/perl
my #array = ('22hb','33HB','3333','43 H','HB33');
for my $item (#array) {
my $match = $1 if $item =~ /^.*?([A-Za-z])/;
print "ITEM: $item \nMATCH: $match\n\n";
}

I believe this is what you are looking for:
(If you can provide more clear example of what you are looking for we may be able to help you better)
The following code takes two strings and finds the first non-digit character common in both the strings:
my $string1 = '25HB';
my $string2 = 'HC';
#strip all digits
$string1 =~ s/\d//g;
foreach my $alpha (split //, $string1) {
# for each non-digit check if we find a match
if ($string2 =~ /$alpha/) {
print "First matching non-numeric character: $alpha\n";
exit;
}
}

Suppress lines with more than 3 words in UpperCase in Perl - RegExp

I need to create a Perl script in order to suppress all lines of more than 3 words in UpperCase (each word is separated by a space).
For now, I have delete all sentences in UpperCase like that:
while(my $text = <IN>)
{
$text =~ s/(^[A-Z \d\W]+$)\n//g;
}

Utilize perlfaq4 - How can I count the number of occurrences of a substring within a string? to count the number of matches of your pattern.
Then apply a filter:
use strict;
use warnings;
while (<DATA>) {
my $uc_words = () = /\b[A-Z]{2,}\b/g;
print if $uc_words < 3;
}
__DATA__
FIRST lower SECOND
FIRST lower SECOND and THIRD and end
FIRST and SECOND and just an I, is that enough?
Filter me because of FIRST, SECOND, THIRD, and FOURTH.
Just First Letter Capitalized Is Cool, Right?
Outputs:
FIRST lower SECOND
FIRST and SECOND and just an I, is that enough?
Just First Letter Capitalized Is Cool, Right?

use this pattern
^(?=(.*\b[A-Z]+\b){3}).*(\n|$)
Demo
^ # Start of string/line
(?= # Look-Ahead
( # Capturing Group (1)
. # Any character except line break
* # (zero or more)(greedy)
\b # <word boundary>
[A-Z] # Character Class [A-Z]
+ # (one or more)(greedy)
\b # <word boundary>
) # End of Capturing Group (1)
{3} # (repeated {3} times)
) # End of Look-Ahead
. # Any character except line break
* # (zero or more)(greedy)
( # Capturing Group (2)
\n # <new line>
| # OR
$ # End of string/line
) # End of Capturing Group (2)

Matching the rightmost and the leftmost symbols in perl with regular expression

I'm trying to match a string such that the leftmost symbol and the rightmost symbol are the same. How do I do that?

It’s impossible to know exactly what you mean without clarification of what you consider a “symbol”, but here is one possible solution:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?= \p{Grapheme_Base} ) \X ) .* \1 \s* \z /sx;
and here is another:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?= \p{Symbol} ) \X ) .* \1 \s* \z /sx;
and here is one more:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?: (?= \p{Symbol} ) \X )+ ) .* \1 \s* \z /sx;
And it is even possible that you might be able in some very limited circumstances be able to get away with:
$string =~ / ^ (\pS) .* \1 $ /xs;
But if you do, it’s also likely that someday you’re going to wish you had been more careful.

$string =~ m/^(.).*\1$/
should work. This fails to match strings of length 1, though.

Why do you want to do this with a regex? Is it homework? I avoid regexes for trivial patterns like this.
use Unicode::Normalize qw(NFC);
$s = NFC( $s );
substr( $s, 0, 1 ) eq substr( $s, -1, 1 );
Because Tom will complain about characters versus graphemes, you can handle that too:
use v5.10.1;
use Unicode::GCString;
use Unicode::Normalize qw(NFC);
my $gcs = Unicode::GCString->new( NFC( $s ) );
$gcs->substr( 0, 1 ) eq $gcs->substr( -1, 1 )

These regex's match strings with length 1 and greater. In the expressions, (.) represent a capture group where the dot should be substituted with your class of symbols I guess (see Unicode guru's, although that does not seem to be the intent of the question).
The context of this regex is single line (/s modifier). It allows the dot to match
newlines as well as anything else (like [\s\S]) so newlines can be embedded as well as being the outter most delimeter.
Using \z is the same as $ (in /s mode), except \z corrects a scenario where $ could match before a newline (matches at the end of string is more commona). If the character in question is a newline and you use un-greedy quantifier (like .*?) and the target string is "\nasdf\n\n", it could falsly match before the final newline. But that is a moot issue since the match is all greedy. Still, leave it in for grins.
/^(?=(.)).*\1\z/s
expanded
/
^ # Beginning of string
(?=(.)) # Lookahead - capture grp1, first (any) character (but don't consume it)
.* # Optionally consume all the characters up until before the last character
\1 # Backreference to capture grp1, this must exist
\z # End of string
/s # s modifier
Example stipulating just word class characters
/^(?=(\w)).*\1\z/s
Again, just substitute your acceptable symbols

What does this Perl regex mean: m/(.?):(.?)$/g?

I am editing a Perl file, but I don't understand this regexp comparison. Can someone please explain it to me?
if ($lines =~ m/(.*?):(.*?)$/g) { } ..
What happens here? $lines is a line from a text file.

Break it up into parts:
$lines =~ m/ (.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $1.
: # Match a colon.
(.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $2.
$ # Match the end of the line.
/gx;
So, this will match strings like ":" (it matches zero characters, then a colon, then zero characters before the end of the line, $1 and $2 are empty strings), or "abc:" ($1 = "abc", $2 is an empty string), or "abc:def:ghi" ($1 = "abc" and $2 = "def:ghi").
And if you pass in a line that doesn't match (it looks like this would be if the string does not contain a colon), then it won't process the code that's within the brackets. But if it does match, then the code within the brackets can use and process the special $1 and $2 variables (at least, until the next regular expression shows up, if there is one within the brackets).

There is a tool to help understand regexes: YAPE::Regex::Explain.
Ignoring the g modifier, which is not needed here:
use strict;
use warnings;
use YAPE::Regex::Explain;
my $re = qr/(.*?):(.*?)$/;
print YAPE::Regex::Explain->new($re)->explain();
__END__
The regular expression:
(?-imsx:(.*?):(.*?)$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
See also perldoc perlre.

It was written by someone who either knows too much about regular expressions or not enough about the $' and $` variables.
THis could have been written as
if ($lines =~ /:/) {
... # use $` ($PREMATCH) instead of $1
... # use $' ($POSTMATCH) instead of $2
}
or
if ( ($var1,$var2) = split /:/, $lines, 2 and defined($var2) ) {
... # use $var1, $var2 instead of $1,$2
}

(.*?) captures any characters, but as few of them as possible.
So it looks for patterns like <something>:<somethingelse><end of line>, and if there are multiple : in the string, the first one will be used as the divider between <something> and <somethingelse>.

That line says to perform a regular expression match on $lines with the regex m/(.*?):(.*?)$/g. It will effectively return true if a match can be found in $lines and false if one cannot be found.
An explanation of the =~ operator:
Binary "=~" binds a scalar expression
to a pattern match. Certain operations
search or modify the string $_ by
default. This operator makes that kind
of operation work on some other
string. The right argument is a search
pattern, substitution, or
transliteration. The left argument is
what is supposed to be searched,
substituted, or transliterated instead
of the default $_. When used in scalar
context, the return value generally
indicates the success of the
operation.
The regex itself is:
m/ #Perform a "match" operation
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
: #Match a literal colon character
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
$ #Match the end of string
/g #Perform the regex globally (find all occurrences in $line)
So if $lines matches against that regex, it will go into the conditional portion, otherwise it will be false and will skip it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex not matching data and dates - regex

Maybe .*? is too greedy, try: $_ =~ s/\d{2}\/\d{2}\/d{4}[^,]+,/sysdate,/g;

Related

Perl regex to extract digits from string with parenthesis

Matching first letter of word

Suppress lines with more than 3 words in UpperCase in Perl - RegExp

Matching the rightmost and the leftmost symbols in perl with regular expression

What does this Perl regex mean: m/(.?):(.?)$/g?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex not matching data and dates - regex

Maybe .*? is too greedy, try: $_ =~ s/\d{2}\/\d{2}\/d{4}[^,]+,/sysdate,/g;

Related

Perl regex to extract digits from string with parenthesis

Matching first letter of word

Suppress lines with more than 3 words in UpperCase in Perl - RegExp

Matching the rightmost and the leftmost symbols in perl with regular expression

What does this Perl regex mean: m/(.*?):(.*?)$/g?

Categories

Resources

What does this Perl regex mean: m/(.?):(.?)$/g?