Perl regex to match up to but not including optional end-token - regex

I am trying to efficiently match lines up to but not including an optional end-token.
/(.*)(?:$tok)?/
doesn't work. The end-token is optional, hence the final ?, but then the first group
greedily captures it.
/(.*?)(?:$tok)?/
also doesn't work: the first group matches a zero-length string
The best I can do so far is
my $tok = 'end';
while (<>) {
my ($line) = /
(?| # 'branch reset'
(.*)$tok # either a line terminated with the end token
| # or
(.*) # the whole line
) # end branch reset group
/x;
print $line, "\n";
}
This works, but strikes me as inefficient. The regex engine has to parse the line twice, which is what I was trying to avoid.
I'm aware the problem as stated would be better solved with index():
my $i = index($_, $end);
$line = $i < 0 ? $_ : substr $_, 0, $i;
but I need to do other processing of the line making a regex desirable - and in any event, I see this as a learning opportunity ;-)

Please take a look at the following example. Here it is looking for the word great at the end of the matching or the end of line($).
my $str = 'alexander the great alex';
if ($str =~ m/(.*?)(?=great|$)/i) {
print "$1";
}
You can replace your $token with great from above example.

This should work -
/^(.*?)(?:(?:\b$tok)?$)/gm
Demo here

Related

Perl Regex Remove Hyphen but Ignore Specific Hyphenated words

I have a perl regex which converts hyphens to spaces eg:-
$string =~ s/-/ /g;
I need to modify this to ignore specific hyphenated phrases and not replace the hyphen e.g. in a string like this:
"use-either-dvi-d-or-dvi-i"
I wish to NOT replace the hyphen in dvi-d and dvi-i so it reads:
"use either dvi-d or dvi-i"
I have tried various negative look ahead matches but failed miserably.
You can use this PCRE regex with verbs (*SKIP)(*F) to skip certain words from your match:
dvi-[id](*SKIP)(*F)|-
RegEx Demo
This will skip words dvi-i and dvi-d for splitting due to use of (*SKIP)(*F).
For your code:
$string =~ s/dvi-[id](*SKIP)(*F)|-/ /g;
Perl Code Demo
There is an alternate lookarounds based solution as well:
/(?<!dvi)-|-(?![di])/
Which basically means match hyphen if it is not preceded by dvi OR if it is not followed by d or i, thus making sure to not match - when we have dvi on LHS and [di] on RHS.
Perl code:
$string =~ s/(?<!dvi)-|-(?![di])/ /g;
Perl Code Demo 2
$string =~ s/(?<!dvi)-(?![id])|(?<=dvi)-(?![id])|(?<!dvi)-(?=[id])/ /g;
While using just (?<!dvi)-(?![id]) you will exclude also dvi-x or x-i, where x can be any character.
It is unlikely that you could get a simple and straightforward regex solution to this. However, you could try the following:
#!/usr/bin/env perl
use strict;
use warnings;
my %whitelist = map { $_ => 1 } qw( dvi-d dvi-i );
my $string = 'use-either-dvi-d-or-dvi-i';
while ( $string =~ m{ ( [^-]+ ) ( - ) ( [^-]+ ) }gx ) {
my $segment = substr($string, $-[0], $+[0] - $-[0]);
unless ( $whitelist{ $segment } ) {
substr( $string, $-[2], 1, ' ');
}
pos( $string ) = $-[ 3 ];
}
print $string, "\n";
The #- array contains the starting offsets of matched groups, and the #+ array contains the ends offsets. In both cases, element 0 refers to the whole match.
I had to resort to something like this because of how \G works:
Note also that s/// will refuse to overwrite part of a substitution that has already been replaced; so for example this will stop after the first iteration, rather than iterating its way backwards through the string:
$_ = "123456789";
pos = 6;
s/.(?=.\G)/X/g;
print; # prints 1234X6789, not XXXXX6789
Maybe #tchrist can figure out how to bend various assertions to his will.
we can ignore specific words using negative Look-ahead and negative Look-behind
Example :
(?!pattern)
is a negative look-ahead assertion
in your case the pattern is
$string =~ s/(?<!dvi)-(?<![id])/ /g;
output :
use either dvi-d or dvi-i
Reference : http://www.perlmonks.org/?node_id=518444
Hope this will help you.

Perl get string between 2 patterns

I have a log file that has this pattern several times:
Toggle('AFDACAAAAAIAAAA')" class="Failure">ABC</a>
I have this code line to get the entire line that has this pattern
but I’m only interested in the ABC string ( that may be any other string )
print "$line" if $line =~ /Toggle\('[A-Z]*'\)" class="Failure">.*<\/a>/g;
Can I do that with regex ?
If you don't want to use a capture group you can use a look ahead and the \K anchor:
print $& if $line =~ /Toggle\('[A-Z]*'\)" class="Failure">\K.*(?=<\/a>)/;
\K will basically throw away everything that has been matched before it - but it still has to match.
Capture the bit that you want. It'll be in $1. (Oh, and you don't need the /g.)
print $1 if $line =~ /Toggle\('[A-Z]*'\)" class="Failure">(.*)<\/a>/;

Regex greedyness REasking

I have this text $line = "config.txt.1", and I want to match it with regex and extract the number
part of it. I am using two versions:
$line = "config.txt.1";
(my $result) = $line =~ /(\d*).*/; #ver 1, matched, but returns nothing
(my $result) = $line =~ /(\d).*/; #ver 2, matched, returns 1
(my $result) = $line =~ /(\d+).*/; #ver 3, matched, returns 1
I think the * was sort of messing things around, I have been looking at this, but still
don't the greedy mechanism in the regex engine. If I start from left of the regex, and potentially there might be no digits in the text, so for ver 1, it will match too. But for
ver 3, it won't match. Can someone give me an explanation for why it is that and how
I should write for what I want? (potentially with a number, not necessarily single digit)
Edit
Requirement: potentially with a number, not necessarily single digit, and match can not capture anything, but should not fail
The output must be as follows (for the above example):
config.txt 1
The regex /(\d*).*/ always matches immediately, because it can match zero characters. It translates to match as many digits at this position as possible (zero or more). Then, match as many non-newline characters as possible. Well, the match starts looking at the c of config. Ok, it matches zero digits.
You probably want to use a regex like /\.(\d+)$/ -- this matches an integer number between a period . and the end of string.
Use the literal '.' as a reference to match before the number:
#!/usr/bin/perl
use strict;
use warnings;
my #line = qw(config.txt file.txt config.txt.1 config.foo.2 config.txt.23 differentname.fsdfsdsdfasd.2444);
my (#capture1, #capture2);
foreach (#line){
my (#filematch) = ($_ =~ /(\w+\.\w+)/);
my (#numbermatch) = ($_ =~ /\w+\.\w+\.?(\d*)/);
my $numbermatch = $numbermatch[0] // $numbermatch[1];
push #capture1, #filematch;
push #capture2, #numbermatch;
}
print "$capture1[$_]\t$capture2[$_]\n" for 0 .. $#capture1;
Output:
config.txt
file.txt
config.txt 1
config.foo 2
config.txt 23
differentname.fsdfsdsdfasd 2444
Thanks guys, I think I figured out myself what I want:
my ($match) = $line =~ /\.(\d+)?/; #this will match and capture any digit
#number if there was one, and not fail
#if there wasn't one
To capture all digits following a final . and not fail the match if the string doesn't end with digits, use /(?:\.(\d+))?$/
perl -E 'if ("abc.123" =~ /(?:\.(\d+))?$/) { say "matched $1" } else { say "match failed" }'
matched 123
perl -E 'if ("abc" =~ /(?:\.(\d+))?$/) { say "matched $1" } else { say "match failed" }'
matched
You do not need .* at all. These two statements assign the exact same number:
my ($match1) = $str =~ /(\d+).*/;
my ($match1) = $str =~ /(\d+)/;
A regex by default matches partially, you do not need to add wildcards.
The reason your first match does not capture a number is because * can match zero times as well. And since it does not have to match your number, it does not. Which is why .* is actually detrimental in that regex. Unless something is truly optional, you should use + instead.

Substitute first character before match

For each line I need to add a semicolon exactly one character before the first match of an alphanumeric sign but only for the alphanumeric sign after the first appearance of a semicolon.
Example:
Input:
00000001;Root;;
00000002; Documents;;
00000003; oracle-advanced_plsql.zip;file;
00000004; Public;;
00000005; backup;;
00000006; 20110323-JM-F.7z.001;file;
00000007; 20110426-JM-F.7z.001;file;
00000008; 20110603-JM-F.7z.001;file;
00000009; 20110701-JM-F-via-summer_school;;
00000010; 20110701-JM-F-via-summer_school.7z.001;file;
Desired output:
00000001;;Root;;
00000002; ;Documents;;
00000003; ;oracle-advanced_plsql.zip;file;
00000004; ;Public;;
00000005; ;backup;;
00000006; ;20110323-JM-F.7z.001;file;
00000007; ;20110426-JM-F.7z.001;file;
00000008; ;20110603-JM-F.7z.001;file;
00000009; ;20110701-JM-F-via-summer_school;;
00000010; ;20110701-JM-F-via-summer_school.7z.001;file;
Could someone helps me please to create Perl regex for that? I'd need it in a program, not as a oneliner.
This is a way to insert a semi-colon after the first semi-colon and whitespace, but before the first non-whitespace.
s/;\s*\K(?=\S)/;/
If you feel the need, you can use \w instead of \S, but I felt with this input it was an unnecessary specification.
The \K (keep) escape is similar to a lookbehind assertion in that it does not remove what it matches. The same goes for the lookahead assertion, so all this substitution does is insert a semi-colon in the designated spot.
First of all, here is a program that seems to match your requirements:
#/usr/bin/perl -w
while(<>) {
s/^(.*?;.*?)(\w)/$1;$2/;
print $_;
}
Store it in a file 'program.pl', make it executable with 'chmod u+x program.pl' and run it on your input data like this:
program.pl input-data.txt
Here is an explanation of the regular expression:
s/ # start search-and-replace regexp
^ # start at the beginning of this line
( # save the matched characters until ')' in $1
.*?; # go forward until finding the first semicolon
.*? # go forward until finding... (to be continued below)
)
( # save the matched characters until ')' in $2
\w # ... the next alphanumeric character.
)
/ # continue with the replace part
$1;$2 # write all characters found above, but insert a ; before $2
/ # finish the search-and-replace regexp.
Based on your sample input, I would use a more specific regular expression:
s/^(\d*; *)(\w)/$1;$2/;
This expression starts at the beginning of the line, skips over numbers (\d*) followed by the first semicolon and space. Before the following word character, it inserts a semicolon.
Take what fits best to your needs!
First of all thank you for your really great answers!
Actually my code snippet looks like this:
our $seperator=";" # at the beginning of the file
#...
sub insert {
my ( $seperator, $line, #all_lines, $count, #all_out );
$count = 0;
#all_lines = read_file($filename);
foreach $line (#all_lines) {
$count = sprintf( "%08d", $count );
chomp $line;
$line =~ s/\:/$seperator/; # works
$line =~ s/\ file/file/; # works
#$line=~s/;\s*\K(?=\S)/;/; # doesn't work
$line =~ s/^(.*?$seperator.*?)(\w)/$1$seperator$2/; # doesn't work
say $count . $seperator . $line . $seperator;
$count++; # btw, is there maybe a hidden index variable in a foreach-loop I could us instead of a new variable??
push( #all_out, $count . $seperator . $line . $seperator . "\n" );
}
write_file( $csvfile, #all_out ); # using File::Slurp
}
In order to get the input which I presented you, I made already some small substitutions, as you can see in the beginning of the foreach-loop.
I am curious, why the regular expressions presented by TLP and Yaakov do not work in my code. In general they work, but only when written like in the example which Yaakov gave:
while(<>) {
s/^(.*?;.*?)(\w)/$1;$2/;
print $_;
}

In Perl, how can I get the matched substring from a regex?

My program read other programs source code and colect information about used SQL queries. I have problem with getting substring.
...
$line = <FILE_IN>;
until( ($line =~m/$values_string/i && $line !~m/$rem_string/i) || eof )
{
if($line =~m/ \S{2}DT\S{3}/i)
{
# here I wish to get (only) substring that match to pattern \S{2}DT\S{3}
# (7 letter table name) and display it.
$line =~/\S{2}DT\S{3}/i;
print $line."\n";
...
In result print prints whole line and not a substring I expect. I tried different approach, but I use Perl seldom and probably make basic concept error. ( position of tablename in line is not fixed. Another problem is multiple occurrence i.e.[... SELECT * FROM AADTTAB, BBDTTAB, ...] ). How can I obtain that substring?
Use grouping with parenthesis and store the first group.
if( $line =~ /(\S{2}DT\S{3})/i )
{
my $substring = $1;
}
The code above fixes the immediate problem of pulling out the first table name. However, the question also asked how to pull out all the table names. So:
# FROM\s+ match FROM followed by one or more spaces
# (.+?) match (non-greedy) and capture any character until...
# (?:x|y) match x OR y - next 2 matches
# [^,]\s+[^,] match non-comma, 1 or more spaces, and non-comma
# \s*; match 0 or more spaces followed by a semi colon
if( $line =~ /FROM\s+(.+?)(?:[^,]\s+[^,]|\s*;)/i )
{
# $1 will be table1, table2, table3
my #tables = split(/\s*,\s*/, $1);
# delim is a space/comma
foreach(#tables)
{
# $_ = table name
print $_ . "\n";
}
}
Result:
If $line = "SELECT * FROM AADTTAB, BBDTTAB;"
Output:
AADTTAB
BBDTTAB
If $line = "SELECT * FROM AADTTAB;"
Output:
AADTTAB
Perl Version: v5.10.0 built for MSWin32-x86-multi-thread
I prefer this:
my ( $table_name ) = $line =~ m/(\S{2}DT\S{3})/i;
This
scans $line and captures the text corresponding to the pattern
returns "all" the captures (1) to the "list" on the other side.
This psuedo-list context is how we catch the first item in a list. It's done the same way as parameters passed to a subroutine.
my ( $first, $second, #rest ) = #_;
my ( $first_capture, $second_capture, #others ) = $feldman =~ /$some_pattern/;
NOTE:: That said, your regex assumes too much about the text to be useful in more than a handful of situations. Not capturing any table name that doesn't have dt as in positions 3 and 4 out of 7? It's good enough for 1) quick-and-dirty, 2) if you're okay with limited applicability.
It would be better to match the pattern if it follows FROM. I assume table names consist solely of ASCII letters. In that case, it is best to say what you want. With those two remarks out of the way, note that a successful capturing regex match in list context returns the matched substring(s).
#!/usr/bin/perl
use strict;
use warnings;
my $s = 'select * from aadttab, bbdttab';
if ( my ($table) = $s =~ /FROM ([A-Z]{2}DT[A-Z]{3})/i ) {
print $table, "\n";
}
__END__
Output:
C:\Temp> s
aadttab
Depending on the version of perl on your system, you may be able to use a named capturing group which might make the whole thing easier to read:
if ( $s =~ /FROM (?<table>[A-Z]{2}DT[A-Z]{3})/i ) {
print $+{table}, "\n";
}
See perldoc perlre.
Parens will let you grab part of the regex into special variables: $1, $2, $3...
So:
$line = ' abc andtabl 1234';
if($line =~m/ (\S{2}DT\S{3})/i) {
# here I wish to get (only) substring that match to pattern \S{2}DT\S{3}
# (7 letter table name) and display it.
print $1."\n";
}
Use a capturing group:
$line =~ /(\S{2}DT\S{3})/i;
my $substr = $1;
$& contains the string matched by the last pattern match.
Example:
$str = "abcdefghijkl";
$str =~ m/cdefg/;
print $&;
# Output: "cdefg"
So you could do something like
if($line =~m/ \S{2}DT\S{3}/i) {
print $&."\n";
}
WARNING:
If you use $& in your code it will slow down all pattern matches.