Perl - Unexptected behaviour with Regex in Array - regex

I'm trying to match lines that have
"/foldera/folderb/folderc/folderd/file.ext##/main" + "/" + ANY_NUMBER:
so for example:
(.+)(main)(.\d)
The lines:
/foldera/folderb/folderc/folderd/file.ext##/main
/foldera/folderb/folderc/folderd/file.ext##/main/0
/foldera/folderb/folderc/folderd/file.ext##/main/1
/foldera/folderb/folderc/folderd/file.ext##/main/2
/foldera/folderb/folderc/folderd/file.ext##/main/3
/foldera/folderb/folderc/folderd/file.ext##/main/4
/foldera/folderb/folderc/folderd/file.ext##/main/5 (RLT-abcde, BLD-abcde, DEV-abcde)
/foldera/folderb/folderc/folderx/file12.ext##/main/0
/foldera/folderb/folderc/folderx/file12.ext##/main/1
/foldera/folderb/folderc/folderx/file12.ext##/main/2
/foldera/folderb/folderc/folderx/file12.ext##/main/3
/foldera/folderb/folderc/folderx/file12.ext##/main/4
/foldera/folderb/folderc/folderx/file12.ext##/main/5
/foldera/folderb/folderc/folderx/file12.ext##/main/6 (RLS-abcde-5.0, RLS-abcde-4.1)
While my regex matches the desired lines (I checked it at http://www.regexe.com/), in my Perl program it does not match
/foldera/folderb/folderc/folderd/file.ext##/main
but it does match:
/foldera/folderb/folderc/folderd/file.ext##/main/5 (RLT-abcde, BLD-abcde, DEV-abcde)
Here is the code:
use warnings;
use strict;
my #file_list = `find /folder -type f -name '*.ext'|xargs cleartool lsvtree -all`;
foreach my $file(#file_list){
if ($file=~m/(.+)(main)(.\d)/g){
print $file;
}
}
I'm pretty sure that I'm making a stupid mistake somewhere, but I just can't see it!
Thank you in advance for your advice.
P.S. I tried it under Perl 5.8 an Perl 5.18 with the same results, OS is Solaris.

Change
print $file;
to:
print "$MATCH\n";
so you only print the part of the line that was matched by the regexp.
You should also change \d to \d+, to allow for numbers with more than one digit.

Just after a quick look
/foldera/folderb/folderc/folderd/file.ext##/main
Has no number at the end. And \d requires the number ;-)
You may also find this site usefull: http://regexpal.com/

I think is not matching your line because your regex is explicitly looking for a digit at the end
Try changing your regex to be: (Note the curley brackets at the end)
(.+)(main)(.\d){0,1}
Or personally I would write it like this:
(.*?)main(\/\d*){0,1}
Hope this helps!

Related

Regex does not match in Perl, while it does in other programs

I have the following string:
load Add 20 percent
to accommodate
I want to get to:
load Add 20 percent to accommodate
With, e.g., regex in sublime, this is easily done by:
Regex:
([a-z])\n\s([a-z])
Replace:
$1 $2
However, in Perl, if I input this command, (adapted to test if I can match the pattern in any case):
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
It doesn't match anything.
Does anyone know why Perl would be different in this case, and what the correct formulation of the Perl command should be?
By default, Perl -p flag read input lines one by one. You can't thus expect your regex to match anything after \n.
Instead, you want to read the whole input at once. You can do this by using the flag -0777 (this is documented in perlrun):
perl -0777 -pi.orig -e 's/([a-z])\n\s(to)/$1 $2/' file
Just trying to help and reminding below your initial proposal for perl regex:
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
Note that in perl regex, [a-z] will match only one character, NOT including any whitespace. Then as a start please include a repetition specifier and include capability to also 'eat' whitespaces. Also to keep the recognized (but 'eaten') 'to' in the replacement, you must put it again in the replacement string, like finally in the below example perl program:
$str = "load Add 20 percent
to accommodate";
print "before:\n$str\n";
$str =~ s/([ a-z]+)\n\s*to/\1 to/;
print "after:\n$str\n";
This program produces the below input:
before:
load Add 20 percent
to accommodate
after:
load Add 20 percent to accommodate
Then it looks like that if I understood well what you want to do, your regexp should better look like:
s/([ a-z]+)\n\s*to/\1 to/ (please note the leading whitespace before 'a-z').

Regex expression matching block of lines

I have this kind of file:
Analysis of its root cause:
Blablablablabla
blabablabkjhjk
kjbsqbdqbds
Details of the fix
blablabla
Analysis of its root cause:
fddsfsdfsdfdsfs
blnskdbbqbbb
xxxxggggggg
Details of the fix
blablabla
Analysis of its root cause is repeated x times in the file. I would like to get the block of text delimited by "Analysis of its root cause" and "Details of the fix".
Thanks a lot for your help.
I'm pretty sure there is some better way to do this, but that's what I could manage:
/(?(?<=Analysis of its root cause:\n)((.*\n)*)(?=Details of the fix\n))/gU
I'm using positive lookahead and lookbehind, and the following modifiers:
g - global - Don't return after first match
u - Ungreedy - Make quantifiers lazy
Try it online: https://regex101.com/r/xpz7pg/2
Not a regex answer, but using perl
Put your lines into a single file.
perl -e '$/="Analysis of its root cause:"; #Sets the record delimiter
while(<>){ #Iterates over the file, record by record
chomp; #Removes the delimiter
if ($_ =~ /\n(.*?)\nDetails of the fix\n(.*)\n/s){ #Matches strings between Details of the fix. . is allowed to match newline
print "ONE:$1TWO:$2"} # $1 is the analysis, $2 is the details
}'
file.txt
Output
ONE:Blablablablabla
blabablabkjhjk
kjbsqbdqbds
TWO:blablabla
ONE:fddsfsdfsdfdsfs
blnskdbbqbbb
xxxxggggggg
TWO:blablabla

Find all text within square brackets using regex

I have a problem that because of PHP version, I need to change my code from $array[stringindex] to $array['stringindex'];
So I want to find all the text using regex, and replace them all. How to find all strings that look like this? $array[stringindex].
Here's a solution in PHP:
$re = "/(\\$[[:alpha:]][[:alnum:]]+\\[)([[:alpha:]][[:alnum:]]+)(\\])/";
$str = "here is \$array[stringindex] but not \$array['stringindex'] nor \$3array[stringindex] nor \$array[4stringindex]";
$subst = "$1'$2'$3";
$result = preg_replace($re, $subst, $str);
You can try it out interactively here. I search for variables beginning with a letter, otherwise things like $foo[42] would be converted to $foo['42'], which might not be desirable.
Note that all the solutions here will not handle every case correctly.
Looking at the Sublime Text regex help, it would seem you could just paste (\\$[[:alpha:]][[:alnum:]]+\\[)([[:alpha:]][[:alnum:]]+)(\\]) into the Search box and $1'$2'$3 into the Replace field.
It depends of the tool you want to use to do the replacement.
with sed for exemple, it would be something like that:
sed "s/\(\$array\)\[\([^]]*\)\]/\1['\2']/g"
If sed is allowed you could simply do:
sed -i "s/(\$[^[]*[)([^]]*)]/\1'\2']/g" file
Explanation:
sed "s/pattern/replace/g" is a sed command which searches for pattern and replaces it with replace. The g options means replace multiple times per line.
(\$[^[]*[)([^]]*)] this pattern consists of two groups (in between brackets). The first is a dollar followed by a series of non [ chars. Then an opening square bracket follows, followed by a series of non closing brackets which is then followed by a closing square bracket.
\1'\2'] the replacement string: \1 means insert the first captured group (analogous for \2. Basically we wrap \2 in quotes (which is what you wanted).
the -i options means that the changes should be applied to the original file, which is supplied at the end.
For more information, see man sed.
This can be combined with the find command, as follows:
find . -name '*.php' -exec sed -i "s/(\$[^[]*[)([^]]*)]/\1'\2']/g" '{}' \;
This will apply the sed command to all php files found.

sed/grep - get text between two strings (html)

I am trying to extract "pagename" from the following:
<a class="timetable work" href="http://www.test.com/pagename?tag=meta376">Test</a>
I tried to get it to work using "sed" but it only says invalid command code.
What line of code would you guys suggest to get the pagename? By the way: This is not a single line but there is more content on the same line - but that should not make a difference as it should just matter what is between the limiters, right?
Thanks in advance for helping me out!
I would use awk for this:
awk -F"[/?]" '/timetable work/ {print $4}'file
pagename
It search for a line containing timetable work, then print fourth field using \ or ? as separator.
As you commented, if you want to extract "<a class="timetable work" href="test.com/"; and "?tag=meta376">Test</a>" you can use the following regex:
<a class="timetable.*?<\/a>
Working demo
If you want to grab the content just surround the regex with capturing groups:
(<a class="timetable.*?<\/a>)
The match is:
MATCH 1
1. [9-80] `<a class="timetable work" href="test.com/"; and "?tag=meta376">Test</a>`
I think this is what you want:
sed 's_^.*<a [^<>]* href="https*://[^/]*/\([^"?]*\).*$_\1_'
Giving you exactly what you asked for using exactly the delimiters you told us to use:
$ sed -n 's|.*<a class="timetable work" href="http://www\.test\.com/\(.*\)?tag=meta376">Test</a>|\1|p' file
pagename
I know it may be tempting to handle this using a regular expression but here's an alternative.
You are trying to parse some HTML, so use an HTML parser. Here's an example in Perl:
use strict;
use warnings;
use feature qw(say);
use HTML::TokeParser::Simple;
use URI::URL;
my $filename = 'file.html';
my $parser = HTML::TokeParser::Simple->new($filename);
while (my $anchor = $parser->get_tag('a')) {
next unless defined(my $class = $anchor->get_attr('class'));
next unless $class =~ /\btimetable\b/ and $class =~ /\bwork\b/;
my $url = url $anchor->get_attr('href');
say substr($url->path, 1);
}
Parse the HTML using HTML::TokeParser::Simple. loop through the <a> tags, skipping any that don't have the correct classes defined. For the ones that do, use URI::URL to parse the url and extract the "path" component (which in your case, would be "/pagename"). As you didn't want the leading slash, I used substr to remove the first character.
Output:
pagename
I know it's much longer than a single regex but it's also a lot more robust and will continue to work even when the format of your HTML changes slightly in the future. HTML parsers exist for a reason :)

regexp greedness: shrinking long path

Please have a look at my mind-breaker.
I'd stuck in shrinking with regex some long path, like this:
/12345/123456/1234/123/12/1/1234567/13245678/123456789/1234567890
I'd like to transform this path to the following form:
/123/123/123/123/12/1/123/123/123/123
each "directory" in a path abbreviates to only 3 first characters
LONG_PATH="/12345/123456/1234/123/12/1/1234567/13245678/123456789/1234567890"
perl -pe "s#/(.{1,3})[^/]*?(/|$)#/\1\2#g" <<<$LONG_PATH
/123/123456/123/123/12//1234567/132/123456789/123
sed -E "s#/(.{1,3})[^/]*?(/|$)#/\1\2#g" <<<$LONG_PATH
/123/123456/123/123/12//1234567/132/123456789/123
I have tried also:
perl -pe "s,/(.)(.)?(.)?[^/]*+,/\1\2\3,g" <<<$LONG_PATH
/123/123/123/123/12//123/132/123/123
and many another, no "luck" - I still have no idea about.
Please point me a right way to success.
Match up to three non-slash characters and capture them. Then match the rest until the next slash. Replace by the capture:
"s#(/[^/]{3})[^/]*#\1#g"
There is no need for ungreediness or anything here, because the negated character class is mutually exclusive with the / or $.
EDIT: Although you seem to know this I should probably clarify for future visitors that this will work with either perl -pe... or sed -E... as you have used it in your question. The regex could also be used as is with sed -r.... If you leave out the -E or -r option, then (as usual) you will need to escape both the parentheses and curly brackets:
sed "s#\(/[^/]\{3\}\)[^/]*#\1#g" filename
Note also as ikegami points out that in Perl you should rather use $1 in the replacement than \1.
You could do it like this:
perl -pe's#[^/]{3}\K[^/]*##g'
/12345/123456/1234/123/12/1/1234567/13245678/123456789/1234567890
/123/123/123/123/12/1/123/132/123/123
Find 3 non-slashes, and keep (\K) them, remove the following characters up until the next slash.
As ikegami pointed out, it is not required to match less than three characters, in which case a lookbehind assertion can be used instead of \K. The benefit is that \K requires perl v5.10, and I believe look-around assertions predate that.
perl -pe 's#(?<=[^/]{3})[^/]*##g'
The best way seems to use the File::Spec module to split and recombine a path. An intermediate call to map will reduce each path segment to its first three characters. This program demonstrates
use strict;
use warnings;
use File::Spec;
my $path = '/12345/123456/1234/123/12/1/1234567/13245678/123456789/1234567890';
my $newpath = File::Spec->catdir(map substr($_, 0, 3), File::Spec->splitdir($path));
print $newpath;
output
/123/123/123/123/12/1/123/132/123/123