regexp greedness: shrinking long path - regex

Please have a look at my mind-breaker.
I'd stuck in shrinking with regex some long path, like this:
/12345/123456/1234/123/12/1/1234567/13245678/123456789/1234567890
I'd like to transform this path to the following form:
/123/123/123/123/12/1/123/123/123/123
each "directory" in a path abbreviates to only 3 first characters
LONG_PATH="/12345/123456/1234/123/12/1/1234567/13245678/123456789/1234567890"
perl -pe "s#/(.{1,3})[^/]*?(/|$)#/\1\2#g" <<<$LONG_PATH
/123/123456/123/123/12//1234567/132/123456789/123
sed -E "s#/(.{1,3})[^/]*?(/|$)#/\1\2#g" <<<$LONG_PATH
/123/123456/123/123/12//1234567/132/123456789/123
I have tried also:
perl -pe "s,/(.)(.)?(.)?[^/]*+,/\1\2\3,g" <<<$LONG_PATH
/123/123/123/123/12//123/132/123/123
and many another, no "luck" - I still have no idea about.
Please point me a right way to success.

Match up to three non-slash characters and capture them. Then match the rest until the next slash. Replace by the capture:
"s#(/[^/]{3})[^/]*#\1#g"
There is no need for ungreediness or anything here, because the negated character class is mutually exclusive with the / or $.
EDIT: Although you seem to know this I should probably clarify for future visitors that this will work with either perl -pe... or sed -E... as you have used it in your question. The regex could also be used as is with sed -r.... If you leave out the -E or -r option, then (as usual) you will need to escape both the parentheses and curly brackets:
sed "s#\(/[^/]\{3\}\)[^/]*#\1#g" filename
Note also as ikegami points out that in Perl you should rather use $1 in the replacement than \1.

You could do it like this:
perl -pe's#[^/]{3}\K[^/]*##g'
/12345/123456/1234/123/12/1/1234567/13245678/123456789/1234567890
/123/123/123/123/12/1/123/132/123/123
Find 3 non-slashes, and keep (\K) them, remove the following characters up until the next slash.
As ikegami pointed out, it is not required to match less than three characters, in which case a lookbehind assertion can be used instead of \K. The benefit is that \K requires perl v5.10, and I believe look-around assertions predate that.
perl -pe 's#(?<=[^/]{3})[^/]*##g'

The best way seems to use the File::Spec module to split and recombine a path. An intermediate call to map will reduce each path segment to its first three characters. This program demonstrates
use strict;
use warnings;
use File::Spec;
my $path = '/12345/123456/1234/123/12/1/1234567/13245678/123456789/1234567890';
my $newpath = File::Spec->catdir(map substr($_, 0, 3), File::Spec->splitdir($path));
print $newpath;
output
/123/123/123/123/12/1/123/132/123/123

Related

Regex does not match in Perl, while it does in other programs

I have the following string:
load Add 20 percent
to accommodate
I want to get to:
load Add 20 percent to accommodate
With, e.g., regex in sublime, this is easily done by:
Regex:
([a-z])\n\s([a-z])
Replace:
$1 $2
However, in Perl, if I input this command, (adapted to test if I can match the pattern in any case):
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
It doesn't match anything.
Does anyone know why Perl would be different in this case, and what the correct formulation of the Perl command should be?
By default, Perl -p flag read input lines one by one. You can't thus expect your regex to match anything after \n.
Instead, you want to read the whole input at once. You can do this by using the flag -0777 (this is documented in perlrun):
perl -0777 -pi.orig -e 's/([a-z])\n\s(to)/$1 $2/' file
Just trying to help and reminding below your initial proposal for perl regex:
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
Note that in perl regex, [a-z] will match only one character, NOT including any whitespace. Then as a start please include a repetition specifier and include capability to also 'eat' whitespaces. Also to keep the recognized (but 'eaten') 'to' in the replacement, you must put it again in the replacement string, like finally in the below example perl program:
$str = "load Add 20 percent
to accommodate";
print "before:\n$str\n";
$str =~ s/([ a-z]+)\n\s*to/\1 to/;
print "after:\n$str\n";
This program produces the below input:
before:
load Add 20 percent
to accommodate
after:
load Add 20 percent to accommodate
Then it looks like that if I understood well what you want to do, your regexp should better look like:
s/([ a-z]+)\n\s*to/\1 to/ (please note the leading whitespace before 'a-z').

Regex to match a path, excluding the final directory. Perl error "search pattern not terminated at -e"

Say I have a directory path
/abc/de/fgh/i/jk
And I just want to match
/abc/de/fgh/i/
How would I do this? I have tried:
(\/.*\/)*
In https://regexr.com/ and it seems to do the trick.
However, when I try:
path="/abc/de/fgh/i/jk"
sliced=`echo $path | perl -pe '(\/.*\/)*'`
I get the error:
Search pattern not terminated at -e line 1
Is there a better way to do this besides perl? What is wrong here anyway to give a perl error? Is my regex even correct here?
There is a coreutil to do this, dirname:
path="/abc/de/fgh/i/jk"
sliced=`dirname "$path"`
However, it also removes the trailing /. If you, for whatever reason, need it, just append it to sliced:
sliced=`dirname "$path"`/
Be sure to quote $path; it won't work with paths with spaces in them if you don't.
Your Perl program isn't working because you haven't put any regex operators around it. It ignores the first \ since it isn't in a string or regex, then finds the /, which begins a regex. Since there's no unescaped / to end it, it gives you that error. You probably want something like s/(\/.*\/)*.*/$1/, which should erase everything after the last /.
Use sed, but match what you want to remove (rather than what you want to keep):
echo $path | sed 's,[^/]*$,,'
Outputs /abc/de/fgh/i/
The regex matches all non-slash chars at the end and replaces them with nothing, removing them.
Why not use Parameter Expansion ?
echo "${path%/*}/"

Conditional in perl regex replacement

I'm trying to return different replacement results with a perl regex one-liner if it matches a group. So far I've got this:
echo abcd | perl -pe "s/(ab)(cd)?/defined($2)?\1\2:''/e"
But I get
Backslash found where operator expected at -e line 1, near "1\"
(Missing operator before \?)
syntax error at -e line 1, near "1\"
Execution of -e aborted due to compilation errors.
If the input is abcd I want to get abcd out, if it's ab I want to get an empty string. Where am I going wrong here?
You used regex atoms \1 and \2 (match what the first or second capture captured) outside of a regex pattern. You meant to use $1 and $2 (as you did in another spot).
Further more, dollar signs inside double-quoted strings have meaning to your shell. It's best to use single quotes around your program[1].
echo abcd | perl -pe's/(ab)(cd)?/defined($2)?$1.$2:""/e'
Simpler:
echo abcd | perl -pe's/(ab(cd)?)/defined($2)?$1:""/e'
Simpler:
echo abcd | perl -pe's/ab(?!cd)//'
Either avoid single-quotes in your program[2], or use '\'' to "escape" them.
You can usually use q{} instead of single-quotes. You can also switch to using double-quotes. Inside of double-quotes, you can use \x27 for an apostrophe.
Why torture yourself, just use a branch reset.
Find (?|(abcd)|ab())
Replace $1
And a couple of even better ways
Find abcd(*SKIP)(*FAIL)|ab
Replace ""
Find (?:abcd)*\Kab
Replace ""
These use regex wisely.
There is really no need nowadays to have to use the eval form
of the regex substitution construct s///e in conjunction with defined().
This is especially true when using the perl command line.
Good luck...

Find all text within square brackets using regex

I have a problem that because of PHP version, I need to change my code from $array[stringindex] to $array['stringindex'];
So I want to find all the text using regex, and replace them all. How to find all strings that look like this? $array[stringindex].
Here's a solution in PHP:
$re = "/(\\$[[:alpha:]][[:alnum:]]+\\[)([[:alpha:]][[:alnum:]]+)(\\])/";
$str = "here is \$array[stringindex] but not \$array['stringindex'] nor \$3array[stringindex] nor \$array[4stringindex]";
$subst = "$1'$2'$3";
$result = preg_replace($re, $subst, $str);
You can try it out interactively here. I search for variables beginning with a letter, otherwise things like $foo[42] would be converted to $foo['42'], which might not be desirable.
Note that all the solutions here will not handle every case correctly.
Looking at the Sublime Text regex help, it would seem you could just paste (\\$[[:alpha:]][[:alnum:]]+\\[)([[:alpha:]][[:alnum:]]+)(\\]) into the Search box and $1'$2'$3 into the Replace field.
It depends of the tool you want to use to do the replacement.
with sed for exemple, it would be something like that:
sed "s/\(\$array\)\[\([^]]*\)\]/\1['\2']/g"
If sed is allowed you could simply do:
sed -i "s/(\$[^[]*[)([^]]*)]/\1'\2']/g" file
Explanation:
sed "s/pattern/replace/g" is a sed command which searches for pattern and replaces it with replace. The g options means replace multiple times per line.
(\$[^[]*[)([^]]*)] this pattern consists of two groups (in between brackets). The first is a dollar followed by a series of non [ chars. Then an opening square bracket follows, followed by a series of non closing brackets which is then followed by a closing square bracket.
\1'\2'] the replacement string: \1 means insert the first captured group (analogous for \2. Basically we wrap \2 in quotes (which is what you wanted).
the -i options means that the changes should be applied to the original file, which is supplied at the end.
For more information, see man sed.
This can be combined with the find command, as follows:
find . -name '*.php' -exec sed -i "s/(\$[^[]*[)([^]]*)]/\1'\2']/g" '{}' \;
This will apply the sed command to all php files found.

Is there a truly universal wildcard in Grep? [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
Really basic question here. So I'm told that a dot . matches any character EXCEPT a line break. I'm looking for something that matches any character, including line breaks.
All I want to do is to capture all the text in a website page between two specific strings, stripping the header and the footer. Something like HEADER TEXT(.+)FOOTER TEXT and then extract what's in the parentheses, but I can't find a way to include all text AND line breaks between header and footer, does this make sense? Thanks in advance!
When I need to match several characters, including line breaks, I do:
[\s\S]*?
Note I'm using a non-greedy pattern
You could do it with Perl:
$ perl -ne 'print if /HEADER TEXT/ .. /FOOTER TEXT/' file.html
To print only the text between the delimiters, use
$ perl -000 -lne 'print $1 while /HEADER TEXT(.+?)FOOTER TEXT/sg' file.html
The /s switch makes the regular expression matcher treat the entire string as a single line, which means dot matches newlines, and /g means match as many times as possible.
The examples above assume you're cranking on HTML files on the local disk. If you need to fetch them first, use get from LWP::Simple:
$ perl -MLWP::Simple -le '$_ = get "http://stackoverflow.com";
print $1 while m!<head>(.+?)</head>!sg'
Please note that parsing HTML with regular expressions as above does not work in the general case! If you're working on a quick-and-dirty scanner, fine, but for an application that needs to be more robust, use a real parser.
By definition, grep looks for lines which match; it reads a line, sees whether it matches, and prints the line.
One possible way to do what you want is with sed:
sed -n '/HEADER TEXT/,/FOOTER TEXT/p' "$#"
This prints from the first line that matches 'HEADER TEXT' to the first line that matches 'FOOTER TEXT', and then iterates; the '-n' stops the default 'print each line' operation. This won't work well if the header and footer text appear on the same line.
To do what you want, I'd probably use perl (but you could use Python if you prefer). I'd consider slurping the whole file, and then use a suitably qualified regex to find the matching portions of the file. However, the Perl one-liner given by '#gbacon' is an almost exact transliteration into Perl of the 'sed' script above and is neater than slurping.
The man page of grep says:
grep, egrep, fgrep, rgrep - print lines matching a pattern
grep is not made for matching more than a single line. You should try to solve this task with perl or awk.
As this is tagged with 'bbedit' and BBedit supports Perl-Style Pattern Modifiers you can allow the dot to match linebreaks with the switch (?s)
(?s).
will match ANY character. And yes,
(?s).+
will match the whole text.
As pointed elsewhere, grep will work for single line stuff.
For multiple-lines (in ruby with Regexp::MULTILINE, or in python, awk, sed, whatever), "\s" should also capture line breaks, so
HEADER TEXT(.*\s*)FOOTER TEXT
might work ...
here's one way to do it with gawk, if you have it
awk -vRS="FOOTER" '/HEADER/{gsub(/.*HEADER/,"");print}' file