How to do non-greedy parsing in ash with grep without Perl

How to do non-greedy parsing in ash with grep without Perl - regex

Hi I'm trying to parse through a folder structure and just have the first two folders in a directory passed to a variable. Right now I'm using the regex '/.?/.?/' which works fine except my version of ash on OpenWRT doesn't support Perl and thus the "? are ignored. How can I fix this?
var2=$(echo "$TR_TORRENT_DIR" | grep -Eio '\/.*?\/.*?\/')
echo $var2

Make your regex more specific.
In general, use of non-greedy modifiers is a red flag. It should make you think about whether you can rewrite your regex in a more specific way (so it doesn't matter whether a quantifier is greedy or not).
In this case the problem is .* (another flag), which matches far too much.
If you use [^/] instead of ., you avoid all problems:
/[^/]*/[^/]*/
will match the first two components and nothing more. (Also, there's no need to escape /; it's not special in a regex.)

Related

Regex whitespace before character [duplicate]

I am attempting to grep for all instances of Ui\. not followed by Line or even just the letter L
What is the proper way to write a regex for finding all instances of a particular string NOT followed by another string?
Using lookaheads
grep "Ui\.(?!L)" *
bash: !L: event not found
grep "Ui\.(?!(Line))" *
nothing

Negative lookahead, which is what you're after, requires a more powerful tool than the standard grep. You need a PCRE-enabled grep.
If you have GNU grep, the current version supports options -P or --perl-regexp and you can then use the regex you wanted.
If you don't have (a sufficiently recent version of) GNU grep, then consider getting ack.

The answer to part of your problem is here, and ack would behave the same way:
Ack & negative lookahead giving errors
You are using double-quotes for grep, which permits bash to "interpret ! as history expand command."
You need to wrap your pattern in SINGLE-QUOTES:
grep 'Ui\.(?!L)' *
However, see #JonathanLeffler's answer to address the issues with negative lookaheads in standard grep!

You probably cant perform standard negative lookaheads using grep, but usually you should be able to get equivalent behaviour using the "inverse" switch '-v'. Using that you can construct a regex for the complement of what you want to match and then pipe it through 2 greps.
For the regex in question you might do something like
grep 'Ui\.' * | grep -v 'Ui\.L'
(Edit: this is not as strong as a true lookahead, but can often be used to work around the problem.)

If you need to use a regex implementation that doesn't support negative lookaheads and you don't mind matching extra character(s)*, then you can use negated character classes [^L], alternation |, and the end of string anchor $.
In your case grep 'Ui\.\([^L]\|$\)' * does the job.
Ui\. matches the string you're interested in
\([^L]\|$\) matches any single character other than L or it matches the end of the line: [^L] or $.
If you want to exclude more than just one character, then you just need to throw more alternation and negation at it. To find a not followed by bc:
grep 'a\(\([^b]\|$\)\|\(b\([^c]\|$\)\)\)' *
Which is either (a followed by not b or followed by the end of the line: a then [^b] or $) or (a followed by b which is either followed by not c or is followed by the end of the line: a then b, then [^c] or $.
This kind of expression gets to be pretty unwieldy and error prone with even a short string. You could write something to generate the expressions for you, but it'd probably be easier to just use a regex implementation that supports negative lookaheads.
*If your implementation supports non-capturing groups then you can avoid capturing extra characters.

If your grep doesn't support -P or --perl-regexp, and you can install PCRE-enabled grep, e.g. "pcregrep", than it won't need any command-line options like GNU grep to accept Perl-compatible regular expressions, you just run
pcregrep "Ui\.(?!Line)"
You don't need another nested group for "Line" as in your example "Ui.(?!(Line))" -- the outer group is sufficient, like I've shown above.
Let me give you another example of looking negative assertions: when you have list of lines, returned by "ipset", each line showing number of packets in a middle of the line, and you don't need lines with zero packets, you just run:
ipset list | pcregrep "packets(?! 0 )"
If you like perl-compatible regular expressions and have perl but don't have pcregrep or your grep doesn't support --perl-regexp, you can you one-line perl scripts that work the same way like grep:
perl -e "while (<>) {if (/Ui\.(?!Lines)/){print;};}"
Perl accepts stdin the same way like grep, e.g.
ipset list | perl -e "while (<>) {if (/packets(?! 0 )/){print;};}"

At least for the case of not wanting an 'L' character after the "Ui." you don't really need PCRE.
grep -E 'Ui\.($|[^L])' *
Here I've made sure to match the special case of the "Ui." at the end of the line.

PCRE regex behaves differently when moved to subroutine

Using PCRE v8.42, I am trying to abstract a regex into a named subroutine, but when it's in a subroutine, it seems to behave differently.
This outputs 10/:
echo '10/' | pcregrep '(?:0?[1-9]|1[0-2])\/'
This outputs nothing:
echo '10/' | pcregrep '(?(DEFINE)(?<MONTHNUM>(?:0?[1-9]|1[0-2])))(?&MONTHNUM)\/'
Are these two regular expressions not equivalent?

In versions of PCRE2 prior to 10.30, all subroutine calls are always treated as atomic groups. Your (?(DEFINE)(?<MONTHNUM>(?:0?[1-9]|1[0-2])))(?&MONTHNUM)\/ regex is actually equal to (?>0?[1-9]|1[0-2])\/. See this regex demo, where 10/ does not match as expected.
There is no match because 0?[1-9] matched the 1 in 10/ and since there is no backtracking allowed, the second alternative was not tested ("entered"), and the whole match failed as there is no / after 1.
You need to make sure the longer alternative comes first:
(?(DEFINE)(?<MONTHNUM>(?:1[0-2]|0?[1-9])))(?&MONTHNUM)/
See the regex demo. Note that in the pcregrep pattern, you do not need to escape /.
Alternatively, you can use PCRE2 v10.30 or newer.

Difference between using grep regex pattern with or without quotes?

I'm learning from Linux Academy and the tutorial shows how to use grep and regex.
He is putting his regex pattern in between quotes something like this:
grep 'pattern' file.txt
This seems to be the same than doing it without quotes:
grep pattern file.txt
But when he does something like this, he needs to escape the { and }:
grep '^A\{1,4\}' file.txt
And after doing some testing these scape characters don't seem to be needed when writing the pattern without the quotes.
grep ^A{1,4} file.txt
So what is the difference between these two methods?
Are the quotations necessary?
Why in the first case the escape characters are needed?
Lastly, I've also seen other methods like grep -E and egrep, which is the most common method that people use to grep with regex?
Edit: Thanks for the reminder that the pattern goes before the file.
Many thanks!

You can sometimes get away with omitting quotes, but it's safest not to. This is because the syntax of regular expressions overlaps that of filename wildcard patterns, and when the shell sees something that looks like a wildcard pattern (and it isn't in quotes), the shell will try to "expand" it into a list of matching filenames. If there are no matching files, it gets passed through unchanged, but if there are matches it gets replaced with the matching filenames.
Here's a simple example. Suppose we're trying to search file.txt for an "a" followed optionally by some "b"s, and print only the matches. So you run:
grep -o ab* file.txt
Now, "ab* could be interpreted as a wildcard pattern looking for files that start with "ab", and the shell will interpret it that way. If there are no files in the current directory that start with "ab", this won't cause a problem. But suppose there are two, "abcd.txt" and "abcdef.jpg". Then the shell expands this to the equivalent of:
grep -o abcd.txt abcdef.jpg file.txt
...and then grep will search the files abcdef.jpg and file.txt for the regex pattern abcd.txt.
So, basically, using an unquoted regex pattern might work, but is not safe. So don't do it.
BTW, I'd also recommend using single-quotes instead of double-quotes, because there are some regex characters that're treated specially by the shell even when they're in double-quotes (mostly dollar sign and backslash/escape). Again, they'll often get passed through unchanged, but not always, and unless you understand the (somewhat messy) parsing rules, you might get unexpected results.
BTW^2, for similar reasons you should (almost) always put double-quotes around variable references (e.g. grep -O 'ab* "$filename" instead of grep -O 'ab*' $filename). Single-quotes don't allow variable references at all; unquoted variable references are subject to word splitting and wildcard expansion, both of which can cause trouble. Double-quoted variables get expanded and nothing else.
BTW^3, there are a bunch of variants of regular expression syntax. The reason the curly braces in your example expression need to be escaped is that, by default, grep uses POSIX "basic" regular expression syntax ("BRE"). In BRE syntax, some regex special characters (including curly brackets and parentheses) must be escaped to have their special meaning (and some others, like alternation with |, are just not available at all). grep -E, on the other hand, uses "extended" regular expression syntax ("ERE"), in which those characters have their special meanings unless they're escaped.
And then there's the Perl-compatible syntax (PCRE), and many other variants. Using the wrong variant of the syntax is a common cause of trouble with regular expressions (e.g. using perl extensions in an ERE context, as here and here). It's important to know which variant the tool you're using understands, and write your regex to that syntax.
Here's a simple example: "a", followed by 1 to 3 space-like characters, followed by "b", in various regex syntax variants:
a[[:space:]]\{1,3\}b # BRE syntax
a[[:space:]]{1,3}b # ERE syntax
a\s{1,3}b # PCRE syntax
Just to make things more complicated, some tools will nominally accept one syntax, but also allow some extensions from other syntax variants. In the example above, you can see that perl added the shorthand \s for a space-like character, which is not part of either POSIX standard syntax. But in fact many tools that nominally use BRE or ERE will actually accept the \s shorthand.

Actually, there are two completely unrelated aspects of escaping in your question. The first has to do how to represent strings in bash. This is about readability, which usually means personal taste. For example, I don't like escaping, hence I prefer writing ab\ cd as 'ab cd'. Hence, I would write
echo 'ab cd'
grep -F 'ab cd' myfile.txt
instead of
echo ab\ cd
grep -F ab\ cd myfile.txt
but there is nothing wrong with either one, and you can choose whichever looks simpler to you.
The other aspect indeed is related to grep, at least as long as you do not use the -F option in grep, which always interprets the search argument literally. In this case, the shell is not involved, and the question is whether a certain character is interpreted as a regexp character or as a literal. Gordon Davisson has already explained this in detail, so I give only an example which combines both aspects:
Say you want to grep for a space, followed by one or more periods, followed by another space. You can't write this as
grep -E .+ myfile.txt
because the spaces would be eaten by bash and the . would have special meaning to grep. Hence, you have to choose some escape mechanism. My personal style would be
grep -E ' [.]+ ' myfile.txt
but many people dislike the [.] and prefer \. instead. This would then become
grep -E ' \.+ ' myfile.txt
This still uses quotes to salvage the spaces from the shell, but escapes the period for grep. If you prefer to use no quotes at all, you can write
grep -E \ \\.+\ myfile.txt
Note that you need to prefix the \ which is intended for grep by another \, because the backslash has, like a space, a special meaning for the shell, and if you would not write \\., grep would not see a backslash-period, but just a period.

Regex lookahead

I am using a regex to find:
test:?
Followed by any character until it hits the next:
test:?
Now when I run this regex I made:
((?:test:\?)(.*)(?!test:\?))
On this text:
test:?foo2=bar2&baz2=foo2test:?foo=bar&baz=footest:?foo2=bar2&baz2=foo2
I expected to get:
test:?foo2=bar2&baz2=foo2
test:?foo=bar&baz=foo
test:?foo2=bar2&baz2=foo2
But instead it matches everything. Does anyone with more regex experience know where I have gone wrong? I've used regexes for pattern matching before but this is my first experience of lookarounds/aheads.
Thanks in advance for any help/tips/pointers :-)

I guess you could explore a greedy version.
(expanded)
(test:\? (?: (?!test:\?)[\s\S])* )

The Perl program below
#! /usr/bin/env perl
use strict;
use warnings;
$_ = "test:?foo2=bar2&baz2=foo2test:?foo=bar&baz=footest:?foo2=bar2&baz2=foo2";
while (/(test:\? .*?) (?= test:\? | $)/gx) {
print "[$1]\n";
}
produces the desired output from your question, plus brackets for emphasis.
[test:?foo2=bar2&baz2=foo2]
[test:?foo=bar&baz=foo]
[test:?foo2=bar2&baz2=foo2]
Remember that regex quantifiers are greedy and want to gobble up as much as they can without breaking the match. Each subsegment to terminate as soon as possible, which means .*? semantics.
Each subsegment terminates with either another test:? or end-of-string, which we look for with (?=...) zero-width lookahead wrapped around | for alternatives.
The pattern in the code above uses Perl’s /x regex switch for readability. Depending on the language and libraries you’re using, you may need to remove the extra whitespace.

Three issues:
(?!) is a negative lookahead assertion. You want (?=) instead, requiring that what comes next is test:?.
The .* is greedy; you want it non-greedy so that you grab just the first chunk.
You're wanting the last chunk also, so you want to match $ as well at the end.
End result:
(?:test:\?)(.*?)(?=test:\?|$)
I've also removed the outer group, seeing no point in it. All RE engines that I know of let you access group 0 as the full match, or some other such way (though perhaps not when finding all matches). You can put it back if you need to.
(This works in PCRE; not sure if it would work with POSIX regular expressions, as I'm not in the habit of working with them.)
If you're just wanting to split on test:?, though, regular expressions are the wrong tool. Split the strings using your language's inbuilt support for such things.
Python:
>>> re.findall('(?:test:\?)(.*?)(?=test:\?|$)',
... 'test:?foo2=bar2&baz2=foo2test:?foo=bar&baz=footest:?foo2=bar2&baz2=foo2')
['foo2=bar2&baz2=foo2', 'foo=bar&baz=foo', 'foo2=bar2&baz2=foo2']

You probably want ((?:test:\?)(.*?)(?=test:\?)), although you haven't told us what language you're using to drive the regexes.
The .*? matches as few characters as possible without preventing the whole string from matching, where .* matches as many as possible (is greedy).
Depending, again, on what language you're using to do this, you'll probably need to match, then chop the string, then match again, or call some language-specific match_all type function.
By the way, you don't need to anchor a regex using a lookahead (you can just match the pattern to search for, instead), so this will (most likely) do in your case:
test:[?](.*?)test:[?]

regex implementation to replace group with its lowercase version

Is there any implementation of regex that allow to replace group in regex with lowercase version of it?

If your regex version supports it, you can use \L, like so in a POSIX shell:
sed -r 's/(^.*)/\L\1/'

In Perl, you can do:
$string =~ s/(some_regex)/lc($1)/ge;
The /e option causes the replacement expression to be interpreted as Perl code to be evaluated, whose return value is used as the final replacement value. lc($x) returns the lowercased version of $x. (Not sure but I assume lc() will handle international characters correctly in recent Perl versions.)
/g means match globally. Omit the g if you only want a single replacement.

If you're using an editor like SublimeText or TextMate1, there's a good chance you may use
\L$1
as your replacement, where $1 refers to something from the regular expression that you put parentheses around. For example2, here's something I used to downcase field names in some SQL, getting everything to the right of the 'as' at the end of any given line. First the "find" regular expression:
(as|AS) ([A-Za-z_]+)\s*,$
and then the replacement expression:
$1 '\L$2',
If you use Vim (or presumably gvim), then you'll want to use \L\1 instead of \L$1, but there's another wrinkle that you'll need to be aware of: Vim reverses the syntax between literal parenthesis characters and escaped parenthesis characters. So to designate a part of the regular expression to be included in the replacement ("captured"), you'll use \( at the beginning and \) at the end. Think of \ as—instead of escaping a special character to make it a literal—marking the beginning of a special character (as with \s, \w, \b and so forth). So it may seem odd if you're not used to it, but it is actually perfectly logical if you think of it in the Vim way.
1 I've tested this in both TextMate and SublimeText and it works as-is, but some editors use \1 instead of $1. Try both and see which your editor uses.
2 I just pulled this regex out of my history. I always tweak regexen while using them, and I can't promise this the final version, so I'm not suggesting it's fit for the purpose described, and especially not with SQL formatted differently from the SQL I was working on, just that it's a specific example of downcasing in regular expressions. YMMV. UAYOR.

Several answers have noted the use of \L. However, \E is also worth knowing about if you use \L.
\L converts everything up to the next \U or \E to lowercase. ... \E turns off case conversion.
(Source: https://www.regular-expressions.info/replacecase.html )
So, suppose you wanted to use rename to lowercase part of some file names like this:
artist_-_album_-_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
artist_-_album_-_Another_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
you could do something like:
rename -v 's/^(.*_-_)(.*)(_-_.*.m4a)/$1\L$2\E$3/g' *

In Perl, there's
$string =~ tr/[A-Z]/[a-z]/;

Most Regex implementations allow you to pass a callback function when doing a replace, hence you can simply return a lowercase version of the match from the callback.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js