Where is lookaround regex supported and where isn't it? - regex

Trying to improve my regex skills, I wanted to learn about lookahead and lookbehind expressions. On my Archlinux system I tried the following:
a=ab;if [[ $a =~ [a-z](?=b) ]]; then echo "Y";else echo "N";fi
Which, as far as I understand it, should match and thus echo out a "Y", but doesn't.
echo ab |sed 's/[a-z](?=b)/x/'
...also doesn't seem to match.
grep doesn't seem to lookaround either, but pcregrep does. I also tried several attempts on quoteing and/or escaping the expressions, to no avail.
I'm a little confused, now. Could someone please clarify where lookaround, which doesn't seem that exotic judging from the number of mentions in tutorials, can actually be used? Or did I just mess up escaping my expressions?

Lookaround assertions aren't supported by basic or extended posix regular expressions which are available in bash or sed.
A good tool to test is GNU grep which supports the -P option for perl compatible regular expressions. Like this:
grep --color=auto -P '[a-z](?=b)' <<< 'ab'
Even a greater resource are online regex testing tools like https://regex101.com/

You should distinguish between basic and extended Regular Expressions.
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; They need to be escaped to get their "regex" meaning.
On the other hand, in the extended Regular Expressions, these characters get their "regex" meaning.
If you grep --help, you'll get:
-E, --extended-regexp PATTERN is an extended regular expression (ERE)
Note that grep doesn't support look-arounds, it's supported in pcregrep.

Related

How can I translate a regex within vim to work with sed?

I have a string that exists within a text file that I am trying to modify with regex.
"configuration_file_for_wks_33-40"
and I want to modify it so that it looks like this
"configuration_file_for_wks_33-40_6ks"
Within vim I can accomplish this with the following regex command
%s/33-\(\d\d\)/33-\1_6ks/
But if I try to pass that regex command to sed such as
sed 's/33-\(\d\d\)/33-\1_6ks/' input_file.json
The string is not changed, even if I include the -e parameter.
I have also tried to do this using ex as
echo '%s/33-\(\d\d\)/33-\1_6ks/' | ex input_file.json
If I use
sed 's/wks_33-\(\d\d\)*/wks_33-\1_6ks/' input_file.json
then I get
configuration_file_for_wks_33-_6ks40
For that, I've tried various different escaping patterns without any luck.
Can someone help me understand why this changes are not working?
vim has a different syntax for regular expressions (which is even configurable). Unfortunately, sed doesn't understand \d (see https://unix.stackexchange.com/a/414230/304256). With -E, you can match digits with [0-9] or [[:digit:]]:
$ sed -E 's/33-[0-9][0-9]/&_6ks/'
configuration_file_for_wks_33-40_6ks
Note that you can use & in the replacement for adding the entire matched string.
So why is this:
$ sed 's/wks_33-\(\d\d\)*/wks_33-\1_6ks/' input_file.json
configuration_file_for_wks_33-_6ks40
Here, (\d\d)* is simply matched 0 times, so you replace wks_33- by wks_33-_6ks (\1 is a zero-length string) and 40 remains where it was before.
Translation from one language to another is best done with some reference material on hand:
sed BRE syntax
sed ERE syntax
sed classes
sed RE extensions
The superficial reading of which shows that sed doesn't support \d.
Possible alternatives to \d\d:
[[:digit:]]\{2\}
[0-9]\{2\}
How can I translate a regex within vim to work with sed?
Since you write "a regex", I think you refer to any regex.
Translating a Vim regex to a Sed regex is not always possible, because a Vim regex can have lookarounds, whereas a Sed regex has no such things.

Sed doesn't work in command line however regular expression in online test regex101 works

I have a string like
July 20th 2017, 11:03:37.620 fc384c3d-9a75-459d-ba92-99069db0e7bf
I need to remove everything from the beginning of the line till the UUID substring (it's a tab, \t just before the UUID).
My regex looks like that:
^\s*July(.*)\t
When I test it in regex101 it all works beatufully: https://regex101.com/r/eZ1gT7/1077
However, when I plonk that into a sed command it doesn't do any substitution:
less pensionQuery.txt | sed -e 's/^\s*July(.*)\t//'
where pensionQuery.txt is a file full of the lines similar to the above. So the command above simply spits out unmodified file contnent.
Is my sed command wrong?
Any ideas?
The regex is right, you are not trying sed with --regexp-extended
'-E'
'--regexp-extended'
Use extended regular expressions rather than basic regular
expressions. Extended regexps are those that egrep accepts; they
can be clearer because they usually have fewer backslashes.
Historically this was a GNU extension, but the -E extension has
since been added to the POSIX standard
echo -e $'July 20th 2017, 11:03:37.620\tfc384c3d-9a75-459d-ba92-99069db0e7bf' |
sed -E 's/^\s*July(.*)\t//'
fc384c3d-9a75-459d-ba92-99069db0e7bf
Also a simple read-up on Basic (BRE) and extended (ERE) regular expression
Basic and extended regular expressions are two variations on the syntax of the specified pattern. Basic Regular Expression (BRE) is the default in sed (and similarly in grep). Extended Regular Expression syntax (ERE) is activated by using the -r or -E options (and similarly, grep -E).

Regex for prosodically-defined words: working in Atom but not grep

I'm trying to search a .txt dictionary for all trisyllabic roots, and then have the matching roots passed to a new .txt file. The dictionary in question is a raw text version of Heath's Nunggubuyu dictionary. When I search the file in Atom (my preferred text editor), the following string does a pretty good job of singling out the desired roots and eliminating any material from the definitions below the headwords (which begin with whitespace), as well as any English words, and any trisyllabic strings interrupted by a hyphen or equals sign (which mean they are not monomorphemic roots). Forgive me if it looks clunky; I'm an absolute beginner. (In this orthography, vowel length is indicated with a ':', and there are only three vowels 'a,i,u'. None of the headwords have uppercase letters.)
^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b
However, I need the matched strings to be output to a new file. When I try using this same string in grep (on a Mac), nothing is matched. I use the syntax
grep -o "^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b" Dict-nofrontmatter.txt > output.txt
I've been searching for hours trying to figure out how to translate from Atom's regex dialect to grep (Mac), to no avail. Whenever I do manage to get matches, the results looks wildly different to what I expect, and what I get from Atom. I've also looked at some apparent grep tools for Atom, but the documentation is virtually non-existent so I can't work out what they even do. What am I getting wrong here? Should I try an alternative to grep?
grep supports different regex styles. From man re_format:
Regular expressions ("RE"s), as defined in POSIX.2, come in two
forms:
modern REs (roughly those of egrep; POSIX.2 calls these extended REs) and
obsolete REs (roughly those of ed(1); POSIX.2 basic REs).
Grep has switches to choose which variant is used. Sorted from less to many features:
fixed string: grep -F or fgrep
No regex at all. Plain text search.
basic regex: grep -G or just grep
|, +, and ? are ordinary characters. | has no equivalent. Parentheses must be escaped to work as sub-expressions.
extended regex: grep -E or egrep
"Normal" regexes with |, +, ? bounds and so on.
perl regex: grep -P (for GNU grep, not pre-installed on Mac)
Most powerful regexes. Supports lookaheads and other features.
In your case you should try grep -Eo "^\S....
Possibly the only thing missing from your grep command is the -E option:
regex='^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b'
grep -Eo "$regex" Dict-nofrontmatter.txt > output.txt
-E activates support for extended (modern) regular expressions, which work as one expects nowadays (duplication symbols + and ? work as expected, ( and ) form capture groups, | is alternation).
Without -E (or with -G) basic regular expressions are assumed - a limited legacy form that differs in syntax. Given that -E is part of POSIX, there's no reason not to use it.
On macOS, grep does understand character-class shortcuts such as \S and \W, and also word-boundary assertions such as \b - this is in contrast with the other BSD utilities that macOS comes with, notably sed and awk.
It doesn't look like you need it, but PRCEs (Perl-compatible Regular Expressions) would provide additional features, such as look-around assertions.
macOS grep doesn't support them, but GNU grep does, via the -P option. You can install GNU grep on macOS via Homebrew.
Alternatively, you can simply use perl directly; the equivalent of the above command would be:
regex='^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b'
perl -lne "print for m/$regex/g" Dict-nofrontmatter.txt > output.txt

Regular expression to search for not-a-specific-sequence-of-characters

I am looking for a regular expression that matches "not-a-specific-sequence-of-characters". A solution suddenly dawned on me (after a few years!) I am running bash on a Macintosh computer.
As an example, I want to match the word Path as long as it is not preceded by the word posix or Posix. Here is the regular expression I came up with:
[^[:space:]]*([^x]|[^i]x|[^s]ix|[^o]six|[^Pp]osix)Path
I would like to ask if there might be a more efficient or otherwise better approach. This approach can become somewhat cumbersome the longer the "not" sequence of characters is.
Perl regexes have handy "look-around" features.
perl -ne 'print if /(?<![pP]osix)Path' file
GNU grep has a -P flag to enable perl-compatible regular expressions, but OSX does not have GNU tools by default.
A straightforward technique is to filter the output of grep:
grep 'Path' file | grep -v '[pP]osixPath'

SED regular expressions trouble

I have build the following regular expression in order to fix a big sql dump with invalid tags
This searches
\[ame=(?:\\"){0,1}(?:http://){0,1}(http://(?:www.|uk.|fr.|il.|hk.){0,1}youtube.com/watch\?v=([^&,",\\]+))[^\]]*\].+?video\]|\[video\](http://(?:www.|uk.|fr.|il.|hk.){0,1}youtube.com/watch\?v=([^\[,&,\\,"]+))\[/video\]
This replaces
[video=youtube;$2$4]$1$3[/video]
So this:
[ame=\"http://www.youtube.com/watch?v=FD5ArmOMisM\"]YouTube - Official Install Of X360FDU![/video]
will become
[video=youtube;FD5ArmOMisM]http://www.youtube.com/watch?v=FD5ArmOMisM[/video]
It behaves like a charm in EditPadPro (Windows) but it gives me conflicts with the codepages when I try to import it in my Linux based MySQL.
So since the file comes from a Linux installation I tried my luck with SED but it gives me errors errors errors. Obviously it has a different way to build regular expressions.
It is quite urgent to do the substitutions so I have no time reading the SED manual.
Can you give a hand to migrate my regular expressions to a SED friendly format?
Thanx in advance!
UPDATE: I added the escape chars proposed
\[ame=\(?:\\"\)\{0,1\}\(?:http:\/\/\)\{0,1\}\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^&,",\\]+\))[^\]]*\].+?video\]|\[video\]\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^\[,&,\\,"]+\))\[\/video\]
but I still get errors - Unkown command: ')'
Your regular expressions are using PCRE - Perl Compatible Regular Expression - notations. As defined by POSIX (codifying what was standardized by 7th Edition Unix circa 1978, which was a continuation of the previous versions of Unix), sed does not support PCRE.
Even GNU sed version 4.2.1, which supports ERE (extended regular expressions) as well as BRE (basic regular expressions) does not support PCRE.
Your best bet is probably to use Perl to provide you with the PCRE you need. Failing that, take the scripting language of your choice with PCRE support.
Sed just has some different escaping rules to the Regex flavor you're using.
() escaped \( \) - for grouping
[] are not - for character classes
{} escaped \{ \} - for numerators
\[ame=\(?:\\"\)\{0,1\}\(?:http:\/\/\)\{0,1\}\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^&,",\\]+\)\)[^\]]*\].+?video\]|\[video\]\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^\[,&,\\,"]+\)\)\[\/video\]
I noticed a couple of unescaped )'s on enclosing groups.