SED regular expressions trouble - regex

I have build the following regular expression in order to fix a big sql dump with invalid tags
This searches
\[ame=(?:\\"){0,1}(?:http://){0,1}(http://(?:www.|uk.|fr.|il.|hk.){0,1}youtube.com/watch\?v=([^&,",\\]+))[^\]]*\].+?video\]|\[video\](http://(?:www.|uk.|fr.|il.|hk.){0,1}youtube.com/watch\?v=([^\[,&,\\,"]+))\[/video\]
This replaces
[video=youtube;$2$4]$1$3[/video]
So this:
[ame=\"http://www.youtube.com/watch?v=FD5ArmOMisM\"]YouTube - Official Install Of X360FDU![/video]
will become
[video=youtube;FD5ArmOMisM]http://www.youtube.com/watch?v=FD5ArmOMisM[/video]
It behaves like a charm in EditPadPro (Windows) but it gives me conflicts with the codepages when I try to import it in my Linux based MySQL.
So since the file comes from a Linux installation I tried my luck with SED but it gives me errors errors errors. Obviously it has a different way to build regular expressions.
It is quite urgent to do the substitutions so I have no time reading the SED manual.
Can you give a hand to migrate my regular expressions to a SED friendly format?
Thanx in advance!
UPDATE: I added the escape chars proposed
\[ame=\(?:\\"\)\{0,1\}\(?:http:\/\/\)\{0,1\}\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^&,",\\]+\))[^\]]*\].+?video\]|\[video\]\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^\[,&,\\,"]+\))\[\/video\]
but I still get errors - Unkown command: ')'

Your regular expressions are using PCRE - Perl Compatible Regular Expression - notations. As defined by POSIX (codifying what was standardized by 7th Edition Unix circa 1978, which was a continuation of the previous versions of Unix), sed does not support PCRE.
Even GNU sed version 4.2.1, which supports ERE (extended regular expressions) as well as BRE (basic regular expressions) does not support PCRE.
Your best bet is probably to use Perl to provide you with the PCRE you need. Failing that, take the scripting language of your choice with PCRE support.

Sed just has some different escaping rules to the Regex flavor you're using.
() escaped \( \) - for grouping
[] are not - for character classes
{} escaped \{ \} - for numerators
\[ame=\(?:\\"\)\{0,1\}\(?:http:\/\/\)\{0,1\}\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^&,",\\]+\)\)[^\]]*\].+?video\]|\[video\]\(http:\/\/\(?:www.|uk.|fr.|il.|hk.\)\{0,1\}youtube.com\/watch\?v=\([^\[,&,\\,"]+\)\)\[\/video\]
I noticed a couple of unescaped )'s on enclosing groups.

Related

How to comprehend expression/pattern in find/grep/rsync?

I have to use find, grep and rsync commands for my program. Generally, I rarely used all of these in a single script so didn't notice earlier. Is there a category of regular-expression that fit these commands like:
find command: follows regex type1
grep command: follows regex type2
rsync command: follows regex type3
For example, for finding all the paths which lead to my program log file, we can do:
find -type f -name "foo.log*"
Here, in the above command, the star is not acting like a proper regular expression, as in regex, the star corresponds to the zero/one/multiple instances of the immediate before expression which is character('g') in this case? So if it actually follows regex, it can match filenames like:
foo.lo
foo.log
foo.logg
foo.loggg
and so on...
Similar to find command, the rsync behave when given expression for its source and destination path. While on the other hand, I noticed the grep command do follow the regular expression.
So, in total:
Do all of these commands follow a different kind of regular expression?
Or some of them follows regex while some of them do not, and if not, then what pattern they follow? Basically, I'm looking for the generalisation of the patterns of all these tools?
I'm new to Linux tools. Please guide!
There is a big difference between wildcards and regular expressions.
Wildcards:
special characters that define a simple search pattern
used by shells (bash, old MS-DOS, ...), and by many unix commands (find, ...)
limited set of wildcards, typically just:
* - zero or more chars (any combination)
? - exactly one char (any char)
[...] - exactly one char out of a set or range of chars, such as [0-9a-f] for a hex digit
see tutorial: https://linuxhint.com/bash_wildcard_tutorial/
Regular Expression:
a sequence of characters that define a search pattern
think of regular expressions (regex for short) as wildcards on steroids
regex patterns are used to find or find and replace strings
powerful language, natively supported by most programming languages
there are different flavors of regular expressions, typically grouped into these categories:
POSIX Basic (BRE - Basic Regular Expressions)
POSIX Extended (ERE - Extended Regular Expressions)
Perl and PCRE (Perl Compatible Regular Expressions)
JavaScript
many more flavors, see https://en.wikipedia.org/wiki/Comparison_of_regular-expression_engines
some unix commands allow you to select one regex flavor or another; for example:
grep uses POSIX Basic by default
grep -E or egrep uses POSIX Extended
grep -Puses Perl
Wikipedia article: https://en.wikipedia.org/wiki/Regular_expression
tutorial: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

What flavor of regex does git use

I'm trying to use the git diff --word-diff-regex= command and it seems to reject any types of lookaheads and lookbehinds. I'm having trouble pinning down what flavor of regex git uses. For example
git diff --word-diff-regex='([.\w]+)(?!>)'
Comes back as an invalid regular expression.
I am trying to get all the words that are not HTML tags. So the resulting matches of the regex should be 'Hello' 'World' 'Foo' 'Bar' for the below string
<p> Hello World </p><p> Foo Bar </p>
The Git source uses regcomp and regexec, which are defined by POSIX 1003.2. The code to compile a diff regexp is:
if (regcomp(ecbdata->diff_words->word_regex,
o->word_regex,
REG_EXTENDED | REG_NEWLINE))
which in POSIX means that these are "extended" regular expressions as defined here.
(Not every C library actually implements the same POSIX REG_EXTENDED. Git includes its own implementation, which can be built in place of the system's.)
Edit (per updated question): POSIX EREs have neither lookahead nor lookbehind, nor do they have \w (but [_[:alnum:]] is probably close enough for most purposes).
Thanks for the hints from #torek 's answer above, now I realize that there are different flavors of regular expression engines and they could even have different syntax.
Even for one particular program, such as git, it could be compiled with a different regex engine. For example, this blog post hints that \w would be supported by git, contradicting with what I observed from my machine or what the OP here asked.
I ended up finding this section from your recommended wikipedia page most helpful, in terms of presenting different syntax in one table, so that I could do some "translation" between for example [:alnum:] and \w, [:digit:] and \d, [:space:] and \s, etc..

Translate PCRE pattern to POSIX

I have the following pcre that works just fine:
/[c,f]=("(?:[a-z A-Z 0-9]|-|_|\/)+\.(?:js|html)")/g
It produces the desired output "foo.js" and "bar.html" from the inputs
<script src="foo.js"...
<link rel="import" href="bar.html"...
Problem is, the OS X version of grep doesn't seem to have any option like -o to only print the captured group (according to another SO question, that apparently works on linux). Since this will be part of a makefile, I need a version that I can count on running on any *nix platform.
I tried sed but the following
s/[c,f]=("(?:[[:alphanum:]]|-|_|\/)+\.(?:js|html)")/\1/pg
Throws an error: 'invalid operand for repetition-operator'. I've tried trimming it down, excluding the filepath separator characters, I just cant seem to crack it. Any help translating my pcre into something that I'm pretty much guaranteed to have on a POSIX-compliant (even unofficially so) platform?
P.S. I'm aware of the potential failure modes inherent in the regex I wrote, it only will be used against very specific files with fairly specific formatting.
POSIX defines two flavors of regular expressions:
BREs (Basic Regular Expressions) - the older flavor with fewer features and the need to \-escape certain metacharacters, notably \(, \) and \{, \}, and no support for duplication symbols \+ (emulate with \{1,\}) and \? (emulate with \{0,1\}), and no support for \| (alternation; cannot be emulated).
EREs (Extended Regular Expressions) - the more modern flavor, which, however lacks regex-internal back-references (which is not the same as capture groups); also there is no support for word-boundary assertions (e.g, \<) and no support for capture groups.
POSIX also mandates which utilities support which flavor: which support BREs, which support EREs, and which optionally support either, and which exclusively support only BREs, or only EREs; notably:
grep uses BREs by default, but can enable EREs with -E
sed, sadly, only supports BREs
Both GNU and BSD sed, however, - as a nonstandard extension - do support EREs with the -E switch (the better known alias with GNU sed is -r, but -E is supported too).
awk only supports EREs
Additionally, the regex libraries on both Linux and BSD/OSX implement extensions to the POSIX ERE syntax - sadly, these extensions are in part incompatible (such as the syntax for word-boundary assertions).
As for your specific regex:
It uses the syntax for non-capturing groups, (?:...); however, capture groups are pointless in the context of grep, because grep offers no replacement feature.
If we remove this aspect, we get:
[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")
This is now a valid POSIX ERE (which can be simplified - see Benjamin W's helpful answer).
However, since it is an Extended RE, using sed is not an option, if you want to remain strictly POSIX-compliant.
Because both GNU and BSD/OSX sed happen to implement -E to support EREs, you can get away with sed, if these platforms are the only ones you need to support - see anubhava's answer.
Similarly, both GNU and BSD/OSX grep happen to implement the nonstandard -o option (unlike what you state in your question), so, again, if these platforms are the only ones you need to support, you can use:
$ grep -Eo '[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")' file | cut -c 3-
c="foo.js"
f="bar.html"
(Note that only GNU grep supports -P to enable PCREs, which would simply the solution to (note the \K, which drops everything matched so far):
$ grep -Po '[c,f]=\K("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")' file
)
If you really wanted a strictly POSIX-compliant solution, you could use awk:
$ awk -F\" '/[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")/ { print "\"" $2 "\"" }' file
On OSX following sed should work with your given input:
sed -E 's~.*[cf]=("[ a-zA-Z0-9_/-]+\.(js|html)").*~\1~' file
"foo.js"
"bar.html"
RegEx Demo
The spec for POSIX sed points out that only basic regular expressions (BRE) are supported, so no + or |; non-capturing groups aren't even in the spec for extended regular expressions (ERE).
Thankfully, both GNU sed and BSD sed support ERE, so we can use alternation and the + quantifier.
A few points:
Did you really want that comma in the first bracket expression? I suspect it could be just [cf].
The expression
(?:[a-z A-Z 0-9]|-|_|\/)+
can be simplified to a single bracket expression,
[a-zA-Z0-9_\/ -]+
Only one space is needed. You can also use a POSIX character class: [[:alnum:]]_/ -]+. Not sure if your [:alphanum:] tripped sed up.
For the whole expression between quotes, I'd just use an expression for "something between quotes, ending in .js or .html, preceded by non-quotes":
"[^"]+\.(js|html)"
To emulate grep -o behaviour, you have to also match everything before and after your expression on the line with .* at the start and end of your regex.
All in all, I'd say that for a sed using ERE (-r option for GNU sed, -E option for BSD sed), this should work:
sed -rn 's/.*[cf]=("[^"]+\.(js|html)").*/\1/p' infile
Or, with BRE only (requiring two commands because of the alternation):
sed -n 's/.*[cf]=\("[^"][^"]*\.js"\).*/\1/p;s/.*[cf]=\("[^"][^"]*\.html"\).*/\1/p' infile
Notice how BRE can emulate the + quantifier with [abc][abc]* instead of [abc]+.
The limitation to this approach is that if there are multiple matches on the same line, only the first one will be printed, because the s/// command removes everything before and after the part we extract.

Regular expression to search for not-a-specific-sequence-of-characters

I am looking for a regular expression that matches "not-a-specific-sequence-of-characters". A solution suddenly dawned on me (after a few years!) I am running bash on a Macintosh computer.
As an example, I want to match the word Path as long as it is not preceded by the word posix or Posix. Here is the regular expression I came up with:
[^[:space:]]*([^x]|[^i]x|[^s]ix|[^o]six|[^Pp]osix)Path
I would like to ask if there might be a more efficient or otherwise better approach. This approach can become somewhat cumbersome the longer the "not" sequence of characters is.
Perl regexes have handy "look-around" features.
perl -ne 'print if /(?<![pP]osix)Path' file
GNU grep has a -P flag to enable perl-compatible regular expressions, but OSX does not have GNU tools by default.
A straightforward technique is to filter the output of grep:
grep 'Path' file | grep -v '[pP]osixPath'

Seeking comparison table for different regexes

I use vim, sed, bash and Perl. Each has somewhat different regex syntax. I just spent time finding that I need to escape the curly parens in sed, but not in BASH (when using them as counter elements). Grrr.
Can anybody point me to a table that summarizes the differences between the different regex parsers in these 4 environments.
TIA
http://www.regular-expressions.info/refflavors.html - scroll down a bit.
Bash uses posix regexes. Sed and vim (which uses ed) use what are listed as "GNU BRE", although this depends on what flags you pass.
Jan Goyvaerts.'s site regular-expressions.info has a listing of popular regex engines and which options they support.