Translate PCRE pattern to POSIX - regex

I have the following pcre that works just fine:
/[c,f]=("(?:[a-z A-Z 0-9]|-|_|\/)+\.(?:js|html)")/g
It produces the desired output "foo.js" and "bar.html" from the inputs
<script src="foo.js"...
<link rel="import" href="bar.html"...
Problem is, the OS X version of grep doesn't seem to have any option like -o to only print the captured group (according to another SO question, that apparently works on linux). Since this will be part of a makefile, I need a version that I can count on running on any *nix platform.
I tried sed but the following
s/[c,f]=("(?:[[:alphanum:]]|-|_|\/)+\.(?:js|html)")/\1/pg
Throws an error: 'invalid operand for repetition-operator'. I've tried trimming it down, excluding the filepath separator characters, I just cant seem to crack it. Any help translating my pcre into something that I'm pretty much guaranteed to have on a POSIX-compliant (even unofficially so) platform?
P.S. I'm aware of the potential failure modes inherent in the regex I wrote, it only will be used against very specific files with fairly specific formatting.

POSIX defines two flavors of regular expressions:
BREs (Basic Regular Expressions) - the older flavor with fewer features and the need to \-escape certain metacharacters, notably \(, \) and \{, \}, and no support for duplication symbols \+ (emulate with \{1,\}) and \? (emulate with \{0,1\}), and no support for \| (alternation; cannot be emulated).
EREs (Extended Regular Expressions) - the more modern flavor, which, however lacks regex-internal back-references (which is not the same as capture groups); also there is no support for word-boundary assertions (e.g, \<) and no support for capture groups.
POSIX also mandates which utilities support which flavor: which support BREs, which support EREs, and which optionally support either, and which exclusively support only BREs, or only EREs; notably:
grep uses BREs by default, but can enable EREs with -E
sed, sadly, only supports BREs
Both GNU and BSD sed, however, - as a nonstandard extension - do support EREs with the -E switch (the better known alias with GNU sed is -r, but -E is supported too).
awk only supports EREs
Additionally, the regex libraries on both Linux and BSD/OSX implement extensions to the POSIX ERE syntax - sadly, these extensions are in part incompatible (such as the syntax for word-boundary assertions).
As for your specific regex:
It uses the syntax for non-capturing groups, (?:...); however, capture groups are pointless in the context of grep, because grep offers no replacement feature.
If we remove this aspect, we get:
[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")
This is now a valid POSIX ERE (which can be simplified - see Benjamin W's helpful answer).
However, since it is an Extended RE, using sed is not an option, if you want to remain strictly POSIX-compliant.
Because both GNU and BSD/OSX sed happen to implement -E to support EREs, you can get away with sed, if these platforms are the only ones you need to support - see anubhava's answer.
Similarly, both GNU and BSD/OSX grep happen to implement the nonstandard -o option (unlike what you state in your question), so, again, if these platforms are the only ones you need to support, you can use:
$ grep -Eo '[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")' file | cut -c 3-
c="foo.js"
f="bar.html"
(Note that only GNU grep supports -P to enable PCREs, which would simply the solution to (note the \K, which drops everything matched so far):
$ grep -Po '[c,f]=\K("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")' file
)
If you really wanted a strictly POSIX-compliant solution, you could use awk:
$ awk -F\" '/[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")/ { print "\"" $2 "\"" }' file

On OSX following sed should work with your given input:
sed -E 's~.*[cf]=("[ a-zA-Z0-9_/-]+\.(js|html)").*~\1~' file
"foo.js"
"bar.html"
RegEx Demo

The spec for POSIX sed points out that only basic regular expressions (BRE) are supported, so no + or |; non-capturing groups aren't even in the spec for extended regular expressions (ERE).
Thankfully, both GNU sed and BSD sed support ERE, so we can use alternation and the + quantifier.
A few points:
Did you really want that comma in the first bracket expression? I suspect it could be just [cf].
The expression
(?:[a-z A-Z 0-9]|-|_|\/)+
can be simplified to a single bracket expression,
[a-zA-Z0-9_\/ -]+
Only one space is needed. You can also use a POSIX character class: [[:alnum:]]_/ -]+. Not sure if your [:alphanum:] tripped sed up.
For the whole expression between quotes, I'd just use an expression for "something between quotes, ending in .js or .html, preceded by non-quotes":
"[^"]+\.(js|html)"
To emulate grep -o behaviour, you have to also match everything before and after your expression on the line with .* at the start and end of your regex.
All in all, I'd say that for a sed using ERE (-r option for GNU sed, -E option for BSD sed), this should work:
sed -rn 's/.*[cf]=("[^"]+\.(js|html)").*/\1/p' infile
Or, with BRE only (requiring two commands because of the alternation):
sed -n 's/.*[cf]=\("[^"][^"]*\.js"\).*/\1/p;s/.*[cf]=\("[^"][^"]*\.html"\).*/\1/p' infile
Notice how BRE can emulate the + quantifier with [abc][abc]* instead of [abc]+.
The limitation to this approach is that if there are multiple matches on the same line, only the first one will be printed, because the s/// command removes everything before and after the part we extract.

Related

Regex for prosodically-defined words: working in Atom but not grep

I'm trying to search a .txt dictionary for all trisyllabic roots, and then have the matching roots passed to a new .txt file. The dictionary in question is a raw text version of Heath's Nunggubuyu dictionary. When I search the file in Atom (my preferred text editor), the following string does a pretty good job of singling out the desired roots and eliminating any material from the definitions below the headwords (which begin with whitespace), as well as any English words, and any trisyllabic strings interrupted by a hyphen or equals sign (which mean they are not monomorphemic roots). Forgive me if it looks clunky; I'm an absolute beginner. (In this orthography, vowel length is indicated with a ':', and there are only three vowels 'a,i,u'. None of the headwords have uppercase letters.)
^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b
However, I need the matched strings to be output to a new file. When I try using this same string in grep (on a Mac), nothing is matched. I use the syntax
grep -o "^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b" Dict-nofrontmatter.txt > output.txt
I've been searching for hours trying to figure out how to translate from Atom's regex dialect to grep (Mac), to no avail. Whenever I do manage to get matches, the results looks wildly different to what I expect, and what I get from Atom. I've also looked at some apparent grep tools for Atom, but the documentation is virtually non-existent so I can't work out what they even do. What am I getting wrong here? Should I try an alternative to grep?
grep supports different regex styles. From man re_format:
Regular expressions ("RE"s), as defined in POSIX.2, come in two
forms:
modern REs (roughly those of egrep; POSIX.2 calls these extended REs) and
obsolete REs (roughly those of ed(1); POSIX.2 basic REs).
Grep has switches to choose which variant is used. Sorted from less to many features:
fixed string: grep -F or fgrep
No regex at all. Plain text search.
basic regex: grep -G or just grep
|, +, and ? are ordinary characters. | has no equivalent. Parentheses must be escaped to work as sub-expressions.
extended regex: grep -E or egrep
"Normal" regexes with |, +, ? bounds and so on.
perl regex: grep -P (for GNU grep, not pre-installed on Mac)
Most powerful regexes. Supports lookaheads and other features.
In your case you should try grep -Eo "^\S....
Possibly the only thing missing from your grep command is the -E option:
regex='^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b'
grep -Eo "$regex" Dict-nofrontmatter.txt > output.txt
-E activates support for extended (modern) regular expressions, which work as one expects nowadays (duplication symbols + and ? work as expected, ( and ) form capture groups, | is alternation).
Without -E (or with -G) basic regular expressions are assumed - a limited legacy form that differs in syntax. Given that -E is part of POSIX, there's no reason not to use it.
On macOS, grep does understand character-class shortcuts such as \S and \W, and also word-boundary assertions such as \b - this is in contrast with the other BSD utilities that macOS comes with, notably sed and awk.
It doesn't look like you need it, but PRCEs (Perl-compatible Regular Expressions) would provide additional features, such as look-around assertions.
macOS grep doesn't support them, but GNU grep does, via the -P option. You can install GNU grep on macOS via Homebrew.
Alternatively, you can simply use perl directly; the equivalent of the above command would be:
regex='^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b'
perl -lne "print for m/$regex/g" Dict-nofrontmatter.txt > output.txt

Given a string, how to remove the run of the first character? (sed)

I need a sed command that takes a string and removes all copies of the first character from the beginning (but not from the rest of the string).
For instance, AAABAC should produce BAC, because the first letter is A, so we remove the entire run of A's from the beginning.
My original thought was:
data=$(echo $data | sed 's/^.\+\(.*\)/\1/')
but this doesn't work (outputs empty string). If I replace the first . with a specific character, it will successfully work just for that character, but I can't get it to wildcard properly.
What I think is that the . matches the first character like I want, but then the + doesn't remember the letter I want and continues accepting every character until the end of the string, so that the parentheses contain nothing and so the whole string gets replaced with nothing. How can I initially accept any character, but then "lock in" that character for the +?
You can use:
$> s='AAABAC'
$> sed -E 's/^(.)\1*//' <<< "$s"
BAC
(.) will match the first character and captures it in group #1
\1* will match 0 or more instances of same character
Alternatively here is a pure BASH way of doing the same:
$> shopt -s extglob
$> echo "${s##+(${s:0:1})}"
BAC
${s:0:1} gives us the first character of $s and ##+(${s:0:1}) removes all the instances of first char from the start.
To provide a road map to the existing answers with respect to portability:
Note: It can be inferred from the syntax used in the question and from what answer was accepted that GNU sed is being used, but the question isn't tagged as such, and it may be of broader interest.
anubhava's helpful answer works with GNU sed, but not with (more) strictly POSIX-compliant sed implementations such as the one found on macOS.
Benjamin W.'s helpful answer works with GNU grep, due to requiring the -P option for PCRE support, which other grep implementations, such as the one found on macOS, do not support.
soronta's helpful answer works on platforms that use the GNU regular-expression libraries (most Linux distros), or, more generally, on platforms whose ERE (extended regular expression) syntax supports backreferences, as a nonstandard extension to the POSIX spec.
Note that =~, Bash's regex-matching operator, is one of the rare Bash features whose behavior is platform-dependent, due to using the respective platform's regex libraries.
Here's a POSIX-compliant solution that should work on all modern Unix-like platforms, because it uses BREs (basic regular expressions), for which POSIX does mandate backreference support:
$ echo 'AAABAC' | sed 's/^\(.\)\1*//'
BAC
You can do it with grep, if your grep understands Perl compatible regular expressions:
$ grep -Po '^(.)\1*\K.*' <<< 'AABAC'
BAC
or
$ grep -Po '^(.)\1*\K.*' <<< 'ABAC'
BAC
-o retains only the match, and \K is a variable-length look-behind, removing as many identical characters from the beginning of the string as possible.
Bash also supports regular expressions:
$ m='(.)(\1+)(.+)'; [[ AAAAABAC =~ $m ]]; printf '%s' "${BASH_REMATCH[3]}"
BAC
Valid for GNU ERE regex system library (varies with the system).

Which characters must be masked when using grep and sed?

I have learned that whene I use the command grep then I must mask those characters {,},(,) and |
But I have found now an example, where / was masked!
Which characters must be masked when using grep and sed command?
When writing regexes in a shell script, it is normally sensible to enclose the regex in single quotes. Then you don't have to worry about anything except single quotes that appear in the regex itself. Occasionally, it may make sense to enclose the regex in double quotes (if it involves matching single quotes and not matching double quotes), but then you have to be careful about $, the back-quote  ` , and backslashes \.
So:
grep -e '^.*([a-z]*)[[:space:]]*{[^}]*}$'
With sed, you need to worry about s/// operations when the search or replacement pattern itself contains slashes /. The simplest technique is to use an alternative character such as %:
sed -e 's%/where/it/was/%/it/goes/here/now/%'
There are three or four dialects of grep:
Plain grep
Extended grep (grep -E, once upon a time known as egrep)
Fixed grep (grep -F, once upon a time known as fgrep)
Sometimes you get grep with PCRE (Perl-compatible Regular Expression) support: grep -P.
Even within 'plain grep', you can find there is some variability between implementations.
Similarly, there are two main dialects of sed:
Plain sed
Extended sed (sed -E or sed -r; sed -E is more widely available)
You need to read about POSIX BRE (basic regular expressions), supported by plain grep and plain sed, and POSIX ERE (extended regular expressions), supported by grep -E and sed -E (when EREs are supported by sed at all).
See also the POSIX specifications for grep and sed.

How to use sed to replace regex capture group?

I have a large file with many scattered file paths that look like
lolsed_bulsh.png
I want to prepend these file names with an extended path like:
/full/path/lolsed_bullsh.png
I'm having a hard time matching and capturing these. currently i'm trying variations of:
cat myfile.txt| sed s/\(.+\)\.png/\/full\/path\/\1/g | ack /full/path
I think sed has some regex or capture group behavior I'm not understanding
In your regex change + with *:
sed -E "s/(.*)\.png/\/full\/path\/\1/g" <<< "lolsed_bulsh.png"
It prints:
/full/path/lolsed_bulsh
NOTE: The non standard -E option is to avoid escaping ( and )
Save yourself some escaping by choosing a different separator (and -E option), for example:
cat myfile.txt | sed -E "s|(..*)\.png|/full/path/\1|g" | ack /full/path
Note that where supported, the -E option ensures ( and ) don't need escaping.
sed uses POSIX BRE, and BRE doesn't support one or more quantifier +. The quantifier + is only supported in POSIX ERE. However, POSIX sed uses BRE and has no option to switch to ERE.
Use ..* to simulate .+ if you want to maintain portability.
Or if you can assume that the code is always run on GNU sed, you can use GNU extension \+. Alternatively, you can also use the GNU extension -r flag to switch to POSIX ERE. The -E flag in higuaro's answer has been tagged for inclusion in POSIX.1 Issue 8, and exists in POSIX.1-202x Draft 1 (June 2020).

Regular expression to search for not-a-specific-sequence-of-characters

I am looking for a regular expression that matches "not-a-specific-sequence-of-characters". A solution suddenly dawned on me (after a few years!) I am running bash on a Macintosh computer.
As an example, I want to match the word Path as long as it is not preceded by the word posix or Posix. Here is the regular expression I came up with:
[^[:space:]]*([^x]|[^i]x|[^s]ix|[^o]six|[^Pp]osix)Path
I would like to ask if there might be a more efficient or otherwise better approach. This approach can become somewhat cumbersome the longer the "not" sequence of characters is.
Perl regexes have handy "look-around" features.
perl -ne 'print if /(?<![pP]osix)Path' file
GNU grep has a -P flag to enable perl-compatible regular expressions, but OSX does not have GNU tools by default.
A straightforward technique is to filter the output of grep:
grep 'Path' file | grep -v '[pP]osixPath'