Which characters must be masked when using grep and sed? - regex

I have learned that whene I use the command grep then I must mask those characters {,},(,) and |
But I have found now an example, where / was masked!
Which characters must be masked when using grep and sed command?

When writing regexes in a shell script, it is normally sensible to enclose the regex in single quotes. Then you don't have to worry about anything except single quotes that appear in the regex itself. Occasionally, it may make sense to enclose the regex in double quotes (if it involves matching single quotes and not matching double quotes), but then you have to be careful about $, the back-quote  ` , and backslashes \.
So:
grep -e '^.*([a-z]*)[[:space:]]*{[^}]*}$'
With sed, you need to worry about s/// operations when the search or replacement pattern itself contains slashes /. The simplest technique is to use an alternative character such as %:
sed -e 's%/where/it/was/%/it/goes/here/now/%'
There are three or four dialects of grep:
Plain grep
Extended grep (grep -E, once upon a time known as egrep)
Fixed grep (grep -F, once upon a time known as fgrep)
Sometimes you get grep with PCRE (Perl-compatible Regular Expression) support: grep -P.
Even within 'plain grep', you can find there is some variability between implementations.
Similarly, there are two main dialects of sed:
Plain sed
Extended sed (sed -E or sed -r; sed -E is more widely available)
You need to read about POSIX BRE (basic regular expressions), supported by plain grep and plain sed, and POSIX ERE (extended regular expressions), supported by grep -E and sed -E (when EREs are supported by sed at all).
See also the POSIX specifications for grep and sed.

Related

sed regexp, number reformatting: how to escape for bash

I have a working (in macOS app Patterns) RegExp that reformats GeoJSON MultiPolygon coordinates, but don't know how to escape it for sed.
The file I'm working on is over 90 Mb in size, so bash terminal looks like the ideal place and sed the perfect tool for the job.
Search Text Example:
[[[379017.735,6940036.7955],[379009.8431,6940042.5761],[379000.4869,6940048.9545],[378991.5455,6940057.8128],[378984.0665,6940066.0744],[378974.7072,6940076.2152],[378962.8639,6940090.5283],[378954.5822,6940101.4028],[378947.9369,6940111.3128],[378941.4564,6940119.5094],[378936.2565,6940128.1229],[378927.6089,6940141.4764],[378919.6611,6940154.0312],[378917.21,6940158.7053],[378913.7614,6940163.4443],[378913.6515,6940163.5893],[378911.4453,6940166.3531],
Desired outcome:
[[[37.9017735,69.400367955],[37.90098431,69.400425761],[37.90004869,69.400489545],[37.89915455,69.400578128],[37.89840665,69.400660744],[37.89747072,69.400762152],[37.89628639,69.400905283],[37.89545822,69.401014028],[37.89479369,69.401113128],[37.89414564,69.401195094],[37.89362565,69.401281229],[37.89276089,69.401414764],[37.89196611,69.401540312],[37.891721,69.401587053],[37.89137614,69.401634443],[37.89136515,69.401635893],[37.89114453,69.401663531],
My current RegExp:
((?:\[)[0-9]{2})([0-9]+)(\.)([0-9]+)(,)([0-9]{2})([0-9]+)(\.)([0-9]+(?:\]))
and reformatting:
$1\.$2$4,$6.$7$9
The command should be something along these lines:
sed -i -e 's/ The RegExp escaped /$1\.$2$4,$6.$7$9/g' large_file.geojson
But what should be escaped in the RegExp to make it work?
My attempts always complain of being unbalanced.
I'm sorry if this has already been answered elsewhere, but I couldn't find even after extensive searching.
Edit: 2017-01-07: I didn't make it clear that the file contains properties other than just the GPS-points. One of the other example values picked from GeoJSON Feature properties is "35.642.1.001_001", which should be left unchanged. The braces check in my original regex is there for this reason.
That regex is not legal in sed; since it uses Perl syntax, my recommendation would be to use perl instead. The regular expression works exactly as-is, and even the command line is almost the same; you just need to add the -p option to get perl to operate in filter mode (which sed does by default). I would also recommend adding an argument suffix to the -i option (whether using sed or perl), so that you have a backup of the original file in case something goes horribly wrong. As for quoting, all you need to do is put the substitution command in single quotation marks:
perl -p -i.bak -e \
's/((?:\[)[0-9]{2})([0-9]+)(\.)([0-9]+)(,)([0-9]{2})([0-9]+)(\.)([0-9]+(?:\]))/$1\.$2$4,$6.$7$9/g' \
large_file.geojson
If your data is just like you showed, you needn't worry about the brackets. You may use a POSIX ERE enabled with -E (or -r in some other distributions) like this:
sed -i -E 's/([0-9]{2})([0-9]*)\.([0-9]+)/\1.\2\3/g' large_file.geojson
Or a POSIX BRE:
sed -i 's/\([0-9]\{2\}\)\([0-9]*\)\.\([0-9]\+\)/\1.\2\3/g' large_file.geojson
See an online demo.
You may see how this regex works here (just a demo, not proof).
Note that in POSIX BRE you need to escape { and } in limiting / range quantifiers and ( and ) in grouping constructs, and the + quantifier, else they denote literal symbols. In POSIX ERE, you do not need to escape the special chars to make them special, this POSIX flavor is closer to the modern regexes.
Also, you need to use \n notation inside the replacement pattern, not $n.
A simple sed will do it:
$ echo "$var"
[[[379017.735,6940036.7955],[379009.8431,6940042.5761],[379000.4869,6940048.9545],[378991.5455,6940057.8128],[378984.0665,6940066.0744],[378974.7072,6940076.2152],[378962.8639,6940090.5283],[378954.5822,6940101.4028],[378947.9369,6940111.3128],[378941.4564,6940119.5094],[378936.2565,6940128.1229],[378927.6089,6940141.4764],[378919.6611,6940154.0312],[378917.21,6940158.7053],[378913.7614,6940163.4443],[378913.6515,6940163.5893],[378911.4453,6940166.3531],
$ echo "$var" | sed 's/\([0-9]\{3\}\)\./.\1/g'
[[[379.017735,6940.0367955],[379.0098431,6940.0425761],[379.0004869,6940.0489545],[378.9915455,6940.0578128],[378.9840665,6940.0660744],[378.9747072,6940.0762152],[378.9628639,6940.0905283],[378.9545822,6940.1014028],[378.9479369,6940.1113128],[378.9414564,6940.1195094],[378.9362565,6940.1281229],[378.9276089,6940.1414764],[378.9196611,6940.1540312],[378.91721,6940.1587053],[378.9137614,6940.1634443],[378.9136515,6940.1635893],[378.9114453,6940.1663531],

Why does sed provide an "invalid content" error on linux but not on mac

I have the following sed extended regular expressions replacement inside a bash script:
sed -i.bak -E 's~^[[:blank:]]*\\iftoggle{[[:alnum:]_]+}{\\input{([[:alnum:]_\/]+)}}{}~\\input{\1}~' file.txt
which should replace strings like
\iftoggle{xx_yy}{\input{xx_yy/zz}}{}
with
\input{xx_yy/zz}
inside file.txt.
This works just fine locally, on OS X, but the script needs to be POSIX. Specifically, it fails on my remote Travis CI build (which uses Linux). While sed -E is not documented for GNU sed, it behaves just like sed -r and seems to work fine, allowing for a POSIX version of sed with extended regular expressions.
The error given is:
sed: -e expression #1, char 81: Invalid content of \{\}
I'm also not sure where the error starts counting characters from, whether it's the beginning of the line, or only that part which is encased in quotes (the expression)?
You don't need ERE here. Using BRE:
sed i.bak 's~^[[:blank:]]*\\iftoggle{[[:alnum:]_][[:alnum:]_]*}{\\input{\([[:alnum:]_\/][[:alnum:]_\/]*\)}}{}~\\input{\1}~' file.txt
{ don't need to be escaped here but ( do.
As + is not part of the BRE, you can replace [[:alnum:]_]+ with [[:alnum:]_][[:alnum:]_]* or with [[:alnum:]_]\{1,\}.
And as a side note, \+ can be used with GNU sed in BRE but keep in mind that it's not portable, it's a GNU extension.
This does not directly answer the question with sed, but provides an alternate simpler way to do this in perl command-line regex search and replacement.
perl -p -e 's|\iftoggle\{(\w+)\}\{\\input\{(\w+)/(\w+)\}\}\{\}|\input\{\2/\3\}|g' file
\input{xx_yy/zz}
Using the word-separator as | and \w+ to match the [[:alnum:]] characters.
For in-place replacement, use the -i flag similar to sed
perl -p -i.bak -e 's|\iftoggle\{(\w+)\}\{\\input\{(\w+)/(\w+)\}\}\{\}|\input\{\2/\3\}|g' file
Regarding Word-characters(\w) in perl POSIX character classes page,
Word characters
A \w matches a single alphanumeric character (an alphabetic character, or a decimal digit); or a connecting punctuation character, such as an underscore ("_"); or a "mark" character (like some sort of accent) that attaches to one of those. It does not match a whole word. To match a whole word, use \w+ . This isn't the same thing as matching an English word, but in the ASCII range it is the same as a string of Perl-identifier characters.
For an input-with multiple folders inside input, e.g.
cat file
\iftoggle{xx_yy}{\input{xx_yy/zz_yy_zz_kk/dude_hjgk}}{}
perl -p -e 's|\iftoggle\{(\w+)\}\{\\input\{(\w+)/(\w+)/(\w+)\}\}\{\}|\input\{\2/\3/\4\}|g' file
\input{xx_yy/zz_yy_zz_kk/dude_hjgk}
Just plug and play as many as capturing groups you want.

Translate PCRE pattern to POSIX

I have the following pcre that works just fine:
/[c,f]=("(?:[a-z A-Z 0-9]|-|_|\/)+\.(?:js|html)")/g
It produces the desired output "foo.js" and "bar.html" from the inputs
<script src="foo.js"...
<link rel="import" href="bar.html"...
Problem is, the OS X version of grep doesn't seem to have any option like -o to only print the captured group (according to another SO question, that apparently works on linux). Since this will be part of a makefile, I need a version that I can count on running on any *nix platform.
I tried sed but the following
s/[c,f]=("(?:[[:alphanum:]]|-|_|\/)+\.(?:js|html)")/\1/pg
Throws an error: 'invalid operand for repetition-operator'. I've tried trimming it down, excluding the filepath separator characters, I just cant seem to crack it. Any help translating my pcre into something that I'm pretty much guaranteed to have on a POSIX-compliant (even unofficially so) platform?
P.S. I'm aware of the potential failure modes inherent in the regex I wrote, it only will be used against very specific files with fairly specific formatting.
POSIX defines two flavors of regular expressions:
BREs (Basic Regular Expressions) - the older flavor with fewer features and the need to \-escape certain metacharacters, notably \(, \) and \{, \}, and no support for duplication symbols \+ (emulate with \{1,\}) and \? (emulate with \{0,1\}), and no support for \| (alternation; cannot be emulated).
EREs (Extended Regular Expressions) - the more modern flavor, which, however lacks regex-internal back-references (which is not the same as capture groups); also there is no support for word-boundary assertions (e.g, \<) and no support for capture groups.
POSIX also mandates which utilities support which flavor: which support BREs, which support EREs, and which optionally support either, and which exclusively support only BREs, or only EREs; notably:
grep uses BREs by default, but can enable EREs with -E
sed, sadly, only supports BREs
Both GNU and BSD sed, however, - as a nonstandard extension - do support EREs with the -E switch (the better known alias with GNU sed is -r, but -E is supported too).
awk only supports EREs
Additionally, the regex libraries on both Linux and BSD/OSX implement extensions to the POSIX ERE syntax - sadly, these extensions are in part incompatible (such as the syntax for word-boundary assertions).
As for your specific regex:
It uses the syntax for non-capturing groups, (?:...); however, capture groups are pointless in the context of grep, because grep offers no replacement feature.
If we remove this aspect, we get:
[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")
This is now a valid POSIX ERE (which can be simplified - see Benjamin W's helpful answer).
However, since it is an Extended RE, using sed is not an option, if you want to remain strictly POSIX-compliant.
Because both GNU and BSD/OSX sed happen to implement -E to support EREs, you can get away with sed, if these platforms are the only ones you need to support - see anubhava's answer.
Similarly, both GNU and BSD/OSX grep happen to implement the nonstandard -o option (unlike what you state in your question), so, again, if these platforms are the only ones you need to support, you can use:
$ grep -Eo '[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")' file | cut -c 3-
c="foo.js"
f="bar.html"
(Note that only GNU grep supports -P to enable PCREs, which would simply the solution to (note the \K, which drops everything matched so far):
$ grep -Po '[c,f]=\K("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")' file
)
If you really wanted a strictly POSIX-compliant solution, you could use awk:
$ awk -F\" '/[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")/ { print "\"" $2 "\"" }' file
On OSX following sed should work with your given input:
sed -E 's~.*[cf]=("[ a-zA-Z0-9_/-]+\.(js|html)").*~\1~' file
"foo.js"
"bar.html"
RegEx Demo
The spec for POSIX sed points out that only basic regular expressions (BRE) are supported, so no + or |; non-capturing groups aren't even in the spec for extended regular expressions (ERE).
Thankfully, both GNU sed and BSD sed support ERE, so we can use alternation and the + quantifier.
A few points:
Did you really want that comma in the first bracket expression? I suspect it could be just [cf].
The expression
(?:[a-z A-Z 0-9]|-|_|\/)+
can be simplified to a single bracket expression,
[a-zA-Z0-9_\/ -]+
Only one space is needed. You can also use a POSIX character class: [[:alnum:]]_/ -]+. Not sure if your [:alphanum:] tripped sed up.
For the whole expression between quotes, I'd just use an expression for "something between quotes, ending in .js or .html, preceded by non-quotes":
"[^"]+\.(js|html)"
To emulate grep -o behaviour, you have to also match everything before and after your expression on the line with .* at the start and end of your regex.
All in all, I'd say that for a sed using ERE (-r option for GNU sed, -E option for BSD sed), this should work:
sed -rn 's/.*[cf]=("[^"]+\.(js|html)").*/\1/p' infile
Or, with BRE only (requiring two commands because of the alternation):
sed -n 's/.*[cf]=\("[^"][^"]*\.js"\).*/\1/p;s/.*[cf]=\("[^"][^"]*\.html"\).*/\1/p' infile
Notice how BRE can emulate the + quantifier with [abc][abc]* instead of [abc]+.
The limitation to this approach is that if there are multiple matches on the same line, only the first one will be printed, because the s/// command removes everything before and after the part we extract.

How to use sed to replace regex capture group?

I have a large file with many scattered file paths that look like
lolsed_bulsh.png
I want to prepend these file names with an extended path like:
/full/path/lolsed_bullsh.png
I'm having a hard time matching and capturing these. currently i'm trying variations of:
cat myfile.txt| sed s/\(.+\)\.png/\/full\/path\/\1/g | ack /full/path
I think sed has some regex or capture group behavior I'm not understanding
In your regex change + with *:
sed -E "s/(.*)\.png/\/full\/path\/\1/g" <<< "lolsed_bulsh.png"
It prints:
/full/path/lolsed_bulsh
NOTE: The non standard -E option is to avoid escaping ( and )
Save yourself some escaping by choosing a different separator (and -E option), for example:
cat myfile.txt | sed -E "s|(..*)\.png|/full/path/\1|g" | ack /full/path
Note that where supported, the -E option ensures ( and ) don't need escaping.
sed uses POSIX BRE, and BRE doesn't support one or more quantifier +. The quantifier + is only supported in POSIX ERE. However, POSIX sed uses BRE and has no option to switch to ERE.
Use ..* to simulate .+ if you want to maintain portability.
Or if you can assume that the code is always run on GNU sed, you can use GNU extension \+. Alternatively, you can also use the GNU extension -r flag to switch to POSIX ERE. The -E flag in higuaro's answer has been tagged for inclusion in POSIX.1 Issue 8, and exists in POSIX.1-202x Draft 1 (June 2020).

How to use regex OR in grep in Cygwin?

I need to return results for two different matches from a single file.
grep "string1" my.file
correctly returns the single instance of string1 in my.file
grep "string2" my.file
correctly returns the single instance of string2 in my.file
but
grep "string1|string2" my.file
returns nothing
in regex test apps that syntax is correct, so why does it not work for grep in cygwin ?
Using the | character without escaping it in a basic regular expression will only match the | literal. For instance, if you have a file with contents
string1
string2
string1|string2
Using grep "string1|string2" my.file will only match the last line
$ grep "string1|string2" my.file
string1|string2
In order to use the alternation operator |, you could:
Use a basic regular expression (just grep) and escape the | character in the regular expression
grep "string1\|string2" my.file
Use an extended regular expression with egrep or grep -E, as Julian already pointed out in his answer
grep -E "string1|string2" my.file
If it is two different patterns that you want to match, you could also specify them separately in -e options:
grep -e "string1" -e "string2" my.file
You might find the following sections of the grep reference useful:
Basic vs Extended Regular Expressions
Matching Control, where it explains -e
You may need to either use egrep or grep -E. The pipe OR symbol is part of 'extended' grep and may not be supported by the basic Cygwin grep.
Also, you probably need to escape the pipe symbol.
The best and most clear way I've found is:
grep -e REG1 -e REG2 -e REG3 _FILETOGREP_
I never use pipe as it's less evident and very awkward to get working.
You can find this information by reading the fine manual: grep(1), which you can find by running 'man grep'. It describes the difference between grep and egrep, and basic and regular expressions, along with a lot of other useful information about grep.