I am using sed to clean up a 100MB text file containing word frequencies.
To test my work I work with this short sample:
86501.522305 .
30876.406478 yes
15806.203945 no
15397.078939 what
9461.059877 8
10526.408684 ,
The whitespace is a single tab character.
My goal is to empty all rows with "non-words", i.e line 1, 5 and 6.
My regex
^\S*?\t[\W\d]+$
works fine when testing on Regex101 and in Notepad++, but my sed command
sed -ri 's/^\S*?\t[\W\d]+$//g' sample.txt
keeps the file completely unaltered (except for the file metadata).
Does anyone have an idea what could cause this weird behaviour?
I have checked the docs for extended regular expressions and tried escaping all kinds of characters, but with no success.
There's nothing weird about seds behavior, you just misunderstood that there are multiple different flavors of regexp and multiple tools that support some/all of them in different ways with different options and different caveats.
sed by default supports POSIX BREs while your regexp contains a PCRE (not an ERE) with a bunch of non-POSIX extensions. GNU and OSX/BSD sed support EREs with the -E argument (older GNU seds use -r) and GNU sed supports some extensions - I'd expect \S and maybe \W to work but not \d. No sed supports PCREs.
FWIW I'd use awk for this for clarity, efficiency, portability, etc.:
$ awk '{print ($NF ~ /[[:alnum:]_]/ ? $0 : "")}' file | cat -n
1
2 30876.406478 yes
3 15806.203945 no
4 15397.078939 what
5 9461.059877 8
6
That will work with any awk in any shell on every UNIX box. The | cat -n is just to show that the lines were emptied rather than deleted.
I have a working (in macOS app Patterns) RegExp that reformats GeoJSON MultiPolygon coordinates, but don't know how to escape it for sed.
The file I'm working on is over 90 Mb in size, so bash terminal looks like the ideal place and sed the perfect tool for the job.
Search Text Example:
[[[379017.735,6940036.7955],[379009.8431,6940042.5761],[379000.4869,6940048.9545],[378991.5455,6940057.8128],[378984.0665,6940066.0744],[378974.7072,6940076.2152],[378962.8639,6940090.5283],[378954.5822,6940101.4028],[378947.9369,6940111.3128],[378941.4564,6940119.5094],[378936.2565,6940128.1229],[378927.6089,6940141.4764],[378919.6611,6940154.0312],[378917.21,6940158.7053],[378913.7614,6940163.4443],[378913.6515,6940163.5893],[378911.4453,6940166.3531],
Desired outcome:
[[[37.9017735,69.400367955],[37.90098431,69.400425761],[37.90004869,69.400489545],[37.89915455,69.400578128],[37.89840665,69.400660744],[37.89747072,69.400762152],[37.89628639,69.400905283],[37.89545822,69.401014028],[37.89479369,69.401113128],[37.89414564,69.401195094],[37.89362565,69.401281229],[37.89276089,69.401414764],[37.89196611,69.401540312],[37.891721,69.401587053],[37.89137614,69.401634443],[37.89136515,69.401635893],[37.89114453,69.401663531],
My current RegExp:
((?:\[)[0-9]{2})([0-9]+)(\.)([0-9]+)(,)([0-9]{2})([0-9]+)(\.)([0-9]+(?:\]))
and reformatting:
$1\.$2$4,$6.$7$9
The command should be something along these lines:
sed -i -e 's/ The RegExp escaped /$1\.$2$4,$6.$7$9/g' large_file.geojson
But what should be escaped in the RegExp to make it work?
My attempts always complain of being unbalanced.
I'm sorry if this has already been answered elsewhere, but I couldn't find even after extensive searching.
Edit: 2017-01-07: I didn't make it clear that the file contains properties other than just the GPS-points. One of the other example values picked from GeoJSON Feature properties is "35.642.1.001_001", which should be left unchanged. The braces check in my original regex is there for this reason.
That regex is not legal in sed; since it uses Perl syntax, my recommendation would be to use perl instead. The regular expression works exactly as-is, and even the command line is almost the same; you just need to add the -p option to get perl to operate in filter mode (which sed does by default). I would also recommend adding an argument suffix to the -i option (whether using sed or perl), so that you have a backup of the original file in case something goes horribly wrong. As for quoting, all you need to do is put the substitution command in single quotation marks:
perl -p -i.bak -e \
's/((?:\[)[0-9]{2})([0-9]+)(\.)([0-9]+)(,)([0-9]{2})([0-9]+)(\.)([0-9]+(?:\]))/$1\.$2$4,$6.$7$9/g' \
large_file.geojson
If your data is just like you showed, you needn't worry about the brackets. You may use a POSIX ERE enabled with -E (or -r in some other distributions) like this:
sed -i -E 's/([0-9]{2})([0-9]*)\.([0-9]+)/\1.\2\3/g' large_file.geojson
Or a POSIX BRE:
sed -i 's/\([0-9]\{2\}\)\([0-9]*\)\.\([0-9]\+\)/\1.\2\3/g' large_file.geojson
See an online demo.
You may see how this regex works here (just a demo, not proof).
Note that in POSIX BRE you need to escape { and } in limiting / range quantifiers and ( and ) in grouping constructs, and the + quantifier, else they denote literal symbols. In POSIX ERE, you do not need to escape the special chars to make them special, this POSIX flavor is closer to the modern regexes.
Also, you need to use \n notation inside the replacement pattern, not $n.
A simple sed will do it:
$ echo "$var"
[[[379017.735,6940036.7955],[379009.8431,6940042.5761],[379000.4869,6940048.9545],[378991.5455,6940057.8128],[378984.0665,6940066.0744],[378974.7072,6940076.2152],[378962.8639,6940090.5283],[378954.5822,6940101.4028],[378947.9369,6940111.3128],[378941.4564,6940119.5094],[378936.2565,6940128.1229],[378927.6089,6940141.4764],[378919.6611,6940154.0312],[378917.21,6940158.7053],[378913.7614,6940163.4443],[378913.6515,6940163.5893],[378911.4453,6940166.3531],
$ echo "$var" | sed 's/\([0-9]\{3\}\)\./.\1/g'
[[[379.017735,6940.0367955],[379.0098431,6940.0425761],[379.0004869,6940.0489545],[378.9915455,6940.0578128],[378.9840665,6940.0660744],[378.9747072,6940.0762152],[378.9628639,6940.0905283],[378.9545822,6940.1014028],[378.9479369,6940.1113128],[378.9414564,6940.1195094],[378.9362565,6940.1281229],[378.9276089,6940.1414764],[378.9196611,6940.1540312],[378.91721,6940.1587053],[378.9137614,6940.1634443],[378.9136515,6940.1635893],[378.9114453,6940.1663531],
I have learned that whene I use the command grep then I must mask those characters {,},(,) and |
But I have found now an example, where / was masked!
Which characters must be masked when using grep and sed command?
When writing regexes in a shell script, it is normally sensible to enclose the regex in single quotes. Then you don't have to worry about anything except single quotes that appear in the regex itself. Occasionally, it may make sense to enclose the regex in double quotes (if it involves matching single quotes and not matching double quotes), but then you have to be careful about $, the back-quote ` , and backslashes \.
So:
grep -e '^.*([a-z]*)[[:space:]]*{[^}]*}$'
With sed, you need to worry about s/// operations when the search or replacement pattern itself contains slashes /. The simplest technique is to use an alternative character such as %:
sed -e 's%/where/it/was/%/it/goes/here/now/%'
There are three or four dialects of grep:
Plain grep
Extended grep (grep -E, once upon a time known as egrep)
Fixed grep (grep -F, once upon a time known as fgrep)
Sometimes you get grep with PCRE (Perl-compatible Regular Expression) support: grep -P.
Even within 'plain grep', you can find there is some variability between implementations.
Similarly, there are two main dialects of sed:
Plain sed
Extended sed (sed -E or sed -r; sed -E is more widely available)
You need to read about POSIX BRE (basic regular expressions), supported by plain grep and plain sed, and POSIX ERE (extended regular expressions), supported by grep -E and sed -E (when EREs are supported by sed at all).
See also the POSIX specifications for grep and sed.
I have the following pcre that works just fine:
/[c,f]=("(?:[a-z A-Z 0-9]|-|_|\/)+\.(?:js|html)")/g
It produces the desired output "foo.js" and "bar.html" from the inputs
<script src="foo.js"...
<link rel="import" href="bar.html"...
Problem is, the OS X version of grep doesn't seem to have any option like -o to only print the captured group (according to another SO question, that apparently works on linux). Since this will be part of a makefile, I need a version that I can count on running on any *nix platform.
I tried sed but the following
s/[c,f]=("(?:[[:alphanum:]]|-|_|\/)+\.(?:js|html)")/\1/pg
Throws an error: 'invalid operand for repetition-operator'. I've tried trimming it down, excluding the filepath separator characters, I just cant seem to crack it. Any help translating my pcre into something that I'm pretty much guaranteed to have on a POSIX-compliant (even unofficially so) platform?
P.S. I'm aware of the potential failure modes inherent in the regex I wrote, it only will be used against very specific files with fairly specific formatting.
POSIX defines two flavors of regular expressions:
BREs (Basic Regular Expressions) - the older flavor with fewer features and the need to \-escape certain metacharacters, notably \(, \) and \{, \}, and no support for duplication symbols \+ (emulate with \{1,\}) and \? (emulate with \{0,1\}), and no support for \| (alternation; cannot be emulated).
EREs (Extended Regular Expressions) - the more modern flavor, which, however lacks regex-internal back-references (which is not the same as capture groups); also there is no support for word-boundary assertions (e.g, \<) and no support for capture groups.
POSIX also mandates which utilities support which flavor: which support BREs, which support EREs, and which optionally support either, and which exclusively support only BREs, or only EREs; notably:
grep uses BREs by default, but can enable EREs with -E
sed, sadly, only supports BREs
Both GNU and BSD sed, however, - as a nonstandard extension - do support EREs with the -E switch (the better known alias with GNU sed is -r, but -E is supported too).
awk only supports EREs
Additionally, the regex libraries on both Linux and BSD/OSX implement extensions to the POSIX ERE syntax - sadly, these extensions are in part incompatible (such as the syntax for word-boundary assertions).
As for your specific regex:
It uses the syntax for non-capturing groups, (?:...); however, capture groups are pointless in the context of grep, because grep offers no replacement feature.
If we remove this aspect, we get:
[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")
This is now a valid POSIX ERE (which can be simplified - see Benjamin W's helpful answer).
However, since it is an Extended RE, using sed is not an option, if you want to remain strictly POSIX-compliant.
Because both GNU and BSD/OSX sed happen to implement -E to support EREs, you can get away with sed, if these platforms are the only ones you need to support - see anubhava's answer.
Similarly, both GNU and BSD/OSX grep happen to implement the nonstandard -o option (unlike what you state in your question), so, again, if these platforms are the only ones you need to support, you can use:
$ grep -Eo '[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")' file | cut -c 3-
c="foo.js"
f="bar.html"
(Note that only GNU grep supports -P to enable PCREs, which would simply the solution to (note the \K, which drops everything matched so far):
$ grep -Po '[c,f]=\K("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")' file
)
If you really wanted a strictly POSIX-compliant solution, you could use awk:
$ awk -F\" '/[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")/ { print "\"" $2 "\"" }' file
On OSX following sed should work with your given input:
sed -E 's~.*[cf]=("[ a-zA-Z0-9_/-]+\.(js|html)").*~\1~' file
"foo.js"
"bar.html"
RegEx Demo
The spec for POSIX sed points out that only basic regular expressions (BRE) are supported, so no + or |; non-capturing groups aren't even in the spec for extended regular expressions (ERE).
Thankfully, both GNU sed and BSD sed support ERE, so we can use alternation and the + quantifier.
A few points:
Did you really want that comma in the first bracket expression? I suspect it could be just [cf].
The expression
(?:[a-z A-Z 0-9]|-|_|\/)+
can be simplified to a single bracket expression,
[a-zA-Z0-9_\/ -]+
Only one space is needed. You can also use a POSIX character class: [[:alnum:]]_/ -]+. Not sure if your [:alphanum:] tripped sed up.
For the whole expression between quotes, I'd just use an expression for "something between quotes, ending in .js or .html, preceded by non-quotes":
"[^"]+\.(js|html)"
To emulate grep -o behaviour, you have to also match everything before and after your expression on the line with .* at the start and end of your regex.
All in all, I'd say that for a sed using ERE (-r option for GNU sed, -E option for BSD sed), this should work:
sed -rn 's/.*[cf]=("[^"]+\.(js|html)").*/\1/p' infile
Or, with BRE only (requiring two commands because of the alternation):
sed -n 's/.*[cf]=\("[^"][^"]*\.js"\).*/\1/p;s/.*[cf]=\("[^"][^"]*\.html"\).*/\1/p' infile
Notice how BRE can emulate the + quantifier with [abc][abc]* instead of [abc]+.
The limitation to this approach is that if there are multiple matches on the same line, only the first one will be printed, because the s/// command removes everything before and after the part we extract.
My following regex in Sed doesn't extract the file I want without the #30 substring.
Could you please help pointing out what I am missing here?
[machine]# echo "//dir1/dir2/dir3/component/file.rb#70" | sed 's/\(.*rb\)#\d+$/\1/g'
Output: //dir1/dir2/dir3/component/file.rb#70
What I want is simply: //dir1/dir2/dir3/component/file.rb without #70 substring.
Thanks in advance
PL
The flavor of regular expression understood by sed by default doesn't include either \d for digits or + for "1 or more".
This will work:
sed 's/\(.*\.rb\)#[0-9][0-9]*$/\1/g'
Or you could turn on "extended" regular expression syntax with -E, which makes the + work (though still not \d), and swaps the meaning of backslashed vs non-backslashed parentheses:
sed -E 's/(.*\.rb)#[0-9]+$/\1/g'
Both of the above commands will work even on non-GNU sed, as you get by default on BSD and Mac OS X systems. In normal mode (without the -E), GNU sed also understands \+ to mean the same as bare + in extended mode, but BSD sed does not.
If all you're trying to do is get rid of the #digits, though, you can do it more simply. Sed regexes aren't anchored to the start of the line, so you don't have to include the filename - just replace the part you don't want with nothing at all:
sed 's/#[0-9][0-9]*$//'
or
sed -E 's/#[0-9]+$//'
If your real problem does require the fancy version, though, you could also use Perl, which has the advantage that there's relatively few (almost no) changes in regex syntax across versions. It also understands that \d syntax you tried to use:
perl -pe 's/(.*\.rb)#\d+$/\1/g'
With GNU sed, your command works if you use -E and change \d to [0-9] or [[:digit:]]:
echo "//dir1/dir2/dir3/component/file.rb#70" | sed -E 's/(.*rb)#[0-9]+$/\1/g'
//dir1/dir2/dir3/component/file.rb
Depending on the context, you may be able to use a simpler command, such as
sed 's/#[0-9]\+//g'
You got the answer but have you considered simply:
$ echo "//dir1/dir2/dir3/component/file.rb#70" | cut -d'#' -f1
//dir1/dir2/dir3/component/file.rb