I am using sed to clean up a 100MB text file containing word frequencies.
To test my work I work with this short sample:
86501.522305 .
30876.406478 yes
15806.203945 no
15397.078939 what
9461.059877 8
10526.408684 ,
The whitespace is a single tab character.
My goal is to empty all rows with "non-words", i.e line 1, 5 and 6.
My regex
^\S*?\t[\W\d]+$
works fine when testing on Regex101 and in Notepad++, but my sed command
sed -ri 's/^\S*?\t[\W\d]+$//g' sample.txt
keeps the file completely unaltered (except for the file metadata).
Does anyone have an idea what could cause this weird behaviour?
I have checked the docs for extended regular expressions and tried escaping all kinds of characters, but with no success.
There's nothing weird about seds behavior, you just misunderstood that there are multiple different flavors of regexp and multiple tools that support some/all of them in different ways with different options and different caveats.
sed by default supports POSIX BREs while your regexp contains a PCRE (not an ERE) with a bunch of non-POSIX extensions. GNU and OSX/BSD sed support EREs with the -E argument (older GNU seds use -r) and GNU sed supports some extensions - I'd expect \S and maybe \W to work but not \d. No sed supports PCREs.
FWIW I'd use awk for this for clarity, efficiency, portability, etc.:
$ awk '{print ($NF ~ /[[:alnum:]_]/ ? $0 : "")}' file | cat -n
1
2 30876.406478 yes
3 15806.203945 no
4 15397.078939 what
5 9461.059877 8
6
That will work with any awk in any shell on every UNIX box. The | cat -n is just to show that the lines were emptied rather than deleted.
I have a working (in macOS app Patterns) RegExp that reformats GeoJSON MultiPolygon coordinates, but don't know how to escape it for sed.
The file I'm working on is over 90 Mb in size, so bash terminal looks like the ideal place and sed the perfect tool for the job.
Search Text Example:
[[[379017.735,6940036.7955],[379009.8431,6940042.5761],[379000.4869,6940048.9545],[378991.5455,6940057.8128],[378984.0665,6940066.0744],[378974.7072,6940076.2152],[378962.8639,6940090.5283],[378954.5822,6940101.4028],[378947.9369,6940111.3128],[378941.4564,6940119.5094],[378936.2565,6940128.1229],[378927.6089,6940141.4764],[378919.6611,6940154.0312],[378917.21,6940158.7053],[378913.7614,6940163.4443],[378913.6515,6940163.5893],[378911.4453,6940166.3531],
Desired outcome:
[[[37.9017735,69.400367955],[37.90098431,69.400425761],[37.90004869,69.400489545],[37.89915455,69.400578128],[37.89840665,69.400660744],[37.89747072,69.400762152],[37.89628639,69.400905283],[37.89545822,69.401014028],[37.89479369,69.401113128],[37.89414564,69.401195094],[37.89362565,69.401281229],[37.89276089,69.401414764],[37.89196611,69.401540312],[37.891721,69.401587053],[37.89137614,69.401634443],[37.89136515,69.401635893],[37.89114453,69.401663531],
My current RegExp:
((?:\[)[0-9]{2})([0-9]+)(\.)([0-9]+)(,)([0-9]{2})([0-9]+)(\.)([0-9]+(?:\]))
and reformatting:
$1\.$2$4,$6.$7$9
The command should be something along these lines:
sed -i -e 's/ The RegExp escaped /$1\.$2$4,$6.$7$9/g' large_file.geojson
But what should be escaped in the RegExp to make it work?
My attempts always complain of being unbalanced.
I'm sorry if this has already been answered elsewhere, but I couldn't find even after extensive searching.
Edit: 2017-01-07: I didn't make it clear that the file contains properties other than just the GPS-points. One of the other example values picked from GeoJSON Feature properties is "35.642.1.001_001", which should be left unchanged. The braces check in my original regex is there for this reason.
That regex is not legal in sed; since it uses Perl syntax, my recommendation would be to use perl instead. The regular expression works exactly as-is, and even the command line is almost the same; you just need to add the -p option to get perl to operate in filter mode (which sed does by default). I would also recommend adding an argument suffix to the -i option (whether using sed or perl), so that you have a backup of the original file in case something goes horribly wrong. As for quoting, all you need to do is put the substitution command in single quotation marks:
perl -p -i.bak -e \
's/((?:\[)[0-9]{2})([0-9]+)(\.)([0-9]+)(,)([0-9]{2})([0-9]+)(\.)([0-9]+(?:\]))/$1\.$2$4,$6.$7$9/g' \
large_file.geojson
If your data is just like you showed, you needn't worry about the brackets. You may use a POSIX ERE enabled with -E (or -r in some other distributions) like this:
sed -i -E 's/([0-9]{2})([0-9]*)\.([0-9]+)/\1.\2\3/g' large_file.geojson
Or a POSIX BRE:
sed -i 's/\([0-9]\{2\}\)\([0-9]*\)\.\([0-9]\+\)/\1.\2\3/g' large_file.geojson
See an online demo.
You may see how this regex works here (just a demo, not proof).
Note that in POSIX BRE you need to escape { and } in limiting / range quantifiers and ( and ) in grouping constructs, and the + quantifier, else they denote literal symbols. In POSIX ERE, you do not need to escape the special chars to make them special, this POSIX flavor is closer to the modern regexes.
Also, you need to use \n notation inside the replacement pattern, not $n.
A simple sed will do it:
$ echo "$var"
[[[379017.735,6940036.7955],[379009.8431,6940042.5761],[379000.4869,6940048.9545],[378991.5455,6940057.8128],[378984.0665,6940066.0744],[378974.7072,6940076.2152],[378962.8639,6940090.5283],[378954.5822,6940101.4028],[378947.9369,6940111.3128],[378941.4564,6940119.5094],[378936.2565,6940128.1229],[378927.6089,6940141.4764],[378919.6611,6940154.0312],[378917.21,6940158.7053],[378913.7614,6940163.4443],[378913.6515,6940163.5893],[378911.4453,6940166.3531],
$ echo "$var" | sed 's/\([0-9]\{3\}\)\./.\1/g'
[[[379.017735,6940.0367955],[379.0098431,6940.0425761],[379.0004869,6940.0489545],[378.9915455,6940.0578128],[378.9840665,6940.0660744],[378.9747072,6940.0762152],[378.9628639,6940.0905283],[378.9545822,6940.1014028],[378.9479369,6940.1113128],[378.9414564,6940.1195094],[378.9362565,6940.1281229],[378.9276089,6940.1414764],[378.9196611,6940.1540312],[378.91721,6940.1587053],[378.9137614,6940.1634443],[378.9136515,6940.1635893],[378.9114453,6940.1663531],
I need to run a regular expression to match part of a string. On OS X I would do:
echo "$string" | sed -E 's/blah(.*)blah/\1/g'
However, this use of sed isn't compatible with other platforms, many of which would invoke the same command using sed -r.
So what I'm looking for is either a good way to detect which option to use, or a widely available (and compatible) alternative to sed that I can try to do the same thing (retrieve part of a string using a pattern).
There are alternatives like awk, perl, tr or even pure bash. It depends upon what you want to do.
However for your case you don't really need special regex flag -E of sed. You can do:
sed 's/blah\(.*\)blah/\1/g'
To make it compatible with sed on other platforms.
This is indeed incredibly annoying. I do something like:
SED_EXTENDED_REGEXP_FLAG=-r
case $(uname)
in
*BSD) SED_EXTENDED_REGEXP_FLAG=-E ;;
Darwin) SED_EXTENDED_REGEXP_FLAG=-E ;;
esac
echo "$string" | sed $SED_EXTENDED_REGEXP_FLAG 's/blah(.*)blah/\1/g'
That's off the top of my head, so apologies if the shell script syntax is a bit off.
This assumes that any platform which is not a BSD or OS X has GNU sed (or another sed where -r is the flag for extended regular expressions, if there is such a thing).
By far the best solution using sed is to use the portable (POSIX) basic regular expression equivalent, which will work on all platforms:
echo "$string" | sed -e 's/blah\(.*\)blah/\1/g'
This -e indicates the sed-script follows; it could be omitted.
Failing that, Perl was in part a sed substitute (there's still a program s2p that converts sed scripts into Perl scripts).
perl -e 'foreach (#ARGV) { s/blah(.*)blah/$1/; print "$_\n"; }' "$string"
i have this text in file
"0000000441244"
"0000000127769"
I want to replace all zeros with 'L'
I am trying this and nothing gets chnaged
sed -e 's/0+/L/g' regex.txt
sed -e 's/(0+)/L/g' regex.txt
I want to know where i am wrong
Posix compliant version should use 00* instead of 0+:
sed -e 's/00*/L/g' regex.txt
As a side note, you only need the g flag if you want to convert "000000012700009" or even "000000012709" into "L127L9". Otherwise, the * in 's/00*/L/' will include all zeros at the beginning anyway.
In Linux(GNU version's sed), both sed -e 's/0\+/L/g' regex.txt and sed -r 's/0+/L/g' regex.txt will do,
but if you are using Mac(BSD version's sed), neither of them works, instead you have to use this:
sed -E 's/0+/L/g' regex.txt.
Actually the last one works in Linux too, so it's more portable.
For this particular problem, #perreal's suggestion is also portable. But when you do need + or other metacharacter in regex, you'd better know how to work around it.
Try this
sed -e 's/0\+/L/g' regex.txt
If you are using any flavor of Unix except GNU, you can either install GNU sed yourself or just switch to awk or ruby or perl.
For example:
ruby -e 'ARGF.each{|l|puts l.gsub(/0+/, "L")}' regex.txt
Using awk:
awk '{gsub("0+", "L"); print $0}' regex.txt
Extended regular expressions are available on Mac OS/X via -E rather than -e.
From the "BSD General Commands Manual":
-E Interpret regular expressions as extended (modern) regular
expressions rather than basic regular expressions (BRE's).
The re_format(7) manual page fully describes both formats.
This might work for you (GNU sed):
sed 'y/0/L/' file
I'm trying to use the following regex in a sed script but it doesn't work:
sed -n '/\(www\.\)\?teste/p'
The regex above doesn't seem to work. sed doesn't seem to apply the ? to the grouped www\..
It works if you use the -E parameter that switches sed to use the Extended Regex, so the syntax becomes:
sed -En '/(www\.)?teste/p'
This works fine but I want to run this script on a machine that doesn't support the -E operator. I'm pretty sure that this is possible and I'm doing something very stupid.
Standard sed only understands POSIX Basic Regular Expressions (BRE), not Extended Regular Expressions (ERE), and the ? is a metacharacter in EREs, but not in BREs.
Your version of sed might support EREs if you turn them on. With GNU sed, the relevant options are -r and --regexp-extended, described as "use extended regular expressions in the script".
However, if your sed does not support it - quite plausible - then you are stuck. Either import a version of sed that does support them, or redesign your processing. Maybe you should use awk instead.
2014-02-21
I don't know why I didn't mention that even though sed does not support the shorthand ? or \? notation, it does support counted ranges with \{n,m\}, so you can simulate ? with \{0,1\}:
sed -n '/\(www\.\)\{0,1\}teste/p' << EOF
http://www.tested.com/
http://tested.com/
http://www.teased.com/
EOF
which produces:
http://www.tested.com/
http://tested.com/
Tested on Mac OS X 10.9.1 Mavericks with the standard BSD sed and with GNU sed 4.2.2.