sed (bash) has different interpretation of regex than any other tool? - regex

I am using sed to clean up a 100MB text file containing word frequencies.
To test my work I work with this short sample:
86501.522305 .
30876.406478 yes
15806.203945 no
15397.078939 what
9461.059877 8
10526.408684 ,
The whitespace is a single tab character.
My goal is to empty all rows with "non-words", i.e line 1, 5 and 6.
My regex
^\S*?\t[\W\d]+$
works fine when testing on Regex101 and in Notepad++, but my sed command
sed -ri 's/^\S*?\t[\W\d]+$//g' sample.txt
keeps the file completely unaltered (except for the file metadata).
Does anyone have an idea what could cause this weird behaviour?
I have checked the docs for extended regular expressions and tried escaping all kinds of characters, but with no success.

There's nothing weird about seds behavior, you just misunderstood that there are multiple different flavors of regexp and multiple tools that support some/all of them in different ways with different options and different caveats.
sed by default supports POSIX BREs while your regexp contains a PCRE (not an ERE) with a bunch of non-POSIX extensions. GNU and OSX/BSD sed support EREs with the -E argument (older GNU seds use -r) and GNU sed supports some extensions - I'd expect \S and maybe \W to work but not \d. No sed supports PCREs.
FWIW I'd use awk for this for clarity, efficiency, portability, etc.:
$ awk '{print ($NF ~ /[[:alnum:]_]/ ? $0 : "")}' file | cat -n
1
2 30876.406478 yes
3 15806.203945 no
4 15397.078939 what
5 9461.059877 8
6
That will work with any awk in any shell on every UNIX box. The | cat -n is just to show that the lines were emptied rather than deleted.

Related

SED not updating with complex regex

I'm trying to automate updating the version number in a file as part of build process. I can get the following to work, but only for version numbers with single digits in each of the Major/minor/fix positions.
sed -i 's/version="[0-9]\.[0-9]\.[0-9]"/version="2.4.567"/g' projectConfig.xml
I've tried a more complex regex pattern and it works in the MS Regular Xpression Tool, but won't match when running sed.
sed -i 's/version="\b\d{1,3}\.\d{1,3}\.\d{1,3}\b"/version="2.4.567"/g' projectConfig.xml
Example Input:
This is a file at version="2.1.245" and it consists of much more text.
Desired output
This is a file at version="2.4.567" and it consists of much more text.
I feel that there is something that I'm missing.
There are 3 problems:
To enable quantifiers ({}) in sed you need the -E / --regexp-extended switch (or use \{\}, see http://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html#Regular-Expressions)
The character set shorthand \d is [[:digit:]] in sed.
Your input does not quote the version in ".
sed 's/version=\b[[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}\b/version="2.4.567"/g' \
<<< "This is a file at version=2.1.245 and it consists of much more text."
To stay more portable, you might want to use the --posix switch (which requires removing \b):
sed --posix 's/version=[[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}/version="2.4.567"/g' \
<<< "This is a file at version=2.1.245 and it consists of much more text."

sed regexp, number reformatting: how to escape for bash

I have a working (in macOS app Patterns) RegExp that reformats GeoJSON MultiPolygon coordinates, but don't know how to escape it for sed.
The file I'm working on is over 90 Mb in size, so bash terminal looks like the ideal place and sed the perfect tool for the job.
Search Text Example:
[[[379017.735,6940036.7955],[379009.8431,6940042.5761],[379000.4869,6940048.9545],[378991.5455,6940057.8128],[378984.0665,6940066.0744],[378974.7072,6940076.2152],[378962.8639,6940090.5283],[378954.5822,6940101.4028],[378947.9369,6940111.3128],[378941.4564,6940119.5094],[378936.2565,6940128.1229],[378927.6089,6940141.4764],[378919.6611,6940154.0312],[378917.21,6940158.7053],[378913.7614,6940163.4443],[378913.6515,6940163.5893],[378911.4453,6940166.3531],
Desired outcome:
[[[37.9017735,69.400367955],[37.90098431,69.400425761],[37.90004869,69.400489545],[37.89915455,69.400578128],[37.89840665,69.400660744],[37.89747072,69.400762152],[37.89628639,69.400905283],[37.89545822,69.401014028],[37.89479369,69.401113128],[37.89414564,69.401195094],[37.89362565,69.401281229],[37.89276089,69.401414764],[37.89196611,69.401540312],[37.891721,69.401587053],[37.89137614,69.401634443],[37.89136515,69.401635893],[37.89114453,69.401663531],
My current RegExp:
((?:\[)[0-9]{2})([0-9]+)(\.)([0-9]+)(,)([0-9]{2})([0-9]+)(\.)([0-9]+(?:\]))
and reformatting:
$1\.$2$4,$6.$7$9
The command should be something along these lines:
sed -i -e 's/ The RegExp escaped /$1\.$2$4,$6.$7$9/g' large_file.geojson
But what should be escaped in the RegExp to make it work?
My attempts always complain of being unbalanced.
I'm sorry if this has already been answered elsewhere, but I couldn't find even after extensive searching.
Edit: 2017-01-07: I didn't make it clear that the file contains properties other than just the GPS-points. One of the other example values picked from GeoJSON Feature properties is "35.642.1.001_001", which should be left unchanged. The braces check in my original regex is there for this reason.
That regex is not legal in sed; since it uses Perl syntax, my recommendation would be to use perl instead. The regular expression works exactly as-is, and even the command line is almost the same; you just need to add the -p option to get perl to operate in filter mode (which sed does by default). I would also recommend adding an argument suffix to the -i option (whether using sed or perl), so that you have a backup of the original file in case something goes horribly wrong. As for quoting, all you need to do is put the substitution command in single quotation marks:
perl -p -i.bak -e \
's/((?:\[)[0-9]{2})([0-9]+)(\.)([0-9]+)(,)([0-9]{2})([0-9]+)(\.)([0-9]+(?:\]))/$1\.$2$4,$6.$7$9/g' \
large_file.geojson
If your data is just like you showed, you needn't worry about the brackets. You may use a POSIX ERE enabled with -E (or -r in some other distributions) like this:
sed -i -E 's/([0-9]{2})([0-9]*)\.([0-9]+)/\1.\2\3/g' large_file.geojson
Or a POSIX BRE:
sed -i 's/\([0-9]\{2\}\)\([0-9]*\)\.\([0-9]\+\)/\1.\2\3/g' large_file.geojson
See an online demo.
You may see how this regex works here (just a demo, not proof).
Note that in POSIX BRE you need to escape { and } in limiting / range quantifiers and ( and ) in grouping constructs, and the + quantifier, else they denote literal symbols. In POSIX ERE, you do not need to escape the special chars to make them special, this POSIX flavor is closer to the modern regexes.
Also, you need to use \n notation inside the replacement pattern, not $n.
A simple sed will do it:
$ echo "$var"
[[[379017.735,6940036.7955],[379009.8431,6940042.5761],[379000.4869,6940048.9545],[378991.5455,6940057.8128],[378984.0665,6940066.0744],[378974.7072,6940076.2152],[378962.8639,6940090.5283],[378954.5822,6940101.4028],[378947.9369,6940111.3128],[378941.4564,6940119.5094],[378936.2565,6940128.1229],[378927.6089,6940141.4764],[378919.6611,6940154.0312],[378917.21,6940158.7053],[378913.7614,6940163.4443],[378913.6515,6940163.5893],[378911.4453,6940166.3531],
$ echo "$var" | sed 's/\([0-9]\{3\}\)\./.\1/g'
[[[379.017735,6940.0367955],[379.0098431,6940.0425761],[379.0004869,6940.0489545],[378.9915455,6940.0578128],[378.9840665,6940.0660744],[378.9747072,6940.0762152],[378.9628639,6940.0905283],[378.9545822,6940.1014028],[378.9479369,6940.1113128],[378.9414564,6940.1195094],[378.9362565,6940.1281229],[378.9276089,6940.1414764],[378.9196611,6940.1540312],[378.91721,6940.1587053],[378.9137614,6940.1634443],[378.9136515,6940.1635893],[378.9114453,6940.1663531],

Regex for prosodically-defined words: working in Atom but not grep

I'm trying to search a .txt dictionary for all trisyllabic roots, and then have the matching roots passed to a new .txt file. The dictionary in question is a raw text version of Heath's Nunggubuyu dictionary. When I search the file in Atom (my preferred text editor), the following string does a pretty good job of singling out the desired roots and eliminating any material from the definitions below the headwords (which begin with whitespace), as well as any English words, and any trisyllabic strings interrupted by a hyphen or equals sign (which mean they are not monomorphemic roots). Forgive me if it looks clunky; I'm an absolute beginner. (In this orthography, vowel length is indicated with a ':', and there are only three vowels 'a,i,u'. None of the headwords have uppercase letters.)
^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b
However, I need the matched strings to be output to a new file. When I try using this same string in grep (on a Mac), nothing is matched. I use the syntax
grep -o "^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b" Dict-nofrontmatter.txt > output.txt
I've been searching for hours trying to figure out how to translate from Atom's regex dialect to grep (Mac), to no avail. Whenever I do manage to get matches, the results looks wildly different to what I expect, and what I get from Atom. I've also looked at some apparent grep tools for Atom, but the documentation is virtually non-existent so I can't work out what they even do. What am I getting wrong here? Should I try an alternative to grep?
grep supports different regex styles. From man re_format:
Regular expressions ("RE"s), as defined in POSIX.2, come in two
forms:
modern REs (roughly those of egrep; POSIX.2 calls these extended REs) and
obsolete REs (roughly those of ed(1); POSIX.2 basic REs).
Grep has switches to choose which variant is used. Sorted from less to many features:
fixed string: grep -F or fgrep
No regex at all. Plain text search.
basic regex: grep -G or just grep
|, +, and ? are ordinary characters. | has no equivalent. Parentheses must be escaped to work as sub-expressions.
extended regex: grep -E or egrep
"Normal" regexes with |, +, ? bounds and so on.
perl regex: grep -P (for GNU grep, not pre-installed on Mac)
Most powerful regexes. Supports lookaheads and other features.
In your case you should try grep -Eo "^\S....
Possibly the only thing missing from your grep command is the -E option:
regex='^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b'
grep -Eo "$regex" Dict-nofrontmatter.txt > output.txt
-E activates support for extended (modern) regular expressions, which work as one expects nowadays (duplication symbols + and ? work as expected, ( and ) form capture groups, | is alternation).
Without -E (or with -G) basic regular expressions are assumed - a limited legacy form that differs in syntax. Given that -E is part of POSIX, there's no reason not to use it.
On macOS, grep does understand character-class shortcuts such as \S and \W, and also word-boundary assertions such as \b - this is in contrast with the other BSD utilities that macOS comes with, notably sed and awk.
It doesn't look like you need it, but PRCEs (Perl-compatible Regular Expressions) would provide additional features, such as look-around assertions.
macOS grep doesn't support them, but GNU grep does, via the -P option. You can install GNU grep on macOS via Homebrew.
Alternatively, you can simply use perl directly; the equivalent of the above command would be:
regex='^\S[^aeiousf]*[aiu:]+[^csfaioeu:\-\=\W]+[aiu:]+[^VNcsfaeiou:\-\=]+[aiu:]+[^VcsfNaeiou:]*\b'
perl -lne "print for m/$regex/g" Dict-nofrontmatter.txt > output.txt

How to use sed to replace regex capture group?

I have a large file with many scattered file paths that look like
lolsed_bulsh.png
I want to prepend these file names with an extended path like:
/full/path/lolsed_bullsh.png
I'm having a hard time matching and capturing these. currently i'm trying variations of:
cat myfile.txt| sed s/\(.+\)\.png/\/full\/path\/\1/g | ack /full/path
I think sed has some regex or capture group behavior I'm not understanding
In your regex change + with *:
sed -E "s/(.*)\.png/\/full\/path\/\1/g" <<< "lolsed_bulsh.png"
It prints:
/full/path/lolsed_bulsh
NOTE: The non standard -E option is to avoid escaping ( and )
Save yourself some escaping by choosing a different separator (and -E option), for example:
cat myfile.txt | sed -E "s|(..*)\.png|/full/path/\1|g" | ack /full/path
Note that where supported, the -E option ensures ( and ) don't need escaping.
sed uses POSIX BRE, and BRE doesn't support one or more quantifier +. The quantifier + is only supported in POSIX ERE. However, POSIX sed uses BRE and has no option to switch to ERE.
Use ..* to simulate .+ if you want to maintain portability.
Or if you can assume that the code is always run on GNU sed, you can use GNU extension \+. Alternatively, you can also use the GNU extension -r flag to switch to POSIX ERE. The -E flag in higuaro's answer has been tagged for inclusion in POSIX.1 Issue 8, and exists in POSIX.1-202x Draft 1 (June 2020).

Sed doesn't replace my text properly

My following regex in Sed doesn't extract the file I want without the #30 substring.
Could you please help pointing out what I am missing here?
[machine]# echo "//dir1/dir2/dir3/component/file.rb#70" | sed 's/\(.*rb\)#\d+$/\1/g'
Output: //dir1/dir2/dir3/component/file.rb#70
What I want is simply: //dir1/dir2/dir3/component/file.rb without #70 substring.
Thanks in advance
PL
The flavor of regular expression understood by sed by default doesn't include either \d for digits or + for "1 or more".
This will work:
sed 's/\(.*\.rb\)#[0-9][0-9]*$/\1/g'
Or you could turn on "extended" regular expression syntax with -E, which makes the + work (though still not \d), and swaps the meaning of backslashed vs non-backslashed parentheses:
sed -E 's/(.*\.rb)#[0-9]+$/\1/g'
Both of the above commands will work even on non-GNU sed, as you get by default on BSD and Mac OS X systems. In normal mode (without the -E), GNU sed also understands \+ to mean the same as bare + in extended mode, but BSD sed does not.
If all you're trying to do is get rid of the #digits, though, you can do it more simply. Sed regexes aren't anchored to the start of the line, so you don't have to include the filename - just replace the part you don't want with nothing at all:
sed 's/#[0-9][0-9]*$//'
or
sed -E 's/#[0-9]+$//'
If your real problem does require the fancy version, though, you could also use Perl, which has the advantage that there's relatively few (almost no) changes in regex syntax across versions. It also understands that \d syntax you tried to use:
perl -pe 's/(.*\.rb)#\d+$/\1/g'
With GNU sed, your command works if you use -E and change \d to [0-9] or [[:digit:]]:
echo "//dir1/dir2/dir3/component/file.rb#70" | sed -E 's/(.*rb)#[0-9]+$/\1/g'
//dir1/dir2/dir3/component/file.rb
Depending on the context, you may be able to use a simpler command, such as
sed 's/#[0-9]\+//g'
You got the answer but have you considered simply:
$ echo "//dir1/dir2/dir3/component/file.rb#70" | cut -d'#' -f1
//dir1/dir2/dir3/component/file.rb