SED not updating with complex regex - regex

I'm trying to automate updating the version number in a file as part of build process. I can get the following to work, but only for version numbers with single digits in each of the Major/minor/fix positions.
sed -i 's/version="[0-9]\.[0-9]\.[0-9]"/version="2.4.567"/g' projectConfig.xml
I've tried a more complex regex pattern and it works in the MS Regular Xpression Tool, but won't match when running sed.
sed -i 's/version="\b\d{1,3}\.\d{1,3}\.\d{1,3}\b"/version="2.4.567"/g' projectConfig.xml
Example Input:
This is a file at version="2.1.245" and it consists of much more text.
Desired output
This is a file at version="2.4.567" and it consists of much more text.
I feel that there is something that I'm missing.

There are 3 problems:
To enable quantifiers ({}) in sed you need the -E / --regexp-extended switch (or use \{\}, see http://www.gnu.org/software/sed/manual/html_node/Regular-Expressions.html#Regular-Expressions)
The character set shorthand \d is [[:digit:]] in sed.
Your input does not quote the version in ".
sed 's/version=\b[[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}\b/version="2.4.567"/g' \
<<< "This is a file at version=2.1.245 and it consists of much more text."
To stay more portable, you might want to use the --posix switch (which requires removing \b):
sed --posix 's/version=[[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}\.[[:digit:]]\{1,3\}/version="2.4.567"/g' \
<<< "This is a file at version=2.1.245 and it consists of much more text."

Related

sed (bash) has different interpretation of regex than any other tool?

I am using sed to clean up a 100MB text file containing word frequencies.
To test my work I work with this short sample:
86501.522305 .
30876.406478 yes
15806.203945 no
15397.078939 what
9461.059877 8
10526.408684 ,
The whitespace is a single tab character.
My goal is to empty all rows with "non-words", i.e line 1, 5 and 6.
My regex
^\S*?\t[\W\d]+$
works fine when testing on Regex101 and in Notepad++, but my sed command
sed -ri 's/^\S*?\t[\W\d]+$//g' sample.txt
keeps the file completely unaltered (except for the file metadata).
Does anyone have an idea what could cause this weird behaviour?
I have checked the docs for extended regular expressions and tried escaping all kinds of characters, but with no success.
There's nothing weird about seds behavior, you just misunderstood that there are multiple different flavors of regexp and multiple tools that support some/all of them in different ways with different options and different caveats.
sed by default supports POSIX BREs while your regexp contains a PCRE (not an ERE) with a bunch of non-POSIX extensions. GNU and OSX/BSD sed support EREs with the -E argument (older GNU seds use -r) and GNU sed supports some extensions - I'd expect \S and maybe \W to work but not \d. No sed supports PCREs.
FWIW I'd use awk for this for clarity, efficiency, portability, etc.:
$ awk '{print ($NF ~ /[[:alnum:]_]/ ? $0 : "")}' file | cat -n
1
2 30876.406478 yes
3 15806.203945 no
4 15397.078939 what
5 9461.059877 8
6
That will work with any awk in any shell on every UNIX box. The | cat -n is just to show that the lines were emptied rather than deleted.

Sed Regex OSX find Roman numerals and replace with empty string. Error "unterminated substitute pattern"

This is probably a Sed and shell scripting syntax issue as well as Regex.
(Edit: maybe an I/O issue, as the regex worked when reading the file within the bash shell, but the actual .txt file was not altered as desired)
Trying to prepare a .txt file for some natural language processing work. Wanted to delete some Roman numerals in a plain text file containing Shakespeare's sonnets, each sonnet beginning with a Roman numeral such as IX. and XVIII. which represents the title of the individual sonnet, including the decimal character.
Example intput text:
XXV.
Let those who are in favour with their stars
Of public honour and proud titles boast,
Desired output:
Let those who are in favour with their stars
Of public honour and proud titles boast,
Following the example in this question, I tried all the following commands in Terminal bash shell:
$ sed -i 's/[IVXLC]{1,}[.]//g' sonnets.txt
$ sed -i 's/[IVXLC]{1,}[.]/^$/g' sonnets.txt
$ sed -i 's/[IVXLC]{1,}[.]/()/g' sonnets.txt
$ sed -i 's/[IVXLC]{1,}[.]/[]/g' sonnets.txt
The idea was to replace any match with an empty string. Since that didn't work, I tried to replace match with a space character:
$ sed -i 's/[IVXLC]{1,}[.]/^ $/g' sonnets.txt
No luck. All commands above returned the same error:
sed: 1: "sonnets.txt": unterminated substitute pattern
I tested the regex in the "find" field on https://regexr.com/ and it seemed to be correct. The target file was right in the working directory. Any idea what went wrong? What characters should I be using in the "replace" field of the Sed command? Should I modify the regex and/or the Sed command?
The curly brackets need to be escaped.
$ sed 's/[IVXLC]\{1,\}[.]//g' sonnets.txt
Let those who are in favour with their stars
Of public honour and proud titles boast,
As #Jonathan Leffler mentioned in the comments, my Mac is using BSD sed and that's why the command didn't work.
So I installed GNU sed through Homebrew:
brew install gnu-sed
Then used the command:
gsed -i 's/[IVXLC]\{1,\}[.]//g' sonnets.txt
Typing in gsed invokes the GNU sed, and it worked as desired. It altered the content of the .txt file in place.
In this configuration, as #Hakan Baba mentioned, the regex did need to escape the curly braces:
\{ \}
The problem seems to be with the range (or limiting ) quantifier {m,n} that is not supported in your BSD sed version. Note that you may rewrite the {1,} quantifier using [IVXLC][IVXLC]* (one Roman "digit" followed with 0+ Roman digits):
sed -i 's/[IVXLC][IVXLC]*[.]//g' sonnets.txt
^^^^^^^^^^^^^^^
Also, if you need to make sure you only match the Roman numbers at the start of the line, append ^ at the start of the pattern (and that means you may also omit g modifier at the end of the regex). To match them as whole words, add [[:<:]] leading word boundary at the start of the pattern.

Regex not working in Bash

I have this regex for now
It should catch something like this
org.package;version="[1.0.41, 1.0.51)" and "," optionally if it is not last element.
Also if after package i added .* because the package could be "org.package.util.something" until ";version"
I tried it online in Regex tool and it is working like this
org.package(.*.*)?;version="[[0-9].[0-9].[0-9][0-9],\s[0-9].[0-9].[0-9][0-9])",?
but i dont know what should i change so it can work in bash
package="org.package"
sed -i "s/"$$package.*;version="\[[0-9].[0-9].[0-9][0-9],[[:space:]][0-9].[0-9].[0-9][0-9]\)",?"//g" "$file"
Change the double quotes arround sed command by single quotes, because variable expansion of $package single quotes are closed and double quotes are use arround variable
package="org.package"
sed -i 's/'"$package"'.*;version="\[[0-9].[0-9].[0-9][0-9],[[:space:]][0-9].[0-9].[0-9][0-9]\)",?//g' "$file"
before using command with -i option check the output is correct
There is more than one problem
$$ will be replaced by bash with its PID, that's probably not what you want
online regex evaluators usually use extended regex or perl regex syntax
sed -r will enable extended regex mode. (for grep there's -E and -P)
You use . when you want to match literal dots. However you should be using \., because . actually means "any character" in regular expressions.

sed regexp, number reformatting: how to escape for bash

I have a working (in macOS app Patterns) RegExp that reformats GeoJSON MultiPolygon coordinates, but don't know how to escape it for sed.
The file I'm working on is over 90 Mb in size, so bash terminal looks like the ideal place and sed the perfect tool for the job.
Search Text Example:
[[[379017.735,6940036.7955],[379009.8431,6940042.5761],[379000.4869,6940048.9545],[378991.5455,6940057.8128],[378984.0665,6940066.0744],[378974.7072,6940076.2152],[378962.8639,6940090.5283],[378954.5822,6940101.4028],[378947.9369,6940111.3128],[378941.4564,6940119.5094],[378936.2565,6940128.1229],[378927.6089,6940141.4764],[378919.6611,6940154.0312],[378917.21,6940158.7053],[378913.7614,6940163.4443],[378913.6515,6940163.5893],[378911.4453,6940166.3531],
Desired outcome:
[[[37.9017735,69.400367955],[37.90098431,69.400425761],[37.90004869,69.400489545],[37.89915455,69.400578128],[37.89840665,69.400660744],[37.89747072,69.400762152],[37.89628639,69.400905283],[37.89545822,69.401014028],[37.89479369,69.401113128],[37.89414564,69.401195094],[37.89362565,69.401281229],[37.89276089,69.401414764],[37.89196611,69.401540312],[37.891721,69.401587053],[37.89137614,69.401634443],[37.89136515,69.401635893],[37.89114453,69.401663531],
My current RegExp:
((?:\[)[0-9]{2})([0-9]+)(\.)([0-9]+)(,)([0-9]{2})([0-9]+)(\.)([0-9]+(?:\]))
and reformatting:
$1\.$2$4,$6.$7$9
The command should be something along these lines:
sed -i -e 's/ The RegExp escaped /$1\.$2$4,$6.$7$9/g' large_file.geojson
But what should be escaped in the RegExp to make it work?
My attempts always complain of being unbalanced.
I'm sorry if this has already been answered elsewhere, but I couldn't find even after extensive searching.
Edit: 2017-01-07: I didn't make it clear that the file contains properties other than just the GPS-points. One of the other example values picked from GeoJSON Feature properties is "35.642.1.001_001", which should be left unchanged. The braces check in my original regex is there for this reason.
That regex is not legal in sed; since it uses Perl syntax, my recommendation would be to use perl instead. The regular expression works exactly as-is, and even the command line is almost the same; you just need to add the -p option to get perl to operate in filter mode (which sed does by default). I would also recommend adding an argument suffix to the -i option (whether using sed or perl), so that you have a backup of the original file in case something goes horribly wrong. As for quoting, all you need to do is put the substitution command in single quotation marks:
perl -p -i.bak -e \
's/((?:\[)[0-9]{2})([0-9]+)(\.)([0-9]+)(,)([0-9]{2})([0-9]+)(\.)([0-9]+(?:\]))/$1\.$2$4,$6.$7$9/g' \
large_file.geojson
If your data is just like you showed, you needn't worry about the brackets. You may use a POSIX ERE enabled with -E (or -r in some other distributions) like this:
sed -i -E 's/([0-9]{2})([0-9]*)\.([0-9]+)/\1.\2\3/g' large_file.geojson
Or a POSIX BRE:
sed -i 's/\([0-9]\{2\}\)\([0-9]*\)\.\([0-9]\+\)/\1.\2\3/g' large_file.geojson
See an online demo.
You may see how this regex works here (just a demo, not proof).
Note that in POSIX BRE you need to escape { and } in limiting / range quantifiers and ( and ) in grouping constructs, and the + quantifier, else they denote literal symbols. In POSIX ERE, you do not need to escape the special chars to make them special, this POSIX flavor is closer to the modern regexes.
Also, you need to use \n notation inside the replacement pattern, not $n.
A simple sed will do it:
$ echo "$var"
[[[379017.735,6940036.7955],[379009.8431,6940042.5761],[379000.4869,6940048.9545],[378991.5455,6940057.8128],[378984.0665,6940066.0744],[378974.7072,6940076.2152],[378962.8639,6940090.5283],[378954.5822,6940101.4028],[378947.9369,6940111.3128],[378941.4564,6940119.5094],[378936.2565,6940128.1229],[378927.6089,6940141.4764],[378919.6611,6940154.0312],[378917.21,6940158.7053],[378913.7614,6940163.4443],[378913.6515,6940163.5893],[378911.4453,6940166.3531],
$ echo "$var" | sed 's/\([0-9]\{3\}\)\./.\1/g'
[[[379.017735,6940.0367955],[379.0098431,6940.0425761],[379.0004869,6940.0489545],[378.9915455,6940.0578128],[378.9840665,6940.0660744],[378.9747072,6940.0762152],[378.9628639,6940.0905283],[378.9545822,6940.1014028],[378.9479369,6940.1113128],[378.9414564,6940.1195094],[378.9362565,6940.1281229],[378.9276089,6940.1414764],[378.9196611,6940.1540312],[378.91721,6940.1587053],[378.9137614,6940.1634443],[378.9136515,6940.1635893],[378.9114453,6940.1663531],

Why does sed provide an "invalid content" error on linux but not on mac

I have the following sed extended regular expressions replacement inside a bash script:
sed -i.bak -E 's~^[[:blank:]]*\\iftoggle{[[:alnum:]_]+}{\\input{([[:alnum:]_\/]+)}}{}~\\input{\1}~' file.txt
which should replace strings like
\iftoggle{xx_yy}{\input{xx_yy/zz}}{}
with
\input{xx_yy/zz}
inside file.txt.
This works just fine locally, on OS X, but the script needs to be POSIX. Specifically, it fails on my remote Travis CI build (which uses Linux). While sed -E is not documented for GNU sed, it behaves just like sed -r and seems to work fine, allowing for a POSIX version of sed with extended regular expressions.
The error given is:
sed: -e expression #1, char 81: Invalid content of \{\}
I'm also not sure where the error starts counting characters from, whether it's the beginning of the line, or only that part which is encased in quotes (the expression)?
You don't need ERE here. Using BRE:
sed i.bak 's~^[[:blank:]]*\\iftoggle{[[:alnum:]_][[:alnum:]_]*}{\\input{\([[:alnum:]_\/][[:alnum:]_\/]*\)}}{}~\\input{\1}~' file.txt
{ don't need to be escaped here but ( do.
As + is not part of the BRE, you can replace [[:alnum:]_]+ with [[:alnum:]_][[:alnum:]_]* or with [[:alnum:]_]\{1,\}.
And as a side note, \+ can be used with GNU sed in BRE but keep in mind that it's not portable, it's a GNU extension.
This does not directly answer the question with sed, but provides an alternate simpler way to do this in perl command-line regex search and replacement.
perl -p -e 's|\iftoggle\{(\w+)\}\{\\input\{(\w+)/(\w+)\}\}\{\}|\input\{\2/\3\}|g' file
\input{xx_yy/zz}
Using the word-separator as | and \w+ to match the [[:alnum:]] characters.
For in-place replacement, use the -i flag similar to sed
perl -p -i.bak -e 's|\iftoggle\{(\w+)\}\{\\input\{(\w+)/(\w+)\}\}\{\}|\input\{\2/\3\}|g' file
Regarding Word-characters(\w) in perl POSIX character classes page,
Word characters
A \w matches a single alphanumeric character (an alphabetic character, or a decimal digit); or a connecting punctuation character, such as an underscore ("_"); or a "mark" character (like some sort of accent) that attaches to one of those. It does not match a whole word. To match a whole word, use \w+ . This isn't the same thing as matching an English word, but in the ASCII range it is the same as a string of Perl-identifier characters.
For an input-with multiple folders inside input, e.g.
cat file
\iftoggle{xx_yy}{\input{xx_yy/zz_yy_zz_kk/dude_hjgk}}{}
perl -p -e 's|\iftoggle\{(\w+)\}\{\\input\{(\w+)/(\w+)/(\w+)\}\}\{\}|\input\{\2/\3/\4\}|g' file
\input{xx_yy/zz_yy_zz_kk/dude_hjgk}
Just plug and play as many as capturing groups you want.