Using sed with regex to replace text on OSX and Linux - regex

I am trying to replace some strings inside a file with sed using Regular Expressions. To complicate the matter, this is being done inside a Makefile script that needs to work on both osx and linux.
Specifically, within file.tex I want to replace
\subimport{chapters/}{xxx}
with
\subimport{chapters/}{xxx-yyy}
(xxx and yyy are just example text.)
Note, xxx could contain any letters, numbers, and _ (underscore) but really the regex can simply match anything inside the brackets. Sometimes there is some whitespace at the beginning of the line before \subimport....
The design of the string being searched for requires a lot of escaping (when searched for with regex) and I am guessing somewhere therein lies my error.
Here's what I've tried so far:
sed -i'.bak' -e 's/\\subimport\{chapters\/\}\{xxx\}/\\subimport\{chapters\/\}\{xxx-yyy\}/g' file.tex
# the -i'.bak' is required so SED works on OSX and Linux
rm -f file.tex.bak # because of this, we have to delete the .bak files after
This results in an error of RE error: invalid repetition count(s) when I build my Makefile that contains this script.
I thought part of my problem was that the -E option for sed was not available in the osx version of sed. It turns out, when using the -E option, fewer things should be escaped (see comments on my question).

POSIX-ly:
sed 's#^\(\\subimport{chapters/}{[[:alnum:]_]\+\)}$#\1-yyy}#'
# is used as the parameter separator for sed's s (Substitution)
\(\\subimport{chapters/}{[[:alnum:]_]\+\) is the captured group, containing everything required upto last }, preceeded by one or more alphabetics, digits, and underscore
In the replacement, the first captured group is followed by the required string, closed by a }
Example:
$ sed 's#^\(\\subimport{chapters/}{[[:alnum:]_]\+\)}$#\1-yyy}#' <<<'\subimport{chapters/}{foobar9}'
\subimport{chapters/}{foobar9-yyy}
$ sed 's#^\(\\subimport{chapters/}{[[:alnum:]_]\+\)}$#\1-yyy}#' <<<'\subimport{chapters/}{spamegg923}'
\subimport{chapters/}{spamegg923-yyy}

Here's is the version that ended up working for me.
sed -i.bak -E 's#^([[:blank:]]*\\subimport{chapters/}{[[:alnum:]_]+)}$#\1-yyy}#' file.tex
rm -f file.tex.bak
Much thanks go to #heemayl. Their answer is the better written one, it simply required some tweaking to get a version that worked for me.

Related

Sed Regex OSX find Roman numerals and replace with empty string. Error "unterminated substitute pattern"

This is probably a Sed and shell scripting syntax issue as well as Regex.
(Edit: maybe an I/O issue, as the regex worked when reading the file within the bash shell, but the actual .txt file was not altered as desired)
Trying to prepare a .txt file for some natural language processing work. Wanted to delete some Roman numerals in a plain text file containing Shakespeare's sonnets, each sonnet beginning with a Roman numeral such as IX. and XVIII. which represents the title of the individual sonnet, including the decimal character.
Example intput text:
XXV.
Let those who are in favour with their stars
Of public honour and proud titles boast,
Desired output:
Let those who are in favour with their stars
Of public honour and proud titles boast,
Following the example in this question, I tried all the following commands in Terminal bash shell:
$ sed -i 's/[IVXLC]{1,}[.]//g' sonnets.txt
$ sed -i 's/[IVXLC]{1,}[.]/^$/g' sonnets.txt
$ sed -i 's/[IVXLC]{1,}[.]/()/g' sonnets.txt
$ sed -i 's/[IVXLC]{1,}[.]/[]/g' sonnets.txt
The idea was to replace any match with an empty string. Since that didn't work, I tried to replace match with a space character:
$ sed -i 's/[IVXLC]{1,}[.]/^ $/g' sonnets.txt
No luck. All commands above returned the same error:
sed: 1: "sonnets.txt": unterminated substitute pattern
I tested the regex in the "find" field on https://regexr.com/ and it seemed to be correct. The target file was right in the working directory. Any idea what went wrong? What characters should I be using in the "replace" field of the Sed command? Should I modify the regex and/or the Sed command?
The curly brackets need to be escaped.
$ sed 's/[IVXLC]\{1,\}[.]//g' sonnets.txt
Let those who are in favour with their stars
Of public honour and proud titles boast,
As #Jonathan Leffler mentioned in the comments, my Mac is using BSD sed and that's why the command didn't work.
So I installed GNU sed through Homebrew:
brew install gnu-sed
Then used the command:
gsed -i 's/[IVXLC]\{1,\}[.]//g' sonnets.txt
Typing in gsed invokes the GNU sed, and it worked as desired. It altered the content of the .txt file in place.
In this configuration, as #Hakan Baba mentioned, the regex did need to escape the curly braces:
\{ \}
The problem seems to be with the range (or limiting ) quantifier {m,n} that is not supported in your BSD sed version. Note that you may rewrite the {1,} quantifier using [IVXLC][IVXLC]* (one Roman "digit" followed with 0+ Roman digits):
sed -i 's/[IVXLC][IVXLC]*[.]//g' sonnets.txt
^^^^^^^^^^^^^^^
Also, if you need to make sure you only match the Roman numbers at the start of the line, append ^ at the start of the pattern (and that means you may also omit g modifier at the end of the regex). To match them as whole words, add [[:<:]] leading word boundary at the start of the pattern.

sed regexp, number reformatting: how to escape for bash

I have a working (in macOS app Patterns) RegExp that reformats GeoJSON MultiPolygon coordinates, but don't know how to escape it for sed.
The file I'm working on is over 90 Mb in size, so bash terminal looks like the ideal place and sed the perfect tool for the job.
Search Text Example:
[[[379017.735,6940036.7955],[379009.8431,6940042.5761],[379000.4869,6940048.9545],[378991.5455,6940057.8128],[378984.0665,6940066.0744],[378974.7072,6940076.2152],[378962.8639,6940090.5283],[378954.5822,6940101.4028],[378947.9369,6940111.3128],[378941.4564,6940119.5094],[378936.2565,6940128.1229],[378927.6089,6940141.4764],[378919.6611,6940154.0312],[378917.21,6940158.7053],[378913.7614,6940163.4443],[378913.6515,6940163.5893],[378911.4453,6940166.3531],
Desired outcome:
[[[37.9017735,69.400367955],[37.90098431,69.400425761],[37.90004869,69.400489545],[37.89915455,69.400578128],[37.89840665,69.400660744],[37.89747072,69.400762152],[37.89628639,69.400905283],[37.89545822,69.401014028],[37.89479369,69.401113128],[37.89414564,69.401195094],[37.89362565,69.401281229],[37.89276089,69.401414764],[37.89196611,69.401540312],[37.891721,69.401587053],[37.89137614,69.401634443],[37.89136515,69.401635893],[37.89114453,69.401663531],
My current RegExp:
((?:\[)[0-9]{2})([0-9]+)(\.)([0-9]+)(,)([0-9]{2})([0-9]+)(\.)([0-9]+(?:\]))
and reformatting:
$1\.$2$4,$6.$7$9
The command should be something along these lines:
sed -i -e 's/ The RegExp escaped /$1\.$2$4,$6.$7$9/g' large_file.geojson
But what should be escaped in the RegExp to make it work?
My attempts always complain of being unbalanced.
I'm sorry if this has already been answered elsewhere, but I couldn't find even after extensive searching.
Edit: 2017-01-07: I didn't make it clear that the file contains properties other than just the GPS-points. One of the other example values picked from GeoJSON Feature properties is "35.642.1.001_001", which should be left unchanged. The braces check in my original regex is there for this reason.
That regex is not legal in sed; since it uses Perl syntax, my recommendation would be to use perl instead. The regular expression works exactly as-is, and even the command line is almost the same; you just need to add the -p option to get perl to operate in filter mode (which sed does by default). I would also recommend adding an argument suffix to the -i option (whether using sed or perl), so that you have a backup of the original file in case something goes horribly wrong. As for quoting, all you need to do is put the substitution command in single quotation marks:
perl -p -i.bak -e \
's/((?:\[)[0-9]{2})([0-9]+)(\.)([0-9]+)(,)([0-9]{2})([0-9]+)(\.)([0-9]+(?:\]))/$1\.$2$4,$6.$7$9/g' \
large_file.geojson
If your data is just like you showed, you needn't worry about the brackets. You may use a POSIX ERE enabled with -E (or -r in some other distributions) like this:
sed -i -E 's/([0-9]{2})([0-9]*)\.([0-9]+)/\1.\2\3/g' large_file.geojson
Or a POSIX BRE:
sed -i 's/\([0-9]\{2\}\)\([0-9]*\)\.\([0-9]\+\)/\1.\2\3/g' large_file.geojson
See an online demo.
You may see how this regex works here (just a demo, not proof).
Note that in POSIX BRE you need to escape { and } in limiting / range quantifiers and ( and ) in grouping constructs, and the + quantifier, else they denote literal symbols. In POSIX ERE, you do not need to escape the special chars to make them special, this POSIX flavor is closer to the modern regexes.
Also, you need to use \n notation inside the replacement pattern, not $n.
A simple sed will do it:
$ echo "$var"
[[[379017.735,6940036.7955],[379009.8431,6940042.5761],[379000.4869,6940048.9545],[378991.5455,6940057.8128],[378984.0665,6940066.0744],[378974.7072,6940076.2152],[378962.8639,6940090.5283],[378954.5822,6940101.4028],[378947.9369,6940111.3128],[378941.4564,6940119.5094],[378936.2565,6940128.1229],[378927.6089,6940141.4764],[378919.6611,6940154.0312],[378917.21,6940158.7053],[378913.7614,6940163.4443],[378913.6515,6940163.5893],[378911.4453,6940166.3531],
$ echo "$var" | sed 's/\([0-9]\{3\}\)\./.\1/g'
[[[379.017735,6940.0367955],[379.0098431,6940.0425761],[379.0004869,6940.0489545],[378.9915455,6940.0578128],[378.9840665,6940.0660744],[378.9747072,6940.0762152],[378.9628639,6940.0905283],[378.9545822,6940.1014028],[378.9479369,6940.1113128],[378.9414564,6940.1195094],[378.9362565,6940.1281229],[378.9276089,6940.1414764],[378.9196611,6940.1540312],[378.91721,6940.1587053],[378.9137614,6940.1634443],[378.9136515,6940.1635893],[378.9114453,6940.1663531],

Why does sed provide an "invalid content" error on linux but not on mac

I have the following sed extended regular expressions replacement inside a bash script:
sed -i.bak -E 's~^[[:blank:]]*\\iftoggle{[[:alnum:]_]+}{\\input{([[:alnum:]_\/]+)}}{}~\\input{\1}~' file.txt
which should replace strings like
\iftoggle{xx_yy}{\input{xx_yy/zz}}{}
with
\input{xx_yy/zz}
inside file.txt.
This works just fine locally, on OS X, but the script needs to be POSIX. Specifically, it fails on my remote Travis CI build (which uses Linux). While sed -E is not documented for GNU sed, it behaves just like sed -r and seems to work fine, allowing for a POSIX version of sed with extended regular expressions.
The error given is:
sed: -e expression #1, char 81: Invalid content of \{\}
I'm also not sure where the error starts counting characters from, whether it's the beginning of the line, or only that part which is encased in quotes (the expression)?
You don't need ERE here. Using BRE:
sed i.bak 's~^[[:blank:]]*\\iftoggle{[[:alnum:]_][[:alnum:]_]*}{\\input{\([[:alnum:]_\/][[:alnum:]_\/]*\)}}{}~\\input{\1}~' file.txt
{ don't need to be escaped here but ( do.
As + is not part of the BRE, you can replace [[:alnum:]_]+ with [[:alnum:]_][[:alnum:]_]* or with [[:alnum:]_]\{1,\}.
And as a side note, \+ can be used with GNU sed in BRE but keep in mind that it's not portable, it's a GNU extension.
This does not directly answer the question with sed, but provides an alternate simpler way to do this in perl command-line regex search and replacement.
perl -p -e 's|\iftoggle\{(\w+)\}\{\\input\{(\w+)/(\w+)\}\}\{\}|\input\{\2/\3\}|g' file
\input{xx_yy/zz}
Using the word-separator as | and \w+ to match the [[:alnum:]] characters.
For in-place replacement, use the -i flag similar to sed
perl -p -i.bak -e 's|\iftoggle\{(\w+)\}\{\\input\{(\w+)/(\w+)\}\}\{\}|\input\{\2/\3\}|g' file
Regarding Word-characters(\w) in perl POSIX character classes page,
Word characters
A \w matches a single alphanumeric character (an alphabetic character, or a decimal digit); or a connecting punctuation character, such as an underscore ("_"); or a "mark" character (like some sort of accent) that attaches to one of those. It does not match a whole word. To match a whole word, use \w+ . This isn't the same thing as matching an English word, but in the ASCII range it is the same as a string of Perl-identifier characters.
For an input-with multiple folders inside input, e.g.
cat file
\iftoggle{xx_yy}{\input{xx_yy/zz_yy_zz_kk/dude_hjgk}}{}
perl -p -e 's|\iftoggle\{(\w+)\}\{\\input\{(\w+)/(\w+)/(\w+)\}\}\{\}|\input\{\2/\3/\4\}|g' file
\input{xx_yy/zz_yy_zz_kk/dude_hjgk}
Just plug and play as many as capturing groups you want.

Trying to remove version number from a string using sed in OSX

I have what I hope is a simple issue which is stumping me. I need to take an installer file with a name like:
installer_v0.29_linux.run
installer_v10.22_linux_x64.run
installer_v1.1_osx.app
installer_v5.6_windows.exe
and zip it up into a file with the format
installer_linux.zip
installer_linux_x64.zip
installer_osx.zip
installer_windows.zip
I already have a bash script running on OSX which does almost everything else I need in the build chain, and was certain I could achieve this with sed using something like:
ZIP_NAME=`echo "$OUTPUT_NAME" | sed -E 's/_(?:\d*\.)?\d+//g'`
That is, replacing the regex _(?:\d*\.)?\d+ with a blank - the regex should match any decimal number preceded by an underscore.
However, I get the error RE error: repetition-operator operand invalid when I try to run this. At this stage I am stumped - I have Googled around this and can't see what I am doing wrong. The regex I wrote works correctly at Regexr, but clearly some element of it is not supported by the sed implementation in OSX. Does anyone know what I am doing wrong?
You can try this sed:
sed 's/_v[^_]*//; s/\.[[:alnum:]]\+$/.zip/' file
installer_linux.zip
installer_linux_x64.zip
installer_osx.zip
installer_windows.zip
You don't need sed, just some parameter expansion magic with an extended pattern.
shopt -s extglob
zip_name=${OUTPUT_NAME/_v+([^_])/}
The pattern _v+([^_]) matches a string starting with _v and all characters up to the next _. The extglob option enables the use of the +(...) pattern to match one or more occurrences of the enclosed pattern (in this case, a non-_ character). The parameter expansion ${var/pattern/} removes the first occurrence of the given pattern from the expansion of $var.
Try this way also
sed 's/_[^_]\+//' FileName
OutPut:
installer_linux.run
installer_linux_x64.run
installer_osx.app
installer_windows.exe
If you want add replace zip instead of run use below method
sed 's/\([^_]\+\).*\(_.*\).*/\1\2.zip/' Filename
Output :
installer_linux.run.zip
installer_x64.run.zip
installer_osx.app.zip
installer_windows.exe.zip

Sed regex to find-replace version numbers

I'm new to sed, trying to write a script to find/replace text in a file. The file (test.txt) looks like this;
hello_world (1.2.0.123)
and I'm finding that this script (which I inherited):
sed -i 's/\(^\s*hello_world \)(.*)/\1hello_world (1.2.0.456)/' test.txt
is leading to;
hello_world hello_world (1.2.0.456)
when I need it to be
hello_world (1.2.0.456)
I'm not sure how to make the first part match only the parentheses, any assistance would be appreciated.
EDIT
The whitespace before the hello_world is important
The sed line is being auto-generated using variables etc. I'm looking for a way to make this regex work without changing that. The variables I have to play with are
variable1: hello_world
variable2: hello_world (1.2.0.456)
(hopefully it's obvious where these variables sat within the sed expression)
EDIT
I got this sorted in the end, answer below if anyone else is interested.
Got it
sed -i 's/\(^\s*\)phoenix_utils (.*)/\1phoenix_utils (1.0.0.28583)/' test.txt
sed -i -e 's/^\([[:blank:]]*hello_world \).*/\1(1.0.0.28583)/' YourFile
\1 is the content of first ( ) so \1Helloworld write it twice in your sample
be carefull with escape content depending of -e or not (behavior change and non GNU sed often need to escape (for grouping pattern)