Sed script to to rewrite certain strings - regex

I'm dealing with a body of XML files containing unstructured texts with semantic markup for personal names.
For reasons to do with the stylesheet that will eventually show them via a web application, I need to replace:
<persName>Fred</persName>'s
<persName>Wilma</persName>'s
with
<persName>Fred's</persName>
<persName>Wilma's</persName>
I have a single line in a shell script, being run in Gitbash for Windows, below. It runs OK, but has no effect. I suppose I'm missing something obvious, perhaps to do with escaping characters, but any help appreciated.
sed -i "s/<\/persName>\'s/\'s<\/persName>/g" test.xml

You may use
sed -i "s,</persName>'s,'s</persName>,g" test.xml
Details
s - we want to replace
, - a delimiter
</persName>'s - this string to find
, - delimiter
's</persName> - replace with this string
, - delimiter
g - multiple times if more than one is found
The -i option makes the replacements directly in the file.
Note that you do not have to escape ' when defining the sed command inside a double quoted string.
It is a good idea to use a delimiter char other than the common / if there are / chars inside the regex or/and replacement pattern.

The comment on your question suggests an easier solution, but I guess, that there might be names where the suffix 's differs, like names ending with an s. So I chose a solution where you grab what's right and put it in the middle.
As separator for the search and replace command in sed you can choose whatever you want. I've chosen #, so you don't have to escape the backslashes in the text. The escaped parantheses store what's inside in variables \1 and \2.
sed 's#<persName>\(.*\)</persName>\(.*\)#<persName>\1\2</persName>#g' testfile
Result:
<persName>Fred's</persName>
<persName>Wilma's</persName>
If you want to replace it in file, you can use the -i parameter. But be sure to check the result first.

Related

Using Variables with Regex that contain a space (\s) and sed

Im trying to create a sort script using literal string variables and Regex and a sort using sed in bash. I cannot seem to find the liternal strings with spaces when using variables, although can find them when using the regex directly. So :
#!/bin/bash
group1="IRISHFHD"
group2="REGIONAL FHD"
sed -i '/group-title="'${group1}/',+1d' JWLINE.m3u
sed -i '/group-title="'${group2}/',+1d' JWLINE.m3u
Ive tried adding \s into the group variable but it doesnt work.
John
The problem has nothing to do with regex, it's all down to how the shell treats variables' values. When a variable is expanded without double-quotes around it (i.e. ${group2}), the shell will split it into "words" based on whitespace. It'll also try to expand any words that contain shell wildcards into lists of matching files, and several regex metacharacters look like shell wildcards, which can cause serious chaos.
In this example:
sed -i '/group-title="'${group2}/',+1d' JWLINE.m3u
It's a little more complicated, because the variable reference is in between two single-quoted sections. In this case, the part before the variable reference gets attached to the first "word" in the variable, and the part after gets attached to the last word. Essentially, it expands into the equivalent of this:
sed -i '/group-title="REGIONAL' 'FHD/,+1d' JWLINE.m3u
^ That's a space between arguments
Anyway, since it gets split on the whitespace, sed gets two partial arguments instead of one whole one, and it doesn't work at all.
Solution: as in almost all situations, you should have double-quotes around the variable reference to prevent weird effects like this. There are a few options for this. You could just add double-quotes around the variable part:
sed -i '/group-title="'"${group2}"/',+1d' JWLINE.m3u
...but IMO this is confusing; some of those quotes are syntactic (i.e. parsed by the shell), and one is literal (passed to sed as part of the regex), and it's not obvious which are which. I'd prefer to just use double-quotes around the whole thing, and escape the double-quote that's supposed to be literal:
sed -i "/group-title=\"${group2}/,+1d" JWLINE.m3u
^^ Escape makes this " a literal part of the argument.
(In double-quotes, you'd also need to escape any dollar signs, backslashes, or backticks that were supposed to be literal parts of the argument. But in this case, there aren't any of those.)

egrep: how to search for text that includes double quotes (win7 cmd window)

I'm trying to make a TOC in my HTML file by searching for all HTML tags that contain one of three classes: article, section, and subsection.
I'm using GNU grep 2.4.2 in a Windows 7 cmd window. Now I've read at least 12 pages from my Google search and tried 20+ permutations of my grep command. I'm trying to find classes in my HTML file. Luckily in my HTML file there is only one HTML tag per line in the HTML file, which simplifies things.
I made a cmd batch file and tried running this and got various errors. I've tried escaping the double quotes, and not escaping them. I tried escaping the parens and not escaping them. I've tried different switches, with and without -E, etc. This is the regex I need to search for on every line and print the lines that match.
/class="\(article\|section\|subsection\)"/
This is one of my later grep attempts.
grep -i -E 'class="\(article\|section\|subsection\)"' ch18IP.htm
In this example I'm not getting any lines returned nor any error message. What am I doing wrong here?
Thank you!
You have three problems:
1) double quote " literals must be escaped as \" when using grep on windows.
2) meta-characters (, ), and | should only be escaped as \(, \), and \| when using basic mode. The -E exended regex option uses the more traditional unescaped form. This is documented at http://www.gnu.org/software/grep/manual/html_node/Basic-vs-Extended.html
3) If a parameter requires quoting on Windows, then double quotes are used, not single quotes. But in this case, enclosing quotes are not required, and would actually get in the way. I'll explain this later in the answer.
I also suggest that you add a word boundry assertion \b before class so that you don't mistakenly match something like subclass.
So either of the following should work:
grep -i -E \bclass=\"(article|section|subsection)\" ch18IP.htm
grep -i \bclass=\"\(article\|section\|subsection\)\" ch18IP.htm
It gets tricky if you want to enclose your search argument in quotes because the search term also includes quote literals, as well as poison characters like | that have special meaning to the cmd "shell". So you may end up having to escape some characters for both grep and cmd.exe. See https://stackoverflow.com/a/19816688/1012053 for more info.
In your case, here are two options for how you could quote your search term for Windows.
grep -i -E ^"\bclass=\"(article|section|subsection)\"^" ch18IP.htm
grep -i -E "\bclass=\"(article^|section^|subsection)\"" ch18IP.htm
That last form looks mighty weird if you decide to use the basic regex:
grep -i "\bclass=\"\(article\^|section\^|subsection\)\"" ch18IP.htm
Getting double-quotes as input on Windows cmd.exe command line is notoriously problematic. See if this works for you: https://www.gnu.org/software/gawk/manual/html_node/DOS-Quoting.html

Rename Files Mac Command Line

I have a bunch of files in a directory that were produced with rather unfortunate names. I want to change two of the characters in the name.
For example I have:
>ch:sdsn-sdfs.txt
and I want to remove the ">" and change the ":" to a "_".
Resulting in
ch_sdsn-sdfs.txt
I tried to just say mv \\>ch\:* ch_* but that didn't work.
Is there a simple solution to this?
For command line script to rename, this stackoverflow question has good answers.
For Mac, In GUI, Finder comes with bulk rename capabilities. If source list of files has some pattern to find & replace, it comes very handy.
Select all the files that need to be replaced, right click and select rename
On rename, enter find and replace string
Other options in rename, to sequence the file names:
To prefix or suffix text:
First, I should say that the easiest way to do this is to use the
prename or rename commands.
Homebrew package rename, MacPorts package renameutils :
rename s/0000/000/ F0000*
That's a lot more understandable than the equivalent sed command.
But as for understanding the sed command, the sed manpage is helpful. If
you run man sed and search for & (using the / command to search),
you'll find it's a special character in s/foo/bar/ replacements.
s/regexp/replacement/
Attempt to match regexp against the pattern space. If success‐
ful, replace that portion matched with replacement. The
replacement may contain the special character & to refer to that
portion of the pattern space which matched, and the special
escapes \1 through \9 to refer to the corresponding matching
sub-expressions in the regexp.
Therefore, \(.\) matches the first character, which can be referenced by \1.
Then . matches the next character, which is always 0.
Then \(.*\) matches the rest of the filename, which can be referenced by \2.
The replacement string puts it all together using & (the original
filename) and \1\2 which is every part of the filename except the 2nd
character, which was a 0.
This is a pretty cryptic way to do this, IMHO. If for
some reason the rename command was not available and you wanted to use
sed to do the rename (or perhaps you were doing something too complex
for rename?), being more explicit in your regex would make it much
more readable. Perhaps something like:
ls F00001-0708-*|sed 's/F0000\(.*\)/mv & F000\1/' | sh
Being able to see what's actually changing in the
s/search/replacement/ makes it much more readable. Also it won't keep
sucking characters out of your filename if you accidentally run it
twice or something.

Bash, find and replace - re-use with variable?

I'm building a script in bash that goes and finds references to other files (such as a reference in an html file to an img source (image.jpg)
The problem is that I'm using sed to replace all instances that contain (in this example) "/some/random/directory/image.jpg"
The "some/random/directory/image.jpg" is going to be differen every single time so when it comes to my sed line I need to use regex, but in order to find the line to replace I need to include image.jpg.
so for example my sed line would be something like
sed 's/\/some\/random\/directory\/image.jpg/images\/image.jpg/g'
But how do I get the end of whats in the find and put it into the replace? (In this example it would be image.jpg. Is there some way to make that a variable?
Here's my script as it stands now:
#!/bin/bash
cd /home/username/www/immrqbe/
for file in $(grep -rlI ".jpg" *)
do
sed -e "s/\".*\/.*.jpg//ig" $file > /tmp/tempfile.tmp
mv /tmp/tempfile.tmp /home/username/www/immrqbe/$file
done
This obviously isn't functional complete as I need help with it but you get the idea of how I'd like to have it complete.
What you're looking for is called a Backreference in the world of regular expressions. You want to refer back to a previously matched string.
There are a couple of ways to do this with sed, but what you want to use is the grouping mechanism: \( and \). Anything sed finds between \( and \) will be put into a group and you can refer back to that group using \n where n is the number of the group that you want to use, from left to right.
So, in your example, you want:
sed 's/".*\/\(.\+\.jpg\)"/\1/ig' file
Your filename will be in the \(.\+\.jpg\) group and you can then refer to it using \1 in the replacement section.
As a side note, notice that, as long as you don't want the shell to expand a variable in your quoted string, you can use single quotes and avoid escaping the double quotes in your pattern.
Use parentheses to capture the match and then refer to it using backslash.
sed -e 's/".*\/\(.*.jpg\)/\1/ig'

Regular expression with sed

I'm having hard time selecting from a file using a regular expression. I'm trying to replace a specific text in the file which is full of lines like this.
/home/user/test2/data/train/train38.wav /home/user/test2/data/train/train38.mfc
I'm trying to replace the bolded text. The problem is the i don't know how to select only the bolded text since i need to use .wav in my regexp and the filename and the location of the file is also going to be different.
Hope you can help
Best regards,
Jökull
This assumes that what you want to replace is the string between the last two slashes in the first path.
sed 's|\([^/]*/\)[^/]*\(/[^/]* .*\)|\1FOO\2|' filename
produces:
/home/user/test2/data/FOO/train38.wav /home/user/test2/data/train/train38.mfc
sed processes lines one at a time, so you can omit the global option and it will only change the first 'train' on each line
sed 's/train/FOO/' testdat
vs
sed 's/train/FOO/g' testdat
which is a global replace
This is quite a bit more readable and less error-prone than some of the other possibilities, but of course there are applications which will not simplify quite as readily.
sed 's;\(\(/[^/]\+\)*\)/train\(\(/[^/]\+\)*\)\.wav;\1/FOO\3.wav;'
You can do it like this
sed -e 's/\<train\>/plane/g'
The \< tells sed to match the beginning of that work and the \> tells it to match the end of the word.
The g at the end means global so it performs the match and replace on the entire line and does not stop after the first successful match as it would normally do without g.