Bash, find and replace - re-use with variable? - regex

I'm building a script in bash that goes and finds references to other files (such as a reference in an html file to an img source (image.jpg)
The problem is that I'm using sed to replace all instances that contain (in this example) "/some/random/directory/image.jpg"
The "some/random/directory/image.jpg" is going to be differen every single time so when it comes to my sed line I need to use regex, but in order to find the line to replace I need to include image.jpg.
so for example my sed line would be something like
sed 's/\/some\/random\/directory\/image.jpg/images\/image.jpg/g'
But how do I get the end of whats in the find and put it into the replace? (In this example it would be image.jpg. Is there some way to make that a variable?
Here's my script as it stands now:
#!/bin/bash
cd /home/username/www/immrqbe/
for file in $(grep -rlI ".jpg" *)
do
sed -e "s/\".*\/.*.jpg//ig" $file > /tmp/tempfile.tmp
mv /tmp/tempfile.tmp /home/username/www/immrqbe/$file
done
This obviously isn't functional complete as I need help with it but you get the idea of how I'd like to have it complete.

What you're looking for is called a Backreference in the world of regular expressions. You want to refer back to a previously matched string.
There are a couple of ways to do this with sed, but what you want to use is the grouping mechanism: \( and \). Anything sed finds between \( and \) will be put into a group and you can refer back to that group using \n where n is the number of the group that you want to use, from left to right.
So, in your example, you want:
sed 's/".*\/\(.\+\.jpg\)"/\1/ig' file
Your filename will be in the \(.\+\.jpg\) group and you can then refer to it using \1 in the replacement section.
As a side note, notice that, as long as you don't want the shell to expand a variable in your quoted string, you can use single quotes and avoid escaping the double quotes in your pattern.

Use parentheses to capture the match and then refer to it using backslash.
sed -e 's/".*\/\(.*.jpg\)/\1/ig'

Related

Is there a bash script for finding a specific character between two given expressions?

I have a 3-step problem: I need to
find all occurrences of the character : in a latex file but only when it is in a \ref{} or in a \label{}, in which there can be other characters. Example: The system's total energy (\ref{eq:E}).
replace those : with _. Example becomes: The system's total energy (\ref{eq_E}).
do this for all such occurrences of : in references or labels, in about 100 files.
I've never done this before. I've worked out that I can use regular expressions to find complex occurrences. I can find either \ref{ or \label{ with (\\ref\{|\\label\{), but I can't put it in a lookbehind because it is not fixed width. My other problem with lookbehind and lookahead is that I can only match everything between my assertions, not specific characters (from what I've understood).
I've also worked out that I can use sed for find and replace. I was planning on using a regular expression as my sed "find". Does that make sense?
And finally, I'm not sure how to go about looping on all my files (which have ordered names). Can I do an if or while loop in a bash script?
I know that my questions are all over the place, as I said, never done this before and there is a mountain of documentation I'm only beginning to tackle. Any help or pointers would be appreciated.
You can use the following command which relies on capturing groups to extract the different parts of a ref or label containing a colon to replace it with the equivalent using an underscore :
sed -E 's/\\(ref|label)\{([^:]*):([^}]*)}/\\\1\{\2_\3}/g'
The expression captures the whole ref or label tag, matching the tag name in the first capturing group, the part that precedes the colon in the second capturing group and the part that follows the colon in the third capturing group. The replacement pattern uses references to these capturing groups and can be read as \<tagName>{<before colon>_<after colon>}.
You can try it here.
Note that it would be prefereable to use a parser that understands the latex format, the regex is likely to fail for some edge cases.
And finally, I'm not sure how to go about looping on all my files (which have ordered names). Can I do an if or while loop in a bash script?
sed accepts a list of files as parameter and will apply its command on all of them. The list of files can be produced by the expansion of a glob, e.g. sed 'sedCommand' /your/directory/*.txt which would work on all file of /your/directory/ whose name end in .txt.
In this case you will likely want to use sed's -i "in place" flag which asks sed to direcly write its result in the target file rather than on its standard output. The flag can be followed by a suffix if you want a backup of the original, for instance sed -i.bak 'command' file.txt will have file.txt contain the result and file.txt.bak the original.

Sed script to to rewrite certain strings

I'm dealing with a body of XML files containing unstructured texts with semantic markup for personal names.
For reasons to do with the stylesheet that will eventually show them via a web application, I need to replace:
<persName>Fred</persName>'s
<persName>Wilma</persName>'s
with
<persName>Fred's</persName>
<persName>Wilma's</persName>
I have a single line in a shell script, being run in Gitbash for Windows, below. It runs OK, but has no effect. I suppose I'm missing something obvious, perhaps to do with escaping characters, but any help appreciated.
sed -i "s/<\/persName>\'s/\'s<\/persName>/g" test.xml
You may use
sed -i "s,</persName>'s,'s</persName>,g" test.xml
Details
s - we want to replace
, - a delimiter
</persName>'s - this string to find
, - delimiter
's</persName> - replace with this string
, - delimiter
g - multiple times if more than one is found
The -i option makes the replacements directly in the file.
Note that you do not have to escape ' when defining the sed command inside a double quoted string.
It is a good idea to use a delimiter char other than the common / if there are / chars inside the regex or/and replacement pattern.
The comment on your question suggests an easier solution, but I guess, that there might be names where the suffix 's differs, like names ending with an s. So I chose a solution where you grab what's right and put it in the middle.
As separator for the search and replace command in sed you can choose whatever you want. I've chosen #, so you don't have to escape the backslashes in the text. The escaped parantheses store what's inside in variables \1 and \2.
sed 's#<persName>\(.*\)</persName>\(.*\)#<persName>\1\2</persName>#g' testfile
Result:
<persName>Fred's</persName>
<persName>Wilma's</persName>
If you want to replace it in file, you can use the -i parameter. But be sure to check the result first.

Sed dynamic backreference replacement

I am trying to use sed for transforming wikitext into latex code. I am almost done, but I would like to automate the generation of the labels of the figures like this:
[[Image(mypicture.png)]]
... into:
\includegraphics{mypicture.png}\label{img-1}
For what I would like to keep using sed. The current regex and bash code I am using is the following:
__tex_includegraphics="\\\\includegraphics[width=0.95\\\\textwidth]{$__images_dir\/"
__tex_figure_pre="\\\\begin{figure}[H]\\\\centering$__tex_includegraphics"
__tex_figure_post="}\\\\label{img-$__images_counter}\\\\end{figure}"
sed -e "s/\[\[Image(\([^)]*\))\]\].*/$__tex_figure_pre\1$__tex_figure_post/g"\
... but I cannot make that counter to be increased. Any ideas?
Within a more general perspective, my question would be the following: can I use a backreference in sed for creating a replacement that is different for each of the matches of sed? This is, each time sed matches the pattern, can I use \1 as the input of a function and use the result of this function as the replacement?
I know it is a tricky question and I might have to use AWK for this. However, if somebody has a solution, I would appreciate his or her help.
This might work for you (GNU sed):
sed -r ':a;/'"$PATTERN"'/{x;/./s/.*/echo $((&+1))/e;/./!s/^/1/;x;G;s/'"$PATTERN"'(.*)\n(.*)/'"$PRE"'\2'"$POST"'\1/;ba}' file
This looks for a PATTERN contained in a shell variable and if not presents prints the current line. If the pattern is present it increments or primes the counter in the hold space and then appends said counter to the current line. The pattern is then replaced using the shell variables PRE and POST and counter. Lastly the current line is checked for further cases of the pattern and the procedure repeated if necessary.
You could read the file line-by-line using shell features, and use a separate sed command for each line. Something like
exec 0<input_file
while read line; do
echo $line | sed -e "s/\[\[Image(\([^)]*\))\]\].*/$__tex_figure_pre\1$__tex_figure_post/g"
__images_counter=$(expr $__images_counter + 1)
done
(This won't work if there are multiple matches in a line, though.)
For the second part, my best idea is to run sed or grep to find what is being matched, and then run sed again with the value of the function of the matched text substituted into the command.

Regular expression with sed

I'm having hard time selecting from a file using a regular expression. I'm trying to replace a specific text in the file which is full of lines like this.
/home/user/test2/data/train/train38.wav /home/user/test2/data/train/train38.mfc
I'm trying to replace the bolded text. The problem is the i don't know how to select only the bolded text since i need to use .wav in my regexp and the filename and the location of the file is also going to be different.
Hope you can help
Best regards,
Jökull
This assumes that what you want to replace is the string between the last two slashes in the first path.
sed 's|\([^/]*/\)[^/]*\(/[^/]* .*\)|\1FOO\2|' filename
produces:
/home/user/test2/data/FOO/train38.wav /home/user/test2/data/train/train38.mfc
sed processes lines one at a time, so you can omit the global option and it will only change the first 'train' on each line
sed 's/train/FOO/' testdat
vs
sed 's/train/FOO/g' testdat
which is a global replace
This is quite a bit more readable and less error-prone than some of the other possibilities, but of course there are applications which will not simplify quite as readily.
sed 's;\(\(/[^/]\+\)*\)/train\(\(/[^/]\+\)*\)\.wav;\1/FOO\3.wav;'
You can do it like this
sed -e 's/\<train\>/plane/g'
The \< tells sed to match the beginning of that work and the \> tells it to match the end of the word.
The g at the end means global so it performs the match and replace on the entire line and does not stop after the first successful match as it would normally do without g.

Match single character between Start string and End string

I can't seem to understand regular expression at all. How can I match a character which resides between a START and END string. For Example
#START-EDIT
#ValueA=0
#ValueB=1
#END-EDIT
I want to match any # which is between the #START-EDIT and #END-EDIT.
Specifically I want to use sed to replace the matches # values with nothing (delete them) on various files which may or may not have multiple START-EDIT and END-EDIT sections.
^#START-EDIT.*(#) *. *#END-EDIT$
sed is line based. you can easily search, replace based on regex in one line. But there is no really easy way to search/replace on multilines. AWK might do the trick.
If you have the regex on one line, the following command could be what you are looking for
sed -e "/^#START-EDIT.*#END-EDIT$//" myInput.txt