Regular expression with sed - regex

I'm having hard time selecting from a file using a regular expression. I'm trying to replace a specific text in the file which is full of lines like this.
/home/user/test2/data/train/train38.wav /home/user/test2/data/train/train38.mfc
I'm trying to replace the bolded text. The problem is the i don't know how to select only the bolded text since i need to use .wav in my regexp and the filename and the location of the file is also going to be different.
Hope you can help
Best regards,
Jökull

This assumes that what you want to replace is the string between the last two slashes in the first path.
sed 's|\([^/]*/\)[^/]*\(/[^/]* .*\)|\1FOO\2|' filename
produces:
/home/user/test2/data/FOO/train38.wav /home/user/test2/data/train/train38.mfc

sed processes lines one at a time, so you can omit the global option and it will only change the first 'train' on each line
sed 's/train/FOO/' testdat
vs
sed 's/train/FOO/g' testdat
which is a global replace
This is quite a bit more readable and less error-prone than some of the other possibilities, but of course there are applications which will not simplify quite as readily.

sed 's;\(\(/[^/]\+\)*\)/train\(\(/[^/]\+\)*\)\.wav;\1/FOO\3.wav;'

You can do it like this
sed -e 's/\<train\>/plane/g'
The \< tells sed to match the beginning of that work and the \> tells it to match the end of the word.
The g at the end means global so it performs the match and replace on the entire line and does not stop after the first successful match as it would normally do without g.

Related

Is there a bash script for finding a specific character between two given expressions?

I have a 3-step problem: I need to
find all occurrences of the character : in a latex file but only when it is in a \ref{} or in a \label{}, in which there can be other characters. Example: The system's total energy (\ref{eq:E}).
replace those : with _. Example becomes: The system's total energy (\ref{eq_E}).
do this for all such occurrences of : in references or labels, in about 100 files.
I've never done this before. I've worked out that I can use regular expressions to find complex occurrences. I can find either \ref{ or \label{ with (\\ref\{|\\label\{), but I can't put it in a lookbehind because it is not fixed width. My other problem with lookbehind and lookahead is that I can only match everything between my assertions, not specific characters (from what I've understood).
I've also worked out that I can use sed for find and replace. I was planning on using a regular expression as my sed "find". Does that make sense?
And finally, I'm not sure how to go about looping on all my files (which have ordered names). Can I do an if or while loop in a bash script?
I know that my questions are all over the place, as I said, never done this before and there is a mountain of documentation I'm only beginning to tackle. Any help or pointers would be appreciated.
You can use the following command which relies on capturing groups to extract the different parts of a ref or label containing a colon to replace it with the equivalent using an underscore :
sed -E 's/\\(ref|label)\{([^:]*):([^}]*)}/\\\1\{\2_\3}/g'
The expression captures the whole ref or label tag, matching the tag name in the first capturing group, the part that precedes the colon in the second capturing group and the part that follows the colon in the third capturing group. The replacement pattern uses references to these capturing groups and can be read as \<tagName>{<before colon>_<after colon>}.
You can try it here.
Note that it would be prefereable to use a parser that understands the latex format, the regex is likely to fail for some edge cases.
And finally, I'm not sure how to go about looping on all my files (which have ordered names). Can I do an if or while loop in a bash script?
sed accepts a list of files as parameter and will apply its command on all of them. The list of files can be produced by the expansion of a glob, e.g. sed 'sedCommand' /your/directory/*.txt which would work on all file of /your/directory/ whose name end in .txt.
In this case you will likely want to use sed's -i "in place" flag which asks sed to direcly write its result in the target file rather than on its standard output. The flag can be followed by a suffix if you want a backup of the original, for instance sed -i.bak 'command' file.txt will have file.txt contain the result and file.txt.bak the original.

How to conditionally remove characters and preserve a text in between?

How could sed or another POSIX command be used to remove the braces but only when we encounter "codeBlock":{"_id":{"varying24characters"}. There may be multiple matches with this condition in the line and I want to avoid removing the braces on something that looks similar like the smoreBlock.
Input (a single line)
test,"codeBlock":{"_id":{"4c9d4e1fe2c101000138eb4b"},morestuff,"smoreBlock":{"_id":{"6c9d4e1fe2c101000138eb4b"},hey,stuff,test,"codeBlock":{"_id":{"7c9d4e1fe7c101111138eb4b"},otherstuff
Desired output
test,"codeBlock":{"_id":"4c9d4e1fe2c101000138eb4b",morestuff,"smoreBlock":{"_id":{"6c9d4e1fe2c101000138eb4b"},hey,stuff,test,"codeBlock":{"_id":"7c9d4e1fe7c101111138eb4b",otherstuff
I've been banging my head reading about sed backreferences and can't even get close to what I'm looking for. Unfortunately this is not homework. I could write a small program to brute force through it but I know there has got to be a way for sed, awk, or perl to handle this. Planning to run this on a RHEL7 or CENTOS7 host.
Think it the other way, match both needed and unneeded together, but keep former in capturing groups. Thus you can replace whole match with only needed parts.
sed 's/\("codeBlock":{"_id":\){\("[0-9a-f]\{24\}"\)}/\1\2/g' file
Or, if you have GNU sed:
sed -E 's/("codeBlock":\{"_id":)\{("[0-9a-f]{24}")\}/\1\2/g' file
both yield:
test,"codeBlock":{"_id":"4c9d4e1fe2c101000138eb4b",morestuff,"smoreBlock":{"_id":{"6c9d4e1fe2c101000138eb4b"},hey,stuff,test,"codeBlock":{"_id":"7c9d4e1fe7c101111138eb4b",otherstuff

How to extract last 2 characters before the extension of a filename in bash?

What i would like to accomplish is to take a file name let's say myfileRE.txt and return the new file name of myfile.txt. The extra two characters will always be two characters and so what i tried to do was:
${filename%??.}
and my idea was "match the 2 characters that come right before the period and rip those characters out" ..unfortunately that just returned the entire filename.
I ended up doing this:
${filename%??????}.txt
but that's not very friendly and there must be a cleaner way to do it. Any advice? Maybe something with regular expressions?
In order to pull something out of the middle of a string, you can use a substitution. The following works in bash:
filename=myfileRE.txt
echo "${filename/??./.}"
This is matching the pattern "??." and replacing it with ".". It is similar to a perl or sed substitution, except it uses shell pattern matching instead of regex.
Jordanm's approach is probably the way to go, but just for variety
echo "${filename%%??.*}.${filename#*.}"

sed add text around regex

I would like to be able to go:
sed "s/^\(\w+\)$/leftside\1rightside/"
and have the group matched by (\w+\) appear in between 'leftside' and 'rightside'.
But it seems like I have to pipe it twice, one for the left of the text, another time for the right. If anyone knows a way to do it in one pass, I'd appreciate it.
The reason it's not working is that you probably specify the wrong regex. In your case, text will be added in the end and beginning of the line only if it consists only of word characters (given that your version of sed supports the \w notation). Also you didn't escape the + which you should do if not using the -r option.
Try starting with sed "s/^\(.*\)$/leftside\1rightside/" or just sed "s/.*/leftside&rightside/" and working from that.

Match single character between Start string and End string

I can't seem to understand regular expression at all. How can I match a character which resides between a START and END string. For Example
#START-EDIT
#ValueA=0
#ValueB=1
#END-EDIT
I want to match any # which is between the #START-EDIT and #END-EDIT.
Specifically I want to use sed to replace the matches # values with nothing (delete them) on various files which may or may not have multiple START-EDIT and END-EDIT sections.
^#START-EDIT.*(#) *. *#END-EDIT$
sed is line based. you can easily search, replace based on regex in one line. But there is no really easy way to search/replace on multilines. AWK might do the trick.
If you have the regex on one line, the following command could be what you are looking for
sed -e "/^#START-EDIT.*#END-EDIT$//" myInput.txt