Keep the first character of a sed regex match - regex

I'm trying to use the rename perl command in Debian to rename files and remove detritus from the end of the filename.
The file names may be like this (varying length/nodes before the series/episode identifier)
A.TV.Show.S01E01.HDTV.XVid[stuff].avi
Other.Prog.S07E09.WEB.H264[things].mp4
And I want to remove everything after the SnnEnn bit and keep the file extension. For example
A.TV.Show.S01E01.avi
Other.Prog.S07E09.mp4
I don't mind having a command per file extension, although a single command that is extension agnostic would be better.
What I have so far is as follows:
rename -nv -- 's/[0-9][.].*?[.]avi$/.avi/' *.avi
I'm using -n just now so it just shows what the rename would do, without doing it.
The problem is it's losing the number at the end of the series and episode identifier - I need it to keep the first character of the matched text then throw the rest away.
What it gives me currently is files named thus:
A.TV.Show.S01E0.avi
Other.Prog.S07E0.mp4
Any idea how to do this? Is there a better pattern than I'm using?

This should work. It's capturing the part that you want to keep in parentheses, and then referring to it in the replacement as $1.
rename -nv -- 's/(^.*?S\d{2}E\d{2})\..*?\.(*)$/$1.$2/' *

You need parentheses to capture parts of the strings:
s/([0-9])[.].*?[.]([^.]+)$/$1.$2/
or, you can use a look-behind instead of the first capture:
s/(?<=[0-9])[.].*?[.]([^.]+)$/.$1/

Related

Regular expression in rename Unix command

I am trying to rename some files and am pretty new to regular expressions. I know how to do this the long way, but I am trying some code golf to shorten it up.
My file:
abc4800_12_S200_R1_001.fastq.gz
My goal:
abc4800_12_R1.fastq.gz
Right now I have a two-step process for renaming it:
rename 's/_S[0-9]+//g' *gz
rename 's/_001//g' *gz
But I was trying to shorten this into one single line to clean it up in one go.
I was trying to use a regular expression to skip over the parts in between, but I don’t know if that is actually a possibility in this function.
rename 's/_S[0-9]+_*?_001//g' *gz
Use a capture group to preserve the middle part of the segment you're replacing.
rename 's/_S\d+_(.*)_001/_$1/' *gz
With your shown samples, please try the following rename command. I am using the -n option here which is a dry run for command. Once you are happy with the output (like how files are going to rename if we run actual code), remove the -n option from the following rename code.
rename -n 's/(^[^_]*_[^_]*)_[^_]*(_[^_]*)[^.]*(\..*$)/$1$2$3/' *.gz
The output will be as follows:
rename(abc4800_12_S200_R1_001.fastq.gz, abc4800_12_R1.fastq.gz)
Explanation:
(^[^_]*_[^_]*) ## Creating the 1st capturing group which captures everything from starting to just before 2nd occurrence of _ here.
_[^_]* ## Matching (without a capturing group) _ then just before the next occurrence of _ here.
(_[^_]*) ## Creating the 2nd capturing group here which matches _, followed by before the next occurrence of _ here.
[^.]* ## Matching everything just before dot comes (not capturing here).
(\..*$) ## Creating the 3rd capturing group which has a dot till the end of line in it.
You are trying to replace two parts in the string with nothing. Use the alternation operator. It will match the left or the right side; replacing any match with the same replacement string (i.e. nothing):
rename 's/_S[0-9]+|_001//g' *gz

Is there a bash script for finding a specific character between two given expressions?

I have a 3-step problem: I need to
find all occurrences of the character : in a latex file but only when it is in a \ref{} or in a \label{}, in which there can be other characters. Example: The system's total energy (\ref{eq:E}).
replace those : with _. Example becomes: The system's total energy (\ref{eq_E}).
do this for all such occurrences of : in references or labels, in about 100 files.
I've never done this before. I've worked out that I can use regular expressions to find complex occurrences. I can find either \ref{ or \label{ with (\\ref\{|\\label\{), but I can't put it in a lookbehind because it is not fixed width. My other problem with lookbehind and lookahead is that I can only match everything between my assertions, not specific characters (from what I've understood).
I've also worked out that I can use sed for find and replace. I was planning on using a regular expression as my sed "find". Does that make sense?
And finally, I'm not sure how to go about looping on all my files (which have ordered names). Can I do an if or while loop in a bash script?
I know that my questions are all over the place, as I said, never done this before and there is a mountain of documentation I'm only beginning to tackle. Any help or pointers would be appreciated.
You can use the following command which relies on capturing groups to extract the different parts of a ref or label containing a colon to replace it with the equivalent using an underscore :
sed -E 's/\\(ref|label)\{([^:]*):([^}]*)}/\\\1\{\2_\3}/g'
The expression captures the whole ref or label tag, matching the tag name in the first capturing group, the part that precedes the colon in the second capturing group and the part that follows the colon in the third capturing group. The replacement pattern uses references to these capturing groups and can be read as \<tagName>{<before colon>_<after colon>}.
You can try it here.
Note that it would be prefereable to use a parser that understands the latex format, the regex is likely to fail for some edge cases.
And finally, I'm not sure how to go about looping on all my files (which have ordered names). Can I do an if or while loop in a bash script?
sed accepts a list of files as parameter and will apply its command on all of them. The list of files can be produced by the expansion of a glob, e.g. sed 'sedCommand' /your/directory/*.txt which would work on all file of /your/directory/ whose name end in .txt.
In this case you will likely want to use sed's -i "in place" flag which asks sed to direcly write its result in the target file rather than on its standard output. The flag can be followed by a suffix if you want a backup of the original, for instance sed -i.bak 'command' file.txt will have file.txt contain the result and file.txt.bak the original.

How to conditionally remove characters and preserve a text in between?

How could sed or another POSIX command be used to remove the braces but only when we encounter "codeBlock":{"_id":{"varying24characters"}. There may be multiple matches with this condition in the line and I want to avoid removing the braces on something that looks similar like the smoreBlock.
Input (a single line)
test,"codeBlock":{"_id":{"4c9d4e1fe2c101000138eb4b"},morestuff,"smoreBlock":{"_id":{"6c9d4e1fe2c101000138eb4b"},hey,stuff,test,"codeBlock":{"_id":{"7c9d4e1fe7c101111138eb4b"},otherstuff
Desired output
test,"codeBlock":{"_id":"4c9d4e1fe2c101000138eb4b",morestuff,"smoreBlock":{"_id":{"6c9d4e1fe2c101000138eb4b"},hey,stuff,test,"codeBlock":{"_id":"7c9d4e1fe7c101111138eb4b",otherstuff
I've been banging my head reading about sed backreferences and can't even get close to what I'm looking for. Unfortunately this is not homework. I could write a small program to brute force through it but I know there has got to be a way for sed, awk, or perl to handle this. Planning to run this on a RHEL7 or CENTOS7 host.
Think it the other way, match both needed and unneeded together, but keep former in capturing groups. Thus you can replace whole match with only needed parts.
sed 's/\("codeBlock":{"_id":\){\("[0-9a-f]\{24\}"\)}/\1\2/g' file
Or, if you have GNU sed:
sed -E 's/("codeBlock":\{"_id":)\{("[0-9a-f]{24}")\}/\1\2/g' file
both yield:
test,"codeBlock":{"_id":"4c9d4e1fe2c101000138eb4b",morestuff,"smoreBlock":{"_id":{"6c9d4e1fe2c101000138eb4b"},hey,stuff,test,"codeBlock":{"_id":"7c9d4e1fe7c101111138eb4b",otherstuff

Rename Files Mac Command Line

I have a bunch of files in a directory that were produced with rather unfortunate names. I want to change two of the characters in the name.
For example I have:
>ch:sdsn-sdfs.txt
and I want to remove the ">" and change the ":" to a "_".
Resulting in
ch_sdsn-sdfs.txt
I tried to just say mv \\>ch\:* ch_* but that didn't work.
Is there a simple solution to this?
For command line script to rename, this stackoverflow question has good answers.
For Mac, In GUI, Finder comes with bulk rename capabilities. If source list of files has some pattern to find & replace, it comes very handy.
Select all the files that need to be replaced, right click and select rename
On rename, enter find and replace string
Other options in rename, to sequence the file names:
To prefix or suffix text:
First, I should say that the easiest way to do this is to use the
prename or rename commands.
Homebrew package rename, MacPorts package renameutils :
rename s/0000/000/ F0000*
That's a lot more understandable than the equivalent sed command.
But as for understanding the sed command, the sed manpage is helpful. If
you run man sed and search for & (using the / command to search),
you'll find it's a special character in s/foo/bar/ replacements.
s/regexp/replacement/
Attempt to match regexp against the pattern space. If success‐
ful, replace that portion matched with replacement. The
replacement may contain the special character & to refer to that
portion of the pattern space which matched, and the special
escapes \1 through \9 to refer to the corresponding matching
sub-expressions in the regexp.
Therefore, \(.\) matches the first character, which can be referenced by \1.
Then . matches the next character, which is always 0.
Then \(.*\) matches the rest of the filename, which can be referenced by \2.
The replacement string puts it all together using & (the original
filename) and \1\2 which is every part of the filename except the 2nd
character, which was a 0.
This is a pretty cryptic way to do this, IMHO. If for
some reason the rename command was not available and you wanted to use
sed to do the rename (or perhaps you were doing something too complex
for rename?), being more explicit in your regex would make it much
more readable. Perhaps something like:
ls F00001-0708-*|sed 's/F0000\(.*\)/mv & F000\1/' | sh
Being able to see what's actually changing in the
s/search/replacement/ makes it much more readable. Also it won't keep
sucking characters out of your filename if you accidentally run it
twice or something.

Replace and add leading zeros when renaming files

Please be patient, this post will be somewhat long...
I have a bunch of files, some of them with a simple and clean name (e.g. 1E01.txt) and some with a lot of extras:
Sample2_Name_E01_-co_032.txt
Sample2_Name_E02_-co_035.txt
...
Sample12_Name_E01_-co_061.txt
and so on. What is important here is the number after "Sample" and the letter+number after "Name" - the rest is disposable. If i get rid of the non-important parts, the filename reduces to the same pattern as the "clean" filenames (2E01.txt, 2E02.txt, ..., 12E01.txt). I've managed to rename the files with the following expression (came up with this one myself, don't know if is very elegant but works fine):
rename -v 's/Sample([0-9]+)_Name_([A-Z][0-9]+).*/$1$2\.txt/' *.txt
Now, the second part, is adding a leading zero for filenames with just one digit, such as 1E01.txt turns into 01E01.txt. I've managed to to this with (found and modified this on another StackExchange post):
rename -v 'unless (/^[0-9]{2}.*\.txt/) {s/^([0-9]{1}.*\.txt)$/0$1/;s/0*([0-9]{2}\..*)/$1/}' *.txt
So I finally got to my question: is there a way to merge both expressions in just one rename command? I know I could do a bash script to automate the process, but what I want is to find a one-pass renaming solution.
thanks
You can try this command to rename 1-file.txt to 0001-file.txt
# fill zeros
$ rename 's/\d+/sprintf("%04d",$&)/e' *.txt
You can change the command a little to meet your need.
Well if that is your "parsing" regex, then you are limiting the files that the script can act on those matching that pattern. Thus, the sprintf using the same literal strings is not a more specialized case, and you could just do this:
s{Sample(\d+)_Name_(\p{IsUpper})(\d+)}
{sprintf "Sample%02d_Name_%s%03d", $1, $2, $3}e
;
Here, you are using the same known features again and simply formatting the accompanying numbers.
The /e switch is for 'eval' and it evaluates the replacement as Perl for each match.
I renamed some of your expressions to more standard character class symbols: [A-Z] becomes the property class \p{IsUpper}, [0-9] becomes the digit code \d (also possible \p{IsDigit} ).