Regular expression in rename Unix command - regex

I am trying to rename some files and am pretty new to regular expressions. I know how to do this the long way, but I am trying some code golf to shorten it up.
My file:
abc4800_12_S200_R1_001.fastq.gz
My goal:
abc4800_12_R1.fastq.gz
Right now I have a two-step process for renaming it:
rename 's/_S[0-9]+//g' *gz
rename 's/_001//g' *gz
But I was trying to shorten this into one single line to clean it up in one go.
I was trying to use a regular expression to skip over the parts in between, but I don’t know if that is actually a possibility in this function.
rename 's/_S[0-9]+_*?_001//g' *gz

Use a capture group to preserve the middle part of the segment you're replacing.
rename 's/_S\d+_(.*)_001/_$1/' *gz

With your shown samples, please try the following rename command. I am using the -n option here which is a dry run for command. Once you are happy with the output (like how files are going to rename if we run actual code), remove the -n option from the following rename code.
rename -n 's/(^[^_]*_[^_]*)_[^_]*(_[^_]*)[^.]*(\..*$)/$1$2$3/' *.gz
The output will be as follows:
rename(abc4800_12_S200_R1_001.fastq.gz, abc4800_12_R1.fastq.gz)
Explanation:
(^[^_]*_[^_]*) ## Creating the 1st capturing group which captures everything from starting to just before 2nd occurrence of _ here.
_[^_]* ## Matching (without a capturing group) _ then just before the next occurrence of _ here.
(_[^_]*) ## Creating the 2nd capturing group here which matches _, followed by before the next occurrence of _ here.
[^.]* ## Matching everything just before dot comes (not capturing here).
(\..*$) ## Creating the 3rd capturing group which has a dot till the end of line in it.

You are trying to replace two parts in the string with nothing. Use the alternation operator. It will match the left or the right side; replacing any match with the same replacement string (i.e. nothing):
rename 's/_S[0-9]+|_001//g' *gz

Related

Replacing with regex: How to insert a number right after a group match

How to insert a number after a group match in a find-replace regex? Like this:
mat367 -> mat0363
fis434 -> fis0434
chm185 -> chm0185
I was renaming those files with the rename command line tool. I tried the following regex
rename 's/([a-z]{3})(.+)/$1\0$2/g' *
s at the beginning means replace
* at the end means every file.
([a-z]{3})(.+) is the regex to match the name of the files.
$1\0$2 is the replacement.
I thought the regex above would insert a 0 after the first group match ($1), but it doesn't insert anything. So I tried:
rename 's/([a-z]{3})(.+)/$10$2/g' *
However, this makes the regex think that I'm refering to $10 (group number teen), and throws errors.
I'd like to know if it is possible to accomplish my goal in a single regex. In other words, don't use the rename command twice or more. For example, use the rename command to insert a letter instead of 0, and then replace that letter with 0, but this would require two regex, two commands. Using only one regex may be useful in contexts other than renaming files.
Note: It seems like the regex used by rename is based on perl. That may help if someone knows perl.
The argument is evaluated as Perl code, and you are correct about Perl seeing $10.
In a double-quoted string literal (which the replacement expression is), you can only safely escape non-word characters. Like letters, digits are word characters. Specifically, \0 refers to the NUL character. So using \0 is not acceptable.
The solution is to use curlies to delimit the var name.
rename 's/([a-z]{3})(.+)/${1}0$2/g' *
Another way to address the problem in this case is by side-stepping it. Since there's no need to replace the text before the insertion point, we don't need to capture it.
rename 's/[a-z]{3}\K(.+)/0$1/g' *
We can further simplify the second solution.
The .+ ensures there's going to be at most one match per line, so the above can be simplified to the following (assuming none of the file names contain a line feed):
rename 's/[a-z]{3}\K(.)/0$1/' *
We could even avoid the remaining capture with a look-ahead.
rename 's/[a-z]{3}\K(?=.)/0/' *
But is there really a reason to look-ahead? The following isn't equivalent as it doesn't require anything to follow the letters, but I don't think that's a problem.
rename 's/[a-z]{3}\K/0/' *
Finally, if the goal is to add a zero before the number (and thus before the first digit encountered), I'd use
rename 's/(?=\d)/0/' *
You can wrap your variable name $1 in curly braces.
$ rename 's/([a-z]{3})(.+)/${1}0$2/g' *
This is Perl's way to enclose variable names inside strings.

Add constants to start and end of "file" after multiple replacements

I have already found how to do multiple replacements, bu replacing
(from1)|(from2).....
with
(?1to1)(?2to2)
For example, if I have:
hello all! I think saying hello to all is a nice way to introduce oneself.
and I replace
(hello)|(all)
with
(?1greetings)(?2everyone)
I get
greetings everyone! I think saying greetings to everyone is a nice way to introduce oneself.
Now, I want to add a string at the very beginning and end of file - not each line. So, in that case, my desired result is:
StartOfAllgreetings everyone! I think saying greetings to everyone is a nice way to introduce oneself.EndOfAll
Can you help me with this? Things that I have tried unsuccesfully include using $,\z,\Z to identify the end of line, and using branch reset groups like this (?|(hello)|(all))*
Use
Find What: (^)(?<!(?s:.))|(hello)|(all)|($)(?!(?s:.))
Or with . matches newline ON: (^)(?<!.)|(hello)|(all)|($)(?!.)
Replace with: (?1StartOfAll)(?2greetings)(?3everyone)(?4EndOfAll)
NOTE: In order to also handle the end of file match when another alternative also matches at the end of the file, you need to add optional groups and handle them in the replacement pattern, too:
Find What: (?s)(^)(?<!.)|(hello)(?:($)(?!.))?|(all)(?:($)(?!.))?|($)(?!.)
Replace with: (?1StartOfAll)(?2greetings)(?3EndOfAll)(?4everyone)(?5EndOfAll)(?6EndOfAll)
Now, the (?:($)(?!.))? optional non-capturing groups ensure an additional capture for end of file positions, and that is why there are additional (?nEndOfAll) in the replacement pattern.
Details
The (^)(?<!(?s:.))|(hello)|(all)|($)(?!(?s:.)) has four alternatives, the ones that you are interested are
(^)(?<!(?s:.)) - The first alternative and the start of file is matched (and captured into Group 1) with ^ that is not preceded with any char (ensured with a negative lookbehind (?<!.) - the inline modifier group is added to make sure the regex works regardless of extra regex Notepad++ settings)
($)(?!(?s:.)) - matches (and captures into Group 4) the end of line that is not followed with any char (see the (?!(?s:.)) negative lookahead).
Settings & demo:

Is there a bash script for finding a specific character between two given expressions?

I have a 3-step problem: I need to
find all occurrences of the character : in a latex file but only when it is in a \ref{} or in a \label{}, in which there can be other characters. Example: The system's total energy (\ref{eq:E}).
replace those : with _. Example becomes: The system's total energy (\ref{eq_E}).
do this for all such occurrences of : in references or labels, in about 100 files.
I've never done this before. I've worked out that I can use regular expressions to find complex occurrences. I can find either \ref{ or \label{ with (\\ref\{|\\label\{), but I can't put it in a lookbehind because it is not fixed width. My other problem with lookbehind and lookahead is that I can only match everything between my assertions, not specific characters (from what I've understood).
I've also worked out that I can use sed for find and replace. I was planning on using a regular expression as my sed "find". Does that make sense?
And finally, I'm not sure how to go about looping on all my files (which have ordered names). Can I do an if or while loop in a bash script?
I know that my questions are all over the place, as I said, never done this before and there is a mountain of documentation I'm only beginning to tackle. Any help or pointers would be appreciated.
You can use the following command which relies on capturing groups to extract the different parts of a ref or label containing a colon to replace it with the equivalent using an underscore :
sed -E 's/\\(ref|label)\{([^:]*):([^}]*)}/\\\1\{\2_\3}/g'
The expression captures the whole ref or label tag, matching the tag name in the first capturing group, the part that precedes the colon in the second capturing group and the part that follows the colon in the third capturing group. The replacement pattern uses references to these capturing groups and can be read as \<tagName>{<before colon>_<after colon>}.
You can try it here.
Note that it would be prefereable to use a parser that understands the latex format, the regex is likely to fail for some edge cases.
And finally, I'm not sure how to go about looping on all my files (which have ordered names). Can I do an if or while loop in a bash script?
sed accepts a list of files as parameter and will apply its command on all of them. The list of files can be produced by the expansion of a glob, e.g. sed 'sedCommand' /your/directory/*.txt which would work on all file of /your/directory/ whose name end in .txt.
In this case you will likely want to use sed's -i "in place" flag which asks sed to direcly write its result in the target file rather than on its standard output. The flag can be followed by a suffix if you want a backup of the original, for instance sed -i.bak 'command' file.txt will have file.txt contain the result and file.txt.bak the original.

Keep the first character of a sed regex match

I'm trying to use the rename perl command in Debian to rename files and remove detritus from the end of the filename.
The file names may be like this (varying length/nodes before the series/episode identifier)
A.TV.Show.S01E01.HDTV.XVid[stuff].avi
Other.Prog.S07E09.WEB.H264[things].mp4
And I want to remove everything after the SnnEnn bit and keep the file extension. For example
A.TV.Show.S01E01.avi
Other.Prog.S07E09.mp4
I don't mind having a command per file extension, although a single command that is extension agnostic would be better.
What I have so far is as follows:
rename -nv -- 's/[0-9][.].*?[.]avi$/.avi/' *.avi
I'm using -n just now so it just shows what the rename would do, without doing it.
The problem is it's losing the number at the end of the series and episode identifier - I need it to keep the first character of the matched text then throw the rest away.
What it gives me currently is files named thus:
A.TV.Show.S01E0.avi
Other.Prog.S07E0.mp4
Any idea how to do this? Is there a better pattern than I'm using?
This should work. It's capturing the part that you want to keep in parentheses, and then referring to it in the replacement as $1.
rename -nv -- 's/(^.*?S\d{2}E\d{2})\..*?\.(*)$/$1.$2/' *
You need parentheses to capture parts of the strings:
s/([0-9])[.].*?[.]([^.]+)$/$1.$2/
or, you can use a look-behind instead of the first capture:
s/(?<=[0-9])[.].*?[.]([^.]+)$/.$1/

Regex for SublimeText Snippet

I've been stuck for a while on this Sublime Snippet now.
I would like to display the correct package name when creating a new class, using TM_FILEPATH and TM_FILENAME.
When printing TM_FILEPATH variable, I get something like this:
/Users/caubry/d/[...]/src/com/[...]/folder/MyClass.as
I would like to transform this output, so I could get something like:
com.[...].folder
This includes:
Removing anything before /com/[...]/folder/MyClass.as;
Removing the TM_FILENAME, with its extension; in this example MyClass.as;
And finally finding all the slashes and replacing them by dots.
So far, this is what I've got:
${1:${TM_FILEPATH/.+(?:src\/)(.+)\.\w+/\l$1/}}
and this displays:
com/[...]/folder/MyClass
I do understand how to replace splashes with dots, such as:
${1:${TM_FILEPATH/\//./g/}}
However, I'm having difficulties to add this logic to the previous one, as well as removing the TM_FILENAME at the end of the logic.
I'm really inexperienced with Regex, thanks in advance.
:]
EDIT: [...] indicates variable number of folders.
We can do this in a single replacement with some trickery. What we'll do is, we put a few different cases into our pattern and do a different replacement for each of them. The trick to accomplish this is that the replacement string must contain no literal characters, but consist entirely of "backreferences". In that case, those groups that didn't participate in the match (because they were part of a different case) will simply be written back as an empty string and not contribute to the replacement. Let's get started.
First, we want to remove everything up until the last src/ (to mimic the behaviour of your snippet - use an ungreedy quantifier if you want to remove everything until the first src/):
^.+/src/
We just want to drop this, so there's no need to capture anything - nor to write anything back.
Now we want to match subsequent folders until the last one. We'll capture the folder name, also match the trailing /, but write back the folder name and a .. But I said no literal text in the replacement string! So the . has to come from a capture as well. Here comes the assumption into play, that your file always has an extension. We can grab the period from the file name with a lookahead. We'll also use that lookahead to make sure that there's at least one more folder ahead:
^.+/src/|\G([^/]+)/(?=[^/]+/.*([.]))
And we'll replace this with $1$2. Now if the first alternative catches, groups $1 and $2 will be empty, and the leading bit is still removed. If the second alternative catches, $1 will be the folder name, and $2 will have captured a period. Sweet. The \G is an anchor that ensures that all matches are adjacent to one another.
Finally, we'll match the last folder and everything that follows it, and only write back the folder name:
^.+/src/|\G([^/]+)/(?=[^/]+/.*([.]))|\G([^/]+)/[^/]+$
And now we'll replace this with $1$2$3 for the final solution. Demo.
A conceptually similar variant would be:
^.+/src/|\G([^/]+)/(?:(?=[^/]+/.*([.]))|[^/]+$)
replaced with $1$2. I've really only factored out the beginning of the second and third alternative. Demo.
Finally, if Sublime is using Boost's extended format string syntax, it is actually possible to get characters into the replacement conditionally (without magically conjuring them from the file extension):
^.+/src/|\G(/)?([^/]+)|\G/[^/]+$
Now we have the first alternative for everything up to src (which is to be removed), the third alternative for the last slash and file name (which is to be removed), and the middle alternative for all folders you want to keep. This time I put the slash to be replaced optionally at the beginning. With a conditional replacement we can write a . there if and only if that slash was matched:
(?1.:)$2
Unfortunately, I can't test this right now and I don't know an online tester that uses Boost's regex engine. But this should do the trick just fine.