How to capture the beginning of a filename using a regex in Bash? - regex

I have a number of files in a directory named edit_file_names.sh, each containing a ? in their name. I want to use a Bash script to shorten the file names right before the ?. For example, these would be my current filenames:
test.file.1?twagdsfdsfdg
test.file.2?
test.file.3?.?
And these would be my desired filenames after running the script:
test.file.1
test.file.2
test.file.3
However, I can't seem to capture the beginning of the filenames in my regex to use in renaming the files. Here is my current script:
#!/bin/bash
cd test_file_name_edit/
regex="(^[^\?]*)"
for filename in *; do
$filename =~ $regex
echo ${BASH_REMATCH[1]}
done
At this point I'm just attempting to print off the beginnings of each filename so that I know that I'm capturing the correct string, however, I get the following error:
./edit_file_names.sh: line 7: test.file.1?twagdsfdsfdg: command not found
./edit_file_names.sh: line 7: test.file.2?: command not found
./edit_file_names.sh: line 7: test.file.3?.?: command not found
How can I fix my code to successfully capture the beginnings of these filenames?

Regex as such may not be the best tool for this job. Instead, I'd suggest using bash parameter expansion. For example:
#!/bin/bash
files=(test.file.1?twagdsfdsfdg test.file.2? test.file.3?.?)
for f in "${files[#]}"; do
echo "${f} shortens to ${f%%\?*}"
done
which prints
test.file.1?twagdsfdsfdg shortens to test.file.1
test.file.2? shortens to test.file.2
test.file.3?.? shortens to test.file.3
Here, ${f%%\?*} expands f and trims the longest suffix that matches a ? followed by any characters (the ? must be escaped since it's a wildcard character).

You miss the test command [[ ]] :
for filename in *; do
[[ $filename =~ $regex ]] && echo ${BASH_REMATCH[1]}
done

Related

How do I get terminal in OSX recognise Regex as filenames?

In OS X terminal I have the following:
for filename in ^.* 2\.jpeg$; do printf "$filename\n"; done;
which I want to match filenames in the current folder ending in the string " 2.jpeg"
but it's not being recognised as Regex and it's not searching the current directory. It simply prints the two strings:
^.*
2\.jpeg$
obviously there's more I want to do with these files but I can't get it to match. Putting the regex in inverted commas doesn't seem to help either.
You need to use a glob pattern, regex doesn't work in for ... in ... construct. And don't print variables like that, use echo or printf '%s\n' "$variable".
for filename in ./*' '2.jpeg; do
echo "$filename"
done
You can do the following:
for filename in *2.jpeg; do echo ${filename}; done
This gives the following for me:
for filename in *2.jpeg; do echo ${filename}; done
2.jpeg
a2.jpeg
In a directory with 3 files:
touch 1.jpeg
touch a2.jpeg
touch 2.jpeg

Extract Filename before date Bash shellscript

I am trying to extract a part of the filename - everything before the date and suffix. I am not sure the best way to do it in bashscript. Regex?
The names are part of the filename. I am trying to store it in a shellscript variable. The prefixes will not contain strange characters. The suffix will be the same. The files are stored in a directory - I will use loop to extract the portion of the filename for each file.
Expected input files:
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Expected Extract:
EXAMPLE_FILE
EXAMPLE_FILE_2
Attempt:
filename=$(basename "$file")
folder=sed '^s/_[^_]*$//)' $filename
echo 'Filename:' $filename
echo 'Foldername:' $folder
$ cat file.txt
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
$
$ cat file.txt | sed 's/_[0-9]*-[0-9]*-[0-9]*\.out$//'
EXAMPLE_FILE
EXAMPLE_FILE_2
$
No need for useless use of cat, expensive forks and pipes. The shell can cut strings just fine:
$ file=EXAMPLE_FILE_2_2017-10-12.out
$ echo ${file%%_????-??-??.out}
EXAMPLE_FILE_2
Read all about how to use the %%, %, ## and # operators in your friendly shell manual.
Bash itself has regex capability so you do not need to run a utility. Example:
for fn in *.out; do
[[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
done
With the example files, output is:
EXAMPLE_FILE_2017-09-12.out => EXAMPLE_FILE
EXAMPLE_FILE_2_2017-10-12.out => EXAMPLE_FILE_2
Using Bash itself will be faster, more efficient than spawning sed, awk, etc for each file name.
Of course in use, you would want to test for a successful match:
for fn in *.out; do
if [[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]; then
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
else
echo "$fn no match"
fi
done
As a side note, you can use Bash parameter expansion rather than a regex if you only need to trim the string after the last _ in the file name:
for fn in *.out; do
cap="${fn%_*}"
printf "%s => %s\n" "$fn" "$cap"
done
And then test $cap against $fn. If they are equal, the parameter expansion did not trim the file name after _ because it was not present.
The regex allows a test that a date-like string \d\d\d\d-\d\d-\d\d is after the _. Up to you which you need.
Code
See this code in use here
^\w+(?=_)
Results
Input
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Output
EXAMPLE_FILE
EXAMPLE_FILE_2
Explanation
^ Assert position at start of line
\w+ Match any word character (a-zA-Z0-9_) between 1 and unlimited times
(?=_) Positive lookahead ensuring what follows is an underscore _ character
Simply with sed:
sed 's/_[^_]*$//' file
The output:
EXAMPLE_FILE
EXAMPLE_FILE_2
----------
In case of iterating through the list of files with extension .out - bash solution:
for f in *.out; do echo "${f%_*}"; done
awk -F_ 'NF-=1' OFS=_ file
EXAMPLE_FILE
EXAMPLE_FILE_2
Could you please try awk solution too, which will take care of all the .out files, note this has ben written and tested in GNU awk.
awk --re-interval 'FNR==1{if(val){close(val)};split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");print array[1];val=FILENAME;nextfile}' *.out
Also my awk version is old so I am using --re-interval, if you have latest version of awk you may need not to use it then.
Explanation and Non-one liner fom of solution: Adding a non-one liner form of solution too here with explanation.
awk --re-interval '##Using --re-interval for supporting ERE in my OLD awk version, if OP has new version of awk it could be removed.
FNR==1{ ##Checking here condition that when very first line of any Input_file is being read then do following actions.
if(val){ ##Checking here if variable named val value is NOT NULL then do following.
close(val) ##close the Input_file named which is stored in variable val, so that we will NOT face problem of TOO MANY FILES OPENED, so it will be like one file read close it in background then.
};
split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");##Splitting FILENAME(which will have Input_file name in it) into array named array only, whose separator is a 4 digits-2 digits- then 2 digits, actually this will take care of YYYY-MM-DD format in Input_file(s) and it will be easier for us to get the file name part.
print array[1]; ##Printing array 1st element here.
val=FILENAME; ##Storing FILENAME variable value which will have current Input_file name in it to variable named val, so that we could close it in background.
nextfile ##nextfile as it name suggests it will skip all the lines in current line and jump onto the next file to save some cpu cycles of our system.
}
' *.out ##Mentioning all *.out Input_file(s) here.

Capturing multiline string with regex and replacing it with itself, indented on every line

Background
I have a large batch of markdown files that have code blocks marked off with three backticks and a language name (github style). Like so:
```ruby
def method_missing
puts "Where's the method?"
end
```
I'd like to change the way these are marked off, so that instead of using three backticks, code blocks are set of with indentation (stack overflow style), as follows:
def method_missing
puts "Where's the method?"
end
Problem
I'm doing a find-and-replace across multiple files in Sublime Text with this expression
(?s)```ruby(.*?)```
This effectively captures what I'd like, but I'm having trouble finding a good way to replace the capture group $1 with an indented version of itself. At best, I can insert a soft tab before the entire capture group.
Any suggestions? Thanks in advance.
Alternatively: Is there a quick way to do this with a bash script using grep?
You can't do this with grep, but you could do it using text manipulation utilities like sed. For example, saying:
sed -n '/```ruby/,/```/{/```ruby/b;/```/b;s/^/ /p }' filename
would produce:
def method_missing
puts "Where's the method?"
end
for your sample input.
It captures lines between ```ruby and three backticks; adds 4 spaces in front of those lines; and prints those.
If you want a TAB character instead of those whitespaces, substitute s/^/ /p with s/^/\t/p in the expression above.
I'm sure you could get a one line to work, but might just be simpler to do it with a shell script
#!/bin/bash
while IFS= read -r line; do
if [[ $line =~ ^'```ruby' ]]; then
indent=true
elif [[ $line =~ ^'```' ]]; then
indent=
else
[[ -n $indent ]] && echo -e "\t$line" || echo "$line"
fi
done < file

Regex to remove lines in file(s) that ending with same or defined letters

i need a bash script for mac osx working in this way:
./script.sh * folder/to/files/
#
# or #
#
./script.sh xx folder/to/files/
This script
read a list of files
open each file and read each lines
if lines ended with the same letters ('*' mode) or with custom letters ('xx') then
remove line and RE-SAVE file
backup original file
My first approach to do this:
#!/bin/bash
# ck init params
if [ $# -le 0 ]
then
echo "Usage: $0 <letters>"
exit 0
fi
# list files in current dir
list=`ls BRUTE*`
for i in $list
do
# prepare regex
case $1 in
"*") REGEXP="^.*(.)\1+$";;
*) REGEXP="^.*[$1]$";;
esac
FILE=$i
# backup file
cp $FILE $FILE.bak
# removing line with same letters
sed -Ee "s/$REGEXP//g" -i '' $FILE
cat $FILE | grep -v "^$"
done
exit 0
But it doesn't work as i want....
What's wrong?
How can i fix this script?
Example:
$cat BRUTE02.dat BRUTE03.dat
aa
ab
ac
ad
ee
ef
ff
hhh
$
If i use '*' i want all files that ended with same letters to be clean.
If i use 'ff' i want all files that ended with 'ff' to be clean.
Ah, it's on Mac OSx. Remember that sed is a little different from classical linux sed.
man sed
sed [-Ealn] command [file ...]
sed [-Ealn] [-e command] [-f command_file] [-i extension] [file
...]
DESCRIPTION
The sed utility reads the specified files, or the standard input
if no files are specified, modifying the input as specified by a list
of commands. The
input is then written to the standard output.
A single command may be specified as the first argument to sed.
Multiple commands may be specified by using the -e or -f options. All
commands are applied
to the input in the order they are specified regardless of their
origin.
The following options are available:
-E Interpret regular expressions as extended (modern)
regular expressions rather than basic regular expressions (BRE's).
The re_format(7) manual page
fully describes both formats.
-a The files listed as parameters for the ``w'' functions
are created (or truncated) before any processing begins, by default.
The -a option causes
sed to delay opening each file until a command containing
the related ``w'' function is applied to a line of input.
-e command
Append the editing commands specified by the command
argument to the list of commands.
-f command_file
Append the editing commands found in the file
command_file to the list of commands. The editing commands should
each be listed on a separate line.
-i extension
Edit files in-place, saving backups with the specified
extension. If a zero-length extension is given, no backup will be
saved. It is not recom-
mended to give a zero-length extension when in-place
editing files, as you risk corruption or partial content in situations
where disk space is
exhausted, etc.
-l Make output line buffered.
-n By default, each line of input is echoed to the standard
output after all of the commands have been applied to it. The -n
option suppresses this
behavior.
The form of a sed command is as follows:
[address[,address]]function[arguments]
Whitespace may be inserted before the first address and the
function portions of the command.
Normally, sed cyclically copies a line of input, not including
its terminating newline character, into a pattern space, (unless there
is something left
after a ``D'' function), applies all of the commands with
addresses that select that pattern space, copies the pattern space to
the standard output, append-
ing a newline, and deletes the pattern space.
Some of the functions use a hold space to save all or part of the
pattern space for subsequent retrieval.
anything else?
it's clear my problem?
thanks.
I don't know bash shell too well so I can't evaluate what the failure is.
This is just an observation of the regex as understood (this may be wrong).
The * mode regex looks ok:
^.*(.)\1+$ that ended with same letters..
But the literal mode might not do what you think.
current: ^.*[$1]$ that ended with 'literal string'
This shouldn't use a character class.
Change it to: ^.*$1$
Realize though the string in $1 (before it goes into the regex) should be escaped
incase there are any regex metacharacters contained within it.
Otherwise, do you intend to have a character class?
perl -ne '
BEGIN {$arg = shift; $re = $arg eq "*" ? qr/([[:alpha:]])\1$/ : qr/$arg$/}
/$re/ && next || print
'
Example:
echo "aa
ab
ac
ad
ee
ef
ff" | perl -ne '
BEGIN {$arg = shift; $re = $arg eq "*" ? qr/([[:alpha:]])\1$/ : qr/$arg$/}
/$re/ && next || print
' '*'
produces
ab
ac
ad
ee
ef
A possible issue:
When you put * on the command line, the shell replaces it with the name of all the files in your directory. Your $1 will never equal *.
And some tips:
You can replace replace:
This:
# list files in current dir
list=`ls BRUTE*`
for i in $list
With:
for i in BRUTE*
And:
This:
cat $FILE | grep -v "^$"
With:
grep -v "^$" $FILE
Besides the possible issue, I can't see anything jumping out at me. What do you mean clean? Can you give an example of what a file should look like before and after and what the command would look like?
This is the problem!
grep '\(.\)\1[^\r\n]$' *
on MAC OSX, ( ) { }, etc... must be quoted!!!
Solved, thanks.

Shell scripting (regex to move files)

I have a list of files such as:
i60.st082313ea.jpg
i61.st51249c5e.jpg
i62.stef1fe5f2.jpg
I would like to rename each file in the directory by decrementing the starting integer (eg. 60, 61, 62) by one.
I've done svn-renaming in the shell using something like the following:
for file in *.xml;
do svn mv $file `basename $file xml`json;
done;
But when it comes to creating a regular expression, and subtracting 1 from part of the file, I'm at a loss. Worth mentioned is that the file could have the expression i[0-9]+ repeated elsewhere in the name, so it would only have to match the leading string.
Any help/tutorials/links would really be appreciated.
for file in *.jpg; do
newfilename=`echo $file | awk -F '.' '{ OFS=".";print "i" substr($1,2,2)+1, $2, $3}'`
mv $file $newfilename
done;
NOTE this only works, if the filenames matches your example (e.g. the integers are at the 2nd and 3rd positions, and the file has exactly two .s).
HTH
The great thing is, your input is very regular. So using a regex like
^i([0-9]+)(\..*)$
would start at the beginning of the input, match the 'i' as a necessary character, then match a decimal number, then match the rest of the input, up to the end.
The parentheses make groups available for capturing the matches. If you're matching with bash, the capture groups are available in BASH_REMATCH (an array). With this regex, you have two capture groups: the digits you want to decrement, and the rest of the filename.
To make the new filename, you want to concatenate the character 'i', ${BASH_REMATCH[1]} - 1, and ${BASH_REMATCH[2]}.
If you're not matching with bash, perhaps try with perl? (Plus you'd be able to use \d for digits, a particular favorite of mine.) It's a bit heavier on the processor than sed or awk, but much easier for what you're trying to do.
sed supports backreferences, but to do arithmetic on them, you'd have to invoke a shell command from inside the sed expression (doable). awk doesn't really support backreferences, but gawk has a function to help, gensub.
Without creating a process for each file to rename, and in bash shell (you can learn which shell you are using by typing the command echo $SHELL) , assuming that your file name always begin by 'i', followed by a number, followed by a '.':
ls $.jpg | while read file; do
post_dot=${file#*.}
pre_dot=${file%%.*}
number=${pre_dot#i}
echo "mv "$file" "i$((number-1)).$post_dot""
done
When you are happy with the printed result, remove 'echo "' at the beginning and the '"' at the end of the mv line. The '"' around $file and i$((number-1)).$post_dot in the mv command are important to account for spaces in the filename.
I wrote a little script I call mv_re although maybe mv_eval or mv_perl would make more sense :
#!/usr/bin/perl
foreach (#ARGV) {
push(#evals,$_);
}
sub mv_eval {
my ($dirname) = #_;
opendir (my($dh), $dirname) or
die "Couldn't open dir '$dirname': $!";
my #fs = readdir $dh;
foreach $f (#fs) {
$_ = $f;
foreach $e (#evals) { eval $e; }
print "mv '$f' '$_'\n" if ($_ ne $f);
}
closedir $dh;
}
mv_eval(".");
It treats each argument as a line of perl code and runs them against every file name in the current directly. Any file names changed get appropriate mv commands printed. So mv_re s/www.// writes a script to remove the first www. from every file in the current directory.
Ideally, I should add optional command line arguments for the directory or filespec, alternative commands to mv, and an option to just do it rather than writing the user a script.