How do I get terminal in OSX recognise Regex as filenames? - regex

In OS X terminal I have the following:
for filename in ^.* 2\.jpeg$; do printf "$filename\n"; done;
which I want to match filenames in the current folder ending in the string " 2.jpeg"
but it's not being recognised as Regex and it's not searching the current directory. It simply prints the two strings:
^.*
2\.jpeg$
obviously there's more I want to do with these files but I can't get it to match. Putting the regex in inverted commas doesn't seem to help either.

You need to use a glob pattern, regex doesn't work in for ... in ... construct. And don't print variables like that, use echo or printf '%s\n' "$variable".
for filename in ./*' '2.jpeg; do
echo "$filename"
done

You can do the following:
for filename in *2.jpeg; do echo ${filename}; done
This gives the following for me:
for filename in *2.jpeg; do echo ${filename}; done
2.jpeg
a2.jpeg
In a directory with 3 files:
touch 1.jpeg
touch a2.jpeg
touch 2.jpeg

Related

How to capture the beginning of a filename using a regex in Bash?

I have a number of files in a directory named edit_file_names.sh, each containing a ? in their name. I want to use a Bash script to shorten the file names right before the ?. For example, these would be my current filenames:
test.file.1?twagdsfdsfdg
test.file.2?
test.file.3?.?
And these would be my desired filenames after running the script:
test.file.1
test.file.2
test.file.3
However, I can't seem to capture the beginning of the filenames in my regex to use in renaming the files. Here is my current script:
#!/bin/bash
cd test_file_name_edit/
regex="(^[^\?]*)"
for filename in *; do
$filename =~ $regex
echo ${BASH_REMATCH[1]}
done
At this point I'm just attempting to print off the beginnings of each filename so that I know that I'm capturing the correct string, however, I get the following error:
./edit_file_names.sh: line 7: test.file.1?twagdsfdsfdg: command not found
./edit_file_names.sh: line 7: test.file.2?: command not found
./edit_file_names.sh: line 7: test.file.3?.?: command not found
How can I fix my code to successfully capture the beginnings of these filenames?
Regex as such may not be the best tool for this job. Instead, I'd suggest using bash parameter expansion. For example:
#!/bin/bash
files=(test.file.1?twagdsfdsfdg test.file.2? test.file.3?.?)
for f in "${files[#]}"; do
echo "${f} shortens to ${f%%\?*}"
done
which prints
test.file.1?twagdsfdsfdg shortens to test.file.1
test.file.2? shortens to test.file.2
test.file.3?.? shortens to test.file.3
Here, ${f%%\?*} expands f and trims the longest suffix that matches a ? followed by any characters (the ? must be escaped since it's a wildcard character).
You miss the test command [[ ]] :
for filename in *; do
[[ $filename =~ $regex ]] && echo ${BASH_REMATCH[1]}
done

Extract Filename before date Bash shellscript

I am trying to extract a part of the filename - everything before the date and suffix. I am not sure the best way to do it in bashscript. Regex?
The names are part of the filename. I am trying to store it in a shellscript variable. The prefixes will not contain strange characters. The suffix will be the same. The files are stored in a directory - I will use loop to extract the portion of the filename for each file.
Expected input files:
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Expected Extract:
EXAMPLE_FILE
EXAMPLE_FILE_2
Attempt:
filename=$(basename "$file")
folder=sed '^s/_[^_]*$//)' $filename
echo 'Filename:' $filename
echo 'Foldername:' $folder
$ cat file.txt
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
$
$ cat file.txt | sed 's/_[0-9]*-[0-9]*-[0-9]*\.out$//'
EXAMPLE_FILE
EXAMPLE_FILE_2
$
No need for useless use of cat, expensive forks and pipes. The shell can cut strings just fine:
$ file=EXAMPLE_FILE_2_2017-10-12.out
$ echo ${file%%_????-??-??.out}
EXAMPLE_FILE_2
Read all about how to use the %%, %, ## and # operators in your friendly shell manual.
Bash itself has regex capability so you do not need to run a utility. Example:
for fn in *.out; do
[[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
done
With the example files, output is:
EXAMPLE_FILE_2017-09-12.out => EXAMPLE_FILE
EXAMPLE_FILE_2_2017-10-12.out => EXAMPLE_FILE_2
Using Bash itself will be faster, more efficient than spawning sed, awk, etc for each file name.
Of course in use, you would want to test for a successful match:
for fn in *.out; do
if [[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]; then
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
else
echo "$fn no match"
fi
done
As a side note, you can use Bash parameter expansion rather than a regex if you only need to trim the string after the last _ in the file name:
for fn in *.out; do
cap="${fn%_*}"
printf "%s => %s\n" "$fn" "$cap"
done
And then test $cap against $fn. If they are equal, the parameter expansion did not trim the file name after _ because it was not present.
The regex allows a test that a date-like string \d\d\d\d-\d\d-\d\d is after the _. Up to you which you need.
Code
See this code in use here
^\w+(?=_)
Results
Input
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Output
EXAMPLE_FILE
EXAMPLE_FILE_2
Explanation
^ Assert position at start of line
\w+ Match any word character (a-zA-Z0-9_) between 1 and unlimited times
(?=_) Positive lookahead ensuring what follows is an underscore _ character
Simply with sed:
sed 's/_[^_]*$//' file
The output:
EXAMPLE_FILE
EXAMPLE_FILE_2
----------
In case of iterating through the list of files with extension .out - bash solution:
for f in *.out; do echo "${f%_*}"; done
awk -F_ 'NF-=1' OFS=_ file
EXAMPLE_FILE
EXAMPLE_FILE_2
Could you please try awk solution too, which will take care of all the .out files, note this has ben written and tested in GNU awk.
awk --re-interval 'FNR==1{if(val){close(val)};split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");print array[1];val=FILENAME;nextfile}' *.out
Also my awk version is old so I am using --re-interval, if you have latest version of awk you may need not to use it then.
Explanation and Non-one liner fom of solution: Adding a non-one liner form of solution too here with explanation.
awk --re-interval '##Using --re-interval for supporting ERE in my OLD awk version, if OP has new version of awk it could be removed.
FNR==1{ ##Checking here condition that when very first line of any Input_file is being read then do following actions.
if(val){ ##Checking here if variable named val value is NOT NULL then do following.
close(val) ##close the Input_file named which is stored in variable val, so that we will NOT face problem of TOO MANY FILES OPENED, so it will be like one file read close it in background then.
};
split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");##Splitting FILENAME(which will have Input_file name in it) into array named array only, whose separator is a 4 digits-2 digits- then 2 digits, actually this will take care of YYYY-MM-DD format in Input_file(s) and it will be easier for us to get the file name part.
print array[1]; ##Printing array 1st element here.
val=FILENAME; ##Storing FILENAME variable value which will have current Input_file name in it to variable named val, so that we could close it in background.
nextfile ##nextfile as it name suggests it will skip all the lines in current line and jump onto the next file to save some cpu cycles of our system.
}
' *.out ##Mentioning all *.out Input_file(s) here.

Mac Terminal batch file rename: flip string and digits

I read some threads about batch renames in Mac OS X Terminal and understood you can do something like this:
for file in *.pdf
do
mv "$file" "<some magic regex>"
done
As you can guess, my issue is with the regex for this particular purpose:
Old file
Author-Year-Title.pdf
New file
Year-Author-Title.pdf
I did try some regex code but got stuck. I "just" want to flip Author and Year, but cannot figure out how. Any help would be appreciated.
This will do, or come quite close to doing, what you are asking for. The technique I am using is called Bash Parameter Substitution and it is documented and described very well here.
#!/bin/bash
for file in *.pdf
do
echo DEBUG: Processing file $file
f=${file%.*} # strip extension from right end
author=${f%%-*} # shortest str at start that ends with dash
title=${f##*-} # shortest str at end that starts with dash
authoryear=${f%-*} # longest string at start that ends in dash
year=${authoryear#*-}
echo DEBUG: author:$author, year:$year, title:$title
echo mv "$file" "$year-$author-$title.pdf"
done
Basically, I am extracting the Author, Year and Title into variables for you and then you can put them together in whatever order you like with whatever separators you like at the end and do the actual renaming. Note that the script actually does nothing until you remove the echo statement in front of the mv command so you can test it out and see what it would do.
Please practice on a COPY of your data in a spare, temporary directory.
Sample Output
DEBUG: Processing file Banks-2012-Something.pdf
DEBUG: author:Banks, year:2012, title:Something
mv Banks-2012-Something.pdf 2012-Banks-Something.pdf
DEBUG: Processing file Shakey-2013-SomethingElse.pdf
DEBUG: author:Shakey, year:2013, title:SomethingElse
mv Shakey-2013-SomethingElse.pdf 2013-Shakey-SomethingElse.pdf
If you like ugly sed commands you can do it more succinctly like this:
#!/bin/bash
for file in *.pdf
do
echo DEBUG: Processing file $file
new=$(sed -E 's/(.*)-([0-9]{4})-(.*)\.*/\2-\1-\3.pdf/' <<< $file)
echo $new
done
The s/xxx/yyy/ means substitute or replace xxx with yyy. Anything inside parentheses must be captured as capture groups and then the first capture group becomes available as \1 in the replacement and the second capture group becomes available as \2 and so on. So it says... save anything up to the first dash as \1, exactly 4 digits between the next pair of dashes as \2 and the other stuff as \3, and then it prints the captured groups out in a different order.

how to rejoin words that are split accross lines with a hyphen in a text file

OCR texts often have words that flow from one line to another with a hyphen at the end of the first line. (ie: the word has '-\n' inserted in it).
I would like rejoin all such split words in a text file (in a linux environment).
I believe this should be possible with sed or awk, but the syntax for these is dark magic to me! I knew a text editor in windows that did regex search/replace with newlines in the search expression, but am unaware of such in linux.
Make sure to back up ocr_file before running as this command will modify the contents of ocr_file:
perl -i~ -e 'BEGIN{$/=undef} ($f=<>) =~ s#-\s*\n\s*(\S+)#$1\n#mg; print $f' ocr_file
This answer is relevant, because I want the words joined together... not just a removal of the dash character.
cat file| perl -CS -pe's/-\n//'|fmt -w52
is the short answer, but uses fmt to reform paragraphs after the paragraphs were mangled by perl.
without fmt, you can do
#!/usr/bin/perl
use open qw(:std :utf8);
undef $/; $_=<>;
s/-\n(\w+\W+)\s*/$1\n/sg;
print;
also, if you're doing OCR, you can use this perl one-liner to convert unicode utf-8 dashes to ascii dash characters. note the -CS option to tell perl about utf-8.
# 0x2009 - 0x2015 em-dashes to ascii dash
perl -CS -pe 'tr/\x{2009}\x{2010}\x{2011}\x{2012\x{2013}\x{2014}\x{2015}/-/'
cat file | perl -p -e 's/-\n//'
If the file has windows line endings, you'll need to catch the cr-lf with something like:
cat file | perl -p -e 's/-\s\n//'
Hey this is my first answer post, here goes:
'-\n' I suspect are the line-feed characters. You can use sed to remove these. You could try the following as a test:
1) create a test file:
echo "hello this is a test -\n" > testfile
2) check the file has the expected contents:
cat testfile
3) test the sed command, this sends the edited text stream to standard out (ie your active console window) without overwriting anything:
sed 's/-\\n//g' testfile
(you should just see 'hello this is a test file' printed to the console without the '-\n')
If I build up the command:
a) First off you have the sed command itself:
sed
b) Secondly the expression and sed specific controls need to be in quotations:
sed 'sedcontrols+regex' (the text in quotations isn't what you'll actually enter, we'll fill this in as we go along)
c) Specify the file you are reading from:
sed 'sedcontrols+regex' testfile
d) To delete the string in question, sed needs to be told to substitute the unwanted characters with nothing (null,zero), so you use 's' to substitute, forward-slash, then the unwanted string (more on that in a sec), then forward-slash again, then nothing (what it's being substituted with), then forward-slash, and then the scale (as in do you want to apply the edit to a single line or more). In this case I will select 'g' which represents global, as in the whole text file. So now we have:
sed 's/regex//g' testfile
e) We need to add in the unwanted string but it gets confusing because if there is a slash in your string, it needs to be escaped out using a back-slash. So, the unwanted string
-\n ends up looking like -\\n
We can output the edited text stream to stdout as follows:
sed 's/-\\n//g' testfile
To save the results without overwriting anything (assuming testfile2 doesn't exist) we can redirect the output to a file:
sed 's/-\\n//g' testfile >testfile2
sed -z 's/-\n//' file_with_hyphens

Using sed to remove all console.log from javascript file

I'm trying to remove all my console.log, console.dir etc. from my JS file before minifying it with YUI (on osx).
The regex I got for the console statements looks like this:
console.(log|debug|info|warn|error|assert|dir|dirxml|trace|group|groupEnd|time|timeEnd|profile|profileEnd|count)\((.*)\);?
and it works if I test it with the RegExr.
But it won't work with sed.
What do I have to change to get this working?
sed 's/___???___//g' <$RESULT >$RESULT_STRIPPED
update
After getting the first answer I tried
sed 's/console.log(.*)\;//g' <test.js >result.js
and this works, but when I add an OR
sed 's/console.\(log\|dir\)(.*)\;//g' <test.js >result.js
it doesn't replace the "logs":
Your original expression looks fine. You just need to pass the -E flag to sed, for extended regular expressions:
sed -E 's/console.(log|debug|info|...|count)\((.*)\);?//g'
The difference between these types of regular expressions is explained in man re_format.
To be honest I have never read that page, but instead simply tack on an -E when things don't work as expected. =)
You must escape ( (for grouping) and | (for oring) in sed's regex syntax. E.g.:
sed 's/console.\(log\|debug\|info\|warn\|error\|assert\|dir\|dirxml\|trace\|group\|groupEnd\|time\|timeEnd\|profile\|profileEnd\|count\)(.*);\?//g'
UPDATE example:
$ sed 's/console.\(log\|debug\|info\|warn\|error\|assert\|dir\|dirxml\|trace\|group\|groupEnd\|time\|timeEnd\|profile\|profileEnd\|count\)(.*);\?//g'
console.log # <- input line, not matches, no replacement printed on next line
console.log
console.log() # <- input line, matches, no printing
console.log(blabla); # <- input line, matches, no printing
console.log(blabla) # <- input line, matches, no printing
console.debug(); # <- input line, matches, no printing
console.debug(BAZINGA) # <- input line, matches, no printing
DATA console.info(ditto); DATA2 # <- input line, matches, printing of expected data
DATA DATA2
HTH
I also find the way to remove all the console.log ,
and i am trying to use python to do this,
but i find the Regex is not work for.
my writing like this:
var re=/^console.log(.*);?$/;
but it will match the following string:
'console.log(23);alert(234dsf);'
does it work? with the
"s/console.(log|debug|info|...|count)((.*));?//g"
I try this:
sed -E 's/console.(log|debug|info)( ?| +)\([^;]*\);//g'
See the test:
Regex Tester
Here's my implementation
for i in $(find ./dir -name "*.js")
do
sed -E 's/console\.(log|warn|error|assert..timeEnd)\((.*)\);?//g' $i > ${i}.copy && mv ${i}.copy $i
done
took the sed thing from github
I was feeling lazy and hoping to find a script to copy & paste. Alas there wasn't one, so for the lazy like me, here is mine. It goes in a file named something like 'minify.sh' in the same directory as the files to minify. It will overwrite the original file and it needs to be executable.
#!/bin/bash
for f in *.js
do
sed -Ei 's/console.(log|debug|info)\((.*)\);?//g' $f
yui-compressor $f -o $f
done
I'd just like to add here that I was running into issues with namespaced console.logs such as window.console.log. Also Tweenmax.js has some interesting uses of console.log in some parts such as
window.console&&console.log(t)
So I used this
sed -i.bak s/[^\&a-zA-Z0-9\.]console.log\(/\\/\\//g js/combined.js
The regex effectively says replace all console.logs that don't start with &, alphanumerics, and . with a '//' comment, which uglify later takes out.
Rodrigocorsi's works with nested parentheses. I added a ? after the ; because yuicompressor was omitting some semicolons.
It is probable that the reason this is not working is that you are not 'limiting'
the regex to not include a closing parenthesises ()) in the method parameters.
Try this regular expression:
console\.(log|trace|error)\(([^)]+)\);
Remember to include the rest of your method names in the capture group.