I read some threads about batch renames in Mac OS X Terminal and understood you can do something like this:
for file in *.pdf
do
mv "$file" "<some magic regex>"
done
As you can guess, my issue is with the regex for this particular purpose:
Old file
Author-Year-Title.pdf
New file
Year-Author-Title.pdf
I did try some regex code but got stuck. I "just" want to flip Author and Year, but cannot figure out how. Any help would be appreciated.
This will do, or come quite close to doing, what you are asking for. The technique I am using is called Bash Parameter Substitution and it is documented and described very well here.
#!/bin/bash
for file in *.pdf
do
echo DEBUG: Processing file $file
f=${file%.*} # strip extension from right end
author=${f%%-*} # shortest str at start that ends with dash
title=${f##*-} # shortest str at end that starts with dash
authoryear=${f%-*} # longest string at start that ends in dash
year=${authoryear#*-}
echo DEBUG: author:$author, year:$year, title:$title
echo mv "$file" "$year-$author-$title.pdf"
done
Basically, I am extracting the Author, Year and Title into variables for you and then you can put them together in whatever order you like with whatever separators you like at the end and do the actual renaming. Note that the script actually does nothing until you remove the echo statement in front of the mv command so you can test it out and see what it would do.
Please practice on a COPY of your data in a spare, temporary directory.
Sample Output
DEBUG: Processing file Banks-2012-Something.pdf
DEBUG: author:Banks, year:2012, title:Something
mv Banks-2012-Something.pdf 2012-Banks-Something.pdf
DEBUG: Processing file Shakey-2013-SomethingElse.pdf
DEBUG: author:Shakey, year:2013, title:SomethingElse
mv Shakey-2013-SomethingElse.pdf 2013-Shakey-SomethingElse.pdf
If you like ugly sed commands you can do it more succinctly like this:
#!/bin/bash
for file in *.pdf
do
echo DEBUG: Processing file $file
new=$(sed -E 's/(.*)-([0-9]{4})-(.*)\.*/\2-\1-\3.pdf/' <<< $file)
echo $new
done
The s/xxx/yyy/ means substitute or replace xxx with yyy. Anything inside parentheses must be captured as capture groups and then the first capture group becomes available as \1 in the replacement and the second capture group becomes available as \2 and so on. So it says... save anything up to the first dash as \1, exactly 4 digits between the next pair of dashes as \2 and the other stuff as \3, and then it prints the captured groups out in a different order.
Related
In OS X terminal I have the following:
for filename in ^.* 2\.jpeg$; do printf "$filename\n"; done;
which I want to match filenames in the current folder ending in the string " 2.jpeg"
but it's not being recognised as Regex and it's not searching the current directory. It simply prints the two strings:
^.*
2\.jpeg$
obviously there's more I want to do with these files but I can't get it to match. Putting the regex in inverted commas doesn't seem to help either.
You need to use a glob pattern, regex doesn't work in for ... in ... construct. And don't print variables like that, use echo or printf '%s\n' "$variable".
for filename in ./*' '2.jpeg; do
echo "$filename"
done
You can do the following:
for filename in *2.jpeg; do echo ${filename}; done
This gives the following for me:
for filename in *2.jpeg; do echo ${filename}; done
2.jpeg
a2.jpeg
In a directory with 3 files:
touch 1.jpeg
touch a2.jpeg
touch 2.jpeg
I'm trying to remove whitespace in file names and replace them.
Input:
echo "File Name1.xml File Name3 report.xml" | sed 's/[[:space:]]/__/g'
However the output
File__Name1.xml__File__Name3__report.xml
Desired output
File__Name1.xml File__Name3__report.xml
You named awk in the title of the question, didn't you?
$ echo "File Name1.xml File Name3 report.xml" | \
> awk -F'.xml *' '{for(i=1;i<=NF;i++){gsub(" ","_",$i); printf i<NF?$i ".xml ":"\n" }}'
File_Name1.xml File_Name3_report.xml
$
-F'.xml *' instructs awk to split on a regex, the requested extension plus 0 or more spaces
the loop {for(i=1;i<=NF;i++) is executed for all the fields in which the input line(s) is(are) splitted — note that the last field is void (it is what follows the last extension), but we are going to take that into account...
the body of the loop
gsub(" ","_", $i) substitutes all the occurrences of space to underscores in the current field, as indexed by the loop variable i
printf i<NF?$i ".xml ":"\n" output different things, if i<NF it's a regular field, so we append the extension and a space, otherwise i equals NF, we just want to terminate the output line with a newline.
It's not perfect, it appends a space after the last filename. I hope that's good enough...
▶ A D D E N D U M ◀
I'd like to address:
the little buglet of the last space...
some of the issues reported by Ed Morton
generalize the extension provided to awk
To reach these goals, I've decided to wrap the scriptlet in a shell function, that changing spaces into underscores is named s2u
$ s2u () { awk -F'\.'$1' *' -v ext=".$1" '{
> NF--;for(i=1;i<=NF;i++){gsub(" ","_",$i);printf "%s",$i ext (i<NF?" ":"\n")}}'
> }
$ echo "File Name1.xml File Name3 report.xml" | s2u xml
File_Name1.xml File_Name3_report.xml
$
It's a bit different (better?) 'cs it does not special print the last field but instead special-cases the delimiter appended to each field, but the idea of splitting on the extension remains.
This seems a good start if the filenames aren't delineated:
((?:\S.*?)?\.\w{1,})\b
( // start of captured group
(?: // non-captured group
\S.*? // a non-white-space character, then 0 or more any character
)? // 0 or 1 times
\. // a dot
\w{1,} // 1 or more word characters
) // end of captured group
\b // a word boundary
You'll have to look-up how a PCRE pattern converts to a shell pattern. Alternatively it can be run from a Python/Perl/PHP script.
Demo
Assuming you are asking how to rename file names, and not remove spaces in a list of file names that are being used for some other reason, this is the long and short way. The long way uses sed. The short way uses rename. If you are not trying to rename files, your question is quite unclear and should be revised.
If the goal is to simply get a list of xml file names and change them with sed, the bottom example is how to do that.
directory contents:
ls -w 2
bob is over there.xml
fred is here.xml
greg is there.xml
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
echo "${a_glob[i]}";
done
shopt -u nullglob
# output
bob is over there.xml
fred is here.xml
greg is there.xml
# then rename them
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
# I prefer 'rename' for such things
# rename 's/[[:space:]]/_/g' "${a_glob[i]}";
# but sed works, can't see any reason to use it for this purpose though
mv "${a_glob[i]}" $(sed 's/[[:space:]]/_/g' <<< "${a_glob[i]}");
done
shopt -u nullglob
result:
ls -w 2
bob_is_over_there.xml
fred_is_here.xml
greg_is_there.xml
globbing is what you want here because of the spaces in the names.
However, this is really a complicated solution, when actually all you need to do is:
cd [your space containing directory]
rename 's/[[:space:]]/_/g' *.xml
and that's it, you're done.
If on the other hand you are trying to create a list of file names, you'd certainly want the globbing method, which if you just modify the statement, will do what you want there too, that is, just use sed to change the output file name.
If your goal is to change the filenames for output purposes, and not rename the actual files:
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
echo "${a_glob[i]}" | sed 's/[[:space:]]/_/g';
done
shopt -u nullglob
# output:
bob_is_over_there.xml
fred_is_here.xml
greg_is_there.xml
You could use rename:
rename --nows *.xml
This will replace all the spaces of the xml files in the current folder with _.
Sometimes it comes without the --nows option, so you can then use a search and replace:
rename 's/[[:space:]]/__/g' *.xml
Eventually you can use --dry-run if you want to just print filenames without editing the names.
I want to prepend every commit message with a portion of my branch name. For example lets say my branch naming convention is the following:
issueType/repoName-issueNumber --> bug/myRepo-123
I want my commit message to go from
added missing semicolon
to
myRepo-123 : added missing semicolon
So far in my prepare-commit-msg file I have the following but it is printing out the entire branch name
#!/bin/sh
NAME=$(git branch | grep '*' | sed 's/* //')
STORY_NUMBER=$(echo $NAME | sed 's/.*\(myRepo-[0-9]\+\.log\).*/\1/')
echo "$STORY_NUMBER" : $(cat "$1") > "$1"
... renders ...
bug/myRepo-123 : added missing semicolon
What edits do I need to make to the regex to extract everything AFTER /?
Thanks
I ended up finding a really great question and answer that helped me solve my problem. It can be found here https://unix.stackexchange.com/questions/24140/return-only-the-portion-of-a-line-after-a-matching-pattern
In the end this was my script which prints out myRepo-123: commit message
#!/bin/sh
BRANCH_NAME=$(git symbolic-ref --short HEAD)
STORY_NUMBER=$(echo $BRANCH_NAME | sed -n -e 's/^.*\(myRepo-\)/\1/p')
echo "$STORY_NUMBER": $(cat "$1") > "$1"
Before I explain how I modified the regex it's worth noting that I found an easier way to get my current branch name using the --short flag on the symbolic ref command.
The following is taken from the other stack exchange answer for the regex
Detailed explanation:
-n means not to print anything by default.
-e is followed by a sed command. s is the pattern replacement command.
The regular expression ^.stalled: matches the pattern
you're looking for, plus any preceding text (. meaning any text,
with an initial ^ to say that the match begins at the beginning of
the line). Note that if stalled: occurs several times on the line,
this will match the last occurrence.
The match, i.e. everything on the line up to stalled:, is replaced by
the empty string (i.e. deleted).
The final p means to print the transformed line.
If you want to retain the matching portion, use a
backreference: \1 in the replacement part designates what is
inside a group (…) in the pattern. Here, you could write
stalled: again in the replacement part; this feature is useful
when the pattern you're looking for is more general than a simple
string.
I have a bunch of daily printer logs in CSV format and I'm writing a script to keep track of how much paper is being used and save the info to a database, but I've come across a small problem
Essentially, some of the document names in the logs include commas in them (which are all enclosed within double quotes), and since it's in comma separated format, my code is messing up and pushing everything one column to the right for certain records.
From what I've been reading, it seems like the best way to go about fixing this would be using awk or sed, but I'm unsure which is the best option for my situation, and how exactly I'm supposed to implement it.
Here's a sample of my input data:
2015-03-23 08:50:22,Jogn.Doe,1,1,Ineo 4000p,"MicrosoftWordDocument1",COMSYRWS14,A4,PCL6,,,NOT DUPLEX,GRAYSCALE,35kb,
And here's what I have so far:
#!/bin/bash
#Get today's file name
yearprefix="20"
currentdate=$(date +"%m-%d-%y");
year=${currentdate:6};
year="$yearprefix$year"
month=${currentdate:0:2};
day=${currentdate:3:2};
filename="papercut-print-log-$year-$month-$day.csv"
echo "The filename is: $filename"
# Remove commas in between quotes.
#Loop through CSV file
OLDIFS=$IFS
IFS=,
[ ! -f $filename ] && { echo "$Input file not found"; exit 99; }
while read time user pages copies printer document client size pcl blank1 blank2 duplex greyscale filesize blank3
do
#Remove headers
if [ "$user" != "" ] && [ "$user" != "User" ]
then
#Remove any file name with an apostrophe
if [[ "$document" =~ "'" ]];
then
document="REDACTED"; # Lazy. Need to figure out a proper solution later.
fi
echo "$time"
#Save results to database
mysql -u username -p -h localhost -e "USE printerusage; INSERT INTO printerlogs (time, username, pages, copies, printer, document, client, size, pcl, duplex, greyscale, filesize) VALUES ('$time', '$user', '$pages', '$copies', '$printer', '$document', '$client', '$size', '$pcl', '$duplex', '$greyscale', '$filesize');"
fi
done < $filename
IFS=$OLDIFS
Which option is more suitable for this task? Will I have to create a second temporary file to get this done?
Thanks in advance!
As I wrote in another answer:
Rather than interfere with what is evidently source data, i.e. the stuff inside the quotes, you might consider replacing the field-separator commas (with say |) instead:
s/,([^,"]*|"[^"]*")(?=(,|$))/|$1/g
And then splitting on | (assuming none of your data has | in it).
Is it possible to write a regular expression that matches a particular pattern and then does a replace with a part of the pattern
There is probably an easier way using sed alone, but this should work. Loop on the file, for each line match the parentheses with grep -o then replace the commas in the line with spaces (or whatever it is you would like to use to get rid of the commas - if you want to preserve the data you can use a non printable and explode it back to commas afterward).
i=1 && IFS=$(echo -en "\n\b") && for a in $(< test.txt); do
var="${a}"
for b in $(sed -n ${i}p test.txt | grep -o '"[^"]*"'); do
repl="$(sed "s/,/ /g" <<< "${b}")"
var="$(sed "s#${b}#${repl}#" <<< "${var}")"
done
let i+=1
echo "${var}"
done
Consider the following:
var="text more text and yet more text"
echo $var | egrep "yet more (text)"
It should be possible to get the result of the regex as the string: text
However, I don't see any way to do this in bash with grep or its siblings at the moment.
In perl, php or similar regex engines:
$output = preg_match('/yet more (text)/', 'text more text yet more text');
$output[1] == "text";
Edit: To elaborate why I can't just multiple-regex, in the end I will have a regex with multiple of these (Pictured below) so I need to be able to get all of them. This also eliminates the option of using lookahead/lookbehind (As they are all variable length)
egrep -i "([0-9]+) +$USER +([0-9]+).+?(/tmp/Flash[0-9a-z]+) "
Example input as requested, straight from lsof (Replace $USER with "j" for this input data):
npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXu8pvMg (deleted)
npviewer. 17875 j 17u REG 8,8 16037387 524273 /tmp/FlashXXIBH29F (deleted)
The end goal is to cp /proc/$var1/fd/$var2 ~/$var3 for every line, which ends up "Downloading" flash files (Flash used to store in /tmp but they drm'd it up)
So far I've got:
#!/bin/bash
regex="([0-9]+) +j +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+)"
echo "npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXYOvS8S (deleted)" |
sed -r -n -e " s%^.*?$regex.*?\$%\1 \2 \3%p " |
while read -a array
do
echo /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
It cuts off the first digits of the first value to return, and I'm not familiar enough with sed to see what's wrong.
End result for downloading flash 10.2+ videos (Including, perhaps, encrypted ones):
#!/bin/bash
lsof | grep "/tmp/Flash" | sed -r -n -e " s%^.+? ([0-9]+) +$USER +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+).*?\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Edit: look at my other answer for a simpler bash-only solution.
So, here the solution using sed to fetch the right groups and split them up. You later still have to use bash to read them. (And in this way it only works if the groups themselves do not contain any spaces - otherwise we had to use another divider character and patch read by setting $IFS to this value.)
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
sed -r -n -e " s%^.*$regex.*\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Note that I had to adapt your last regex group to allow uppercase letters, and added a space at the beginning to be sure to capture the whole block of numbers. Alternatively here a \b (word limit) would have worked, too.
Ah, I forget mentioning that you should pipe the text to this script, like this:
./grep-result.sh < grep-result-test.txt
(provided your files are named like this). Instead you can add a < grep-result-test after the sed call (before the |), or prepend the line with cat grep-result-test.txt |.
How does it work?
sed -r -n calls sed in extended-regexp-mode, and without printing anything automatically.
-e " s%^.*$regex.*\$%\1 \2 \3%p " gives the sed program, which consists of a single s command.
I'm using % instead of the normal / as parameter separator, since / appears inside the regex and I don't want to escape it.
The regex to search is prefixed by ^.* and suffixed by .*$ to grab the whole line (and avoid printing parts of the rest of the line).
Note that this .* grabs greedy, so we have to insert a space into our regexp to avoid it grabbing the start of the first digit group too.
The replacement text contains of the three parenthesed groups, separated by spaces.
the p flag at the end of the command says to print out the pattern space after replacement. Since we grabbed the whole line, the pattern space consists of only the replacement text.
So, the output of sed for your example input is this:
5 11 /tmp/FlashXXu8pvMg
5 17 /tmp/FlashXXIBH29F
This is much more friendly for reuse, obviously.
Now we pipe this output as input to the while loop.
read -a array reads a line from standard input (which is the output from sed, due to our pipe), splits it into words (at spaces, tabs and newlines), and puts the words into an array variable.
We could also have written read var1 var2 var3 instead (preferably using better variable names), then the first two words would be put to $var1 and $var2, with $var3 getting the rest.
If read succeeded reading a line (i.e. not end-of-file), the body of the loop is executed:
${array[0]} is expanded to the first element of the array and similarly.
When the input ends, the loop ends, too.
This isn't possible using grep or another tool called from a shell prompt/script because a child process can't modify the environment of its parent process. If you're using bash 3.0 or better, then you can use in-process regular expressions. The syntax is perl-ish (=~) and the match groups are available via $BASH_REMATCH[x], where x is the match group.
After creating my sed-solution, I also wanted to try the pure-bash approach suggested by Mark. It works quite fine, for me.
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
while read
do
if [[ $REPLY =~ $regex ]]
then
echo cp /proc/${BASH_REMATCH[1]}/fd/${BASH_REMATCH[2]} ~/${BASH_REMATCH[3]}
fi
done
(If you upvote this, you should think about also upvoting Marks answer, since it is essentially his idea.)
The same as before: pipe the text to be filtered to this script.
How does it work?
As said by Mark, the [[ ... ]] special conditional construct supports the binary operator =~, which interprets his right operand (after parameter expansion) as a extended regular expression (just as we want), and matches the left operand against this. (We have again added a space at front to avoid matching only the last digit.)
When the regex matches, the [[ ... ]] returns 0 (= true), and also puts the parts matched by the individual groups (and the whole expression) into the array variable BASH_REMATCH.
Thus, when the regex matches, we enter the then block, and execute the commands there.
Here again ${BASH_REMATCH[1]} is an array-access to an element of the array, which corresponds to the first matched group. ([0] would be the whole string.)
Another note: Both my scripts accept multi-line input and work on every line which matches. Non-matching lines are simply ignored. If you are inputting only one line, you don't need the loop, a simple if read ; then ... or even read && [[ $REPLY =~ $regex ]] && ... would be enough.
echo "$var" | pcregrep -o "(?<=yet more )text"
Well, for your simple example, you can do this:
var="text more text and yet more text"
echo $var | grep -e "yet more text" | grep -o "text"