Strip all characters from image filename after a dash (-) - regex

I have about 1400 images like this.
101018-202x300.jpg
100116-215x300.jpg
1000748-300x157.jpg
100138-196x300.jpg
100308-companion-in-surgical-studies-208x300.jpg
100463-Ambroise-Pare-300x216.jpg
100523-Grulee-collection-pediatrics-194x300.jpg
I need to strip out all the character after the FIRST dash so that it reads like this
101018.jpg
100116.jpg
1000748.jpg
100138.jpg
100308.jpg
100463.jpg
100523.jpg
I know this can be done with Regular Expressions but I have not a clue where to begin with it?
I am busy working through this Regex Site to learn more about the topic.
Thank you.
EDIT: Apologies, I did not add some of the other more varying examples.

You can rename capturing the 101018 and rename the file that you used. NEW (Demo)
OLD (Demo)
(\d+)[\w-]+\.jpg
EDIT: A second option is splitting by "-" and getting the first parameter.

In PowerShell:
$_ -replace '-.*(?=\.\w+$)'
e.g.
PS> -split'101018-202x300.jpg
>> 100116-215x300.jpg
>> 1000748-300x157.jpg
>> 100138-196x300.jpg' | %{ $_ -replace '-.*(?=\.\w+$)'}
>>
101018.jpg
100116.jpg
1000748.jpg
100138.jpg
You can simplify the regex a bit if you only have JPEGs or it's always the part between a hyphen-minus and a dot:
-[^.]+

Using shell and sed..
for i in $(ls /path/to/your/images)
do
mv $i $(echo $i | sed -r 's/([0-9]\*)\-.*\.jpg/\1.jpg/g')
done
Let me explain what this will do.
for loop will take one file for each iteration
echo $i | sed -r 's/([0-9]*)\-.*\.jpg/\1.jpg/g' is for changing file names to your desired output.

Related

Remove character occurring after _ from all the files excluding file extension (.png)

I was searching for a unix command/shell script to remove characters occurred after _ in all the files excluding file extension.
Example:
b6d28-insurance-renewal-shop_6b5c74fa3d4b96f7557c3fd66f2555af.png
should be renamed to
b6d28-insurance-renewal-shop.png
I have tried searching online and but was not able to find out a quick and optimal solution.
Please note that those extra characters are added randomly and varying in each file.
Thanks in Advance!
You can use sed like this using a negated character class:
f='b6d28-insurance-renewal-shop_6b5c74fa3d4b96f7557c3fd66f2555af.png'
sed 's/_[^_.]*//' <<< "$f"
b6d28-insurance-renewal-shop.png
[^_.] matches any character except DOT or underscore.
If you're using bash then you can do this in shell itself using:
echo "${f%_*}.png"
You could also use cut for the result like this:
file="b6d28-insurance-renewal-shop_6b5c74fa3d4b96f7557c3fd66f2555af.png"
new_file=$(echo $file | cut -d'_' -f1).$(echo $file | cut -d'.' -f2)
echo "New file name: ${new_file}"
Output:
New file name: b6d28-insurance-renewal-shop.png
Regex pattern:
(\_[\d\w]+)(?=(\.\w{2,3}))
to find every _akfgasfhsgfhha before .ext[ension]
Assuming that f holds the original filename,
${f%_*}.${f##*.}
would give you the transformed filename.

How can I use sed to regex string and number in bash script

I want to separate string and number in a file to get a specific number in bash script, such as:
Branches executed:75.38% of 1190
I want to only get number
75.38
. I have try like the code below
$new_value=value | sed -r 's/.*_([0-9]*)\..*/\1/g'
but it was incorrect and it was failed.
How should it works? Thank you before for your help.
You can use the following regex to extract the first number in a line:
^[^0-9]*\([0-9.]*\).*$
Usage:
% echo 'Branches executed:75.38% of 1190' | sed 's/^[^0-9]*\([0-9.]*\).*$/\1/'
75.38
Give this a try:
value=$(sed "s/^Branches executed:\([0-9][.0-9]*[0-9]*\)%.*$/\1/" afile)
It is assumed that the line appears only once in afile.
The value is stored in the value variable.
There are several things here that we could improve. One is that you need to escape the parentheses in sed: \(...\)
Another one is that it would be good to have a full specification of the input strings as well as a good script that can help us to play with this.
Anyway, this is my first attempt:
Update: I added a little more bash around this regex so it'll be more easy to play with it:
value='Branches executed:75.38% of 1190'
new_value=`echo $value | sed -e 's/[^0-9]*\([0-9]*\.[0-9]*\).*/\1/g'`
echo $new_value
Update 2: as john pointed out, it will match only numbers that contain a decimal dot. We can fix it with an optional group: \(\.[0-9]\+\)?.
An explanation for the optional group:
\(...\) is a group.
\(...\)? Is a group that appears zero or one times (mind the question mark).
\.[0-9]\+ is the pattern for a dot and one or more digits.
Putting all together:
value='Branches executed:75.38% of 1190'
new_value=`echo $value | sed -e 's/[^0-9]*\([0-9]\+\(\.[0-9]\+\)\?\).*/\1/g'`
echo $new_value

Bash script regex for file size

I'm trying to extract the size (in kb) from a file. Trying to do so as follows:
textA=$(du a)
sizeA=$(expr match "$textA" '\(^[^\s]*\)')
textB=$(du b)
sizeB=$(expr match "$textB" '\(^[^\s]*\)')
echo $textA
echo $sizeA
echo $textB
echo $sizeB
[[ $sizeA == $sizeB ]] && echo "eq"
But this just prints in console textA and textB. Both are like:
30745 a
Can someone please explain why is not the regex matching? I've tried to test the regex against the text in many sites, just to make sure, and it appears to capture the correct text.
I've also tried changing it to:
'^\([^\s]*\)'
But this way it will capture all the text. Any thoughts?
My expr match does not understand \s or other extended regexps. Try '\([0-9]*\)' instead.
But as others mentioned already, using regexp for getting "the first word" is a little overkill. I'd use du s | { read a b; echo $a; }, but you could also use the awk version or solutions using cut.
Not a direct answer, but I would do it like this:
sizeA=$(du a | awk '{print $1}')
size=$(wc -c < file)
If you want to use du, I would use the bash builtin read:
read size filename < <(du file)
Note that you can't say du file | read size filename because in bash, components of a pipeline are executed in subshells, so the variables will disappear when the subshell exits.
Do not parse the output of du, if available you can e.g. use stat to get the size of a file in bytes:
sizeA=$(stat -c%s "${fileA}")

Add specific source file types to svn recursively

I know this is an ugly script but it does the job.
What I am facing now is adding a few more extensions what would clutter the scrip even more.
How can I make it more modular?
Specifically, how can I write this long regular expression (source file extensions) on multiple lines? say one extension on each line. I guess I am doing something wrong with string concatenation but not quite sure what exactly.
Here's the original file:
#!/bin/bash
COMMAND='svn status'
XARGS='xargs'
SVN='svn add'
$COMMAND | grep -E '(\.m|\.mat|\.java|\.js|\.php|\.cpp|\.h|\.c|\.py|\.hs|\.pl|\.xml|\.html|\.sh|.\asm|\.s|\.tex|\.bib|.\Makefile|.\jpg|.\gif|.\png|.\css)'$ | awk ' { print$2 } ' | $XARGS $SVN
and here's roughly what I am aiming at
...code code
'(.\m|
\.mat|
\.js|
.
.
.
.\css)'
..more code here
Anybody?
I know this doesn't answer the question directly, but from a readability perspective, I think most developers would agree that a single-line regex is the most common way to do things and therefore the most maintainable approach. Further, I'm not sure why you're including a period in each of your extensions, this should only need to be used once.
I wrote this little script to automatically add all images to svn. You should be able to simply add extensions between the pipes in the regex to add or remove different file types. Note that it makes sure to only add files that are unrecognized by making sure each line starts with a "?" (^\?) and ends with a period (\.(extensions)$). Hope it's helpful!
#!/bin/bash
svn st | grep -E "^\?.*\.(png|jpg|jpeg|tiff|bmp|gif)$" > /tmp/svn-auto-add-img
while read output; do
FILE=$(echo $output | awk '{ print $2 }')
svn add $FILE
done < /tmp/svn-auto-add-img
exit 0
How about this:
PATTERNS="
\.foo
\.bar
\.baz"
# Put them into one list separated by or ("|").
PATTERNS=`echo $PATTERNS |sed 's/\s\+/|/g'`
$COMMAND | grep -E "($PATTERNS)"
(Note that this would not work if you put quotes around $PATTERNS in the call to echo -- echo is taking care of stripping whitespace and converting newlines to spaces for us.)
#!/bin/bash
COMMAND='svn status'
XARGS='xargs'
SVNADD='svn add'
pats=
pats+=' \.m'
pats+=' \.mat'
pats+=' \.java'
pats+=' \.js'
# add your 'or-able' sub patterns here
# build the full pattern
pattern='(';for pat in $pats;do pattern+="$pat|";done;pattern=${pattern%\|}')$'
# run grep with the generated pattern
files=$($COMMAND | grep -E ${pattern} | awk ' { print $NF } ')
if [ " $files" != " " ]
then
$COMMAND | grep -E ${pattern} | awk ' { print $NF } ' | $XARGS $SVNADD
fi

In GNU Grep or another standard bash command, is it possible to get a resultset from regex?

Consider the following:
var="text more text and yet more text"
echo $var | egrep "yet more (text)"
It should be possible to get the result of the regex as the string: text
However, I don't see any way to do this in bash with grep or its siblings at the moment.
In perl, php or similar regex engines:
$output = preg_match('/yet more (text)/', 'text more text yet more text');
$output[1] == "text";
Edit: To elaborate why I can't just multiple-regex, in the end I will have a regex with multiple of these (Pictured below) so I need to be able to get all of them. This also eliminates the option of using lookahead/lookbehind (As they are all variable length)
egrep -i "([0-9]+) +$USER +([0-9]+).+?(/tmp/Flash[0-9a-z]+) "
Example input as requested, straight from lsof (Replace $USER with "j" for this input data):
npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXu8pvMg (deleted)
npviewer. 17875 j 17u REG 8,8 16037387 524273 /tmp/FlashXXIBH29F (deleted)
The end goal is to cp /proc/$var1/fd/$var2 ~/$var3 for every line, which ends up "Downloading" flash files (Flash used to store in /tmp but they drm'd it up)
So far I've got:
#!/bin/bash
regex="([0-9]+) +j +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+)"
echo "npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXYOvS8S (deleted)" |
sed -r -n -e " s%^.*?$regex.*?\$%\1 \2 \3%p " |
while read -a array
do
echo /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
It cuts off the first digits of the first value to return, and I'm not familiar enough with sed to see what's wrong.
End result for downloading flash 10.2+ videos (Including, perhaps, encrypted ones):
#!/bin/bash
lsof | grep "/tmp/Flash" | sed -r -n -e " s%^.+? ([0-9]+) +$USER +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+).*?\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Edit: look at my other answer for a simpler bash-only solution.
So, here the solution using sed to fetch the right groups and split them up. You later still have to use bash to read them. (And in this way it only works if the groups themselves do not contain any spaces - otherwise we had to use another divider character and patch read by setting $IFS to this value.)
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
sed -r -n -e " s%^.*$regex.*\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Note that I had to adapt your last regex group to allow uppercase letters, and added a space at the beginning to be sure to capture the whole block of numbers. Alternatively here a \b (word limit) would have worked, too.
Ah, I forget mentioning that you should pipe the text to this script, like this:
./grep-result.sh < grep-result-test.txt
(provided your files are named like this). Instead you can add a < grep-result-test after the sed call (before the |), or prepend the line with cat grep-result-test.txt |.
How does it work?
sed -r -n calls sed in extended-regexp-mode, and without printing anything automatically.
-e " s%^.*$regex.*\$%\1 \2 \3%p " gives the sed program, which consists of a single s command.
I'm using % instead of the normal / as parameter separator, since / appears inside the regex and I don't want to escape it.
The regex to search is prefixed by ^.* and suffixed by .*$ to grab the whole line (and avoid printing parts of the rest of the line).
Note that this .* grabs greedy, so we have to insert a space into our regexp to avoid it grabbing the start of the first digit group too.
The replacement text contains of the three parenthesed groups, separated by spaces.
the p flag at the end of the command says to print out the pattern space after replacement. Since we grabbed the whole line, the pattern space consists of only the replacement text.
So, the output of sed for your example input is this:
5 11 /tmp/FlashXXu8pvMg
5 17 /tmp/FlashXXIBH29F
This is much more friendly for reuse, obviously.
Now we pipe this output as input to the while loop.
read -a array reads a line from standard input (which is the output from sed, due to our pipe), splits it into words (at spaces, tabs and newlines), and puts the words into an array variable.
We could also have written read var1 var2 var3 instead (preferably using better variable names), then the first two words would be put to $var1 and $var2, with $var3 getting the rest.
If read succeeded reading a line (i.e. not end-of-file), the body of the loop is executed:
${array[0]} is expanded to the first element of the array and similarly.
When the input ends, the loop ends, too.
This isn't possible using grep or another tool called from a shell prompt/script because a child process can't modify the environment of its parent process. If you're using bash 3.0 or better, then you can use in-process regular expressions. The syntax is perl-ish (=~) and the match groups are available via $BASH_REMATCH[x], where x is the match group.
After creating my sed-solution, I also wanted to try the pure-bash approach suggested by Mark. It works quite fine, for me.
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
while read
do
if [[ $REPLY =~ $regex ]]
then
echo cp /proc/${BASH_REMATCH[1]}/fd/${BASH_REMATCH[2]} ~/${BASH_REMATCH[3]}
fi
done
(If you upvote this, you should think about also upvoting Marks answer, since it is essentially his idea.)
The same as before: pipe the text to be filtered to this script.
How does it work?
As said by Mark, the [[ ... ]] special conditional construct supports the binary operator =~, which interprets his right operand (after parameter expansion) as a extended regular expression (just as we want), and matches the left operand against this. (We have again added a space at front to avoid matching only the last digit.)
When the regex matches, the [[ ... ]] returns 0 (= true), and also puts the parts matched by the individual groups (and the whole expression) into the array variable BASH_REMATCH.
Thus, when the regex matches, we enter the then block, and execute the commands there.
Here again ${BASH_REMATCH[1]} is an array-access to an element of the array, which corresponds to the first matched group. ([0] would be the whole string.)
Another note: Both my scripts accept multi-line input and work on every line which matches. Non-matching lines are simply ignored. If you are inputting only one line, you don't need the loop, a simple if read ; then ... or even read && [[ $REPLY =~ $regex ]] && ... would be enough.
echo "$var" | pcregrep -o "(?<=yet more )text"
Well, for your simple example, you can do this:
var="text more text and yet more text"
echo $var | grep -e "yet more text" | grep -o "text"