Remove character occurring after _ from all the files excluding file extension (.png) - regex

I was searching for a unix command/shell script to remove characters occurred after _ in all the files excluding file extension.
Example:
b6d28-insurance-renewal-shop_6b5c74fa3d4b96f7557c3fd66f2555af.png
should be renamed to
b6d28-insurance-renewal-shop.png
I have tried searching online and but was not able to find out a quick and optimal solution.
Please note that those extra characters are added randomly and varying in each file.
Thanks in Advance!

You can use sed like this using a negated character class:
f='b6d28-insurance-renewal-shop_6b5c74fa3d4b96f7557c3fd66f2555af.png'
sed 's/_[^_.]*//' <<< "$f"
b6d28-insurance-renewal-shop.png
[^_.] matches any character except DOT or underscore.
If you're using bash then you can do this in shell itself using:
echo "${f%_*}.png"

You could also use cut for the result like this:
file="b6d28-insurance-renewal-shop_6b5c74fa3d4b96f7557c3fd66f2555af.png"
new_file=$(echo $file | cut -d'_' -f1).$(echo $file | cut -d'.' -f2)
echo "New file name: ${new_file}"
Output:
New file name: b6d28-insurance-renewal-shop.png

Regex pattern:
(\_[\d\w]+)(?=(\.\w{2,3}))
to find every _akfgasfhsgfhha before .ext[ension]

Assuming that f holds the original filename,
${f%_*}.${f##*.}
would give you the transformed filename.

Related

Regex - how to prevent or work around interference between files when searching through them?

So, I am using a regular expression to search through a bunch of files from a corpus. The point is to find the titles of newspaper articles.
This is what I use:
cat *.txt | grep -P '(^[A-ZÖÄÜÕŠŽ].*[^\.]$)' --colour
It finds lines that begin with a capital, followed by any character, but not ending with a dot and that works for these specific files.
The problem is that two files interfere with each other and the dot from the very end of one file shows up in the beginning of another and I get this:
Kõik Kataria jüngrid kinnitavad , et nende elu on pärast naeruklubiga liitumist oluliselt paranenud
.Kosmosepall teeb maailmareisi 39 kilomeetri kõrgusel.
Is there any way to prevent that interference without actually modifying the files or a way to change the regular expression, so that this dot at the beginning is excluded? I must say that I am a beginner, I tried to find solutions, but none of them were specific to my case.
The files probably does not have a newline at the end, so last line of the first file is merged with the first one in the second one.
You can try to append newline on the fly:
find *.txt | xargs -I{} sh -c "cat {}; echo ''" | grep ... grep -P '(^[A-ZÖÄÜÕŠŽ].*[^\.]$)' --colour
Source: https://stackoverflow.com/a/44675414/580346

Select a single character in an alphanumeric string in bash

I have an issue with string manipulation in bash. I have a list of names, each name being composed of two parts, chars and numbers: for example
abcdef01234
I want to cut the last character before the numeric part starts, in this case
f
I think there is a regular expression to help me with this but just can't figure it out. AWK/sed solutions are accepted too. Hope someone can help.
Thank you.
In bash it can be done with parameter expansion with substring removal and string indexes, e.g.,
a=abcdef01234 # your string
tmp=${a%%[0-9]*} # remove all numbers from right
echo ${tmp:(-1)} # output last of remaining chars
Output: f
You can use a regexp like [a-zA-Z]+([a-zA-Z])[0-9]+. If you know how to use sed is pretty easy.
Check https://regex101.com/r/XCkKM5/1
The match will be the letter you want.
^\w+([a-zA-Z])\d+$
As a sed command (on OSX) this will be :
echo "abcdef12345" | sed -E "s#^[a-zA-Z]+([a-zA-Z])[0-9]+\$#\1#"
try following too once.
echo "abcdef01234" | awk '{match($0,/[a-zA-Z]+/);print substr($0,RLENGTH,1)}'
I have a list of names I assume is a file, file. Using grep's PCRE and (positive) lookahead:
$ grep -oP "[a-z](?=[^a-z])" file
f
It prints out the first (lowercase) letter followed by a non-(lowercase)-letter.

sed match pattern \tTEXT\t not working

I use the following command on a huge text file
sed 's/\tEN-GB\t//g' "/home/ubuntu/0214/corpus/C.txt"
The file contains a [tab]EN-GB[tab] in each row, but what I get is the original text. I cannot figure out why.
NOTE: when I'm using 's/\t//g' it works and the resulting string is [a lot of no-tabs]EN-GB[a lot of no-tabs] in each row, so the tabs vanished.
UPDATE: Here is the incriminated part of the output from cat -vet:
^#2^#0^#0^#7^#0^#1^#0^#4^#~^#1^#6^#3^#2^#4^#3^#^I^#^I^#0^#^I^#E^#N^#-^#G^#B^#^I^#T^#h^#e^# ^#a^#d^#m^#i^#n^#i^#s^#t^#
I'm out of black magic... thanks in advance
It appears that your sed command is correct but you have some null characters in your text file
Run this sed command to remove nulls first:
sed -i.bak 's/\x0//g; s/\tEN-GB\t//g' "/home/ubuntu/0214/corpus/C.txt"
You can use ANSI-C quoting to represent the TAB character:
sed 's/'$'\tEN-GB\t''//g' filename
EDIT: The output of cat -vet suggests that you have NULL characters in your input. Remove those before piping the results to the above command. Say:
tr -d '\x0' < filename | sed 's/'$'\tEN-GB\t''//g'

Strip all characters from image filename after a dash (-)

I have about 1400 images like this.
101018-202x300.jpg
100116-215x300.jpg
1000748-300x157.jpg
100138-196x300.jpg
100308-companion-in-surgical-studies-208x300.jpg
100463-Ambroise-Pare-300x216.jpg
100523-Grulee-collection-pediatrics-194x300.jpg
I need to strip out all the character after the FIRST dash so that it reads like this
101018.jpg
100116.jpg
1000748.jpg
100138.jpg
100308.jpg
100463.jpg
100523.jpg
I know this can be done with Regular Expressions but I have not a clue where to begin with it?
I am busy working through this Regex Site to learn more about the topic.
Thank you.
EDIT: Apologies, I did not add some of the other more varying examples.
You can rename capturing the 101018 and rename the file that you used. NEW (Demo)
OLD (Demo)
(\d+)[\w-]+\.jpg
EDIT: A second option is splitting by "-" and getting the first parameter.
In PowerShell:
$_ -replace '-.*(?=\.\w+$)'
e.g.
PS> -split'101018-202x300.jpg
>> 100116-215x300.jpg
>> 1000748-300x157.jpg
>> 100138-196x300.jpg' | %{ $_ -replace '-.*(?=\.\w+$)'}
>>
101018.jpg
100116.jpg
1000748.jpg
100138.jpg
You can simplify the regex a bit if you only have JPEGs or it's always the part between a hyphen-minus and a dot:
-[^.]+
Using shell and sed..
for i in $(ls /path/to/your/images)
do
mv $i $(echo $i | sed -r 's/([0-9]\*)\-.*\.jpg/\1.jpg/g')
done
Let me explain what this will do.
for loop will take one file for each iteration
echo $i | sed -r 's/([0-9]*)\-.*\.jpg/\1.jpg/g' is for changing file names to your desired output.

grep - search for "<?\n" at start of a file

I have a hunch that I should probably be using ack or egrep instead, but what should I use to basically look for
<?
at the start of a file? I'm trying to find all files that contain the php short open tag since I migrated a bunch of legacy scripts to a relatively new server with the latest php 5.
I know the regex would probably be '/^<\?\n/'
I RTFM and ended up using:
grep -RlIP '^<\?\n' *
the P argument enabled full perl compatible regexes.
If you're looking for all php short tags, use a negative lookahead
/<\?(?!php)/
will match <? but will not match <?php
[meder ~/project]$ grep -rP '<\?(?!php)' .
find . -name "*.php" | xargs grep -nHo "<?[^p^x]"
^x to exclude xml start tag
if you worried about windows line endings, just add \r?.
grep '^<?$' filename
Don't know if that is showing up correctly. Should be
grep ' ^ < ? $ ' filename
Do you mean a literal "backslash n" or do you mean a newline?
For the former:
grep '^<?\\n' [files]
For the latter:
grep '^<?$' [files]
Note that grep will search all lines, so if you want to find matches just at the beginning of the file, you'll need to either filter each file down to its first line, or ask grep to print out line numbers and then only look for line-1 matches.