Replacing <title> content with <h1> - regex

Or more generally, how to replace a pattern with another captured pattern.
I want to do this for multiple html files. Like:
find . -name '*.html' | xargs sed -i 's/(?<=title\>).*(?=\<\/title)/string2/g'
except string2 is dynamically captured with another pattern.
UPDATE
A little bit verbose though, but I mostly work it out via a shell script
#!/bin/bash
file=$1
h1=`grep -oP '(?<=h1\>)(?!FreeType).*(?=\<\/h1)' $file` # I want to find a h1 tag without a 'FreeType' word in it
echo ${h1} finded
perl -i.bak -pe "s[(?<=title\>).*(?=\<\/title)][${h1}]g" $file
and
find . -name '*.html' -exec ~/replace.sh {} \;

sed doesn't support look arounds, but you can just capture the tags and put them back:
find . -name '*.html' | xargs sed -i 's_<title>.*?</title>_<title>$string2</title>_g'
I changed the quantifier to reluctant to avoid matches gobbling up all the input from the first open tag to the last close tag as one match.
Also note how you can avoid saw tooth patterns (ie escaping slashes \/\/) by using a character other than / as the delimiter - here I used underscore, but you can use anything - to make the pattern more readable.

Related

Using find/sed to replace strings in text files- works only on some of the matches

I want to replace
{not STRING }
with
(not STRING )
I ran
find . -maxdepth 1 -type f -exec sed -i -E 's/{not\s([^\s}]+)\s}/(not \1 )/g' {} ;
It worked on some of the matches. When I run grep with the same pattern it shows more files that still have STRING. Ran find/sed again, same result.
You need to escape curly braces ({}), as they are regex meta-characters. Also \s is not POSIX sed, I would use the more portable [[:space:]].
Your code did not work on the example text for me (GNU/Linux). This does:
sed -E 's/\{not[[:space:]]+([^[:space:]}]+)[[:space:]]+\}/(not \1 )/g'
I also allowed for variable length whitespace directly after not and directly before } (using [[:space:]]+). You may or may not want that.
Also:
On MacOS sed I believe you need to supply a suffix argument to -i.
The trailing ; for find -exec must be quoted (\;) to avoid interpretation by the shell.
So the command would be:
find . -maxdepth 1 -type f -exec \
sed -E -i .TMP 's/\{not[[:space:]]+([^[:space:]}]+)[[:space:]]+\}/(not \1 )/g' {} \;
If .TMP conflicts with an existing file, choose a different suffix.

How do I find and replace a character when it is not one of the first 8 characters in the filename using prename?

This will for example recursively find and replace all hyphens in all filenames with a single space:
find . -type f -name "*-*" -execdir prename 's/-/ /g' "{}" \;
how would I modify this to only replace hyphens that are not within the first 8 characters of the file name.
Pathnames passed to prename are prepended with ./ because of -execdir primary. So, you need to keep the first ten characters intact and substitute each dash with a space in the rest of the path string, which can be achieved fairly easily with a while loop (because g flag doesn't work when matches overlap) and PCRE's zero-width positive lookbehind assertion thingy*.
find -name '????????*-*' -type f -execdir prename -n '1 while s/.{10,}\K-/ /' {} +
This invokes prename at least once for each directory, and thus, may be slow due to overhead from initialization. If that is a concern, you can use -exec instead of -execdir, and adjust the Perl expression accordingly. Below is my amateur attempt at it, use with caution.
-exec prename -n '/(.*\/.{8})(.*)/; $_ = $1 . $2 =~ y/-/ /r' {} +
Drop -n if the output looks good.

Using SED to replace a domain name in a large number of HTML files

Ok, I give up. I've been trying for a couple of hours to get sed to replace an incorrectly formatted domain name in several thousand html files but I cannot seem to get the escaping of the slashes (and possibly dot/colon) correct.
Text to find:
http://www.domain.com/http
Replace with:
http
What i have tried:
sed -i 's/http:\/\/www.domain.com\/http/http/'
sed -i 's/http\\:\\/\\/www\\.domain\\.com\\/http/http/'
sed -i 's/http\:\/\/www\.domain\.com\/http/http/'
sed -i 's=http://www.domain.com/http=http='
UPDATE:
As it transpires I was chasing chasing ghosts. A piece of javascript was adding the http://www.domain.com/ to the beginning of all my img tags! Unfortunately now I need to try and remove this from all pages. So instead of the above, i am now looking to:
Replace this:
http://www.domain.com/'+img[0]
with this:
'+img[0]
I have tried the following to no avail:
find . -name "*.html" -type f -exec sed -i 's|http://www\.domain\.com/\'+img\[0\]|\'+img\[0\]|g' {} \;
find . -name "*.html" -type f -exec sed -i 's|http://www\.domain\.com/\'+img[0]|\'+img[0]|g' {} \;
I appear to be stuck on the escaping of certain chars again. Only this time when i try to run one of the above commands it just takes me to a > prompt.
You can avoid alot of the escaping by using a different delimiter. The dot . is the only character of special meaning that needs to be escaped, everything else you can match literally. Also use the global modifier with your pattern.
sed -i 's|http://www\.domain\.com/http|http|g'
Edit — You can use the following to replace the other part.
sed -i "s|http://www\.domain\.com/\('[+]img\[0\]\)|\1|g"

Pass sed output to mv

I'm trying to batch rename text files according to a string they contain.
I used sed to isolate the pattern with \( and \) as I couldn't get this to work in grep.
sed -i '' 's/<title>\(.*\)<\/title>/&/g' *.txt | mv *.txt $sed.txt
(the text I want to use as filename is between html title tags)`
Where I wrote $sed would be the output of sed.
hope that's clear!
A simple loop in bash can accomplish this. If each file is valid HTML, meaning you have only one <title> tag in the file, you can rename them all this way:
for file in *.txt; do
mv "$file" `sed -n 's/<title>\([^<]*\)<\/title>/\1/p;' $file| sed -e 's/[ ][ ]*/_/g'`.txt
done
So, if you have files 1.txt, 2.txt and 3.txt, each with cat, dog and my hippo in their TITLE tags, you'll end up with cat.txt, dog.txt and my_hippo.txt after the above loop.
EDIT: quoted initial $file in case there are spaces in filenames; and added a second sed to convert any spaces in the <title> tag to _'s in resulting filenames. NOTE the whitespace inside the []'s in the second sed command is a literal space and tab character.
You can enclose expression in grave accent characters (`) to make it insert its output to the place you want. Try:
mv *.txt `sed -i '' 's/<title>\(.*\)<\/title>/&/g' *.txt`.txt
It is rather not flexible, but should work.
(I haven't used it in a while and cannot test it now, so I might be wrong).
Here is the command I would use:
for i in *.txt ; do
sed "s=<title>\(.*\)</title>=mv '$i' '\1'=e" $i
done
The sed substitution search for pattern in each one of your .txt files. For each file it creates string mv 'file_name' 'found_pattern'.
With the e command at the end of sed commands, this resulting string is directly executed in terminal, thus it renames your files.
Some hints:
Note the use of =s instead of /s as delimiters for sed substition: it's more readable as you already have /s in your pattern (you could use many other symbols if you don't like =). And in this way you don't have to escape the / in your pattern.
The e command for sed executes the created string.
(I'm speaking of this one below:
sed "s=<title>\(.*\)</title>=mv '$i' '\1'=e" $i
^
)
So use it with caution! I would recommand to first use the line without final e: it won't execute any mv command, but just print instead what would be executed if you were to add the e.
What I read from your question is:
you have a number of text (html) files in a directory
each file contains at least the tag <title> ... </title>
you want to extract the content (elements.text) and use it as filename
last you want to rename that file to the extracted filename
Is this correct?
So, then you need to loop through the files, e.g. with xargs or find
ls '*.txt' | xargs -i\{\} command "{}" ...
find -maxdepth 1 -type f -name '*.txt' -exec command "{}" ... \;
I always replace the xargs substitues by -i\{\} because the resulting command is compatible if I use it sometimes with find and its substitute {}.
Next the -maxdepth option will help find not to dive deeper in directory, if no subdir, you can leave it out.
command could be something very simple like echo "Testing File: {}" or a really small script if you use it with bash:
find . -name '*.txt' -exec bash -c 'CUR_FILE="{}"; echo "Working on: $CUR_FILE"; ls -l "$CUR_FILE";' \;
The big decision for your question is: how to get the text from title element.
A simple solution (suitable if opening and closing tag is on same textline) would be by grep
A more solid solution is to use a HTML Parser and navigate by DOM operation
The simple solution base on:
get the title line
remove the everything before and after title content
So do it together:
ls *.txt | xargs -i\{\} bash -c 'TITLE=$(egrep "<title>[^<]*</title>" "{}"); NEW_FNAME=$(echo "$TITLE" | sed -e "s#.*<title>\([^<]*\)</title>.*#\1#"); mv -v "{}" "$NEW_FNAME.txt"'
Same with usage of find:
find . -maxdepth 1 -type f -name '*.txt' -exec bash -c 'TITLE=$(egrep "<title>[^<]*</title>" "{}"); NEW_FNAME=$(echo "$TITLE" | sed -e "s#.*<title>\([^<]*\)</title>.*#\1#"); mv -v "{}" "$NEW_FNAME.txt"' \;
Hopefully it is what you expected.

Recursively replace django template tag with sed

I renamed something in my django application, and I want to recursively search and replace the tag in all of the templates. I tried to do this using find and sed like so.
find . -name *.html -exec sed -i 's/\{\{\s*oldtag\s*\}\}/{{ newtag }}/g' {} \;
I get this error.
sed: -e expression #1, char 44: Invalid preceding regular expression
Ok, so I tried a whole bunch of different things to try to make it work. I tried unescaping and double-escaping the curly braces. I tried using [ \t] instead of \s. Nothing seems to work. Some of the combinations don't give an error, but they also don't find or replace anything. What's even worse is sometimes I get this other error.
find: paths must precede expression: index.html
How can the path precede the expression? . is the path, and it immediately follows the find command. It precedes all the expressions.
Try:
find . -name '*.html' -exec sed -i 's|{{\s*oldtag\s*}}|{{ newtag }}|g' {} +
With some assumptions:
your sed implementation recognizes the \s escape sequence and the -i option
your find implementation supports the {} + syntax
You should be escaping the ' and \ characters. This should work:
find . -name *.html -exec sed -i \'s/{{\\s*oldtag\\s*}}/{{ newtag }}/g\' {} \;
Tip: You can always just insert echo just before the word sed to see a printout of what it looks like (see what is escaped).