Perform a regex operation over files in directory - regex

I want to replace instances of <span class='i'> </span> with <i> </i> because I decided I want to format my pages this way instead. So I have come up with this command:
perl -pe "s/<span +class *= *['\"]i['\"] *>(.*?)<\/span>/<i>\1<\/i>/g"
I could make it more elaborate but I really don't think there are instances of weirdly formed tags like < / span> or anything so I'll leave it at that. It does have a non greedy capture which is why I used perl -p rather than sed.
So this will output the correctly modified lines but I'm not sure about the best way to send multiple files through this command. What's the best way to do it if I want all of pages/*.html to have the span class='i' tags fixed? Does bash provide some provision for doing this other than a for loop?

#Steven, as per your comment to the answer by #SiegeX, the following will work fine:
perl -pi -e "s/<span +class *= *['\"]i['\"] *>(.*?)<\/span>/<i>\1<\/i>/g" *.html
I would have Perl create backups of the files though, so change the first part to
perl -pi.bak -e ...

The following will iterate over all html files in pages/ and do an in-place edit with your perl script .
#!/bin/bash
for file in pages/*.html; do
perl -pi -e "s/<span +class *= *['\"]i['\"] *>(.*?)<\/span>/<i>\1<\/i>/g" "$file"
done

Related

sed replace base64 encoded images in bulk

I'm trying to use sed to recursively find/replace several base64 encoded images from files.
every image embed starts with src="data:image/svg+xml;base64, and ends with +"
I've tried this, but couldnt get it to work: find . -type f -name "*.txt" -exec sed -i 's/.*src=\"data\:image\([^;]*\)+\".*/\/path\/to\/image.png/g' {} +
I'm pretty new to regex and sed; I'm expecting something wrong with the regex match or escaping.
thanks!
When you're using s with sed it will replace the entire pattern that it matches, so you probably don't want the .* at the start and end of the pattern you actually want to replace.
You can also use different delimiters with sed than / which is handy if you have / in either of the part of the substitution, as you have in the path. So we can try a different substitution like:
sed 's_"data:image[^"]*"_/path/to/image.png_g' your_file
This will find everything that start with "data:image and ends with the next ", which I'm guessing will work, though you didn't give much sample data to show whether the +" was really important, or the " itself is sufficient.
We can test this match with
$ echo '<img src="data:image/svg+xml;base64,+">' | sed 's_"data:image[^"]*"_/path/to/image_g'
<img src=/path/to/image>
So you can use that sed command in your find. I assume you're using gnu's sed since you're using -i like that. If you have a different sed, you'll probably need to fix that part of your command too.

deleting and replacing string inside .php file

I am having some problems with loading a php file and then replacing his content with something else.
my code looks like this
$pattern="*random text*"
$rep=" "
$where=`ls *.php`
find -f $where -name "*.php" -exec sed -i 's/$pattern/$rep/g' {} \;
This wont load entire line of text. Also is there a limit of how many character can $pattern load?
Also is there a way to make this .sh file execute on every 15min for example?
i am using mac osX.
Thanks!
The syntax $var="value" is wrong. You need to say var="value".
If you just want to do something on files matching *.php, you are doing it in just a directory, so there is no need to use find. Just use for loop:
pattern="*random text*"
rep=" "
for file in *.php
do
sed -i "s/$pattern/$rep/g" "$file"
done
See the usage of sed "s/$var/.../g" instead of sed 's/$var/.../g'. The double quotes expand the variables within the expression; otherwise, you would be looking for a literal $var.
Note that sed -i alone does not work in OS X, so you probably have to say sed -i ''.
Example of replacement:
Given a file:
$ cat a
hello
<?php eval(1234567890) regular php code ?>
bye
Let's remove everything from within eval():
$ sed -r 's/(eval\()[^)]*/\1X/' a
hello
<?php eval(X) regular php code ?>
bye

use regular expressions to identify html form action tags

I am trying to sed -i to update all my html forms for url shortening. Basically I need to delete the .php from all the action="..." tags in my html forms.
But I am stuck at just identifying these instances. I am trying this testfile:
action = "yo.php"
action = 'test.php'
action='test.php'
action="upup.php"
And I am using this expression:
grep -R "action\s?=\s?(.*)php(\"|\')" testfile
And grep returns nothing at all.
I've tried a bunch of variations, and I can see that even the \s? isn't working because just this grep command also returns nothing:
grep -R "action\s?=\s?" testfile
grep -R "action\\s?=\\s?" testfile
(the latter I tried thinking maybe I had to escape the \ in \s).
Can someone tell me what's wrong with these commands?
Edit:
Fix 1 - apparently I need to escape the question make in \s? to make it be perceived as optional character rather than a literal question mark.
The way you're using it, grep accepts basic posix regex syntax. The single quote does not need to be escaped in it1, but some of the metacharacters you use do -- in particular, ?, (), and |. You can use
grep -R "action\s\?=\s\?\(.*\)php\(\"\|'\)" testfile
I recommend, however, that you use extended posix regex syntax by giving grep the -E flag:
grep -E -R "action\s?=\s?(.*)php(\"|')" testfile
As you can see, that makes the whole thing much more readable.
Addendum: To remove the .php extension from all action attributes in a file, you could use
sed -i 's/\(action\s*=\s*["'\''][^"'\'']*\)\.php\(["'\'']\)/\1\2/g' testfile
Shell strings make this look scarier than it is; the sed code is simply
s/\(action\s*=\s*["'][^"']*\)\.php\(["']\)/\1\2/g
I amended the regex slightly so that in a line action='foo.php' somethingelse='bar.php' the right .php would be removed. I tried to make this as safe as I can, but be aware that handling HTML with sed is always hacky.
Combine this with find and its -exec filter to handle a whole directory.
1 And that the double quote needs to be escaped is because you use a doubly-quoted shell string, not because the regex requires it.
You need to use the -P option to use Perl regexs:
$ grep -P "action\s?=\s?(.*)php(\"|\')" test
action = "yo.php"
action = 'test.php'
action='test.php'
action="upup.php"
try this unescaped plain regex, which only selects text within quotes:
action\s?=\s?["'](.*)\.php["']
you can fiddle around here:
https://regex101.com/r/lN8iG0/1
so on command line this would be:
grep -P "action\s?=\s?[\"'](.*)\.php[\"']" test

Sed script - Removing lines

I need help with my sed script. I have a XML-file where I have to remove everything except the text enclosed in these tags:
<TEXT>......</TEXT>
<HEADLINE>......</HEADLINE>
How do I write the sed code ? I know how to remove everything except the text enlosed in ONE tag.
s/.*<TEXT>\(.*\)<\/TEXT>.*/\1/
But how do i write the sed code for many tags ?
You can pass multiple commands to sed:
$ echo '<TEXT>Hello</TEXT>
<HEADLINE>there</HEADLINE>' | sed -n 's/.*<TEXT>\(.*\)<\/TEXT>.*/\1/gp; s/.*<HEADLINE>\(.*\)<\/HEADLINE>.*/\1/gp'
Hello
there
But you really should be careful when applying regex to XML-like files.
Assuming that you have valid XML:
sed '/.*<\(TEXT\|HEADLINE\)>\(.*\)<\/\(TEXT\|HEADLINE\)>.*/!d;s//\2/' yourfile.xml
If you want to use a sed script add this line:
/.*<\(TEXT\|HEADLINE\)>\(.*\)<\/\(TEXT\|HEADLINE\)>.*/!d;s//\2/
Then run:
sed -f yourscript.sed < yourfile.xml
This might work for you (GNU sed):
sed -r '/<(text|headline)>/I!d;s//&\n/;s/^[^\n]*\n//;:a;/<\//!{$!{N;ba}};s/\n/ /g;s/<\//\n&/;P;D' file
This removes all text accept that which is between TEXT and HEADLINE tags and on multi-line values replaces newlines with spaces.

What's wrong with this shell/sed script?

I have about 150 HTML files in a given directory that I'd like to make some changes to. Some of the anchor tags have an href along the following lines: index.php?page=something. I'd like all of those to be changed to something.html. Simple regex, simple script. I can't seem to get it correct, though. Can somebody weigh in on what I'm doing wrong?
Sample html, before and after output:
<!-- Before -->
<ul>
<li>Apple</li>
<li>Dandelion</li>
<li>Elephant</li>
<li>Resonate</li>
</ul>
<!-- After -->
<ul>
<li>Apple</li>
<li>Dandelion</li>
<li>Elephant</li>
<li>Resonate</li>
</ul>
Script file:
#! /bin/bash
for f in *.html
do
sed s/\"index\.php?page=\([.]*\)\"/\1\.html/g < $f >! $f
done
It's your regex, and the fact that the shell is trying to interpret bits of your regex.
First - the [.]* matches any number of literal dots .. Change it to .*.
Secondly, enclose the entire regex in single quotes ' to prevent the bash shell from interpreting any of it.
sed 's/"index\.php?page=\(.*\)"/\1\.html/g'
Also, instead of < $f >! $f you can just feed in the '-i' switch to sed to have it operate in-place:
sed -i 's/"index\.php?page=\(.*\)"/"\1\.html"/g' "$f"
(Also, as another point I think in your replacement you want double quotes around the \1.html so that the new URL is quoted within the HTML. I also quoted your $f to "$f", because if the file name contains spaces bash will complain).
EDIT: as #TimPote notes, the standard way to match something within quotes is either ".*?" (so that the .* is non-greedy) or "[^"]+". Sed doesn't support the former, so try:
sed -i 's/"index\.php?page=\([^"]\+\)"/"\1\.html"/g' "$f"
This is to prevent (for example) "asdf" from being turned into "asdf.html" (where the (.*) captured asdf">"asdf, being greedy).
Your .* was too greedy. Use [^"]\+ instead. Plus your quotes were all messed up. Surround the whole thing with single quotes instead, then you can use " without escaping them.
sed -i 's/"index\.php?page=\([^"]\+\)"/"\1\.html"/g'
You can do this whole operation with a single statement using find:
find . -maxdepth 1 -type f -name '*.html' \
-exec sed -i 's/"index\.php?page=\([^"]\+\)"/"\1\.html"/g' {} \+
The following works:
sed "s/\"index\.php?page=\(.*\)\"/\"\1.html\"/g" < 1.html
I think it was mostly the square brackets. Not sure why you had them.
Oh, and the entire sed command needs to be in quotes.