sed replace base64 encoded images in bulk

sed replace base64 encoded images in bulk - regex

I'm trying to use sed to recursively find/replace several base64 encoded images from files.
every image embed starts with src="data:image/svg+xml;base64, and ends with +"
I've tried this, but couldnt get it to work: find . -type f -name "*.txt" -exec sed -i 's/.*src=\"data\:image\([^;]*\)+\".*/\/path\/to\/image.png/g' {} +
I'm pretty new to regex and sed; I'm expecting something wrong with the regex match or escaping.
thanks!

When you're using s with sed it will replace the entire pattern that it matches, so you probably don't want the .* at the start and end of the pattern you actually want to replace.
You can also use different delimiters with sed than / which is handy if you have / in either of the part of the substitution, as you have in the path. So we can try a different substitution like:
sed 's_"data:image[^"]*"_/path/to/image.png_g' your_file
This will find everything that start with "data:image and ends with the next ", which I'm guessing will work, though you didn't give much sample data to show whether the +" was really important, or the " itself is sufficient.
We can test this match with
$ echo '<img src="data:image/svg+xml;base64,+">' | sed 's_"data:image[^"]*"_/path/to/image_g'
<img src=/path/to/image>
So you can use that sed command in your find. I assume you're using gnu's sed since you're using -i like that. If you have a different sed, you'll probably need to fix that part of your command too.

Related

use regular expressions to identify html form action tags

I am trying to sed -i to update all my html forms for url shortening. Basically I need to delete the .php from all the action="..." tags in my html forms.
But I am stuck at just identifying these instances. I am trying this testfile:
action = "yo.php"
action = 'test.php'
action='test.php'
action="upup.php"
And I am using this expression:
grep -R "action\s?=\s?(.*)php(\"|\')" testfile
And grep returns nothing at all.
I've tried a bunch of variations, and I can see that even the \s? isn't working because just this grep command also returns nothing:
grep -R "action\s?=\s?" testfile
grep -R "action\\s?=\\s?" testfile
(the latter I tried thinking maybe I had to escape the \ in \s).
Can someone tell me what's wrong with these commands?
Edit:
Fix 1 - apparently I need to escape the question make in \s? to make it be perceived as optional character rather than a literal question mark.

The way you're using it, grep accepts basic posix regex syntax. The single quote does not need to be escaped in it1, but some of the metacharacters you use do -- in particular, ?, (), and |. You can use
grep -R "action\s\?=\s\?\(.*\)php\(\"\|'\)" testfile
I recommend, however, that you use extended posix regex syntax by giving grep the -E flag:
grep -E -R "action\s?=\s?(.*)php(\"|')" testfile
As you can see, that makes the whole thing much more readable.
Addendum: To remove the .php extension from all action attributes in a file, you could use
sed -i 's/\(action\s*=\s*["'\''][^"'\'']*\)\.php\(["'\'']\)/\1\2/g' testfile
Shell strings make this look scarier than it is; the sed code is simply
s/\(action\s*=\s*["'][^"']*\)\.php\(["']\)/\1\2/g
I amended the regex slightly so that in a line action='foo.php' somethingelse='bar.php' the right .php would be removed. I tried to make this as safe as I can, but be aware that handling HTML with sed is always hacky.
Combine this with find and its -exec filter to handle a whole directory.
1 And that the double quote needs to be escaped is because you use a doubly-quoted shell string, not because the regex requires it.

You need to use the -P option to use Perl regexs:
$ grep -P "action\s?=\s?(.*)php(\"|\')" test
action = "yo.php"
action = 'test.php'
action='test.php'
action="upup.php"

try this unescaped plain regex, which only selects text within quotes:
action\s?=\s?["'](.*)\.php["']
you can fiddle around here:
https://regex101.com/r/lN8iG0/1
so on command line this would be:
grep -P "action\s?=\s?[\"'](.*)\.php[\"']" test

Using SED to replace a domain name in a large number of HTML files

Ok, I give up. I've been trying for a couple of hours to get sed to replace an incorrectly formatted domain name in several thousand html files but I cannot seem to get the escaping of the slashes (and possibly dot/colon) correct.
Text to find:
http://www.domain.com/http
Replace with:
http
What i have tried:
sed -i 's/http:\/\/www.domain.com\/http/http/'
sed -i 's/http\\:\\/\\/www\\.domain\\.com\\/http/http/'
sed -i 's/http\:\/\/www\.domain\.com\/http/http/'
sed -i 's=http://www.domain.com/http=http='
UPDATE:
As it transpires I was chasing chasing ghosts. A piece of javascript was adding the http://www.domain.com/ to the beginning of all my img tags! Unfortunately now I need to try and remove this from all pages. So instead of the above, i am now looking to:
Replace this:
http://www.domain.com/'+img[0]
with this:
'+img[0]
I have tried the following to no avail:
find . -name "*.html" -type f -exec sed -i 's|http://www\.domain\.com/\'+img\[0\]|\'+img\[0\]|g' {} \;
find . -name "*.html" -type f -exec sed -i 's|http://www\.domain\.com/\'+img[0]|\'+img[0]|g' {} \;
I appear to be stuck on the escaping of certain chars again. Only this time when i try to run one of the above commands it just takes me to a > prompt.

You can avoid alot of the escaping by using a different delimiter. The dot . is the only character of special meaning that needs to be escaped, everything else you can match literally. Also use the global modifier with your pattern.
sed -i 's|http://www\.domain\.com/http|http|g'
Edit — You can use the following to replace the other part.
sed -i "s|http://www\.domain\.com/\('[+]img\[0\]\)|\1|g"

What's wrong with this shell/sed script?

I have about 150 HTML files in a given directory that I'd like to make some changes to. Some of the anchor tags have an href along the following lines: index.php?page=something. I'd like all of those to be changed to something.html. Simple regex, simple script. I can't seem to get it correct, though. Can somebody weigh in on what I'm doing wrong?
Sample html, before and after output:
<!-- Before -->
<ul>
<li>Apple</li>
<li>Dandelion</li>
<li>Elephant</li>
<li>Resonate</li>
</ul>
<!-- After -->
<ul>
<li>Apple</li>
<li>Dandelion</li>
<li>Elephant</li>
<li>Resonate</li>
</ul>
Script file:
#! /bin/bash
for f in *.html
do
sed s/\"index\.php?page=\([.]*\)\"/\1\.html/g < $f >! $f
done

It's your regex, and the fact that the shell is trying to interpret bits of your regex.
First - the [.]* matches any number of literal dots .. Change it to .*.
Secondly, enclose the entire regex in single quotes ' to prevent the bash shell from interpreting any of it.
sed 's/"index\.php?page=\(.*\)"/\1\.html/g'
Also, instead of < $f >! $f you can just feed in the '-i' switch to sed to have it operate in-place:
sed -i 's/"index\.php?page=\(.*\)"/"\1\.html"/g' "$f"
(Also, as another point I think in your replacement you want double quotes around the \1.html so that the new URL is quoted within the HTML. I also quoted your $f to "$f", because if the file name contains spaces bash will complain).
EDIT: as #TimPote notes, the standard way to match something within quotes is either ".*?" (so that the .* is non-greedy) or "[^"]+". Sed doesn't support the former, so try:
sed -i 's/"index\.php?page=\([^"]\+\)"/"\1\.html"/g' "$f"
This is to prevent (for example) "asdf" from being turned into "asdf.html" (where the (.*) captured asdf">"asdf, being greedy).

Your .* was too greedy. Use [^"]\+ instead. Plus your quotes were all messed up. Surround the whole thing with single quotes instead, then you can use " without escaping them.
sed -i 's/"index\.php?page=\([^"]\+\)"/"\1\.html"/g'
You can do this whole operation with a single statement using find:
find . -maxdepth 1 -type f -name '*.html' \
-exec sed -i 's/"index\.php?page=\([^"]\+\)"/"\1\.html"/g' {} \+

The following works:
sed "s/\"index\.php?page=\(.*\)\"/\"\1.html\"/g" < 1.html
I think it was mostly the square brackets. Not sure why you had them.
Oh, and the entire sed command needs to be in quotes.

Regex in sed to replace parts of an url given a specific format

I'm having some issues in doing a simple regex using sed.
I've to do some replacement in a sql file and I'm trying to use sed.
I should replace the url of some links. The links are in the following format:
www.site1.com/blog/2012/12/12
I would like to replace site1 with site2 in all links.
To find these links I've written the following regex:
(site1.com)\/blog\/\d{4}\/\d{2}\/\d{2}
And seems to wokr properly.
Using sed to do the replacement things I've written the following code
cat back.sql | sed 's:(site1.com)\/blog\/\d{4}\/\d{2}\/\d{2}:site2.com:' > fixed.sql
But it seems is not working..

sed does not support \d (not to my knowing at least), and supports {4} only with extended regular expressions.
sed -r 's:site1.com(/blog/[0-9]{4}/[0-9]{2}/[0-9]{2}):site2.com/\1:'
as a basic regular expression (requires lots of escaping):
sed 's:site1.com\(/blog/[0-9]\{4\}/[0-9]\{2\}/[0-9]\{2\}\):site2.com/\1:'
ps. you don't need to escape slashes if you use different delemiters (:)

Looks to be a straight substitution to me:
$ sed -i s/\.site1\./\.site2\./g afile.txt
... where afile.txt contains your list of sites.
If you want to output to another file, remove the -i and redirect the output using > .

put regular expression in variable

output=`grep -R -l "${images}" *`
new_output=`regex "slide[0-9]" $output`
Basically $output is a string like this:
slides/_rels/slide9.xml.rels
The number in $output will change. I want to grab "slide9" and put that in a variable. I was hoping new_output would do that but I get a command not found for using regex. Any other options? I'm using a bash shell script.

Well, regex is not a program like grep. ;)
But you can use
grep -Eo "(slide[0-9]+)"
as a simple approach. -o means: show only the matching part, -E means: extended regex (allows more sophisticated patterns).

Reading I want to grab "slide9" and put that in a variable. I assume you want what matches your regexp to be the only thing put in $new_output? If so, then you can change that to:
new_output=`egrep -R -l "${images}" * | sed 's/.*\(slide[0-9]+\).*/\1/'`
Note no setting of output= is required (unless you use that for something else)
If you need $output to use elsewhere then instead use:
output=`grep -R -l "${images}" *`
new_output=`echo ${ouput} | sed 's/.*\(slide[0-9]+\).*/\1/'`
sed's s/// command is similar to perls s// command and has an equivalent in most languages.
Here I'm matching zero or more characters .* before and after your slide[0-9]+ and then remembering (backrefrencing) the result \( ... \) in sed (the brackets may or may not need to be escaped depending on the version of sed). We then replace that whole match (i.e the whole line) with \1 which expands to the first captured result in this case your slide[0-9]+ match.

In these situations using awk is better :
output="`grep -R -l "main" codes`"
echo $output
tout=`echo $output | awk -F. '{for(i=1;i<=NF;i++){if(index($i,"/")>0){n=split($i,ar,"/");print ar[n];}}}'`
echo $tout
This prints the filename without the extension. If you want to grab only slide9 than use the solutions provided by others.
Sample output :
A#A-laptop ~ $ bash try.sh
codes/quicksort_iterative.cpp codes/graham_scan.cpp codes/a.out
quicksort_iterative graham_scan a

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

sed replace base64 encoded images in bulk - regex

Related

use regular expressions to identify html form action tags

Using SED to replace a domain name in a large number of HTML files

What's wrong with this shell/sed script?

Regex in sed to replace parts of an url given a specific format

put regular expression in variable

Categories

Resources