Preventing XSS in Node.js / server side javascript - xss

Any idea how one would go about preventing XSS attacks on a node.js app? Any libs out there that handle removing javascript in hrefs, onclick attributes,etc. from POSTed data?
I don't want to have to write a regex for all that :)
Any suggestions?

I've created a module that bundles the Caja HTML Sanitizer
npm install sanitizer
http://github.com/theSmaw/Caja-HTML-Sanitizer
https://www.npmjs.com/package/sanitizer
Any feedback appreciated.

One of the answers to Sanitize/Rewrite HTML on the Client Side suggests borrowing the whitelist-based HTML sanitizer in JS from Google Caja which, as far as I can tell from a quick scroll-through, implements an HTML SAX parser without relying on the browser's DOM.
Update: Also, keep in mind that the Caja sanitizer has apparently been given a full, professional security review while regexes are known for being very easy to typo in security-compromising ways.
Update 2017-09-24: There is also now DOMPurify. I haven't used it yet, but it looks like it meets or exceeds every point I look for:
Relies on functionality provided by the runtime environment wherever possible. (Important both for performance and to maximize security by relying on well-tested, mature implementations as much as possible.)
Relies on either a browser's DOM or jsdom for Node.JS.
Default configuration designed to strip as little as possible while still guaranteeing removal of javascript.
Supports HTML, MathML, and SVG
Falls back to Microsoft's proprietary, un-configurable toStaticHTML under IE8 and IE9.
Highly configurable, making it suitable for enforcing limitations on an input which can contain arbitrary HTML, such as a WYSIWYG or Markdown comment field. (In fact, it's the top of the pile here)
Supports the usual tag/attribute whitelisting/blacklisting and URL regex whitelisting
Has special options to sanitize further for certain common types of HTML template metacharacters.
They're serious about compatibility and reliability
Automated tests running on 16 different browsers as well as three diffferent major versions of Node.JS.
To ensure developers and CI hosts are all on the same page, lock files are published.

All usual techniques apply to node.js output as well, which means:
Blacklists will not work.
You're not supposed to filter input in order to protect HTML output. It will not work or will work by needlessly malforming the data.
You're supposed to HTML-escape text in HTML output.
I'm not sure if node.js comes with some built-in for this, but something like that should do the job:
function htmlEscape(text) {
return text.replace(/&/g, '&').
replace(/</g, '<'). // it's not neccessary to escape >
replace(/"/g, '"').
replace(/'/g, ''');
}

I recently discovered node-validator by chriso.
Example
get('/', function (req, res) {
//Sanitize user input
req.sanitize('textarea').xss(); // No longer supported
req.sanitize('foo').toBoolean();
});
XSS Function Deprecation
The XSS function is no longer available in this library.
https://github.com/chriso/validator.js#deprecations

You can also look at ESAPI. There is a javascript version of the library. It's pretty sturdy.

In newer versions of validator module you can use the following script to prevent XSS attack:
var validator = require('validator');
var escaped_string = validator.escape(someString);

Try out the npm module strip-js. It performs the following actions:
Sanitizes HTML
Removes script tags
Removes attributes such as "onclick", "onerror", etc. which contain JavaScript code
Removes "href" attributes which contain JavaScript code
https://www.npmjs.com/package/strip-js

Update 2021-04-16: xss is a module used to filter input from users to prevent XSS attacks.
Sanitize untrusted HTML (to prevent XSS) with a configuration specified by a Whitelist.
Visit https://www.npmjs.com/package/xss
Project Homepage: http://jsxss.com

You should try library npm "insane".
https://github.com/bevacqua/insane
I try in production, it works well. Size is very small (around ~3kb gzipped).
Sanitize html
Remove all attributes or tags who evaluate js
You can allow attributes or tags that you don't want sanitize
The documentation is very easy to read and understand.
https://github.com/bevacqua/insane

Related

Regex and preg_replace: How to match the same pattern multiple times in URL [duplicate]

How can one parse HTML/XML and extract information from it?
Native XML Extensions
I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.
DOM
The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.
DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.
It takes some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a language-agnostic interface, you'll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language's DOM API then.
How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing Stack Overflow.
A basic usage example and a general conceptual overview are available in other answers.
XMLReader
The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.
XMLReader, like DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml's HTML Parser Module.
A basic usage example is available in another answer.
XML Parser
This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.
The XML Parser library is also based on libxml, and implements a SAX style XML push parser. It may be a better choice for memory management than DOM or SimpleXML, but will be more difficult to work with than the pull parser implemented by XMLReader.
SimpleXml
The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.
SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml because it will choke.
A basic usage example is available, and there are lots of additional examples in the PHP Manual.
3rd Party Libraries (libxml based)
If you prefer to use a 3rd-party lib, I'd suggest using a lib that actually uses DOM/libxml underneath instead of string parsing.
FluentDom
FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.
HtmlPageDom
Wa72\HtmlPageDom is a PHP library for easy manipulation of HTML
documents using DOM. It requires DomCrawler from Symfony2
components for traversing
the DOM tree and extends it by adding methods for manipulating the
DOM tree of HTML documents.
phpQuery
phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.
The library is written in PHP5 and provides additional Command Line Interface (CLI).
This is described as "abandonware and buggy: use at your own risk" but does appear to be minimally maintained.
laminas-dom
The Laminas\Dom component (formerly Zend_DOM) provides tools for working with DOM documents and structures. Currently, we offer Laminas\Dom\Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.
This package is considered feature-complete, and is now in security-only maintenance mode.
fDOMDocument
fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.
sabre/xml
sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple "xml to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.
FluidXML
FluidXML is a PHP library for manipulating XML with a concise and fluent API.
It leverages XPath and the fluent programming pattern to be fun and effective.
3rd-Party (not libxml-based)
The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd-party libs go down this route. Some of them listed below
PHP Simple HTML DOM Parser
An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!
Require PHP 5+.
Supports invalid HTML.
Find tags on an HTML page with selectors just like jQuery.
Extract contents from HTML in a single line.
I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Not all jQuery Selectors (such as child selectors) are possible. Any of the libxml based libraries should outperform this easily.
PHP Html Parser
PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrape html, whether it's valid or not! This project was original supported by sunra/php-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his previous work.
Again, I would not recommend this parser. It is rather slow with high CPU usage. There is also no function to clear memory of created DOM objects. These problems scale particularly with nested loops. The documentation itself is inaccurate and misspelled, with no responses to fixes since 14 Apr 16.
HTML 5
You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you may want to consider using a dedicated parser. Note that these are written in PHP, so suffer from slower performance and increased memory usage compared to a compiled extension in a lower-level language.
HTML5DomDocument
HTML5DOMDocument extends the native DOMDocument library. It fixes some bugs and adds some new functionality.
Preserves html entities (DOMDocument does not)
Preserves void tags (DOMDocument does not)
Allows inserting HTML code that moves the correct parts to their proper places (head elements are inserted in the head, body elements in the body)
Allows querying the DOM with CSS selectors (currently available: *, tagname, tagname#id, #id, tagname.classname, .classname, tagname.classname.classname2, .classname.classname2, tagname[attribute-selector], [attribute-selector], div, p, div p, div > p, div + p, and p ~ ul.)
Adds support for element->classList.
Adds support for element->innerHTML.
Adds support for element->outerHTML.
HTML5
HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has well over five million downloads.
HTML5 provides the following features.
An HTML5 serializer
Support for PHP namespaces
Composer support
Event-based (SAX-like) parser
A DOM tree builder
Interoperability with QueryPath
Runs on PHP 5.3.0 or newer
Regular Expressions
Last and least recommended, you can extract data from HTML with regular expressions. In general using Regular Expressions on HTML is discouraged.
Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or adding, or changing attributes in a tag, can make the RegEx fails when it's not properly written. You should know what you are doing before using RegEx on HTML.
HTML parsers already know the syntactical rules of HTML. Regular expressions have to be taught for each new RegEx you write. RegEx are fine in some cases, but it really depends on your use-case.
You can write more reliable parsers, but writing a complete and reliable custom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.
Also see Parsing Html The Cthulhu Way
Books
If you want to spend some money, have a look at
PHP Architect's Guide to Webscraping with PHP
I am not affiliated with PHP Architect or the authors.
Try Simple HTML DOM Parser.
A HTML DOM parser written in PHP 5+ that lets you manipulate HTML in a very easy way!
Require PHP 5+.
Supports invalid HTML.
Find tags on an HTML page with selectors just like jQuery.
Extract contents from HTML in a single line.
Download
Note: as the name suggests, it can be useful for simple tasks. It uses regular expressions instead of an HTML parser, so will be considerably slower for more complex tasks. The bulk of its codebase was written in 2008, with only small improvements made since then. It does not follow modern PHP coding standards and would be challenging to incorporate into a modern PSR-compliant project.
Examples:
How to get HTML elements:
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
How to modify HTML elements:
// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');
$html->find('div', 1)->class = 'bar';
$html->find('div[id=hello]', 0)->innertext = 'foo';
echo $html;
Extract content from HTML:
// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->plaintext;
Scraping Slashdot:
// Create DOM from URL
$html = file_get_html('http://slashdot.org/');
// Find all article blocks
foreach($html->find('div.article') as $article) {
$item['title'] = $article->find('div.title', 0)->plaintext;
$item['intro'] = $article->find('div.intro', 0)->plaintext;
$item['details'] = $article->find('div.details', 0)->plaintext;
$articles[] = $item;
}
print_r($articles);
Just use DOMDocument->loadHTML() and be done with it. libxml's HTML parsing algorithm is quite good and fast, and contrary to popular belief, does not choke on malformed HTML.
Why you shouldn't and when you should use regular expressions?
First off, a common misnomer: Regexps are not for "parsing" HTML. Regexes can however "extract" data. Extracting is what they're made for. The major drawback of regex HTML extraction over proper SGML toolkits or baseline XML parsers are their syntactic effort and varying reliability.
Consider that making a somewhat dependable HTML extraction regex:
<a\s+class="?playbutton\d?[^>]+id="(\d+)".+? <a\s+class="[\w\s]*title
[\w\s]*"[^>]+href="(http://[^">]+)"[^>]*>([^<>]+)</a>.+?
is way less readable than a simple phpQuery or QueryPath equivalent:
$div->find(".stationcool a")->attr("title");
There are however specific use cases where they can help.
Many DOM traversal frontends don't reveal HTML comments <!--, which however are sometimes the more useful anchors for extraction. In particular pseudo-HTML variations <$var> or SGML residues are easy to tame with regexps.
Oftentimes regular expressions can save post-processing. However HTML entities often require manual caretaking.
And lastly, for extremely simple tasks like extracting <img src= urls, they are in fact a probable tool. The speed advantage over SGML/XML parsers mostly just comes to play for these very basic extraction procedures.
It's sometimes even advisable to pre-extract a snippet of HTML using regular expressions /<!--CONTENT-->(.+?)<!--END-->/ and process the remainder using the simpler HTML parser frontends.
Note: I actually have this app, where I employ XML parsing and regular expressions alternatively. Just last week the PyQuery parsing broke, and the regex still worked. Yes weird, and I can't explain it myself. But so it happened.
So please don't vote real-world considerations down, just because it doesn't match the regex=evil meme. But let's also not vote this up too much. It's just a sidenote for this topic.
Note, this answer recommends libraries that have now been abandoned for 10+ years.
phpQuery and QueryPath are extremely similar in replicating the fluent jQuery API. That's also why they're two of the easiest approaches to properly parse HTML in PHP.
Examples for QueryPath
Basically you first create a queryable DOM tree from an HTML string:
$qp = qp("<html><body><h1>title</h1>..."); // or give filename or URL
The resulting object contains a complete tree representation of the HTML document. It can be traversed using DOM methods. But the common approach is to use CSS selectors like in jQuery:
$qp->find("div.classname")->children()->...;
foreach ($qp->find("p img") as $img) {
print qp($img)->attr("src");
}
Mostly you want to use simple #id and .class or DIV tag selectors for ->find(). But you can also use XPath statements, which sometimes are faster. Also typical jQuery methods like ->children() and ->text() and particularly ->attr() simplify extracting the right HTML snippets. (And already have their SGML entities decoded.)
$qp->xpath("//div/p[1]"); // get first paragraph in a div
QueryPath also allows injecting new tags into the stream (->append), and later output and prettify an updated document (->writeHTML). It can not only parse malformed HTML, but also various XML dialects (with namespaces), and even extract data from HTML microformats (XFN, vCard).
$qp->find("a[target=_blank]")->toggleClass("usability-blunder");
.
phpQuery or QueryPath?
Generally QueryPath is better suited for manipulation of documents. While phpQuery also implements some pseudo AJAX methods (just HTTP requests) to more closely resemble jQuery. It is said that phpQuery is often faster than QueryPath (because of fewer overall features).
For further information on the differences see this comparison on the wayback machine from tagbyte.org. (Original source went missing, so here's an internet archive link. Yes, you can still locate missing pages, people.)
Advantages
Simplicity and Reliability
Simple to use alternatives ->find("a img, a object, div a")
Proper data unescaping (in comparison to regular expression grepping)
Simple HTML DOM is a great open-source parser:
simplehtmldom.sourceforge
It treats DOM elements in an object-oriented way, and the new iteration has a lot of coverage for non-compliant code. There are also some great functions like you'd see in JavaScript, such as the "find" function, which will return all instances of elements of that tag name.
I've used this in a number of tools, testing it on many different types of web pages, and I think it works great.
One general approach I haven't seen mentioned here is to run HTML through Tidy, which can be set to spit out guaranteed-valid XHTML. Then you can use any old XML library on it.
But to your specific problem, you should take a look at this project: http://fivefilters.org/content-only/ -- it's a modified version of the Readability algorithm, which is designed to extract just the textual content (not headers and footers) from a page.
For 1a and 2: I would vote for the new Symfony Componet class DOMCrawler ( DomCrawler ).
This class allows queries similar to CSS Selectors. Take a look at this presentation for real-world examples: news-of-the-symfony2-world.
The component is designed to work standalone and can be used without Symfony.
The only drawback is that it will only work with PHP 5.3 or newer.
This is commonly referred to as screen scraping, by the way. The library I have used for this is Simple HTML Dom Parser.
We have created quite a few crawlers for our needs before. At the end of the day, it is usually simple regular expressions that do the thing best. While libraries listed above are good for the reason they are created, if you know what you are looking for, regular expressions is a safer way to go, as you can handle also non-valid HTML/XHTML structures, which would fail, if loaded via most of the parsers.
I recommend PHP Simple HTML DOM Parser.
It really has nice features, like:
foreach($html->find('img') as $element)
echo $element->src . '<br>';
This sounds like a good task description of W3C XPath technology. It's easy to express queries like "return all href attributes in img tags that are nested in <foo><bar><baz> elements." Not being a PHP buff, I can't tell you in what form XPath may be available. If you can call an external program to process the HTML file you should be able to use a command line version of XPath.
For a quick intro, see http://en.wikipedia.org/wiki/XPath.
Third party alternatives to SimpleHtmlDom that use DOM instead of String Parsing: phpQuery, Zend_Dom, QueryPath and FluentDom.
Yes you can use simple_html_dom for the purpose. However I have worked quite a lot with the simple_html_dom, particularly for web scraping and have found it to be too vulnerable. It does the basic job but I won't recommend it anyways.
I have never used curl for the purpose but what I have learned is that curl can do the job much more efficiently and is much more solid.
Kindly check out this link:scraping-websites-with-curl
Advanced Html Dom is a simple HTML DOM replacement that offers the same interface, but it's DOM-based which means none of the associated memory issues occur.
It also has full CSS support, including jQuery extensions.
QueryPath is good, but be careful of "tracking state" cause if you didn't realise what it means, it can mean you waste a lot of debugging time trying to find out what happened and why the code doesn't work.
What it means is that each call on the result set modifies the result set in the object, it's not chainable like in jquery where each link is a new set, you have a single set which is the results from your query and each function call modifies that single set.
in order to get jquery-like behaviour, you need to branch before you do a filter/modify like operation, that means it'll mirror what happens in jquery much more closely.
$results = qp("div p");
$forename = $results->find("input[name='forename']");
$results now contains the result set for input[name='forename'] NOT the original query "div p" this tripped me up a lot, what I found was that QueryPath tracks the filters and finds and everything which modifies your results and stores them in the object. you need to do this instead
$forename = $results->branch()->find("input[name='forname']")
then $results won't be modified and you can reuse the result set again and again, perhaps somebody with much more knowledge can clear this up a bit, but it's basically like this from what I've found.
For HTML5, html5 lib has been abandoned for years now. The only HTML5 library I can find with a recent update and maintenance records is html5-php which was just brought to beta 1.0 a little over a week ago.
I created a library named PHPPowertools/DOM-Query, which allows you to crawl HTML5 and XML documents just like you do with jQuery.
Under the hood, it uses symfony/DomCrawler for conversion of CSS selectors to XPath selectors. It always uses the same DomDocument, even when passing one object to another, to ensure decent performance.
Example use :
namespace PowerTools;
// Get file content
$htmlcode = file_get_contents('https://github.com');
// Define your DOMCrawler based on file string
$H = new DOM_Query($htmlcode);
// Define your DOMCrawler based on an existing DOM_Query instance
$H = new DOM_Query($H->select('body'));
// Passing a string (CSS selector)
$s = $H->select('div.foo');
// Passing an element object (DOM Element)
$s = $H->select($documentBody);
// Passing a DOM Query object
$s = $H->select( $H->select('p + p'));
// Select the body tag
$body = $H->select('body');
// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');
// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');
// Use a lambda function to set the text of all site blocks
$siteblocks->text(function( $i, $val) {
return $i . " - " . $val->attr('class');
});
// Append the following HTML to all site blocks
$siteblocks->append('<div class="site-center"></div>');
// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');
// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));
// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function( $i, $val) {
return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});
// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();
// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');
// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');
[...]
Supported methods :
[x] $ (1)
[x] $.parseHTML
[x] $.parseXML
[x] $.parseJSON
[x] $selection.add
[x] $selection.addClass
[x] $selection.after
[x] $selection.append
[x] $selection.attr
[x] $selection.before
[x] $selection.children
[x] $selection.closest
[x] $selection.contents
[x] $selection.detach
[x] $selection.each
[x] $selection.eq
[x] $selection.empty (2)
[x] $selection.find
[x] $selection.first
[x] $selection.get
[x] $selection.insertAfter
[x] $selection.insertBefore
[x] $selection.last
[x] $selection.parent
[x] $selection.parents
[x] $selection.remove
[x] $selection.removeAttr
[x] $selection.removeClass
[x] $selection.text
[x] $selection.wrap
Renamed 'select', for obvious reasons
Renamed 'void', since 'empty' is a reserved word in PHP
NOTE :
The library also includes its own zero-configuration autoloader for PSR-0 compatible libraries. The example included should work out of the box without any additional configuration. Alternatively, you can use it with composer.
You could try using something like HTML Tidy to cleanup any "broken" HTML and convert the HTML to XHTML, which you can then parse with a XML parser.
I have written a general purpose XML parser that can easily handle GB files. It's based on XMLReader and it's very easy to use:
$source = new XmlExtractor("path/to/tag", "/path/to/file.xml");
foreach ($source as $tag) {
echo $tag->field1;
echo $tag->field2->subfield1;
}
Here's the github repo: XmlExtractor
Another option you can try is QueryPath. It's inspired by jQuery, but on the server in PHP and used in Drupal.
XML_HTMLSax is rather stable - even if it's not maintained any more. Another option could be to pipe you HTML through Html Tidy and then parse it with standard XML tools.
There are many ways to process HTML/XML DOM of which most have already been mentioned. Hence, I won't make any attempt to list those myself.
I merely want to add that I personally prefer using the DOM extension and why :
iit makes optimal use of the performance advantage of the underlying C code
it's OO PHP (and allows me to subclass it)
it's rather low level (which allows me to use it as a non-bloated foundation for more advanced behavior)
it provides access to every part of the DOM (unlike eg. SimpleXml, which ignores some of the lesser known XML features)
it has a syntax used for DOM crawling that's similar to the syntax used in native Javascript.
And while I miss the ability to use CSS selectors for DOMDocument, there is a rather simple and convenient way to add this feature: subclassing the DOMDocument and adding JS-like querySelectorAll and querySelector methods to your subclass.
For parsing the selectors, I recommend using the very minimalistic CssSelector component from the Symfony framework. This component just translates CSS selectors to XPath selectors, which can then be fed into a DOMXpath to retrieve the corresponding Nodelist.
You can then use this (still very low level) subclass as a foundation for more high level classes, intended to eg. parse very specific types of XML or add more jQuery-like behavior.
The code below comes straight out my DOM-Query library and uses the technique I described.
For HTML parsing :
namespace PowerTools;
use \Symfony\Component\CssSelector\CssSelector as CssSelector;
class DOM_Document extends \DOMDocument {
public function __construct($data = false, $doctype = 'html', $encoding = 'UTF-8', $version = '1.0') {
parent::__construct($version, $encoding);
if ($doctype && $doctype === 'html') {
#$this->loadHTML($data);
} else {
#$this->loadXML($data);
}
}
public function querySelectorAll($selector, $contextnode = null) {
if (isset($this->doctype->name) && $this->doctype->name == 'html') {
CssSelector::enableHtmlExtension();
} else {
CssSelector::disableHtmlExtension();
}
$xpath = new \DOMXpath($this);
return $xpath->query(CssSelector::toXPath($selector, 'descendant::'), $contextnode);
}
[...]
public function loadHTMLFile($filename, $options = 0) {
$this->loadHTML(file_get_contents($filename), $options);
}
public function loadHTML($source, $options = 0) {
if ($source && $source != '') {
$data = trim($source);
$html5 = new HTML5(array('targetDocument' => $this, 'disableHtmlNsInDom' => true));
$data_start = mb_substr($data, 0, 10);
if (strpos($data_start, '<!DOCTYPE ') === 0 || strpos($data_start, '<html>') === 0) {
$html5->loadHTML($data);
} else {
#$this->loadHTML('<!DOCTYPE html><html><head><meta charset="' . $encoding . '" /></head><body></body></html>');
$t = $html5->loadHTMLFragment($data);
$docbody = $this->getElementsByTagName('body')->item(0);
while ($t->hasChildNodes()) {
$docbody->appendChild($t->firstChild);
}
}
}
}
[...]
}
See also Parsing XML documents with CSS selectors by Symfony's creator Fabien Potencier on his decision to create the CssSelector component for Symfony and how to use it.
The Symfony framework has bundles which can parse the HTML, and you can use CSS style to select the DOMs instead of using XPath.
With FluidXML you can query and iterate XML using XPath and CSS Selectors.
$doc = fluidxml('<html>...</html>');
$title = $doc->query('//head/title')[0]->nodeValue;
$doc->query('//body/p', 'div.active', '#bgId')
->each(function($i, $node) {
// $node is a DOMNode.
$tag = $node->nodeName;
$text = $node->nodeValue;
$class = $node->getAttribute('class');
});
https://github.com/servo-php/fluidxml
JSON and array from XML in three lines:
$xml = simplexml_load_string($xml_string);
$json = json_encode($xml);
$array = json_decode($json,TRUE);
Ta da!
There are several reasons to not parse HTML by regular expression. But, if you have total control of what HTML will be generated, then you can do with simple regular expression.
Above it's a function that parses HTML by regular expression. Note that this function is very sensitive and demands that the HTML obey certain rules, but it works very well in many scenarios. If you want a simple parser, and don't want to install libraries, give this a shot:
function array_combine_($keys, $values) {
$result = array();
foreach ($keys as $i => $k) {
$result[$k][] = $values[$i];
}
array_walk($result, create_function('&$v', '$v = (count($v) == 1)? array_pop($v): $v;'));
return $result;
}
function extract_data($str) {
return (is_array($str))
? array_map('extract_data', $str)
: ((!preg_match_all('#<([A-Za-z0-9_]*)[^>]*>(.*?)</\1>#s', $str, $matches))
? $str
: array_map(('extract_data'), array_combine_($matches[1], $matches[2])));
}
print_r(extract_data(file_get_contents("http://www.google.com/")));
I've created a library called HTML5DOMDocument that is freely available at https://github.com/ivopetkov/html5-dom-document-php
It supports query selectors too that I think will be extremely helpful in your case. Here is some example code:
$dom = new IvoPetkov\HTML5DOMDocument();
$dom->loadHTML('<!DOCTYPE html><html><body><h1>Hello</h1><div class="content">This is some text</div></body></html>');
echo $dom->querySelector('h1')->innerHTML;
The best method for parse xml:
$xml='http://www.example.com/rss.xml';
$rss = simplexml_load_string($xml);
$i = 0;
foreach ($rss->channel->item as $feedItem) {
$i++;
echo $title=$feedItem->title;
echo '<br>';
echo $link=$feedItem->link;
echo '<br>';
if($feedItem->description !='') {
$des=$feedItem->description;
} else {
$des='';
}
echo $des;
echo '<br>';
if($i>5) break;
}
There are many ways:
In General:
Native XML Extensions: they come bundled with PHP, are usually faster than all the 3rd party libs, and give me all the control you need over the markup.
DOM: DOM is capable of parsing and modifying real-world (broken) HTML and it can do XPath queries. It is based on libxml.
XML Reader: XMLReader, like DOM, is based on libxml. The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way
XML Parser: This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust. It implements a SAX style XML push parser.
Simple XML: The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.
3rd Party Libraries [ libxml based ]:
FluentDom - Repo: FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. It can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.
HtmlPageDom: is a PHP library for easy manipulation of HTML documents using It requires DomCrawler from Symfony2 components for traversing the DOM tree and extends it by adding methods for manipulating the DOM tree of HTML documents.
ZendDOM: Zend_Dom provides tools for working with DOM documents and structures. Currently, they offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.
QueryPath: QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files but also with web services and database resources. It implements much of the jQuery interface (including CSS-style selectors), but it is heavily tuned for server-side use. Can be installed via Composer.
fDOM Document: fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.
Sabre/XML: sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple "xml to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.
FluidXML: FluidXML is a PHP library for manipulating XML with a concise and fluent API. It leverages XPath and the fluent programming pattern to be fun and effective.
3rd Party Libraries [ Not libxml based ]:
PHP Simple HTML DOM Parser: An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way, It Requires PHP 5+. Also Supports invalid HTML.
It Extracts contents from HTML in a single line. The codebase is horrible and very slow in working.
PHP Html Parser: HPHtmlParser is a simple, flexible, HTML parser that allows you to select tags using any CSS selector, like jQuery. The goal is to assist in the development of tools that require a quick, easy way to scrape HTML, whether it's valid or not. It is slow and takes too much CPU Power.
Ganon (recommended): A universal tokenizer and HTML/XML/RSS DOM Parser. It has the Ability to manipulate elements and their attributes. It Supports invalid HTML and UTF8. It Can perform advanced CSS3-like queries on elements (like jQuery -- namespaces supported). A HTML beautifier (like HTML Tidy). Minify CSS and Javascript. It Sort attributes, change character case, correct indentation, etc.
Extensible. The Operations separated into smaller functions for easy overriding and
Fast and Easy to use.
Web Services:
If you don't feel like programming PHP, you can also use Web services. ScraperWiki's external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.
I have shared all the resources, you can choose according to your taste, usefulness, etc.

extract youtube iframe src attribute using preg_match_all [duplicate]

How can one parse HTML/XML and extract information from it?
Native XML Extensions
I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.
DOM
The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.
DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.
It takes some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a language-agnostic interface, you'll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language's DOM API then.
How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing Stack Overflow.
A basic usage example and a general conceptual overview are available in other answers.
XMLReader
The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.
XMLReader, like DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml's HTML Parser Module.
A basic usage example is available in another answer.
XML Parser
This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.
The XML Parser library is also based on libxml, and implements a SAX style XML push parser. It may be a better choice for memory management than DOM or SimpleXML, but will be more difficult to work with than the pull parser implemented by XMLReader.
SimpleXml
The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.
SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml because it will choke.
A basic usage example is available, and there are lots of additional examples in the PHP Manual.
3rd Party Libraries (libxml based)
If you prefer to use a 3rd-party lib, I'd suggest using a lib that actually uses DOM/libxml underneath instead of string parsing.
FluentDom
FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.
HtmlPageDom
Wa72\HtmlPageDom is a PHP library for easy manipulation of HTML
documents using DOM. It requires DomCrawler from Symfony2
components for traversing
the DOM tree and extends it by adding methods for manipulating the
DOM tree of HTML documents.
phpQuery
phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.
The library is written in PHP5 and provides additional Command Line Interface (CLI).
This is described as "abandonware and buggy: use at your own risk" but does appear to be minimally maintained.
laminas-dom
The Laminas\Dom component (formerly Zend_DOM) provides tools for working with DOM documents and structures. Currently, we offer Laminas\Dom\Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.
This package is considered feature-complete, and is now in security-only maintenance mode.
fDOMDocument
fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.
sabre/xml
sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple "xml to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.
FluidXML
FluidXML is a PHP library for manipulating XML with a concise and fluent API.
It leverages XPath and the fluent programming pattern to be fun and effective.
3rd-Party (not libxml-based)
The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd-party libs go down this route. Some of them listed below
PHP Simple HTML DOM Parser
An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!
Require PHP 5+.
Supports invalid HTML.
Find tags on an HTML page with selectors just like jQuery.
Extract contents from HTML in a single line.
I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Not all jQuery Selectors (such as child selectors) are possible. Any of the libxml based libraries should outperform this easily.
PHP Html Parser
PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrape html, whether it's valid or not! This project was original supported by sunra/php-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his previous work.
Again, I would not recommend this parser. It is rather slow with high CPU usage. There is also no function to clear memory of created DOM objects. These problems scale particularly with nested loops. The documentation itself is inaccurate and misspelled, with no responses to fixes since 14 Apr 16.
HTML 5
You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you may want to consider using a dedicated parser. Note that these are written in PHP, so suffer from slower performance and increased memory usage compared to a compiled extension in a lower-level language.
HTML5DomDocument
HTML5DOMDocument extends the native DOMDocument library. It fixes some bugs and adds some new functionality.
Preserves html entities (DOMDocument does not)
Preserves void tags (DOMDocument does not)
Allows inserting HTML code that moves the correct parts to their proper places (head elements are inserted in the head, body elements in the body)
Allows querying the DOM with CSS selectors (currently available: *, tagname, tagname#id, #id, tagname.classname, .classname, tagname.classname.classname2, .classname.classname2, tagname[attribute-selector], [attribute-selector], div, p, div p, div > p, div + p, and p ~ ul.)
Adds support for element->classList.
Adds support for element->innerHTML.
Adds support for element->outerHTML.
HTML5
HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has well over five million downloads.
HTML5 provides the following features.
An HTML5 serializer
Support for PHP namespaces
Composer support
Event-based (SAX-like) parser
A DOM tree builder
Interoperability with QueryPath
Runs on PHP 5.3.0 or newer
Regular Expressions
Last and least recommended, you can extract data from HTML with regular expressions. In general using Regular Expressions on HTML is discouraged.
Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or adding, or changing attributes in a tag, can make the RegEx fails when it's not properly written. You should know what you are doing before using RegEx on HTML.
HTML parsers already know the syntactical rules of HTML. Regular expressions have to be taught for each new RegEx you write. RegEx are fine in some cases, but it really depends on your use-case.
You can write more reliable parsers, but writing a complete and reliable custom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.
Also see Parsing Html The Cthulhu Way
Books
If you want to spend some money, have a look at
PHP Architect's Guide to Webscraping with PHP
I am not affiliated with PHP Architect or the authors.
Try Simple HTML DOM Parser.
A HTML DOM parser written in PHP 5+ that lets you manipulate HTML in a very easy way!
Require PHP 5+.
Supports invalid HTML.
Find tags on an HTML page with selectors just like jQuery.
Extract contents from HTML in a single line.
Download
Note: as the name suggests, it can be useful for simple tasks. It uses regular expressions instead of an HTML parser, so will be considerably slower for more complex tasks. The bulk of its codebase was written in 2008, with only small improvements made since then. It does not follow modern PHP coding standards and would be challenging to incorporate into a modern PSR-compliant project.
Examples:
How to get HTML elements:
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
How to modify HTML elements:
// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');
$html->find('div', 1)->class = 'bar';
$html->find('div[id=hello]', 0)->innertext = 'foo';
echo $html;
Extract content from HTML:
// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->plaintext;
Scraping Slashdot:
// Create DOM from URL
$html = file_get_html('http://slashdot.org/');
// Find all article blocks
foreach($html->find('div.article') as $article) {
$item['title'] = $article->find('div.title', 0)->plaintext;
$item['intro'] = $article->find('div.intro', 0)->plaintext;
$item['details'] = $article->find('div.details', 0)->plaintext;
$articles[] = $item;
}
print_r($articles);
Just use DOMDocument->loadHTML() and be done with it. libxml's HTML parsing algorithm is quite good and fast, and contrary to popular belief, does not choke on malformed HTML.
Why you shouldn't and when you should use regular expressions?
First off, a common misnomer: Regexps are not for "parsing" HTML. Regexes can however "extract" data. Extracting is what they're made for. The major drawback of regex HTML extraction over proper SGML toolkits or baseline XML parsers are their syntactic effort and varying reliability.
Consider that making a somewhat dependable HTML extraction regex:
<a\s+class="?playbutton\d?[^>]+id="(\d+)".+? <a\s+class="[\w\s]*title
[\w\s]*"[^>]+href="(http://[^">]+)"[^>]*>([^<>]+)</a>.+?
is way less readable than a simple phpQuery or QueryPath equivalent:
$div->find(".stationcool a")->attr("title");
There are however specific use cases where they can help.
Many DOM traversal frontends don't reveal HTML comments <!--, which however are sometimes the more useful anchors for extraction. In particular pseudo-HTML variations <$var> or SGML residues are easy to tame with regexps.
Oftentimes regular expressions can save post-processing. However HTML entities often require manual caretaking.
And lastly, for extremely simple tasks like extracting <img src= urls, they are in fact a probable tool. The speed advantage over SGML/XML parsers mostly just comes to play for these very basic extraction procedures.
It's sometimes even advisable to pre-extract a snippet of HTML using regular expressions /<!--CONTENT-->(.+?)<!--END-->/ and process the remainder using the simpler HTML parser frontends.
Note: I actually have this app, where I employ XML parsing and regular expressions alternatively. Just last week the PyQuery parsing broke, and the regex still worked. Yes weird, and I can't explain it myself. But so it happened.
So please don't vote real-world considerations down, just because it doesn't match the regex=evil meme. But let's also not vote this up too much. It's just a sidenote for this topic.
Note, this answer recommends libraries that have now been abandoned for 10+ years.
phpQuery and QueryPath are extremely similar in replicating the fluent jQuery API. That's also why they're two of the easiest approaches to properly parse HTML in PHP.
Examples for QueryPath
Basically you first create a queryable DOM tree from an HTML string:
$qp = qp("<html><body><h1>title</h1>..."); // or give filename or URL
The resulting object contains a complete tree representation of the HTML document. It can be traversed using DOM methods. But the common approach is to use CSS selectors like in jQuery:
$qp->find("div.classname")->children()->...;
foreach ($qp->find("p img") as $img) {
print qp($img)->attr("src");
}
Mostly you want to use simple #id and .class or DIV tag selectors for ->find(). But you can also use XPath statements, which sometimes are faster. Also typical jQuery methods like ->children() and ->text() and particularly ->attr() simplify extracting the right HTML snippets. (And already have their SGML entities decoded.)
$qp->xpath("//div/p[1]"); // get first paragraph in a div
QueryPath also allows injecting new tags into the stream (->append), and later output and prettify an updated document (->writeHTML). It can not only parse malformed HTML, but also various XML dialects (with namespaces), and even extract data from HTML microformats (XFN, vCard).
$qp->find("a[target=_blank]")->toggleClass("usability-blunder");
.
phpQuery or QueryPath?
Generally QueryPath is better suited for manipulation of documents. While phpQuery also implements some pseudo AJAX methods (just HTTP requests) to more closely resemble jQuery. It is said that phpQuery is often faster than QueryPath (because of fewer overall features).
For further information on the differences see this comparison on the wayback machine from tagbyte.org. (Original source went missing, so here's an internet archive link. Yes, you can still locate missing pages, people.)
Advantages
Simplicity and Reliability
Simple to use alternatives ->find("a img, a object, div a")
Proper data unescaping (in comparison to regular expression grepping)
Simple HTML DOM is a great open-source parser:
simplehtmldom.sourceforge
It treats DOM elements in an object-oriented way, and the new iteration has a lot of coverage for non-compliant code. There are also some great functions like you'd see in JavaScript, such as the "find" function, which will return all instances of elements of that tag name.
I've used this in a number of tools, testing it on many different types of web pages, and I think it works great.
One general approach I haven't seen mentioned here is to run HTML through Tidy, which can be set to spit out guaranteed-valid XHTML. Then you can use any old XML library on it.
But to your specific problem, you should take a look at this project: http://fivefilters.org/content-only/ -- it's a modified version of the Readability algorithm, which is designed to extract just the textual content (not headers and footers) from a page.
For 1a and 2: I would vote for the new Symfony Componet class DOMCrawler ( DomCrawler ).
This class allows queries similar to CSS Selectors. Take a look at this presentation for real-world examples: news-of-the-symfony2-world.
The component is designed to work standalone and can be used without Symfony.
The only drawback is that it will only work with PHP 5.3 or newer.
This is commonly referred to as screen scraping, by the way. The library I have used for this is Simple HTML Dom Parser.
We have created quite a few crawlers for our needs before. At the end of the day, it is usually simple regular expressions that do the thing best. While libraries listed above are good for the reason they are created, if you know what you are looking for, regular expressions is a safer way to go, as you can handle also non-valid HTML/XHTML structures, which would fail, if loaded via most of the parsers.
I recommend PHP Simple HTML DOM Parser.
It really has nice features, like:
foreach($html->find('img') as $element)
echo $element->src . '<br>';
This sounds like a good task description of W3C XPath technology. It's easy to express queries like "return all href attributes in img tags that are nested in <foo><bar><baz> elements." Not being a PHP buff, I can't tell you in what form XPath may be available. If you can call an external program to process the HTML file you should be able to use a command line version of XPath.
For a quick intro, see http://en.wikipedia.org/wiki/XPath.
Third party alternatives to SimpleHtmlDom that use DOM instead of String Parsing: phpQuery, Zend_Dom, QueryPath and FluentDom.
Yes you can use simple_html_dom for the purpose. However I have worked quite a lot with the simple_html_dom, particularly for web scraping and have found it to be too vulnerable. It does the basic job but I won't recommend it anyways.
I have never used curl for the purpose but what I have learned is that curl can do the job much more efficiently and is much more solid.
Kindly check out this link:scraping-websites-with-curl
Advanced Html Dom is a simple HTML DOM replacement that offers the same interface, but it's DOM-based which means none of the associated memory issues occur.
It also has full CSS support, including jQuery extensions.
QueryPath is good, but be careful of "tracking state" cause if you didn't realise what it means, it can mean you waste a lot of debugging time trying to find out what happened and why the code doesn't work.
What it means is that each call on the result set modifies the result set in the object, it's not chainable like in jquery where each link is a new set, you have a single set which is the results from your query and each function call modifies that single set.
in order to get jquery-like behaviour, you need to branch before you do a filter/modify like operation, that means it'll mirror what happens in jquery much more closely.
$results = qp("div p");
$forename = $results->find("input[name='forename']");
$results now contains the result set for input[name='forename'] NOT the original query "div p" this tripped me up a lot, what I found was that QueryPath tracks the filters and finds and everything which modifies your results and stores them in the object. you need to do this instead
$forename = $results->branch()->find("input[name='forname']")
then $results won't be modified and you can reuse the result set again and again, perhaps somebody with much more knowledge can clear this up a bit, but it's basically like this from what I've found.
For HTML5, html5 lib has been abandoned for years now. The only HTML5 library I can find with a recent update and maintenance records is html5-php which was just brought to beta 1.0 a little over a week ago.
I created a library named PHPPowertools/DOM-Query, which allows you to crawl HTML5 and XML documents just like you do with jQuery.
Under the hood, it uses symfony/DomCrawler for conversion of CSS selectors to XPath selectors. It always uses the same DomDocument, even when passing one object to another, to ensure decent performance.
Example use :
namespace PowerTools;
// Get file content
$htmlcode = file_get_contents('https://github.com');
// Define your DOMCrawler based on file string
$H = new DOM_Query($htmlcode);
// Define your DOMCrawler based on an existing DOM_Query instance
$H = new DOM_Query($H->select('body'));
// Passing a string (CSS selector)
$s = $H->select('div.foo');
// Passing an element object (DOM Element)
$s = $H->select($documentBody);
// Passing a DOM Query object
$s = $H->select( $H->select('p + p'));
// Select the body tag
$body = $H->select('body');
// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');
// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');
// Use a lambda function to set the text of all site blocks
$siteblocks->text(function( $i, $val) {
return $i . " - " . $val->attr('class');
});
// Append the following HTML to all site blocks
$siteblocks->append('<div class="site-center"></div>');
// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');
// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));
// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function( $i, $val) {
return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});
// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();
// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');
// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');
[...]
Supported methods :
[x] $ (1)
[x] $.parseHTML
[x] $.parseXML
[x] $.parseJSON
[x] $selection.add
[x] $selection.addClass
[x] $selection.after
[x] $selection.append
[x] $selection.attr
[x] $selection.before
[x] $selection.children
[x] $selection.closest
[x] $selection.contents
[x] $selection.detach
[x] $selection.each
[x] $selection.eq
[x] $selection.empty (2)
[x] $selection.find
[x] $selection.first
[x] $selection.get
[x] $selection.insertAfter
[x] $selection.insertBefore
[x] $selection.last
[x] $selection.parent
[x] $selection.parents
[x] $selection.remove
[x] $selection.removeAttr
[x] $selection.removeClass
[x] $selection.text
[x] $selection.wrap
Renamed 'select', for obvious reasons
Renamed 'void', since 'empty' is a reserved word in PHP
NOTE :
The library also includes its own zero-configuration autoloader for PSR-0 compatible libraries. The example included should work out of the box without any additional configuration. Alternatively, you can use it with composer.
You could try using something like HTML Tidy to cleanup any "broken" HTML and convert the HTML to XHTML, which you can then parse with a XML parser.
I have written a general purpose XML parser that can easily handle GB files. It's based on XMLReader and it's very easy to use:
$source = new XmlExtractor("path/to/tag", "/path/to/file.xml");
foreach ($source as $tag) {
echo $tag->field1;
echo $tag->field2->subfield1;
}
Here's the github repo: XmlExtractor
Another option you can try is QueryPath. It's inspired by jQuery, but on the server in PHP and used in Drupal.
XML_HTMLSax is rather stable - even if it's not maintained any more. Another option could be to pipe you HTML through Html Tidy and then parse it with standard XML tools.
There are many ways to process HTML/XML DOM of which most have already been mentioned. Hence, I won't make any attempt to list those myself.
I merely want to add that I personally prefer using the DOM extension and why :
iit makes optimal use of the performance advantage of the underlying C code
it's OO PHP (and allows me to subclass it)
it's rather low level (which allows me to use it as a non-bloated foundation for more advanced behavior)
it provides access to every part of the DOM (unlike eg. SimpleXml, which ignores some of the lesser known XML features)
it has a syntax used for DOM crawling that's similar to the syntax used in native Javascript.
And while I miss the ability to use CSS selectors for DOMDocument, there is a rather simple and convenient way to add this feature: subclassing the DOMDocument and adding JS-like querySelectorAll and querySelector methods to your subclass.
For parsing the selectors, I recommend using the very minimalistic CssSelector component from the Symfony framework. This component just translates CSS selectors to XPath selectors, which can then be fed into a DOMXpath to retrieve the corresponding Nodelist.
You can then use this (still very low level) subclass as a foundation for more high level classes, intended to eg. parse very specific types of XML or add more jQuery-like behavior.
The code below comes straight out my DOM-Query library and uses the technique I described.
For HTML parsing :
namespace PowerTools;
use \Symfony\Component\CssSelector\CssSelector as CssSelector;
class DOM_Document extends \DOMDocument {
public function __construct($data = false, $doctype = 'html', $encoding = 'UTF-8', $version = '1.0') {
parent::__construct($version, $encoding);
if ($doctype && $doctype === 'html') {
#$this->loadHTML($data);
} else {
#$this->loadXML($data);
}
}
public function querySelectorAll($selector, $contextnode = null) {
if (isset($this->doctype->name) && $this->doctype->name == 'html') {
CssSelector::enableHtmlExtension();
} else {
CssSelector::disableHtmlExtension();
}
$xpath = new \DOMXpath($this);
return $xpath->query(CssSelector::toXPath($selector, 'descendant::'), $contextnode);
}
[...]
public function loadHTMLFile($filename, $options = 0) {
$this->loadHTML(file_get_contents($filename), $options);
}
public function loadHTML($source, $options = 0) {
if ($source && $source != '') {
$data = trim($source);
$html5 = new HTML5(array('targetDocument' => $this, 'disableHtmlNsInDom' => true));
$data_start = mb_substr($data, 0, 10);
if (strpos($data_start, '<!DOCTYPE ') === 0 || strpos($data_start, '<html>') === 0) {
$html5->loadHTML($data);
} else {
#$this->loadHTML('<!DOCTYPE html><html><head><meta charset="' . $encoding . '" /></head><body></body></html>');
$t = $html5->loadHTMLFragment($data);
$docbody = $this->getElementsByTagName('body')->item(0);
while ($t->hasChildNodes()) {
$docbody->appendChild($t->firstChild);
}
}
}
}
[...]
}
See also Parsing XML documents with CSS selectors by Symfony's creator Fabien Potencier on his decision to create the CssSelector component for Symfony and how to use it.
The Symfony framework has bundles which can parse the HTML, and you can use CSS style to select the DOMs instead of using XPath.
With FluidXML you can query and iterate XML using XPath and CSS Selectors.
$doc = fluidxml('<html>...</html>');
$title = $doc->query('//head/title')[0]->nodeValue;
$doc->query('//body/p', 'div.active', '#bgId')
->each(function($i, $node) {
// $node is a DOMNode.
$tag = $node->nodeName;
$text = $node->nodeValue;
$class = $node->getAttribute('class');
});
https://github.com/servo-php/fluidxml
JSON and array from XML in three lines:
$xml = simplexml_load_string($xml_string);
$json = json_encode($xml);
$array = json_decode($json,TRUE);
Ta da!
There are several reasons to not parse HTML by regular expression. But, if you have total control of what HTML will be generated, then you can do with simple regular expression.
Above it's a function that parses HTML by regular expression. Note that this function is very sensitive and demands that the HTML obey certain rules, but it works very well in many scenarios. If you want a simple parser, and don't want to install libraries, give this a shot:
function array_combine_($keys, $values) {
$result = array();
foreach ($keys as $i => $k) {
$result[$k][] = $values[$i];
}
array_walk($result, create_function('&$v', '$v = (count($v) == 1)? array_pop($v): $v;'));
return $result;
}
function extract_data($str) {
return (is_array($str))
? array_map('extract_data', $str)
: ((!preg_match_all('#<([A-Za-z0-9_]*)[^>]*>(.*?)</\1>#s', $str, $matches))
? $str
: array_map(('extract_data'), array_combine_($matches[1], $matches[2])));
}
print_r(extract_data(file_get_contents("http://www.google.com/")));
I've created a library called HTML5DOMDocument that is freely available at https://github.com/ivopetkov/html5-dom-document-php
It supports query selectors too that I think will be extremely helpful in your case. Here is some example code:
$dom = new IvoPetkov\HTML5DOMDocument();
$dom->loadHTML('<!DOCTYPE html><html><body><h1>Hello</h1><div class="content">This is some text</div></body></html>');
echo $dom->querySelector('h1')->innerHTML;
The best method for parse xml:
$xml='http://www.example.com/rss.xml';
$rss = simplexml_load_string($xml);
$i = 0;
foreach ($rss->channel->item as $feedItem) {
$i++;
echo $title=$feedItem->title;
echo '<br>';
echo $link=$feedItem->link;
echo '<br>';
if($feedItem->description !='') {
$des=$feedItem->description;
} else {
$des='';
}
echo $des;
echo '<br>';
if($i>5) break;
}
There are many ways:
In General:
Native XML Extensions: they come bundled with PHP, are usually faster than all the 3rd party libs, and give me all the control you need over the markup.
DOM: DOM is capable of parsing and modifying real-world (broken) HTML and it can do XPath queries. It is based on libxml.
XML Reader: XMLReader, like DOM, is based on libxml. The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way
XML Parser: This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust. It implements a SAX style XML push parser.
Simple XML: The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.
3rd Party Libraries [ libxml based ]:
FluentDom - Repo: FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. It can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.
HtmlPageDom: is a PHP library for easy manipulation of HTML documents using It requires DomCrawler from Symfony2 components for traversing the DOM tree and extends it by adding methods for manipulating the DOM tree of HTML documents.
ZendDOM: Zend_Dom provides tools for working with DOM documents and structures. Currently, they offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.
QueryPath: QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files but also with web services and database resources. It implements much of the jQuery interface (including CSS-style selectors), but it is heavily tuned for server-side use. Can be installed via Composer.
fDOM Document: fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.
Sabre/XML: sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple "xml to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.
FluidXML: FluidXML is a PHP library for manipulating XML with a concise and fluent API. It leverages XPath and the fluent programming pattern to be fun and effective.
3rd Party Libraries [ Not libxml based ]:
PHP Simple HTML DOM Parser: An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way, It Requires PHP 5+. Also Supports invalid HTML.
It Extracts contents from HTML in a single line. The codebase is horrible and very slow in working.
PHP Html Parser: HPHtmlParser is a simple, flexible, HTML parser that allows you to select tags using any CSS selector, like jQuery. The goal is to assist in the development of tools that require a quick, easy way to scrape HTML, whether it's valid or not. It is slow and takes too much CPU Power.
Ganon (recommended): A universal tokenizer and HTML/XML/RSS DOM Parser. It has the Ability to manipulate elements and their attributes. It Supports invalid HTML and UTF8. It Can perform advanced CSS3-like queries on elements (like jQuery -- namespaces supported). A HTML beautifier (like HTML Tidy). Minify CSS and Javascript. It Sort attributes, change character case, correct indentation, etc.
Extensible. The Operations separated into smaller functions for easy overriding and
Fast and Easy to use.
Web Services:
If you don't feel like programming PHP, you can also use Web services. ScraperWiki's external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.
I have shared all the resources, you can choose according to your taste, usefulness, etc.

XSS DOM vulnerable

I tested site for vulnerables (folder /service-contact) and possible XSS DOM issue came up (using Kali Linux, Vega and XSSER). However, i tried to manually test url with 'alert' script to make sure it's vulnerable. I used
www.babyland.nl/service-contact/alert("test")
No alert box/pop-up was shown, only the html code showed up in contact form box.
I am not sure i used the right code (i'm a rookie) or did the right interpretation. Server is Apache, using javascript/js.
Can you help?
Thanks!
This is Not Vulnerable to XSS, Whatever you are writing in the URL is Coming in Below Form section ( Vraag/opmerking ) . And the Double Quotes (") are Escaped. If you try another Payload like <script>alert(/xss/)</script> That Also won't work, Because this is Not Reflecting neither Storing. You will see output as a Text in Vraag/opmerking. Don't Rely on Online Scanners, Test Manually, For DOM Based XSS ..Check Sink and Sources and Analyze them.
The tool is right. There is a XSS-Vulnerability on the site, but the proof of concept (PoC) code is wrong. The content of a <textarea> can only contain character data (see <textarea> description on MDN). So your <script>alert("test")</script> is interpreted as text and not as HTML code. But you can close the <textarea> tag and insert the javascript code after that.
Here is the working PoC URL:
https://www.babyland.nl/service-contact/</textarea><script>alert("test")</script>
which is rendered as:
<textarea rows="" cols="" id="comment" name="comment"></textarea<script>alert("test")</script></textarea>
A little note to testing for XSS injection: Chrome/Chromium has a XSS protection. So this code doesn't exploit in this browser. For manual testing you can use Firefox or run Chrome with: --disable-web-security (see this StackOverflow Question and this for more information).

How to disable Sitecore's embedded language parser

I have a site that has many URL rewrites and a good portion of them contain old links that are prefixed with a country code (e.g. /fr, /de, etc). Rewrites without the prefixes work just fine but those with trigger Sitecore's embedded language URL parser which bypasses the rewrite module entirely.
Example
/fr/old-link tries to parse 'fr' as a language and fails as 'fr-FR' is the name of the French language.
Solution I need to disable Sitecore's ability to detect a language prefix in the URL so the URL rewrite module can proceed unhindered.
I can't find where it is in the pipeline that this occurs. I've gone through numerous with Reflector and come up short. I need help please.
Another pipeline to look at is the preprocessRequest pipeline. It has a StripLanguage processor that detects if the first part of the URL is a language and acts on it.
More info on how to get Sitecore to ignore the language part of the url can be found in this post http://sitecoreblog.patelyogesh.in/2013/11/sitecore-item-with-language-name.html
You will need to create a new LanguageResolver to replace the standard Sitecore one (Sitecore.Pipelines.HttpRequest.LanguageResolver). This is referenced in the <httpRequestBegin> pipeline section in web.config. Here you can handle requests beginning with fr as opposed to fr-FR etc. In the past I have done a similar thing for when we wanted to use non-ISO language codes.
EDIT
The LanguageResolver resolves language based on query string first, but will also resolve based on file path (i.e. having fr-FR in the start of your path). I think you would need to inherit from the Sitecore LanguageResolver and override the GetLanguageFromRequest method changing the else statement to use something different to Context.Data.FilePathLanguage - possibly just using regex/string manipulation to get the first folder from the URL then use that to set the context language. This should prevent the failure to resolve language which I understand is killing your URL rewrite module.

Cleansing string / input in Coldfusion 9

I have been working with Coldfusion 9 lately (background in PHP primarily) and I am scratching my head trying to figure out how to 'clean/sanitize' input / string that is user submitted.
I want to make it HTMLSAFE, eliminate any javascript, or SQL query injection, the usual.
I am hoping I've overlooked some kind of function that already comes with CF9.
Can someone point me in the proper direction?
Well, for SQL injection, you want to use CFQUERYPARAM.
As for sanitizing the input for XSS and the like, you can use the ScriptProtect attribute in CFAPPLICATION, though I've heard that doesn't work flawlessly. You could look at Portcullis or similar 3rd-party CFCs for better script protection if you prefer.
This an addition to Kyle's suggestions not an alternative answer, but the comments panel is a bit rubbish for links.
Take a look a the ColdFusion string functions. You've got HTMLCodeFormat, HTMLEditFormat, JSStringFormat and URLEncodedFormat. All of which can help you with working with content posted from a form.
You can also try to use the regex functions to remove HTML tags, but its never a precise science. This ColdFusion based regex/html question should help there a bit.
You can also try to protect yourself from bots and known spammers using something like cfformprotect, which integrates Project Honeypot and Akismet protection amongst other tools into your forms.
You've got several options:
"Global Script Protection" Administrator setting, which applies a regular expression against post and get (i.e. FORM and URL) variables to strip out <script/>, <img/> and several other tags
Use isValid() to validate variables' data types (see my in depth answer on this one).
<cfqueryparam/>, which serves to create SQL bind parameters and validate the datatype passed to it.
That noted, if you are really trying to sanitize HTML, use Java, which ColdFusion can access natively. In particular use the OWASP AntiSamy Project, which takes an HTML fragment and whitelists what values can be part of it. This is the same approach that sites like SO and slashdot.org use to protect submissions and is a more secure approach to accepting markup content.
Sanitation of strings in coldfusion and in quite any language is very important and depends on what you want to do with the string. most mitigations are for
saving content to database (e.g. <cfqueryparam ...>)
using content to show on next page (e.g. put url-parameter in link or show url-parameter in text)
saving files and using upload filenames and content
There is always a risk if you follow the idea to prevent and reduce a string by allow basically everything in the first step and then sanitize malicious code "away" by deleting or replacing characters (blacklist approach).
The better solution is to replace strings with rereplace(...) agains regular expressions that explicitly allow only the characters needed for the scenario you use it as an easy solution, whenever this is possible. use cases are inputs for numbers, lists, email-addresses, urls, names, zip, cities, etc.
For example if you want to ask for a email-address, you could use
<cfif reFindNoCase("^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.(?:[A-Z]{5})$", stringtosanitize)>...ok, clean...<cfelse>...not ok...</cfif>
(or an own regex).
For HTML-Imput or CSS-Imput I would also recommend OWASP Java HTML Sanitizer Project.