How to apply Drupal's Ckeditor HTML filters when manually saving node fields? - drupal-8

I have written a custom content import script for Drupal 8, which imports content from a JSON export from another website.
My ckeditor fields have pretty basic HTML filtering and for example replaces <i> statements for <em> statements and <b> for <strong>.
Now when I save my HTML into a field with such settings my HTML works fine for <p> and <ul> statements, but <i> tags are not being displayed:
$html = '<p><i>Italic text</i> and some <b>bold</b> text</p>';
$node->set('field_some_html', ['value' => $html, 'format' => 'basic_html']);
It now renders as:
<p>Italic text and some bold text</p>
When I then edit the node, while editing I do see the text in cursive or bold.
When I save the node everything is corrected. The statements have now been converted.
It now renders as:
<p><em>Italic text</em> and some <strong>bold</strong> text</p>
So my question is: how do I fix this? How do I apply the filters to my HTML input before saving the node to the database?
Update 1 After some more investigation I found FilterFormat. Now I tried this:
$html = '<p><i>Italic text</i> and some <b>bold</b> text</p>';
$filter_format = FilterFormat::load($format);
$filters = $filter_format->filters();
/* #var \Drupal\filter\Plugin\Filter\FilterHtml $filter */
foreach($filters as $filter) {
$html = $filter->process(is_string($html) ? $html : $value->getProcessedText(), 'nl');
}
die($html->getProcessedText());
However, this does the opposite of what I want to achieve. This returns the HTML stripped from <i> and <b> tags.
I think I may be close to the solution though...

Related

Select every text node in a HTML document except script nodes with XPath

I am currently writing a web crawler with Scrapy, and I would like to fetch all the text displayed on the screen of every HTML document with a single XPath query.
Here is the HTML I'm working with:
<body>
<div>
<h1>Main title</h1>
<div>
<script>var grandson;</script>
<p>Paragraph</p>
</div>
</div>
<script>var child;</script>
</body>
As you can see, there are some script tags that I want to filter when getting the text inside the body tag
Here is my first XPath query and its result:
XPath: /body/*//text()
Result: Main title / var grandson; / Paragraph / var child;
This is not good because it also fetches the text inside the script tag.
Here is my second try:
XPath: /body/*[not(self::script)]//text()
Result: Main title / var grandson; / Paragraph
Here, the last script tag (which is body's child) is filtered, but the inner script is not.
How would you filter all the script tags ? Thanks in advance.
Try
//*[not(self::script)]/text()
This xPath does what you want.
.//text()[not(parent::script)]
So we have looking what is parent of text.
More interesting sample. I can use it for each element which contains html code.
.//text()[not(ancestor::script|ancestor::style|ancestor::noscript)]

preg_match_all grab everything in a HTML tag when malformatted

I am trying to automatically grab everything in a special tag in a html string.
What i need to do is grab everything in
<font size="8"></font>
so that i wrote following preg_match_all
preg_match_all('/<font(.*?)size="8"(.*?)>(.*?)<\/font\>/s', $row['html'], $titles,PREG_PATTERN_ORDER);
however it only works on certain cases only for example following string (Mal-formatted) is failed to match. do you have any idea on how to fix this or to modify above preg with this
<font FACE="Times New Roman" SIZE="8">
<p><font color="#003300">adadas <br>
dfsf sdfsdf <font size="4"><br>
<br>
gdfgdg
</font>
</font>
Give something like this a try:
<?php
$titles = array(); // CREATE AN ARRAY
$string = '<font FACE="Times New Roman" SIZE="8"><p><font color="#003300">adadas <br>dfsf sdfsdf <font size="4"><br><br>gdfgdg</font></font>';
$dom_document = new DOMDocument(); // CREATE A NEW DOCUMENT
$dom_document->loadHTML($string); // LOAD THE STRING INTO THE DOCUMENT
// LOOP THROUGH EACH font TAG
foreach ($dom_document->getElementsByTagName('font') as $font_item) {
// CHECK TO SEE IF IT HAS A SIZE ATTRIBUTE OF 8
if ($font_item->getAttribute('size') == 8) {
$titles[] = $font_item->ownerDocument->saveXML($font_item);
}
}
print_r($titles);
Basically, instead of using REGEX, you can use PHP's built-in DOM Parser. What this script does is creates a new document named $dom_document and loads your string into it. Then it loops through any font tags that it finds and checks to see if any of them have an attribute of size="8". If it finds any, it grabs the HTML and stores it into the $titles array.

Hyperlinks inside object fields

I have an object inside my Ember app, with a description field. This description field may contain hyperlinks, like this
My fancy text <a href='http://other.site.com' target='_blank'>My link</a> My fancy text continues...
However, when i output it normally, like {{ description }} my hyperlinks are displayed as a plain text. Why is this happening and how can i fix this?
Handlebars escapes any HTML within output by default. For unescaped text in markup use triple-stashes:
{{{ description }}}
There's an alternative when one controls the property: Handlebars.SafeString. SafeStrings are assumed to be safe and are not escaped either. From the documentation:
Handlebars.registerHelper('link', function(text, url) {
text = Handlebars.Utils.escapeExpression(text);
url = Handlebars.Utils.escapeExpression(url);
var result = '' + text + '';
return new Handlebars.SafeString(result);
});
Note - please be careful with this. There are security concerns with rendering unescaped text that has come from user input; an attacker could inject a malicious script into the description and hijack your page, for example.

HTML: sanitize a set of tags but allow all tags in <code> blocks

I'm using Django+Markdown for processing user input. Text produced by the markdown filter need to be 'safe' and is not protected by django's auto-escape mechanism, so I have to escape user input myself. This is how I do it now:
{{ text|force_escape|markdown:"codehilite" }}
However, if text contains something that would be marked as <code> by markdown, it is escaped as well and the output would be pretty ugly(e.g., '<' is displayed as < in <code>). For example, if
text = u'''
<script>alert("I'm not working 'cause I'll be escaped")</script>
The following would be marked as a code block:
<script>alert("not xss 'cause I'm in <code>")</script>
'''
Using the filter mentioned above, the produced text is:
<p>
<script>alert("I'm not working 'cause I'll be escaped")</script>
The following would be marked as a code block:
</p>
<pre class="codehilite">
<code>
&lt;script&gt;alert(&quot;not xss &#39;cause I&#39;m in &lt;code&gt;&quot;)&lt;/script&gt;
</code>
</pre>
What I what is:
<p>
<script>alert("I'm not working 'cause I'll be escaped")</script>
The following would be marked as a code block:
</p>
<pre class="codehilite">
<code>
<script>alert("not xss 'cause I'm in <code>")</script>
</code>
</pre>
I'm thinking about using BeautifulSoup to get the <code> blocks produced by markdown and reverse-escape their content. But soup.code.text returns only the 'text', excluding the tags. so I couldn't get my hands on any of the <,>,',",&s in it..
Don't escape the input before passing it to Markdown. As you found, this breaks user input in some cases. And, it doesn't ensure security: consider, e.g., "[clickme](javascript:alert%28%22xss%22%29)".
Instead, the correct approach is to use Markdown in its safe mode. I've written elsewhere about how to do so, but the short version in Django is to use something like {{ text | markdown:"safe" }}. (Alternatively, you can apply a HTML sanitizer, like HTML Purifier, to the output of the Markdown processor.)

actionscript htmltext. removing a table tag from dynamic html text

AS 3.0 / Flash
I am consuming XML which I have no control over.
the XML has HTML in it which i am styling and displaying in a HTML text field.
I want to remove all the html except the links.
Strip all HTML tags except links
is not working for me.
does any one have any tips? regEx?
the following removes tables.
var reTable:RegExp = /<table\s+[^>]*>.*?<\/table>/s;
but now i realize i need to keep content that is the tables and I also need the links.
thanks!!!
cp
Probably shouldn't use regex to parse html, but if you don't care, something simple like this:
find /<table\s+[^>]*>.*?<\/table\s+>/
replace ""
ActionScript has a pretty neat tool for handling XML: E4X. Rather than relying on RegEx, which I find often messes things up with XML, just modify the actual XML tree, and from within AS:
var xml : XML = <page>
<p>Other elements</p>
<table><tr><td>1</td></tr></table>
<p>won't</p>
<div>
<table><tr><td>2</td></tr></table>
</div>
<p>be</p>
<table><tr><td>3</td></tr></table>
<p>removed</p>
<table><tr><td>4</td></tr></table>
</page>;
clearTables (xml);
trace (xml.toXMLString()); // will output everything but the tables
function removeTables (xml : XML ) : void {
xml.replace( "table", "");
for each (var child:XML in xml.elements("*")) clearTables(child);
}