I'm using tinymce as rich text editor and separate excerpt from content via pagebreak button that insert a <!-- pagebreak --> tag . I'm wondering what is the best way to extract excerpt from database.
I know i can use preg_math as well as preg_split , but is it realy best and optimized solution?
wouldn't it be better and faster to save excerpt in a separate column?
This should work, without using any regex functions:
$pagebreak = '<!-- pagebreak -->';
$content = 'I am the excerpt<!-- pagebreak -->I am the rest of the content';
$excerpt = substr($content, 0, strpos($content, $pagebreak));
$restOfTheContent = substr($content, strpos($content, $pagebreak) + strlen($pagebreak));
var_dump($excerpt); // string(16) "I am the excerpt"
var_dump($restOfTheContent); // string(28) "I am the rest of the content"
Please note that this is really only designed to work with a single page break. It wouldn't be too difficult to modify it to generate an array of $pages based off of the string $content should multiple page breaks be necessary.
Related
I have a situation where I need to differentiate two calls by the path in the source of a HTML. This is how the img tag looks like
<img src="/folder/12280218/160024536.images.jpg" />
I am planning to alter the source to
<img src="/folder/12280218/160024536.images.jpg/1" />
observe the "/1" at the end of src
I need this so that I can change the flow in the controller when I am serving this image.
This is what I have tried until now.
my $string = '<p><img src="/folder/12280218/160024536.images.jpg" /></p>';
$string =~ s/<img\s+src\=\"(.*)"\s+\/><\/p>/<img src\=\"$1\/1" \><\/p>/g;
This is working as long as the $string looks like this.
In our application, user has the ability to alter the HTML input using CKEditor.
He can alter the image tag by adding width="800" before or after the src attribute. I want the regular expression to handle all these situations.
Please let me know how to proceed.
Thanks in advance.
Replace :
(<img.*src="[^"]*)(".*\/>)
by
$1/1$2
Demo here
Edit : Changed the regex to handle situations with other attributes (like the "width" part)
Please Observe the google Doc below:
https://docs.google.com/document/d/1dw6mJW0VxHzD3_h86RgtZwmelBQE8tYGgi41jb1oz-o/edit
I am attempting to put the data into Hbase using either MapReduce or Importtsv. But my main problem is dealing with the photos. I would like to put the photos in a seperate column family. How do i go about selecting only the photos and importing them into HBase, given that the photos dont have nothing that it can be identified by...like a (text) name.
I thought about using Regex. But some of the districts are of different structure. for instance, "Arizona 1" vs. "Alaska at large".
I need to know how to specifically identify the photos, so they that can be distinguished and imported appropriately.
Having in mind the structure of the document mentioned above, this is the expression you need. It will match all image URLs and each image description.
<image\sxlink:href="(https:\/\/[^"\s]+)".*?<title><\/title><desc>(.+?)<\/desc><\/image>
Demo
Usage in PHP:
$html = '<p>Members of our tim</p><image xlink:href="https://lh4.googleusercontent.com/z3GK1MdYyLTo0Q0xLmawvcptIrK4qkQx7XJWUgTK_i6Psm22GBqZXBh-w0TeQ5xgKxckQOB2wHWySSIpNj3tXx65MPXmaxKjK4ye_Xu-wAUFKLVhvWFgIedtzxo" width="100%" height="100%" preserveAspectRatio="none"><title></title><desc>Bradley Byrne.jpg</desc></image><h1>Some big title</h1><p>Something <span>more</span> here</p><image xlink:href="https://lh5.googleusercontent.com/fWYh7qTWqu4_4oxAiNhmnMCmD6DScZ6bIvkF5nSFunU8NxKlBT1T-1J85MJCqghhbChFzoLi-p4ZFVDCA2DWWBP9Paagp9ZgshqnGK5CQQF6D7IoBGihcFZoOms" width="100%" height="100%" preserveAspectRatio="none"><title></title><desc>Spencer Bachus 113th Congress.jpg</desc></image><h1>TITLE</h1><p>Testing, testing, testing</p><image xlink:href="https://lh5.googleusercontent.com/VAHzM6OkdtxT61j9XSgTDKlpVi99WsFfzNAlvqmnpCi90XFs9aUNMfuCeeeQ3e26fykjveoxldHvv5jO1Bk9IeEmeU7DdGVAM1N9xXoB8tJTYBeTeFBxigXtT5s" width="100%" height="100%" preserveAspectRatio="none"><title></title><desc>Kyrsten Sinema 113th Congress.jpg</desc></image><p>Last updated on 25th of July, 2014</p>';
$pattern = '/<image\sxlink:href="(https:\/\/[^"\s]+)".*?<title><\/title><desc>(.+?)<\/desc><\/image>/';
if(preg_match_all($pattern, $html, $matches)){
$size_of_matches = count($matches[0]);
for($i = 0; $i < $size_of_matches; $i++){
echo $matches[1][$i] . " -> " . $matches[2][$i] . "<br />";
}
}
Output:
https://lh4.googleusercontent.com/z3GK1MdYyLTo0Q0xLmawvcptIrK4qkQx7XJWUgTK_i6Psm22GBqZXBh-w0TeQ5xgKxckQOB2wHWySSIpNj3tXx65MPXmaxKjK4ye_Xu-wAUFKLVhvWFgIedtzxo -> Bradley Byrne.jpg
https://lh5.googleusercontent.com/fWYh7qTWqu4_4oxAiNhmnMCmD6DScZ6bIvkF5nSFunU8NxKlBT1T-1J85MJCqghhbChFzoLi-p4ZFVDCA2DWWBP9Paagp9ZgshqnGK5CQQF6D7IoBGihcFZoOms -> Spencer Bachus 113th Congress.jpg
https://lh5.googleusercontent.com/VAHzM6OkdtxT61j9XSgTDKlpVi99WsFfzNAlvqmnpCi90XFs9aUNMfuCeeeQ3e26fykjveoxldHvv5jO1Bk9IeEmeU7DdGVAM1N9xXoB8tJTYBeTeFBxigXtT5s -> Kyrsten Sinema 113th Congress.jpg
I do not have experience with MapReduce or Importtsv, so I went about this a different way using c#. As hex494D49 pointed out, the images do have text associated with them. You just have to obtain that data from the document's source (i.e. right click-->View page source).
This code reads in the document's source, makes an attempt to match the politician with an image file (based on the available information that was posted), and writes the results to a text file. The code has many examples of the c# flavor of regex. A sample of the output is here.
AS 3.0 / Flash
I am consuming XML which I have no control over.
the XML has HTML in it which i am styling and displaying in a HTML text field.
I want to remove all the html except the links.
Strip all HTML tags except links
is not working for me.
does any one have any tips? regEx?
the following removes tables.
var reTable:RegExp = /<table\s+[^>]*>.*?<\/table>/s;
but now i realize i need to keep content that is the tables and I also need the links.
thanks!!!
cp
Probably shouldn't use regex to parse html, but if you don't care, something simple like this:
find /<table\s+[^>]*>.*?<\/table\s+>/
replace ""
ActionScript has a pretty neat tool for handling XML: E4X. Rather than relying on RegEx, which I find often messes things up with XML, just modify the actual XML tree, and from within AS:
var xml : XML = <page>
<p>Other elements</p>
<table><tr><td>1</td></tr></table>
<p>won't</p>
<div>
<table><tr><td>2</td></tr></table>
</div>
<p>be</p>
<table><tr><td>3</td></tr></table>
<p>removed</p>
<table><tr><td>4</td></tr></table>
</page>;
clearTables (xml);
trace (xml.toXMLString()); // will output everything but the tables
function removeTables (xml : XML ) : void {
xml.replace( "table", "");
for each (var child:XML in xml.elements("*")) clearTables(child);
}
I'm making an automated script with PHP to check if my link exists at my partner website ( link exchange) .. besides making sure my link exists in the source code , I want to make sure he is not placing it in a HTML comment like <!-- http://www.mywebsite.com --> and cheating me ..
I tried to match it with REGEXP , but have failed
Use the DOM and XPath, it ignores comments:
$doc = new DOMDocument();
$doc->loadHTML($htmlstring);
$xpath = new DOMXPath($doc);
$result = $xpath->query('//a[contains(#href, "mywebsite.com")]');
if (!$result->length) echo "You've been cheated\n";
And then if you still want to know if your website is being commented out
if (strpos($htmlstring, 'mywebsite.com') !== false && !$result->length)
echo "Your partner is hiding your link in a comment, sneaky bastard\n";
Sounds like a perfect use for an HTML parser like DOMDocument->loadHTML() and look for an anchor tag with your link. He could still remove it via javascript on the browser side, but that's a different issue.
If it's a cat and mouse game of "are you showing a link to my site" using a standard parser is your best bet. There are just too many ways for a regex to fail on html.
I want to extract the image url from any website. I am reading the source info through webRequest. I want a regular expression which will fetch the Image url from this content i.e the Src value in the <img> tag.
I'd recommend using an HTML parser to read the html and pull the image tags out of it, as regexes don't mesh well with data structures like xml and html.
In C#: (from this SO question)
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");
var nodes = doc.DocumentNode.SelectNodes("//img[#src]");
foreach (var node in nodes)
{
Console.WriteLine(node.src);
}
/(?:\"|')[^\\x22*<>|\\\\]+?\.(?:jpg|bmp|gif|png)(?:\"|')/i
is a decent one I have used before. This gets any reference to an image file within an html document. I didn't strip " or ' around the match, so you will need to do that.
Try this*:
<img .*?src=["']?([^'">]+)["']?.*?>
Tested here with:
<img class="test" src="/content/img/so/logo.png" alt="logo homepage">
Gives
$1 = /content/img/so/logo.png
The $1 (you have to mouseover the match to see it) corresponds to the part of the regex between (). How you access that value will depend on what implementation of regex you are using.
*If you want to know how this works, leave a comment
EDIT
As nearly always with regexp, there are edge cases:
<img title="src=hack" src="/content/img/so/logo.png" alt="logo homepage">
This would be matched as 'hack'.