I need to parse a list of bookmarks exported from a browser like Chrome, Firefox and IE. Maybe even google etc.
I played around and did something like this reMatchNoCase("(<h3)(.*?)(</dl>)",myfile1) loop. Then I use reMatchNoCase("(<dt[>])(.*?)(</a>)",i) within the h3/dl
tags, and then a lot of cleanup, but its really not reliable.
The thing is that they have categories using h3 tags surrounded by dl tags and then the bookmarks in that. I can't just parse all URLs since I want to get the categories as in the browser.
Thanks.
if it is XHTML, use XPath
if it is not, it wouldn't be easy. Search https://stackoverflow.com/search?q=parse+html
can you consider using a hybrid approach, parse with jQuery on client side first and post to CF?
Related
I am dealing with old hacked sites in Wordpress where there are injection spam links on images.
I have access to the database and would like to remove links that look like this:
<a style="text-decoration:none" href="/ansaid-retail-cost">.</a>
Now text varies inside the <href> it might be for cialas or any product, but the rest doesn't vary. I want to remove the entire LINK, so the result is a single space.
I don't know regex, so I would appreciate the help. I've tried online generators but they don't seem to be working.
For the development of a custom reverse proxy (written in C++) I want to do a realtime translation of URIs in HTML content. For example if I want to access a ressource on http://myserver/ using http://my-reverse-proxy/myserver, all absolute and toplevel links like http://myserver/somecontent1.ext or /somecontent2.ext need to be modified.
An HTML tag
<img src="/sample.png">
would therefore be translated to
<img src="/myserver/sample.png">
From my point of view there are to approaches:
1) Using regular expressions and string replacement to find all related HTML tags and their paths using capture groups and do some string replacement.
2) Parse entire HTML content, do some transformation on the parse tree and pretty-print the result back to a valid HTML ressource.
And this is what this question is all about: Do you have any experiences what solution might be faster and maybe even more reasonable? Do you know a framework I might use to not reinvent the wheel? As this process should be used later for CSS and XML-based ressources as well, it should not be a HTML-depend solution.
Thanks in advance!
Proxy servers generally work by being servers. They handle all HTTP requests, modify the requested URLs, and then pass the modified request on to the server on the other side.
You should stick to this paradigm. It is far easier and more efficient than mucking around with the files themselves. Anything that is being done real-time can be done at the point of the request.
Also, it should probably be asked: why a custom reverse proxy? Such things exist already.
I'm building a Yahoo! Pipe to pull an RSS feed from Reddit which links to some content in the description. I'm using a regex to match the href attribute of the anchor link in an item.description field. The regex I'm using is:
^.+?href="([^"]+)">\[link\].+?$
As a test, I set the replace to simply:
$1
and I see that the entire description field has been replaced with the URL. So far, so good.
I then put the following in the replace field. The idea being to iframe the content that's linked to:
Content: <iframe src="$1">no iframe support</iframe> End
What I get out however is:
Content: no iframe support End
I've confirmed that this is also coming through in the pipe's output and not just in the Yahoo! Pipes debug console.
I've so far tried replacing my angle brackets with < and > entities. I've tried wrapping the entire thing in a <![CDATA[ ... ]]> block and still, I get nothing. If I break my iframe tag by removing an angle bracket, the broken content comes through fine, but if I have a well-formed iframe element, it vanishes, leaving the "no iframe support" text. Am I doing something wrong here, or is Yahoo! actively preventing me from using iframe tags in my generated pipe? A cursory search on Google isn't turning up anything related to this.
The pipe in question is here:
http://pipes.yahoo.com/pipes/pipe.info?_id=2ba41448cadd2347d86f377efd3d199f
This Pipes FAQ Question "Why does Pipes Strip <object> and <embed> tags... ?" shows that a certain amount of sanitization is performed, by placing content (at least certain content) into an iframe for the safety of RSS consumers - though it does not state it specifically, this probably also removes other iframes in order to avoid nesting and other work-arounds.
Yahoo is big enough I would doubt they have a week sanitizer, but an extremely long shot is that you might be able to fool it by nesting the iframe in a bunch of other tags (again I doubt this will work). Also depending upon which step does the sanitization, perhaps adding part of the tag in one step, then adding another part somewhere else might work (yet again, doubt overwhelms me)
Not sure what else to suggest, other than getting something else to consume and transform your RSS a little bit more (by fixing otherwise broken tags??) - but that's what you're using pipes for to begin with, isn't it? Idunno...
Good luck!
Pipes has an fanatical devotion to the RSS spec and the spec says the description field is plain text only. HTML etc is supposed to go in the content:encoded field, not that I've had much luck getting pipes to do that.
I have been working with Coldfusion 9 lately (background in PHP primarily) and I am scratching my head trying to figure out how to 'clean/sanitize' input / string that is user submitted.
I want to make it HTMLSAFE, eliminate any javascript, or SQL query injection, the usual.
I am hoping I've overlooked some kind of function that already comes with CF9.
Can someone point me in the proper direction?
Well, for SQL injection, you want to use CFQUERYPARAM.
As for sanitizing the input for XSS and the like, you can use the ScriptProtect attribute in CFAPPLICATION, though I've heard that doesn't work flawlessly. You could look at Portcullis or similar 3rd-party CFCs for better script protection if you prefer.
This an addition to Kyle's suggestions not an alternative answer, but the comments panel is a bit rubbish for links.
Take a look a the ColdFusion string functions. You've got HTMLCodeFormat, HTMLEditFormat, JSStringFormat and URLEncodedFormat. All of which can help you with working with content posted from a form.
You can also try to use the regex functions to remove HTML tags, but its never a precise science. This ColdFusion based regex/html question should help there a bit.
You can also try to protect yourself from bots and known spammers using something like cfformprotect, which integrates Project Honeypot and Akismet protection amongst other tools into your forms.
You've got several options:
"Global Script Protection" Administrator setting, which applies a regular expression against post and get (i.e. FORM and URL) variables to strip out <script/>, <img/> and several other tags
Use isValid() to validate variables' data types (see my in depth answer on this one).
<cfqueryparam/>, which serves to create SQL bind parameters and validate the datatype passed to it.
That noted, if you are really trying to sanitize HTML, use Java, which ColdFusion can access natively. In particular use the OWASP AntiSamy Project, which takes an HTML fragment and whitelists what values can be part of it. This is the same approach that sites like SO and slashdot.org use to protect submissions and is a more secure approach to accepting markup content.
Sanitation of strings in coldfusion and in quite any language is very important and depends on what you want to do with the string. most mitigations are for
saving content to database (e.g. <cfqueryparam ...>)
using content to show on next page (e.g. put url-parameter in link or show url-parameter in text)
saving files and using upload filenames and content
There is always a risk if you follow the idea to prevent and reduce a string by allow basically everything in the first step and then sanitize malicious code "away" by deleting or replacing characters (blacklist approach).
The better solution is to replace strings with rereplace(...) agains regular expressions that explicitly allow only the characters needed for the scenario you use it as an easy solution, whenever this is possible. use cases are inputs for numbers, lists, email-addresses, urls, names, zip, cities, etc.
For example if you want to ask for a email-address, you could use
<cfif reFindNoCase("^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.(?:[A-Z]{5})$", stringtosanitize)>...ok, clean...<cfelse>...not ok...</cfif>
(or an own regex).
For HTML-Imput or CSS-Imput I would also recommend OWASP Java HTML Sanitizer Project.
I'm writing a Django app to serve some documentation written in RestructuredText.
I have many documents written in *.rst, each of them is quite long with many section, subsection and so on.
Display the whole document in a single page is not a problem using Django filters, but I'd rather have just the topic index on a first page, whit links to an URL where I can display a single section / subsection (which will need some 'previous | up | home | next' link I guess...). In a way similar to a 'multiple HTML page output' as in a docbook / XML to HTML conversion.
Can anyone point me to some direction to build a document tree of a *.rst document an parse a single section of it, or suggest a clever way to obtain a similar result?
Choice 1. Include URL links to the other parts of the document.
You write an index.rst, part1.rst, part2.rst, etc. And your index.rst has links to the other parts. This requires almost no work, except careful planning to make sure that your RST HTML links are correct.
There's no "parse". You just break your document into sections. Manually.
[This seems so obvious, I'm afraid to mention it.]
Choice 2. Use Sphinx. It manages table-of-contents and inter-document connections very nicely.
However, the Sphinx extensions to RST aren't handled directly by Django, so you'd need to save the Sphinx output and then display that in Django. We use the JSON HTML Builder (http://sphinx.pocoo.org/builders.html?highlight=json#sphinx.builders.html.JSONHTMLBuilder) output from Sphinx. Then we render these documents through a template.