Non-destructive parsing and modifying of HTML elements in C++ - c++

I have a need to do some simple modifications to HTML in C++, preferably without completely rewriting the HTML, such as what happens when I use libxml2 or MSHTML.
In particular I need to be able to read, and then (potentially) modify, the "src" attribute of all "img" elements. I need it to be robust enough to be able to do this with any valid HTML, but preferably without changing any of the other HTML in the process.
Are there any libraries out there that would be able to handle this? Or is this something I can do with regular expressions? I'm not too savvy with regular expressions, and I've read a lot of questions here that say you shouldn't use them to parse HTML, but I'm not clear if that applies to something like this or if that principle applies primarily to parsing in the context of building a tree from the HTML.

Regular expressions aren't recommended for HTML because they don't handle nested tags well. They should be fine for this purpose.

Try looking at HTMLTidy
I have used it for similar things in the past.

Related

Creating my own html parser

I know this post, I've already read it but still I'd like to learn what language does an html parser (may) use? I mean, does it parse the whole source with a regex or it uses a normal programming language such as c# or python?
Apart from the question above can you also brief me on from where I should start to create my own parser? (I'd like to create an html parser for my personal needs :)
Python, Java, and Perl are all fine languages for learning to write an HTML parser. Perl is very pleasant for regular expressions, but that's not what you need for a parser. It is a bit more pleasant to write OO programs in Python or Java. C/C++/C#, etc., are also common, for very fast parsers. However, as a learning exercise, I recommend Python or Java, so that you can compare your work with standard parsers.
The standard way is to use some Yacc/Lex duet; second makes a code that splits the code into tokens, first builds a code that converts a token stream into some desired structure.
There is also some more tempting option, Ragel. Here you just write a big regexp-like structure capable of matching entire file and define a hooks that will fire when a certain sub-pattern was matched.

How do HTML parsers work?

I've seen the humorous threads and read the warnings, and I know that you don't parse HTML with regex. Don't worry... I'm not planning on trying it.
BUT... that leads me to ask: how are HTML parsers coded (including the built-in functions of programming languages, like DOM parsers and PHP's strip_tags)? What mechanism do they employ to parse the (sometimes malformed) markup?
I found the source of one coded in JavaScript, and it actually uses regex to do the job:
// Regular Expressions for parsing tags and attributes
var startTag = /^<(\w+)((?:\s+\w+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)>/,
endTag = /^<\/(\w+)[^>]*>/,
attr = /(\w+)(?:\s*=\s*(?:(?:"((?:\\.|[^"])*)")|(?:'((?:\\.|[^'])*)')|([^>\s]+)))?/g;
Do they all do this? Is there a conventional, standard way to code an HTML parser?
I do not know that that style is a “normal” way to do things. It is better than most I’ve seen, but it’s still too close to what I refer to as a “naïve” approach in this answer. For one thing, it isn’t accounting for HTML comments getting in the way of things. There are also legal but somewhat matters of entities it isn’t dealing with. But it’s HTML comments where most such approaches fall down.
A more natural way is to use a lexer to peel off tokens, more like like shown in this answer’s script, then assemble those meaningfully. The lexer would be able to know about the HTML comments easily enough.
You could approach this with a full grammar, such as the one shown here for parsing an RFC 5322 mail address. That is the sort of approach I take in the second, “wizardly” solution in this answer. But even that is only a complete grammar for well-formed HTML, and I’m only interested in a few different sort of tags. Those I define fully, but I don’t define valid fields for tags I’m unconcerned with.

'javadoc' look-a-like, using parser generator?

I'm going to create a javadoc look-a-like for the language I'm mainly using, but I was wondering - is it worth to use a parser generator for this? The main idea to use a parser generator was because I could use templates for the HTML code which could be exported then. Also I could also use PDF templates if I need it.
Thanks,
William v. Doorn
If all you are going to do is extract the "Javadoc" comments, you don't need a full parser; after all, you only need to recognize the comments and regexps will likely do fine.
If you want to extract information from the code and use it augment the javadoc comments, you'll need not only a parser but also name and type resolution.
You can see the results of combining parsing, name/type resolution, and Javadoc comment extraction in the Java Source Code Browser, which produces Javadoc results along with fully hyperlinked source code cross-referenced into the Javadocs.
The machinery which produced this is a generalization of something like ANTLR. But there was little need of using code templates to produce the HTML itself; all the hard work is in parsing and fact collection across the symbol tables.

How do you implement syntax highlighting?

I am embarking on some learning and I want to write my own syntax highlighting for files in C++.
Can anyone give me ideas on how to go about doing this?
To me it seems that when a file is opened:
It would need to be parsed and decided what type of source file it is. Trusting the extension might not be fool-proof
A way to know what keywords/commands apply to what language
A way to decide what color each keyword/command gets
I want to do this on OS X, using C++ or Objective-C.
Can anyone provide pointers on how I might get started with this?
Syntax highlighters typically don't go beyond lexical analysis, which means you don't have to parse the whole language into statements and declarations and expressions and whatnot. You only have to write a lexer, which is fairly easy with regular expressions. I recommend you start by learning regular expressions, if you haven't already. It'll take all of 30 minutes.
You may want to consider toying with Flex ( the lexical analyzer generator; https://github.com/westes/flex ) as a learning exercise. It should be quite easy to implement a basic syntax highlighter in Flex that outputs highlighted HTML or something.
In short, you would give Flex a set of regular expressions and what to do with matching text, and the generator will greedily match against your expressions. You can make your lexer transition among exclusive states (e.g. in and out of string literals, comments, etc.) as shown in the flex FAQ. Here's a canonical example of a lexer for C written in Flex: http://www.lysator.liu.se/c/ANSI-C-grammar-l.html .
Making an extensible syntax highlighter would be the next part of your journey. Although I am by no means a fan of XML, take a look at how Kate syntax highlighting files are defined, such as this one for C++ . Your task would be to figure out how you want to define syntax highlighters, then make a program that uses those definitions to generate HTML or whatever you please.
You may want to look at how GeSHI implements highlighting, etc. In addition, it has a whole bunch of language packs that contain all the keywords you'll ever want.
Assuming that you are using Cocoa frameworks you can use UTIs to determine the file type.
For an overview of the api:
http://developer.apple.com/mac/library/documentation/FileManagement/Conceptual/understanding_utis/understand_utis_intro/understand_utis_intro.html#//apple_ref/doc/uid/TP40001319-CH201-SW1
For a list of known UTIs:
http://developer.apple.com/mac/library/documentation/Miscellaneous/Reference/UTIRef/Articles/System-DeclaredUniformTypeIdentifiers.html#//apple_ref/doc/uid/TP40009259-SW1
The two keys are you probably most interested in would be kUTTypeObjectiveC​PlusPlusSource and kUTTypeCPlusPlusHeader.
For the highlighting you might find the information on this page helpful as it discusses syntax highlighting with an NSView and temporary attributes:
http://www.cocoadev.com/index.pl?ImplementSyntaxHighlightingUsingTemporaryAttributes
I think (1) isn't possible, since the only way to tell if a file is valid C++ is to run it through a C++ parser and see if it parses... but if you used that as your standard, you couldn't operate on code that doesn't compile because it is a work-in-progress, which you probably want to do. It's probably best just to trust the extension, as I don't think any other method will work better than that.
You can get a list of C++ keywords here: http://www.cppreference.com/wiki/keywords/start
The colors are up to you (or if you want, you can make them configurable and leave the choice to the user)

Are Regular Expressions useful for a Web Designer (XHTML/CSS)?

Are Regular Expressions useful for a Web Designer (XHTML/CSS)? Can a web designer get any help if they learn regular expressions?
Anyone who works with text files on a regular basis, which includes all programmers, can benefit from learning regular expressions. They make find-and-replace tasks much easier, and save you a lot of manual editing. Almost all text-editing programs support regular expression searches.
You won't be able to use them in your code, if all you're doing is HTML and CSS. But if you start to use JavaScript, you'll find them useful for things like testing the value of input fields.
Yes.
Regular expressions are not part of (X)HTML or CSS, but they are part of the tools you will probably use with them: Javascript, XSLT, and any server scripts.
The key is to remember that, because of the difficulty in parsing a language with quoted strings and SGML-style tags, that regex shouldn't be used to parse (X)HTML except as a last resort. Tokenizers exist so you don't have to do that particular hard work. You would find most regular expressions in use for checking and sanitizing input.
In short: yes, but always use the best tool for the job.
Among other things, it's great for validating input and cleansing html
Even if you don't use them directly, RegExes are a way of thinking about manipulating text that have proven very valuable to me over the years from the very first project I was paid money to write in PERL to things I do today in .NET
Can a web designer get any help if
they learn regular expressions?
If you're asking for some resources to help you with regular expressions, I find the MDC page on Regular Expressions and the Regular Expression Cheat Sheet quite useful references, and the Regular Expression Tester handy for testing regular expressions.
Steve
CSS has partial support for regular expressions, so they might be of some use in that area.
Other than that, only if you code Javascript.
There are languages and protocols you use when creating web artifacts (pages, scripts, styles, etc.) and there are tools you use to manipulate those artifacts (editors, utilities, interpreters, etc.).
Regular expressions would belong to latter set and they belong to more advanced specter of it (based on the fact that average web designer is likely to have no clue). But any computer literate person would benefit from learning and using regular expressions - not just web designers.
Anything that makes you stand out (in positive way of course) in comparison to the crowd will benefit you - so go ahead and learn regex - you won't regret.