Is there an alternative to HTML Tidy? - c++

I have embedded HTML Tidy in my application to clean incoming HTML. But Tidy has a huge amount of bugs and fixing them directly in the source is my worst nightmare. Tidy source code is an unreadable abomination. Thousand+ line functions, poor variable naming, spaghetti code etc. It's truly horrible.
Worse yet, official development seems to have ceased. In the last 12 months, there have been three write transactions to the official CVS repo. But it's been dead and buried for much longer than that...
So I'm looking for an OSS C or C++ application/library that can do what Tidy can (when it feels like it): fix bad HTML markup and transform it into valid XHTML (this is the part I'm interested in). And I mean all sorts of bad markup.
Is there something like that out there?
EDIT: I need it both for manipulations on the DOM tree by an XML handling tool and for general compliance with the XHTML spec. My app needs to accept HTML from users (which is often invalid in all sorts of ways) and output valid XHTML. It needs to be able to handle even HTML that would normally not display in a browser because the user edited it by hand and didn't check afterwards.
A drop-in replacement for Tidy's error-correcting parser... that doesn't suck. I don't mind bugs if the source is readable and I can fix problems myself, or if there are active developers who provide bugfixes on a timely basis.

Could you tell us what you plan to use this tool for? As in, do you want to fix static web pages, or do you want some sort of filtering step before other manipulations, so that some tool can handle buggy web pages?
Personally, I write my own tool atop Python's BeautifulSoup or lxml whenever I need to --- it's at most a dozen line script and does much of what I want.

There is a new, nice, proper HTML 5 supporting Tidy, so the alternative to old, ugly Tidy would be Tidy (GitHub repository).

Try Pretty Diff. It is a vastly superior beautification algorithm and it does not make any assumptions about your input.
http://prettydiff.com/?m=beautify&html

For something that actually fixes code, your best bet is still HTML Tidy. There are a lot of linters, but not really anything that repairs errors to HTML, other than Tidy.
At first glance, modern OOP programmers might think that the source code is an unreadable abomination, but in the C world, Tidy is pretty sophisticated library that uses a lot of advanced OO concepts and offers a very thoughtful interface that exposes nearly all of its functionality in a pure C API.
A casual developer will be lost, but once immersed, the code is quite beautiful. Granted, naming conventions are a mixed bad, but PR's are welcome!

Related

From LaTeX to BBCode and back

I wasn't able to recover a similar thread, but I'm surprised nobody asked something so elementary before.
I would like to convert a couple of (quite long, so long I don't want to do it manually) LaTeX notes into something I can post in a forum which supports TeX code between the [tex]...[/tex] BBCode delimiters.
Hence I would like to find an automated way to replace, say,
$e^{i\pi}$
with
[tex]e^{i\pi}[/tex]
and vice versa (easier); possibly something I can write once and for all and execute each time I need it. The best of all would be a solution which also converts \section{...}, \subsection{...} and other environments, but this isn't mandatory, since the only issue with these documents is that they contain tons of math.
My impression is that a professional tool like, say, PanDoc, is too much a "nuke the fly" approach (not to mention I'm not able to use it)... I'm able to use a couple of features of the sublime-text editor, so it would be wonderful if you want to help me referring to it. In any case, keep in mind that I feel kinda yahoo about regex-stuff and suchlike (I've always seen them like a sorcery, or better, I was too dumb to learn them), so please be verbose. :)
LaTeX is a Turing-complete programming language, so a simple regex won't do what you want in general. That said, Andrew Stacey specializes in compiling LaTeX code to various formats, e.g. today's post on G+. I bet he has a program that would parse your latex and emit bbcode.

Best choice for runtime templating engine?

We're designing an app that will generate lots of different types of text output, eg email, html, sms, etc. The output will be generated using some kind of template, with data coming from a db. Our requirements include:
Basic logic / calculated fields within template. Eg "ifs" and "for" loops, plus some things like adding percentages for tax etc
Runtime editing. Our users need to be able to tweak the templates to their needs, such as change boilerplate text, add new logic, etc
Multi lingual. We need to choose the correct template for the current culture.
Culture sensitive. Eg dates and currencies will output according to current ui culture.
Flexibility. We need the templates to be able to handle multiple repeating groups, hierarchies, etc.
Cannot use commercial software as a solution (e.g. InfoPath). We need to be able to modify the source code at any time.
The app is c#.net. We are considering using T4, XML + XSLT or hosting the Razor engine. Given that the syntax cant be too overwhelming for non-techie users, we'd like to get your opinion on which you feel is the right templating engine for us. We're happy to consider ones notalready mentioned too.
Thanks.
I'm very hesitant to try and answer this question on a forum, because technology choices depend on far more factors than are conveyed in the question, including things such as attitude to risk, attitude to open source, previous good and bad experiences, politics and leadership on the project etc. The big advantage of XSLT over Razor is that it's a standard and has multiple implementations on multiple platforms (including at least three implementations on .NET!) so there's no lock-in; but that doesn't seem to be a factor in your statement of requirements. And the fact that you're using .NET suggests that supplier lock-in isn't something that worries you anyway.
One thing to bear in mind is that non-programmers often take to XSLT a lot more quickly than programmers do. Its rule-based declarative approach, and its XML syntax, sometimes make programmers uncomfortable (it's not like anything they have seen before) but end-users often take to it like ducks to water.
We've decided to go with Razor Hosting. The reason why I've posted this is an answer is that I thought it would help others if I include the following article link:
http://www.west-wind.com/weblog/posts/2010/Dec/27/Hosting-the-Razor-Engine-for-Templating-in-NonWeb-Applications
This excellent piece of work by Rick Strahl makes it really easy to host Razor.

Writing a tokenizer, where to begin?

I'm trying to write a tokenizer for CSS in C++, but I have no idea how to write a tokenizer. I know that it should be greedy, reading as much input as possible, for each token, and in theory I know how I could put that in code.
I have looked at Boost.Tokenizer, and it seems nice, but it doesn't help me whatsoever. It sure is a nice wrapper for a tokenizer, but the problem lies in writing the token splitter, the TokenizerFunction in Boost terms.
I have no idea how to write this tokenizer, are there any "neat" ways of doing it, like something that closely resembles the syntax itself?
Please note, I'm not looking for a parser! My application doesn't need to be able to understand CSS, just read a CSS file to a general internal tokenized format, process some things and output again.
Writing a "correct" lexer and/or parser is more difficult than you might think. And it can get ugly when you start dealing with weird corner cases.
My best suggestion is to invest some time in learning a proper lexer/parser system. CSS should be a fairly easy language to implement, and then you will have acquired an amazingly powerful tool you can use for all sorts of future projects.
I'm an Old FartĀ® and I use lex/yacc (or things that use the same syntax) for this type of project. I first learned to use them back in the early 80's and it has returned the effort to learn them many, many times over.
BTW, if you have anything approaching a BNF of the language, lex/yacc can be laughably easy to work with.
Boost.Spirit.Qi would be my first choice.
Spirit.Qi is designed to be a practical parsing tool. The ability to generate a fully-working parser from a formal EBNF specification inlined in C++ significantly reduces development time. Programmers typically approach parsing using ad hoc hacks with primitive tools such as scanf. Even regular-expression libraries (such as boost regex) or scanners (such as Boost tokenizer) do not scale well when we need to write more elaborate parsers. Attempting to write even a moderately-complex parser using these tools leads to code that is hard to understand and maintain.
The Qi tutorials even finish by implementing a parser for an XMLish language; writing a grammar for CSS should be considerably easier.

Is it possible for programmer to analyze unknown code fast?

I got a task related to ANCIENT C++ project which hasn't any documentation, comments at all and all code/variables is written in foreign language. Do I have a chance to analyze this code in a 1 working day and make a design/UML to create new features? I have been sitting around for 3 hours already and I feel so frustrated... Maybe somebody also had same problem? Any advice?
BR,
I suspect the biggest issue may be the fact that it's in a foreign language. You can use various static code analysis tools to try and understand what's going on, but if everything is presented in an unfamiliar language then that's still no use. Your first step (I believe) is to find someone who can speak this language and get them to translate as you go...
1) Use Doxygen , You can configure doxygen to extract the code structure from undocumented source files.
2) Use source Insight, Source Insight is an advanced code editor and browser with built-in analysis for C/C++, C#, and Java programs
Short answer, no - you probably don't have a chance to understand the code in one day. Reading/maintaining code is one of the hardest things to do, especially when it's lacking documentation. The fact that the code is in a foreign language (!) makes it even harder.
Sounds like you are on a very restricted (unrealistic) time-budget, but Working With Legacy Software is a good book if you're working with legacy systems. If you are planning to keep adding new features to the legacy system it's your responsibility to make your management aware of the scope of the operation. Or at least try.
Under this time constraint (1 day) it may or may not be doable depending on the size of the project - if its a few hundred lines of code then for sure. If its a serious project with several tens of thousands code lines, then likely no.
The first thing you need to know is what is this program supposed to do at all. If you have no idea what it does and how it does it, then analyzing the code will give you the answer but it will be a long and frustrating task. So my first suggestion would be to get yourself familiar with the outer workings of the software - what does it supposed to do and generally how it is supposed to do it. If you are doing it as part as your work then you should be able to get someone to walk you through using the program - even if its UI is in a foreign language (which I hope it doesn't, even if the code is written by a foreign language speaker).
Once you know what the software is attempting to do, then it should be fairly straight forward (even if lengthy and daunting) to rewrite all the comments in your own language for you to understand. I suggest doing so in a bottoms-up approach: its easier to understand the small and trivial things a program does, then to understand the top-level logic - and a lot of trivial things in order make up the logic of the software.
Only once you understand - to a large degree, anyway - the inner workings of the program you may write its functional spec and work on features.
Non-free way on Windows:
You can use CppDepend. This application is able to parse your visual project or your source files. It gives you a lot of information like dependency trees. You can try the trial (Maybe it will be enough for what you have to do).
Free way multi-platform:
You can use doxygen with a special configuration (extract code structure from undocumented code) and analyze the result.
I was quite happy with a tool called Understand (15-day eval license available) for this kind of task. However, I agree with Guss that the time you'll need depends a lot on the size of the code, and one day is probably just enough for a small program.
cscope & ctags are a must when I do my own code, and even more when looking to other's code.
You may also try this ::
http://www.sgvsarc.com/product_crystalflow.htm

Why does XSLT seem to irritate so many people? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 months ago.
Improve this question
What it is about XSLT that people find irritating? Is it the syntax (which is pretty unusual) or just the way XSLT works in general? Are there features that are lacking?
I did a little bit of XSLT (around 800 lines) a while ago and found it not that bad. So why the general animosity against it?
I think people find it difficult to get their heads around XSLT (and bitch about it) because it is functional and declarative in nature, unlike c# or java programming. Navigating around documents can end up being complicated when XPATH statements get clever - though this is a feature of XPATH rather than XSLT. XPATH typically gets complex when you don't know at design time the exact structure of a document so you start querying siblings, descendents and ancestors. This is when people inheriting a complex XSLT start considering career changes!
With XSLT it is very much 'right tool for the right job'. It is designed to transform an xml document into another xml document extremely quickly and efficiently. XSLT is almost certainly the best tool to use for this purpose because of its extensibility, the fact that it has been written for this purpose, widespread support for it in xml processors across the board, and in case i didnt mention it already, performance. Common use-cases:
converting an xml document purely containing data into a document exposing a user-interface such as an xhtml document
converting an xml document into a different structure to suit someone elses schema e.g. Biz2Biz communications
A great implementation of the xslt technology is the apache-cocoon project which transforms xml documents into multiple output formats including html, excel, chart images, pdf's with an extensible plugin architecture. We use it a lot for our reporting platform and it works very well. When developers start with it, they find the same familiar issues. Once they get over them, they would typically be writing what i am here.
I once worked with a guy who didnt want to work with (and learn) XSLT and ended up presenting a demo to the client which took over 20 seconds to render a page. When i finally persuaded him to use an XSLT transform instead of his dumb DOM code it took under a second.
I like xslt, and use it quite a bit. As long as you think in terms of functional programming (i.e. set-once variables, similar to F# etc), then it is hugely versatile. I use it regularly for data transformation, presentation (in particular [x]html), and versatile code generation.
Definitely highly programming related; nobody except a programmer would grok it - but a very powerful tool.
I have a few xslt (split over a few xsl:import/xsl:include files) that is substantially more than the 800 you mention in the post... it really can (when used correctly) be a fully featured environment.
Notes:
best used at the server; client-side support is hit'n'miss
a few key things missed in 1.0; regex; case-insensitivity; etc
can be tricky if whitespace is important
One particularly useful feature of xslt (as a separate file) is that it makes it possible to change the transform without rebuilding any code. The code-gen example is from an open source project I run; I know of several users who have dipped in and tweaked the code-gen for their local standards. One use even went as far as writing the transform for an entire second language - and all without touching the binaries.
I personally dislike XSLT because it seems to combine several things that are generrally disliked in the developer community:
it uses magic strings (XPATH) that look like noise aka perl reg exs.
xml tags which can make statements verbose - aka xml programming language.
I've worked with XSLT before and I didn't much care for it because I found it extremely verbose for the simple task I wanted to perform.
Just out of curiosity, what did your 800 lines of XSLT do?
XSLT is a really powerful tool in the developer arsenal. I use it all the time for code generation. Performance counters, data access layer, REST interfaces, you name it. anything repetitive.
As a language it sure has its quirks, but as a tool is invaluable.
Many programmers don't have any experience with Functional Programming. XSLT, in many ways, resembles Functional Programming and a new and foreign paradigm to learn.
Learning an unfamiliar programming paradigm can be challenging, let alone learning an unfamiliar programming paradigm expressed in XML.
Code written in a Functional Programming language is typically minimalistic. XML is rarely minimalistic. So folks who know Functional Programming and appreciate its minimalism have to give up that minimalism.
I personally think it is very suitable for certain types of programming problems. To me, in certain situations, it is much easier to maintain a form using XSLT versus having to rewrite/recompile/redeploy code changes. While XSLT is not the only way to accomplish that, I haven't found any other solutions for those cases that is much cleaner and easier.
It has its place. Like everything else, when misused, it becomes a garbled mess of code, just as any language would. When used correctly, it can be a good supplement or solution to a programming problem.
XSLT is very powerful, so long as what you want to do with it matches what it's good for. However, maintaining someone else's XSLT can be a bit daunting. It's a programming language but it's also an XML file, so it can be hard to understand, even when laid out cleanly and adequately commented.
Our Library CMS largely consists of html stylesheets to do almost everything. Our data is XML natively of course. Some of our programmers don't get the functional programming paradigm. Your first experiences might lead to complex templates misusing the iterative features of XSLT. The first thing you have to tell a programmer is not to use the for each statement or travel the xpath axes
If they learn to refrain they may learn to understand the concepts of templates.
I find that the people that complain about XSLT are the ones that misuse it. For example, I think using it as an HTML templating language for a CMS is a terrible idea, unless your data is in XML already. Those people might complain that XSLT is ugly, or verbose, or whatever, but that's because they are using it for the wrong reasons.
XSLT is both functional and imperative at the same time. This trips up a lot of people. they have match and for loops with variables.
It is easy to write bad code in it. But if you follow good patterns you can do some really neat things very easily.
Check out http://www.worldofwarcraft.com/index.xml and http://www.wowarmory.com/index.xml if you have an XSLT-capable browser (FF 3 is good). They are totally written in client side XSLT with underlying XML. It makes scraping those sites REALLY easy and nice and they are forced to keep the data and presentation separate. A great example is their character pages http://www.wowarmory.com/character-achievements.xml?r=Mal%27Ganis&cn=Vosk&gn=Juggernaut
It's an example of turning XML into a programming language. Yuck. I wish people wouldn't do that. We have perfectly good programming languages already, and they are far better at it than XML.
because MS doesn't implement exslt2