Creating my own html parser

Creating my own html parser - regex

I know this post, I've already read it but still I'd like to learn what language does an html parser (may) use? I mean, does it parse the whole source with a regex or it uses a normal programming language such as c# or python?
Apart from the question above can you also brief me on from where I should start to create my own parser? (I'd like to create an html parser for my personal needs :)

Python, Java, and Perl are all fine languages for learning to write an HTML parser. Perl is very pleasant for regular expressions, but that's not what you need for a parser. It is a bit more pleasant to write OO programs in Python or Java. C/C++/C#, etc., are also common, for very fast parsers. However, as a learning exercise, I recommend Python or Java, so that you can compare your work with standard parsers.

The standard way is to use some Yacc/Lex duet; second makes a code that splits the code into tokens, first builds a code that converts a token stream into some desired structure.
There is also some more tempting option, Ragel. Here you just write a big regexp-like structure capable of matching entire file and define a hooks that will fire when a certain sub-pattern was matched.

Related

Perl Vs Ultraedit Scripting

I'm wondering if Perl is a good (easy to use and to learn) tool for this:
I'd like to do some custom preprocessing on my C/C++ source code. Basically, this's to allow me to insert my own custom annotations to the source code and generate new codes base on it. The required processing is mainly line oriented search/replace and insertion of new source code lines.
I can now think of 2 tools to achieve this: (1) Ultraedit's scripting feature (or any other capable editors). (2) Perl scripting.
Ultraedit's scripting looks good and I'm familiar with it. Best of all, its natural line oriented processing is a good abstraction for processing source code lines.
I'm wondering if Perl is also a good tool. I've ZERO experience with Perl except that I'm familiar with Perl style Regexpr used in other contexts. Is Perl a good tool for line oriented text processing? I'll have to search forward and backward and replace source code lines with some other texts.

Yes, Perl is a good tool for what you want. I'd go for Python, it's quick, easy, beautiful and has a good regex interface in the STL; but it's purely a matter of taste.

Perl is an excellent tool for this if you are familiar with it. It's essentially geared towards that kind of text analysis and translation, so you'll find that it has all of the extensions you could ask for.
Another option is to use UltraEdit's JavaScript functionality. The execution speed on it is a little slower than what you can get in Perl, but it provides a decent user interface where you can use UltraEdit to indicate where you want the changes to be made. Also, UltraEdit JavaScript has a great deal more flexibility than UltraEdit scripting.
I can't personally recommend Python for it, but I'm currently part of a company initiative to use it for exactly that kind of function, so hopefully the previous answerer is right.

Writing a tokenizer, where to begin?

I'm trying to write a tokenizer for CSS in C++, but I have no idea how to write a tokenizer. I know that it should be greedy, reading as much input as possible, for each token, and in theory I know how I could put that in code.
I have looked at Boost.Tokenizer, and it seems nice, but it doesn't help me whatsoever. It sure is a nice wrapper for a tokenizer, but the problem lies in writing the token splitter, the TokenizerFunction in Boost terms.
I have no idea how to write this tokenizer, are there any "neat" ways of doing it, like something that closely resembles the syntax itself?
Please note, I'm not looking for a parser! My application doesn't need to be able to understand CSS, just read a CSS file to a general internal tokenized format, process some things and output again.

Writing a "correct" lexer and/or parser is more difficult than you might think. And it can get ugly when you start dealing with weird corner cases.
My best suggestion is to invest some time in learning a proper lexer/parser system. CSS should be a fairly easy language to implement, and then you will have acquired an amazingly powerful tool you can use for all sorts of future projects.
I'm an Old Fart® and I use lex/yacc (or things that use the same syntax) for this type of project. I first learned to use them back in the early 80's and it has returned the effort to learn them many, many times over.
BTW, if you have anything approaching a BNF of the language, lex/yacc can be laughably easy to work with.

Boost.Spirit.Qi would be my first choice.
Spirit.Qi is designed to be a practical parsing tool. The ability to generate a fully-working parser from a formal EBNF specification inlined in C++ significantly reduces development time. Programmers typically approach parsing using ad hoc hacks with primitive tools such as scanf. Even regular-expression libraries (such as boost regex) or scanners (such as Boost tokenizer) do not scale well when we need to write more elaborate parsers. Attempting to write even a moderately-complex parser using these tools leads to code that is hard to understand and maintain.
The Qi tutorials even finish by implementing a parser for an XMLish language; writing a grammar for CSS should be considerably easier.

When to write a parser using grammar vs. using language featured regular expressions

In my master's I've seen how to write parsers, compilers using ANTLR. But in the real world, often times we have a requirement of parsing and extracting relevant content from a heavy load of in-coming stream data. Each language has it's own regular expression engine which can be conveniently used to parse the data. Alternatively we can write an EBNF grammar and take a slick tool like ANTLR to automatically generate the parser. The latter approach is less error prone and guaranteed to be more reliable than the former (especially in case of some extra spaces, new lines).
I would just like to know what would be the borderline between this 2 world's when one would go and write a whole grammar and generate his own parser vs. one quickly uses the inbuilt language regex engine and rollout a petty parser that can do the work quick enough. Again I am not looking for arguments but trying to analyze to what extent and approach people go for writing parsers.

If your input stream is processable by a regular expression and it isn't complex, then use a regex.
A stream of records where each record has a slot and value can be processed pretty reasonably this way.
If the stream has arbitrarily nested records, doing it by regex is impractical (in fact impossible), and you should switch to using a BNF and parser generator.

How do you implement syntax highlighting?

I am embarking on some learning and I want to write my own syntax highlighting for files in C++.
Can anyone give me ideas on how to go about doing this?
To me it seems that when a file is opened:
It would need to be parsed and decided what type of source file it is. Trusting the extension might not be fool-proof
A way to know what keywords/commands apply to what language
A way to decide what color each keyword/command gets
I want to do this on OS X, using C++ or Objective-C.
Can anyone provide pointers on how I might get started with this?

Syntax highlighters typically don't go beyond lexical analysis, which means you don't have to parse the whole language into statements and declarations and expressions and whatnot. You only have to write a lexer, which is fairly easy with regular expressions. I recommend you start by learning regular expressions, if you haven't already. It'll take all of 30 minutes.
You may want to consider toying with Flex ( the lexical analyzer generator; https://github.com/westes/flex ) as a learning exercise. It should be quite easy to implement a basic syntax highlighter in Flex that outputs highlighted HTML or something.
In short, you would give Flex a set of regular expressions and what to do with matching text, and the generator will greedily match against your expressions. You can make your lexer transition among exclusive states (e.g. in and out of string literals, comments, etc.) as shown in the flex FAQ. Here's a canonical example of a lexer for C written in Flex: http://www.lysator.liu.se/c/ANSI-C-grammar-l.html .
Making an extensible syntax highlighter would be the next part of your journey. Although I am by no means a fan of XML, take a look at how Kate syntax highlighting files are defined, such as this one for C++ . Your task would be to figure out how you want to define syntax highlighters, then make a program that uses those definitions to generate HTML or whatever you please.

You may want to look at how GeSHI implements highlighting, etc. In addition, it has a whole bunch of language packs that contain all the keywords you'll ever want.

Assuming that you are using Cocoa frameworks you can use UTIs to determine the file type.
For an overview of the api:
http://developer.apple.com/mac/library/documentation/FileManagement/Conceptual/understanding_utis/understand_utis_intro/understand_utis_intro.html#//apple_ref/doc/uid/TP40001319-CH201-SW1
For a list of known UTIs:
http://developer.apple.com/mac/library/documentation/Miscellaneous/Reference/UTIRef/Articles/System-DeclaredUniformTypeIdentifiers.html#//apple_ref/doc/uid/TP40009259-SW1
The two keys are you probably most interested in would be kUTTypeObjectiveCPlusPlusSource and kUTTypeCPlusPlusHeader.
For the highlighting you might find the information on this page helpful as it discusses syntax highlighting with an NSView and temporary attributes:
http://www.cocoadev.com/index.pl?ImplementSyntaxHighlightingUsingTemporaryAttributes

I think (1) isn't possible, since the only way to tell if a file is valid C++ is to run it through a C++ parser and see if it parses... but if you used that as your standard, you couldn't operate on code that doesn't compile because it is a work-in-progress, which you probably want to do. It's probably best just to trust the extension, as I don't think any other method will work better than that.
You can get a list of C++ keywords here: http://www.cppreference.com/wiki/keywords/start
The colors are up to you (or if you want, you can make them configurable and leave the choice to the user)

Search string parser in C/C++

I work on an open source project focused around Biblical texts. I would like to create a standard string format to build up a search string. I would then need to parse the search string and run the search with the options given. There are a number of different options, from scope of the search, to searching multiple texts, to wildcards, etc.
I'm thinking that using something like lex/yacc to generate a parser for this format might be a good idea. I think the Xapian project uses lemony to achieve a similar goal. My question is, is using one (or more) of these tools the best way to accomplish this?
In addition to the question, I would appreciate any links to resources on these tools (and any others that might be options). The biggest problem I've run into so far is that most of the examples and tutorials are either geared towards a programming language or something simple like a calculator rather than parsing a string format.

Tools like Lex and Yacc are suitable for your purposes. A parser for a search string is not that different from a parser for a programming language (the big difference is that a search string parser generates rules guiding the search, while the programming language parser generates a parse tree from where code is generated)
I assume your syntax will contain rules like the following:
expression : word
| expression AND expression
| expression OR expression
| NOT expression
| '(' expression ')'
all of which are easy to express in Yacc.
You can look at A Compact Guide to Lex & Yacc which I've found very useful for learning Lex and Yacc

If you're trying to build a parser in C++ have a look at
boost::sprit
It certainly is advanced C++, but it will build quite complex and performant parsers from C++ templates without code generation. It took me a few days to get into it, but using and modifying the samples that was straight forward. I also recommend reading the following book:
C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond by David Abrahams and Aleksey Gurtovoy

Keep "syntax error diagnosis and message" in your mind uppermost - if the user makes a mistake, a handcrafted recursive-descent-style parser can have some idea based on what it has scanned so far, what mistake the user might have made. If you're going to use an automated tool, be sure to test how it responds to typical user typos - genius-programmers can handle cryptic messages from their compiler, while it sounds like you are targeting a much less sophisticated user who therefore needs a friendlier interface.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Creating my own html parser - regex

Related

Perl Vs Ultraedit Scripting

Writing a tokenizer, where to begin?

When to write a parser using grammar vs. using language featured regular expressions

How do you implement syntax highlighting?

Search string parser in C/C++

Categories

Resources