I am embarking on some learning and I want to write my own syntax highlighting for files in C++.
Can anyone give me ideas on how to go about doing this?
To me it seems that when a file is opened:
It would need to be parsed and decided what type of source file it is. Trusting the extension might not be fool-proof
A way to know what keywords/commands apply to what language
A way to decide what color each keyword/command gets
I want to do this on OS X, using C++ or Objective-C.
Can anyone provide pointers on how I might get started with this?
Syntax highlighters typically don't go beyond lexical analysis, which means you don't have to parse the whole language into statements and declarations and expressions and whatnot. You only have to write a lexer, which is fairly easy with regular expressions. I recommend you start by learning regular expressions, if you haven't already. It'll take all of 30 minutes.
You may want to consider toying with Flex ( the lexical analyzer generator; https://github.com/westes/flex ) as a learning exercise. It should be quite easy to implement a basic syntax highlighter in Flex that outputs highlighted HTML or something.
In short, you would give Flex a set of regular expressions and what to do with matching text, and the generator will greedily match against your expressions. You can make your lexer transition among exclusive states (e.g. in and out of string literals, comments, etc.) as shown in the flex FAQ. Here's a canonical example of a lexer for C written in Flex: http://www.lysator.liu.se/c/ANSI-C-grammar-l.html .
Making an extensible syntax highlighter would be the next part of your journey. Although I am by no means a fan of XML, take a look at how Kate syntax highlighting files are defined, such as this one for C++ . Your task would be to figure out how you want to define syntax highlighters, then make a program that uses those definitions to generate HTML or whatever you please.
You may want to look at how GeSHI implements highlighting, etc. In addition, it has a whole bunch of language packs that contain all the keywords you'll ever want.
Assuming that you are using Cocoa frameworks you can use UTIs to determine the file type.
For an overview of the api:
http://developer.apple.com/mac/library/documentation/FileManagement/Conceptual/understanding_utis/understand_utis_intro/understand_utis_intro.html#//apple_ref/doc/uid/TP40001319-CH201-SW1
For a list of known UTIs:
http://developer.apple.com/mac/library/documentation/Miscellaneous/Reference/UTIRef/Articles/System-DeclaredUniformTypeIdentifiers.html#//apple_ref/doc/uid/TP40009259-SW1
The two keys are you probably most interested in would be kUTTypeObjectiveCPlusPlusSource and kUTTypeCPlusPlusHeader.
For the highlighting you might find the information on this page helpful as it discusses syntax highlighting with an NSView and temporary attributes:
http://www.cocoadev.com/index.pl?ImplementSyntaxHighlightingUsingTemporaryAttributes
I think (1) isn't possible, since the only way to tell if a file is valid C++ is to run it through a C++ parser and see if it parses... but if you used that as your standard, you couldn't operate on code that doesn't compile because it is a work-in-progress, which you probably want to do. It's probably best just to trust the extension, as I don't think any other method will work better than that.
You can get a list of C++ keywords here: http://www.cppreference.com/wiki/keywords/start
The colors are up to you (or if you want, you can make them configurable and leave the choice to the user)
Related
I noticed something when looking the kind of language of a software (openFOAM) which is written in C++. It was something like,
field value;
For example,
temperature 25;
I wonder how this works. I mean how temperature is set to 25 without using equality sign. Any idea?
Because openFOAM has a parser that understands that format.
Such a parser is not hard to make, and C++ (with the help of a parser-generator tool like yacc, bison) is a popular choice for parsers because it can be very fast.
C++ is a Turing-complete language, therefore it can do anything that any other language can. Specifically, it can process data that doesn't look like C++ code.
Programming languages are typically context free. If the language is context free it can be parsed (I won't say it's trivial but it's a problem that has been solved so many times before and if you take a compilers class you'll have to do it for simple langauges).
The parser is simply looking for a declarations of the format field value;. It appears there is no type check happening so it is simply splitting on semi colon then on the space. Read about parsing context-free langauges, push down automata, and context free grammars if you're interested in learning about parsing source code.
I have worked with java for a while now, and I found checkstyle to be very useful. I am starting to work with c++ and I was wondering if there is a style checker with similar functionality. I am mainly looking for the ability to write customized checks.
What about Vera++ ?
Vera++ is a programmable tool for verification, analysis and transformation of C++ source code.
Vera++ is mainly an engine that parses C++ source files and presents the result of this parsing to scripts in the form of various collections - the scripts are actually performing the requested tasks.
Click here to see a more complete demo of what it can do.
crc.hpp:157: keyword 'explicit' not followed by a single space
crc.hpp:588: closing curly bracket not in the same line or column
dynamic_property_map.hpp:82: keyword 'if' not followed by a single space
functional.hpp:106: line is longer than 100 characters
multi_index_container.hpp:472: comma should be followed by whitespace
version.hpp:37: too many consecutive empty lines
weak_ptr.hpp:108: keyword 'catch' not followed by a single space
...
I have had good feedback about Artistic Style which allows to apply a uniform style on code without too much hassle.
It's free and there are plenty of "classic" styles already defined. It might not work with C++0x new constructs though.
I am also expecting a Clang library, though I haven't found any to date. Normally, given Clang's structure it should be relatively easy, but then it's always easier to say than to code and I guess nobody took the time yet.
KWStyle seems to be a lightweight fit
I'm about to start developing a sub-component of an application to evaluate math functions with operands of C++ objects. This will be accessed via a user interface to provide drag and drop, feedback of appropriate types followed by an execute button.
I'm quite interested in using flex and bison for this having looked at equation parsing and the like, both here and further afield. What I'm unsure of is if flex/bison is appropriate when you're trying to parse with custom C++ types? Obviously normal parsing is with text and this is quite a departure from that so wanted so too see what people thought, and see if I'm trying to put a square peg in a round hole.
What do you think?
Edit
There are some very good sources of information in the links people have provided below. One that looks promising but hasn't been mentioned yet is Boost.Spirit. I was taking a look though the examples earlier today and there are some informative calculator based examples in the boost/libs/spirit/examples directory should you have boost downloaded and be interested. Their homepage is here.
Please checkout muparser
Flex and Bison are the right tool for parsing arithmetic expressions, equations and the like.
Here are few examples:
Parsing arithmetic expressions - Bison and Flex. It uses C (and not C++), but you can readapt it.
Flex Bison C++ Template/Example. A "framework" to integrate Flex and Bison "into a modern C++ program". An arithmetic calculator is included as an example.
Certainly sounds like a square peg in a round hole to me (unless I grossly misunderstand the question):
Flex would create a state machine to tokenize a stream, in your case - the contents are already tokenized
Bison sounds a bit more relevant, since it can deal with operator precedence, but integrating with it would be too much of a pain for the relatively small benefit.
I work on an open source project focused around Biblical texts. I would like to create a standard string format to build up a search string. I would then need to parse the search string and run the search with the options given. There are a number of different options, from scope of the search, to searching multiple texts, to wildcards, etc.
I'm thinking that using something like lex/yacc to generate a parser for this format might be a good idea. I think the Xapian project uses lemony to achieve a similar goal. My question is, is using one (or more) of these tools the best way to accomplish this?
In addition to the question, I would appreciate any links to resources on these tools (and any others that might be options). The biggest problem I've run into so far is that most of the examples and tutorials are either geared towards a programming language or something simple like a calculator rather than parsing a string format.
Tools like Lex and Yacc are suitable for your purposes. A parser for a search string is not that different from a parser for a programming language (the big difference is that a search string parser generates rules guiding the search, while the programming language parser generates a parse tree from where code is generated)
I assume your syntax will contain rules like the following:
expression : word
| expression AND expression
| expression OR expression
| NOT expression
| '(' expression ')'
all of which are easy to express in Yacc.
You can look at A Compact Guide to Lex & Yacc which I've found very useful for learning Lex and Yacc
If you're trying to build a parser in C++ have a look at
boost::sprit
It certainly is advanced C++, but it will build quite complex and performant parsers from C++ templates without code generation. It took me a few days to get into it, but using and modifying the samples that was straight forward. I also recommend reading the following book:
C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond by David Abrahams and Aleksey Gurtovoy
Keep "syntax error diagnosis and message" in your mind uppermost - if the user makes a mistake, a handcrafted recursive-descent-style parser can have some idea based on what it has scanned so far, what mistake the user might have made. If you're going to use an automated tool, be sure to test how it responds to typical user typos - genius-programmers can handle cryptic messages from their compiler, while it sounds like you are targeting a much less sophisticated user who therefore needs a friendlier interface.
While editing this and that in Vim, I often find that its syntax highlighting (for some filetypes) has some defects. I can't remember any examples at the moment, but someone surely will. Usually, it consists of strings badly highlighted in some cases, some things with arithmetic and boolean operators and a few other small things as well.
Now, vim uses regexes for that kinda stuff (its own flavour).
However, I've started to come across editors which, at first glance, have syntax highlighting better taken care of. I've always thought that regexes are the way to go for that kind of stuff.
So I'm wondering, do those editors just have better written regexes, or do they take care of that in some other way ? What ? How is syntax highlighting taken care of when you want it to be "stable" ?
And in your opinion what is the editor that has taken care it the best (in your editor of choice), and how did he do it (language-wise) ?
Edit-1: For example, editors like Emacs, Notepad2, Notepad++, Visual Studio - do you perchance know what mechanism they use for syn. high. ?
The thought that immediately comes to mind for what you'd want to use instead of regexes for syntax highlighting is parsing. Regexes have a lot of advantages, but as we see with vim's highlighting, there are limits. (If you look for threads about using regexes to analyze XML, you'll find extensive material on why regexes can't do what parsers do.)
Since what we want from syntax highlighting is for it to follow the syntactic structure of the language, which regexes can only approximate, you need to perform some level of real parsing to go beyond what regexes can do. A simple recursive descent lexer will probably do great for most languages, I'm thinking.
Some programming languages have a formal definition/specification written in Backus-Naur Form. All*) programming languages can be described in it. All you then need, is some kind of parser for the notation.
*) not verified
For instance, C's BNF definition is "only five pages long".
If you want accurate highlighting one needs real programming not regular expressions. RegExs are rarely the answer fir anything but trivial tasks. To do highlighting in a better way you need to write a simple parser. Parses basically have separate components that each can do something like identify and consume a quoted string or number literal. If said component when looking at it's given cursor can't consume what's underneath it does nothing. From that you can easily parse or highlight fairly simply and easily.
Given something like
static int field = 123;
• The first macher would skip the whitespace before "static". The keyword, literal etc matchers would do nothing because handling whitespace is not their thing.
• The keyword matched when positioned over "static" would consume that. Because "s" is not a digit the literal matched does nothing. The whitespace skipper does nothing as well because "s" is not a whitespace character.
Naturally your loop continues to advance the cursor over the input string until the end is reached. The ordering of your matchers is of course important.
This approach is both flexible in that it handles syntactically incorrect fragments and is also easy to extend and reuse individual matchers to support highlighting of other languages...
I suggest the use of REs for syntax highlighting. If it's not working properly, then your RE isn't powerful or complicated enough :-) This is one of those areas where REs shine.
But given that you couldn't supply any examples of failure (so we can tell you what the problem is) or the names of the editors that do it better (so we can tell you how they do it), there's not a lot more we'll be able to give you in an answer.
I've never had any trouble with Vim with the mainstream languages and I've never had a need to use weird esoteric languages, so it suits my purposes fine.