Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I would like to write a simple in-house program that parses user commands written in a language of our team's own invention (but based closely on another program we are already familiar with). The command parser that I am working on now will simply be the UI through which the user can run the other algorithms I have already written. (Those other algorithms, by the way, are used to generate the input files for a molecular dynamic simulation package called LAMMPS.) The only thing I really have left to do is just write this UI, but as it turns out, writing your own scripting language is almost an intractable challenge for a non software engineer to tackle on his own.
According to the answers I received, what I am try to make would be considered a Domain Specific Language, and it is not advisable to try to make one's own DSL due to the enormous amount of work required to make it useful and bug-free.
The best option then would actually be to use an existing scripting language like Lua or Python, and embed it in the program.
To do this, I will most likely use Lua because it seems most fitting for our needs. So at this point, the rest of this question is no longer relevant since the answer would be: "Don't do it yourself." But I'm still going to keep part of it here for other users to be able read and learn from the wonderful answers below.
Thanks again to everyone who replied!
Old Question:
I would like to write a program that parses a user text input and then
runs a function corresponding to that input. To do this I would need
to parse the string for relevant keywords. I believe there will be
less than 15 keywords when I'm done, so ideally I'd like this code
to be simple and short.
The problem is that I am currently using if-statements to parse the
strings. This is an extremely inconvenient way to parse commands
because even for a short 3 word commands the code explodes into nested-ifs
3 layers deep. So longer 8+ word sentences will become nested-ifs more than
8 layers deep.
This kind of programing approach quickly becomes unmanageable, especially
when I need to make any significant changes to a command.
My question is whether or not there exists a data structure in C++ that
can help me better manage my giant nested-ifs, or if anyone could suggest
a better way to parse a string for lots of different data types (i.e.
substings, ints, and floats) and output an error message when the expected
type is not found?
Here is an example of a short user session to show the kinds of commands
I would like to interpret:
load "Basis.Silicon" as material 1
add material 1 to layer 1
rotate layer 1 about x-axis by 45 degrees
translate layer 1 in x-axis by 10 nm
generate crystal
These commands are based on an already-existing program that our team
uses, but unfortunately the source code for this program has never been
publicly released so I am left guessing as to how it was actually
implemented.
One final note, unlike natural language processors, I know exactly what
the format of each line will be. So my issue isn't so much how to interpret
the text, but rather how to code the logic in a concise and manageable way.
Thanks everyone!
Your question is not clear. And your goals are more difficult than what you believe.
Either you consider that you want to somehow process human language sentences (e.g. in English). Then you want to study natural language processing, and you can find some libraries related to that field.
Or you consider that you want to interpret some formal programming or scripting language. Then you want to study interpreters and compilers. BTW, in that case, you might just embed an existing interpreter (like Lua, Guile, Python, etc....) in your program.
You could also think in terms of expert systems with a knowledge base made of rules (this approach could be viewed as in the middle between NLP and scripting language) You'll then need some inference engine (perhaps CLIPS). See also J.Pitrat's blog.
Notice that even coding a simple interpreter is more difficult than you believe. You absolutely need to represent abstract syntax trees, which you construct from textual input with a parsing phase.
BTW, All of NLP, expert systems, and interpreter design and implementation are difficult fields. You could get a PhD in all 3 fields (but you have to choose which).
If you go the embedded interpreter way: study the interpreters I mentioned (Guile, Lua, Python, Neko, etc...) and choose which one you want, to embed.
If for whatever reason, you want to make an interpreter from scratch: Learn several programming languages first (including scripting languages like Ruby, Python, Ocaml, Scheme, Lua, Neko, ...). Read books on Programming Language Pragmatics (by M.Scott) and Lisp In Small Pieces (by Queinnec). Read also text books on compilation and parsing, and on Garbage Collection and formal (e.g. denotational) semantics. All this may need a dozen years of work.
Notice that by experience embedding a software in an interpreter is a very structuring design. If you did not thought of that at the beginning you probably need to redesign and refactor a lot your existing application. For instance, when embedding a software in an interpreter, you cannot afford that bad input crashes the program. So error handling and memory management (interfacing to the GC of the interpreter) is challenging and gives new constraints. Hence you'll need to re-think your application.
If all this is new (and even if you don't choose e.g. Guile as the embedding interpreter): learn and practice a bit of Scheme -e.g. with Guile or PltScheme- (e.g. reading SICP), read a little bit about λ-calculus and closures, then read Queinnec's Lisp In Small Pieces book. Remember the halting problem (which is partly why interpreters are difficult to code).
BTW the syntax you are proposing (e.g. rotate mat 1 by x 90) is not very readable and looks COBOL-like. If possible, have a language which looks familiar to existing ones. Make it easy to read !
Start by reading all the wikipages I am referencing here.
FWIW, I am the main author of MELT, a domain specific language (inspired a lot by Scheme) to extend the GCC compiler. Some of the papers / documentations I wrote might inspire you (and contain valuable references).
Addenda (after question was reformulated)
You seems to invent some formal syntax like
add material 1 to layer 1
rotate layer 1 about x-axis by 90 degrees
translate layer 1 in x-axis by 10 inches
I can't guess what kind of language is it? Are you implementing a 3D printer? If yes, you should stick to some existing standard formal language in that domain.
I believe that such a COBOL-like syntax is really wrong. The point is that it is too verbose, and that you are wishing to implement some domain specific language. I find your example very bad-looking.
Is that syntax your invention, or is there some document specifying (and many thousands already existing lines coded in) your domain specific language. If you are just inventing it, please reconsider the syntax and the semantics.
First, you need to specify on paper the full syntax and semantics of your DSL.
Is your DSL Turing complete? (I guess that yes, because Turing completeness is reached very quickly - e.g. with variables and loops....). If yes, you are inventing a scripting language. Please don't invent scripting language without knowing several programming & scripting languages (then read Programming Language Pragmatics...). The point is that, if your scripting language will become successful, advanced users will soon or later write important programs in it (e.g. many thousand lines). Then, these advanced users will be programmers. In that case, it is very important (for social & economic reasons) to have a DSL well founded and looking familiar (if possible, an extension of some existing scripting language).
If your DSL already exists, stick to its specification on paper. If that specification is not good enough, improve it with formalization (e.g. by writing some BNF syntax, and some formal (e.g. denotational) semantics for it). Publish and discuss that formalization with existing users.
Several industries got some ad-hoc DSLs which became widely used but was ill designed
(e.g., in the French nuclear industry, the Gibiane DSL designed in the 1970s by nuclear physicists, not computer scientists; the US Boeing corporation is also rumored to have made similar mistakes). Then, maintaining and improving the many hundred thousands lines of DSL scripts is becoming a nightmare (and may means losing millions of dollars or euros). So you better stick to some existing scripting language. The advantages are that there exist some culture on it (e.g. you can find dozens of books on Python or Lua, and many trained engineers familiar with them), that the interpreter is widely used and tested, that the community working on them is improving the interpreters, so it has quite few uncorrected bugs.
You should not attempt to design and implement your own DSL if you are not a trained computer scientist. Stick to some existing scripting language (of course their syntax is not like you want it to be), and leverage on existing implementations and experiment.
As a counter-example, J.Ousterhout has invented the widely used Tcl scripting language, with the claim that scripts are always small (e.g. hundreds of line only) and won't grow to big code base; unfortunately, some of them did, and Tcl is known as a bad language to code many dozens of thousands of lines (even if Tcl is an easy and convenient language for tiny scripts). The moral of the story is that if a (turing complete) scripting language is becoming successful, some "crazy" advanced user will code hundred of thousands of script code. So you need that scripting language to be well designed from the start. Hence, you should adopt and adapt a good existing scripting language (and avoid inventing an unfamiliar syntax without having a good knowledge of several existing scripting languages)
later additions
PS: my criticism of Tcl is not entirely subjective: the point is that Tcl was designed for small scripts in mind (read J.Ousterhout's first papers about Tcl), but my point is that when you offer a Turing-complete scripting language, some "crazy" user will eventually write huge scripts for it. Hence, you need to anticipate such "crazy" usage by offering a scripting language which "scales up" to big scripts, so is built according to software engineering practices for large software code base.
NB. Lua is probably a good choice as a language to embed. It is small, has a nice implementation, is well documented, and has good performance. But be careful about memory management issues (and this advice holds for any scripting language).
EDIT: To be more clear, I would like to have a short list of key words
(<15). The order/presence of which would determine which function will
be run.
You can build a small ruleset engine (e.g. something that processes lists of words). You write that engine/function once and just pass the data structures to it.
As an alternative, a solution using regular expressions would be probably the fastest to code (the engine is ready for you), assuming you're familiar with the regexp syntax (if not, it's still a good investment).
You could build a table of keywords and function pointers:
typedef void (*Function_Pointer)(void);
struct table_entry
{
const char * keyword;
Function_Pointer p_function;
};
table_entry function_table[] =
{
{"car", Process_Car},
{"bike", Process_Bike},
};
Search the table for a keyword. If the keyword is found, dereference the function pointer.
The following snippet will execute the function for processing the word "car":
(function_table[0].p_function)();
There is a famous program, called Eliza, which parses sentences for keywords.
Examples can be found at: Eliza C++ examples
Current Choice: lua-jit. Impressive benchmarks, I am getting used to the syntax. Writing a high performance ABI will require careful consideration on how I will structure my C++.
Other Questions of interest
Gambit-C and Guile as embeddable languages
Lua Performance Tips (have the option of running with a disabled collector, and calling the collector at the end of a processing run(s) is always an option).
Background
I'm working on a realtime high volume (complex) event processing system. I have a DSL that represents the schema of the event structure at source, the storage format, certain domain specific constructs, firing internal events (to structure and drive general purpose processing), and encoding certain processing steps that always happen.
The DSL looks pretty similar to SQL, infact I am using berkeley db (via sqlite3 interface) for long-term storage of events. The important part here is that the processing of events is done set-based, like SQL. I have come to conclusion however that I should not add general-purpose processing logic to the DSL, and rather embed lua or lisp to take care of this.
The processing core is built arround boost::asio, it is multithreaded, rpc is done via protocol buffers, events are encoded using the protocol buffer IO library --i.e., the events are not structured using protocol buffer object they just use the same encoding/decoding library. I will create a dataset object that contains rows, pretty similar to how a database engine stores in memory sets. processing steps in the DSL will be taken care of first and then presented to the general purpose processing logic.
Regardless of what embeddable scripting environment I use, each thread in my processing core will probably needs it's own embedded-language-environment (that is how lua requires it to be at least if you are doing multi-threaded work).
The Question(s)
The choice at the moment is between lisp ECL and lua. Keeping in mind that performance and throughput is a strong requirement, this means minimising memory allocations is highly desired:
If you were in my position which language would you chose ?
are there any alternatives I should consider (don't suggest languages that don't have an embeddable implementation). Javascript v8 perhaps ?
Does lisp fit the domain better ? I don't think lua and lisp are that different in terms of what they provide. Call me out :D
Are there any other properties (like the ones below) I should be thinking about ?
I assert that any form of embedded database IO (see the example DSL below for context) dwarfs the scripting language call on orders of magnitude, and that picking either will not add much overhead to the overall throughput. Am I on the right track ? :D
Desired Properties
I would like to map my dataset onto a lisp list or lua table and I would like to minimise redundant data copies. For example adding a row from one dataset to another should try to use reference semantics if both tables have the same shape.
I can guarantee that the dataset that is passed as input will not change whilst I have made the lua/lisp call. I want lua and lisp to enforce not altering the dataset as well if possible.
After the embedded call end's the datasets should be destroyed, any references created would need to be replaced with copies (I guess).
DSL Example
I attach a DSL for your viewing pleasure so you can get an idea of what I am trying to achieve. Note: The DSL does not show general purpose processing.
// Derived Events : NewSession EndSession
NAMESPACE WebEvents
{
SYMBOLTABLE DomainName(TEXT) AS INT4;
SYMBOLTABLE STPageHitId(GUID) AS INT8;
SYMBOLTABLE UrlPair(TEXT hostname ,TEXT scriptname) AS INT4;
SYMBOLTABLE UserAgent(TEXT UserAgent) AS INT4;
EVENT 3:PageInput
{
//------------------------------------------------------------//
REQUIRED 1:PagehitId GUID
REQUIRED 2:Attribute TEXT;
REQUIRED 3:Value TEXT;
FABRRICATED 4:PagehitIdSymbol INT8;
//------------------------------------------------------------//
PagehitIdSymbol AS PROVIDED(INT8 ph_symbol)
OR Symbolise(PagehitId) USING STPagehitId;
}
// Derived Event : Pagehit
EVENT 2:PageHit
{
//------------------------------------------------------------//
REQUIRED 1:PageHitId GUID;
REQUIRED 2:SessionId GUID;
REQUIRED 3:DateHit DATETIME;
REQUIRED 4:Hostname TEXT;
REQUIRED 5:ScriptName TEXT;
REQUIRED 6:HttpRefererDomain TEXT;
REQUIRED 7:HttpRefererPath TEXT;
REQUIRED 8:HttpRefererQuery TEXT;
REQUIRED 9:RequestMethod TEXT; // or int4
REQUIRED 10:Https BOOL;
REQUIRED 11:Ipv4Client IPV4;
OPTIONAL 12:PageInput EVENT(PageInput)[];
FABRRICATED 13:PagehitIdSymbol INT8;
//------------------------------------------------------------//
PagehitIdSymbol AS PROVIDED(INT8 ph_symbol)
OR Symbolise(PagehitId) USING STPagehitId;
FIRE INTERNAL EVENT PageInput PROVIDE(PageHitIdSymbol);
}
EVENT 1:SessionGeneration
{
//------------------------------------------------------------//
REQUIRED 1:BinarySessionId GUID;
REQUIRED 2:Domain STRING;
REQUIRED 3:MachineId GUID;
REQUIRED 4:DateCreated DATETIME;
REQUIRED 5:Ipv4Client IPV4;
REQUIRED 6:UserAgent STRING;
REQUIRED 7:Pagehit EVENT(pagehit);
FABRICATED 8:DomainId INT4;
FABRICATED 9:PagehitId INT8;
//-------------------------------------------------------------//
DomainId AS SYMBOLISE(domain) USING DomainName;
PagehitId AS SYMBOLISE(pagehit:PagehitId) USING STPagehitId;
FIRE INTERNAL EVENT pagehit PROVIDE (PagehitId);
}
}
This project is a component of a Ph.D research project and is/will be free software. If your interested in working with me (or contributing) on this project, please leave a comment :D
I strongly agree with #jpjacobs's points. Lua is an excellent choice for embedding, unless there's something very specific about lisp that you need (for instance, if your data maps particularly well to cons-cells).
I've used lisp for many many years, BTW, and I quite like lisp syntax, but these days I'd generally pick Lua. While I like the lisp language, I've yet to find a lisp implementation that captures the wonderful balance of features/smallness/usability for embedded use the way Lua does.
Lua:
Is very small, both source and binary, an order of magnitude or more smaller than many more popular languages (Python etc). Because the Lua source code is so small and simple, it's perfectly reasonable to just include the entire Lua implementation in your source tree, if you want to avoid adding an external dependency.
Is very fast. The Lua interpreter is much faster than most scripting languages (again, an order of magnitude is not uncommon), and LuaJIT2 is a very good JIT compiler for some popular CPU architectures (x86, arm, mips, ppc). Using LuaJIT can often speed things up by another order of magnitude, and in many cases, the result approaches the speed of C. LuaJIT is also a "drop-in" replacement for standard Lua 5.1: no application or user code changes are required to use it.
Has LPEG. LPEG is a "Parsing Expression Grammar" library for Lua, which allows very easy, powerful, and fast parsing, suitable for both large and small tasks; it's a great replacement for yacc/lex/hairy-regexps. [I wrote a parser using LPEG and LuaJIT, which is much faster than the yacc/lex parser I was trying emulate, and was very easy and straight-forward to create.] LPEG is an add-on package for Lua, but is well-worth getting (it's one source file).
Has a great C-interface, which makes it a pleasure to call Lua from C, or call C from Lua. For interfacing large/complex C++ libraries, one can use SWIG, or any one of a number of interface generators (one can also just use Lua's simple C interface with C++ of course).
Has liberal licensing ("BSD-like"), which means Lua can be embedded in proprietary projects if you wish, and is GPL-compatible for FOSS projects.
Is very, very elegant. It's not lisp, in that it's not based around cons-cells, but it shows clear influences from languages like scheme, with a straight-forward and attractive syntax. Like scheme (at least in it's earlier incarnations), it tends towards "minimal" but does a good job of balancing that with usability. For somebody with a lisp background (like me!), a lot about Lua will seem familiar, and "make sense", despite the differences.
Is very flexible, and such features as metatables allow easily integrating domain-specific types and operations.
Has a simple, attractive, and approachable syntax. This might not be such an advantage over lisp for existing lisp users, but might be relevant if you intend to have end-users write scripts.
Is designed for embedding, and besides its small size and fast speed, has various features such as an incremental GC that make using a scripting language more viable in such contexts.
Has a long history, and responsible and professional developers, who have shown good judgment in how they've evolved the language over the last 2 decades.
Has a vibrant and friendly user-community.
You don't state what platform you are using, but if it would be capable of using LuaJIT 2 I'd certainly go for that, since execution speeds approach that of compiled code, and interfacing with C code just got a whole lot easier with the FFI library.
I don't quit know other embeddable scripting languages so I can't really compare what they can do, and how they work with tables.
Lua mostly works with references: all functions, userdata, tables are used by reference, and are collected on the next gc run when no references to the data are left.
Strings are internalised, so a certain string is in the memory only once.
The thing to take into account is that you should avoid creating and subsequently discarding loads of tables, since this can slow down the GC cycle (as explained in the Lua gem you cited)
For parsing your code sample, I'd take a look at the LPEG library
There is a number of options for implementing high performance embedded compilers. One is Mono VM, it naturally comes with dozens of already made high quality languages implemented on top of it, and it is quite embeddable (see how Second Life is using it). It is also possible to use LLVM - looks like your DSL is not complicated, so implementing an ad hoc compiler would not be a big deal.
I happened to work on a project which have some parts that is similar to your project, It's a cross-platform system running on Win-CE,Android,iOS, I need maximize cross-platform-able code, C/C++ combine with a embeddable language is a good choice. here is my solution related to your questions.
If you were in my position which language would you chose ?
The DSL in my project is similar to yours. for performance, I wrote a compiler with Yacc/Lex to compile the DSL to binary for runtime and a bunch of API to get information from binary, but it's annoying when there is something modified in DSL syntax, I need to modify both compiler and APIs, so I abondoned the DSL, turned into XML(don't write XML directly, a well defined schema is worthy), I wrote a general compiler converting XML to lua table, reimplement APIs with lua. by doing this I got two benefits: Readability and flexibility, without perceivable performance degradation.
Are there any alternatives I should consider (don't suggest languages that don't have an embeddable implementation). Javascript v8 perhaps ?
Before choosing lua, I considerd Embedded Ch(mostly used in industrial control system) , embedded lisp and lua, at last lua stand out, because lua is well integrated with C, lua have a prosperous community, and lua is easy to learn for another team member. regarding Javascript v8, it's like using a steam-hammer to crack nuts, if used in a embedded realtime system.
Does lisp fit the domain better ? I don't think lua and lisp are that different in terms of what they provide. Call me out :D
For my domain, lisp and lua have the same ability in semantic, they both can handle XML-based DSL easily, or you might even wrote a simple compiler converting XML to lisp list or lua table. they both can handle domain logic easily. but lua is better integrated with C/C++, this is what lua aim for.
Are there any other properties (like the ones below) I should be thinking about ?
Working alone or with team members is also a weighting factor of solution selection. nowadays not so many programmers are familiar with lisp-like language.
I assert that any form of embedded database IO (see the example DSL below for context) dwarfs the scripting language call on orders of magnitude, and that picking either will not add much overhead to the overall throughput. Am I on the right track ? :D
here is a list of programming languages performance, here is a list of access time of computer components. if your system is IO-bound, the overhead of script is not key point. my system is a O&M(Operation & Maintenance) system, script performance is insignificant.
I'm working on a project where I have to divide a C program into modules and find the time taken for the execution of each module. I have to use C or C++ to do this.
For example: the first module will be main() then inside you may have other modules like for, or while or if etc.
It's actually like generating a syntax tree for the program, in a textual form.
So what I'm looking for is a way to do this, or any framework which can assist me with this, or any suggestions, anything would be great!
Thanks in advance :)
EDIT:
and ok, il make it more clear, by giving the sample input and the expected output as mentioned by our institute,
Input:
void main() {
int h=2,j=5,k=1;
while (h>j) {
h++;
for (k=100;k>0;k/=2) {
if (i<=5) {
d=n*10;
p=n/10;
}
}
}
}
Output:
main
begin
h=2
j=5
k=1
*************
while
h>j
0.2087
*************
begin
h++
*************
for
k=100
k>0
k/2
0.8956
*************
begin
if i<=5
*************
begin
d=n*10
p=n/10
0.1153
end
*************
end
end
sorry for the output, its vertical, i think it gives every1 a clear idea ! :) –
Being less than 100 lines doesn't help you much; people can write a program that uses every feature of C language (and the compiler extensions!) in a program that size. So you are saying you have to deal with the entire C language (statements, expressions, functions, structures, typedefs, macros, preprocessor conditionals, ...). If this is a school project, I think that's too hard; you will need to limit your project to a interesting subset, say, functions, assignments, while statements and function calls.
What you need to do this straightforwardly is
A parser for the C language
The ability to climb over the AST
The ability to patch the AST
The ability to emit the AST as source text
Program transformations system are nearly ideal engines to accomplish this kind of work.
These allow you to additionally do pattern-directed changes to the AST, which makes this much easier to implement.
You can probably do this with TXL or Stratego; I'm sure both have C parsers that will handle the limited subset of C that you should be willing to do.
You could do with this our DMS Software Reengineering Toolkit, although this is probably not suited for student use in short period of time. (TXL and Stratego are free, but might have steep learning curves, too).
But you will find this document on instrumenting code to build test coverage tools probably an ideal introduction to what you need to do to accomplish your task with a transformation system.
EDIT: I transcribed Hari's comment with an example into his question.
I think he wants a tracer not a profiler.
He can do that using instrumentation, but he'll have a lot of instrumentation to insert because it looks like the goal is tracing every action.
In that case, it would be easier if he had an interpreter, but he needs one that keeps the code structure around so that he can report the code structure as it runs (so CINT which compiles to a pcode wouldn't be the right answer).
What he needs is full C parsing to AST with symbol tables, an interpreter that executes the ASTs, and a prettyprinter so that he can convert appropriate parts of the AST he is currently executing back into text to report as progress.
DMS is still a very good foundation for this, as it has all the machinery needed to support this task too.
Parse the C text, build the symbol table (all built into DMS for C) and then run a custom interpreter.
This is easily coded as an interpreter over the AST; a big case statement over the tree node types would be the interpreter core, with node-type specific cases carrying out the action implied by the AST node (including climbing up or down the tree, updating symbol table entries with new values, computing expression intermediate results, and of course reporting progress.
ANTLR might work; it has a C parser but I'm not so sure about a prettyprinter. The interpreter would work pretty much the same way.
TXL and Stratego would likely be awkward for this, as it isn't clear how you'd use their pure-transformational style to alternatate interpreting and printing trace data.
What you're describing is a profiler. Which one is suitable very much depends on the platform which you're using.
For Window and Visual C++ I've found LTProf to be lightweight, inexpensive and usable http://www.lw-tech.com/index.php?option=com_content&task=view&id=25&Itemid=65. Alternatively the higher end Visual Studio products include a profiler and Intel sell various profilers.
Alternatively are you trying to write your own profiler? That's quite a tricky piece of code to write and very platform dependent. You may find open source examples to use as starting points.
EDIT:
NB: Using ltprof may require you to turn off compiler optimization to get reliable profiling results. Be aware that this can potentially change the results.
If you had the possibility of having an application that would use both Haskell and C++.
What layers would you let Haskell-managed and what layers would you let C++-managed ?
Has any one ever done such an association, (surely) ?
(the Haskell site tells it's really easy because Haskell has a mode where it can be compiled in C by gcc)
At first I think I would keep all I/O operations in the C++ layers. As well as GUI management.
It is pretty vague a question, but as I am planning to learn Haskell, I was thinking about delegating some work to Haskell-code (I learn in actually coding), and I want to choose some part where I will see Haskell benefits.
The benefit of Haskell is the powerful abstractions it allows you to use. You're not thinking in terms of ones and zeros and addresses and registers but computations and type properties and continuations.
The benefit of C++ is how tightly you can optimize it when necessary. You aren't thinking about high-minded monads, arrows, partial application, and composing pure functions: with C++, you can get right down to the bare metal!
There's tension between these two statements. In his paper “Structured Programming with go to statements,” Donald Knuth wrote
I have felt for a long time that a talent for programming consists largely of the ability to switch readily from microscopic to macroscopic views of things, i.e., to change levels of abstraction fluently.
Knowing how to use Haskell and C++ but also how and when to combine them well will knock down all sorts of problems.
The last big project I wrote that used FFI involved using an in-house radar modeling library written in C. Reimplementing it would have been silly, and expressing the high-level logic of the rest of the application would have been a pain. I kept the “brains” of it in Haskell and called the C library when I needed it.
You're wanting to do this as an exercise, so I'd recommend the same approach: write the smarts in Haskell. Shackling Haskell as a slave to C++ will probably end up frustrating you or making you feel as though you wasted your time. Use each language where its strengths lie.
Here is how I see things:
Functional languages excel at transforming things. Whenever you write programs which take an input and map/filter/reduce it, use functional constructs. Wonderful real world examples where Haskell should excel are given by web applications (you basically transform things stored in a database to web pages).
Procedural languages (OOP languages are procedural) excel at side effects and communication between objects. They are cumbersome to use to transform data, but whenever you want to do system programming or (bidirectional) interaction with humans (user interfaces of any kind, including client-side web programming), they do the job cleanly.
However, some may argue that user interfaces should have a functional description, I answer that well established frameworks are easy enough to use with OOP languages that one should use them this way. After all, it is natural to think of UI components in terms of objects and communication between objects.
IO is only a tool: whenever you transform input to output, Haskell (or whatever FP language) should do the IO. Whenever you speak to a human, C++ (or whatever OOP language) should do the IO.
Don't care about speed. When you use Haskell for the right job, it's efficient. When you use C++ or Python for the right job, it's efficient.
Therefore, let's say I'm fluent in Haskell, C, C++ and Python, here's how I write applications:
If my application's main role is to transform data, I write it in Haskell, with possibly some low-level parts written in C (which may, in turn, call some high-tech low level parts written in C++, but I'd stick with C as an interface for portability reasons).
If my application's main role is to interact with a user, I write it in Python (PyQt for instance), and let Python call performance-critical routines written in C++ (boost::python is pretty good as a binding generator). I may also have to call subroutines which transform or fetch data, which will be written in Haskell.
If I have to write a part of an application in Haskell, I segregate it into a C-callable API.
Sometimes, a Haskell application which reads things on stdin and write back on stdout is useful as a submodule (that you call with fork/exec, or whatever on your platform). Sometimes, a shell script is the right wrapper for such applications.
This answer is more a story than a comprehensive answer, but I used a mix of Haskell, Python and C++ for my dissertation in computational linguistics, as well as several C and Java tools that I didn't write. I found it simplest to run everything as a separate process, using Python as glue code to start the Haskell, C++ and Java programs.
The C++ was a fairly simple, tight loop that counted feature occurrences. Basically all it did was math and simple I/O. I actually controlled options by having the Python glue code write out a header full of #defines and recompiling. Kind of hacky, but it worked.
The Haskell was all the intermediate processing: taking the complex output from the various C and Java parsers that I used, filtering extraneous data, and transforming it the simple format the C++ code expected. Then I took the C++ output and transformed it into LaTeX markup (among other formats).
This is an area that you would expect Python to be strong, but I found that Haskell makes manipulation of complex structures easier; Python is probably better for simple line-to-line transformations, but I was slicing and dicing parse trees and I found that I forgot the input and output types when I wrote code in Python.
Since I was using Haskell a lot like a more-structured scripting language, I ended up writing a few file I/O utilities, but beyond that, the built in libraries for tree and list manipulation sufficed.
In summary, if you have a problem like mine, I would suggest C++ for the memory-constrained, speed-critical part, Haskell for the high-level transformations, and Python to run it all.
I have never mixed both languages but your approach feels a little upside down to me.
Haskell is more apt at high-level operations while C++ can be optimized and can be most beneficial for tight loops and other performance critical code.
One of the largest benefits of Haskell is the encapsulation of IO into monads. As long as this IO isn't time critical I don't see any reason to do it in C++.
For the GUI part you are probably right. There is a plethora of Haskell GUI libraries but C++ has powerful tools such as QtCreator which simplify the tedious tasks a lot.
I am interested in writing a very minimalistic compiler.
I want to write a small piece of software (in C/C++) that fulfills the following criteria:
output in ELF format (*nix)
input is a single textfile
C-like grammar and syntax
no linker
no preprocessor
very small (max. 1-2 KLOC)
Language features:
native data types: char, int and floats
arrays (for all native data types)
variables
control structures (if-else)
functions
loops (would be nice)
simple algebra (div, add, sub, mul, boolean expressions, bit-shift, etc.)
inline asm (for system calls)
Can anybody tell me how to start? I don't know what parts a compiler consists of (at least not in the sense that I just could start right off the shelf) and how to program them. Thank you for your ideas.
With all that you hope to accomplish, the most challenging requirement might be "very small (max. 1-2 KLOC)". I think your first requirement alone (generating ELF output) might take well over a thousand lines of code by itself.
One way to simplify the problem, at least to start with, is to generate code in assembly language text that you then feed into an existing assembler (nasm would be a good choice). The assembler would take care of generating the actual machine code, as well as all the ELF specific code required to build an actual runnable executable. Then your job is reduced to language parsing and assembly code generation. When your project matures to the point where you want to remove the dependency on an assembler, you can rewrite this part yourself and plug it in at any time.
If I were you, I might start with an assembler and build pieces on top of it. The simplest "compiler" might take a language with just a few very simple possible statements:
print "hello"
a = 5
print a
and translate that to assembly language. Once you get that working, then you can build a lexer and parser and abstract syntax tree and code generator, which are most of the parts you'll need for a modern block structured language.
Good luck!
Firstly, you need to decide whether you are going to make a compiler or an interpreter. A compiler translates your code into something that can be run either directly on hardware, in an interpreter, or get compiled into another language which then is interpreted in some way. Both types of languages are turing complete so they have the same expressive capabilities. I would suggest that you create a compiler which compiles your code into either .net or Java bytecode, as it gives you a very optimized interpreter to run on as well as a lot of standard libraries.
Once you made your decision there are some common steps to follow
Language definition Firstly, you have to define how your language should look syntactically.
Lexer The second step is to create the keywords of your code, known as tokens. Here, we are talking about very basic elements such as numbers, addition sign, and strings.
Parsing The next step is to create a grammar that matches your list of tokens. You can define your grammar using e.g. a context-free grammar. A number of tools can be fed with one of these grammars and create the parser for you. Usually, the parsed tokens are organized into a parse tree. A parse tree is the representation of your grammar as a data structure which you can move around in.
Compiling or Interpreting The last step is to run some logic on your parse tree. A simple way to make your own interpreter is to create some logic associated to each node type in your tree and walk through the tree either bottom-up or top-down. If you want to compile to another language you can insert the logic of how to translate the code in the nodes instead.
Wikipedia is great for learning more, you might want to start here.
Concerning real-world reading material I would suggest "Programming language processors in JAVA" by David A Watt & Deryck F Brown. I used that book in my compilers course and learning by example is great in this field.
These are the absolutely essential parts:
Scanner: This breaks the input file into tokens
Parser: This constructs an abstract syntax tree (AST) from the tokens identified by the scanner.
Code generation: This produces the output from the AST.
You'll also probably want:
Error handling: This tells the parser what to do if it encounters an unexpected token
Optimization: This will enable the compiler to produce more efficient machine code
Edit: Have you already designed the language? If not, you'll want to look into language design, too.
I don't know what you hope to get out of this, but if it is learning, and looking at existing code works for you, there is always tcc.
The number one essential is a book on compiler writing. A lot of people will tell you to read the "Dragon Book" by Aho et al, but the best book I've read on compilers is "Brinch Hansen on Pascal Compilers". I suspect it's out of print (Amazon is your friend), but it takes you through all the steps of designing and writing a compiler using recursive descent, which is the easiest method for compiler newbies to understand.
Although the book uses Pascal as the implementation and target languages, the lessons and techniques presented apply equally to all other languages.
The examples are all in Perl, but Exploring Programming Language Architecture in Perl is a good book (and free).
A really good set of free references, IMHO, are:
Overall compiler tutorial: Let's Build a Compiler by Jack Crenshaw (http://compilers.iecc.com/crenshaw/) It's wordy, but I like it.
Assembler: NASM (nasm.us) good for Linux and Windows/DOS, and most importantly lots of doco and examples/tutorials. (FASM is also good but less documentation/tutorials out there)
Other sources
The PC Assembly book (http://www.drpaulcarter.com/pcasm/index.php)
I'm trying to write a LISP, so I'm using the Lisp 1.5 Manual. You may want to get the language spec for whatever language you're writing.
As far as 1-2KLOC, assuming you use a high level language (like Py or Rb) you should be close if you're not too ambitious.
I always recommend flex and bison for this kind of work as a beginner. You can always learn the ins and outs of writing your own scanner and parser later, although they may increase the code size at least they will be generated for you by tools. :)