How to split a simple Lisp-like code to tokens in C++? - c++

Basically, the language has 3 list and 3 fixed-length types, one of them is string.
This is simple to detect the type of a token using regular expressions, but splitting them into tokens is not that trivial.
String is notated with double-quote, and double-qoute is escaped with backslash.
EDIT:
Some example code
{
print (sum (1 2 3 4))
if [( 2 + 3 ) < 6] : {print ("Smaller")}
}
Lists like
() are argument lists that are only evaluated when necessary.
[] are special list to express 2 operand operations in a prettier
way.
{} are lists that are always evaluated. First element is a function
name, second is a list of arguments, and this repeats.
anything : anything [ : anything [: ...]] translate to argument lists that have the elements joined by the :s. This is only for making loops and conditionals look better.
All functions take a single argument. Argument lists can be used for functions that need more. You can fore and argument list to evaluate using different types of eval functions. (There would be eval functions for each list model)
So, if you understand this, this works very similar like Lisp does, it's only has different list types for prettifying the code.
EDIT:
#rici
[[2 + 3] < 6] is OK too. As I mentioned, argument lists are evaluated only when it's necessary. Since < is a function that requires an argument list of length 2, (2 + 3) must be evaluated somehow, other ways it [(2 + 3) < 6] would translate to < (2 + 3) : 6 which equals to < (2 + 3 6) which is and invalid argument list for <. But I see you point, it's not trivial that how automatic parsing in this case should work. The version that I described above, is that the [...] evaluates arguments list with a function like eval_as_oplist (...) But I guess you are right, because this way, you couldn't use an argument list in the regular way inside a [...] which is problematic even if you don't have a reason to do so, because it doesn't lead to a better code. So [[. . .] . .] is a better code, I agree.

Rather than inventing your own "Lisp-like, but simpler" language, you should consider using an existing Lisp (or Scheme) implementation and embedding it in your C++ application.
Although designing your own language and then writing your own parser and interpreter for it is surely good fun, you will have hard time to come up with something better designed, more powerful and implemented more efficiently and robustly than, say, Scheme and it's numerous implementations.
Chibi Scheme: http://code.google.com/p/chibi-scheme/ is particularly well suited for embedding in C/C++ code, it's very small and fast.

I would suggest using Flex (possibly with Bison) or ANTLR, which has a C++ output target.
Since google is simpler than finding stuff on my own file server, here is someone else's example:
http://ragnermagalhaes.blogspot.com/2007/08/bison-lisp-grammar.html
This example has formatting problems (which can be resolved by viewing the HTML in a text editor) and only supports one type of list, but it should help you get started and certainly shows how to split the items into tokens.

I believe Boost.Spirit would be suitable for this task provided you could construct a PEG-compatible grammar for the language you're proposing. It's not obvious from the examples as to whether or not this is the case.
More specifically, Spirit has a generalized AST called utree, and there is example code for parsing symbolic expressions (ie lisp syntax) into utree.
You don't have to use utree in order to take advantage of Spirit's parsing and lexing capabilities, but you would have to have your own AST representation. Maybe that's what you want?

Related

Pattern Matching a 2d array with tuples

I have a 2d list that looks as follows (string * int) list list
[ [("size1", 1);("size2",2)] ; [("size3",3);("size4",4)] ]
When given a string name, I want to return the int associated with that name
"size1" -> 1
"size2" -> 2
"size4" -> 4
I have an idea of how I would do this using the List module
but how would I do something like this using pattern matching?
An OCaml pattern essentially matches a single constructor of a type, with subpatterns in the places of the constructed values. For a list, this means you need to pick the number of :: constructors you want to have in your pattern. Hence there's no OCaml pattern that matches at an arbitrary place in a list. So there's no way to solve your problem with just pattern matching (if I understand what you're trying to do).
If you know the maximum length of your lists you can enumerate all possible lengths, but somehow I doubt this is what you want to do.
My way of looking at it is that OCaml patterns have a very clean and disciplined structure, which allows them to be implemented very efficiently. Other languages (like perl say) have very powerful patterns that can't really be implemented efficiently. They can require more or less arbritrary amounts of retrying of different subpatterns. Naturally being an OCaml guy I much prefer the OCaml approach.

C++ parsing expressions, breaking down the order of evaluation

I'm trying to write an expression parser. One part I'm stuck on is breaking down an expression into blocks via its appropriate order of precedence.
I found the order of precedence for C++ operators here. But where exactly do I split the expression based on this?
I have to assume the worst of the user. Here's a really messy over-exaggerated test example:
if (test(s[4]) < 4 && b + 3 < r && a!=b && ((c | e) == (g | e)) ||
r % 7 < 4 * givemeanobj(a & c & e, b, hello(c)).method())
Perhaps it doesn't even evaluate, and if it doesn't I still need to break it down to determine that.
It should break down into blocks of singles and pairs connected by operators. Essentially it breaks down into a tree-structure where the branches are the groupings, and each node has two branches.
Following the order of precedence the first thing to do would be to evaluate the givemeanobj(), however that's an easy one to see. The next would be the multiplication sign. Does that split everything before the * into a separate , or just the 4? 4 * givemeanobj comes before the <, right? So that's the first grouping?
Is there a straightforward rule to follow for this?
Is there a straightforward rule to follow for this?
Yes, use a parser generator such as ANTLR. You write your language specification formally, and it will generate code which parses all valid expressions (and no invalid ones). ANTLR is nice in that it can give you an abstract syntax tree which you can easily traverse and evaluate.
Or, if the language you are parsing is actually C++, use Clang, which is a proper compiler and happens to be usable as a library as well.

What is the name of the data structure for and-or-lists (or and-or-trees) and where can I read about it?

I recently needed to make a data structure which was a nested list of and/or questions. Since most every interesting thing has been discovered by someone else previously, I’m looking for the name of this data structure. it looks something like this.
‘((a b c) (b d e) (c (a b) (f a)))
The interpretation is I want to find abc or bde or caf or caa or cbf or cba and the list encapsulates that. At the top level each item is or’ed together and sub-lists of the top level are and’ed together and sub-lists of sub-lists are or’ed again sub-lists of those are and’ed and sub-lists of those or’ed ad infinitum. Note that in my example, all the lists are the same length, in my real application the lists vary in length.
The code to walk such a “tree” is relatively simple, but I’m assuming that there is a name for that type of tree and there is stuff I can read about it.
These lists are equivalent to fixed length regular expressions (which I've seen referred to as "network expressions", but I am particularly interested in this data structure and representation thereof.
In general (in the very high level of abstraction) it is:
Context free grammar -Wiki
If you allow it to be infinitely nested, then it is not a regular expression because of presence of parentheses (left and right should match).
If you consider, that expressions inside parentheses are ordered. I mean that a and b and c is equivalent to (a and b) and c. You get then Binary expression tree -Wiki
But for your particular case, it is probably: Disjunctive normal form -Wiki
I am not sure, but my intuition says that it is regular expression again because you have only 2 levels of nesting (1st - for 'or-ed' and 2nd - for 'and-ed' parts)
The trees are also a subset of DAWGS - directed acyclic word graphs and one could construct them the same way.
In my case, I have a very small set that I have built by hand and I don't worry about getting the minimal set, but instead just want something that I can easily write down but deals with the types of simple variations I see. Basically, I have different ways of finding where I keep my .el files based upon the different directory structures of various OSes I use. (E.g. when I was working at Google, the /usr/local/emacs/site-lisp directory was actually more like /usr/local/Google/emacs/site-lisp.)
I don't need a full regex, but there are about a dozen variations, some having quite long lists of nested sub-directories (c:\users\cfclark\appData\roaming\emacs.emacs.d or some other awful thing) that I wanted to write down (and then have emacs make an automated search to find the one that is appropriate to this particular installation). And every time I go to a new job, I can simply add to the list a description of where they are in that setup.
Anyway, as that code has evolved, I found that I had I was doing (nested or's and and's and realized that the structure generalized to the alternating or/and/or/and/... case). So, my assumption is that someone must have discovered this before. I had hints of it myself several years ago, but didn't set down to implement it. The Disjunctive Normal Form link mpasko256 gave is also particularly relevant. I don't normalize to that level, I still keep nested and's and or's rather than flattening to 2, but I do have a distinct structure, or's at the top, then and's, then or's....

What would Clojure lose by switching away from leading parenthesis like Dylan, Julia and Seph?

Three lispy homoiconic languages, Dylan, Julia and Seph all moved away from leading parenthesis - so a hypothetical function call in Common Lisp that would look like:
(print hello world)
Would look like the following hypothetical function call
print(hello world)
in the three languages mentioned above.
Were Clojure to go down this path - what would it have to sacrifice to get there?
Reasoning:
Apart from the amazing lazy functional data structures in Clojure, and the improved syntax for maps and seqs, the language support for concurrency, the JVM platform, the tooling and the awesome community - the distinctive thing about it being 'a LISP' is leading parenthesis giving homoiconicity which gives macros providing syntax abstraction.
But if you don't need leading parentheses - why have them? The only arguments I can think of for keeping them are
(1) reusing tool support in emacs
(2) prompting people to 'think in LISP' and not try and treat it as another procedural language)
(Credit to andrew cooke's answer, who provided the link to Wheeler's and Gloria's "Readable Lisp S-expressions Project")
The link above is a project intended to provide a readable syntax for all languages based on s-expressions, including Scheme and Clojure. The conclusion is that it can be done: there's a way to have readable Lisp without the parentheses.
Basically what David Wheeler's project does is add syntactic sugar to Lisp-like languages to provide more modern syntax, in a way that doesn't break Lisp's support for domain-specific languages. The enhancements are optional and backwards-compatible, so you can include as much or as little of it as you want and mix them with existing code.
This project defines three new expression types:
Curly-infix-expressions. (+ 1 2 3) becomes {1 + 2 + 3} at every place you want to use infix operators of any arity. (There is a special case that needs to be handled with care if the inline expression uses several operators, like {1 + 2 * 3} - although {1 + {2 * 3} } works as expected).
Neoteric-expressions. (f x y) becomes f(x y) (requires that no space is placed between the function name and its parameters)
Sweet-expressions. Opening and closing parens can be replaced with (optional) python-like semantic indentation. Sweet-expressions can be freely mixed with traditional parentheses s-expressions.
The result is Lisp-compliant but much more readable code. An example of how the new syntactic sugar enhances readability:
(define (gcd_ a b)
(let (r (% b a))
(if (= r 0) a (gcd_ r a))))
(define-macro (my-gcd)
(apply gcd_ (args) 2))
becomes:
define gcd_(a b)
let r {b % a}
if {r = 0} a gcd_(r a)
define-macro my-gcd()
apply gcd_ (args) 2
Note how the syntax is compatible with macros, which was a problem with previous projects that intended to improve Lisp syntax (as described by Wheeler and Gloria). Because it's just sugar, the final form of each new expression is a s-expression, transformed by the language reader before macros are processed - so macros don't need any special treatment. Thus the "readable Lisp" project preserves homoiconicity, the property that allows Lisp to represent code as data within the language, which is what allows it to be a powerful meta-programming environment.
Just moving the parentheses one atom in for function calls wouldn't be enough to satisfy anybody; people will be complaining about lack of infix operators, begin/end blocks etc. Plus you'd probably have to introduce commas / delimiters in all sorts of places.
Give them that and macros will be much harder just to write correctly (and it would probably be even harder to write macros that look and act nicely with all the new syntax you've introduced by then). And macros are not something that's a nice feature you can ignore or make a whole lot more annoying; the whole language (like any other Lisp) is built right on top of them. Most of the "user-visible" stuff that's in clojure.core, including let, def, defn etc are macros.
Writing macros would become much more difficult because the structure would no longer be simple you would need another way to encode where expressions start and stop using some syntactic symbol to mark the start and end of expressions to you can write code that generates expressions perhaps you could solve this problem by adding something like a ( to mark the start of the expression...
On a completely different angle, it is well worth watching this video on the difference between familiar and easy making lisps syntax more familiar wont make it any easier for people to learn and may make it misleading if it looks to much like something it is not.
even If you completely disagree, that video is well worth the hour.
you wouldn't need to sacrifice anything. there's a very carefully thought-out approach by david wheeler that's completely transparent and backwards compatible, with full support for macros etc.
You would have Mathematica. Mathematica accepts function calls as f[x, y, z], but when you investigate it further, you find that f[x, y, z] is actually syntactic sugar for a list in which the Head (element 0) is f and the Rest (elements 1 through N) is {x, y, z}. You can construct function calls from lists that look a lot like S-expressions and you can decompose unevaluated functions into lists (Hold prevents evaluation, much like quote in Lisp).
There may be semantic differences between Mathematica and Lisp/Scheme/Clojure, but Mathematica's syntax is a demonstration that you can move the left parenthesis over by one atom and still interpret it sensibly, build code with macros, etc.
Syntax is pretty easy to convert with a preprocessor (much easier than semantics). You could probably get the syntax you want through some clever subclassing of Clojure's LispReader. There's even a project that seeks to solve Clojure's off-putting parentheses by replacing them all with square brackets. (It boggles my mind that this is considered a big deal.)
I used to code C/C#/Java/Pascal so I emphasize with the feeling that Lisp code is a bit alien. However that feeling only lasts a few weeks - after a fairly short amount of time the Lisp style will feel very natural and you'll start berating other languages for their "irregular" syntax :-)
There is a very good reason for Lisp syntax. Leading parentheses make code logically simpler to parse and read, by collecting both a function and the expressions that make up it's arguments in a single form.
And when you manipulate code / use macros, it is these forms that matter: these are the building blocks of all your code. So it fundamentally makes sense to put the parentheses in a place that exactly delimits these forms, rather than arbitrarily leaving the first element outside the form.
The "code is data" philosophy is what makes reliable metaprogramming possible. Compare with Perl / Ruby, which have complex syntax, their approaches to metaprogramming are only reliable in very confined circumstances. Lisp metaprogramming is so perfectly reliable that the core of the language depends on it. The reason for this is the uniform syntax shared by code and data (the property of homoiconicity). S-expressions are the way this uniformity is realized.
That said, there are circumstances in Clojure where implied parenthesis is possible:
For example the following three expressions are equivalent:
(first (rest (rest [1 2 3 4 5 6 7 8 9])))
(-> [1 2 3 4 5 6 7 8 9] (rest) (rest) (first))
(-> [1 2 3 4 5 6 7 8 9] rest rest first)
Notice that in the third the -> macro is able in this context to infer parentheses and thereby leave them out. There are many special case scenarios like this. In general Clojure's design errs on the side of less parentheses. See for example the controversial decision of leaving out parenthesis in cond. Clojure is very principled and consistent about this choice.
Interestingly enough - there is an alternate Racket Syntax:
#foo{blah blah blah}
reads as
(foo "blah blah blah")
Were Clojure to go down this path - what would it have to sacrifice to get there?
The sacrifice would the feeling/mental-modal of writing code by creating List of List of List.. or more technically "writing the AST directly".
NOTE: There are other things as well that will be scarified as mentioned in other answers.

calculating user defined formulas (with c++)

We would like to have user defined formulas in our c++ program.
e.g. The value v = x + ( y - (z - 2)) / 2. Later in the program the user would define x,y and z -> the program should return the result of the calculation. Somewhen later the formula may get changed, so the next time the program should parse the formula and add the new values. Any ideas / hints how to do something like this ? So far I just came to the solution to write a parser to calculate these formulas - maybe any ideas about that ?
If it will be used frequently and if it will be extended in the future, I would almost recommend adding either Python or Lua into your code. Lua is a very lightweight scripting language which you can hook into and provide new functions, operators etc. If you want to do more robust and complicated things, use Python instead.
You can represent your formula as a tree of operations and sub-expressions. You may want to define types or constants for Operation types and Variables.
You can then easily enough write a method that recurses through the tree, applying the appropriate operations to whatever values you pass in.
Building your own parser for this should be a straight-forward operation:
) convert the equation from infix to postfix notation (a typical compsci assignment) (I'd use a stack)
) wait to get the values you want
) pop the stack of infix items, dropping the value for the variable in where needed
) display results
Using Spirit (for example) to parse (and the 'semantic actions' it provides to construct an expression tree that you can then manipulate, e.g., evaluate) seems like quite a simple solution. You can find a grammar for arithmetic expressions there for example, if needed... (it's quite simple to come up with your own).
Note: Spirit is very simple to learn, and quite adapted for such tasks.
There's generally two ways of doing it, with three possible implementations:
as you've touched on yourself, a library to evaluate formulas
compiling the formula into code
The second option here is usually done either by compiling something that can be loaded in as a kind of plugin, or it can be compiled into a separate program that is then invoked and produces the necessary output.
For C++ I would guess that a library for evaluation would probably exist somewhere so that's where I would start.
If you want to write your own, search for "formal automata" and/or "finite state machine grammar"
In general what you will do is parse the string, pushing characters on a stack as you go. Then start popping the characters off and perform tasks based on what is popped. It's easier to code if you force equations to reverse-polish notation.
To make your life easier, I think getting this kind of input is best done through a GUI where users are restricted in what they can type in.
If you plan on doing it from the command line (that is the impression I get from your post), then you should probably define a strict set of allowable inputs (e.g. only single letter variables, no whitespace, and only certain mathematical symbols: ()+-*/ etc.).
Then, you will need to:
Read in the input char array
Parse it in order to build up a list of variables and actions
Carry out those actions - in BOMDAS order
With ANTLR you can create a parser/compiler that will interpret the user input, then execute the calculations using the Visitor pattern. A good example is here, but it is in C#. You should be able to adapt it quickly to your needs and remain using C++ as your development platform.