Read Clojure source code one token at a time - clojure

In Clojure, it's possible to read a whole s-expression with (read). Is there some way to read just one token at a time? So calling (read-token "(read)") would return something like ["(", "read", ")"].

"tokens" are not something that the clojure reader works with: it doesn't have distinct lex/parse phases like languages with more complicated grammars often do. Of course you can write your own grammar for clojure forms, call ( an OPEN_PAREN token and so on, but there's no built-in support for it.

Related

Abstract structure of Clojure

I've been learning Clojure and am a good way through a book on it when I realized how much I'm still struggling to interpret the code. What I'm looking for is the abstract structure, interface, or rules, Clojure uses to parse code. I think it looks something like:
(some-operation optional-args)
optional-args can be nearly anything and that's where I start getting confused.
(operation optional-name-string [vector of optional args]) would equal (defn newfn [argA, argB])
I think this pattern holds for all lists () but with so much flexibility and variation in Clojure, I'm not sure. It would be really helpful to see the rules the interpreter follows.
You are not crazy. Sure it's easy to point out how "easy" ("simple"? but that another discussion) Clojure syntax is but there are two things for a new learner to be aware of that are not pointed out very clearly in beginning tutorials that greatly complicate understanding what you are seeing:
Destructuring. Spend some quality time with guides on destructuring in Clojure. I will say that this adds a complexity to the language and is not dissimilar from "*args" and "**kwargs" arguments in Python or from the use of the "..." spread operator in javascript. They are all complicated enough to require some dedicated time to read. This relates to the optional-args you reference above.
macros and metaprogramming. In the some-operation you reference above, you wish to "see the rules the interpreter follows". In the majority of the cases it is a function but Clojure provides you no indication of whether you are looking at a function or a macro. In the standard library, you will just need to know some standard macros and how they affect the syntax they headline. (e.g. if, defn etc). For included libraries, there will typically be a small set of macros that are core to understanding that library. Any macro will to modify, dare I say, complicate the syntax in the parens you are looking at so be on your toes.
Clojure is fantastic and easy to learn but those two points are not to be glossed over IMHO.
Before you start coding with Clojure, I highly recommend studying functional programming and LISB. In Clojure, everything is a prefix, and when you want to run and specific function, you will call it and then feed it with some arguments. for example, 1+2+3 will be (+ 1 2 3) in Clojure. In other words, every function you call will be at the start of a parenthesis, and all of its arguments will be follows the function name.
If you define a function, you may do as follow:
(defn newfunc [a1 a2]
(+ 100 a1 a2))
Which newfunc add 100 and a1 and a2. When you call it, you should do this:
(newfunc 1 2)
and the result will be 103.
in the first example, + is a function, so we call it at the beginning of the parenthesis.
Clojure is a beautiful world full of simplicity. Please learn it deeply.

What should a parser for a programming language do?

I have already written a lexer which returns tokens, and now I'm working on a parser. I have one problem.
Imagine this code example:
print("Hello, world!")
The lexer returns four tokens (print, (, "Hello, world!" and )). The final program should print the string "Hello, world!".
But what should the parser do? Should the parser already execute the code, should it return something (and what) that is handled by another object?
The parser should generate an abstract syntax tree, which is an in memory representation of the program. This tree can be traversed after parsing to do the code generation. I'd recommend to read some good book about the subject, maybe one involving dragons.
What should the parser do?
The typical role of a parser is to read the stream of tokens and, from that, build a parse tree or abstract syntax tree.
Should the parser already execute the code
No. That's not parsing.
Typically, a parser does not execute anything. Parsers usually take an input (text or binary) and produce an in-memory representation, nothing more... but that's already much!
If you already have a Lexer, then the second step is normally to perform Semantic Analysis, to produce an Abstract Syntax Tree.
This means, producing something of the form:
(FunctionCall "print" [
(StringLiteral "Hello, World!")
]
)
It should return an abstract syntax tree.
The parser should basically do two things:
Produce a form of intermediate text, generally in a tree or reverse-Polish form, that the code generator can consume.
Clearly and accurately report any errors encountered, identifying the failing line number, the precise cause for the error (in reasonable non-techspeak), and, to the degree possible, the position within the line or the identity of the element that caused the parser to "choke".

Finite State Machine parser

I would like to parse a self-designed file format with a FSM-like parser in C++ (this is a teach-myself-c++-the-hard-way-by-doing-something-big-and-difficult kind of project :)). I have a tokenized string with newlines signifying the end of a euh... line. See here for an input example. All the comments will and junk is filtered out, so I have a std::string like this:
global \n { \n SOURCE_DIRS src \n HEADER_DIRS include \n SOURCES bitwise.c framing.c \n HEADERS ogg/os_types.h ogg/ogg.h \n } \n ...
Syntax explanation:
{ } are scopes, and capitalized words signify that a list of options/files is to follow.
\n are only important in a list of options/files, signifying the end of the list.
So I thought that a FSM would be simple/extensible enough for my needs/knowledge. As far as I can tell (and want my file design to be), I don't need concurrent states or anything fancy like that. Some design/implementation questions:
Should I use an enum or an abstract class + derivatives for my states? The first is probably better for small syntax, but could get ugly later, and the second is the exact opposite. I'm leaning to the first, for its simplicity. enum example and class example. EDIT: what about this suggestion for goto, I thought they were evil in C++?
When reading a list, I need to NOT ignore \n. My preferred way of using the string via stringstream, will ignore \n by default. So I need simple way of telling (the same!) stringstream to not ignore newlines when a certain state is enabled.
Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations?
Here's the draft states I have in mind:
upper: reads global, exe, lib+ target names...
normal: inside a scope, can read SOURCES..., create user variables...
list: adds items to a list until a newline is encountered.
Each scope will have a kind of conditional (e.g. win32:global { gcc:CFLAGS = ... }) and will need to be handled in the exact same fashion eveywhere (even in the list state, per item).
Thanks for any input.
If you have nesting scopes, then a Finite State Machine is not the right way to go, and you should look at a Context Free Grammar parser. An LL(1) parser can be written as a set of recursive funcitons, or an LALR(1) parser can be written using a parser generator such as Bison.
If you add a stack to an FSM, then you're getting into pushdown automaton territory. A nondeterministic pushdown automaton is equivalent to a context free grammar (though a deterministic pushdown automaton is strictly less powerful.) LALR(1) parser generators actually generate a deterministic pushdown automaton internally. A good compiler design textbook will cover the exact algorithm by which the pushdown automaton is constructed from the grammar. (In this way, adding a stack isn't "hacky".) This Wikipedia article also describes how to construct the LR(1) pushdown automaton from your grammar, but IMO, the article is not as clear as it could be.
If your scopes nest only finitely deep (i.e. you have the upper, normal and list levels but you don't have nested lists or nested normals), then you can use a FSM without a stack.
There are two stages to analyzing a text input stream for parsing:
Lexical Analysis: This is where your input stream is broken into lexical units. It looks at a sequence of characters and generates tokens (analagous to word in spoken or written languages). Finite state machines are very good at lexical analysis provided you've made good design decision about the lexical structure. From your data above, individal lexemes would be things like your keywords (e.g. "global"), identifiers (e.g. "bitwise", "SOURCES"), symbolic tokesn (e.g. "{" "}", ".", "/"), numeric values, escape values (e.g. "\n"), etc.
Syntactic / Grammatic Analysis: Upon generating a sequence of tokens (or perhaps while you're doing so) you need to be able to analyze the structure to determine if the sequence of tokens is consistent with your language design. You generally need some sort of parser for this, though if the language structure is not very complicated, you may be able to do it with a finite state machine instead. In general (and since you want nesting structures in your case in particular) you will need to use one of the techniques Ken Bloom describes.
So in response to your questions:
Should I use an enum or an abstract class + derivatives for my states?
I found that for small tokenizers, a matrix of state / transition values is suitable, something like next_state = state_transitions[current_state][current_input_char]. In this case, the next_state and current_state are some integer types (including possibly an enumerated type). Input errors are detected when you transition to an invalid state. The end of an token is identified based on the state identification of valid endstates with no valid transition available to another state given the next input character. If you're concerned about space, you could use a vector of maps instead. Making the states classes is possible, but I think that's probably making thing more difficult than you need.
When reading a list, I need to NOT ignore \n.
You can either create a token called "\n", or a more generalize escape token (an identifier preceded by a backslash. If you're talking about identifying line breaks in the source, then those are simply characters you need to create transitions for in your state transition matrix (be aware of the differnce between Unix and Windows line breaks, however; you could create a FSM that operates on either).
Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations?
This is where you will need a grammar or pushdown automaton unless you can guarantee that the nesting will not exceed a certain level. Even then, it will likely make your FSM very complex.
Here's the draft states I have in mind: ...
See my commments on lexical and grammatical analysis above.
For parsing I always try to use something already proven to work: ANTLR with ANTLRWorks which is of great help for designing and testing a grammar. You can generate code for C/C++ (and other languages) but you need to build the ANTLR runtime for those languages.
Of course if you find flex or bison easier to use you can use them too (I know that they generate only C and C++ but I may be wrong since I didn't use them for some time).

Generalized stream parsing?

Are there any libraries or technologies(in any language) that provide a regular-expression-like tool for any sort of stream-like or list-like data(as opposed to only character strings)?
For example, suppose you were writing a parser for your pet programming language. You've already got it lexed into a list of Common Lisp objects representing the tokens.
You might use a pattern like this to parse function calls(using C-style syntax):
(pattern (:var (:class ident)) (:class left-paren)
(:optional (:var object)) (:star (:class comma) (:var :object)) (:class right-paren))
Which would bind variables for the function name and each of the function arguments(actually, it would probably be implemented so that this pattern would probably bind a variable for the function name, one for the first argument, and a list of the rest, but that's not really an important detail).
Would something like this be useful at all?
I don't know how many replies you'll receive on a subject like this, as most languages lack the sort of robust stream APIs you seem to have in mind; thus, most of the people reading this probably don't know what you're talking about.
Smalltalk is a notable exception, shipping with a rich hierarchy of Stream classes that--coupled with its Collection classes--allow you to do some pretty impressive stuff. While most Smalltalks also ship with regex support (the pure ST implementation by Vassili Bykov is a popular choice), the regex classes unfortunately are not integrated with the Stream classes in the same way the Collection classes are. This means that using streams and regexes in Smalltalk usually involves reading character strings from a stream and then testing those strings separately with regex patterns--not the sort "read next n characters up until a pattern matches," or "read next n characters matching this pattern" type of functionally you likely have in mind.
I think a powerful stream API coupled with powerful regex support would be great. However, I think you'd have trouble generalizing about different stream types. A read stream on a character string would pose few difficulties, but file and TCP streams would have their own exceptions and latencies that you would have to handle gracefully.
Try looking at scala.util.regexp, both the API documentation, and the code example at http://scala.sygneca.com/code/automata. I think would allow a computational linguist to match strings of words by looking for part of speech patterns, for example.
This is the principle behind most syntactic parsers, which operate in two phases. The first phase is the lexer, where identifiers, language keywords, and other special characters (arithmetic operators, braces, etc) are identified and split into Token objects that typically have a numeric field indicating the type of the lexeme, and optionally another field indicating the text of the lexeme.
In the second phase, a syntactic parser operates on the Token objects, matching them by magic number alone, to parse phrases. (Software for doing this includes Antlr, yacc/bison, Scala's cala.util.parsing.combinator.syntactical library, and plenty of others). The two phases don't entirely have to depend on each other -- you can get your Token objects from anywhere else that you like. The magic number aspect seems to be important, though, because the magic numbers are assigned to constants, and they're what make it easy to express your grammar in a readable language.
And remember, that anything you can accomplish with a regular expression can also be accomplished with a context-free grammar (usually just as easily).

Expression Evaluation in C++

I'm writing some excel-like C++ console app for homework.
My app should be able to accept formulas for it's cells, for example it should evaluate something like this:
Sum(tablename\fieldname[recordnumber], fieldname[recordnumber], ...)
tablename\fieldname[recordnumber] points to a cell in another table,
fieldname[recordnumber] points to a cell in current table
or
Sin(fieldname[recordnumber])
or
anotherfieldname[recordnumber]
or
"10" // (simply a number)
something like that.
functions are Sum, Ave, Sin, Cos, Tan, Cot, Mul, Div, Pow, Log (10), Ln, Mod
It's pathetic, I know, but it's my homework :'(
So does anyone know a trick to evaluate something like this?
Ok, nice homework question by the way.
It really depends on how heavy you want this to be. You can create a full expression parser (which is fun but also time consuming).
In order to do that, you need to describe the full grammar and write a frontend (have a look at lex and yacc or flexx and bison.
But as I see your question you can limit yourself to three subcases:
a simple value
a lookup (possibly to an other table)
a function which inputs are lookups
I think a little OO design can helps you out here.
I'm not sure if you have to deal with real time refresh and circular dependency checks. Else they can be tricky too.
For the parsing, I'd look at Recursive descent parsing. Then have a table that maps all possible function names to function pointers:
struct FunctionTableEntry {
string name;
double (*f)(double);
};
You should write a parser. Parser should take the expression i.e., each line and should identify the command and construct the parse tree. This is the first phase. In the second phase you can evaluate the tree by substituting the data for each elements of the command.
Previous responders have hit it on the head: you need to parse the cell contents, and interpret them.
StackOverflow already has a whole slew of questions on building compilers and interperters where you can find pointers to resources. Some of them are:
Learning to write a compiler (#1669 people!)
Learning Resources on Parsers, Interpreters, and Compilers
What are good resources on compilation?
References Needed for Implementing an Interpreter in C/C++
...
and so on.
Aside: I never have the energy to link them all together, or even try to build a comprehensive list.
I guess you cannot use yacc/lex (or the like) so you have to parse "manually":
Iterate over the string and divide it into its parts. What a part is depends on you grammar (syntax). That way you can find the function names and the parameters. The difficulty of this depends on the complexity of your syntax.
Maybe you should read a bit about lexical analysis.