Generalized stream parsing? - regex

Are there any libraries or technologies(in any language) that provide a regular-expression-like tool for any sort of stream-like or list-like data(as opposed to only character strings)?
For example, suppose you were writing a parser for your pet programming language. You've already got it lexed into a list of Common Lisp objects representing the tokens.
You might use a pattern like this to parse function calls(using C-style syntax):
(pattern (:var (:class ident)) (:class left-paren)
(:optional (:var object)) (:star (:class comma) (:var :object)) (:class right-paren))
Which would bind variables for the function name and each of the function arguments(actually, it would probably be implemented so that this pattern would probably bind a variable for the function name, one for the first argument, and a list of the rest, but that's not really an important detail).
Would something like this be useful at all?

I don't know how many replies you'll receive on a subject like this, as most languages lack the sort of robust stream APIs you seem to have in mind; thus, most of the people reading this probably don't know what you're talking about.
Smalltalk is a notable exception, shipping with a rich hierarchy of Stream classes that--coupled with its Collection classes--allow you to do some pretty impressive stuff. While most Smalltalks also ship with regex support (the pure ST implementation by Vassili Bykov is a popular choice), the regex classes unfortunately are not integrated with the Stream classes in the same way the Collection classes are. This means that using streams and regexes in Smalltalk usually involves reading character strings from a stream and then testing those strings separately with regex patterns--not the sort "read next n characters up until a pattern matches," or "read next n characters matching this pattern" type of functionally you likely have in mind.
I think a powerful stream API coupled with powerful regex support would be great. However, I think you'd have trouble generalizing about different stream types. A read stream on a character string would pose few difficulties, but file and TCP streams would have their own exceptions and latencies that you would have to handle gracefully.

Try looking at scala.util.regexp, both the API documentation, and the code example at http://scala.sygneca.com/code/automata. I think would allow a computational linguist to match strings of words by looking for part of speech patterns, for example.

This is the principle behind most syntactic parsers, which operate in two phases. The first phase is the lexer, where identifiers, language keywords, and other special characters (arithmetic operators, braces, etc) are identified and split into Token objects that typically have a numeric field indicating the type of the lexeme, and optionally another field indicating the text of the lexeme.
In the second phase, a syntactic parser operates on the Token objects, matching them by magic number alone, to parse phrases. (Software for doing this includes Antlr, yacc/bison, Scala's cala.util.parsing.combinator.syntactical library, and plenty of others). The two phases don't entirely have to depend on each other -- you can get your Token objects from anywhere else that you like. The magic number aspect seems to be important, though, because the magic numbers are assigned to constants, and they're what make it easy to express your grammar in a readable language.
And remember, that anything you can accomplish with a regular expression can also be accomplished with a context-free grammar (usually just as easily).

Related

How to test if a string is a valid C++(ish) expression?

I am writing a program in C++ that needs to be able to test if a string (probably std::string) is a valid C++ expression. Variables can be checked if they have been declared (bool variableDeclared(std::string identifier)) and their type can also be checked (std::string variableType(std::string identifier)). The variableType function returns a string based on how it would be declared in C++ ("bool", "double", "char", etc).
The expression doesn't need to be evaluated but only tested to see if it is valid. The function only needs to support character literals, string literals, number literals, brackets, simple operators (+, -, *, /, ! (logic not), &&, ||, >, <, ==), and variables of type double, std::string (no function calls needed), bool and char. It is also not required to support string concatenation.
The desired result would be a function that is something like bool validExpression(std::string expression). It is also preferable that it allows me to modify the operations (for example I could change "==" to "equal-to").
How would I implement this? Is there a library that could do something like this, a regex statement or is it simply a matter of a long function with lots of if statements?
Formally, your situation is: you have a grammar which describes the language of expressions which you want to validate, and a word for which you want to determine whether it belongs in that language. This is a job for a parser of that language.
You could hand-cook something like a recursive-descent LL(1) parser, or use a tool to generate a parser. A well-known example of such a tool is Bison for generating LALR(1) parsers. Wikipedia has a long parser generator list.
Technical terms are used above mainly to provide entry points for googling.
You would start from defining your language more or less formally. (A language is a set of strings). A good way to define a language is to specify its context-free grammar. Describe additional conditions (like the requirement that variables must be declared, and of the right type) informally in prose.
The next step would be building a parser for your grammar specified at the previous step. There are several tools for building parsers from grammars automatically, from yacc/bison to boost::spirit.
After building and checking the parser, implement the informally-specified rules and plug them into your parser code/data.
Normally the next step, building an evaluator, would probably the easiest part of writing a simple interpreter, but you say you don't need one.
Describing your language as "just like C++ only with certain bits taken out" could be a preliminary step to the sequence outlined above. It is however not recommended to start out from C++ if you can help it. C++ is an extremely hard language to specify formally, and its parsers tend to be rather hairy, due to its convoluted declaration syntax.
you can run compiler as sub-process of your application. All you have to do is to pass arguments and parse response properly

Parse an XML in standard C/C++ without additional libraries

I have an XML (assuming it is valid) and I must parse it and store it in a tree.
What is the best approach to parse it, without using other libraries, just basic manipulation of strings?
Keep in mind that I don't have to validate it, just parse and memorize it into a tree.
The basic structure of XML is quite simple:
<tagname [attribute[="value"] ...]>content</tagname>
where the content may contain both normal text and more XML structures, or the special form
<tagname [attribute[="value"] ...]/>
which is equivalent to
<tagname [attribute[="value"] ...]></tagname>
that is,. empty content.
So if you don't need to interpret a DTD or do other fancy things, you can do the following:
Check that the first non-whitespace character is <. If not, you don't have XML and can just give an error and exit.
Now follows the tag name, until the first whitespace, or the / or the > character. Store that.
If the next non-whitespace character is /, check that it is followed by >. If so, you've finished parsing and can return your result. Otherwise, you've got malformed XML, and can exit with an error.
If the character is >, then you've found the end of the begin tag. Now follows the content. Continue at step 6.
Otherwise what follows is an argument. Parse that, store the result, and continue at step 3.
Read the content until you find a < character.
If that character is followed by /, it's the end tag. Check that it is followed by the tag name and >, and if yes, return the result. Otherwise, throw an error.
If you get here, you've found the beginning of a nested XML. Parse that with this algorithm, and then continue at 6.
Reading XML looks simple but doing it correctly involves a few complexities you don't really want to deal with. Indeed, writing a simple XML parser effectively amounts to creating yet another XML library. I have done it and an incomplete version of this is sitting somewhere on my disk. Even if you don't need to validate your XML structure:
whether you validate or not, you need to deal with entity references like < and the variety of character entity references like A and
the plain body of an XML document is relatively simple but the header a major pain to deal with in particular the DTD: there are two versions thereof which are slightly different and you probably need to process the inline DTD
even the body isn't entirely trivial because of these annoying character data segments
even without validation you may need to support external entity references
the characters to be accepted and/or rejected for various parts of XML are also somewhat interesting
note that XML is defined in terms of Unicode and proper handling of this isn't entirely trivial either: just using char or wchar_t just doesn't cut it.
The first version I implemented was a nice little iterator intended to pop out all the elements encountered. This allowed for the nice feature of easily stopping and continuing the parsing at the choice of the iterator user. Unfortunately, I didn't get it to fly when trying to copy with the various entity references. It would parse simple XML files nice and fast but some quirks in the specification I just didn't get right.
What worked best for me was creating a simple recursive decent parser combined with a suitable stack of buffers to somewhat transparently deal with entity references. However, to finish this completely I still need to deal with some encoding issues and in the end I just had higher priority projects to work on (in my spare time, that is).
In summary: it can be done, obviously, as others did. It is probably a somewhat pointless exercise unless you have a really bright idea which makes your implementation uniquely better suited than the alternatives.
The best and only approach is to re-implement such a library from scratch without using any other libraries...
You're welcome to use existing libraries like pugixml, for example. It's installation is as simple as adding the files to your project and start using it. It's lightweight compared to other validating parsers, such as Xerces.

what exactly is a token, in relation to parsing

I have to use a parser and writer in c++, i am trying to implement the functions, however i do not understand what a token is. one of my function/operations is to check to see if there are more tokens to produce
bool Parser::hasMoreTokens()
how exactly do i go about this, please help
SO!
I am opening a text file with text in it, all words are lowercased. How do i go about checking to see if it hasmoretokens?
This is what i have
bool Parser::hasMoreTokens() {
while(source.peek()!=NULL){
return true;
}
return false;
}
Tokens are the output of lexical analysis and the input to parsing. Typically they are things like
numbers
variable names
parentheses
arithmetic operators
statement terminators
That is, roughly, the biggest things that can be unambiguously identified by code that just looks at its input one character at a time.
One note, which you should feel free to ignore if it confuses you: The boundary between lexical analysis and parsing is a little fuzzy. For instance:
Some programming languages have complex-number literals that look, say, like 2+3i or 3.2e8-17e6i. If you were parsing such a language, you could make the lexer gobble up a whole complex number and make it into a token; or you could have a simpler lexer and a more complicated parser, and make (say) 3.2e8, -, 17e6i be separate tokens; it would then be the parser's job (or even the code generator's) to notice that what it's got is really a single literal.
In some programming languages, the lexer may not be able to tell whether a given token is a variable name or a type name. (This happens in C, for instance.) But the grammar of the language may distinguish between the two, so that you'd like "variable foo" and "type name foo" to be different tokens. (This also happens in C.) In this case, it may be necessary for some information to be fed back from the parser to the lexer so that it can produce the right sort of token in each case.
So "what exactly is a token?" may not always have a perfectly well defined answer.
A token is whatever you want it to be. Traditionally (and for
good reasons), language specifications broke the analysis into
two parts: the first part broke the input stream into tokens,
and the second parsed the tokens. (Theoretically, I think you
can write any grammar in only a single level, without using
tokens—or what is the same thing, using individual
characters as tokens. I wouldn't like to see the results of
that for a language like C++, however.) But the definition of
what a token is depends entirely on the language you are
parsing: most languages, for example, treat white space as
a separator (but not Fortran); most languages will predefine
a set of punctuation/operators using punctuation characters, and
not allow these characters in symbols (but not COBOL, where
"abc-def" would be a single symbol). In some cases (including
in the C++ preprocessor), what is a token depends on context, so
you may need some feedback from the parser. (Hopefully not;
that sort of thing is for very experienced programmers.)
One thing is probably sure (unless each character is a token):
you'll have to read ahead in the stream. You typically can't
tell whether there are more tokens by just looking at a single
character. I've generally found it useful, in fact, for the
tokenizer to read a whole token at a time, and keep it until the
parser needs it. A function like hasMoreTokens would in fact
scan a complete token.
(And while I'm at it, if source is an istream:
istream::peek does not return a pointer, but an int.)
A token is the smallest unit of a programming language that has a meaning. A parenthesis (, a name foo, an integer 123, are all tokens. Reducing a text to a series of tokens is generally the first step of parsing it.
A token is usually akin to a word in sponken language. In C++, (int, float, 5.523, const) will be tokens. Is the minimal unit of text which constitutes a semantic element.
When you split a large unit (long string) into a group of sub-units (smaller strings), each of the sub-units (smaller strings) is referred to as a "token". If there are no more sub-units, then you are done parsing.
How do I tokenize a string in C++?
A token is a terminal in a grammar, a sequence of one or more symbol(s) that is defined by the sequence itself, ie it does not derive from any other production defined in the grammar.

Finite State Machine parser

I would like to parse a self-designed file format with a FSM-like parser in C++ (this is a teach-myself-c++-the-hard-way-by-doing-something-big-and-difficult kind of project :)). I have a tokenized string with newlines signifying the end of a euh... line. See here for an input example. All the comments will and junk is filtered out, so I have a std::string like this:
global \n { \n SOURCE_DIRS src \n HEADER_DIRS include \n SOURCES bitwise.c framing.c \n HEADERS ogg/os_types.h ogg/ogg.h \n } \n ...
Syntax explanation:
{ } are scopes, and capitalized words signify that a list of options/files is to follow.
\n are only important in a list of options/files, signifying the end of the list.
So I thought that a FSM would be simple/extensible enough for my needs/knowledge. As far as I can tell (and want my file design to be), I don't need concurrent states or anything fancy like that. Some design/implementation questions:
Should I use an enum or an abstract class + derivatives for my states? The first is probably better for small syntax, but could get ugly later, and the second is the exact opposite. I'm leaning to the first, for its simplicity. enum example and class example. EDIT: what about this suggestion for goto, I thought they were evil in C++?
When reading a list, I need to NOT ignore \n. My preferred way of using the string via stringstream, will ignore \n by default. So I need simple way of telling (the same!) stringstream to not ignore newlines when a certain state is enabled.
Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations?
Here's the draft states I have in mind:
upper: reads global, exe, lib+ target names...
normal: inside a scope, can read SOURCES..., create user variables...
list: adds items to a list until a newline is encountered.
Each scope will have a kind of conditional (e.g. win32:global { gcc:CFLAGS = ... }) and will need to be handled in the exact same fashion eveywhere (even in the list state, per item).
Thanks for any input.
If you have nesting scopes, then a Finite State Machine is not the right way to go, and you should look at a Context Free Grammar parser. An LL(1) parser can be written as a set of recursive funcitons, or an LALR(1) parser can be written using a parser generator such as Bison.
If you add a stack to an FSM, then you're getting into pushdown automaton territory. A nondeterministic pushdown automaton is equivalent to a context free grammar (though a deterministic pushdown automaton is strictly less powerful.) LALR(1) parser generators actually generate a deterministic pushdown automaton internally. A good compiler design textbook will cover the exact algorithm by which the pushdown automaton is constructed from the grammar. (In this way, adding a stack isn't "hacky".) This Wikipedia article also describes how to construct the LR(1) pushdown automaton from your grammar, but IMO, the article is not as clear as it could be.
If your scopes nest only finitely deep (i.e. you have the upper, normal and list levels but you don't have nested lists or nested normals), then you can use a FSM without a stack.
There are two stages to analyzing a text input stream for parsing:
Lexical Analysis: This is where your input stream is broken into lexical units. It looks at a sequence of characters and generates tokens (analagous to word in spoken or written languages). Finite state machines are very good at lexical analysis provided you've made good design decision about the lexical structure. From your data above, individal lexemes would be things like your keywords (e.g. "global"), identifiers (e.g. "bitwise", "SOURCES"), symbolic tokesn (e.g. "{" "}", ".", "/"), numeric values, escape values (e.g. "\n"), etc.
Syntactic / Grammatic Analysis: Upon generating a sequence of tokens (or perhaps while you're doing so) you need to be able to analyze the structure to determine if the sequence of tokens is consistent with your language design. You generally need some sort of parser for this, though if the language structure is not very complicated, you may be able to do it with a finite state machine instead. In general (and since you want nesting structures in your case in particular) you will need to use one of the techniques Ken Bloom describes.
So in response to your questions:
Should I use an enum or an abstract class + derivatives for my states?
I found that for small tokenizers, a matrix of state / transition values is suitable, something like next_state = state_transitions[current_state][current_input_char]. In this case, the next_state and current_state are some integer types (including possibly an enumerated type). Input errors are detected when you transition to an invalid state. The end of an token is identified based on the state identification of valid endstates with no valid transition available to another state given the next input character. If you're concerned about space, you could use a vector of maps instead. Making the states classes is possible, but I think that's probably making thing more difficult than you need.
When reading a list, I need to NOT ignore \n.
You can either create a token called "\n", or a more generalize escape token (an identifier preceded by a backslash. If you're talking about identifying line breaks in the source, then those are simply characters you need to create transitions for in your state transition matrix (be aware of the differnce between Unix and Windows line breaks, however; you could create a FSM that operates on either).
Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations?
This is where you will need a grammar or pushdown automaton unless you can guarantee that the nesting will not exceed a certain level. Even then, it will likely make your FSM very complex.
Here's the draft states I have in mind: ...
See my commments on lexical and grammatical analysis above.
For parsing I always try to use something already proven to work: ANTLR with ANTLRWorks which is of great help for designing and testing a grammar. You can generate code for C/C++ (and other languages) but you need to build the ANTLR runtime for those languages.
Of course if you find flex or bison easier to use you can use them too (I know that they generate only C and C++ but I may be wrong since I didn't use them for some time).

Is it feasible to ascribe pronunciations to distinct source code concepts?

I frequently tutor fellow students in programming, most often in C++ or Java.
It is uniquely aggravating to try to verbally convey the essential syntax of a C++ expression. The speaker must give either an idiomatic translation into English, or a full specification of the code in verbal longhand, using explicit yet slow terms such as "opening parenthesis", "bitwise and", et cetera. Neither of these solutions is optimal.
In C++, there is a finite set of keywords—63—and operators—54, discounting named operators and treating compound assignment operators and prefix versus postfix auto-increment and decrement as distinct. There are just a few types of literal, a similar number of grouping symbols, and the semicolon. Unless I'm utterly mistaken, that's about it.
Would it not then be feasible to ascribe a concise, unique pronunciation to each of these distinct concepts (including one for whitespace, where it is required) and go from there? Programming languages are far more regular than natural languages, so the pronunciation could be standardised.
Instead of creating new "words" to describe them, for things such as "include" you could simply prefix it with "keyword" when saying it aloud. You could use words/phrases commonly known to say other parts as well. As with any new programmer, you have to literally describe everything anyway, so I don't think that requires special attention. I think creating new words is the harder method...
So, for example:
#include <iostream>;
int main()
{
if (1 < 2)
return 1;
else
return 0;
}
Could be read out as:
(keyword) include iostream new-line
(keyword) int main no params start
block if number 1 (operator) less than
number 2 new-line (keyword) return
number 1 new-line (keyword) else
new-line (keyword) return number 0 end
block
Treat words in () as optional descriptive words, most likely to be used in more complex code. You could use the word 'literal' if you want them to actually write the descriptive word. For example
(keyword) if literal number (operator)
less than literal keyword
becomes
if (number < keyword)
Other words could be given defined meanings as well, such as 'split-line' when you want them to continue on the next line, without closing any currently open parenthesis, etc.
I personally find this method quite simple to use and easy to teach. YMMV, as always.
Of course, this doesn't solve the internationalisation issue, but at worst, would result in 'new words' being used in the non-English languages, which is no worse than the proposed solution you offered.
As a blind developer, programming since I was 13, I found this question really interesting. First of all, as mentioned by other peple, learning a new language to be able to understand code is not a practical solution, as it would probably take longer to learn the spoken utterances as it would to learn the actual programming language.
Reading the question/answers two further points occured to me:
Firstly, you'd be surprised how important "thinking time" is. I have previously programmed in C/C++/Java and now use C# as my primary language, and consider myself very competant. But when I did a couple of projects in Python, I found the reduced punctuation robbed me of my "thinking time" - subconsciously, I was using the punctuation to digest what I'd just heard - fascinating... However, the situation is a bit different when it comes to identifiers, as these aren't well known by the listener - I personally find it hard to listen to code with acronym variables (RGXRatio, RGVRatio) as I don't have time to figure out what it means. On the flip side, hungarian notation and initial underscores makes code hard to listen to as the length of the variables (in terms of time taken to speak) is much longer than the more important operations being performed on those variables.
Another thing to consider is that the length of the audio stream is an end result, but not the root cause. The reason the audio is so long is because audio is a one-dimensional medium, whereas reading text is a 2d medium with the ability to jump around and skip past irelevant/familiar text. It wouldn't work for a face-to-face lecture, but what if there were keyboard commands for controlling the speech. In text documents my screen reader lets me jump to the next line, but what if this were adapted to the semantics of a programming language. some research, such as by T V Raman at Google, includes using different voices for syntax highlighting, and audio cues to mark metadata like capitals.
I know the original question specifically related to a lecture given to a class, but if like myself you have to listen to entire files of source code , I also find the structure of the code makes a huge difference. I personally read code like a story - left to right, top to bottom. so it's very hard to trace through unfamiliar code when it's written bottom-up.
So would it not then be feasible to simply ascribe a concise, unique pronunciation to each of these distinct concepts (including one for whitespace, where it is required) and go from there? Programming languages are far more regular than natural languages, so the pronunciation could be standardised
Perhaps, but you've lost sight of your goal. The premise was that the person listening did not already know the language. If he does, we can simply say "include iostream" when we mean #include <iostream>, or "vector of int" when we mean std::vector<int>.
Your premise was that the person listening is not familiar enough with the language to understand what you read out loud unless you read out exactly what it says.
Now, inventing a whole new language just to describe the primitives that occur in your source code doesn't solve the problem. Instead, you still have to read out every syntactic token (with simpler, more "standardized" pronunciations, yes, but they still have to be read out loud), and the person listening still won't understand you, because if they don't know C++ well enough to understand "include iostream", they won't understand your standardized pronunciation either. And if you're going to teach them your pronunciation, why bother, when you could've just taught them to understand C++ syntax directly instead?
There's also the root problem that C++ code tends to consist of a lot of syntactic tokens. Take a line as simple as this:
std::vector<int> v;
I count 9 tokens. Not one of them can be omitted. If the person listening does not understand the code and syntax well enough to understand a high-level description such as "declare a vector of int, named v", then you'll have to read out all 9 tokens in some form. Even if you come up with simpler names than "namespace resolution operator" and "less than sign", you still have to list 9 token names. Which is a lot of work.
In short, no, I don't think it'd work. First, it's still too cumbersome, and second, it's presuming prior knowledge on the part of the person listening, when the motivation for this was that the person listening was a student without the prior knowledge that made it possible to understand a high-level description of the code.