Implementing a stack based virtual machine for a subset of C - c++

Hello everyone I'm currently implementing a simple programming language for learning experience but I'm in need of some advice. Currently I'm designing my Interpreter and I've come into a problem.
My language is a subset of C and I'm having a problem regarding the stack interpreter implementation. In the language the following will compile:
somefunc ()
{
1 + 2;
}
main ()
{
somefunc ();
}
Now this is alright but when "1+2" is computed the result is pushed onto a stack and then the function returns but there's still a number on the stack, and there shouldn't be. How can I get around this problem?
I've thought about saving a "state" of the stack before a function call and restoring the "state" after the function call. For example saving the number of elements on the stack, then execute the function code, return, and then pop from the stack until we have the same number of elements as before (or maybe +1 if the function returned something).
Any ideas? Thanks for any tips!

Great question! One of my hobbies is writing compilers for toy languages, so kudos for your excellent programming taste.
An expression statement is one where the code in the statement is simply an expression. This means anything of the form <expression> ;, which includes things like assignments and function calls, but not ifs, whiles, or returns. Any expression statement will have a left over value on the stack at the end, which you should discard.
1 + 2 is an expression statement, but so are these:
x = 5;
The assignment expression leaves the value 5 on the stack since the result of an assignment is the value of the left-hand operand. After the statement is finished you pop off the unused value 5.
printf("hello world!\n");
printf() returns the number of characters output. You will have this value left over on the stack, so pop it when the statement finishes.
Effectively every expression statement will leave a value on the stack unless the expression's type is void. In that case you either special-case void statements and don't pop anything afterwards, or push a pretend "void" value onto the stack so you can always pop a value.

You'll need a smarter parser. When you see an expression whose value isn't being used then you need to emit a POP.

This is an important opportunity on learning optimization. you have a function that does number but integer math, the int math result isn't even used in any way, shape, or form.
Having your compiler optimize the function away would reduce alot of bytecode being generated and executed for nothing!

Related

Ternary operator as a command?

In the source-code for nanodns, there is an atypical use of the ternary operator in an attempt to reduce the size of the code:
/* If the incoming packet has an AR record (such as in an EDNS request),
* mark the reply as "NOT IMPLEMENTED"; using a?b:c form to save one byte*/
q[11]?q[3]|=4:1;
It’s not obvious what this line does. At first glance, it looks like it is assigning a value to one of two array elements, but it is not. Rather, it seems to be either or’ing an array element, or else, doing nothing (running the “command” 1).
It looks like it is supposed to be a replacement for this line of code (which is indeed one byte longer):
if(q[11])q[3]|=4;
The literal equivalent would be this:
if (q[11])
q[3]|=4;
else
1;
The ternary operator is typically used as part of an expression, so seeing it used as a standalone command seems odd. Coupled with the seemingly out of place 1, this line almost qualifies as obfuscated code.
I did a quick test and was able to compile and run a C(++) program with data constants as “command”, such as void main() {0; 'a'; "foobar"; false;}. It seems to be a sort of nop command, but I cannot find any information about such usage—Google isn’t very amenable to this type of search query).
Can anyone explain exactly what it is and how it works?
In C and C++ any expression can be made into a statement by putting ; at the end.
Another example is that the expression x = 5 can be made into a statement: x = 5; . Hopefully you agree that this is a good idea.
It would needlessly complicate the language to try and "ban" some subset of expressions from having ; come after them. This code isn't very useful but it is legal.
Please note that the code you linked to is awful and written by a really bad programmer. Particularly, the statement
"It is common practice in tiny C programs to define reused expressions
to make the code smaller"
is complete b***s***. That statement is where things started to go terribly wrong.
The size of the source code has no relation to the size of the compiler executable, nor any relation to that executable's memory consumption, nor any relation to program performance. The only thing it affects is the size of the source code files on the programmers computer, expressed in bytes.
Unless you are programming on some 8086 computer from mid-80s with very limited hard drive space, you never need to "reduce the size of the code". Instead, write readable code.
That being said, since q is an array of characters , the code you linked is equivalent to
if(q[11])
{
(int)(q[3] |= 4);
}
else
{
1;
}
Where 1 is a statement with no side effect, it will get optimized away. It was only placed there because the ?: operator demands a 3rd operator.
The only difference between if statements and the ?: operator is subtle: the ?: implicitly balances the type between the 2nd and 3rd operand.
To increase readability and produce self-documenting code, the code should get rewritten to something like
if (q[AR_INDEX] != 0)
{
q[REPLY_INDEX] |= NOT_IMPLEMENTED;
}
As a side note, there is a bug here: q[2]|=128;. q is of type char, which has implementation-defined signedness, so this line is potentially disastrous. The core problem is that you should never use the char type for bit-wise operations or any form of arithmetic, which is a classic beginner mistake. It must be replaced with uint8_t or unsigned char.

Read and write variable in an IF statement

I'm hoping to perform the following steps in a single IF statement to save on code writing:
If ret is TRUE, set ret to the result of function lookup(). If ret is now FALSE, print error message.
The code I've written to do this is as follows:
BOOLEAN ret = TRUE;
// ... functions assigning to `ret`
if ( ret && !(ret = lookup()) )
{
fprintf(stderr, "Error in lookup()\n");
}
I've got a feeling that this isn't as simple as it looks. Reading from, assigning to and reading again from the same variable in an IF statement. As far as I'm aware, the compiler will always split statements like this up into their constituent operations according to precedence and evaluates conjuncts one at a time, failing immediately when evaluating an operand to false rather than evaluating them all. If so, then I expect the code to follow the steps I wrote above.
I've used assignments in IF statements a lot and I know they work, but not with another read beforehand.
Is there any reason why this isn't good code? Personally, I think it's easy to read and the meaning is clear, I'm just concerned about the compiler maybe not producing the equivalent logic for whatever reason. Perhaps compiler vendor disparities, optimisations or platform dependencies could be an issue, though I doubt this.
...to save on code writing This is almost never a valid argument. Don't do this. Particularly, don't obfuscate your code into a buggy, unreadable mess to "save typing". That is very bad programming.
I've got a feeling that this isn't as simple as it looks. Reading from, assigning to and reading again from the same variable in an IF statement.
Correct. It has little to do with the if statement in itself though, and everything to do with the operators involved.
As far as I'm aware, the compiler will always split statements like this up into their constituent operations according to precedence and evaluates conjuncts one at a time
Well, yes... but there is operator precedence and there is order of evaluation of subexpressions, they are different things. To make things even more complicated, there are sequence points.
If you don't know the difference between operator precedence and order of evaluation, or if you don't know what sequence points are, you need to instantly stop stuffing as many operators as you can into a single line, because in that case, you are going to write horrible bugs all over the place.
In your specific case, you get away with the bad programming just because as a special case, there happens to be a sequence point between the left and right evaluation of the && operator. Had you written some similar mess with a different operator, for example ret + !(ret = lookup(), your code would have undefined behavior. A bug which will take hours, days or weeks to find. Well, at least you saved 10 seconds of typing!
Also, in both C and C++ use the standard bool type and not some home-brewed version.
You need to correct your code into something more readable and safe:
bool ret = true;
if(ret)
{
ret = lookup();
}
if(!ret)
{
fprintf(stderr, "Error in lookup()\n");
}
Is there any reason why this isn't good code?
Yes, there are a lot issues whith such dirty code fragments!
1)
Nobody can read it and it is not maintainable. A lot of coding guidlines contain a rule which tells you: "One statement per line".
2) If you combine multiple expressions in one if statement, only the first statements will be executed until the expression is defined! This means: if you have multiple expressions which combined with AND the first expression which generates false will be the last one which will be executed. Same with OR combinations: The first one which evaluates to true is the last one which is executed.You already wrote this and you! know this, but this is a bit of tricky programming. If all your colleges write code that way, it is maybe ok, but as I know, my colleagues will not understand what you are doing in the first step!
3) You should never compare and assign in one statement. It is simply ugly!
4) if YOU! already think about " I'm just concerned about the compiler maybe not producing the equivalent logic" you should think again why you are not sure what you are doing! I believe that everybody who must work with such a dirty code will think again on such combinations.
Hint: Don't do that! Never!

How can I avoid using the stack with continuation-passing style?

For my diploma thesis I chose to implement the task of the ICFP 2004 contest.
The task--as I translated it to myself--is to write a compiler which translates a high-level ant-language into a low-level ant-assembly. In my case this means using a DSL written in Clojure (a Lisp dialect) as the high-level ant-language to produce ant-assembly.
UPDATE:
The ant-assembly has several restrictions: there are no assembly-instructions for calling functions (that is, I can't write CALL function1, param1), nor returning from functions, nor pushing return addresses onto a stack. Also, there is no stack at all (for passing parameters), nor any heap, or any kind of memory. The only thing I have is a GOTO/JUMP instruction.
Actually, the ant-assembly is for to describe the transitions of a state machine (=the ants' "brain"). For "function calls" (=state transitions) all I have is a JUMP/GOTO.
While not having anything like a stack, heap or a proper CALL instruction, I still would like to be able to call functions in the ant-assembly (by JUMPing to certain labels).
At several places I read that transforming my Clojure DSL function calls into continuation-passing style (CPS) I can avoid using the stack[1], and I can translate my ant-assembly function calls into plain JUMPs (or GOTOs). Which is exactly what I need, because in the ant-assembly I have no stack at all, only a GOTO instruction.
My problem is that after an ant-assembly function has finished, I have no way to tell the interpreter (which interprets the ant-assembly instructions) where to continue. Maybe an example helps:
The high-level Clojure DSL:
(defn search-for-food [cont]
(sense-food-here? ; a conditional w/ 2 branches
(pickup-food ; true branch, food was found
(go-home ; ***
(drop-food
(search-for-food cont))))
(move ; false branch, continue searching
(search-for-food cont))))
(defn run-away-from-enemy [cont]
(sense-enemy-here? ; a conditional w/ 2 branches
(go-home ; ***
(call-help-from-others cont))
(search-for-food cont)))
(defn go-home [cont]
(turn-backwards
; don't bother that this "while" is not in CPS now
(while (not (sense-home-here?))
(move)))
(cont))
The ant-assembly I'd like to produce from the go-home function is:
FUNCTION-GO-HOME:
turn left nextline
turn left nextline
turn left nextline ; now we turned backwards
SENSE-HOME:
sense here home WE-ARE-AT-HOME CONTINUE-MOVING
CONTINUE-MOVING:
move SENSE-HOME
WE-ARE-AT-HOME:
JUMP ???
FUNCTION-DROP-FOOD:
...
FUNCTION-CALL-HELP-FROM-OTHERS:
...
The syntax for the ant-asm instructions above:
turn direction which-line-to-jump
sense direction what jump-if-true jump-if-false
move which-line-to-jump
My problem is that I fail to find out what to write to the last line in the assembly (JUMP ???). Because--as you can see in the example--go-home can be invoked with two different continuations:
(go-home
(drop-food))
and
(go-home
(call-help-from-others))
After go-home has finished I'd like to call either drop-food or call-help-from-others. In assembly: after I arrived at home (=the WE-ARE-AT-HOME label) I'd like to jump either to the label FUNCTION-DROP-FOOD or to the FUNCTION-CALL-HELP-FROM-OTHERS.
How could I do that without a stack, without PUSHing the address of the next instruction (=FUNCTION-DROP-FOOD / FUNCTION-CALL-HELP-FROM-OTHERS) to the stack? My problem is that I don't understand how continuation-passing style (=no stack, only a GOTO/JUMP) could help me solving this problem.
(I can try to explain this again if the things above are incomprehensible.)
And huge thanks in advance for your help!
--
[1] "interpreting it requires no control stack or other unbounded temporary storage". Steele: Rabbit: a compiler for Scheme.
Yes, you've provided the precise motivation for continuation-passing style.
It looks like you've partially translated your code into continuation-passing-style, but not completely.
I would advise you to take a look at PLAI, but I can show you a bit of how your function would be transformed, assuming I can guess at clojure syntax, and mix in scheme's lambda.
(defn search-for-food [cont]
(sense-food-here? ; a conditional w/ 2 branches
(search-for-food
(lambda (r)
(drop-food r
(lambda (s)
(go-home s cont)))))
(search-for-food
(lambda (r)
(move r cont)))))
I'm a bit confused by the fact that you're searching for food whether or not you sense food here, and I find myself suspicious that either this is weird half-translated code, or just doesn't mean exactly what you think it means.
Hope this helps!
And really: go take a look at PLAI. The CPS transform is covered in good detail there, though there's a bunch of stuff for you to read first.
Your ant assembly language is not even Turing-complete. You said it has no memory, so how are you supposed to allocate the environments for your function calls? You can at most get it to accept regular languages and simulate finite automata: anything more complex requires memory. To be Turing-complete you'll need what amounts to a garbage-collected heap. To do everything you need to do to evaluate CPS terms you'll also need an indirect GOTO primitive. Function calls in CPS are basically (possibly indirect) GOTOs that provide parameter passing, and the parameters you pass require memory.
Clearly, your two basic options are to inline everything, with no "external" procedures (for extra credit look up the original meaning of "internal" and "external" here), or somehow "remember" where you need to go on "return" from a procedure "call" (where the return point does not necessarily need to fall in the physical locations immediately following the "calling" point). Basically, the return point identifier can be a code address, an index into a branch table, or even a character symbol -- it just needs to identify the return target relative to the called procedure.
The most obvious here would be to track, in your compiler, all of the return targets for a given call target, then, at the end of the called procedure, build a branch table (or branch ladder) to select from one of the several possible return targets. (In most cases there are only a handful of possible return targets, though for commonly used procedures there could be hundreds or thousands.) Then, at the call point, the caller needs to load a parameter with the index of its return point relative to the called procedure.
Obviously, if the callee in turn calls another procedure, the first return point identifier must be preserved somehow.
Continuation passing is, after all, just a more generalized form of a return address.
You might be interested in Andrew Appel's book Compiling with Continuations.

Is this undefined behavior in C/C++ (Part 2) [duplicate]

This question already has answers here:
Why are these constructs using pre and post-increment undefined behavior?
(14 answers)
Undefined behavior and sequence points
(5 answers)
Closed 5 years ago.
What does the rule about sequence points say about the following code?
int main(void) {
int i = 5;
printf("%d", ++i, i); /* Statement 1 */
}
There is just one %d. I am confused because I am getting 6 as output in compilers GCC, Turbo C++ and Visual C++. Is the behavior well defined or what?
This is related to my last question.
It's undefined because of 2 reasons:
The value of i is twice used without an intervening sequence point (the comma in argument lists is not the comma operator and does not introduce a sequence point).
You're calling a variadic function without a prototype in scope.
The number of arguments passed to printf() are not compatible with the format string.
the default output stream is usually line buffered. Without a '\n' there is no guarantee the output will be effectively output.
All arguments get evaluated when calling a function, even if they are not used, so, since the order of evaluation of function arguments is undefined, you have UB again.
I think it's well defined. The printf matches the first % placeholder to the first argument, which in this instance is a preincremented variable.
All arguments are evaluated. Order not defined. All implementations of C/C++ (that I know of) evaluate function arguments from right to left. Thus i is usually evaluated before ++i.
In printf, %d maps to the first argument. The rest are ignored.
So printing 6 is the correct behaviior.
I believe that the right-to-left evaluation order has been very very old (since the first C compilers). Certainly way before C++ was invented, and most implementations of C++ would be keeping the same evaluation order because early C++ implementations simply translates into C.
There are some technical reasons for evaluating function arguments right-to-left. In stack architectures, arguments are typically pushed onto the stack. In C, you can call a function with more arguments than actually specified -- the extra arguments are simiply ignored. If arguments are evaluated left-to-right, and pushed left-to-right, then the stack slot right under the stack pointer will hold the last argument, and there is no way for the function to get at the offset of any particular argument (because the actual number of arguments pushed depends on the caller).
In a right-to-left push order, the stack slot right under the stack pointer will always hold the first argument, and the next slot holds the second argument etc. Argument offsets will always be deterministic for the function (which may be written and compiled elsewhere into a library, separately from where it is called).
Now, right-to-left push order does not mandate right-to-left evaluation order, but in early compilers, memory is scarce. In right-to-left evaluation order, the same stack can be used in-place (essentially, after evaluating the argument -- which may be an expression or a funciton call! -- the return value is already at the right position on the stack). In left-to-right evaluation, the argument values must be stored separately and the pushed back to the stack in reverse order.
Would be interested to know the true history behind right-to-left evaluation though.
According to this documentation, any additional arguments passed to a format string shall be ignored. It also mentions for fprintf that the argument will be evaluated then ignored. I'm not sure if this is the case with printf.

Fortran return statement

I'm trying to get some code compiled under gfortran that compiles fine under g77. The problem seems to be from a return statement:
ffuncs.f:934.13:
RETURN E
1
Error: Alternate RETURN statement at (1) requires a SCALAR-INTEGER return specifier
In the code anything E was specified as real*8:
IMPLICIT REAL*8 ( A - H , O -Z )
However, E was never given a value or anything in fact you never see it until the return statement. I know almost nothing about fortran. What is the meaning of a return statement with an argument in fortran?
Thanks.
In FORTRAN (up to Fortran 77, which I'm very familiar with), RETURN n is not used to return a function value; instead, it does something like what in other languages would be handled by an exception: An exit to a code location other than the normal one.
You'd normally call such a SUBROUTINE or FUNCTION with labels as arguments, e.g.
CALL MYSUB(A, B, C, *998, *999)
...
998 STOP 'Error 1'
998 STOP 'Error 2'
and if things go wrong in MYSUB then you do RETURN 1 or RETURN 2 (rather than the normal RETURN) and you'd be hopping straight to label 998 or 999 in the calling routine.
That's why normally you want an integer on that RETURN - it's not a value but an index to which error exit you want to take.
RETURN E sounds wrong to me. Unless there's a syntax I'm unaware of, the previous compiler should have flagged that as an error.
In a Fortran function one returns the value, by assigning the value to a fake variable which is the same name as the function. Once you do that, simply return.
I think #Carl Smotricz has the answer. Does argument list of ffuncs has dummy arguments that are asterisks (to match the asterisk-label in the calls)? Or was this used without there being alternative returns? If there were no alternative returns, just delete the "E". If there are alternative returns, the big question is what the program was doing before at run time since the variable was of the wrong type and uninitialized. If the variable didn't have an integer value matching one of the expected branches, perhaps the program took the regular return branch -- but that's just a guess -- if so, the easy fix is to again to delete the "E".
The "alternate return" feature is considered "obsolescent" by the language standard and could be deleted in a future standard; compilers would likely continue to support it if it were removed because of legacy code. For new code, one simple alternative is to return an integer status variable and use a "select case" statement in the caller.