What dynamic programming features of Perl should I be using? - c++

I am pretty new to scipting languages (Perl in particular), and most of the code I write is an unconscious effort to convert C code to Perl.
Reading about Perl, one of the things that is often mentioned as the biggest difference is that Perl is a dynamic language. So, it can do stuff at runtime that the other languages (static ones) can only do at compiletime, and so be better at it because it can have access to realtime information.
All that is okay, but what specific features should I, with some experience in C and C++, keep in mind while writing code in Perl to use all the dynamic programming features that it has, to produce some awesome code?

This question is more than enough to fill a book. In fact, that's precisely what happened!
Mark Jason Dominus' excellent Higher-Order Perl is available online for free.
Here is a quote from it's preface that really grabbed me by the throat when I first read the book:
Around 1993 I started reading books
about Lisp, and I discovered something
important: Perl is much more like Lisp
than it is like C. If you pick up a
good book about Lisp, there will be a
section that describes Lisp’s good
features. For example, the book
Paradigms of Artificial Intelligence
Programming, by Peter Norvig, includes
a section titled What Makes Lisp
Different? that describes seven
features of Lisp. Perl shares six of
these features; C shares none of them.
These are big, important features,
features like first-class functions,
dynamic access to the symbol table,
and automatic storage management.

A list of C habits not to carry over into Perl 5:
Don't declare your variables at the top of the program/function. Declare them as they are needed.
Don't assign empty lists to arrays and hashes when declaring them (they are empty already and need to be initialized).
Don't use if (!(complex logical statement)) {}, that is what unless is for.
Don't use goto to break deeply nested loops, next, last, and redo all take a loop label as an argument.
Don't use global variables (this is a general rule even for C, but I have found a lot of C people like to use global variables).
Don't create a function where a closure will do (callbacks in particular). See perldoc perlsub and perldoc perlref for more information.
Don't use in/out returns, return multiple values instead.
Things to do in Perl 5:
Always use the strict and warnings pragmas.
Read the documentation (perldoc perl and perldoc -f function_name).
Use hashes the way you used structs in C.

Use the features that solve your problem with the best combination of maintainability, developer time, testability, and flexibility. Talking about any technique, style, or library outside of the context of a particular application isn't very useful.
Your goal shouldn't be to find problems for your solutions. Learn a bit more Perl than you plan on using immediately (and keep learning). One day you'll come across a problem and think "I remember something that might help with this".
You might want to see some of these book, however:
Higher-Order Perl
Mastering Perl
Effective Perl Programming
I recommend that you slowly and gradually introduce new concepts into your coding. Perl is designed so that you don't have to know a lot to get started, but you can improve your code as you learn more. Trying to grasp lots of new features all at once usually gets you in trouble in other ways.

I think the biggest hurdle will not be the dynamic aspect but the 'batteries included' aspect.
I think the most powerful aspects of perl are
hashes : they allow you to easily express very effective datastructures
regular expressions : they're really well integrated.
the use of the default variables like $_
the libraries and the CPAN for whatever is not installed standard
Something I noticed with C converts is the over use of for loops. Many can be removed using grep and map
Another motto of perl is "there is more than one way to do it". In order to climb the learning curve you have to tell yourself often : "There has got to be a better way of doing this, I cannot be the first one wanting to do ...". Then you can typically turn to google and the CPAN with its riduculous number of libraries.
The learning curve of perl is not steep, but it is very long... take your time and enjoy the ride.

Two points.
First, In general, I think you should be asking yourself 2 slightly different questions:
1) Which dynamic programming features of Perl can be used in which situations/to solve which problems?
2) What are the trade-offs, pitfalls and downsides of each feature.
Then the answer to your question becomes extremely obvious: you should be using the features that solve your problem better (performance or code maintainability wise) than a comparable non-DP solution, and which incurs less than the maximum accaptable level of downsides.
As an example, to quote from FM's comment, string form of eval has some fairly nasty downsides; but it MIGHT in certain cases be an extremely elegant solution which is orders of magnitude better than any alternate DP or SP approach.
Second, please be aware that a lot of "dynamic programming" features of Perl are actually packaged for you into extremely useful modules that you might not even recognize as being of the DP nature.
I'll have to think of a set of good examples, but one that immediately springs to mind is Text template modules, many of which are implemented using the above-mentioned string form of eval; or Try::Tiny exception mechanism which uses block form of eval.
Another example is aspect programming which can be achieved via Moose (I can't find the relevant StackOverflow link now - if someone has it please edit in the link) - which underneath uses access to symbol table featrue of DP.

Most of the other comments are complete here and I won't repeat them. I will focus on my personal bias about excessive or not enough use of language idioms in the language you are writing code in. As a quip goes, it is possible to write C in any language. It is also possible to write unreadable code in any language.
I was trained in C and C++ in college and picked up Perl later. Perl is fabulous for quick solutions and some really long life solutions. I built a company on Perl and Oracle solving logistics solutions for the DoD with about 100 active programmers. I also have some experience in managing the habits of other Perl programmers new and old. (I was the founder / ceo and not in technical management directly however...)
I can only comment on my transition to a Perl programmer and what I saw at my company. Many of our engineers shared my background of primarily being C / C++ programers by training and Perl programmers by choice.
The first issue I have seen (and had myself) is writing code that is so idiomatic that it is unreadable, unmaintainable, and unusable after a short period of time. Perl, and C++ share the ability to write terse code that is entertaining to understand at the moment but you will forget, not be around, and others won't get it.
We hired (and fired) many programmers over the 5 years I had the company. A common Interview Question was the following: Write a short Perl program that will print all the odd numbers between 1 and 50 inclusive separated by a space between each number and terminated with a CR. Do not use comments. They could do this on their own time of a few minutes and could do it on a computer to proof the output.
After they wrote the script and explained it, we would then ask them to modify it to print only the evens, (in front of the interviewer), then have a pattern of results based on every single digit even, every odd, except every seventh and 11th as an example. Another potential mod would be every even in this range, odd in that range, and no primes, etc. The purpose was to see if their original small script withstood being modified, debugged, and discussed by others and whether they thought in advance that the spec may change.
While the test did not say 'in a single line' many took the challenge to make it a single terse line and with the cost of readability. Others made a full module that just took too long given the simple spec. Our company needed to delver solid code very quickly; so that is why we used Perl. We needed programmers that thought the same way.
The following submitted code snippets all do exactly the same thing:
1) Too C like, but very easy to modify. Because of the C style 3 argument for loop it takes more bug prone modifications to get alternate cycles. Easy to debug, and a common submission. Any programmer in almost any language would understand this. Nothing particularly wrong with this, but not killer:
for($i=1; $i<=50; $i+=2) {
printf("%d ", $i);
}
print "\n";
2) Very Perl like, easy to get evens, easy (with a subroutine) to get other cycles or patterns, easy to understand:
print join(' ',(grep { $_ % 2 } (1..50))), "\n"; #original
print join(' ',(grep { !($_ % 2) } (1..50))), "\n"; #even
print join(' ',(grep { suba($_) } (1..50))), "\n"; #other pattern
3) Too idiomatic, getting a little weird, why does it get spaces between the results? Interviewee made mistake in getting evens. Harder to debug or read:
print "#{[grep{$_%2}(1..50)]}\n"; #original
print "#{[grep{$_%2+1}(1..50)]}\n"; #even - WRONG!!!
print "#{[grep{~$_%2}(1..50)]}\n"; #second try for even
4) Clever! But also too idiomatic. Have to think about what happens to the annon hash created from a range operator list and why that creates odds and evens. Impossible to modify to another pattern:
print "$_ " for (sort {$a<=>$b} keys %{{1..50}}), "\n"; #orig
print "$_ " for (sort {$a<=>$b} keys %{{2..50}}), "\n"; #even
print "$_ " for (sort {$a<=>$b} values %{{1..50}}), "\n"; #even alt
5) Kinda C like again but a solid framework. Easy to modify beyond even/odd. Very readable:
for (1..50) {
print "$_ " if ($_%2);
} #odd
print "\n";
for (1..50) {
print "$_ " unless ($_%2);
} #even
print "\n";
6) Perhaps my favorite answer. Very Perl like yet readable (to me anyway) and step-wise in formation and right to left in flow. The list is on the right and can be changed, the processing is immediately to the left, formatting again to the left, final operation of 'print' on the far left.
print map { "$_ " } grep { $_ & 1 } 1..50; #original
print "\n";
print map { "$_ " } grep { !($_ & 1) } 1..50; #even
print "\n";
print map { "$_ " } grep { suba($_) } 1..50; #other
print "\n";
7) This is my least favorite credible answer. Neither C nor Perl, impossible to modify without gutting the loop, mostly showing the applicant knew Perl array syntax. He wanted to have a case statement really badly...
for (1..50) {
if ($_ & 1) {
$odd[++$#odd]="$_ ";
next;
} else {
push #even, "$_ ";
}
}
print #odd, "\n";
print #even;
Interviewees with answers 5, 6, 2 and 1 got jobs and did well. Answers 7,3,4 did not get hired.
Your question was about using dynamic constructs like eval or others that you cannot do in a purely compiled language such as C. This last example is "dynamic" with the eval in the regex but truly poor style:
$t='D ' x 25;
$i=-1;
$t=~s/D/$i+=2/eg;
print "$t\n"; # don't let the door hit you on the way out...
Many will tell you "don't write C in Perl." I think this is only partially true. The error and mistake is to rigidly write new Perl code in C style even when there are so many more expressive forms in Perl. Use those. And yes, don't write NEW Perl code in C style because C syntax and idiom is all you know. (bad dog -- no biscuit)
Don't write dynamic code in Perl just because you can. There are certain algorithms that you will run across that you will say 'I don't quite know how I would write THAT in C' and many of these use eval. You can write a Perl regex to parse many things (XML, HTML, etc) using recursion or eval in the regex, but you should not do that. Use a parser just like you would in C. There are certain algorithms though that eval is a gift. Larry Wall's file name fixer rename would take a lot more C code to replicate, no? There are many other examples.
Don't rigidly avoid C stye either. The C 3 argument form of a for loop may be the perfect fit to certain algorithms. Also, remember why you are using Perl: assumably for high programmer productivity. If I have a completely debugged piece of C code that does exactly what I want and I need that in Perl, I just rewrite the silly thing C style in Perl! That is one of the strengths of the language (but also its weakness for larger or team projects where individual coding styles may vary and make the overall code difficult to follow.)
By far the killer verbal response to this interview question (from the applicant who wrote answer 6) was: This single line of code fits the spec and can easily be modified. However, there are many other ways to write this. The right way depends on the style of the surrounding code, how it will be called, performance considerations, and if the output format may change. Dude! When can you start?? (He ended up in management BTW.)
I think that attitude also applies to your question.

At least IME, the "dynamic" nature isn't really that big of a deal. I think the biggest difference you need to take into account is that in C or C++, you're mostly accustomed to there being only a fairly minor advantage to using library code. What's in the library is already written and debugged, so it's convenient, but if push comes to shove you can generally do pretty much the same thing on your own. For efficiency, it's mostly a question of whether your ability to write something a bit more specialized outweighs the library author's ability to spend more time on polishing each routine. There's little enough difference, however, that unless a library routine really does what you want, you may be better off writing your own.
With Perl, that's no longer true. Much of what's in the (huge, compared to C) library is actually written in C. Attempting to write anything very similar at all on your own (unless you write a C module, of course) will almost inevitably come out quite a bit slower. As such, if you can find a library routine that does even sort of close to what you want, you're probably better off using it. Using pre-written library code is much more important than in C or C++.

Good programming practices arent specific to individual languages. They are valid across all languages. In the long run, you may find it best not to rely on tricks possible in dynamic languages (for example, functions that can return either integer or text values) as it makes the code harder to maintain and quickly understand. So ultimately, to answer your question, I dont think you should be looking for features specific to dynamicly typed languages unless you have some compelling reason that you need them. Keep things simple and easy to maintain - that will be far more valuable in the long run.

There are many things you can do only with dynamic language but the coolest one is eval. See here for more detail.
With eval, you can execute string as if it was a pre-written command. You can also access variable by name at runtime.
For example,
$Double = "print (\$Ord1 * 2);";
$Opd1 = 8;
eval $Double; # Prints 8*2 =>16.
$Opd1 = 7;
eval $Double; # Prints 7*2 =>14.
The variable $Double is a string but we can execute it as it is a regular statement. This cannot be done in C/C++.
The cool thing is that a string can be manipulated at run time; therefore, we can create a command at runtime.
# string concatenation of operand and operator is done before eval (calculate) and then print.
$Cmd = "print (eval (\"(\$Ord1 \".\$Opr.\" \$Ord2)\"));";
$Opr = "*";
$Ord1 = "5";
$Ord1 = "2";
eval $Cmd; # Prints 5*2 => 10.
$Ord1 = 3;
eval $Cmd; # Prints 5*3 => 15.
$Opr = "+";
eval $Cmd; # Prints 5+3 => 8.
eval is very powerful so (as in Spiderman) power comes with responsibility. Use it wisely.
Hope this helps.

Related

Writing a Parser for a programming language: Output [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to write a simple interpreted programming language in C++. I've read that a lot of people use tools such Lex/Flex Bison to avoid "reinventing the wheel", but since my goal is to understand how these little beasts work improving my knowledge, i've decided to write the Lexer and the Parser from scratch. At the moment i'm working on the parser (the lexer is complete) and i was asking myself what should be its output. A tree? A linear vector of statements with a "depth" or "shift" parameter? How should i manage loops and if statements? Should i replace them with invisible goto statements?
A parser should almost always output an AST. An AST is simply, in the broadest sense, a tree representation of the syntactical structure of the program. A Function becomes an AST node containing the AST of the function body. An if becomes an AST node containing the AST of the condition and the body. A use of an operator becomes an AST node containing the AST of each operand. Integer literals, variable names, and so on become leaf AST nodes. Operator precedence and such is implicit in the relationship of the nodes: Both 1 * 2 + 3 and (1 * 2) + 3 are represented as Add(Mul(Int(1), Int(2)), Int(3)).
Many details of what's in the AST depend on your language (obviously) and what you want to do with the tree. If you want to analyze and transform the program (i.e. split out altered source code at the end), you might preserve comments. If you want detailed error messages, you might add source locations (as in, this integer literal was on line 5 column 12).
A compiler will proceed to turn the AST into a different format (e.g. a linear IR with gotos, or data flow graphs). Going through the AST is still a good idea, because a well-designed AST has a good balance of being syntax-oriented but only storing what's important for understanding the program. The parser can focus on parsing while the later transformations are protected from irrelevant details such as the amount of white space and operator precedence. Note that such a "compiler" might also output bytecode that's later interpreted (the reference implementation of Python does this).
A relatively pure interpreter might instead interpret the AST. Much has been written about this; it is about the easiest way to execute the parser's output. This strategy benefits from the AST in much the same way as a compiler; in particular most interpretation is simply top-down traversal of the AST.
The formal and most properly correct answer is going to be that you should return an Abstract Syntax Tree. But that is simultaneously the tip of an iceberg and no answer at all.
An AST is simply a structure of nodes describing the parse; a visualization of the paths your parse took thru the token/state machine.
Each node represents a path or description. For example, you would have nodes which represents language statements, nodes which represent compiler directives and nodes which represent data.
Consider a node which describes a variable, and lets say your language supports variables of int and string and the notion of "const". You may well choose to make the type a direct property of the Variable node struct/class, but typically in an AST you make properties - like constness - a "mutator", which is itself some form of node linked to the Variable node.
You could implement the C++ concept of "scope" by having locally-scoped variables as mutations of a BlockStatement node; the constraints of a "Loop" node (for, do, while, etc) as mutators.
When you closely tie your parser/tokenizer to your language implementation, it can become a nightmare making even small changes.
While this is true, if you actually want to understand how these things work, it is worth going through at least one first implementation where you begin to implement your runtime system (vm, interpreter, etc) and have your parser target it directly. (The alternative is, e.g., to buy a copy of the "Dragon Book" and read how it's supposed to be done, but it sounds like you are actually wanting to have the full understanding that comes from having worked thru the problem yourself).
The trouble with being told to return an AST is that an AST actually needs a form of parsing.
struct Node
{
enum class Type {
Variable,
Condition,
Statement,
Mutator,
};
Node* m_parent;
Node* m_next;
Node* m_child;
Type m_type;
string m_file;
size_t m_lineNo;
};
struct VariableMutatorNode : public Node
{
enum class Mutation {
Const
};
Mutation m_mutation;
// ...
};
struct VariableNode
{
VariableMutatorNode* m_mutators;
// ...
};
Node* ast; // Top level node in the AST.
This sort of AST is probably OK for a compiler that is independent of its runtime, but you'd need to tighten it up a lot for a complex, performance sensitive language down the (at which point there is less 'A' in 'AST').
The way you walk this tree is to start with the first node of 'ast' and act acording to it. If you're writing in C++, you can do this by attaching behaviors to each node type. But again, that's not so "abstract", is it?
Alternatively, you have to write something which works its way thru the tree.
switch (node->m_type) {
case Node::Type::Variable:
declareVariable(node);
break;
case Node::Type::Condition:
evaluate(node);
break;
case Node::Type::Statement:
execute(node);
break;
}
And as you write this, you'll find yourself thinking "wait, why didn't the parser do this for me?" because processing an AST often feels a lot like you did a crap job of implementing the AST :)
There are times when you can skip the AST and go straight to some form of final representation, and (rare) times when that is desirable; then there are times when you could go straight to some form of final representation but now you have to change the language and that decision will cost you a lot of reimplementation and headaches.
This is also generally the meat of building your compiler - the lexer and parser are generally the lesser parts of such an under taking. Working with the abstract/post-parse representation is a much more significant part of the work.
That's why people often go straight to flex/bison or antlr or some such.
And if that's what you want to do, looking at .NET or LLVM/Clang can be a good option, but you can also fairly easily bootstrap yourself with something like this: http://gnuu.org/2009/09/18/writing-your-own-toy-compiler/4/
Best of luck :)
I would build a tree of statements. After that, yes the goto statements are how the majority of it works (jumps and calls). Are you translating to a low level like assembly?
The output of the parser should be an abstract syntax tree, unless you know enough about writing compilers to directly produce byte-code, if that's your target language. It can be done in one pass but you need to know what you're doing. The AST expresses loops and ifs directly: you're not concerned with translating them yet. That comes under code generation.
People don't use lex/yacc to avoid re-inventing the wheel, the use it to build a more robust compiler prototype more quickly, with less effort, and to focus on the language, and avoid getting bogged down in other details. From personal experience with several VM projects, compilers and assemblers, I suggest if you want to learn how to build a language, do just that -- focus on building a language (first).
Don't get distracted with:
Writing your own VM or runtime
Writing your own parser generator
Writing your own intermediate language or assembler
You can do these later.
This is a common thing I see when a bright young computer scientist first catches the "language fever" (and its good thing to catch), but you need to be careful and focus your energy on the one thing you want to do well, and make use of other robust, mature technologies like parser generators, lexers, and runtime platforms. You can always circle back later, when you have slain the compiler dragon first.
Just spend your energy learning how a LALR grammar works, write your language grammar in Bison or Yacc++ if you can still find it, don't get distracted by people who say you should be using ANTLR or whatever else, that isn't the goal early on. Early on, you need to focus on crafting your language, removing ambiguities, creating a proper AST (maybe the most important skillset), semantic checking, symbol resolution, type resolution, type inference, implicit casting, tree rewriting, and of course, end program generation. There is enough to be done making a proper language that you don't need to be learning multiple other areas of research that some people spend their whole careers mastering.
I recommend you target an existing runtime like the CLR (.NET). It is one of the best runtimes for crafting a hobby language. Get your project off the ground using a textual output to IL, and assemble with ilasm. ilasm is relatively easy to debug, assuming you put some time into learning it. Once you get a prototype going, you can then start thinking about other things like an alternate output to your own interpreter, in case you have language features that are too dynamic for the CLR (then look at the DLR). The main point here is that CLR provides a good intermediate representation to output to. Don't listen to anyone that tells you you should be directly outputting bytecode. Text is king for learning in the early stages and allows you to plug and play with different languages / tools. A good book is by the author John Gough, titled Compiling for the .NET Common Language Runtime (CLR) and he takes you through the implementation of the Gardens Point Pascal Compiler, but it isn't a book about Pascal, it is a book about how to build a real compiler on the CLR. It will answer many of your questions on implementing loops and other high level constructs.
Related to this, a great tool for learning is to use Visual Studio and ildasm (the disassembler) and .NET Reflector. All available for free. You can write small code samples, compile them, then disassemble them to see how they map to a stack based IL.
If you aren't interested in the CLR for whatever reason, there are other options out there. You will probably run across llvm, Mono, NekoVM, and Parrot (all good things to learn) in your searches. I was an original Parrot VM / Perl 6 developer, and wrote the Perl Intermediate Representation language and imcc compiler (which is quite a terrible piece of code I might add) and the first prototype Perl 6 compiler. I suggest you stay away from Parrot and stick with something easier like .NET CLR, you'll get much further. If, however, you want to build a real dynamic language, and want to use Parrot for its continuations and other dynamic features, see the O'Reilly Books Perl and Parrot Essentials (there are several editions), the chapters on PIR/IMCC are about my stuff, and are useful. If your language isn't dynamic, then stay far away from Parrot.
If you are bent on writing your own VM, let me suggest you prototype the VM in Perl, Python or Ruby. I have done this a couple of times with success. It allows you to avoid too much implementation early, until your language starts to mature. Perl+Regex are easy to tweak. An intermediate language assembler in Perl or Python takes a few days to write. Later, you can rewrite the 2nd version in C++ if you still feel like it.
All this I can sum up with: avoid premature optimizations, and avoid trying to do everything at once.
First you need to get a good book. So I refer you to the book by John Gough in my other answer, but emphasize, focus on learning to implement an AST for a single, existing platform first. It will help you learn about AST implementation.
How to implement a loop?
Your language parser should return a tree node during the reduce step for the WHILE statement. You might name your AST class WhileStatement, and the WhileStatement has, as members, ConditionExpression and BlockStatement and several labels (also inheritable but I added inline for clarity).
Grammar pseudocode below, shows the how the reduce creates a new object of WhileStatement from a typical shift-reduce parser reduction.
How does a shift-reduce parser work?
WHILE ( ConditionExpression )
BlockStatement
{
$$ = new WhileStatement($3, $5);
statementList.Add($$); // this is your statement list (AST nodes), not the parse stack
}
;
As your parser sees "WHILE", it shifts the token on the stack. And so forth.
parseStack.push(WHILE);
parseStack.push('(');
parseStack.push(ConditionalExpression);
parseStack.push(')');
parseStack.push(BlockStatement);
The instance of WhileStatement is a node in a linear statement list. So behind the scenes, the "$$ =" represents a parse reduce (though if you want to be pedantic, $$ = ... is user-code, and the parser is doing its own reductions implicitly, regardless). The reduce can be thought of as popping off the tokens on the right side of the production, and replacing with the single token on the left side, reducing the stack:
// shift-reduce
parseStack.pop_n(5); // pop off the top 5 tokens ($1 = WHILE, $2 = (, $3 = ConditionExpression, etc.)
parseStack.push(currToken); // replace with the current $$ token
You still need to add your own code to add statements to a linked list, with something like "statements.add(whileStatement)" so you can traverse this later. The parser has no such data structure, and its stacks are only transient.
During parse, synthesize a WhileStatement instance with its appropriate members.
In latter phase, implement the visitor pattern to visit each statement and resolve symbols and generate code. So a while loop might be implemented with the following AST C++ class:
class WhileStatement : public CompoundStatement {
public:
ConditionExpression * condExpression; // this is the conditional check
Label * startLabel; // Label can simply be a Symbol
Label * redoLabel; // Label can simply be a Symbol
Label * endLabel; // Label can simply be a Symbol
BlockStatement * loopStatement; // this is the loop code
bool ResolveSymbolsAndTypes();
bool SemanticCheck();
bool Emit(); // emit code
}
Your code generator needs to have a function that generates sequential labels for your assembler. A simple implementation is a function to return a string with a static int that increments, and returns LBL1, LBL2, LBL3, etc. Your labels can be symbols, or you might get fancy with a Label class, and use a constructor for new Labels:
class Label : public Symbol {
public Label() {
this.name = newLabel(); // incrementing LBL1, LBL2, LBL3
}
}
A loop is implemented by generating the code for condExpression, then the redoLabel, then the blockStatement, and at the end of blockStatement, then goto to redoLabel.
A sample from one of my compilers to generate code for the CLR.
// Generate code for .NET CLR for While statement
//
void WhileStatement::clr_emit(AST *ctx)
{
redoLabel = compiler->mkLabelSym();
startLabel = compiler->mkLabelSym();
endLabel = compiler->mkLabelSym();
// Emit the redo label which is the beginning of each loop
compiler->out("%s:\n", redoLabel->getName());
if(condExpr) {
condExpr->clr_emit_handle();
condExpr->clr_emit_fetch(this, t_bool);
// Test the condition, if false, branch to endLabel, else fall through
compiler->out("brfalse %s\n", endLabel->getName());
}
// The body of the loop
compiler->out("%s:\n", startLabel->getName()); // start label only for clarity
loopStmt->clr_emit(this); // generate code for the block
// End label, jump out of loop
compiler->out("br %s\n", redoLabel->getName()); // goto redoLabel
compiler->out("%s:\n", endLabel->getName()); // endLabel for goto out of loop
}

Where does the k prefix for constants come from?

it's a pretty common practice that constants are prefixed with k (e.g. k_pi). But what does the k mean?
Is it simply that c already meant char?
It's a historical oddity, still common practice among teams who like to blindly apply coding standards that they don't understand.
Long ago, most commercial programming languages were weakly typed; automatic type checking, which we take for granted now, was still mostly an academic topic. This meant that is was easy to write code with category errors; it would compile and run, but go wrong in ways that were hard to diagnose. To reduce these errors, a chap called Simonyi suggested that you begin each variable name with a tag to indicate its (conceptual) type, making it easier to spot when they were misused. Since he was Hungarian, the practise became known as "Hungarian notation".
Some time later, as typed languages (particularly C) became more popular, some idiots heard that this was a good idea, but didn't understand its purpose. They proposed adding redundant tags to each variable, to indicate its declared type. The only use for them is to make it easier to check the type of a variable; unless someone has changed the type and forgotten to update the tag, in which case they are actively harmful.
The second (useless) form was easier to describe and enforce, so it was blindly adopted by many, many teams; decades later, you still see it used, and even advocated, from time to time.
"c" was the tag for type "char", so it couldn't also be used for "const"; so "k" was chosen, since that's the first letter of "konstant" in German, and is widely used for constants in mathematics.
I haven't seen it that much, but maybe it comes from certain languages' (the germanic ones in particular) spelling of the word constant - konstant.
Don't use Hungarian Notation. If you want constants to stand out, make them all caps.
As a side note: there are a lot of things in the Google Coding Standards that are poor practice (in terms of code readability). That is what happens when you design a coding standard by committee.
It means the value is k-onstant.
I think mathematical convention was the precedent. k is used in maths all the time as just some constant.
K stands for konstant, a wordplay on constant. It relates to Coding Styles.
It's just a matter of preference, some people and projects use them which means they also embrace the Hungarian notation, many don't. That's not that important.
If you're unsure what a prefix or style might mean, always check if the project has a coding style reference and read that.
Actually, whenever I define constants in typescript, I do something like this -
NODE_ENV = 'production';
But recently, I saw that the k prefix is being used in the Flutter SDK. It makes sense to me to keep using the k prefix cuz' it helps your editor/IDE in searching out constants in your codebase.
It's a convention, probably from math. But there are other suggestions for constant too, for example Kernighan and Ritchie in their book "The C language" suggest writing constants' name in capital letters (e.g. #define MAX 55).
I think, it means coefficient (as k in math means)

In "aa67bc54c9", is there any way to print "aa" 67 times, "bc" 54 times and so on, using regular expressions?

I was asked this question in an interview for an internship, and the first solution I suggested was to try and use a regular expression (I usually am a little stumped in interviews). Something like this
(?P<str>[a-zA-Z]+)(?P<n>[0-9]+)
I thought it would match the strings and store them in the variable "str" and the numbers in the variable "n". How, I was not sure of.
So it matches strings of type "a1b2c3", but a problem here is that it also matches strings of type "a1b". Could anyone suggest a solution to deal with this problem?
Also, is there any other regular expression that could solve this problem?
Do you know why "regular expressions" are called "regular"? :-)
That would be too long to explain, I'll just outline the way. To match a pattern (i.e. decide whether a given string is "valid" or "invalid"), a theoretical informatician would use a finite state automaton. That's an abstract machine that has a finite number of states; each tick it reads a char from the input and jumps to another state. The pattern of where to jump from particular state when a particular character is read is fixed. Some states are marked as "OK", some--as "FAIL", so that by examining state of a machine you can check whether your text is "valid" (i.e. a valid e-mail).
For example, this machine only accepts "nice" as its "valid" word (a pic from Wikipedia):
A set of "valid" words such a machine theoretically can distinguish from invalid is called "regular language". Not every set is a regular language: for example, finite state automata are incapable of checking whether parentheses in string are balanced.
But constructing state machines was a complex task, compared to the complexity of defining what "valid" is. So the mathematicians (mainly S. Kleene) noted that every regular language could be described with a "regular expression". They had *s and |s and were the prototypes of what we know as regexps now.
What does it have to do with the problem? The problem in subject is essentially non-regular. It can't be expressed with anything that works like a finite automaton.
The essence is that it should contain a memory cell that is capable to hold an arbitrary number (repetition count in your case). Finite automata and classical regular expressions can not do this.
However, modern regexps are more expressive and are said to be able to check balanced parentheses! But this may serve as a good example that you shouldn't use regexps for tasks they don't suit. Let alone that it contains code snippets; this makes the expression far from being "regular".
Answering the initial question, you can't solve your problem with using anything "regular" only. However, regexps could be aid you in solving this problem, as in tster's answer
Perhaps, I should look closer to tster's answer (do a "+1" there, please!) and show why it's not the "regular expression" solution. One may think that it is, it just contains print statement (not essential) and a loop--and loop concept is compatible with finite state automaton expressive power. But there is one more elusive thing:
while ($line =~ s/^([a-z]+)(\d+)//i)
{
print $1
x # <--- this one
$2;
}
The task of reading a string and a number and printing repeatedly that string given number of times, where the number is an arbitrary integer, is undoable on a finite state machine without additional memory. You use a memory cell to keep that number and decrease it, and check for it to be greater than zero. But this number may be arbitrarily big, and it contradicts with a finite memory available to the finite state machine.
However, there's nothing wrong with classical pattern /([abc]*){5}/ that matches something "regular" repeated fixed number of times. We essentially have states that correspond to "matched pattern once", "matched pattern twice" ... "matched pattern 5 times". There's finite number of them, and that's the gist of the difference.
how about:
while ($line =~ s/^([a-z]+)(\d+)//i)
{
print $1 x $2;
}
Answering your question directly:
No, regular expressions match text and don't print anything, so there is no way to do it solely using regular expressions.
The regular expression you gave will match one string/number pair; you can then print that repeatedly using an appropriate mechanism. The Perl solution from #tster is about as compact as it gets. (It doesn't use the names that you applied in your regex; I'm pretty sure that doesn't matter.)
The remaining details depend on your implementation language.
Nope, this is your basic 'trick question' - no matter how you answer it that answer is wrong unless you have exactly the answer the interviewer was trained to parrot. See the workup of the issue given by Pavel Shved - note that all invocations have 'not' as a common condition, the tool just keeps sliding: Even when it changes state there is no counter in that state
I have a rather advanced book by Kenneth C Louden who is a college prof on the matter, in which it is stated that the issue at hand is codified as "Regex's can't count." The obvious answer to the question seems to me at the moment to be using the lookahead feature of Regex's ...
Probably depends on what build of what brand of regex the interviewer is using, which probably depends of flight-dynamics of Golf Balls.
Nice answers so far. Regular expressions alone are generally thought of as a way to match patterns, not generate output in the manner you mentioned.
Having said that, there is a way to use regex as part of the solution. #Jonathan Leffler made a good point in his comment to tster's reply: "... maybe you need a better regex library in your language."
Depending on your language of choice and the library available, it is possible to pull this off. Using C# and .NET, for example, this could be achieved via the Regex.Replace method. However, the solution is not 100% regex since it still relies on other classes and methods (StringBuilder, String.Join, and Enumerable.Repeat) as shown below:
string input = "aa67bc54c9";
string pattern = #"([a-z]+)(\d+)";
string result = Regex.Replace(input, pattern, m =>
// can be achieved using StringBuilder or String.Join/Enumerable.Repeat
// don't use both
//new StringBuilder().Insert(0, m.Groups[1].Value, Int32.Parse(m.Groups[2].Value)).ToString()
String.Join("", Enumerable.Repeat(m.Groups[1].Value, Int32.Parse(m.Groups[2].Value)).ToArray())
+ Environment.NewLine // comment out to prevent line breaks
);
Console.WriteLine(result);
A clearer solution would be to identify the matches, loop over them and insert them using the StringBuilder rather than rely on Regex.Replace. Other languages may have compact idioms to handle the string multiplication that doesn't rely on other library classes.
To answer the interview question, I would reply with, "it's possible, however the solution would not be a stand-alone 100% regex approach and would rely on other language features and/or libraries to handle the generation aspect of the question since the regex alone is helpful in matching patterns, not generating them."
And based on the other responses here you could beef up that answer further if needed.

Is it feasible to ascribe pronunciations to distinct source code concepts?

I frequently tutor fellow students in programming, most often in C++ or Java.
It is uniquely aggravating to try to verbally convey the essential syntax of a C++ expression. The speaker must give either an idiomatic translation into English, or a full specification of the code in verbal longhand, using explicit yet slow terms such as "opening parenthesis", "bitwise and", et cetera. Neither of these solutions is optimal.
In C++, there is a finite set of keywords—63—and operators—54, discounting named operators and treating compound assignment operators and prefix versus postfix auto-increment and decrement as distinct. There are just a few types of literal, a similar number of grouping symbols, and the semicolon. Unless I'm utterly mistaken, that's about it.
Would it not then be feasible to ascribe a concise, unique pronunciation to each of these distinct concepts (including one for whitespace, where it is required) and go from there? Programming languages are far more regular than natural languages, so the pronunciation could be standardised.
Instead of creating new "words" to describe them, for things such as "include" you could simply prefix it with "keyword" when saying it aloud. You could use words/phrases commonly known to say other parts as well. As with any new programmer, you have to literally describe everything anyway, so I don't think that requires special attention. I think creating new words is the harder method...
So, for example:
#include <iostream>;
int main()
{
if (1 < 2)
return 1;
else
return 0;
}
Could be read out as:
(keyword) include iostream new-line
(keyword) int main no params start
block if number 1 (operator) less than
number 2 new-line (keyword) return
number 1 new-line (keyword) else
new-line (keyword) return number 0 end
block
Treat words in () as optional descriptive words, most likely to be used in more complex code. You could use the word 'literal' if you want them to actually write the descriptive word. For example
(keyword) if literal number (operator)
less than literal keyword
becomes
if (number < keyword)
Other words could be given defined meanings as well, such as 'split-line' when you want them to continue on the next line, without closing any currently open parenthesis, etc.
I personally find this method quite simple to use and easy to teach. YMMV, as always.
Of course, this doesn't solve the internationalisation issue, but at worst, would result in 'new words' being used in the non-English languages, which is no worse than the proposed solution you offered.
As a blind developer, programming since I was 13, I found this question really interesting. First of all, as mentioned by other peple, learning a new language to be able to understand code is not a practical solution, as it would probably take longer to learn the spoken utterances as it would to learn the actual programming language.
Reading the question/answers two further points occured to me:
Firstly, you'd be surprised how important "thinking time" is. I have previously programmed in C/C++/Java and now use C# as my primary language, and consider myself very competant. But when I did a couple of projects in Python, I found the reduced punctuation robbed me of my "thinking time" - subconsciously, I was using the punctuation to digest what I'd just heard - fascinating... However, the situation is a bit different when it comes to identifiers, as these aren't well known by the listener - I personally find it hard to listen to code with acronym variables (RGXRatio, RGVRatio) as I don't have time to figure out what it means. On the flip side, hungarian notation and initial underscores makes code hard to listen to as the length of the variables (in terms of time taken to speak) is much longer than the more important operations being performed on those variables.
Another thing to consider is that the length of the audio stream is an end result, but not the root cause. The reason the audio is so long is because audio is a one-dimensional medium, whereas reading text is a 2d medium with the ability to jump around and skip past irelevant/familiar text. It wouldn't work for a face-to-face lecture, but what if there were keyboard commands for controlling the speech. In text documents my screen reader lets me jump to the next line, but what if this were adapted to the semantics of a programming language. some research, such as by T V Raman at Google, includes using different voices for syntax highlighting, and audio cues to mark metadata like capitals.
I know the original question specifically related to a lecture given to a class, but if like myself you have to listen to entire files of source code , I also find the structure of the code makes a huge difference. I personally read code like a story - left to right, top to bottom. so it's very hard to trace through unfamiliar code when it's written bottom-up.
So would it not then be feasible to simply ascribe a concise, unique pronunciation to each of these distinct concepts (including one for whitespace, where it is required) and go from there? Programming languages are far more regular than natural languages, so the pronunciation could be standardised
Perhaps, but you've lost sight of your goal. The premise was that the person listening did not already know the language. If he does, we can simply say "include iostream" when we mean #include <iostream>, or "vector of int" when we mean std::vector<int>.
Your premise was that the person listening is not familiar enough with the language to understand what you read out loud unless you read out exactly what it says.
Now, inventing a whole new language just to describe the primitives that occur in your source code doesn't solve the problem. Instead, you still have to read out every syntactic token (with simpler, more "standardized" pronunciations, yes, but they still have to be read out loud), and the person listening still won't understand you, because if they don't know C++ well enough to understand "include iostream", they won't understand your standardized pronunciation either. And if you're going to teach them your pronunciation, why bother, when you could've just taught them to understand C++ syntax directly instead?
There's also the root problem that C++ code tends to consist of a lot of syntactic tokens. Take a line as simple as this:
std::vector<int> v;
I count 9 tokens. Not one of them can be omitted. If the person listening does not understand the code and syntax well enough to understand a high-level description such as "declare a vector of int, named v", then you'll have to read out all 9 tokens in some form. Even if you come up with simpler names than "namespace resolution operator" and "less than sign", you still have to list 9 token names. Which is a lot of work.
In short, no, I don't think it'd work. First, it's still too cumbersome, and second, it's presuming prior knowledge on the part of the person listening, when the motivation for this was that the person listening was a student without the prior knowledge that made it possible to understand a high-level description of the code.

Are there particular cases where native text manipulation is more desirable than regex?

Are there particular cases where native text manipulation is more desirable than regex?
In particular .net?
Note:
Regex appears to be a highly emotive subject, so I am wary of asking such a question. This question is not inviting personal/profession opinions on regex, only specific situations where a solution including its use is not as good as language native commands (including those which have underlying code using regex) and why.
Also, note that Desirable can mean performance, can mean code-readability; it does not mean panacea, as each solution for a problem has its benefits and limitations.
Apologies if this is a duplicate, I have searched SO for a similar question.
I prefer text manipulation over regular expressions to parse delimited string input. It's far simpler (for me at least) to issue a string split than to manage a regular expression.
Given some text:
value1, value2, value3
You can parse the line easily:
var values = myString.Split(',');
I'm sure there's a better way but with regular expressions you'd have to do something like:
var match = Regex.Match(myString, "^([^,]*),([^,]*),([^,]*)$");
var value1 = match.Group[1];
...
When you can do it simply with native text manipulation, it is usually preferable (simpler to read & better performance) not to use regex.
Personal rule of thumb: if it's tricky or relatively longer to do it "manually" and that performance gain is negligible, don't. Else do.
Don't examples:
split
simple find & replace
long text
loop
existing native functions (like, in PHP, strrchr, ucwords...)
Using a regex basically means embedding a tiny program, written in a different programming language, in the middle of your program. I'll ignore the inefficiency of using a regex over native string manipulation, because it probably isn't relevant in most cases.
I prefer native text manipulation over regex any time native text manipulation will be easier to follow for other people. Which is true quite frequently, since plenty of the people around me are not strongly familiar with regex. Unless working with something that is very much about parsing (via regex) they should not need to be!
Regular expressions are usually slower, less readable, and harder to debug than native string manipulation.
The main case where I'll prefer regex over string manipulation is when I want to be able to have different ways to parse strings dependning on the source, and the types of sources will increase over time. Native string manipulation is not really practical in this case. I've had cases where I've stuck a regex column in a database...
RegEx's are very flexible and powerful, because they are in many ways similar to an eval() statement. That being said, depending on the implementation, they can be a bit slow. Normally, this is not an issue, however, if they can be avoided in a particularly costly loop, that can boost performance.
That being said, I tend to use them, and only worry about performance when the app is "done" and I have real benchmarks to prove I need to tweak performance. i.e, avoid premature optimization.
Whenever the same result can be achieved with a reasonable amount of code.
Regular expressions are very powerful, but they tend to get hard to read. If you can do the same with simple string operations that usually means that the code gets easier to manage and maintain.
There is some overhead in setting up the object and parsing the expression. For simpler string manipulation you can get better performance with simple string methods.
Example:
Getting the file name from a file path (yes, I know that the Path class should be used for that, it's just an example...)
string name = Regex.Match(path, #"([^\\]+)$").Groups[0].Value;
vs.
string name = path.Substring(path.LastIndexOf('\\') + 1);
The second solution is straight forward and does the minimal work needed to get the result. The regular expression solution produces the same result, but it does more work to parse the string, and it produces a bunch of objects that is not needed for the result.
Regex parsing and execution refers the host language to defer processing to its regex "engine". This adds overhead, so for any instance where native string manipulation could be used it is preferable for speed (and readability!).
I'll usually just use text manipulation for simple string replacements (e.g. replacing tokens in a template with actual values). You could certainly do this with Regex, but replacements are much easier.
Yes. Example:
char* basename (const char* path)
{
char* p = strrchr(path, '/');
return (p != NULL) ? (p+1) : path;
}