Are there any conventions for printing optional values? - c++

Suppose I want to overload operator<< for an optional<T> class template. How would I print the "absent value", and how would I print a "real value" x?
none
some x
or
[]
[x]
Or should I literally print nothing for the first case and x for the second? How is this normally handled?

I like the option of print None and Some x. I think that this immediately describes what's going on (especially for people familiar with Haskell).
Personally, I would not use the [] and [x] alternative, because many languages use the square brackets to denote some sort of list. If I were to see that output, I would immediately be thinking that a list had been printed, as opposed to an optional type.

In the absence of any context, I would think of an optional as a special case of a collection, that either is empty or has one member.
You probably already have a convention for how to print collections or compound objects, but something like {} if it's empty or {x} if it has the value x would seem reasonable. If you print out an empty vector as none and a vector with three elements as some x y z, then by all means apply the same convention to an optional type :-)

I'm not aware of any specific convention. Personally I'd print (null) when the value is missing and the actual value otherwise.

It depends on the type. If you want a string to be printed, it could be "" or "Fred". If it is an array, it could be {} or { 1, 2, 3 }.
Following this video or the code attached in the web page can help you. Pretty Printer

The choice of behavior should probably be based on why you're implementing an output operator. If it is mostly for debugging purposes it is important to provide a visual clue that a value is missing. Printing an existing value between square brackets or just the open and closed brackets if the value is missing is a valid approach, given that square brackets are often used to indicate optionality, e.g. in command help messages.
On the other hand if this is meant to be a general purpose output operator the best approach is probably to print existing values as you would for the non-optional underlying type and the empty string for missing values.

The UNIX and the Echo.
There dwelt in the land of New Jersey the UNIX, a fair maid whom savants traveled far to admire. Dazzled by her purity, all sought to expose her, one for her virginal grace, another polished civility, yet another for her agility in performing exacting tasks seldom accomplished even in much richer lands. So large of heart and accomodating of nature was she that the UNIX adopted all but the unsufferably rich of her suitors. Soon many offspring grew and prospered and spread to the ends of the earth.
Nature herself smiled and answered to the UNIX more eagerly than to other mortal beings. Humbler folk, who knew little of more courtly manners, delighted in her echo, so precise and crystal clear they scarce believed she could be answered by the same rocks and woods that so garbled their own shouts into the wilderness. And the compliant UNiX obliged with perfect echoes of what ever she was asked. When one impatient swain asked the UNIX, 'Echo nothing', the UNIX obligingly opened her mouth, echoed nothing, and closed it again.
'Whatever do you mean,' the youth demanded, 'opening your mouth like that? Henceforth never open your mouth when you are supposed to echo nothing!' And the UNIX obliged.
'But I want a perfect performance, even when you echo nothing,' pleaded a sensitive youth, 'and no echoes can come from a closed mouth.' Not wishing to offend either one, the UNIX agreed to say different nothings for the impatient youth and the sensitive youth. She called the sensitive nothing '\n.'
Yet now when she said '\n,'she was really not saying nothing so she had to open her mouth twice, once to say '\n,' and once to say nothing, and so she did not please the sensitive youth, who said forthwith, 'The \n sounds like a perfect nothing to me, but the second ruins it. I want you to take back one of them.' So the UNiX, who could not abide offending, agreed to undo some echoes and called that '\c'. Now the sensitive youth could hear a perfect echo of nothing by asking for '\n' and '\c' together.
But they say that he died of a surfeit of notation before he ever heard one.
-- Doug McIlroy

Related

Is it a good style to write constants on the left of equal to == in If statement in C++? [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Okay, we know that the following two lines are equivalent -
(0 == i)
(i == 0)
Also, the first method was encouraged in the past because that would have allowed the compiler to give an error message if you accidentally used '=' instead of '=='.
My question is - in today's generation of pretty slick IDE's and intelligent compilers, do you still recommend the first method?
In particular, this question popped into my mind when I saw the following code -
if(DialogResult.OK == MessageBox.Show("Message")) ...
In my opinion, I would never recommend the above. Any second opinions?
I prefer the second one, (i == 0), because it feel much more natural when reading it. You ask people, "Are you 21 or older?", not, "Is 21 less than or equal to your age?"
It doesn't matter in C# if you put the variable first or last, because assignments don't evaluate to a bool (or something castable to bool) so the compiler catches any errors like "if (i = 0) EntireCompanyData.Delete()"
So, in the C# world at least, its a matter of style rather than desperation. And putting the variable last is unnatural to english speakers. Therefore, for more readable code, variable first.
If you have a list of ifs that can't be represented well by a switch (because of a language limitation, maybe), then I'd rather see:
if (InterstingValue1 == foo) { } else
if (InterstingValue2 == foo) { } else
if (InterstingValue3 == foo) { }
because it allows you to quickly see which are the important values you need to check.
In particular, in Java I find it useful to do:
if ("SomeValue".equals(someString)) {
}
because someString may be null, and in this way you'll never get a NullPointerException. The same applies if you are comparing constants that you know will never be null against objects that may be null.
(0 == i)
I will always pick this one. It is true that most compilers today do not allow the assigment of a variable in a conditional statement, but the truth is that some do. In programming for the web today, I have to use myriad of langauges on a system. By using 0 == i, I always know that the conditional statement will be correct, and I am not relying on the compiler/interpreter to catch my mistake for me. Now if I have to jump from C# to C++, or JavaScript I know that I am not going to have to track down assignment errors in conditional statements in my code. For something this small and to have it save that amount of time, it's a no brainer.
I used to be convinced that the more readable option (i == 0) was the better way to go with.
Then we had a production bug slip through (not mine thankfully), where the problem was a ($var = SOME_CONSTANT) type bug. Clients started getting email that was meant for other clients. Sensitive type data as well.
You can argue that Q/A should have caught it, but they didn't, that's a different story.
Since that day I've always pushed for the (0 == i) version. It basically removes the problem. It feels unnatural, so you pay attention, so you don't make the mistake. There's simply no way to get it wrong here.
It's also a lot easier to catch that someone didn't reverse the if statement in a code review than it is that someone accidentally assigned a value in an if. If the format is part of the coding standards, people look for it. People don't typically debug code during code reviews, and the eye seems to scan over a (i = 0) vs an (i == 0).
I'm also a much bigger fan of the java "Constant String".equals(dynamicString), no null pointer exceptions is a good thing.
You know, I always use the if (i == 0) format of the conditional and my reason for doing this is that I write most of my code in C# (which would flag the other one anyway) and I do a test-first approach to my development and my tests would generally catch this mistake anyhow.
I've worked in shops where they tried to enforce the 0==i format but I found it awkward to write, awkward to remember and it simply ended up being fodder for the code reviewers who were looking for low-hanging fruit.
Actually, the DialogResult example is a place where I WOULD recommend that style. It places the important part of the if() toward the left were it can be seen. If it's is on the right and the MessageBox have more parameters (which is likely), you might have to scroll right to see it.
OTOH, I never saw much use in the "(0 == i) " style. If you could remember to put the constant first, you can remember to use two equals signs,
I'm trying always use 1st case (0==i), and this saved my life a few times!
I think it's just a matter of style. And it does help with accidentally using assignment operator.
I absolutely wouldn't ask the programmer to grow up though.
I prefer (i == 0), but I still sort of make a "rule" for myself to do (0 == i), and then break it every time.
"Eh?", you think.
Well, if I'm making a concious decision to put an lvalue on the left, then I'm paying enough attention to what I'm typing to notice if I type "=" for "==". I hope. In C/C++ I generally use -Wall for my own code, which generates a warning on gcc for most "=" for "==" errors anyway. I don't recall seeing that warning recently, perhaps because the longer I program the more reflexively paranoid I am about errors I've made before...
if(DialogResult.OK == MessageBox.Show("Message"))
seems misguided to me. The point of the trick is to avoid accidentally assigning to something.
But who is to say whether DialogResult.OK is more, or less likely to evaluate to an assignable type than MessageBox.Show("Message")? In Java a method call can't possibly be assignable, whereas a field might not be final. So if you're worried about typing = for ==, it should actually be the other way around in Java for this example. In C++ either, neither or both could be assignable.
(0==i) is only useful because you know for absolute certain that a numeric literal is never assignable, whereas i just might be.
When both sides of your comparison are assignable you can't protect yourself from accidental assignment in this way, and that goes for when you don't know which is assignable without looking it up. There's no magic trick that says "if you put them the counter-intuitive way around, you'll be safe". Although I suppose it draws attention to the issue, in the same way as my "always break the rule" rule.
I use (i == 0) for the simple reason that it reads better. It makes a very smooth flow in my head. When you read through the code back to yourself for debugging or other purposes, it simply flows like reading a book and just makes more sense.
My company has just dropped the requirement to do if (0 == i) from its coding standards. I can see how it makes a lot of sense but in practice it just seems backwards. It is a bit of a shame that by default a C compiler probably won't give you a warning about if (i = 0).
Third option - disallow assignment inside conditionals entirely:
In high reliability situations, you are not allowed (without good explanation in the comments preceeding) to assign a variable in a conditional statement - it eliminates this question entirely because you either turn it off at the compiler or with LINT and only under very controlled situations are you allowed to use it.
Keep in mind that generally the same code is generated whether the assignment occurs inside the conditional or outside - it's simply a shortcut to reduce the number of lines of code. There are always exceptions to the rule, but it never has to be in the conditional - you can always write your way out of that if you need to.
So another option is merely to disallow such statements, and where needed use the comments to turn off the LINT checking for this common error.
-Adam
I'd say that (i == 0) would sound more natural if you attempted to phrase a line in plain (and ambiguous) english. It really depends on the coding style of the programmer or the standards they are required to adhere to though.
Personally I don't like (1) and always do (2), however that reverses for readability when dealing with dialog boxes and other methods that can be extra long. It doesn't look bad how it is not, but if you expand out the MessageBox to it's full length. You have to scroll all the way right to figure out what kind of result you are returning.
So while I agree with your assertions of the simplistic comparison of value types, I don't necessarily think it should be the rule for things like message boxes.
both are equal, though i would prefer the 0==i variant slightly.
when comparing strings, it is more error-prone to compare "MyString".equals(getDynamicString())
since, getDynamicString() might return null.
to be more conststent, write 0==i
Well, it depends on the language and the compiler in question. Context is everything.
In Java and C#, the "assignment instead of comparison" typo ends up with invalid code apart from the very rare situation where you're comparing two Boolean values.
I can understand why one might want to use the "safe" form in C/C++ - but frankly, most C/C++ compilers will warn you if you make the typo anyway. If you're using a compiler which doesn't, you should ask yourself why :)
The second form (variable then constant) is more readable in my view - so anywhere that it's definitely not going to cause a problem, I use it.
Rule 0 for all coding standards should be "write code that can be read easily by another human." For that reason I go with (most-rapidly-changing value) test-against (less-rapidly-changing-value, or constant), i.e "i == 0" in this case.
Even where this technique is useful, the rule should be "avoid putting an lvalue on the left of the comparison", rather than the "always put any constant on the left", which is how it's usually interpreted - for example, there is nothing to be gained from writing
if (DateClass.SATURDAY == dateObject.getDayOfWeek())
if getDayOfWeek() is returning a constant (and therefore not an lvalue) anyway!
I'm lucky (in this respect, at least) in that these days in that I'm mostly coding in Java and, as has been mentioned, if (someInt = 0) won't compile.
The caveat about comparing two booleans is a bit of a red-herring, as most of the time you're either comparing two boolean variables (in which case swapping them round doesn't help) or testing whether a flag is set, and woe-betide-you if I catch you comparing anything explicitly with true or false in your conditionals! Grrrr!
In C, yes, but you should already have turned on all warnings and be compiling warning-free, and many C compilers will help you avoid the problem.
I rarely see much benefit from a readability POV.
Code readability is one of the most important things for code larger than a few hundred lines, and definitely i == 0 reads much easier than the reverse
Maybe not an answer to your question.
I try to use === (checking for identical) instead of equality. This way no type conversion is done and it forces the programmer do make sure the right type is passed,
You are right that placing the important component first helps readability, as readers tend to browse the left column primarily, and putting important information there helps ensure it will be noticed.
However, never talk down to a co-worker, and implying that would be your action even in jest will not get you high marks here.
I always go with the second method. In C#, writing
if (i = 0) {
}
results in a compiler error (cannot convert int to bool) anyway, so that you could make a mistake is not actually an issue. If you test a bool, the compiler is still issuing a warning and you shouldn't compare a bool to true or false. Now you know why.
I personally prefer the use of variable-operand-value format in part because I have been using it so long that it feels "natural" and in part because it seems to the predominate convention. There are some languages that make use of assignment statements such as the following:
:1 -> x
So in the context of those languages it can become quite confusing to see the following even if it is valid:
:if(1=x)
So that is something to consider as well. I do agree with the message box response being one scenario where using a value-operand-variable format works better from a readability stand point, but if you are looking for constancy then you should forgo its use.
This is one of my biggest pet peeves. There is no reason to decrease code readability (if (0 == i), what? how can the value of 0 change?) to catch something that any C compiler written in the last twenty years can catch automatically.
Yes, I know, most C and C++ compilers don't turn this on by default. Look up the proper switch to turn it on. There is no excuse for not knowing your tools.
It really gets on my nerves when I see it creeping into other languages (C#,Python) which would normally flag it anyway!
I believe the only factor to ever force one over the other is if the tool chain does not provide warnings to catch assignments in expressions. My preference as a developer is irrelevant. An expression is better served by presenting business logic clearly. If (0 == i) is more suitable than (i == 0) I will choose it. If not I will choose the other.
Many constants in expressions are represented by symbolic names. Some style guides also limit the parts of speech that can be used for identifiers. I use these as a guide to help shape how the expression reads. If the resulting expression reads loosely like pseudo code then I'm usually satisfied. I just let the expression express itself and If I'm wrong it'll usually get caught in a peer review.
We might go on and on about how good our IDEs have gotten, but I'm still shocked by the number of people who turn the warning levels on their IDE down.
Hence, for me, it's always better to ask people to use (0 == i), as you never know, which programmer is doing what.
It's better to be "safe than sorry"
if(DialogResult.OK == MessageBox.Show("Message")) ...
I would always recommend writing the comparison this way. If the result of MessageBox.Show("Message") can possibly be null, then you risk a NPE/NRE if the comparison is the other way around.
Mathematical and logical operations aren't reflexive in a world that includes NULLs.

What dynamic programming features of Perl should I be using?

I am pretty new to scipting languages (Perl in particular), and most of the code I write is an unconscious effort to convert C code to Perl.
Reading about Perl, one of the things that is often mentioned as the biggest difference is that Perl is a dynamic language. So, it can do stuff at runtime that the other languages (static ones) can only do at compiletime, and so be better at it because it can have access to realtime information.
All that is okay, but what specific features should I, with some experience in C and C++, keep in mind while writing code in Perl to use all the dynamic programming features that it has, to produce some awesome code?
This question is more than enough to fill a book. In fact, that's precisely what happened!
Mark Jason Dominus' excellent Higher-Order Perl is available online for free.
Here is a quote from it's preface that really grabbed me by the throat when I first read the book:
Around 1993 I started reading books
about Lisp, and I discovered something
important: Perl is much more like Lisp
than it is like C. If you pick up a
good book about Lisp, there will be a
section that describes Lisp’s good
features. For example, the book
Paradigms of Artificial Intelligence
Programming, by Peter Norvig, includes
a section titled What Makes Lisp
Different? that describes seven
features of Lisp. Perl shares six of
these features; C shares none of them.
These are big, important features,
features like first-class functions,
dynamic access to the symbol table,
and automatic storage management.
A list of C habits not to carry over into Perl 5:
Don't declare your variables at the top of the program/function. Declare them as they are needed.
Don't assign empty lists to arrays and hashes when declaring them (they are empty already and need to be initialized).
Don't use if (!(complex logical statement)) {}, that is what unless is for.
Don't use goto to break deeply nested loops, next, last, and redo all take a loop label as an argument.
Don't use global variables (this is a general rule even for C, but I have found a lot of C people like to use global variables).
Don't create a function where a closure will do (callbacks in particular). See perldoc perlsub and perldoc perlref for more information.
Don't use in/out returns, return multiple values instead.
Things to do in Perl 5:
Always use the strict and warnings pragmas.
Read the documentation (perldoc perl and perldoc -f function_name).
Use hashes the way you used structs in C.
Use the features that solve your problem with the best combination of maintainability, developer time, testability, and flexibility. Talking about any technique, style, or library outside of the context of a particular application isn't very useful.
Your goal shouldn't be to find problems for your solutions. Learn a bit more Perl than you plan on using immediately (and keep learning). One day you'll come across a problem and think "I remember something that might help with this".
You might want to see some of these book, however:
Higher-Order Perl
Mastering Perl
Effective Perl Programming
I recommend that you slowly and gradually introduce new concepts into your coding. Perl is designed so that you don't have to know a lot to get started, but you can improve your code as you learn more. Trying to grasp lots of new features all at once usually gets you in trouble in other ways.
I think the biggest hurdle will not be the dynamic aspect but the 'batteries included' aspect.
I think the most powerful aspects of perl are
hashes : they allow you to easily express very effective datastructures
regular expressions : they're really well integrated.
the use of the default variables like $_
the libraries and the CPAN for whatever is not installed standard
Something I noticed with C converts is the over use of for loops. Many can be removed using grep and map
Another motto of perl is "there is more than one way to do it". In order to climb the learning curve you have to tell yourself often : "There has got to be a better way of doing this, I cannot be the first one wanting to do ...". Then you can typically turn to google and the CPAN with its riduculous number of libraries.
The learning curve of perl is not steep, but it is very long... take your time and enjoy the ride.
Two points.
First, In general, I think you should be asking yourself 2 slightly different questions:
1) Which dynamic programming features of Perl can be used in which situations/to solve which problems?
2) What are the trade-offs, pitfalls and downsides of each feature.
Then the answer to your question becomes extremely obvious: you should be using the features that solve your problem better (performance or code maintainability wise) than a comparable non-DP solution, and which incurs less than the maximum accaptable level of downsides.
As an example, to quote from FM's comment, string form of eval has some fairly nasty downsides; but it MIGHT in certain cases be an extremely elegant solution which is orders of magnitude better than any alternate DP or SP approach.
Second, please be aware that a lot of "dynamic programming" features of Perl are actually packaged for you into extremely useful modules that you might not even recognize as being of the DP nature.
I'll have to think of a set of good examples, but one that immediately springs to mind is Text template modules, many of which are implemented using the above-mentioned string form of eval; or Try::Tiny exception mechanism which uses block form of eval.
Another example is aspect programming which can be achieved via Moose (I can't find the relevant StackOverflow link now - if someone has it please edit in the link) - which underneath uses access to symbol table featrue of DP.
Most of the other comments are complete here and I won't repeat them. I will focus on my personal bias about excessive or not enough use of language idioms in the language you are writing code in. As a quip goes, it is possible to write C in any language. It is also possible to write unreadable code in any language.
I was trained in C and C++ in college and picked up Perl later. Perl is fabulous for quick solutions and some really long life solutions. I built a company on Perl and Oracle solving logistics solutions for the DoD with about 100 active programmers. I also have some experience in managing the habits of other Perl programmers new and old. (I was the founder / ceo and not in technical management directly however...)
I can only comment on my transition to a Perl programmer and what I saw at my company. Many of our engineers shared my background of primarily being C / C++ programers by training and Perl programmers by choice.
The first issue I have seen (and had myself) is writing code that is so idiomatic that it is unreadable, unmaintainable, and unusable after a short period of time. Perl, and C++ share the ability to write terse code that is entertaining to understand at the moment but you will forget, not be around, and others won't get it.
We hired (and fired) many programmers over the 5 years I had the company. A common Interview Question was the following: Write a short Perl program that will print all the odd numbers between 1 and 50 inclusive separated by a space between each number and terminated with a CR. Do not use comments. They could do this on their own time of a few minutes and could do it on a computer to proof the output.
After they wrote the script and explained it, we would then ask them to modify it to print only the evens, (in front of the interviewer), then have a pattern of results based on every single digit even, every odd, except every seventh and 11th as an example. Another potential mod would be every even in this range, odd in that range, and no primes, etc. The purpose was to see if their original small script withstood being modified, debugged, and discussed by others and whether they thought in advance that the spec may change.
While the test did not say 'in a single line' many took the challenge to make it a single terse line and with the cost of readability. Others made a full module that just took too long given the simple spec. Our company needed to delver solid code very quickly; so that is why we used Perl. We needed programmers that thought the same way.
The following submitted code snippets all do exactly the same thing:
1) Too C like, but very easy to modify. Because of the C style 3 argument for loop it takes more bug prone modifications to get alternate cycles. Easy to debug, and a common submission. Any programmer in almost any language would understand this. Nothing particularly wrong with this, but not killer:
for($i=1; $i<=50; $i+=2) {
printf("%d ", $i);
}
print "\n";
2) Very Perl like, easy to get evens, easy (with a subroutine) to get other cycles or patterns, easy to understand:
print join(' ',(grep { $_ % 2 } (1..50))), "\n"; #original
print join(' ',(grep { !($_ % 2) } (1..50))), "\n"; #even
print join(' ',(grep { suba($_) } (1..50))), "\n"; #other pattern
3) Too idiomatic, getting a little weird, why does it get spaces between the results? Interviewee made mistake in getting evens. Harder to debug or read:
print "#{[grep{$_%2}(1..50)]}\n"; #original
print "#{[grep{$_%2+1}(1..50)]}\n"; #even - WRONG!!!
print "#{[grep{~$_%2}(1..50)]}\n"; #second try for even
4) Clever! But also too idiomatic. Have to think about what happens to the annon hash created from a range operator list and why that creates odds and evens. Impossible to modify to another pattern:
print "$_ " for (sort {$a<=>$b} keys %{{1..50}}), "\n"; #orig
print "$_ " for (sort {$a<=>$b} keys %{{2..50}}), "\n"; #even
print "$_ " for (sort {$a<=>$b} values %{{1..50}}), "\n"; #even alt
5) Kinda C like again but a solid framework. Easy to modify beyond even/odd. Very readable:
for (1..50) {
print "$_ " if ($_%2);
} #odd
print "\n";
for (1..50) {
print "$_ " unless ($_%2);
} #even
print "\n";
6) Perhaps my favorite answer. Very Perl like yet readable (to me anyway) and step-wise in formation and right to left in flow. The list is on the right and can be changed, the processing is immediately to the left, formatting again to the left, final operation of 'print' on the far left.
print map { "$_ " } grep { $_ & 1 } 1..50; #original
print "\n";
print map { "$_ " } grep { !($_ & 1) } 1..50; #even
print "\n";
print map { "$_ " } grep { suba($_) } 1..50; #other
print "\n";
7) This is my least favorite credible answer. Neither C nor Perl, impossible to modify without gutting the loop, mostly showing the applicant knew Perl array syntax. He wanted to have a case statement really badly...
for (1..50) {
if ($_ & 1) {
$odd[++$#odd]="$_ ";
next;
} else {
push #even, "$_ ";
}
}
print #odd, "\n";
print #even;
Interviewees with answers 5, 6, 2 and 1 got jobs and did well. Answers 7,3,4 did not get hired.
Your question was about using dynamic constructs like eval or others that you cannot do in a purely compiled language such as C. This last example is "dynamic" with the eval in the regex but truly poor style:
$t='D ' x 25;
$i=-1;
$t=~s/D/$i+=2/eg;
print "$t\n"; # don't let the door hit you on the way out...
Many will tell you "don't write C in Perl." I think this is only partially true. The error and mistake is to rigidly write new Perl code in C style even when there are so many more expressive forms in Perl. Use those. And yes, don't write NEW Perl code in C style because C syntax and idiom is all you know. (bad dog -- no biscuit)
Don't write dynamic code in Perl just because you can. There are certain algorithms that you will run across that you will say 'I don't quite know how I would write THAT in C' and many of these use eval. You can write a Perl regex to parse many things (XML, HTML, etc) using recursion or eval in the regex, but you should not do that. Use a parser just like you would in C. There are certain algorithms though that eval is a gift. Larry Wall's file name fixer rename would take a lot more C code to replicate, no? There are many other examples.
Don't rigidly avoid C stye either. The C 3 argument form of a for loop may be the perfect fit to certain algorithms. Also, remember why you are using Perl: assumably for high programmer productivity. If I have a completely debugged piece of C code that does exactly what I want and I need that in Perl, I just rewrite the silly thing C style in Perl! That is one of the strengths of the language (but also its weakness for larger or team projects where individual coding styles may vary and make the overall code difficult to follow.)
By far the killer verbal response to this interview question (from the applicant who wrote answer 6) was: This single line of code fits the spec and can easily be modified. However, there are many other ways to write this. The right way depends on the style of the surrounding code, how it will be called, performance considerations, and if the output format may change. Dude! When can you start?? (He ended up in management BTW.)
I think that attitude also applies to your question.
At least IME, the "dynamic" nature isn't really that big of a deal. I think the biggest difference you need to take into account is that in C or C++, you're mostly accustomed to there being only a fairly minor advantage to using library code. What's in the library is already written and debugged, so it's convenient, but if push comes to shove you can generally do pretty much the same thing on your own. For efficiency, it's mostly a question of whether your ability to write something a bit more specialized outweighs the library author's ability to spend more time on polishing each routine. There's little enough difference, however, that unless a library routine really does what you want, you may be better off writing your own.
With Perl, that's no longer true. Much of what's in the (huge, compared to C) library is actually written in C. Attempting to write anything very similar at all on your own (unless you write a C module, of course) will almost inevitably come out quite a bit slower. As such, if you can find a library routine that does even sort of close to what you want, you're probably better off using it. Using pre-written library code is much more important than in C or C++.
Good programming practices arent specific to individual languages. They are valid across all languages. In the long run, you may find it best not to rely on tricks possible in dynamic languages (for example, functions that can return either integer or text values) as it makes the code harder to maintain and quickly understand. So ultimately, to answer your question, I dont think you should be looking for features specific to dynamicly typed languages unless you have some compelling reason that you need them. Keep things simple and easy to maintain - that will be far more valuable in the long run.
There are many things you can do only with dynamic language but the coolest one is eval. See here for more detail.
With eval, you can execute string as if it was a pre-written command. You can also access variable by name at runtime.
For example,
$Double = "print (\$Ord1 * 2);";
$Opd1 = 8;
eval $Double; # Prints 8*2 =>16.
$Opd1 = 7;
eval $Double; # Prints 7*2 =>14.
The variable $Double is a string but we can execute it as it is a regular statement. This cannot be done in C/C++.
The cool thing is that a string can be manipulated at run time; therefore, we can create a command at runtime.
# string concatenation of operand and operator is done before eval (calculate) and then print.
$Cmd = "print (eval (\"(\$Ord1 \".\$Opr.\" \$Ord2)\"));";
$Opr = "*";
$Ord1 = "5";
$Ord1 = "2";
eval $Cmd; # Prints 5*2 => 10.
$Ord1 = 3;
eval $Cmd; # Prints 5*3 => 15.
$Opr = "+";
eval $Cmd; # Prints 5+3 => 8.
eval is very powerful so (as in Spiderman) power comes with responsibility. Use it wisely.
Hope this helps.

Is there a fairly simple way for a script to tell (from context) whether "her" is a possessive pronoun?

I am writing a script to reverse all genders in a piece of text, so all gendered words are swapped - "man" is swapped with "woman", "she" is swapped with "he", etc. But there is an ambiguity as to whether "her" should be replaced with "him" or "his".
Okay. Lets look at this like a linguist might. I am thinking aloud here.
"Her" is a pronoun. It can either be a:
1. possessive pronoun
This is her book.
2. personal pronoun
Give it to her. (after preposition)
He wrote her a letter. (indirect object)
He treated her for a cold. (direct object)
So lets look at case (1), possessive pronoun. That is it is a pronoun which is in the "genitive" case (meaning, it is a noun which is being "possessive." Okay, that detail isn't quite as important as the next one.)
In this case, "her" is acting as a "determiner". Determiners may occur in two places in a sentence (this is a simplification):
Det + Noun ("her book")
Det + Adj + Noun ("her nice book")
So to figure out if her is a determiner, you could have this logic:
a. If the word following "her" is a noun, then "her" is a determiner.
b. If the 2 words following "her" is an adjective, then a noun, then "her" is a determiner"
And if you establish that "her" is a determiner, then you know that you must replace it with "his", which is also a determiner (aka genitive noun, aka possessive pronoun).
If it doesn't match criteria (a) and (b) above, then you could possibly conclude that it is not a determiner, which means it must be a personal pronoun. In that case, you would replace "her" with "him".
You wouldn't even have to do the tests below, but I'll try to describe them anyway.
Looking at (2) from above: personal pronoun, rather than possessive. This gets trickier.
The examples above show "her" occurring in 3 ways:
(1) Give it to her. (after preposition. we call this the "object of a preposition".)
So you could maybe devise a rule: "If 'her' occurs immediately after a preposition, then it should be treated as a noun, so we would replace it with 'him'".
The next two are tricky. "her" can either be a direct object or an indirect object.
(2) He wrote her a letter. (indirect object)
(3) He treated her for a cold. (direct object)
Syntactically, how can we tell the difference?
A direct object occurs immediately after a verb.
If you have a verb, followed by a noun, then that noun is a direct object. eg:
He treated her.*
If you have a verb, followed by a noun, followed by a prepositional phrase, then the noun is a direct object.
He treated her for a cold. ("her" is a noun, and it comes immediately after the verb "treated". "for a cold" is a prepositional phrase.)
Which means that you could say "If you have Verb + Noun + Prep" then the noun is a direct object. Since the noun is a direct object, then it is a personal pronoun, so use "him". (note, you only have to check for a preposition, not the entire prep phrase, since the phrase will always begin with a preposition.)
If it is an indirect object, then you'll have the form "verb + noun + noun".
He wrote her a letter. ("her" is a noun, "letter" is a noun. well, "a letter" is a "noun phrase", so you'd have to account for determiners as well.)
So... if "her" is a direct object, indirect object, or obj of prep, you could change it to "him", otherwise, change it to "his".
This method seems a lot more complicated - so I'd just start by checking to see if "her" is a determiner (see above), and if it is a determiner, use "his" otherwise, just use "him".
So, the above has a lot of simplifications. It doesn't cover "interrupting phrases", or clause structures, or constituency tests, or embedded clauses, or punctuation, or anything like that.
Also, this solution requires a dictionary - a list of "nouns" and "verbs" and "prepositions" so that you can determine the lexical category of each word in the sentence.
And even there, man, natural language processing is hard. You'd want to do some sort of "training" for your model to have a good solution. BUT for very simple things, try some of the stuff described above.
Sorry for being so verbose! (None of the existing answers gave any hard data, or precise linguistic definitions, so here goes.)
Given the scope of your project: reversing all gender-related words, it appears that :
The "investment" in a more fundamental approach would be justified
No heuristic based on simple lookup/substitution will adequately serve all or even most cases.
Furthermore, Regex too seems a poor choice of tool; natural language is just not a regular langugage ;-).
Instead, you should consider introducing Part-of-Speech (POS) tagging, possibly with a hint of Named Entity Recognition, and then apply substitution rules based on the extra info the tagging supplied.
This may seem like a lot of work, but if for example your scripting language happens to be Python, you can leverage NTLK to implement all this with a relatively small effort.
G'day,
This is one of those cases where you could invest an inordinate amount of time tracking down the automatic solution and finish up with a result that you're going to have to check through anyway.
I'd suggest making your script insert a piece of text that will really stand out at every instance of "her" and would be easily searchable. Maybe even make the script insert both "him" and "his" strings so that you only need to delete one of them after you've seen the context?
You're going to save a lot of time and effort this way. Not to mention blood, sweat and tears even! (-:
Coming up with a fully automatic solution is no mean feat as it will involve scanning a massive corpus of words to determine if the following word is an object.
Sometimes gaining that extra 5 or 10 percent improvement is just not worth the extra effort involved. Except of course as an "it is left as an interesting exercise for the reader..." type problem that some text books seem to love.
Edit: I forgot to mention that finding this "tipping point" is a true art. Definitely one skill that only comes with experience. (-:
Edit: Part II - The Revenge I also forgot to mention that you can eliminate one edge case though. If the word "him" is followed by punctuation, e.g. "... to her.", "... for her," etc. then you can eliminate the uncertainty for those cases and just replace them with "him". Similarly if the word is followed by a class of words, e.g. "... for her to" can have the "her" easily be replaced with "him". Edit 3: This is not a full list of exceptions but is merely intended as a suggestion for a starting point of the list of items you'll need to look for.
HTH
Trying to determine whether her is a possessive or personal pronoun is harder than trying to determine the class of him or his. However, you would expect both to be used in the same contexts given a large enough corpus. So why not reverse the problem? Take a large corpus and find all occurrences of him and his. Then look at the words surrounding them (just how many words you need to look at is left up to you). With enough training examples, you can estimate the probability that a given set of words in the vicinity of the word indicates him or his. Then you can use those probability estimates on an occurrence of her to determine whether you should be using him or his. As other responses have indicated, you're not going to be perfect. Also, figuring out how big of a neighborhood to use and how to calculate the probabilities is a fair bit of work. You could probably do fairly well using a simple classifier like Naive Bayes.
I suspect, though, you can get a decent bit of accuracy just by looking at patterns in parts of speech and writing some rules. Naturally, you'll miss some, but probably a dozen rules or so will account for the majority of occurrences. I just glanced through about fifty occurrences of her in "The Phantom Rickshaw" by Rudyard Kipling and you can easily get 90% accuracy just by the rule:
her_followed_by_noun ? possessive : personal
You can use an off-the-shelf part-of-speech (POS) tagger like the Stanford POS Tagger to automatically determine whether a word is a noun or something else in context. Again, it's not perfect, but it does pretty well.
Edge cases with odd clause structures are hard to get right, but they also occur fairly rarely in most text. It just depends on your data.
I don’t think so. You could check if the possessive pronoun is followed by a noun or an adjective and thereby conclude that is indeed a possessive pronoun. But of course you would have to write a script that is able to do this and even if you had a method it would still be wrong in some other cases. A simple pattern matching algorithm won’t help you here.
Good luck with analysing this: http://en.wikipedia.org/wiki/X-bar_theory
Definitely no. You would have to do syntactic analysis on your input text (parsing the English language, really, that's where the word “to parse” comes from). That's the only way you can determine with certainty what the “her” in your text stand for, you can't rely on search-and-replace. There are many ways to do that, but none would qualify as “fairly simple”, I think.
I will address regex, since that is one of the tags. Regular expressions are insufficiently powerful for parsing human language, because regex does not do recursion, and all human lnguages are recursive.
When this fact is combined with the other ambiguities in English, such as the way many words can serve multiple functions in a sentense, I think that a reliable automated solution will be a very difficult and costly project.
About the only one I can think of (and I'm sure someone in the comments will prove me wrong!) is any instance of her followed by punctuation can most probably be replace with him. But I still agree with the previous answers that you're probably best off doing a manual replace.
OK, based on some of the answers people gave I've got a better idea of how to approach this. Instead of trying to write a script that gets this right 100% of the time I'll just aim to get it right as often as possible. A quick search through some English-language texts shows that "his" appears (very roughly) twice as often as "him", so the default behaviour should be to convert "her" to "his". If I did this and nothing else it should be right about two thirds of the time.
Now I'm not interested in finding patterns that would show "her" should be converted to "his", since this is what I would do anyway, I'm only interested in finding patterns that would show "her" should be converted to "him", since these would allow me to lower the error rate. There's two rules I can implement fairly painlessly:
If "her" is followed immediately by a comma or period, it should be converted to "him", as Michael Itzoe said.
If 'her' occurs immediately after a preposition, then it should be treated as a noun, we would replace it with 'him', as Rasher said.
And I'll be able to do more than that if I use Part of Speech tagging software. I think I'll get on with doing the easy stuff first :-)

Is it feasible to ascribe pronunciations to distinct source code concepts?

I frequently tutor fellow students in programming, most often in C++ or Java.
It is uniquely aggravating to try to verbally convey the essential syntax of a C++ expression. The speaker must give either an idiomatic translation into English, or a full specification of the code in verbal longhand, using explicit yet slow terms such as "opening parenthesis", "bitwise and", et cetera. Neither of these solutions is optimal.
In C++, there is a finite set of keywords—63—and operators—54, discounting named operators and treating compound assignment operators and prefix versus postfix auto-increment and decrement as distinct. There are just a few types of literal, a similar number of grouping symbols, and the semicolon. Unless I'm utterly mistaken, that's about it.
Would it not then be feasible to ascribe a concise, unique pronunciation to each of these distinct concepts (including one for whitespace, where it is required) and go from there? Programming languages are far more regular than natural languages, so the pronunciation could be standardised.
Instead of creating new "words" to describe them, for things such as "include" you could simply prefix it with "keyword" when saying it aloud. You could use words/phrases commonly known to say other parts as well. As with any new programmer, you have to literally describe everything anyway, so I don't think that requires special attention. I think creating new words is the harder method...
So, for example:
#include <iostream>;
int main()
{
if (1 < 2)
return 1;
else
return 0;
}
Could be read out as:
(keyword) include iostream new-line
(keyword) int main no params start
block if number 1 (operator) less than
number 2 new-line (keyword) return
number 1 new-line (keyword) else
new-line (keyword) return number 0 end
block
Treat words in () as optional descriptive words, most likely to be used in more complex code. You could use the word 'literal' if you want them to actually write the descriptive word. For example
(keyword) if literal number (operator)
less than literal keyword
becomes
if (number < keyword)
Other words could be given defined meanings as well, such as 'split-line' when you want them to continue on the next line, without closing any currently open parenthesis, etc.
I personally find this method quite simple to use and easy to teach. YMMV, as always.
Of course, this doesn't solve the internationalisation issue, but at worst, would result in 'new words' being used in the non-English languages, which is no worse than the proposed solution you offered.
As a blind developer, programming since I was 13, I found this question really interesting. First of all, as mentioned by other peple, learning a new language to be able to understand code is not a practical solution, as it would probably take longer to learn the spoken utterances as it would to learn the actual programming language.
Reading the question/answers two further points occured to me:
Firstly, you'd be surprised how important "thinking time" is. I have previously programmed in C/C++/Java and now use C# as my primary language, and consider myself very competant. But when I did a couple of projects in Python, I found the reduced punctuation robbed me of my "thinking time" - subconsciously, I was using the punctuation to digest what I'd just heard - fascinating... However, the situation is a bit different when it comes to identifiers, as these aren't well known by the listener - I personally find it hard to listen to code with acronym variables (RGXRatio, RGVRatio) as I don't have time to figure out what it means. On the flip side, hungarian notation and initial underscores makes code hard to listen to as the length of the variables (in terms of time taken to speak) is much longer than the more important operations being performed on those variables.
Another thing to consider is that the length of the audio stream is an end result, but not the root cause. The reason the audio is so long is because audio is a one-dimensional medium, whereas reading text is a 2d medium with the ability to jump around and skip past irelevant/familiar text. It wouldn't work for a face-to-face lecture, but what if there were keyboard commands for controlling the speech. In text documents my screen reader lets me jump to the next line, but what if this were adapted to the semantics of a programming language. some research, such as by T V Raman at Google, includes using different voices for syntax highlighting, and audio cues to mark metadata like capitals.
I know the original question specifically related to a lecture given to a class, but if like myself you have to listen to entire files of source code , I also find the structure of the code makes a huge difference. I personally read code like a story - left to right, top to bottom. so it's very hard to trace through unfamiliar code when it's written bottom-up.
So would it not then be feasible to simply ascribe a concise, unique pronunciation to each of these distinct concepts (including one for whitespace, where it is required) and go from there? Programming languages are far more regular than natural languages, so the pronunciation could be standardised
Perhaps, but you've lost sight of your goal. The premise was that the person listening did not already know the language. If he does, we can simply say "include iostream" when we mean #include <iostream>, or "vector of int" when we mean std::vector<int>.
Your premise was that the person listening is not familiar enough with the language to understand what you read out loud unless you read out exactly what it says.
Now, inventing a whole new language just to describe the primitives that occur in your source code doesn't solve the problem. Instead, you still have to read out every syntactic token (with simpler, more "standardized" pronunciations, yes, but they still have to be read out loud), and the person listening still won't understand you, because if they don't know C++ well enough to understand "include iostream", they won't understand your standardized pronunciation either. And if you're going to teach them your pronunciation, why bother, when you could've just taught them to understand C++ syntax directly instead?
There's also the root problem that C++ code tends to consist of a lot of syntactic tokens. Take a line as simple as this:
std::vector<int> v;
I count 9 tokens. Not one of them can be omitted. If the person listening does not understand the code and syntax well enough to understand a high-level description such as "declare a vector of int, named v", then you'll have to read out all 9 tokens in some form. Even if you come up with simpler names than "namespace resolution operator" and "less than sign", you still have to list 9 token names. Which is a lot of work.
In short, no, I don't think it'd work. First, it's still too cumbersome, and second, it's presuming prior knowledge on the part of the person listening, when the motivation for this was that the person listening was a student without the prior knowledge that made it possible to understand a high-level description of the code.

Short example of regular expression converted to a state machine?

In the Stack Overflow podcast #36 (https://blog.stackoverflow.com/2009/01/podcast-36/), this opinion was expressed:
Once you understand how easy it is to set up a state machine, you’ll never try to use a regular expression inappropriately ever again.
I've done a bunch of searching. I've found some academic papers and other complicated examples, but I'd like to find a simple example that would help me understand this process. I use a lot of regular expressions, and I'd like to make sure I never use one "inappropriately" ever again.
A rather convenient way to help look at this to use python's little-known re.DEBUG flag on any pattern:
>>> re.compile(r'<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>', re.DEBUG)
literal 60
subpattern 1
in
range (65, 90)
max_repeat 0 65535
in
range (65, 90)
range (48, 57)
at at_boundary
max_repeat 0 65535
not_literal 62
literal 62
subpattern 2
min_repeat 0 65535
any None
literal 60
literal 47
groupref 1
literal 62
The numbers after 'literal' and 'range' refer to the integer values of the ascii characters they're supposed to match.
Sure, although you'll need more complicated examples to truly understand how REs work. Consider the following RE:
^[A-Za-z][A-Za-z0-9_]*$
which is a typical identifier (must start with alpha and can have any number of alphanumeric and undescore characters following, including none). The following pseudo-code shows how this can be done with a finite state machine:
state = FIRSTCHAR
for char in all_chars_in(string):
if state == FIRSTCHAR:
if char is not in the set "A-Z" or "a-z":
error "Invalid first character"
state = SUBSEQUENTCHARS
next char
if state == SUBSEQUENTCHARS:
if char is not in the set "A-Z" or "a-z" or "0-9" or "_":
error "Invalid subsequent character"
state = SUBSEQUENTCHARS
next char
Now, as I said, this is a very simple example. It doesn't show how to do greedy/nongreedy matches, backtracking, matching within a line (instead of the whole line) and other more esoteric features of state machines that are easily handled by the RE syntax.
That's why REs are so powerful. The actual finite state machine code required to do what a one-liner RE can do is usually very long and complex.
The best thing you could do is grab a copy of some lex/yacc (or equivalent) code for a specific simple language and see the code it generates. It's not pretty (it doesn't have to be since it's not supposed to be read by humans, they're supposed to be looking at the lex/yacc code) but it may give you a better idea as to how they work.
Make your own on the fly!
http://osteele.com/tools/reanimator/???
This is a really nicely put together tool which visualises regular expressions as FSMs. It doesn't have support for some of the syntax you'll find in real-world regular expression engines, but certainly enough to understand exactly what's going on.
Is the question "How do I choose the states and the transition conditions?", or "How do I implement my abstract state machine in Foo?"
How do I choose the states and the transition conditions?
I usually use FSMs for fairly simple problems and choose them intuitively. In my answer to another question about regular expressions, I just looked at the parsing problem as one of being either Inside or outside a tag pair, and wrote out the transitions from there (with a beginning and ending state to keep the implementation clean).
How do I implement my abstract state machine in Foo?
If your implementation language supports a structure like c's switch statement, then you switch on the current state and process the input to see which action and/or transition too perform next.
Without switch-like structures, or if they are deficient in some way, you if style branching. Ugh.
Written all in one place in c the example I linked would look something like this:
token_t token;
state_t state=BEGIN_STATE;
do {
switch ( state.getValue() ) {
case BEGIN_STATE;
state=OUT_STATE;
break;
case OUT_STATE:
switch ( token.getValue() ) {
case CODE_TOKEN:
state = IN_STATE;
output(token.string());
break;
case NEWLINE_TOKEN;
output("<break>");
output(token.string());
break;
...
}
break;
...
}
} while (state != END_STATE);
which is pretty messy, so I usually rip the state cases out to separate functions.
I'm sure someone has better examples, but you could check this post by Phil Haack, which has an example of a regular expression and a state machine doing the same thing (there's a previous post with a few more regex examples in there as well I think..)
Check the "HenriFormatter" on that page.
I don't know what academic papers you've already read but it really isn't that difficult to understand how to implement a finite state machine. There are some interesting mathematics but to idea is actually very trivial to understand. The easiest way to understand an FSM is through input and output (actually, this comprises most of the formal definition, that I won't describe here). A "state" is essentially just describing a set of input and outputs that have occurred and can occur from a certain point.
Finite state machines are easiest to understand via diagrams. For example:
alt text http://img6.imageshack.us/img6/7571/mathfinitestatemachinedco3.gif
All this is saying is that if you begin in some state q0 (the one with the Start symbol next to it) you can go to other states. Each state is a circle. Each arrow represents an input or output (depending on how you look at it). Another way to think of an finite state machine is in terms of "valid" or "acceptable" input. There are certain output strings that are NOT possible certain finite state machines; this would allow you to "match" expressions.
Now suppose you start at q0. Now, if you input a 0 you will go to state q1. However, if you input a 1 you will go to state q2. You can see this by the symbols above the input/output arrows.
Let's say you start at q0 and get this input
0, 1, 0, 1, 1, 1
This means you have gone through states (no input for q0, you just start there):
q0 -> q1 -> q0 -> q1 -> q0 -> q2 -> q3 -> q3
Trace the picture with your finger if it doesn't make sense. Notice that q3 goes back to itself for both inputs 0 and 1.
Another way to say all this is "If you are in state q0 and you see a 0 go to q1 but if you see a 1 go to q2." If you make these conditions for each state you are nearly done defining your state machine. All you have to do is have a state variable and then a way to pump input in and that is basically what is there.
Ok, so why is this important regarding Joel's statement? Well, building the "ONE TRUE REGULAR EXPRESSION TO RULE THEM ALL" can be very difficult and also difficult to maintain modify or even for others to come back and understand. Also, in some cases it is more efficient.
Of course, state machines have many other uses. Hope this helps in some small way. Note, I didn't bother going into the theory but there are some interesting proofs regarding FSMs.