Bests practices for localized texts in C++ cross-platform applications? - c++

In the current C++ standard (C++03), there are too few specifications about text localization and that makes the C++ developer's life harder than usual when working with localized texts (certainly the C++0x standard will help here later).
Assuming the following scenario (which is from real PC-Mac game development cases):
responsive (real time) application: the application has to minimize non-responsive times to "not noticeable", so speed of execution is important.
localized texts: displayed texts are localized in more than two languages, potentially more - don't expect a fixed number of languages, should be easily extensible.
language defined at runtime: the texts should not be compiled in the application (nor having one application per language), you get the chosen language information at application launch - which implies some kind of text loading.
cross-platform: the application is be coded with cross-platform in mind (Windows - Linux/Ubuntu - Mac/OSX) so the localized text system have to be cross platform too.
stand-alone application: the application provides all that is necessary to run it; it won't use any environment library or require the user to install anything other than the OS (like most games for example).
What are the best practices to manage localized texts in C++ in this kind of application?
I looked into this last year that and the only things I'm sure of are that you should use std::wstring or std::basic_string<ABigEnoughType> to manipulate the texts in the application. I stopped my research because I was working more on the "text display" problem (in the case of real-time 3D), but I guess there are some best practices to manage localized texts in raw C++ beyond just that and "use Unicode".
So, all best-practices, suggestions and information (cross-platform makes it hard I think) are welcome!

At a small Video Game Company, Black Lantern Studios, I was the Lead developer for a game called Lionel Trains DS. We localized into English, Spanish, French, and German. We knew all the languages up front, so including them at compile time was the only option. (They are burned to a ROM, you see)
I can give you information on some of the things we did. Our strings were loaded into an array at startup based on the language selection of the player. Each individual language went into a separate file with all the strings in the same order. String 1 was always the title of the game, string 2 always the first menu option, and so on. We keyed the arrays off of an enum, as integer indexing is very fast, and in games, speed is everything. ( The solution linked in one of the other answers uses string lookups, which I would tend to avoid.) When displaying the strings, we used a printf() type function to replace markers with values. "Train 3 is departing city 1."
Now for some of the pitfalls.
1) Between languages, phrase order is completely different. "Train 3 is departing city 1." translated to German and back ends up being "From City 1, Train 3 is departing". If you are using something like printf() and your string is "Train %d is departing city %d." the German will end up saying "From City 3, Train 1 is departing." which is completely wrong. We solved this by forcing the translation to retain the same word order, but we ended up with some pretty broken German. Were I to do it again, I would write a function that takes the string and a zero-based array of the values to put in it. Then I would use markers like %0 and %1, basically embedding the array index into the string. Update: #Jonathan Leffler pointed out that a POSIX-compliant printf() supports using %2$s type markers where the 2$ portion instructs the printf() to fill that marker with the second additional parameter. That would be quite handy, so long as it is fast enough. A custom solution may still be faster, so you'll want to make sure and test both.
2) Languages vary greatly in length. What was 30 characters in English came out sometimes to as much as 110 characters in German. This meant it often would not fit the screens we were putting it on. This is probably less of a concern for PC/Mac games, but if you are doing any work where the text must fit in a defined box, you will want to consider this. To solve this issue, we stripped as many adjectives from our text as possible for other languages. This shortened the sentence, but preserved the meaning, if loosing a bit of the flavor. I later designed an application that we could use which would contain the font and the box size and allow the translators to make their own modifications to get the text fit into the box. Not sure if they ever implemented it. You might also consider having scrolling areas of text, if you have this problem.
3) As far as cross platform goes, we wrote pretty much pure C++ for our Localization system. We wrote custom encoded binary files to load, and a custom program to convert from a CSV of language text into a .h with the enum and file to language map, and a .lang for each language. The most platform specific thing we used was the fonts and the printf() function, but you will have something suitable for wherever you are developing, or could write your own if needed.

I strongly disagree with the accepted answer. First, the part about using static array lookups to speed up the text lookups is counterproductive premature optimization - Calculating the layout for said text and rendering said text uses 2-4 orders of magnitude more time than a hash lookup. If anyone wanted to implement their own language library it should never be based on static arrays, because doing so trades real benefits (translators don't need access to the code) for imaginary benefits (speed increase of ~0.01%).
Next, writing your own language library to use in your own game is even worse than premature optimization.
There are some extremely good reasons to never write your own localization library:
Planning the time to use an existing localization library is much easier than planning the time to write a localization library. Localization libraries exist, they work, and many people have used them.
Localization is tricky, so you will get things wrong. Every language adds a new quirk, which means whenever you add a new language to your own homegrown localization library you will need to change code again to account for the quirks. Did you know that some languages have more than 2 plural forms, depending on the number of items in question? More than 2 genders (more than 10, even)? Also, the number and date formats vary a lot between different in many languages.
When your application becomes successful you will want add support for more languages. Languages nobody on your team speaks fluently. Hiring someone to write a translation will be considerably cheaper if they already know the tools they are working with.
A very well known and complete localization library is GNU Gettext, which uses the GPL, and should therefore be avoided for commercial work. You can instead use the boost library boost.locale which works with Gettext files, and is free to use and modify for commercial and non-commercial projects of any kind.

GNU Gettext does it all.

There won't be any additional features in the C++0x standard, as far as I can tell. I suspect the Committee considers this a matter for third-party libraries.

Related

Cleanest data structure to use when interpreting data from neatly-structured user commands (in C++) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I would like to write a simple in-house program that parses user commands written in a language of our team's own invention (but based closely on another program we are already familiar with). The command parser that I am working on now will simply be the UI through which the user can run the other algorithms I have already written. (Those other algorithms, by the way, are used to generate the input files for a molecular dynamic simulation package called LAMMPS.) The only thing I really have left to do is just write this UI, but as it turns out, writing your own scripting language is almost an intractable challenge for a non software engineer to tackle on his own.
According to the answers I received, what I am try to make would be considered a Domain Specific Language, and it is not advisable to try to make one's own DSL due to the enormous amount of work required to make it useful and bug-free.
The best option then would actually be to use an existing scripting language like Lua or Python, and embed it in the program.
To do this, I will most likely use Lua because it seems most fitting for our needs. So at this point, the rest of this question is no longer relevant since the answer would be: "Don't do it yourself." But I'm still going to keep part of it here for other users to be able read and learn from the wonderful answers below.
Thanks again to everyone who replied!
Old Question:
I would like to write a program that parses a user text input and then
runs a function corresponding to that input. To do this I would need
to parse the string for relevant keywords. I believe there will be
less than 15 keywords when I'm done, so ideally I'd like this code
to be simple and short.
The problem is that I am currently using if-statements to parse the
strings. This is an extremely inconvenient way to parse commands
because even for a short 3 word commands the code explodes into nested-ifs
3 layers deep. So longer 8+ word sentences will become nested-ifs more than
8 layers deep.
This kind of programing approach quickly becomes unmanageable, especially
when I need to make any significant changes to a command.
My question is whether or not there exists a data structure in C++ that
can help me better manage my giant nested-ifs, or if anyone could suggest
a better way to parse a string for lots of different data types (i.e.
substings, ints, and floats) and output an error message when the expected
type is not found?
Here is an example of a short user session to show the kinds of commands
I would like to interpret:
load "Basis.Silicon" as material 1
add material 1 to layer 1
rotate layer 1 about x-axis by 45 degrees
translate layer 1 in x-axis by 10 nm
generate crystal
These commands are based on an already-existing program that our team
uses, but unfortunately the source code for this program has never been
publicly released so I am left guessing as to how it was actually
implemented.
One final note, unlike natural language processors, I know exactly what
the format of each line will be. So my issue isn't so much how to interpret
the text, but rather how to code the logic in a concise and manageable way.
Thanks everyone!
Your question is not clear. And your goals are more difficult than what you believe.
Either you consider that you want to somehow process human language sentences (e.g. in English). Then you want to study natural language processing, and you can find some libraries related to that field.
Or you consider that you want to interpret some formal programming or scripting language. Then you want to study interpreters and compilers. BTW, in that case, you might just embed an existing interpreter (like Lua, Guile, Python, etc....) in your program.
You could also think in terms of expert systems with a knowledge base made of rules (this approach could be viewed as in the middle between NLP and scripting language) You'll then need some inference engine (perhaps CLIPS). See also J.Pitrat's blog.
Notice that even coding a simple interpreter is more difficult than you believe. You absolutely need to represent abstract syntax trees, which you construct from textual input with a parsing phase.
BTW, All of NLP, expert systems, and interpreter design and implementation are difficult fields. You could get a PhD in all 3 fields (but you have to choose which).
If you go the embedded interpreter way: study the interpreters I mentioned (Guile, Lua, Python, Neko, etc...) and choose which one you want, to embed.
If for whatever reason, you want to make an interpreter from scratch: Learn several programming languages first (including scripting languages like Ruby, Python, Ocaml, Scheme, Lua, Neko, ...). Read books on Programming Language Pragmatics (by M.Scott) and Lisp In Small Pieces (by Queinnec). Read also text books on compilation and parsing, and on Garbage Collection and formal (e.g. denotational) semantics. All this may need a dozen years of work.
Notice that by experience embedding a software in an interpreter is a very structuring design. If you did not thought of that at the beginning you probably need to redesign and refactor a lot your existing application. For instance, when embedding a software in an interpreter, you cannot afford that bad input crashes the program. So error handling and memory management (interfacing to the GC of the interpreter) is challenging and gives new constraints. Hence you'll need to re-think your application.
If all this is new (and even if you don't choose e.g. Guile as the embedding interpreter): learn and practice a bit of Scheme -e.g. with Guile or PltScheme- (e.g. reading SICP), read a little bit about λ-calculus and closures, then read Queinnec's Lisp In Small Pieces book. Remember the halting problem (which is partly why interpreters are difficult to code).
BTW the syntax you are proposing (e.g. rotate mat 1 by x 90) is not very readable and looks COBOL-like. If possible, have a language which looks familiar to existing ones. Make it easy to read !
Start by reading all the wikipages I am referencing here.
FWIW, I am the main author of MELT, a domain specific language (inspired a lot by Scheme) to extend the GCC compiler. Some of the papers / documentations I wrote might inspire you (and contain valuable references).
Addenda (after question was reformulated)
You seems to invent some formal syntax like
add material 1 to layer 1
rotate layer 1 about x-axis by 90 degrees
translate layer 1 in x-axis by 10 inches
I can't guess what kind of language is it? Are you implementing a 3D printer? If yes, you should stick to some existing standard formal language in that domain.
I believe that such a COBOL-like syntax is really wrong. The point is that it is too verbose, and that you are wishing to implement some domain specific language. I find your example very bad-looking.
Is that syntax your invention, or is there some document specifying (and many thousands already existing lines coded in) your domain specific language. If you are just inventing it, please reconsider the syntax and the semantics.
First, you need to specify on paper the full syntax and semantics of your DSL.
Is your DSL Turing complete? (I guess that yes, because Turing completeness is reached very quickly - e.g. with variables and loops....). If yes, you are inventing a scripting language. Please don't invent scripting language without knowing several programming & scripting languages (then read Programming Language Pragmatics...). The point is that, if your scripting language will become successful, advanced users will soon or later write important programs in it (e.g. many thousand lines). Then, these advanced users will be programmers. In that case, it is very important (for social & economic reasons) to have a DSL well founded and looking familiar (if possible, an extension of some existing scripting language).
If your DSL already exists, stick to its specification on paper. If that specification is not good enough, improve it with formalization (e.g. by writing some BNF syntax, and some formal (e.g. denotational) semantics for it). Publish and discuss that formalization with existing users.
Several industries got some ad-hoc DSLs which became widely used but was ill designed
(e.g., in the French nuclear industry, the Gibiane DSL designed in the 1970s by nuclear physicists, not computer scientists; the US Boeing corporation is also rumored to have made similar mistakes). Then, maintaining and improving the many hundred thousands lines of DSL scripts is becoming a nightmare (and may means losing millions of dollars or euros). So you better stick to some existing scripting language. The advantages are that there exist some culture on it (e.g. you can find dozens of books on Python or Lua, and many trained engineers familiar with them), that the interpreter is widely used and tested, that the community working on them is improving the interpreters, so it has quite few uncorrected bugs.
You should not attempt to design and implement your own DSL if you are not a trained computer scientist. Stick to some existing scripting language (of course their syntax is not like you want it to be), and leverage on existing implementations and experiment.
As a counter-example, J.Ousterhout has invented the widely used Tcl scripting language, with the claim that scripts are always small (e.g. hundreds of line only) and won't grow to big code base; unfortunately, some of them did, and Tcl is known as a bad language to code many dozens of thousands of lines (even if Tcl is an easy and convenient language for tiny scripts). The moral of the story is that if a (turing complete) scripting language is becoming successful, some "crazy" advanced user will code hundred of thousands of script code. So you need that scripting language to be well designed from the start. Hence, you should adopt and adapt a good existing scripting language (and avoid inventing an unfamiliar syntax without having a good knowledge of several existing scripting languages)
later additions
PS: my criticism of Tcl is not entirely subjective: the point is that Tcl was designed for small scripts in mind (read J.Ousterhout's first papers about Tcl), but my point is that when you offer a Turing-complete scripting language, some "crazy" user will eventually write huge scripts for it. Hence, you need to anticipate such "crazy" usage by offering a scripting language which "scales up" to big scripts, so is built according to software engineering practices for large software code base.
NB. Lua is probably a good choice as a language to embed. It is small, has a nice implementation, is well documented, and has good performance. But be careful about memory management issues (and this advice holds for any scripting language).
EDIT: To be more clear, I would like to have a short list of key words
(<15). The order/presence of which would determine which function will
be run.
You can build a small ruleset engine (e.g. something that processes lists of words). You write that engine/function once and just pass the data structures to it.
As an alternative, a solution using regular expressions would be probably the fastest to code (the engine is ready for you), assuming you're familiar with the regexp syntax (if not, it's still a good investment).
You could build a table of keywords and function pointers:
typedef void (*Function_Pointer)(void);
struct table_entry
{
const char * keyword;
Function_Pointer p_function;
};
table_entry function_table[] =
{
{"car", Process_Car},
{"bike", Process_Bike},
};
Search the table for a keyword. If the keyword is found, dereference the function pointer.
The following snippet will execute the function for processing the word "car":
(function_table[0].p_function)();
There is a famous program, called Eliza, which parses sentences for keywords.
Examples can be found at: Eliza C++ examples

What are the key decisions to get right when designing a fully Unicode-aware language or library?

Looking at Tom Christiansen's talk
🔫 Unicode Support Shootout
👍 The Good, the Bad, & the (mostly) Ugly 👎
working with text seems to be so incredibly hard, that there is no programming language (except Perl 6) which gets it even remotely correct.
What are the key design decisions to make to have a chance to implement Unicode support correctly on a clean table (i. e. no backward-compatibility requirements).
What about default file encodings, which transfer format and normalization format to use internally and for strings? What about case-mapping and case-folding? What about locale- and RTL-support? What about Regex engines as defined by UTS#18? How should common APIs look like?
EDIT: I'll add more as I think of them.
You need no existing code that you have to support. A legacy of code that requires that everything be in 8- or 16-bit unit code units is a royal pain. It makes even libraries awkward when you have to support pre-existing models that don't consider this.
You have to work with blind people only so fonts are no issue. :)
You have to follow the Unicode rules for identifier characters, and pattern syntax characters. You should normalize your identifiers internally. If your language is itself LTR, you may not wish to allow RTL idents; unclear here.
You need to provide primitives in your language that map to Unicode concepts, like instead of just uppercase and lowercase, you need uppercase, titlecase, lowercase, and foldcase (or lc, uc, tc, and fc).
You need to give full access to the Unicode Character Database, including all character properties, so that the various tech reports' algorithms can be easily built up using them.
You need a clear logical model that is easily extensible to graphemes as needed. Just as people have come to realize a code point interface is vastly more important than a code unit one, you have to be able to deal with graphemes, etc. For example, nobody in their right mind should be forced to rewrite:
printf "%-10.10s", $string;
as this every time:
# this library treats strings as sequences of
# extended grapheme clusters for indexing purposes etc.
use Unicode::GCString;
my $gcstring = Unicode::GCString->new($string);
my $colwidth = $gcstring->columns();
if ($colwidth > 10) {
print $gcstring->substr(0,10);
} else {
print " " x (10 - $colwidth);
print $gcstring;
}
You have to do it that way, BTW, because you have to have a notion of print columns, which can be 0 for combining and control characters, or 2 for characters with certain East Asian Width properties. Etc. It would be much better if there was no existing printf code so you could start from scratch and do it right. I have no idea what to do about RTL scripts' widths.
The operating system is a pre-existing code-unit library.
You need not to interact with the filesystem name space, as you have no control over whether filesystem A runs things through NFD (Linux, I believe), filesystem B runs things through NFC (HSF+, nearly), or filesystem C (traditional Unix) doesn't no any at all. Alternately, it is possible that you might be able to provide an abstraction layer here with local filters to hide some of that from the user if possible. Operating systems always have code-unit limits, not code-point ones, which is going to annoy you.
Other things with code-unit stipulations include databases that allocate fixed-size records. Fixed size just doesn't work: it's grapheme-hostile, and normalization form hostile.

Word language detection in C++

After searching on Google I don't know any standard way or library for detecting whether a particular word is of which language.
Suppose I have any word, how could I find which language it is: English, Japanese, Italian, German etc.
Is there any library available for C++? Any suggestion in this regard will be greatly appreciated!
Simple language recognition from words is easy. You don't need to understand the semantics of the text. You don't need any computationally expensive algorithms, just a fast hash map. The problem is, you need a lot of data. Fortunately, you can probably find dictionaries of words in each language you care about. Define a bit mask for each language, that will allow you to mark words like "the" as recognized in multiple languages. Then, read each language dictionary into your hash map. If the word is already present from a different language, just mark the current language also.
Suppose a given word is in English and French. Then when you look it up ex("commercial") will map to ENGLISH|FRENCH, suppose ENGLISH = 1, FRENCH=2, ... You'll find the value 3. If you want to know whether the words are in your lang only, you would test:
int langs = dict["the"];
if (langs | mylang == mylang)
// no other language
Since there will be other languages, probably a more general approach is better.
For each bit set in the vector, add 1 to the corresponding language. Do this for n words. After about n=10 words, in a typical text, you'll have 10 for the dominant language, maybe 2 for a language that it is related to (like English/French), and you can determine with high probability that the text is English. Remember, even if you have a text that is in a language, it can still have a quote in another, so the mere presence of a foreign word doesn't mean the document is in that language. Pick a threshhold, it will work quite well (and very, very fast).
Obviously the hardest thing about this is reading in all the dictionaries. This isn't a code problem, it's a data collection problem. Fortunately, that's your problem, not mine.
To make this fast, you will need to preload the hash map, otherwise loading it up initially is going to hurt. If that's an issue, you will have to write store and load methods for the hash map that block load the entire thing in efficiently.
I have found Google's CLD very helpful, it's written in C++, and from their web site:
"CLD (Compact Language Detector) is the library embedded in Google's Chromium browser. The library detects the language from provided UTF8 text (plain text or HTML). It's implemented in C++, with very basic Python bindings."
Well,
Statistically trained language detectors work surprisingly well on single-word inputs, though there are obviously some cases where they can't possible work, as observed by others here.
In Java, I'd send you to Apache Tika. It has an Open-source statistical language detector.
For C++, you could use JNI to call it. Now, time for a disclaimer warning. Since you specifically asked for C++, and since I'm unaware of a C++ free alternative, I will now point you at a product of my employer, which is a statistical language detector, natively in C++.
http://www.basistech.com, the product name is RLI.
This will not work well one word at a time, as many words are shared. For instance, in several languages "the" means "tea."
Language processing libraries tend to be more comprehensive than just this one feature, and as C++ is a "high-performance" language it might be hard to find one for free.
However, the problem might not be too hard to solve yourself. See the Wikipedia article on the problem for ideas. Also a small support vector machine might do the trick quite handily. Just train it with the most common words in the relevant languages, and you might have a very effective "database" in just a kilobyte or so.
I wouldn't hold my breath. It is difficult enough to determine the language of a text automatically. If all you have is a single word, without context, you would need a database of all the words of all the languages in the world... the size of which would be prohibitive.
Basically you need a huge database of all the major languages. To auto-detect the language of a piece of text, pick the language whose dictionary contains the most words from the text. This is not something you would want to have to implement on your laptop.
Spell check first 3 words of your text in all languages (the more words to spell check, the better). The spelling with least number of spelling errors "wins". With only 3 words it is technically possible to have same spelling in a few languages but with each additional word it becomes less probable. It is not a perfect method, but I figure it would work in most cases.
Otherwise if there is equal number of errors in all languages, use the default language. Or randomly pick another 3 words until you have more clear result. Or expand the number of spell checked words to more than 3, until you get a more clear result as well.
As for the spell checking libraries, there are many, I personally prefer Hunspell. Nuspell is probably also good. It is a matter of personal opinion and/or technical capabilities which one to use.
I assume that you are working with text not with speech.
If you are working with UNICODE than it has provided slot for each languages.
So you can identify that all characters of particular word is fall in this language slot.
For more help about unicode language slot you can fine over here

How essential is polymorphism for writing a text editor?

Many years ago when I didn't know much about object oriented design I heard one guy said something like "How can you write a text editor without polymorphism?" I didn't know much about OOP and so I couldn't judge how wise that though was or ask any specific questions at that time.
Now, after many years of software development (mostly C++), I've used polymorphism many times to solve various problems when designing software. Yet I've never created text editors. So I still can't evaluate that guy's idea.
Is using polymorphism so essential for implementing a text editor in object-oriented languages and why?
Polymorphism for writing a text editor is by no means essential. In fact, polymorphism for solving any programming problem is not essential. It's just one way to do it. Sometimes it makes solving certain kinds of problems easier, and sometimes it just gets in the way.
The evidence for this is that there are perfectly usable text editors developed long before "OOP" became popular.
I would say "no", because it's entirely possible to write perfectly good text editors in non-object-oriented languages, so it can't be that essential.
Polymorphism is a great technique for the problems it addresses, but it's by no means the golden hammer for everything that ails a software developer.
This is a term that was thrown around a lot when OO programming was the rage. This guy was probably trying to intimidate you with large words, I doubt if he fully understood what he was saying although it is a simple concept when explained.
Any here lies the crux of the argument - how many times would you have to write, maintain or extend a text editor - none - therefore imho an OO paradigm is of little use in a for what is a relatively simple piece of code that needs to be highly efficient.
Many of the design patterns like Memento, Flyweight etc that may be used to design/implement Text Editor require inheritance and polymorphism.
The other points about polymorphism as being just a tool are spot on.
However if "the guy" did have some experience with writing text editors he may well have been talking about using polymorphism in the implementation of a document composition hierarchy.
Basically this is just a tree of objects that represent the structure of your document including details such as formatting (bold, italic etc) coloring and so on.
(Most web browsers implement something similar in the form of the browser Document Object Model (DOM), although there is certainly no requirement that they use polymorphism.)
Each of these objects inherits from a common base class (often abstract) that defines a method such as Compose().
Then when it is time to display or to update the structure of the document, the code simply traverses the tree calling the concrete Compose() on each object. Each object is then repsonsible for composing and rendering itself at the appropriate location in the document.
This is a classic use of polymorphism because it allows new document "components" to be added (or changed) without any (or minimal) change to the main application code.
Once again though, there are many ways to build a text manipulation program, polymorphism is definitely not required to build one.
I once wrote a text editor in Basic. It wasn't a sophisticated text editor by any means, it's big highlight being a text-mode windowing thing used for some menus and dialogs, but it still did it's job at the time - ie it proved I could write a text editor in Basic. I even used it sometimes. I won't be showing the source in public - it's just too embarrassing!
When your text editor is mostly just inserting/deleting characters in a big array of strings and displaying them, little or no abstraction is needed other than the usual provided-as-standard abstractions of arrays and strings.
On the other hand, the amount of text that a text editor on a PC is expected to cope with has increased a lot over the last 20 years, sometimes to the extent that even a modern PC with multiple Gigabytes may not be able to keep the whole file in RAM. On top of that there are character set and encoding issues. A good text editor is expected to remember a (potentially large) number of bookmarks into multiple files, and to maintain them so they refer to the same point despite edits. And then there's syntax highlighting, the ability to record/playback macros, and more.
In short, modern text editors are much more complex than the things used in DOS and on other micros twenty years ago. That complexity is no doubt much easier to manage with a good toolkit for handling abstractions.
While a simple text editor (below edit.com from MS-DOS) may be realized easier in a static class only (because the functionality is very limited), as soon as you get to menus and dialogs, you will find yourself in dire need of object oriented language features.
Personally, I frown upon procedural code anyway - I prefer a mixture of OOP (program structure, separation of functionality, etc...) and functional programming (implementation).
This may sound like a religious argument of some sort, but I find my personal style quite recommendable. Usually I need far less lines of code (which are much easier to understand) than most of the developers I work with and my code feels much more "agile" and "flexible".
Try it. :-)
Oh - and polymorphy is not hard to understand. Simply imagine that you (as a person) can be handled as:
a) Man or woman
b) European, asian, american, african, oceanian (I hope this is right), etc...
c) By your name
d) By your occupation
But still you are a person - and a living being, and a part of the universe... and YOU.
So for someone who asks you for statistical reasons a few questions, you may be handled as a, say, woman from oceania (I don't know where you come from, but lets just assume) who is, hm, 42 years old and lived in, hm, Switzerland for 23 years (hahaha).
For your employer, you may be competent in programming and talking to your collegues.
However, HOW you fill those roles is dependent on your implementation. This is you.
Is using polymorphism so essential for implementing a text editor in object-oriented languages and why?
Depends on what kind of text editor you're talking about.
You can write notepad without OOP. But you most likely will need OOP for something like MS Word or OpenOffice.
Design Patterns: Elements of Reusable Object-Oriented Software uses text editor for examples (i.e. "case study") of Design Pattern application. You may want to check out the book.

Picking a front-end/interpreter for a scientific code

The simulation tool I have developed over the past couple of years, is written in C++ and currently has a tcl interpreted front-end. It was written such that it can be run either in an interactive shell, or by passing an input file. Either way, the input file is written in tcl (with many additional simulation-specific commands I have added). This allows for quite powerful input files (e.g.- when running monte-carlo sims, random distributions can be programmed as tcl procedures directly in the input file).
Unfortunately, I am finding that the tcl interpreter is becoming somewhat limited compared to what more modern interpreted languages have to offer, and its syntax seems a bit arcane. Since the computational engine was written as a library with a c-compatible API, it should be straightforward to write alternative front-ends, and I am thinking of moving to a new interpreter, however I am having a bit of a time choosing (mostly because I don't have significant experience with many interpreted languages). The options I have begun to explore are as follows:
Remaining with tcl:
Pros:
- No need to change the existing code.
- Existing input files stay the same. (though I'd probably keep the tcl front end as an option)
- Mature language with lots of community support.
Cons:
- Feeling limited by the language syntax.
- Getting complaints from users as to the difficulty of learning tcl.
Python:
Pros:
- Modern interpreter, known to be quite efficient.
- Large, active community.
- Well known scientific and mathematical modules, such as scipy.
- Commonly used in the academic Scientific/engineering community (typical users of my code)
Cons:
- I've never used it and thus would take time to learn the language (this is also a pro, as I've been meaning to learn python for quite some time)
- Strict formatting of the input files (indentation, etc..)
Matlab:
Pros:
- Very power and widely used mathematical tool
- Powerful built-in visualization/plotting.
- Extensible, through community submitted code, as well as commercial toolboxes.
- Many in science/engineering academia is familiar with and comfortable with matlab.
Cons:
- Can not distribute as an executable- would need to be an add-on/toolbox.
- Would require (?) the matlab compiler (which is pricy).
- Requires Matlab, which is also pricy.
These pros and cons are what I've been able to come up with, though I have very little experience with interpreted languages in general. I'd love to hear any thoughts on both the interpreters I've proposed here, if these pros/cons listed are legitimate, and any other interpreters I haven't thought of (e.g.- would php be appropriate for something like this? lua?). First hand experience with embedding an interpreter in your code is definitely a plus!
I was a strong Tcl/Tk proponent from pre-release, until I did a largish project with it and found how unmaintainable it is. Unfortunately, since prototypes are so easy in Tcl, you wind up with "one-off" scripts taking on lives of their own.
Having adopted Python in the last few months, I'm finding it to be all that Tcl promised and a whole lot more. As many a Python veteran can tell you, source indentation is a bother for the first hour at most and then it seems not a hindrance but affirmatively helpful. Incidentally Tcl's author, John Ousterhout was alternately praised and panned for writing a language that forced the One True Brace Style on Tcl coders (I was 1TBS so it was fine by me).
The only Tcl constructs that aren't handled well by Python are arbitrary eval "${prefix}${command} arg" constructs that shouldn't be used in Tcl anyway but are and the uplevel type statements (which were a nice idea but made for some hairy code). Indeed, Python feels a little antagonistic to dynamic eval but I think that is a Good Thing. Unfortunately, I've yet to come along with a language that embraced its GUI as well as Tcl/Tk; Tkinter does the job in Python but it hurts.
I cannot speak to Matlab at all.
With a few months of Python under my belt, I'd almost certainly port any Tcl program that is in ongoing development to Python for purposes of sanity.
Have you considered using Octave? From what I gather, it is nearly a drop-in replacement for much of matlab. This might allow you to support matlab for those who have it, and a free alternative for those who don't. Since the "meat" of your program appears to be written in another language, the performance considerations seem to be not as important as providing an environment that has: plotting and visualization capabilities, is cross-platform, has a big user base, and in a language that nearly everyone in academia and/or involved with modelling fluid flow probably already knows. Matlab/Octave can potentially have all of those.
Well, unless there are any other suggestions, the final answer I have arrived at is to go with Python.
I seriously considered matlab/octave, but when reading the octave API and matlab API, they are different enough that I'd need to build separate interfaces for each (or get very creative with macros). With python I end up with a single, easier to maintain codebase for the front end, and it is used by just about everyone we know. Thanks for the tips/feedback everyone!