Can flex and bison accept input from other sources? - c++

I am going to write a text editor in Qt which can provide highlighting/code completion/syntax analyzing for a programming language (toy language, for learning purpose).
At first, I thought of writing handcraft C++, which would be more comfortable for me since I'm more familiar. However, upon searching, I found out that flex/bison can simplify parser creation. Upon trying a few simple example, it seems the working examples accept input from standard input in terminal. So, I just want to know, can flex/bison accept input from the text editor widget in GUI framework (such as Qt, which I'm going to learn at the same time after I finish a few features in the parser engine), then later output the result back to the text editor?

If you do not want to use a FILE * pointer, you can also scan from in-memory buffers such as character arrays and nul-terminated C type strings by creating FLEX input buffers - yy_scan_string() creates a buffer from a null terminated string, yy_scan_bytes creates a buffer from a fixed length character array. See Multiple Input Buffers in the flex documentation for more information.
And if that does not meet your needs, you can also re-define the YY_INPUT macro for complete control - see Generated Scanner.

flex reads its input from yyin. If you point it to something that is not stdin... See here for example.
Edit: btw, yyin is a FILE *. You're using C++, which means you'd want to pass a stream instead. Please, read flex's documentation on C++ interfacing
Edit2: for the output... you're the one programming yacc/bison actions for the rules, and also the error handler. In that sense, you're given quite some freedom on what to do there. For example, you can "emit" highlighted code and also use the error handlers to point errors when analyzing the code. The completion would force you to implement at least part of the semantics (symbol table, etc), but that's a different story...

Related

C++ Running code from a text file

I realize that this (Is it possible to dynamically compile and execute C# code fragments?) is a similar question, But it is 5 years old, and also in C#. Basically what I would like to do is have a program take code from a text file, and then compile and run it. If C++ is not a good language for this a recommendation would be helpful. Here is my basic goal.
Have code that runs code from a text file(ex; Test0).
Have the code from the file output 'Hello'
Then make another file(Test1) and change the output to a mixing of those letters
When it saves, it overwrites any previous file of the same name
The Code from the text file then ends.
The original program then makes Test0=Test1
The program then loops or ends.
Sorry about the coloring, the only way I know how to make the enter key strokes to appear is the code exert thingy.
The most basic solution is to invoke the C/C++ compiler to compile the text file each time it changes and before you call it. But what you are probably looking for is an interpreter, not a compiler. There are various C/C++ interpreters (which you can search for) that may meet your needs. However, C/C++ is not traditionally interpreted. More popular interpreted languages are Python, Lua, and JavaScript. With those, it is possible to embed the source code of the interpreter into your C/C++ program and interact with it --- it is possible to call the interpreted language's functions from C/C++ and vice-versa.

What is the fastest method to read STDIN data into console app with C++

I have a console app written with C++ that needs to accept text input from another process via writeline (followed by a deadline.) I'm assuming that such needs to be done via STDIN, but it also needs to work in the fastest way possible. My console app then needs to reply to the process back.
I haven't done console programming for awhile now, but I remember from C classes at school that there's a lot of C-type functions, like fgets, getline, etc. but what I remember is that they seemed to be quite slow.
So is there any way to do this exchange (again, "quick in and then out") with WinAPIs?
The fastest method in theory will almost certainly be the system
level input routines, since both the stdin (in C, but also
available in C++) and std::cin build on these. On the other
hand, they have generally been optimized for the platform, so
unless you can find the optimal configuration yourself (e.g.
things like buffer size), you might not gain much, if anything:
calling read (Unix) or ReadFile (Windows) for each character
will probably be slower than using something like
std::getline.
The other question is what you plan on doing with the data after
you've read it. Functions like read or ReadLine give you
a buffer (char[]) with a certain number of characters; you
then have to analyse it, break it into lines, etc. Functions
like std::getline give you an std::string with the line in
it. If you're a really skilled C++ programmer, you can probably
organize things so that the actual data are never moved from the
char[], but this would require re-implementing a lot of things
that are already implemented in the standard library. The use
of templates in the standard library means that you don't have
to implement nearly as much as you would have to otherwise, but
you'll still need to create an equivalent to std::string which
maintains two iterators (char const*) rather than the data
itself.
In the end, I'd start by writing the application using
std::getline and std::string. Once it's working, I'd see
what its actual performance is, and only then, if necessary,
consider ways of improving it.

Interpreters: Handling includes/imports

I've built an interpreter in C++ and everything works fine so far, but now I'm getting stuck with the design of the import/include/however you want to call it function.
I thought about the following:
Handling includes in the tokenizing process: When there is an include found in the code, the tokenizing function is recursively called with the filename specified. The tokenized code of the included file is then added to the prior position of the include.
Disadvantages: No conditional includes(!)
Handling includes during the interpreting process: I don't know how. All I know is that PHP must do it this way as conditional includes are possible.
Now my questions:
What should I do about includes?
How do modern interpreters (Python/Ruby) handle this? Do they allow conditional includes?
This problem is easy to solve if you have a clean design and you know what you're doing. Otherwise it can be very hard. I have written at least 6 interpreters that all have this feature, and it's fairly straightforward.
Your interpreter needs to maintain an environment that knows about all the global variables, functions, types and so on that have been defined. You might feel more comfortable calling this the "symbol table".
You need to define an internal function that reads a file and updates the environment. Depending on your language design, you might or might not do some evaluation the moment you read things in. My interpreters are very dynamic and evaluate each definition as soon as it is read in.
Your life will be infinitely easier if you structure your interpreter in layers:
Tokenizer (breaks input into tokens)
Parser (reads one token at a time, converts to abstract-syntax tree)
Evaluator (reads the abstract syntax and updates the environment)
The abstract-syntax tree is really the key. If you have this, when you encounter the import/include construct in the input, you just make a recursive call and get more abstract syntax back. You can do this in the parser or the evaluator. If you want conditional import, you have to do it in the evaluator, since only the evaluator can compute a condition.
Source code for my interpreters is on the web. Two of them are written in C; the others are written in Standard ML.

Is there a better (more modern) tool than lex/flex for generating a tokenizer for C++?

I recent added source file parsing to an existing tool that generated output files from complex command line arguments.
The command line arguments got to be so complex that we started allowing them to be supplied as a file that was parsed as if it was a very large command line, but the syntax was still awkward. So I added the ability to parse a source file using a more reasonable syntax.
I used flex 2.5.4 for windows to generate the tokenizer for this custom source file format, and it worked. But I hated the code. global variables, wierd naming convention, and the c++ code it generated was awful. The existing code generation backend was glued to the output of flex - I don't use yacc or bison.
I'm about to dive back into that code, and I'd like to use a better/more modern tool. Does anyone know of something that.
Runs in Windows command prompt (Visual studio integration is ok, but I use make files to build)
Generates a proper encapsulated C++ tokenizer. (No global variables)
Uses regular expressions for describing the tokenizing rules (compatible with lex syntax a plus)
Does not force me to use the c-runtime (or fake it) for file reading. (parse from memory)
Warns me when my rules force the tokenizer to backtrack (or fixes it automatically)
Gives me full control over variable and method names (so I can conform to my existing naming convention)
Allows me to link multiple parsers into a single .exe without name collisions
Can generate a UNICODE (16bit UCS-2) parser if I want it to
Is NOT an integrated tokenizer + parser-generator (I want a lex replacement, not a lex+yacc replacement)
I could probably live with a tool that just generated the tokenizing tables if that was the only thing available.
Ragel: http://www.complang.org/ragel/ It fits most of your requirements.
It runs on Windows
It doesn't declare the variables, so you can put them inside a class or inside a function as you like.
It has nice tools for analyzing regular expressions to see when they would backtrack. (I don't know about this very much, since I never use syntax in Ragel that would create a backtracking parser.)
Variable names can't be changed.
Table names are prefixed with the machine name, and they're declared "const static", so you can put more than one in the same file and have more than one with the same name in a single program (as long as they're in different files).
You can declare the variables as any integer type, including UChar (or whatever UTF-16 type you prefer). It doesn't automatically handle surrogate pairs, though. It doesn't have special character classes for Unicode either (I think).
It only does regular expressions... has no bison/yacc features.
The code it generates interferes very little with a program. The code is also incredibly fast, and the Ragel syntax is more flexible and readable than anything I've ever seen. It's a rock solid piece of software. It can generate a table-driven parser or a goto-driven parser.
Flex also has a C++ output option.
The result is a set of classes that do that parsing.
Just add the following to the head of you lex file:
%option C++
%option yyclass="Lexer"
Then in you source it is:
std::fstream file("config");
Lexer lexer(&file)
while(int token = lexer.yylex())
{
}
Boost.Spirit.Qi (parser-tokenizer) or Boost.Spirit.Lex (tokenizer only). I absolutely love Qi, and Lex is not bad either, but I just tend to take Qi for my parsing needs...
The only real drawback with Qi tends to be an increase in compile time, and it is also runs slightly slower than hand-written parsing code. It is generally much faster than parsing with regex, though.
http://www.boost.org/doc/libs/1_41_0/libs/spirit/doc/html/index.html
There's two tools that comes to mind, although you would need to find out for yourself which would be suitable, Antlr and GoldParser. There are language bindings available in both tools in which it can be plugged into the C++ runtime environment.
boost.spirit and Yard parser come to my mind. Note that the approach of having lexer generators is somewhat substituted by C++ inner DSL (domain-specific language) to specify tokens. Simply because it is part of your code without using an external utility, just by following a series of rules to specify your grammar.

Best way to design for localization of strings

This is kinda a general question, open for opinions. I've been trying to come up with a good way to design for localization of string resources for a Windows MFC application and related utilities. My wishlist is:
Must preserve string literals in code (as opposed to replacing with macro #define resource ID's), so that the messages are still readable inline
Must allow localized string resources (duh)
Must not impose additional run-time environment restrictions (eg: dependency on .NET, etc.)
Should have minimal obtrusion into existing code (the less modification the better)
Should be debuggable
Should generate resource files which are editable by common tools (ie: common format)
Should not use copy/paste comment blocks to preserve literal strings in code, or anything else which creates the potential for de-synchronization
Would be nice to allow static (compile-time) checking that every "notated" string is in the resource file(s)
Would be nice to allow cross-language resource string pooling (for components in various languages, eg: native C++ and .NET)
I have a way which fulfills all my wishlist to some extent except for static checking, but I have had to develop a bit of custom code to achieve it (and it has limitations). I'm wondering if anyone has solved this problem in a particularly good way.
Edit:
The solution I currently have looks like this:
ShowMessage( RESTRING( _T("Some string") ) );
ShowMessage( RESTRING( _T("Some string with variable %1"), sNonTranslatedStringVariable ) );
I then have a custom utility to parse out the strings from within the 'RESTRING' blocks and put them into a .resx file for localization, and a separate C# COM object to load them from localized resource files with fallback. If the C# object is not available (or cannot load), I fallback to the string in the code. The macro expands to a template class which calls the COM object and does the formatting, etc.
Anyway, I thought it would be useful to add what I have now for reference.
We use the English string as the ID.
If it fails the look up from the international resource object (loaded from the I18N dll installed) then we default to the ID string.
Code looks like:
doAction(I18N.get("Press OK to continue"));
As part of the build processes we have a perl script that parses all source for string constants. It builds a temp file of all strings in the application and then compares these against the resource strings in each local to see if they exists. Any missing strings generates an e-mail to the appropriate translation team.
We can have multiple dll for each local. The name of the dll is based on RFC 3066
language[_territory][.codeset][#modifier]
We try and extract the locale from the machine and be as specific as possible when loading the I18N dll but fallback to less specific local variations if the more specific version is not present.
Example:
In the UK: If the local was en_GB.UTF-8
(I use the term dll loosely not in the specific windows sense).
First look for the I18N.en_GB.UTF-8 dll. If this dll does not exist fall back to I18N.en_GB. If this dll does not exist fall back to I18N.en If this dll does not exist fall beck to I18N.default
The only exception to this rule is:
Simplified Chinese (zh_CN) where the fallback is US English (en_US). If the machine does not support simplified Chinese then it is unlikely to support full Chinese.
The simple way is to only use string IDs in your code - no literal strings.
You can then produce different versions of the.rc file for each language and either create resource only DLLs or simply different language builds.
There are a couple of shareware utilstohelp localising the rc file which handle resizing dialog elements for languages with longer words and warnign about missing translations.
A more complicated problem is word order, if you have several numbers in a printf which must be in a different order for different language's grammar.
There are some extended printf classes on codeproject that let you specify things like printf("word %1s and %2s",var1,var2) so you can switch %1s and %2s if necessary.
I don't know much about how this is normally done on Windows, but the way localized strings are handled in Apple's Cocoa framework works pretty well. They have a very basic text-format file that you can send to a translator, and some preprocessor macros to retrieve the values from the files.
In your code, you'll see the strings in your native language, rather than as opaque IDs.
Since it is open for opinions, here is how I do it.
My localized text file is a simple tab delimited text file that can be loaded in Excel and edited.
The first column is for the define and each column to the right is a subsequent language, for example:
ID ENGLISH FRENCH GERMAN
STRING_YES YES OUI YA
STRING_NO NO NON NEIN
Then in my makefile is a cusom build step that generates a strings.h file and a strings.dat. In my case it builds an enum list for the string ids and then a binary file with offsets for the text. Since in my app the user can change the language at any time i have them all in memory but you could easily have your pre-processer generate a different output file for each language if necessary.
The thing that I like about this design is that if any strings are missing then I would get a compile error whereas if strings were looked up at runtime then you might not know about a missing string in a seldom used part of the code until later.
Your solution is quite similar to the Unix/Linux "gettext" solution. In fact, you would not need to write the extraction routines.
I'm not sure why you want the _RESTRING macro to handle multiple arguments. My code (using wxWidgets' support for gettext) looks like this: MyString.Format(_("Some string with variable %ls"), _("variable"));. That is to say, String::Format(...) gets two individually translated arguments. In hindsight, Boost::Format would have been better, but it too would allow boost::format(_("Some string with variable %1")) % _("variable");
(We use the _() macro for brevity)
On one project I had localized into 10+ languages, I put everything that was to be localized into a single resource-only dll. At install time, the user selected which dll got installed with their application.
I only had to deliver the English dll to the localization team. They returned a localized dll to me for each language which I included in the build.
I know it's not perfect, but it worked.
You want an advanced utility that I've always wanted to write but never had the time to.
If you don't find such a tool, you may want to fallback on my CMsg() and CFMsg() wrapper classes that allow to very easily pull strings from the resource table. (CFMsg even provide a FormatMessage one-liner wrapper.
And yes, in the absence of that tool you're looking for, keeping a copy of the string in comment is a good solution. Regarding desynchronisation of the comment, remember that string literals are very rarely changed.
http://www.codeproject.com/KB/string/stringtable.aspx
BTW, native Win32 programs and .NET programs have a totally different resource storage management. You'll have a hard time finding a common solution for both.