QString(const char* p) constructor misrepresents non ASCII characters - c++

I am trying to upgrade a bigger C++ program from Qt4 to Qt5 or higher and have some problems with the legacy code that was written in ISO-LATIN1. From Qt5, the code is expected to be present in UTF8 and that is what we tried to do. We use a own String-class (let's call it myQString here), that was basically a char* under Qt4 and went to a QString-derived class in Qt5. So far so good.
The cases where I still have some problems is when I try to pass char* variables to the myQString class, that includes non-ASCII characters (like the letter characters with diaeresis for example, 'ä', 'Ä', 'ö', 'Ö', etc.).
I tried to write a mini program that reproduces/illustrates the problem. To make it clearer I could post some code but a picture would be better in this case:
Zoom Debugger output
Here we can see via the Debugger-View, that the desired end-products (cyan and yellow color: "mstr2, mstr4, qstr2, qstr4), that should be something that store "Äa", misrepresent the first byte "Ä". All of them use the green marked constructor, myQString(const char* p).
The function that illustrates the problem (in the Debugger) is charPointerToQstring(). It is part of the file main.cpp (see last code block).
If you want to run that "mini" program, I will also post all the files (four) needed to do so. I am using QtCreator and have a project file, which you can call how you like, let's say "testingQString.pro"
QT -= gui
CONFIG += c++17 console
SOURCES += \
main.cpp \
myqstring.cpp
HEADERS += \
myqstring.h
Then we have a "stripped" myQString class, with the two files:
myqstring.h:
#ifndef MYQSTRING_H
#define MYQSTRING_H
#include <QString>
class myQString : public QString
{
public:
myQString();
myQString(const QString& str);
myQString(char c);
myQString(const char* p);
myQString(const QByteArray& ba);
};
#endif // MYQSTRING_H
and "stripped" myqstring.cpp:
#include "myqstring.h"
#include <QDebug>
#define ENTER_FUNCTION qDebug() << "========== Entering:" << Q_FUNC_INFO
myQString::myQString()
{
ENTER_FUNCTION;
}
myQString::myQString(const QString& str) : QString(str)
{
ENTER_FUNCTION;
}
myQString::myQString(char c) : QString(QChar(c))
{
ENTER_FUNCTION;
}
myQString::myQString(const char* p) : QString(p)
{
ENTER_FUNCTION;
}
myQString::myQString(const QByteArray& ba)
{
ENTER_FUNCTION;
foreach (auto c, ba) {
#if QT_VERSION_MAJOR == 5
append(QChar(c));
#endif
#if QT_VERSION_MAJOR == 6
append(char(c));
#endif
}
}
The file main.cpp is also "stripped" here and only shows that one specific problem:
#include "myqstring.h"
#include <QDebug>
#include <array>
// -----------------------------------------------------------------------------
// -----------------------------------------------------------------------------
void charPointerToQstring() {
// case 1 - const char* with string as initialiser
const char* buf1("Äa");
myQString mstr1(buf1);
QString qstr1(buf1);
// case 2 - char* with char assignment
const int len = 2;
char* buf2 = new char[len+1];
buf2[0] = char(0xC4); // 0xC4 == 196 == AE (umlaut)
buf2[1] = 'a';
buf2[len] = '\0';
myQString mstr2(buf2);
QString qstr2(buf2);
// case 3 - str
myQString mstr3("Äa");
QString qstr3("Äa");
// case 4 - std::array<char>
std::array<char, len+1> stda1;
stda1[0] = char(0xC4);
stda1[1] = 'a';
stda1[len] = '\0';
myQString mstr4(stda1.data());
QString qstr4(stda1.data());
qDebug() << "Set a breakpoint exactly on ME (nr 3) and check the results via Debugger!!!";
}
// -----------------------------------------------------------------------------
// -----------------------------------------------------------------------------
int main(int argc, char *argv[])
{
Q_UNUSED(argc)
Q_UNUSED(argv)
// missing code with more tests here...
charPointerToQstring();
}
The big question is: Why isn't Qt handling a single char of a char* argument right but a string as argument (with the same info) goes well? If we have a char* as an argument then we can only go for each char from 0x00 to 0xFF (unsinged). Why not make 0x0000 to 0x00FF out of it?
Edit:
The answer of Artyer explains the behavior for buf1 but not for buf2. buf2 is a char[3] { 0xC4, 0x61, '\0' } which get's converted (with Artyers help) to a QString with elements QChar{ 0x00C4, 0x0061 }. So Qt can easily convert those 0xC4 characters to 0x00C4. In fact qstr1 shows that it can convert the two chars { 0xC3, 0x84 } from 'Ä' to one correct QChar {0x00C4}. If we have a char* as an argument then we can only go for each char from 0x00 to 0xFF (unsinged). Why not make 0x0000 to 0x00FF out of it?
And btw, I can't accept that approach yet because it now breaks mstr1 and mstr3. They then get exactly the "same" elements as buf1 but in QChar (so, without the closing '\0', from char[3] { 0xC3, 0x84, 0x61 } to QChar { 0x00C3, 0x0084, 0x0061 } but it should get QChar { 0x00C4, 0x0061 })

What is probably the case is that "Äa" is three UTF-8 encoded bytes in the source file (Equivalent to char[4]{ 0xC3, 0x84, 'a', '\0' }), and the QString constructor expects UTF-8 encoded data.
The 65533 character (U+FFFD) is the replacement character for the invalid UTF-8 data.
Use QString::fromLatin1:
myQString::myQString(const char* p) : QString(QString::fromLatin1(p, std::strlen(p)))
{
ENTER_FUNCTION;
}
myQString::myQString(const QByteArray& ba) : QString(QString::fromLatin1(ba))
{
ENTER_FUNCTION;
}
Also consider using QLatin1StringView instead of char* to avoid getting confused about encoding (might be called QLatin1String in older QT versions)

Related

Casting QByteArray to `long` outputs different result for same input

EDIT: Added full MCV example project.
I have a strange problem where the same code and same input produce different output values.
The purpose of the code is to test a function that takes a value packed into 4 bytes, and unpack it into a single 32bit value. The expected value of value1, value2 and value3 in test_unpack() is 2018915346 (i.e. 0x78563412 because of little-endian unpacking). I got this method of unpacking from another answer. Below is an MCV example that you can easily build and see the problem for yourself. Note that if you comment out the body of test1() test_unpack() magically passes with the correct value.
test_canserialcomm.cpp
#include "test_canserialcomm.h"
#include <QtTest/QtTest>
#include <QByteArray>
long unpack() noexcept
{
quint8 a_bytes[] = {0x12, 0x34, 0x56, 0x78};
QByteArray a = QByteArray(reinterpret_cast<char*>(a_bytes), 4);
long value1 = *((long*)a.data());
qDebug() << value1; // outputs "32651099317351442" (incorrect value)
quint8 b_bytes[] = {0x12, 0x34, 0x56, 0x78};
QByteArray b = QByteArray(reinterpret_cast<char*>(b_bytes), 4);
long value2 = *((long*)b.data());
qDebug() << value2; // outputs "2018915346" (correct value)
quint8 c_bytes[] = {0x12, 0x34, 0x56, 0x78};
QByteArray c = QByteArray(reinterpret_cast<char*>(c_bytes), 4);
long value3 = *((long*)c.data());
qDebug() << value3; // outputs "2018915346" (correct value)
return value1;
}
void TestCanSerialComm::test1()
{
QCOMPARE("aoeu", "aoeu"); // If you comment this line, the next test will pass, as expected.
}
void TestCanSerialComm::test_unpack()
{
long expected {0x78563412};
QCOMPARE(unpack(), expected);
}
test_canserialcomm.h
#ifndef TEST_CANSERIALCOMM_H
#define TEST_CANSERIALCOMM_H
#include <QtTest>
class TestCanSerialComm: public QObject
{
Q_OBJECT
private slots:
void test1();
void test_unpack();
};
#endif // TEST_CANSERIALCOMM_H
test_main.cpp
#include <QtTest>
#include "test_canserialcomm.h"
#include <QCoreApplication>
int main(int argc, char** argv) {
QCoreApplication app(argc, argv);
TestCanSerialComm testCanSerialComm;
// Execute test-runner.
return QTest::qExec(&testCanSerialComm, argc, argv); }
tmp.pro
QT += core \
testlib
QT -= gui
CONFIG += c++11
TARGET = tmp
CONFIG += console
CONFIG -= app_bundle
TEMPLATE = app
TARGET = UnitTests
HEADERS += test_canserialcomm.h
SOURCES += test_canserialcomm.cpp \
test_main.cpp
The output of value1 in test_unpack() is wrong, despite the same code and same inputs. Strangely, if I remove the qDebug() calls and set a breakpoint, the debugger expression evaluator now shows that value2 has the wrong value.
Any idea why this is happening? Or even how to troubleshoot this further?
Additional Notes: If I add a line qDebug() << "garbage"; at the top of my function, all 3 values produced are correct.
You're compiling and running this program on a system where long is 8 bytes, but your QByteArray has only 4 bytes. That means that when you alias the array as a long (using *((long*)a.data())) you're reading 4 bytes past the end of the array, into uninitialized heap storage.
The fix is to use a type that is guaranteed to be 4 bytes in size, e.g. std::int32_t.
As an aside, using *((long*)[...]) to alias memory is not guaranteed to work, primarily because of alignment issues but also (in the general case) because aliasing is only supported for types equivalent to char or a signed or unsigned variant. The safer technique is to use memcpy:
std::uint32_t value1;
assert(a.size() == sizeof(value1));
memcpy(&value1, a.data(), a.size());

Qt convert unicode entities

In QT 5.4 and C++ I try to decode a string that has unicode entities.
I have this QString:
QString string = "file\u00d6\u00c7\u015e\u0130\u011e\u00dc\u0130\u00e7\u00f6\u015fi\u011f\u00fc\u0131.txt";
I want to convert this string to this: fileÖÇŞİĞÜİçöşiğüı.txt
I tried QString's toUtf8 and fromUtf8 methods. Also tried to decode it character by character.
Is there a way to convert it by using Qt?
Qt provides a macro called QStringLiteral for handling string literals correctly.
Here's a full working example:
#include <QString>
#include <QDebug>
int main(void) {
QString string = QStringLiteral("file\u00d6\u00c7\u015e\u0130\u011e\u00dc\u0130\u00e7\u00f6\u015fi\u011f\u00fc\u0131.txt");
qDebug() << string;
return 0;
}
As mentioned in the above comments, you do need to print to a console that supports these characters for this to work.
I have just tested this code:
int main(int argc, char *argv[])
{
QApplication a(argc, argv);
QString s = "file\u00d6\u00c7\u015e\u0130\u011e\u00dc\u0130\u00e7\u00f6\u015fi\u011f\u00fc\u0131.txt";
qDebug() << s.length(); //Outputs: 22
qDebug() << s; //Outputs: fileÖÇŞİĞÜİçöşiğüı.txt
return a.exec();
}
This is with Qt 5.4 on ubuntu, so it looks like your problem is with some OS only.
#include <QTextDocument>
QTextDocument doc;
QString string = "file\u00d6\u00c7\u015e\u0130\u011e\u00dc\u0130\u00e7\u00f6\u015fi\u011f\u00fc\u0131.txt";
doc.setHtml(string); // to convert entities to text
QString result = doc.toPlainText(); // result = "fileÖÇŞİĞÜİçöşiğüı.txt"
NOT USEFUL if you have a CONSOLE app
QTextDocument needs the GUI module.

UTF-8 error in GtkTextView while decoding base64

I have been trying to figure this out for a few days now. All I am trying to do is decode a base64 string and add it to a Gtk::TextView. Below is the code:
txtbuffer_ = Gtk::TextBuffer::create();
txtview_.set_buffer(txtbuffer_);
const Glib::ustring str = Glib::Base64::decode("YmJi3A==");
txtbuffer_->set_text(str);
When I run the program I get the error:
Gtk-CRITICAL **: gtk_text_buffer_emit_insert: assertion 'g_utf8_validate (text, len, NULL)' failed
This error only occurs with Unicode characters. When the text is ASCII it all works fine.
I have tried three different base64 decoders, I tried using std::string and Glib::ustring with all the different decoders. I also tried using the function Glib::locale_to_utf8(), but that gives me the error terminate called after throwing an instance of 'Glib::ConvertError'. And I tried using Glib::convert with the same error.
I know that Gtk::TextView can display Unicode because if I set the text to a string with Unicode it will display the text.
I read that Gtk::TextView displays text in UTF-8, so I think my problem is that the decoded string is not coded in UTF-8, but I am not sure. So my question is how can I get Gtk::TextView to display the decoded base64?
Added note: I am using version 3.8 of Gtkmm
Tested using version 3.12, same error message
Minimal program:
//test.h
#ifndef TEST_H_
#define TEST_H_
#include <gtkmm.h>
class MainWindow : public Gtk::Window
{
public:
MainWindow();
virtual ~MainWindow();
protected:
Gtk::Box box_main;
Gtk::TextView txtview_;
Glib::RefPtr<Gtk::TextBuffer> txtbuffer_;
};
#endif /* TEST_H_ */
//test.cpp
#include "test.h"
MainWindow::MainWindow()
{
Gtk::Window::add(box_main);
box_main.pack_start(txtview_);
txtbuffer_ = Gtk::TextBuffer::create();
txtview_.set_buffer(txtbuffer_);
const Glib::ustring str = Glib::Base64::decode("YmJi3A==");
txtbuffer_->set_text(str);
Gtk::Window::show_all_children();
}
MainWindow::~MainWindow()
{
}
//main.cpp
#include "test.h"
int main(int argc, char* argv[])
{
Glib::RefPtr<Gtk::Application> app = Gtk::Application::create(argc, argv, "test.program");
MainWindow mw;
return app->run(mw);
}
The reason why it was not working was because the string that I encoded was not UTF-8. Thanks to: https://mail.gnome.org/archives/gtk-list/2014-April/msg00016.html. I found out that the encoding was ISO-8859-1. So there are 2 fixes kind of, first, first encode the string to utf8:
const Glib::ustring str2 = Glib::Base64::encode("bbbÜ");
or you have to figure out the original encoding of the string, so for me this worked:
Glib::convert(base64_str, "UTF-8", "ISO-8859-1");
From documentation:
Note that the returned binary data is not necessarily zero-terminated,
so it should not be used as a character string.
That means utf8 validate will read beyond bounds with a likelyhood near 1 get a sequence of bytes which fail to be valid utf8 characters.
But even that did not fix it. It seems that the length is one too long and the last value is just garbage.
So you can either use (which I'd recommend)
std::string stdstr = Glib::Base64::decode (x);
const Glib::ustring str(stdstr.c_str(), stdstr.length()-1);
or
gsize len = 0;
const gchar *ret = (gchar*)g_base64_decode (x, &len);
len --;
const Glib::ustring str(ret, len);
g_free (ret);
So I guess this a bug in gtk+ (which gtkmm encapsulates)

C++ Linking and running LuaJit compiled files with loadbuffer and runbuffer

I have compiled test.lua with LuaJit into test.obj and test.h. How do I correctly use the loadBuffer or runBuffer functions that I have? All I need to find out is basically how to place test.lua, test.obj and test.h into the command but I just cant, Ive tried hundreds of ways but nothing seems to work. I have stripped some other functions off from main and so forth that it would just leave the problem visible and not other things that work just fine.
C++: This is the main
int main(int argc, const char* argv[])
{
std::vector<std::string> args(argv, argv + argc);
g_lua.loadBuffer("test.lua", "test.obj")
// I have tried both, runBuffer and loadBuffer but I just cant get it right, it always fails.
}
Here is the loadBuffer function:
void LuaCodes::loadBuffer(const std::string& buffer, const std::string& source)
{
int ret = luaL_loadbuffer(L, buffer.c_str(), buffer.length(), source.c_str());
if(ret != 0)
throw LuaException(popString(), 0);
}
Here is the runBuffer function:
void LuaCodes::runBuffer(const std::string& buffer, const std::string& source)
{
loadBuffer(buffer, source);
safeCall(0, 0);
}
Here are the insides of test.h:
#define luaJIT_BC_test_SIZE 1186
static const char luaJIT_BC_test[] = {
27,76,74,1,2,154,9,2,0,12,0,47,0,151,1,52,0,0,0,55,0,1,0,52,1,2,0,55,1,3,1,62,
1,1,2,52,2,4,0,55,2,5,2,62,2,1,2,37,3,6,0,36,1,3,1,62,0,2,1,52,0,0,0,55,0,7,0,
52,1,8,0,55,1,9,1,37,2,10,0,62,1,2,0,61,0,0,1,52,0,0,0,55,0,7,0,52,1,4,0,55,1,
11,1,62,1,1,2,37,2,12,0,52,3,4,0,55,3,13,3,62,3,1,2,37,4,14,0,52,5,4,0,55,5,
15,5,62,5,1,2,37,6,16,0,52,7,4,0,55,7,17,7,62,7,1,2,37,8,18,0,52,9,4,0,55,9,
19,9,62,9,1,2,37,10,20,0,52,11,4,0,55,11,21,11,62,11,1,2,36,1,11,1,62,0,2,1,
52,0,2,0,55,0,22,0,52,1,2,0,55,1,3,1,62,1,1,2,37,2,23,0,36,1,2,1,41,2,2,0,62,
0,3,2,14,0,0,0,84,0,4,128,52,0,0,0,55,0,24,0,37,1,25,0,62,0,2,1,52,0,2,0,55,0,
22,0,52,1,2,0,55,1,3,1,62,1,1,2,37,2,26,0,36,1,2,1,41,2,2,0,62,0,3,2,14,0,0,0,
84,0,4,128,52,0,0,0,55,0,24,0,37,1,27,0,62,0,2,1,52,0,2,0,55,0,22,0,52,1,2,0,
55,1,3,1,62,1,1,2,37,2,28,0,36,1,2,1,41,2,2,0,62,0,3,1,52,0,2,0,55,0,29,0,52,
1,4,0,55,1,5,1,62,1,1,0,61,0,0,1,52,0,2,0,55,0,30,0,37,1,31,0,37,2,32,0,41,3,
2,0,62,0,4,1,52,0,33,0,55,0,34,0,37,1,35,0,62,0,2,1,52,0,36,0,55,0,37,0,62,0,
1,1,52,0,36,0,55,0,38,0,39,1,99,0,62,0,2,1,52,0,36,0,55,0,39,0,37,1,40,0,62,0,
2,1,52,0,36,0,55,0,39,0,37,1,41,0,62,0,2,1,52,0,36,0,55,0,38,0,39,1,243,1,62,
0,2,1,52,0,36,0,55,0,39,0,37,1,42,0,62,0,2,1,52,0,36,0,55,0,38,0,39,1,231,3,
62,0,2,1,52,0,36,0,55,0,39,0,37,1,43,0,62,0,2,1,52,0,36,0,55,0,38,0,39,1,15,
39,62,0,2,1,37,0,31,0,52,1,4,0,55,1,5,1,62,1,1,2,37,2,44,0,36,0,2,0,52,1,2,0,
55,1,45,1,16,2,0,0,62,1,2,2,15,0,1,0,84,2,3,128,52,1,46,0,16,2,0,0,62,1,2,1,
71,0,1,0,11,100,111,102,105,108,101,15,102,105,108,101,69,120,105,115,116,115,
7,114,99,19,103,97,109,101,95,105,110,116,101,114,102,97,99,101,11,99,108,105,
101,110,116,12,103,97,109,101,108,105,98,12,99,111,114,101,108,105,98,23,101,
110,115,117,114,101,77,111,100,117,108,101,76,111,97,100,101,100,20,97,117,
116,111,76,111,97,100,77,111,100,117,108,101,115,20,100,105,115,99,111,118,
101,114,77,111,100,117,108,101,115,14,103,95,109,111,100,117,108,101,115,17,
47,99,111,110,102,105,103,46,111,116,109,108,9,108,111,97,100,14,103,95,99,
111,110,102,105,103,115,11,46,111,116,112,107,103,6,47,25,115,101,97,114,99,
104,65,110,100,65,100,100,80,97,99,107,97,103,101,115,22,115,101,116,117,112,
85,115,101,114,87,114,105,116,101,68,105,114,9,109,111,100,115,56,85,110,97,
98,108,101,32,116,111,32,97,100,100,32,109,111,100,117,108,101,115,32,100,105,
114,101,99,116,111,114,121,32,116,111,32,116,104,101,32,115,101,97,114,99,104,
32,112,97,116,104,46,12,109,111,100,117,108,101,115,53,85,110,97,98,108,101,
32,116,111,32,97,100,100,32,100,97,116,97,32,100,105,114,101,99,116,111,114,
121,32,116,111,32,116,104,101,32,115,101,97,114,99,104,32,112,97,116,104,46,
10,102,97,116,97,108,9,100,97,116,97,18,97,100,100,83,101,97,114,99,104,80,97,
116,104,17,103,101,116,66,117,105,108,100,65,114,99,104,15,32,102,111,114,32,
97,114,99,104,32,17,103,101,116,66,117,105,108,100,68,97,116,101,16,41,32,98,
117,105,108,116,32,111,110,32,19,103,101,116,66,117,105,108,100,67,111,109,
109,105,116,7,32,40,21,103,101,116,66,117,105,108,100,82,101,118,105,115,105,
111,110,10,32,114,101,118,32,15,103,101,116,86,101,114,115,105,111,110,6,32,
12,103,101,116,78,97,109,101,42,61,61,32,97,112,112,108,105,99,97,116,105,111,
110,32,115,116,97,114,116,101,100,32,97,116,32,37,98,32,37,100,32,37,89,32,37,
88,9,100,97,116,101,7,111,115,9,105,110,102,111,9,46,108,111,103,19,103,101,
116,67,111,109,112,97,99,116,78,97,109,101,10,103,95,97,112,112,15,103,101,
116,87,111,114,107,68,105,114,16,103,95,114,101,115,111,117,114,99,101,115,15,
115,101,116,76,111,103,70,105,108,101,13,103,95,108,111,103,103,101,114,0
};
For luaL_loadbuffer (and hence LuaCodes::loadBuffer) the 1st argument should be a string containing the bytecode and the 2nd argument should be a human-readable name (e.g. the filename that the bytecode was compiled from.)
Try:
#include "test.h"
// ...
int main(int argc, const char* argv[])
{
// ...
std::string bytecode(luaJIT_BC_test, luaJIT_BC_test_SIZE);
g_lua.loadBuffer(bytecode, "#test.lua")
}

Using lex generated source code in another file

i would like to use the code generated by lex in another code that i have , but all the examples that i have seen is embedding the main function inside the lex file not the opposite.
is it possible to use(include) the c generated file from lex into other code that to have something like this (not necessarily the same) ?
#include<something>
int main(){
Lexer l = Lexer("some string or input file");
while (l.has_next()){
Token * token = l.get_next_token();
//somecode
}
//where token is just a simple object to hold the token type and lexeme
return 0;
}
This is what I would start with:
Note: this is an example of using a C interface
To use the C++ interface add %option c++ See below
Test.lex
IdentPart1 [A-Za-z_]
Identifier {IdentPart1}[A-Za-z_0-9]*
WHITESPACE [ \t\r\n]
%option noyywrap
%%
{Identifier} {return 257;}
{WHITESPACE} {/* Ignore */}
. {return 258;}
%%
// This is the bit you want.
// It is best just to put this at the bottom of the lex file
// By default functions are extern. So you can create a header file with
// these as extern then included that header file in your code (See Lexer.h)
void* setUpBuffer(char const* text)
{
YY_BUFFER_STATE buffer = yy_scan_string(text);
yy_switch_to_buffer(buffer);
return buffer;
}
void tearDownBuffer(void* buffer)
{
yy_delete_buffer((YY_BUFFER_STATE)buffer);
}
Lexer.h
#ifndef LOKI_A_LEXER_H
#define LOKI_A_LEXER_H
#include <string>
extern int yylex();
extern char* yytext;
extern int yyleng;
// Here is the interface to the lexer you set up above
extern void* setUpBuffer(char const* text);
extern void tearDownBuffer(void* buffer);
class Lexer
{
std::string token;
std::string text;
void* buffer;
public:
Lexer(std::string const& t)
: text(t)
{
// Use the interface to set up the buffer
buffer = setUpBuffer(text.c_str());
}
~Lexer()
{
// Tear down your interface
tearDownBuffer(buffer);
}
// Don't use RAW pointers
// This is only a quick and dirty example.
bool nextToken()
{
int val = yylex();
if (val != 0)
{
token = std::string(yytext, yyleng);
}
return val;
}
std::string const& theToken() const {return token;}
};
#endif
main.cpp
#include "Lexer.h"
#include <iostream>
int main()
{
Lexer l("some string or input file");
// Did not like your hasToken() interface.
// Just call nextToken() until it fails.
while (l.nextToken())
{
std::cout << l.theToken() << "\n";
delete token;
}
//where token is just a simple object to hold the token type and lexeme
return 0;
}
Build
> flext test.lex
> g++ main.cpp lex.yy.c
> ./a.out
some
string
or
input
file
>
Alternatively you can use the C++ interface to flex (its experimental)
test.lext
%option c++
IdentPart1 [A-Za-z_]
Identifier {IdentPart1}[A-Za-z_0-9]*
WHITESPACE [ \t\r\n]
%%
{Identifier} {return 257;}
{WHITESPACE} {/* Ignore */}
. {return 258;}
%%
// Note this needs to be here
// If you define no yywrap() in the options it gets added to the header file
// which leads to multiple definitions if you are not careful.
int yyFlexLexer::yywrap() { return 1;}
main.cpp
#include "MyLexer.h"
#include <iostream>
#include <sstream>
int main()
{
std::istringstream data("some string or input file");
yyFlexLexer l(&data, &std::cout);
while (l.yylex())
{
std::cout << std::string(l.YYText(), l.YYLeng()) << "\n";
}
//where token is just a simple object to hold the token type and lexeme
return 0;
}
build
> flex --header-file=MyLexer.h test.lex
> g++ main.cpp lex.yy.cc
> ./a.out
some
string
or
input
file
>
Sure. I'm not sure about the generated class; we use the C generated
parsers, and call them from C++. Or you can insert any sort of wrapper
code you want in the lex file, and call anything there from outside of
the generated file.
The keywords are %option reentrant or %option c++.
As an example here's the ncr2a scanner:
/** ncr2a_lex.l: Replace all NCRs by corresponding printable ASCII characters. */
%%
&#(1([01][0-9]|2[0-6])|3[2-9]|[4-9][0-9]); { /* accept 32..126 */
/** `+2` skips '&#', `atoi()` ignores ';' at the end */
fputc(atoi(yytext + 2), yyout); /* non-recursive version */
}
The scanner code can be left unchanged.
Here the program that uses it:
/** ncr2a.c */
#include "ncr2a_lex.h"
typedef struct {
int i,j; /** put here whatever you need to keep extra state */
} State;
int main () {
yyscan_t scanner;
State my_custom_data = {0,0};
yylex_init(&scanner);
yyset_extra(&my_custom_data, scanner);
yylex(scanner);
yylex_destroy(scanner);
return 0;
}
To build ncr2a executable:
flex -R -oncr2a_lex.c --header-file=ncr2a_lex.h ncr2a_lex.l
cc -c -o ncr2a_lex.o ncr2a_lex.c
cc -o ncr2a ncr2a_lex.o ncr2a.c -lfl
Example
$ echo 'three colons :::' | ./ncr2a
three colons :::
This example uses stdin/stdout as input/output and it calls yylex() once.
To read from a file:
yyin = fopen("input.txt", "r" );
#Loki Astari's answer shows how to read from a string (buffer = yy_scan_string(text, scanner); yy_switch_to_buffer(buffer, scanner))
.
To call yylex() once for each token add return inside rule definitions that yield full token in the *.l file.