How to Simplify C++ Boolean Comparisons - c++

I'm trying to find a way to simplify the comparison cases of booleans. Currently, there are only three (as shown below), but I'm about to add a 4th option and this is getting very tedious.
bracketFirstIndex = message.indexOf('[');
mentionFirstIndex = message.indexOf('#');
urlFirstIndex = message.indexOf(urlStarter);
bool startsWithBracket = (bracketFirstIndex != -1);
bool startsWithAtSymbol = (mentionFirstIndex != -1);
bool startsWithUrl = (urlFirstIndex != -1);
if (!startsWithBracket)
{
if (!startsWithAtSymbol)
{
if (!startsWithUrl)
{
// No brackets, mentions, or urls. Send message as normal
cursor.insertText(message);
break;
}
else
{
// There's a URL, lets begin!
index = urlFirstIndex;
}
}
else
{
if (!startsWithUrl)
{
// There's an # symbol, lets begin!
index = mentionFirstIndex;
}
else
{
// There's both an # symbol and URL, pick the first one... lets begin!
index = std::min(urlFirstIndex, mentionFirstIndex);
}
}
}
else
{
if (!startsWithAtSymbol)
{
// There's a [, look down!
index = bracketFirstIndex;
}
else
{
// There's both a [ and #, pick the first one... look down!
index = std::min(bracketFirstIndex, mentionFirstIndex);
}
if (startsWithUrl)
{
// If there's a URL, pick the first one... then lets begin!
// Otherwise, just "lets begin!"
index = std::min(index, urlFirstIndex);
}
}
Is there a better/more simpler way to compare several boolean values, or am I stuck in this format and I should just attempt to squeeze in the 4th option in the appropriate locations?

Some type of text processing are fairly common, and for those, you should strongly consider using an existing library. For example, if the text you are processing is using the markdown syntax, consider using an existing library to parse the markdown into a structured format for you to interpret.
If this is completely custom parsing, then there are a few options:
For very simple text processing (like a single string expected to
be in one of a few formats or containing a piece of subtext in an expected format), use regular expressions. In C++, the RE2 library provides very powerful support for matching and extracting usign regexes.
For more complicated text processing, such as data spanning many lines or having a wide variety of content / syntax, consider using an existing lexer and parser generator. Flex and Bison are common tools (used together) to auto-generate logic for parsing text according to a grammar.
You can, by hand, as you are doing now, write your own parsing logic.
If you go with the latter approach, there are a few ways to simplify things:
Separate the "lexing" (breaking up the input into tokens) and "parsing" (interpreting the series of tokens) into separate phases.
Define a "Token" class and a corresponding hierarchy representing the types of symbols that can appear within your grammar (like RawText, Keyword, AtMention, etc.)
Create one or more enums representing the states that your parsing logic can be in.
Implement your lexing and parsing logic as a state machine that transforms the state given the current state and the next token or letter. Building up a map from (state, token type) to next_state or from (state, token type) to handler_function can help you to simplify the structure.

Since you are switching only on the starting letter, use cases:
enum State { Start1, Start2, Start3, Start4};
State state;
if (startswithbracket) {
state = Start1;
} else {
.
.
.
}
switch (state) {
case Start1:
dosomething;
break;
case Start2:
.
.
.
}
More information about switch syntax and use cases can be found here.

Related

Looking for a better algorithn of findins substrings in strings using Qt

UPDATE
I'll add some info about the problem to give you a better idea about why is everything done the way it is.
The main point of the whole script is to find all errors in a special file that keeps original and translated strings.
The script requires the "special" bilingual file(an xml in real life) and a "special" vocabulary file which keeps words and their tranlations(xls, xlsx constructed by hand. PO would probably be better.)
As a result it find all errors in translation, using the provided vocabulary.
Obviously if the vocab is bad the result sucks.
At some point of time the whole thing used 'std' or mostly 'std' and 'boost regular expressions'.
At some other point of time came the need for utf-8 support, including the regular expressions. We had no time to write complex stuff, so it was decided to go the QT way.
We were aware that it is possible to iterate over bytes. But we needed actual letters and sequences of letters also we needed to cut the word ending which is done though regular expressions, and no other regex supports utf-8 relatively good.
It was decided that Qt fitted the role far better than anything we would write ourselves in very limited time, as Qt has utf-8 support, and as of v5 keeps all internal stings as utf-8 encoded(as far as I am aware).
It was pointed out that complexity of proposed solution looks like O(m * n).
In reality it's probably even worse - closer to O(m * n * log(l)) or even O(m * n * l) strait. Here m is number of strings, n - number of vocabulary records, l - number of synonyms each word has(l is always at least equals 1).
Since we need to check all strings, and for each string run the whole vocabulary to find all errors, I currently see no way how can we make it any faster, because there is no real way faster.
As the question implies I am looking for a better solution to an existing coding problem.
I am gonna try to explain what exactly the problem is as best as I can.
Imagine you have a piece of code written on C++ that takes a string, a translation of the string,
gets rid of pesky word endings.
After that it takes another file which is a vocabulary and actually runs the whole vocab to find out whether the translation of the string has any errors.
Obviously this thing is highly dependent on the actual vocabulary, but that is not really a problem.
I actually have a described piece of code, although I need to mention the whole thing runs through CGI(don't ask, but at some point it was decided that C++ will run it faster). I can have the full code uploaded to git repo, it's rather big, but I will share the essential parts here.
The current problem I am facing is two fold: either the code does not find all it is supposed to, or it works too slow(probably gets stuck somewhere, but I have not yet pin pointed where)
The main idea behind the code was:
// All definitions for essential structures so you have a better idea what he hell is goind on
struct Word {
QString full = "";
QString stemmed = "";
};
struct VocRecord {
QVector<Word> orig;
QVector<Word> trans;
QString error = "";
void clearRecord() {
this->orig.clear();
this->trans.clear();
this->error = "";
}
};
typedef QVector<VocRecord> Vocabluary;
......
Vocabluary voc = .....; // Obviosly here we get the vocabulary, now how we get it is rather complicated, you can just assume it looks like defined vector of records.
QString origStemmed, transStemmed, orig, trans;
// orig - original string
// trans - it's translation
// origStemmed - original string with removed word endings (we call it stemming hence stemmed)
// transStemmed - transtalion with removed word endings.
At first the algo was something along the lines of:
origStemmed = QString(" ") + origStemmed + QString(" "); // Add whitespaces in the begin and end of string for searching
transStemmed = QString(" ") + transStemmed + QString(" ");
for(int i = 0; i < voc.length(); i++) {
VocRecord record = voc[i];
for(int j = 0; j < record.orig.length(); j++) {
Word origWord = record.orig[j];
si = origStemmed.indexOf(QString(" ") + origWord.stemmed + QString(" "));
if(si > -1) {
int ind = origWord.stemmed.indexOf(" ");
int idx = 0;
if(ind != -1) {
// Found a space in record, means record contains at least two words.
// Here we care where the firs word ends, an it's part of the global problem
idx = origMod.indexOf(origWord.full.mid(0, ind));
} else {
// We did not find a space, do one word only, take the whole thing.
idx = origMod.indexOf(origWord.full);
}
// Now comes the tricky part, we try to figure out if that original text, in which we found our voc record, had any punctuation after the word.
// Now this actually matters only for records that have more then one word in reality, but as you'll see we check all of them and that is not correct - still figuring how to get around it.
QChar symb; - // We'll keep our last symbol of first word here
// originMod - modified original: everything is lowercase, punctuation is kept.
// The main reason we have this at all is because when stemming we have to get rid of all punctuation so we keep the "lowercased" string separate.
// I am 100% sure we don't need it at all since Qt supporrts case insensitive search, but I would like to hear your opinion on it.
if(origMod.indexOf(" ", idx) > 0) {
symb = origMod[origMod.indexOf(" ", idx)-1];
} else {
symb = origMod[origMod.length()-1];
}
// When we have the last symbol we skip the the found word
if(ind != -1 && (symb == QChar(',') || symb == QChar(';') || symb == QChar('!') || symb == QChar(':') || symb == QChar('?') || symb == QChar('.'))) {
continue;
}
// The important part ends here
............
As you will notice we search for stemmed word in the original string.
by all accounts it should work, but the main problem of proposed search that it can have several matches including false ones, and we only care about first found one. The most obvious solution is probably go through all matches, but I am unsure that is a good idea, it requires another loop and the algo is quite slow already.
The next solution I came up with to solving the problem was using regular expressions, but I must have messed up, because the algo started to be "really slow".
The main idea of the second solution:
// We DO not add spaces! spaces suck big time.
for(int i = 0; i < voc.length(); i++) {
VocRecord record = voc[i];
for(int j = 0; j < record.orig.length(); j++) {
Word origWord = record.orig[j];
// In stead of using spaces, we search for a regular expression made from vocab record.
// The simple contains actually runs into the same set of problems namely more then one match or in some cases false matches(when the searched part matches something it should not).
// Now this is terribly slow as you can imagine because we create regular expressions on the fly and not pre-make them. But I still have not thought of a way around it.
if(origStemmed.contains(origWord.stemmed + "\\b",
QRegularExpression::UseUnicodePropertiesOption | QRegularExpression::CaseInsensitiveOption))) {
// Here we do something ungodly. We take our stemmed voc record, split it by space, then go through all parts making striing that will become our regular expression later
QString temp;
parts.clear();
parts = origWord.stemmed.split(" ");
for(int k = 0; k < parts.count(); k++) {
temp += "\\b" + parts[k] + "[a-z]*?\\b";
}
// After we added everything we need? we join the whole thing back by spaces.
temp = parts.join(" ");
// And here is the Ungodly chech - we actually search for the made regular expression in the original sting, and because we made sure to exclude any punctuation from expression in theory this should work.
if(!origMod.contains(QRegularExpression(temp, QRegularExpression::UseUnicodePropertiesOption | QRegularExpression::CaseInsensitiveOption))) {
continue;
}
// Well it does not work, or rather it works so slow - it's impossible to get any result, and even if we do, we still don't find everything we should - I blame the shitty regex here.
// And the important part ends.
As I pointed the second solution sucks big time. Currently I am aiming for some intermediate solution and would gladly accept any tips or suggestions you can make on where to look or what to look for.
If any of you will want to see the full code for this thing - just add a comment, I'll github all the important files in a separate repo.

How to remove elements by substrings from a STL container

I have a vector of objects (objects are term nodes that amongst other fields contai a string field with the term string)
class TermNode {
private:
std::wstring term;
double weight;
...
public:
...
};
After some processing and calculating the scores these objects get finally stored in a vector of TermNode pointers such as
std::vector<TermNode *> termlist;
A resulting list of this vector, containing up to 400 entries, looks like this:
DEBUG: 'knowledge' term weight=13.5921
DEBUG: 'discovery' term weight=12.3437
DEBUG: 'applications' term weight=11.9476
DEBUG: 'process' term weight=11.4553
DEBUG: 'knowledge discovery' term weight=11.4509
DEBUG: 'information' term weight=10.952
DEBUG: 'techniques' term weight=10.4139
DEBUG: 'web' term weight=10.3733
...
What I try to do is to cleanup that final list for substrings also contained in phrases inside the terms list. For example, looking at the above list snippet, there is the phrase 'knowledge discovery' and therefore I would like to remove the single terms 'knowledge' and 'discovery', because they are also in the list and redundant in this context. I want to keep the phrases containing the single terms. I am also thinking about to remove all strings equal or less 3 characters. But that is just a thought for now.
For this cleanup process I would like to code a class using remove_if / find_if (using the new C++ lambdas) and it would be nice to have that code in a compact class.
I am not really sure on how to solve this. The problem is that I first would have to identify what strings to remove, by probably setting a flag as an delete marker. That would mean I would have to pre-process that list. I would have to find the single terms and the phrases that contain one of those single terms. I think that is not an easy task to do and would need some advanced algorithm. Using a suffix tree to identify substrings?
Another loop on the vector and maybe a copy of the same vector could to the clean up. I am looking for something most efficient in a time manner.
I been playing with the idea or direction such as showed in std::list erase incompatible iterator using the remove_if / find_if and the idea used in Erasing multiple objects from a std::vector?.
So the question is basically is there a smart way to do this and avoid multiple loops and how could I identify the single terms for deletion? Maybe I am really missing something, but probably someone is out there and give me a good hint.
Thanks for your thoughts!
Update
I implemented the removal of redundant single terms the way Scrubbins recommended as follows:
/**
* Functor gets the term of each TermNode object, looks if term string
* contains spaces (ie. term is a phrase), splits phrase by spaces and finally
* stores thes term tokens into a set. Only term higher than a score of
* 'skipAtWeight" are taken tinto account.
*/
struct findPhrasesAndSplitIntoTokens {
private:
set<wstring> tokens;
double skipAtWeight;
public:
findPhrasesAndSplitIntoTokens(const double skipAtWeight)
: skipAtWeight(skipAtWeight) {
}
/**
* Implements operator()
*/
void operator()(const TermNode * tn) {
// --- skip all terms lower skipAtWeight
if (tn->getWeight() < skipAtWeight)
return;
// --- get term
wstring term = tn->getTerm();
// --- iterate over term, check for spaces (if this term is a phrase)
for (unsigned int i = 0; i < term.length(); i++) {
if (isspace(term.at(i))) {
if (0) {
wcout << "input term=" << term << endl;
}
// --- simply tokenze term by space and store tokens into
// --- the tokens set
// --- TODO: check if this really is UTF-8 aware, esp. for
// --- strings containing umlauts, etc !!
wistringstream iss(term);
copy(istream_iterator<wstring,
wchar_t, std::char_traits<wchar_t> >(iss),
istream_iterator<wstring,
wchar_t, std::char_traits<wchar_t> >(),
inserter(tokens, tokens.begin()));
if (0) {
wcout << "size of token set=" << tokens.size() << endl;
for_each(tokens.begin(), tokens.end(), printSingleToken());
}
}
}
}
/**
* return set of extracted tokens
*/
set<wstring> getTokens() const {
return tokens;
}
};
/**
* Functor to find terms in tokens set
*/
class removeTermIfInPhraseTokensSet {
private:
set<wstring> tokens;
public:
removeTermIfInPhraseTokensSet(const set<wstring>& termTokens)
: tokens(termTokens) {
}
/**
* Implements operator()
*/
bool operator()(const TermNode * tn) const {
if (tokens.find(tn->getTerm()) != tokens.end()) {
return true;
}
return false;
}
};
...
findPhrasesAndSplitIntoTokens objPhraseTokens(6.5);
objPhraseTokens = std::for_each(
termList.begin(), termList.end(), objPhraseTokens);
set<wstring> tokens = objPhraseTokens.getTokens();
wcout << "size of tokens set=" << tokens.size() << endl;
for_each(tokens.begin(), tokens.end(), printSingleToken());
// --- remove all extracted single tokens from the final terms list
// --- of similar search terms
removeTermIfInPhraseTokensSet removeTermIfFound(tokens);
termList.erase(
remove_if(
termList.begin(), termList.end(), removeTermIfFound),
termList.end()
);
for (vector<TermNode *>::const_iterator tl_iter = termList.begin();
tl_iter != termList.end(); tl_iter++) {
wcout << "DEBUG: '" << (*tl_iter)->getTerm() << "' term weight=" << (*tl_iter)->getNormalizedWeight() << endl;
if ((*tl_iter)->getNormalizedWeight() <= 6.5) break;
}
...
I could'nt use the C++11 lambda syntax, because on my ubuntu servers have currently g++ 4.4.1 installed. Anyways. It does the job for now.
The way to go is to check the quality of the resulting weighted terms with other search result sets and see how I can improve the quality and find a way to boost the more relevant terms in conjunction with the original query term. It might be not an easy task to do, I wish there would be some "simple heuristics".
But that might be another new question when stepped further a little more :-)
So thanks to all for this rich contribution of thoughts!
What you need to do is first, iterate through the list and split up all the multi-word values into single words. If you're allowing Unicode, this means you will need something akin to ICU's BreakIterators, else you can go with a simple punctuation/whitespace split. When each string is split into it's constituent words, then use a hash map to keep a list of all the current words. When you reach a multi-word value, then you can check if it's words have already been found. This should be the simplest way to identify duplicates.
I can suggest you to use the "erase-remove" idiom in this way:
struct YourConditionFunctor {
bool operator()(TermNode* term) {
if (/* you have to remove term */) {
delete term;
return true;
}
return false;
}
};
and then write:
termlist.erase(
remove_if(
termlist.begin(),
termlist.end(),
YourConditionFunctor()
),
termlist.end()
);

Better, or advantages in different ways of coding similar functions

I'm writing the code for a GUI (in C++), and right now I'm concerned with the organisation of text in lines. One of the problems I'm having is that the code is getting very long and confusing, and I'm starting to get into a n^2 scenario where for every option I add in for the texts presentation, the number of functions I have to write is the square of that. In trying to deal with this, A particular design choice has come up, and I don't know the better method, or the extent of the advantages or disadvantages between them:
I have two methods which are very similar in flow, i.e, iterate through the same objects, taking into account the same constraints, but ultimately perform different operations between this flow. For anyones interest, the methods render the text, and determine if any text overflows the line due to wrapping the text around other objects or simply the end of the line respectively.
These functions need to be copied and rewritten for left, right or centred text, which have different flow, so whatever design choice I make would be repeated three times.
Basically, I could continue what I have now, which is two separate methods to handle these different actions, or I could merge them into one function, which has if statements within it to determine whether or not to render the text or figure out if any text overflows.
Is there a generally accepted right way to going about this? Otherwise, what are the tradeoffs concerned, what are the signs that might indicate one way should be used over the other? Is there some other way of doing things I've missed?
I've edited through this a few times to try and make it more understandable, but if it isn't please ask me some questions so I can edit and explain. I can also post the source code of the two different methods, but they use a lot of functions and objects that would take too long to explain.
// EDIT: Source Code //
Function 1:
void GUITextLine::renderLeftShifted(const GUIRenderInfo& renderInfo) {
if(m_renderLines.empty())
return;
Uint iL = 0;
Array2t<float> renderCoords;
renderCoords.s_x = renderInfo.s_offset.s_x + m_renderLines[0].s_x;
renderCoords.s_y = renderInfo.s_offset.s_y + m_y;
float remainingPixelsInLine = m_renderLines[0].s_y;
for (Uint iTO= 0;iTO != m_text.size();++iTO)
{
if(m_text[iTO].s_pixelWidth <= remainingPixelsInLine)
{
string preview = m_text[iTO].s_string;
m_text[iTO].render(&renderCoords);
remainingPixelsInLine -= m_text[iTO].s_pixelWidth;
}
else
{
FSInternalGlyphData intData = m_text[iTO].stealFSFastFontInternalData();
float characterWidth = 0;
Uint iFirstCharacterOfRenderLine = 0;
for(Uint iC = 0;;++iC)
{
if(iC == m_text[iTO].s_string.size())
{
// wrap up
string renderPart = m_text[iTO].s_string;
renderPart.erase(iC, renderPart.size());
renderPart.erase(0, iFirstCharacterOfRenderLine);
m_text[iTO].s_font->renderString(renderPart.c_str(), intData,
&renderCoords);
break;
}
characterWidth += m_text[iTO].s_font->getWidthOfGlyph(intData,
m_text[iTO].s_string[iC]);
if(characterWidth > remainingPixelsInLine)
{
// Can't push in the last character
// No more space in this line
// First though, render what we already have:
string renderPart = m_text[iTO].s_string;
renderPart.erase(iC, renderPart.size());
renderPart.erase(0, iFirstCharacterOfRenderLine);
m_text[iTO].s_font->renderString(renderPart.c_str(), intData,
&renderCoords);
if(++iL != m_renderLines.size())
{
remainingPixelsInLine = m_renderLines[iL].s_y;
renderCoords.s_x = renderInfo.s_offset.s_x + m_renderLines[iL].s_x;
// Cool, so now try rendering this character again
--iC;
iFirstCharacterOfRenderLine = iC;
characterWidth = 0;
}
else
{
// Quit
break;
}
}
}
}
}
// Done! }
Function 2:
vector GUITextLine::recalculateWrappingContraints_LeftShift()
{
m_pixelsOfCharacters = 0;
float pixelsRemaining = m_renderLines[0].s_y;
Uint iRL = 0;
// Go through every text object, fiting them into render lines
for(Uint iTO = 0;iTO != m_text.size();++iTO)
{
// If an entire text object fits in a single line
if(pixelsRemaining >= m_text[iTO].s_pixelWidth)
{
pixelsRemaining -= m_text[iTO].s_pixelWidth;
m_pixelsOfCharacters += m_text[iTO].s_pixelWidth;
}
// Otherwise, character by character
else
{
// Get some data now we don't get it every function call
FSInternalGlyphData intData = m_text[iTO].stealFSFastFontInternalData();
for(Uint iC = 0; iC != m_text[iTO].s_string.size();++iC)
{
float characterWidth = m_text[iTO].s_font->getWidthOfGlyph(intData, '-');
if(characterWidth < pixelsRemaining)
{
pixelsRemaining -= characterWidth;
m_pixelsOfCharacters += characterWidth;
}
else // End of render line!
{
m_pixelsOfWrapperCharacters += pixelsRemaining; // we might track how much wrapping px we use
// If this is true, then we ran out of render lines before we ran out of text. Means we have some overflow to return
if(++iRL == m_renderLines.size())
{
return harvestOverflowFrom(iTO, iC);
}
else
{
pixelsRemaining = m_renderLines[iRL].s_y;
}
}
}
}
}
vector<GUIText> emptyOverflow;
return emptyOverflow; }
So basically, render() takes renderCoordinates as a parameter and gets from it the global position of where it needs to render from. calcWrappingConstraints figures out how much text in the object goes over the allocated space, and returns that text as a function.
m_renderLines is an std::vector of a two float structure, where .s_x = where rendering can start and .s_y = how large the space for rendering is - not, its essentially width of the 'renderLine', not where it ends.
m_text is an std::vector of GUIText objects, which contain a string of text, and some data, like style, colour, size ect. It also contains under s_font, a reference to a font object, which performs rendering, calculating the width of a glyph, ect.
Hopefully this clears things up.
There is no generally accepted way in this case.
However, common practice in any programming scenario is to remove duplicated code.
I think you're getting stuck on how to divide code by direction, when direction changes the outcome too much to make this division. In these cases, focus on the common portions of the three algorithms and divide them into tasks.
I did something similar when I duplicated WinForms flow layout control for MFC. I dealt with two types of objects: fixed positional (your pictures etc.) and auto positional (your words).
In the example you provided I can list out common portions of your example.
Write Line (direction)
bool TestPlaceWord (direction) // returns false if it cannot place word next to previous word
bool WrapPastObject (direction) // returns false if it runs out of line
bool WrapLine (direction) // returns false if it runs out of space for new line.
Each of these would be performed no matter what direction you are faced with.
Ultimately, the algorithm for each direction is just too different to simplify anymore than that.
How about an implementation of the Visitor Pattern? It sounds like it might be the kind of thing you are after.

Design Pattern For Making An Assembler

I'm making an 8051 assembler.
Before everything is a tokenizer which reads next tokens, sets error flags, recognizes EOF, etc.
Then there is the main loop of the compiler, which reads next tokens and check for valid mnemonics:
mnemonic= NextToken();
if (mnemonic.Error)
{
//throw some error
}
else if (mnemonic.Text == "ADD")
{
...
}
else if (mnemonic.Text == "ADDC")
{
...
}
And it continues to several cases. Worse than that is the code inside each case, which checks for valid parameters then converts it to compiled code. Right now it looks like this:
if (mnemonic.Text == "MOV")
{
arg1 = NextToken();
if (arg1.Error) { /* throw error */ break; }
arg2 = NextToken();
if (arg2.Error) { /* throw error */ break; }
if (arg1.Text == "A")
{
if (arg2.Text == "B")
output << 0x1234; //Example compiled code
else if (arg2.Text == "#B")
output << 0x5678; //Example compiled code
else
/* throw "Invalid parameters" */
}
else if (arg1.Text == "B")
{
if (arg2.Text == "A")
output << 0x9ABC; //Example compiled code
else if (arg2.Text == "#A")
output << 0x0DEF; //Example compiled code
else
/* throw "Invalid parameters" */
}
}
For each of the mnemonics I have to check for valid parameters then create the correct compiled code. Very similar codes for checking the valid parameters for each mnemonic repeat in each case.
So is there a design pattern for improving this code?
Or simply a simpler way to implement this?
Edit: I accepted plinth's answer, thanks to him. Still if you have ideas on this, i will be happy to learn them. Thanks all.
I've written a number of assemblers over the years doing hand parsing and frankly, you're probably better off using a grammar language and a parser generator.
Here's why - a typical assembly line will probably look something like this:
[label:] [instruction|directive][newline]
and an instruction will be:
plain-mnemonic|mnemonic-withargs
and a directive will be:
plain-directive|directive-withargs
etc.
With a decent parser generator like Gold, you should be able to knock out a grammar for 8051 in a few hours. The advantage to this over hand parsing is that you will be able to have complicated enough expressions in your assembly code like:
.define kMagicNumber 0xdeadbeef
CMPA #(2 * kMagicNumber + 1)
which can be a real bear to do by hand.
If you want to do it by hand, make a table of all your mnemonics which will also include the various allowable addressing modes that they support and for each addressing mode, the number of bytes that each variant will take and the opcode for it. Something like this:
enum {
Implied = 1, Direct = 2, Extended = 4, Indexed = 8 // etc
} AddressingMode;
/* for a 4 char mnemonic, this struct will be 5 bytes. A typical small processor
* has on the order of 100 instructions, making this table come in at ~500 bytes when all
* is said and done.
* The time to binary search that will be, worst case 8 compares on the mnemonic.
* I claim that I/O will take way more time than look up.
* You will also need a table and/or a routine that given a mnemonic and addressing mode
* will give you the actual opcode.
*/
struct InstructionInfo {
char Mnemonic[4];
char AddessingMode;
}
/* order them by mnemonic */
static InstructionInfo instrs[] = {
{ {'A', 'D', 'D', '\0'}, Direct|Extended|Indexed },
{ {'A', 'D', 'D', 'A'}, Direct|Extended|Indexed },
{ {'S', 'U', 'B', '\0'}, Direct|Extended|Indexed },
{ {'S', 'U', 'B', 'A'}, Direct|Extended|Indexed }
}; /* etc */
static int nInstrs = sizeof(instrs)/sizeof(InstrcutionInfo);
InstructionInfo *GetInstruction(char *mnemonic) {
/* binary search for mnemonic */
}
int InstructionSize(AddressingMode mode)
{
switch (mode) {
case Inplied: return 1;
/ * etc */
}
}
Then you will have a list of every instruction which in turn contains a list of all the addressing modes.
So your parser becomes something like this:
char *line = ReadLine();
int nextStart = 0;
int labelLen;
char *label = GetLabel(line, &labelLen, nextStart, &nextStart); // may be empty
int mnemonicLen;
char *mnemonic = GetMnemonic(line, &mnemonicLen, nextStart, &nextStart); // may be empty
if (IsOpcode(mnemonic, mnemonicLen)) {
AddressingModeInfo info = GetAddressingModeInfo(line, nextStart, &nextStart);
if (IsValidInstruction(mnemonic, info)) {
GenerateCode(mnemonic, info);
}
else throw new BadInstructionException(mnemonic, info);
}
else if (IsDirective()) { /* etc. */ }
Yes. Most assemblers use a table of data which describes the instructions: mnemonic, op code, operands forms etc.
I suggest looking at the source code for as. I'm having some trouble finding it though. Look here. (Thanks to Hossein.)
I think you should look into the Visitor pattern. It might not make your code that much simpler, but will reduce coupling and increase reusability. SableCC is a java framework to build compilers that uses it extensively.
When I was playing with a Microcode emulator tool, I converted everything into descendants of an Instruction class. From Instruction were category classes, such as Arithmetic_Instruction and Branch_Instruction. I used a factory pattern to create the instances.
Your best bet may be to get a hold of the assembly language syntax specification. Write a lexer to convert to tokens (**please, don't use if-elseif-else ladders). Then based on semantics, issue the code.
Long time ago, assemblers were a minimum of two passes: The first to resolve constants and form the skeletal code (including symbol tables). The second pass was to generate more concrete or absolute values.
Have you read the Dragon Book lately?
Have you looked at the "Command Dispatcher" pattern?
http://en.wikipedia.org/wiki/Command_pattern
The general idea would be to create an object that handles each instruction (command), and create a look-up table that maps each instruction to the handler class. Each command class would have a common interface (Command.Execute( *args ) for example) which would definitely give you a cleaner / more flexible design than your current enormous switch statement.

Checking lists and running handlers

I find myself writing code that looks like this a lot:
set<int> affected_items;
while (string code = GetKeyCodeFromSomewhere())
{
if (code == "some constant" || code == "some other constant") {
affected_items.insert(some_constant_id);
} else if (code == "yet another constant" || code == "the constant I didn't mention yet") {
affected_items.insert(some_other_constant_id);
} // else if etc...
}
for (set<int>::iterator it = affected_items.begin(); it != affected_items.end(); it++)
{
switch(*it)
{
case some_constant_id:
RunSomeFunction(with, these, params);
break;
case some_other_constant_id:
RunSomeOtherFunction(with, these, other, params);
break;
// etc...
}
}
The reason I end up writing this code is that I need to only run the functions in the second loop once even if I've received multiple key codes that might cause them to run.
This just doesn't seem like the best way to do it. Is there a neater way?
One approach is to maintain a map from strings to booleans. The main logic can start with something like:
if(done[code])
continue;
done[code] = true;
Then you can perform the appropriate action as soon as you identify the code.
Another approach is to store something executable (object, function pointer, whatever) into a sort of "to do list." For example:
while (string code = GetKeyCodeFromSomewhere())
{
todo[code] = codefor[code];
}
Initialize codefor to contain the appropriate function pointer, or object subclassed from a common base class, for each code value. If the same code shows up more than once, the appropriate entry in todo will just get overwritten with the same value that it already had. At the end, iterate over todo and run all of its members.
Since you don't seem to care about the actual values in the set you could replace it with setting bits in an int. You can also replace the linear time search logic with log time search logic. Here's the final code:
// Ahead of time you build a static map from your strings to bit values.
std::map< std::string, int > codesToValues;
codesToValues[ "some constant" ] = 1;
codesToValues[ "some other constant" ] = 1;
codesToValues[ "yet another constant" ] = 2;
codesToValues[ "the constant I didn't mention yet" ] = 2;
// When you want to do your work
int affected_items = 0;
while (string code = GetKeyCodeFromSomewhere())
affected_items |= codesToValues[ code ];
if( affected_items & 1 )
RunSomeFunction(with, these, params);
if( affected_items & 2 )
RunSomeOtherFunction(with, these, other, params);
// etc...
Its certainly not neater, but you could maintain a set of flags that say whether you've called that specific function or not. That way you avoid having to save things off in a set, you just have the flags.
Since there is (presumably from the way it is written), a fixed at compile time number of different if/else blocks, you can do this pretty easily with a bitset.
Obviously, it will depend on the specific circumstances, but it might be better to have the functions that you call keep track of whether they've already been run and exit early if required.