C++ - strcmp() does not work correctly? - c++

There's something really weird going on: strcmp() returns -1 though both strings are exactly the same. Here is a snippet from the output of the debugger (gdb):
(gdb) print s[i][0] == grammar->symbols_from_int[107][0]
$36 = true
(gdb) print s[i][1] == grammar->symbols_from_int[107][1]
$37 = true
(gdb) print s[i][2] == grammar->symbols_from_int[107][2]
$38 = true
(gdb) print s[i][3] == grammar->symbols_from_int[107][3]
$39 = true
(gdb) print s[i][4] == grammar->symbols_from_int[107][4]
$40 = true
(gdb) print s[i][5] == grammar->symbols_from_int[107][5]
$41 = false
(gdb) print grammar->symbols_from_int[107][4]
$42 = 0 '\0'
(gdb) print s[i]
$43 = (char * const&) #0x202dc50: 0x202d730 "Does"
(gdb) print grammar->symbols_from_int[107]
$44 = (char * const&) #0x1c9fb08: 0x1c9a062 "Does"
(gdb) print strcmp(s[i],grammar->symbols_from_int[107])
$45 = -1
Any idea what's going on?
Thanks in advance,
Onur
Edit 1:
Here are some snippets of my code:
# include <unordered_map> // Used as hash table
# include <stdlib.h>
# include <string.h>
# include <stdio.h>
# include <vector>
using namespace std;
using std::unordered_map;
using std::hash;
struct eqstr
{
bool operator()(const char* s1, const char* s2) const
{
return strcmp(s1, s2) == 0;
}
};
...
<some other code>
...
class BPCFG {
public:
char *symbols; // Character array holding all grammar symbols, with NULL seperating them.
char *rules; // Character array holding all rules, with NULL seperating them.
unordered_map<char *, int , hash<char *> , eqstr> int_from_symbols; // Hash table holding the grammar symbols and their integer indices as key/value pairs.
...
<some other code>
...
vector<char *> symbols_from_int; // Hash table holding the integer indices and their corresponding grammar symbols as key/value pairs.
void load_symbols_from_file(const char *symbols_file);
}
void BPCFG::load_symbols_from_file(const char *symbols_file) {
char buffer[200];
FILE *input = fopen(symbols_file, "r");
int symbol_index = 0;
while(fscanf(input, "%s", buffer) > 0) {
if(buffer[0] == '/')
strcpy(symbols + symbol_index, buffer+1);
else
strcpy(symbols + symbol_index, buffer);
symbols_from_int.push_back(symbols + symbol_index);
int_from_symbols[symbols+symbol_index] = symbols_from_int.size()-1;
probs.push_back(vector<double>());
hyperprobs.push_back(vector<double>());
rules_from_IntPair.push_back(vector<char *>());
symbol_index += strlen(symbols+symbol_index) + 1;
}
fclose(input);
}
This last function (BPCFG::load_symbols_from_file) seems to be the only function I modify symbols_from_int in my whole code. Please tell me if you need some more code. I'm not putting everything because it's hundreds of lines.
Edit 2:
OK, I think I should add one more thing from my code. This is the constructor of BPCFG class:
BPCFG(int symbols_length, int rules_length, int symbol_count, int rule_count):
int_from_symbols(1.5*symbol_count),
IntPair_from_rules(1.5*rule_count),
symbol_after_dot(10*rule_count)
{
symbols = (char *)malloc(symbols_length*sizeof(char));
rules = (char *)malloc(rules_length*sizeof(char));
}
Edit 3:
Here is the code on the path to the point of error. It's not compilable, but it shows where the code stepped through (I checked with next and step commands in the debugger that the code indeed follows this route):
BPCFG my_grammar(2000, 5500, 194, 187);
my_grammar.load_symbols_from_file("random_50_1_words_symbols.txt");
<some irrelevant code>
my_grammar.load_rules_from_file("random_50_1_words_grammar.txt", true);
<some irrelevant code>
my_grammar.load_symbols_after_dots();
BPCFGParser my_parser(&my_grammar);
BPCFGParser::Sentence s;
// (Sentence is defined in the BPCFGParser class with
// typedef vector<char *> Sentence;)
Edge e;
try {
my_parser.parse(s, e);
}
catch(char *e) {fprintf(stderr, "%s", e);}
void BPCFGParser::parse(const Sentence & s, Edge & goal_edge) {
/* Initializing the chart */
chart::active_sets.clear();
chart::passive_sets.clear();
chart::active_sets.resize(s.size());
chart::passive_sets.resize(s.size());
// initialize(sentence, goal);
try {
initialize(s, goal_edge);
}
catch (char *e) {
if(strcmp(e, UNKNOWN_WORD) == 0)
throw e;
}
<Does something more, but the execution does not come to this point>
}
void BPCFGParser::initialize(const Sentence & s, Edge & goal_edge) {
// create a new chart and new agendas
/* For now, we plan to do this during constructing the BPCFGParser object */
// for each word w:[start,end] in the sentence
// discoverEdge(w:[start,end])
Edge temp_edge;
for(int i = 0;i < s.size();i++) {
temp_edge.span.start = i;
temp_edge.span.end = i+1;
temp_edge.isActive = false;
/* Checking whether the given word is ever seen in the training corpus */
unordered_map<char *, int , hash<char *> , eqstr>::const_iterator it = grammar->int_from_symbols.find(s[i]);
if(it == grammar->int_from_symbols.end())
throw UNKNOWN_WORD;
<Does something more, but execution does not come to this point>
}
}
Where I run the print commands in the debugger is the last
throw UNKNOWN_WORD;
command. I mean, I was stepping with next on GDB and after seeing this line, I ran all these print commands.
Thank you for your interest,
Onur
OK, I think I should add one more thing from my code. This is the constructor of BPCFG class:
BPCFG(int symbols_length, int rules_length, int symbol_count, int rule_count):
int_from_symbols(1.5*symbol_count),
IntPair_from_rules(1.5*rule_count),
symbol_after_dot(10*rule_count)
{
symbols = (char *)malloc(symbols_length*sizeof(char));
rules = (char *)malloc(rules_length*sizeof(char));
}

This sounds like s is a pointer to an array that was on the stack which is overwritten as soon as a new function is called, ie strcmp()
What does the debugger say they are after the strcmp() call?

In recent Linux distributions strcmp is a symbol of type STT_GNU_IFUNC. This is not supported in the last release of GDB (7.2 at the time of writing). That may be the cause of your problem, although in your case the return value looks genuine.

Given that GDB output the only possible cause I can see is that strcmp() is bugged.
You basically did in GDB what strcmp does: compare character per character, until both are zeros (at 4).
Can you try print strcmp("Does", "Does"); ?
EDIT: also try:
print stricmp(s[i], grammar->symbols_from_int[107], 1);
print stricmp(s[i], grammar->symbols_from_int[107], 2);
print stricmp(s[i], grammar->symbols_from_int[107], 3);
print stricmp(s[i], grammar->symbols_from_int[107], 4);
print stricmp(s[i], grammar->symbols_from_int[107], 5);

The only way you'll ever figure this out is to step INTO strcmp with the debugger.

I would strongly suggest you zero out the memory before you start using it. I realize that the GDB output makes no sense because you do verify it's a null terminated strings, but I've had a lot of string.h bizarre problems go away with a memset, bzero, calloc or whatever you want to use.
Specifically, zero out the memory in the constructor and the buffer you use when reading from the file.

As others have pointed out, it's near to impossible that there would be a problem with strcmp.
It would be best to strip down the problematic code to the absolute minimum required to reproduce the problem (also include instructions on how to compile - which compiler, flags, runtime library are you using?). Most likely you will find a bug in the process.
If not, you will receive a lot of credit for finding a bug in one of the most intensively used C functions ;-)

May be size of the two string are not same.

Does strlen(s[i]) and strlen(grammar->symbols_from_int[107]) return the same?
Also, I can't imagine this is the problem but could you use a constant value rather than i just to make sure that something weird is not going?

I would strongly recommend trying to solve the problem with regular STL strings first - you'll get cleaner code and automated memory management so you can concentrate on the parser logic. Only after the thing is working and profiling proves that string manipulation is the performance bottleneck I would look at all the algorithms used in greater detail, then at specialized string allocators, and - as the last resort - at manual character array manipulation, in that order.

Thanks to all spared time to answer. I wrote my own string comparison function and it worked correctly at the same point, so obviously it's a problem with strcmp(). Nevertheless, my code still does not work as I expect. But I'll request help for that only after I analyze it thoroughly. Thanks.

Mark you own strcmp realization as inline and look what happens ...
From GCC 4.3 Release Series Changes, New Features, and Fixes for GCC 4.3.4 :
"During feedback directed optimizations, the expected block size the memcpy, memset and bzero functions operate on is discovered and for cases of commonly used small sizes, specialized inline code is generated."
May be some other related bugs exists ...
Try to switch compiler optimization or/and function inlining off.

Maybe there is a non-printable character in one of the strings?

Related

c_str() is only reading half of my string, why? How can I fix this? Is it a byte issue?

I am writing a client program and server program. On the server program, in order to write the result back to the client, I have to convert the string to const char* to put it in a const void* variable to use the write() function. The string itself is outputting the correct result when I checked, but when I use the c_str() function on the string, it is only outputting up until the first variable in the string. I am providing some code for reference (not sure if this is making any sense).
I have already tried all sorts of different ways to adjust the string, but nothing has worked yet.
Here are how the variables have been declared:
string final;
const void * fnlPrice;
carTable* table = new carTable[fileLength];
Here is the struct for the table:
struct carTable
{
string mm; // make and model
string hPrice; // high price
string lPrice; // low price
};
Here is a snipped of the code with the issue, starting with updating the string variable, final, with text as well as the resulting string variables:
final = "The high price for that car is $" + table[a].hPrice + "\nThe low
price for that car is $" + table[a].lPrice;;
if(found = true)
{
fnlPrice = final.c_str();
n = write(newsockfd,fnlPrice, 200);
if (n < 0)
{
error("ERROR writing to socket");
}
}
else
{
n = write(newsockfd, "That make and model is not in
the database. \n", 100);
if (n < 0)
{
error("ERROR writing to socket");
}
}
Unfortunately your code does not make any sense. And that may be your major problem. You should rewrite you code end eliminate the bugs.
Switch on all compiler warnings and eliminate the warnings.
Do not use new and pointers. Never
Do not use C-Style arrays. So, something with []. Never. Use STL containers
Always initialize all variables. Always. Even if you assign an other value in the next line
Do not use magic constants like 200 (The size of the string is final.size())
If an error happens then print the error text with strerror (or a compatible function)
Make sure that your array itself and the array values are initalized
To test your function, write to socket 1 (_write(1,fnlPrice,final.size()); 1 is equal to std::cout
There is no need to use the void pointer. You can use n = _write(newsockfd, final.c_str(), final.size()); directly
If you want a detailed answer here on SO then you need to post your compiled code. I have rewritten your function and tested it. It works for me and prints the complete string. So, there is a bug in an other part of your code that we cannot not see.

R G B element array swap

I'm trying to create this c++ program to perform the description below. I am pretty certain the issue is in the recursive, but uncertain how to fix it. I'm guessing it just keeps iterating through to infinity and crashes. I do not even get an output. I figured I could just compare the previous and current pointer and perform a 3-piece temp swap based on lexicography. I would use a pointer to iterate through the array and decrement it after each swap, then recursively call with that ptr as the parameter. Didn't work, I'm here, help me please :). If there is a simpler solution that would work too, but prefer to understand where I went wrong with this code.
#include <string>
#include <iostream>
using namespace std;
// Given an array of strictly the characters 'R', 'G', and
// 'B', segregate the values of the array so that all the
// Rs come first, the Gs come second, and the Bs come last.
// You can only swap elements of the array.
char* RGBorder(char* c_a)
{
size_t sz = sizeof(c_a)/sizeof(*c_a);
char* ptr_ca = c_a;
char* prv_ptr = ptr_ca;
ptr_ca++;
char temp;
while(*ptr_ca)
{
switch(*ptr_ca)
{
case 'R' :
if( *prv_ptr < *ptr_ca ) {
temp = *prv_ptr; *prv_ptr = *ptr_ca; *ptr_ca = temp;
} else if( *prv_ptr == *ptr_ca ) {
continue;
} else { ptr_ca--; RGBorder(ptr_ca); }
case 'G' :
if( *prv_ptr < *ptr_ca ) {
temp = *prv_ptr; *prv_ptr = *ptr_ca; *ptr_ca = temp;
} else if( *prv_ptr == *ptr_ca ) {
continue;
} else { ptr_ca--; RGBorder(ptr_ca); }
default:
ptr_ca++;
continue;
}
ptr_ca++;
cout << *ptr_ca;
}
return c_a;
}
int main()
{
char ca[] = {'G', 'B', 'R', 'R', 'B', 'R', 'G'};
char *oca =RGBorder(ca);
char *pca = oca;
while(*pca)
{
cout << *pca << endl;
pca++;
}
}
There are many issues with your code.
1) You call the function RGBorder with a character pointer, and then attempt to get the number of characters using this:
size_t sz = sizeof(c_a)/sizeof(*c_a);
This will not get you the number of characters. Instead this will simply get you the
sizeof(char *) / sizeof(char)
which is usually 4 or 8. The only way to call your function using a char array is either provide a null-terminated array (thus you can use strlen), or you have to pass the number of characters in the array as a separate argument:
char *RGBorder(char *c_a, int size)
2) I didn't go through your code, but there are easier ways to do a 3-way partition in an array. One popular algorithm to do this is one based on the Dutch National Flag problem.
Since you want the array in RGB order, you know that the series of G will always come in the middle (somewhere) of the sequence, with R on the left of the sequence, and B always on the right of the sequence.
So the goal is to simply swap R to the left of the middle, and B to the right of the middle. So basically you want a loop that incrementally changes the "middle" when needed, while swapping R's and B's to their appropriate position when they're detected.
The following code illustrates this:
#include <algorithm>
char *RGBorder(char *c_a, int num)
{
int middle = 0; // assume we only want the middle element
int low = 0; // before the G's
int high = num - 1; // after the G's
while (middle <= high)
{
if ( c_a[middle] == 'R' ) // if we see an 'R' in the middle, it needs to go before the middle
{
std::swap(c_a[middle], c_a[low]); // swap it to a place before middle
++middle; // middle has creeped up one spot
++low; // so has the point where we will swap when we do this again
}
else
if (c_a[middle] == 'B') // if we see a 'B' as the middle element, it needs to go after the middle
{
std::swap(c_a[middle], c_a[high]); // place it as far back as you can
--high; // decrease the back position for next swap that comes here
}
else
++middle; // it is a 'G', do nothing
}
return c_a;
}
Live Example
Here is another solution that uses std::partition.
#include <algorithm>
#include <iostream>
char *RGBorder(char *c_a, int num)
{
auto iter = std::partition(c_a, c_a + num, [](char ch) {return ch == 'R';});
std::partition(iter, c_a + num, [](char ch) {return ch == 'G';});
return c_a;
}
Live Example
Basically, the first call to std::partition places the R's to the front of the array. Since std::partition returns an iterator (in this case, a char *) to the end of where the partition occurs, we use that as a starting position in the second call to std::partition, where we partition the G values.
Note that std::partition also accomplishes its goal by swapping.
Given this solution, we can generalize this for an n-way partition by using a loop. Assume we want to place things in RGBA order (4 values instead of 3).
#include <algorithm>
#include <iostream>
#include <cstring>
char *RGBorder(char *c_a, int num, char *order, int num2)
{
auto iter = c_a;
for (int i = 0; i < num2 - 1; ++i)
iter = std::partition(iter, c_a + num, [&](char ch) {return ch == order[i];});
return c_a;
}
int main()
{
char ca[] = "AGBRRBARGGARRBGAGRARAA";
std::cout << RGBorder(ca, strlen(ca), "RGBA", 4);
}
Output:
RRRRRRRGGGGGBBBAAAAAAA
Sorry to put it blunt, but that code is a mess. And I don't mean the mistakes, those are forgivable for beginners. I mean the formatting. Multiple statements in one line make it super hard to read and debug the code. Short variable names that carry no immediate intrinsic meaning make it hard to understand what the code is supposed to do. using namespace std; is very bad practise as well, but I can imagine you were taught to do this by whoever gives that course.
1st problem
Your cases don't break, thus you execute all cases for R, and both G and default for G. Also your code will never reach the last 2 lines of your loop, as you continue out before in every case.
2nd problem
You have an endless loop. In both cases you have two situations where you'll end up in an endless loop:
In the else if( *prv_ptr == *ptr_ca ) branch you simply continue; without changing the pointer.
In the else branch you do ptr_ca--;, but then in default you call ptr_ca++; again.(Note that even with breaks you would still call ptr_ca++; at the end of the loop.)
In both cases the pointer doesn't change, so once you end up in any of those conditions your loop will never exit.
Possible 3rd problem
I can only guess, because it is not apparent from the name, but it seems that prv_ptr is supposed to hold whatever was the last pointer in the loop? If so, it seems wrong that you don't update that pointer, ever. Either way, proper variable names would've made it more clear what the purpose of this pointer is exactly. (On a side note, consistent usage of const can help identify such issues. If you have a variable that is not const, but never gets updated, you either forgot to add const or forgot to update it.)
How to fix
Format your code:
Don't use using namespace std;.
One statement per line.
Give your variables proper names, so it's easy to identify what is what. (This is not 1993, really, I'd rather have a thisIsThePointerHoldingTheCharacterThatDoesTheThing than ptr_xy.)
Fix the aforementioned issues (add breaks, make sure your loop actually exits).
Then debug your code. With a debugger. While it runs. With breakpoints and stepping through line by line, inspecting the values of your pointers as the code executes. Fancy stuff.
Good luck!
just count the number of 'R', 'G' and 'B' letters and fill the array from scratch.
much easier, no recursions.

Split a even-numbered string in c++

I am very new to c++. I am trying to split a string that contains even numbered sub strings till there is no even numbered sub string left. For example, if I input AB ABCD ABC, the output should be A B A B C D ABC. I am trying to do it without tokens, because I don't know how to..
What I have so far only split the first even sub string and it doesn't work if I only have 1 sub string. Can someone please help me out?
Any advise will be much appreciated. Thank you!
string temp = "";
void check(string &str, int &i, int &flag)
{
int count = 0;
int reminder;
do
{
count++;
temp += str[i];
i++;
} while (str[i] != ' ');
i = i - temp.size();
reminder = count % 2;
if (reminder == 0)
flag = 1;
else
flag = 0;
}
void SplitEvenWord(string &str)
{
int i = 0;
int flag = 0;
for (i = 0; i < str.size(); i++)
{
check(str, i, flag);
if (flag == 1)
{
temp.insert(temp.size() / 2, " ");
str.replace(i, temp.size() - 1, temp);
}
}
}
There are two skills that are absolutely vital in software engineering (Well, more than two, but two for now): developing new functions in isolation, and testing things in the simplest possible way.
You say that the code fails if there is only one substring. You don't say how it fails (I should have mentioned clear error reports in the list) so I don't know whether to test your code with an even-length string which it ought to split ("ABCD" => "A B C D") or an odd-length string which it ought to leave alone ("ABC" => "ABC"). Before I try to code these up, I look at your first function:
void check(string &str, int &i, int &flag)
{
...
do
{
count++;
temp += str[i];
i++;
} while (str[i] != ' ');
...
}
Trouble already. The strings I have in mind do not contain any spaces, so the loop cannot terminate. This code will run past the end of the string into whatever happens to be in that memory space, which will cause undefined behavior. (If you don't know that term, it means that there's no telling what will happen, but if you're lucky the program will just crash.)
Fix that, try running that code on "ABC" and "ABCD" and "A" and "" and "ABC DEF", and get it working perfectly. Once it does, take a look at your other function. Don't test it with random typing, test it with short, clearly defined strings. Once it works perfectly, try longer, more complicated ones. If you find a string which causes it to fail, hold onto it! That string will lead you to a bug.
That should be enough to get you started.
I'm writing this as an answer because it was too long to fit as a comment.
I have a couple of suggestions that may help you to figure out what the problem is.
Separate "check" into at least two functions, one to split the string into individual words and check them and one to check the length of the string.
Test the "check" and "tokenize" functions by separately and see if they give you the expected answers. Work on them individually until they are correct.
Separate the formatting of the answers out of "SplitEvenWord" into a separate function.
"SplitEvenWord" should then be nothing more than calling the functions you created as a result of the steps above.
When I'm stuck, I always try to break the problem down into small bite sized pieces that I know I can get working. Eventually, the problem becomes assembling the already working pieces of the solution into a larger function that solves the original problem.

Design Pattern For Making An Assembler

I'm making an 8051 assembler.
Before everything is a tokenizer which reads next tokens, sets error flags, recognizes EOF, etc.
Then there is the main loop of the compiler, which reads next tokens and check for valid mnemonics:
mnemonic= NextToken();
if (mnemonic.Error)
{
//throw some error
}
else if (mnemonic.Text == "ADD")
{
...
}
else if (mnemonic.Text == "ADDC")
{
...
}
And it continues to several cases. Worse than that is the code inside each case, which checks for valid parameters then converts it to compiled code. Right now it looks like this:
if (mnemonic.Text == "MOV")
{
arg1 = NextToken();
if (arg1.Error) { /* throw error */ break; }
arg2 = NextToken();
if (arg2.Error) { /* throw error */ break; }
if (arg1.Text == "A")
{
if (arg2.Text == "B")
output << 0x1234; //Example compiled code
else if (arg2.Text == "#B")
output << 0x5678; //Example compiled code
else
/* throw "Invalid parameters" */
}
else if (arg1.Text == "B")
{
if (arg2.Text == "A")
output << 0x9ABC; //Example compiled code
else if (arg2.Text == "#A")
output << 0x0DEF; //Example compiled code
else
/* throw "Invalid parameters" */
}
}
For each of the mnemonics I have to check for valid parameters then create the correct compiled code. Very similar codes for checking the valid parameters for each mnemonic repeat in each case.
So is there a design pattern for improving this code?
Or simply a simpler way to implement this?
Edit: I accepted plinth's answer, thanks to him. Still if you have ideas on this, i will be happy to learn them. Thanks all.
I've written a number of assemblers over the years doing hand parsing and frankly, you're probably better off using a grammar language and a parser generator.
Here's why - a typical assembly line will probably look something like this:
[label:] [instruction|directive][newline]
and an instruction will be:
plain-mnemonic|mnemonic-withargs
and a directive will be:
plain-directive|directive-withargs
etc.
With a decent parser generator like Gold, you should be able to knock out a grammar for 8051 in a few hours. The advantage to this over hand parsing is that you will be able to have complicated enough expressions in your assembly code like:
.define kMagicNumber 0xdeadbeef
CMPA #(2 * kMagicNumber + 1)
which can be a real bear to do by hand.
If you want to do it by hand, make a table of all your mnemonics which will also include the various allowable addressing modes that they support and for each addressing mode, the number of bytes that each variant will take and the opcode for it. Something like this:
enum {
Implied = 1, Direct = 2, Extended = 4, Indexed = 8 // etc
} AddressingMode;
/* for a 4 char mnemonic, this struct will be 5 bytes. A typical small processor
* has on the order of 100 instructions, making this table come in at ~500 bytes when all
* is said and done.
* The time to binary search that will be, worst case 8 compares on the mnemonic.
* I claim that I/O will take way more time than look up.
* You will also need a table and/or a routine that given a mnemonic and addressing mode
* will give you the actual opcode.
*/
struct InstructionInfo {
char Mnemonic[4];
char AddessingMode;
}
/* order them by mnemonic */
static InstructionInfo instrs[] = {
{ {'A', 'D', 'D', '\0'}, Direct|Extended|Indexed },
{ {'A', 'D', 'D', 'A'}, Direct|Extended|Indexed },
{ {'S', 'U', 'B', '\0'}, Direct|Extended|Indexed },
{ {'S', 'U', 'B', 'A'}, Direct|Extended|Indexed }
}; /* etc */
static int nInstrs = sizeof(instrs)/sizeof(InstrcutionInfo);
InstructionInfo *GetInstruction(char *mnemonic) {
/* binary search for mnemonic */
}
int InstructionSize(AddressingMode mode)
{
switch (mode) {
case Inplied: return 1;
/ * etc */
}
}
Then you will have a list of every instruction which in turn contains a list of all the addressing modes.
So your parser becomes something like this:
char *line = ReadLine();
int nextStart = 0;
int labelLen;
char *label = GetLabel(line, &labelLen, nextStart, &nextStart); // may be empty
int mnemonicLen;
char *mnemonic = GetMnemonic(line, &mnemonicLen, nextStart, &nextStart); // may be empty
if (IsOpcode(mnemonic, mnemonicLen)) {
AddressingModeInfo info = GetAddressingModeInfo(line, nextStart, &nextStart);
if (IsValidInstruction(mnemonic, info)) {
GenerateCode(mnemonic, info);
}
else throw new BadInstructionException(mnemonic, info);
}
else if (IsDirective()) { /* etc. */ }
Yes. Most assemblers use a table of data which describes the instructions: mnemonic, op code, operands forms etc.
I suggest looking at the source code for as. I'm having some trouble finding it though. Look here. (Thanks to Hossein.)
I think you should look into the Visitor pattern. It might not make your code that much simpler, but will reduce coupling and increase reusability. SableCC is a java framework to build compilers that uses it extensively.
When I was playing with a Microcode emulator tool, I converted everything into descendants of an Instruction class. From Instruction were category classes, such as Arithmetic_Instruction and Branch_Instruction. I used a factory pattern to create the instances.
Your best bet may be to get a hold of the assembly language syntax specification. Write a lexer to convert to tokens (**please, don't use if-elseif-else ladders). Then based on semantics, issue the code.
Long time ago, assemblers were a minimum of two passes: The first to resolve constants and form the skeletal code (including symbol tables). The second pass was to generate more concrete or absolute values.
Have you read the Dragon Book lately?
Have you looked at the "Command Dispatcher" pattern?
http://en.wikipedia.org/wiki/Command_pattern
The general idea would be to create an object that handles each instruction (command), and create a look-up table that maps each instruction to the handler class. Each command class would have a common interface (Command.Execute( *args ) for example) which would definitely give you a cleaner / more flexible design than your current enormous switch statement.

Char* vs String Speed in C++

I have a C++ program that will read in data from a binary file and originally I stored data in std::vector<char*> data. I have changed my code so that I am now using strings instead of char*, so that std::vector<std::string> data. Some changes I had to make was to change from strcmp to compare for example.
However I have seen my execution time dramatically increase. For a sample file, when I used char* it took 0.38s and after the conversion to string it took 1.72s on my Linux machine. I observed a similar problem on my Windows machine with execution time increasing from 0.59s to 1.05s.
I believe this function is causing the slow down. It is part of the converter class, note private variables designated with_ at the end of variable name. I clearly am having memory problems here and stuck in between C and C++ code. I want this to be C++ code, so I updated the code at the bottom.
I access ids_ and names_ many times in another function too, so access speed is very important. Through the use of creating a map instead of two separate vectors, I have been able to achieve faster speeds with more stable C++ code. Thanks to everyone!
Example NewList.Txt
2515 ABC 23.5 32 -99 1875.7 1
1676 XYZ 12.5 31 -97 530.82 2
279 FOO 45.5 31 -96 530.8 3
OLD Code:
void converter::updateNewList(){
FILE* NewList;
char lineBuffer[100];
char* id = 0;
char* name = 0;
int l = 0;
int n;
NewList = fopen("NewList.txt","r");
if (NewList == NULL){
std::cerr << "Error in reading NewList.txt\n";
exit(EXIT_FAILURE);
}
while(!feof(NewList)){
fgets (lineBuffer , 100 , NewList); // Read line
l = 0;
while (!isspace(lineBuffer[l])){
l = l + 1;
}
id = new char[l];
switch (l){
case 1:
n = sprintf (id, "%c", lineBuffer[0]);
break;
case 2:
n = sprintf (id, "%c%c", lineBuffer[0], lineBuffer[1]);
break;
case 3:
n = sprintf (id, "%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2]);
break;
case 4:
n = sprintf (id, "%c%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2],lineBuffer[3]);
break;
default:
n = -1;
break;
}
if (n < 0){
std::cerr << "Error in processing ids from NewList.txt\n";
exit(EXIT_FAILURE);
}
l = l + 1;
int s = l;
while (!isspace(lineBuffer[l])){
l = l + 1;
}
name = new char[l-s];
switch (l-s){
case 2:
n = sprintf (name, "%c%c", lineBuffer[s+0], lineBuffer[s+1]);
break;
case 3:
n = sprintf (name, "%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2]);
break;
case 4:
n = sprintf (name, "%c%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2],lineBuffer[s+3]);
break;
default:
n = -1;
break;
}
if (n < 0){
std::cerr << "Error in processing short name from NewList.txt\n";
exit(EXIT_FAILURE);
}
ids_.push_back ( std::string(id) );
names_.push_back(std::string(name));
}
bool isFound = false;
for (unsigned int i = 0; i < siteNames_.size(); i ++) {
isFound = false;
for (unsigned int j = 0; j < names_.size(); j ++) {
if (siteNames_[i].compare(names_[j]) == 0){
isFound = true;
}
}
}
fclose(NewList);
delete [] id;
delete [] name;
}
C++ CODE
void converter::updateNewList(){
std::ifstream NewList ("NewList.txt");
while(NewList.good()){
unsigned int id (0);
std::string name;
// get the ID and name
NewList >> id >> name;
// ignore the rest of the line
NewList.ignore( std::numeric_limits<std::streamsize>::max(), '\n');
info_.insert(std::pair<std::string, unsigned int>(name,id));
}
NewList.close();
}
UPDATE: Follow up question: Bottleneck from comparing strings and thanks for the very useful help! I will not be making these mistakes in the future!
My guess it that it should be tied to the vector<string>'s performance
About the vector
A std::vector works with an internal contiguous array, meaning that once the array is full, it needs to create another, larger array, and copy the strings one by one, which means a copy-construction and a destruction of string which had the same contents, which is counter-productive...
To confirm this easily, then use a std::vector<std::string *> and see if there is a difference in performance.
If this is the case, they you can do one of those four things:
if you know (or have a good idea) of the final size of the vector, use its method reserve() to reserve enough space in the internal array, to avoid useless reallocations.
use a std::deque, which works almost like a vector
use a std::list (which doesn't give you random access to its items)
use the std::vector<char *>
About the string
Note: I'm assuming that your strings\char * are created once, and not modified (through a realloc, an append, etc.).
If the ideas above are not enough, then...
The allocation of the string object's internal buffer is similar to a malloc of a char *, so you should see little or no differences between the two.
Now, if your char * are in truth char[SOME_CONSTANT_SIZE], then you avoid the malloc (and thus, will go faster than a std::string).
Edit
After reading the updated code, I see the following problems.
if ids_ and names_ are vectors, and if you have the slightest idea of the number of lines, then you should use reserve() on ids_ and and names_
consider making ids_ and names_ deque, or lists.
faaNames_ should be a std::map, or even a std::unordered_map (or whatever hash_map you have on your compiler). Your search currently is two for loops, which is quite costly and inneficient.
Consider comparing the length of the strings before comparing its contents. In C++, the length of a string (i.e. std::string::length()) is a zero cost operation)
Now, I don't know what you're doing with the isFound variable, but if you need to find only ONE true equality, then I guess you should work on the algorithm (I don't know if there is already one, see http://www.cplusplus.com/reference/algorithm/), but I believe this search could be made a lot more efficient just by thinking on it.
Other comments:
Forget the use of int for sizes and lengths in STL. At very least, use size_t. In 64-bit, size_t will become 64-bit, while int will remain 32-bits, so your code is not 64-bit ready (in the other hand, I see few cases of incoming 8 Go strings... but still, better be correct...)
Edit 2
The two (so called C and C++) codes are different. The "C code" expects ids and names of length lesser than 5, or the program exists with an error. The "C++ code" has no such limitation. Still, this limitation is ground for massive optimization, if you confirm names and ids are always less then 5 characters.
Before fixing something make sure that it is bottleneck. Otherwise you are wasting your time. Plus this sort of optimization is microoptimization. If you are doing microoptimization in C++ then consider using bare C.
Resize vector to large enough size before you start populating it. Or, use pointers to strings instead of strings.
The thing is that the strings are being copied each time the vector is being auto-resized. For small objects such as pointers this cost nearly nothing, but for strings the whole string is copied in full.
And id and name should be string instead of char*, and be initialized like this (assuming that you still use string instead of string*):
id = string(lineBuffer, lineBuffer + l);
...
name = string(lineBuffer + s, lineBuffer + s + l);
...
ids_.push_back(id);
names_.push_back(name);
Except for std::string, this is a C program.
Try using fstream, and use the profiler to detect the bottle neck.
You can try to reserve a number of vector values in order to reduce the number of allocations (which are costly), as said Dialecticus (probably from the ancient Roma?).
But there is something that may deserve some observation: how do you store the strings from the file, do you perform concatenations etc...
In C, strings (which do not exist per say - they don't have a container from a library like the STL) need more work to deal with, but at least we know what happens clearly when dealing with them. In the STL, each convenient operation (meaning requiring less work from the programmer) may actually require a lot of operations behind the scene, within the string class, depending on how you use it.
So, while the allocations / freeings are a costly process, the rest of the logic, especially the strings process, may / should probably be looked at as well.
I believe the main issue here is that your string version is copying things twice -- first into dynamically allocated char[] called name and id, and then into std::strings, while your vector<char *> version probably does not do that. To make the string version faster, you need to read directly into the strings and get rid of all the redundant copies
streams take care of a lot of the heavy lifting for you. Stop doing it all yourself, and let the library help you:
void converter::updateNewList(){
std::ifstream NewList ("NewList.txt");
while(NewList.good()){
int id (0);
std::string name;
// get the ID and name
NewList >> id >> name;
// ignore the rest of the line
NewList.ignore( numeric_limits<streamsize>::max(), '\n');
ids_.push_back (id);
names_.push_back(name);
}
NewList.close();
}
There's no need to do the whitespace-tokenizing manually.
Also, you may find this site a helpful reference:
http://www.cplusplus.com/reference/iostream/ifstream/
You can use a profiler to find out where your code consumes most time. If you are for example using gcc, you can compile your program with -pg. When you run it, it saves profiling results in a file. You can the run gprof on the binary to get human readable results. Once you know where most time is consumed you can post that piece of code for further questions.