Performance regarding the 'getting' of struct data

Performance regarding the 'getting' of struct data - c++

I have a question regarding the performance effects when taking into consideration of two possible methods of 'getting' data from a given struct. It is assumed that the 'name' variable is relative to what the value of 'id' is.
Assuming I have a struct and enum as follows,
enum GenericId { NONE, ONE, TWO };
struct GenericTypeDefinition {
GenericId id;
const char name[8];
...
};
Let's say I wanted to get the name of this struct. Quite easy, I could just refer to the instance of the GenericTypeDefinition struct and refer (or point) to the name member. Simple enough.
Now here is where my performance question becomes relevant. Say I need I create hundreds of these instances, all of which will be locked to a certain number of names and a unique id per. These instances will be referred to as a collection of possible 'GenericTypeDefinition's throughout the program. Keep in mind, the value of 'name' is relative to the value of 'id'. My question is, would I be able to save some memory if I implemented a function like follows (and removed the name variable from the struct),
struct GenericTypeDefinition { // 'name' is now removed.
GenericId id;
...
};
const char* Definition_ToString(GenericEnum e) {
switch (e) {
case NONE: return "Nothing";
case ZERO: return "Zero is not nothing.";
...
}
I assume it would because I am freeing up the need to store the string (8 bytes in length) in each struct that I create.
If you would like any clarification please ask, as I have not been able to find much on this.

If I understand what you're asking, you are putting redundant data into your struct. Essentially, you are able to get the name of the struct from the id in the struct. But, you could also store the name directly in the struct.
So, you are right -- not storing the name will save memory, because you won't store the name with every item. The cost is a bit of time. You will need to call a function to give you the name from the id each time you need it. You will have to weigh these tradeoffs to determine which is more important.

The devil is in details. The answer depends on many things. For example, how often such a structure is allocated, how often it is used and how often char name[8]; is used.
If you remove name from the structure, several scenario may happen:
if you have many objects of this type and a good allocator, you will save space.
if you use those objects extensively during some calculus and you use name only from time to time, you will save time thanks to better cache performance.
if you use name extensively for some computation and your function Definition_ToString is just a little bit more complex than the one in your example, you will loose on performance.
However, in my estimation, optimizations like this can speed up program by some small factor only. It may help in cases when you count in microseconds. If your program is desperately slow, look for asymptotically better algorithm.

In most cases compiler will do this job for you. It usually stores all the const string literals in the RO section of the executable.Depending on the optimization level it may even do away with the memory taken up by the char array in the struct. So your executable size will grow,but it won't effect the run time memory.
However since the name is tied to the ID,logically it makes sense to implement the second version,so in future if you want to add a new id,you don't need to do any redundant work.

In your first case, the task of initializing the structs with the proper ID and NAME means that the program will, at the very beginning, copy the literals, this is, the strings (because I assume you initialize the structs with the strings within the code) to another space in RAM memory, to which the char[ ] will point.
Instead, the second case means that the value is read from the program itself (the literals are hard coded in a table somewhere in the deep assembler code), and will return a pointer to it (correct me if the pointer is not to somewhere in the program but the returning const char* is stored as a variable), therefore you do save some memory.
My personal comment is (which you may see it beyond the question's scope), that even though the second alternative may save you some memory, implies that the IDs and NAMEs are hard coded, therefore leaving out any possibility of expansion during runtime (i.e. you want to add more IDs that are received via a console...).

Related

Defending classes with 'magic numbers'

A few months ago I read a book on security practices, and it suggested the following method for protecting our classes from overwriting with e.g. overflows etc.:
first define a magic number and a fixed-size array (can be a simple integer too)
use that array containing the magic number, and place one at the top, and one at the bottom of our class
a function compares these numbers, and if they are equal, and equal to the static variable, the class is ok, return true, else it is corrupt, and return false.
place this function at the start of every other class method, so this will check the validity of the class on function calls
it is important to place this array at the start and the end of the class
At least this is as I remember it. I'm coding a file encryptor for learning purposes, and I'm trying to make this code exception safe.
So, in which scenarios is it useful, and when should I use this method, or is this something totally useless to count on? Does it depend on the compiler or OS?
PS: I forgot the name of the book mentioned in this post, so I cannot check it again, if anyone of you know which one was it please tell me.

What you're describing sounds a Canary, but within your program, as opposed to the compiler. This is usually on by default when using gcc or g++ (plus a few other buffer overflow countermeasures).
If you're doing mutable operations on your class and you want to make sure you don't have side effects, I don't know if having a magic number is very useful. Why rely on a homebrew validity check when there are mothods out there that are more likely to be successful?
Checksums: I think it'd be more useful for you to hash the unencrypted text and add that to the end of the encrypted file. When decrypting, remove the hash and compare the hash(decrypted text) with what it should be.
I think most, if not all, widely used encryptors/decryptors store some sort of checksum in order to verify that the data has not changed.

This type of a canary will partially protect you against a very specific type of overflow attack. You can make it a little more robust by randomizing the canary value every time you run the program.
If you're worried about buffer overflow attacks (and you should be if you are ever parsing user input), then go ahead and do this. It probably doesn't cost too much in speed to check your canaries every time. There will always be other ways to attack your program, and there might even be careful buffer overflow attacks that get around your canary, but it's a cheap measure to take so it might be worth adding to your classes.

why is "char" a bad programming practice in struct types?

I see in c++ programming language some recommendations like "don't use char in struct type",
struct Student{
int stdnum, FieldCode, age;
double average, marks, res[NumOfCourses];
char Fname[20], Lname[20], cmp[20];
};
And is better to use:
struct Student{
int stdnum, FieldCode, age;
double average, marks, res[NumOfCourses];
string Fname, Lname, cmp;
};
Any other recommendations would be welcome on the matter
Thank you in advance

Because string handling using C-level raw character arrays is hard to get right, prone to fail, and when it fail it can fail rather horribly.
With a higher-level string datatype, your code becomes easier to write, more likely to be correct. It's also often easier to get shorter code, since the string datatype does a lot of work for you that you otherwise have to do yourself.
The original question was tagged with the c tag, but of course string is a c++ construct so this answer doesn't apply to C. In C, you can choose to use some string library (glib's GString is nice) to gain similar benefits but of course you'll never have overloaded operators like in C++.

#unwind's answer is generally perfectly appropriate, unless it's really a sequence of 20 char's you're after. In that case a std::string might be overkill (performance-wise as well), but a std::array might still be better.

For your student class, you have first name, last name and "cmp" - I've no idea what that's supposed to be. Clearly a student could have a first and/or last name longer than 20 characters, so by hardcoding an array of 20 elements you've already created a system that:
has to bother to check all inputs to make sure no attempt is made to store more than 19 characters (leaving space for a NUL), and
can't reliably print out any formal documents (e.g. graduation certificates) that require students' exact names.
If you don't carefully check all your input handling when modifying arrays, your program could crash, corrupt data and/or malfunction.
With std::string, you can generally just let people type into whatever fields your User Interface has, or pick up data from files, databases or the network, and store whatever you're given, print whatever you're given, add extra characters to it without worrying about crossing that threshold etc.. In extreme situations where you can't trust your input sources (e.g. accepting student data over untrusted network connections) you may still want to do some length and content checks, but you'd very rarely find it necessary to let those proliferate throughout all your code the way array-bounds checking often needs to.
There are some performance implications:
fixed length arrays have to be sized to your "worst supported case" size, so may waste considerable space for average content
strings have some extra data members, and if the textual content is larger than any internal Short String Optimisation buffer, then they may use further dynamic memory allocation (i.e. new[]), and the allocation routines might allocate more memory than you actually asked for, or be unable to effectively reuse delete[]d memory due to fragmentation
if a std::string implementation happens to share a buffer between std::string objects that are copied (just until one is or might be modified), then std::strings could reduce your memory usage when there are multiple copies of the student struct - but this probably only happens transiently and is only likely to help with very long strings
It's hard to conclude much from just reading about all these factors - if you care about potential performance impact in your application, you should always benchmark with your actual data and usage.
Separately, std::string is intuitive to use, supporting operators like +, +=, ==, !=, < etc., so you're more likely to be able to write correct, concise code that's easily understood and maintained.

I dont think it makes difference in C as well.
Check this precious stackoverflow answer explained clearly states strings are more secure than char *.

std::string or char[] as the element for a large array-like structure

I'm creating a hash table. Each value is a string. I have the problem of deciding what structure to use to store the string. Intuitively I thought of std::string and char*. But,
1), std::string seems to use the stack if the string is short. That means it's not a good choice if my hash table is really big.
2), If using char* then I don't know what to return if I want to change a value, for example like in the following situation: myTable[i] = changedString; It seems in this case I'll need to implement a new string class. But I'm feeling it won't be necessary with std::string there.
Could anyone give any suggestions/comments? Thanks!

I'll assume you are trying to implement unordered_map (H.W?) and that this is why you dont use it.
you should use std::vector, or std::string, but don't use the array.
And why is there problem of std::string using a stack?

If your goal is to create a hash table, you should try to eliminate any distractions that would make that specific task more complicated. As such you should use std::string for the mutable values in your table so you don't have to spend development effort on allocating and deallocating char*
Once your hash table is functional and correct, if you have a reason to move to char*, then you can always change to that later on. Focus on your highest priority goal, the hash table, and don't spend time on trying to beat std::string performance until your first goal is reached; beating std::string may not be worthwhile in any case.

The overhead caused by std::string is minimal, actually AFAIK apart from the pointer to the string's internal buffer, there are just size and capacity members, both of type size_t, causing let's say (it's environment dependent) 8 bytes per string, so if you have an array of 100 000 strings, there would be about 780KB overhead, which is something I wouldn't worry about unless you are in an environment with strict memory limitations.
In case the length of the string is fixed or varies in minimal way (let's say 2 to 4 characters), then it could be more reasonable to use an array with automatic storage duration:
struct X {
...
char code[4]; // up to 4 characters
};
which would work fine even while copying instances of it in a following way:
X x1, x2;
...
x2 = x1;
However in case you don't have a really good reason to worry about this right now, anything you'd do at this point is pretty much a premature optimization.

Why are empty classes 8 bytes and larger classes always > 8 bytes?

class foo { }
writeln(foo.classinfo.init.length); // = 8 bytes
class foo { char d; }
writeln(foo.classinfo.init.length); // = 9 bytes
Is d actually storing anything in those 8 bytes, and if so, what? It seems like a huge waste, If I'm just wrapping a few value types then the the class significantly bloats the program, specifically if I am using a lot of them. A char becomes 8 times larger while an int becomes 3 times as large.
A struct's minimum size is 1 byte.

In D, object have a header containing 2 pointer (so it may be 8bytes or 16 depending on your architecture).
The first pointer is the virtual method table. This is an array that is generated by the compiler filled with function pointer, so virtual dispatch is possible. All instances of the same class share the same virtual method table.
The second pointer is the monitor. It is used for synchronization. It is not sure that this field stay here forever, because D emphasis local storage and immutability, which make synchronization on many objects useless. As this field is older than these features, it is still here and can be used. However, it may disapear in the future.
Such header on object is very common, you'll find the same in Java or C# for instance. You can look here for more information : http://dlang.org/abi.html

D uses two machine words in each class instance for:
A pointer to the virtual function table. This contains the addresses of virtual methods. The first entry points towards the class's classinfo, which is also used by dynamic casts.
The monitor, which allows the synchronized(obj) syntax, documented here.
These fields are described in the D documentation here (scroll down to "Class Properties") and here (scroll down to "Classes").

I don't know the particulars of D, but in both Java and .net, every class object contains information about its type, and also holds information about whether it's the target of any monitor locks, whether it's eligible for finalization cleanup, and various other things. Having a standard means by which all objects store such information can make many things more convenient for both users and implementers of the language and/or framework. Incidentally, in 32-bit versions of .net, the overhead for each object is 8 bytes except that there is a 12-byte minimum object size. This minimum stems from the fact that when the garbage-collector moves objects around, it needs to temporarily store in the old location a reference to the new one as well as some sort of linked data structure that will permit it to examine arbitrarily-deep nested references without needing an arbitrarily-large stack.
Edit
If you want to use a class because you need to be able to persist references to data items, space is at a premium, and your usage patterns are such that you'll know when data items are still useful and when they become obsolete, you may be able to define an array of structures, and then pass around indices to the array elements. It's possible to write code to handle this very efficiently with essentially zero overhead, provided that the structure of your program allows you to ensure that every item that gets allocated is released exactly once and things are not used once they are released.
If you would not be able to readily determine when the last reference to an object is going to go out of scope, eight bytes would be a very reasonable level of overhead. I would expect that most frameworks would force objects to be aligned on 32-bit boundaries (so I'm surprised that adding a byte would push the size to nine rather than twelve). If a system is going have a garbage collector that works better than a Commodore 64(*), it would need to have an absolute minimum of a bit of overhead per object to indicate which things are used and which aren't. Further, unless one wants to have separate heaps for objects which can contain supplemental information and those which can't, one will every object to either include space for a supplemental-information pointer, or include space for all the supplemental information (locking, abandonment notification requests, etc.). While it might be beneficial in some cases to have separate heaps for the two categories of objects, I doubt the benefits would very often justify the added complexity.
(*) The Commodore 64 garbage collector worked by allocating strings from the top of memory downward, while variables (which are not GC'ed) were allocated bottom-up. When memory got full, the system would scan all variables to find the reference to the string that was stored at the highest address. That string would then be moved to the very top of memory and all references to it would be updated. The system would then scan all variables to find the reference to the string at the highest address below the one it just moved and update all references to that. The process would repeat until it didn't find any more strings to move. This algorithm didn't require any extra data to be stored with strings in memory, but it was of course dog slow. The Commodore 128 garbage collector stored with each string in GC space a pointer to the variable that holds a reference and a length byte that could be used to find the next lower string in GC space; it could thus check each string in order to find out whether it was still used, relocating it to the top of memory if so. Much faster, but at the cost of three bytes' overhead per string.

You should look into the storage requirements for various types. Every instruction, storage allocation (ie:variable/object, etc) uses up a specific amount of space. In c# an Int32 type integer object should store integer information to the tune of 4 bytes (32bit). It might have other information, too, because it is an object, but your character data type probably only requires 1 byte of information. If you have constructs like for or while in your class, those things will take up space, too, because each of those things is telling your class to do something. The class itself requires a number of instructions to be created in memory, which would account for the 8 initial bytes.
Take an assembler language course. You'll learn all you ever wanted to know and then some about why your programs use however much memory or take up however much storage when compiled.

fast retrieval of (void *) memory block

I have a system that returns a void* to a memory block. This memory block stores contiguous data records of different types(int,char,double etc.) and gives the number of bytes of each field in each record.I essentially look up the type of the record and get the value of the record. To retrieve all the records, I do
switch(type)
{
case 'int' : \*(int*)(ptr+index)
case 'char': \*(char*)(ptr+index)
}
When I have to go through 300000 records this is taking a lot of time.Is there a faster way to go through all the records?

If a single block can be of multiple types that can only be resolved at runtime, you will have to dispatch to handlers in a switch statement. Note that:
unions are usually used in C for such things to save space
switch statements are very fast and translate to constant-time lookup tables

If I understand your question correctly -- I assume you're going through each record, sortof saying "for each record i, if its type is 'char' then access the record at location i".
If you know how many records you have to access in advance, can't you just cache them all first?
If I'm completely off track forgive me for not understanding your point.

Your comments finally gave enough information to answer the question: you're writing to an ostringstream. Which means that you're probably doing a lot of string manipulation inside your switch. Figure out how to optimize this, and your performance problems should go away.
(to convince yourself this is the case, simply comment out all code that references the stream and run your program again)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js