Which container to use for String-Interning

Which container to use for String-Interning - c++

My goal is to do string-interning. For this I am looking for a hashed
container class that can do the following:
allocate only one block of memory per node
different userdata size per node
The value type looks like this:
struct String
{
size_t refcnt;
size_t len;
char data[];
};
Every String object will have a different size. This will be accomplished with
opereator new + placement new.
So basically I want to allocate the Node myself and push it in the container later.
Following containers are not suitable:
std::unordored_set
boost::multi_index::*
Cannot allocate different sized nodes
boost::intrusive::unordered_set
Seems to work at first. But has some drawbacks. First of all you have to allocate
the bucket array and maintain the load-factor yourself. This is just unnecessary
and error-prone.
But another problem is harder to solve: You can only search for objects that have the
type String. But it is inefficient to allocate a String everytime you look for an entry
and you only have i.e. a std::string as input.
Are there any other hashed containers that can be used for this task?

I don't think you can do that with any of the standard containers.
What you can do is store the pointer to String and provide custom hash and cmp functors
struct StringHash
{
size_t operator() (String* str)
{
// calc hash
}
};
struct StringCmp
{
bool operator() (String* str1, String* str2)
{
// compare
}
};
std::unordered_set<String*, StringHash, StringCmp> my_set;

Your definition for String won't compile in C++; the obvious
solution is to replace the data field with a pointer (in which
case, you can put the structures themselves in
std::unordered_set).
It's possible to create an open ended struct in C++ with
something like the following:
struct String
{
int refcnt;
int len;
char* data()
{
return reinterpret_cast<char*>(this + 1);
}
};
You're skating on thin ice if you do, however; for types other
than char, there is a risk that this + won't be
appropriately aligned.
If you do this, then your std::unordered_set will have to
contain pointers, rather than the elements, so I doubt you'll
gain anything for the effort.

Related

C++ Pointer not being updated?

Not exactly sure how to word the title but I'll explain as best I can.
I have a program that originally used a 2D array of a set size and so it was defined as:
typedef char Map[Row][Col];
I'm now trying to dynamically allocate memory for it and it has now also become of variable size based on input. It's now defined as:
typedef char** Map;
In my main method, I originally had:
Map map;
readUserInput(map);
Basically readUserInput takes the map array as a parameter, and assigns values to it based on user input. The map then contains values and is used in other functions.
I've updated the readUserInput function so that it dynamically sizes the array and it allocates/deallocates memory for it. This works fine, but the problem comes from the fact that now in the main method, map is not being updated. The above code in main now looks like:
Map map = nullptr;
readUserInput(map);
but after running the readUserInput function, map is still null. Inside of the function, map is updated fine, so I'm not understanding the difference made between the changes.

What you pass to function is a pointer to array and fuction can't change it. But replacing array with pointer to pointer is incorrect in most case.Pointer to pointer suggest that have a 1D array of pointers. Which may (or may not) point to other arrays. Such data organization sometimes referred to as jagged arrays, because it allows each row to be of separate length. But on practtice jagged arrays and their subclass, sparse matrices, usually implemented as 1D array to avoid re-allocation.
To avoid decaying and to actually store a monolithic array in memory, you should use 1d array and, preferably, encapsulation for pointer arithmetic and reallocation, and then pass reference to object that stores all required states. Reference ensures that object is mutable by function ( a smart-pointer-less version for an example):
class Map
{
int rows, cols;
char *data;
public:
Map() : rows(), cols(), data(nullptr) {}
Map(int r, int c) : rows(r), cols(c), data(new char[r*c]()) {}
~Map() { delete[] data; }
void resize(int r, int c) {
if(rows == r && cols == c) return;
char* tmp = new char[r*c]();
if(data)
{
// copy old data here if required
delete[] data;
}
row = r; col = c;
data = tmp;
}
char& operator() (int r, int c) { return data[r*cols + c]; }
char operator() (int r, int c) const { return data[r*cols + c]; }
};
NB: this class requires a copy and move operations to be implemented if any copy must be allowed.
The function prototype would be:
void readUserInput(Map& map);
With such class you can do dynamic resizing, store its size, and address element as simple as this:
int main()
{
Map test(4, 5); // declaring and allocating memory
test.resize(3,3); // reallocating
test(1,1) = 3; // writing
//reading
std::cout << +test(1,1) << std::endl;
}

The function should accept the array by reference in the C terms like
readUserInput( &map );
when the function is declared like
void readUserInput( Map *map );
or in the C++ terms when the function is declared like for example
void readUserInput( Map &map );
and called like
readUserInput(map);
Instead of allocating dynamically arrays you could use the container std::vector<std::string>.

The code you have used is a pure C-style code, and is prone to many mistakes:
You use typedef instead of: using Map = char**;
You use a function which gets a pointer and fills it, which is more common in C than in C++.
You use raw pointer instead of smart pointers (added in C++11), which may cause a memory leak in the end.
I've updated the readUserInput function so that it dynamically sizes the array and it allocates/deallocates memory for it.
This means that now it should be a class named Map, since it should be able to allocate/deallocate, insert and remove values, and is a valid container. Actually, you are creating a type of std::vector here, and if you don't create it for you own learning process, I strongly suggest you to use the std containers!
It is possible to pass both pointer and references in C++, notice that:
You can pass a reference only if the value isn't nullptr.
When there should be a value, reference is recommended.
In this case, your function should look like
void readUserInput(Map* map);
and should be called using:
readUserInput(&map);

Converting std::vector container into an std::set using std::transform

More specifically I have a vector of some struct
std::vector<SomeStruct> extensions = getThoseExtensions();
where someStructVariable.extensionName returns a string.
And I want to create a set of extensionName, something like this std::set<const char*>.
Process is fairly straightforward when done using some for loops but I want to use std::transform from <algorithm> instead.
std::transform has four parameters.
1,2. First range (to transform first range from and to)
3. Second range/inserter (to transform second range)
4. A function
This is what I have so far
auto lambdaFn =
[](SomeStruct x) -> const char* { return x.extensionName; };
std::transform(availableExtensions.begin(),
availableExtensions.end(),
std::inserter(xs, xs.begin()),
lambdaFn);
because there's no "proper context" for std::back_inserter in std::set I'm using std::inserter(xs, xs.begin()).
The problem is I'm trying to return stack mem in my lambda function. So how do I get around this problem?
Oddly enough if I remove return from the function it works just like I expect it to! But I don't understand why and that strikes fear of future repercussion.
EDIT:
I'm using several structs in place of SomeStruct like VkExtensionProperties defined in vulkan_core
typedef struct VkExtensionProperties {
char extensionName[VK_MAX_EXTENSION_NAME_SIZE];
uint32_t specVersion;
} VkExtensionProperties;
From Khronos specs

You probably can't create a set of char * unless all instances of extensionName with the same value point to the same char array (it would store unique pointers instead of unique values). If you use std::set<std::string> instead this will both work and only store unique values and solve your variable lifetime problem as std::string takes care of copying (or moving) itself for you where necessary:
auto lambdaFn =
[](const SomeStruct& x) { return std::string(x.extensionName); };
std::set<std::string> xs;
std::transform(availableExtensions.begin(),
availableExtensions.end(),
std::inserter(xs, xs.begin()),
lambdaFn);

One way to do what you want is with the following lambda
auto lambda = [](const SomeStruct& x) -> const char* { return x.extensions.data();};
The problem with this is, that you are saving pointers to memory owned by someone else (those strings). When they are destroyed (this seems to be the case at the end of the function), the pointer will be dangling. You can get around this by allocating memory in your lambda and copying the data:
auto lambda = [](const SomeStruct & x) -> const char* {
char* c = new char[x.extensions.length()+1];
std::strcpy(c, x.extensions.data());
return c;
}
But then you have to do memory management yourself (i.e. remember to free those const char*). And that is a bad idea. You should probably reconsider what you are doing. Why are you using const char* here and not std:: string?
Please remember that the typical use for const char* is to save string literals i C-code, i.e. the code
const char* str = "Hello World!";
creates a char array of sufficient size in the static section of the memory, initializes it with the string (a compile time constant) and then saves a pointer to that in str. This is also why this has to be a const char* since another pointer refering to an equal string literal may (or may not) point the exactly the same char array and you don't want to enable change there. So don't just use const char* because you see strings in C saved in those const char* without anyone needing to free them later.

There are a couple of things you can do here.
If you own the definition of SomeStruct, it is best if you changed that member to std::string.
Short of that, see if you lambda can take by-ref parameter const auto& obj. This will not create a copy and point back to the object the container has. However, I am still afraid of this solution since this smells like bad class design where ownership and lifetime of members is ambiguous.

How can I make my char buffer more performant?

I have to read a lot of data into:
vector<char>
A 3rd party library reads this data in many turns. In each turn it calls my callback function whose signature is like this:
CallbackFun ( int CBMsgFileItemID,
unsigned long CBtag,
void* CBuserInfo,
int CBdataSize,
void* CBdataBuffer,
int CBisFirst,
int CBisLast )
{
...
}
Currently I have implemented a buffer container using an STL Container where my method insert() and getBuff are provided to insert a new buffer and getting stored buffer. But still I want better performing code, so that I can minimize allocations and de-allocations:
template<typename T1>
class buffContainer
{
private:
class atomBuff
{
private:
atomBuff(const atomBuff& arObj);
atomBuff operator=(const atomBuff& arObj);
public:
int len;
char *buffPtr;
atomBuff():len(0),buffPtr(NULL)
{}
~atomBuff()
{
if(buffPtr!=NULL)
delete []buffPtr;
}
};
public :
buffContainer():_totalLen(0){}
void insert(const char const *aptr,const unsigned long &alen);
unsigned long getBuff(T1 &arOutObj);
private:
std::vector<atomBuff*> moleculeBuff;
int _totalLen;
};
template<typename T1>
void buffContainer< T1>::insert(const char const *aPtr,const unsigned long &aLen)
{
if(aPtr==NULL,aLen<=0)
return;
atomBuff *obj=new atomBuff();
obj->len=aLen;
obj->buffPtr=new char[aLen];
memcpy(obj->buffPtr,aPtr,aLen);
_totalLen+=aLen;
moleculeBuff.push_back(obj);
}
template<typename T1>
unsigned long buffContainer<T1>::getBuff(T1 &arOutObj)
{
std::cout<<"Total Lenght of Data is: "<<_totalLen<<std::endl;
if(_totalLen==0)
return _totalLen;
// Note : Logic pending for case size(T1) > T2::Value_Type
int noOfObjRqd=_totalLen/sizeof(T1::value_type);
arOutObj.resize(noOfObjRqd);
char *ptr=(char*)(&arOutObj[0]);
for(std::vector<atomBuff*>::const_iterator itr=moleculeBuff.begin();itr!=moleculeBuff.end();itr++)
{
memcpy(ptr,(*itr)->buffPtr,(*itr)->len);
ptr+= (*itr)->len;
}
std::cout<<arOutObj.size()<<std::endl;
return _totalLen;
}
How can I make this more performant?

If my wild guess about your callback function makes sense, you don't need anything more than a vector:
std::vector<char> foo;
foo.reserve(MAGIC); // this is the important part. Reserve the right amount here.
// and you don't have any reallocs.
setup_callback_fun(CallbackFun, &foo);
CallbackFun ( int CBMsgFileItemID,
unsigned long CBtag,
void* CBuserInfo,
int CBdataSize,
void* CBdataBuffer,
int CBisFirst,
int CBisLast )
{
std::vector<char>* pFoo = static_cast<std::vector<char>*>(CBuserInfo);
char* data = static_cast<char*>CBdataBuffer;
pFoo->insert(pFoo->end(), data, data+CBdataSize);
}

Depending on how you plan to use the result, you might try putting the incoming data into a rope datastructure instead of vector, especially if the strings you expect to come in are very large. Appending to the rope is very fast, but subsequent char-by-char traversal is slower by a constant factor. The tradeoff might work out for you or not, I don't know what you need to do with the result.
EDIT: I see from your comment this is no option, then. I don't think you can do much more efficient in the general case when the size of the data coming in is totally arbitrary. Otherwise you could try to initially reserve enough space in the vector so that the data will fit without or at most one reallocation in the average case or so.
One thing I noticed about your code:
if(aPtr==NULL,aLen<=0)
I think you mean
if(aPtr==NULL || aLen<=0)

The main thing you can do is avoid doing quite so much copying of the data. Right now, when insert() is called, you're copying the data into your buffer. Then, when getbuff() is called, you're copying the data out to a buffer they've (hopefully) specified. So, to get data from outside to them, you're copying each byte twice.
This part:
arOutObj.resize(noOfObjRqd);
char *ptr=(char*)(&arOutObj[0]);
Seems to assume that arOutObj is really a vector. If so, it would be a whole lot better to rewrite getbuff as a normal function taking a (reference to a) vector instead of being a template that really only works for one type of parameter.
From there, it becomes a fairly simple matter to completely eliminate one copy of the data. In insert(), instead of manually allocating memory and tracking the size, put the data directly into a vector. Then, when getbuff() is called, instead of copying the data into their buffer, just give then a reference to your existing vector.
class buffContainer {
std::vector<char> moleculeBuff;
public:
void insert(char const *p, unsigned long len) {
Edit: Here you really want to add:
moleculeBuff.reserve(moleculeBuff.size()+len);
End of edit.
std::copy(p, p+len, std::back_inserter(moleculeBuff));
}
void getbuff(vector<char> &output) {
output = moleculeBuff;
}
};
Note that I've changed the result of getbuff to void -- since you're giving them a vector, its size is known, and there's no point in returning the size. In reality, you might want to actually change the signature a bit, to just return the buffer:
vector<char> getbuff() {
vector<char> temp;
temp.swap(moleculeBuff);
return temp;
}
Since it's returning a (potentially large) vector by value, this depends heavily on your compiler implementing the named return value optimization (NRVO), but 1) the worst case is that it does about what you were doing before anyway, and 2) virtually all reasonably current compilers DO implement NRVO.
This also addresses one other detail your original code didn't (seem to). As it was, getbuff returns some data, but if you call it again, it (apparently doesn't keep track of what data has already been returned, so it returns it all again. It keeps allocating data, but never deletes any of it. That's what the swap is for: it creates an empty vector, and then swaps that with the one that's being maintained by buffContainer, so buffContainer now has an empty vector, and the filled one is handed over to whatever called getbuff().
Another way to do things would be to take the swap a step further: basically, you have two buffers:
one owned by buffContainer
one owned by whatever calls getbuffer()
In the normal course of things, we can probably expect that the buffer sizes will quickly reach some maximum size. From there on, we'd really like to simply re-cycle that space: read some data into one, pass it to be processed, and while that's happening, read data into the other.
As it happens, that's pretty easy to do too. Change getbuff() to look something like this:
void getbuff(vector<char> &output) {
swap(moleculeBuff, output);
moleculeBuff.clear();
}
This should improve speed quite a bit -- instead of copying data back and forth, it just swaps one vector's pointer to the data with the others (along with a couple other details like the current allocation size, and used size of the vector). The clear is normally really fast -- for a vector (or any type without a dtor) it'll just set the number of items in the vector to zero (if the items have dtors, it has to destroy them, of course). From there, the next time insert() is called, the new data will just be copied into the memory the vector already owns (until/unless it needs more space than the vector had allocated).

Statically initializing a structure with arrays of varying length

I've got a static map of identifier<=>struct pairs, and each struct should contain some arrays. Everything is known at compile time. That is, I want to have something like this here:
ID1 => name: someString
flagCount: 3
flags: [1, 5, 10]
statically created (if possible). Of course, a declaration like:
struct Info
{
const char* name;
int flagCount;
int flags[];
};
would be ideal, as long as I could initialize it like ...
Info infos [] = { ... };
which is not possible, due to the varying length arrays (except I'm missing something). Alternatively, I though about (ab)using boost::assign for this, but I'm wondering if there is a recommended solution to this. I'm fine if I can store only the info structures into an array, and to the mapping elsewhere.
Edit: A note on the current solution. At the moment, I have:
struct Info
{
Info (const std::vector<int>& flags) : flags {}
std::vector<int> flags;
};
and I use:
const std::map<ID, Info> map = boost::assign::map_list_of
("ID1", Info (boost::assign::list_of (1)(2)(3));
which works, I'm just curious whether there is a simpler solution (template-based?).

The elements in an array must be the same size as each other, otherwise you can't use infos[i] to access them - the compiler would have to step through the array and look at the size of each element up to i to find where the next one started. You can allocate enough memory for each element contiguously, and then create an array of pointers to the elements (pointers being a fixed size). If you only need the mapping and not to index the infos then your map will be identifier to pointer.
Alternatively, as you know the size at compile time, if there are only a few flags, make the Info::flags array large enough for the maximum flags, or make it a pointer to an array of flags, so that Info is a fixed size struct.

Either use a pointer to the variable-length array:
struct Info
{
const char* name;
int flagCount;
int *flags;
};
or fixed-size array large enough to hold all flags:
struct Info
{
const char* name;
int flagCount;
int flags[MAX_FLAGS];
};
Both solutions will waste some memory; but for solution 1, it's just one pointer per struct; note that you are already implicitly using this solution for the name field.

Using a vector as you have done is almost certainly the best solution. oefe has given you a solution where you include some indirection in the Info's themselves, another option is to indirect in the map, i.e. map<ID, Info*> (or since you're using boost map<ID, shared_ptr<Info> >) and define Info like. Actually don't do this. Use a vector. It's the best solution.
struct Info {
const char *name;
int flagCount;
int flags[1]; // this is cheating...
};
Info* make_info(int count) {
char *buf = new char[sizeof(Info) + (sizeof(int) * (count - 1))];
Info *rv = static_cast<Info*>(static_cast<void*>(buf));
rv->flagCount = count;
}

Best Replacement for a Character Array

we have a data structure
struct MyData
{
int length ;
char package[MAX_SIZE];
};
where MAX_SIZE is a fixed value . Now we want to change it so as to support
"unlimited" package length greater than MAX_SIZE . one of the proposed solution
is to replace the static array with a pointer and then dynamically allocating
the size as we require For EX
struct MyData
{
int length ;
char* package;
};
and then
package = (char*)malloc(SOME_RUNTIME_SIZE) ;
Now my question is that is this the most efficient way to cater to the requirement OR is there any other method .. maybe using STL data structures like growable arrays etc etc .
we want a solution where most of the code that works for the static char array should work for the new structure too ..

Much, much better/safer:
struct my_struct
{
std::vector<char>package;
};
To resize it:
my_struct s;
s.package.resize(100);
To look at how big it is:
my_struct s;
int size = s.package.size();
You can even put the functions in the struct to make it nicer:
struct my_struct
{
std::vector<char>package;
void resize(int n) {
package.resize(n);
}
int size() const {
return package.size();
}
};
my_struct s;
s.resize(100);
int z = s.size();
And before you know it, you're writing good code...

using STL data structures like growable arrays
The STL provides you with a host of containers. Unfortunately, the choice depends on your requirements. How often do you add to the container? How many times do you delete? Where do you delete from/add to? Do you need random access? What performance gurantees do you need? Once you have a sufficiently clear idea about such things look up vector, deque, list, set etc.
If you can provide some more detail, we can surely help pick a proper one.

I would also wrap a vector:
// wraps a vector. provides convenience conversion constructors
// and assign functions.
struct bytebuf {
explicit bytebuf(size_t size):c(size) { }
template<size_t size>
bytebuf(char const(&v)[size]) { assign(v); }
template<size_t size>
void assign(char const(&v)[size]) {
c.assign(v, v+size);
}
// provide access to wrapped vector
std::vector<char> & buf() {
return c;
}
private:
std::vector<char> c;
};
int main() {
bytebuf b("data");
process(&b.buf()[0], b.buf().size()); // process 5 byte
std::string str(&b.buf()[0]);
std::cout << str; // outputs "data"
bytebuf c(100);
read(&c.buf()[0], c.buf().size()); // read 100 byte
// ...
}
There is no need to add many more functions to it, i think. You can always get the vector using buf() and operate on it directly. Since a vectors' storage is contiguous, you can use it like a C array, but it is still resizable:
c.buf().resize(42)
The template conversion constructor and assign function allows you to initialize or assign from a C array directly. If you like, you can add more constructors that can initialize from a set of two iterators or a pointer and a length. But i would try keeping the amount of added functionality low, so it keeps being a tight, transparent vector wrapping struct.

If this is C:
Don't cast the return value of malloc().
Use size_t to represent the size of the allocated "package", not int.

If you're using the character array as an array of characters, use a std::vector<char> as that's what vectors are for. If you're using the character array as a string, use a std::string which will store its data in pretty much the same way as a std::vector<char>, but will communicate its purpose more clearly.

Yep, I would use an STL vector for this:
struct
{
std::vector<char> package;
// not sure if you have anything else in here ?
};
but your struct length member just becomes package.size ().
You can index characters in the vector as you would in your original char array (package[index]).

use a deque. sure a vector will work and be fine, but a deque will use fragmented memory and be almost as fast.

How are you using your structure?
Is it like an array or like a string?
I would just typedef one of the C++ containers:
typedef std::string MyData; // or std::vector<char> if that is more appropriate

What you have written can work and is probably the best thing to do if you do not need to resize on the fly. If you find that you need to expand your array, you can run
package = (char*)realloc((void*)package, SOME_RUNTIME_SIZE) ;
You can use an STL vector
include <vector>
std::vector<char> myVec(); //optionally myVec(SOME_RUNTIME_SIZE)
that you can then resize using myVec.resize(newSize) or by using functions such as push_back that add to the vector and automatically resize. The good thing about the vector solution is that it takes away many memory management issues -- if the vector is stack-allocated, its destructor will be called when it goes out of scope and the dynamically-allocated array underlying it will be deleted. However, if you pass the vector around, the data will get copied that can be slow, so you may need to pass pointers to vectors instead.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Which container to use for String-Interning - c++

Related

C++ Pointer not being updated?

Converting std::vector container into an std::set using std::transform

How can I make my char buffer more performant?

Statically initializing a structure with arrays of varying length

Best Replacement for a Character Array

Categories

Resources