String pooling/interning - Is this good practice?

String pooling/interning - Is this good practice? - c++

As an example, I have a class which I'm storing some information about in binary files:
class car {
char car_manufacturer;
//other stuff
};
Where the value of car_manufacturer is one of the values in:
enum car_manufacturers : char {
VOLVO = 0,
AUDI,
MERCEDES
};
Now in this example application, the user will want a string representation of their car manufacturer rather than a number so I create an array with the string representations of the enums where the ordering of the array is the same as the enum, so the car_manufacturer can be used as an array index
std::string car_manufacturers_strings[MERCEDES + 1] = {
"Volvo",
"Audi",
"Mercedes"
};
So now after loading the file and creating a car object out of the data, I can get the make of the car, as a string simply by car_manufacturers_strings[car.car_manufacturer];
The advantage of this being that, in the files I don't have to store a bunch of repeated strings, if the cars were the same make thus I save a lot of space. But the drawback of this is that the code is slightly more complex.
So is this good or bad practice?

This is pretty standard practice to provide string representations of enumerators as an array.
One point is that:
std::string car_manufacturers_strings[MERCEDES + 1] = {
"Volvo",
"Audi",
"Mercedes"
};
Stores string literals in the binary, as well as creates copies of those literals as std::string objects at dynamic initialization phase.
You may like to change it to:
char const* const car_manufacturers_strings[MERCEDES + 1] = {
"Volvo",
"Audi",
"Mercedes"
};
So that it does not unnecessarily create those copies.

I generally do it with a function, along the lines of:
char const * to_string(car_manufacturers cm) {
switch (cm) {
#define CASE(CM) case CM: return #CM;
CASE(Volvo)
CASE(Audi)
CASE(Mercedes)
#undef CASE
}
return "Unknown"; // or throw, or whatever
}
Advantages:
no need to convert the argument into a numeric index, so it works with scoped enumerations without need for hideous casting;
my compiler gives a warning if I add an enumerator without updating the function (not as good as removing the duplication, but at least consistency is enforced)
Disadvantages:
switch may (or may not) be less efficient than an array lookup
Capitalisation of the enumerators has to match the strings, so you'd have to drop your fetish for SHOUTY_CAPS. Some would say that's a good thing; they certainly hurt my eyes.
Ewwww! Macros!

Related

How to create a C++ multidimensional array of names with basic types

I'm trying to instantiate and easily access an array of names in C++ using basic types in contiguous memory. I'm astounded that this is extremely difficult or complicated to do in C++ WITH ONLY basic types.
For some background, I am programming a microcontroller with limited memory, modest processing power, and it is handling serial communication over a network to 36 other microcontrollers sending continuous sensor data which is uploaded to a webserver. The shorter the refresh rate of the data, the better, so I prefer basic program features.
Not that I'm saying the more complicated stuff I've looked in other forums for, like an array of strings, has worked.
In my desperation, I was able to get this to work.
char names_array[] = "Bob\0\0Carl";
printf("%s",names_array); //outputs Bob
printf("%s",names_array + 5); //outputs Carl
This is a horrible solution though. My indexing is dependent on the longest name in the array, so if I added "Megan" to my list of names, I'd have to add a bunch of null characters throughout the entire array.
What I want to do is something like this:
char names_array[2] = {"Bob","Carl"}; //does not compile
printf("%s",names_array[0]);
printf("%s",names_array[1]);
//Error: A value of type "const char *" cannot be used to
//initialize an entity of type "char" in "main.cpp"
but that didn't work.
I want to loop through the names in my list and do something with each name, so at this point, this is my best solution.
char name0[] = "Bob";
loop_code(name0);
char name1[] = "Carl";
loop_code(name1);
.
.
.
I expect there's a reasonable way to make an array of pointers, each to an array of char terminated by null(s). I must be doing something wrong. I refuse to believe that a language like C++ is incapable of such a basic memory allocation.

You can, e.g., get an array of pointers to null-terminated strings:
const char* names_array[] = { "Bob", "Carl" };
and then
std::printf("%s", names_array[0]);
std::printf("%s", names_array[1]);
The problem with your attempt
char names_array[2] = {"Bob","Carl"};
is that you declare names_array to be an array of characters. This should never compile because what the = {"Bob","Carl"} essentially attempts to do is initialize each character in that array of characters with an entire array of characters of its own. A character is just a character, you cannot assign an entire array of characters to just an individual character. More precisely, initialization of a character array from a string literal is a special case of initialization [dcl.init.string] that allows a single string literal to be used to initialize an entire character array (because anything else doesn't make sense). What you actually want would be something more like an array of character arrays. However, the problem there is that you'd have to effectively pick a fixed maximum length for all strings in the array:
char names_array[][5] = { "Bob", "Carl" }; // each subarray is 5 characters in length
which would be potentially wasteful. You can flatten a series of multiple strings into one long array and then index into that, like you did with your first approach. The downside of that, as you've found out, is that you then need to know where each string starts in that array…
If you just want an array of string constants, a more modern C++ approach would be something like this:
#include <string_view>
using namespace std::literals;
constexpr std::string_view names[] = {
"Bob"sv,
"Carl"sv
};
The advantage of std::string_view is that it also has information about the length of the string. However, while std::string_view is compatible with most of the C++ standard library facilities that handle strings, it's not so simple to use it together with functions that expect C-style null-terminated strings. If you need null-terminated strings, I'd suggest to simply use an array of pointers to strings as shown at the very beginning of this answer…

char can has only one character.
If you want to use char, you can do it like
char name0[3] = "Bob";
char name1[4] = "Carl";
char *nameptr[2] = {&name0[0], &name1[0]};
Acutally, this pretty hard.
I suggest to you, use std::string.
std::string name[2] = {"Bob","Carl"};
this code is acceptable.

Change endianness of entire struct in C++

I am writing a parser in C++ to parse a well defined binary file. I have declared all the required structs. And since only particular fields are of interest to me, so in my structs I have skipped non-required fields by creating char array of size equal to skipped bytes. So I am just reading the file in char array and casting the char pointer to my struct pointer. Now problem is that all data fields in that binary are in big endian order, so after typecasting I need to change the endianness of all the struct fields. One way is to do it manually for each and every field. But there are various structs with many fields, so it'll be very cumbersome to do it manually. So what's the best way to achieve this. And since I'll be parsing very huge such files (say in TB's), so I require a fast way to do this.
EDIT : I have use attribute(packed) so no need to worry about padding.

If you can do misaligned accesses with no penalty, and you don't mind compiler- or platform-specific tricks to control padding, this can work. (I assume you are OK with this since you mention __attribute__((packed))).
In this case the nicest approach is to write value wrappers for your raw data types, and use those instead of the raw type when declaring your struct in the first place. Remember the value wrapper must be trivial/POD-like for this to work. If you have a POSIX platform you can use ntohs/ntohl for the endian conversion, it's likely to be better optimized that whatever you write yourself.
If misaligned accesses are illegal or slow on your platform, you need to deserialize instead. Since we don't have reflection yet, you can do this with the same value wrappers (plus an Ignore<N> placeholder that skips N bytes for fields you're not interested), and declare them in a tuple instead of a struct - you can iterate over the members in a tuple and tell each to deserialize itself from the message.

One way to do that is combine C preprocessor with C++ operators. Write a couple of C++ classes like this one:
#include "immintrin.h"
class FlippedInt32
{
int value;
public:
inline operator int() const
{
return _bswap( value );
}
};
class FlippedInt64
{
__int64 value;
public:
inline operator __int64() const
{
return _bswap64( value );
}
};
Then,
#define int FlippedInt32
before including the header that define these structures. #undef immediately after the #include.
This will replace all int fields in the structures with FlippedInt32, which has the same size but returns flipped bytes.
If it’s your own structures which you can modify you don’t need the preprocessor part. Just replace the integers with the byte-flipping classes.

If you can come up with a list of offsets (in-bytes, relative to the top of the file) of the fields that need endian-conversion, as well as the size of those fields, then you could do all of the endian-conversion with a single for-loop, directly on the char array. E.g. something like this (pseudocode):
struct EndianRecord {
size_t offsetFromTop;
size_t fieldSizeInByes;
};
std::vector<EndianRecord> todoList;
// [populate the todo list here...]
char * rawData = [pointer to the raw data]
for (size_t i=0; i<todoList.size(); i++)
{
const EndianRecord & er = todoList[i];
ByteSwap(&rawData[er.offsetFromTop], er.fieldSizeBytes);
}
struct MyPackedStruct * data = (struct MyPackedStruct *) rawData;
// Now you can just read the member variables
// as usual because you know they are already
// in the correct endian-format.
... of course the difficult part is coming up with the correct todoList, but since the file format is well-defined, it should be possible to generate it algorithmically (or better yet, create it as a generator with e.g. a GetNextEndianRecord() method that you can call, so that you don't have to store a very large vector in memory)

C++ Unicode: Bytes, Code Points and Graphemes

So, I'm building a scripting language and one of my goals is convenient string operations. I tried some ideas in C++.
String as sequences of bytes and free functions that return vectors containing the code-points indices.
A wrapper class that combines a string and a vector containing the indices.
Both ideas had a problem, and that problem was, what should I return. It couldn't be a char, and if it was a string it would be wasted space.
I ended up creating a wrapper class around a char array of exactly 4 bytes: a string that has exactly 4 bytes in memory, no more nor less.
After creating this class, I felt tempted to just wrap it in a std::vector of it in another class and build from there, thus making a string type of code-points. I don't know if this is a good approach, it would end up being much more convenient but it would end up wasting more space.
So, before posting some code, here's a more organized list of ideas.
My character type would be not a byte, nor a grapheme but rather a code-point. I named it a rune like the one in the Go language.
A string as a series of decomposed runes, thus making indexing and slicing O1.
Because a rune is now a class and not a primitive, it could be expanded with methods for detecting unicode whitespace: mysring[0].is_whitespace()
I still don't know how to handle graphemes.
Curious fact! An odd thing about the way I build the prototype of the rune class was that it always print in UTF8. Because my rune is not a int32, but a 4 byte string, this end up having some interesting properties.
My code:
class rune {
char data[4] {};
public:
rune(char c) {
data[0] = c;
}
// This constructor needs a string, a position and an offset!
rune(std::string const & s, size_t p, size_t n) {
for (size_t i = 0; i < n; ++i) {
data[i] = s[p + i];
}
}
void swap(rune & other) {
rune t = *this;
*this = other;
other = t;
}
// Output as UTF8!
friend std::ostream & operator <<(std::ostream & output, rune input) {
for (size_t i = 0; i < 4; ++i) {
if (input.data[i] == '\0') {
return output;
}
output << input.data[i];
}
return output;
}
};
Error handling ideas:
I don't like to use exceptions in C++. My idea is, if the constructor fails, initialize the rune as 4 '\0', then overload the bool operator explicitly to return false if the first byte of the run happens to be '\0'. Simple and easy to use.
So, thoughts? Opinions? Different approaches?
Even if my rune string is to much, at least I have a rune type. Small and fast to copy. :)

It sounds like you're trying to reinvent the wheel.
There are, of course, two ways you need to think about text:
As an array of codepoints
As an encoded array of bytes.
In some codebases, those two representations are the same (and all encodings are basically arrays of char32_t or unsigned int). In some (I'm inclined to say "most" but don't quote me on that), the encoded array of bytes will use UTF-8, where the codepoints are converted into variable lengths of bytes before being placed into the data structure.
And of course many codebases simply ignore unicode entirely and store their data in ASCII. I don't recommend that.
For your purposes, while it does make sense to write a class to "wrap around" your data (though I wouldn't call it a rune, I'd probably just call it a codepoint), you'll want to think about your semantics.
You can (and probably should) treat all std::string's as UTF-8 encoded strings, and prefer this as your default interface for dealing with text. It's safe for most external interfaces—the only time it will fail is when interfacing with a UTF-16 input, and you can write corner cases for that—and it'll save you the most memory, while still obeying common string conventions (it's lexicographically comparable, which is the big one).
If you need to work with your data in codepoint form, then you'll want to write a struct (or class) called codepoint, with the following useful functions and constructors
While I have had to write code that handles text in codepoint form (notably for a font renderer), this is probably not how you should store your text. Storing text as codepoints leads to problems later on when you're constantly comparing against UTF-8 or ASCII encoded strings.
code:
struct codepoint {
char32_t val;
codepoint(char32_t _val = 0) : val(_val) {}
codepoint(std::string const& s);
codepoint(std::string::const_iterator begin, std::string::const_iterator end);
//I don't know the UTF-8→codepoint conversion off-hand. There are lots of places
//online that show how to do this
std::string to_utf8() const;
//Again, look up an algorithm. They're not *too* complicated.
void append_to_string_as_utf8(std::string & s) const;
//This might be more performant if you're trying to reduce how many dynamic memory
//allocations you're making.
//codepoint(std::wstring const& s);
//std::wstring to_utf16() const;
//void append_to_string_as_utf16(std::wstring & s) const;
//Anything else you need, equality operator, comparison operator, etc.
};

C++ What is wrong about using this approach instead of enums when I want a string representation?

There are several questions around concerning this topic (e.g. here and here). I am a bit surprised how lenghty the proposed solutions are. Also, I am a bit lazy and would like to avoid maintaining an extra list of strings for my enums.
I came up with the following and I wonder if there is anything fundamentally wrong with my approach...
class WEEKDAY : public std::string{
public:
static const WEEKDAY MONDAY() {return WEEKDAY("MONDAY");}
static const WEEKDAY TUESDAY(){return WEEKDAY("TUESDAY");}
/* ... and so on ... */
private:
WEEKDAY(std::string s):std::string(s){};
};
Still I have to type the name/string representation more than once, but at least now its all in a single line for each possible value and also in total it does not take much more lines than a plain enum. Using these WEEKDAYS looks almost identical to using enums:
bool isAWorkingDay(WEEKDAY w){
if (w == WEEKDAY::MONDAY()){return true;}
/* ... */
return false;
}
and its straighforward to get the "string representation" (well, in fact it is just a string)
std::cout << WEEKDAY::MONDAY() << std::end;
I am still relatively new to C++ (not in writing but in understanding ;), so maybe there are things that can be done with enums that cannot be done with such kind of constants.

You could use the preprocessor to avoid duplicating the names:
#define WEEKDAY_FACTORY(DAY) \
static const WEEKDAY DAY() {return WEEKDAY(#DAY);}
WEEKDAY_FACTORY(MONDAY)
WEEKDAY_FACTORY(TUESDAY)
// and so on
Whether the deduplication is worth the obfuscation is a matter of taste. It would be more efficient to use an enumeration rather than a class containing a string in most places; I'd probably do that, and only convert to a string when needed. You could use the preprocessor to help with that in a similar way:
char const * to_string(WEEKDAY w) {
switch (w) {
#define CASE(DAY) case DAY: return #DAY;
CASE(MONDAY)
CASE(TUESDAY)
// and so on
}
return "UNKNOWN";
}

How do I write setters and getters for an array? (c++)

Im writing a class within c++, however I am not certain on how to create the setters and getters for the arrays (sorry for it being a basic question!) I am getting the following error:
expected primary expression before ']' token
Here is my code:
Class planet: public body
{
private:
string name[];
string star[];
public:
void nameSetter (string h_name[])
{
name[] = h_name[];
}
};
Once again I am sorry for such I silly question, I know I am not passing an index through, however, when I create an index it throws up a large amount of errors!

string name[];
This is not an array, it is a pointer. Use vectors instead:
#include <vector>
class planet: public body
{
private:
vector<string> name;
vector<string> star;
public:
void nameSetter (const vector<string> &h_name)
{
name = h_name;
}
};

Arrays in C++ have compile-time fixed sizes. You can't have a declaration like string name[]; because it leaves the size empty. You can't do that unless you provide an initialization list from which the size is determined.
In addition, array type arguments are transformed to pointer arguments. So your string h_name[] argument is actually a string* h_name.
name[] = h_name[];
This line doesn't make much sense. It's almost like you're trying to access elements of name and h_name without giving an index. Perhaps you were intending to assign the h_name array to the name array, like so:
name = h_name;
However, as we've just seen, h_name is actually a pointer. And in fact, you can't assign to an array anyway, so even if h_name were an array, this still wouldn't work.
You'll be much better off using a standard container like std::vector. It appears that you want dynamically sized arrays anyway, so this will make that easy.

Even though an answer has been selected, I think maybe the original question may have been misunderstood.
I think what the OP intended was that each instance of planet should have 1 name and 1 star; so the array notation he's used in his code is a misunderstanding on his part about arrays and strings. Based on this assumption I will continue.
When you declare
string name[];
I believe you just want to hold the name of 1 planet, in which case you don't need and array, you just need a single string.
ie
string name;
The same goes for star.
This would make the code
Class planet: public body
{
private:
string name;
string star;
public:
void nameSetter (const string& h_name)
{
name = h_name;
}
};

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

String pooling/interning - Is this good practice? - c++

Related

How to create a C++ multidimensional array of names with basic types

Change endianness of entire struct in C++

C++ Unicode: Bytes, Code Points and Graphemes

C++ What is wrong about using this approach instead of enums when I want a string representation?

How do I write setters and getters for an array? (c++)

Categories

Resources