In place Tokenization of std::string into a map of Key Value

In place Tokenization of std::string into a map of Key Value - c++

In C the delimiters can be replaced by Nulls and a map of char* -> char* with a comparison function would work.
I am trying to figure out the fastest possible way to do this in Modern C++ . The idea is to avoid Copying characters in the Map.
std::string sample_string("name=alpha;title=something;job=nothing");
to
std::map<std::string,std::string> sample_map;
Without copying characters.
It's ok to lose original input string.

Two std::strings cannot point to the same underlying bytes, so no it's not possible to do with strings.
To avoid coping bytes, you could to use iterators:
struct Slice {
string::iterator begin, end;
bool operator < (const& Slice that) const {
return lexicographical_compare(begin, end, that.begin, that.end);
}
};
std::map<Slice,Slice> sample_map;
And beware that if you modify the original string, all the iterators will be invalid.

Related

Erase inside a std::string by std::string_view

I need to find and then erase a portion of a string (a substring). string_view seems such a good idea, but I cannot make it work with string::erase:
// guaranteed to return a view into `str`
auto gimme_gimme_gimme(const std::string& str) -> std::string_view;
auto after_midnight(std::string& str)
{
auto man = gimme_gimme_gimme(str);
str.erase(man); // way to hopeful, not a chance though
str.erase(man.begin(), man.end()); // nope
str.erase(std::distance(str.begin(), man.begin()), man.size()); // nope
str.erase(std::distance(str.data(), man.data()), man.size()); // nope again
// for real???
}
Am I overthinking this? Given a std::string_view into a std::string how to erase that part of the string? Or am I misusing string_view?

The string view could indeed be empty, or it could be a view to the outside of the container. Your suggested erase overload, as well as the implementation of the function in your answer relies on a pre-condition that the string view is to the same string object.
Of course, the iterator overloads are very much analogous and rely on the same pre-condition. But such pre-condition is conventional for iterators, but non-conventional for string views.
I don't think that string view is an ideal way to represent the sub range in this case. Instead, I would suggest using a relative sub range based on the indices. For example:
struct sub_range {
size_t begin;
size_t count;
constexpr size_t past_end() noexcept {
return begin + count;
}
};
It is a matter of taste whether to use end (i.e. past_end) or count for the second member, and to provide the other as a function. Regardless, there should be no confusion because the member will have a name. Using count is somewhat more conventional with indices.
Another choice is whether to use signed or unsigned indices. Signed indices can be used to represent backwards ranges. std::string interface doesn't understand such ranges however.
Example usage:
auto gimme_gimme_gimme(const std::string& str) -> sub_range;
auto after_midnight(std::string& str)
{
auto man = gimme_gimme_gimme(str);
str.erase(man.begin, man.distance);
}

Am I overthinking this?
You're under thinking it, unless I'm missing something obvious. To make the code compile you need this:
auto gimme_gimme_gimme(const std::string& str) -> std::string_view;
auto after_midnight(std::string& str)
{
auto man = gimme_gimme_gimme(str);
str.erase(std::distance(std::as_const(str).data(), man.data()), man.size()); // urrr... growling in pain
}
But wait!! There's more! Notice I said "to make it compile". The code is error prone!! Because...
std::string::data cannot be nullptr but an empty string_view can be represented as (valid pointer inside the string + size 0) or as (nullptr + size 0). The problem arises if the string_view::data is nulltpr because of the std::distance used.
So you need to make sure that the string_view always points inside the string, even if the view is empty. Or do extra checks on the erase side.

Get an iterator from a char pointer (C++)

I am challenging myself to write a Palindrome tester using only SL algorithms, iterators etc. I also want to program to work with raw strings. Below, I use the raw pointer pal in the copy_if algorithm, but instead, how could I define an iterator to go here, i.e. by using something like begin(pal) and end(pal + size)?
#include <algorithm>
#include <iterator>
#include <cctype>
using namespace std;
bool isPalindrome(const char* pal) {
if (!pal) { return(false); }
int size = strlen(pal);
string pal_raw;
pal_raw.reserve(size);
// Copy alphabetical chars only (no spaces, punctuations etc.) into pal_raw
copy_if(pal, pal+size, back_inserter(pal_raw),
[](char item) {return isalpha(item); }
);
// Test if palindromic, ignoring capitalisation
bool same = equal(begin(pal_raw), end(pal_raw), rbegin(pal_raw), rend(pal_raw),
[](char item1, char item2) {return tolower(item1) == tolower(item2); }
);
return same;
}
int main(){
char pal[] = "Straw? No, too stupid a fad. I put soot on warts.";
bool same = isPalindrome(pal);
return 0;
}
Bonus Question: Is it possible to eliminate the need to copy_if() by incrementing the iterators 'in place' from within equal() i.e. when !isalpha(item)?

Iterators implement the concept of pointers, when it comes to C++ library algorithms. And, as you've discovered, C++ library algorithms that take iterators are perfectly happy to also take pointers. It's the same concept.
And when you already have pointers to begin with there is no iterator, of some kind, that the pointers can be converted to.
It is true that
std::begin(arr)
and
std::end(arr)
are defined on flat arrays. But, guess what: they return a pointer to the beginning and the end of the array, and not an iterator class of some kind.
However, you cannot use std::begin(), and std::end() because by the time you need to use it, inside your function, the array was already decayed to a char *. std::begin() and std::end() works on real arrays, and not decayed pointers.
If you insist on using iterators, you should pass a std::string to your palindrome function, instead of a char *. std::string implements a begin() and an end() method that return a std::string::iterator, that you can use.

If I'd do that and I'd want to make it work for different types, I'd templatize the method - for example:
template<class ITER_T>
bool isPalindrome(ITER_T begin, ITER_T end) {
// check for palindrome generically between begin and end
}
That way it would work for const char * iterators, std::string, std::vector<char>, std::map<char> with the same code. And if implemented properly, even for other types of vectors, maps and anything else which you can get iterator for and the item type has a comparison operator defined.
As a bonus, I could then also check if a part of character array, string or a vector is a palindrome.
By the way, this:
equal(begin(pal_raw), end(pal_raw), rbegin(pal_raw), rend(pal_raw), ...
is unnecessarily checking each character twice, if the string is a palindrome ... not very efficient. In this case you could do perhaps better with a "manual" loop (not sure if there is a std algo for that).
If you'd like to make the begin(arr)/end(arr) work inside the overloaded function, you'd need to templatize it anyway, like this:
template<size_t N>
bool isPalindrome(const char (&arr)[N]) {
...
However then you get separate instantiation for each different array size. So it is better anyway to templatize using the iterators and only get a single instantiation for any char array size.
So to answer the "bonus question", it is indeed possible to avoid creating the temporary string (i.e. dynamic memory allocation) at all, by iterating over the array directly:
template<typename ITER_T>
bool isPalindrome(ITER_T begin, ITER_T end) {
while (begin < end) {
if (tolower(*begin++) != tolower(*--end))
return false;
}
return true;
}
bool same = isPalindrome(begin(pal), end(pal));
To test for isalpha I leave to you for your practice. Hint: you can do that before the equality check and increment/decrement the begin/end appropriately (hint #2: the solution I have in mind will use the continue keyword).
The same to make it work with arbitrary type different from char - then type traits can be used to abstract out the isalpha/tolower calls via template specializations.

Interpret a std::string as a std::vector of char_type?

I have a template<typename T> function that takes a const vector<T>&. In said function, I have vectors cbegin(), cend(), size(), and operator[].
As far as I understand it, both string and vector use contiguous space, so I was wondering if I could reuse the function for both data types in an elegant manner.
Can a std::string be reinterpreted as a std::vector of (the appropriate) char_type? If so, what would the limitations be?

If you make your template just for type const T& and use the begin(), end(), etc, functions which both vector and string share then your code will work with both types.

Go STL way and use iterators. Accept iterator to begin and iterator to end. It will work with all possible containers, including non-containers like streams.

There is no guarantee the layout of string and vector will be the same. They theoretically could be, but they probably aren't in any common implementation. Therefore, you can't do this safely. See Zan's answer for a better solution.
Let me explain: If I am a standard library implementer and decide to implement std::string like so....
template ...
class basic_string {
public:
...
private:
CharT* mData;
size_t mSize;
};
and decide to implement std::vector like so...
template ...
class vector {
public:
...
private:
T* mEnd;
T* mBegin;
};
When you reinterpret_cast<string*>(&myVector) you wind up interpreting the pointer to the end of your data as the pointer to the start of your data, and the pointer to the start of your data to the size of your data. If the padding between members is different, or there are extra members, it could get even weirder and more broken than that too.
So yes, in order for this to possibly work they both need to store contiguous data, but they also need quite a bit else to be the same between the implementations for it to work.

std::experimental::array_view<const char> n4512 represents a contiguous buffer of chars.
Writing your own is not hard, and it solves this problem and (in my experience) many more.
Both string and vector are compatible with an array view.
This lets you move your implementation into a .cpp file (and not expose it), gives you the same performance as doing it with std::vector<T> const& and probably the same implementation, avoids duplicating code, and uses light weight contiguous buffer type erasure (which is full of tasty keywords).

If the key point is that you want to access a continuous area in memory where instances of a specific char type are stored then you could define your function as
void myfunc(const CType *p, int size) {
...
}
to make it clear that you assume they must be adjacent in memory.
Then for example to pass the content of a vector the code is simply
myfunc(&myvect[0], myvect.size());
and for a string
myfunc(mystr.data(), mystr.size());
or
myfunc(buffer, n);
for an array.

You can't directly typecast a std::vector to a std::string or vice versa. But using the iterators that STL containers provide does allow you to iterate both a vector and a string in the same way. And if your function requires random access of the container in question then either would work.
std::vector<char> str1 {'a', 'b', 'c'};
std::string str2 = "abc";
template<typename Iterator>
void iterator_function(Iterator begin, Iterator end)
{
for(Iterator it = begin; it != end; ++it)
{
std::cout << *it << std::endl;
}
}
iterator_function(str1.begin(), str1.end());
iterator_function(str2.begin(), str2.end());
Both of those last two function calls would print the same thing.
Now if you wanted to write a generic version that parsed only characters only stored in a string or in a vector you could write something that iterated the internal array.
void array_function(const char * array, unsigned length)
{
for(unsigned i = 0; i < length; ++i)
{
std::cout << array[i] << std::endl;
}
}
Both functions would do the same thing in the following scenarios.
std::vector<char> str1 {'a', 'b', 'c'};
std::string str2 = "abc";
iterator_function(str1.begin(), str1.end());
iterator_function(str2.begin(), str2.end());
array_function(str1.data(), str1.size());
array_function(str2.data(), str2.size());
There are always multiple ways to solve a problem. Depending on what you have available any number of solutions might work. Try both and see which works better for your application. If you don't know the iterator type then the char typed array iteration is useful. If you know you will always have the template type to pass in then the template iterator method might be more useful.

The way your question is put at the moment is a bit confusing. If you mean to be asking "is it safe to cast a std::vector type to a std::string type or vice versa if the vector happens to contain char values of the appropriate type?", the answer is: no way, don't even think about it! If you're asking: "can I access the contiguous memory of non-empty sequences of char type if they're of the type std::vector or std::string?" then the answer is, yes you can (with the data() member function).

Parsing key/value pairs from a string in C++

I'm working in C++11, no Boost. I have a function that takes as input a std::string that contains a series of key-value pairs, delimited with semicolons, and returns an object constructed from the input. All keys are required, but may be in any order.
Here is an example input string:
Top=0;Bottom=6;Name=Foo;
Here's another:
Name=Bar;Bottom=20;Top=10;
There is a corresponding concrete struct:
struct S
{
const uint8_t top;
const uint8_t bottom;
const string name;
}
I've implemented the function by repeatedly running a regular expression on the input string, once per member of S, and assigning the captured group of each to the relevant member of S, but this smells wrong. What's the best way to handle this sort of parsing?

For an easy readable solution, you can e.g. use std::regex_token_iterator and a sorted container to distinguish the attribute value pairs (alternatively use an unsorted container and std::sort).
std::regex r{R"([^;]+;)"};
std::set<std::string> tokens{std::sregex_token_iterator{std::begin(s), std::end(s), r}, std::sregex_token_iterator{}};
Now the attribute value strings are sorted lexicographically in the set tokens, i.e. the first is Bottom, then Name and last Top.
Lastly use a simple std::string::find and std::string::substr to extract the desired parts of the string.
Live example

Do you care about performance or readability? If readability is good enough, then pick your favorite version of split from this question and away we go:
std::map<std::string, std::string> tag_map;
for (const std::string& tag : split(input, ';')) {
auto key_val = split(input, '=');
tag_map.insert(std::make_pair(key_val[0], key_val[1]));
}
S s{std::stoi(tag_map["top"]),
std::stoi(tag_map["bottom"]),
tag_map["name"]};

std::map with a char[5] key that may contain null bytes

The keys are binary garbage and I only defined them as chars because I need a 1-byte array.
They may contain null bytes.
Now problem is, when I have a two keys: ab(0)a and ab(0)b ((0) being a null byte), the map treats them as strings, considers them equal and I don't get two unique map entries.
What's the best way to solve this?

Why not use std::string as key:
//must use this as:
std::string key1("ab\0a",4);
std::string key2("ab\0b",4);
std::string key3("a\0b\0b",5);
std::string key4("a\0\0b\0b",6);
Second argument should denote the size of the c-string. All of the above use this constructor:
string ( const char * s, size_t n );
description of which is this:
Content is initialized to a copy of the string formed by the first n characters in the array of characters pointed by s.

Use std::array<char,5> or maybe even better (if you want really to handle keys as binary values) std::bitset

If you really want to use char[5] as your key, consider writing your own comparison class to compare between keys correctly. The map class requires one of these in order to organize its contents. By default, it is using a version that doesn't work with your key.
Here's a page on the map class that shows the parameters for map. You'd want to write your own Compare class to replace less<Key> which is the third template parameter to map.

If you only need to distinguish them and don't rely on a lexicographical ordering you could treat each key as uint64_t. This has the advantage, that you could easily replace std::map by a hashmap implementation and that you don't have to do anything by hand.
Otherwise you can also write your own comparator somehow like this:
class MyKeyComp
{
public:
operator()(char* lhs, char* rhs)
{
return lhs[0] == rhs[0] ?
(lhs[1] == rhs[1] ?
(lhs[2] == rhs[2] ?
(lhs[3] == rhs[3] ? lhs[4] < rhs[4])
: lhs[3] < rhs[3])
: lhs[2] < rhs[2])
: lhs[1] < rhs[1])
: lhs[0] < rhs[0];
}
};

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

In place Tokenization of std::string into a map of Key Value - c++

Related

Erase inside a std::string by std::string_view

Get an iterator from a char pointer (C++)

Interpret a std::string as a std::vector of char_type?

Parsing key/value pairs from a string in C++

std::map with a char[5] key that may contain null bytes

Categories

Resources