Parse char vector to std::map<string, string> - c++

I have a vector of unsigned char values. The data inside is keys and values in the format "key=value". Each pair is terminated by a '\0' character and the last pair is terminated by a double '/0'. A value can contain a "=" as well, so this delimiter shouldnt be a criteria for the pair, only the terminating '\0'
From this vector of single unsigned char I want to get a std::map<std::string, std::string> object.
What would be an effective way to go through the vector and fill the map?
Thanks in advance!
This code finds all pairs, but when printing them, my console seems to mess things up, so I suspect invalid characters... maybe it has to do with the fact that I'm using unsigned char?
unsigned char *env, *nxt;
std::string delimiter = "=";
std::string line;
std::map<string, string> myMap;
for (env = &myVector[0]; *env != '\0'; env = nxt + 1)
{
for (nxt = env; *nxt != '\0'; ++nxt)
{
if (nxt >= &myVector[myVector.size()])
{
printf("string not terminated\n");
return -1;
}
}
line = std::string(env, env + myVector.size());
myMap.insert(std::pair<string, string>(
line.substr(0, line.find(delimiter)),
line.substr(line.find(delimiter) + 1, line.size())));
}

Since the key cannot contain the = sign, it's fairly easy:
Split on (generate all positions of) zero characters.
Split created entries on first =.
Use the generated regions as key/value.
To make things clearer (than your double nested loops, ew), I'd use some reasonable data structure to mark the faux-strings in the buffer (if you want to avoid copying), such as pair<unsigned, unsigned>.
So, the signatures of the functions (which in this case tell more than their implementations, I suppose), could look like this:
using BufferRange = pair<unsigned, unsigned>;
using BufferEntry = pair<BufferRange, BufferRange>;
list<BufferRange> splitOnZeroes(Buffer const& b);
BufferEntry splitOnEquality(BufferRange const& br, Buffer const& b);
void addToMap(map<string, string>& m, BufferEntry const& p)
Those are fairly simplified; my code design OCD tells me that BufferRange could be a type carrying not only the numerical indices, but also the reference to the buffer itself. Changing that (if required) left as an exercise.

Related

How to split a string by emojis in C++

I'm trying to take a string of emojis and split them into a vector of each emoji Given the string:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
I'm trying to get:
std::vector<std::string> splitted_emojis = {"😀", "🔍", "🦑", "😁", "🔍", "🎉", "😂", "🤣"};
Edit
I've tried to do:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
std::vector<std::string> splitted_emojis;
size_t pos = 0;
std::string token;
while ((pos = emojis.find("")) != std::string::npos)
{
token = emojis.substr(0, pos);
splitted_emojis.push_back(token);
emojis.erase(0, pos);
}
But it seems like it throws terminate called after throwing an instance of 'std::bad_alloc' after a couple of seconds.
When trying to check how many emojis are in a string using:
std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
std::cout << emojis.size() << std::endl; // returns 32
it returns a bigger number which i assume are the unicode data. I don't know too much about unicode data but i'm trying to figure out how to check for when the data of an emoji begins and ends to be able to split the string to each emoji
I would definitely recommend that you use a library with better unicode support (all large frameworks do), but in a pinch you can get by with knowing that the UTF-8 encoding spreads Unicode characters over multiple bytes, and that the first bits of the first byte determine how many bytes a character is made up of.
I stole a function from boost. The split_by_codepoint function uses an iterator over the input string and constructs a new string using the first N bytes (where N is determined by the byte count function) and pushes it to the ret vector.
// Taken from boost internals
inline unsigned utf8_byte_count(uint8_t c)
{
// if the most significant bit with a zero in it is in position
// 8-N then there are N bytes in this UTF-8 sequence:
uint8_t mask = 0x80u;
unsigned result = 0;
while(c & mask)
{
++result;
mask >>= 1;
}
return (result == 0) ? 1 : ((result > 4) ? 4 : result);
}
std::vector<std::string> split_by_codepoint(std::string input) {
std::vector<std::string> ret;
auto it = input.cbegin();
while (it != input.cend()) {
uint8_t count = utf8_byte_count(*it);
ret.emplace_back(std::string{it, it+count});
it += count;
}
return ret;
}
int main() {
std::string emojis = u8"😀🔍🦑😁🔍🎉😂🤣";
auto split = split_by_codepoint(emojis);
std::cout << split.size() << std::endl;
}
Note that this function simply splits a string into UTF-8 strings containing one code point each. Determining if the character is an emoji is left as an exercise: UTF-8-decode any 4-byte characters and see if they are in the proper range.

How to break up a CString that does not have any delimiters within it?

I have a CString that I want to break up into small strings. It is a string consisting of a constant 2 byte header and 2 byte footer, but the rest of the string has no discernible pattern. I need to break them up based on sizes: so the first two bytes become the header, then I need to extract the next 2, then 3 and so on. (these numbers are in no pattern either)
Example:
CString test = "1010eefabbccde1f1f"
I need
CString header = "1010";
CString test1 = "eefa";
CString test2 = "bbccde";
CString footer = "1f1f";
I read about sscanf being used for this purpose, but I have only managed to use it split strings into int.
Example:
CString test = '101022223333331010';
int a,b,c,d;
sscanf(test,"%02d%02d%03d%02d",&a,&b,&c,&d);
This works for strings containing only numbers. But when I do the same for strings by changing %d to %s, exceptions get raised.
Is there a better way to do this?
My understanding is that given an input string test and vector<size_t> sizes of sizes, you wish to break the string apart into those sizes, and then you wish to take those parts, treat them as hex numbers, and return them in vector<int> result.
I'm going to presume that you have already tested test to ensure the correct number of characters exist. And I'm going to assume that the sizes include the header and footer sizes.
After running something like this:
const auto test = "1010eefabbccde1f1f"s;
const vector<size_t> sizes { 4U, 4U, 6U, 4U };
const auto result = accumulate(cbegin(sizes), cend(sizes), vector<int>(), [&, pos = 0U](auto& a, const auto& b) mutable {
a.push_back(stoi(test.substr(pos, b), nullptr, 16));
pos += b;
return a;
});
result will contain:
4112
61178
12307678
7967
Live Example

Parsing a non-uniform string into integers

I am writing a parser for .obj files, and there is a part of the file that is in the format
f [int]/[int] [int]/[int] [int]/[int]
and the integers are of unknown length. In each [int]/[int] pair, they both need to be put onto separate arrays. What is the simplest method to separate them as integers?
You can do it with fscanf:
int matched = fscanf(fptr, "f %d/%d %d/%d %d/%d", &a, &b, &c, &d, &e, &f);
if (matched != 6) fail();
or ifstream and sscanf:
char buf[100];
yourIfstream.getLine(buf, sizeof(buf));
int matched = sscanf(buf, "f %d/%d %d/%d %d/%d", &a, &b, &c, &d, &e, &f);
if (matched != 6) fail();
Consider using one of the scanf functions (fscanf if you are reading the file using <stdio.h> and FILE*, or sscanf to parse a line in memory buffer).
So, if you have a buffer with data and two integer arrays like this:
int first[3], second[3];
char *buffer = "f 10/20 1/300 344/2";
Then you can just write:
sscanf(buffer, "f %d/%d %d/%d %d/%d",
&first[0], &second[0], &first[1], &second[1], &first[2], &second[2]);
(The spaces in sscanf's input pattern are not necessary as %d skips the spaces, but they improve readability.)
If you need error checking, then analyse the result of sscanf: this function returns number of successfully entered values (6 for this example if everything was correct).
I would use regular expressions for this. If you have a C++11-compliant compiler you can use , otherwise you can look in to boost::regex. In Perl-like syntax, your regular expression pattern would look something like this: f ([0-9]+)/([0-9]+) ([0-9]+)/([0-9]+) ([0-9]+)/([0-9]+). Then you take the sub matches in turn (what's inside the parathesis) and convert them from string or char* to integer with istringstream.
#include <stdlib.h>
long int strtol(const char *nptr, char **endptr, int base);
long long int strtoll(const char *nptr, char **endptr, int base);
The strtol function will both parse an integer from input, and return the place where the integer ends in the string. You could use it like
char *input = "f 123/234 234/345 345/456"
char *c = input;
char *endptr;
if (*c++ != 'f') fail();
if (*c++ != ' ') fail();
long l1 = strtol(c, &endptr, 10);
if (l1 < 0) fail(); /* you expect them unsigned, right? */
if (endptr == c) fail();
if (*endptr != '/') fail();
c = endptr+1;
...
The easiest way would be to use C++11 regular expressions:
static const std::regex ex("f (-?\\d+)//(-?\\d+) (-?\\d+)//(-?\\d+) (-?\\d+)//(-?\\d+)");
std::smatch match;
if(!std::regex_match(line, match, ex))
throw std::runtime_error("invalid face data");
int v0 = std::stoi(match[1]), t0 = std::stoi(match[2]),
v1 = std::stoi(match[3]), t1 = std::stoi(match[4]),
v2 = std::stoi(match[5]), t2 = std::stoi(match[6]);
While this might be sufficient for your case, I can't help adding a more flexible way to read those index tuples, which better copes with non-triangular faces and different face specification formats. For this we assume you have already put the face line into a std::istringstream and already ate away the face tag. This is usually the case, since the easiest way to read an OBJ file is still:
for(std::string line,tag; std::getline(file, line); )
{
std::istringstream sline(line);
sline >> tag;
if(tag == "v")
...
else if(tag == "f")
...
}
To now read the face data (inside the "f" case of course) we first read each single index tuple individually. Then we just parse this index using regular expressions for each possible index format and handle them appropriately, returning the individual vertex, texcoord and normal indices in a 3-element std::tuple:
for(std::string corner; sline>>corner; )
{
static const std::regex vtn_ex("(-?\\d+)/(-?\\d+)/(-?\\d+)");
static const std::regex vn_ex("(-?\\d+)//(-?\\d+)");
static const std::regex vt_ex("(-?\\d+)/(-?\\d+)/?");
std::smatch match;
std::tuple<int,int,int> idx;
if(std::regex_match(corner, match, vtn_ex))
idx = std::make_tuple(std::stoi(match[1]),
std::stoi(match[2]), std::stoi(match[3]));
else if(std::regex_match(corner, match, vn_ex))
idx = std::make_tuple(std::stoi(match[1]), 0, std::stoi(match[2]));
else if(std::regex_match(corner, match, vt_ex))
idx = std::make_tuple(std::stoi(match[1]), std::stoi(match[2]), 0);
else
idx = std::make_tuple(std::stoi(str), 0, 0);
//do whatever you want with the indices in std::get<...>(idx)
};
Of course this offers possibilities for performance-guided optimizations (if neccessary), like eliminating the need for allocating new strings and streams in each and every loop iteration. But it is the easiest way to privide the flexibility neccessary for a proper OBJ loader. But it may also be that the above version for triangles with vertices and texcoords only is sufficient for you already.

Parsing a character array with several null terminated characters into different strings - C++

I asked this question before but with less information than I have now.
What I essentially have is a data block of type char. That block contains filenames that I need to format and put into a vector. I initially thought the formation of this char block had three spaces between each filename. Now, I realize they are '/0' null terminated characters. So the solution that was provided was fantastic for the example I gave when I thought that there were spaces rather than null chars.
Here is what the structure looks like. Also, I should point out I DO have the size of the character data block.
filename1.bmp/0/0/0brick.bmp/0/0/0toothpaste.gif/0/0/0
The way the best solution did it was this:
// The stringstream will do the dirty work and deal with the spaces.
std::istringstream iss(s);
// Your filenames will be put into this vector.
std::vector<std::string> v;
// Copy every filename to a vector.
std::copy(std::istream_iterator<std::string>(iss),
std::istream_iterator<std::string>(),
std::back_inserter(v));
// They are now in the vector, print them or do whatever you want with them!
for(int i = 0; i < v.size(); ++i)
std::cout << v[i] << "\n";
This works fantastic for my original question but not with the fact they are null chars instead of spaces. Is there any way to make the above example work. I tried replacing null chars in the array with spaces but that didn't work.
Any ideas on the best way to format this char block into a vector of strings?
Thanks.
If you know your filenames don't have embedded "\0" characters in them, then this should work. (untested)
const char * buffer = "filename1.bmp/0/0/0brick.bmp/0/0/0toothpaste.gif/0/0/0";
int size_of_buffer = 1234; //Or whatever the real value is
const char * end_of_buffer = buffer + size_of_buffer;
std::vector<std::string> v;
while( buffer!=end_of_buffer)
{
v.push_back( std::string(buffer) );
buffer = buffer+filename1.size()+3;
}
If they do have embedded null characters in the filename you'll need to be a little cleverer.
Something like this should work. (untested)
char * start_of_filename = buffer;
while( start_of_filename != end_of_buffer )
{
//Create a cursor at the current spot and move cursor until we hit three nulls
char * scan_cursor = buffer;
while( scan_cursor[0]!='\0' && scan_cursor[1]!='\0' && scan_cursor[2]!='\0' )
{
++scan_cursor;
}
//From our start to the cursor is our word.
v.push_back( std::string(start_of_filename,scan_cursor) );
//Move on to the next word
start_of_filename = scan_cursor+3;
}
If spaces would be a suitable separator, you could just replace the null characters by spaces:
std::replace(std::begin(), std::end(), 0, ' ');
... and go from there. However, I'd suspect that you really need to use the null characters as separators as file names typically can include spaces. In this case, you could either use std::getline() with '\0' as the end of line or use the find() and substr() members of the string itself. The latter would look something like this:
std::vector<std::string> v;
std::string const null(1, '\0');
for (std::string::size_type pos(0); (pos = s.find_first_not_of(null, pos)) != s.npos; )
{
end = s.find(null, pos);
v.push_back(s.substr(0, end - pos));
pos = end;
}

How to work with null pointers in a std::vector

Say I have a vector of null terminates strings some of which may be null pointers. I don't know even if this is legal. It is a learning exercise. Example code
std::vector<char*> c_strings1;
char* p1 = "Stack Over Flow";
c_strings1.push_back(p1);
p1 = NULL; // I am puzzled you can do this and what exactly is stored at this memory location
c_strings1.push_back(p1);
p1 = "Answer";
c_strings1.push_back(p1);
for(std::vector<char*>::size_type i = 0; i < c_strings1.size(); ++i)
{
if( c_strings1[i] != 0 )
{
cout << c_strings1[i] << endl;
}
}
Note that the size of vector is 3 even though I have a NULL at location c_strings1[1]
Question. How can you re-write this code using std::vector<char>
What exactly is stored in the vector when you push a null value?
EDIT
The first part of my question has been thoroughly answered but not the second. Not to my statisfaction at least. I do want to see usage of vector<char>; not some nested variant or std::vector<std::string> Those are familiar. So here is what I tried ( hint: it does not work)
std::vector<char> c_strings2;
string s = "Stack Over Flow";
c_strings2.insert(c_strings2.end(), s.begin(), s.end() );
// char* p = NULL;
s = ""; // this is not really NULL, But would want a NULL here
c_strings2.insert(c_strings2.end(), s.begin(), s.end() );
s = "Answer";
c_strings2.insert(c_strings2.end(), s.begin(), s.end() );
const char *cs = &c_strings2[0];
while (cs <= &c_strings2[2])
{
std::cout << cs << "\n";
cs += std::strlen(cs) + 1;
}
You don't have a vector of strings -- you have a vector of pointer-to-char. NULL is a perfectly valid pointer-to-char which happens to not point to anything, so it is stored in the vector.
Note that the pointers you are actually storing are pointers to char literals. The strings are not copied.
It doesn't make a lot of sense to mix the C++ style vector with the C-style char pointers. Its not illegal to do so, but mixing paradigms like this often results in confused & busted code.
Instead of using a vector<char*> or a vector<char>, why not use a vector<string> ?
EDIT
Based on your edit, it seems like what your'e trying to do is flatten several strings in to a single vector<char>, with a NULL-terminator between each of the flattened strings.
Here's a simple way to accomplish this:
#include <algorithm>
#include <vector>
#include <string>
#include <iterator>
using namespace std;
int main()
{
// create a vector of strings...
typedef vector<string> Strings;
Strings c_strings;
c_strings.push_back("Stack Over Flow");
c_strings.push_back("");
c_strings.push_back("Answer");
/* Flatten the strings in to a vector of char, with
a NULL terminator between each string
So the vector will end up looking like this:
S t a c k _ O v e r _ F l o w \0 \0 A n s w e r \0
***********************************************************/
vector<char> chars;
for( Strings::const_iterator s = c_strings.begin(); s != c_strings.end(); ++s )
{
// append this string to the vector<char>
copy( s->begin(), s->end(), back_inserter(chars) );
// append a null-terminator
chars.push_back('\0');
}
}
So,
char *p1 = "Stack Over Flow";
char *p2 = NULL;
char *p3 = "Answer";
If you notice, the type of all three of those is exactly the same. They are all char *. Because of this, we would expect them all to have the same size in memory as well.
You may think that it doesn't make sense for them to have the same size in memory, because p3 is shorter than p1. What actually happens, is that the compiler, at compile-time, will find all of the strings in the program. In this case, it would find "Stack Over Flow" and "Answer". It will throw those to some constant place in memory, that it knows about. Then, when you attempt to say that p3 = "Answer", the compiler actually transforms that to something like p3 = 0x123456A0.
Therefore, with either version of the push_back call, you are only pushing into the vector a pointer, not the actual string itself.
The vector itself, doesn't know, or care that a NULL char * is an empty string. So in it's counting, it sees that you have pushed three pointers into it, so it reports a size of 3.
I have a funny feeling that what you would really want is to have the vector contain something like "Stack Over Flow Answer" (possibly without space before "Answer").
In this case, you can use a std::vector<char>, you just have to push the whole arrays, not just pointers to them.
This cannot be accomplished with push_back, however vector have an insert method that accept ranges.
/// Maintain the invariant that the vector shall be null terminated
/// p shall be either null or point to a null terminated string
void push_back(std::vector<char>& v, char const* p) {
if (p) {
v.insert(v.end(), p, p + strlen(p));
}
v.push_back('\0');
} // push_back
int main() {
std::vector<char> v;
push_back(v, "Stack Over Flow");
push_back(v, 0);
push_back(v, "Answer");
for (size_t i = 0, max = v.size(); i < max; i += strlen(&v[i]) + 1) {
std::cout << &v[i] << "\n";
}
}
This uses a single contiguous buffer to store multiple null-terminated strings. Passing a null string to push_back results in an empty string being displayed.
What exactly is stored in the vector when you push a null value?
A NULL. You're storing pointers, and NULL is a possible value for a pointer. Why is this unexpected in any way?
Also, use std::string as the value type (i.e. std::vector<std::string>), char* shouldn't be used unless it's needed for C interop. To replicate your code using std::vector<char>, you'd need std::vector<std::vector<char>>.
You have to be careful when storing pointers in STL containers - copying the containers results in shallow copy and things like that.
With regard to your specific question, the vector will store a pointer of type char* regardless of whether or not that pointer points to something. It's entirely possible you would want to store a null-pointer of type char* within that vector for some reason - for example, what if you decide to delete that character string at a later point from the vector? Vectors only support amortized constant time for push_back and pop_back, so there's a good chance if you were deleting a string inside that vector (but not at the end) that you would prefer to just set it null quickly and save some time.
Moving on - I would suggest making a std::vector > if you want a dynamic array of strings which looks like what you're going for.
A std::vector as you mentioned would be useless compared to your original code because your original code stores a dynamic array of strings and a std::vector would only hold one dynamically changable string (as a string is an array of characters essentially).
NULL is just 0. A pointer with value 0 has a meaning. But a char with value 0 has a different meaning. It is used as a delimiter to show the end of a string. Therefore, if you use std::vector<char> and push_back 0, the vector will contain a character with value 0. vector<char> is a vector of characters, while std::vector<char*> is a vector of C-style strings -- very different things.
Update. As the OP wants, I am giving an idea of how to store (in a vector) null terminated strings some of which are nulls.
Option 1: Suppose we have vector<char> c_strings;. Then, we define a function to store a string pi. A lot of complexity is introduced since we need to distinguish between an empty string and a null char*. We select a delimiting character that does not occur in our usage. Suppose this is the '~' character.
char delimiter = '~';
// push each character in pi into c_strings
void push_into_vec(vector<char>& c_strings, char* pi) {
if(pi != 0) {
for(char* p=pi; *p!='\0'; p++)
c_strings.push_back(*p);
// also add a NUL character to denote end-of-string
c_strings.push_back('\0');
}
c_strings.push_back(deimiter);
// Note that a NULL pointer would be stored as a single '~' character
// while an empty string would be stored as '\0~'.
}
// now a method to retrieve each of the stored strings.
vector<char*> get_stored_strings(const vector<char>& c_strings) {
vector<char*> r;
char* end = &c_strings[0] + c_strings.size();
char* current = 0;
bool nullstring = true;
for(char* c = current = &c_strings[0]; c != end+1; c++) {
if(*c == '\0') {
int size = c - current - 1;
char* nc = new char[size+1];
strncpy(nc, current, size);
r.push_back(nc);
nullstring = false;
}
if(*c == delimiter) {
if(nullstring) r.push_back(0);
nullstring = true; // reset nullstring for the next string
current = c+1; // set the next string
}
}
return r;
}
You still need to call delete[] on the memory allocated by new[] above. All this complexity is taken care of by using the string class. I very rarely use char* in C++.
Option 2: You could use vector<boost::optional<char> > . Then the '~' can be replaced by an empty boost::optional, but other other parts are the same as option 1. But the memory usage in this case would be higher.