Parsing a non-uniform string into integers

Parsing a non-uniform string into integers - c++

I am writing a parser for .obj files, and there is a part of the file that is in the format
f [int]/[int] [int]/[int] [int]/[int]
and the integers are of unknown length. In each [int]/[int] pair, they both need to be put onto separate arrays. What is the simplest method to separate them as integers?

You can do it with fscanf:
int matched = fscanf(fptr, "f %d/%d %d/%d %d/%d", &a, &b, &c, &d, &e, &f);
if (matched != 6) fail();
or ifstream and sscanf:
char buf[100];
yourIfstream.getLine(buf, sizeof(buf));
int matched = sscanf(buf, "f %d/%d %d/%d %d/%d", &a, &b, &c, &d, &e, &f);
if (matched != 6) fail();

Consider using one of the scanf functions (fscanf if you are reading the file using <stdio.h> and FILE*, or sscanf to parse a line in memory buffer).
So, if you have a buffer with data and two integer arrays like this:
int first[3], second[3];
char *buffer = "f 10/20 1/300 344/2";
Then you can just write:
sscanf(buffer, "f %d/%d %d/%d %d/%d",
&first[0], &second[0], &first[1], &second[1], &first[2], &second[2]);
(The spaces in sscanf's input pattern are not necessary as %d skips the spaces, but they improve readability.)
If you need error checking, then analyse the result of sscanf: this function returns number of successfully entered values (6 for this example if everything was correct).

I would use regular expressions for this. If you have a C++11-compliant compiler you can use , otherwise you can look in to boost::regex. In Perl-like syntax, your regular expression pattern would look something like this: f ([0-9]+)/([0-9]+) ([0-9]+)/([0-9]+) ([0-9]+)/([0-9]+). Then you take the sub matches in turn (what's inside the parathesis) and convert them from string or char* to integer with istringstream.

#include <stdlib.h>
long int strtol(const char *nptr, char **endptr, int base);
long long int strtoll(const char *nptr, char **endptr, int base);
The strtol function will both parse an integer from input, and return the place where the integer ends in the string. You could use it like
char *input = "f 123/234 234/345 345/456"
char *c = input;
char *endptr;
if (*c++ != 'f') fail();
if (*c++ != ' ') fail();
long l1 = strtol(c, &endptr, 10);
if (l1 < 0) fail(); /* you expect them unsigned, right? */
if (endptr == c) fail();
if (*endptr != '/') fail();
c = endptr+1;
...

The easiest way would be to use C++11 regular expressions:
static const std::regex ex("f (-?\\d+)//(-?\\d+) (-?\\d+)//(-?\\d+) (-?\\d+)//(-?\\d+)");
std::smatch match;
if(!std::regex_match(line, match, ex))
throw std::runtime_error("invalid face data");
int v0 = std::stoi(match[1]), t0 = std::stoi(match[2]),
v1 = std::stoi(match[3]), t1 = std::stoi(match[4]),
v2 = std::stoi(match[5]), t2 = std::stoi(match[6]);
While this might be sufficient for your case, I can't help adding a more flexible way to read those index tuples, which better copes with non-triangular faces and different face specification formats. For this we assume you have already put the face line into a std::istringstream and already ate away the face tag. This is usually the case, since the easiest way to read an OBJ file is still:
for(std::string line,tag; std::getline(file, line); )
{
std::istringstream sline(line);
sline >> tag;
if(tag == "v")
...
else if(tag == "f")
...
}
To now read the face data (inside the "f" case of course) we first read each single index tuple individually. Then we just parse this index using regular expressions for each possible index format and handle them appropriately, returning the individual vertex, texcoord and normal indices in a 3-element std::tuple:
for(std::string corner; sline>>corner; )
{
static const std::regex vtn_ex("(-?\\d+)/(-?\\d+)/(-?\\d+)");
static const std::regex vn_ex("(-?\\d+)//(-?\\d+)");
static const std::regex vt_ex("(-?\\d+)/(-?\\d+)/?");
std::smatch match;
std::tuple<int,int,int> idx;
if(std::regex_match(corner, match, vtn_ex))
idx = std::make_tuple(std::stoi(match[1]),
std::stoi(match[2]), std::stoi(match[3]));
else if(std::regex_match(corner, match, vn_ex))
idx = std::make_tuple(std::stoi(match[1]), 0, std::stoi(match[2]));
else if(std::regex_match(corner, match, vt_ex))
idx = std::make_tuple(std::stoi(match[1]), std::stoi(match[2]), 0);
else
idx = std::make_tuple(std::stoi(str), 0, 0);
//do whatever you want with the indices in std::get<...>(idx)
};
Of course this offers possibilities for performance-guided optimizations (if neccessary), like eliminating the need for allocating new strings and streams in each and every loop iteration. But it is the easiest way to privide the flexibility neccessary for a proper OBJ loader. But it may also be that the above version for triangles with vertices and texcoords only is sufficient for you already.

Related

Is there a function/WinAPI to tell if one string starts with another string in a case-insensitive linguistic way?

The best way to illustrate my question is with this example (that doesn't work if I use the strstr CRT function):
const wchar_t* s1 = L"Hauptstraße ist die längste";
const wchar_t* s2 = L"Hauptstrasse";
bool b_s1_starts_with_s2 = !!wcsstr(s1, s2);
_ASSERT(b_s1_starts_with_s2); //Should be true
So far the only WinAPI that seems to recognize linguistic string equivalency is CompareStringEx when used with the LINGUISTIC_IGNORECASE flag, but it is somewhat tricky & inefficient to use for this purpose as I will have to call it on s2 repeatedly until I reach its end.
So I was wondering if there's a better approach to doing this (under Windows)?
EDIT: Here's what I mean:
bool b_s1_starts_with_s2 = false;
int ln1 = (int)wcslen(s1);
int ln2 = (int)wcslen(s2);
for(int p = 1; p <= ln1; p++)
{
if(::CompareString(LOCALE_USER_DEFAULT, LINGUISTIC_IGNORECASE,
s1, p,
s2, ln2) == CSTR_EQUAL)
{
//Match
b_s1_starts_with_s2 = true;
break;
}
}

You can use FindNLSString, check if the return value is zero.
Evidently it matches ß with ss
const wchar_t *s1 = L"Hauptstraße ist die längste";
const wchar_t *s2 = L"Hauptstrasse";
INT found = 0;
int start = FindNLSString(0, LINGUISTIC_IGNORECASE, s1, -1, s2, -1, &found);
wprintf(L"start = %d\n", start);
s1 = L"δεθ Testing Greek";
s2 = L"ΔΕΘ";
start = FindNLSString(0, LINGUISTIC_IGNORECASE, s1, -1, s2, -1, &found);
wprintf(L"start = %d\n", start);

I have not tried it, but I think you probably could use LCMapStringEx to transform all strings to lowercase appropriately for the locale, and then do a normal string prefix match with wcsncmp.
(As noted in comments, it makes no sense that you used wcsstr in your example since wcsstr determines if one string contains another string. To determine if one string starts with another string, it's more efficient to use wcsncmp with the length of the prefix string.)

sscanf format string that can detect a hyphenated and non-hyphenated string

I'm working on an implementation that uses sscanf() to detect input.
Input will either contain a hyphen character '-' or it won't. The length of the strings and position of the hyphen character will vary.
Examples:
"18509726-550"
"14782"
I've been trying to come up with a format string that has a return > 0 for strings with a hyphen but not for strings without a hyphen, however I have not had success. It may not be possible. I also realize there are much better ways achieve this, but this code was written way before my time.
Code:
// what format string will return 0 for one but not the other?
String^ pszString = "18509726-550"; // or "14782"
int nP;
char buffer[256];
char buffer1[256];
const char* pSrc = (gcnew marshal_context())->marshal_as<const char*>(pszString);
nP=sscanf(pSrc, "%[-]s", buffer, buffer1);
I've tried several format strings:
"%s-%s"
"%[0-9-]s"
"%[-]s"
"%[^-]s"
I've poured over any sscanf documentation I can find, and I don't think it is possible with sscanf alone, but validation helps too.
Thanks SO community!

[] assumes s format already.
Just use "%[^-]-%[^-]".
#include <stdio.h>
void test(const char* s)
{
char s1[100]{}, s2[100]{};
int n = sscanf(s, "%[^-]-%[^-]", s1, s2);
printf("%d: %s, %s\n", n, s1, s2);
}
int main()
{
test("18509726-550");
test("14782");
}
Prints:
2: 18509726, 550
1: 14782,

Parse char vector to std::map<string, string>

I have a vector of unsigned char values. The data inside is keys and values in the format "key=value". Each pair is terminated by a '\0' character and the last pair is terminated by a double '/0'. A value can contain a "=" as well, so this delimiter shouldnt be a criteria for the pair, only the terminating '\0'
From this vector of single unsigned char I want to get a std::map<std::string, std::string> object.
What would be an effective way to go through the vector and fill the map?
Thanks in advance!
This code finds all pairs, but when printing them, my console seems to mess things up, so I suspect invalid characters... maybe it has to do with the fact that I'm using unsigned char?
unsigned char *env, *nxt;
std::string delimiter = "=";
std::string line;
std::map<string, string> myMap;
for (env = &myVector[0]; *env != '\0'; env = nxt + 1)
{
for (nxt = env; *nxt != '\0'; ++nxt)
{
if (nxt >= &myVector[myVector.size()])
{
printf("string not terminated\n");
return -1;
}
}
line = std::string(env, env + myVector.size());
myMap.insert(std::pair<string, string>(
line.substr(0, line.find(delimiter)),
line.substr(line.find(delimiter) + 1, line.size())));
}

Since the key cannot contain the = sign, it's fairly easy:
Split on (generate all positions of) zero characters.
Split created entries on first =.
Use the generated regions as key/value.
To make things clearer (than your double nested loops, ew), I'd use some reasonable data structure to mark the faux-strings in the buffer (if you want to avoid copying), such as pair<unsigned, unsigned>.
So, the signatures of the functions (which in this case tell more than their implementations, I suppose), could look like this:
using BufferRange = pair<unsigned, unsigned>;
using BufferEntry = pair<BufferRange, BufferRange>;
list<BufferRange> splitOnZeroes(Buffer const& b);
BufferEntry splitOnEquality(BufferRange const& br, Buffer const& b);
void addToMap(map<string, string>& m, BufferEntry const& p)
Those are fairly simplified; my code design OCD tells me that BufferRange could be a type carrying not only the numerical indices, but also the reference to the buffer itself. Changing that (if required) left as an exercise.

C++: <= conflict between signed and unsigned

I have created a wrapper around the .substr function:
wstring MidEx(wstring u, long uStartBased1, long uLenBased1)
{
//Extracts a substring. It is fail-safe. In case we read beyond the string, it will just read as much as it has
// For example when we read from the word HELLO , and we read from position 4, len 5000, it will just return LO
if (uStartBased1 > 0)
{
if (uStartBased1 <= u.size())
{
return u.substr(uStartBased1-1, uLenBased1);
}
}
return wstring(L"");
}
It works fine, however the compiler gives me the warning "<= Conflict between signed and unsigned".
Can somebody tell me how to do it correctly?
Thank you very much!

You should use wstring::size_type (or size_t) instead of long:
wstring MidEx(wstring u, wstring::size_type uStartBased1, wstring::size_type uLenBased1)
{
//Extracts a substring. It is fail-safe. In case we read beyond the string, it will just read as much as it has
// For example when we read from the word HELLO , and we read from position 4, len 5000, it will just return LO
if (uStartBased1 > 0)
{
if (uStartBased1 <= u.size())
{
return u.substr(uStartBased1-1, uLenBased1);
}
}
return wstring(L"");
}
which is the exact return type of u.size(). This way, you ensure that the comparision gives the expected result.
If you are working with std::wstring or another standard library container (like std::vector etc.), then x::size_type will be defined as size_t. So using it will be more consistent.

You want unsigned arguments, something like:
wstring MidEx(wstring u, unsigned long uStartBased1, unsigned long uLenBased1)

How to capture length of sscanf'd string?

I'm parsing a string that follows a predictable pattern:
1 character
an integer (one or more digits)
1 colon
a string, whose length came from #2
For example:
s5:stuff
I can see easily how to parse this with PCRE or the like, but I'd rather stick to plain string ops for the sake of speed.
I know I'll need to do it in 2 steps because I can't allocate the destination string until I know its length. My problem is gracefully getting the offset for the start of said string. Some code:
unsigned start = 0;
char type = serialized[start++]; // get the type tag
int len = 0;
char* dest = NULL;
char format[20];
//...
switch (type) {
//...
case 's':
// Figure out the length of the target string...
sscanf(serialized + start, "%d", &len);
// <code type='graceful'>
// increment start by the STRING LENGTH of whatever %d was
// </code>
// Don't forget to skip over the colon...
++start;
// Build a format string which accounts for length...
sprintf(format, "%%%ds", len);
// Finally, grab the target string...
sscanf(serialized + start, format, string);
break;
//...
}
That code is roughly taken from what I have (which isn't complete because of the issue at hand) but it should get the point across. Maybe I'm taking the wrong approach entirely. What's the most graceful way to do this? The solution can either C or C++ (and I'd actually like to see the competing methods if there are enough responses).

You can use the %n conversion specifier, which doesn't consume any input - instead, it expects an int * parameter, and writes the number of characters consumed from the input into it:
int consumed;
sscanf(serialized + start, "%d%n", &len, &consumed);
start += consumed;
(But don't forget to check that sscanf() returned > 0!)

Use the %n format specifier to write the number of characters read so far to an integer argument.

Here's a C++ solution, it could be better, and is hard-coded specifically to deal with your example input, but shouldn't require much modification to get working.
std::stringstream ss;
char type;
unsigned length;
char dummy;
std::string value;
ss << "s5:Helloxxxxxxxxxxx";
ss >> type;
ss >> length;
ss >> dummy;
ss.width(length);
ss >> value;
std::cout << value << std::endl;
Disclaimer:
I'm a noob at C++.

You can probably just use atoi which will ignore the colon.
e.g. len = atoi(serialized + start);
The only thing with atoi is that if it returns zero it could mean either the conversion failed, or that the length was truly zero. So it's not always the most appropriate function.

if you replace you colon with a space scanf will stop on it and you can get the size malloc the size then run another scanf to get the rest of the string`
int main (int argc, const char * argv[]) {
char foo[20];
char *test;
scanf("%s",foo); //"hello world"
printf("foo = %s\n", foo);//prints hello
//get size
test = malloc(sizeof(char)* 10);//replace 10 with your string size
scanf("%s", test);
printf("test = %s\n", test);//prints world
return 0;
}
`

Seems like the format is overspecified... (using a variable length field to specify the length of a variable length field).
If you're using GCC, I'd suggest
if (sscanf(serialized,"%c%d:%as",&type,&len,&dest)<3) return -1;
/* use type, dest; ignore len */
free(dest);
return 0;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parsing a non-uniform string into integers - c++

I am writing a parser for .obj files, and there is a part of the file that is in the format f [int]/[int] [int]/[int] [int]/[int] and the integers are of unknown length. In each [int]/[int] pair, they both need to be put onto separate arrays. What is the simplest method to separate them as integers?

Related

Is there a function/WinAPI to tell if one string starts with another string in a case-insensitive linguistic way?

sscanf format string that can detect a hyphenated and non-hyphenated string

Parse char vector to std::map<string, string>

C++: <= conflict between signed and unsigned

How to capture length of sscanf'd string?

Categories

Resources