How to capture length of sscanf'd string? - c++

I'm parsing a string that follows a predictable pattern:
1 character
an integer (one or more digits)
1 colon
a string, whose length came from #2
For example:
s5:stuff
I can see easily how to parse this with PCRE or the like, but I'd rather stick to plain string ops for the sake of speed.
I know I'll need to do it in 2 steps because I can't allocate the destination string until I know its length. My problem is gracefully getting the offset for the start of said string. Some code:
unsigned start = 0;
char type = serialized[start++]; // get the type tag
int len = 0;
char* dest = NULL;
char format[20];
//...
switch (type) {
//...
case 's':
// Figure out the length of the target string...
sscanf(serialized + start, "%d", &len);
// <code type='graceful'>
// increment start by the STRING LENGTH of whatever %d was
// </code>
// Don't forget to skip over the colon...
++start;
// Build a format string which accounts for length...
sprintf(format, "%%%ds", len);
// Finally, grab the target string...
sscanf(serialized + start, format, string);
break;
//...
}
That code is roughly taken from what I have (which isn't complete because of the issue at hand) but it should get the point across. Maybe I'm taking the wrong approach entirely. What's the most graceful way to do this? The solution can either C or C++ (and I'd actually like to see the competing methods if there are enough responses).

You can use the %n conversion specifier, which doesn't consume any input - instead, it expects an int * parameter, and writes the number of characters consumed from the input into it:
int consumed;
sscanf(serialized + start, "%d%n", &len, &consumed);
start += consumed;
(But don't forget to check that sscanf() returned > 0!)

Use the %n format specifier to write the number of characters read so far to an integer argument.

Here's a C++ solution, it could be better, and is hard-coded specifically to deal with your example input, but shouldn't require much modification to get working.
std::stringstream ss;
char type;
unsigned length;
char dummy;
std::string value;
ss << "s5:Helloxxxxxxxxxxx";
ss >> type;
ss >> length;
ss >> dummy;
ss.width(length);
ss >> value;
std::cout << value << std::endl;
Disclaimer:
I'm a noob at C++.

You can probably just use atoi which will ignore the colon.
e.g. len = atoi(serialized + start);
The only thing with atoi is that if it returns zero it could mean either the conversion failed, or that the length was truly zero. So it's not always the most appropriate function.

if you replace you colon with a space scanf will stop on it and you can get the size malloc the size then run another scanf to get the rest of the string`
int main (int argc, const char * argv[]) {
char foo[20];
char *test;
scanf("%s",foo); //"hello world"
printf("foo = %s\n", foo);//prints hello
//get size
test = malloc(sizeof(char)* 10);//replace 10 with your string size
scanf("%s", test);
printf("test = %s\n", test);//prints world
return 0;
}
`

Seems like the format is overspecified... (using a variable length field to specify the length of a variable length field).
If you're using GCC, I'd suggest
if (sscanf(serialized,"%c%d:%as",&type,&len,&dest)<3) return -1;
/* use type, dest; ignore len */
free(dest);
return 0;

Related

ifstream not reading the same characters as they are written in the file

console
file
Simple explanation: ifstream's get() is reading the wrong chars (console is different from file) and I need to know why.
I am recording registers into a file as a char array. When I write it to the file, it writes successfully. I open the file and find the chars I intended, except notepad apparently shows unicode character 0000 ( NULL) as a space.
For instance, the entries
id = 1000; //an 8-byte long long
name = "stack"; //variable size
surname = "overflow"; //variable size
degree = "internet"; //variable size
sex = 'c'; //1-byte char
birthdate = 256; //4-byte int
become this on the file:
& èstackoverflowinternetc
or, putting the number of unicode characters that disappear when posted here between brackets:
&[3]| [1]è|stack|overflow|internet|c| [1] | //separating each section with a | for easier reading. Some unicode characters disappear when I post them here, but I assure you they are the correct ones
SIZE| ID | name| surname| degree |g| birth
(writing is working fine and puts the expected characters)
Trouble is, when the console in the code below prints what the buffer is reading from the file, it gives me the following record (extra spaces included)
Þstackoverflowinternetc
Which is bad because it returns me the wrong ID and birthdate. Either "-21" and "4747968" or "Ù" and "-1066252288". Other fields are unnaffected. Weird because size bytes show up as empty space in the console, so it shouldn't be able to split name, surname, degree and sex.
ifstream infile("alumni.freire", ios::binary);
if(infile.is_open()){
infile.seekg(pos, ios::beg);
int size;
size = infile.get();
char charreg[size];
charreg[0] = size;
//testing what buffer gives me
for(int i = 1; i < size; i++){
charreg[i] = infile.get();
cout << charreg[i];
}
}
What am I doing wrong?
EDIT: to explain better what I did:
I get the entries on the first "code" from user input and use them as parameters when creating a "reg" class I implemented. The reg class then does (adequatly, I've already tested it) the conversion to strings, and calculates a hidden four-element char array containing instance size, name size, surname size and degree size. When the program writes the class on-file, it is written perfectly, as I showed in the second "code" section. (If you do the calculations you'll see '&' equals the size of the entire thing, for example). When I read it from the file, it appears differently on console for some reason. Different characters. But it reads the right amount of characters because "name", "surname" and "degree" appear correctly.
EDIT n2: I made "charreg[]" into an int array and printed it and the values are correct. I have no idea what's happening anymore.
EDIT n3: Apparently the reason I was getting the wrong chars is that I should have used unsigned chars...
The idea to write, as is, your structure is good. But your approach is wrong.
You must have something to separate your fields.
For example you know that your ID is 8 byte long, great ! You can read 8 bytes :
long long id;
read(fd, &id, 8);
In your example you got -24 because you read the first byte of the full id number.
But for the rest of the file, how can you know the length of the first name and the last name ?
You could read byte by byte until you find an null byte.
But I suggest you to use a more structured file.
For example, you can define a structure like this :
long long id; // 8 bytes
char firstname[256]; // 256 bytes
char lastname[256]; // 256 bytes
char sex; // 1 byte
int birthdate; // 4 bytes
With this structure you can read and write super easily :
struct my_struct s;
read(fd, &s, sizeof(struct my_struct)); // read 8+256+256+1+4 bytes
s.birthdate = 128;
write(fd, &s, sizeof(struct my_struct));// write the structure
Of course you loose the "variable length" of the first name and last name. Do you really need more than 100 chars for a name ?
In a case you really need, you could introduce an header over each variable length value. But you loose the ability to read everything at once.
long long id;
int foo_size;
char *foo;
And then to read it :
struct my_struct s;
read(fd, &s, 12); // read the header, 8 + 4 bytes
char foo[s.foo_size];
read(fd, &s, s.foo_size);
You should define what exactly you need to save. Define a precise data structure that you can easily deduce at read, avoid things like "oh, let's read until null-byte".
I used C function to explain you because it's much more representative. You know what you read and what you write.
Start to play with this, and then try the same with c++ streams/function
I don't know how you are writing back information to the file but here is how I would do that, I'm hoping this is a fairly simple way of doing it. Keep in mind I have no idea what kind of file you are actually working with.
long long id = 1000;
std::string name = "name";
std::string surname = "overflow";
std::string degree = "internet";
unsigned char sex = 'c';
int birthdate = 256;
ofstream outfile("test.txt", ios::binary);
if (outfile.is_open())
{
const char* idBytes = static_cast<char*>(static_cast<void*>(&id));
const char* nameBytes = name.c_str();
const char* surnameBytes = surname.c_str();
const char* degreeBytes = degree.c_str();
const char* birthdateBytes = static_cast<char*>(static_cast<void*>(&birthdate));
outfile.write(idBytes, sizeof(id));
outfile.write(nameBytes, name.length());
outfile.write(surnameBytes, surname.length());
outfile.write(degreeBytes, degree.length());
outfile.put(sex);
outfile.write(birthdateBytes, sizeof(birthdate));
outfile.flush();
outfile.close();
}
and here is how I am going to output it, which to me seems to be coming out as expected.
ifstream infile("test.txt", std::ifstream::ate | ios::binary);
if (infile.is_open())
{
std::size_t fileSize = infile.tellg();
infile.seekg(0);
for (int i = 0; i < fileSize; i++)
{
char c = infile.get();
std::cout << c;
}
std::cout << std::endl;
}

Basics of strtol?

I am really confused. I have to be missing something rather simple but nothing I am reading about strtol() is making sense. Can someone spell it out for me in a really basic way, as well as give an example for how I might get something like the following to work?
string input = getUserInput;
int numberinput = strtol(input,?,?);
The first argument is the string. It has to be passed in as a C string, so if you have a std::string use .c_str() first.
The second argument is optional, and specifies a char * to store a pointer to the character after the end of the number. This is useful when converting a string containing several integers, but if you don't need it, just set this argument to NULL.
The third argument is the radix (base) to convert. strtol can do anything from binary (base 2) to base 36. If you want strtol to pick the base automatically based on prefix, pass in 0.
So, the simplest usage would be
long l = strtol(input.c_str(), NULL, 0);
If you know you are getting decimal numbers:
long l = strtol(input.c_str(), NULL, 10);
strtol returns 0 if there are no convertible characters at the start of the string. If you want to check if strtol succeeded, use the middle argument:
const char *s = input.c_str();
char *t;
long l = strtol(s, &t, 10);
if(s == t) {
/* strtol failed */
}
If you're using C++11, use stol instead:
long l = stol(input);
Alternately, you can just use a stringstream, which has the advantage of being able to read many items with ease just like cin:
stringstream ss(input);
long l;
ss >> l;
Suppose you're given a string char const * str. Now convert it like this:
#include <cstdlib>
#include <cerrno>
char * e;
errno = 0;
long n = std::strtol(str, &e, 0);
The last argument 0 determines the number base you want to apply; 0 means "auto-detect". Other sensible values are 8, 10 or 16.
Next you need to inspect the end pointer e. This points to the character after the consumed input. Thus if all input was consumed, it points to the null-terminator.
if (*e != '\0') { /* error, die */ }
It's also possible to allow for partial input consumption using e, but that's the sort of stuff that you'll understand when you actually need it.
Lastly, you should check for errors, which can essentially only be overflow errors if the input doesn't fit into the destination type:
if (errno != 0) { /* error, die */ }
In C++, it might be preferable to use std::stol, though you don't get to pick the number base in this case:
#include <string>
try { long n = std::stol(str); }
catch (std::invalid_argument const & e) { /* error */ }
catch (std::out_of_range const & e) { /* error */ }
Quote from C++ reference:
long int strtol ( const char * str, char ** endptr, int base );
Convert string to long integer
Parses the C string str interpreting its content as an integral number of the specified base, which is returned as a long int value. If endptr is not a null pointer, the function also sets the value of endptr to point to the first character after the number.
So try something like
long l = strtol(pointerToStartOfString, NULL, 0)
I always use simply strol(str,0,0) - it returns long value. 0 for radix (last parameter) means to auto-detect it from input string, so both 0x10 as hex and 10 as decimal could be used in input string.

C++ get hour and minutes from string

I'm writing C++ code for school in which I can only use the std library, so no boost. I need to parse a string like "14:30" and parse it into:
unsigned char hour;
unsigned char min;
We get the string as a c++ string, so no direct pointer. I tried all variations on this code:
sscanf(hour.c_str(), "%hhd[:]%hhd", &hours, &mins);
but I keep getting wrong data. What am I doing wrong.
As everyone else has mentioned, you have to use %d format specified (or %u). As for the alternative approaches, I am not a big fan of the "because C++ has feature XX it must be used" and oftentimes resort to C-level functions. Though I never use scanf()-like stuff as it got its own problems. That being said, here is how I would parse your string using strtol() with error checking:
#include <cstdio>
#include <cstdlib>
int main()
{
unsigned char hour;
unsigned char min;
const char data[] = "12:30";
char *ep;
hour = (unsigned char)strtol(data, &ep, 10);
if (!ep || *ep != ':') {
fprintf(stderr, "cannot parse hour: '%s' - wrong format\n", data);
return EXIT_FAILURE;
}
min = (unsigned char)strtol(ep+1, &ep, 10);
if (!ep || *ep != '\0') {
fprintf(stderr, "cannot parse minutes: '%s' - wrong format\n", data);
return EXIT_FAILURE;
}
printf("Hours: %u, Minutes: %u\n", hour, min);
}
Hope it helps.
Your problem is, of course, that you are using sscanf. And that
you're using some very special type for the hours and minutes, instead
of int. Since you're parsing a string of exactly 5 characters, the
simplest solution is just to ensure that all of the characters are legal
in that position, using isdigit for characters 0, 1, 3 and 4, and
comparing to ':' for character 2. Once you've done that, it's trivial
to create an std::istringstream from the string, and input into an
int, a char (which you'll ignore afterwards) and a second int. If
you want to be more flexible in the input, for example allowing things
like "9:45" as well, you can skip the initial checks, and just input
into int, char and int, then check that the char contains ':'
(and that the two int are in range).
As to why your sscanf is failing: you're asking it to match something
like "12[:]34", which is not what you're giving it. I'm not sure
whether you're trying to use "%hhd:%hhd", or if for some reason you
really do want a character class, in which case, you have to use [ as
a conversion specifier, and then ignore the input: "%hhd%*[:]%hhd".
(This would allow accepting more than one character as the separator,
but otherwise, I don't see the advantage. Also, technically at least,
using %d and then passing the address of an unsigned integral types
is not supported, %hhd must be a signed char. In practice,
however, I don't think you'll ever run into any problems for
non-negative input values less than 128.)
As mentioned by izomorphius sscanf and variants are not C++ they are C. The C++ way would be to use streams. The following works (it's not amazingly flexible but should give you an idea)
#include <iostream>
#include <string>
#include <sstream>
using namespace std;
int main(int argc, char* argv[])
{
string str = "14:30";
stringstream sstrm;
int hour,min;
sstrm << str;
sstrm >> hour;
sstrm.get(); // get colon
sstrm >> min;
cout << hour << endl;
cout << min << endl;
return 0;
}
You could also use getline to get everything upto the colon.
I would do it like this
unsigned tmp_hour, tmp_mins;
unsigned char hour, mins;
sscanf(hour.c_str(), "%u:%u", &tmp_hours, &tmp_mins);
hour = tmp_hours;
mins = tmp_mins;
Less messing around with obscure scanf options. I would add some error checking too.
My understanding is that h in %hhd is not a valid format specifier. The correct specifier for decimal integers is %d.
As R.Martinho Fernandes says in his comment, %d:%d will match two numbers separated by a colon (':').
Did you want something different?
You can always read the entire text string and parse it any way you want.
sscanf with %hhd:%hhd seems to work perfectly fine:
std::string time("14:30");
unsigned char hour, min;
sscanf(time.c_str(), "%hhd:%hhd", &hour, &min);
Note that the hh length modifier is simply to allow storing the value in an unsigned char.
However, sscanf is from the C Standard Library and there are better C++ ways to do this. A C++11 way to do this is using stoi:
std::string time("14:30");
unsigned char hour = std::stoi(time);
unsigned char min = std::stoi(time.substr(3));
In C++03, we can use stringstream instead but it's a bit of a pain if you really want it in a char:
std::stringstream stream("14:30");
unsigned int hour, min;
stream >> hour;
stream.ignore();
stream >> min;

How to read in only a particular number of characters

I have a small query regarding reading a set of characters from a structure. For example: A particular variable contains a value "3242C976*32" (char - type). How can I get only the first 8 bits of this variable. Kindly help.
Thanks.
Edit:
I'm trying to read in a signal:
For Ex: $ASWEER,2,X:3242C976*32
into this structure:
struct pg
{
char command[7]; // saves as $ASWEER,2,X:3242C976*32
char comma1[1]; // saves as ,2,X:3242C976*32
char groupID[1]; // saves as 2,X:3242C976*32
char comma2[1]; // etc
char handle[2]; // this is the problem, need it to save specifically each part, buts its not
char canID[8];
char checksum[3];
}m_pg;
...
When memcopying buffer into a structure, it works but because there is no carriage returns it saves the rest of the signal in each char variable. So, there is always garbage at the end.
you could..
convert your hex value in canID to float(depending on how you want to display it), e.g.
float value1 = HexToFloat(m_pg.canID); // find a conversion script for HexToFloat
CString val;
val.Format("0.3f",value1);
the garbage values aren't actually being stored in the structure, it only displays it as so, as there is no carriage return, so format the message however you want to and display it using the CString val;
If "3242C976*3F" is a c-string or std::string, you can just do:
char* str = "3242C976*3F";
char first_byte = str[0];
Or with an arbitrary memory block you can do:
SomeStruct memoryBlock;
char firstByte;
memcpy(&firstByte, &memoryBlock, 1);
Both copy the first 8bits or 1 byte from the string or arbitrary memory block just as well.
After the edit (original answer below)
Just copy by parts. In C, something like this should work (could also work in C++ but may not be idiomatic)
strncpy(m_pg.command, value, 7); // m.pg_command[7] = 0; // oops
strncpy(m_pg.comma, value+7, 1); // m.pg_comma[1] = 0; // oops
strncpy(m_pg.groupID, value+8, 1); // m.pg_groupID[1] = 0; // oops
strncpy(m_pg.comma2, value+9, 1); // m.pg_comma2[1] = 0; // oops
// etc
Also, you don't have space for the string terminator in the members of the structure (therefore the oopses above). They are NOT strings. Do not printf them!
Don't read more than 8 characters. In C, something like
char value[9]; /* 8 characters and a 0 terminator */
int ch;
scanf("%8s", value);
/* optionally ignore further input */
while (((ch = getchar()) != '\n') && (ch != EOF)) /* void */;
/* input terminated with ch (either '\n' or EOF) */
I believe the above code also "works" in C++, but it may not be idiomatic in that language
If you have a char pointer, you can just set str[8] = '\0'; Be careful though, because if the buffer is less than 8 (EDIT: 9) bytes, this could cause problems.
(I'm just assuming that the name of the variable that already is holding the string is called str. Substitute the name of your variable.)
It looks to me like you want to split at the comma, and save up to there. This can be done with strtok(), to split the string into tokens based on the comma, or strchr() to find the comma, and strcpy() to copy the string up to the comma.

Very strange char array behaviour

.
unsigned int fname_length = 0;
//fname length equals 30
file.read((char*)&fname_length,sizeof(unsigned int));
//fname contains random data as you would expect
char *fname = new char[fname_length];
//fname contains all the data 30 bytes long as you would expect, plus 18 bytes of random data on the end (intellisense display)
file.read((char*)fname,fname_length);
//m_material_file (std:string) contains all 48 characters
m_material_file = fname;
// count = 48
int count = m_material_file.length();
now when trying this way, intellisense still shows the 18 bytes of data after setting the char array to all ' ' and I get exactly the same results. even without the file read
char name[30];
for(int i = 0; i < 30; ++i)
{
name[i] = ' ';
}
file.read((char*)fname,30);
m_material_file = name;
int count = m_material_file.length();
any idea whats going wrong here, its probably something completely obvious but im stumped!
thanks
Sounds like the string in the file isn't null-terminated, and intellisense is assuming that it is. Or perhaps when you wrote the length of the string (30) into the file, you didn't include the null character in that count. Try adding:
fname[fname_length] = '\0';
after the file.read(). Oh yeah, you'll need to allocate an extra character too:
char * fname = new char[fname_length + 1];
I guess that intellisense is trying to interpret char* as C string and is looking for a '\0' byte.
fname is a char* so both the debugger display and m_material_file = fname will be expecting it to be terminated with a '\0'. You're never explicitly doing that, but it just happens that whatever data follows that memory buffer has a zero byte at some point, so instead of crashing (which is a likely scenario at some point), you get a string that's longer than you expect.
Use
m_material_file.assign(fname, fname + fname_length);
which removes the need for the zero terminator. Also, prefer std::vector to raw arrays.
std::string::operator=(char const*) is expecting a sequence of bytes terminated by a '\0'. You can solve this with any of the following:
extend fname by a character and add the '\0' explicitly as others have suggested or
use m_material_file.assign(&fname[0], &fname[fname_length]); instead or
use repeated calls to file.get(ch) and m_material_file.push_back(ch)
Personally, I would use the last option since it eliminates the explicitly allocated buffer altogether. One fewer explicit new is one fewer chance of leaking memory. The following snippet should do the job:
std::string read_name(std::istream& is) {
unsigned int name_length;
std::string file_name;
if (is.read((char*)&name_length, sizeof(name_length))) {
for (unsigned int i=0; i<name_length; ++i) {
char ch;
if (is.get(ch)) {
file_name.push_back(ch);
} else {
break;
}
}
}
return file_name;
}
Note:
You probably don't want to use sizeof(unsigned int) to determine how many bytes to write to a binary file. The number of bytes read/written is dependent on the compiler and platform. If you have a maximum length, then use it to determine the specific byte size to write out. If the length is guaranteed to fewer than 255 bytes, then only write a single byte for the length. Then your code will not depend on the byte size of intrinsic types.