Parsing char array in C++

Parsing char array in C++ - c++

I'm coding a httpserver and I'm stuck at parsing GET requests.
How would I parse a char buffer with something like
GET /images/logo.png HTTP/1.1
So that I only get the path and file extension but ignore the other parts?

You don't say specifically say what sort of storage this string is in - simple char* or some string class.
So, in general, you could either do it the rather simple and dirty way, by splitting the string on the space character, and taking the second or middle section. Or, a better approach would be to get familiar with Regular Expressions. C++ has several regex libraries - Boost is well regarded.

Related

Parse an XML in standard C/C++ without additional libraries

I have an XML (assuming it is valid) and I must parse it and store it in a tree.
What is the best approach to parse it, without using other libraries, just basic manipulation of strings?
Keep in mind that I don't have to validate it, just parse and memorize it into a tree.

The basic structure of XML is quite simple:
<tagname [attribute[="value"] ...]>content</tagname>
where the content may contain both normal text and more XML structures, or the special form
<tagname [attribute[="value"] ...]/>
which is equivalent to
<tagname [attribute[="value"] ...]></tagname>
that is,. empty content.
So if you don't need to interpret a DTD or do other fancy things, you can do the following:
Check that the first non-whitespace character is <. If not, you don't have XML and can just give an error and exit.
Now follows the tag name, until the first whitespace, or the / or the > character. Store that.
If the next non-whitespace character is /, check that it is followed by >. If so, you've finished parsing and can return your result. Otherwise, you've got malformed XML, and can exit with an error.
If the character is >, then you've found the end of the begin tag. Now follows the content. Continue at step 6.
Otherwise what follows is an argument. Parse that, store the result, and continue at step 3.
Read the content until you find a < character.
If that character is followed by /, it's the end tag. Check that it is followed by the tag name and >, and if yes, return the result. Otherwise, throw an error.
If you get here, you've found the beginning of a nested XML. Parse that with this algorithm, and then continue at 6.

Reading XML looks simple but doing it correctly involves a few complexities you don't really want to deal with. Indeed, writing a simple XML parser effectively amounts to creating yet another XML library. I have done it and an incomplete version of this is sitting somewhere on my disk. Even if you don't need to validate your XML structure:
whether you validate or not, you need to deal with entity references like < and the variety of character entity references like A and
the plain body of an XML document is relatively simple but the header a major pain to deal with in particular the DTD: there are two versions thereof which are slightly different and you probably need to process the inline DTD
even the body isn't entirely trivial because of these annoying character data segments
even without validation you may need to support external entity references
the characters to be accepted and/or rejected for various parts of XML are also somewhat interesting
note that XML is defined in terms of Unicode and proper handling of this isn't entirely trivial either: just using char or wchar_t just doesn't cut it.
The first version I implemented was a nice little iterator intended to pop out all the elements encountered. This allowed for the nice feature of easily stopping and continuing the parsing at the choice of the iterator user. Unfortunately, I didn't get it to fly when trying to copy with the various entity references. It would parse simple XML files nice and fast but some quirks in the specification I just didn't get right.
What worked best for me was creating a simple recursive decent parser combined with a suitable stack of buffers to somewhat transparently deal with entity references. However, to finish this completely I still need to deal with some encoding issues and in the end I just had higher priority projects to work on (in my spare time, that is).
In summary: it can be done, obviously, as others did. It is probably a somewhat pointless exercise unless you have a really bright idea which makes your implementation uniquely better suited than the alternatives.

The best and only approach is to re-implement such a library from scratch without using any other libraries...
You're welcome to use existing libraries like pugixml, for example. It's installation is as simple as adding the files to your project and start using it. It's lightweight compared to other validating parsers, such as Xerces.

I'm getting contradictory answers, what should I do with my code?

I am making a RPG game in C++ and DirectX.
I store all the data for the game in .txt files and read/write it using `ifstream/ofstream. this has worked well for me so far when talking about creature stats and I have a hack together for creature names but this is becoming a problem.
I can store strings in the txt file and read them but I am having trouble using them. for single words I have a hack but now I am up to the story line where characters are talking to each other it is a real problem.
I asked on gamedevelopment.stackexchange how to put text on screen and was told to use D3Dtext but that only accepts C-style strings and I can only read C++ strings from the text file. This is such a big problem now that I am willing to go back and re-factor what need sit as no progress can be made until this is sorted.
So now I have a bunch of questions and I dont know which to ask first:
I want a way to draw the letters like graphics. I was told this is what D3Dtext does but I want to implement it my self if I can I just need info on how if someone knows?
If I am to use D3Dtext like so called experts advise I have to use C-style strings. so how can I convert C-style strings to C++ strings? I have a method now but that requires the new and delete operator for every string and I can see the being a big problem as it grows in complexity?
Is there a way to read C-style strings? Maybe a replacement for ifstream. I would like to keep the txt files as I really dont want to use xml but I could change the file format if it was a viable solution?
Premature optimisation I know but I plan to use the same function for every piece of text in the game so what would be a good way of doing this in terms of speed (why I dont want new/delete for every string)?
I am happy to provide any information that would be needed to help me, just ask.

std::string mystr = "Hello World.";
mystr.c_str(); // gets a null terminated const char* C-style string
Read your file as you are currently doing then if you need to access the c strings as above.

You can convert freely between C-style strings and C++'s std::string. Just use my_cpp_string.c_str() to get the C-string representation of a C++ string, and std::string my_cpp_string(my_c_string) to initialize a new std::string from a C-style string.

2) Use the c_str() method to pass your C++ strings to D3Dtext
some_D3Dtext_function(some_text.c_str())
3 and 4 then become non-issues.

C++ Dynamically convert string to any basic type

In C++ I need to convert a string to any type at runtime where I do not know what type I might be getting in the string. I have heard there is a lexical_cast in boost that I can use, but what would be the most effective way to implement it?
I might get a bunch of string like this from a client: Date="25/08/2010", Someval="2", Blah="25.5".
Now I want to be able to convert these strings to their type, eg, the Somval is obviously an int, and the Date could be a boost::date or whatever. The point is, I don't know at this time in what order these would be given to me, so it's hard to write some code that will perform a bunch of casts.
I could use a bunch of if/else statements or a switch/case statements, however I'm thinking that there is possibly a better way to do this.
I'm not looking for something different to lexical_cast, I can totally use that, I am looking to see if someone knows a better way then doing this:
std::string str = "256";
int a = lexical_cast<int>(str);
//now check if the cast worked, if not, try another...
This is too much of a guessing game, and if I have 10 possible types, for any given string, it sounds a bit ineffective. Especially if it has to do 1000's of these at any given time.
Can anybody advice?
Alex Brown notes - the example string is a fragment of the XML data that comes from the client.

Use an XML parser to read XML data, it will do almost all of the legwork for you, and deal with the ordering issues. Then you simply need to ask the parser for the data you need for the calculation.
Details differ with different XML parsers - go find one, read the documentation. If you need more help, come back here with an XML parser question.

GMan is right, you can not cast an arbitrary string to for example a Date type if the underlaying data structure is different. You can, however, parse the content and instantiate a new object using the data in the string. std::atoi() parses a c-string to an int for example.
You need to parse the string, not cast it.

What you're describing is actually a parser. Even the trial-and-error approach using lexical_cast is really just a (crude) parser.
I suggest to clarify the format of the input string and then, if it's simple enough, write a Recursive descent parser by hand to parse the input string into whatever data structure is convenient for your need.

you could use a VARIANT type of struct (i.e. one of every possible results, and a "type" specifying which it was, and a big enum of types), and a ConvertStringToVariant() function.

This is too much of a guessing game,
and if I have 10 possible types, for
any given string
If you're concerned about this, you need a lexical analyzer, such as flex or Boost::Spirit.
It will still be a guessing game, but a more "informed" guessing one.

Parse URLs using C-Strings in C++

I'm learning C++ for one of my CS classes, and for our first project I need to parse some URLs using c-strings (i.e. I can't use the C++ String class).
The only way I can think of approaching this is just iterating through (since it's a char[]) and using some switch statements. From someone who is more experienced in C++ - is there a better approach? Could you maybe point me to a good online resource? I haven't found one yet.

Weird that you're not allowed to use C++ language features i.e. C++ strings!
There are some C string functions available in the standard C library.
e.g.
strdup - duplicate a string
strtok - breaking a string into tokens. Beware - this modifies the original string.
strcpy - copying string
strstr - find string in string
strncpy - copy up to n bytes of string
etc
There is a good online reference here with a full list of available c string functions
for searching and finding things.
http://www.cplusplus.com/reference/clibrary/cstring/
You can walk through strings by accessing them like an array if you need to.
e.g.
char* url="http://stackoverflow.com/questions/1370870/c-strings-in-c"
int len = strlen(url);
for (int i = 0; i < len; ++i){
std::cout << url[i];
}
std::cout << endl;
As for actually how to do the parsing, you'll have to work that out on your own. It is an assignment after all.

There are a number of C standard library functions that can help you.
First, look at the C standard library function strtok. This allows you to retrieve parts of a C string separated by certain delimiters. For example, you could tokenize with the delimiter / to get the protocol, domain, and then the file path. You could tokenize the domain with delimiter . to get the subdomain(s), second level domain, and top level domain. Etc.
It's not nearly as powerful as a regular expression parser, which is what you would really want for parsing URLs, but it works on C strings, is part of the C standard library and is probably OK to use in your assignment.
Other C standard library functions that may help:
strstr() Extracts substrings just like std::string::substr()
strspn(), strchr() and strpbrk() Find a character or characters in a string, similar to std::string::find_first_of(), etc.
Edit: A reminder that the proper way to use these functions in C++ is to include <cstring> and use them in the std:: namespace, e.g. std::strtok().

You might want to refer to an open source library that can parse URLs (as a reference for how others have done it -- obviously don't copy and paste it!), such as curl or wget (links are directly to their url parsing files).

I don't know what the requirements are for parsing the URLs,
but if this is CS level it would be appropriate to use (very
simple) BNF and a (very simple) recursive descent parser.
This would make for a more robust solution than direct
iteration, e.g. for malformed URLs.
Very few string functions from the standard C library would
be needed.

You can use C functions like strtok, strchr, strstr etc.

Many of the runtime library functions that have been mentioned work quite well, either in conjunction with or apart from the approach of iterating through the string that you mentioned (which I think is time honored).

Are there particular cases where native text manipulation is more desirable than regex?

Are there particular cases where native text manipulation is more desirable than regex?
In particular .net?
Note:
Regex appears to be a highly emotive subject, so I am wary of asking such a question. This question is not inviting personal/profession opinions on regex, only specific situations where a solution including its use is not as good as language native commands (including those which have underlying code using regex) and why.
Also, note that Desirable can mean performance, can mean code-readability; it does not mean panacea, as each solution for a problem has its benefits and limitations.
Apologies if this is a duplicate, I have searched SO for a similar question.

I prefer text manipulation over regular expressions to parse delimited string input. It's far simpler (for me at least) to issue a string split than to manage a regular expression.
Given some text:
value1, value2, value3
You can parse the line easily:
var values = myString.Split(',');
I'm sure there's a better way but with regular expressions you'd have to do something like:
var match = Regex.Match(myString, "^([^,]*),([^,]*),([^,]*)$");
var value1 = match.Group[1];
...

When you can do it simply with native text manipulation, it is usually preferable (simpler to read & better performance) not to use regex.
Personal rule of thumb: if it's tricky or relatively longer to do it "manually" and that performance gain is negligible, don't. Else do.
Don't examples:
split
simple find & replace
long text
loop
existing native functions (like, in PHP, strrchr, ucwords...)

Using a regex basically means embedding a tiny program, written in a different programming language, in the middle of your program. I'll ignore the inefficiency of using a regex over native string manipulation, because it probably isn't relevant in most cases.
I prefer native text manipulation over regex any time native text manipulation will be easier to follow for other people. Which is true quite frequently, since plenty of the people around me are not strongly familiar with regex. Unless working with something that is very much about parsing (via regex) they should not need to be!
Regular expressions are usually slower, less readable, and harder to debug than native string manipulation.
The main case where I'll prefer regex over string manipulation is when I want to be able to have different ways to parse strings dependning on the source, and the types of sources will increase over time. Native string manipulation is not really practical in this case. I've had cases where I've stuck a regex column in a database...

RegEx's are very flexible and powerful, because they are in many ways similar to an eval() statement. That being said, depending on the implementation, they can be a bit slow. Normally, this is not an issue, however, if they can be avoided in a particularly costly loop, that can boost performance.
That being said, I tend to use them, and only worry about performance when the app is "done" and I have real benchmarks to prove I need to tweak performance. i.e, avoid premature optimization.

Whenever the same result can be achieved with a reasonable amount of code.
Regular expressions are very powerful, but they tend to get hard to read. If you can do the same with simple string operations that usually means that the code gets easier to manage and maintain.
There is some overhead in setting up the object and parsing the expression. For simpler string manipulation you can get better performance with simple string methods.
Example:
Getting the file name from a file path (yes, I know that the Path class should be used for that, it's just an example...)
string name = Regex.Match(path, #"([^\\]+)$").Groups[0].Value;
vs.
string name = path.Substring(path.LastIndexOf('\\') + 1);
The second solution is straight forward and does the minimal work needed to get the result. The regular expression solution produces the same result, but it does more work to parse the string, and it produces a bunch of objects that is not needed for the result.

Regex parsing and execution refers the host language to defer processing to its regex "engine". This adds overhead, so for any instance where native string manipulation could be used it is preferable for speed (and readability!).

I'll usually just use text manipulation for simple string replacements (e.g. replacing tokens in a template with actual values). You could certainly do this with Regex, but replacements are much easier.

Yes. Example:
char* basename (const char* path)
{
char* p = strrchr(path, '/');
return (p != NULL) ? (p+1) : path;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js