Parse URLs using C-Strings in C++ - c++

I'm learning C++ for one of my CS classes, and for our first project I need to parse some URLs using c-strings (i.e. I can't use the C++ String class).
The only way I can think of approaching this is just iterating through (since it's a char[]) and using some switch statements. From someone who is more experienced in C++ - is there a better approach? Could you maybe point me to a good online resource? I haven't found one yet.

Weird that you're not allowed to use C++ language features i.e. C++ strings!
There are some C string functions available in the standard C library.
e.g.
strdup - duplicate a string
strtok - breaking a string into tokens. Beware - this modifies the original string.
strcpy - copying string
strstr - find string in string
strncpy - copy up to n bytes of string
etc
There is a good online reference here with a full list of available c string functions
for searching and finding things.
http://www.cplusplus.com/reference/clibrary/cstring/
You can walk through strings by accessing them like an array if you need to.
e.g.
char* url="http://stackoverflow.com/questions/1370870/c-strings-in-c"
int len = strlen(url);
for (int i = 0; i < len; ++i){
std::cout << url[i];
}
std::cout << endl;
As for actually how to do the parsing, you'll have to work that out on your own. It is an assignment after all.

There are a number of C standard library functions that can help you.
First, look at the C standard library function strtok. This allows you to retrieve parts of a C string separated by certain delimiters. For example, you could tokenize with the delimiter / to get the protocol, domain, and then the file path. You could tokenize the domain with delimiter . to get the subdomain(s), second level domain, and top level domain. Etc.
It's not nearly as powerful as a regular expression parser, which is what you would really want for parsing URLs, but it works on C strings, is part of the C standard library and is probably OK to use in your assignment.
Other C standard library functions that may help:
strstr() Extracts substrings just like std::string::substr()
strspn(), strchr() and strpbrk() Find a character or characters in a string, similar to std::string::find_first_of(), etc.
Edit: A reminder that the proper way to use these functions in C++ is to include <cstring> and use them in the std:: namespace, e.g. std::strtok().

You might want to refer to an open source library that can parse URLs (as a reference for how others have done it -- obviously don't copy and paste it!), such as curl or wget (links are directly to their url parsing files).

I don't know what the requirements are for parsing the URLs,
but if this is CS level it would be appropriate to use (very
simple) BNF and a (very simple) recursive descent parser.
This would make for a more robust solution than direct
iteration, e.g. for malformed URLs.
Very few string functions from the standard C library would
be needed.

You can use C functions like strtok, strchr, strstr etc.

Many of the runtime library functions that have been mentioned work quite well, either in conjunction with or apart from the approach of iterating through the string that you mentioned (which I think is time honored).

Related

Expand escape sequences detected by flex

In my scanner.lex file I have this:
{Some rule that matches strings} return STRING; //STRING is enum
in my c++ file I have this:
if (yylex == STRING) {
cout << "STRING: " << yytext << endl;
Obviously with some logic to take the input from stdin.
Now if this program gets the input "Hello\nWorld", my output is "STRING: Hello\nWorld", while I would want my output to be:
Hello
World
The same goes for other escape characters such as \",\0, \x<hex_number>, \t, \\... But I'm not sure how to achieve this. I'm not even sure if that's a flex issue or if I can solve this using only c++ tools...
How can I get this done?
As #Some programmer dude mentions in a comment, there is an an example of how to do this using start conditions in the Flex documentation. That example puts the escaping rules into a separate start condition; each rule is implemented by appending the unescaped text to a buffer. And that's the way it's normally done.
Of course, you might find an external library which unescapes C-style escaped strings, which you could call on the string returned by flex. But that would be both slower and less flexible than the approach suggested in the Flex manual: slower because it requires a second scan of the string, and less flexible because the library is likely to have its own idea of what escapes to handle.
If you're using C++, you might find it more elegant to modify that example to use a std::string buffer instead of an arbitrary fixed-size character array. You can compile a flex-generated scanner with C++, so there is no problem using C++ standard library objects in your scanner code.
Depending on the various semantic value types you are managing, you will probably want to modify the yylex prototype to either use an additional reference parameter or a more structured return type, in order to return the token value to the caller. Note that while it is OK to use yytext before the next call to yylex, it's not generally considered good style since it won't work with most parsers: in general, parsers require the ability to look one or more tokens ahead, and thus yytext is likely to be overwritten by the time your parser needs its value. The flex manual documents the macro hook used to modify the yylex() prototype.

Parsing char array in C++

I'm coding a httpserver and I'm stuck at parsing GET requests.
How would I parse a char buffer with something like
GET /images/logo.png HTTP/1.1
So that I only get the path and file extension but ignore the other parts?
You don't say specifically say what sort of storage this string is in - simple char* or some string class.
So, in general, you could either do it the rather simple and dirty way, by splitting the string on the space character, and taking the second or middle section. Or, a better approach would be to get familiar with Regular Expressions. C++ has several regex libraries - Boost is well regarded.

C++, Multilanguage/Localisation support

what's the best way to add multilanguage support to a C++ program?
If possible, the language should be read in from a plain text file containing something like key-value pairs (§WelcomeMessage§ "Hello %s!").
I thought of something like adding a localizedString(key) function that returns the string of the loaded language file. Are there better or more efficient ways?
//half-pseudo code
//somewhere load the language key value pairs into langfile[]
string localizedString(key)
{
//do something else here with the string like parsing placeholders
return langfile[key];
}
cout << localizedString(§WelcomeMessage§);
Simplest way without external libraries:
// strings.h
enum
{
LANG_EN_EN,
LANG_EN_AU
};
enum
{
STRING_HELLO,
STRING_DO_SOMETHING,
STRING_GOODBYE
};
// strings.c
char* en_gb[] = {"Well, Hello","Please do something","Goodbye"};
char* en_au[] = {"Morning, Cobber","do somin'","See Ya"};
char** languages[MAX_LANGUAGES] = {en_gb,en_au};
This will give you what you want. Obviously you could read the strings from a file. I.e.
// en_au.lang
STRING_HELLO,"Morning, CObber"
STRING_DO_SOMETHING,"do somin'"
STRING_GOODBYE,"See Ya"
But you would need a list of string names to match to the string titles. i.e.
// parse_strings.c
struct PARSE_STRINGS
{
char* string_name;
int string_id;
}
PARSE_STRINGS[] = {{"STRING_HELLO",STRING_HELLO},
{"STRING_DO_SOMETHING",STRING_DO_SOMETHING},
{"STRING_GOODBYE",STRING_GOODBYE}};
The above should be slightly easier in C++ as you could use the enum classes toString() method (or what ever it as - can't be bothered to look it up).
All you then have to do is parse the language files.
I hope this helps.
PS: and to access the strings:
languages[current_language][STRING_HELLO]
PPS: apologies for the half c half C++ answer.
Space_C0wb0w's suggestion is a good one. We currently use successfully use ICU for that in our products.
Echoing your comment to his answer: It is indeed hard to say that ICU is "small, clean, uncomplicated". There is "accidental" complexity in ICU coming from its "Java-ish" interface, but a large part of the complexity and size simply comes from the complexity and size of the problem domain it is addressing.
If you don't need ICU's full power and are only interested in "message translation", you may want to look at GNU gettext which, depending on your platform and licencing requirements, may be a "smaller, cleaner and less-complicated" alternative.
The Boost.Locale project is also an interesting alternative. In fact, its "Messages Formatting" functionality is based on the gettext model.
Since you are asking for the best way (and didn't mention the platform) I would recommend GNU Gettext.
Arguably it is the most complete and mature internationalization library for C/C++ programming.

C++ - Splitting Filename and File Extension

Ok, first of all I don't want to use Boost, or any external libraries. I just want to use the C++ Standard Library. I can easily split strings with a given delimiter with my split() function:
void split(std::string &string, std::vector<std::string> &tokens, const char &delim) {
std::string ea;
std::stringstream stream(string);
while(getline(stream, ea, delim))
tokens.push_back(ea);
}
I do this on filenames. But there's a problem. There are files that have extensions like: tar.gz, tar.bz2, etc. Also there are some filenames that have extra dots. Some.file.name.tar.gz. I wish to separate Some.file.name and tar.gz Note: The number of dots in a filename isn't constant.
I also tried PathFindExtension but no luck. Is this possible? If so, please enlighten me. Thank you.
Edit: I'm very sorry about not specifying the OS. It's Windows.
I think you could use std::string find_last_of to get the index of the last ., and substr to cut the string (although the "complex extensions" involving multiple dots will require additional work).
There is no way of doing what you want that does not involve a database of extensions for your purpose. There's nothing magical about extensions, they are just part of a filename (if you gunzip foo.tar.gz you'll likely get a foo.tar, so for this application .gz actually is "the extension"). So, in order to do what you want, build a database of extensions that you want to look for and fall back on "last dot" if you don't find one.
There's nothing in the C++ standard library -- that is, it's not in the Standard --, but every operating system I know of provides this functionality in a variety of ways.
In Windows you can use _splitpath(), and in Linux you can use dirname() & basename()
The problem is indeed filenames like *.tar.gz, which can not be split consistently, due to the fact that (at least in Windows) the .tar part isn't part of the extension. You'll either have to keep a list for these special cases and use a one-dot string::rfind for the rest or find some pre-implemented way. Note that the .tar.* extensions aren't infinite, and very much standardized (there's about ten of them I think).
You could create a look-up table of file extensions that you think you might encounter. And also add a command line option to add a new one to the look-up table if you encounter anything new. Then parse through the file name to see if it any entry in the look-up table is a sub-string in the file name.
EDIT: You can also refer to this question: C++/STL string: How to mimic regex like function with wildcards?

C++/STL string: How to mimic regex like function with wildcards?

I would like to compare 4 character string using wildcards.
For example:
std::string wildcards[]=
{"H? ", "RH? ", "H[0-5] "};
/*in the last one I need to check if string is "H0 ",..., and "H5 " */
Is it possible to manage to realize only by STL?
Thanks,
Arman.
EDIT:
Can we do it without boost.regex?
Or should I add yet another library dependences to my project?:)
Use Boost.Regex
No - you need boost::regex
Regular expressions were made for this sort of thing. I can understand your reluctance to avoid a dependency, but in this case it's probably justified.
You might check your C++ compiler to see if it includes any built-in regular expression library. For example, Microsoft includes CAtlRegExp.
Barring that, your problem doesn't look too difficult to write custom code for.
You can do it without introducing a new library dependency, but to do so you'd end up writing a regular expression engine yourself (or at least a subset of one).
Is there some reason you don't want to use a library for this?