C++ - Splitting Filename and File Extension - c++

Ok, first of all I don't want to use Boost, or any external libraries. I just want to use the C++ Standard Library. I can easily split strings with a given delimiter with my split() function:
void split(std::string &string, std::vector<std::string> &tokens, const char &delim) {
std::string ea;
std::stringstream stream(string);
while(getline(stream, ea, delim))
tokens.push_back(ea);
}
I do this on filenames. But there's a problem. There are files that have extensions like: tar.gz, tar.bz2, etc. Also there are some filenames that have extra dots. Some.file.name.tar.gz. I wish to separate Some.file.name and tar.gz Note: The number of dots in a filename isn't constant.
I also tried PathFindExtension but no luck. Is this possible? If so, please enlighten me. Thank you.
Edit: I'm very sorry about not specifying the OS. It's Windows.

I think you could use std::string find_last_of to get the index of the last ., and substr to cut the string (although the "complex extensions" involving multiple dots will require additional work).

There is no way of doing what you want that does not involve a database of extensions for your purpose. There's nothing magical about extensions, they are just part of a filename (if you gunzip foo.tar.gz you'll likely get a foo.tar, so for this application .gz actually is "the extension"). So, in order to do what you want, build a database of extensions that you want to look for and fall back on "last dot" if you don't find one.

There's nothing in the C++ standard library -- that is, it's not in the Standard --, but every operating system I know of provides this functionality in a variety of ways.
In Windows you can use _splitpath(), and in Linux you can use dirname() & basename()

The problem is indeed filenames like *.tar.gz, which can not be split consistently, due to the fact that (at least in Windows) the .tar part isn't part of the extension. You'll either have to keep a list for these special cases and use a one-dot string::rfind for the rest or find some pre-implemented way. Note that the .tar.* extensions aren't infinite, and very much standardized (there's about ten of them I think).

You could create a look-up table of file extensions that you think you might encounter. And also add a command line option to add a new one to the look-up table if you encounter anything new. Then parse through the file name to see if it any entry in the look-up table is a sub-string in the file name.
EDIT: You can also refer to this question: C++/STL string: How to mimic regex like function with wildcards?

Related

Indexing string literals for c++ project

I have a huge c++ project and I find myself rgrep-ing for patterns that I know are in string literals. Is there a way to get clang or xtags or cscope or whatever to build a file with a mapping of each string literal in the project to the file and line where it was found?
I don't know of a way to make cscope or friends to do this. You could almost certainly write a custom Starscope extractor that would do this, if you don't mind writing a dozen or so lines of Ruby (starscope: https://github.com/eapache/starscope, adding an extractor: https://github.com/eapache/starscope/blob/master/doc/LANGUAGE_SUPPORT.md#how-to-add-another-language)
Alternatively it may just be enough to use something like ag instead, which is grep-like but generally a lot faster: https://github.com/ggreer/the_silver_searcher

Create a safe, escaped path base/file name, check if safe

I wonder if there is a generic way to produce filesystem safe filenames that is portable. That is, I have a user entered string and would like to produce a file with a name that as closely resembles the name they have chosen. The resulting name must not include any path reference or other special file-system special name or tag.
Currently I just replace a bunch of known bad characters with other characters, or empty strings. For example, given the name ABC / DEF* : A Company? I'd produce the string ABC - DEF - A Company. My choice for replacement characters is totally arbitrary as I don't know of a generic escape symbol.
So my related questions are:
Is there a method (perhaps in boost filesystem) that can tell me if the name refers strictly to a file without a path?
Is there a function that tells me if the name is "safe" to use as a file (this may be an additional check from 1 for some filesystems)?
Is there a function to convert a string into a reasonable safe name?
Addtional Notes
For #1 I thought to just compare a boost path::filename() to the original object, if they are the same then I have a file. However this still allows things like '..' and '.' But that might be okay if there is a good solution for #2
In theory I'd have to provide a directory in which the file would reside, since different file-systems may have different requirements. But a global solution for the OS would also be okay.
I already have a function that just replaces a bunch of commonly known unsafe characters.
Common file dialogs cannot be used to do the filtering since the interface may not always allow them and in some cases the user isn't directly aware of the relationship to the file (advanced users would however).
According to POSIX fully portable filenames, the only portable filenames are those that contain only A–Za–z0–9._- and are max 14 characters long.
That said, a more practical approach is to assume that modern filesystems can cope with longer filenames and to simply replace all characters which are not explicitly marked as "safe" with _. Sometimes, instead of replacing with _, those characters are hex-encoded, like in URLs: sample%20file.txt. KDE applications use this, for example.
As for implementation, it's as simple as s/[^A-Za-z0-9.-]/_/.
How portable is portable? Many systems had limits on length, and some
probably still do. Is disinguishing between names an issue? Some
systems distinguish case, and others don't. What about a final .xxx?
For some systems, it is significant, for others, it's just text.
Neglecting length, the safest bet is to take the opposite approach:
create a set of known safe characters, and convert everything outside of
that to a specific character. ASCII alphanumerics, and '_' seem
pretty safe, and you're probably OK (today) with '-', but I doubt the
list goes much further. And depending on what you're doing with these
names, you might want to force them to a single case, either upper or
lower.

C++, Multilanguage/Localisation support

what's the best way to add multilanguage support to a C++ program?
If possible, the language should be read in from a plain text file containing something like key-value pairs (§WelcomeMessage§ "Hello %s!").
I thought of something like adding a localizedString(key) function that returns the string of the loaded language file. Are there better or more efficient ways?
//half-pseudo code
//somewhere load the language key value pairs into langfile[]
string localizedString(key)
{
//do something else here with the string like parsing placeholders
return langfile[key];
}
cout << localizedString(§WelcomeMessage§);
Simplest way without external libraries:
// strings.h
enum
{
LANG_EN_EN,
LANG_EN_AU
};
enum
{
STRING_HELLO,
STRING_DO_SOMETHING,
STRING_GOODBYE
};
// strings.c
char* en_gb[] = {"Well, Hello","Please do something","Goodbye"};
char* en_au[] = {"Morning, Cobber","do somin'","See Ya"};
char** languages[MAX_LANGUAGES] = {en_gb,en_au};
This will give you what you want. Obviously you could read the strings from a file. I.e.
// en_au.lang
STRING_HELLO,"Morning, CObber"
STRING_DO_SOMETHING,"do somin'"
STRING_GOODBYE,"See Ya"
But you would need a list of string names to match to the string titles. i.e.
// parse_strings.c
struct PARSE_STRINGS
{
char* string_name;
int string_id;
}
PARSE_STRINGS[] = {{"STRING_HELLO",STRING_HELLO},
{"STRING_DO_SOMETHING",STRING_DO_SOMETHING},
{"STRING_GOODBYE",STRING_GOODBYE}};
The above should be slightly easier in C++ as you could use the enum classes toString() method (or what ever it as - can't be bothered to look it up).
All you then have to do is parse the language files.
I hope this helps.
PS: and to access the strings:
languages[current_language][STRING_HELLO]
PPS: apologies for the half c half C++ answer.
Space_C0wb0w's suggestion is a good one. We currently use successfully use ICU for that in our products.
Echoing your comment to his answer: It is indeed hard to say that ICU is "small, clean, uncomplicated". There is "accidental" complexity in ICU coming from its "Java-ish" interface, but a large part of the complexity and size simply comes from the complexity and size of the problem domain it is addressing.
If you don't need ICU's full power and are only interested in "message translation", you may want to look at GNU gettext which, depending on your platform and licencing requirements, may be a "smaller, cleaner and less-complicated" alternative.
The Boost.Locale project is also an interesting alternative. In fact, its "Messages Formatting" functionality is based on the gettext model.
Since you are asking for the best way (and didn't mention the platform) I would recommend GNU Gettext.
Arguably it is the most complete and mature internationalization library for C/C++ programming.

C++/STL string: How to mimic regex like function with wildcards?

I would like to compare 4 character string using wildcards.
For example:
std::string wildcards[]=
{"H? ", "RH? ", "H[0-5] "};
/*in the last one I need to check if string is "H0 ",..., and "H5 " */
Is it possible to manage to realize only by STL?
Thanks,
Arman.
EDIT:
Can we do it without boost.regex?
Or should I add yet another library dependences to my project?:)
Use Boost.Regex
No - you need boost::regex
Regular expressions were made for this sort of thing. I can understand your reluctance to avoid a dependency, but in this case it's probably justified.
You might check your C++ compiler to see if it includes any built-in regular expression library. For example, Microsoft includes CAtlRegExp.
Barring that, your problem doesn't look too difficult to write custom code for.
You can do it without introducing a new library dependency, but to do so you'd end up writing a regular expression engine yourself (or at least a subset of one).
Is there some reason you don't want to use a library for this?

Parse URLs using C-Strings in C++

I'm learning C++ for one of my CS classes, and for our first project I need to parse some URLs using c-strings (i.e. I can't use the C++ String class).
The only way I can think of approaching this is just iterating through (since it's a char[]) and using some switch statements. From someone who is more experienced in C++ - is there a better approach? Could you maybe point me to a good online resource? I haven't found one yet.
Weird that you're not allowed to use C++ language features i.e. C++ strings!
There are some C string functions available in the standard C library.
e.g.
strdup - duplicate a string
strtok - breaking a string into tokens. Beware - this modifies the original string.
strcpy - copying string
strstr - find string in string
strncpy - copy up to n bytes of string
etc
There is a good online reference here with a full list of available c string functions
for searching and finding things.
http://www.cplusplus.com/reference/clibrary/cstring/
You can walk through strings by accessing them like an array if you need to.
e.g.
char* url="http://stackoverflow.com/questions/1370870/c-strings-in-c"
int len = strlen(url);
for (int i = 0; i < len; ++i){
std::cout << url[i];
}
std::cout << endl;
As for actually how to do the parsing, you'll have to work that out on your own. It is an assignment after all.
There are a number of C standard library functions that can help you.
First, look at the C standard library function strtok. This allows you to retrieve parts of a C string separated by certain delimiters. For example, you could tokenize with the delimiter / to get the protocol, domain, and then the file path. You could tokenize the domain with delimiter . to get the subdomain(s), second level domain, and top level domain. Etc.
It's not nearly as powerful as a regular expression parser, which is what you would really want for parsing URLs, but it works on C strings, is part of the C standard library and is probably OK to use in your assignment.
Other C standard library functions that may help:
strstr() Extracts substrings just like std::string::substr()
strspn(), strchr() and strpbrk() Find a character or characters in a string, similar to std::string::find_first_of(), etc.
Edit: A reminder that the proper way to use these functions in C++ is to include <cstring> and use them in the std:: namespace, e.g. std::strtok().
You might want to refer to an open source library that can parse URLs (as a reference for how others have done it -- obviously don't copy and paste it!), such as curl or wget (links are directly to their url parsing files).
I don't know what the requirements are for parsing the URLs,
but if this is CS level it would be appropriate to use (very
simple) BNF and a (very simple) recursive descent parser.
This would make for a more robust solution than direct
iteration, e.g. for malformed URLs.
Very few string functions from the standard C library would
be needed.
You can use C functions like strtok, strchr, strstr etc.
Many of the runtime library functions that have been mentioned work quite well, either in conjunction with or apart from the approach of iterating through the string that you mentioned (which I think is time honored).