C++ Returning Multiple Items - c++

I am designing a class in C++ that extracts URLs from an HTML page. I am using Boost's Regex library to do the heavy lifting for me. I started designing a class and realized that I didn't want to tie down how the URLs are stored. One option would be to accept a std::vector<Url> by reference and just call push_back on it. I'd like to avoid forcing consumers of my class to use std::vector. So, I created a member template that took a destination iterator. It looks like this:
template <typename TForwardIterator, typename TOutputIterator>
TOutputIterator UrlExtractor::get_urls(
TForwardIterator begin,
TForwardIterator end,
TOutputIterator dest);
I feel like I am overcomplicating things. I like to write fairly generic code in C++, and I struggle to lock down my interfaces. But then I get into these predicaments where I am trying to templatize everything. At this point, someone reading the code doesn't realize that TForwardIterator is iterating over a std::string.
In my particular situation, I am wondering if being this generic is a good thing. At what point do you start making code more explicit? Is there a standard approach to getting values out of a function generically?

Yes, it's not only fine but a very nice design. Templating that way is how most of the standard library algorithms work, like std::fill or std::copy; they are made to work with iterators so that you can fill a container that already has elements in it, or you can take an empty container and fill it up with data by using std::back_inserter.
This is a very good design IMO, and takes advantage of the power of templates and the iterator concept.
You can use it like this (but you already know this):
std::list<Url> l;
std::vector<Url> v;
x.get_urls(begin(dat1), end(dat1), std::back_inserter(l));
y.get_urls(begin(dat2), end(dat2), std::back_inserter(v));
I get the feeling that you are afraid of using templates, that they are not "normal" C++, or that they should be avoided and are bloated or something. I assure you, they are very normal and a powerful language feature that no other language (that I know of) has, so whenever it is appropriate to use them, USE THEM. And here, it is very appropriate.

Looks to me that you have the wrong interface.
There are already algorithms for copying from iterators in-to containers. Seems to me that your class is providing a stream of urls (without relying modifying its source). So all you really need is a way to expose you internal data via iterators (forward iterators) and thus all you need to provide begin() and end().
UrlExtractor page(/* Some way of constructing page */);
std::vector<std::string> data;
std::copy(page.begin(), page.end(), std::back_inserter(data));
I would just provide the following interface:
class UrlExtractor
{
...... STUFF
iterator begin();
iterator end();
};

Yes, you are being too general. The point of a template is that you can generate multiple copies of the function that behave differently. You probably don't want that because you should pick one way of representing a URL and use that in your entire program.
How about you just do:
typedef std::string url;
That allows you to change the class you use for urls in the future.
Maybe std::vector implements some interface with push_back() in it and your method can take a reference to that interface (back_inserter?).

It's hard to say without knowing the actual use case scenarios, but in
general, it's better to avoid templates (or any other unnecessary
complexity) unless it actually buys you something. The most obvious
signature here would be:
std::vector<Url> UrlExtractor::get_urls( std::string const& source );
Is there really any likely scenario where you'll have to parse anything
but an std::string? (There might be if you also supported input
iterators. But in practice, if you're parsing, the sources will be
either a std::string or an std::istream&. Unless you really want to
support the latter, just use std::string.) And of course, client code
can do whatever it wants with the returned vector, including appending
it to another type of collection.
If the cost of returning a std::vector does become an issue, then you
could take an std::vector<Url>& as an argument. I can't see any
reasonable scenario where any additional flexibility would buy you much,
and a function like get_urls is likely to be fairly complicated, and
not the sort of thing you'd want to put in a header.

Related

Replace RogueWave with standard library by writing a wrapper

With reference to this post
How do I abstract away from using RogueWave in legacy code?
The new wrapper will have equivalent RogueWave methods wrapped around standard library. Something like -
template<class T, class Container = std::deque<T> >
class my_stack
{
public:
void push(const T& t)
{
m_stack.push(t);
}
// ... so on ...
protected:
std::stack<T, Container> m_stack;
};
How do we expose the standard library methods which do not exist in RogueWave.
Does the wrapper be a union of RogueWave::stack and std::stack methods ? Or do we expose the underlying std::stack object for the client to directly call std::stack methods ? Does the client use std library directly or everything through a wrapper ?
Thoughts please.
Thanks.
Is there a reason why you don't replace RougeWave::stack with std::stack without a wrapper?
The wrapper approach requires work to maintain the interface between your wrapper and the container class. You need to get things like rvalue references right, but unless you are adding significant functionality (and with std::stack that seems unlikely) the benefit you can get from this is limited.
I see that you have protected:, so it could be that you are using inheritance on your containers. That could also be a good thing to remove.
std::stack is part of the language now, it will have a longer life than a third party library like RogueWave.
We are getting near the end of a long project to wrap and remove RW in legacy code. I will use RWOrdered as an example, replacing it with OOrdered.
If you have code with RW in it, it is probably legacy code. If you find removing RW not to be trivial, it is probably large. You probably no longer understand all the details of its design. Changing how it works may cause a lot of trouble. Any interface or behavior differences between RWOrdered and OOrdered are to be avoided.
If you replace RW with something just like RW, you get rid of license fees, own all of your code, can do 64 bit builds, etc. But you don't get anything better than RW. You probably don't want to use it as the basis of all future containers.
RW was world class code in its day. You are replacing it with home built software, even if you wrap world class std containers. Sometimes std classes work differently than RW classes. You will have to solve problems that have been solved by RW. You will also find the std library solves problems RW did not.
C++ templated containers are strongly typed. RW containers can contain anything that inherits from RWCollectable. That is, you can mix types. Making OOrdered a template class may not be what you want.
std classes make a clear distinction between equivalence and equality. The find() algorithm uses equality, operator==(), to find an item. set<> is sorted, usually by operator<(). When set::insert(a) uses equivalence based on this operator to determine if an item is already in the set. It looks for an item b where a < b and b < a are both false.
In RW, RWCollectable defines isEqual(), compareTo(), and hash(). That is, equivalence and equality are available to all collections. Sometimes, RW mixes equality and equivalence, particularly when collections are nested.
You need to be aware of which RW function does what, as well as which std library entity does what. You need to choose to exactly match RW behavior, or how you want to differ.
These examples are easily solved, but some are harder. Serialization was a pain point for us. We wanted to use boost serialization to replace SaveGuts() and RestoreGuts(). We have nested containers, where sometime an inner container is allocated by a different DLL than the outer container. This breaks boost. Work arounds exist, but they are not trivial.
Take it in steps.
Write OCollectable, which wraps RWCollectable.
Write a type that inherits from OCollectable, such as OWidget. RW defines macros that implement some base functionality like isA(). Use the macros.
Write OOrdered, which wraps RWOrdered.
Replace RWOrderd and Widget in your code. If they behave exactly the same, your code will still work.
Expand the macros
Add a std::vector to OOrdered. Rewrite OOrdered member functions. Drop ones you don't use.
Rewrite Widget functions.
One way to expose the wrapped vector is to add a getVector() function. A typedef for the return type will help.
Another way is to add functions expose the vector functions you want.
Test, test, test
Gotchas
Getting rid of some RW is not too hard. Getting rid of every last trace is harder. For example, your OWidget inherits from OCollectable, which inherits from RWCollectable. So you can take a widget out of your OOrdered, and hand it to a method that takes an RWCollectable. If you change OCollectable so that it no longer inherits, you can't pass your OWidget class in any more. You have to wait until the end of your project to drop RWCollectable, when everything inherits from OCollectable.
By the time you are done, you will be an expert in a dead library in a dying language. This may be better for your career than it sounds. People who don't want to dig into it themselves need such experts. On the other hand, you will also know a lot about some proprietary legacy code. You might prefer a career in steam engines.
Likewise, you will be familiar with std library. Effective STL is dated, but still a really good book. There isn't any equivalent for RW. This is good. It keeps the market from being flooded with RW experts.

Is it recommended to specify e.g. vector<t> in my public interface?

I'm new to C++, and while writing a class I realized one of my methods was asking for a vector-of-vectors. Should this be done or should I rethink my class's interface? (How?)
I think it is no problem what container you use. You could do it like
void func(std::vector<std::vector<int> > const& int_matrix);
or in C++11, successive > won't be considered as '>>' so you could also use
void func(std::vector<std::vector<int>> const& int_matrix);
But the problem is, if your work are published as binary instead of source code, the users of the interface should have the same STL implement as yours, otherwise strange runtime errors may occur. So use STL container as interface is not proper in this situation. You have to define some structures yourself as the type of parameters.
A vector of vectors isn't necessarily a bad thing. If you need something like a multidimensional array, then you need what you need. (Just make sure you pass the vector by [const] reference).
You might want to change the title of your question though, because the title says "vector<T>" (boldness because it thinks it's an HTML tag) but your question asks about a vector of vectors.
IMO, if possible it's better to merge all the vectors into a single vector. Having vector of vector doesn't make much sense to me.

how do C++ professional programmers implement common abstractions?

I've never programmed with C++ professionally and working with (Visual) C++ as student. I'm having difficulty dealing with the lack of abstractions especially with the STL container classes. For example, the vector class doesn't contain a simple remove method, common in many libraries e.g. .NET Framework. I know there's an erase method, it doesn't make the remove method abstract enough to reduce the operation to a one-line method call. For example, if I have a
std::vector<std::string>
I don't know how else to remove a string element from the vector without iterating thru it and searching for a matching string element.
bool remove(vector<string> & msgs, string toRemove) {
if (msgs.size() > 0) {
vector<string>::iterator it = msgs.end() - 1;
while (it >= msgs.begin()) {
string remove = it->data();
if (remove == toRemove) {
//std::cout << "removing '" << it->data() << "'\n";
msgs.erase(it);
return true;
}
it--;
}
}
return false;
}
What do professional C++ programmers do in this situation? Do you write out the implementation every time? Do you create your own container class, your own library of helper functions, or do you suggest using another library i.e. Boost (even if you program Windows in Visual Studio)? or something else?
(if the above remove operation needs work, please leave an alternative method of doing this, thanks.)
You would use the "remove and erase idiom":
v.erase(std::remove(v.begin(), v.end(), mystring), v.end());
The point is that vector is a sequence container and not geared towards manipulation by value. Depending on your design needs, a different standard library container may be more appropriate.
Note that the remove algorithm merely reorders elements of the range, it does not erase anything from the container. That's because iterators don't carry information about their container with them, and this is fully intentional: By separating iterators from their containers, one can write generic algorithms that work on any sensible container.
Idiomatic modern C++ would try to follow that pattern whenever applicable: Expose your data through iterators and use generic algorithms to manipulate it.
Have you considered std::remove_if?
http://www.cplusplus.com/reference/algorithm/remove_if/
IMO, professionally, it is perfectly logical to write up custom implementation do custom tasks, especially if the standard doesn't provide that. This is much better than writing (read: copy-paste) the same stuff again and again. One may also take advantage of inline function, template functions, macros to place same stuff in one place. This reduces any bugs that may encounter while re-using the same stuff (which may go somewhat wrong while pasting). It also makes it possible to correct the bug in one place.
Templates and macros, if properly designed, are very useful - they aren't code bloat.
Edit: Your code needs improvement:
bool remove(vector & msgs, cosnt string& toRemove);
To iterator over a collection, a for loop is sufficient. There is no need to check for size, take last iterator, check with begin, get data and all.
There is no need to waste a string - just compare it, and remove.
For your problem, I believe a map or set would fit much better.

Interface-based programming in C++ in combination with iterators. How too keep this simple?

In my developments I am slowly moving from an object-oriented approach to interface-based-programming approach. More precisely:
in the past I was already satisfied if I could group logic in a class
now I tend to put more logic behind an interface and let a factory create the implementation
A simple example clarifies this.
In the past I wrote these classes:
Library
Book
Now I write these classes:
ILibrary
Library
LibraryFactory
IBook
Book
BookFactory
This approach allows me to easily implement mocking classes for each of my interfaces, and to switch between old, slower implementations and new, faster implementations, and compare them both within the same application.
For most cases this works very good, but it becomes a problem if I want to use iterators to loop over collections.
Suppose my Library has a collection of books and I want to iterator over them. In the past this wasn't a problem: Library::begin() and Library::end() returned an iterator (Library::iterator) on which I could easily write a loop, like this:
for (Library::iterator it=myLibrary.begin();it!=mylibrary.end();++it) ...
Problem is that in the interface-based approach, there is no guarantee that different implementations of ILibrary use the same kind of iterator. If e.g. OldLibrary and NewLibrary both inherit from ILibrary, then:
OldLibrary could use an std::vector to store its books, and return std::vector::const_iterator in its begin and end methods
NewLibrary could use an std::list to store its books, and return std::list::const_iterator in its begin and end methods
Requiring both ILibrary implementations to return the same kind of iterator isn't a solution either, since in practice the increment operation (++it) needs to be implemented differently in both implementations.
This means that in practice I have to make the iterator an interface as well, meaning that application can't put the iterator on the stack (typical C++ slicing problem).
I could solve this problem by wrapping the iterator-interface within a non-interface class, but this seems a quite complex solution for what I try to obtian.
Are there better ways to handle this problem?
EDIT:
Some clarifications after remarks made by Martin.
Suppose I have a class that returns all books sorted on popularity: LibraryBookFinder.
It has begin() and end() methods that return a LibraryBookFinder::const_iterator which refers to a book.
To replace my old implementation with a brand new one, I want to put the old LibraryBookFinder behind an interface ILibraryBookFinder, and rename the old implementation to OldSlowLibraryBookFinder.
Then my new (blistering fast) implementation called VeryFastCachingLibraryBookFinder can inherit from ILibraryBookFinder. This is where the iterator problem comes from.
Next step could be to hide the interface behind a factory, where I can ask the factory "give me a 'finder' that is very good at returning books according popularity, or according title, or author, .... You end up with code like this:
ILibraryBookFinder *myFinder = LibraryBookFinderFactory (FINDER_POPULARITY);
for (ILibraryBookFinder::const_iterator it=myFinder->begin();it!=myFinder.end();++it) ...
or if I want to use another criteria:
ILibraryBookFinder *myFinder = LibraryBookFinderFactory (FINDER_AUTHOR);
for (ILibraryBookFinder::const_iterator it=myFinder->begin();it!=myFinder.end();++it) ...
The argument of LibraryBookFinderFactory may be determined by an external factor: a configuration setting, a command line option, a selection in a dialog, ... And every implementation has its own kind of optimizations (e.g. the author of a book doesn't change so this can be a quite static cache; the popularity can change daily which may imply a totally different data structure).
You are mixing metaphors here.
If a library is a container then it needs its own iterator it can't re-use an iterator of a member. Thus you would wrap the member iterator in an implementation of ILibraryIterator.
But strictly speaking a Library is not a container it is a library.
Thus the methods on a library are actions (think verbs here) that you can perform on a library. A library may contain a container but strictly speaking it is not a container and thus should not be exposing begin() and end().
So if you want to perform an action on the books you should ask the library to perform the action (by providing the functor). The concept of a class is that it is self contained. User should not be using getter to get stuff about the object and then put stuff back the object should know how to perform the action on itself (this is why I hate getters/setters as they break encapsulation).
class ILibrary
{
public:
IBook const& getBook(Index i) const;
template<R,A>
R checkBooks(A const& librarianAction);
};
If your libraries hold a lot of books, you should consider putting your "aggregate" functions into your collections and pass in the action want it to be perform.
Something in the nature of:
class ILibrary
{
public:
virtual ~Ilibrary();
virtual void for_each( boost::function1<void, IBook> func ) = 0;
};
LibraryImpl::for_each( boost::function1<void, IBook> func )
{
std::for_each( myImplCollection.begin(), myImplCollection.end(), func );
}
Although probably not exactly like that because you may need to deal with using shared_ptr, constness etc.
For this purpose (or in general in implementations where I make heavy use of interfaces), I have also created an interface for an iterator and other objects return this. It becomes pretty Java-a-like.
If you care about having the iterator in most cases of the stack: Your problem is of course that you don't really know the size of the iterator at compile time so you cannot allocate a stack variable of the correct size. But if you really care a lot about this: Maybe you could write some wrapper which either allocates a specific size on the stack (e.g. 128 bytes) and if the new iterator fits in, it moves it there (be sure that your iterator has a proper interface to allow this in a clean way). Or you could use alloca(). E.g. your iterator interface could be like:
struct IIterator {
// iterator stuff here
// ---
// now for the handling on the stack
virtual size_t size() = 0; // must return own size
virtual void copyTo(IIterator* pt) = 0;
};
and your wrapper:
struct IteratorWrapper {
IIterator* pt;
IteratorWrapper(IIterator* i) {
pt = alloca(i->size());
i->copyTo(pt);
}
// ...
};
Or so.
Another way, if in theory it would be always clear at compile time (not sure if that holds true for you; it is a clear restriction): Use functors everywhere. This has many other disadvantages (mainly having all real code in header files) but you will have really fast code. Example:
template<typename T>
do_sth_with_library(T& library) {
for(typename T::iterator i = library.begin(); i != library.end(); ++i)
// ...
}
But the code can become pretty ugly if you do rely too heavy on this.
Another nice solution (making the code more functional -- implementing a for_each interface) was provided by CashCow.
With current C++, this could make the code also a bit complicated/ugly to use though. With the upcoming C++0x and lambda functions, this solution can become much more clean.

How do you use stl's functions like for_each?

I started using stl containers because they came in very handy when I needed the functionality of a list, set and map and had nothing else available in my programming environment. I did not care much about the ideas behind it. STL documentation was interesting up to the point where it came to functions, etc. Then I skipped reading and just used the containers.
But yesterday, still being relaxed from my holidays, I just gave it a try and wanted to go a bit more the stl way. So I used the transform function (can I have a little bit of applause for me, thank you).
From an academic point of view it really looked interesting and it worked. But the thing that bothers me is that if you intensify the use of those functions, you need thousands of helper classes for mostly everything you want to do in your code. The whole logic of the program is sliced into tiny pieces. This slicing is not the result of good coding habits; it's just a technical need. Something, that makes my life probably harder not easier.
I learned the hard way, that you should always choose the simplest approach that solves the problem at hand. I can't see what, for example, the for_each function is doing for me that justifies the use of a helper class over several simple lines of code that sit inside a normal loop so that everybody can see what is going on.
I would like to know, what you are thinking about my concerns? Did you see it like I do when you started working this way and have changed your mind when you got used to it? Are there benefits that I overlooked? Or do you just ignore this stuff as I did (and will go on doing it, probably).
Thanks.
PS: I know that there is a real for_each loop in boost. But I ignore it here since it is just a convenient way for my usual loops with iterators I guess.
The whole logic of the program is sliced in tiny pieces. This slicing is not the result of good coding habits. It's just a technical need. Something, that makes my life probably harder not easier.
You're right, to a certain extent. That's why the upcoming revision to the C++ standard will add lambda expressions, allowing you to do something like this:
std::for_each(vec.begin(), vec.end(), [&](int& val){val++;})
but I also think it is often a good coding habit to split up your code as currently required. You're effectively separating the code describing the operation you want to do, from the act of applying it to a sequence of values. It is some extra boilerplate code, and sometimes it's just annoying, but I think it also often leads to good, clean, code.
Doing the above today would look like this:
int incr(int& val) { return val+1}
// and at the call-site
std::for_each(vec.begin(), vec.end(), incr);
Instead of bloating up the call site with a complete loop, we have a single line describing:
which operation is performed (if it is named appropriately)
which elements are affected
so it's shorter, and conveys the same information as the loop, but more concisely.
I think those are good things. The drawback is that we have to define the incr function elsewhere. And sometimes that's just not worth the effort, which is why lambdas are being added to the language.
I find it most useful when used along with boost::bind and boost::lambda so that I don't have to write my own functor. This is just a tiny example:
class A
{
public:
A() : m_n(0)
{
}
void set(int n)
{
m_n = n;
}
private:
int m_n;
};
int main(){
using namespace boost::lambda;
std::vector<A> a;
a.push_back(A());
a.push_back(A());
std::for_each(a.begin(), a.end(), bind(&A::set, _1, 5));
return 0;
}
You'll find disagreement among experts, but I'd say that for_each and transform are a bit of a distraction. The power of STL is in separating non-trivial algorithms from the data being operated on.
Boost's lambda library is definitely worth experimenting with to see how you get on with it. However, even if you find the syntax satisfactory, the awesome amount of machinery involved has disadvantages in terms of compile time and debug-ability.
My advice is use:
for (Range::const_iterator i = r.begin(), end = r.end(); i != end(); ++i)
{
*out++ = .. // for transform
}
instead of for_each and transform, but more importantly get familiar with the algorithms that are very useful: sort, unique, rotate to pick three at random.
Incrementing a counter for each element of a sequence is not a good example for for_each.
If you look at better examples, you may find it makes the code much clearer to understand and use.
This is some code I wrote today:
// assume some SinkFactory class is defined
// and mapItr is an iterator of a std::map<int,std::vector<SinkFactory*> >
std::for_each(mapItr->second.begin(), mapItr->second.end(),
checked_delete<SinkFactory>);
checked_delete is part of boost, but the implementation is trivial and looks like this:
template<typename T>
void checked_delete(T* pointer)
{
delete pointer;
}
The alternative would have been to write this:
for(vector<SinkFactory>::iterator pSinkFactory = mapItr->second.begin();
pSinkFactory != mapItr->second.end(); ++pSinkFactory)
delete (*pSinkFactory);
More than that, once you have that checked_delete written once (or if you already use boost), you can delete pointers in any sequence aywhere, with the same code, without caring what types you're iterating over (that is, you don't have to declare vector<SinkFactory>::iterator pSinkFactory).
There is also a small performance improvement from the fact that with for_each the container.end() will be only called once, and potentially great performance improvements depending on the for_each implementation (it could be implemented differently depending on the iterator tag received).
Also, if you combine boost::bind with stl sequence algorithms you can make all kinds of fun stuff (see here: http://www.boost.org/doc/libs/1_43_0/libs/bind/bind.html#with_algorithms).
I guess the C++ comity has the same concerns. The to be validated new C++0x standard introduces lambdas. This new feature will enable you to use the algorithm while writing simple helper functions directly in the algorithm parameter list.
std::transform(in.begin(), int.end(), out.begin(), [](int a) { return ++a; })
Local classes are a great feature to solve this. For example:
void IncreaseVector(std::vector<int>& v)
{
class Increment
{
public:
int operator()(int& i)
{
return ++i;
}
};
std::for_each(v.begin(), v.end(), Increment());
}
IMO, this is way too much complexity for just an increment, and it'll be clearer to write it in the form of a regular plain for loop. But when the operation you want to perform over a sequence becomes mor complex. Then I find it useful to clearly separate the operation to be performed over each element from the actual loop sentence. If your functor name is properly chosen, code gets a descriptive plus.
These are indeed real concerns, and these are being addressed in the next version of the C++ standard ("C++0x") which should be published either at the end of this year or in 2011. That version of C++ introduces a notion called C++ lambdas which allow for one to construct simple anonymous functions within another function, which makes it very easy to accomplish what you want without breaking your code into tiny little pieces. Lambdas are (experimentally?) supported in GCC as of GCC 4.5.
Those libraries like STL and Boost are complex also because they need to solve every need and work on any plateform.
As a user of these libraries -- you're not planning on remaking .NET are you? -- you can use their simplified goodies.
Here is possibly a simpler foreach from Boost I like to use:
BOOST_FOREACH(string& item in my_list)
{
...
}
Looks much neater and simpler than using .begin(), .end(), etc. and yet it works for pretty much any iteratable collection (not just arrays/vectors).