Is apache arrow.Vector.toArray() zero-copy in JS - apache-arrow

Same as title: is toArray() a zero-copy memory cast, in effect? Is there a way to find out this sort of things without asking on forums? Thanks.

Apache arrow supports a number of different languages and I don't see any language tag here. I'm going to assume JavaScript because that is the only language that has a toArray method on something called a Vector. If it is not JavaScript then please let me know.
The answer to your question is maybe. If it is a vector of int, float, time, decimal, or timestamp then it will be zero-copy and it is just returning a window into a private variable in the vector.
Otherwise, if it is a different type, then it performs an actual memory copy.
Source: https://github.com/apache/arrow/blob/abc786099627ef429109da16b9dc768b4efbd866/js/src/visitor/toarray.ts
There is also an arrow user's mailing list user#arrow.apache.org which is probably the place to ask this kind of question for the fastest answer.

Related

Dictionary access optimization describen - does it take place anywhere?

Most of us compiler nerds have read Google's paper on V8 object property access where the resulting technique is just (in-)directly accessing an array member. My question is:
Does anyone optimize the dictionary access the same way (assigning a fixed index to a (compile time) fixed key)? It doesn't have to be applicable everywhere but perhaps when it's compilation-unit wide? Or the dictionary is readonly? Or between the compilation units at a pass? Whatever, maybe even unrolling the dict. access or inlineing it using a fixed array index instead of a key.
I do know how constant time lookup dictionaries work, but maybe the proposed optimization takes place to further boost the compiled languages (e.g. C++) where the hardware is coached to deal with V-table-like structures at runtime.
Please, if you know any of that, give me a hunch. Thank you much!
TL;DR I want to know of an existing way to optimize dictionary access, (e.g. accessing std::map via array index), not of internal struct/object arrangment in a particular language
Doubtful
While this could in theory be possible (being that std::map implementation is part of the Standard library) I know of no C++ compiler that performs such a trick.
And they do not really have to: if you want array indexing in C++, you pick an array and index it (possibly with named constants).

Mapping vectors of arbitrary type

I need to store a list vectors of different types, each to be referenced by a string identifier. For now, I'm using std::map with std::string as the key and boost::any as it's value (example implementation posted here).
I've come unstuck when trying to run a method on all the stored vector, e.g.:
std::map<std::string, boost::any>::iterator it;
for (it = map_.begin(); it != map_.end(); ++it) {
it->second.reserve(100); // FAIL: refers to boost::any not std::vector
}
My questions:
Is it possible to cast boost::any to an arbitrary vector type so I can execute its methods?
Is there a better way to map vectors of arbitrary types and retrieve then later on with the correct type?
At present, I'm toying with an alternative implementation which replaces boost::any with a pointer to a base container class as suggested in this answer. This opens up a whole new can of worms with other issues I need to work out. I'm happy to go down this route if necessary but I'm still interested to know if I can make it work with boost::any, of if there are other better solutions.
P.S. I'm a C++ n00b novice (and have been spoilt silly by Python's dynamic typing for far too long), so I may well be going about this the wrong way. Harsh criticism (ideally followed by suggestions) is very welcome.
The big picture:
As pointed out in comments, this may well be an XY problem so here's an overview of what I'm trying to achieve.
I'm writing a task scheduler for a simulation framework that manages the execution of tasks; each task is an elemental operation on a set of data vectors. For example, if task_A is defined in the model to be an operation on "x"(double), "y"(double), "scale"(int) then what we're effectively trying to emulate is the execution of task_A(double x[i], double y[i], int scale[i]) for all values of i.
Every task (function) operate on different subsets of data so these functions share a common function signature and only have access to data via specific APIs e.g. get_int("scale") and set_double("x", 0.2).
In a previous incarnation of the framework (written in C), tasks were scheduled statically and the framework generated code based on a given model to run the simulation. The ordering of tasks is based on a dependency graph extracted from the model definition.
We're now attempting to create a common runtime for all models with a run-time scheduler that executes tasks as their dependencies are met. The move from generating model-specific code to a generic one has brought about all sorts of pain. Essentially, I need to be able to generically handle heterogenous vectors and access them by "name" (and perhaps type_info), hence the above question.
I'm open to suggestions. Any suggestion.
Looking through the added detail, my immediate reaction would be to separate the data out into a number of separate maps, with the type as a template parameter. For example, you'd replace get_int("scale") with get<int>("scale") and set_double("x", 0.2) with set<double>("x", 0.2);
Alternatively, using std::map, you could pretty easily change that (for one example) to something like doubles["x"] = 0.2; or int scale_factor = ints["scale"]; (though you may need to be a bit wary with the latter -- if you try to retrieve a nonexistent value, it'll create it with default initialization rather than signaling an error).
Either way, you end up with a number of separate collections, each of which is homogeneous, instead of trying to put a number of collections of different types together into one big collection.
If you really do need to put those together into a single overall collection, I'd think hard about just using a struct, so it would become something like vals.doubles["x"] = 0.2; or int scale_factor = vals.ints["scale"];
At least offhand, I don't see this losing much of anything, and by retaining static typing throughout, it certainly seems to fit better with how C++ is intended to work.

Could the S of S.O.L.I.D be extended for every single element of the code?

The S of the famous Object Oriented Programming design stands for:
Single responsibility principle, the notion that an object should have
only a single responsibility.
I was wondering, can this principle, be extended even to arrays, variables, and all the elements of a program?
For example, let's say we have:
int A[100];
And we use it to store the result of a function, but somehow we use the same A[100] to check, for example, what indexes of A have we already checked and elaborated.
Could this be considered wrong? Shouldn't we create another element to store, for example, the indexes that we have already checked? Isn't this an hint of future messy code?
PS: I'm sorry if my question is not comprehensible but English is not my primary language. If you have any problem understanding the point of it please let me know in a comment below.
If same A instance is used in different program code portions you must follow this principle. If A is a auxiliary variable, local one for example, I think you don't need to be care about it.
If you are tracking the use of bits of the array that have been updated, then you probably shouldn't be using an array, but a map instead.
In any case, if you need that sort of extra control over the array, then basically, you should be considering a class that contains both the contents of the array and the various information about what has and hasn't been done. So your array becomes local to the class object, as do your controls, and voila. You have single responsibility again.

Initializing a AS 3.0 vector with type array - Vector.<Array>? (plus C++ equivelant)

I have a quick question for you all. I'm trying to convert over some ActionScript code to C++ and am having a difficult time with this one line:
private var edges:Vector.<Array>
What is this exactly? Is this essentially a multidimensional vector then? Or is this simply declaring the vector as a container? I understand from researching that vectors, like C++ vectors, have to be declared with a type. However, in C++ I can't just put down Array, I have to use another vector (probably) so it looks like:
vector<vector<T> example;
or possibly even
vector<int[]> example;
I don't expect you guys to know the C++ equivalent because I'm primarily posting this with AS tags, but if you could confirm my understand of the AS half, that would be great. I did some googling but didn't find any cases where someone used Array as it's type.
From Mike Chambers (adobe evangelist) :
"Essentially, the Vector class is a typed Array, and in addition to ensuring your collection is type safe, can also provide (sometimes significant) performance improvements over using an Array."
http://www.mikechambers.com/blog/2008/08/19/using-vectors-in-actionscript-3-and-flash-player-10/
Essentially the vector in C++ is based on the same principles. As far as porting a vector of Arrays in AS3 to C++, well that's not a conversion that is clear cut in principle, as you could have a collection (array) of various types in C++, such as a char array. However, it appears you've got the idea, as you've pretty much posted examples of both avenues in your question.
I would post some code but I think you've got it exactly. Weather you use a vector within a vector or you declare a specifically typed collection I think comes down to a matter of what works best for you specific project.
Also you might be interested in:
http://www.mikechambers.com/blog/2008/09/24/actioscript-3-vector-array-performance-comparison/

What is the usefulness of project1st<Arg1, Arg2> in the STL?

I was browsing the SGI STL documentation and ran into project1st<Arg1, Arg2>. I understand its definition, but I am having a hard time imagining a practical usage.
Have you ever used project1st or can you imagine a scenario?
A variant of project1st (taking a std::pair, and returning .first) is quite useful. You can use it in combination with std::transform to copy the keys from a std::map<K,V> to a std::vector<K>. Similarly, a variant of project2nd can be used to copy the values from a map to a vector<V>.
As it happens, none of the standard algorithms really benefits from project1st. The closest is partial_sum(project1st), which would set all output elements to the first input element. It mainly exists because the STL is heavily founded in mathematical set theory, and there operations like project1st are basic building blocks.
My guess is that if you were using the strategy pattern and had a situation where you needed to pass an identity object, this would be a good choice. For example, there might be a case where an algorithm takes several such objects, and perhaps it is possible that you want one of them to do nothing under some situation.
Parallel programming. Imagine a situation where two processes come up with two valid but different results for a given computation, and you need to force them to be the same. project1st/2nd provides a very convenient way to perform this operation on a whole container, using an appropriate parallel call that takes a functor as an argument.
I assume that someone had a practical use for it, or it wouldn't have been written, but I'm drawing a blank on what it might have been. Presumably its use-case is similar to the identity function that the description mentions, where there's no real need for processing but the syntax requires a functor anyway.
The example on that same page suggests using it with the two-container form of std::transform, but if I'm not mistaken, the way they're using it is functionally identical to std::copy, so I don't see the point.
It looks like a solution in search of a problem to me.