Building a dataframe in C++ - c++

I am trying to build a DataFrame in C++. I'm facing some problems, such as dealing with variable data type.
I am thinking in a DataFrame inspired by Pandas DataFrame (from python). So my design idea is:
Build an object 'Series' which is a vector of a fixed data type.
Build an object 'DataFrame' which will store a list of Series (this list can be variable).
The item 1. is just a regular vector. So, for instance, the user would call
Series.fill({1,2,3,4}) and it would store the vector {1,2,3,4} in some attribute of Series, say Series.data.
Problem 1. How I would make a class that understands {1,2,3,4} as a vector of 4 integers. Is it possible?
The next problem is:
About 2., I can see the DataFrame as a matrix of n columns and m rows, but the columns can have different data types.
I tried to design this as a vector of n pointers, where each pointer would point to a vector of dimension m with different data types.
I tried to do something like
vector<void*> columns(10)
and fill it with something like
columns[0] = (int*) malloc(8*sizeof(int))
But this does not work, if I try to fill the vector, like
(*columns[0])[0] = 5;
I get an error
::value_type {aka void*}’ is not a pointer-to-object type
(int *) (*a[0])[0] = 5;
How can I do it properly? I still have other questions like, how would I append an undetermined number of Series into a DataFrame, but for now, just building a matrix with columns with different data types is a great start.
I know that I must keep track of the types of pointers inside my void vector but I can create a parallel list with all data types and make this an attribute of my class DataFrame.

Building a heterogeneous container (which a dataframe is supposed to be) in C++ is more complex than you think, because C++ is statically typed. It means you have to know all the types at compile time.
Your approach uses a vector of pointers (there are a few variations of this approach, which I am not going into). This approach is very inefficient, because pointers are pointing to all over the memory and trashing your cache locality. I do not recommend even attempting to implement such a dataframe because there is really no point to it.
Look at this implementation of DataFrame in C++: https://github.com/hosseinmoein/DataFrame. You might be able to just use it as is. Or get insight from it how to implement a true heterogeneous DataFrame. It uses a collection of static vectors in a hash table to implement a true heterogeneous container. It also uses contiguous memory space, so it avoids the pointer effect.

TL;DR Version
Discard what you are doing.
Use vector<vector<int>> columns;. When you need a column, use columns[index].data() to get a pointer to the backing array from the indexed inner vector and pass that int * to whatever required the void *. The int * will be implicitly converted.
Explanation
Quoting cppreference
void - type with an empty set of values. It is an incomplete type that cannot be completed (consequently, objects of type void are disallowed). There are no arrays of void, nor references to void. However, pointers to void and functions returning type void (procedures in other languages) are permitted.
Since void is incomplete, you can't have a void. void* needs to be cast back to the actual data type, int*, before it can be used for anything other than passing the anonymously typed pointer around. All receivers of the void * have to know what it really is to do anything with it other than pass it on.
Functions that require void * parameters will take any pointer you give them without any further effort on your part, so there is almost no need to make void * variables in C++. Almost all cases where you would need a void * are filled in with polymorphism or templates. The last time I used a void * in C++ was back when I wrote C++ as C with classes bolted on.
The Error
Given
vector<void*> columns(10);
where each element will contain an array of ints, let's work through
(*columns[0])[0] = 5;
step by step to see what types we have and make sure thee types at each step are consistent
columns[0]
Gets the first element in the vector, a void*. So far so good.
*columns[0]
dereferences the void* at columns[0]. As covered in the preamble, this cannot be done. You cannot dereference a void * because that you have a value of type void This produces the reported ::value_type {aka void}’ is not a pointer-to-object type* error message.
We could
*reinterpret_cast<int*>(columns[0])
to turn it into a pointer to int, something we can dereference and matches the initial type, and receive an int, specifically the first int in the array.
(*reinterpret_cast<int*>(columns[0]))[0]
will fail because you can't index an int. That would be like writing 42[0]. This means the dereference is unnecessary.
The end result needs to look like
reinterpret_cast<int*>(columns[0])[0]
But don't do this. It is unnecessary and grossly over-complicated.

Related

Write multidimensional vectors (tensors) of scalars to file in C++

The description of the object I have
I have several N-dimensional containers in my code, representing tensors, whose types are defined as
std::vector<std::vector<std::vector<...<double>...>>>
These type of data structures occur in several different sizes and dimensions and they only contain scalar numbers. The number of dimensions is known for every vector and can be accessed as eg. tensor::dimension. Since they're representing tensors, they're never "irregular": at the bottom level, vectors always contain the same number of elements, like this:
// THIS IS HOW THEY ALWAYS LOOK LIKE
T tensor = {{1,2,3,4}, {1,2,3,4}, {1,2,3,4}}
// THIS IS WHAT NEVER HAPPENS
T tensor = {{1,2,3}, {1,2,3,4}, {1,2}}
What I want to do with this object
I want to save each of these multidimensional vectors (tensors basically) into different files, which I can then easily load/read eg. in Python into a numpy.array - for further analysis and visualization. How can I achieve this to save any of these N-dimensional std::vectors in modern C++ without explicitly defining a basic write-to-txt function with N nested loops for each vector with different dimensions?
(Note: Solutions/advice that require/mention only standard libraries are preferred, but I'm happy to hear any other answers too!)
The only way to iterate over something in C++ is a loop, in some sort of shape, matter, or form. So no matter what you're going to have loops. There are no workarounds or alternatives, but it doesn't mean you actually have to write all these loops yourself, one at a time. This is why we have templates in C++. What you are looking for is a recursive template, that recursively peels away each dimension: until the last one which gets implemented for real-sies, basically letting your compiler write every loop for you. Mission accomplished. Starting with a simplistic example of writing out a plain vector
void write_vec(const std::vector<double> &v)
{
for (const auto &value:vector)
std::cout << value << std::endl;
}
The actual details of how you want to save each value, and which files, is irrelevant here, you can adjust the above code to make it work in whichever way you see fit. The point that you want to make it work for some artbirary dimensions. Simply add a template with the same name, then let overload resolution do all the work for you:
template<typename T>
void write_vec(const std::vector<std::vector<T>> &v)
{
for (const auto &value:vector)
write_vec(value);
}
Now, a write_vec(anything), where anything is any N-"deep" vector that ends up in a std::vector<double> will walk its way downhill, on its own, and write out every double.

Is it possible to create storage in for an n element array where the elements are tuples?

I'm making a class that is supposed to be able to store a 20 element array with each element being a tuple of four predefined types. Another catch is, I can't use parameters.
I can't find good online sources for this and the material provided from my university is honestly insufficient. I'm preparing for an exam and I'm stumped when it comes to objects in OCaml.
I was thinking of doing something like
val mutable arr = Array.make 20 (input 20 values)
but this seems too simplistic and inefficient to be a correct solution.
The fields of a class can have any type. This certainly includes an array type. Arrays, in turn, can contain any type, which includes tuples.
Any given mutable field and any given array is, of course, restricted to always contain values of the same type. This is what it means to have "strong" typing.
OCaml is a high level language, so there's no need (or opportunity, really) to be concerned with too many details of representation. If you want a class with a field like you say, your proposted type sounds perfectly fine.
type mytuple = int * float * char
class myclass = object
val mutable myfield : mytuple array = [||]
end
You can find good documentation on OCaml at realworldocaml.org. There are more resources listed at ocaml.org.

C++ Table of Vectors of Different Types

I have a collection of vectors of different types like this:
std::vector<int> a_values;
std::vector<float> b_values;
std::vector<std::string> c_values;
Everytime i get a new value for a, b and c I want to push those to their respective vectors a_values, b_values and c_values.
I want to do this in the most generic way possible, ideally in a way I can iterate over the vectors. So I want a function addValue(...) which automatically calls the respective push_back() on each vector. If I add a new vector d_values I only want to have to specify it in one place.
The first answer to this post https://softwareengineering.stackexchange.com/questions/311415/designing-an-in-memory-table-in-c seems relevant, but I want to easily get the vector out for a given name, without having to manually cast to a particular type. ie. I want to call getValues("d") which will give me the underlying std::vector.
Does anyone have a basic example of a collection class that does this?
This idea can be achieved with the heterogenous container tuple, which will allow for storage of vectors containing elements of different types.
In particular, we can define a simple data structure as follows
template <typename ...Ts>
using vector_tuple = std::tuple<std::vector<Ts>...>;
In the initial case, of the provided example, the three vectors a_values, b_values, c_values, simply corresponds to the type vector_tuple<int, float, std::string>. Adding an additional vector simply requires adding an additional type to our collection.
Indexing into our new collection is simple too, given the collection
vector_tuple<int, float, std::string> my_vec_tup;
we have the following methods for extracting a_values, b_values and c_values
auto const &a_values = std::get<0>(my_vec_tup);
auto const &b_values = std::get<1>(my_vec_tup);
auto const &c_values = std::get<2>(my_vec_tup);
Note that the tuple container is indexed at compile-time, which is excellent if you know the intended size at compile-time, but will be unsuitable otherwise.
In the description of the problem that has been provided, the number of vectors does appear to be decided at compile-time and the naming convention appears to be arbitrary and constant (i.e., the index of the vectors won't need to be changed at runtime). Hence, a tuple of vectors seems to be a suitable solution, if you associate each name with an integer index.
As for the second part of the question (i.e., iterating over each vector), if you're using C++17, the great news is that you can simply use the std::apply function to do so: https://en.cppreference.com/w/cpp/utility/apply
All that is required is to pass a function which takes a vector (you may wish to define appropriate overloading to handle each container separately), and your tuple to std::apply.
However, for earlier versions of C++, you'll need to implement your own for_each function. This problem has fortunately already been solved: How can you iterate over the elements of an std::tuple?

Returning a vector of tuples c++

I am trying to create and return a vector of two element arrays (which I will refer to as tuples), however I am running into issues.
std::vector<int *> distr;
int tuple[2];
distr.push_back(tuple);
//modify tuple's contents
distr.push_back(tuple)
In this case distr then has two copies of the modified tuple rather than the two distinct tuples I desired.
So I figured it had to do with memory so I tried this approach instead
distr.push_back(new int [num1, num2]);
But it doesn't save the tuples correctly as trying to access their values returns weird false values.
This is clearly due to a misunderstanding of how memory is allocated. I can understand why the first example fails in that fashion but I do not understand the issue with the second example.
When you use
distr.push_back(new int [num1, num2]);
You are not creating a a two element array filled with num1, num1. That would be done like the following:
new int[2] {num1, num2}
I would advise against using this method though. If all of your tuples will be the same size I would make struct to represent that data type (in the special case of two, you can even use std::pair)
Use a pair instead of a pointer:
std::vector<std::pair<int, int> > distr;
// Do some code
distr.emplace_back(num1, num2);
At first, you should understand, that "classic" C and C++ arrays are just buffers of allocated memory. In your sample, tuple is just a pointer to allocated buffer of 2 integers. So, when you push_back value of tuple you just add the same pointer twice. The array itself is not copied to std::vector, so, you end with vector containing two pointers to the SAME area of memory. To achieve desired behavior, you can use more high-level C++-ish data types, such as std::tuple or std::array.
Speaking about your second code snippet, it's just syntax misunderstanding: expression new <type>[<count>] creates a memory buffer (similar to your tuple, but on the HEAP) of values of type <type>. So, if you are going to create buffer of 2 ints, you should write new int[2]. When you are use a, b expression, it evaluates as comma operator, and <count> will be num2 in your sample.
P.S. Be aware, that to work correct with heap memory you should study C++ memory management much deeper.

Compare pointers by type?

How could I compare two pointers to see if they are of the same type?
Say I have...
int * a;
char * b;
I want to know whether or not these two pointers differ in type.
Details:
I'm working on a lookup table (implemented as a 2D void pointer array) to store pointers to structs of various types. I want to use one insert function that compares the pointer type given to the types stored in the first row of the table. If they match I want to add the pointer to the table in that column.
Basically I want to be able to store each incoming type into its own column.
Alternative methods of accomplishing this are welcomed.
In this case, since you know the types before hand, it doesn't make much sense to check. You can just proceed knowing that they are different types.
However, assuming that perhaps the types may be dependent on some compile-time properties like template arguments, you could use std::is_same:
std::is_same<decltype(a), decltype(b)>::value
This will be true if they are the same type and false otherwise.