How can I get the row view of data read from parquet file? - c++

Example: Let's say a table name user has id, name, email, phone, and is_active as attributes. And there are 1000s of users part of this table. I would like to read the details per user.
void ParquetReaderPlus::read_next_row(long row_group_index, long local_row_num)
{
std::vector<int> columns_to_tabulate(this->total_row);
for (int idx = 0; idx < this->total_row; idx++)
columns_to_tabulate[idx] = idx;
this->file_reader->set_num_threads(4);
int rg = this->total_row_group;
// Read into table as row group rather than the whole Parquet file.
std::shared_ptr<arrow::Table> table;
this->file_reader->ReadRowGroup(row_group_index, columns_to_tabulate, &table);
auto rows = table->num_rows();
//TODO
// Now I am confused how to proceed from here
}
Any suggestions?
I am confused if converting the ColumnarTableToVector will work?

It's difficult to answer this question without knowing what you plan on doing with those details. A Table has a list of columns and each column (in Arrow-C++) has a type-agnostic array of data. Since the columns are type-agnostic there is not much you can do with them other than get the count and access the underlying bytes.
If you want to interact with the values then you will either need to know the type of a column ahead of time (and cast), have a series of different actions for each different type of data you might encounter (switch case plus cast), or interact with the values as buffers of bytes. One could probably write a complete answer for all three of those options.
You might want to read up a bit on the Arrow compute API (https://arrow.apache.org/docs/cpp/compute.html although the documentation is a bit sparse for C++). This API allows you to perform some common operations on your data (somewhat) regardless of type. For example, I see the word "tabulate" in your code snippet. If you wanted to sum up the values in a column then you could use the "sum" function in the compute API. This function follows the "have a series of different actions for each different type of data you might encounter" advice above and will allow you to sum up any numeric column.

As far as I know what you are trying to do isn't easy. You'd have to:
iterate through each row
iterate through each column
figure out the type of the column
cast the arrow::Array of the column to the underlying type (eg: arrow::StringArray)
get the value for that column, convert it to string and append it to your output
This is further complciated by:
the fact that the rows are grouped in chunked (so iterating through rows isn't as simple)
you also need to deal with list and struct types.
It's not impossible, it's a lot of code (but you'd only have to write it once).
Another option is to write that table to CSV in memory and print it:
arrow::Status dumpTable(const std::shared_ptr<arrow::Table>& table) {
auto outputResult = arrow::io::BufferOutputStream::Create();
ARROW_RETURN_NOT_OK(outputResult.status());
std::shared_ptr<arrow::io::BufferOutputStream> output = outputResult.ValueOrDie();
ARROW_RETURN_NOT_OK(arrow::csv::WriteCSV(*table, arrow::csv::WriteOptions::Defaults(), output.get()));
auto finishResult = output->Finish();
ARROW_RETURN_NOT_OK(finishResult.status());
std::cout << finishResult.ValueOrDie()->ToString();
return arrow::Status::OK();
}

Related

crossfilter dimension on 2 fields

My data looks like this
field1,field2,value1,value2
a,b,1,1
b,a,2,2
c,a,3,5
b,c,6,7
d,a,6,7
I don't have a good way of rearranging that data so let's assume the data has to stay like this.
I want to create a dimension on field1 and field2 combined : a single dimension that would take the union of all values in both field1 and field2 (in my example, the values should be [a,b,c,d])
As a reduce function you can assume reduceSum on value2 for example (allowing double counting for now).
(have tagged dc.js and reductio because it could be useful for users of those libraries)
First I need to point out that your data is denormalized, so the counts you get might be somewhat confusing, no matter what technique you use.
In standard usage of crossfilter, each row will be counted in exactly one bin, and all the bins in a group will add up to 100%. However, in your case, each row will be counted twice (unless the two fields are the same), so for example a pie chart wouldn't make any sense.
That said, the "tag dimension" feature is perfect for what you're trying to do.
The dimension declaration could be as simple as:
var tagDimension = cf.dimension(function(d) { return [d.field1,d.field2]; }, true);
Now each row will get counted twice - this dimension and its associated groups will act exactly as if each of the rows were duplicated, with one copy indexed by field1 and the other by field2.
If you made a bar chart with this, say, the total count will be 2N minus the number of rows where field1 === field2. If you click on bar 'b', all rows which have 'b' in either fields will get selected. This only affects groups built on this dimension, so any other charts will only see one copy of each row.

spotfire plot list of elements

I have a data table that has this format :
and I want to plot temperature to time, any idea how to do that ?
This can be done in a TERR data function. I don't know how comfortable you are integrating Spotfire with TERR, there is an intro video here for instance (demo starts from about minute 7):
https://www.youtube.com/watch?v=ZtVltmmKWQs
With that in mind, I wrote the script without loading any library, so it is quite verbose and explicit, but hopefully simpler to follow step by step. I am sure there is a more elegant way, and there are better ways of making it flexible with column names, but this is a start.
Your input will be a data table (dt, the original data) and the output a new data table (dt.out, the transformed data). All column names (and some values) are addressed explicitly in the script (so if you change them it won't work).
#remove the []
dt$Values=gsub('\\[|\\]','',dt$Values)
#separate into two different data frames, one for time and one for temperature
dt.time=dt[dt$Description=='time',]
dt.temperature=dt[dt$Description=='temperature',]
#split the columns we want to separate into a list of vectors
dt2.time=strsplit(as.character(dt.time$Values),',')
dt2.temperature=strsplit(as.character(dt.temperature$Values),',')
#rearrange times
names(dt2.time)=dt.time$object
dt2.time=stack(dt2.time) #stack vectors
dt2.time$id=c(1:nrow(dt2.time)) #assign running id for merging later
colnames(dt2.time)[colnames(dt2.time)=='values']='time'
#rearrange temperatures
names(dt2.temperature)=dt.temperature$object
dt2.temperature=stack(dt2.temperature) #stack vectors
dt2.temperature$id=c(1:nrow(dt2.temperature)) #assign running id for merging later
colnames(dt2.temperature)[colnames(dt2.temperature)=='values']='temperature'
#merge time and temperature
dt.out=merge(dt2.time,dt2.temperature,by=c('id','ind'))
colnames(dt.out)[colnames(dt.out)=='ind']='object'
dt.out$time=as.numeric(dt.out$time)
dt.out$temperature=as.numeric(dt.out$temperature)
Gaia
because all of the example rows you've shown here contain exactly four list items and you haven't specified otherwise, I'll assume that all of the data fits this format.
with this assumption, it becomes pretty trivial, albeit a little messy, to split the values out into columns using the RXReplace() expression function.
you can create four calculated columns, each with an expression like:
Int(RXReplace([values],"\\[([\\d\\-]+),([\\d\\-]+),([\\d\\-]+),([\\d\\-]+)]","\\1",""))
the third argument "\\1" determines which number in the list to extract. backslashes are doubled ("escaped") per the requirements of the RXReplace() function.
note that this example assumes the numbers are all whole numbers. if you have decimals, you'd need to adjust each "phrase" of the regular expression to ([\\d\\-\\.]+), and you'd need to wrap the expression in Real() rather than Int() (if you leave this part out, the result will be a String type which could cause confusion later on when working with the data).
once you have the four columns, you'll be able to unpivot to get the data easily.

Exporting an oracle table dynamically to a flat file

I am trying to build a C++ program using occi libraries that will take a select statement or a table name as an input and turn it into a delimited file. However, looking at the documentation, I can't find a way to export all columns of a query result into a file. Almost all examples I found were along the following lines
string query = "SELECT col1 FROM table1";
stmt = con->createStatement(query);
res = stmt->executeQuery();
while (res->next())
{
outfile<<res->getInt(1)<<endl;
}
What i want to do is: Do a select * and then export the full row to the file in one go without specifying the type for each column, but I haven't been able to find something that does this.
I know that row-by-row exports are not really efficient for large sets, but I want to make this work before optimizing it.
Does anyone have any ideas around how to do this efficiently?
I don't think that you will find something like this "in the box" while using OCCI.
However using STL you can push to a stringstream the result of every iteration and when the rs->next() is NULL you can append the stringstream in the file.
I found that there is no way to do this without iterating over the metadata object at least once. Since I only need to do this once per query execution I ended up writing the attribute types and column positions to a map and using that map within the result set loop to read data. Here's the code I used:
res = stmt->executeQuery();
vector<oracle::occi::MetaData> meta = res->getColumnListMetaData();
map<int, int> mapper;
for (int i=0; i < meta.size(); i++) {
mapper[i] = meta[i].getInt(oracle::occi::MetaData::ATTR_DATA_TYPE);
}

Maths Operations on Columns from Different Data Frames

I have two data frames, imported through Pandas from Fama French and Yahoo. I am trying to compare column values from the two data frames (more specifically, subtract one from the other), but a value error occurs whenever I try doing so. The data frames have different indexing and I don't know how to take this factor into account (I'm quite new to python & pandas).
Here is the code in question:
start, end = dt.datetime.now()-dt.timedelta(days=60*30), dt.datetime.now()
f = data.DataReader('F-F_Research_Data_Factors', 'famafrench', start, end)[0]
s = data.get_data_yahoo('aapl', start, end)
s = s.resample('M', how='last')
s['returns'] = s['Adj Close'].pct_change()
Ideally, I would like to create a series with row values = f['RF'] - s['returns']
Any help would be much appreciated.
Convert f.index
f.index = f.index.to_datetime() + pd.offsets.MonthEnd()
f['RF'] - s['returns']
Ask yourself, how you could possibly define a difference between two matrices when they have a different size?
First thing to do, is to match the two dataframes on a commmon value (say the date). Then you will be able to do any operation you want

kettle sample rows for each type

I have a set of rows let's say "rowId","type","value". I need on output set of 10 sample rows for each "type". How can I do it? "type" has aprox. 100 different, and changing values, so switch is not good option.
Well I've figured a walkaround from this situation. I splited transformation in parts. First part collects all data to a temp table, finds unique types, and copies them to the result.
The second one runs for every input row (where we have types), and collects data of a given type from temp table. Then you need no grouping to do stratified sample.