Exporting an oracle table dynamically to a flat file - c++

I am trying to build a C++ program using occi libraries that will take a select statement or a table name as an input and turn it into a delimited file. However, looking at the documentation, I can't find a way to export all columns of a query result into a file. Almost all examples I found were along the following lines
string query = "SELECT col1 FROM table1";
stmt = con->createStatement(query);
res = stmt->executeQuery();
while (res->next())
{
outfile<<res->getInt(1)<<endl;
}
What i want to do is: Do a select * and then export the full row to the file in one go without specifying the type for each column, but I haven't been able to find something that does this.
I know that row-by-row exports are not really efficient for large sets, but I want to make this work before optimizing it.
Does anyone have any ideas around how to do this efficiently?

I don't think that you will find something like this "in the box" while using OCCI.
However using STL you can push to a stringstream the result of every iteration and when the rs->next() is NULL you can append the stringstream in the file.

I found that there is no way to do this without iterating over the metadata object at least once. Since I only need to do this once per query execution I ended up writing the attribute types and column positions to a map and using that map within the result set loop to read data. Here's the code I used:
res = stmt->executeQuery();
vector<oracle::occi::MetaData> meta = res->getColumnListMetaData();
map<int, int> mapper;
for (int i=0; i < meta.size(); i++) {
mapper[i] = meta[i].getInt(oracle::occi::MetaData::ATTR_DATA_TYPE);
}

Related

How can I get the row view of data read from parquet file?

Example: Let's say a table name user has id, name, email, phone, and is_active as attributes. And there are 1000s of users part of this table. I would like to read the details per user.
void ParquetReaderPlus::read_next_row(long row_group_index, long local_row_num)
{
std::vector<int> columns_to_tabulate(this->total_row);
for (int idx = 0; idx < this->total_row; idx++)
columns_to_tabulate[idx] = idx;
this->file_reader->set_num_threads(4);
int rg = this->total_row_group;
// Read into table as row group rather than the whole Parquet file.
std::shared_ptr<arrow::Table> table;
this->file_reader->ReadRowGroup(row_group_index, columns_to_tabulate, &table);
auto rows = table->num_rows();
//TODO
// Now I am confused how to proceed from here
}
Any suggestions?
I am confused if converting the ColumnarTableToVector will work?
It's difficult to answer this question without knowing what you plan on doing with those details. A Table has a list of columns and each column (in Arrow-C++) has a type-agnostic array of data. Since the columns are type-agnostic there is not much you can do with them other than get the count and access the underlying bytes.
If you want to interact with the values then you will either need to know the type of a column ahead of time (and cast), have a series of different actions for each different type of data you might encounter (switch case plus cast), or interact with the values as buffers of bytes. One could probably write a complete answer for all three of those options.
You might want to read up a bit on the Arrow compute API (https://arrow.apache.org/docs/cpp/compute.html although the documentation is a bit sparse for C++). This API allows you to perform some common operations on your data (somewhat) regardless of type. For example, I see the word "tabulate" in your code snippet. If you wanted to sum up the values in a column then you could use the "sum" function in the compute API. This function follows the "have a series of different actions for each different type of data you might encounter" advice above and will allow you to sum up any numeric column.
As far as I know what you are trying to do isn't easy. You'd have to:
iterate through each row
iterate through each column
figure out the type of the column
cast the arrow::Array of the column to the underlying type (eg: arrow::StringArray)
get the value for that column, convert it to string and append it to your output
This is further complciated by:
the fact that the rows are grouped in chunked (so iterating through rows isn't as simple)
you also need to deal with list and struct types.
It's not impossible, it's a lot of code (but you'd only have to write it once).
Another option is to write that table to CSV in memory and print it:
arrow::Status dumpTable(const std::shared_ptr<arrow::Table>& table) {
auto outputResult = arrow::io::BufferOutputStream::Create();
ARROW_RETURN_NOT_OK(outputResult.status());
std::shared_ptr<arrow::io::BufferOutputStream> output = outputResult.ValueOrDie();
ARROW_RETURN_NOT_OK(arrow::csv::WriteCSV(*table, arrow::csv::WriteOptions::Defaults(), output.get()));
auto finishResult = output->Finish();
ARROW_RETURN_NOT_OK(finishResult.status());
std::cout << finishResult.ValueOrDie()->ToString();
return arrow::Status::OK();
}

How to choose indexed assignment variable dynamically in SAS?

I am trying to build a custom transformation in SAS DI. This transformation will "act" on columns in an input data set, producing the desired output. For simplicity let's assume the transformation will use input_col1 to compute output_col1, input_col2 to compute output_col2, and so on up to some specified number of columns to act on (let's say 2).
In the Code Options section of the custom transformation users are able to specify (via prompts) the names of the columns to be acted on; for example, a user could specify that input_col1 should refer to the column named "order_datetime" in the input dataset, and either make a similar specification for input_col2 or else leave that prompt blank.
Here is the code I am using to generate the output for the custom transformation:
data cust_trans;
set &_INPUT0;
i=1;
do while(i<3);
call symputx('index',i);
result = myfunc("&&input_col&index");
output_col&index = result; /*what is proper syntax here?*/
i = i+1;
end;
run;
Here myfunc refers to a custom function I made using proc fcmp which works fine.
The custom transformation works fine if I do not try to take into account the variable number of input columns to act on (i.e. if I use "&&input_col&i" instead of "&&input_col&index" and just use the column result on the output table).
However, I'm having two issues with trying to make the approach more dynamic:
I get the following warning on the line containing
result = myfunc("&&input_col&index"):
WARNING: Apparent symbolic reference INDEX not resolved.
I do not know how to have the assignment to the desired output column happen dynamically; i.e., depending on the iteration of the do loop I'd like to assign the output value to the corresponding output column.
I feel confident that the solution to this must be well known amongst experts, but I cannot find anything explaining how to do this.
Any help is greatly appreciated!
You can't use macro variables that depend on data variables, in this manner. Macro variables are resolved at compile time, not at run time.
So you either have to
%do i = 1 %to .. ;
which is fine if you're in a macro (it won't work outside of an actual macro), or you need to use an array.
data cust_trans;
set &_INPUT0;
array in[2] &input_col1 &input_col2; *or however you determine the input columns;
array output_col[2]; *automatically names the results;
do i = 1 to dim(in);
result = myfunc(in[i]); *You quote the input - I cannot see what your function is doing, but it is probably wrong to do so;
output_col[i] = result; /*what is proper syntax here?*/
end;
run;
That's the way you'd normally do that. I don't know what myfunc does, and I also don't know why you quote "&&input_col&index." when you pass it to it, but that would be a strange way to operate unless you want the name of the input column as text (and don't want to know what data is in that variable). If you do, then pass vname(in[i]) which passes the name of the variable as a character.

WEKA input of predictions in 10folds CSV output

I'm using WEKA Explorer to run a 10fold cross validation. I output the predictions to a CSV file. Because the 10fold approach mixes the order of the data, I do not know which specific data is correctly or incorrectly classified.
I mean, by looking at the CSV I do not know which specific 1 or 0 is classified as 1 or 0. Is there any way to see what is the classification result for every specific instance in test set for every fold? For example, it would be great if the CSV would record the ID of the instance being classified.
One alternative could be for me to implement the 10folds approach manually; i.e., I could create the 10 ARFF files and then run on each of them a percentage split with 90/10 (and preserve order). This solution looks pretty elaborated, effort expensive and error prone.
Thanks for your help!
To do that you need to do the following for every fold:
int result = new int[testSet.numInstances()];
for (int j = 0; j < testSet.numInstances(); j++) {
double res[j] = classifier.classifyInstance(testSet.get(j));
}
Now res array has the classification result for every Instance in test set. You can use this information as you want.
You can for example print the attributes of each instance(e.g if attributes are strings you can print them using (Before addingFilter) testSet.get(j).stringValue(PositionOfAttributeYouWantToPrint)) followed by the classification result.
Note that if the classification result is nominal value you can print it using this:
testSet.classAttribute().value((int)res[j]))

hdf5 multiple extensible tables

I am analysing a huge number of files to strip out the important statistical informations. The analysis-program creates for every analysed file approx 3000 double-arrays of length n (approx. 100) together with a string which names the content of the respective array. I want to write the results into an hdf 5 file, where each array is written into a table whose name is the respective string. For that i use the following function :
#include "hdf5.h"
#include "hdf5_hl.h"
hid_t file_id;
hsize_t dims[RANK]={1,n};
herr_t status;
....
void hdf5_write ( double& array , string arrayname )
{
const char * tablename = arrayname.c_str();
status = H5LTmake_dataset(file_id,tablename,RANK,dims,H5T_NATIVE_DOUBLE,array);
}
This works fine for analysing the first file, however, when analysing multiple files one after another the existing tables are simply overwritten by the new arrays though I want that the new arrays are appended to the already existing tables respectively. Is there a hdf 5 function for that case?
I'm afraid you can't append using the high level (H5LT) interface.
Here is a complete example using the low level interface. It is much more complex but it gives you total control.
Or if you think this is overkill, you can ask yourself if you really need a single large dataset vs multiple small ones. Depending on the application you have in mind, multiple datasets might simply be a better design.

Is it possible to detect and handle string collisions among grouped values when grouping in Hadoop Pig?

Assuming I have lines of data like the following that show user names and their favorite fruits:
Alice\tApple
Bob\tApple
Charlie\tGuava
Alice\tOrange
I'd like to create a pig query that shows the favorite fruit of each user. If a user appears multiple times, then I'd like to show "Multiple". For example, the result with the data above should be:
Alice\tMultiple
Bob\tApple
Charlie\tGuava
In SQL, this could be done something like this (although it wouldn't necessarily perform very well):
select user, case when count(fruit) > 1 then 'Multiple' else max(fruit) end
from FruitPreferences
group by user
But I can't figure out the equivalent PigLatin. Any ideas?
Write a "Aggregate Function" Pig UDF (scroll down to "Aggregate Functions"). This is a user-defined function that takes a bag and outputs a scalar. So basically, your UDF would take in the bag, determine if there is more than one item in it, and transform it accordingly with an if statement.
I can think of a way of doing this without a UDF, but it is definitely awkward. After your GROUP, use SPLIT to split your data set into two: one in which the count is 1 and one in which the count is more than one:
SPLIT grouped INTO one IF COUNT(fruit) == 0, more IF COUNT(fruit) > 0;
Then, separately use FOREACH ... GENERATE on each to transform it:
one = FOREACH one GENERATE name, MAX(fruit); -- hack using MAX to get the item
more = FOREACH more GENERATE name, 'Multiple';
Finally, union them back:
out = UNION one, more;
I haven't really found a better way of handing the same data set in two different ways based on some conditional, like you want. I typically do some sort of split/recombine like I did here. I believe Pig will be smart and make a plan that doesn't use more than 1 M/R job.
Disclaimer: I can't actually test this code at the moment, so it may have some mistakes.
Update:
In looking harder, I was reminded of the bicond operator and I think that will work here.
b = FOREACH a GENERATE name, (COUNT(fruit)==1 ? MAX(FRUIT) : 'Multiple');