Is there a way to specify the Parquet Version using AWS data Wrangler - aws-data-wrangler

We are writing parquet Files which seem to default to version 1.
enter image description here
which teradata NOS complains with a
"Native Object Store user error: Unsupported file version"
How can we specify with AWS data wrangler /SDK for Pandas the parquet version towrite.
Not sure what to try... maybe need to find alternative writer?

You can set the version using pyarrow_additional_kwargs:
wr.s3.to_parquet(
...
pyarrow_additional_kwargs={"version":"2.6"}
)

Related

How to write file-wide metadata into parquetfiles with apache parquet in C++

I use apache parquet to create Parquet tables with process information of a machine and I need to store file wide metadata (Machine ID and Machine Name).
It is stated that parquet files are capable of storing file wide metadata, however i couldn't find anything in the documentation about it.
There is another stackoverflow post that tells how it is done with pyarrow. As far as the post is telling, i need some kind of key value pair (maybe map<string, string>) and add it to the schema somehow.
I Found a class inside the parquet source code that is called parquet::FileMetaData that may be used for this purpose, however there is nothing in the docs about it.
Is it possible to store file-wide metadata with c++ ?
Currently i am using the stream_reader_writer example for writing parquet files
You can pass the file level metadata when calling parquet::ParquetFileWriter::Open, see the source code here

How can I figure out why BigQuery is rejecting my parquet file?

When trying to upload a parquet file into BigQuery, I get this error:
Error while reading data, error message: Read less values than expected from: prod-scotty-45ecd3eb-e041-450c-bac8-3360a39b6c36; Actual: 0, Expected: 10
I don't know why I get the error.
I tried inspecting the file with parquet-tools and it prints the file contents without issues.
The parquet file is written using the parquetjs JavaScript library.
Update: I also filed this in the BigQuery issue tracker here: https://issuetracker.google.com/issues/145797606
It turns out BigQuery doesn't support the latest version of the parquet format. I changed the output not to use the version 2 format and BigQuery accepted it.
From the error message it seems like a rogue line break might be causing this.
We use DataPrep to clean up our data, it works quite well. If I am wrong it's also google recommended method of cleaning up / sanitising data for big query.
https://cloud.google.com/dataprep/docs/html/BigQuery-Data-Type-Conversions_102563896

How to load the training data to OpenCV from UCI?

I have a character/font dataset found in UCI repository:
https://archive.ics.uci.edu/ml/datasets/Character+Font+Images
Take any CSV file as an example, for instance 'AGENCY.csv'. I am struggling to load it to the OpenCV using a c++ functions. It seems that the structure of the dataset is quite different from what normally assumed in function
cv::ml::TrainData::loadFromCSV
Any ideas to do it neatly or I need to pre-process the csv files directly?
You can try to load csv file like this:
CvMLData data;
data.read_csv( filename )
For details on opencv ml csv, Refer this page:
http://www.opencv.org.cn/opencvdoc/2.3.1/html/modules/ml/doc/mldata.html

How do you convert hdf5 files into a format that is readable by SAS Enterprise Miner(sas7bdat)

I have a subset of the data set called as 'million song dataset' available on the website (http://labrosa.ee.columbia.edu/millionsong/) on which I would like to perform data mining operations on SAS Enterprise Miner (13.2).
The subset I have downloaded contains 10,000 files and they are all in HDF5 format.
How do you convert hdf5 files into a format that is readable by SAS Enterprise Miner(sas7bdat)
On Windows there is an ODBC driver for HD5. If you have SAS/ACCESS ODBC then you can use that to read the file.
I don't think it's feasible to do this directly, as hdf5 seems to be a binary file format. You might be able to use another application to convert hdf5 to a plain text format and then write SAS code to import that.
I think some of the other files on this page might be easier to import:
http://labrosa.ee.columbia.edu/millionsong/pages/getting-dataset

how can i read datasets in Weka?

I want to use some of the datasets available at the website of the Weka to perform some
experiments with Neural Networks.
What do I have to do to read the data?
I downloaded the datasets and they were saved as .arff.txt so I deleted the extension of .txt to have only .arff. So I used this file as an ipnut but an error occurs.
Which is the right way to read data?
Do I have to write code?
Please help me.
Thank you
I'm using Weka 3.6.6 and coc81.arff opens just fine. You are using Weka 3.7.x, which is the development branch of Weka. I suggest that you download 3.6.6 or 3.6.7 (the latest stable release) and try to open the file again.
There is also another simple throw...
open your dataset file in excel in my case MS Excel2010, format fields intype.
and save as 'csv',
then reload that csv file in the weka explorer and save on the local drive as arff format.
may be this help.