How to Read ORC File in informatica - informatica

How to Read ORC File in informatica Power Center.
I tried via reading in flatfile or ODBC Junk Character is getting return

It seems you will need a PowerExchange adapter for HDFS.
Please refer PowerExchange for HDFS Overview for detailed information and
ORC Data Types and Transformation Data Types for ORC related details.

Related

JDBC Connectivity for Data Source in AWS Glue

We are trying to bring oracle table catalogs into glue . But unable to read the data from source
We have tried to give all the possibilities in include path parameter but unable to bring the data
Anyone tried oracle as JDBC data store for AWS Glue ? Please help us to fix the issue

how to export oracle DB table with complex CLOB data into bigquery through batch upload?

We are currently using Apache sqoop once daily to export an oracle DB table containing a CLOB column into HDFS. As part of this we first map the CLOB column to java string(using --map-column-java) and have the imported data to be saved in the format of parquet. We have this scheduled as an oozie workflow.
There is a plan to move from apache hive to bigquery. I am not able to find a way to get this table into bigquery and would like help on the best approach to get this done.
If we go withreal time streaming from oracle DB into bigquery using google datastream, can you tell me if the clob column will get streamed correctly, as it has some malformed xml data (close to xml structure but might have some discrepancies in obeying the structure).
Another option i read was to have the table extracted as a csv file,and have it transferred to GCS and have the bigquery table refer it there.But since mydata in CLOB column is very large and is wild with multiple commas and special chsracters in between, i think there will be issues with parsing or exporting. Any options to do it in parquet or ORC formats?
The preferred approach is to have a scheduled batch upload performed daily from oracle to bigquery. Appreciate any inputs on how to achieve the same.
We can convert CLOB data from Oracle DB to desired format like ORC, Parquet, TSV, Avro files through Enterprise Flexter.
Also, you can refer to this on how to ingest on-premises Oracle data with Google Cloud Dataflow via JDBC, using the Hybrid Data Pipeline On-Premises Connector.
For your other query moving from apache hive to bigquery-
The fastest way to import to BQ is using GCP resources. Dataflow is a scalable solution to read and write. Dataproc is also another option that is more flexible and you can use more open source stacks to read from the Hive cluster.
You can also use this Dataflow template, which would require a connection to be established directly between the Dataflow workers and the Apache Hive nodes.
There is also a plugin for moving data from Hive into BigQuery which utilises GCS as a temporary storage and uses BigQuery Storage API to move data to BigQuery.
You can also use Cloud SQL to migrate your Hive data to BigQuery.

Datafusion load BQ with XML 2003 worksheet data

I have a system exporting data as XML 2003 Worksheet. I need to load it to Bigquery through datafusion or any other process using GCP resources.
So
Is it possible to complete this with DataFusion
I have followed the process for XML transformation in https://www.youtube.com/watch?v=e-5K4cxwGrc&feature=youtu.be. So far I have reached a point where the header and data rows appear in different rows but same column. I am not able to parse it any further(using Wrangler) to individual columns as it just keeps isolating the json key:value pairs in different rows but same column
As I am new to datafusion, appreciate some detailed guidance.
This can be implemented using Data Fusion.
Basically, once you have the file (either uploaded directly or connecting using a source) and use the transformation XML to JSON, you can add a parsing operation for the JSON so it will be parsed into columns [1]. This will add another Transformation in the wrangler.
Additionally, I would suggest that you take a look at the documentation for Data Fusion in GCP which is very self-explanatory [2].
[1]- Column transformations -> Parse -> JSON
[2]- https://cloud.google.com/data-fusion/docs

Big Query can't query some csvs in Cloud Storage bucket

I created a permanent Big Query table that reads some csv files from a Cloud Storage Bucket sharing the same prefix name (filename*.csv) and the same schema.
There are some csvs anyway that make fail BigQuery queries with a message like the following one: "Error while reading table: xxxx.xxxx.xxx, error message: CSV table references column position 5, but line starting at position:10 contains only 2 columns.
Moving all the csvs one-by-one from the bucket I devised the one responsible for that.
This csv file doesn't have 10 lines...
I found this ticket BigQuery error when loading csv file from Google Cloud Storage, so I thought the issue was having an empty line at the end. But also others csvs in my bucket do, so this can't be the reason.
On the other hand this csv is the only one with content type text/csv; charset=utf-8, all the others being text/csv,application/vnd.ms-excel,application/octet-stream.
Furthermore downloading this csv to my local Windows machine and uploading it againt to Cloud Storage, content type is automatically converted to application/vnd.ms-excel.
Then even with the missing line Big Query can then query the permanent table based on filename*.csvs.
Is it possible that BigQuery had issues querying csvs with UTF-8 encoding, or is it just coincidence?
Use Google Cloud Dataprep to load your csv file. Once the file is loaded, analyze the data and clean it if requires.
Once all the rows are cleaned, you can then sink that data in BQ.
Dataprep is GUI based ETL tool and it runs a dataflow job internally.
Do let me know if any more clarification is required.
Just to remark the issue, the CSV file had gzip as encoding which was the reason that BigQuery doesn't interpret as a CSV file.
According to documentation BigQuery expects CSV data to be UTF-8 encoded:
"encoding": "UTF-8"
In addition, since this issue is relate to the metadata of the files in GCS you can edit the metadata directly from the Console.

Re-parsing Blob data stored in HDFS imported from Oracle by Sqoop

Using Sqoop I’ve successfully imported a few rows from a table that has a BLOB column.Now the part-m-00000 file contains all the records along with BLOB field as CSV.
Questions:
1) As per doc, knowledge about the Sqoop-specific format can help to read those blob records.
So , What does the Sqoop-specific format means ?
2) Basically the blob file is .gz file of a text file containing some float data in it. These .gz file is stored in Oracle DB as blob and imported into HDFS using Sqoop. So how could I be able to get back those float data from HDFS file.
Any sample code will of very great use.
I see these options.
Sqoop Import from Oracle directly to hive table with a binary data type. This option may limit the processing capabilities outside hive like MR, pig etc. i.e. you may need to know the knowledge of how the blob gets stored in hive as binary etc. The same limitation that you described in your question 1.
Sqoop import from oracle to avro, sequence or orc file formats which can hold binary. And you should be able to read this by creating a hive external table on top of it. You can write a hive UDF to decompress the binary data. This option is more flexible as the data can be processed easily with MR as well especially the avro, sequence file formats.
Hope this helps. How did you resolve?