I have need of regarding HIVE table using Informatica and then write the data after some transformations to MS SQL table.
Can anyone please let me know what is the driver / connector required to connect to Apache HIVE from Informatica. Is there any specific Informatica version from which this is supported?
Informatica Big Data Edition (BDE) supports Hive both as a source and target.
More information: BDE User Guide
Related
We are trying to bring oracle table catalogs into glue . But unable to read the data from source
We have tried to give all the possibilities in include path parameter but unable to bring the data
Anyone tried oracle as JDBC data store for AWS Glue ? Please help us to fix the issue
We are currently using Apache sqoop once daily to export an oracle DB table containing a CLOB column into HDFS. As part of this we first map the CLOB column to java string(using --map-column-java) and have the imported data to be saved in the format of parquet. We have this scheduled as an oozie workflow.
There is a plan to move from apache hive to bigquery. I am not able to find a way to get this table into bigquery and would like help on the best approach to get this done.
If we go withreal time streaming from oracle DB into bigquery using google datastream, can you tell me if the clob column will get streamed correctly, as it has some malformed xml data (close to xml structure but might have some discrepancies in obeying the structure).
Another option i read was to have the table extracted as a csv file,and have it transferred to GCS and have the bigquery table refer it there.But since mydata in CLOB column is very large and is wild with multiple commas and special chsracters in between, i think there will be issues with parsing or exporting. Any options to do it in parquet or ORC formats?
The preferred approach is to have a scheduled batch upload performed daily from oracle to bigquery. Appreciate any inputs on how to achieve the same.
We can convert CLOB data from Oracle DB to desired format like ORC, Parquet, TSV, Avro files through Enterprise Flexter.
Also, you can refer to this on how to ingest on-premises Oracle data with Google Cloud Dataflow via JDBC, using the Hybrid Data Pipeline On-Premises Connector.
For your other query moving from apache hive to bigquery-
The fastest way to import to BQ is using GCP resources. Dataflow is a scalable solution to read and write. Dataproc is also another option that is more flexible and you can use more open source stacks to read from the Hive cluster.
You can also use this Dataflow template, which would require a connection to be established directly between the Dataflow workers and the Apache Hive nodes.
There is also a plugin for moving data from Hive into BigQuery which utilises GCS as a temporary storage and uses BigQuery Storage API to move data to BigQuery.
You can also use Cloud SQL to migrate your Hive data to BigQuery.
In the SnowFlake web interface, the INFORMATION_SCHEMA is visible and accesible. When logging in to PowerBI, with exactly the same user, the INFORMATION_SCHEMA is not shown. The PowerBI report should contain data of the INFORMATION_SCHEMA, how can I get this to be visible in PowerBI
Are you using PowerBI's native connector to Snowflake? Have you tried with the ODBC connector option instead? I ask because I've seen differences in native connectors versus the ODBC option and it's possible the native connector has some limitations that PowerBI can help work out with their native Snowflake connector. Note that the PowerBI team maintains/owns the native connector to Snowflake so you can follow-up with them.
There's also the potential option of using ACCOUNT_USAGE views instead of the INFORMATION_SCHEMA depending on what the user is looking for.
Hive Partitioned Tables have a folder structure with partition date as the folder. I have explored loading externally partitioned tables directly to bigquery which is possible.
What I would like to know is if this feature is possible to do with dataflow since I am going to be running some feature transforms and such using dataflow before loading the data into bigquery. what I have found is if I add the partition date as a column then partitioning is possible but I am looking for a direct method with which I wouldn't be adding the column during transforms but directly while loading data into bigquery.
Is such a thing possible?
Hive partitioning is a beta feature in BigQuery, and it was released on Oct 31st, 2019. The latest version of Apache Beam SDK supported by Dataflow is 2.16.0 which was released on Oct 7th, 2019. At the moment, neither Java nor Python supports this feature directly. So, if you want to use it from Dataflow, maybe you could try calling the BigQuery API directly
We are working on the ETL. How to read data from the POSTGRESQL data base using streams in DATA ANALYTICS SERVER and manipulate some operations using the streams and insert the manipulated data into another POSTGRESQL data base on a scheduled time. Please share the procedures to follow.
Actually, you don't need to publish data from your PostgreSQL server. Using WSO2 Data Analytics Server (DAS) you can pull data from your database and do the analysis. Finally, you can push results back to the PostgreSQL server. In DAS, we have a special connector called "CarbonJDBC" and using that connector you can easily do this.
The current version of the "CarbonJDBC" connector supports following database management systems.
MySQL
H2
MS SQL
DB2
PostgreSQL
Oracle
You can use following query to pull data from your PostgreSQL database and populate a spark table. Once spark table is populated with data, you can start you data analysis tasks.
create temporary table <temp_table> using CarbonJDBC options (dataSource "<datasource name>", tableName "<table name>");
select * from <temp_table>;
insert into / overwrite table <temp_table> <some select statement>;
For more information regarding "CarbonJDBC" connector please refer following blog post [1].
[1]. https://pythagoreanscript.wordpress.com/2015/08/11/using-the-carbon-spark-jdbc-connector-for-wso2-das-part-1/