I was looking for ways to move data onto a HDFS system, wanted to know if Apache Sqoop can be used to pull/extract data from an external REST service?
My favorite way to pull data from a REST service:
curl http:// | hdfs -put - /my/hdfs/directory
From http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html
Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
So it doesn't support import data from REST service.
Related
We are currently using Apache sqoop once daily to export an oracle DB table containing a CLOB column into HDFS. As part of this we first map the CLOB column to java string(using --map-column-java) and have the imported data to be saved in the format of parquet. We have this scheduled as an oozie workflow.
There is a plan to move from apache hive to bigquery. I am not able to find a way to get this table into bigquery and would like help on the best approach to get this done.
If we go withreal time streaming from oracle DB into bigquery using google datastream, can you tell me if the clob column will get streamed correctly, as it has some malformed xml data (close to xml structure but might have some discrepancies in obeying the structure).
Another option i read was to have the table extracted as a csv file,and have it transferred to GCS and have the bigquery table refer it there.But since mydata in CLOB column is very large and is wild with multiple commas and special chsracters in between, i think there will be issues with parsing or exporting. Any options to do it in parquet or ORC formats?
The preferred approach is to have a scheduled batch upload performed daily from oracle to bigquery. Appreciate any inputs on how to achieve the same.
We can convert CLOB data from Oracle DB to desired format like ORC, Parquet, TSV, Avro files through Enterprise Flexter.
Also, you can refer to this on how to ingest on-premises Oracle data with Google Cloud Dataflow via JDBC, using the Hybrid Data Pipeline On-Premises Connector.
For your other query moving from apache hive to bigquery-
The fastest way to import to BQ is using GCP resources. Dataflow is a scalable solution to read and write. Dataproc is also another option that is more flexible and you can use more open source stacks to read from the Hive cluster.
You can also use this Dataflow template, which would require a connection to be established directly between the Dataflow workers and the Apache Hive nodes.
There is also a plugin for moving data from Hive into BigQuery which utilises GCS as a temporary storage and uses BigQuery Storage API to move data to BigQuery.
You can also use Cloud SQL to migrate your Hive data to BigQuery.
HDFS stores data in replication form, when we move data from HDFS to RDBMS by using SQOOP, How sqoop avoid exporting duplicated data from HDFS to RDBMS?
Internally HDFS handles replication. You usually read file using HDFS protocol/ HDFS API then hdfs internally manages this and return only one copy of data.
Sqoop also uses HDFS API/protocol to read data.
So, there no need for extra handling on sqoop end.
I am trying to query hdfs file with Presto like Apache Drill. I have searched but found anything due to lack of Presto resources. I can query hdfs data with hive connector, no problem with that. But I want to query a file in hdfs that not controlled by hive. Is it possible?
Yes, it is possible. You can use Hive connector with
hive.metastore=file
hive.metastore.catalog.dir=/home/youruser/metastore
This will create "embedded metastore" with /home/youruser/metastore directory as the storage. Then you can declare your table as if you used Hive metastore and read from it.
Using the unload command in any database, we can get the data into a flat file and we can copy that to HDFS.
What is the advantage of using sqoop over unload command in this regard?
Please explain. Thanks in advance.
Sqoop will translate data with the type of you DB to a Java type wich can be then used on a process. Sqoop also allow you to offload various relationnal db to Hive or Hbase. Or it will provide a parquet output ( one of the most used format for datasets under hadoop )
Sqoop will also split and distribute the export file folowing a provided key.
I am using SQOOP as a technology to download lots of data from mysql to HDFS. sometimes, I need to write some special queries in sqoop to download the data.
One of the problems I feel with sqoop is that its virtually untestable. There is absolutely no guidance or technology to unit test a sqoop.
If anyone is using sqoop for data integration. How do you test your sqoop applications?
Afaif as of now there is no unit testing frameworks for sqoop, you can follow below approach
1) schedule a sqoop eval job , that will have source query to display output of source table.
$ sqoop eval --connect jdbc:mysql://db.example.com/corp \
--query "SELECT * FROM employees LIMIT 10"
2) Run the corresponding hive query or hdfs shell command to get the data or count after sqoop is completed.
If you don't use free-form queries via --query, you can use built-in --validate option to match records count in source table and HDFS. Unfortunately it will fail on big tables in MS SQL (record count>int capacity) because Sqoop is not aware of count_big().