sqoop vs unload command - hdfs

Using the unload command in any database, we can get the data into a flat file and we can copy that to HDFS.
What is the advantage of using sqoop over unload command in this regard?
Please explain. Thanks in advance.

Sqoop will translate data with the type of you DB to a Java type wich can be then used on a process. Sqoop also allow you to offload various relationnal db to Hive or Hbase. Or it will provide a parquet output ( one of the most used format for datasets under hadoop )
Sqoop will also split and distribute the export file folowing a provided key.

Related

How to query hdfs file with presto

I am trying to query hdfs file with Presto like Apache Drill. I have searched but found anything due to lack of Presto resources. I can query hdfs data with hive connector, no problem with that. But I want to query a file in hdfs that not controlled by hive. Is it possible?
Yes, it is possible. You can use Hive connector with
hive.metastore=file
hive.metastore.catalog.dir=/home/youruser/metastore
This will create "embedded metastore" with /home/youruser/metastore directory as the storage. Then you can declare your table as if you used Hive metastore and read from it.

Vertica HDFS as external table

What is the best practice for working with Vertica and Parquet
my application architecture is:
Kafka Topic (Avro Data).
Vertica DB.
Vertica's scheduler consumed the data from Kafka and ingest it into a managed table in Vertica.
let's say I have Vertica's Storage only for one month of data.
As far as I understood I can create an external table on HDFS using parquet and Vertica API enables me to query these tables as well.
What is the best practice for this scenario? can I add some Vertica scheduler for coping the date from managed tables to external tables (as parquet).
how do I configure the rolling data in Vertica (dropped 30 days ago every day )
Thanks.
You can use external tables with Parquet data, whether that data was ever in Vertica or came from some other source. For Parquet and ORC formats specifically, there are some extra features, like predicate pushdown and taking advantage of partition columns.
You can export data in Vertica to Parquet format. You can export the results of a query, so you can select only the 30-day-old data. And despite that section being in the Hadoop section of Vertica's documentation, you can actually write your Parquet files anywhere; you don't need to be running HDFS at all. It just has to be somewhere that all nodes in your database can reach, because external tables read the data at query time.
I don't know of an in-Vertica way to do scheduled exports, but you could write a script and run it nightly. You can run a .sql script from the command line using vsql -f filename.sql.

Unit Testing Sqoop Applications

I am using SQOOP as a technology to download lots of data from mysql to HDFS. sometimes, I need to write some special queries in sqoop to download the data.
One of the problems I feel with sqoop is that its virtually untestable. There is absolutely no guidance or technology to unit test a sqoop.
If anyone is using sqoop for data integration. How do you test your sqoop applications?
Afaif as of now there is no unit testing frameworks for sqoop, you can follow below approach
1) schedule a sqoop eval job , that will have source query to display output of source table.
$ sqoop eval --connect jdbc:mysql://db.example.com/corp \
--query "SELECT * FROM employees LIMIT 10"
2) Run the corresponding hive query or hdfs shell command to get the data or count after sqoop is completed.
If you don't use free-form queries via --query, you can use built-in --validate option to match records count in source table and HDFS. Unfortunately it will fail on big tables in MS SQL (record count>int capacity) because Sqoop is not aware of count_big().

Loading data onto HDFS using Sqoop

I was looking for ways to move data onto a HDFS system, wanted to know if Apache Sqoop can be used to pull/extract data from an external REST service?
My favorite way to pull data from a REST service:
curl http:// | hdfs -put - /my/hdfs/directory
From http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html
Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
So it doesn't support import data from REST service.

Load table from hbase to hdfs through map-reduce program

How to write a map-reduce program to load any table from hbase into hdfs?
There are a few ways :-
Use the Hbase's TableMapreduceUtil ( http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html) where you can specify a table scan and them manipulate data or write it to hdfs.
You can use hbase's utility which creates a snapshot of the whole hbase table into hdfs called export ( http://hbase.apache.org/book/ops_mgt.html#export), the import tool can be used to load the backed up table from hdfs to hbase.