Unit Testing Sqoop Applications - unit-testing

I am using SQOOP as a technology to download lots of data from mysql to HDFS. sometimes, I need to write some special queries in sqoop to download the data.
One of the problems I feel with sqoop is that its virtually untestable. There is absolutely no guidance or technology to unit test a sqoop.
If anyone is using sqoop for data integration. How do you test your sqoop applications?

Afaif as of now there is no unit testing frameworks for sqoop, you can follow below approach
1) schedule a sqoop eval job , that will have source query to display output of source table.
$ sqoop eval --connect jdbc:mysql://db.example.com/corp \
--query "SELECT * FROM employees LIMIT 10"
2) Run the corresponding hive query or hdfs shell command to get the data or count after sqoop is completed.

If you don't use free-form queries via --query, you can use built-in --validate option to match records count in source table and HDFS. Unfortunately it will fail on big tables in MS SQL (record count>int capacity) because Sqoop is not aware of count_big().

Related

Speed up BigQuery query job to import from Cloud SQL

I am performing a query to generate a new BigQuery table of of size ~1 Tb (a few billion rows), as part of migrating a Cloud SQL table to BigQuery, using Federated query. I use the BigQuery Python client to submit the query job, in the query I select all from the the Cloud SQL database table and use EXTERNAL_QUERY.
I find that the query can take 6+ hours (and fails with "Operation timed out after 6.0 hour")! Even if it didn't fail, I would like to speed it up as I may need to perform this migration again.
I see that the PostgreSQL egress is 20Mb/sec, consistent with a job that would take half a day. Would it help if I consider something more distributed with Dataflow? Or simpler, extend my Python code using the BigQuery client to generate multiple queries, which can run async by BigQuery?
Or is it possible to still use that single query but increase the egress traffic (database configuration)?
I think it is more suitable to use the dump export.
Running a query on large table is an inefficient job.
I recommend to export Cloud SQL data to a CSV file.
BigQuery can import CSV format file, So you can use this file to create your new bigQuery table.
I'm not sure of how long this job will takes, But at least will not be failed.
Refer here to get more detailed job about export Cloud SQL to CSV dump.

Vertica HDFS as external table

What is the best practice for working with Vertica and Parquet
my application architecture is:
Kafka Topic (Avro Data).
Vertica DB.
Vertica's scheduler consumed the data from Kafka and ingest it into a managed table in Vertica.
let's say I have Vertica's Storage only for one month of data.
As far as I understood I can create an external table on HDFS using parquet and Vertica API enables me to query these tables as well.
What is the best practice for this scenario? can I add some Vertica scheduler for coping the date from managed tables to external tables (as parquet).
how do I configure the rolling data in Vertica (dropped 30 days ago every day )
Thanks.
You can use external tables with Parquet data, whether that data was ever in Vertica or came from some other source. For Parquet and ORC formats specifically, there are some extra features, like predicate pushdown and taking advantage of partition columns.
You can export data in Vertica to Parquet format. You can export the results of a query, so you can select only the 30-day-old data. And despite that section being in the Hadoop section of Vertica's documentation, you can actually write your Parquet files anywhere; you don't need to be running HDFS at all. It just has to be somewhere that all nodes in your database can reach, because external tables read the data at query time.
I don't know of an in-Vertica way to do scheduled exports, but you could write a script and run it nightly. You can run a .sql script from the command line using vsql -f filename.sql.

sqoop vs unload command

Using the unload command in any database, we can get the data into a flat file and we can copy that to HDFS.
What is the advantage of using sqoop over unload command in this regard?
Please explain. Thanks in advance.
Sqoop will translate data with the type of you DB to a Java type wich can be then used on a process. Sqoop also allow you to offload various relationnal db to Hive or Hbase. Or it will provide a parquet output ( one of the most used format for datasets under hadoop )
Sqoop will also split and distribute the export file folowing a provided key.

Number of reducer in sqoop

How many default mappers and reducers in sqoop? (4-mappers, 0-reducers).
If used --where or --query condition in sqoop import then how many reducers will be there ?
In local cluster it is showing 0 reducers after using --where or --query condition
As per sqoop user guide, Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the
--num-mappers
 argument. By default, four tasks are used. As if we are not doing any aggregation task the reducer task will be zero. For more details http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_free_form_query_imports
Sqoop jobs are map only. There is no reducer phase.
For example, sqoop import from Mysql to HDFS with 4 mappers will generate 4 concurrent connections and start fetching data. 4 Mappers job are created. Data will be written to the HDFS part files. There is no reducer stage.
Reducers are required for aggregation. While fetching data from mysql , sqoop simply uses select queries which is done by the mappers.
There are no reducers in sqoop. Sqoop only uses mappers as it does parallel import and export. Whenever we write any query(even aggregation one such as count , sum) , these all queries run on RDBMS and the generated result is fetched by the mappers from RDBMS using select queries and it is loaded on hadoop parallely. Hence the where clause or any aggregation query runs on RDBMS , hence no reducers required.
For most of the functions, sqoop is a map-only job.
Even if there are aggregations in the free-form query
that query would be executed at the RDBMS hence no reducers.
However for one particular option "--incremental lastmodified",
reducer(s) are called if "--merge-key" is specified (used for merging
the new incremental data with the previously extracted data).
In this case, there seems to be a way to specify the number of reducers also
using the property "mapreduce.job.reduces" as below.
sqoop import -Dmapreduce.job.reduces=3 --incremental lastmodified --connect jdbc:mysql://localhost/testdb --table employee --username root --password cloudera --target-dir /user/cloudera/SqoopImport --check-column trans_dt --last-value "2019-07-05 00:00:00" --merge-key emp_id
The "-D" properties are expected before the command options.

Loading data onto HDFS using Sqoop

I was looking for ways to move data onto a HDFS system, wanted to know if Apache Sqoop can be used to pull/extract data from an external REST service?
My favorite way to pull data from a REST service:
curl http:// | hdfs -put - /my/hdfs/directory
From http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html
Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
So it doesn't support import data from REST service.