Number of reducer in sqoop - mapreduce

How many default mappers and reducers in sqoop? (4-mappers, 0-reducers).
If used --where or --query condition in sqoop import then how many reducers will be there ?
In local cluster it is showing 0 reducers after using --where or --query condition

As per sqoop user guide, Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the
--num-mappers
 argument. By default, four tasks are used. As if we are not doing any aggregation task the reducer task will be zero. For more details http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_free_form_query_imports

Sqoop jobs are map only. There is no reducer phase.
For example, sqoop import from Mysql to HDFS with 4 mappers will generate 4 concurrent connections and start fetching data. 4 Mappers job are created. Data will be written to the HDFS part files. There is no reducer stage.

Reducers are required for aggregation. While fetching data from mysql , sqoop simply uses select queries which is done by the mappers.
There are no reducers in sqoop. Sqoop only uses mappers as it does parallel import and export. Whenever we write any query(even aggregation one such as count , sum) , these all queries run on RDBMS and the generated result is fetched by the mappers from RDBMS using select queries and it is loaded on hadoop parallely. Hence the where clause or any aggregation query runs on RDBMS , hence no reducers required.

For most of the functions, sqoop is a map-only job.
Even if there are aggregations in the free-form query
that query would be executed at the RDBMS hence no reducers.
However for one particular option "--incremental lastmodified",
reducer(s) are called if "--merge-key" is specified (used for merging
the new incremental data with the previously extracted data).
In this case, there seems to be a way to specify the number of reducers also
using the property "mapreduce.job.reduces" as below.
sqoop import -Dmapreduce.job.reduces=3 --incremental lastmodified --connect jdbc:mysql://localhost/testdb --table employee --username root --password cloudera --target-dir /user/cloudera/SqoopImport --check-column trans_dt --last-value "2019-07-05 00:00:00" --merge-key emp_id
The "-D" properties are expected before the command options.

Related

How can I reuse spark SQL view/table across multiple AWS EMR steps?

I am submitting multiple steps (concurrency - 1) to AWS EMR cluster by command - 'spark-submit --deploy-mode client --master yarn <>' one after other.
In first step I'm reading for S3 and creating dataframe out of it. I'm registering this dataframe as spark SQL table/view using createGlobalTempView
In second step I'm trying to access table/view in my spark SQL query (tried with global_temp... as well), but getting table/view not found exception.
What I am missing? Doesn't createGlobalTempView should be accessible across multiples sessions? Or sessions and steps are different things? How I can achieve this?
One step in EMR is like a single spark application and the lifetime of the view created using createGlobalTempView is tied to this Spark application.
The same can be seen in the pyspark documentation
And yes session and steps are different other things.
A step in EMR submits a job that creates a spark application and once the step finishes executing, all the temp views created within it are gone.
The same can be verified by adding this small line of code in your each-step script.
print(spark.sparkContext.getApplicationId())
spark mentioned above is sparkSession.
Both the steps that you have executed will have different applications.
There can be multiple SparkSessions associated with a spark application and can be only one SparkContext per application.
To achieve this you can create a temp table in Hive like:
df.write.mode('overwrite').saveAsTable("table-name")
and
can use this table in the next step as input data.
df= spark.sql("select * from table-name")
or
if you don't want to create a temp table another way is to include your transformation sqls on the same script.

How to export BigQuery partitions to the GCS using Airflow

I need to create a Airflow job that exports the partitions in the BigQuery table to GCS between the given range of _PARTITIONDATE. I need partitions to be in separate file with the date of partitions. How can I achieve this?
I have tried using airflow tasks that uses SQL to fetch the _PARTITIONDATE, but can I do it programatically?
For this, I recommend you to perform a loop in your dag definition (your loop is in Python code and you will add a lot of step in the DAG. By definition, the DAG can't contain loop).
The algorithm should be like that
For all days in the range
Query BigQuery on this day and save the result to a temporary table, the name of the table contain the date. Use BigqueryOperator
Extract the temporary table to GCS. Use BigQueryToCloudStorageOperator
Just follow the link below
these are the guide to export Bigquery partitions to the GCS using Airflow:
https://m.youtube.com/watch?v=wAyu5BN3VpY&t=28s

Vertica HDFS as external table

What is the best practice for working with Vertica and Parquet
my application architecture is:
Kafka Topic (Avro Data).
Vertica DB.
Vertica's scheduler consumed the data from Kafka and ingest it into a managed table in Vertica.
let's say I have Vertica's Storage only for one month of data.
As far as I understood I can create an external table on HDFS using parquet and Vertica API enables me to query these tables as well.
What is the best practice for this scenario? can I add some Vertica scheduler for coping the date from managed tables to external tables (as parquet).
how do I configure the rolling data in Vertica (dropped 30 days ago every day )
Thanks.
You can use external tables with Parquet data, whether that data was ever in Vertica or came from some other source. For Parquet and ORC formats specifically, there are some extra features, like predicate pushdown and taking advantage of partition columns.
You can export data in Vertica to Parquet format. You can export the results of a query, so you can select only the 30-day-old data. And despite that section being in the Hadoop section of Vertica's documentation, you can actually write your Parquet files anywhere; you don't need to be running HDFS at all. It just has to be somewhere that all nodes in your database can reach, because external tables read the data at query time.
I don't know of an in-Vertica way to do scheduled exports, but you could write a script and run it nightly. You can run a .sql script from the command line using vsql -f filename.sql.

Hive on tez engine

Currently in our production environment we use hive on tez instead of mapreduce engine ,so i wanted to ask will all the hive optimization for joins will be relevant for tez as well? for example in multitable table it was mentioned that if join key is same then it will use single map reduce job,but when i checked on HQL in our environment where we were joining one table left outer with many table on same key I didnt see 1 reducer,infact there were 17 reducers running .so is it because hive on tez will be different than hive on mr?
Hive version : 1.2
Hadoop :2.7
Below is the documentation where it mentions using 1 reducer only
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins

Unit Testing Sqoop Applications

I am using SQOOP as a technology to download lots of data from mysql to HDFS. sometimes, I need to write some special queries in sqoop to download the data.
One of the problems I feel with sqoop is that its virtually untestable. There is absolutely no guidance or technology to unit test a sqoop.
If anyone is using sqoop for data integration. How do you test your sqoop applications?
Afaif as of now there is no unit testing frameworks for sqoop, you can follow below approach
1) schedule a sqoop eval job , that will have source query to display output of source table.
$ sqoop eval --connect jdbc:mysql://db.example.com/corp \
--query "SELECT * FROM employees LIMIT 10"
2) Run the corresponding hive query or hdfs shell command to get the data or count after sqoop is completed.
If you don't use free-form queries via --query, you can use built-in --validate option to match records count in source table and HDFS. Unfortunately it will fail on big tables in MS SQL (record count>int capacity) because Sqoop is not aware of count_big().