Hive on tez engine

Hive on tez engine - mapreduce

Currently in our production environment we use hive on tez instead of mapreduce engine ,so i wanted to ask will all the hive optimization for joins will be relevant for tez as well? for example in multitable table it was mentioned that if join key is same then it will use single map reduce job,but when i checked on HQL in our environment where we were joining one table left outer with many table on same key I didnt see 1 reducer,infact there were 17 reducers running .so is it because hive on tez will be different than hive on mr?
Hive version : 1.2
Hadoop :2.7
Below is the documentation where it mentions using 1 reducer only
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins

Related

How can I reuse spark SQL view/table across multiple AWS EMR steps?

I am submitting multiple steps (concurrency - 1) to AWS EMR cluster by command - 'spark-submit --deploy-mode client --master yarn <>' one after other.
In first step I'm reading for S3 and creating dataframe out of it. I'm registering this dataframe as spark SQL table/view using createGlobalTempView
In second step I'm trying to access table/view in my spark SQL query (tried with global_temp... as well), but getting table/view not found exception.
What I am missing? Doesn't createGlobalTempView should be accessible across multiples sessions? Or sessions and steps are different things? How I can achieve this?

One step in EMR is like a single spark application and the lifetime of the view created using createGlobalTempView is tied to this Spark application.
The same can be seen in the pyspark documentation
And yes session and steps are different other things.
A step in EMR submits a job that creates a spark application and once the step finishes executing, all the temp views created within it are gone.
The same can be verified by adding this small line of code in your each-step script.
print(spark.sparkContext.getApplicationId())
spark mentioned above is sparkSession.
Both the steps that you have executed will have different applications.
There can be multiple SparkSessions associated with a spark application and can be only one SparkContext per application.
To achieve this you can create a temp table in Hive like:
df.write.mode('overwrite').saveAsTable("table-name")
and
can use this table in the next step as input data.
df= spark.sql("select * from table-name")
or
if you don't want to create a temp table another way is to include your transformation sqls on the same script.

Vertica HDFS as external table

What is the best practice for working with Vertica and Parquet
my application architecture is:
Kafka Topic (Avro Data).
Vertica DB.
Vertica's scheduler consumed the data from Kafka and ingest it into a managed table in Vertica.
let's say I have Vertica's Storage only for one month of data.
As far as I understood I can create an external table on HDFS using parquet and Vertica API enables me to query these tables as well.
What is the best practice for this scenario? can I add some Vertica scheduler for coping the date from managed tables to external tables (as parquet).
how do I configure the rolling data in Vertica (dropped 30 days ago every day )
Thanks.

You can use external tables with Parquet data, whether that data was ever in Vertica or came from some other source. For Parquet and ORC formats specifically, there are some extra features, like predicate pushdown and taking advantage of partition columns.
You can export data in Vertica to Parquet format. You can export the results of a query, so you can select only the 30-day-old data. And despite that section being in the Hadoop section of Vertica's documentation, you can actually write your Parquet files anywhere; you don't need to be running HDFS at all. It just has to be somewhere that all nodes in your database can reach, because external tables read the data at query time.
I don't know of an in-Vertica way to do scheduled exports, but you could write a script and run it nightly. You can run a .sql script from the command line using vsql -f filename.sql.

Number of reducer in sqoop

How many default mappers and reducers in sqoop? (4-mappers, 0-reducers).
If used --where or --query condition in sqoop import then how many reducers will be there ?
In local cluster it is showing 0 reducers after using --where or --query condition

As per sqoop user guide, Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the
--num-mappers
 argument. By default, four tasks are used. As if we are not doing any aggregation task the reducer task will be zero. For more details http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_free_form_query_imports

Sqoop jobs are map only. There is no reducer phase.
For example, sqoop import from Mysql to HDFS with 4 mappers will generate 4 concurrent connections and start fetching data. 4 Mappers job are created. Data will be written to the HDFS part files. There is no reducer stage.

Reducers are required for aggregation. While fetching data from mysql , sqoop simply uses select queries which is done by the mappers.
There are no reducers in sqoop. Sqoop only uses mappers as it does parallel import and export. Whenever we write any query(even aggregation one such as count , sum) , these all queries run on RDBMS and the generated result is fetched by the mappers from RDBMS using select queries and it is loaded on hadoop parallely. Hence the where clause or any aggregation query runs on RDBMS , hence no reducers required.

For most of the functions, sqoop is a map-only job.
Even if there are aggregations in the free-form query
that query would be executed at the RDBMS hence no reducers.
However for one particular option "--incremental lastmodified",
reducer(s) are called if "--merge-key" is specified (used for merging
the new incremental data with the previously extracted data).
In this case, there seems to be a way to specify the number of reducers also
using the property "mapreduce.job.reduces" as below.
sqoop import -Dmapreduce.job.reduces=3 --incremental lastmodified --connect jdbc:mysql://localhost/testdb --table employee --username root --password cloudera --target-dir /user/cloudera/SqoopImport --check-column trans_dt --last-value "2019-07-05 00:00:00" --merge-key emp_id
The "-D" properties are expected before the command options.

Bulk Update a million rows

Suppose I have a million rows in a table. I want to flip a flag in a column from true to false. How do I do that in spanner with a single statement?
That is, I want to achieve the following DML statement.
Update mytable set myflag=true where 1=1;

Cloud Spanner doesn't currently support DML, but we are working on a Dataflow connector (Apache Beam) that would allow you to do bulk mutations.

You can use this open source JDBC driver in combination with a standard JDBC tool like for example SQuirreL or SQL Workbench. Have a look here for a short tutorial on how to use the driver with these tools: http://www.googlecloudspanner.com/2017/10/using-standard-database-tools-with.html
The JDBC driver supports both DML- and DDL-statements, so this statement should work out-of-the-box:
Update mytable set myflag=true
DML-statements operating on a large number of rows are supported, but the underlying transaction quotas of Cloud Spanner continue to apply (max 20,000 mutations in one transaction). You can bypass this by setting the AllowExtendedMode=true connection property (see the Wiki-pages of the driver). This breaks a large update into several smaller updates and executes each of these in its own transaction. You can also do this batching yourself by dividing your update statement into several different parts.

VORA Tables in Zeppelin and Spark shell

We have created test table from spark shell as well as from Zepellin. But when we do show tables on single table is visible in respective environment. Table created via spark shell is not displayed in Zepellin show table command.
What is the difference between these two tables ? can anybody please explain.

The show tables command only shows the tables defined in the current session.
A table is created in a current session and also in a (persistent) catalog in Zookeeper. You can show all tables that Vora saved in Zookeeper via this command:
SHOW DATASOURCETABLES
USING com.sap.spark.vora
OPTIONS(zkurls "<zookeeper_server>:2181")
You can also register all or single tables in the current session via this command:
REGISTER ALL TABLES
USING com.sap.spark.vora
OPTIONS(zkurls "<zookeeper_server>:2181")
REGISTER TABLE <tablename>
USING com.sap.spark.vora
OPTIONS(zkurls "<zookeeper_server>:2181")
So if you want to access the table that you created in the Spark Shell from Zookeeper and vice versa you need to register it first.
You can use those commands if you need to clear the Zookeeper Catalog. Be aware that tables then need to be recreated:
import com.sap.spark.vora.client._
ClusterUtils.clearZooKeeperCatalog("<zookeeper_server>:2181")
This (and more) information can be found in the Vora Installation and Developer Guide

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Hive on tez engine - mapreduce

Related

How can I reuse spark SQL view/table across multiple AWS EMR steps?

Vertica HDFS as external table

Number of reducer in sqoop

Bulk Update a million rows

VORA Tables in Zeppelin and Spark shell

Categories

Resources