I have been running some tests in both Vora and Hive from the Sap Spark Controller as well as a Base Spark Server. Both the Controller and the Spark Thrift server have the same configurations.
12 Column
10M row table
680Mb
Both Spark Server and SAP Controller are started with --master YARN and the same number of executors,executor memory and cores. The Controller and The Thrift Server are found on the same server in the Hadoop Cluster, I run one test shutdown that Controller/Thriftserver, then startup another to test.
All Numbers Below are from the Thrift Server Job Completion Time or SAP Controller Job Completion Time, I am not waiting for the results to show in HANA or in Beeline or Spark-Shell.
Results:
Spark-Shell -> Spark Thriftserver -> Hive
Select Column returns in : 13s
Count returns in : 1.2s
Spark-Shell -> Spark Thriftserver -> Vora
Select Column returns in : 5s
Count returns in : 100ms
Hana -> Sap Controller -> Hive
Select Column returns in : 45s
Count returns in : 4s
Hana -> Sap Controller -> Vora
Select Column returns in : 24s
Count returns in : 2.1s
Beeline -> Spark Thriftserver -> Hive
Select Column returns in : 35s
Count returns in : 1.9s
Beeline -> Spark Thriftserver -> Vora
Select Column returns in : 55s
Count returns in : 1.2s
Are there any important performance tuning tips to help the controller? The fact that I can select from Hive at a faster speed than the Controller can from Vora is interesting.
After a bit of Partitioning changes. I have gotten SAP Controller to select the data at a faster rate from Hive, Vora still is about the same speed.
It seems that smaller number of splits helps the Controller tremendously
Splitting the data from 31 to 10 files decreases the query time by more than 75%
current results:
Spark-Shell -> Spark Thriftserver -> Hive
Select Column returns in : 14s
Count returns in : 1s
Hana -> Sap Controller -> Hive
Select Column returns in : 10s
Count returns in : 5s
Beeline -> Spark Thriftserver -> Hive
Select Column returns in : 7s
Count returns in : 1.3s
The count seems to return slowly still but not a problem.
Related
I am using https://aws-quickstart.github.io/quickstart-neo4j/ to deploy Neo4j on AWS.
My server connect to the Leader cluster IP.
But when I perform query , sometimes it takes me very long time for an simple query. It can be up to 70second while another time with the same query only take about 1 seconds.
My query is just like :
MATCH (user:USER {alias:"userAlias"}) RETURN user
I am using .Net Neo4jClient.
I already set an Unique Constraints.
Any one has facing this issue ? Please let me know how to solve this ?
In Redshift one of the query is taking 3 hours to execute while analyzing its query plan it seems the network is taking all the time. How I could troubleshoot and resolve this problem.
Below is my query execution plan:
QUERY PLAN: Filter: ((to_date((productrequestdate)::text, 'YYYY-MM-DD'::text) <= '2022-01-31'::date) AND (to_date((productrequestdate)::text, 'YYYY-MM-DD'::text) >= '2020-01-01'::date) AND ((mstrclientid)::text = 'GSKUS'::text) AND (quantityrequested >= 0))
QUERY PLAN: -> XN Seq Scan on brsit_sample_transparency (cost=0.00..0.30 rows=1 width=4980)
QUERY PLAN: Filter: ((to_date((productrequestdate)::text, 'YYYY-MM-DD'::text) <= '2022-01-31'::date) AND (to_date((productrequestdate)::text, 'YYYY-MM-DD'::text) >= '2020-01-01'::date) AND ((mstrclientid)::text = 'GSKUS'::text) AND (quantityrequested >= 0))
QUERY PLAN: -> XN Seq Scan on verri_sample_transparency (cost=0.00..0.30 rows=1 width=4980)
QUERY PLAN: Filter: ((to_date((productrequestdate)::text, 'YYYY-MM-DD'::text) <= '2022-01-31'::date) AND (to_date((productrequestdate)::text, 'YYYY-MM-DD'::text) >= '2020-01-01'::date) AND (quantityrequested >= 0) AND ((mstrclientid)::text = 'GSKUS'::text))
QUERY PLAN: -> XN Seq Scan on gskus_sample_transparency (cost=0.00..33348.33 rows=5558 width=993)
QUERY PLAN: -> XN Multi Scan (cost=0.00..33404.53 rows=5560 width=4980)
QUERY PLAN: -> XN Subquery Scan bi_sample_transparency_view (cost=0.00..33460.13 rows=5560 width=1488)
QUERY PLAN: Sort Key: productndc10
QUERY PLAN: -> XN Sort (cost=1000000033805.99..1000000033819.89 rows=5560 width=1488)
QUERY PLAN: Send to leader
QUERY PLAN: -> XN Network (cost=1000000033805.99..1000000033819.89 rows=5560 width=1488)
QUERY PLAN: Merge Key: productndc10
QUERY PLAN:XN Merge (cost=1000000033805.99..1000000033819.89 rows=5560 width=1488)
As you say this is a problematic step in the plan (network transfer before SORT which isn't really a plan step but an activity that needs to be performed). With only 5560 rows being reported it doesn't seem like this should be a ton of data but your column count is high and I don't know the sizes of these columns. It could be there is a lot of data moving even for this limited number of rows. Or it could be that the reported number of rows is not indicative of the number of rows being moved in the network activity which can happen but this would need to be a huge difference. You can look at stl_dist for this query to see exactly how much data (bytes) is being moved.
Another possibility here is that your query was a victim not the culprit. You see Redshift is a cluster and clusters are connected by networks and these networks are common infrastructure for all queries running on the cluster. If there was a really bad query running during this window which browned out the internode network (bandwidth hog) then you query was caught up in this traffic jam. Does your query run normally most of the time but just went slow this time? What was the cluster activity like at this time? Were other queries impacted? I've debugged plenty of "slow" queries that were victims. That said it is always good in a clustered database like Redshift to not transfer excessive amounts of data on the network due to its clustered nature.
If you want to debug this query further (it is the culprit) then the query text, stl_dist information, and explain plan could shine some more light on the situation.
Image showing tables created. (crawler snapshot)
Unable to see tables under databases tab in the AWS datalake/glue UI even though the Crawler log states that - 2 tables have been created.
2020-09-05T15:16:45.020+05:30 [7bf19dc8-e723-4852-b92f-ccd1ab313849] BENCHMARK : Running Start Crawl for Crawler db1
2020-09-05T15:17:02.149+05:30 [7bf19dc8-e723-4852-b92f-ccd1ab313849] BENCHMARK : Classification complete, writing results to database db1
2020-09-05T15:17:02.150+05:30 [7bf19dc8-e723-4852-b92f-ccd1ab313849] INFO : Crawler configured with SchemaChangePolicy {"UpdateBehavior":"UPDATE_IN_DATABASE","DeleteBehavior":"DEPRECATE_IN_DATABASE"}.
2020-09-05T15:17:23.963+05:30 [7bf19dc8-e723-4852-b92f-ccd1ab313849] INFO : Created table customers in database db1
2020-09-05T15:17:23.965+05:30 [7bf19dc8-e723-4852-b92f-ccd1ab313849] INFO : Created table sales in database db1
2020-09-05T15:17:24.674+05:30 [7bf19dc8-e723-4852-b92f-ccd1ab313849] BENCHMARK : Finished writing to Catalog
2020-09-05T15:18:30.608+05:30 [7bf19dc8-e723-4852-b92f-ccd1ab313849] BENCHMARK : Crawler has finished running and is in state READY
The role consists of all admin policies. Tried refreshing many times, still cannot see the tables.
Currently in our production environment we use hive on tez instead of mapreduce engine ,so i wanted to ask will all the hive optimization for joins will be relevant for tez as well? for example in multitable table it was mentioned that if join key is same then it will use single map reduce job,but when i checked on HQL in our environment where we were joining one table left outer with many table on same key I didnt see 1 reducer,infact there were 17 reducers running .so is it because hive on tez will be different than hive on mr?
Hive version : 1.2
Hadoop :2.7
Below is the documentation where it mentions using 1 reducer only
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
I have connected wso2 bam with external cassandra .
Successfully inserted data to external cassandra .
Now i have written a hive query to fetch the data from this cassandra
to postgresql.Data is fetching and inserting successfully . But it
taking more time for execution of query . Why this happening . How can
I reduce the execution time . Is there any way to do this ?
please help
Are you using external cassandra 2.x?
If so try setting 'num_tokens' configuration in the cassandra.yaml to 1. It configured to 256 in cassandra 2.x where as in older cassandra versions this configuration is commented out by default and hence it was taken as 1.
Thanks,
Harsha