We have an issue with migrating the spark job from single master to multi master EMR cluster.
Multiple jobs which are using hive were successfully migrated to the new cluster, but only one keeps failing despite it works fine in the single master node cluster.
ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:127)
at org.apache.spark.sql.hive.HiveExternalCatalog.doListPartitions(HiveExternalCatalog.scala:1226)
at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionLocations(HiveExternalCatalog.scala:1211)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionLocations(ExternalCatalogWithListener.scala:246)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionLocationsOptimized(SessionCatalog.scala:960)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionLocations(SessionCatalog.scala:941)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:196)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:196)
at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3390)
at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$executeQuery$1(SQLExecution.scala:83)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1$$anonfun$apply$1.apply(SQLExecution.scala:94)
at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$withMetrics(SQLExecution.scala:178)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:93)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:200)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:92)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3389)
at org.apache.spark.sql.Dataset.(Dataset.scala:196)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:81)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:650)
The job is failing while we are doing insert into external table with partitions.
"""insert into table {tableName} partition ({part1}, {part2}) select {columns} from {view}"""
- We've checked hive configurations between old and new clusters - no significant differences.
- We are able to connect to hive and do any command from the terminal on that cluster.
- All the jobs are using the same metastore.
- We tried to increase metastore timeout - no luck.
- We tried to run this job from another master node - same result.
Any suggestions?
Related
I'm having troubles with a job I've set up on dataflow.
Here is the context, I created a dataset on bigquery using the following path
bi-training-gcp:sales.sales_data
In the properties I can see that the data location is "US"
Now I want to run a job on dataflow and I enter the following command into the google shell
gcloud dataflow sql query ' SELECT country, DATE_TRUNC(ORDERDATE , MONTH),
sum(sales) FROM bi-training-gcp.sales.sales_data group by 1,2 ' --job-name=dataflow-sql-sales-monthly --region=us-east1 --bigquery-dataset=sales --bigquery-table=monthly_sales
The query is accepted by the console and returns me a sort of acceptation message.
After that I go to the dataflow dashboard. I can see a new job as queued but after 5 minutes or so the job fails and I get the following error messages:
Error
2021-09-29T18:06:00.795ZInvalid/unsupported arguments for SQL job launch: Invalid table specification in Data Catalog: Could not resolve table in Data Catalog: bi-training-gcp.sales.sales_data
Error 2021-09-29T18:10:31.592036462ZError occurred in the launcher
container: Template launch failed. See console logs.
My guess is that it cannot find my table. Maybe because I specified the wrong location/region, since my table is specified to be location in "US" I thought it would be on a US server (which is why I specified us-east1 as a region), but I tried all us regions with no success...
Does anybody know how I can solve this ?
Thank you
This error occurs if the Dataflow service account doesn't have access to the Data Catalog API. To resolve this issue, enable the Data Catalog API in the Google Cloud project that you're using to write and run queries. Alternately, assign the roles/datacatalog.
I have pyspark program running on AWS EMR cluster.
Cluster config is like this - emr-5.31.0, hadoop 2.10.0, hive 2.3.7, hue 4.7.1, pig 0.17.0.
Program processes some files on hdfs file system but at some moment it is getting errors.
In amazon console - YARN applications - application_XXX (Spark) - executors - driver - stderr:
'could not obtain block ... file=
A little before this message there is 'Task 0 in stage 35 failed 4 times. aborting job'
If i go to amazon console - YARN applications - application_XXX (Spark) - stages - 35 - tasks - 0 - stdout - i dont see anything bad at first glance except a lot of 'GC (allocation Failure)' messages.
In its stderr - there is a WARN - 'Could not obtain block XXX, file= No live nodes contain current block Block locations: Dead nodes: . Throwing a BlockMissingException.
If i go to monitoring tab - node status - i see that one node became unhealthy at that time and thats it. Number of nodes also changed at 'live data nodes', 'MR total nodes', 'MR active nodes', MR lost nodes' charts.
As i understand, task cannot find file on hdfs because node it was hosted on became unhealthy.
My question is where i can find the reasons node became unhealthy. I wasnt able to find any other logs on amazon console. May be there are some node-local places where this reason is stored?
Hi I launched a EMR myself some time ago, dont remember about the logs. But consulting the docs here:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html
It states that they are stored on the machines (which I assume you have the keys), they are also stored on S3 by default. Not sure in which bucket they will be created.
Best Regards :)
On the Summary page for your EMR cluster there is a section named "Configuration details".
Below that, there is a label named "Log URI". It points to an S3 URI, but, there is also a small folder icon.
Click on that icon and you can browse to the logs on the nodes for your EMR cluster.
Actually, for amazon there are more logs accessible via s3 location - there are logs for node boot and configuration part, and logs from running services on node - hdfs and yarn, which i was looking for. Path looks like this - s3 location/cluster id/node/node id/applications - here i was able to find hdfs and yarn logs.
We have managed to get a valid connection from Azure Data Factory towards our Azure Databricks cluster using the Spark (odbc) connector. In the list of tables we do get the expected list, but when querying a specific table we get an exception.
ERROR [HY000] [Microsoft][Hardy] (35) Error from server: error code:
'0' error message:
'com.databricks.backend.daemon.data.common.InvalidMountException:
Error while using path xxxx for resolving path xxxx within mount at
'/mnt/xxxx'.'.. Activity ID:050ac7b5-3e3f-4c8f-bcd1-106b158231f3
In our case the Databrick tables and mounted parquet files stored in Azure Data Lake 2, this is related to the above exception. Any suggestions how to solve this issue?
Ps. the same error appaers when connectin from Power BI desktop.
Thanks
Bart
In your configuration to mount the lake can you add this setting:
"fs.azure.createRemoteFileSystemDuringInitialization": "true"
I haven't tried your exact scenario - however this solved a similar problem for me using Databricks-Connect.
enter image description hereI'm able to connect to MYSQL while running my Pyspark Code Locally in juypter notebook, but the same code I am getting Communication error in AWS Glue while running the code. I have added MySQL jar in jar files required while creating the job in AWS Glue.
Reading from MYSQL
dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://localhost/read").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "student").option("user", "root").option("password", "root").load()
Writing to MYSQL
df = sc.parallelize([[25, 'Prem'],
[20, 'Kate'],
[20, 'Kate'],
[40, 'Cheng']]).toDF(["Depy_id","Dept_name"])
df.write.format('jdbc').options(
url='jdbc:mysql://localhost/test',
driver='com.mysql.jdbc.Driver',
dbtable='dept',
user='root',
password='root').mode('overwrite').save()
Please note that you have to provide a valid database URL, not a localhost. I believe your jupyter notebook was run locally on a laptop, in the same local environment where your mysql is running as well.
AWS Glue runs within an AWS environment, and behind the scene it would launch number of EC2 instances depending upon the DPU configuration. If your URL is configured as LOCALHOST, then the EC2 instance where the pyspark code is running, would look for a mysql database on the same node.
Please make sure you have a valid public IP for the mysql database, and try to setup a connection in the AWS Glue as suggested by bdcloud, and retry again. If you dont want to create a connection, you can hard code the connection parameters in the code and try again. If you cannot get a public IP for the installed mysql database, maybe you can try setup an RDS Mysql on AWS, and use it for testing.
Sample code snippet:
conn = mysql.connector.connect(host=url, user=uname, password=pwd, database=dbase)
cur = conn.cursor()
insertQry = "INSERT INTO emp (id, emp_name, dept, designation, address1, city, state, active_start_date, is_active) SELECT (SELECT coalesce(MAX(ID),0) + 1 FROM atlas.emp) id, tmp.emp_name, tmp.dept, tmp.designation, tmp.address1, tmp.city, tmp.state, tmp.active_start_date, tmp.is_active from EMP_STG tmp ON DUPLICATE KEY UPDATE dept=tmp.dept, designation=tmp.designation, address1=tmp.address1, city=tmp.city, state=tmp.state, active_start_date=tmp.active_start_date, is_active =tmp.is_active ;"
n = cur.execute(insertQry)
print (" CURSOR status :", n)
Please refer to the AWS Glue connections section:
yes that's true I am able to connect it as above by just adding the connection to the job as well as changing the local host to the respective
I have around 500GB compressed data in amazon s3. I wanted to load this data to Amazon Redshift. For that, I have created an internal table in AWS Athena and I am trying to load data in the internal table of Amazon Redshift.
Loading of this big data into Amazon Redshift is taking more than an hour. The problem is when I fired a query to load data it gets aborted after 1hour. I tried it 2-3 times but it's getting aborted after 1 hour. I am using Aginity Tool to fire the query. Also, in Aginity tool it is showing that query is currently running and the loader is spinning.
More Details:
Redshift cluster has 12 nodes with 2TB space for each node and I used 1.7 TB space.
S3 files are not the same size. One of them is 250GB. Some of them in MB.
I am using the command
create table table_name as select * from athena_schema.table_name
it stops exactly after 1hr.
Note: I have set the current query timeout in Aginity to 90000 sec.
I know this is an old thread, but for anyone coming here because of the same issue, I've realised that, at least for my case, the problem was the Aginity client; so, it's not related with Redshift or its Workload Manager, but only with such third party client called Aginity. In summary, use a different client like SQL Workbench and run the COPY command from there.
Hope this helps!
Carlos C.
More information, about my environment:
Redshift:
Cluster TypeThe cluster's type: Multi Node
Cluster: ds2.xlarge
NodesThe cluster's type: 4
Cluster Version: 1.0.4852
Client Environment:
Aginity Workbench for Redshift
Version 4.9.1.2686 (build 05/11/17)
Microsoft Windows NT 6.2.9200.0 (64-bit)
Network:
Connected to OpenVPN, via SSH Port tunneling.
The connection is not being dropped. This issue is only affecting the COPY command. The connection remains active.
Command:
copy tbl_XXXXXXX
from 's3://***************'
iam_role 'arn:aws:iam::***************:role/***************';
S3 Structure:
120 files of 6.2 GB each. 20 files of 874MB.
Output:
ERROR: 57014: Query (22381) cancelled on user's request
Statistics:
Start: ***************
End: ***************
Duration: 3,600.2420863
I'm not sure if following answer will solve your exact problem of timeout at exactly 1 Hr.
But, based on my experience, in case of Redshift loading data via Copy command is best and fast way. SO I feel that timeout issue shouldn't happen at all in your case.
The copy command in RedShift could load data from S3 or via SSH.
e.g.
Simple copy
copy sales from 'emr://j-SAMPLE2B500FC/myoutput/part-*' iam_role
'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '\t' lzop;
e.g. Using Menifest
copy customer
from 's3://mybucket/cust.manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;
PS: Even if you do it using Menifest and divide your data into Multiple files, it will be more faster as RedShift loads data in parallel.