error while running hive mapreduce job

error while running hive mapreduce job - mapreduce

when I trying to run hive commands which involve mapreduce, I am getting the following error, please help me solve this:
hive (hiveclass)> create table companies2 as select * from companies;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hduser_20161105115631_5868a474-de3c-4cc9-80f9-40a6fac60fa2
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1478325284966_0010, Tracking URL = http://ubuntu:8088/proxy/application_1478325284966_0010/
Kill Command = /home/hduser/hadoop/bin/hadoop job -kill job_1478325284966_0010
after kill command, nothing is happening. then I have to give ctrl+c and exit the hive shell.

Related

GCP Airflow Operators: BQ LOAD and sensor for job ID

I need to schedule automatically a bq load process that gets AVRO files from a GCS bucket and load them in BigQuery, and wait for its completion in order to execute another task upon completion, specifically a task that will read from above mentioned table.
As showed here there is a nice API to run this [command][1] , example given:
bq load \
--source_format=AVRO \
mydataset.mytable \
gs://mybucket/mydata.avro
This will give me a job_id
Waiting on bqjob_Rsdadsajobc9sda5dsa17_0sdasda931b47_1 ... (0s) Current status
job_id that I can check with bq show --job=true bqjob_Rsdadsajobc9sda5dsa17_0sdasda931b47_1
And that is nice... I guess under the hood the bq load job is a DataTransfer. I found some operators related to this: https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/bigquery_dts.html
Even if the documentation does not cover specifically avro load configuration, digging through the documentation, gave me what I was looking for.
My question is: Is there an easier way of getting the status of the job given a job_id similar to the bq show --job=true <job_id> command?
Is there something that might help me in not going through creating a DataTransferJob, starting it, monitoring it and then delete it (because, I don't need it to stay there since next time the parameters will change).
Maybe a custom sensor, using the python-sdk-api?
Thank you in advance.
[1]: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro

I think you can use GCSToBigQueryOperator operators and tasks sequencing with Airflow to solve your issue :
from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator
with airflow.DAG(
'dag_name',
default_args=your_args,
schedule_interval=None) as dag:
load_avro_to_bq = GCSToBigQueryOperator(
task_id='load_avro_file_to_bq',
bucket={your_bucket},
source_objects=['folder/{SOURCE}/*.avro'],
destination_project_dataset_table='{your_project}:{your_dataset}.{your_table}',
source_format='AVRO',
compression=None,
create_disposition='CREATE_NEVER',
write_disposition='WRITE_TRUNCATE'
)
second_task = DummyOperator(task_id='Next operator', dag=dag)
load_avro_to_bq >> second_task
The first operator allows to load the Avro file from GCS to BigQuery
If this operator is in success, the second task is executed otherwise it's not executed

Getting details of a BigQuery job using gcloud CLI on local machine

I am trying to process the billed bytes of each bigquery job runned by all user. I was able to find the details in BigQuery UI under Project History. Also running bq --location=europe-west3 show --job=true --format=prettyjson JOB_ID on Google Cloud Shell gives the exact information that I want (BQ SQL query, billed bytes, run time for each bigquery job).
For the next step, I want to access the json that returned by above script on local machine. I have already configured gcloud cli properly, and able to find bigquery jobs using gcloud alpha bq jobs list --show-all-users --limit=10.
I select a job id and run the following script: gcloud alpha bq jobs describe JOB_ID --project=PROJECT_ID,
I get (gcloud.alpha.bq.jobs.describe) NOT_FOUND: Not found: Job PROJECT_ID:JOB_ID--toyFH. It is possibly because of creation and end times
as shown here
What am I doing wrong? Is there another way to get details of a bigquery job using gcloud cli (maybe there is a way to get billed bytes with query details using Python SDK)?

You can get job details with diff APIs or as you are doing, but first, why are you using the alpha version of the bq?
To do it in python, you can try something like this:
from google.cloud import bigquery
def get_job(
client: bigquery.Client,
location: str = "us",
job_id: str = << JOB_ID >>,
) -> None:
job = client.get_job(job_id, location=location)
print(f"{job.location}:{job.job_id}")
print(f"Type: {job.job_type}")
print(f"State: {job.state}")
print(f"Created: {job.created.isoformat()}")
There are more properties that you can get with some kind of command from the job. Also check the status of the job in the console first, to compare between them
You can find more details here: https://cloud.google.com/bigquery/docs/managing-jobs#python

AWS Glue Job using awsglueml.transforms.FindMatches gives timeout error seemingly randomly

I have a Glue ETL Job (using pyspark) that gives a timeout error when trying to access the awsglueml.transforms.FindMatches library seemingly randomly. The error given on the glue dashboard is:
An error occurred while calling z:com.amazonaws.services.glue.ml.FindMatches.apply. The target server failed to respond
Basically if I try to run this Glue ETL job late at night, it most of the time succeeds. But if I try to run this ETL Job in the middle of the day, it fails with this error. Sometimes just retrying it enough times causes it to succeed, but this doesn't seem like a good solution. It seems like the issue is with AWS FindMatches library not having enough bandwidth to support people wanting to use this library, but I could be wrong here.
The Glue ETL job was setup using the option A proposed script generated by AWS Glue
The line of code that this is timing out on is a line that was provided by glue when I created this job:
from awsglueml.transforms import FindMatches
...
findmatches2 = FindMatches.apply(frame = datasource0, transformId = "<redacted>", computeMatchConfidenceScores = True, transformation_ctx = "findmatches2")
Welcoming any information on this elusive issue.

com.amazonaws.services.gluejobexecutor.model.VersionMismatchException

Exactly like in this AWS forum question I was running 2 Jobs concurrently. The Job was configured with Max concurrency: 10 but when executing job.commit() I receive this error message:
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.amazonaws.services.glue.util.Job.commit.
: com.amazonaws.services.gluejobexecutor.model.VersionMismatchException:
Continuation update failed due to version mismatch. Expected version 6 but found version 7
(Service: AWSGlueJobExecutor; Status Code: 400; Error Code: VersionMismatchException; Request ID: 123)
The two Jobs read different portions of data.
But I can't understand what's the problem here and how to deal with it. Anyone can help?

Reporting #bgiannini's answer in this other AWS forum question, it looks like that the "version" was referring to job bookmarking.
If multiple instances of the same job are running simultaneously (i.e. max concurrency > 1) and using bookmarks, when job run 1 runs job.init() it gets a version and job.commit() seems to expect a certain value (+1 to version for every job.commit that is executed I guess?). If job run 2 started at the same time and got the same initial version from job.init(), then submits job.commit() before job 1 does, job 1 doesn't increment to the version it expected to.
Actually I was running the 2 Jobs with Job bookmark: Enable. Indeed when disabling bookmarking, looks to be working for me.
I understand it might not be the best solution but it can be a good compromise.

The default JobName for your bookmark is the glue JOB_NAME, but it doesn't have to be.
Consider you have a glue job called JobA which executes concurrently taking different input parameters. You have two concurrent executions with input parameter contextName. Let's call the value passed into this parameter contextA and contextB.
The default initialisation in your pyspark script is:
Job.init(args['JOB_NAME'], args)
but you can change this to be unique for your execution context. Instead:
Job.init(args['JOB_NAME']+args['contextName'], args)
This is unique for each concurrent execution so would never clash. When you view the bookmark state from the cli for this job, you'd need to view it like this:
aws glue get-job-bookmark --job-name "jobAcontextA"
or
aws glue get-job-bookmark --job-name "jobAcontextB"
You wouldn't be able to use the UI to pause or reset the bookmark, you'd need to do it programatically.

How to skip slave replication errors on Google Cloud SQL 2nd Gen

I am in the process of migrating a database from an external server to cloud sql 2nd gen. Have been following the recommended steps and the 2TB mysqlsump process was complete and replication started. However, got an error:
'Error ''Access denied for user ''skip-grants user''#''skip-grants host'' (using password: NO)'' on query. Default database: ''mondovo_db''. Query: ''LOAD DATA INFILE ''/mysql/tmp/SQL_LOAD-0a868f6d-8681-11e9-b5d3-42010a8000a8-6498057-322806.data'' IGNORE INTO TABLE seoi_volume_update_tracker FIELDS TERMINATED BY ''^#^'' ENCLOSED BY '''' ESCAPED BY ''\'' LINES TERMINATED BY ''^|^'' (keyword_search_volume_id)'''
2 questions,
1) I'm guessing the error has come about because cloud sql requires LOAD DATA LOCAL INFILE instead of LOAD DATA INFILE? However am quite sure on the master we run only LOAD DATA LOCAL INFILE so not sure how it changes to remove LOCAL while in replication, is that possible?
2) I can't stop the slave to skip the error and restart since SUPER privileges aren't available and so am not sure how to skip this error and also avoid it for the future while the the final sync happens. Suggestions?

There was no way to work around the slave replication error in Google Cloud SQL, so had to come up with another way.
Since replication wasn't going to work, I had to do a copy of all the databases. However, because of the aggregate size of all my DBs being at 2TB, it was going to take a long time.
The final strategy that took the least amount of time:
1) Pre-requisite: You need to have at least 1.5X the amount of current database size in terms of disk space remaining on your SQL drive. So my 2TB DB was on a 2.7TB SSD, I needed to eventually move everything temporarily to a 6TB SSD before I could proceed with the steps below. DO NOT proceed without sufficient disk space, you'll waste a lot of your time as I did.
2) Install cloudsql-import on your server. Without this, you can't proceed and this took a while for me to discover. This will facilitate in the quick transfer of your SQL dumps to Google.
3) I had multiple databases to migrate. So if in a similar situation, pick one at a time and for the sites that access that DB, prevent any further insertions/updates. I needed to put a "Website under Maintenance" on each site, while I executed the operations outlined below.
4) Run the commands in the steps below in a separate screen. I launched a few processes in parallel on different screens.
screen -S DB_NAME_import_process
5) Run a mysqldump using the following command and note, the output is an SQL file and not a compressed file:
mysqldump {DB_NAME} --hex-blob --default-character-set=utf8mb4 --skip-set-charset --skip-triggers --no-autocommit --single-transaction --set-gtid-purged=off > {DB_NAME}.sql
6) (Optional) For my largest DB of around 1.2TB, I also split the DB backup into individual table SQL files using the script mentioned here: https://stackoverflow.com/a/9949414/1396252
7) For each of the files dumped, I converted the INSERT commands into INSERT IGNORE because didn't want any further duplicate errors during the import process.
cat {DB_OR_TABLE_NAME}.sql | sed s/"^INSERT"/"INSERT IGNORE"/g > new_{DB_OR_TABLE_NAME}_ignore.sql
8) Create a database by the same name on Google Cloud SQL that you want to import. Also create a global user that has permission to access all the databases.
9) Now, we import the SQL files using the cloudsql-import plugin. If you split the larger DB into individual table files in Step 6, use the cat command to combine a batch of them into a single file and make as many batch files as you see appropriate.
Run the following command:
cloudsql-import --dump={DB_OR_TABLE_NAME}.sql --dsn='{DB_USER_ON_GLCOUD}:{DB_PASSWORD}#tcp({GCLOUD_SQL_PUBLIC_IP}:3306)/{DB_NAME_CREATED_ON_GOOGLE}'
10) While the process is running, you can step out of the screen session using Ctrl+a
+ Ctrl+d (or refer here) and then reconnect to the screen later to check on progress. You can create another screen session and repeat the same steps for each of the DBs/batches of tables that you need to import.
Because of the large sizes that I had to import, I believe it did take me a day or two, don't remember now since it's been a few months but I know that it's much faster than any other way. I had tried using Google's copy utility to copy the SQL files to Cloud Storage and then use Cloud SQL's built-in visual import tool but that was slow and not as fast as cloudsql-import. I would recommend this method up until Google fixes the ability to skip slave errors.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

error while running hive mapreduce job - mapreduce

Related

GCP Airflow Operators: BQ LOAD and sensor for job ID

Getting details of a BigQuery job using gcloud CLI on local machine

AWS Glue Job using awsglueml.transforms.FindMatches gives timeout error seemingly randomly

com.amazonaws.services.gluejobexecutor.model.VersionMismatchException

How to skip slave replication errors on Google Cloud SQL 2nd Gen

Categories

Resources