Submiting the same oozie workflow job multiple times at the same time - mapreduce

I am wondered how oozie handle conflicts(if there really exists) when I submit two same workflow job(just the Oozie sample examples) at the same time.
I can submit the same two job successful and oozie server return two different jobId.In Oozie Web Console, I saw status of two job are all RUNNING, then all SUCCEEDED after some time.
My workflow.xml as followers:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="map-reduce-wf">
<start to="mr-node"/>
<action name="mr-node">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete path="${nameNode}/user/${wf:user()}/mapreduce_test/output-data"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>mapred.mapper.class</name>
<value>org.apache.oozie.example.SampleMapper</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.apache.oozie.example.SampleReducer</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
</property>
<property>
<name>mapred.input.dir</name>
<value>/user/${wf:user()}/mapreduce_test/input</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/user/${wf:user()}/mapreduce_test/output-data/</value>
</property>
</configuration>
</map-reduce>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
I know in the "prepare" label delete output directory helps make the action repeatable and enables retries after failure, I also understand the basic action run model.
So, My questions are:
The same two jobs are really running concurrently?(I saw the two all in running state in oozie web console).
Is there exists write conflict?(two same job point one output directory)

Oozie does not detect any job duplication or anything like it. It accepts the workflow jobs and schedule them on the cluster for execution and monitor till the completion or failure.
The same two jobs are really running concurrently?(I saw the two all in running state in oozie web console).
Yes. Both the jobs will be running concurrently.
Is there exists write conflict?(two same job point one output directory)
Oozie does not have any checks related to write conflicts. I guess these are taken care by either map reduce or hdfs framework.

As per your question:
1. Oozie schedule jobs on the cluster for execution till the end of end with status like success/failure.
2. Both Jobs will be running at same time and will execute same action what have been defined.
To Avoid the same you you can perform below steps which will bit helpfull.
As Oozie Jobs started with execution by triggering job.properties or coordinator.properties and starts the workflow to executing by as per mention interval passed through job.xml/coordinator.xml.
So when ever a request have been submitted it sill make a fresh entry in
COORD_JOBS for coordinator and WF_JOBS for workflow
tables in Metastore DB of Oozie {which could be "Oracle/MySQL/Postgress/Derby".
So even though the Job have been triggered the same can be start repeatedly as every time New ID have been set for respected Job. {As COORDINATOR JOB ID has been set as Incremental Basis}.
One way to avoid Duplicate Processing of the same Job, you can be done from Metastore DB end with some validation check.
Create a Trigger for COORD_JOBS table under Metastore DB, which will check the table entry with Job Name, query alike
IF (SELECT COUNT(*) FROM COORD_JOBS WHERE (app_name = NEW.app_name) AND (status="RUNNING")) > 0 THEN
SET NEW='Error: Cannot Update table.';
END IF;
These DB Table trigger in COORD_JOBS/WF_JOBS tables will check every time when ever the Oozie tries to make update with new Job.
COORD_JOBS table can be replace with WF_JOBS table, which stores the Information of Workflow job details started by Coordinator.properties,

Related

Accessing GCS with Hadoop client from outside of Cloud

I want to access Google Cloud Storage via Hadoop client. I want to use it on machine outside of Google Cloud.
I followed instructions from here.
I created service account and generated key file. I also created core-site.xml file and downloaded the necessary library.
However, when I am trying to run simple hdfs dfs -ls gs://bucket-name command, all I get is this:
Error getting access token from metadata server at: http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token
When I am doing this inside the Google Cloud it works, but trying to connect to GCS from outside, it shows error above.
How to connect to GCS with Hadoop Client in this way? Is it even possible? I have no route to 169.254.169.254 address.
Here is my core-site.xml(I changed the key path and email in this example):
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>spark.hadoop.google.cloud.auth.service.account.enable</name>
<value>true</value>
</property>
<property>
<name>spark.hadoop.google.cloud.auth.service.account.json.keyfile</name>
<value>path/to/key.json</value>
</property>
<property>
<name>fs.gs.project.id</name>
<value>ringgit-research</value>
<description>
Optional. Google Cloud Project ID with access to GCS buckets.
Required only for list buckets and create bucket operations.
</description>
</property>
<property>
<name>fs.AbstractFileSystem.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
<description>The AbstractFileSystem for gs: uris.</description>
</property>
<property>
<name>fs.gs.auth.service.account.email</name>
<value>myserviceaccountaddress#google</value>
<description>
The email address is associated with the service account used for GCS
access when fs.gs.auth.service.account.enable is true. Required
when authentication key specified in the Configuration file (Method 1)
or a PKCS12 certificate (Method 3) is being used.
</description>
</property>
</configuration>
could be that the hadoop services haven’t taken the updates made in your core-site.xml file yet, so my suggestion is restart the hadoop’s services,another action that you can take is check the Access control options[1].
If You still having the same issue after having taken the action suggested, please post the complete error message.
[1]https://cloud.google.com/storage/docs/access-control/
The problem is with the fact that I've tried wrong authentication method. Used method assumes that it's running inside google cloud and it's trying to connect to google metadata servers. When running outside of google it doesn't work for obvious reasons.
The answer to this is here: Migrating 50TB data from local Hadoop cluster to Google Cloud Storage with the proper core-site.xml in the selected answer.
Property fs.gs.auth.service.account.keyfile should be used instead of spark.hadoop.google.cloud.auth.service.account.json.keyfile. The only difference is that this property needs p12 key file instead of json.

How to configure Hive EMR to use S3 as the default filesystem and warehouse

I'm trying to configure EMR Hive to use S3 as the database warehouse and default location for internal/managed databases/tables.
I have already configured S3 as the default filesystem for hadoop, by setting the fs.defaultFS to s3://.. in core-site.xml of the namenode host -
<property>
<name>fs.defaultFS</name>
<value>s3://***********</value>
</property>
Some properties were present by default which I didn't touch
<property>
<name>fs.s3.impl</name>
<value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>
<property>
<name>fs.s3n.impl</name>
<value>com.amazon.ws.emr.hadoop.fs.EmrFileSystem</value>
</property>
<property>
<name>fs.s3.buffer.dir</name>
<value>/mnt/s3,/mnt1/s3</value>
<final>true</final>
</property>
<property>
<name>fs.s3.buckets.create.region</name>
<value>sa-east-1</value>
</property>
<property>
<name>fs.s3bfs.impl</name>
<value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.s3.impl</name>
<value>org.apache.hadoop.fs.s3.EMRFSDelegate</value>
</property>
This works as expected with the hadoop fs commands and I'm able to access the S3 bucket.
To configure hive to use S3 I made these changes to the hive-site.xml of the namenode host where hive-server2 also is running
<property>
<name>fs.defaultFS</name>
<value>s3://***********</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name> .
<value>s3://***********/user/hive/warehouse</value>
</property>
After restarting the components hadoop-hdfs-namenode, hive-server2, hadoop-yarn-resourcemanager, I try to use the Hive CLI to create a database.
create database test
But when I describe that database, it gives me its location as hdfs://..
I got that fixed using s3:a// as the URI scheme in hive-site.xml.. the database location is then s3a://.. but then when I try to insert data into it.. but it fails with the error -
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
hadoop classpath has the path where hadoop-aws jars are present -
/etc/hadoop/conf:/usr/lib/hadoop/lib/:/usr/lib/hadoop/.//:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/:/usr/lib/hadoop-hdfs/.//:/usr/lib/hadoop-yarn/lib/:/usr/lib/hadoop-yarn/.//:/usr/lib/hadoop-mapreduce/lib/:/usr/lib/hadoop-mapreduce/.//::/etc/tez/conf:/usr/lib/tez/:/usr/lib/tez/lib/:/usr/lib/hadoop-lzo/lib/:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/
Please mention in the comment if any other information is required to debug the issue. Thanks in advance!

How to Rebuild Sitecore Index without blowing my log files?

I have the Sitecore + Coveo system. I have automated Rebuild/Refresh Index using command. But while performing Rebuild/Refresh my Logging files are going up to ~40GB.
Is there anyway so that I can restrict logging while Rebuild/Refresh?
You need to set the logging level for your Crawling log. In the web.config file, find the logger called Sitecore.Diagnostics.Crawling and set the log level.
This is mine, set to INFO
<logger name="Sitecore.Diagnostics.Crawling" additivity="false">
<level value="INFO" />
<appender-ref ref="CrawlingLogFileAppender" />
</logger>
That should reduce the amount of logs written. If you want to reduce it even further, you can set it to ERROR or NONE, but I would not recommend NONE.

Hibernate running on separate JVM fail to read

I am implementing WebService with Hibernate to write/read data into database (MySQL). One big issue I have was when I insert data (e.g., USER table) via one JVM (example: JUNit test or directly from DBUI suite) successfully, my WebService's Hibernate running on separate JVM cannot find this new data. They all point to the same DB server. It is only if I had destroyed the WebService's Hibernate SessionFactory and recreate it, then the WebService's Hibernate layer can read the new inserted data. In contrast, the same JUnit test or a direct query from DBUI suite can find the inserted data.
Any assistance is appreciated.
This issue is resolved today with the following:
I changed our Hibernate config file (hibernate.cfg.xml) to have Isolation Level to at least "2" (READ COMMITTED). This immediately resolved the issue above. To understand further about this isolation level setting, please refer to these:
Hibernate reading function shows old data
Transaction isolation levels relation with locks on table
I ensured I did not use 2nd level caching by setting CacheMode to IGNORE for each of my Session object:
Session session = getSessionFactory().openSession();
session.setCacheMode(CacheMode.IGNORE);
Reference only: Some folks did the following in hibernate.cfg.xml to disable their 2nd level caching in their apps (BUT I didn't need to):
<property name="cache.provider_class">org.hibernate.cache.internal.NoCacheProvider</property>
<property name="hibernate.cache.use_second_level_cache">false</property>
<property name="hibernate.cache.use_query_cache">false</property>

MapReduce inefficient reducer

What would cause only a single reducer in a MapReduce job apart from all the keys output by the map function being the same?
Possible causes:
Your cluster still has the default setting of having only 1 reducer (= default value).
Your code explicitly sets the value to be 1 reducer.
You are running in local mode (i.e. no cluster at all).
Quote from mapred-default.xml
<property>
<name>mapred.reduce.tasks</name>
<value>1</value>
<description>The default number of reduce tasks per job. Typically set to 99%
of the cluster's reduce capacity, so that if a node fails the reduces can
still be executed in a single wave.
Ignored when mapred.job.tracker is "local".
</description>
</property>