I'm having hard time setting up 2 node Cassandra cluster on Ec2 instances. This is 2.2.19 version. I cannot upgrade due to some other dependencies involved.
The Ec2 instances are in private subnet. Assigned static private ips
Here is my cassandra.yaml
cluster_name: 'Test-cluster'
data_file_directories:
- /var/lib/cassandra/data
commitlog_directory: /var/lib/cassandra/commitlog
saved_caches_directory: /var/lib/cassandra/saved_caches
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_segment_size_in_mb: 32
seed_provider:
# Addresses of hosts that are deemed contact points.
# Cassandra nodes use this list of hosts to find each other and learn
# the topology of the ring. You must change this if you are running
# multiple nodes!
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
# seeds is actually a comma-delimited list of addresses.
# Ex: "<ip1>,<ip2>,<ip3>"
- seeds: "${private_ip}"
listen_address: ${private_ip}
start_native_transport: true
native_transport_port: 9042
storage_port: 7000
num_tokens: 32
ssl_storage_port: 9042
start_rpc: true
rpc_address: ${private_ip}
rpc_port: 9160
broadcast_rpc_address: ${private_ip}
endpoint_snitch: Ec2Snitch
partitioner: org.apache.cassandra.dht.RandomPartitioner
Here is my system.log
INFO [main] 2021-06-07 18:42:41,900 DatabaseDescriptor.java:327 - DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap
INFO [main] 2021-06-07 18:42:42,022 DatabaseDescriptor.java:437 - Global memtable on-heap threshold is enabled at 251MB
INFO [main] 2021-06-07 18:42:42,023 DatabaseDescriptor.java:441 - Global memtable off-heap threshold is enabled at 251MB
ERROR [main] 2021-06-07 18:42:42,049 CassandraDaemon.java:787 - Exception encountered during startup
org.apache.cassandra.exceptions.ConfigurationException: Error instantiating snitch class 'org.apache.cassandra.locator.Ec2Snitch'.
at org.apache.cassandra.utils.FBUtilities.construct(FBUtilities.java:551) ~[apache-cassandra-2.2.19.jar:2.2.19]
at org.apache.cassandra.utils.FBUtilities.construct(FBUtilities.java:529) ~[apache-cassandra-2.2.19.jar:2.2.19]
at org.apache.cassandra.config.DatabaseDescriptor.createEndpointSnitch(DatabaseDescriptor.java:741) ~[apache-cassandra-2.2.19.jar:2.2.19]
at org.apache.cassandra.config.DatabaseDescriptor.applyConfig(DatabaseDescriptor.java:465) ~[apache-cassandra-2.2.19.jar:2.2.19]
at org.apache.cassandra.config.DatabaseDescriptor.<clinit>(DatabaseDescriptor.java:133) ~[apache-cassandra-2.2.19.jar:2.2.19]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:599) [apache-cassandra-2.2.19.jar:2.2.19]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:774) [apache-cassandra-2.2.19.jar:2.2.19]
Caused by: org.apache.cassandra.exceptions.ConfigurationException: Ec2Snitch was unable to execute the API call. Not an ec2 node?
at org.apache.cassandra.locator.Ec2Snitch.awsApiCall(Ec2Snitch.java:79) ~[apache-cassandra-2.2.19.jar:2.2.19]
at org.apache.cassandra.locator.Ec2Snitch.<init>(Ec2Snitch.java:55) ~[apache-cassandra-2.2.19.jar:2.2.19]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_282]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[na:1.8.0_282]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.8.0_282]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[na:1.8.0_282]
at java.lang.Class.newInstance(Class.java:442) ~[na:1.8.0_282]
at org.apache.cassandra.utils.FBUtilities.construct(FBUtilities.java:536) ~[apache-cassandra-2.2.19.jar:2.2.19]
Note: When I change snitch to SimpleSnitch it actually works.
Please help!!
Answering my own question
Ec2snitch uses IMDVs1 to get metadata http://169.254.169.254/latest/meta-data/placement/availability-zone to determine certain properties.
I created Ec2 instances through terraform where my code has
metadata_options {
http_endpoint = "enabled"
http_tokens = "enabled"
}
The above code forces to use imdsv2 only which is causing the issue. Ec2snitch couldn't get metadata by simple curl command.
Solution:
metadata_options {
http_endpoint = "enabled"
http_tokens = "optional"
}
If you are doing through console, when launching instance, make sure meta data version is set to V1 and V2
Related
I have a GKE cluster where I create jobs through django, it runs my c++ code images and the builds are triggered through github. It was working just fine up until now. However I have recently pushed a new commit to github (It was a really small change, like three-four lines of basic operations) and it built an image as usual. But this time, it said Pod errors: BackoffLimitExceeded, Error with exit code 137 when trying to create the job through my simple job, and the job is not completed.
I did some digging into the problem and through runnig kubectl describe POD_NAME I got this output from a failed pod:
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-nqgnl:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m32s default-scheduler Successfully assigned default/xvb8zfzrhhmz-jk9vf to gke-cluster-1-default-pool-ee7e99bb-xzhk
Normal Pulling 7m7s kubelet Pulling image "gcr.io/videoo3-360019/github.com/videoo-io/videoo-render:latest"
Normal Pulled 4m1s kubelet Successfully pulled image "gcr.io/videoo3-360019/github.com/videoo-io/videoo-render:latest" in 3m6.343917225s
Normal Created 4m1s kubelet Created container jobcontainer
Normal Started 4m kubelet Started container jobcontainer
Warning Evicted 3m29s kubelet The node was low on resource: ephemeral-storage. Container jobcontainer was using 91144Ki, which exceeds its request of 0.
Normal Killing 3m29s kubelet Stopping container jobcontainer
Warning ExceededGracePeriod 3m19s kubelet Container runtime did not kill the pod within specified grace period.
The error occurs because of this line:
The node was low on resource: ephemeral-storage. Container jobcontainer was using 91144Ki, which exceeds its request of 0.
I do not have a yaml file where I set my pod informations, instead I make a django call handle configuration which looks like this:
def kube_create_job_object(name, container_image, namespace="default", container_name="jobcontainer", env_vars={}):
# Body is the object Body
body = client.V1Job(api_version="batch/v1", kind="Job")
# Body needs Metadata
# Attention: Each JOB must have a different name!
body.metadata = client.V1ObjectMeta(namespace=namespace, name=name)
# And a Status
body.status = client.V1JobStatus()
# Now we start with the Template...
template = client.V1PodTemplate()
template.template = client.V1PodTemplateSpec()
# Passing Arguments in Env:
env_list = []
for env_name, env_value in env_vars.items():
env_list.append( client.V1EnvVar(name=env_name, value=env_value) )
print(env_list)
security = client.V1SecurityContext(privileged=True, allow_privilege_escalation=True, capabilities= client.V1Capabilities(add=["CAP_SYS_ADMIN"]))
container = client.V1Container(name=container_name, image=container_image, env=env_list, stdin=True, security_context=security)
template.template.spec = client.V1PodSpec(containers=[container], restart_policy='Never')
body.spec = client.V1JobSpec(backoff_limit=0, ttl_seconds_after_finished=600, template=template.template)
return body
def kube_create_job(manifest, output_uuid, output_signed_url, webhook_url, valgrind, sleep, isaudioonly):
credentials, project = google.auth.default(
scopes=['https://www.googleapis.com/auth/cloud-platform', ])
credentials.refresh(google.auth.transport.requests.Request())
cluster_manager = ClusterManagerClient(credentials=credentials)
cluster = cluster_manager.get_cluster(name=f"path/to/cluster")
with NamedTemporaryFile(delete=False) as ca_cert:
ca_cert.write(base64.b64decode(cluster.master_auth.cluster_ca_certificate))
config = client.Configuration()
config.host = f'https://{cluster.endpoint}:443'
config.verify_ssl = True
config.api_key = {"authorization": "Bearer " + credentials.token}
config.username = credentials._service_account_email
config.ssl_ca_cert = ca_cert.name
client.Configuration.set_default(config)
# Setup K8 configs
api_instance = kubernetes.client.BatchV1Api(kubernetes.client.ApiClient(config))
container_image = get_first_success_build_from_list_builds(client)
name = id_generator()
body = kube_create_job_object(name, container_image,
env_vars={
"PROJECT" : json.dumps(manifest),
"BUCKET" : settings.GS_BUCKET_NAME,
})
try:
api_response = api_instance.create_namespaced_job("default", body, pretty=True)
print(api_response)
except ApiException as e:
print("Exception when calling BatchV1Api->create_namespaced_job: %s\n" % e)
return body
What causes this and how can I fix it? Am I supposed to set resource/limit varibles to a value and if so how can I do that inside my django job call?
It looks like you are running out of storage on the actual node itself. Since your job spec does not have a request for ephemeral storage, it is being scheduled on any node and in this case it appears like that particular node does not have enough storage available.
I'm not a Python expert, but looks like you should be able to do something like:
storage_size = SOME_VALUE
requests = {'ephemeral-storage': storage_size}
resources = client.V1ResourceRequirements(requests=requests)
container = client.V1Container(name=container_name, image=container_image, env=env_list, stdin=True, security_context=security, resources=resources)
This is my Jenkins EC2 configuration:
URL: $JENKINS_URL/configureClouds/
Add new cloud: Amazon EC2
Name: Amazon EC2 eu-central-1
Amazon EC2 Credentials: AKIA...
Region: eu-central-1
EC2 Key Pair's Private Key: ubuntu
Test connection: success
Advanced...
Instance Cap: 3
No delay provisioning: checked
Add AMI
Description: Linux node
AMI ID: ami-0293...
Check AMI: 05052029...
Instance Type: T3aMedium
EBS Optimized: checked
Monitoring: checked
T2 Unlimited: checked
Security group names: sg-0c2d... (opens SSH port 22)
Remote FS root: ./jenkins
Remote user: ubuntu
AMI Type: unix
Labels: aws ubuntu linux
Usage: Use this node as much as possible
Idle termination time: 30
Advanced...
Number of executors: 2
Stop/Disconnect on Idle Timeout: checked
Minimum number of instances: 1
Minimum number of spare instances: 0
Instance cap: 10
Block device mapping: /dev/sda1=snap-0eadbe3f...:200:true:gp2, /dev/sdb=ephemeral0, /dev/sdc=ephemeral1
Associate Public IP: checked
Connection Strategy: Public DNS
Host Key Verification Strategy: off
Maximum Total Uses: 10
Environment variables: checked
(not listing all environment variables)
Tool locations: checked
(not listing all tool locations)
With this configuration, I would expect that at least 1 EC2 instance would be started, but no instance is started.
In the nodes page in Jenkins when I hit the provision via button, I get an error:
Oops! A problem occurred while processing the request. Logging ID=8ead3651-3809-4a47-984c-e0e494c705bb
In /log/all I have:
Apr 14, 2021 5:34:37 PM INFO hudson.plugins.ec2.SlaveTemplate getImage
Getting image for request {ExecutableUsers: [],Filters: [],ImageIds: [ami-0293c4ed***],Owners: []}
Apr 14, 2021 5:34:37 PM INFO hudson.plugins.ec2.SlaveTemplate logProvisionInfo
SlaveTemplate{description='Linux node', labels='aws ubuntu linux'}. Considering launching
Apr 14, 2021 5:34:37 PM INFO hudson.plugins.ec2.SlaveTemplate setupRootDevice
AMI had /dev/sda1
Apr 14, 2021 5:34:37 PM INFO hudson.plugins.ec2.SlaveTemplate setupRootDevice
{DeleteOnTermination: true,SnapshotId: snap-0eadbe3f***,VolumeSize: 20,VolumeType: gp2,Encrypted: false}
Apr 14, 2021 5:34:37 PM INFO hudson.plugins.ec2.SlaveTemplate logProvisionInfo
SlaveTemplate{description='Linux node', labels='aws ubuntu linux'}. Setting Instance Initiated Shutdown Behavior : ShutdownBehavior.Stop
Apr 14, 2021 5:34:37 PM INFO hudson.plugins.ec2.SlaveTemplate logProvisionInfo
SlaveTemplate{description='Linux node', labels='aws ubuntu linux'}. Looking for existing instances with describe-instance: {Filters: [{Name: image-id,Values: [ami-0293c4ed***]}, {Name: instance-type,Values: [t3a.medium]}, {Name: key-name,Values: [***]}, {Name: tag:jenkins_server_url,Values: [https://jenkins.***.com/]}, {Name: tag:jenkins_slave_type,Values: [demand_Linux node]}],InstanceIds: [],}
Apr 14, 2021 5:34:37 PM WARNING hudson.init.impl.InstallUncaughtExceptionHandler handleException
Caught unhandled exception with ID c080ae42-6b7b-47aa-93ea-1b8064503c1c
com.amazonaws.services.ec2.model.AmazonEC2Exception: Value () for parameter groupId is invalid. The value cannot be empty (Service: AmazonEC2; Status Code: 400; Error Code: InvalidParameterValue; Request ID: da74dbbf-0685-45ac-8454-c3f5d1b4c700; Proxy: null)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1819)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1403)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1372)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
at com.amazonaws.services.ec2.AmazonEC2Client.doInvoke(AmazonEC2Client.java:29240)
at com.amazonaws.services.ec2.AmazonEC2Client.invoke(AmazonEC2Client.java:29207)
at com.amazonaws.services.ec2.AmazonEC2Client.invoke(AmazonEC2Client.java:29196)
at com.amazonaws.services.ec2.AmazonEC2Client.executeRunInstances(AmazonEC2Client.java:28011)
at com.amazonaws.services.ec2.AmazonEC2Client.runInstances(AmazonEC2Client.java:27980)
at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:1100)
at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:1042)
at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:867)
at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:693)
at hudson.plugins.ec2.EC2Cloud.doProvision(EC2Cloud.java:430)
at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627)
at org.kohsuke.stapler.Function$MethodFunction.invoke(Function.java:396)
at org.kohsuke.stapler.Function$InstanceFunction.invoke(Function.java:408)
at org.kohsuke.stapler.interceptor.RequirePOST$Processor.invoke(RequirePOST.java:77)
at org.kohsuke.stapler.PreInvokeInterceptedFunction.invoke(PreInvokeInterceptedFunction.java:26)
at org.kohsuke.stapler.Function.bindAndInvoke(Function.java:212)
at org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:145)
at org.kohsuke.stapler.MetaClass$11.doDispatch(MetaClass.java:536)
at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58)
at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:766)
at org.kohsuke.stapler.Stapler.invoke(Stapler.java:898)
at org.kohsuke.stapler.MetaClass$4.doDispatch(MetaClass.java:281)
at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58)
at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:766)
at org.kohsuke.stapler.Stapler.invoke(Stapler.java:898)
at org.kohsuke.stapler.Stapler.invoke(Stapler.java:694)
at org.kohsuke.stapler.Stapler.service(Stapler.java:240)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:791)
at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1626)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
at org.jenkinsci.plugins.ssegateway.Endpoint$SSEListenChannelFilter.doFilter(Endpoint.java:248)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
at jenkins.security.ResourceDomainFilter.doFilter(ResourceDomainFilter.java:76)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
at jenkins.telemetry.impl.UserLanguages$AcceptLanguageFilter.doFilter(UserLanguages.java:129)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
at hudson.plugins.audit_trail.AuditTrailFilter.doFilter(AuditTrailFilter.java:111)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
at io.jenkins.blueocean.auth.jwt.impl.JwtAuthenticationFilter.doFilter(JwtAuthenticationFilter.java:60)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
at com.cloudbees.jenkins.support.slowrequest.SlowRequestFilter.doFilter(SlowRequestFilter.java:37)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
at io.jenkins.blueocean.ResourceCacheControl.doFilter(ResourceCacheControl.java:134)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
at org.jenkinsci.plugins.modernstatus.ModernStatusFilter.doFilter(ModernStatusFilter.java:50)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
at hudson.plugins.greenballs.GreenBallFilter.doFilter(GreenBallFilter.java:64)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
at jenkins.metrics.impl.MetricsFilter.doFilter(MetricsFilter.java:125)
at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
at hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:157)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:153)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:92)
at jenkins.security.AcegiSecurityExceptionFilter.doFilter(AcegiSecurityExceptionFilter.java:52)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:97)
at hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:51)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:97)
at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:119)
at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:113)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:97)
at org.springframework.security.web.authentication.AnonymousAuthenticationFilter.doFilter(AnonymousAuthenticationFilter.java:105)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:97)
at org.springframework.security.web.authentication.rememberme.RememberMeAuthenticationFilter.doFilter(RememberMeAuthenticationFilter.java:101)
at org.springframework.security.web.authentication.rememberme.RememberMeAuthenticationFilter.doFilter(RememberMeAuthenticationFilter.java:92)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:97)
at org.springframework.security.web.authentication.AbstractAuthenticationProcessingFilter.doFilter(AbstractAuthenticationProcessingFilter.java:218)
at org.springframework.security.web.authentication.AbstractAuthenticationProcessingFilter.doFilter(AbstractAuthenticationProcessingFilter.java:212)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:97)
at jenkins.security.BasicHeaderProcessor.doFilter(BasicHeaderProcessor.java:93)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:97)
at org.springframework.security.web.context.SecurityContextPersistenceFilter.doFilter(SecurityContextPersistenceFilter.java:110)
at org.springframework.security.web.context.SecurityContextPersistenceFilter.doFilter(SecurityContextPersistenceFilter.java:80)
at hudson.security.HttpSessionContextIntegrationFilter2.doFilter(HttpSessionContextIntegrationFilter2.java:62)
at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:97)
at hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:109)
at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:168)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.kohsuke.stapler.compression.CompressionFilter.doFilter(CompressionFilter.java:51)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:82)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.kohsuke.stapler.DiagnosticThreadNameFilter.doFilter(DiagnosticThreadNameFilter.java:30)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at jenkins.security.SuspiciousRequestFilter.doFilter(SuspiciousRequestFilter.java:36)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:578)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.Server.handle(Server.java:516)
at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:279)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:383)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)
at java.lang.Thread.run(Thread.java:748)
Write up of the comments for anyone else looking for help diagnosing EC2 Agent Plugin issue.
When you have configured your agents go to the Nodes page (Jenkins URL/computer)
Hit the button to Provision a new agent from your cloud
If there is a configuration issue you will get Evil Jenkins and a Logging ID
Go to Jenkins logs page (Jenkins URL/log/all) and search that ID
This should give you the stack trace from the AWS SDK call which will help you to narrow down whether is it missing config or IAM permissions etc at fault
If there were no config errors you would get taken to the node that is being launched config page where you would be able to see its EC2 startup log to check for any User Data or AMI issues.
For this particular issue 'Value () for parameter groupId is invalid' the plugin is expecting you to provide a list or at least one subnet-id in configuration.
I faced the same issue and was resolved with this.
TID: [-1] [] [2019-11-22 13:18:34,362] WARN {org.apache.ode.scheduler.simple.SimpleScheduler} - Error while processing a persisted job: Job hqejbhcnphreqf4l2mpcoj time: 2019-11-22 13:18:31 WEST transacted: true persisted: true details: JobDetails( instanceId: null mexId: hqejbhcnphreqf4l2mpcoi processId: {http://wso2.org/bps/sample}my-process-7 type: INVOKE_INTERNAL channel: null correlatorId: null correlationKeySet: null retryCount: 4 inMem: false detailsExt: {enqueue=false}) {org.apache.ode.scheduler.simple.SimpleScheduler}
java.lang.NullPointerException
at org.apache.ode.bpel.engine.BpelRuntimeContextImpl.checkDuplicateCSetKey(BpelRuntimeContextImpl.java:621)
at org.apache.ode.bpel.engine.BpelRuntimeContextImpl.checkDuplicateCSets(BpelRuntimeContextImpl.java:578)
at org.apache.ode.bpel.runtime.PICK$WAITING$2.onRequestRcvd(PICK.java:300)
at sun.reflect.GeneratedMethodAccessor1427.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.ode.jacob.vpu.JacobVPU$JacobThreadImpl.run(JacobVPU.java:451)
at org.apache.ode.jacob.vpu.JacobVPU.execute(JacobVPU.java:139)
at org.apache.ode.bpel.engine.BpelRuntimeContextImpl.execute(BpelRuntimeContextImpl.java:1002)
at org.apache.ode.bpel.engine.PartnerLinkMyRoleImpl.invokeNewInstance(PartnerLinkMyRoleImpl.java:208)
at org.apache.ode.bpel.engine.BpelProcess$1.invoke(BpelProcess.java:283)
at org.apache.ode.bpel.engine.BpelProcess.invokeProcess(BpelProcess.java:224)
at org.apache.ode.bpel.engine.BpelProcess.invokeProcess(BpelProcess.java:279)
at org.apache.ode.bpel.engine.BpelProcess.handleJobDetails(BpelProcess.java:434)
at org.apache.ode.bpel.engine.BpelEngineImpl.sendMyRoleFault(BpelEngineImpl.java:835)
at org.apache.ode.bpel.engine.BpelEngineImpl.onScheduledJob(BpelEngineImpl.java:581)
at org.apache.ode.bpel.engine.BpelServerImpl.onScheduledJob(BpelServerImpl.java:467)
at org.apache.ode.scheduler.simple.SimpleScheduler$RunJob$1.call(SimpleScheduler.java:633)
at org.apache.ode.scheduler.simple.SimpleScheduler$RunJob$1.call(SimpleScheduler.java:627)
at org.apache.ode.scheduler.simple.SimpleScheduler.execTransaction(SimpleScheduler.java:298)
at org.apache.ode.scheduler.simple.SimpleScheduler.execTransaction(SimpleScheduler.java:253)
at org.apache.ode.scheduler.simple.SimpleScheduler$RunJob.call(SimpleScheduler.java:627)
at org.apache.ode.scheduler.simple.SimpleScheduler$RunJob.call(SimpleScheduler.java:611)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(T
The problem is due to the data:
A BPEL process had 3 versions (v1, v2, v3)
Version v1 has been removed (from the registry and bpel / -123) but these instances still remained in the database
An old instance of version v1 remained in ACTIVE status with an id correlation = 400200 (for example).
When starting a new instance of version v3 with a correlation id = 400200 the exception is raised.
Indeed apache ODE to each new instance looks for if there is an instance in active status and carrying the same correlation id (checkDuplicatCS ..). In our context, Apache ODE finds an instance of version v1 and goes back NullpointerException because it does not find the process v1 in its registry.
Solution: Clean the old instances in Active status of version v1.
Environment:
ignite server:
centos6.5 with kernel 2.6.32-431.el6.x86_64
ignite version 1.9
hadoop version 2.6.2
3 server nodes with each having '-Xms16g -Xmx16g -server -XX:+AggressiveOpts -XX:MaxMetaspaceSize=256m' set when started
I run a map reduce test job with ignite map reduce. The job is simply getting the average number for each people. The data is like:
Jack 0.35
Tom 0.78
Lily 0.92
Jack 0.28
Tom 0.18
...
At first, I generated a data set of 100M lines. It's about 2.53GB. The job finished correctly in about 30s. Then I generated a data set of 1 Billion lines, about 25.3GB. The job always failed with exceptions. I tried several times but the same result.
The ignite server node threw exception below:
[15:06:56,804][ERROR][sys-#2740%null%][GridTcpRestProtocol] Failed to process client request [ses=GridSelectorNioSessionImpl [worker=ByteBufferNioClientWorker [readBuf=java.nio.HeapByteBuffer[pos=549 lim=549 cap=8192], super=AbstractNioClientWorker [selector=sun.nio.ch.EPollSelectorImpl#1cba0431, idx=3, bytesRcvd=0, bytesSent=0, bytesRcvd0=0, bytesSent0=0, select=true, super=GridWorker [name=grid-nio-worker-tcp-rest-3, gridName=null, finished=false, isCancelled=false, hashCode=906881587, interrupted=false, runner=grid-nio-worker-tcp-rest-3-#50%null%]]], writeBuf=null, readBuf=null, inRecovery=null, outRecovery=null, super=GridNioSessionImpl [locAddr=/172.31.68.204:11211, rmtAddr=/172.31.68.202:39473, createTime=1493967985751, closeTime=1493968009502, bytesSent=2715, bytesRcvd=2641, bytesSent0=0, bytesRcvd0=0, sndSchedTime=1493968016794, lastSndTime=1493967998303, lastRcvTime=1493968009502, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=GridTcpRestParser [jdkMarshaller=JdkMarshaller [], routerClient=false], directMode=false]], accepted=true]], msg=GridClientTaskRequest [taskName=o.a.i.i.processors.hadoop.proto.HadoopProtocolJobStatusTask, arg=HadoopProtocolTaskArguments []]]
class org.apache.ignite.IgniteCheckedException: Failed to send message (connection was closed): GridSelectorNioSessionImpl [worker=ByteBufferNioClientWorker [readBuf=java.nio.HeapByteBuffer[pos=549 lim=549 cap=8192], super=AbstractNioClientWorker [selector=sun.nio.ch.EPollSelectorImpl#1cba0431, idx=3, bytesRcvd=0, bytesSent=0, bytesRcvd0=0, bytesSent0=0, select=true, super=GridWorker [name=grid-nio-worker-tcp-rest-3, gridName=null, finished=false, isCancelled=false, hashCode=906881587, interrupted=false, runner=grid-nio-worker-tcp-rest-3-#50%null%]]], writeBuf=null, readBuf=null, inRecovery=null, outRecovery=null, super=GridNioSessionImpl [locAddr=/172.31.68.204:11211, rmtAddr=/172.31.68.202:39473, createTime=1493967985751, closeTime=1493968009502, bytesSent=2715, bytesRcvd=2641, bytesSent0=0, bytesRcvd0=0, sndSchedTime=1493968016794, lastSndTime=1493967998303, lastRcvTime=1493968009502, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=GridTcpRestParser [jdkMarshaller=JdkMarshaller [], routerClient=false], directMode=false]], accepted=true]]
at org.apache.ignite.internal.util.IgniteUtils.cast(IgniteUtils.java:7239)
at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:170)
at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:119)
at org.apache.ignite.internal.processors.rest.protocols.tcp.GridTcpRestNioListener$1$1.apply(GridTcpRestNioListener.java:264)
at org.apache.ignite.internal.processors.rest.protocols.tcp.GridTcpRestNioListener$1$1.apply(GridTcpRestNioListener.java:261)
at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:271)
at org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:228)
at org.apache.ignite.internal.processors.rest.protocols.tcp.GridTcpRestNioListener$1.apply(GridTcpRestNioListener.java:261)
at org.apache.ignite.internal.processors.rest.protocols.tcp.GridTcpRestNioListener$1.apply(GridTcpRestNioListener.java:229)
at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:271)
at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListeners(GridFutureAdapter.java:259)
at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:389)
at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:355)
at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:332)
at org.apache.ignite.internal.processors.rest.GridRestProcessor$2$1.apply(GridRestProcessor.java:158)
at org.apache.ignite.internal.processors.rest.GridRestProcessor$2$1.apply(GridRestProcessor.java:155)
at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:271)
at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListeners(GridFutureAdapter.java:259)
at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:389)
at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:355)
at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:332)
at org.apache.ignite.internal.util.future.GridFutureChainListener.applyCallback(GridFutureChainListener.java:78)
at org.apache.ignite.internal.util.future.GridFutureChainListener.apply(GridFutureChainListener.java:70)
at org.apache.ignite.internal.util.future.GridFutureChainListener.apply(GridFutureChainListener.java:30)
at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:271)
at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListeners(GridFutureAdapter.java:259)
at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:389)
at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:355)
at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:332)
at org.apache.ignite.internal.processors.rest.handlers.task.GridTaskCommandHandler$2.apply(GridTaskCommandHandler.java:294)
at org.apache.ignite.internal.processors.rest.handlers.task.GridTaskCommandHandler$2.apply(GridTaskCommandHandler.java:257)
at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:271)
at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListeners(GridFutureAdapter.java:259)
at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:389)
at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:355)
at org.apache.ignite.internal.processors.task.GridTaskWorker.finishTask(GridTaskWorker.java:1579)
at org.apache.ignite.internal.processors.task.GridTaskWorker.finishTask(GridTaskWorker.java:1547)
at org.apache.ignite.internal.processors.task.GridTaskWorker.reduce(GridTaskWorker.java:1157)
at org.apache.ignite.internal.processors.task.GridTaskWorker.onResponse(GridTaskWorker.java:942)
at org.apache.ignite.internal.processors.task.GridTaskProcessor.processJobExecuteResponse(GridTaskProcessor.java:996)
at org.apache.ignite.internal.processors.task.GridTaskProcessor$JobMessageListener.onMessage(GridTaskProcessor.java:1221)
at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1222)
at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:850)
at org.apache.ignite.internal.managers.communication.GridIoManager.access$2100(GridIoManager.java:108)
at org.apache.ignite.internal.managers.communication.GridIoManager$7.run(GridIoManager.java:790)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to send message (connection was closed): GridSelectorNioSessionImpl [worker=ByteBufferNioClientWorker [readBuf=java.nio.HeapByteBuffer[pos=549 lim=549 cap=8192], super=AbstractNioClientWorker [selector=sun.nio.ch.EPollSelectorImpl#1cba0431, idx=3, bytesRcvd=0, bytesSent=0, bytesRcvd0=0, bytesSent0=0, select=true, super=GridWorker [name=grid-nio-worker-tcp-rest-3, gridName=null, finished=false, isCancelled=false, hashCode=906881587, interrupted=false, runner=grid-nio-worker-tcp-rest-3-#50%null%]]], writeBuf=null, readBuf=null, inRecovery=null, outRecovery=null, super=GridNioSessionImpl [locAddr=/172.31.68.204:11211, rmtAddr=/172.31.68.202:39473, createTime=1493967985751, closeTime=1493968009502, bytesSent=2715, bytesRcvd=2641, bytesSent0=0, bytesRcvd0=0, sndSchedTime=1493968016794, lastSndTime=1493967998303, lastRcvTime=1493968009502, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=GridTcpRestParser [jdkMarshaller=JdkMarshaller [], routerClient=false], directMode=false]], accepted=true]]
at org.apache.ignite.internal.util.nio.GridNioServer.send0(GridNioServer.java:554)
at org.apache.ignite.internal.util.nio.GridNioServer.send(GridNioServer.java:494)
at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onSessionWrite(GridNioServer.java:3036)
at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionWrite(GridNioFilterAdapter.java:118)
at org.apache.ignite.internal.util.nio.GridNioCodecFilter.onSessionWrite(GridNioCodecFilter.java:94)
at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionWrite(GridNioFilterAdapter.java:118)
at org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onSessionWrite(GridNioFilterChain.java:264)
at org.apache.ignite.internal.util.nio.GridNioFilterChain.onSessionWrite(GridNioFilterChain.java:189)
at org.apache.ignite.internal.util.nio.GridNioSessionImpl.send(GridNioSessionImpl.java:108)
at org.apache.ignite.internal.processors.rest.protocols.tcp.GridTcpRestNioListener$1.apply(GridTcpRestNioListener.java:258)
... 40 more
The job client threw exception below:
java.io.IOException: Failed to get job status: job_1fbf9083-9a44-4be9-9199-695a97652dc2_0002
at org.apache.ignite.internal.processors.hadoop.impl.proto.HadoopClientProtocol.getJobStatus(HadoopClientProtocol.java:197)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:326)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:323)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:323)
at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:611)
at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1357)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1318)
at com.tscloud.sdk.test.ignite.MRTest.run(MRTest.java:81)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.tscloud.sdk.test.ignite.MRTest.main(MRTest.java:53)
Caused by: class org.apache.ignite.internal.client.impl.connection.GridClientConnectionResetException: Failed to perform request (connection failed): /172.31.68.204:11211
at org.apache.ignite.internal.client.impl.connection.GridClientConnection.getCloseReasonAsException(GridClientConnection.java:491)
at org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection.close(GridClientNioTcpConnection.java:339)
at org.apache.ignite.internal.client.impl.connection.GridClientNioTcpConnection.close(GridClientNioTcpConnection.java:299)
at org.apache.ignite.internal.client.impl.connection.GridClientConnectionManagerAdapter$NioListener.onDisconnected(GridClientConnectionManagerAdapter.java:630)
at org.apache.ignite.internal.util.nio.GridNioFilterChain$TailFilter.onSessionClosed(GridNioFilterChain.java:253)
at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:93)
at org.apache.ignite.internal.util.nio.GridNioCodecFilter.onSessionClosed(GridNioCodecFilter.java:70)
at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:93)
at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onSessionClosed(GridNioServer.java:3005)
at org.apache.ignite.internal.util.nio.GridNioFilterChain.onSessionClosed(GridNioFilterChain.java:147)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:2306)
at org.apache.ignite.internal.util.nio.GridNioServer$ByteBufferNioClientWorker.processRead(GridNioServer.java:929)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2026)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1863)
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1568)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.lang.Thread.run(Thread.java:745)
The job configuration is below:
Configuration configuration = new Configuration();
configuration.set(MRConfig.FRAMEWORK_NAME, IgniteHadoopClientProtocolProvider.FRAMEWORK_NAME);
configuration.set(MRConfig.MASTER_ADDRESS, "172.31.68.202:11211");
configuration.set("fs.igfs.impl", "org.apache.ignite.hadoop.fs.v1.IgniteHadoopFileSystem");
configuration.set("fs.default.name", "igfs://igfs#172.31.68.202/");
I checked nodes status after the job failed using ignitevisorcmd.sh. All server nodes were OK, but there were sometime one of the node server was down. I did not know why it behaved like this.
Any help is appreciated.
Edit(2017-05-16):
I changed the hadoop core-site.xml and add hadoop.tmp.dir property as below
<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop/hadoop-2.6.2/tmp</value>
</property>
Then I reformatted hdfs and uploaded the 25.3GB data file. I run the test successfully. It turns out my hdfs has something wrong. Reformatting namenode solves the problem.
Before above steps, I tried checking the jvm heap usage by VisualVM.
One of the server node visualvm monitor snapshot
If your node was down temporarily, then it is reasonable that a job would fail-over to another node or fail altogether if there are no other nodes. I would check that your network was not reset or down. Also, you should check for the presence of any software firewalls between your nodes (it is best to disable operating system firewalls).
This is related to my previous question. Basically, to summarize: I
1) Set up a vagrant ubuntu 14.04 box locally
2) Packaged the vagrant instance into a package.box following these instructions
3) Converted the package.box into a .vmdk file using this function
4) Ran the following CLI command:
ec2-import-instance tmpdir/box-disk1.vmdk -f VMDK -t t2.micro -a x86_64 -b <S3 Bucket> -o $AWS_ACCESS_KEY -w $AWS_SECRET_KEY -p Linux
Since I suspected the problem was with something called cloud-init I read about (but have never used/don't really know what it does), I tried the above twice: once with the original /etc/cloud/cloud.cfg file and again with the /etc/cloud/cloud.cfg file I found here.
Basically, what I'm eventually seeing in the AWS Console is a running instance that does not have a Public IP address. I attached an Elastic IP to the instance, but I can't ssh into that IP address for some reason - it says port 22: Connection refused
I'm at a loss because these instances are launching in the Default VPC which has a security group attached to it that allows all ports and all protocols from any IP.
By the way: I'm pretty new to all of AWS and don't really know my way fully around the console, so any direct guidance would be much appreciated.
Original /etc/cloud/cloud.cfg file:
# The top level settings are used as module
# and system configuration.
# A set of users which may be applied and/or used by various modules
# when a 'default' entry is found it will reference the 'default_user'
# from the distro configuration specified below
users:
- default
# If this is set, 'root' will not be able to ssh in and they
# will get a message to login instead as the above $user (ubuntu)
disable_root: true
# This will cause the set+update hostname module to not operate (if true)
preserve_hostname: false
# Example datasource config
# datasource:
# Ec2:
# metadata_urls: [ 'blah.com' ]
# timeout: 5 # (defaults to 50 seconds)
# max_wait: 10 # (defaults to 120 seconds)
# The modules that run in the 'init' stage
cloud_init_modules:
- migrator
- seed_random
- bootcmd
- write-files
- growpart
- resizefs
- set_hostname
- update_hostname
- update_etc_hosts
- ca-certs
- rsyslog
- users-groups
- ssh
# The modules that run in the 'config' stage
cloud_config_modules:
# Emit the cloud config ready event
# this can be used by upstart jobs for 'start on cloud-config'.
- emit_upstart
- disk_setup
- mounts
- ssh-import-id
- locale
- set-passwords
- grub-dpkg
- apt-pipelining
- apt-configure
- package-update-upgrade-install
- landscape
- timezone
- puppet
- chef
- salt-minion
- mcollective
- disable-ec2-metadata
- runcmd
- byobu
# The modules that run in the 'final' stage
cloud_final_modules:
- rightscale_userdata
- scripts-vendor
- scripts-per-once
- scripts-per-boot
- scripts-per-instance
- scripts-user
- ssh-authkey-fingerprints
- keys-to-console
- phone-home
- final-message
- power-state-change
# System and/or distro specific settings
# (not accessible to handlers/transforms)
system_info:
# This will affect which distro class gets used
distro: ubuntu
# Default user name + that default users groups (if added/used)
default_user:
name: ubuntu
lock_passwd: True
gecos: Ubuntu
groups: [adm, audio, cdrom, dialout, dip, floppy, netdev, plugdev, sudo, video]
sudo: ["ALL=(ALL) NOPASSWD:ALL"]
shell: /bin/bash
# Other config here will be given to the distro class and/or path classes
paths:
cloud_dir: /var/lib/cloud/
templates_dir: /etc/cloud/templates/
upstart_dir: /etc/init/
package_mirrors:
- arches: [i386, amd64]
failsafe:
primary: http://archive.ubuntu.com/ubuntu
security: http://security.ubuntu.com/ubuntu
search:
primary:
- http://%(ec2_region)s.ec2.archive.ubuntu.com/ubuntu/
- http://%(availability_zone)s.clouds.archive.ubuntu.com/ubuntu/
- http://%(region)s.clouds.archive.ubuntu.com/ubuntu/
security: []
- arches: [armhf, armel, default]
failsafe:
primary: http://ports.ubuntu.com/ubuntu-ports
security: http://ports.ubuntu.com/ubuntu-ports
ssh_svcname: ssh
Second try /etc/cloud/cloud.cfg file:
users:
- default
disable_root: 1
ssh_pwauth: 0
locale_configfile: /etc/sysconfig/i18n
mount_default_fields: [~, ~, 'auto', 'defaults,nofail', '0', '2']
resize_rootfs_tmp: /dev
ssh_deletekeys: 0
ssh_genkeytypes: ~
syslog_fix_perms: ~
cloud_init_modules:
- bootcmd
- write-files
- resizefs
- set_hostname
- update_hostname
- update_etc_hosts
- rsyslog
- users-groups
- ssh
cloud_config_modules:
- mounts
- locale
- set-passwords
- timezone
- runcmd
cloud_final_modules:
- scripts-per-once
- scripts-per-boot
- scripts-per-instance
- scripts-user
- ssh-authkey-fingerprints
- keys-to-console
- final-message
system_info:
distro: rhel
default_user:
name: ec2-user
paths:
cloud_dir: /var/lib/cloud
templates_dir: /etc/cloud/templates
ssh_svcname: sshd
EOF
This is happening because when you transferred the instance to AWS from your local there was no any PEM key associated with that instance due to which you were not able to SSH.
After you took an Image of your instance and launched the instance again with a associated key you were able to SSH into the instance.