I'm trying to insert to a table with a query in an EMR cluster on AWS. The table is creating correctly, and a colleague can run the exact same code that I'm using and it won't fail. However, when I try to run the code, I get failures in Map1 that make the entire job fail with the error below for the query below.
Can someone help me figure out why my job is failing when I run it, but my friend can run it without issue? I've been staring at this for the entire day and can't get past it.
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 container RUNNING 13 0 0 13 40 1
Map 3 .......... container SUCCEEDED 1 1 0 0 0 0
Map 5 .......... container SUCCEEDED 1 1 0 0 0 0
Map 7 .......... container SUCCEEDED 1 1 0 0 0 0
Map 8 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 container INITED 6 0 0 6 0 0
Reducer 4 ...... container SUCCEEDED 2 2 0 0 0 0
Reducer 6 ...... container SUCCEEDED 2 2 0 0 0 0
Reducer 9 ...... container SUCCEEDED 2 2 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 07/09 [========>>------------------] 34% ELAPSED TIME: 132.71 s
----------------------------------------------------------------------------------------------
Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1544915203536_0453_2_07, diagnostics=[Task failed, taskId=task_1544915203536_0453_2_07_000009, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1544915203536_0453_2_07_000009_0:java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.lang.IllegalArgumentException: [
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: java.lang.IllegalArgumentException: [VALUE] BINARY is not in the store:
at
I think in the error :
[tuning_event_start_ts] BINARY is not in the store
The most important point is : not in the store
Check your query (only the select without insert).
Wich table contains TUNING_EVENT_START_TS ?
So it turns out that the vectorization was the issue. These were the settings that would be activated at the beginning of the session.
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
By not activating this it was able to run slower but successfully. It seems hive does not like the timestamp value. At the bottom of the below wiki is a limitations piece. It definitely works without these options set.
https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution
In summary, timestamps and vectorization don't like each other in hive.... But only sometimes...
Related
I have a dataset where I need unique county FIPS codes that need to be merged. The dataset looks like:
FIPS yr1990 yr2000 yr2010
1001 1 0 1
1002 1 1 0
1003 1 0 0
1004 0 0 0
1005 0 0 1
County boundaries have changed and I need to merge several FIPS codes together. Essentially, I need the dataset to look like:
FIPS yr1990 yr2000 yr2010
1001/1003 1 1 1
1002 1 1 0
1004/1005 0 0 1
Is there a way to select specific FIPS to be merged over rows?
This solution might not scale to very large datasets as writing the replace statements must be done manually. But it keeps the exact format you are using in your example. And a more scalable way might be difficult if there is no system in how the FIPS codes were combined.
* Example generated by -dataex-. For more info, type help dataex
clear
input str4 FIPS byte(yr1990 yr2000 yr2010)
"1001" 1 0 1
"1002" 1 1 0
"1003" 1 0 0
"1004" 0 0 0
"1005" 0 0 1
end
*Combine the FIPS codes
replace FIPS = "1001/1003" if inlist(FIPS,"1001","1003")
replace FIPS = "1004/1005" if inlist(FIPS,"1004","1005")
*Collapse rows by FIPS value, use max value for each var on format yr????
collapse (max) yr???? , by(FIPS)
I believe that all i'd need to do to resolve this is to set SSM inside of Image Builder to use my proxy with the environment variable -> HTTP_PROXY = HOST:IP
for example, I can run this on another server where all traffic is directed through the proxy:
curl -I --socks5-hostname socks.local:1080 https://s3.amazonaws.com/aws-cli/awscli-bundle.zip -o awscli-bundle.zip
Here's what Image builder is trying to do and failing (before any of the image builder components are ran):
SSM execution '68711005-5dc4-41f6-8cdd-633728ca41da' failed with status = 'Failed' in state = 'BUILDING' and failure message = 'Step fails when it is verifying the command has completed. Command 76b55646-79bb-417c-8bb6-6ee01f9a76ff returns unexpected invocation result: {Status=[Failed], ResponseCode=[7], Output=[ ----------ERROR------- + sudo systemctl stop ecs + curl https://s3.amazonaws.com/aws-cli/awscli-bundle.zip -o /tmp/imagebuilder_service/awscli-bundle.zip % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- 0:00:05 --:--:-- 0 0 0 0 0 0 0 0 0 ...'
These env vars are all that should be needed, the problem is that i see no way to add them (similarly to how you would in CodeBuild):
http_proxy=http://hostname:port
https_proxy=https://hostname:port
no_proxy=169.254.169.254
SSM Agent does not read environment variables from host, you would need to provide the environment variables in the file below and restart ssm agent
On Ubuntu Server instances where SSM Agent is installed by using a snap: /etc/systemd/system/snap.amazon-ssm-agent.amazon-ssm-agent.service.d/override.conf
On Amazon Linux 2 instances: /etc/systemd/system/amazon-ssm-agent.service.d/override.conf
On other operating systems: /etc/systemd/system/amazon-ssm-agent.service.d/amazon-ssm-agent.override
[Service]
Environment="http_proxy=http://hostname:port"
Environment="https_proxy=https://hostname:port"
Environment="no_proxy=169.254.169.254"
Reference: https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-proxy-with-ssm-agent.html#ssm-agent-proxy-systemd
I am trying to understand what series values stand for in Prometheus unit test.
The official doc does not provide any info.
For example, fire an alert if any instance is down over 10 seconds.
alerting-rules.yml
groups:
- name: alert_rules
rules:
- alert: InstanceDown
expr: up == 0
for: 10s
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 10 seconds."
alerting-rules.test.yml
rule_files:
- alerting-rules.yml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'up{job="prometheus", instance="localhost:9090"}'
values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
alert_rule_test:
- eval_time: 10m
alertname: InstanceDown
exp_alerts:
- exp_labels:
severity: critical
instance: localhost:9090
job: prometheus
exp_annotations:
summary: "Instance localhost:9090 down"
description: "localhost:9090 of job prometheus has been down for more than 10 seconds."
Originally, I thought because of interval: 1m, which is 60 seconds, and there are 15 numbers, 60 / 15 = 4s, so each value stands for 4 seconds (1 means up, 0 means down).
However, when the values are
values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
or
values: '1 1 1 1 1 1 1 1 1 0 0 0 0 0 0'
Both will pass the test when I run promtool test rules alerting-rules.test.yml.
But below will fail:
values: '1 1 1 1 1 1 1 1 1 1 0 0 0 0 0'
So my original thought each number stands for 4s is wrong. If my assumption is correct, then only when less three 0s will fail the test.
What do series values stand for in Prometheus unit test?
Your assumption is incorrrect. The number in the values doesn't correspond at the number of value in the interval but which value the series will have after each interval. For example:
values: 1 1 1 1 1 1
# 1m 2m 3m 4m 5m 6m
In your example, since you evaluate the value at 10min (with eval_time) the evaluation will be based on the tenth value in the values. Since you check if up==0, when you change the tenth value to 1 it will fail because the alert will not be trigger as excepted.
Given that I have the following output :
Loopback1 is up, line protocol is up
Hardware is Loopback
Description: ** NA4-ISIS-MGMT-LOOPBACK1_MPLS **
Internet address is 84.116.226.27/32
MTU 1514 bytes, BW 8000000 Kbit, DLY 5000 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation LOOPBACK, loopback not set
Keepalive set (10 sec)
Last input 12w3d, output never, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/0 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 0 bits/sec, 0 packets/sec
0 packets input, 0 bytes, 0 no buffer
Received 0 broadcasts (0 IP multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
6 packets output, 456 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 output buffer failures, 0 output buffers swapped out
How can I match "Loopback1" and not "Loopback" ?
In other words, how can I match the interface name only if there is a number next to it, in Tcl ?
use lookahead
Loopback(?=\d+)
It matches only Loopback in Loopback followed by any number of digits. If you want to match loopback and the number, useLoopback\d+
I have a Django view which creates 500-5000 new database INSERTS in a loop. Problem is, it is really slow! I'm getting about 100 inserts per minute on Postgres 8.3. We used to use MySQL on lesser hardware (smaller EC2 instance) and never had these types of speed issues.
Details:
Postgres 8.3 on Ubuntu Server 9.04.
Server is a "large" Amazon EC2 with database on EBS (ext3) - 11GB/20GB.
Here is some of my postgresql.conf -- let me know if you need more
shared_buffers = 4000MB
effective_cache_size = 7128MB
My python:
for k in kw:
k = k.lower()
p = ProfileKeyword(profile=self)
logging.debug(k)
p.keyword, created = Keyword.objects.get_or_create(keyword=k, defaults={'keyword':k,})
if not created and ProfileKeyword.objects.filter(profile=self, keyword=p.keyword).count():
#checking created is just a small optimization to save some database hits on new keywords
pass #duplicate entry
else:
p.save()
Some output from top:
top - 16:56:22 up 21 days, 20:55, 4 users, load average: 0.99, 1.01, 0.94
Tasks: 68 total, 1 running, 67 sleeping, 0 stopped, 0 zombie
Cpu(s): 5.8%us, 0.2%sy, 0.0%ni, 90.5%id, 0.7%wa, 0.0%hi, 0.0%si, 2.8%st
Mem: 15736360k total, 12527788k used, 3208572k free, 332188k buffers
Swap: 0k total, 0k used, 0k free, 11322048k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14767 postgres 25 0 4164m 117m 114m S 22 0.8 2:52.00 postgres
1 root 20 0 4024 700 592 S 0 0.0 0:01.09 init
2 root RT 0 0 0 0 S 0 0.0 0:11.76 migration/0
3 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0
4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
5 root 10 -5 0 0 0 S 0 0.0 0:00.08 events/0
6 root 11 -5 0 0 0 S 0 0.0 0:00.00 khelper
7 root 10 -5 0 0 0 S 0 0.0 0:00.00 kthread
9 root 10 -5 0 0 0 S 0 0.0 0:00.00 xenwatch
10 root 10 -5 0 0 0 S 0 0.0 0:00.00 xenbus
18 root RT -5 0 0 0 S 0 0.0 0:11.84 migration/1
19 root 34 19 0 0 0 S 0 0.0 0:00.01 ksoftirqd/1
Let me know if any other details would be helpful.
One common reason for slow bulk operations like this is each insert happening in its own transaction. If you can get all of them to happen in a single transaction, it could go much faster.
Firstly, ORM operations are always going to be slower than pure SQL. I once wrote an update to a large database in ORM code and set it running, but quit it after several hours when it had completed only a tiny fraction. After rewriting it in SQL the whole thing ran in less than a minute.
Secondly, bear in mind that your code here is doing up to four separate database operations for every row in your data set - the get in get_or_create, possibly also the create, the count on the filter, and finally the save. That's a lot of database access.
Bearing in mind that a maximum of 5000 objects is not huge, you should be able to read the whole dataset into memory at the start. Then you can do a single filter to get all the existing Keyword objects in one go, saving a huge number of queries in the Keyword get_or_create and also avoiding the need to instantiate duplicate ProfileKeywords in the first place.