Pig's "dump" is not working on AWS - amazon-web-services

I am trying Pig commands on EMR of AWS. But even small commands are not working as I expected. What I did is following.
Save the following 6 lines as ~/a.csv.
1,2,3
4,2,1
8,3,4
4,3,3
7,2,5
8,4,3
Start Pig
Load the csv file.
grunt> A = load './a.csv' using PigStorage(',');
16/01/06 13:09:09 INFO Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
Dump the variable A.
grunt> dump A;
But this commands fails. I expected that this command produces 6 tuples which are described in a.csv. The dump commands a lot of INFO lines and ERROR lines. The ERROR lines are following.
91711 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
16/01/06 13:10:08 ERROR pigstats.PigStats: ERROR 0: java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
91711 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!
16/01/06 13:10:08 ERROR mapreduce.MRPigStatsUtil: 1 map reduce job(s) failed!
[...skipped...]
Input(s):
Failed to read data from "hdfs://ip-xxxx.eu-central-1.compute.internal:8020/user/hadoop/a.csv"
Output(s):
Failed to produce result in "hdfs://ip-xxxx.eu-central-1.compute.internal:8020/tmp/temp-718505580/tmp344967938"
[...skipped...]
91718 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias A. Backend error : java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
16/01/06 13:10:08 ERROR grunt.Grunt: ERROR 1066: Unable to open iterator for alias A. Backend error : java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
(I have changed IP-like description.) The error message seems to say that the load operator also fails.
I have no idea why even the dump operator fails. Can you give me any advice?
Note
I also use TAB in a.csv instead commas and execute A = load './a-tab.csv';, but it does not help.
$ pig -x local -> A = load 'a.csv' using PigStorage(','); -> dump A;. Then
Input(s):
Failed to read data from "file:///home/hadoop/a.csv"
If I use the full path, namely A = load '/home/hadoop/a.csv' using PigStorage(',');, then I get
Input(s):
Failed to read data from "/home/hadoop/a.csv"

I have encountered the same problem. You may try to su root use the root user, then ./bin/pig at PIG_HOME to start pig in mapreduce mode. On the other hand, you also can use the current user by sudo ./bin/pig at PIG_HOME to start pig, but you must export JAVA_HOME and HADOOP_HOME in the ./bin/pig file.

If you want to use your local file system, you should have to start your pig in step 2 as below
bin/pig -x local
If you start just as bin/pig that will search the file in DFS. That's why you get error Failed to read data from "hdfs://ip-xxxx.eu-central-1.compute.internal:8020/user/hadoop/a.csv"

Related

ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream

I am doing a distributed training using GCP Vertex platform. The model is trained in parallel using 4 GPU's using Pytorch and HuggingFace. After training when I save the model from local container to GCP bucket it throws me the error.
Here is the code:
I launch the train.py this way:
python -m torch.distributed.launch --nproc_per_node 4 train.py
After training is complete I save model files using this. It has 3 files that needs to be saved.
trainer.save_model("model_mlm") #Saves in local directory
subprocess.call('gsutil -o GSUtil:parallel_composite_upload_threshold=0 cp -r /pythonPackage/trainer/model_mlm gs://*****/model_mlm', shell=True, stdout=subprocess.PIPE) #from local to GCP
Error:
ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream; this can happen if a file changes size while being uploaded
And sometimes I get this error:
ResumableUploadAbortException: 409 The object has already been created in an earlier attempt and was overwritten, possibly due to a race condition.
As per the documentation name conflict, you are trying to overwrite a file that has already been created.
So I would recommand you to change the destiny location with a unique identifier per training so you don't receive this type of error. For example, adding the timestamp in string format at the end of your bucket like:
- gs://pypl_bkt_prd_row_std_aiml_vertexai/model_mlm_vocab_exp2_50epocs/20220407150000
I would like to mention that this kind of error is retryable as mentioned in the error documentation error docs.

Batch jcl error for cics web services using cics web service assistant tool

I am facing issue while submitting below job, can someone please suggest?
Error:
IEF344I KA7LS2W2 INPUT STEP1 SYSUT2 - ALLOCATION FAILED DUE TO DATA FACILITY SYSTEM ERROR
IGD17501I ATTEMPT TO OPEN A UNIX FILE FAILED,
RETURN CODE IS (00000081) REASON CODE IS (0594003D)
FILENAME IS (/ka7a/KA7A.in)
JCL:
//KA7LS2W2 JOB (51,168),'$ACCEPT',CLASS=1,
// MSGCLASS=X,MSGLEVEL=(1,0),NOTIFY=&SYSUID,REGION=0M
// EXPORT SYMLIST=*
// JCLLIB ORDER=SYS2.CI55.SDFHINST
//STEP1 EXEC DFHLS2WS,
// JAVADIR='java/J7.0_64',PATHPREF='',TMPDIR='/ka7a',
// USSDIR='',TMPFILE=&QT.&SYSUID.&QT
//INPUT.SYSUT1 DD *
PDSLIB=//DJPN.KA7A.POC
LANG=COBOL
PGMINT=CHANNEL
PGMNAME=KZHFEN1C
REQMEM=PAYIN
RESPMEM=PAYOUT
MAPPING-LEVEL=2.2
LOGFILE=/home/websrvices/wsbind/payws.log `enter code here`
WSBIND=/home/webservices/wsbind/payws.wsbind
WSDL=/home/webservices/wsdl/payws.wsdl
/*
Based on the Return Code 81 / Reason Code 0594003D the pathname can't be resolved.
the message IGD17501I explains the error. You'll find more information looking up the Reason Code 0594003D.
You can use BPXMTEXT to lookup more detail on the Reason Code.
Executing this command in USS you'll see:
$ bpxmtext 0594003D
BPXFVLKP 05/14/20
JRDirNotFound: A directory in the pathname was not found
Action: One of the directories specified was not found. Verify that the name
specified is spelled correctly.
Per #phunsoft adding that the same command can be executed in TSO and is not case sensitive like USS.
I'd suspect that /ka7a doesn't exist. Is it a case issue? Or perhaps you meant /u/ka7a/ or `/home/ka7a' ?

What gridmix input format likes?

I use Rumen mine job-history files, contains job-trace.json and job-topology.json.
GirdMix usage likes:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-gridmix-2.7.3.jar -libjars $HADOOP_HOME/share/hadoop/tools/lib/hadoop-rumen-2.7.3.jar -Dgridmix.compression-emulation.enable=false <iopath> <trace>
And, means working directory for Gridmix, so I feed with: file:///home/hadoop/input, means the trace file extracted from log files, feed with file:///home/hadoop/rumen/job-trace-1hr.json.
Finally, meet with following Exceptions:
2019-03-07 16:37:12,495 ERROR [main] gridmix.Gridmix (Gridmix.java:start(534)) - Startup failed. java.io.IOException: Found no satisfactory file in file:/home//hadoop/input
2019-03-07 16:37:13,040 INFO [main] util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 2
2019-03-07 16:37:13,041 INFO [Thread-1] gridmix.Gridmix (Gridmix.java:run(657)) - Exiting...
So what this parameter like , or how to use it?
can anyone have some ideas?
Thanks.
I found it's my own incorrect useage;
I check out gridmix parameters usage, due to too small input data.
gridmix.min.file.size | The minimum size of the input files. The default limit is 128 MiB. Tweak this parameter if you see an error-message like "Found no satisfactory file" while testing GridMix with a relatively-small input data-set.
So, I tuned larger input data.
Using -generate 10G.
Thanks.

BigQuery - Where can I find the error stream?

I have uploaded a CSV file with 300K rows from GCS to BigQuery, and received the following error:
Where can I find the error stream?
I've changed the create table configuration to allow 4000 errors and it worked, so it must be a problem with the 3894 rows in the message, but this error message does not tell me much about which rows or why.
Thanks
I'm finally managed to see the error stream by running the following command in the terminal:
bq --format=prettyjson show -j <JobID>
It returns a JSON with more details.
In my case it was:
"message": "Error while reading data, error message: Could not parse '16.66666666666667' as int for field Course_Percentage (position 46) starting at location 1717164"
You should be able to click on Job History in the BigQuery UI, then click the failed load job. I tried loading an invalid CSV file just now, and the errors that I see are:
Errors:
Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the error stream for more details. (error code: invalid)
Error while reading data, error message: CSV table references column position 1, but line starting at position:0 contains only 1 columns. (error code: invalid)
The first one is just a generic message indicating the failure, but the second error (from the "error stream") is the one that provides more context for the failure, namely CSV table references column position 1, but line starting at position:0 contains only 1 columns.
Edit: given a job ID, you can also use the BigQuery CLI to see complete information about the failure. You would use:
bq --format=prettyjson show -j <job ID>
Using python client it's
from google.api_core.exceptions import BadRequest
job = client.load_table_from_file(*args, **kwargs)
try:
result = job.result()
except BadRequest as ex:
for err in ex.errors:
print(err)
raise
# or alternatively
# job.errors
You could also just do.
try:
load_job.result() # Waits for the job to complete.
except ClientError as e:
print(load_job.errors)
raise e
This will print the errors to screen or you could log them etc.
Following the rest of the answers, you could also see this information in the GCP logs (Stackdriver) tool.
But It might happen that this does not answer your question. It seems like there are detailed errors (such as the one Elliot found) and more imprecise ones. Which gives you no description at all independently of the UI you're using to explore it.

Redshift COPY from S3 fails when timestamp is not correct

While loading data into Redshift from S3 via the COPY command, if any record in the file contains an incorrect timestamp, then the copy fails. I have passed maxerror as 1000 to the COPY command, but still it fails.
However, upon subsequent retries, the same command works. Though it fails to load the corrupted records.
This is the error I am getting:
ERROR: Assert
DETAIL:
-----------------------------------------------
error: Assert
code: 1000
context: status == 0 - timestamp: '-6585881136298398395'
query: 30903
location: cg_util.cpp:1063
process: query1_69 [pid=25674]
-----------------------------------------------
AWS cli version : aws-cli/1.10.56 Python/2.7.12 Linux/4.4.19-29.55.amzn1.x86_64 botocore/1.4.46
Is there anyone who faced the same issue? How did you resolve it?
Append
ACCEPTANYDATE dateformat 'auto'
in your copy statement.
ACCEPTANYDATE
dateformat
(AWS Documentation)
This'll atleast try to enforce that your copy statements don't fail. Still, some of the unsupported format might be null (as you mentioned,I am fine with the corrupt record(record containing wrong timestamp) not getting loaded to redshift. But other records should be loaded)