BigQuery - Where can I find the error stream? - google-cloud-platform

I have uploaded a CSV file with 300K rows from GCS to BigQuery, and received the following error:
Where can I find the error stream?
I've changed the create table configuration to allow 4000 errors and it worked, so it must be a problem with the 3894 rows in the message, but this error message does not tell me much about which rows or why.
Thanks

I'm finally managed to see the error stream by running the following command in the terminal:
bq --format=prettyjson show -j <JobID>
It returns a JSON with more details.
In my case it was:
"message": "Error while reading data, error message: Could not parse '16.66666666666667' as int for field Course_Percentage (position 46) starting at location 1717164"

You should be able to click on Job History in the BigQuery UI, then click the failed load job. I tried loading an invalid CSV file just now, and the errors that I see are:
Errors:
Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the error stream for more details. (error code: invalid)
Error while reading data, error message: CSV table references column position 1, but line starting at position:0 contains only 1 columns. (error code: invalid)
The first one is just a generic message indicating the failure, but the second error (from the "error stream") is the one that provides more context for the failure, namely CSV table references column position 1, but line starting at position:0 contains only 1 columns.
Edit: given a job ID, you can also use the BigQuery CLI to see complete information about the failure. You would use:
bq --format=prettyjson show -j <job ID>

Using python client it's
from google.api_core.exceptions import BadRequest
job = client.load_table_from_file(*args, **kwargs)
try:
result = job.result()
except BadRequest as ex:
for err in ex.errors:
print(err)
raise
# or alternatively
# job.errors

You could also just do.
try:
load_job.result() # Waits for the job to complete.
except ClientError as e:
print(load_job.errors)
raise e
This will print the errors to screen or you could log them etc.

Following the rest of the answers, you could also see this information in the GCP logs (Stackdriver) tool.
But It might happen that this does not answer your question. It seems like there are detailed errors (such as the one Elliot found) and more imprecise ones. Which gives you no description at all independently of the UI you're using to explore it.

Related

ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream

I am doing a distributed training using GCP Vertex platform. The model is trained in parallel using 4 GPU's using Pytorch and HuggingFace. After training when I save the model from local container to GCP bucket it throws me the error.
Here is the code:
I launch the train.py this way:
python -m torch.distributed.launch --nproc_per_node 4 train.py
After training is complete I save model files using this. It has 3 files that needs to be saved.
trainer.save_model("model_mlm") #Saves in local directory
subprocess.call('gsutil -o GSUtil:parallel_composite_upload_threshold=0 cp -r /pythonPackage/trainer/model_mlm gs://*****/model_mlm', shell=True, stdout=subprocess.PIPE) #from local to GCP
Error:
ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream; this can happen if a file changes size while being uploaded
And sometimes I get this error:
ResumableUploadAbortException: 409 The object has already been created in an earlier attempt and was overwritten, possibly due to a race condition.
As per the documentation name conflict, you are trying to overwrite a file that has already been created.
So I would recommand you to change the destiny location with a unique identifier per training so you don't receive this type of error. For example, adding the timestamp in string format at the end of your bucket like:
- gs://pypl_bkt_prd_row_std_aiml_vertexai/model_mlm_vocab_exp2_50epocs/20220407150000
I would like to mention that this kind of error is retryable as mentioned in the error documentation error docs.

Redshift COPY from S3 fails when timestamp is not correct

While loading data into Redshift from S3 via the COPY command, if any record in the file contains an incorrect timestamp, then the copy fails. I have passed maxerror as 1000 to the COPY command, but still it fails.
However, upon subsequent retries, the same command works. Though it fails to load the corrupted records.
This is the error I am getting:
ERROR: Assert
DETAIL:
-----------------------------------------------
error: Assert
code: 1000
context: status == 0 - timestamp: '-6585881136298398395'
query: 30903
location: cg_util.cpp:1063
process: query1_69 [pid=25674]
-----------------------------------------------
AWS cli version : aws-cli/1.10.56 Python/2.7.12 Linux/4.4.19-29.55.amzn1.x86_64 botocore/1.4.46
Is there anyone who faced the same issue? How did you resolve it?
Append
ACCEPTANYDATE dateformat 'auto'
in your copy statement.
ACCEPTANYDATE
dateformat
(AWS Documentation)
This'll atleast try to enforce that your copy statements don't fail. Still, some of the unsupported format might be null (as you mentioned,I am fine with the corrupt record(record containing wrong timestamp) not getting loaded to redshift. But other records should be loaded)

GATE_Using for Thesis_Run-time Error

When I am trying to run corpus pipeline on language resources. It is throwing the below (even though I follow the order as Document reset, english tokeniser, sentence splitter)
Can someone help me with the process to debug this run-time error
Error:
gate.creole.ExecutionException: No sentences or tokens to process in document Password_Safe-window1.txt_0003E
Please run a sentence splitter and tokeniser first!
at gate.creole.POSTagger.execute(POSTagger.java:257)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
at gate.creole.SerialController.runComponent(SerialController.java:225)
at gate.creole.SerialController.executeImpl(SerialController.java:157)
at gate.creole.SerialAnalyserController.executeImpl(SerialAnalyserController.java:223)
at gate.creole.SerialAnalyserController.execute(SerialAnalyserController.java:126)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1759)
at java.lang.Thread.run(Thread.java:745)
Edit:
The files are not empty. As i tried to implement #dedek's suggestion, it has thrown no errors. But raised one more problem as follows:
Exception in thread "ApplicationViewer1" java.lang.OutOfMemoryError: Java heap space
I think it is because your document is empty.
Can you confirm that?
There is a run-time param failOnMissingInputAnnotations of the POSTagger, set it to false and it should be ok.
See also the docs:
failOnMissingInputAnnotations - if set to false, the PR will not fail with an ExecutionException if no input Annotations are found and instead only log a single warning message per session and a debug message per document that has no input annotations (run-time, default = true).
Concerning the OutOfMemoryError: Java heap space
See following questions:
Getting OOM while using GATE on large data set
GATE PersistenceManager.loadObjectFromFile outofmemory error while loading .gapp files
JAVA PermGem memory

Pig's "dump" is not working on AWS

I am trying Pig commands on EMR of AWS. But even small commands are not working as I expected. What I did is following.
Save the following 6 lines as ~/a.csv.
1,2,3
4,2,1
8,3,4
4,3,3
7,2,5
8,4,3
Start Pig
Load the csv file.
grunt> A = load './a.csv' using PigStorage(',');
16/01/06 13:09:09 INFO Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
Dump the variable A.
grunt> dump A;
But this commands fails. I expected that this command produces 6 tuples which are described in a.csv. The dump commands a lot of INFO lines and ERROR lines. The ERROR lines are following.
91711 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR 0: java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
16/01/06 13:10:08 ERROR pigstats.PigStats: ERROR 0: java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
91711 [main] ERROR org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 map reduce job(s) failed!
16/01/06 13:10:08 ERROR mapreduce.MRPigStatsUtil: 1 map reduce job(s) failed!
[...skipped...]
Input(s):
Failed to read data from "hdfs://ip-xxxx.eu-central-1.compute.internal:8020/user/hadoop/a.csv"
Output(s):
Failed to produce result in "hdfs://ip-xxxx.eu-central-1.compute.internal:8020/tmp/temp-718505580/tmp344967938"
[...skipped...]
91718 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias A. Backend error : java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
16/01/06 13:10:08 ERROR grunt.Grunt: ERROR 1066: Unable to open iterator for alias A. Backend error : java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
(I have changed IP-like description.) The error message seems to say that the load operator also fails.
I have no idea why even the dump operator fails. Can you give me any advice?
Note
I also use TAB in a.csv instead commas and execute A = load './a-tab.csv';, but it does not help.
$ pig -x local -> A = load 'a.csv' using PigStorage(','); -> dump A;. Then
Input(s):
Failed to read data from "file:///home/hadoop/a.csv"
If I use the full path, namely A = load '/home/hadoop/a.csv' using PigStorage(',');, then I get
Input(s):
Failed to read data from "/home/hadoop/a.csv"
I have encountered the same problem. You may try to su root use the root user, then ./bin/pig at PIG_HOME to start pig in mapreduce mode. On the other hand, you also can use the current user by sudo ./bin/pig at PIG_HOME to start pig, but you must export JAVA_HOME and HADOOP_HOME in the ./bin/pig file.
If you want to use your local file system, you should have to start your pig in step 2 as below
bin/pig -x local
If you start just as bin/pig that will search the file in DFS. That's why you get error Failed to read data from "hdfs://ip-xxxx.eu-central-1.compute.internal:8020/user/hadoop/a.csv"

"Resource temporarily unavailable" error on reading CSV file in web2py app on PythonAnywhere

I have a python web2py app uploaded at PythonAnywhere. App is working fine. I want to read a csv file placed in a folder along with my app and import it into mysql table. When I try to read that CSV file, I get the error saying "[Errno 11] Resource temporarily unavailable".
I am new to python as well as PythonAnywhere and I couldn't understand this issue and can't figure it out how can I overcome this error and read a csv file successfully at server?
Note: I can run this code successfully on my local machine.
What I am doing is this:
path = '/home/user123/web2py/files/'
file_ = path+filename
print file_
with open(file_, "r") as f_obj:
reader = csv.reader(f_obj)
fields = reader.next()
print fields
self.create_new_table(tablename, fields)
Will appreciate any help in this regard.
Thanx in advance.
I opened server.log file in Web tab and found out that the print statement "print fields" was causing the error .... It tried to print all the column names and at the mid of those column names, it produced this error and stopped execution. I removed such print statements which were trying to print long statements and the error was gone!
It seems to be limit in print or something else similar to this, dont know exactly!