ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream - google-cloud-platform

I am doing a distributed training using GCP Vertex platform. The model is trained in parallel using 4 GPU's using Pytorch and HuggingFace. After training when I save the model from local container to GCP bucket it throws me the error.
Here is the code:
I launch the train.py this way:
python -m torch.distributed.launch --nproc_per_node 4 train.py
After training is complete I save model files using this. It has 3 files that needs to be saved.
trainer.save_model("model_mlm") #Saves in local directory
subprocess.call('gsutil -o GSUtil:parallel_composite_upload_threshold=0 cp -r /pythonPackage/trainer/model_mlm gs://*****/model_mlm', shell=True, stdout=subprocess.PIPE) #from local to GCP
Error:
ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream; this can happen if a file changes size while being uploaded
And sometimes I get this error:
ResumableUploadAbortException: 409 The object has already been created in an earlier attempt and was overwritten, possibly due to a race condition.

As per the documentation name conflict, you are trying to overwrite a file that has already been created.
So I would recommand you to change the destiny location with a unique identifier per training so you don't receive this type of error. For example, adding the timestamp in string format at the end of your bucket like:
- gs://pypl_bkt_prd_row_std_aiml_vertexai/model_mlm_vocab_exp2_50epocs/20220407150000
I would like to mention that this kind of error is retryable as mentioned in the error documentation error docs.

Related

Does AWS S3 GetObject read the partial of the Object being uploaded to s3 at the same time

I have a lambda (L1) replacing a file (100MB) into an s3 location ( s3://bucket/folder/abc.json). I have another lambdas (L2 , L3) reading the same file at the same time, one via golang api another via Athena query. The s3 bucket/ folder is not versioned.
The question is: Does the lambdas L2, L3 read the old copy of the file till the new file got uploaded? Or does it read the partial file that is being uploaded? If its the later then how do you make sure that the L2, L3 read the files only on full upload?
Amazon S3 is now strongly consistent. This means once you upload an object, all people that read that object are guaranteed to get the updated version of the object.
On the surface, that sounds like it guarantees that your question is "yes, all clients will get the old version or the new version of a file". The truth is still a bit fuzzier than that.
Under the covers, many of the S3 APIs upload with a multi-part upload. This is well known, and doesn't change what I've said above, since the upload must be done before the object is available. However, many of the APIs also use multiple byte-range requests during downloads to download larger objects. This is problematic. It means a download might download part of file v1, then when it goes to download another part, it might get v2 if v2 was just uploaded.
With a little bit of effort, we can demonstrate this:
#!/usr/bin/env python3
import boto3
import multiprocessing
import io
import threading
bucket = "a-bucket-to-use"
key = "temp/dummy_key"
size = 104857600
class ProgressWatcher:
def __init__(self, filesize, downloader):
self._size = float(filesize)
self._seen_so_far = 0
self._lock = threading.Lock()
self._launch = True
self.downloader = downloader
def __call__(self, bytes_amount):
with self._lock:
self._seen_so_far += bytes_amount
if self._launch and (self._seen_so_far / self._size) >= 0.95:
self._launch = False
self.downloader.start()
def upload_helper(pattern, name, callback):
# Upload a file of 100mb of "pattern" bytes
s3 = boto3.client('s3')
print(f"Uploading all {name}..")
temp = io.BytesIO(pattern * size)
s3.upload_fileobj(temp, bucket, key, Callback=callback)
print(f"Done uploading all {name}")
def download_helper():
# Download a file
s3 = boto3.client('s3')
print("Starting download...")
s3.download_file(bucket, key, "temp_local_copy")
print("Done with download")
def main():
# See how long an upload takes
upload_helper(b'0', "zeroes", None)
# Watch how the next upload progresses, this will start a download when it's nearly done
watcher = ProgressWatcher(size, multiprocessing.Process(target=download_helper))
# Start another upload, overwriting the all-zero file with all-ones
upload_helper(b'1', "ones", watcher)
# Wait for the downloader to finish
watcher.downloader.join()
# See what the resulting file looks like
print("Loading file..")
counts = [0, 0]
with open("temp_local_copy") as f:
for x in f.read():
counts[ord(x) - ord(b'0')] += 1
print("Results")
print(counts)
if __name__ == "__main__":
main()
This code uploads an object to S3 that's 100mb of "0". It then starts an upload, using the same key, of 100mb of "1", and when that second upload is 95% done, it starts a download of that S3 object. It then counts how many "0" and "1"s it sees in the downloaded file.
Running this, with the latest versions of Python and Boto3, your exact output will no doubt differ than mine due to network conditions, but this is what I saw with a test run:
Uploading all zeroes..
Done uploading all zeroes
Uploading all ones..
Starting download...
Done uploading all ones
Done with download
Loading file..
Results
[83886080, 20971520]
The last line is important. The downloaded file was mostly "0" bytes, but there were 20mb of "1" bytes. Meaning, I got some part of v1 of the file and some part of v2, despite only performing one download call.
Now, in practice, this is unlikely to happen, and more so if you have better network bandwidth then I do here on a run of the mill home Internet connection.
But it can always potentially happen. If you need to ensure that the downloaders never see a partial file like this, you either need to do something like verify a hash of the file is correct, or my preference is to upload with different keys each time, and have some mechnism for the client to discover the "latest" key, so they can download the whole unchanged file, even if an upload finishes while they're uploading.
The readers see only the old file until the new one is fully uploaded. There is no read of a partial file.
"Amazon S3 never adds partial objects."
SO discussion
Announcement

Hypelerdger Indy Node - Seed value

I am playing with indy-sdk and in walkthrough demo in step 3 (https://github.com/hyperledger/indy-sdk/blob/master/docs/getting-started/indy-walkthrough.md#step-3-getting-the-ownership-for-stewards-verinym) the seed value for Steward sets equal to '000000000000000000000000Steward1'.
If I change it(or leave it empty), I am getting an error. Also, in /tmp/indy/poo1.txn there is no information about this specific value.
My question is how did we know that this is the right value and how we could get it?
Why it doesn't work
000000000000000000000000Steward1 is seed which (given default key derivation method) generates DID Th7MpTaRZVRYnPiabds81Y. You can verify yourself using indy-cli (command line tool)
indy> wallet create test key=123
Wallet "test" has been created
indy> wallet open test key=123
Wallet "test" has been opened
wallet(test):indy> did new seed=000000000000000000000000Steward1
Did "Th7MpTaRZVRYnPiabds81Y" has been created with "~7TYfekw4GUagBnBVCqPjiC" verkey
In the network you are using, the owner of DID VsKV7grR1BUE29mG2Fm2kX (ie. whoever has knowledge about its associated private key or seed) has steward role which grants permissions to do various operations on the ledger. So if you modify the seed, it will generate different DID which won't have required permissions to execute operations used further in the tutorial (like writing data on the ledger).
Where is 000000000000000000000000Steward1 coming from
From what you say I presume you are using prebuilt docker image from indy-sdk repo running pool of indy-node instances in it, following some of these instructions.
So the simple answer is that configuration for 000000000000000000000000Steward1 is pre-baked in it. Look at the dockerfile used for building indy-pool image. Notice these lines
RUN awk '{if (index($1, "NETWORK_NAME") != 0) {print("NETWORK_NAME = \"sandbox\"")} else print($0)}' /etc/indy/indy_config.py> /tmp/indy_config.py
RUN mv /tmp/indy_config.py /etc/indy/indy_config.py
Let's look what's in these files
docker exec indylocalhost cat '/etc/indy/indy_config.py'
# Current network
# Disable stdout logging
enableStdOutLogging = False
# Directory to store ledger.
LEDGER_DIR = '/var/lib/indy'
# Directory to store logs.
LOG_DIR = '/var/log/indy'
# Directory to store keys.
KEYS_DIR = '/var/lib/indy'
# Directory to store genesis transactions files.
GENESIS_DIR = '/var/lib/indy'
# Directory to store backups.
BACKUP_DIR = '/var/lib/indy/backup'
# Directory to store plugins.
PLUGINS_DIR = '/var/lib/indy/plugins'
# Directory to store node info.
NODE_INFO_DIR = '/var/lib/indy'
NETWORK_NAME = 'sandbox'%
This
# Directory to store genesis transactions files.
GENESIS_DIR = '/var/lib/indy'
Looks like what we are looking for. Let's see what's there
docker exec indylocalhost ls '/var/lib/indy/sandbox'
data
domain_transactions_genesis
keys
node1_additional_info.json
node1_info.json
node1_version_info.json
node2_additional_info.json
node2_info.json
node2_version_info.json
node3_additional_info.json
node3_info.json
node3_version_info.json
node4_additional_info.json
node4_info.json
node4_version_info.json
pool_transactions_genesis
In blockchains, genesis file is typically the file you use to initially kick off the network and may populate network with some data. In case of hyperledger-indy, there's 4 "subledgers" which contain different types of transactions: domain, pool, config, audit. The domain subledger is the one which contains things such like DIDs, credential schema or credential definitons. We are looking for a DID, so let's look at file domain genesis file.
docker exec indylocalhost cat '/var/lib/indy/sandbox/domain_transactions_genesis'
{"reqSignature":{},"txn":{"data":{"dest":"V4SGRU86Z58d6TV7PBUe6f","role":"0","verkey":"~CoRER63DVYnWZtK8uAzNbx"},"metadata":{},"type":"1"},"txnMetadata":{"seqNo":1},"ver":"1"}
{"reqSignature":{},"txn":{"data":{"dest":"Th7MpTaRZVRYnPiabds81Y","role":"2","verkey":"~7TYfekw4GUagBnBVCqPjiC"},"metadata":{"from":"V4SGRU86Z58d6TV7PBUe6f"},"type":"1"},"txnMetadata":{"seqNo":2},"ver":"1"}
{"reqSignature":{},"txn":{"data":{"dest":"EbP4aYNeTHL6q385GuVpRV","role":"2","verkey":"~RHGNtfvkgPEUQzQNtNxLNu"},"metadata":{"from":"V4SGRU86Z58d6TV7PBUe6f"},"type":"1"},"txnMetadata":{"seqNo":3},"ver":"1"}
{"reqSignature":{},"txn":{"data":{"dest":"4cU41vWW82ArfxJxHkzXPG","role":"2","verkey":"~EMoPA6HrpiExVihsVfxD3H"},"metadata":{"from":"V4SGRU86Z58d6TV7PBUe6f"},"type":"1"},"txnMetadata":{"seqNo":4},"ver":"1"}
{"reqSignature":{},"txn":{"data":{"dest":"TWwCRQRZ2ZHMJFn9TzLp7W","role":"2","verkey":"~UhP7K35SAXbix1kCQV4Upx"},"metadata":{"from":"V4SGRU86Z58d6TV7PBUe6f"},"type":"1"},"txnMetadata":{"seqNo":5},"ver":"1"}
{"reqSignature":{},"txn":{"data":{"dest":"7JhapNNMLnwkbiC2ZmPZSE","verkey":"~LgpYPrzkB6awcHMTPZ9TVn"},"metadata":{"from":"V4SGRU86Z58d6TV7PBUe6f"},"type":"1"},"txnMetadata":{"seqNo":6},"ver":"1"}
{"reqSignature":{},"txn":{"data":{"dest":"MEPecrczs4Wh6FA12u519D","verkey":"~A4rMgHYboWYS1DXibCgo9W"},"metadata":{"from":"V4SGRU86Z58d6TV7PBUe6f"},"type":"1"},"txnMetadata":{"seqNo":7},"ver":"1"}
{"reqSignature":{},"txn":{"data":{"dest":"EAPtwgevBpzP8hkj9sxuzy","verkey":"~gmzSzu3feXC6g2djF7ar4"},"metadata":{"from":"V4SGRU86Z58d6TV7PBUe6f"},"type":"1"},"txnMetadata":{"seqNo":8},"ver":"1"}
{"reqSignature":{},"txn":{"data":{"dest":"LuL1HK1sDruwkfm68jrVfD","verkey":"~Nyv9BKUJuvjgMbfbwk8CFD"},"metadata":{"from":"V4SGRU86Z58d6TV7PBUe6f"},"type":"1"},"txnMetadata":{"seqNo":9},"ver":"1"}
{"reqSignature":{},"txn":{"data":{"dest":"462p8mtcX6jpa9ky565YEL","verkey":"~LCgq4hnSvMvB8nKd9vgsTD"},"metadata":{"from":"V4SGRU86Z58d6TV7PBUe6f"},"type":"1"},"txnMetadata":{"seqNo":10},"ver":"1"}
And you can see that DID Th7MpTaRZVRYnPiabds81Y is hardcoded on this ledger using verkey ~7TYfekw4GUagBnBVCqPjiC which is matching what we've generated from seed 000000000000000000000000Steward1. You can also see it's given role "2". If you dig deeper, you can find in indy-plenum, the consensus alg. project used by indy that role ID 2 is steward role.
Seed is a secret value, when we generate Indy network's genesis transaction, we use public information/keys which come from that secret seed value.
To know the write seed value, to create custom network with actors generated keys and to generate pool_transactions_genesis and domain_transactions_genesis file, you have to use indy-plenum.
You can find details on the following tutorial:
https://taseen-junaid.medium.com/hyperledger-indy-custom-network-with-indy-node-plenum-protocol-ledger-85fd10eb5bf5
You can find the code base of that tutorial into following link:
https://github.com/Ta-SeenJunaid/Hyperledger-Indy-Tutorial

BigQuery - Where can I find the error stream?

I have uploaded a CSV file with 300K rows from GCS to BigQuery, and received the following error:
Where can I find the error stream?
I've changed the create table configuration to allow 4000 errors and it worked, so it must be a problem with the 3894 rows in the message, but this error message does not tell me much about which rows or why.
Thanks
I'm finally managed to see the error stream by running the following command in the terminal:
bq --format=prettyjson show -j <JobID>
It returns a JSON with more details.
In my case it was:
"message": "Error while reading data, error message: Could not parse '16.66666666666667' as int for field Course_Percentage (position 46) starting at location 1717164"
You should be able to click on Job History in the BigQuery UI, then click the failed load job. I tried loading an invalid CSV file just now, and the errors that I see are:
Errors:
Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the error stream for more details. (error code: invalid)
Error while reading data, error message: CSV table references column position 1, but line starting at position:0 contains only 1 columns. (error code: invalid)
The first one is just a generic message indicating the failure, but the second error (from the "error stream") is the one that provides more context for the failure, namely CSV table references column position 1, but line starting at position:0 contains only 1 columns.
Edit: given a job ID, you can also use the BigQuery CLI to see complete information about the failure. You would use:
bq --format=prettyjson show -j <job ID>
Using python client it's
from google.api_core.exceptions import BadRequest
job = client.load_table_from_file(*args, **kwargs)
try:
result = job.result()
except BadRequest as ex:
for err in ex.errors:
print(err)
raise
# or alternatively
# job.errors
You could also just do.
try:
load_job.result() # Waits for the job to complete.
except ClientError as e:
print(load_job.errors)
raise e
This will print the errors to screen or you could log them etc.
Following the rest of the answers, you could also see this information in the GCP logs (Stackdriver) tool.
But It might happen that this does not answer your question. It seems like there are detailed errors (such as the one Elliot found) and more imprecise ones. Which gives you no description at all independently of the UI you're using to explore it.

Processing large files with django celery tasks

My goal it to process a large CSV file using Celery that is uploaded through a Django form. When the file's size is less than SETTINGS.FILE_UPLOAD_MAX_MEMORY_SIZE, I can pass the form's cleaned_data variable to a celery task and read the file with:
#task
def taskFunction(cleaned_data):
for line in csv.reader(cleaned_data['upload_file']):
MyModel.objects.create(field=line[0])
However, when the file's size is greater than the above setting, I get the following error:
expected string or Unicode object, NoneType found
Where the stack trace shows the error occurring during pickle:
return dumper(obj, protocol=pickle_protocol)
It appears that when the uploaded file is read from a temporary file, pickle fails.
The simple solution to this problem is to increase the FILE_UPLOAD_MAX_MEMORY_SIZE. However, I am curious if there is a better way to manage this issue?
Save it to a temp file and pass the file name to celery instead. Delete after processing.

Getting "file does not exist" error when running an Amazon EMR job

I have uploaded my data
genotype1_large_ind_large.txt
phenotype1_large_ind_large_1.txt
to the S3 system, and in the EMR UI, I set the parameter like below
RunDear.run s3n://scalability/genotype1_large_ind_large.txt s3n://scalability/phenotype1_large_ind_large_1.txt s3n://scalability/output_1phe 33 10 4
In my class RunDear.run I will distribute the file genotype1_large_ind_large.txt and phenotype1_large_ind_large_1.txt to the cache
However, after running the EMR, I get the following error:
java.io.FileNotFoundException: File does not exist: /genotype1_large_ind_large.txt
I am wondering why there is slash '/' in front of the file name?
how to make it work?
I also tried to use like below, but my program will take -cacheFile as an argument, thus also does not work,
RunDear.run -cacheFile s3n://scalability/genotype1_large_ind_large.txt#genotype.txt -cacheFile s3n://scalability/phenotype1_large_ind_large_1.txt#phenotype.txt s3n://scalability/output_1phe 33 280 4
I finally realize it is the problem of using the filesystem, so I add a code in the program like below
FileSystem fs = FileSystem.get( URI.create("s3://scalability"), conf);