Apache-spark - Reading data from aws-s3 bucket with glacier objects - amazon-web-services

The scenario is this:
I'm using spark to read an s3-bucket, where some objects (parquet) were transitioned to glacier storage class. I'm not trying to read these objects, but there is an error on spark using these kind of buckets (https://jira.apache.org/jira/browse/SPARK-21797).
There is a workaround that "fix" this issue: https://jira.apache.org/jira/browse/SPARK-21797?focusedCommentId=16140408&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16140408. But looking into the code: https://github.com/apache/spark/pull/16474/files, call's are still made and only skipping those files that raise an IOException. Is there any better way to config Spark to only load Standard objects on s3-bucket?.

someone (you?) gets to fix https://issues.apache.org/jira/browse/HADOOP-14837 ; have s3a raise a specific exception when attempts to read glaciated data fails
then spark needs to recognise and skip that when it happens
I don't think S3's LIST call flags when an object is glaciated -so the filtering cannot be done during query planning/partitioning. It will be very expensive to call HEAD for each object at that point in the process.

Related

BigQuery external table operator use wrong schema path

Here is a snippet from a DAG that I am working on
create_ext_table = bigquery_operator.BigQueryCreateExternalTableOperator(
task_id='create_ext_table',
bucket='bucket-a',
source_objects='path/*',
schema_object='bucket-b/data/schema.json',
destination_project_dataset_table='sandbox.write_to_BQ',
source_format='CSV',
field_delimiter=';')
create_ext_table
When I run the code, I am getting the following error on Composer 1.10.10+composer :
404 GET https://storage.googleapis.com/download/storage/v1/b/bucket-a/o/bucket-b%2Fdata%2Fschema.json?alt=media: (u'Request failed with status code', 404, u'Expected one of', 200, 206)
As seen in the error, airflow concat the bucket param with the schema_objet param ... Is there any workaround with this ? Because I cannot store the table schema and the table files in the same bucket
Thanks
This is expected as you can see in the source code for the operator here that we use the bucket argument to get the schema_object, so the operator assumes you have them in the same bucket.
As you mentioned you cannot store them there are a few workarounds that you can try, I'll speak to them at a high level:
You can extend the operator and override the execute method in which you retrieve the data from the bucket you care about
You can add an upstream task to move the schema object to bucket-a using GoogleCloudStorageToGoogleCloudStorageOperator. This requires handling the schema_object different from the way the source code handles it. Namely parsing it for the bucket name and object path then retrieving it. Alternatively you can create your own argument (something like schema_bucket) and use it in a similar manner.
You can also delete this object using GoogleCloudStorageDeleteOperator as a downstream task after creating the external table so it does not have to be persisted in `bucket
Final note on the schema_object argument, it's meant to be the GCS path as it uses the same bucket, so if you use the already defined operator it should be schema_object='data/schema.json',

How to set the starting index for an ObjectWriteStream?

I'm trying to build a connection to the file in a google storage bucket, but I have a difficulty to implement an ObjectWriteStream. The problem is that if I create an ObjectWriteStream to the file that is already on the cloud, it will delete the old file and start from the beginning of it. Here is an example code
namespace gcs = google::cloud::storage;
void test(gcs::Client client, string bucket_name, string file_name){
auto writeCon = client.WriteObject(bucket_name.c_str(), file_name.c_str());
writeCon << "This is a test";
writeCon.Close();
}
What should I do to prevent the ObjectWriteStream from deleting my file and upload data from the location I want(e.g. append data to the file)? I have tried to call the standard ostream function seekp to set the stream location. This would not work since ObjectWriteStream does not support it. Strangely ObjectReadStream does not support this operation neither but it has an option gcs::ReadRange(start, end) to set the starting location. Therefore, I am wondering if there is a non-standard way to set the position for ObjectWriteStream. I will appreciate it if anyone can advise me.
it will delete the old file and start from the beginning of it.
This is by design. Remember that GCS is not a filesystem. GCS is an object store. In an object store, the object is atomic unit. You cannot modify objects.
If you require filesystem semantics, you may want to use Cloud Filestore instead.
The answers indicating that objects are immutable is correct. However, two or more objects can be concatenated together using the compose API. Here's the relevant javadoc.
So you could combine a few techniques to effectively append to objects in GCS.
You could copy your existing object (A) to a new object (B) in the same location and storage class (this will be very fast), delete A, upload new data into object C, and then compose B+C into A's original location. Then delete B and C. This will require a copy, delete, upload, compose, and then two deletes -- so six operations. Be mindful of operations costs.
You could simply upload a new object (B) and compose A+B into a new object, C, and record the name of the new object in a metadata database, if you're using one. This would require only an upload, compose, and two deletes.
Within Google Cloud Storage, objects are immutable. See:
https://cloud.google.com/storage/docs/key-terms#immutability
What this means is that you simply can't append to a file. You can re-write the file passing in the original content and then add more content.

Datastore: Batch must be in progress to put()

I am trying to use the Google Cloud Datastore API client library to upload an entity with batch on datastore. My version is 1.6.0
This is my code:
from google.cloud import datastore
client = datastore.Client()
batch = client.batch()
key = client.key('test', 'key1')
entity = datastore.Entity(
key,
exclude_from_indexes=['attribute2'])
entity.update({
'attribute1': 'hello',
'attribute2': 'hello again',
'attribute3': 0.98,
})
batch.put(entity)
And I am getting this error when I do batch.put():
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/google/cloud/datastore/batch.py", line 185, in put
raise ValueError('Batch must be in progress to put()')
ValueError: Batch must be in progress to put()
What am I doing wrong?
You'll need to explicitly call batch.begin() if you aren't doing the puts in a context, i.e. using the with keyword.
Short answer: As stated by #JimMorrison in his answer, you need to call the batch.begin() method when beginning a batch in Datastore, unless you are using the with statement, as explained in the documentation for google.cloud.datastore.batch.Batch.
Long answer: The answer in short answer works, however working with batches directly is not the recommended approach according to the documentation. The Datastore documentation offers some information about how to work with batch operations, differentiating two types of batch operations:
Non-transactional batch: these are some predefined methods (generally having the _multi appendix, indicating that it will work over multiple entities simultaneously) that allow to operate on multiple objects in a single Datastore call. These methods are get_multi(), put_multi() and delete_multi(), but they are not executed transactionally, i.e. it is possible that only some operations in the request have ended successfully if an error happens.
Transactions: atomic operations which are guaranteed to never be partially applied, i.e. either all operations are applied, or none (if an error occurs). This is a nice feature that comes in handy depending on the time of operations you are trying to perform.
According to the documentation about Datastore's Batch class, the Batch class is overridden by the Transaction class, so unless you want to perform some specific operations with the underlying Batch, you should probably work with transactions instead, as explained in the documentation, where you will find more examples and best practices to work with. This practice is preferred to the one you are using, which is working with the Batch class directly, which is in fact overridden by Transaction.
TL;DR: batch.begin() will solve the issue in your code, as you need to initialize the batch with the _IN_PROGRESS status. However, if you want to make your life easier, and make an abstracted usage of batches through transactions, with better documentation and examples, I would recommend the usage of transactions instead.

downloading from AWS S3 while file is being updated

This may seem like a really basic question, but if I am downloading a file from S3 while it is being updated by another process, do I have to worry about getting an incomplete file?
Example: a 200MB CSV file. User A starts to update the file with 200MB of new content at 1Mbps. 16 seconds later, User B starts download the file at 200Mbps. Does User B get all 200MB of the original file, or does User B get ~2MB of User A's changes and nothing else?
User B gets all 200MB of the original file.
Here's why:
PUT operations on S3 are atomic. There's technically no such thing as "modifying" an object. What actually happens when an object is overwritten is that the object is replaced with another object having the same key. But the original object is not actually replaced until the new (overwriting) object is uploaded in its entirety, and successfully...and even then, the overwritten object is not technically "gone" yet -- it's only been replaced in the bucket's index, so that future requests will be served the new object.
(Serving the new object is actually documented as not being guaranteed to always happen immediately. In contrast with uploads of new objects, which are immediately available for download, overwrites of existing objects are eventually consistent, meaning that it's possible -- however unlikely -- that for a short period of time after you upload an object that the old copy could still be served up for subsequent requests).
But when you overwrite an object, and versioning is not enabled on the bucket, the old object and new objects are actually stored independently in S3, in spite of the same key. The old object is now no longer referenced by the bucket's index, so you are no longer billed for storage of it, and it will shortly be purged from S3's backing store. It's not actually documented how much later this happens... but (tl;dr) overwriting an object that is currently being downloaded should not cause any unexpected side effects.
Updates to a single key are atomic. For example, if you PUT to an existing key, a subsequent read might return the old data or the updated data, but it will never write corrupted or partial data.
http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel

Google cloud buckets - is there a way to fetch by prefix

Google Cloud Storage Buckets has a function to get a paginated listing of the object names inside a bucket, called "list". Here are the docs:
https://developers.google.com/storage/docs/json_api/v1/buckets/list
If I want to discover whether a certain object name exists, the only (apparent) way to do so is to fetch ALL object names, one page at a time, and look through them myself. This is not scalable.
We have 10,000+ objects stored. So if I want to find gs://mybucket/my/simulated/dir/* or if I want to find gs://mybucket/my/sim*/subdir/*.txt the only way to do so is to retrieve 600,000 bytes of listing information and filter through it with code.
The question: Does anyone know a way, short of keeping track of the object names myself somehow, to get JUST the listings I care about?
It turns out I'm crazy. I was looking at the /buckets/ documentation, and instead I should have been looking at the /objects/ documentation.
https://developers.google.com/storage/docs/json_api/v1/objects/list