Here is a snippet from a DAG that I am working on
create_ext_table = bigquery_operator.BigQueryCreateExternalTableOperator(
task_id='create_ext_table',
bucket='bucket-a',
source_objects='path/*',
schema_object='bucket-b/data/schema.json',
destination_project_dataset_table='sandbox.write_to_BQ',
source_format='CSV',
field_delimiter=';')
create_ext_table
When I run the code, I am getting the following error on Composer 1.10.10+composer :
404 GET https://storage.googleapis.com/download/storage/v1/b/bucket-a/o/bucket-b%2Fdata%2Fschema.json?alt=media: (u'Request failed with status code', 404, u'Expected one of', 200, 206)
As seen in the error, airflow concat the bucket param with the schema_objet param ... Is there any workaround with this ? Because I cannot store the table schema and the table files in the same bucket
Thanks
This is expected as you can see in the source code for the operator here that we use the bucket argument to get the schema_object, so the operator assumes you have them in the same bucket.
As you mentioned you cannot store them there are a few workarounds that you can try, I'll speak to them at a high level:
You can extend the operator and override the execute method in which you retrieve the data from the bucket you care about
You can add an upstream task to move the schema object to bucket-a using GoogleCloudStorageToGoogleCloudStorageOperator. This requires handling the schema_object different from the way the source code handles it. Namely parsing it for the bucket name and object path then retrieving it. Alternatively you can create your own argument (something like schema_bucket) and use it in a similar manner.
You can also delete this object using GoogleCloudStorageDeleteOperator as a downstream task after creating the external table so it does not have to be persisted in `bucket
Final note on the schema_object argument, it's meant to be the GCS path as it uses the same bucket, so if you use the already defined operator it should be schema_object='data/schema.json',
Related
I am copying a workspace using the sample script “Copy a Workspace”.
While GET workspace returns 200, the POST workspace request is returning this error:
“TypeError: Cannot read properties of undefined (reading ‘id’)”
When I look at the response body in the console, I see this error:
{“error”:{“name”:“invalidParamError”,“message”:“body.visibilityStatus is invalid”}}
The request body is programmatically populated by the sample with my first name, the date, and the workspace I am trying to copy in this format:
“workspace”:{“name”:“[MY FIRST NAME] 2/16 - [WORKSPACE]”
I have tried manually changing the request body to a simple string containing just my name under the ‘Variables’ section and I have tried manually defining a “name” variable in the same section. I don’t think either of these get at the problem, but I am unsure what the workspace is actually expecting and why the request isn’t processing.
Please help. Thank you!
I am trying to build a DAG that first checks if a given path in the Google Cloud Storage exists or not. if not , it will simply exist/end to DAG.
is there any Airflow operator available to do the same?
Easy answer: The folder doesn't exist!!
More seriously, the folder really doesn't exist, all the object are stored at the root (the bucket). The folder is only a human representation to easily classify and navigate through the files.
However, you can achieve what you want: perform a list object operation and provide a prefix. The prefix of the file can contain / which is the human representation of the folder.
I found this operator, but I'm not a Composer user, it may be the wrong version, but the idea is this one: List the file with the targeted prefix. If no file, no folder!
As per this documentation of Airflow operators for Google Cloud Storage, there is no direct operator for detecting GCS paths. However, we can check if the bucket/object (which constitutes the path) exists or not. If the bucket/object doesn't exist, there won't be a path for that object.
You can try using GCSListObjectsOperator to list the objects in a particular bucket by adding appropriate prefixes.
For example:
with models.DAG(
"example_gcs_sensors",
start_date=days_ago(1),
schedule_interval=None,
tags=['example'],
) as dag2:
GCS_Files = GCSListObjectsOperator(
task_id='GCS_Files',
bucket=BUCKET_1,
prefix='hi-',
)
delete_bucket = GCSDeleteBucketOperator(task_id="delete_bucket", bucket_name=BUCKET_1)
GCS_Files >> delete_bucket
Here GCSListObjectsOperator will list all the objects in BUCKET_1, if the object with the specified prefix is not present, ‘GCS_Files’ task will fail.
Alternatively, Airflow also provides GCS sensors: GCSObjectExistenceSensor and GCSObjectsWithPrefixExistenceSensor operator which can be used to check the existence of objects in a bucket.
The scenario is this:
I'm using spark to read an s3-bucket, where some objects (parquet) were transitioned to glacier storage class. I'm not trying to read these objects, but there is an error on spark using these kind of buckets (https://jira.apache.org/jira/browse/SPARK-21797).
There is a workaround that "fix" this issue: https://jira.apache.org/jira/browse/SPARK-21797?focusedCommentId=16140408&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16140408. But looking into the code: https://github.com/apache/spark/pull/16474/files, call's are still made and only skipping those files that raise an IOException. Is there any better way to config Spark to only load Standard objects on s3-bucket?.
someone (you?) gets to fix https://issues.apache.org/jira/browse/HADOOP-14837 ; have s3a raise a specific exception when attempts to read glaciated data fails
then spark needs to recognise and skip that when it happens
I don't think S3's LIST call flags when an object is glaciated -so the filtering cannot be done during query planning/partitioning. It will be very expensive to call HEAD for each object at that point in the process.
I've been trying to wrap my mind around how batch_write_item works in the boto3 (Python SDK for AWS). My code is as follows:
users = self.scan_table(filterKey="key",filterValue="value",table="users")
deleteUsers = []
# Create lists to delete
for u in users:
deleteUsers.append({"DeleteRequest":{"Key":{"S":u["user_id"]}}})
# Delete items
ret = self.db.batch_write_item(
RequestItems={
"users":delUsers
}
)
The output is as follows (for each item to be deleted):
ClientError: An error occurred (ValidationException) when calling the
BatchWriteItem operation: The provided key element does not match the schema
As far as I can tell, I am following the prescribed instructions in the documentation exactly. What am I missing?
EDIT:
As per jarmod's comment, I felt I needed to specify that I am indeed using the boto3 resource, not the client. There are two separate sets of documentation for the client vs. the resource. Turns out that using the correct documentation for the function you are using is advisable.
Oh, it turns out I wasn't reading the documentation correctly (though it could stand to be a lot clearer in places). I changed the following code:
# Create lists to delete
for u in users:
delUsers.append({"DeleteRequest":{"Key":{"S":u["user_id"]}}})
To:
# Create lists to delete
for u in users:
delUsers.append({"DeleteRequest":{"Key":{"user_id":u["user_id"]}}})
The explanation is that I assumed that "S" was to designate the type of the key, but that isn't what's being asked for. After playing around removing the brackets, and playing around with dummy data, it became clear that I needed to change "S" to the actual name of the primary key of the table I was trying to effect, in this case, "user_id".
It worked just fine after that.
I am trying to design a RESTful filesystem-like service, and copy/move operations are causing me some trouble.
First of all, uploading a new file is done using a PUT to the file's ultimate URL:
PUT /folders/42/contents/<name>
The question is, what if the new file already resides on the system under a different URL?
Copy/move Idea 1: PUTs with custom headers.
This is similar to S3's copy. A PUT that looks the same as the upload, but with a custom header:
PUT /folders/42/contents/<name>
X-custom-source: /files/5
This is nice because it's easy to change the file's name at copy/move time. However, S3 doesn't offer a move operation, perhaps because a move using this scheme won't be idempotent.
Copy/move Idea 2: POST to parent folder.
This is similar to the Google Docs copy. A POST to the destination folder with XML content describing the source file:
POST /folders/42/contents
...
<source>/files/5</source>
<newName>foo</newName>
I might be able to POST to the file's new URL to change its name..? Otherwise I'm stuck with specifying a new name in the XML content, which amplifies the RPCness of this idea. It's also not as consistent with the upload operation as idea 1.
Ultimately I'm looking for something that's easy to use and understand, so in addition to criticism of the above, new ideas are certainly welcome!
The HTTP spec says if the resource already exists then you update the resource and return 200.
If the resource doesn't exist then you create it and you return 201.
Edit:
Ok, I misread. I prefer the POST to the parent folder approach. You could also refer to the source file using a query string parameter. e.g.
POST /destination/folder?sourceFile=/source/folder/filename.txt
To create a new resource you usually use POST. This should create a new resource on a URI creates by the Server.
POST /folders/42/contents/fileName
<target>newFile</target>
What REST says is that with POST the new Resource is located in a path determined by the server. This is how copy even works in the (windows) FileSystem. Consider you copy a file to a name that already exists, then the response of the above example could be:
<newFileLocation>/folders/42/contents/newFile-2</newFileLocation>
A move is then made by first copy then delete. You should not do these two actions in one request.
Edit:
I found the book RESTful Web Services Cookbook very good.
Chapter 11 handles the Copy method and recommends the following in 11.1:
Problem You want to know how to make a copy of an existing resource.
Solution Design a controller resource that can create a copy. The client makes a POST request to this controller to copy the
resource. To make the POST conditional, provide a one-time URI to the
client. After the controller creates the copy, return response code
201 (Created) with a Location header containing the URI of the copy.
Request POST /albums/2009/08/1011/duplicate;t=a5d0e32ddff373df1b3351e53fc6ffb1
Response
<album xmlns:atom="http://www.w3.org/2005/Atom">
<id>urn:example:album:1014</id>
<atom:link rel="self" href="http://www.example.org/albums/2009/08/1014"/>
...
</album>
REST is not limited to the default set of HTTP methods. You could use WebDAV in this case.
For the move part, just do a combo of Copy (PUT) then Delete if you want to keep it simple.
For moving, you could
a). copy via PUT with a custom source header followed by a DELETE on the source.
b). DELETE source with a custom move header.
I prefer the latter because it can be atomic and it is clear to the client that the resource was removed from the original collection. And when it GETs the collection of its new location, it will find the moved resource there.