Can you please help me here:
I'm batch processing files (json files) from Cloud Storage to write the data into BigQuery.
I have a topic created with a Cloud Function(to process the message and write the data into BQ) subscriber to the topic.
I have created a 'DataFlow' job to notify the topic for any json files created/stored in my source bucket.
The above flow processes the json file and inserts rows in to BQ table perfectly.
I want to delete the source json file from the Cloud Storage after the file is successfully processed. Any input on how this can be done?
You can use the Client Libraries and then make a call to the objects delete function at some point in your pipeline.
First install the the Java Client Library for example and after the file is processed make a call to the delete method as shown in this sample:
BlobId blobId = BlobId.of(bucketName, blobName);
boolean deleted = storage.delete(blobId);
if (deleted) {
// the blob was deleted
} else {
// the blob was not found
}
UPDATE:
Another thing that comes to mind is to use Pub/Sub notifications in order to know when certain events occur in your storage bucket. But so far the list of supported events doesn't include object creation:
OBJECT_FINALIZE Sent when a new object (or a new generation of an existing object) is successfully created in the bucket ...
OBJECT_METADATA_UPDATE Sent when the metadata of an existing object changes ...
OBJECT_DELETE Sent when an object has been permanently deleted. This includes objects that are overwritten or are deleted as part of
the bucket's lifecycle configuration ...
OBJECT_ARCHIVE Only sent when a bucket has enabled object versioning ...
Important: Additional event types may be released later. Client code
should either safely ignore unrecognized event types, or else
explicitly specify in their notification configuration which event
types they are prepared to accept.
Hope this helps.
Related
I was wondering if it's possible to customise the pubsub messages that are triggered by GCS events. In particular, I'm interested in adding metadata in the message "attributes".
For example, upon the creation of a new object in GCS, the OBJECT_FINALIZED (see https://cloud.google.com/functions/docs/calling/storage) is triggered.
I pull this message, e.g.
received_messages = pubsub_v1.SubscriberClient().pull(request)
for msg in received_messages:
message_data = json.loads(msg.message.data.decode('utf-8')).get("data")
msg_attributes = message_data.get("attributes")
I want to be able to customise what goes into "attributes" prior to creation the object in GCS.
It is not possible to customize the Pub/Sub notifications from Cloud Storage. They are published by Cloud Storage and the schema and contents are controlled by the service and are specified in the notifications documentation.
I would like to send a message to an HTTP triggered Google Cloud Function. Specifically I want to tell the function when a file version has changed so that the function loads the new version of the file in memory.
I thought about updating an environment variable as a way of sending that message but it is not so straightforward to run an update-env-vars since this needs to be done in the context of the function's project.
Also I thought of using a database which sounds like too much for a single variable and using a simple text file in storage with the current version which sounds too little. Any other idea?
According to the conversation in the comments section, I believe the best way to achieve what you are looking for is a gcs notification triggering PubSub.
gsutil notification create -t TOPIC_NAME -f json gs://BUCKET_NAME
PubSub will get notified based on event types and this I believe it will depend on what you consider a new version of the file (metadata changes? new blob will be created?)
Basically, you can pass the -e flag in the command above which indicates the event type:
OBJECT_FINALIZE Sent when a new object (or a new generation of an
existing object) is successfully created in the bucket. This includes
copying or rewriting an existing object. A failed upload does not
trigger this event.
OBJECT_METADATA_UPDATE Sent when the metadata of an existing object
changes.
That means, any file upload or metadata change in GCS it will trigger PubSub which triggers your Cloud Function. Function example to pull message from PubSub
def hello_pubsub(event, context):
import base64
print("""This Function was triggered by messageId {} published at {} to {}
""".format(context.event_id, context.timestamp, context.resource["name"]))
if 'data' in event:
name = base64.b64decode(event['data']).decode('utf-8')
else:
name = 'World'
print('Hello {}!'.format(name))
Documents for reference:
https://cloud.google.com/storage/docs/pubsub-notifications
https://cloud.google.com/functions/docs/calling/pubsub#functions_calling_pubsub-python
I have an S3 bucket into which clients drop data files (CSV files) each month. I was wondering there was a way that I could automatically create a new "folder" (object) every time the files are dropped each month and put the newest files into that "folder". I need the CSV files separated by month so that AWS Glue can create new partitions when I run incremental crawlers on this bucket.
For example, let's say I have a S3 bucket called "client." On December 1st, a new CSV file ("DecClientData") will be dropped into that "client" bucket. I want to know if there is a way to automate the following two processes:
Create a "folder" (let's call it "dec") within "client".
Place the "DecClientData" file in the "dec" "folder".
Thanks in advance for any assistance you can provide!
S3 doesn't have the notion of folders commonly found in file systems but instead has a flat structure, more details can be found here.
Instead, the full path of an object is stored in its Key (filename). For example, an object can be stored in Amazon S3 with a Key of files/2020-12/data.txt regardless of the existence of files and 2020-12 directories (they are not really directories but zero-length objects).
In your case, to solve both points you are mentioning, you should leverage S3 event notifications and use them as a Lambda Trigger. When the Lambda function is triggered, it is passed the name of the object (Key) as an argument, at that point you can simply change its Key.
I.e. Object is uploaded in s3://my_bucket/uploads/file.txt, this creates an event notification that triggers a Lambda function. The functions gets the object and re-uploads it to s3://my_bucket/files/dec/file.txt (and deletes the original one).
Write an AWS Lambda function to create a folder in the client bucket and move the most recent .csv file (or files) in the new folder.
Then, configure the client S3 bucket to trigger the AWS Lambda function on new uploads through the event notification settings.
I am having a service that uploads files during the day. The same file gets updated multiple time on different events (no determined way to know when it gets updated). At the same time there is a client that downloads the file. What happens if the file gets updated during the download? Does s3 still preserve an old version until all active processes with it are done (kind of like filesystem)? Can the file be corrupted (part from old version, part from new)? Can the connection be closed abruptly in this case?
An object will only be created in Amazon S3 if the upload process completed fully. Partial files will not appear in Amazon S3.
Similarly, when overwriting an object in Amazon S3, the object will only be replaced if the new object was fully uploaded. The new object completely replaces the old object.
There might be a small delay between the upload completing and the new object appearing because objects in Amazon S3 are replicated between multiple servers for durability.
Referring to item: Watching for new files matching a filepattern in Apache Beam
Can you use this for simple use cases? My use case is that I have user uploads data to Cloud Storage -> Pipeline (Process csv to json) -> Big Query. I know Cloud Storage is bounded collection so it represents Batch Dataflow.
What I would like is to do is keep pipeline running in streaming mode and as soon as a file is uploaded to Cloud Storage, it will be processed through pipeline. Is this possible with watchfornewfiles?
I wrote my code as follows:
p.apply(TextIO.read().from("<bucketname>")
.watchForNewFiles(
// Check for new files every 30 seconds
Duration.standardSeconds(30),
// Never stop checking for new files
Watch.Growth.<String>never()));
None of the contents is being forwarded to Big Query, but the pipeline shows that it is streaming.
You may use Google Cloud Storage Triggers here :
https://cloud.google.com/functions/docs/calling/storage#functions-calling-storage-python
These triggers uses Cloud Functions similar to Cloud Pub/Sub which gets triggered on objects if they were: created/ deleted/archived/ or metadata change.
These event are sent using Pub/Sub notifications from Cloud Storage, but pay attention not to set many functions over the same bucket as there is some notification limits.
Also, at the end of the document there is a link to a sample implementation.