is there a way to delete a file after some operations done in u-sql script? - delete-file

I want to delete some files in the Azure Data Lake Store after some operations(Extraction Method) done using the U-SQL script. Is there any way to delete files using functions or any other way in U-SQL?
I know that U-SQL can be used to only read the files but I want to delete some files. We can do the same using .NET SDK but I want to delete right after the U-SQL completes.

Deleting files directly from U-SQL is not supported, at least not that I know of.
What about using Azure Data Factory. You could use Data Factory to orchestrate the U-SQL Jobs (https://learn.microsoft.com/en-us/azure/data-factory/v1/data-factory-usql-activity) and in the end (after the last Data Lake Analytics activity), you could use a Custom C# Activity (https://learn.microsoft.com/en-us/azure/data-factory/transform-data-using-dotnet-custom-activity) to delete the respective files.

Related

Updating AVRO files in GCS using data

I'm working on a POC to extract data from an API and load new/updated records to AVRO file present in GCS, I also want to delete the record that comes with a deleted flag, from the AVRO file.
What would be a feasible approach to implement this using dataflow, are there any resources that I can refer to for it?
You can't update file in GCS. You can only READ, WRITE and DELETE. If you have to change 1 byte in the file, you need to download the file, make the change and upload it again.
You can keep versions in GCS, but each BLOB is unique and can be changed.
Anyway, you can do that with dataflow, but keep in mind that you need 2 inputs:
The data to update
The file stored in GCS (that you have to read and to process also with dataflow)
At the end, you need to write the new file in GCS, with the data stored in dataflow.

Cloud function is unable to move files in archive bucket

I have implemented an architecture as per the link https://cloud.google.com/solutions/streaming-data-from-cloud-storage-into-bigquery-using-cloud-functions
But the issue is when multiple files come at the same time(For E:g. 3 files comes at the same timestamp(21/06/2020, 12:13:54 UTC+5:30)) in the bucket. In this scenario, the cloud function is unable to move all these files with the same timestamp to success bucket after processing.
Can someone please suggest.
Google Cloud Storage is not a file system. You can only CREATE, READ and DELETE the BLOB. Therefore, you can't MOVE a file. The MOVE that exist on the console or in some client library (in python for example) perform a CREATE (copy the existing BLOB to the target name) and then a DELETE of the old BLOB.
Eventually, you can't keep the original timestamp with you perform a MOVE operation.
NOTE: because you perform a CREATE and a DELETE when you MOVE your file, you are charge on early deletion when you use classes such as Nearline, coldline and archive

AWS Lambda generates large size files to S3

Currently we are having a aws lambda (java based runtime) which takes a SNS as input and then perform business logic and generate 1 XML file , store it to S3.
The implementation now is create the XML at .tmp location which we know there is space limitation of aws lambda (500mb).
Do we have any way to still use lambda but can stream XML file to S3 without using .tmp folder?
I do research but still do not find solution for it.
Thank you.
You can directly load an object to s3 from memory without having to store it locally. You can use the put object API for this. However, keep in mind that you still have time and total memory limits with lambda as well. You may run out of those too if your object size is too big.
If you can split the file into chunks and don't require to update the beginning of the file while working with its end you can use multipart upload providing a ready to go chunk and then free the memory for the next chunk.
Otherwise you still need a temporary storage for form all the parts of the XML. You can use DynamoDB or Redis and when you collect there all the parts of the XML you can start uploading it part by part, then cleanup the db (or set TTL to automate the cleanup).

Processing unpartitioned data with AWS Glue using bookmarking

I have data being written from Kafka to a directory in s3 with a structure like this:
s3://bucket/topics/topic1/files1...N
s3://bucket/topics/topic2/files1...N
.
.
s3://bucket/topics/topicN/files1...N
There is already a lot of data in this bucket and I want to use AWS Glue to transform it into parquet and partition it, but there is way too much data to do it all at once. I was looking into bookmarking and it seems like you can't use it to only read the most recent data or to process data in chunks. Is there a recommended way of processing data like this so that bookmarking will work for when new data comes in?
Also, does bookmarking require that spark or glue has to scan my entire dataset each time I run a job to figure out which files are greater than the last runs max_last_modified timestamp? That seems pretty inefficient especially as the data in the source bucket continues to grow.
I have learned that Glue wants all similar files (files with same structure and purpose) to be under one folder, with optional subfolders.
s3://my-bucket/report-type-a/yyyy/mm/dd/file1.txt
s3://my-bucket/report-type-a/yyyy/mm/dd/file2.txt
...
s3://my-bucket/report-type-b/yyyy/mm/dd/file23.txt
All of the files under report-type-a folder must be of the same format. Put a different report like report-type-b in a different folder.
You might try putting just a few of your input files in the proper place, running your ETL job, placing more files in the bucket, running again, etc.
I tried this by getting the current files working (one file per day), then back-filling historical files. Note however, that this did not work completely. I have been getting files processed ok in s3://my-bucket/report-type/2019/07/report_20190722.gzp, but when I tried to add past files to 's3://my-bucket/report-type/2019/05/report_20190510.gzip`, Glue did not "see" or process the file in the older folder.
However, if I moved the old report to the current partition, it worked: s3://my-bucket/report-type/2019/07/report_20190510.gzip .
AWS Glue bookmarking works only with a select few formats (more here) and when read using glueContext.create_dynamic_frame.from_options function. Along with this job.init() and job.commit() should also be present in the glue script. You can checkout a related answer.

Microsoft Sync framework, sync database with files

I want to sync a database who represent my file and a local folder who contain files.
I have seen the FileSyncProdivder and the SQLSyncProvider but i want to know if I need te create a custom provider or if someone have alreay did this ?
the two (FileSyncProvider and SQLSyncProvider)providers will not sync with each other.you will have to write your own code to do this.