I have migrated a big data application on to cloud and the input files are stored in GCS. The files can be of different formats like txt, csv, avro, parquet etc and these files contain sensitive data that I want to mask.
Also, I have read there is some quota restriction on the size of file. For my case a single file can contain 15M records.
I have tried the DLP UI as well as Client library to inspect those files, but its not working.
Github page - https://github.com/Hitman007IN/DataLossPreventionGCPDemo
under the resources there are 2 files. test.txt is working and test1.txt which is the sample file that I use in my application is not working.
Google Cloud DLP just launched support last week for scanning Avro files natively.
Related
I'm working on a POC to extract data from an API and load new/updated records to AVRO file present in GCS, I also want to delete the record that comes with a deleted flag, from the AVRO file.
What would be a feasible approach to implement this using dataflow, are there any resources that I can refer to for it?
You can't update file in GCS. You can only READ, WRITE and DELETE. If you have to change 1 byte in the file, you need to download the file, make the change and upload it again.
You can keep versions in GCS, but each BLOB is unique and can be changed.
Anyway, you can do that with dataflow, but keep in mind that you need 2 inputs:
The data to update
The file stored in GCS (that you have to read and to process also with dataflow)
At the end, you need to write the new file in GCS, with the data stored in dataflow.
I am Trying to transfer multiple (.csv) files under a directory from Azure storage container to Google storage (as .txt files)through data fusion.
From Data fusion, I can successfully transfer single file and converting it to .txt file as part of GCS Sink.
But when I am trying to transfer all the .csv files under azure's container to GCS, it s merging all the .csv files data and generating single .txt file at GCS.
Can some one help on how to transfer each file separately and converting it to txt at Sink side?
What you're seeing is expected behavior when using GCS sink.
You need an Azure to GCS copy action plugin, or more generally an HCFS to GCS copy action plugin. Unfortunately such a plugin doesn't already exist. You could consider writing one using https://github.com/data-integrations/example-action as a starting point.
I have data being written from Kafka to a directory in s3 with a structure like this:
s3://bucket/topics/topic1/files1...N
s3://bucket/topics/topic2/files1...N
.
.
s3://bucket/topics/topicN/files1...N
There is already a lot of data in this bucket and I want to use AWS Glue to transform it into parquet and partition it, but there is way too much data to do it all at once. I was looking into bookmarking and it seems like you can't use it to only read the most recent data or to process data in chunks. Is there a recommended way of processing data like this so that bookmarking will work for when new data comes in?
Also, does bookmarking require that spark or glue has to scan my entire dataset each time I run a job to figure out which files are greater than the last runs max_last_modified timestamp? That seems pretty inefficient especially as the data in the source bucket continues to grow.
I have learned that Glue wants all similar files (files with same structure and purpose) to be under one folder, with optional subfolders.
s3://my-bucket/report-type-a/yyyy/mm/dd/file1.txt
s3://my-bucket/report-type-a/yyyy/mm/dd/file2.txt
...
s3://my-bucket/report-type-b/yyyy/mm/dd/file23.txt
All of the files under report-type-a folder must be of the same format. Put a different report like report-type-b in a different folder.
You might try putting just a few of your input files in the proper place, running your ETL job, placing more files in the bucket, running again, etc.
I tried this by getting the current files working (one file per day), then back-filling historical files. Note however, that this did not work completely. I have been getting files processed ok in s3://my-bucket/report-type/2019/07/report_20190722.gzp, but when I tried to add past files to 's3://my-bucket/report-type/2019/05/report_20190510.gzip`, Glue did not "see" or process the file in the older folder.
However, if I moved the old report to the current partition, it worked: s3://my-bucket/report-type/2019/07/report_20190510.gzip .
AWS Glue bookmarking works only with a select few formats (more here) and when read using glueContext.create_dynamic_frame.from_options function. Along with this job.init() and job.commit() should also be present in the glue script. You can checkout a related answer.
I have a web app with a download buttons to download objects from s3 buckets. I also have plot buttons to read the contents of csv files in s3 bucket using pandas read_csv to read the columns and make visualizations. I wanted to understand if the price for s3 data transfer out of the internet is only for actually download of files or it also includes just reading the contents too because the bytes are transferred over the internet in that case as well.
S3 does not operate like a file system. There is no notion of reading and writing portions of files as you would to a local or remote drive. To read a file you must always download the entire file and then read portions as needed. That is why AWS only shows pricing for data transfer.
I have uploaded a simple 10 row csv file (S3) into AWS ML website. It keeps giving me the error,
"We cannot find any valid records for this datasource."
There are records there and Y variable is continuous (not binary). I am pretty much stuck at this point because there is only 1 button to move forward to build Machine Learning. Does any one know what should I do to fix it? Thanks!
The only way I have been able to upload .csv files to S3 created on my own is by downloading an existing .csv file from my S3 server, modifying the data, uploading it then changing the name in the S3 console.
Could you post the first few lines of contents of the .csv file? I am able to upload my own .csv file along with a schema that I have created and it is working. However, I did have issues in that Amazon ML was unable to create the schema for me.
Also, did you try to save the data in something like Sublime, Notepad++, etc. in order to get a different format? On my mac with Microsoft Excel, the CSV did not work, but when I tried LibreOffice on my Windows, the same file worked perfectly.