does the pricing for s3 data transfer out of the internet includes for reading file contents - amazon-web-services

I have a web app with a download buttons to download objects from s3 buckets. I also have plot buttons to read the contents of csv files in s3 bucket using pandas read_csv to read the columns and make visualizations. I wanted to understand if the price for s3 data transfer out of the internet is only for actually download of files or it also includes just reading the contents too because the bytes are transferred over the internet in that case as well.

S3 does not operate like a file system. There is no notion of reading and writing portions of files as you would to a local or remote drive. To read a file you must always download the entire file and then read portions as needed. That is why AWS only shows pricing for data transfer.

Related

GCP Data fusion transfer multiples from Azure storage to Google Storage

I am Trying to transfer multiple (.csv) files under a directory from Azure storage container to Google storage (as .txt files)through data fusion.
From Data fusion, I can successfully transfer single file and converting it to .txt file as part of GCS Sink.
But when I am trying to transfer all the .csv files under azure's container to GCS, it s merging all the .csv files data and generating single .txt file at GCS.
Can some one help on how to transfer each file separately and converting it to txt at Sink side?
What you're seeing is expected behavior when using GCS sink.
You need an Azure to GCS copy action plugin, or more generally an HCFS to GCS copy action plugin. Unfortunately such a plugin doesn't already exist. You could consider writing one using https://github.com/data-integrations/example-action as a starting point.

Google Cloud - Download large file from web

I'm trying to download GhTorrent dump from http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2020-07-17.tar.gz which is about 127gb
I tried in the cloud but after 6gb it stops, I believe that there is a size limit for using curl
curl http://ghtorrent... | gsutil cp - gs://MY_BUCKET_NAME/mysql-2020-07-17.tar.gz
I cannot use Data Transfer as I need to specify the url, size in bytes (which I have) and hash MD5 which I don't have and I only can generate by having the file in my disk. I think(?)
Is there any other option to download and upload the file directly to the cloud?
My total disk size is 117gb sad beep
Worked for me with Storage Transfer Service: https://console.cloud.google.com/transfer/
Have a look on the pricing before moving TBs especially if your target is nearline/coldline: https://cloud.google.com/storage-transfer/pricing
Simple example that copies a file from a public url, to my bucket using a Transfer Job:
Create a file theTsv.tsv and specify the complete list of files that must be copied. This example contains just one file:
TsvHttpData-1.0
http://public-url-pointint-to-the-file
Upload the theTsv.tsv file to your bucket or any publicly accessible url. In this example I am storing my .tsv file on my bucket https://storage.googleapis.com/<my-bucket-name>/theTsv.tsv
Create a transfer job - List of object URLs
Add the url that points to the theTsv.tsv file in the URL of TSV file field;
Select the target bucket
Run immediately
My file, named MD5SUB was copied from the source url into my bucket, under an identical directory structure.

Are parquet files splittable when stored in AWS S3?

I know that parquet files are splittable if they are stored in block storage. E.g stored on HDFS
Are they also splittable when stored in object storage such as AWS s3?
This confuses me because, object storage is supposed to be atomic. You either access the entire file or none of the file. You can't even change meta data on an S3 file without rewriting the entire file. On the other hand, AWS reccomends using splittable file formats in S3 to improve the performance of Athena and other frameworks in the hadoop ecosystem.
Yes, Parquet files are splittable.
S3 supports positioned reads (range requests), which can be used to read only selected portions of the input file (object).
I'm not 100% sure what you mean here, but generally (I think), you have parquet partition on partition keys and save columns into blocks of rows. When I have used in it AWS S3 it has saved like:
|-Folder
|--Partition Keys
|---Columns
|----Rows_1-100.snappy.parquet
|----Rows_101-200.snappy.parquet
This handles the splitting efficiencies you mention.

Data Loss Prevention on Big Data files

I have migrated a big data application on to cloud and the input files are stored in GCS. The files can be of different formats like txt, csv, avro, parquet etc and these files contain sensitive data that I want to mask.
Also, I have read there is some quota restriction on the size of file. For my case a single file can contain 15M records.
I have tried the DLP UI as well as Client library to inspect those files, but its not working.
Github page - https://github.com/Hitman007IN/DataLossPreventionGCPDemo
under the resources there are 2 files. test.txt is working and test1.txt which is the sample file that I use in my application is not working.
Google Cloud DLP just launched support last week for scanning Avro files natively.

Amazon S3 synch command uploads the entire modified file again or just the delta in the file?

My system generate large log files continuously and I want to upload all the log files to Amazon S3. I am planning to use the s3 synch command for this. My system appens the logs in the same file until they are of about 50MB and then it create new log file. I understand that synch command will synch the modified local log file in s3 bucket, but I dont want to upload the entire log file when the file changes as the files are large and sending same data again and again will consume my data bandwidth.
So I am wondering if s3 synch command sends the entire modified file or just the delta in the file?
The documentation implies that it copies the whole updated files
Recursively copies new and updated files
Plus there would be no way to to do this without downloading the file from S3 which would effectively double the cost of an upload since you'd pay the download and upload costs.