I have a csv hosted on a server which updates daily. I'd like to setup a transfer to load this into Google Cloud Storage so that I can then query it using BigQuery.
I'm looking at the transfer service and it doesn't seem to have what I need e.g. only accepts csvs or files from other google storage buckets or amazon s3 buckets.
Thanks in advance
You can also use a URL to a TSV file as explained here and configure the transfer to run daily at the time of your choice.
Alternatively, if it still doesn't fit your need, you may install gsutil on your remote machine and use the gsutil rsync command and schedule it to run daily.
Related
I want to download million of files from S3 bucket which will take more than a week to be downloaded one by one - any way/ any command to download those files in parallel using shell script ?
Thanks,
AWS CLI
You can certainly issue GetObject requests in parallel. In fact, the AWS Command-Line Interface (CLI) does exactly that when transferring files, so that it can take advantage of available bandwidth. The aws s3 sync command will transfer the content in parallel.
See: AWS CLI S3 Configuration
If your bucket has a large number of objects, it can take a long time to list the contents of the bucket. Therefore, you might want to sync the bucket by prefix (folder) rather than trying it all at once.
AWS DataSync
You might instead want to use AWS DataSync:
AWS DataSync is an online data transfer service that simplifies, automates, and accelerates copying large amounts of data to and from AWS storage services over the internet or AWS Direct Connect... Move active datasets rapidly over the network into Amazon S3, Amazon EFS, or Amazon FSx for Windows File Server. DataSync includes automatic encryption and data integrity validation to help make sure that your data arrives securely, intact, and ready to use.
DataSync uses a protocol that takes full advantage of available bandwidth and will manage the parallel downloading of content. A fee of $0.0125 per GB applies.
AWS Snowball
Another option is to use AWS Snowcone (8TB) or AWS Snowball (50TB or 80TB), which are physical devices that you can pre-load with content from S3 and have it shipped to your location. You then connect it to your network and download the data. (It works in reverse too, for uploading bulk data to Amazon S3).
Is it possible to get per file statistics (or at least download count) for files in google cloud storage?
I want to find the number of downloads for a js plugin file to get an idea of how frequently these are used (in client pages).
Yes, it is possible, but it has to be enabled.
The official recommendation is to create another bucket for the logs generated by the main bucket that you want to trace.
gsutil mb gs://<some-unique-prefix>-example-logs-bucket
then assign Cloud Storage the roles/storage.legacyBucketWriter role for the bucket:
gsutil iam ch group:cloud-storage-analytics#google.com:legacyBucketWriter gs://<some-unique-prefix>-example-logs-bucket
and finally enable the logging for your main bucket:
gsutil logging set on -b gs://example-logs-bucket gs://<main-bucket>
Generate some activity on your main bucket, then wait for one hour at most, hence the reports are not generated hourly and daily. You will be able to browse these events on the logs-bucket created at step 1:
https://imgur.com/a/fncnxwM (imgur is down at the moment..I will fix this image later)
More info, can be found at https://cloud.google.com/storage/docs/access-logs
In most cases, using Cloud Audit Logs is now recommended instead of using legacyBucketWriter.
Logging to a separate Cloud Storage bucket with legacyBucketWriter produces csv files, which you would have to then load into BigQuery yourself to make them actionable, and this would be done far from in real time. Cloud Audit Logs are easier to set up and work with by comparison, and logs are delivered almost instantly.
I am trying to transfer some files to BigQuery which are stored in my VM Instances. Normally we do a two steps process:
Transfer files from VM instances to Cloud Storage bucket.
Getting data from Cloud Storage bucket to BigQuery.
Now, I want to take files directly from VM Instances to BigQuery platform. Is there any way to do it?
You can load data directly from a readable data source (such as your local machine) by using:
The Cloud Console or the classic BigQuery web UI
The bq command-line tool's bq load command
The API
The client libraries
Please, follow the official documentation to see examples of using each way.
Moreover, if you want to stay with idea of sending your files to Cloud Storage bucket, you can think about using Dataflow templates:
Cloud Storage Text to BigQuery (Stream)
Cloud Storage Text to BigQuery (Batch)
which allows you to read text files stored in Cloud Storage, transform them using a JavaScript User Defined Function (UDF) that you provide, and output the result to BigQuery. It is automated solution.
I hope you find the above pieces of information useful.
The solution would be to use bq command for this.
The command would be like this:
bq load --autodetect --source_format=CSV x.y abc.csv
My organization is evaluating options of Hybrid Data Warehouse using AWS Redshift and S3. Objective is to process the data on-premises and send processed copy to S3 and then load to Redshift for visualization.
As we are in initial stages, there is no file/storage gateway setup yet.
Initially we used Informatica Cloud tool to upload data from on-premises server to AWS S3, but was taking long time. Data volume is few hundred million records in history and few thousand records in daily incremental.
Now I have created custom UNIX scripts using AWS CLI and using CP command to transfer files between on-premises server and AWS S3 in gzip compressed format.
This option is working fine.
But would like to understand from experts, if this is the right way of doing it or if there are any other optimized approaches available to achieve this.
If the volume of your data is more than 100 mb then AWS suggest to use Multipart upload for better performance.
You can refer the below to get the benefit of this
AWS Java SDK to upload large file in S3
Can anyone suggest any document for transferring data from my Personal Computer to S3 on AWS. I have about 50GB of data to be transferred and later use spark to analyze the data.
There are many free ways to upload files to S3, including:
use the AWS console, go into S3, navigate to the S3 bucket, then use
Actions | Upload
use s3cmd
use the awscli
use Cloudberry Explorer
To upload from your local machine to S3, you can use tools like CyberDuck. Some times large uploads may get interrupted ... Tools like Cyberduck can resume an aborted update.
If you already have data onto an Amazon EC2 machine instance, then s3cmd works pretty well.