Migrate a large file from AWS S3 to Azure BlobStorage. Also Need to monitor the progress while migration in Java - amazon-web-services

I've written a code that first downloads the file from AWS and then starts uploading to Azure. I also need to monitor full progress of the migration. But this consumes alot of bandwidth and time and no monitoring of data as well. What should be the best way to make a reliable transfer from s3 to blobstorage along with monitoring of migration
//Downloading from AWS
BasicAWSCredentials awsCreds = new BasicAWSCredentials(t.getDaccID(),t.getDaccKey());
AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
.withRegion(Regions.fromName("us-east-2"))
.withCredentials(new AWSStaticCredentialsProvider(awsCreds))
.build();
S3Object s3object = s3Client.getObject(new GetObjectRequest(t.getDbucket(), t.getDfileName()));
byte[] bytes = IOUtils.toByteArray(s3object.getObjectContent());
//Uploading to Azure
String Connstr = "DefaultEndpointsProtocol=https;AccountName="+t.getUaccID()+";AccountKey="+t.getUaccKey()+";EndpointSuffix=core.windows.net";
CloudStorageAccount cloudStorageAccount =CloudStorageAccount.parse(Connstr);
CloudBlobClient blobClient = cloudStorageAccount.createCloudBlobClient();
CloudBlobContainer container=blobClient.getContainerReference(t.getUbucket());
CloudBlockBlob blob = container.getBlockBlobReference(t.getDfileName());
blob.uploadFromByteArray(bytes ,0, bytes.length);
writer.append("File Uploaded to Azure Successful \n");

You don't really need to download the file from S3 and upload it back in Azure Blob Storage. Azure Blob Storage supports creating new blob by copying objects from a publicly available URL. This is an asynchronous operation and is done on the server-side by Azure Storage itself.
Here's what you would need to do (in lieu of the code):
Create a Signed URL of the object in AWS S3 or you can have the object publicly available.
Use Azure Storage Java SDK to create a blob by using Copy Blob functionality. In the copy operation, the source URL will be the signed URL.
Once the copy initiates, you would need to periodically fetch the properties of the blob. In the properties, you will see Copy Properties and there you will be told about the progress (both in terms of percentage as well as bytes copied). You can use that to monitor the progress of the copy.
I wrote a blog post long time back (when async copy blob was first introduced) which talks about copying objects from Amazon S3 to Azure Blob Storage. You can read that blog post here: https://gauravmantri.com/2012/06/14/how-to-copy-an-object-from-amazon-s3-to-windows-azure-blob-storage-using-copy-blob/.

Related

multi-part upload Google Storage

I'm trying to implement multi-part upload to Google Storage but to my surprise it does not seem to be straightforward (I could not find java example).
Only mention I found was in the XML API https://cloud.google.com/storage/docs/multipart-uploads
Also found some discussion around a compose API StorageExample.java#L446 mentioned here google-cloud-java issues 1440
Any recommendations how to do multipart upload?
I got the multi-part upload working with #Koblan suggestion. (for details check blog post)
This is how I create the S3 Client and point it to Google Storage
def createClient(accessKey: String, secretKey: String, region: String = "us"): AmazonS3 = {
val endpointConfig = new EndpointConfiguration("https://storage.googleapis.com", region)
val credentials = new BasicAWSCredentials(accessKey, secretKey)
val credentialsProvider = new AWSStaticCredentialsProvider(credentials)
val clientConfig = new ClientConfiguration()
clientConfig.setUseGzip(true)
clientConfig.setMaxConnections(200)
clientConfig.setMaxErrorRetry(1)
val clientBuilder = AmazonS3ClientBuilder.standard()
clientBuilder.setEndpointConfiguration(endpointConfig)
clientBuilder.withCredentials(credentialsProvider)
clientBuilder.withClientConfiguration(clientConfig)
clientBuilder.build()
}
Because I'm doing the upload from the frontend (after I generate signled URLs for each part using the AmazonS3 client) I needed to enable CORS.
For testing, I enabled everything for now
$ gsutil cors get gs://bucket
$ echo '[{"origin": ["*"],"responseHeader": ["Content-Type", "ETag"],"method": ["GET", "HEAD", "PUT", "DELETE", "PATCH"],"maxAgeSeconds": 3600}]' > cors-config.json
$ gsutil cors set cors-config.json gs://bucket
See https://cloud.google.com/storage/docs/configuring-cors#gsutil_1
Currently Java Client library for multi part upload in Cloud Storage is not available. You can raise a feature request for the same in this link. As mentioned by John Hanley, the next best thing you can do is, do a parallel composite upload with gsutil (CLI), JSON and XML support/ resumable upload with Java libraries.
In parallel compose, the parallel writes can be done by using the JSON or XML API for Google Cloud Storage. Specifically, you would write a number of smaller objects in parallel and then (once all of those objects have been written) call the Compose request to compose them into one larger object.
If you're using the JSON API the compose documentation is at : https://cloud.google.com/storage/docs/json_api/v1/objects/compose
If you're using the XML API the compose documentation is at : https://cloud.google.com/storage/docs/reference-methods#putobject (see the compose query parameter).
Also there is an interesting document link provided by Kolban which you can try and work out. Also I would like to mention that you can have multi part uploads in Java, if you use the Google Drive API(v3). Here is the code example where we use the files.create method with uploadType=multipart.

Amazon Textract without using Amazon S3

I want to extract information from PDFs using Amazon Textract (as in How to use the Amazon Textract with PDF files). All the answers and the AWS documentation requires the input to be Amazon S3 objects.
Can I use Textract without uploading the PDFs to Amazon S3, but just giving them in the REST call? (I have to store the PDFs locally).
I will answer this question with the Java API in mind. The short answer is Yes.
If you look at this TextractAsyncClient Javadoc for a given operation:
https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/textract/TextractAsyncClient.html#analyzeDocument-software.amazon.awssdk.services.textract.model.AnalyzeDocumentRequest-
It states:
" Documents for asynchronous operations can also be in PDF format"
This means - you can reference a PDF document and create an AnalyzeDocumentRequest object like this (without pulling from an Amazon S3 bucket). :
public static void analyzeDoc(TextractClient textractClient, String sourceDoc) {
try {
InputStream sourceStream = new FileInputStream(new File(sourceDoc));
SdkBytes sourceBytes = SdkBytes.fromInputStream(sourceStream);
// Get the input Document object as bytes
Document myDoc = Document.builder()
.bytes(sourceBytes)
.build();
List<FeatureType> featureTypes = new ArrayList<FeatureType>();
featureTypes.add(FeatureType.FORMS);
featureTypes.add(FeatureType.TABLES);
AnalyzeDocumentRequest analyzeDocumentRequest = AnalyzeDocumentRequest.builder()
.featureTypes(featureTypes)
.document(myDoc)
.build();
// Use the TextractAsyncClient to perform an operation like analyzeDocument
...
}

Best practices of uploading a file to S3 and metadata to RDS?

Context
I'm building a mock service to learn AWS. I want a user to be able to upload a sound file (which other users can listen to). To do this I need the sound file to be uploaded to S3 and metadata such as file name, name of uploader, length, S3 ID to RDS. It is preferable that the user uploads directly to S3 with a signed URL instead of doubling the data transfered by first uploading it to my server and from there to S3.
Optimally this would be transactional but from what I have gathered there's no functionality for that given. In order to implement this and minimize the risk of the cases where the file being successfully uploaded to S3 but not the metadata to RDS and vice versa my best guess is as follows:
My solution
With words:
First is an attempt to upload the file to S3 with a key (uuid) I generate locally or server-side. If this is successful I make a request to my API to upload the metadata including the key to RDS. If this is unsuccessful I remove the object from S3.
With code:
uuid = get_uuid_from_server();
s3Client.putObject({.., key: uuid, ..}, function(err, data) {
if (err) {
reject(err);
} else {
resolve(data);
// Upload metadata to RDS through API-call to EC2 server. Remove s3 object with key:
uuid if this call is unsuccessful
}
});
As I'm learning, my approaches are seldom the best practices but I was unable to find any good information on this particular problem. Is my approach/solution above in line with best practices?
Bonus question: is it beneficial for security purposes to generate the file's key (uuid) server-side instead of client-side?
Here are 2 approaches that you can pick, assuming the client is a web browser or mobile app.
1. Use your server as a proxy to S3.
Your server acts as a proxy between your clients and S3, you have full control of the upload flow, control the supported file types and can inspect file contents, for example: to make sure the file is a correct sound file, before uploading to S3.
2. Use your server to create pre-signed upload URLs
In this approach, your client first requests server to create a single or multiple (for multi-part upload) pre-signed URLs. Clients then upload to your S3 using those URLs. Your server can save those URLs to keep track later.
To be notified when the upload finishes successfully or unsuccessfully, you can either
(1) Ask clients to call another API,e.g: /ack after the upload finishes for a particular signed URL. If this API is not called after some time, e.g: 1 hour, you can check with S3 and delete the file accordingly. You can do this because you have the signed URL stored in your DB at the start of the upload.
or
(2) Make use of S3 events. You can configure ObjectCreated event in S3, which is fired whenever an object is created, and send all the events to a queue in SQS, and have your server process each event from there. This way, you do not rely on clients to update your server after an upload finishes. S3 will notify your server accordingly, for all successful uploads.

Does GCP SDK downloads the file when calling copyTo method from the Blob object?

Given the example on the Google Cloud documentation:
CopyWriter copyWriter = blob.copyTo(destBucket, destBlob);
Blob copiedBlob = copyWriter.getResult();
boolean deleted = blob.delete();
Does calling the copyTo method downloads the file to the local machine and then uploads it to the new destination?
The CopyTo API does not download the object. Both the source and destination are objects in buckets and are copied by Cloud Storage backends. This can be verified by looking at the REST API. There is no mechanism provided to download/upload data for this API.

How to make requests in third party APIs and load the results periodically on google BigQuery? What google services should I use?

I need to get the data from a third party API and ingest it in google BigQuery. Perhaps, I need to automate this process through google services to do it periodically.
I am trying to use Cloud Functions, but it needs a trigger. I have also read about App Engine, but I believe it is not suitable for only one function to make pull requests.
Another doubt is: do I need to load the data into cloud storage or can I load it straight to BigQuery? Should I use Dataflow and make any configuration?
def upload_blob(bucket_name, request_url, destination_blob_name):
"""
Uploads a file to the bucket.
"""
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
request_json = requests.get(request_url['url'])
print('File {} uploaded to {}.'.format(
bucket_name,
destination_blob_name))
def func_data(request_url):
BUCKET_NAME = 'dataprep-staging'
BLOB_NAME = 'any_name'
BLOB_STR = '{"blob": "some json"}'
upload_blob(BUCKET_NAME, request_url, BLOB_NAME)
return f'Success!'
I expect advise about the architecture (google services) that I should use for creating this pipeline. For example, use cloud functions (get the data from API), then schedule a job using service 'X' to input data to storage and finally pull the data from storage.
You can use function. Create an http triggered function and call it periodically with cloud scheduler.
By the way, you can also call http endpoint of appengine or cloud run.
About storage, answer is no. If the API result is not too large for function allowed memory, you can write in /tmp directory and load data to bigquery with this file. You can size your function up to 2go if needed