gsutill: backup files that encrypted with Customer-Supplied Key - google-cloud-platform

I have a Google cloud storage bucket containing files. Each of these files is encrypted with a different key, for security reasons. This bucket is the source. I want to copy it's content from the Source bucket to the Destination bucket. Just to have a backup...
I tried to run this command:
$ gsutil cp -r gs://source-bucket/* gs://dest-bucket/
Traceback (most recent call last):
File "/usr/lib/google-cloud-sdk/platform/gsutil/gsutil", line 21, in <module>
File "/usr/lib/google-cloud-sdk/platform/gsutil/", line 132, in RunMain
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/", line 431, in main
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/", line 760, in _RunNamedCommandAndHandleExceptions
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/", line 626, in _RunNamedCommandAndHandleExceptions
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/", line 411, in RunNamedCommand
return_code = command_inst.RunCommand()
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/", line 1200, in RunCommand
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/", line 1515, in Apply
arg_checker, should_return_results, fail_on_error)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/", line 1586, in _SequentialApply
worker_thread.PerformTask(task, self)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/", line 2306, in PerformTask
results = task.func(cls, task.args, thread_state=self.thread_gsutil_api)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/", line 790, in _CopyFuncWrapper
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/", line 1000, in CopyFunc
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/", line 3729, in PerformCopy
decryption_key = GetDecryptionCSEK(src_url, src_obj_metadata)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/", line 3645, in GetDecryptionCSEK
(src_obj_metadata.customerEncryption.keySha256, src_url))
gslib.cloud_api.EncryptionException: Missing decryption key with SHA256 hash 0z1dPrWjTL6yrU5U6GP2gTaBriwNbMJnh6CcIuLSy8o=. No decryption key matches object gs://source-bucket/myfile.json
I guess that the reason for this failure is the missing key in order to copy the files.
I also tried to create a Transfer operation but it failed for a strange reason.
How I can back up my files in this case? Just copy it as is.
What are my alternatives?

You have to supply the keys that you used to encrypt the files.
With gsutils you have to use the .boto file and put inside something similar:
encryption_key = ...
decryption_key1 = ...
decryption_key2 = ...
gsutil automatically detects the correct CSEK to use for a cloud
object by comparing the key's SHA256 hash against the hash of the
CSEK. gsutil considers the configured encryption key and up to 100
decryption keys when searching for a match. Decryption keys must be
listed in the boto configuration file in ascending numerical order
starting with 1.
for more on customer-supplied encryption keys check here


Create a file from using an AWS Lambda function and an Amazon S3 event (JAVA) [duplicate]

I'm seeing the below error from my lambda function when I drop a file.csv into an S3 bucket. The file is not large and I even added a 60 second sleep prior to opening the file for reading, but for some reason the file has the extra ".6CEdFe7C" appended to it. Why is that?
[Errno 30] Read-only file system: u'/file.csv.6CEdFe7C': IOError
Traceback (most recent call last):
File "/var/task/", line 75, in lambda_handler
s3.download_file(bucket, key, filepath)
File "/var/runtime/boto3/s3/", line 104, in download_file
extra_args=ExtraArgs, callback=Callback)
File "/var/runtime/boto3/s3/", line 670, in download_file
extra_args, callback)
File "/var/runtime/boto3/s3/", line 685, in _download_file
self._get_object(bucket, key, filename, extra_args, callback)
File "/var/runtime/boto3/s3/", line 709, in _get_object
extra_args, callback)
File "/var/runtime/boto3/s3/", line 723, in _do_get_object
with, 'wb') as f:
File "/var/runtime/boto3/s3/", line 332, in open
return open(filename, mode)
IOError: [Errno 30] Read-only file system: u'/file.csv.6CEdFe7C'
def lambda_handler(event, context):
s3_response = {}
counter = 0
event_records = event.get("Records", [])
s3_items = []
for event_record in event_records:
if "s3" in event_record:
bucket = event_record["s3"]["bucket"]["name"]
key = event_record["s3"]["object"]["key"]
filepath = '/' + key
s3.download_file(bucket, key, filepath)
The result of the above is:
[Errno 30] Read-only file system: u'/file.csv.6CEdFe7C'
If the key/file is "file.csv", then why does the s3.download_file method try to download "file.csv.6CEdFe7C"? I'm guessing when the function is triggered, the file is file.csv.xxxxx but by the time it gets to line 75, the file is renamed to file.csv?
Only /tmp seems to be writable in AWS Lambda.
Therefore this would work:
filepath = '/tmp/' + key
According to
The example shows how to use the first parameter for the cloud name and the second parameter for the local path to be downloaded.
in other hand, the amazaon docs, says
Thus, we have 512 MB for create files.
Here is my code on lambda aws, it works like charm.
I noticed when I uploaded a code for lambda directly as a zip file I was able to write only to /tmp folder, but when uploaded code from S3 I was able to write to the project root folder too.
Also for C# works perfect :
using (var fileStream = File.Create("/tmp/" + fName))
str.Seek(0, SeekOrigin.Begin);

UnicodeEncodeError while transferring ".eml" file to Google Cloud Platform (gsutil v4.6.1 on Linux)

While transferring file(s) from a Linux system to Google Cloud Platform using the gsutil cp command, it fails at some old ".eml" files when trying to process its content (not just file name!) which contains non-English characters not encoded in Unicode.
The command attempted was:
gsutil cp "/home/darsenlu/Home/mail/Pan/Fw_ japanese_lyrics.eml" gs://darsen_backup_monthly/
The error message was:
UnicodeEncodeError: 'ascii' codec can't encode character '\udca8' in position 22881: ordinal not in range(128)
gsutil rsync gives a very similar error. Position 22881 (0x5961) turns out to be towards the end of the multi-part e-mail source file. Following shows the hex-dumped file content:
00005960: 20a8 43a4 d1b3 a320 5961 686f 6f21 a95f .C.... Yahoo!._
00005970: bcaf 203e 2020 7777 772e 7961 686f 6f2e .. >
00005980: 636f 6d2e 7477 0d0a
We see byte "0xa8" at position 0x5961, which was the source of the problem as indicated by the error message. For some reason gsutil was trying to encode the text. When opening the file in a terminal that supports Chinese characters, we see this:
< 每天都 Yahoo!奇摩 >
The first Chinese character "每" is 0xa843 when encoded in Big-5. A simple work-around was to rename the file extension to something other than ".eml" such as ".eml.bak" so that gsutil does not process the file content. Unfortunately it is difficult to know the existence of files with such non-English character in advance while doing bulk transfer, and the whole process can be stopped multiple times.
Following is the full error message:
darsenlu#devmodel:~/Home$ gsutil cp "/home/darsenlu/Home/mail/Pan/Fw_ japanese_lyrics.eml" gs://darsen_backup_monthly/
Copying file:///home/darsenlu/Home/mail/Pan/Fw_ japanese_lyrics.eml [Content-Type=message/rfc822]...
Traceback (most recent call last):
File "/usr/lib/google-cloud-sdk/platform/gsutil/gsutil", line 21, in <module>
File "/usr/lib/google-cloud-sdk/platform/gsutil/", line 122, in RunMain
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/", line 444, in main
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/", line 780, in _RunNamedCommandAndHandleExceptions
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/", line 639, in _RunNamedCommandAndHandleExceptions
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/", line 411, in RunNamedCommand
return_code = command_inst.RunCommand()
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/", line 1124, in RunCommand
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/", line 1525, in Apply
arg_checker, should_return_results, fail_on_error)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/", line 1596, in _SequentialApply
worker_thread.PerformTask(task, self)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/", line 2316, in PerformTask
results = task.func(cls, task.args, thread_state=self.thread_gsutil_api)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/", line 709, in _CopyFuncWrapper
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/", line 924, in CopyFunc
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/", line 3957, in PerformCopy
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/", line 2250, in _UploadFileToObject
parallel_composite_upload, logger)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/", line 2066, in _DelegateUploadFileToObject
elapsed_time, uploaded_object = upload_delegate()
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/", line 2227, in CallNonResumableUpload
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/", line 1762, in _UploadFileToObjectNonResumable
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/", line 388, in UploadObject
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/", line 1712, in UploadObject
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/", line 1534, in _UploadObject
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/third_party/storage_apitools/", line 1182, in Insert
upload=upload, upload_config=upload_config)
File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/", line 703, in _RunMethod
File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/", line 679, in PrepareHttpRequest
upload.ConfigureRequest(upload_config, http_request, url_builder)
File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/", line 763, in ConfigureRequest
File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/", line 823, in __ConfigureMultipartRequest
g.flatten(msg_root, unixfrom=False)
File "/usr/lib/python3.6/email/", line 116, in flatten
File "/usr/lib/python3.6/email/", line 181, in _write
File "/usr/lib/python3.6/email/", line 214, in _dispatch
File "/usr/lib/python3.6/email/", line 272, in _handle_multipart
g.flatten(part, unixfrom=False, linesep=self._NL)
File "/usr/lib/python3.6/email/", line 116, in flatten
File "/usr/lib/python3.6/email/", line 181, in _write
File "/usr/lib/python3.6/email/", line 214, in _dispatch
File "/usr/lib/python3.6/email/", line 361, in _handle_message
payload = self._encode(payload)
File "/usr/lib/python3.6/email/", line 412, in _encode
return s.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\udca8' in position 22881: ordinal not in range(128)
The Linux system is Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-76-generic x86_64).
I took your string with Chinese characters and was able to reproduce your error. I fixed it after updating to gsutil 4.62. Here's the merged PR and issue tracker as reference.
Update Cloud SDK by running:
gcloud components update

Parsing multipage tables into CSV files with AWS Textract

I'm a total AWS newbie trying to parse tables of multi page files into CSV files with AWS Textract.
I tried using AWS's example in this page however when we are dealing with a multi-page file the response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES']) breaks since we need asynchronous processing in those cases, as you can see in the documentation here. The correct function to call would be client.start_document_analysis and after running it retrieve the file using client.get_document_analysis(JobId).
So, I adapted their example using this logic instead of using client.analyze_document function, the adapted piece of code looks like this:
client = boto3.client('textract')
response = client.start_document_analysis(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])
while jobstatus=="IN_PROGRESS":
if jobstatus == "IN_PROGRESS": print("IN_PROGRESS")
But when I run that I get the following error:
Traceback (most recent call last):
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/", line 125, in <module>
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/", line 112, in main
table_csv = get_table_csv_results(file_name)
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/", line 62, in get_table_csv_results
response = client.start_document_analysis(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/", line 316, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/", line 608, in _make_api_call
api_params, operation_model, context=request_context)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/", line 656, in _convert_to_request_dict
api_params, operation_model)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/", line 297, in serialize_to_request
raise ParamValidationError(report=report.generate_report())
botocore.exceptions.ParamValidationError: Parameter validation failed:
Missing required parameter in input: "DocumentLocation"
Unknown parameter in input: "Document", must be one of: DocumentLocation, FeatureTypes, ClientRequestToken, JobTag, NotificationChannel
And that happens because the standard way to call start_document_analysis is using an S3 file with this sort of synthax:
response = client.start_document_analysis(
'S3Object': {
'Bucket': s3BucketName,
'Name': documentName
However, if I do that I will break the command line logic proposed in the AWS example:
python file.pdf.
The question is: how do I adapt AWS example to be able to process multipage files?
Consider use two different lambdas. One for call textract and one for process the result.
Please read this document
And check this repository
To process the JSON you can use this sample as reference
or use it directly as library.
python -m pip install amazon-textract-response-parser

Processing many WARC archives from CommonCrawl using Hadoop Streaming and MapReduce

I am working on a project in which I need to download crawl data (from CommonCrawl) for specific URLs from an S3 container and then process that data.
Currently I have a MapReduce job (Python via Hadoop Streaming) which gets the correct S3 file paths for a list of URLs. Then I am trying to use a second MapReduce job to process this output by downloading the data from the commoncrawl S3 bucket. In the mapper I am using boto3 to download the gzip contents for a specific URL from the commoncrawl S3 bucket and then output some information about the the gzip contents (word counter information, content length, URLs linked to, etc.). The reducer then goes through this output to get the final word count, URL list, etc.
The output file from the first MapReduce job is only about 6mb in size (but will be larger once we scale to the full dataset). When I run the second MapReduce, this file is only split twice. Normally this is not a problem for such a small file, but the mapper code I described above (fetching S3 data, spitting out mapped output, etc.) takes a while to run for each URL. Since the file is only splitting twice, there are only 2 mappers being run. I need to increase the number of splits so that the mapping can be done faster.
I have tried setting "mapreduce.input.fileinputformat.split.maxsize" and "mapreduce.input.fileinputformat.split.minsize" for the MapReduce job, but it doesn't change the number of splits taking place.
Here is some of the code from the mapper:
s3 = boto3.client('s3', 'us-west-2', config=Config(signature_version=UNSIGNED))
offset_end = offset + length - 1
gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
fileobj = io.BytesIO(gz_file)
with, 'rb') as file:
[do stuff]
I also manually split the input file up into multiple files with a maximum of 100 lines. This had the desired effect of giving me more mappers, but then I began encountering a ConnectionError from the s3client.get_object() call:
Traceback (most recent call last):
File "", line 103, in <module>
commoncrawl_reader(base_url, full_url, offset, length, warc_file)
File "", line 14, in commoncrawl_reader
gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
File "/usr/lib/python3.6/site-packages/botocore/", line 314, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/lib/python3.6/site-packages/botocore/", line 599, in _make_api_call
operation_model, request_dict)
File "/usr/lib/python3.6/site-packages/botocore/", line 148, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/lib/python3.6/site-packages/botocore/", line 177, in _send_request
success_response, exception):
File "/usr/lib/python3.6/site-packages/botocore/", line 273, in _needs_retry
caught_exception=caught_exception, request_dict=request_dict)
File "/usr/lib/python3.6/site-packages/botocore/", line 227, in emit
return self._emit(event_name, kwargs)
File "/usr/lib/python3.6/site-packages/botocore/", line 210, in _emit
response = handler(**kwargs)
File "/usr/lib/python3.6/site-packages/botocore/", line 183, in __call__
if self._checker(attempts, response, caught_exception):
File "/usr/lib/python3.6/site-packages/botocore/", line 251, in __call__
File "/usr/lib/python3.6/site-packages/botocore/", line 277, in _should_retry
return self._checker(attempt_number, response, caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/", line 317, in __call__
File "/usr/lib/python3.6/site-packages/botocore/", line 223, in __call__
attempt_number, caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/", line 359, in _check_caught_exception
raise caught_exception
File "/usr/lib/python3.6/site-packages/botocore/", line 222, in _get_response
proxies=self.proxies, timeout=self.timeout)
File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/", line 573, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/", line 415, in send
raise ConnectionError(err, request=request)
botocore.vendored.requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
I am currently running this with only a handful of URLs, but I will need to do it with several thousand (each with many subdirectories) once I get it working.
I am not certain where to start with fixing this. I feel that it is highly likely there is better approach than what I am trying. The fact that the mapper seems to take so long for each URL seems like a big indication that I am approaching this wrong. I should also mention that the mapper and the reducer both run correctly if run directly as a pipe command:
"cat short_url_list.txt | python | sort | python" -> Produces desired output, but would take too long to run on the entire list of URLs.
Any guidance would be greatly appreciated.
The MapReduce API provides the NLineInputFormat. The property "mapreduce.input.lineinputformat.linespermap" allows to control how many lines (here WARC records) are passed to a mapper at maximum. It works with mrjob, cf. Ilya's WARC indexer.
Regarding the S3 connection error: it's better to run the job in the us-east-1 AWS region where the data is located.

'No such file or directory' error after submitting a training job

I execute:
gcloud beta ml jobs submit training ${JOB_NAME} --config config.yaml
and after about 5 minutes the job errors out with this error:
Traceback (most recent call last):
File "/usr/lib/python2.7/", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/", line 72, in _run_code exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/trainer/", line 232, in <module>
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/", line 30, in run sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/trainer/", line 228, in main run_training()
File "/root/.local/lib/python2.7/site-packages/trainer/", line 129, in run_training data_sets = input_data.read_data_sets(FLAGS.train_dir, FLAGS.fake_data)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/", line 212, in read_data_sets with open(local_file, 'rb') as f: IOError: [Errno 2] No such file or directory: 'gs://my-bucket/mnist/train/train-images.gz'
The strange thing is, as far as I can tell, that file exists at that url.
This error usually indicates you are using a multi-region GCS bucket for your output. To avoid this error you should use a regional GCS bucket. Regional buckets provide stronger consistency guarantees which are needed to avoid these types of errors.
For more information about properly setting up GCS buckets for Cloud ML please refer to the Cloud ML Docs
Normal IO does not know how to deal with GCS gs:// correctly. You need:
first_data_file = args.train_files[0]
file_stream = file_io.FileIO(first_data_file, mode='r')
# run experiment
But ironically, you can move files from the gs://bucket to your root directory, which your programs CAN then actually see:
with file_io.FileIO(gs://presentation_mplstyle_path, mode='r') as input_f:
with file_io.FileIO('presentation.mplstyle', mode='w+') as output_f:
And finally, moving a file from your root back to a gs://bucket:
with file_io.FileIO(report_name, mode='r') as input_f:
with file_io.FileIO(job_dir + '/' + report_name, mode='w+') as output_f:
Should be easier IMO.