how to write to storage google cloud with pyspark writestream? - google-cloud-platform

I tried to write to GCP storage from pyspark stream.
This is the code:
df_test\
.writeStream.format("parquet")\
.option("path","gs://{my_bucketname}/test")\
.option("checkpointLocation", "gs://{my_checkpointBucket}/checkpoint")\
.start()\
.awaitTermination()
but I got this error:
20/11/15 16:37:59 WARN CheckpointFileManager: Could not use FileContext API
for managing Structured Streaming checkpoint files at gs://name-
bucket/test/_spark_metadata
.Using FileSystem API instead for managing log files. If the implementation
of FileSystem.rename() is not atomic, then the correctness and fault-
tolerance ofyour Structured Streaming is not guaranteed.
Traceback (most recent call last):
File "testgcp.py", line 40, in <module>
.option("checkpointLocation", "gs://check_point_bucket/checkpoint")\
File "/home/naya/anaconda3/lib/python3.6/site-
packages/pyspark/sql/streaming.py", line 1105, in start
return self._sq(self._jwrite.start())
File "/home/naya/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py",
line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/naya/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py",
line 63, in deco
return f(*a, **kw)
File "/home/naya/anaconda3/lib/python3.6/site-packages/py4j/protocol.py",
line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o55.start.
: java.io.IOException: No FileSystem for scheme: gs
what should be the right syntax?

As per this documentation. It seems that first you need to set the spark.conf with the proper authentication.
spark.conf.set("google.cloud.auth.service.account.enable", "true")
spark.conf.set("google.cloud.auth.service.account.email", "Your_service_email")
spark.conf.set("google.cloud.auth.service.account.keyfile", "path/to/your/files")
Then you can access the file in your bucket using the read function.
df = spark.read.option("header",True).csv("gs://bucket_name/path_to_your_file.csv")
df.show()

Related

Airflow EmrCreateJobFlowOperator `label is invalid: emr-6.8.0` Error On Latest EMR Version

EMR released a new cluster version today
But when I attempt to upgrade to the latest released EMR version using the contributed EMR create job flow operator I'm hitting
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1138, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
result = task_copy.execute(context=context)
File "/usr/local/airflow/dags/plugins/operators/shippo_emr_operators.py", line 133, in execute
return super().execute(context)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/emr_create_job_flow.py", line 81, in execute
response = emr.create_job_flow(job_flow_overrides)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/emr.py", line 88, in create_job_flow
response = self.get_conn().run_job_flow(**config)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 676, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the RunJobFlow operation: The supplied release label is invalid: emr-6.8.0.
Looking at the EMR contribution code I don't see any hard coded values so I'm not sure why were hitting this error at this point. Has the label format changed and if so where can I find the exact string?
EDIT: The plot thickens. If I run aws emr list-release-labels I get
NextToken: AAIAAdZ_6MGjAhReZYcOrXICLpYU98iQO_ZB3kCK65qEWRH9MrJLdi_r-alVGb1AZlnFg0vsdxRUzdBLt-SyQ3TznUBM8Ncu7n94pJVQykbWe_TapxBi2WpUkcZfRAcxYgcg6TwejeaxGKcbysA89Jc9M3vIlVQetGgY1zQESS2Dq3P9vxvsOo3xxZoTqnmOVjs24Hy1hPM8zfzoUfH7MMomXkqhU5MHZ0cG3Aee5F51LtNS0_NBge399SiDYwhz1W2RB2tAjDc=
ReleaseLabels:
- emr-6.7.0
- emr-6.6.0
- emr-6.5.0
- emr-6.4.0
Which indicates that the release label has been updated in the docs but not actually released to the tooling?
EMR release the new versions in a few regions first, probably you are trying to launch a cluster in a no available region yet.

Parsing multipage tables into CSV files with AWS Textract

I'm a total AWS newbie trying to parse tables of multi page files into CSV files with AWS Textract.
I tried using AWS's example in this page however when we are dealing with a multi-page file the response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES']) breaks since we need asynchronous processing in those cases, as you can see in the documentation here. The correct function to call would be client.start_document_analysis and after running it retrieve the file using client.get_document_analysis(JobId).
So, I adapted their example using this logic instead of using client.analyze_document function, the adapted piece of code looks like this:
client = boto3.client('textract')
response = client.start_document_analysis(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])
jobid=response['JobId']
jobstatus="IN_PROGRESS"
while jobstatus=="IN_PROGRESS":
response=client.get_document_analysis(JobId=jobid)
jobstatus=response['JobStatus']
if jobstatus == "IN_PROGRESS": print("IN_PROGRESS")
time.sleep(5)
But when I run that I get the following error:
Traceback (most recent call last):
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 125, in <module>
main(file_name)
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 112, in main
table_csv = get_table_csv_results(file_name)
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 62, in get_table_csv_results
response = client.start_document_analysis(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 316, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 608, in _make_api_call
api_params, operation_model, context=request_context)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 656, in _convert_to_request_dict
api_params, operation_model)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/validate.py", line 297, in serialize_to_request
raise ParamValidationError(report=report.generate_report())
botocore.exceptions.ParamValidationError: Parameter validation failed:
Missing required parameter in input: "DocumentLocation"
Unknown parameter in input: "Document", must be one of: DocumentLocation, FeatureTypes, ClientRequestToken, JobTag, NotificationChannel
And that happens because the standard way to call start_document_analysis is using an S3 file with this sort of synthax:
response = client.start_document_analysis(
DocumentLocation={
'S3Object': {
'Bucket': s3BucketName,
'Name': documentName
}
},
FeatureTypes=["TABLES"])
However, if I do that I will break the command line logic proposed in the AWS example:
python textract_python_table_parser.py file.pdf.
The question is: how do I adapt AWS example to be able to process multipage files?
Consider use two different lambdas. One for call textract and one for process the result.
Please read this document
https://aws.amazon.com/blogs/compute/getting-started-with-rpa-using-aws-step-functions-and-amazon-textract/
And check this repository
https://github.com/aws-samples/aws-step-functions-rpa
To process the JSON you can use this sample as reference
https://github.com/aws-samples/amazon-textract-response-parser
or use it directly as library.
python -m pip install amazon-textract-response-parser

Glue Boto Client -- NoCredentialsError

I've been running my Glue Jobs on a schedule for a few months. Last night my Glue Job failed due to botocore.exceptions.NoCredentialsError: Unable to locate credentials after calling bucket.objects.filter(Prefix=productionDirectory):
I am under the impression this is a result of not having defined a credentials file, but AWS Glue has always pulled credentials without issue. I just re-ran my job and everything worked perfectly. For reference, I define my Glue Client via: glue = boto3.client('glue'). Has anyone ever experienced this before? Is this just an edge-case?
Full Logs:
Traceback (most recent call last):
File "/tmp/data-deployment", line 67, in <module>
for obj in bucket.objects.filter(Prefix=productionDirectory):
File "/home/spark/.local/lib/python3.7/site-packages/boto3/resources/collection.py", line 83, in __iter__
for page in self.pages():
File "/home/spark/.local/lib/python3.7/site-packages/boto3/resources/collection.py", line 166, in pages
for page in pages:
File "/home/spark/.local/lib/python3.7/site-packages/botocore/paginate.py", line 255, in __iter__
response = self._make_request(current_kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/paginate.py", line 332, in _make_request
return self._method(**current_kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/client.py", line 316, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/client.py", line 613, in _make_api_call
operation_model, request_dict, request_context)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/client.py", line 632, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/endpoint.py", line 102, in make_request
return self._send_request(request_dict, operation_model)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/endpoint.py", line 132, in _send_request
request = self.create_request(request_dict, operation_model)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/endpoint.py", line 116, in create_request
operation_name=operation_model.name)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/hooks.py", line 356, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/hooks.py", line 228, in emit
return self._emit(event_name, kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/hooks.py", line 211, in _emit
response = handler(**kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/signers.py", line 90, in handler
return self.sign(operation_name, request)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/signers.py", line 160, in sign
auth.add_auth(request)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/auth.py", line 357, in add_auth
raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials
Edit/Update: This is a known bug. I've posted the mitigation strategy provided from AWS as an answer below.
Update: I reached out to AWS via Support and they responded. Apparently this is a known bug and issue. While they do not have a solution or ETA for solution, they do have a way to mitigate the issue. Information below:
Thank you for reporting your issue to us and product team is aware of this intermittent issue.
They are working on resolution however, I do not have an ETA.
To mitigate this issue, increase the timeout / attempts to meta service request in your code:
####START######
import os
####Increase meta service timeout and attempt########
os.environ['AWS_METADATA_SERVICE_NUM_ATTEMPTS'] ="5"
os.environ['AWS_METADATA_SERVICE_TIMEOUT'] ="30"
#####################END#################
I faced a similar issue with Glue, but not exactly the same.
We used external tables with SparkSQL and S3, and sometimes an Exception was raised out of nowhere, i.e. Table not found. The issue was never reproduced on testing and had least frequency. Since our jobs ran perfectly fine on retries, we enabled the retry mechanism to solve it.
It has something to do with the internal workings of Glue and its serverless environment.

'No such file or directory' error after submitting a training job

I execute:
gcloud beta ml jobs submit training ${JOB_NAME} --config config.yaml
and after about 5 minutes the job errors out with this error:
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 232, in <module> tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 228, in main run_training()
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 129, in run_training data_sets = input_data.read_data_sets(FLAGS.train_dir, FLAGS.fake_data)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py", line 212, in read_data_sets with open(local_file, 'rb') as f: IOError: [Errno 2] No such file or directory: 'gs://my-bucket/mnist/train/train-images.gz'
The strange thing is, as far as I can tell, that file exists at that url.
This error usually indicates you are using a multi-region GCS bucket for your output. To avoid this error you should use a regional GCS bucket. Regional buckets provide stronger consistency guarantees which are needed to avoid these types of errors.
For more information about properly setting up GCS buckets for Cloud ML please refer to the Cloud ML Docs
Normal IO does not know how to deal with GCS gs:// correctly. You need:
first_data_file = args.train_files[0]
file_stream = file_io.FileIO(first_data_file, mode='r')
# run experiment
model.run_experiment(file_stream)
But ironically, you can move files from the gs://bucket to your root directory, which your programs CAN then actually see:
with file_io.FileIO(gs://presentation_mplstyle_path, mode='r') as input_f:
with file_io.FileIO('presentation.mplstyle', mode='w+') as output_f:
output_f.write(input_f.read())
mpl.pyplot.style.use(['./presentation.mplstyle'])
And finally, moving a file from your root back to a gs://bucket:
with file_io.FileIO(report_name, mode='r') as input_f:
with file_io.FileIO(job_dir + '/' + report_name, mode='w+') as output_f:
output_f.write(input_f.read())
Should be easier IMO.

TypeError when using botocore to read from AWS SQS queue

I'm using a Tornado server with tornado-botocore to connect to Amazon SQS services.
When running stress tests we sometimes get the following exception:
Traceback (most recent call last):
File "/home/app/handlers/WebSocketsHandler.py", line 95, in listen_outgoing_queue
message = yield tornado.gen.Task(self.outgoing_queue.read)
File "/home/local/lib/python2.7/site-packages/tornado/gen.py", line 870, in run
value = future.result()
File "/home/local/lib/python2.7/site-packages/tornado/concurrent.py", line 215, in result
raise_exc_info(self._exc_info)
File "/home/local/lib/python2.7/site-packages/tornado/stack_context.py", line 314, in wrapped
ret = fn(*args, **kwargs)
File "/home/local/lib/python2.7/site-packages/tornado_botocore/base.py", line 70, in prepare_response
response_dict, operation_model.output_shape)
File "/home/local/lib/python2.7/site-packages/botocore/parsers.py", line 155, in parse
return self._do_error_parse(response, shape)
File "/home/.env/local/lib/python2.7/site-packages/botocore/parsers.py", line 314, in _do_error_parse
root = self._parse_xml_string_to_dom(xml_contents)
File "/home/local/lib/python2.7/site-packages/botocore/parsers.py", line 274, in _parse_xml_string_to_dom
parser.feed(xml_string)
TypeError: must be string or read-only buffer, not None
could it be caused by the concurrency?
has anyone encountered such behavior?
We are using tornado 4.2.1, botocore 0.65.0 and tonado-botocore 0.1.6
problem solved once i removed the #tornado.gen.engine decorator from the method