Parsing multipage tables into CSV files with AWS Textract - amazon-web-services

I'm a total AWS newbie trying to parse tables of multi page files into CSV files with AWS Textract.
I tried using AWS's example in this page however when we are dealing with a multi-page file the response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES']) breaks since we need asynchronous processing in those cases, as you can see in the documentation here. The correct function to call would be client.start_document_analysis and after running it retrieve the file using client.get_document_analysis(JobId).
So, I adapted their example using this logic instead of using client.analyze_document function, the adapted piece of code looks like this:
client = boto3.client('textract')
response = client.start_document_analysis(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])
jobid=response['JobId']
jobstatus="IN_PROGRESS"
while jobstatus=="IN_PROGRESS":
response=client.get_document_analysis(JobId=jobid)
jobstatus=response['JobStatus']
if jobstatus == "IN_PROGRESS": print("IN_PROGRESS")
time.sleep(5)
But when I run that I get the following error:
Traceback (most recent call last):
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 125, in <module>
main(file_name)
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 112, in main
table_csv = get_table_csv_results(file_name)
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 62, in get_table_csv_results
response = client.start_document_analysis(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 316, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 608, in _make_api_call
api_params, operation_model, context=request_context)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 656, in _convert_to_request_dict
api_params, operation_model)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/validate.py", line 297, in serialize_to_request
raise ParamValidationError(report=report.generate_report())
botocore.exceptions.ParamValidationError: Parameter validation failed:
Missing required parameter in input: "DocumentLocation"
Unknown parameter in input: "Document", must be one of: DocumentLocation, FeatureTypes, ClientRequestToken, JobTag, NotificationChannel
And that happens because the standard way to call start_document_analysis is using an S3 file with this sort of synthax:
response = client.start_document_analysis(
DocumentLocation={
'S3Object': {
'Bucket': s3BucketName,
'Name': documentName
}
},
FeatureTypes=["TABLES"])
However, if I do that I will break the command line logic proposed in the AWS example:
python textract_python_table_parser.py file.pdf.
The question is: how do I adapt AWS example to be able to process multipage files?

Consider use two different lambdas. One for call textract and one for process the result.
Please read this document
https://aws.amazon.com/blogs/compute/getting-started-with-rpa-using-aws-step-functions-and-amazon-textract/
And check this repository
https://github.com/aws-samples/aws-step-functions-rpa
To process the JSON you can use this sample as reference
https://github.com/aws-samples/amazon-textract-response-parser
or use it directly as library.
python -m pip install amazon-textract-response-parser

Related

how to write to storage google cloud with pyspark writestream?

I tried to write to GCP storage from pyspark stream.
This is the code:
df_test\
.writeStream.format("parquet")\
.option("path","gs://{my_bucketname}/test")\
.option("checkpointLocation", "gs://{my_checkpointBucket}/checkpoint")\
.start()\
.awaitTermination()
but I got this error:
20/11/15 16:37:59 WARN CheckpointFileManager: Could not use FileContext API
for managing Structured Streaming checkpoint files at gs://name-
bucket/test/_spark_metadata
.Using FileSystem API instead for managing log files. If the implementation
of FileSystem.rename() is not atomic, then the correctness and fault-
tolerance ofyour Structured Streaming is not guaranteed.
Traceback (most recent call last):
File "testgcp.py", line 40, in <module>
.option("checkpointLocation", "gs://check_point_bucket/checkpoint")\
File "/home/naya/anaconda3/lib/python3.6/site-
packages/pyspark/sql/streaming.py", line 1105, in start
return self._sq(self._jwrite.start())
File "/home/naya/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py",
line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/naya/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py",
line 63, in deco
return f(*a, **kw)
File "/home/naya/anaconda3/lib/python3.6/site-packages/py4j/protocol.py",
line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o55.start.
: java.io.IOException: No FileSystem for scheme: gs
what should be the right syntax?
As per this documentation. It seems that first you need to set the spark.conf with the proper authentication.
spark.conf.set("google.cloud.auth.service.account.enable", "true")
spark.conf.set("google.cloud.auth.service.account.email", "Your_service_email")
spark.conf.set("google.cloud.auth.service.account.keyfile", "path/to/your/files")
Then you can access the file in your bucket using the read function.
df = spark.read.option("header",True).csv("gs://bucket_name/path_to_your_file.csv")
df.show()

Failed to query Elasticsearch using : TransportErrorr(400, 'parsing_exception')

I have been trying to get the Elasticsearch to work in a Django application. It has been a problem because of the mess of compatibility considerations this apparently involves. I follow the recommendations, but still get an error when I actually perform a search.
Here is what I have
Django==2.1.7
Django-Haystack==2.5.1
Elasticsearch(django)==1.7.0
Elasticsearch(Linux app)==5.0.1
There is also DjangoCMS==3.7 and aldryn-search=1.0.1, but I am not sure how relevant those are.
Here is the error I get when I submit a search query via the basic text form.
GET /videos/modelresult/_search?_source=true [status:400 request:0.001s]
Failed to query Elasticsearch using '(video)': TransportError(400, 'parsing_exception')
Traceback (most recent call last):
File "/home/user-name/miniconda3/envs/project-web/lib/python3.7/site-packages/haystack/backends/elasticsearch_backend.py", line 524, in search
_source=True)
File "/home/user-name/miniconda3/envs/project-web/lib/python3.7/site-packages/elasticsearch/client/utils.py", line 69, in _wrapped
return func(*args, params=params, **kwargs)
File "/home/user-name/miniconda3/envs/project-web/lib/python3.7/site-packages/elasticsearch/client/__init__.py", line 527, in search
doc_type, '_search'), params=params, body=body)
File "/home/user-name/miniconda3/envs/project-web/lib/python3.7/site-packages/elasticsearch/transport.py", line 307, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/home/user-name/miniconda3/envs/project-web/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 93, in perform_request
self._raise_error(response.status, raw_data)
File "/home/user-name/miniconda3/envs/project-web/lib/python3.7/site-packages/elasticsearch/connection/base.py", line 105, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.RequestError: TransportError(400, 'parsing_exception')
Could someone tell me if this is an issue with compatibility or is there something else going on? How can I fix it?
The combination that appears to have worked for my setup is as follows. I believe the key was to drastically downgrade the Elasticsearch.
Elasticsearch=1.7.6 (with Java 8)
Django==2.1.7
Django-Haystack==2.8.1
elasticsearch==1.7.0
The two items below may or may not be relevant. I have not changed them.
DjangoCMS==3.7.0
aldryn-search==1.0.1

Processing many WARC archives from CommonCrawl using Hadoop Streaming and MapReduce

I am working on a project in which I need to download crawl data (from CommonCrawl) for specific URLs from an S3 container and then process that data.
Currently I have a MapReduce job (Python via Hadoop Streaming) which gets the correct S3 file paths for a list of URLs. Then I am trying to use a second MapReduce job to process this output by downloading the data from the commoncrawl S3 bucket. In the mapper I am using boto3 to download the gzip contents for a specific URL from the commoncrawl S3 bucket and then output some information about the the gzip contents (word counter information, content length, URLs linked to, etc.). The reducer then goes through this output to get the final word count, URL list, etc.
The output file from the first MapReduce job is only about 6mb in size (but will be larger once we scale to the full dataset). When I run the second MapReduce, this file is only split twice. Normally this is not a problem for such a small file, but the mapper code I described above (fetching S3 data, spitting out mapped output, etc.) takes a while to run for each URL. Since the file is only splitting twice, there are only 2 mappers being run. I need to increase the number of splits so that the mapping can be done faster.
I have tried setting "mapreduce.input.fileinputformat.split.maxsize" and "mapreduce.input.fileinputformat.split.minsize" for the MapReduce job, but it doesn't change the number of splits taking place.
Here is some of the code from the mapper:
s3 = boto3.client('s3', 'us-west-2', config=Config(signature_version=UNSIGNED))
offset_end = offset + length - 1
gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
'Body'].read()
fileobj = io.BytesIO(gz_file)
with gzip.open(fileobj, 'rb') as file:
[do stuff]
I also manually split the input file up into multiple files with a maximum of 100 lines. This had the desired effect of giving me more mappers, but then I began encountering a ConnectionError from the s3client.get_object() call:
Traceback (most recent call last):
File "dmapper.py", line 103, in <module>
commoncrawl_reader(base_url, full_url, offset, length, warc_file)
File "dmapper.py", line 14, in commoncrawl_reader
gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
File "/usr/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/lib/python3.6/site-packages/botocore/client.py", line 599, in _make_api_call
operation_model, request_dict)
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 148, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 177, in _send_request
success_response, exception):
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 273, in _needs_retry
caught_exception=caught_exception, request_dict=request_dict)
File "/usr/lib/python3.6/site-packages/botocore/hooks.py", line 227, in emit
return self._emit(event_name, kwargs)
File "/usr/lib/python3.6/site-packages/botocore/hooks.py", line 210, in _emit
response = handler(**kwargs)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 183, in __call__
if self._checker(attempts, response, caught_exception):
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 251, in __call__
caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 277, in _should_retry
return self._checker(attempt_number, response, caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 317, in __call__
caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 223, in __call__
attempt_number, caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 359, in _check_caught_exception
raise caught_exception
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 222, in _get_response
proxies=self.proxies, timeout=self.timeout)
File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/adapters.py", line 415, in send
raise ConnectionError(err, request=request)
botocore.vendored.requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
I am currently running this with only a handful of URLs, but I will need to do it with several thousand (each with many subdirectories) once I get it working.
I am not certain where to start with fixing this. I feel that it is highly likely there is better approach than what I am trying. The fact that the mapper seems to take so long for each URL seems like a big indication that I am approaching this wrong. I should also mention that the mapper and the reducer both run correctly if run directly as a pipe command:
"cat short_url_list.txt | python mapper.py | sort | python reducer.py" -> Produces desired output, but would take too long to run on the entire list of URLs.
Any guidance would be greatly appreciated.
The MapReduce API provides the NLineInputFormat. The property "mapreduce.input.lineinputformat.linespermap" allows to control how many lines (here WARC records) are passed to a mapper at maximum. It works with mrjob, cf. Ilya's WARC indexer.
Regarding the S3 connection error: it's better to run the job in the us-east-1 AWS region where the data is located.

AWS polly sample example in python?

First Time I am trying AWS services. I have to integrate AWS polly with asterisk for text to speech.
here is example code i written to convert text to speech
from boto3 import client
import boto3
import StringIO
from contextlib import closing
polly = client("polly", 'us-east-1' )
response = polly.synthesize_speech(
Text="Good Morning. My Name is Rajesh. I am Testing Polly AWS Service For Voice Application.",
OutputFormat="mp3",
VoiceId="Raveena")
print(response)
if "AudioStream" in response:
with closing(response["AudioStream"]) as stream:
data = stream.read()
fo = open("pollytest.mp3", "w+")
fo.write( data )
fo.close()
I am getting following error.
Traceback (most recent call last):
File "pollytest.py", line 11, in <module>
VoiceId="Raveena")
File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 253, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python2.7/dist-packages/botocore/client.py", line 530, in _make_api_call
operation_model, request_dict)
File "/usr/local/lib/python2.7/dist-packages/botocore/endpoint.py", line 141, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/local/lib/python2.7/dist-packages/botocore/endpoint.py", line 166, in _send_request
request = self.create_request(request_dict, operation_model)
File "/usr/local/lib/python2.7/dist-packages/botocore/endpoint.py", line 150, in create_request
operation_name=operation_model.name)
File "/usr/local/lib/python2.7/dist-packages/botocore/hooks.py", line 227, in emit
return self._emit(event_name, kwargs)
File "/usr/local/lib/python2.7/dist-packages/botocore/hooks.py", line 210, in _emit
response = handler(**kwargs)
File "/usr/local/lib/python2.7/dist-packages/botocore/signers.py", line 90, in handler
return self.sign(operation_name, request)
File "/usr/local/lib/python2.7/dist-packages/botocore/signers.py", line 147, in sign
auth.add_auth(request)
File "/usr/local/lib/python2.7/dist-packages/botocore/auth.py", line 316, in add_auth
raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials
I want to provide credentials directly in this script so that i can use this in asterisk system application.
UPDATE:
created a file ~/.aws/credentials with below content
[default]
aws_access_key_id=XXXXXXXX
aws_secret_access_key=YYYYYYYYYYY
now for my current login user its working fine, but for asterisk PBX it is not working.
Your code runs perfectly fine for me!
The last line is saying:
botocore.exceptions.NoCredentialsError: Unable to locate credentials
So, it is unable to authenticate against AWS.
If you are running this code on an Amazon EC2 instance, the simplest method is to assign an IAM Role to the instance when it is launched (it can't be added later). This will automatically assign credentials that can be used by application running on the instance -- no code changes required.
Alternatively, you could obtain an Access Key and Secret Key from IAM for your IAM User and store those credentials in a local file via the aws configure command.
It is bad practice to put credentials in source code, since they may become compromised.
See:
IAM Roles for Amazon EC2
Best Practices for Managing AWS Access Keys
Please note,asterisk pbx usually run under asterisk user.
So you have put authentification for that user, not root.

"xml.sax._exceptions.SAXReaderNotAvailable: No parsers found" when run in jenkins

So I'm working towards having automated staging deployments via Jenkins and Ansible. Part of this is using a script called ec2.py from ansible in order to dynamically retrieve a list of matching servers to deploy to.
SSH-ing into the Jenkins server and running the script from the jenkins user, the script runs as expected. However, running the script from within jenkins leads to the following error:
ERROR: Inventory script (ec2/ec2.py) had an execution error: Traceback (most recent call last):
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/ec2/ec2.py", line 1262, in <module>
Ec2Inventory()
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/ec2/ec2.py", line 159, in __init__
self.do_api_calls_update_cache()
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/ec2/ec2.py", line 386, in do_api_calls_update_cache
self.get_instances_by_region(region)
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/ec2/ec2.py", line 417, in get_instances_by_region
reservations.extend(conn.get_all_instances(filters = { filter_key : filter_values }))
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/.local/lib/python2.7/site-packages/boto/ec2/connection.py", line 585, in get_all_instances
max_results=max_results)
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/.local/lib/python2.7/site-packages/boto/ec2/connection.py", line 681, in get_all_reservations
[('item', Reservation)], verb='POST')
File "/opt/bitnami/apps/jenkins/jenkins_home/jobs/Deploy API/workspace/deploy/.local/lib/python2.7/site-packages/boto/connection.py", line 1181, in get_list
xml.sax.parseString(body, h)
File "/usr/lib/python2.7/xml/sax/__init__.py", line 43, in parseString
parser = make_parser()
File "/usr/lib/python2.7/xml/sax/__init__.py", line 93, in make_parser
raise SAXReaderNotAvailable("No parsers found", None)
xml.sax._exceptions.SAXReaderNotAvailable: No parsers found
I don't know too much about python, so I'm not sure how to debug this issue further.
So it turns out the issue was to do with Jenkins overwriting the default LD_LIBRARY_PATH variable. By unsetting that variable before running python, I was able to make the python app work!