Error while listing objects using boto3: botocore.parsers.ResponseParserError: - amazon-web-services

Hi I am using boto3 for pulling data from s3.
result = s3.list_objects_v2(
Bucket = bucket,
Prefix ='1/abc/2/cde',
I am trying to list all the folder names after the "Prefix"
I am getting the following error
File "/usr/local/lib/python3.7/site-packages/botocore/", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/", line 648, in _make_api_call
operation_model, request_dict, request_context)
File "/usr/local/lib/python3.7/site-packages/botocore/", line 667, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/usr/local/lib/python3.7/site-packages/botocore/", line 102, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/local/lib/python3.7/site-packages/botocore/", line 135, in _send_request
request, operation_model, context)
File "/usr/local/lib/python3.7/site-packages/botocore/", line 167, in _get_response
request, operation_model)
File "/usr/local/lib/python3.7/site-packages/botocore/", line 218, in _do_get_response
response_dict, operation_model.output_shape)
File "/usr/local/lib/python3.7/site-packages/botocore/", line 242, in parse
parsed = self._do_parse(response, shape)
File "/usr/local/lib/python3.7/site-packages/botocore/", line 775, in _do_parse
self._parse_payload(response, shape, member_shapes, final_parsed)
File "/usr/local/lib/python3.7/site-packages/botocore/", line 811, in _parse_payload
original_parsed = self._initial_body_parse(response['body'])
File "/usr/local/lib/python3.7/site-packages/botocore/", line 897, in _initial_body_parse
return self._parse_xml_string_to_dom(xml_string)
File "/usr/local/lib/python3.7/site-packages/botocore/", line 437, in _parse_xml_string_to_dom
"invalid XML received:\n%s" % (e, xml_string))
botocore.parsers.ResponseParserError: Unable to parse response (not well-formed (invalid token): line 1, column 0), invalid XML received:
Is this because it is not able to parse "etag" ? Please help!

For your case, the response type is JSON, where boto3 tries to parse it with XML format. You need to request explicitly to let list-objects reply with application/xml. One way that works for me is,
def add_xml_header(params,**kwargs):
params['headers']['Accept'] = 'application/xml''before-call.s3.ListObjects',add_xml_header)
Another common cause is,
From list_objects,
EncodingType (string) -- Requests Amazon S3 to encode the object keys in the response and specifies the encoding method to use. An object key may contain any Unicode character; however, XML 1.0 parser cannot parse some characters, such as characters with an ASCII value from 0 to 10. For characters that are not supported in XML 1.0, you can add this parameter to request that Amazon S3 encode the keys in the response.
In your case, your prefix contains ASCII value from 0 to 10.

Related command prompt lacking credentials

I'm trying to put to work AWS's Textract export table suggestion in this link
I'm a complete newbie in AWS's solutions and in command prompt so I'm trying to do exactly as they suggest. I'm running that in python so I'm using this piece of code:
import os
k=os.system("python my_pdf_file_path.pdf")
The code runs, I get an Image loaded my_pdf_file_path.pdf however at some point it bugs on credential matters:
Traceback (most recent call last):
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/", line 108, in <module>
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/", line 94, in main
table_csv = get_table_csv_results(file_name)
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/", line 53, in get_table_csv_results
response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/", line 316, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/", line 622, in _make_api_call
operation_model, request_dict, request_context)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/", line 641, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/", line 102, in make_request
return self._send_request(request_dict, operation_model)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/", line 132, in _send_request
request = self.create_request(request_dict, operation_model)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/", line 116, in create_request
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/", line 356, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/", line 228, in emit
return self._emit(event_name, kwargs)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/", line 211, in _emit
response = handler(**kwargs)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/", line 90, in handler
return self.sign(operation_name, request)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/", line 160, in sign
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/", line 357, in add_auth
raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials
I'm aware I didn't pass any credentials and that's natural to happen but where should I pass it and what would be the right syntax for that using python os? Amazon's example doesn't say anything about that.
It depends where you run your code, for example:
local computer - can use aws configure CLI to set your credetnails
EC2 instance - use instance role
lambda function - use lambda execution role

AWS chalice error. How do I properly put my credentials in?

Hi I'm trying to access serverless API. I got as far as creating virtual environments, activating it and puting my credentials in. Though when I try to deploy aws chalice, this is what i get:
Creating deployment package.
Traceback (most recent call last):
File "c:\users\jerom\desktop\venv\lib\site-packages\chalice\cli\", line 599, in main
return cli(obj={})
File "c:\users\jerom\desktop\venv\lib\site-packages\click\", line 829, in __call__
return self.main(*args, **kwargs)
File "c:\users\jerom\desktop\venv\lib\site-packages\click\", line 782, in main
rv = self.invoke(ctx)
File "c:\users\jerom\desktop\venv\lib\site-packages\click\", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "c:\users\jerom\desktop\venv\lib\site-packages\click\", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "c:\users\jerom\desktop\venv\lib\site-packages\click\", line 610, in invoke
return callback(*args, **kwargs)
File "c:\users\jerom\desktop\venv\lib\site-packages\click\", line 21, in new_func
return f(get_current_context(), *args, **kwargs)
File "c:\users\jerom\desktop\venv\lib\site-packages\chalice\cli\", line 206, in deploy
deployed_values = d.deploy(config, chalice_stage_name=stage)
File "c:\users\jerom\desktop\venv\lib\site-packages\chalice\deploy\", line 353, in deploy
return self._deploy(config, chalice_stage_name)
File "c:\users\jerom\desktop\venv\lib\site-packages\chalice\deploy\", line 364, in _deploy
plan = self._plan_stage.execute(resources)
File "c:\users\jerom\desktop\venv\lib\site-packages\chalice\deploy\", line 139, in execute
result = handler(resource)
File "c:\users\jerom\desktop\venv\lib\site-packages\chalice\deploy\", line 195, in _plan_lambdafunction
if not self._remote_state.resource_exists(resource):
File "c:\users\jerom\desktop\venv\lib\site-packages\chalice\deploy\", line 61, in resource_exists
result = handler(resource)
File "c:\users\jerom\desktop\venv\lib\site-packages\chalice\deploy\", line 94, in _resource_exists_lambdafunction
return self._client.lambda_function_exists(resource.function_name)
File "c:\users\jerom\desktop\venv\lib\site-packages\chalice\", line 103, in lambda_function_exists
client = self._client('lambda')
File "c:\users\jerom\desktop\venv\lib\site-packages\chalice\", line 708, in _client
self._client_cache[service_name] = self._session.create_client(
File "c:\users\jerom\desktop\venv\lib\site-packages\botocore\", line 831, in create_client
client = client_creator.create_client(
File "c:\users\jerom\desktop\venv\lib\site-packages\botocore\", line 83, in create_client
client_args = self._get_client_args(
File "c:\users\jerom\desktop\venv\lib\site-packages\botocore\", line 285, in _get_client_args
return args_creator.get_client_args(
File "c:\users\jerom\desktop\venv\lib\site-packages\botocore\", line 99, in get_client_args
endpoint = endpoint_creator.create_endpoint(
File "c:\users\jerom\desktop\venv\lib\site-packages\botocore\", line 286, in create_endpoint
raise ValueError("Invalid endpoint: %s" % endpoint_url)
ValueError: Invalid endpoint: https://lambda.New
does anyone have any idea how to solve this?
It would appear that you provided an invalid valid value for "Region" when storing your credentials.
The region name forms part of the URL when connecting to AWS services, which is why your code is trying to access https://lambda.New (New Jersey is not a valid Region.)
To fix:
Use the AWS CLI aws configure command to update your credentials.
In the Region field, provided a region code from the list of AWS Endpoints, such as us-west-2 or eu-west-2.

Processing many WARC archives from CommonCrawl using Hadoop Streaming and MapReduce

I am working on a project in which I need to download crawl data (from CommonCrawl) for specific URLs from an S3 container and then process that data.
Currently I have a MapReduce job (Python via Hadoop Streaming) which gets the correct S3 file paths for a list of URLs. Then I am trying to use a second MapReduce job to process this output by downloading the data from the commoncrawl S3 bucket. In the mapper I am using boto3 to download the gzip contents for a specific URL from the commoncrawl S3 bucket and then output some information about the the gzip contents (word counter information, content length, URLs linked to, etc.). The reducer then goes through this output to get the final word count, URL list, etc.
The output file from the first MapReduce job is only about 6mb in size (but will be larger once we scale to the full dataset). When I run the second MapReduce, this file is only split twice. Normally this is not a problem for such a small file, but the mapper code I described above (fetching S3 data, spitting out mapped output, etc.) takes a while to run for each URL. Since the file is only splitting twice, there are only 2 mappers being run. I need to increase the number of splits so that the mapping can be done faster.
I have tried setting "mapreduce.input.fileinputformat.split.maxsize" and "mapreduce.input.fileinputformat.split.minsize" for the MapReduce job, but it doesn't change the number of splits taking place.
Here is some of the code from the mapper:
s3 = boto3.client('s3', 'us-west-2', config=Config(signature_version=UNSIGNED))
offset_end = offset + length - 1
gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
fileobj = io.BytesIO(gz_file)
with, 'rb') as file:
[do stuff]
I also manually split the input file up into multiple files with a maximum of 100 lines. This had the desired effect of giving me more mappers, but then I began encountering a ConnectionError from the s3client.get_object() call:
Traceback (most recent call last):
File "", line 103, in <module>
commoncrawl_reader(base_url, full_url, offset, length, warc_file)
File "", line 14, in commoncrawl_reader
gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
File "/usr/lib/python3.6/site-packages/botocore/", line 314, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/lib/python3.6/site-packages/botocore/", line 599, in _make_api_call
operation_model, request_dict)
File "/usr/lib/python3.6/site-packages/botocore/", line 148, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/lib/python3.6/site-packages/botocore/", line 177, in _send_request
success_response, exception):
File "/usr/lib/python3.6/site-packages/botocore/", line 273, in _needs_retry
caught_exception=caught_exception, request_dict=request_dict)
File "/usr/lib/python3.6/site-packages/botocore/", line 227, in emit
return self._emit(event_name, kwargs)
File "/usr/lib/python3.6/site-packages/botocore/", line 210, in _emit
response = handler(**kwargs)
File "/usr/lib/python3.6/site-packages/botocore/", line 183, in __call__
if self._checker(attempts, response, caught_exception):
File "/usr/lib/python3.6/site-packages/botocore/", line 251, in __call__
File "/usr/lib/python3.6/site-packages/botocore/", line 277, in _should_retry
return self._checker(attempt_number, response, caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/", line 317, in __call__
File "/usr/lib/python3.6/site-packages/botocore/", line 223, in __call__
attempt_number, caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/", line 359, in _check_caught_exception
raise caught_exception
File "/usr/lib/python3.6/site-packages/botocore/", line 222, in _get_response
proxies=self.proxies, timeout=self.timeout)
File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/", line 573, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/", line 415, in send
raise ConnectionError(err, request=request)
botocore.vendored.requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
I am currently running this with only a handful of URLs, but I will need to do it with several thousand (each with many subdirectories) once I get it working.
I am not certain where to start with fixing this. I feel that it is highly likely there is better approach than what I am trying. The fact that the mapper seems to take so long for each URL seems like a big indication that I am approaching this wrong. I should also mention that the mapper and the reducer both run correctly if run directly as a pipe command:
"cat short_url_list.txt | python | sort | python" -> Produces desired output, but would take too long to run on the entire list of URLs.
Any guidance would be greatly appreciated.
The MapReduce API provides the NLineInputFormat. The property "mapreduce.input.lineinputformat.linespermap" allows to control how many lines (here WARC records) are passed to a mapper at maximum. It works with mrjob, cf. Ilya's WARC indexer.
Regarding the S3 connection error: it's better to run the job in the us-east-1 AWS region where the data is located.

OneLogin SAML2 module throws an error `lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1`

I am trying to implement a SAML2 SSO functionality using the OneLogin SAML2 module for this. There is much info in the readme and also in the demo.
I have implemented the most of it already and I am testing my ACS endpoint using Samling tool.
I am able to receive the SAML response, but I am getting the mentioned above error at this point in my implementation.
The XML, which I receive looks fine and the first symbol is <. I do not understand, where the problem lies. Please help.
Here is the complete Traceback:
Internal Server Error: /auth/sso/saml2/
Traceback (most recent call last):
File "/usr/local/lib/python3.4/site-packages/django/core/handlers/", line 140, in get_response
response = middleware_method(request, callback, callback_args, callback_kwargs)
File "/usr/local/lib/python3.4/site-packages/debug_toolbar/", line 78, in process_view
response = panel.process_view(request, view_func, view_args, view_kwargs)
File "/usr/local/lib/python3.4/site-packages/debug_toolbar/panels/", line 151, in process_view
return self.profiler.runcall(view_func, *args, **view_kwargs)
File "/usr/local/lib/python3.4/", line 109, in runcall
return func(*args, **kw)
File "/code/authtoken/", line 63, in sso_handler
resp = do_saml2(request)
File "/code/authtoken/sso/saml2/", line 83, in do_saml2
File "/usr/local/lib/python3.4/site-packages/onelogin/saml2/", line 99, in process_response
response = OneLogin_Saml2_Response(self.__settings, self.__request_data['post_data']['SAMLResponse'])
File "/usr/local/lib/python3.4/site-packages/onelogin/saml2/", line 39, in __init__
self.document = OneLogin_Saml2_XML.to_etree(self.response)
File "/usr/local/lib/python3.4/site-packages/onelogin/saml2/", line 66, in to_etree
return OneLogin_Saml2_XML._parse_etree(xml)
File "/usr/local/lib/python3.4/site-packages/defusedxml/", line 143, in fromstring
rootelement = _etree.fromstring(text, parser, base_url=base_url)
File "src/lxml/lxml.etree.pyx", line 3228, in lxml.etree.fromstring (src/lxml/lxml.etree.c:79609)
File "src/lxml/parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:119128)
File "src/lxml/parser.pxi", line 1736, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:117808)
File "src/lxml/parser.pxi", line 1102, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:112052)
File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105896)
File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:107604)
File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:106458)
File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
[2017/07/10 14:07:58] HTTP POST /auth/sso/saml2/ 500 [0.28,]
I found it. The response xml must be base64 encoded. Then it is processed correctly.

Getting error 'ascii' codec can't decode byte 0xc3 in position 149: ordinal not in range(128)' when rebuilding haystack index

I have an application where I have to store people's names and make them searchable. The technologies I am using are python (v2.7.6) django (v1.9.5) rest framwork. The dbms is postgresql (v9.2). Since the user names can be arabic we are using utf-8 as db encoding. For search we are using haystack (v2.4.1) with Amazon Elastic Search for indexing. The index was building fine a few days ago but now when I try to rebuild it with
python rebuild_index
it fails with the following error
'ascii' codec can't decode byte 0xc3 in position 149: ordinal not in range(128)
The full error trace is
File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/", line 188, in handle_label
self.update_backend(label, using)
File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/", line 233, in update_backend
do_update(backend, index, qs, start, end, total, verbosity=self.verbosity, commit=self.commit)
File "/usr/local/lib/python2.7/dist-packages/haystack/management/commands/", line 96, in do_update
backend.update(index, current_qs, commit=commit)
File "/usr/local/lib/python2.7/dist-packages/haystack/backends/", line 193, in update
bulk(self.conn, prepped_docs, index=self.index_name, doc_type='modelresult')
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/", line 188, in bulk
for ok, item in streaming_bulk(client, actions, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/", line 160, in streaming_bulk
for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/", line 85, in _process_bulk_chunk
resp = client.bulk('\n'.join(bulk_actions) + '\n', **kwargs)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/", line 69, in _wrapped
return func(*args, params=params, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/client/", line 795, in bulk
doc_type, '_bulk'), params=params, body=self._bulk_body(body))
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/", line 329, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/", line 68, in perform_request
response = self.session.request(method, url, data=body, timeout=timeout or self.timeout)
File "/usr/lib/python2.7/dist-packages/requests/", line 455, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python2.7/dist-packages/requests/", line 558, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/", line 330, in send
File "/usr/local/lib/python2.7/dist-packages/urllib3/", line 558, in urlopen
body=body, headers=headers)
File "/usr/local/lib/python2.7/dist-packages/urllib3/", line 353, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python2.7/", line 979, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python2.7/", line 1013, in _send_request
File "/usr/lib/python2.7/", line 975, in endheaders
File "/usr/lib/python2.7/", line 833, in _send_output
msg += message_body
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 149: ordinal not in range(128)
My guess is that befor we didn't have arabic characters in our database so the index was building fine but now since users have entered arabic chars the index fails to build.
If you are using the requests-aws4auth package, then you can use the following wrapper class in place of the AWS4Auth class. It encodes the headers created by AWS4Auth into byte strings thus avoiding the UnicodeDecodeError downstream.
from requests_aws4auth import AWS4Auth
class AWS4AuthEncodingFix(AWS4Auth):
def __call__(self, request):
request = super(AWS4AuthEncodingFix, self).__call__(request)
for header_name in request.headers:
self._encode_header_to_utf8(request, header_name)
return request
def _encode_header_to_utf8(self, request, header_name):
value = request.headers[header_name]
if isinstance(value, unicode):
value = value.encode('utf-8')
if isinstance(header_name, unicode):
del request.headers[header_name]
header_name = header_name.encode('utf-8')
request.headers[header_name] = value
I suspect you're correct about the arabic chars now showing up in the DB.
are also possibly related to this issue. The first link seems to have some kind of work around for it, but doesn't have a lot of detail. I suspect what the author meant with
The proper fix is to use unicode type instead of str or set the default encoding properly to (I assume) utf-8.
is that you need to check that the the machine it's running on is LANG=en_US.UTF-8 or at least some UTF-8 LANG
Elasticsearch supports different encoding so having arabic characters shouldn't be the problem.
Since you are using AWS, I will assume you also use some authorization library like requests-aws4auth.
If that is the case, notice that during authorization, some unicode headers are added, like u'x-amz-date'. That is a problem, since python's httplib perfoms the following during _send_output(): msg = "\r\n".join(self._buffer) where _buffer is a list of the HTTP headers. Having unicode headers makes msg be of <type 'unicode'> while it really should be of type str (Here is a similar issue with different auth library).
The line that raises the exception, msg += message_body raises it since python needs to decode message_body to unicode so it matches the type of msg. The exception is rised since py-elasticsearch already took care of the encoding, so we end up of encoding to unicode twice, which cause the exception (as explained here).
You may want to try to replace the auth library (for example with DavidMuller/aws-requests-auth) and see if it fixes the problem.