boto.connect_s3 bucket operations -> TypeError: sequence item 0: expected string, int found - python-2.7

I've been using both for a while to interface with AWS in different computers without issue until now. For some reason, in one of my machines I keep getting "TypeError: sequence item 0: expected string, int found" on boto.connect_s3 operations related to buckets.
I'm able to connect to my S3 without issues, but if I try to create a new bucket, retrieve a specific bucket or anything related to buckets, I get that error. Also, this only happens in one out of my three computers.
Any help would be appreciated.
Edit I was having the error in python 2.7.9 32 bit. I installed 2.7.9 64 bit and seems to be working fine.
This is the code I'm using, it works fine on other machines.
import boto
conn = boto.connect_s3()
conn.create_bucket("gbatestingbucket2")
This is the error I get:
Traceback (most recent call last):
File "C:\Users\Sunpower\Desktop\EC2Automation\test.py", line 4, in <module>
conn.create_bucket("gbatestingbucket2")
File "c:\Python27\lib\site-packages\boto\s3\connection.py", line 612, in create_bucket
data=data)
File "c:\Python27\lib\site-packages\boto\s3\connection.py", line 664, in make_request
retry_handler=retry_handler
File "c:\Python27\lib\site-packages\boto\connection.py", line 1070, in make_request
retry_handler=retry_handler)
File "c:\Python27\lib\site-packages\boto\connection.py", line 942, in _mexe
request.body, request.headers)
File "c:\Python27\lib\httplib.py", line 946, in request
self._send_request(method, url, body, headers)
File "c:\Python27\lib\httplib.py", line 986, in _send_request
self.putheader(hdr, value)
File "c:\Python27\lib\httplib.py", line 924, in putheader
str = '%s: %s' % (header, '\r\n\t'.join(values))
TypeError: sequence item 0: expected string, int found
[Finished in 0.5s with exit code 1]

Related

Glue Boto Client -- NoCredentialsError

I've been running my Glue Jobs on a schedule for a few months. Last night my Glue Job failed due to botocore.exceptions.NoCredentialsError: Unable to locate credentials after calling bucket.objects.filter(Prefix=productionDirectory):
I am under the impression this is a result of not having defined a credentials file, but AWS Glue has always pulled credentials without issue. I just re-ran my job and everything worked perfectly. For reference, I define my Glue Client via: glue = boto3.client('glue'). Has anyone ever experienced this before? Is this just an edge-case?
Full Logs:
Traceback (most recent call last):
File "/tmp/data-deployment", line 67, in <module>
for obj in bucket.objects.filter(Prefix=productionDirectory):
File "/home/spark/.local/lib/python3.7/site-packages/boto3/resources/collection.py", line 83, in __iter__
for page in self.pages():
File "/home/spark/.local/lib/python3.7/site-packages/boto3/resources/collection.py", line 166, in pages
for page in pages:
File "/home/spark/.local/lib/python3.7/site-packages/botocore/paginate.py", line 255, in __iter__
response = self._make_request(current_kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/paginate.py", line 332, in _make_request
return self._method(**current_kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/client.py", line 316, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/client.py", line 613, in _make_api_call
operation_model, request_dict, request_context)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/client.py", line 632, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/endpoint.py", line 102, in make_request
return self._send_request(request_dict, operation_model)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/endpoint.py", line 132, in _send_request
request = self.create_request(request_dict, operation_model)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/endpoint.py", line 116, in create_request
operation_name=operation_model.name)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/hooks.py", line 356, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/hooks.py", line 228, in emit
return self._emit(event_name, kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/hooks.py", line 211, in _emit
response = handler(**kwargs)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/signers.py", line 90, in handler
return self.sign(operation_name, request)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/signers.py", line 160, in sign
auth.add_auth(request)
File "/home/spark/.local/lib/python3.7/site-packages/botocore/auth.py", line 357, in add_auth
raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials
Edit/Update: This is a known bug. I've posted the mitigation strategy provided from AWS as an answer below.
Update: I reached out to AWS via Support and they responded. Apparently this is a known bug and issue. While they do not have a solution or ETA for solution, they do have a way to mitigate the issue. Information below:
Thank you for reporting your issue to us and product team is aware of this intermittent issue.
They are working on resolution however, I do not have an ETA.
To mitigate this issue, increase the timeout / attempts to meta service request in your code:
####START######
import os
####Increase meta service timeout and attempt########
os.environ['AWS_METADATA_SERVICE_NUM_ATTEMPTS'] ="5"
os.environ['AWS_METADATA_SERVICE_TIMEOUT'] ="30"
#####################END#################
I faced a similar issue with Glue, but not exactly the same.
We used external tables with SparkSQL and S3, and sometimes an Exception was raised out of nowhere, i.e. Table not found. The issue was never reproduced on testing and had least frequency. Since our jobs ran perfectly fine on retries, we enabled the retry mechanism to solve it.
It has something to do with the internal workings of Glue and its serverless environment.

Processing many WARC archives from CommonCrawl using Hadoop Streaming and MapReduce

I am working on a project in which I need to download crawl data (from CommonCrawl) for specific URLs from an S3 container and then process that data.
Currently I have a MapReduce job (Python via Hadoop Streaming) which gets the correct S3 file paths for a list of URLs. Then I am trying to use a second MapReduce job to process this output by downloading the data from the commoncrawl S3 bucket. In the mapper I am using boto3 to download the gzip contents for a specific URL from the commoncrawl S3 bucket and then output some information about the the gzip contents (word counter information, content length, URLs linked to, etc.). The reducer then goes through this output to get the final word count, URL list, etc.
The output file from the first MapReduce job is only about 6mb in size (but will be larger once we scale to the full dataset). When I run the second MapReduce, this file is only split twice. Normally this is not a problem for such a small file, but the mapper code I described above (fetching S3 data, spitting out mapped output, etc.) takes a while to run for each URL. Since the file is only splitting twice, there are only 2 mappers being run. I need to increase the number of splits so that the mapping can be done faster.
I have tried setting "mapreduce.input.fileinputformat.split.maxsize" and "mapreduce.input.fileinputformat.split.minsize" for the MapReduce job, but it doesn't change the number of splits taking place.
Here is some of the code from the mapper:
s3 = boto3.client('s3', 'us-west-2', config=Config(signature_version=UNSIGNED))
offset_end = offset + length - 1
gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
'Body'].read()
fileobj = io.BytesIO(gz_file)
with gzip.open(fileobj, 'rb') as file:
[do stuff]
I also manually split the input file up into multiple files with a maximum of 100 lines. This had the desired effect of giving me more mappers, but then I began encountering a ConnectionError from the s3client.get_object() call:
Traceback (most recent call last):
File "dmapper.py", line 103, in <module>
commoncrawl_reader(base_url, full_url, offset, length, warc_file)
File "dmapper.py", line 14, in commoncrawl_reader
gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
File "/usr/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/lib/python3.6/site-packages/botocore/client.py", line 599, in _make_api_call
operation_model, request_dict)
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 148, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 177, in _send_request
success_response, exception):
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 273, in _needs_retry
caught_exception=caught_exception, request_dict=request_dict)
File "/usr/lib/python3.6/site-packages/botocore/hooks.py", line 227, in emit
return self._emit(event_name, kwargs)
File "/usr/lib/python3.6/site-packages/botocore/hooks.py", line 210, in _emit
response = handler(**kwargs)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 183, in __call__
if self._checker(attempts, response, caught_exception):
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 251, in __call__
caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 277, in _should_retry
return self._checker(attempt_number, response, caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 317, in __call__
caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 223, in __call__
attempt_number, caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 359, in _check_caught_exception
raise caught_exception
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 222, in _get_response
proxies=self.proxies, timeout=self.timeout)
File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/adapters.py", line 415, in send
raise ConnectionError(err, request=request)
botocore.vendored.requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
I am currently running this with only a handful of URLs, but I will need to do it with several thousand (each with many subdirectories) once I get it working.
I am not certain where to start with fixing this. I feel that it is highly likely there is better approach than what I am trying. The fact that the mapper seems to take so long for each URL seems like a big indication that I am approaching this wrong. I should also mention that the mapper and the reducer both run correctly if run directly as a pipe command:
"cat short_url_list.txt | python mapper.py | sort | python reducer.py" -> Produces desired output, but would take too long to run on the entire list of URLs.
Any guidance would be greatly appreciated.
The MapReduce API provides the NLineInputFormat. The property "mapreduce.input.lineinputformat.linespermap" allows to control how many lines (here WARC records) are passed to a mapper at maximum. It works with mrjob, cf. Ilya's WARC indexer.
Regarding the S3 connection error: it's better to run the job in the us-east-1 AWS region where the data is located.

Python request to API keeps returning ZeroReturnError exception

Python 2.7.3
Calling an API from a Raspberry Pi 3, the API logs show it hits the correct endpoint and returns with a 200 status code, but the python code from the Pi spits out a huge error stack. I saw in some forums that the ZeroReturnError is always thrown meaning that there was nothing wrong, but that seems weird since I can't actually get the results of the response in an except block from the try.
My code is literally
import requests
response = requests.get(<URL I AM USING>, json={JSON I AM USING})
Not sure what to do.
Traceback (most recent call last):
File "music.py", line 13, in <module>
response = requests.get(url, json={'blah':{'blah':'*********'}})
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 60, in get
return request('get', url, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 49, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 457, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 606, in send
r.content
File "/usr/lib/python2.7/dist-packages/requests/models.py", line 724, in content
self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
File "/usr/lib/python2.7/dist-packages/requests/models.py", line 653, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/usr/lib/python2.7/dist-packages/urllib3/response.py", line 256, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/usr/lib/python2.7/dist-packages/urllib3/response.py", line 186, in read
data = self._fp.read(amt)
File "/usr/lib/python2.7/httplib.py", line 602, in read
s = self.fp.read(amt)
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
File "/usr/lib/python2.7/dist-packages/urllib3/contrib/pyopenssl.py", line 188, in recv
data = self.connection.recv(*args, **kwargs)
OpenSSL.SSL.ZeroReturnError
Some more searching brought me to think it was version issues.
Ran sudo pip install urllib3 --upgrade on the Raspberry Pi and it cleared it up.
I am getting a DependencyWarning about installing PySocks, but its working correctly now.

TypeError when using botocore to read from AWS SQS queue

I'm using a Tornado server with tornado-botocore to connect to Amazon SQS services.
When running stress tests we sometimes get the following exception:
Traceback (most recent call last):
File "/home/app/handlers/WebSocketsHandler.py", line 95, in listen_outgoing_queue
message = yield tornado.gen.Task(self.outgoing_queue.read)
File "/home/local/lib/python2.7/site-packages/tornado/gen.py", line 870, in run
value = future.result()
File "/home/local/lib/python2.7/site-packages/tornado/concurrent.py", line 215, in result
raise_exc_info(self._exc_info)
File "/home/local/lib/python2.7/site-packages/tornado/stack_context.py", line 314, in wrapped
ret = fn(*args, **kwargs)
File "/home/local/lib/python2.7/site-packages/tornado_botocore/base.py", line 70, in prepare_response
response_dict, operation_model.output_shape)
File "/home/local/lib/python2.7/site-packages/botocore/parsers.py", line 155, in parse
return self._do_error_parse(response, shape)
File "/home/.env/local/lib/python2.7/site-packages/botocore/parsers.py", line 314, in _do_error_parse
root = self._parse_xml_string_to_dom(xml_contents)
File "/home/local/lib/python2.7/site-packages/botocore/parsers.py", line 274, in _parse_xml_string_to_dom
parser.feed(xml_string)
TypeError: must be string or read-only buffer, not None
could it be caused by the concurrency?
has anyone encountered such behavior?
We are using tornado 4.2.1, botocore 0.65.0 and tonado-botocore 0.1.6
problem solved once i removed the #tornado.gen.engine decorator from the method

NotSupportedError when trying to build primary index in N1QL in Couchbase Python SDK

I'm trying to get into the new N1QL Queries for Couchbase in Python.
I got my database set up in Couchbase 4.0.0.
My initial try was to retreive all documents like this:
from couchbase.bucket import Bucket
bucket = Bucket('couchbase://localhost/dafault')
rv = bucket.n1ql_query('CREATE PRIMARY INDEX ON default').execute()
for row in bucket.n1ql_query('SELECT * FROM default'):
print row
But this produces a OperationNotSupportedError:
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 2357, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1777, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Users/my_user/python_tests/test_n1ql.py", line 9, in <module>
rv = bucket.n1ql_query('CREATE PRIMARY INDEX ON default').execute()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/couchbase/n1ql.py", line 215, in execute
for _ in self:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/couchbase/n1ql.py", line 235, in __iter__
self._start()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/couchbase/n1ql.py", line 180, in _start
self._mres = self._parent._n1ql_query(self._params.encoded)
couchbase.exceptions.NotSupportedError: <RC=0x13[Operation not supported], Couldn't schedule n1ql query, C Source=(src/n1ql.c,82)>
Here the version numbers of everything I use:
Couchbase Server: 4.0.0
couchbase python library: 2.0.2
cbc: 2.5.1
python: 2.7.8
gcc: 4.2.1
Anyone an idea what might have went wrong here? I could not find any solution to this problem up to now.
There was another ticket for node.js where the same issue happened. There was a proposal to enable n1ql for the specific bucket first. Is this also needed in python?
It would seem you didn't configure any cluster nodes with the Query or Index services. As such, the error returned is one that indicates no nodes are available.
I also got similar error while trying to create primary index.
Create a primary index...
Traceback (most recent call last):
File "post-upgrade-test.py", line 45, in <module>
mgr.n1ql_index_create_primary(ignore_exists=True)
File "/usr/local/lib/python2.7/dist-packages/couchbase/bucketmanager.py", line 428, in n1ql_index_create_primary
'', defer=defer, primary=True, ignore_exists=ignore_exists)
File "/usr/local/lib/python2.7/dist-packages/couchbase/bucketmanager.py", line 412, in n1ql_index_create
return IxmgmtRequest(self._cb, 'create', info, **options).execute()
File "/usr/local/lib/python2.7/dist-packages/couchbase/_ixmgmt.py", line 160, in execute
return [x for x in self]
File "/usr/local/lib/python2.7/dist-packages/couchbase/_ixmgmt.py", line 144, in __iter__
self._start()
File "/usr/local/lib/python2.7/dist-packages/couchbase/_ixmgmt.py", line 132, in _start
self._cmd, index_to_rawjson(self._index), **self._options)
couchbase.exceptions.NotSupportedError: <RC=0x13[Operation not supported], Couldn't schedule ixmgmt operation, C Source=(src/ixmgmt.c,98)>
Adding query and index node to the cluster solved the issue.