Getting error when uploading csv to Snowflakes - snowflake-connector

Trying to upload a csv of size 100GB and getting this error after compressed files are created in tmp
data = sf_file_transfer_agent.result()
File "/apps/tools/python/python36/lib/python3.6/site-packages/snowflake/connector/file_transfer_agent.py", line 722, in result
"errno": ER_FAILED_TO_UPLOAD_TO_STAGE,
File "/apps/tools/python/python36/lib/python3.6/site-packages/snowflake/connector/errors.py", line 258, in errorhandler_wrapper
error_value,
File "/apps/tools/python/python36/lib/python3.6/site-packages/snowflake/connector/errors.py", line 309, in hand_to_other_handler
cursor.errorhandler(connection, cursor, error_class, error_value)
File "/apps/tools/python/python36/lib/python3.6/site-packages/snowflake/connector/errors.py", line 195, in default_errorhandler
cursor=cursor,
snowflake.connector.errors.OperationalError: AttributeError("'StorageCredential' object has no attribute '_command'",)

Loading very large files (e.g. 100 GB or larger) is not recommended. Try to chuck the files into 100-250 MB size(compressed) and try.

Related

gsutill: backup files that encrypted with Customer-Supplied Key

I have a Google cloud storage bucket containing files. Each of these files is encrypted with a different key, for security reasons. This bucket is the source. I want to copy it's content from the Source bucket to the Destination bucket. Just to have a backup...
I tried to run this command:
$ gsutil cp -r gs://source-bucket/* gs://dest-bucket/
Traceback (most recent call last):
File "/usr/lib/google-cloud-sdk/platform/gsutil/gsutil", line 21, in <module>
gsutil.RunMain()
File "/usr/lib/google-cloud-sdk/platform/gsutil/gsutil.py", line 132, in RunMain
sys.exit(gslib.__main__.main())
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/__main__.py", line 431, in main
user_project=user_project)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/__main__.py", line 760, in _RunNamedCommandAndHandleExceptions
_HandleUnknownFailure(e)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/__main__.py", line 626, in _RunNamedCommandAndHandleExceptions
user_project=user_project)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command_runner.py", line 411, in RunNamedCommand
return_code = command_inst.RunCommand()
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/cp.py", line 1200, in RunCommand
seek_ahead_iterator=seek_ahead_iterator)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command.py", line 1515, in Apply
arg_checker, should_return_results, fail_on_error)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command.py", line 1586, in _SequentialApply
worker_thread.PerformTask(task, self)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command.py", line 2306, in PerformTask
results = task.func(cls, task.args, thread_state=self.thread_gsutil_api)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/cp.py", line 790, in _CopyFuncWrapper
preserve_posix=cls.preserve_posix_attrs)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/cp.py", line 1000, in CopyFunc
preserve_posix=preserve_posix)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py", line 3729, in PerformCopy
decryption_key = GetDecryptionCSEK(src_url, src_obj_metadata)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py", line 3645, in GetDecryptionCSEK
(src_obj_metadata.customerEncryption.keySha256, src_url))
gslib.cloud_api.EncryptionException: Missing decryption key with SHA256 hash 0z1dPrWjTL6yrU5U6GP2gTaBriwNbMJnh6CcIuLSy8o=. No decryption key matches object gs://source-bucket/myfile.json
I guess that the reason for this failure is the missing key in order to copy the files.
I also tried to create a Transfer operation but it failed for a strange reason.
How I can back up my files in this case? Just copy it as is.
What are my alternatives?
You have to supply the keys that you used to encrypt the files.
With gsutils you have to use the .boto file and put inside something similar:
[GSUtil]
encryption_key = ...
decryption_key1 = ...
decryption_key2 = ...
gsutil automatically detects the correct CSEK to use for a cloud
object by comparing the key's SHA256 hash against the hash of the
CSEK. gsutil considers the configured encryption key and up to 100
decryption keys when searching for a match. Decryption keys must be
listed in the boto configuration file in ascending numerical order
starting with 1.
for more on customer-supplied encryption keys check here

Processing many WARC archives from CommonCrawl using Hadoop Streaming and MapReduce

I am working on a project in which I need to download crawl data (from CommonCrawl) for specific URLs from an S3 container and then process that data.
Currently I have a MapReduce job (Python via Hadoop Streaming) which gets the correct S3 file paths for a list of URLs. Then I am trying to use a second MapReduce job to process this output by downloading the data from the commoncrawl S3 bucket. In the mapper I am using boto3 to download the gzip contents for a specific URL from the commoncrawl S3 bucket and then output some information about the the gzip contents (word counter information, content length, URLs linked to, etc.). The reducer then goes through this output to get the final word count, URL list, etc.
The output file from the first MapReduce job is only about 6mb in size (but will be larger once we scale to the full dataset). When I run the second MapReduce, this file is only split twice. Normally this is not a problem for such a small file, but the mapper code I described above (fetching S3 data, spitting out mapped output, etc.) takes a while to run for each URL. Since the file is only splitting twice, there are only 2 mappers being run. I need to increase the number of splits so that the mapping can be done faster.
I have tried setting "mapreduce.input.fileinputformat.split.maxsize" and "mapreduce.input.fileinputformat.split.minsize" for the MapReduce job, but it doesn't change the number of splits taking place.
Here is some of the code from the mapper:
s3 = boto3.client('s3', 'us-west-2', config=Config(signature_version=UNSIGNED))
offset_end = offset + length - 1
gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
'Body'].read()
fileobj = io.BytesIO(gz_file)
with gzip.open(fileobj, 'rb') as file:
[do stuff]
I also manually split the input file up into multiple files with a maximum of 100 lines. This had the desired effect of giving me more mappers, but then I began encountering a ConnectionError from the s3client.get_object() call:
Traceback (most recent call last):
File "dmapper.py", line 103, in <module>
commoncrawl_reader(base_url, full_url, offset, length, warc_file)
File "dmapper.py", line 14, in commoncrawl_reader
gz_file = s3.get_object(Bucket='commoncrawl', Key=filename, Range='bytes=%s-%s' % (offset, offset_end))[
File "/usr/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/lib/python3.6/site-packages/botocore/client.py", line 599, in _make_api_call
operation_model, request_dict)
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 148, in make_request
return self._send_request(request_dict, operation_model)
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 177, in _send_request
success_response, exception):
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 273, in _needs_retry
caught_exception=caught_exception, request_dict=request_dict)
File "/usr/lib/python3.6/site-packages/botocore/hooks.py", line 227, in emit
return self._emit(event_name, kwargs)
File "/usr/lib/python3.6/site-packages/botocore/hooks.py", line 210, in _emit
response = handler(**kwargs)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 183, in __call__
if self._checker(attempts, response, caught_exception):
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 251, in __call__
caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 277, in _should_retry
return self._checker(attempt_number, response, caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 317, in __call__
caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 223, in __call__
attempt_number, caught_exception)
File "/usr/lib/python3.6/site-packages/botocore/retryhandler.py", line 359, in _check_caught_exception
raise caught_exception
File "/usr/lib/python3.6/site-packages/botocore/endpoint.py", line 222, in _get_response
proxies=self.proxies, timeout=self.timeout)
File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3.6/site-packages/botocore/vendored/requests/adapters.py", line 415, in send
raise ConnectionError(err, request=request)
botocore.vendored.requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
I am currently running this with only a handful of URLs, but I will need to do it with several thousand (each with many subdirectories) once I get it working.
I am not certain where to start with fixing this. I feel that it is highly likely there is better approach than what I am trying. The fact that the mapper seems to take so long for each URL seems like a big indication that I am approaching this wrong. I should also mention that the mapper and the reducer both run correctly if run directly as a pipe command:
"cat short_url_list.txt | python mapper.py | sort | python reducer.py" -> Produces desired output, but would take too long to run on the entire list of URLs.
Any guidance would be greatly appreciated.
The MapReduce API provides the NLineInputFormat. The property "mapreduce.input.lineinputformat.linespermap" allows to control how many lines (here WARC records) are passed to a mapper at maximum. It works with mrjob, cf. Ilya's WARC indexer.
Regarding the S3 connection error: it's better to run the job in the us-east-1 AWS region where the data is located.

'No such file or directory' error after submitting a training job

I execute:
gcloud beta ml jobs submit training ${JOB_NAME} --config config.yaml
and after about 5 minutes the job errors out with this error:
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 232, in <module> tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 228, in main run_training()
File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 129, in run_training data_sets = input_data.read_data_sets(FLAGS.train_dir, FLAGS.fake_data)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py", line 212, in read_data_sets with open(local_file, 'rb') as f: IOError: [Errno 2] No such file or directory: 'gs://my-bucket/mnist/train/train-images.gz'
The strange thing is, as far as I can tell, that file exists at that url.
This error usually indicates you are using a multi-region GCS bucket for your output. To avoid this error you should use a regional GCS bucket. Regional buckets provide stronger consistency guarantees which are needed to avoid these types of errors.
For more information about properly setting up GCS buckets for Cloud ML please refer to the Cloud ML Docs
Normal IO does not know how to deal with GCS gs:// correctly. You need:
first_data_file = args.train_files[0]
file_stream = file_io.FileIO(first_data_file, mode='r')
# run experiment
model.run_experiment(file_stream)
But ironically, you can move files from the gs://bucket to your root directory, which your programs CAN then actually see:
with file_io.FileIO(gs://presentation_mplstyle_path, mode='r') as input_f:
with file_io.FileIO('presentation.mplstyle', mode='w+') as output_f:
output_f.write(input_f.read())
mpl.pyplot.style.use(['./presentation.mplstyle'])
And finally, moving a file from your root back to a gs://bucket:
with file_io.FileIO(report_name, mode='r') as input_f:
with file_io.FileIO(job_dir + '/' + report_name, mode='w+') as output_f:
output_f.write(input_f.read())
Should be easier IMO.

pandas reading .csv files

I have a small script to read and print a .csv file using pandas generated from MS Excel.
import pandas as pd
data = pd.read_csv('./2010-11.csv')
print(data)
now this script runs in Python 2.7.8 but in Python 3.4.1 gives the following
error. Any ideas why this might be so? Thanks in advance for any help with this.
Traceback (most recent call last):
File "proc_csv_0-0.py", line 3, in <module>
data = pd.read_csv('./2010-11.csv')
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 474, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 260, in _read
return parser.read()
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 721, in read
ret = self._engine.read(nrows)
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 1170, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 769, in pandas.parser.TextReader.read (pandas/parser.c:7566)
File "pandas/parser.pyx", line 791, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7806)
File "pandas/parser.pyx", line 866, in pandas.parser.TextReader._read_rows (pandas/parser.c:8639)
File "pandas/parser.pyx", line 973, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:9950)
File "pandas/parser.pyx", line 1033, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:10737)
File "pandas/parser.pyx", line 1130, in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:12141)
File "pandas/parser.pyx", line 1150, in pandas.parser.TextReader._string_convert (pandas/parser.c:12355)
File "pandas/parser.pyx", line 1382, in pandas.parser._string_box_utf8 (pandas/parser.c:17679)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 4: unexpected end of data
In Python3, when pd.read_csv is passed a file path (as opposed to a file buffer) it decodes the contents with the utf-8 codec by default.1 It appears your CSV file is using a different encoding. Since it was generated by MS Excel, it might be cp-1252:
In [25]: print('\xc9'.decode('cp1252'))
É
In [27]: import unicodedata as UDAT
In [28]: UDAT.name('\xc9'.decode('cp1252'))
Out[28]: 'LATIN CAPITAL LETTER E WITH ACUTE'
The error message
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9
says that '\xc9'.decode('utf-8') raises a UnicodeDecodeError.
The above shows byte 0xc9 can be decoded with cp1252. It remains to be seen if the rest of the file can also be decoded with cp1252, and if it produces the desired result.
Unfortunately, given only a file, there is no surefire way to tell what
encoding (if any) was used. It depends entirely on the program used to generate
the file.
If cp1252 is the right encoding, then to load the file into a DataFrame use
data = pd.read_csv('./2010-11.csv', encoding='cp1252')
1 When pd.read_csv is passed a buffer, the buffer could have been opened with encoding already set:
# Python3
with open('/tmp/test.csv', 'r', encoding='cp1252') as f:
df = pd.read_csv(f)
print(df)
in which case pd.read_csv will not attempt to decode since the buffer f is already supplying decoded strings.

reading csv file in python?

i am working on a machine learning project where i am supposed to read a csv file to build a linear regression model and here is i read the csv file
data_test = pd.read_csv("/media/halawa/93B77F681EC1B4D2/GUC/Semster 8/CSEN 1022 Machine Learning/2/test.csv",delimiter=",", header=0)
but when i run i got this error
/usr/bin/python2.7 /home/halawa/PycharmProjects/ML/evergreen.py
Traceback (most recent call last):
File "/home/halawa/PycharmProjects/ML/evergreen.py", line 24, in <module>
data_test = pd.read_csv("/media/halawa/93B77F681EC1B4D2/GUC/Semster 8/CSEN 1022 Machine Learning/2/test.csv",delimiter=",", header=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 470, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 256, in _read
return parser.read()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 715, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1164, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 758, in pandas.parser.TextReader.read (pandas/parser.c:7411)
File "pandas/parser.pyx", line 780, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7651)
File "pandas/parser.pyx", line 833, in pandas.parser.TextReader._read_rows (pandas/parser.c:8268)
File "pandas/parser.pyx", line 820, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8142)
File "pandas/parser.pyx", line 1758, in pandas.parser.raise_parser_error (pandas/parser.c:20728)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 8
Process finished with exit code 1
Your issue is that your CSV doesn't have a consistent number of fields on each line. For example, it appears that the first line has 3 fields
x,y,z
While the third line has 8
x,y,z,a,b,c,d,e
You will need to fix your source CSV file to avoid this error.
Alternatively, if you know that you have 8 fields max, and are ok with some lines missing fields you can use names:
data_test = pd.read_csv("/media/halawa/93B77F681EC1B4D2/GUC/Semster 8/CSEN 1022 Machine Learning/2/test.csv",delimiter=",", header=0, names=list('abcdefgh'))
This parameter tells the CSV reader how many fields to expect, and the rest are filled in with a default value.
EDIT:
If your null columns are marked with a ? then you should set the pandas na_values parameter like so:
data_test = pd.read_csv("/media/halawa/93B77F681EC1B4D2/GUC/Semster 8/CSEN 1022 Machine Learning/2/test.csv",delimiter=",", header=0, na_values=['?'])