Accessing a bz2 file in S3 from Sagemaker notebook - amazon-web-services

I am able to read and write csv files from and to S3 bucket from Sagemaker notebook, but when trying to read a bz2 file, using the path method used in csv files, I get the error of no file or directory
IOErrorTraceback (most recent call last)
<ipython-input-19-d14d47a702e1> in <module>()
2 # Create corpus
3 #%time wiki = WikiCorpus("resources/articles1.xml.bz2", tokenizer_func=spacy_tokenize)
----> 4 wiki = WikiCorpus("s3://sagemakerq/enwiki.xml.bz2", tokenizer_func=spacy_tokenize)
/home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages/gensim/corpora/wikicorpus.pyc in __init__(self, fname, processes, lemmatize, dictionary, filter_namespaces, tokenizer_func, article_min_tokens, token_min_len, token_max_len, lower, filter_articles)
634
635 if dictionary is None:
--> 636 self.dictionary = Dictionary(self.get_texts())
637 else:
638 self.dictionary = dictionary
/home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages/gensim/corpora/dictionary.pyc in __init__(self, documents, prune_at)
82
83 if documents is not None:
---> 84 self.add_documents(documents, prune_at=prune_at)
85
86 def __getitem__(self, tokenid):
/home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages/gensim/corpora/dictionary.pyc in add_documents(self, documents, prune_at)
195
196 """
--> 197 for docno, document in enumerate(documents):
198 # log progress & run a regular check for pruning, once every 10k docs
199 if docno % 10000 == 0:
/home/ec2-user/anaconda3/envs/amazonei_mxnet_p27/lib/python2.7/site-packages/gensim/corpora/wikicorpus.pyc in get_texts(self)
676 ((text, self.lemmatize, title, pageid, tokenization_params)
677 for title, text, pageid
--> 678 in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces, self.filter_articles))
679 pool = multiprocessing.Pool(self.processes, init_to_ignore_interrupt)
680
IOError: [Errno 2] No such file or directory: 's3://sagemakerq/enwiki.xml.bz2'

Looks like you are using Python gensim package to construct a corpus from a wiki based database dump from S3. The package does not support reading directly from S3. Instead you can download the file and work with it.
import boto3
from gensim.corpora.wikicorpus import WikiCorpus
s3 = boto3.client('s3')
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME')
wiki = WikiCorpus('FILE_NAME')

Related

torchvision.datasets.mnist RunTimeError on JupyterLab

I'm trying to run the following sample code on JupyterLab (through GCP vertex AI):
import torch
from torchvision import transforms
from torchvision import datasets
train_data = datasets.MNIST(root='data', train=True, download=True, transform=None)
print(train_data)
with versions:
torch-1.12.1+cu113
torchvision-0.13.1+cu113
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_10081/229378695.py in <module>
11 from torchvision import datasets
12
---> 13 train_data = datasets.MNIST(root='data', train=True, download=True, transform=None)
14 print(train_data)
/opt/conda/lib/python3.7/site-packages/torchvision/datasets/mnist.py in __init__(self, root, train, transform, target_transform, download)
102 raise RuntimeError("Dataset not found. You can use download=True to download it")
103
--> 104 self.data, self.targets = self._load_data()
105
106 def _check_legacy_exist(self):
/opt/conda/lib/python3.7/site-packages/torchvision/datasets/mnist.py in _load_data(self)
121 def _load_data(self):
122 image_file = f"{'train' if self.train else 't10k'}-images-idx3-ubyte"
--> 123 data = read_image_file(os.path.join(self.raw_folder, image_file))
124
125 label_file = f"{'train' if self.train else 't10k'}-labels-idx1-ubyte"
/opt/conda/lib/python3.7/site-packages/torchvision/datasets/mnist.py in read_image_file(path)
542
543 def read_image_file(path: str) -> torch.Tensor:
--> 544 x = read_sn3_pascalvincent_tensor(path, strict=False)
545 if x.dtype != torch.uint8:
546 raise TypeError(f"x should be of dtype torch.uint8 instead of {x.dtype}")
/opt/conda/lib/python3.7/site-packages/torchvision/datasets/mnist.py in read_sn3_pascalvincent_tensor(path, strict)
529
530 assert parsed.shape[0] == np.prod(s) or not strict
--> 531 return parsed.view(*s)
532
533
RuntimeError: shape '[60000, 28, 28]' is invalid for input of size 9437168
____________________
and I'm getting this strange error when trying to load MNIST
I tried reproducing it in other envaironments but couldn't - it works great locally & on cloab
I tried lots of other versions of torch and torchvision but non of them works
This error is often caused by an issue with the MNIST dataset files that are downloaded onto your system. Try deleting the MNIST dataset files in the data directory and then running the code again to download fresh copies of the dataset files. Follow this code:
import os
import shutil
mnist_folder = 'data/MNIST'
if os.path.exists(mnist_folder):
shutil.rmtree(mnist_folder)
train_data = datasets.MNIST(root='data', train=True, download=True, transform=None)
If this method doesn't work, visit this website and placing them in the data/MNIST folder.

Keep Getting Permission denied when using fastai library on AWS setting

I'm learning deep learning by taking a lecture that uses fastai. I'm running fastai library on AWS p2.xlarge. When I ran some function on fastai library I get this error.:
Traceback (most recent call last)
<ipython-input-12-1d86fc0ece07> in <module>()
1 arch = resnet34
2 data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch,sz ))
----> 3 learn = ConvLearner.pretrained(arch, data, precompute = True)
4 learn.fit(0.01, 2)
~/fastai/fastai/conv_learner.py in pretrained(cls, f, data, ps, xtra_fc, xtra_cut, custom_head, precompute, pretrained, **kwargs)
112 models = ConvnetBuilder(f, data.c, data.is_multi, data.is_reg,
113 ps=ps, xtra_fc=xtra_fc, xtra_cut=xtra_cut, custom_head=custom_head, pretrained=pretrained)
--> 114 return cls(data, models, precompute, **kwargs)
115
116 #classmethod
~/fastai/fastai/conv_learner.py in __init__(self, data, models, precompute, **kwargs)
95 def __init__(self, data, models, precompute=False, **kwargs):
96 self.precompute = False
---> 97 super().__init__(data, models, **kwargs)
98 if hasattr(data, 'is_multi') and not data.is_reg and self.metrics is None:
99 self.metrics = [accuracy_thresh(0.5)] if self.data.is_multi else [accuracy]
~/fastai/fastai/learner.py in __init__(self, data, models, opt_fn, tmp_name, models_name, metrics, clip, crit)
35 self.tmp_path = tmp_name if os.path.isabs(tmp_name) else os.path.join(self.data.path, tmp_name)
36 self.models_path = models_name if os.path.isabs(models_name) else os.path.join(self.data.path, models_name)
---> 37 os.makedirs(self.tmp_path, exist_ok=True)
38 os.makedirs(self.models_path, exist_ok=True)
39 self.crit = crit if crit else self._get_crit(data)
~/anaconda3/envs/fastai/lib/python3.6/os.py in makedirs(name, mode, exist_ok)
218 return
219 try:
--> 220 mkdir(name, mode)
221 except OSError:
222 # Cannot rely on checking for EEXIST, since the operating system
PermissionError: [Errno 13] Permission denied: 'data/dogscats/tmp'
I think the AWS console has no permission to make the directory.
I did sudo mkdir tmp data/dogscats/ but I get another error that I couldn't understand.
I think I have to give AWS some permission but I have no clue how to do that.
I hope you guys can give me some clear idea on how to solve this kind of problem.
Fastai creates saves data like current loss etc. in a folder it creates by default the folder is created in the working directory but you can pass the argument path that is the path where you have the privileges to create a folder.

IOError: [Errno 22] loading parquet file

I have parquet data like the sample data below. I’m trying to load it in to a dataframe using the code below. The engine I’m using is pyarrow. I have other files that it works fine for, but when I try to load this file. I’m getting the error below. I’m new to parquet does anyone see what the issue might be?
Code:
pd.read_parquet('/tmp/dt=20/09_0')
Error:
ArrowIOErrorTraceback (most recent call last)
<ipython-input-20-23dfd4ca529a> in <module>()
----> 1 view_df=pd.read_parquet('/data_tmp/view_coremetrics/dt=20180402/000119_0')
2 # view_df=pd.read_parquet('/data_tmp/000031_0')
3 print view_df.shape
4 view_df.head()
/data2/user1/anaconda2/lib/python2.7/site-packages/pandas/io/parquet.pyc in read_parquet(path, engine, columns, **kwargs)
255
256 impl = get_engine(engine)
--> 257 return impl.read(path, columns=columns, **kwargs)
/data2/user1/anaconda2/lib/python2.7/site-packages/pandas/io/parquet.pyc in read(self, path, columns, **kwargs)
128 kwargs['use_pandas_metadata'] = True
129 return self.api.parquet.read_table(path, columns=columns,
--> 130 **kwargs).to_pandas()
131
132 def _validate_write_lt_070(self, df):
/data2/user1/anaconda2/lib/python2.7/site-packages/pyarrow/parquet.pyc in read_table(source, columns, nthreads, metadata, use_pandas_metadata)
937 return fs.read_parquet(source, columns=columns, metadata=metadata)
938
--> 939 pf = ParquetFile(source, metadata=metadata)
940 return pf.read(columns=columns, nthreads=nthreads,
941 use_pandas_metadata=use_pandas_metadata)
/data2/user1/anaconda2/lib/python2.7/site-packages/pyarrow/parquet.pyc in __init__(self, source, metadata, common_metadata)
62 self.reader = ParquetReader()
63 source = _ensure_file(source)
---> 64 self.reader.open(source, metadata=metadata)
65 self.common_metadata = common_metadata
66 self._nested_paths_by_prefix = self._build_nested_paths()
_parquet.pyx in pyarrow._parquet.ParquetReader.open()
error.pxi in pyarrow.lib.check_status()
ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument
Data:
PAR1??x??xLҢ924217587908548913115647362798388396398451534680690245436174253535301832948328446784820483655304337818520249518423095384646626994297369124175421698306711617314169483532812118925257912118483068693626684028851435422618056045560553866002671256164797432939995779833592483738186675911756683298492596228339721443259180385356757426207851989658054881511280641692601503861637822470631692909600167537024514

"RuntimeError: Could not create write struct" with pyplot

UPDATE:
I get this message no matter what I attempt to plot: even this
plt.plot([1,2,3,4])
plt.ylabel('some numbers')
plt.show()
returns the error RuntimeError: Could not create write struct
I am trying to plot a raw image inline. My Jupyter notebook is up on an AWS instance with port forwarding.
My code is as follows:
see above update
When I try this, I get the error message below, which culminates in the message RuntimeError: Could not create write struct.
The weird thing is, the exact same code runs fine locally. I can view images all day long.
So as an experiment I pulled the image down off AWS and ran it locally and I could see it displayed just fine.
I'm thinking, there must be some problem with either my Matplotlib or even jupyter notebook.
I've removed / reinstalled both multiple times, in multiple configurations. I made sure the local and AMI versions of the packages are the exact same.
I have no idea what is going on.
The error itself, naturally, isn't useful. And when googling the error, there's few exact string matching results, which is always scary.
Other random stuff:
I'm using Python 2.7
Both libraries are managed within Conda
Jupyter: 4.4.0
Matplotlib: 2.1.2
<matplotlib.image.AxesImage at 0x7f261c1f2b50>
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/home/ubuntu/anaconda3/envs/pytorch_p27/lib/python2.7/site-packages/IPython/core/formatters.pyc in __call__(self, obj)
332 pass
333 else:
--> 334 return printer(obj)
335 # Finally look for special method names
336 method = get_real_method(obj, self.print_method)
/home/ubuntu/anaconda3/envs/pytorch_p27/lib/python2.7/site-packages/IPython/core/pylabtools.pyc in <lambda>(fig)
238
239 if 'png' in formats:
--> 240 png_formatter.for_type(Figure, lambda fig: print_figure(fig, 'png', **kwargs))
241 if 'retina' in formats or 'png2x' in formats:
242 png_formatter.for_type(Figure, lambda fig: retina_figure(fig, **kwargs))
/home/ubuntu/anaconda3/envs/pytorch_p27/lib/python2.7/site-packages/IPython/core/pylabtools.pyc in print_figure(fig, fmt, bbox_inches, **kwargs)
122
123 bytes_io = BytesIO()
--> 124 fig.canvas.print_figure(bytes_io, **kw)
125 data = bytes_io.getvalue()
126 if fmt == 'svg':
/home/ubuntu/anaconda3/envs/pytorch_p27/lib/python2.7/site-packages/matplotlib/backend_bases.pyc in print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, **kwargs)
2214 orientation=orientation,
2215 dryrun=True,
-> 2216 **kwargs)
2217 renderer = self.figure._cachedRenderer
2218 bbox_inches = self.figure.get_tightbbox(renderer)
/home/ubuntu/anaconda3/envs/pytorch_p27/lib/python2.7/site-packages/matplotlib/backends/backend_agg.pyc in print_png(self, filename_or_obj, *args, **kwargs)
524 try:
525 _png.write_png(renderer._renderer, filename_or_obj,
--> 526 self.figure.dpi, metadata=metadata)
527 finally:
528 if close:
RuntimeError: Could not create write struct
<matplotlib.figure.Figure at 0x7f2624b94950>
This is unclear and I have no idea why, but I removed the conda installation of matplotlib, and then reinstalled matplotlib with pip.
Now everything works fine.
¯\_(ツ)_/¯

Issue starting out with xlwings - AttributeError: Excel.Application.Workbooks

I was trying to use the package xlwings and ran into a simple error right from the start. I was able to run the example files they provided here without any major issues (except for multiple Excel books opening up upon running the code) but as soon as I tried to execute code via IPython I got the error AttributeError: Excel.Application.Workbooks. Specifically I ran:
from xlwings import Workbook, Sheet, Range, Chart
wb = Workbook()
Range('A1').value = 'Foo 1'
and got
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-7-7436ba97d05d> in <module>()
1 from xlwings import Workbook, Sheet, Range, Chart
----> 2 wb = Workbook()
3 Range('A1').value = 'Foo 1'
PATH\xlwings\main.pyc in __init__(self, fullname, xl_workbook, app_visible)
139 else:
140 # Open Excel if necessary and create a new workbook
--> 141 self.xl_app, self.xl_workbook = xlplatform.new_workbook()
142
143 self.name = xlplatform.get_workbook_name(self.xl_workbook)
PATH\xlwings\_xlwindows.pyc in new_workbook()
103 def new_workbook():
104 xl_app = _get_latest_app()
--> 105 xl_workbook = xl_app.Workbooks.Add()
106 return xl_app, xl_workbook
107
PATH\win32com\client\dynamic.pyc in __getattr__(self, attr)
520
521 # no where else to look.
--> 522 raise AttributeError("%s.%s" % (self._username_, attr))
523
524 def __setattr__(self, attr, value):
AttributeError: Excel.Application.Workbooks
I noticed the examples have a .xlxm file already present in the folder with the python code. Does the python code only ever work if it's in the same location as an existing Excel file? Does this mean it can't create Excel files automatically? Apologies if this is basic.