How can I access to HDFS file system in the latest Tensorflow 2.6.0? - hdfs

I recently upgraded the tensorflow version used in my program to the recently released 2.6.0, but I ran into a trouble.
import tensorflow as tf
pattern = 'hdfs://mypath'
print(tf.io.gfile.glob(pattern))
The above API throws an exception in version 2.6:
tensorflow.python.framework.errors_impl.UnimplementedError: File system scheme'hdfs' not implemented (file:xxxxx)
Then I checked the relevant implementation code and found that the official recommendation is to use tensorflow/io to access hdfs, and the environment variable TF_USE_MODULAR_FILESYSTEM is provided to use legacy access support. Since my code is more complex and difficult to refactor in a short time, I tried to use this environment variable, but it still failed.
In general, my questions are:
In the latest version of tensorflow, if "tfio" is not used, how can I still access the HDFS file?
If "tfio" must be used, what is the equivalent code call to tf.io.gfile.glob?

TL.DR. Install tensorflow-io and import it.
After some tossing, I found a solution (it may be the official recommended way):
Since v2.6.0, Tensorflow no longer provides HDFS, GCS and other file system support in the framework, but transfers these support to the Tensorflow/IO project.
Therefore, in future versions, to have the support of HDFS, GCS and other file systems, you only need to install tensorflow-io and import it to the training program:
$ pip install tensorflow-io
$ cat test.py
import tensorflow as tf
import tensorflow_io as tfio
print(tf.io.gfile.glob('hdfs://...'))
$ CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath --glob) python test.py
It will load libtensorflow_io.so and libtensorflow_io_plugins.so, which contains the implementation and registration logic of each extras file system:
# tensorflow_io/python/ops/__init__.py
core_ops = LazyLoader("core_ops", "libtensorflow_io.so")
try:
plugin_ops = _load_library("libtensorflow_io_plugins.so", "fs")
except NotImplementedError as e:
warnings.warn("unable to load libtensorflow_io_plugins.so: {}".format(e))
# Note: load libtensorflow_io.so imperatively in case of statically linking
try:
core_ops = _load_library("libtensorflow_io.so")
plugin_ops = _load_library("libtensorflow_io.so", "fs")
except NotImplementedError as e:
warnings.warn("file system plugins are not loaded: {}".format(e))
Ref:
Remove hdfs as support has been moved to modular file systems
Remove AWS files as s3 support is now in modular file system
SIG IO Meeting Notes

Related

How to invoke the Flex delegate for tflite interpreters?

I have a TensorFlow model which I want to convert into a tflite model, which is going to be deployed on an ARM64 platform.
It happens to be that two operations of my model (RandomStandardNormal, Softplus) seem to require custom implementations. Due to execution time being not that important, I decided to go with a hybrid model that uses the extended runtime. I converted it via:
graph_def_file = './model/frozen_model.pb'
inputs = ['eval_inputs']
outputs = ['model/y']
converter = tf.lite.TFLiteConverter.from_frozen_graph(graph_def_file, inputs, outputs)
converter.target_ops = [tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS]
tflite_file_name = 'vae_' + str(tf.__version__) + '.tflite'
tflite_model = converter.convert()
open(tflite_file_name, 'wb').write(tflite_model)
This worked and I ended up with a seemingly valid tflite model file. Whenever I try to load this model with an interpreter, I get an error (it does not matter if I use the Python or C++ API):
ERROR: Regular TensorFlow ops are not supported by this interpreter. Make sure you invoke the Flex delegate before inference.
ERROR: Node number 4 (FlexSoftplus) failed to prepare.
I have a hard time to find documentation on the tf website on how to invoke the Flex delegate for both APIs. I have stumbled across a header file ("tensorflow/lite/delegates/flex/delegate_data.h") which seems to be related to this issue, but including it in my C++ project yields another error:
In file included from /tensorflow/tensorflow/core/common_runtime/eager/context.h:28:0,
from /tensorflow/tensorflow/lite/delegates/flex/delegate_data.h:18,
from /tensorflow/tensorflow/lite/delegates/flex/delegate.h:19,
from demo.cpp:7:
/tensorflow/tensorflow/core/lib/core/status.h:23:10: fatal error: tensorflow/core/lib/core/error_codes.pb.h: No such file or directory
#include "tensorflow/core/lib/core/error_codes.pb.h"
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
By any chance, has anybody encountered and resolved this before? If you have an example snippet, please share the link!
When building TensorFlow Lite libraries using the bazel pipeline, the additional TensorFlow ops library can be included and enabled as follows:
Enable monolithic builds if necessary by adding the --config=monolithic build flag.
Add the TensorFlow ops delegate library dependency to the build dependencies: tensorflow/lite/delegates/flex:delegate.
Note that the necessary TfLiteDelegate will be installed automatically when creating the interpreter at runtime as long as the delegate is linked into the client library. It is not necessary to explicitly install the delegate instance as is typically required with other delegate types.
Python pip package
Python support is actively under development.
source: https://www.tensorflow.org/lite/guide/ops_select
According to https://www.tensorflow.org/lite/guide/ops_select#android_aar on 2019/9/25
Python support of 'select operators' is actively under development.
You can test the model in Android by using FlexDelegate.
I ran my model successfully in the same way.
e.g. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/java/src/test/java/org/tensorflow/lite/InterpreterFlexTest.java

AttributeError module 'Pyro4' has no attribute 'expose' while running gensim distributed LSI

So I am trying to run the demo from gensim for distributed LSI (You can find it here) Yet whenever I run the code I get the error
AttributeError: module 'Pyro4' has no attribute 'expose'
I have checked similar issues here on stackoverflow, and usually they are caused through misuse of the library.
However I am not using Pyro4 directly, I am using Distributed LSI introduced by gensim. So there is no room for mistakes on my side (or so I believe)
My code is really simple you can find it below
from gensim import corpora, models, utils
import logging, os, Pyro4
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
os.environ["PYRO_SERIALIZERS_ACCEPTED"] = 'pickle'
os.environ["PYRO_SERIALIZER"] = 'pickle'
corpus = corpora.MmCorpus('wiki_corpus.mm') # load a corpus of nine documents, from the Tutorials
id2word = corpora.Dictionary.load('wiki_dict.dict')
lsi = models.LsiModel(corpus, id2word=id2word, num_topics=200, chunksize=1, distributed=True) # run distributed LSA on nine documents
Pyro4.expose was added in Pyro4 version 4.27 from august 2014.
It looks to me that you have a very old Pyro4 version installed from before this date, and that your gensim requires a more recent one.
Check using:
$ python -m Pyro4.configuration | head -3
You should probably upgrade your Pyro4 library...
Pay attention though, I believe gensim doesn't support the most recent versions of Pyro4 so you should probably check its manual for the correct version that you need. You can always try to install the latest (4.61 right now) and see how it goes.
edit I suppose you could also try to find gensim specific support? https://radimrehurek.com/gensim/support.html

How to start a machine learning course of Udacity on Anaconda Jupyter notebook and Python 2.7?

I want to start a machine learning course of udacity. So I downloaded ud120-projects-master.zip file and extracted it in my downloads folder. I have installed anaconda jupyter notebook (python 2.7).
First mini project is Naïve-Bayes ,so I opened the jupyter notebook and the %load nb_author_id.py to convert into .ipynb
But I think I have to first run the startup.py in tools folder to extract the data.
So I ran the startup.ipynb.
# %load startup.py
print
print "checking for nltk"
try:
import nltk
except ImportError:
print "you should install nltk before continuing"
print "checking for numpy"
try:
import numpy
except ImportError:
print "you should install numpy before continuing"
print "checking for scipy"
try:
import scipy
except:
print "you should install scipy before continuing"
print "checking for sklearn"
try:
import sklearn
except:
print "you should install sklearn before continuing"
print
print "downloading the Enron dataset (this may take a while)"
print "to check on progress, you can cd up one level, then execute <ls -lthr>"
print "Enron dataset should be last item on the list, along with its current size"
print "download will complete at about 423 MB"
import urllib
url = "https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz"
urllib.urlretrieve(url, filename="../enron_mail_20150507.tgz")
print "download complete!"
print
print "unzipping Enron dataset (this may take a while)"
import tarfile
import os
os.chdir("..")
tfile = tarfile.open("enron_mail_20150507.tgz", "r:gz")
tfile.extractall(".")
print "you're ready to go!"
But getting an error....
checking for nltk
checking for numpy
checking for scipy
checking for sklearn
downloading the Enron dataset (this may take a while)
to check on progress, you can cd up one level, then execute <ls -lthr>
Enron dataset should be last item on the list, along with its current size
download will complete at about 423 MB
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-1-c30fe1ced56a> in <module>()
32 import urllib
33 url = "https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz"
---> 34 urllib.urlretrieve(url, filename="../enron_mail_20150507.tgz")
35 print "download complete!"
36
This is for nb_author_id.py :
# %load nb_author_id.py
#!/usr/bin/python
"""
This is the code to accompany the Lesson 1 (Naive Bayes) mini-project.
Use a Naive Bayes Classifier to identify emails by their authors
authors and labels:
Sara has label 0
Chris has label 1
"""
import sys
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess
### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
#########################################################
### your code goes here ###
#########################################################
error/warning
C:\Users\jr31964\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
no. of Chris training emails: 7936
no. of Sara training emails: 7884
How to I start with Naïve Bayes mini project and what are the prerequisites action needed.
Since the course is I presume in Python 3, I would suggest making a conda environment in python 3. You can do this even though you have a base python installation of python 2. This should save you converting all the course code in python 3 to your python 2.
conda create --name UdacityCourseEnvironment python=3.6
# to get into your new environment (mac/linux)
source activate UdacityCourseEnvironment
# to get into your new environment (windows)
activate UdacityCourseEnvironment
# When you need new packages inside your new environment
conda install nameOfPackage
Source: Switching between python 2 and 3 with Conda
You made the right decision to go with Anaconda - this solves a bunch of incompatibility issues between Python 2 and Python 3 and the various package dependencies. I did it the hard way and am converting the code to Python3 (& dependencies) as I go along, because I want an up-to-date environment & programming skills when I finish; but that's just me.
Obviously, you can ignore that deprecation warning: sklearn 0.19.0 still works. Anyone who tries to run this after 0.20.0 will have an issue. But, if you find it annoying (like me) you can edit the file tools/email_preprocess.py and change the following lines (original in comments):
# from sklearn import cross_validation
from sklearn.model_selection import train_test_split
and
#features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)
features_train, features_test, labels_train, labels_test = train_test_split(word_data, authors, test_size=0.1, random_state=42)
Also, because some installs are dependent on others. An earlier successful install (e.g. numpy) turns out to cause a failure of the install of other packages (e.g. scipy) because a prereq for that is numpy+mkl. If you just installed numpy, that needs to be uninstalled and replaced. See more on that at (I have hit my link limit) https colon //github dot com/scipy/scipy/issues/7221
The next problem I hit was that, on my machine, the volume of the email files in enron_mail_20150507.tgz was so large that it ran for several hours without reaching the completion message:
print "you're ready to go!"
Turns out that my IDE (PyCharm) was indexing the files as they were being unpacked and this was killing the disk. As indexing text files is unnecessary I turned that off for the directory 'maildir'. That allowed startup.py to finish.
The error you are encountering with urllib is due to a change in the package: you need to change the import statement to:
import urllib.request
...and then your line 34 (error message above) to:
urllib.request.urlretrieve(url, filename="../enron_mail_20150507.tar.gz")
Note also this link on github is very helpful: https://github.com/MLTO/general/wiki/Python-Setup-for-Udacity-ud120-course
The rest of this answer relates to Windows 10, so Linux users can skip this.
The next problem I encountered was that some of the package imports were failing, due to the installs not being correctly optimized for W10. An invaluable resource to resolve this is a set of Windows optimized .whl (wheel) files that can be found at http://www.lfd.uci.edu/~gohlke/pythonlibs/
Next problem was the unpacking of the .tgz file introduced the probably familiar LF/CRLF character issues between Linux and Windows files. There is a fix for this from #monkshow92 on github here: (link limit again) https colon //github dot com/udacity/ud120-projects/issues/46
Apart from that, it was a breeze....

Cython module imports suddenly result in undefined symbol errors on Ubuntu 16.04 when previously working

Cython no longer works appropriately on my version of Ubuntu, this appears to be related to me installing either Clion or Pycharm (both professional editions), as building worked fine previously. Anaconda version = 3.5.2 using python 3.5.2, The code is simple:
# setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
source = ["cythonsrc/testing.pyx"]
extensions = [Extension("testing", source, language='c++',
extra_compile_args=["-std=c++11", "-g"],
extra_link_args=["-std=c++11"])]
setup(
ext_modules = cythonize(extensions)
)
pyx file
#testing.pyx
#!python
#cython: language_level=3, boundscheck=false
# distutils: language=c++
def test2():
print("HEF")
Build results
python cythonsrc/setup.py build_ext --inplace
/home/name/anaconda3/lib/python3.5/site-packages/Cython/Distutils/old_build_ext.py:30: UserWarning: Cython.Distutils.old_build_ext does not properly handle dependencies and is deprecated.
"Cython.Distutils.old_build_ext does not properly handle dependencies "
running build_ext
Module import results
>>> import testing
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: /home/name/Documents/CythonTest/testing.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE9_M_createERmm
Not sure what would cause this issue, but google is failing me, I've seen people talk about g++ versions causing issue with this, however this doesn't appear to be my problem as no one has had this issue with the g++ version currently installed, and it wouldn't matter anyway I need the build process to be cross platform (and thus can't enforce g++ versions). Other people with similar undefined symbol errors are due to setup.py user error, however I've used this process with other projects on other systems and it worked, and now previous projects that were building fine no longer build on my system with out editing the code.
My inclination is that I must have downloaded or updated something (such as the Jetbrains products) that have caused some issue with the normal build or environment variables on my system.
EDIT:
I should mention the actual C++ build process works fine for .cpp .h files outside the context of Cython, I've read people mentioning python 2.7 has issues that might cause these issues but couldn't see any solutions out of that (in addition to me not using 2.7).

Importing Numpy in embedded Python c++ application

I would like to have a script invoke numpy from a c++ embedded python runtime by setting the runtime path to know about the numpy module located within site-packages.
However I get the error:
cannot import name 'multiarray'
from \Lib\site-packages\numpy\core__init_.py on the line
from . import multiarrray
I have tried to set the os.path to be xxx\numpy\core but it still cannot seem to find the multiarray.pyd file during the import statement
I have read through similar questions posed but none of the answers seem relevant to my case.
I am using Python 3.4.4 (32 bit) and have installed Numpy 1.11.1 using the wheel
numpy-1.11.1-cp34-none-win32.whl
python -m pip install numpy-1.11.1-cp34-none-win32.whl
Completed without any errors.
Seems like the failure message maybe more general than just an incomplete PYTHONPATH?
Also think it might be broader than Numpy in that ANY .pyd based package that is imported from the embedded environment will have this problem?
Any help appreciated.
Did you ensure all your NumPy includes: \numpy\core\include\numpy\ were present during the build? That's the only time I get those types of errors was if the build couldn't find all the NumPy includes... although during embedding I found that the numpy entire directory (already built on your build machine) has to be inside a directory under Py_SetPath(python35.lib;importlibs); assuming importlibs is a directory with NumPy inside and anything else you want to bundle.
Seems like the answer was to install python 3.4.1 to match the python34.dll version of 3.4.1.