I'm creating a pyspark udf inside a class based view and I have the function what I want to call, inside another class based view, both of them are in the same file (api.py), but when I inspect the content of the dataframe resulting, I get this error:
ModuleNotFoundError: No module named 'api'
I can't understand why this happens, I tried to do a similar code in the pyspark console and it worked good. A similar question was asked here but the difference is that I'm trying to do that in the same file.
This a piece of my full code:
api.py
class TextMiningMethods():
def clean_tweet(self,tweet):
'''
some logic here
'''
return "Hello: "+tweet
class BigDataViewSet(TextMiningMethods,viewsets.ViewSet):
#action(methods=['post'], detail=False)
def word_cloud(self, request, *args, **kwargs):
'''
some previous logic here
'''
spark=SparkSession \
.builder \
.master("spark://"+SPARK_WORKERS) \
.appName('word_cloud') \
.config("spark.executor.memory", '2g') \
.config('spark.executor.cores', '2') \
.config('spark.cores.max', '2') \
.config("spark.driver.memory",'2g') \
.getOrCreate()
sc.sparkContext.addPyFile('path/to/udfFile.py')
cols = ['text']
rows = []
for tweet_account_index, tweet_account_data in enumerate(tweets_list):
tweet_data_aux_pandas_df = pd.Series(tweet_account_data['tweet']).dropna()
for tweet_index,tweet in enumerate(tweet_data_aux_pandas_df):
row= [tweet['text']]
rows.append(row)
# Create a Pandas Dataframe of tweets
tweet_pandas_df = pd.DataFrame(rows, columns = cols)
schema = StructType([
StructField("text", StringType(),True)
])
# Converts to Spark DataFrame
df = spark.createDataFrame(tweet_pandas_df,schema=schema)
clean_tweet_udf = udf(TextMiningMethods().clean_tweet, StringType())
clean_tweet_df = df.withColumn("clean_tweet", clean_tweet_udf(df["text"]))
clean_tweet_df.show() # This line produces the error
This similar test in pyspark works good
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.functions import udf
def clean_tweet(name):
return "This is " + name
schema = StructType([StructField("Id", IntegerType(),True),StructField("tweet", StringType(),True)])
data = [[ 1, "tweet 1"],[2,"tweet 2"],[3,"tweet 3"]]
df = spark.createDataFrame(data,schema=schema)
clean_tweet_udf = udf(clean_tweet,StringType())
clean_tweet_df = df.withColumn("clean_tweet", clean_tweet_udf(df["tweet"]))
clean_tweet_df.show()
So these are my questions:
What is this error related to? and How can I fix it?
What is the right way to create a pyspark udf when you're working with class based view? is a wrong practice to write functions that you will use as pyspark udf, in the same file where you will call them? (in my case, all my api endpoints, working with django rest framework)
Any help will be appreciated, thanks in advance
UPDATE:
This link and this link explains how to use custom classes with pyspark using SparkContext, but not with SparkSession that is my case , but I used this:
sc.sparkContext.addPyFile('path/to/udfFile.py')
The problem is that I defined the class where I have the functions to use as pyspark udf, in the same file where I'm creating the udf function for the dataframe (as a showed in my code). I couldn't found how to reach that behaviour when the path of addPyFile() is in the same code. In spite of that, I moved my code and I followed these steps (that was another error that I fixed):
Create a new folder called udf
Create a new empty __ini__.py file, to make the directory to a package.
And create a file.py for my udf functions.
core/
udf/
├── __init__.py
├── __pycache__
└── pyspark_udf.py
api/
├── admin.py
├── api.py
├── apps.py
├── __init__.py
In this file, I tried to import the dependencies either at the beginning or inside the function. In all the cases I receive ModuleNotFoundError: No module named 'udf'
pyspark_udf.py
import re
import string
import unidecode
from nltk.corpus import stopwords
class TextMiningMethods():
"""docstring for TextMiningMethods"""
def clean_tweet(self,tweet):
# some logic here
I have tried with all of these, At the beginning of my api.py file
from udf.pyspark_udf import TextMiningMethods
# or
from udf.pyspark_udf import *
And inside the word_cloud function
class BigDataViewSet(viewsets.ViewSet):
def word_cloud(self, request, *args, **kwargs):
from udf.pyspark_udf import TextMiningMethods
In the python debugger this line works:
from udf.pyspark_udf import TextMiningMethods
But when I show the dataframe, i receive the error:
clean_tweet_df.show()
ModuleNotFoundError: No module named 'udf'
Obviously, the original problem changed to another, now my problem is more related with this question, but I couldn't find a satisfactory way to import the file yet and create a pyspark udf callinf a class function from another class function.
What I'm missing?
After different tries, I couldn't find a solution by referencing to a method in the path of addPyFile(), located in the same file where I was creating the udf (I would like to know if this is a bad practice) or in another file, technically addPyFile(path) documentation says:
Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.
So what I mention should be possible. Based on that, I had to used this solution and zip all the udf folder from it's highest level with:
zip -r udf.zip udf
Also, in the pyspark_udf.py I had to import my dependencies as below to avoid this problem
class TextMiningMethods():
"""docstring for TextMiningMethods"""
def clean_tweet(self,tweet):
import re
import string
import unidecode
from nltk.corpus import stopwords
Instead of:
import re
import string
import unidecode
from nltk.corpus import stopwords
class TextMiningMethods():
"""docstring for TextMiningMethods"""
def clean_tweet(self,tweet):
Then, finally this line worked good:
clean_tweet_df.show()
I hope this could be useful for anyone else
Thank you! Your approach worked for me.
Just to clarify my steps:
Made a udf module with __init__.py and pyspark_udfs.py
Made a bash file to zip udfs first and then run my files on the top level:
runner.sh
echo "zipping udfs..."
zip -r udf.zip udf
echo "udfs zipped"
echo "running script..."
/opt/conda/bin/python runner.py
echo "script ended."
In actual code imported my udfs from udf.pyspark_udfs module and initialized my udfs in the python function I need, like so:
def _produce_period_statistics(self, df: pyspark.sql.DataFrame, period: str) -> pyspark.sql.DataFrame:
""" Produces basic and trend statistics based on user visits."""
# udfs
get_hist_vals_udf = F.udf(lambda array, bins, _range: get_histogram_values(array, bins, _range), ArrayType(IntegerType()))
get_hist_edges_udf = F.udf(lambda array, bins, _range: get_histogram_edges(array, bins, _range), ArrayType(FloatType()))
get_mean_udf = F.udf(get_mean, FloatType())
get_std_udf = F.udf(get_std, FloatType())
get_lr_coefs_udf = F.udf(lambda bar_height, bar_edges, hist_upper: get_linear_regression_coeffs(bar_height, bar_edges, hist_upper), StringType())
...
I want to access to my modules.module function A in my main , but when I do this I have an error that I cannot import that.. how can I fix it? I Have seen multiples articles but i hadnt a chance, how can I fix this error of importing modules from subfolders?
**/tool
main.py
/google
/modules
__init__.py
module.py**
ImportError: cannot import name google
main.py
#!/usr/bin/python
import sys
import core.settings
from google.modules import google
if __name__ == '__main__':
try:
core.settings.settings()
google()
except KeyboardInterrupt:
print "interrupted by user.."
except:
sys.exit()
module.py
def google():
print 'A'
the easiest way to work this out is have your main.py in your highest directory (it makes sense anyway for main to be there) and if you really want the real main to be in a sub directory have a dummy main at the top level you can call that just calls the actual main, that way python can see your entire directory tree and will know how to import any sub directory.
alternatively
you could add your parent directory to the sys.path:
parent_dir = os.path.realpath(os.path.join(os.getcwd(),'..')))
sys.path.append(parent_dir)
which will add that directory to the places python searches for when you try to import stuff.
but then you will need to keep track of how deep your main is in the directory tree and its a pretty unelegant solution in my opinion
I am having tricky ImportError when packaging some modules, to try and support both python 2.7 and 3.6. There is nothing inside the code that's fancy to exlude one of these versions, so I thought I would try. The repo is at https://github.com/raamana/neuropredict
It worked fine before on 2.7 alone. To support both 2.7 and 3.6,
After much googling and head-scracting, I thought I figured out how to do that with the following import code at the top of each of the modules in my package:
from sys import version_info
if version_info.major==2 and version_info.minor==7:
import rhst, visualize
from freesurfer import aseg_stats_subcortical, aseg_stats_whole_brain
import config_neuropredict as cfg
elif version_info.major > 2:
from neuropredict import rhst, visualize
from neuropredict.freesurfer import aseg_stats_subcortical, aseg_stats_whole_brain
from neuropredict import config_neuropredict as cfg
else:
raise NotImplementedError('neuropredict supports only 2.7 or Python 3+. Upgrade to Python 3+ is recommended.')
The directory structure is :
$ 22:19:54 miner neuropredict >> tree
.
├── config_neuropredict.py
├── freesurfer.py
├── __init__.py
├── __main__.py
├── model_comparison.py
├── neuropredict.py
├── rhst.py
├── test_rhst.py
└── visualize.py
The init.py looks like this:
__all__ = ['neuropredict', 'rhst', 'visualize', 'freesurfer',
'config_neuropredict', 'model_comparison']
from sys import version_info
if version_info.major==2 and version_info.minor==7:
import neuropredict, config_neuropredict, rhst, visualize, freesurfer, model_comparison
elif version_info.major > 2:
from neuropredict import neuropredict, config_neuropredict, rhst, visualize, freesurfer, model_comparison
else:
raise NotImplementedError('neuropredict supports only 2.7 or Python 3+. Upgrade to Python 3+ is recommended.')
So when I run the unit tests locally or on CI, it works fine with above import mechanism.
$ 22:19:25 miner neuropredict >> pytest test_rhst.py
=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.6.1, pytest-3.2.1, py-1.4.34, pluggy-0.4.0
rootdir: /data1/strother_lab/praamana/neuropredict, inifile:
plugins: hypothesis-3.23.2
collected 1 item
test_rhst.py .
============================================================================================ 1 passed in 8.73 seconds =============================================================================================
$ 22:19:42 miner neuropredict >>
However, when I run neuropredict.py directly, it thrown an error
$ 22:19:57 miner neuropredict >> python ./neuropredict.py
Traceback (most recent call last):
File "neuropredict.py", line 23, in <module>
from neuropredict import rhst, visualize
File "/data1/strother_lab/praamana/neuropredict/neuropredict/neuropredict.py", line 23, in <module>
from neuropredict import rhst, visualize
ImportError: cannot import name 'rhst'
$ 22:29:39 miner neuropredict >> pwd
/data1/strother_lab/praamana/neuropredict/neuropredict
$ 22:29:43 miner neuropredict >>
This is killing me - I need figure out where I am making a mistake, or if I am doing something sacrilegious.
Question 1: why am I not able to
python ./neuropredict.py
and get some parser help, when the test scripts can import it successfully? This used to work in many other scenarios. Driving me crazy as I can't understand what the hell is going on.
This is what I need to be able to achieve:
import config_neuropredict.py into all other modules
import config_neuropredict, rhst, visualize, freesurfer, model_comparison into the neuropredict.py module
make this reliably work in Python 3+ (2.7 support not important)
understand python's behaviour during Question 1 above.
If you need more details or into the code, I would appreciate if you can quickly head over to this public repo: https://github.com/raamana/neuropredict
I am having problems in using Sphinx to generate documentation for a Flask app. Without going into specific details of the app its basic structure looks as follows.
__all__ = ['APP']
<python 2 imports>
<flask imports>
<custom imports>
APP = None # module-level variable to store the Flask app
<other module level variables>
# App initialisation
def init():
global APP
APP = Flask(__name__)
<other initialisation code>
try:
init()
except Exception as e:
logger.exception(str(e))
#APP.route(os.path.join(<service base url>, <request action>), methods=["POST"])
<request action handler>
if __name__ == '__main__':
init()
APP.run(debug=True, host='0.0.0.0', port=5000)
I've installed Sphinx in a venv along with other packages needed for the web service, and the build folder is within a docs subfolder which looks like this
docs
├── Makefile
├── _build
├── _static
├── _templates
├── conf.py
├── index.rst
├── introduction.rst
└── make.bat
The conf.py was generated by running sphinx-quickstart and it contains the line
autodoc_mock_imports = [<external imports to ignore>]
to ensure that Sphinx will ignore the listed external imports. The index.rst is standard
.. myapp documentation master file, created by
sphinx-quickstart on Fri Jun 16 12:35:40 2017.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to myapp's documentation!
=============================================
.. toctree::
:maxdepth: 2
:caption: Contents:
introduction
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
and I've added an introduction.rst page to document the app members
===================
`myapp`
===================
Oasis Flask app that handles keys requests.
myapp.app
---------------------
.. automodule:: myapp.app
:members:
:undoc-members:
When I run make html in docs I am getting HTML output in the _build subfolder but I get the following warning
WARNING: /path/to/myapp/docs/introduction.rst:10: (WARNING/2) autodoc: failed to import module u'myapp.app'; the following exception was raised:
Traceback (most recent call last):
File "/path/to/myapp/venv/lib/python2.7/site-packages/sphinx/ext/autodoc.py", line 657, in import_object
__import__(self.modname)
File "/path/to/myapp/__init__.py", line 4, in <module>
from .app import APP
File "/path/to/myapp/app.py", line 139, in <module>
#APP.route(os.path.join(<service base url>, <request action>), methods=['GET'])
File "/path/to/myapp/venv/lib/python2.7/posixpath.py", line 70, in join
elif path == '' or path.endswith('/'):
AttributeError: 'NoneType' object has no attribute 'endswith'
and I am not seeing the documentation I am expecting to see for the app members like the request handler and the app init method.
I don't know what the problem is, any help would be appreciated.
Try using sphinx-apidoc to automatically generate Sphinx sources that, using the autodoc extension, document a whole package in the style of other automatic API documentation tools. You will need to add 'sphinx.ext.autodoc' to your list of Sphinx extensions in your conf.py, too.
Im trying from import items from the .items file with from test_1.items import Items i keep getting the "No module named test_1.items" error. My structure is
test_1/
test_1/test_2/
test_1/scrapy.cfg
test_1/test_2/spiders/
test_1/test_2/ __init__.py
test_1/test_2/items.py
test_1/test_2/pipelines.py
test_1/test_2/settings.py
test_1/test_2/spiders/ __init__.py
test_1/test_2/spiders/today_spider.py
im coding in today_spider after i modify the items.py file and getImportError: No module named... this error. As you see i tried to change the manes not to be identical. i also tryied to start today_spider with __future__ import absolute_import. any advise?
thks a lot for your time