I'm creating a pyspark udf inside a class based view and I have the function what I want to call, inside another class based view, both of them are in the same file (api.py), but when I inspect the content of the dataframe resulting, I get this error:
ModuleNotFoundError: No module named 'api'
I can't understand why this happens, I tried to do a similar code in the pyspark console and it worked good. A similar question was asked here but the difference is that I'm trying to do that in the same file.
This a piece of my full code:
api.py
class TextMiningMethods():
def clean_tweet(self,tweet):
'''
some logic here
'''
return "Hello: "+tweet
class BigDataViewSet(TextMiningMethods,viewsets.ViewSet):
#action(methods=['post'], detail=False)
def word_cloud(self, request, *args, **kwargs):
'''
some previous logic here
'''
spark=SparkSession \
.builder \
.master("spark://"+SPARK_WORKERS) \
.appName('word_cloud') \
.config("spark.executor.memory", '2g') \
.config('spark.executor.cores', '2') \
.config('spark.cores.max', '2') \
.config("spark.driver.memory",'2g') \
.getOrCreate()
sc.sparkContext.addPyFile('path/to/udfFile.py')
cols = ['text']
rows = []
for tweet_account_index, tweet_account_data in enumerate(tweets_list):
tweet_data_aux_pandas_df = pd.Series(tweet_account_data['tweet']).dropna()
for tweet_index,tweet in enumerate(tweet_data_aux_pandas_df):
row= [tweet['text']]
rows.append(row)
# Create a Pandas Dataframe of tweets
tweet_pandas_df = pd.DataFrame(rows, columns = cols)
schema = StructType([
StructField("text", StringType(),True)
])
# Converts to Spark DataFrame
df = spark.createDataFrame(tweet_pandas_df,schema=schema)
clean_tweet_udf = udf(TextMiningMethods().clean_tweet, StringType())
clean_tweet_df = df.withColumn("clean_tweet", clean_tweet_udf(df["text"]))
clean_tweet_df.show() # This line produces the error
This similar test in pyspark works good
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.functions import udf
def clean_tweet(name):
return "This is " + name
schema = StructType([StructField("Id", IntegerType(),True),StructField("tweet", StringType(),True)])
data = [[ 1, "tweet 1"],[2,"tweet 2"],[3,"tweet 3"]]
df = spark.createDataFrame(data,schema=schema)
clean_tweet_udf = udf(clean_tweet,StringType())
clean_tweet_df = df.withColumn("clean_tweet", clean_tweet_udf(df["tweet"]))
clean_tweet_df.show()
So these are my questions:
What is this error related to? and How can I fix it?
What is the right way to create a pyspark udf when you're working with class based view? is a wrong practice to write functions that you will use as pyspark udf, in the same file where you will call them? (in my case, all my api endpoints, working with django rest framework)
Any help will be appreciated, thanks in advance
UPDATE:
This link and this link explains how to use custom classes with pyspark using SparkContext, but not with SparkSession that is my case , but I used this:
sc.sparkContext.addPyFile('path/to/udfFile.py')
The problem is that I defined the class where I have the functions to use as pyspark udf, in the same file where I'm creating the udf function for the dataframe (as a showed in my code). I couldn't found how to reach that behaviour when the path of addPyFile() is in the same code. In spite of that, I moved my code and I followed these steps (that was another error that I fixed):
Create a new folder called udf
Create a new empty __ini__.py file, to make the directory to a package.
And create a file.py for my udf functions.
core/
udf/
├── __init__.py
├── __pycache__
└── pyspark_udf.py
api/
├── admin.py
├── api.py
├── apps.py
├── __init__.py
In this file, I tried to import the dependencies either at the beginning or inside the function. In all the cases I receive ModuleNotFoundError: No module named 'udf'
pyspark_udf.py
import re
import string
import unidecode
from nltk.corpus import stopwords
class TextMiningMethods():
"""docstring for TextMiningMethods"""
def clean_tweet(self,tweet):
# some logic here
I have tried with all of these, At the beginning of my api.py file
from udf.pyspark_udf import TextMiningMethods
# or
from udf.pyspark_udf import *
And inside the word_cloud function
class BigDataViewSet(viewsets.ViewSet):
def word_cloud(self, request, *args, **kwargs):
from udf.pyspark_udf import TextMiningMethods
In the python debugger this line works:
from udf.pyspark_udf import TextMiningMethods
But when I show the dataframe, i receive the error:
clean_tweet_df.show()
ModuleNotFoundError: No module named 'udf'
Obviously, the original problem changed to another, now my problem is more related with this question, but I couldn't find a satisfactory way to import the file yet and create a pyspark udf callinf a class function from another class function.
What I'm missing?
After different tries, I couldn't find a solution by referencing to a method in the path of addPyFile(), located in the same file where I was creating the udf (I would like to know if this is a bad practice) or in another file, technically addPyFile(path) documentation says:
Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.
So what I mention should be possible. Based on that, I had to used this solution and zip all the udf folder from it's highest level with:
zip -r udf.zip udf
Also, in the pyspark_udf.py I had to import my dependencies as below to avoid this problem
class TextMiningMethods():
"""docstring for TextMiningMethods"""
def clean_tweet(self,tweet):
import re
import string
import unidecode
from nltk.corpus import stopwords
Instead of:
import re
import string
import unidecode
from nltk.corpus import stopwords
class TextMiningMethods():
"""docstring for TextMiningMethods"""
def clean_tweet(self,tweet):
Then, finally this line worked good:
clean_tweet_df.show()
I hope this could be useful for anyone else
Thank you! Your approach worked for me.
Just to clarify my steps:
Made a udf module with __init__.py and pyspark_udfs.py
Made a bash file to zip udfs first and then run my files on the top level:
runner.sh
echo "zipping udfs..."
zip -r udf.zip udf
echo "udfs zipped"
echo "running script..."
/opt/conda/bin/python runner.py
echo "script ended."
In actual code imported my udfs from udf.pyspark_udfs module and initialized my udfs in the python function I need, like so:
def _produce_period_statistics(self, df: pyspark.sql.DataFrame, period: str) -> pyspark.sql.DataFrame:
""" Produces basic and trend statistics based on user visits."""
# udfs
get_hist_vals_udf = F.udf(lambda array, bins, _range: get_histogram_values(array, bins, _range), ArrayType(IntegerType()))
get_hist_edges_udf = F.udf(lambda array, bins, _range: get_histogram_edges(array, bins, _range), ArrayType(FloatType()))
get_mean_udf = F.udf(get_mean, FloatType())
get_std_udf = F.udf(get_std, FloatType())
get_lr_coefs_udf = F.udf(lambda bar_height, bar_edges, hist_upper: get_linear_regression_coeffs(bar_height, bar_edges, hist_upper), StringType())
...
I transferred some code from IDLE 3.5 (64 bits) to pycharm (Python 2.7). Most of the code is still working, for example I can import WD_LINE_SPACING from docx.enum.text, but for some reason I can't import WD_ALIGN_PARAGRAPH.
At first, nearly non of the imports worked, but after I did
pip install python-docx
instead of
pip install docx
most of the imports worked except for WD_ALIGN_PARAGRAPH.
# works
from __future__ import print_function
import xlrd
import xlwt
import os
import subprocess
from calendar import monthrange
import datetime
from docx import Document
from datetime import datetime
from datetime import date
from docx.enum.text import WD_LINE_SPACING
from docx.shared import Pt
# does not work
from docx.enum.text import WD_ALIGN_PARAGRAPH
I don't get any error messages but Pycharm marks the line as error:
"Cannot find reference 'WD_ALIGN_PARAGRAPH' in 'text.py'".
You can use this instead:
from docx.enum.text import WD_PARAGRAPH_ALIGNMENT
and then substitute WD_PARAGRAPH_ALIGNMENT wherever WD_ALIGN_PARAGRAPH would have appeared before.
The reason this is happening is that the actual enum object is named WD_PARAGRAPH_ALIGNMENT, and a decorator is applied that also allows it to be referenced as WD_ALIGN_PARAGRAPH (which is a little shorter, and possibly clearer). I expect the syntax checker in PyCharm is operating on direct module attributes and doesn't pick up the alias, which is resolved by the Python parser/compiler.
Interestingly, I expect your code would work fine either way. But to get rid of the annoying message you can use the base name.
If someone uses pylint it can be easily suppressed with # pylint: disable=E0611 added at the end of the import line.
I am very new to Python and am trying to append some functionality to an existing Python program. I want to read values from a config INI file like this:
[Admin]
AD1 = 1
AD2 = 2
RSW = 3
When I execute the following code from IDLE, it works as ist should (I already was able to read in values from the file, but deleted this part for a shorter code snippet):
#!/usr/bin/python
import ConfigParser
# buildin python libs
from time import sleep
import sys
def main():
print("Test")
sleep(2)
if __name__ == '__main__':
main()
But the compiled exe quits before printing and waiting 2 seconds. If I comment out the import of ConfigParser, exe runs fine.
This is how I compile into exe:
from distutils.core import setup
import py2exe, sys
sys.argv.append('py2exe')
setup(
options = {'py2exe': {'bundle_files': 1}},
zipfile = None,
console=['Test.py'],
)
What am I doing wrong? Is there maybe another way to read in a configuration in an easy way, if ConfigParser for some reason doesnt work in a compiled exe?
Thanks in advance for your help!
Following is a python code snippet I am running on my server. I want to replicate a file 'n' times and save with a different name each time. However, whatever value I give for loop I always end up getting a single replica made.
import os
import time
import shutil
os.chdir(''server_directory)
src='myFile.jpg'
numberofcopies=10
for i in range(0,numberofcopies):
print "replicating {0}".format(i+1)
timestamp=int(round(time.time()))
dst='{0}.jpg'.format(timestamp)
shutil.copy2(src, dst)
Apparently using a timeout in the main thread and running the loop infinitely unless Ctrl-C interruption solved my problem.
import os
import time
import shutil
os.chdir('server_directory')
src='file to replicate.jpg'
i=0
while True:
print "replicating {0}".format(i+1)
timestamp=int(round(time.time()))
dst='{0}.jpg'.format(timestamp)
shutil.copy2(src, dst)
i=i+1
time.sleep(1)
This should replicate my file 5 times with different names.
import os
import shutil
import os
for r in range(5):
original = r'transactions.xlsx'
target = f'Newfile{r}.xlsx'
shutil.copyfile(original, target)
print (f'Action: replicating {original} as {target}')
I am trying to understand how to import code from one file to another. I have two files file1.py and file2.py. I am running code in the first file, and have many variables and functions defined in the second file. I am using from file2 import * to import the code into file1.py. I have no problem using variables defined in file2.py in file1.py, but with functions I am getting NameError: name 'myfunc' is not defined when I try to use the function in file1.py. I can fix this problem by writing from file2 import myfunc, but I thought writing * would import everything from that file. What is the difference for functions versus variables?
I have tried to recreate the setup you have described and it is working OK for me. Hopefully this will give you an idea of how to get it to work.
# file1.py #####################################
import sys
sys.path.append("/home/neko/test/")
import file2
if __name__ == "__main__":
file2.testfunc()
# file2.py ######################################
testvar = 'hello'
def testfunc(): print testvar
For this test I was using python version 2.6.6
Both file1.py and file2.py is in /home/neko/test/