How to create a pyspark udf, calling a class function from another class function in the same file? - django

I'm creating a pyspark udf inside a class based view and I have the function what I want to call, inside another class based view, both of them are in the same file (api.py), but when I inspect the content of the dataframe resulting, I get this error:
ModuleNotFoundError: No module named 'api'
I can't understand why this happens, I tried to do a similar code in the pyspark console and it worked good. A similar question was asked here but the difference is that I'm trying to do that in the same file.
This a piece of my full code:
api.py
class TextMiningMethods():
def clean_tweet(self,tweet):
'''
some logic here
'''
return "Hello: "+tweet
class BigDataViewSet(TextMiningMethods,viewsets.ViewSet):
#action(methods=['post'], detail=False)
def word_cloud(self, request, *args, **kwargs):
'''
some previous logic here
'''
spark=SparkSession \
.builder \
.master("spark://"+SPARK_WORKERS) \
.appName('word_cloud') \
.config("spark.executor.memory", '2g') \
.config('spark.executor.cores', '2') \
.config('spark.cores.max', '2') \
.config("spark.driver.memory",'2g') \
.getOrCreate()
sc.sparkContext.addPyFile('path/to/udfFile.py')
cols = ['text']
rows = []
for tweet_account_index, tweet_account_data in enumerate(tweets_list):
tweet_data_aux_pandas_df = pd.Series(tweet_account_data['tweet']).dropna()
for tweet_index,tweet in enumerate(tweet_data_aux_pandas_df):
row= [tweet['text']]
rows.append(row)
# Create a Pandas Dataframe of tweets
tweet_pandas_df = pd.DataFrame(rows, columns = cols)
schema = StructType([
StructField("text", StringType(),True)
])
# Converts to Spark DataFrame
df = spark.createDataFrame(tweet_pandas_df,schema=schema)
clean_tweet_udf = udf(TextMiningMethods().clean_tweet, StringType())
clean_tweet_df = df.withColumn("clean_tweet", clean_tweet_udf(df["text"]))
clean_tweet_df.show() # This line produces the error
This similar test in pyspark works good
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.functions import udf
def clean_tweet(name):
return "This is " + name
schema = StructType([StructField("Id", IntegerType(),True),StructField("tweet", StringType(),True)])
data = [[ 1, "tweet 1"],[2,"tweet 2"],[3,"tweet 3"]]
df = spark.createDataFrame(data,schema=schema)
clean_tweet_udf = udf(clean_tweet,StringType())
clean_tweet_df = df.withColumn("clean_tweet", clean_tweet_udf(df["tweet"]))
clean_tweet_df.show()
So these are my questions:
What is this error related to? and How can I fix it?
What is the right way to create a pyspark udf when you're working with class based view? is a wrong practice to write functions that you will use as pyspark udf, in the same file where you will call them? (in my case, all my api endpoints, working with django rest framework)
Any help will be appreciated, thanks in advance
UPDATE:
This link and this link explains how to use custom classes with pyspark using SparkContext, but not with SparkSession that is my case , but I used this:
sc.sparkContext.addPyFile('path/to/udfFile.py')
The problem is that I defined the class where I have the functions to use as pyspark udf, in the same file where I'm creating the udf function for the dataframe (as a showed in my code). I couldn't found how to reach that behaviour when the path of addPyFile() is in the same code. In spite of that, I moved my code and I followed these steps (that was another error that I fixed):
Create a new folder called udf
Create a new empty __ini__.py file, to make the directory to a package.
And create a file.py for my udf functions.
core/
udf/
├── __init__.py
├── __pycache__
└── pyspark_udf.py
api/
├── admin.py
├── api.py
├── apps.py
├── __init__.py
In this file, I tried to import the dependencies either at the beginning or inside the function. In all the cases I receive ModuleNotFoundError: No module named 'udf'
pyspark_udf.py
import re
import string
import unidecode
from nltk.corpus import stopwords
class TextMiningMethods():
"""docstring for TextMiningMethods"""
def clean_tweet(self,tweet):
# some logic here
I have tried with all of these, At the beginning of my api.py file
from udf.pyspark_udf import TextMiningMethods
# or
from udf.pyspark_udf import *
And inside the word_cloud function
class BigDataViewSet(viewsets.ViewSet):
def word_cloud(self, request, *args, **kwargs):
from udf.pyspark_udf import TextMiningMethods
In the python debugger this line works:
from udf.pyspark_udf import TextMiningMethods
But when I show the dataframe, i receive the error:
clean_tweet_df.show()
ModuleNotFoundError: No module named 'udf'
Obviously, the original problem changed to another, now my problem is more related with this question, but I couldn't find a satisfactory way to import the file yet and create a pyspark udf callinf a class function from another class function.
What I'm missing?

After different tries, I couldn't find a solution by referencing to a method in the path of addPyFile(), located in the same file where I was creating the udf (I would like to know if this is a bad practice) or in another file, technically addPyFile(path) documentation says:
Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.
So what I mention should be possible. Based on that, I had to used this solution and zip all the udf folder from it's highest level with:
zip -r udf.zip udf
Also, in the pyspark_udf.py I had to import my dependencies as below to avoid this problem
class TextMiningMethods():
"""docstring for TextMiningMethods"""
def clean_tweet(self,tweet):
import re
import string
import unidecode
from nltk.corpus import stopwords
Instead of:
import re
import string
import unidecode
from nltk.corpus import stopwords
class TextMiningMethods():
"""docstring for TextMiningMethods"""
def clean_tweet(self,tweet):
Then, finally this line worked good:
clean_tweet_df.show()
I hope this could be useful for anyone else

Thank you! Your approach worked for me.
Just to clarify my steps:
Made a udf module with __init__.py and pyspark_udfs.py
Made a bash file to zip udfs first and then run my files on the top level:
runner.sh
echo "zipping udfs..."
zip -r udf.zip udf
echo "udfs zipped"
echo "running script..."
/opt/conda/bin/python runner.py
echo "script ended."
In actual code imported my udfs from udf.pyspark_udfs module and initialized my udfs in the python function I need, like so:
def _produce_period_statistics(self, df: pyspark.sql.DataFrame, period: str) -> pyspark.sql.DataFrame:
""" Produces basic and trend statistics based on user visits."""
# udfs
get_hist_vals_udf = F.udf(lambda array, bins, _range: get_histogram_values(array, bins, _range), ArrayType(IntegerType()))
get_hist_edges_udf = F.udf(lambda array, bins, _range: get_histogram_edges(array, bins, _range), ArrayType(FloatType()))
get_mean_udf = F.udf(get_mean, FloatType())
get_std_udf = F.udf(get_std, FloatType())
get_lr_coefs_udf = F.udf(lambda bar_height, bar_edges, hist_upper: get_linear_regression_coeffs(bar_height, bar_edges, hist_upper), StringType())
...

Related

How we can use SFTPToGCSOperator in GCP composer enviornment(1.10.6)?

Here I want to use SFTPToGCSOperator in composer enviornment(1.10.6) of GCP. I know there is a limitation because The operator present only in latest version of airflow not in composer latest version 1.10.6.
See the refrence -
https://airflow.readthedocs.io/en/latest/howto/operator/gcp/sftp_to_gcs.html
I found the alternative of operator and I created a plugin class, But again I faced the issue for sftphook class, Now I am using older version of sftphook class.
see the below refrence -
from airflow.contrib.hooks.sftp_hook import SFTPHook
https://airflow.apache.org/docs/stable/_modules/airflow/contrib/hooks/sftp_hook.html
I have created a plugin class, later It's import in my DAG script. It's working fine only when we are moveing one file, In that case we need to pass complete file path with extension.
Please refer below example(It's working fine in this scenrio)
DIR = "/test/sftp_dag_test/source_dir"
OBJECT_SRC_1 = "file.csv"
source_path=os.path.join(DIR, OBJECT_SRC_1),
Except this If we are using wildcard, I mean if we want to move all the files from directory I am getting error for get_tree_map method.
Please see below DAG code
import os
from airflow import models
from airflow.models import Variable
from PluginSFTPToGCSOperator import SFTPToGCSOperator
#from airflow.contrib.operators.sftp_to_gcs import SFTPToGCSOperator
from airflow.utils.dates import days_ago
default_args = {"start_date": days_ago(1)}
DIR_path = "/main_dir/sub_dir/"
BUCKET_SRC = "test-gcp-bucket"
with models.DAG(
"dag_sftp_to_gcs", default_args=default_args, schedule_interval=None
) as dag:
copy_sftp_to_gcs = SFTPToGCSOperator(
task_id="t_sftp_to_gcs",
sftp_conn_id="test_sftp_conn",
gcp_conn_id="google_cloud_default",
source_path=os.path.join(DIR_path, "*.gz"),
destination_bucket=BUCKET_SRC,
)
copy_sftp_to_gcs
Here we are using wildcard * in DAG script, please see below plugin class.
import os
from tempfile import NamedTemporaryFile
from typing import Optional, Union
from airflow.plugins_manager import AirflowPlugin
from airflow import AirflowException
from airflow.contrib.hooks.gcs_hook import GoogleCloudStorageHook
from airflow.models import BaseOperator
from airflow.contrib.hooks.sftp_hook import SFTPHook
from airflow.utils.decorators import apply_defaults
WILDCARD = "*"
class SFTPToGCSOperator(BaseOperator):
template_fields = ("source_path", "destination_path", "destination_bucket")
#apply_defaults
def __init__(
self,
source_path: str,
destination_bucket: str = "destination_bucket",
destination_path: Optional[str] = None,
gcp_conn_id: str = "google_cloud_default",
sftp_conn_id: str = "sftp_conn_plugin",
delegate_to: Optional[str] = None,
mime_type: str = "application/octet-stream",
gzip: bool = False,
move_object: bool = False,
*args,
**kwargs
) -> None:
super().__init__(*args, **kwargs)
self.source_path = source_path
self.destination_path = self._set_destination_path(destination_path)
print('destination_bucket : ',destination_bucket)
self.destination_bucket = destination_bucket
self.gcp_conn_id = gcp_conn_id
self.mime_type = mime_type
self.delegate_to = delegate_to
self.gzip = gzip
self.sftp_conn_id = sftp_conn_id
self.move_object = move_object
def execute(self, context):
print("inside execute")
gcs_hook = GoogleCloudStorageHook(
google_cloud_storage_conn_id=self.gcp_conn_id, delegate_to=self.delegate_to
)
sftp_hook = SFTPHook(self.sftp_conn_id)
if WILDCARD in self.source_path:
total_wildcards = self.source_path.count(WILDCARD)
if total_wildcards > 1:
raise AirflowException(
"Only one wildcard '*' is allowed in source_path parameter. "
"Found {} in {}.".format(total_wildcards, self.source_path)
)
print('self.source_path : ',self.source_path)
prefix, delimiter = self.source_path.split(WILDCARD, 1)
print('prefix : ',prefix)
base_path = os.path.dirname(prefix)
print('base_path : ',base_path)
files, _, _ = sftp_hook.get_tree_map(
base_path, prefix=prefix, delimiter=delimiter
)
for file in files:
destination_path = file.replace(base_path, self.destination_path, 1)
self._copy_single_object(gcs_hook, sftp_hook, file, destination_path)
else:
destination_object = (
self.destination_path
if self.destination_path
else self.source_path.rsplit("/", 1)[1]
)
self._copy_single_object(
gcs_hook, sftp_hook, self.source_path, destination_object
)
def _copy_single_object(
self,
gcs_hook: GoogleCloudStorageHook,
sftp_hook: SFTPHook,
source_path: str,
destination_object: str,
) -> None:
"""
Helper function to copy single object.
"""
self.log.info(
"Executing copy of %s to gs://%s/%s",
source_path,
self.destination_bucket,
destination_object,
)
with NamedTemporaryFile("w") as tmp:
sftp_hook.retrieve_file(source_path, tmp.name)
print('before upload self det object : ',self.destination_bucket)
gcs_hook.upload(
self.destination_bucket,
destination_object,
tmp.name,
self.mime_type,
)
if self.move_object:
self.log.info("Executing delete of %s", source_path)
sftp_hook.delete_file(source_path)
#staticmethod
def _set_destination_path(path: Union[str, None]) -> str:
if path is not None:
return path.lstrip("/") if path.startswith("/") else path
return ""
#staticmethod
def _set_bucket_name(name: str) -> str:
bucket = name if not name.startswith("gs://") else name[5:]
return bucket.strip("/")
class SFTPToGCSOperatorPlugin(AirflowPlugin):
name = "SFTPToGCSOperatorPlugin"
operators = [SFTPToGCSOperator]
So this plugin class I am importing in my DAG script and it's wotking fine when we are using file name, Because code is going inside else condition.
But when we are using wildcard we have cursor inside if condition and I am getting error for get_tree_map method.
see below error -
ERROR - 'SFTPHook' object has no attribute 'get_tree_map'
I found the reason of this error this method itself is not present in composer(airflow 1.10.6)-
https://airflow.apache.org/docs/stable/_modules/airflow/contrib/hooks/sftp_hook.html
This method is present in latest version of airflow
https://airflow.readthedocs.io/en/latest/_modules/airflow/providers/sftp/hooks/sftp.html
Now What should I can try, Is there any alternative of this method or any alternative of this operator class.
Does anyone know if there is a solution for this?
Thanks in Advance.
Please ignore Typo or indentation error in stackoverflow. In my code there is no Indentation error.
"providers" packages are only available from Airflow 2.0, which is not yet available in Cloud Composer (as I write this post, the latest available Airflow image is 1.10.14, released this morning).
BUT you can import backport packages which let you enjoy these new packages in earlier versions 1.10.*.
My requirements.txt:
apache-airflow-backport-providers-ssh==2020.10.29
apache-airflow-backport-providers-sftp==2020.10.29
pysftp>=0.2.9
paramiko>=2.6.0
sshtunnel<0.2,>=0.1.4
You can import PyPi packages directly in your Composer environment from the console.
With these dependencies, I could use the newest airflow.providers.ssh.operators.ssh.SSHOperator (formerly airflow.contrib.operators.ssh_operator.SSHOperator) and the new airflow.providers.google.cloud.transfers.gcs_to_sftp.GCSToSFTPOperator (which had no equivalent in contrib operators).
Enjoy!
To use SFTPToGCSOperator in Google Cloud Composer on Airflow version 1.10.6 we need to create a plugin and somehow "hack" Airflow by copying operator/hook codes into one file to enable SFTPToGCSOperator use code from Airflow 1.10.10 version.
The latest Airflow version has a new airflow.providers directory, which does not exist in earlier versions. This is why you saw following error: No module named airflow.providers. All the changes I made are described here:
I prepared working plugin, which you can download here. Before using it, we have to install following PyPI libraries on the Cloud Composer environment: pysftp, paramiko, sshtunnel.
I copied full SFTPToGCSOperator code, which starts in 792nd line. You can see that this operator uses GCSHook:
from airflow.providers.google.cloud.hooks.gcs import GCSHook
which also need to be copied to the plugin - starts in 193rd line.
Then, GCSHook inherits from GoogleBaseHook class, which we can change for GoogleCloudBaseHook accessible in Airflow 1.10.6 version, and import it:
from airflow.contrib.hooks.gcp_api_base_hook import GoogleCloudBaseHook
Finally, there is a need to import SFTPHook code into the plugin - starts in 39th line, which inherits from SSHHook class, we can use one from Airflow 1.10.6 version by changing import statement:
from airflow.contrib.hooks.ssh_hook import SSHHook
At the end of file, you can find the definition of the plugin:
class SFTPToGCSOperatorPlugin(AirflowPlugin):
name = "SFTPToGCSOperatorPlugin"
operators = [SFTPToGCSOperator]
Plugin creation is needed, as an Airflow built-in operator is not currently available in Airflow 1.10.6 version (the latest in Cloud Composer). You can keep an eye on Cloud Composer version lists in order to see when the newest version of Airflow will be available to use.
I hope you find the above pieces of information useful.

Python packaging can't import class handler

I have two packages (diretories ) in my Python project
src
/textmining
mining.py...def mining():#...
__init.py....__all__ = ["mining"]
/crawler
crawler.py
in crawler.py I use the mining class
mining=mining()
main.py
__init__.py
my main.py is as follow:
scrapy_command = 'scrapy runspider {spider_name} -a crawling_level="{param_1}"'.format(spider_name='crawler/crawler.py',
param_1=crawling_level)
process = subprocess.Popen(scrapy_command, shell=True)
when I run crawler, it prompts
runspider: error: Unable to load 'Crawler.py': cannot import name mining
You need an __init__.py in every folder that's a part of the same package module.
src
__init__.py
/textmining
__init__.py
mining.py
/crawler
__init__.py
crawler.py
For simplicity, you should add a main.py in the src folder and call the function you want to start your program with from there as it's fairly difficult to import modules from sibling directories if you start your script in a non-root directory.
main.py
from crawler import crawler
crawler.start_function()
crawler.py
from src.textmining import mining
miner = mining()
Without turning everything into a python module you'd have to import a folder into the current script or __init__.py module by adding to path:
# In crawler.py
import sys
import os
sys.path.append(os.path.abspath('../textmining'))
import mining
However, messing around with the path requires you to keep in mind what you've done and may not be a thing you desire.

From module import * is not importing my functions

I am trying to understand how to import code from one file to another. I have two files file1.py and file2.py. I am running code in the first file, and have many variables and functions defined in the second file. I am using from file2 import * to import the code into file1.py. I have no problem using variables defined in file2.py in file1.py, but with functions I am getting NameError: name 'myfunc' is not defined when I try to use the function in file1.py. I can fix this problem by writing from file2 import myfunc, but I thought writing * would import everything from that file. What is the difference for functions versus variables?
I have tried to recreate the setup you have described and it is working OK for me. Hopefully this will give you an idea of how to get it to work.
# file1.py #####################################
import sys
sys.path.append("/home/neko/test/")
import file2
if __name__ == "__main__":
file2.testfunc()
# file2.py ######################################
testvar = 'hello'
def testfunc(): print testvar
For this test I was using python version 2.6.6
Both file1.py and file2.py is in /home/neko/test/

Importing CSV to Django and settings not recognised

So i'm getting to grips with Django, or trying to. I have some code that isn't dependent on being called by the webpage - it's designed to populate the database with information. Eventually it will be set up as a cron job to run overnight. This is the first crack at it, which is to do an initial population (once I have that working, I'll move to an add structure, where only new records are pushed.) I'm using Python 2.7, Django 1.5 and Sqlite3. When I run this code, I get
Requested setting DATABASES, but settings are not configured. You must either define the environment variable DJANGO_SETTINGS_MODULE or call settings.configure() before accessing settings.
That seems fairly obvious, but I've spent a couple of hours now trying to work out how to adjust that setting. How do I call / open a connection / whatever the right terminology is here? I have a number of functions like this that will be scheduled jobs, and this has been frustrating me all afternoon.
import urllib2
import csv
import requests
from django.db import models
from gmbl.models import Match
master_data_file = urllib2.urlopen("http://www.football-data.co.uk/mmz4281/1213/E0.csv", "GET")
data = list(tuple(rec) for rec in csv.reader(master_data_file, delimiter=','))
for row in data:
current_match = Match(matchdate=row[1],
hometeam=row[2],
awayteam = row [3],
homegoals = row [4],
awaygoals = row[5],
homeshots = row[10],
awayshots = row[11],
homeshotsontarget = row[12],
awayshotsontarget = row[13],
homecorners = row[16],
awaycorners = row[17])
current_match.save()
I had originally started out with http://django-csv-importer.readthedocs.org/en/latest/ but I had the same error, and the documentation doesn't make much sense trying to debug it. When I tried calling settings.configure in the function, it said it didn't exist; presumably I had to import it, but couldn't make that work.
Make sure Django, and your project are in PYTHONPATH then you can do:
import urllib2
import csv
import requests
from django.core.management import setup_environ
from django.db import models
from yoursite import settings
setup_environ(settings)
from gmbl.models import Match
master_data_file = urllib2.urlopen("http://www.football-data.co.uk/mmz4281/1213/E0.csv", "GET")
data = list(tuple(rec) for rec in csv.reader(master_data_file, delimiter=','))
# ... your code ...
Reference: http://www.b-list.org/weblog/2007/sep/22/standalone-django-scripts/
Hope it helps!

Configure Django to find all doctests in all modules?

If I run the following command:
>python manage.py test
Django looks at tests.py in my application, and runs any doctests or unit tests in that file. It also looks at the __ test __ dictionary for extra tests to run. So I can link doctests from other modules like so:
#tests.py
from myapp.module1 import _function1, _function2
__test__ = {
"_function1": _function1,
"_function2": _function2
}
If I want to include more doctests, is there an easier way than enumerating them all in this dictionary? Ideally, I just want to have Django find all doctests in all modules in the myapp application.
Is there some kind of reflection hack that would get me where I want to be?
I solved this for myself a while ago:
apps = settings.INSTALLED_APPS
for app in apps:
try:
a = app + '.test'
__import__(a)
m = sys.modules[a]
except ImportError: #no test jobs for this module, continue to next one
continue
#run your test using the imported module m
This allowed me to put per-module tests in their own test.py file, so they didn't get mixed up with the rest of my application code. It would be easy to modify this to just look for doc tests in each of your modules and run them if it found them.
Use django-nose since nose automatically find all tests recursivelly.
Here're key elements of solution:
tests.py:
def find_modules(package):
"""Return list of imported modules from given package"""
files = [re.sub('\.py$', '', f) for f in os.listdir(os.path.dirname(package.__file__))
if f.endswith(".py") and os.path.basename(f) not in ('__init__.py', 'test.py')]
return [imp.load_module(file, *imp.find_module(file, package.__path__)) for file in files]
def suite(package=None):
"""Assemble test suite for Django default test loader"""
if not package: package = myapp.tests # Default argument required for Django test runner
return unittest.TestSuite([doctest.DocTestSuite(m) for m in find_modules(package)])
To add recursion use os.walk() to traverse module tree and find python packages.
Thanks to Alex and Paul. This is what I came up with:
# tests.py
import sys, settings, re, os, doctest, unittest, imp
# import your base Django project
import myapp
# Django already runs these, don't include them again
ALREADY_RUN = ['tests.py', 'models.py']
def find_untested_modules(package):
""" Gets all modules not already included in Django's test suite """
files = [re.sub('\.py$', '', f)
for f in os.listdir(os.path.dirname(package.__file__))
if f.endswith(".py")
and os.path.basename(f) not in ALREADY_RUN]
return [imp.load_module(file, *imp.find_module(file, package.__path__))
for file in files]
def modules_callables(module):
return [m for m in dir(module) if callable(getattr(module, m))]
def has_doctest(docstring):
return ">>>" in docstring
__test__ = {}
for module in find_untested_modules(myapp.module1):
for method in modules_callables(module):
docstring = str(getattr(module, method).__doc__)
if has_doctest(docstring):
print "Found doctest(s) " + module.__name__ + "." + method
# import the method itself, so doctest can find it
_temp = __import__(module.__name__, globals(), locals(), [method])
locals()[method] = getattr(_temp, method)
# Django looks in __test__ for doctests to run
__test__[method] = getattr(module, method)
I'm not up to speed on Djano's testing, but as I understand it uses automatic unittest discovery, just like python -m unittest discover and Nose.
If so, just put the following file somewhere the discovery will find it (usually just a matter of naming it test_doctest.py or similar).
Change your_package to the package to test. All modules (including subpackages) will be doctested.
import doctest
import pkgutil
import your_package as root_package
def load_tests(loader, tests, ignore):
modules = pkgutil.walk_packages(root_package.__path__, root_package.__name__ + '.')
for _, module_name, _ in modules:
try:
suite = doctest.DocTestSuite(module_name)
except ValueError:
# Presumably a "no docstrings" error. That's OK.
pass
else:
tests.addTests(suite)
return tests