I am trying to import the bitarray library into a SparkContext. https://pypi.python.org/pypi/bitarray/0.8.1.
To do this I have zipped up the contexts in the bit array folder and then tried to add it to my python files. However even after I push the library to the nodes my RDD cannot find the library. Here is my code
zip bitarray.zip bitarray-0.8.1/bitarray/*
// Check the contents of the zip file
unzip -l bitarray.zip
Archive: bitarray.zip
Length Date Time Name
--------- ---------- ----- ----
143455 2015-11-06 02:07 bitarray/_bitarray.so
4440 2015-11-06 02:06 bitarray/__init__.py
6224 2015-11-06 02:07 bitarray/__init__.pyc
68516 2015-11-06 02:06 bitarray/test_bitarray.py
78976 2015-11-06 02:07 bitarray/test_bitarray.pyc
--------- -------
301611 5 files
then in spark
import os
# Environment
import findspark
findspark.init("/home/utils/spark-1.6.0/")
import pyspark
sparkConf = pyspark.SparkConf()
sparkConf.set("spark.executor.instances", "2")
sparkConf.set("spark.executor.memory", "10g")
sparkConf.set("spark.executor.cores", "2")
sc = pyspark.SparkContext(conf = sparkConf)
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import HiveContext
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import udf
hiveContext = HiveContext(sc)
PYBLOOM_LIB = '/home/ryandevera/pybloom.zip'
sys.path.append(PYBLOOM_LIB)
sc.addPyFile(PYBLOOM_LIB)
from pybloom import BloomFilter
f = BloomFilter(capacity=1000, error_rate=0.001)
x = sc.parallelize([(1,("hello",4)),(2,("goodbye",5)),(3,("hey",6)),(4,("test",7))],2)
def bloom_filter_spark(iterator):
for id,_ in iterator:
f.add(id)
yield (None, f)
x.mapPartitions(bloom_filter_spark).take(1)
This yields the error -
ImportError: pybloom requires bitarray >= 0.3.4
I am not sure where I am going wrong. Any help would be greatly appreciated!
Probably the simplest thing you is to create and distribute egg files. Assuming you've downloaded and unpacked source files from PyPI and set PYBLOOM_SOURCE_DIR and BITARRAY_SOURCE_DIR variables:
cd $PYBLOOM_SOURCE_DIR
python setup.py bdist_egg
cd $BITARRAY_SOURCE_DIR
python setup.py bdist_egg
In PySpark add:
from itertools import chain
import os
import glob
eggs = chain.from_iterable([
glob.glob(os.path.join(os.environ[x], "dist/*")) for x in
["PYBLOOM_SOURCE_DIR", "BITARRAY_SOURCE_DIR"]
])
for egg in eggs: sc.addPyFile(egg)
Problem is that BloomFilter object cannot be properly serialized so if you want to use it you'll have to either patch it or extract bitarrays and pass these around:
def buildFilter(iter):
bf = BloomFilter(capacity=1000, error_rate=0.001)
for x in iter:
bf.add(x)
return [bf.bitarray]
rdd = sc.parallelize(range(100))
rdd.mapPartitions(buildFilter).reduce(lambda x, y: x | y)
Related
Following: https://docs.delta.io/latest/quick-start.html#python
I have installed delta-spark and run:
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = spark = configure_spark_with_delta_pip(builder).getOrCreate()
However when I run:
data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")
the error states: delta not recognised
& if I run
DeltaTable.isDeltaTable(spark, "packages/tests/streaming/data")
It states: TypeError: 'JavaPackage' object is not callable
It seemed that I could run these commands locally (such as unit tests) without Maven or running it in a pyspark shell? It would be good to just see if I am missing a dependency?
You can just install delta-spark PyPi package using pip install delta-spark (it will pull pyspark as well), and then refer to it.
Or you can add a configuration option that will fetch Delta package. It's .config("spark.jars.packages", "io.delta:delta-core_2.12:<delta-version>"). For Spark 3.1 Delta versions is 1.0.0 (see releases mapping docs for more information).
I have an example of using Delta tables in unit tests (please note, that import statement is in the function definition because Delta package is loaded dynamically):
import pyspark
import pyspark.sql
import pytest
import shutil
from pyspark.sql import SparkSession
delta_dir_name = "/tmp/delta-table"
#pytest.fixture
def delta_setup(spark_session):
data = spark_session.range(0, 5)
data.write.format("delta").save(delta_dir_name)
yield data
shutil.rmtree(delta_dir_name, ignore_errors=True)
def test_delta(spark_session, delta_setup):
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark_session, delta_dir_name)
hist = deltaTable.history()
assert hist.count() == 1
environment is initialized via pytest-spark:
[pytest]
filterwarnings =
ignore::DeprecationWarning
spark_options =
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.jars.packages: io.delta:delta-core_2.12:1.0.0
spark.sql.catalogImplementation: in-memory
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-2cacdf187bba> in <module>
6 import numpy as np
7
----> 8 from pytorchcv import load_mnist, train, plot_results, plot_convolution, display_dataset
9 load_mnist(batch_size=128)
ImportError: cannot import name 'load_mnist' from 'pytorchcv' (/anaconda/envs/py37_pytorch/lib/python3.7/site-packages/pytorchcv/__init__.py)
How can fix this bug?
I use python 3.7 and Jupiter notebook. The code in Pytorch Fundamental of Microsoft: Link https://learn.microsoft.com/en-gb/learn/modules/intro-computer-vision-pytorch/5-convolutional-networks
import torch
import torch.nn as nn
import torchvision
import matplotlib.pyplot as plt
from torchinfo import summary
import numpy as np
from pytorchcv import load_mnist, train, plot_results, plot_convolution, display_dataset
load_mnist(batch_size=128)
I installed PyTorch by command: pip install pytorchcv
I assume you might have the wrong pytorchcv package. The one in pypy does not contain load_mnist
Starting from scratch you could download mnist as such:
data_train = torchvision.datasets.MNIST('./data',
download=True,train=True,transform=ToTensor()) data_test = torchvision.datasets.MNIST('./data',
download=True,train=False,transform=ToTensor())
They missed one command before importing pytorchcv.
This pytorchcv is different from pytorchcv in PyPI.
Run this before importing pytorchcv.
!wget https://raw.githubusercontent.com/MicrosoftDocs/pytorchfundamentals/main/computer-vision-pytorch/pytorchcv.py
Then it should work.
please download the prepared py file from the link https://raw.githubusercontent.com/MicrosoftDocs/pytorchfundamentals/main/computer-vision-pytorch/pytorchcv.py and put it into current folder, then VS Code could recognize it
I've two files.
test.py
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SQLContext
class Connection():
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("Remote_Spark_Program - Leschi Plans")
conf.set('spark.executor.instances', 1)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
print ('all done.')
con = Connection()
test_test.py
from test import Connection
sparkConnect = Connection()
when I run test.py the connection is made successfully but with test_test.py it gives
raise KeyError(key)
KeyError: 'SPARK_HOME'
KEY_ERROR arises if the SPARK_HOME is not found or invalid. So it's better to add it to your bashrc and check and reload in your code. So add this at the top of your test.py
import os
import sys
import pyspark
from pyspark import SparkContext, SparkConf, SQLContext
# Create a variable for our root path
SPARK_HOME = os.environ.get('SPARK_HOME',None)
# Add the PySpark/py4j to the Python Path
sys.path.insert(0, os.path.join(SPARK_HOME, "python", "lib"))
sys.path.insert(0, os.path.join(SPARK_HOME, "python"))
pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
Also add this at the end of your ~/.bashrc file
COMMAND: vim ~/.bashrc if you are using any Linux based OS
# needed for Apache Spark
export SPARK_HOME="/opt/spark"
export IPYTHON="1"
export PYSPARK_PYTHON="/usr/bin/python3"
export PYSPARK_DRIVER_PYTHON="ipython3"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYTHONPATH="$SPARK_HOME/python/:$PYTHONPATH"
export PYTHONPATH="$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH"
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
export CLASSPATH="$CLASSPATH:/opt/spark/lib/spark-assembly-1.6.1-hadoop2.6.0.jar
Note:
In the above bashrc code, I have given my SPARK_HOME value as /opt/spark you can give the location where you keep your spark folder(the downloaded one from the website).
Also I'm using python3 you can change it to python in the bashrc if you are using python 2.+ versions
I was using Ipython, for easy testing during runtime, like load the data once and test your code many times. If you are using plain old text editor, let me know I will update the bashrc accordingly.
I have a flask application that I would like to convert into an executable for deploying elsewhere. I have used py2exe for that. I am getting jinja2:TemplateNotFound error. I have copied the static and templates folders into the dist folder where the exe files reside. Pls let me know if I am missing something. My setup file is as follows:
from distutils.core import setup
import py2exe
import os
from glob import glob
import sys
from distutils.filelist import findall
import matplotlib
matplotlibdatadir = matplotlib.get_data_path()
matplotli bdata = findall(matplotlibdatadir)
matplotlibdata_files = []
for f in matplotlibdata:
dirname = os.path.join('matplotlibdata', f[len(matplotlibdatadir)+1:])
matplotlibdata_files.append((os.path.split(dirname)[0], [f]))
data_files=[('static', glob("D:\\pythonLearning\\static\\*.*")), ('templates', glob("D:\\pythonLearning\\templates\\login.html"))]
data_files.extend(matplotlibdata_files)
print data_files
sys.path.append('C:\\Windows\\winsxs\\x86_microsoft.vc90.crt_1fc8b3b9a1e18e3b_9.0.30729.6161_none_50934f2ebcb7eb57')
setup( console=['myfile.py'],
options={ 'py2exe': { 'packages' : ['matplotlib', 'pytz','werkzeug','email','jinja2.ext'],
'includes': ['flask','jinja2'] } },
data_files=data_files )
This is because jinja expects your egg to be unzipped and available via the filepath. See for more information.
You can workaround this for typical data files using this
but jinja2 has no direct support for this and you would have to implement this yourself
I'm trying to get data from a zipped csv file. Is there a way to do this without unzipping the whole files? If not, how can I unzip the files and read them efficiently?
I used the zipfile module to import the ZIP directly to pandas dataframe.
Let's say the file name is "intfile" and it's in .zip named "THEZIPFILE":
import pandas as pd
import zipfile
zf = zipfile.ZipFile('C:/Users/Desktop/THEZIPFILE.zip')
df = pd.read_csv(zf.open('intfile.csv'))
If you aren't using Pandas it can be done entirely with the standard lib. Here is Python 3.7 code:
import csv
from io import TextIOWrapper
from zipfile import ZipFile
with ZipFile('yourfile.zip') as zf:
with zf.open('your_csv_inside_zip.csv', 'r') as infile:
reader = csv.reader(TextIOWrapper(infile, 'utf-8'))
for row in reader:
# process the CSV here
print(row)
A quick solution can be using below code!
import pandas as pd
#pandas support zip file reads
df = pd.read_csv("/path/to/file.csv.zip")
zipfile also supports the with statement.
So adding onto yaron's answer of using pandas:
with zipfile.ZipFile('file.zip') as zip:
with zip.open('file.csv') as myZip:
df = pd.read_csv(myZip)
Thought Yaron had the best answer but thought I would add a code that iterated through multiple files inside a zip folder. It will then append the results:
import os
import pandas as pd
import zipfile
curDir = os.getcwd()
zf = zipfile.ZipFile(curDir + '/targetfolder.zip')
text_files = zf.infolist()
list_ = []
print ("Uncompressing and reading data... ")
for text_file in text_files:
print(text_file.filename)
df = pd.read_csv(zf.open(text_file.filename)
# do df manipulations
list_.append(df)
df = pd.concat(list_)
Yes. You want the module 'zipfile'
You open the zip file itself with zipfile.ZipInfo([filename[, date_time]])
You can then use ZipFile.infolist() to enumerate each file within the zip, and extract it with ZipFile.open(name[, mode[, pwd]])
this is the simplest thing I always use.
import pandas as pd
df = pd.read_csv("Train.zip",compression='zip')
Supposing you are downloading a zip file that contains a CSV and you don't want to use temporary storage. Here is what a sample implementation looks like:
#!/usr/bin/env python3
from csv import DictReader
from io import TextIOWrapper, BytesIO
from zipfile import ZipFile
import requests
def all_tickers():
url = "https://simfin.com/api/bulk/bulk.php?dataset=industries&variant=null"
r = requests.get(url)
zip_ref = ZipFile(BytesIO(r.content))
for name in zip_ref.namelist():
print(name)
with zip_ref.open(name) as file_contents:
reader = DictReader(TextIOWrapper(file_contents, 'utf-8'), delimiter=';')
for item in reader:
print(item)
This takes care of all python3 bytes/str issues.
Modern Pandas since version 0.18.1 natively supports compressed csv files: its read_csv method has compression parameter : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
If you have a file name: my_big_file.csv and you zip it with the same name my_big_file.zip
you may simply do this:
df = pd.read_csv("my_big_file.zip")
Note: check your pandas version first (not applicable for older versions)