Missing s3 package from AWS Wrangler - amazon-web-services

I installed the latest version of AWS Wrangler, 2.19.0. When I run the import this happens.
import awswrangler as wr
File \~/opt/anaconda3/lib/python3.9/site-packages/awswrangler/lakeformation/\_utils.py:13, in \<module\>
11 from awswrangler import \_utils, exceptions
12 from awswrangler.catalog.\_utils import \_catalog_id, \_transaction_id
\---\> 13 from awswrangler.s3.\_describe import describe_objects
15 \_QUERY_FINAL_STATES: List\[str\] = \["ERROR", "FINISHED"\]
16 \_QUERY_WAIT_POLLING_DELAY: float = 2 # SECONDS
ModuleNotFoundError: No module named 'awswrangler.s3.\_describe'; 'awswrangler.s3' is not a package

Related

Spark job submission error on EMR: java.net.URISyntaxException: Expected scheme-specific part at index 3: s3

I submit Spark jobs to EMR via Managed Airflow service of AWS (MWAA). The jobs were running fine with the MWAA version 1.10.12. Recently, AWS released a newer version of MWAA i.e. 2.0.2. I created a new environment with this version and tried submitting the same job to EMR. But it failed with the following error :
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme-specific part at index 3: s3:
at org.apache.hadoop.fs.Path.initialize(Path.java:263)
at org.apache.hadoop.fs.Path.<init>(Path.java:221)
at org.apache.hadoop.fs.Path.<init>(Path.java:129)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:229)
at org.apache.hadoop.fs.Globber.glob(Globber.java:149)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2096)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2078)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPath(DependencyUtils.scala:192)
at org.apache.spark.deploy.DependencyUtils$.$anonfun$resolveGlobPaths$2(DependencyUtils.scala:147)
at org.apache.spark.deploy.DependencyUtils$.$anonfun$resolveGlobPaths$2$adapted(DependencyUtils.scala:145)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.deploy.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:145)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$5(SparkSubmit.scala:364)
at scala.Option.map(Option.scala:230)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:364)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:902)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1038)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1047)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.URISyntaxException: Expected scheme-specific part at index 3: s3:
at java.net.URI$Parser.fail(URI.java:2847)
at java.net.URI$Parser.failExpecting(URI.java:2853)
at java.net.URI$Parser.parse(URI.java:3056)
at java.net.URI.<init>(URI.java:746)
at org.apache.hadoop.fs.Path.initialize(Path.java:260)
... 27 more
Command exiting with ret '1'
The spark-submit command looks like this:
spark-submit --deploy-mode cluster
--master yarn
--queue low
--jars s3://bucket-name/jars/mysql-connector-java-8.0.20.jar,s3://bucket-name/jars/postgresql-42.1.1.jar,s3://bucket-name/jars/delta-core_2.12-1.0.0.jar
--py-files s3://bucket-name/dependencies/spark.py, s3://bucket-name/dependencies/helper_functions.py
--files s3://bucket-name/configs/spark_config.json
s3://bucket-name/jobs/data_processor.py [command-line-args]
The job submission failed in under 10 seconds. Hence, the YARN application ID did not get created.
What I tried to resolve the error:
I added amazon related package to requirements.txt:
apache-airflow[mysql]==2.0.2
pycryptodome==3.9.9
apache-airflow-providers-amazon==1.3.0
I changed the import statements from:
from airflow.contrib.operators.emr_add_steps_operator import EmrAddStepsOperator
from airflow.contrib.operators.emr_create_job_flow_operator import EmrCreateJobFlowOperator
from airflow.contrib.operators.emr_terminate_job_flow_operator import EmrTerminateJobFlowOperator
from airflow.contrib.sensors.emr_step_sensor import EmrStepSensor
from airflow.hooks.S3_hook import S3Hook
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from airflow.models import Variable
to
from airflow.providers.amazon.aws.operators.emr_create_job_flow import EmrCreateJobFlowOperator
from airflow.providers.amazon.aws.operators.emr_add_steps import EmrAddStepsOperator
from airflow.providers.amazon.aws.operators.emr_terminate_job_flow import EmrTerminateJobFlowOperator
from airflow.providers.amazon.aws.sensors.emr_step import EmrStepSensor
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from airflow import DAG
from airflow.models import Variable
Changed the URI scheme to s3n and s3a
I checked the official documentation and blogs on MWAA as well as Airflow 2.0.2 and made the above changes. But nothing has worked so far. I seek help in resolving this error at the earliest. Thanks in advance

Failed to find data source: delta in Python environment

Following: https://docs.delta.io/latest/quick-start.html#python
I have installed delta-spark and run:
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = spark = configure_spark_with_delta_pip(builder).getOrCreate()
However when I run:
data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")
the error states: delta not recognised
& if I run
DeltaTable.isDeltaTable(spark, "packages/tests/streaming/data")
It states: TypeError: 'JavaPackage' object is not callable
It seemed that I could run these commands locally (such as unit tests) without Maven or running it in a pyspark shell? It would be good to just see if I am missing a dependency?
You can just install delta-spark PyPi package using pip install delta-spark (it will pull pyspark as well), and then refer to it.
Or you can add a configuration option that will fetch Delta package. It's .config("spark.jars.packages", "io.delta:delta-core_2.12:<delta-version>"). For Spark 3.1 Delta versions is 1.0.0 (see releases mapping docs for more information).
I have an example of using Delta tables in unit tests (please note, that import statement is in the function definition because Delta package is loaded dynamically):
import pyspark
import pyspark.sql
import pytest
import shutil
from pyspark.sql import SparkSession
delta_dir_name = "/tmp/delta-table"
#pytest.fixture
def delta_setup(spark_session):
data = spark_session.range(0, 5)
data.write.format("delta").save(delta_dir_name)
yield data
shutil.rmtree(delta_dir_name, ignore_errors=True)
def test_delta(spark_session, delta_setup):
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark_session, delta_dir_name)
hist = deltaTable.history()
assert hist.count() == 1
environment is initialized via pytest-spark:
[pytest]
filterwarnings =
ignore::DeprecationWarning
spark_options =
spark.sql.extensions: io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog: org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.jars.packages: io.delta:delta-core_2.12:1.0.0
spark.sql.catalogImplementation: in-memory

ImportError: cannot import name 'load_mnist' from 'pytorchcv'

---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-2cacdf187bba> in <module>
6 import numpy as np
7
----> 8 from pytorchcv import load_mnist, train, plot_results, plot_convolution, display_dataset
9 load_mnist(batch_size=128)
ImportError: cannot import name 'load_mnist' from 'pytorchcv' (/anaconda/envs/py37_pytorch/lib/python3.7/site-packages/pytorchcv/__init__.py)
How can fix this bug?
I use python 3.7 and Jupiter notebook. The code in Pytorch Fundamental of Microsoft: Link https://learn.microsoft.com/en-gb/learn/modules/intro-computer-vision-pytorch/5-convolutional-networks
import torch
import torch.nn as nn
import torchvision
import matplotlib.pyplot as plt
from torchinfo import summary
import numpy as np
from pytorchcv import load_mnist, train, plot_results, plot_convolution, display_dataset
load_mnist(batch_size=128)
I installed PyTorch by command: pip install pytorchcv
I assume you might have the wrong pytorchcv package. The one in pypy does not contain load_mnist
Starting from scratch you could download mnist as such:
data_train = torchvision.datasets.MNIST('./data',
download=True,train=True,transform=ToTensor()) data_test = torchvision.datasets.MNIST('./data',
download=True,train=False,transform=ToTensor())
They missed one command before importing pytorchcv.
This pytorchcv is different from pytorchcv in PyPI.
Run this before importing pytorchcv.
!wget https://raw.githubusercontent.com/MicrosoftDocs/pytorchfundamentals/main/computer-vision-pytorch/pytorchcv.py
Then it should work.
please download the prepared py file from the link https://raw.githubusercontent.com/MicrosoftDocs/pytorchfundamentals/main/computer-vision-pytorch/pytorchcv.py and put it into current folder, then VS Code could recognize it

Error Import Parquet in Windows

I'm trying to install Parquet file format to use it with Apache Spark. I learned that I had to install Thrift, ThriftPy, and Python-Snappy in order to fully install Parquet.
I install Thrift using the command
pip install thrift
Then I installed python-snappy manually through a wheel file found here. This was because I was unable to install python-snappy automatically. Anyways, python-snappy was successfully installed.
I also installed ThrifPy using the similar command
pip install ThriftPy
And finally, I used pip to install parquet which was successful. After installing, when I try to import parquet, it raises the error as
---------------------------------------------------------------------------
ThriftParserError Traceback (most recent call last)
<ipython-input-55-942008defa53> in <module>()
----> 1 import parquet
C:\anaconda2\lib\site-packages\parquet\__init__.py in <module>()
17 from thriftpy.protocol.compact import TCompactProtocolFactory
18
---> 19 from . import encoding
20 from . import schema
21 from .converted_types import convert_column
C:\anaconda2\lib\site-packages\parquet\encoding.py in <module>()
17
18 THRIFT_FILE = os.path.join(os.path.dirname(__file__), "parquet.thrift")
---> 19 parquet_thrift = thriftpy.load(THRIFT_FILE,
module_name=str("parquet_thrift")) # pylint: disable=invalid-name
20
21 logger = logging.getLogger("parquet") # pylint: disable=invalid-name
C:\anaconda2\lib\site-packages\thriftpy\parser\__init__.pyc in load(path,
module_name, include_dirs, include_dir)
28 real_module = bool(module_name)
29 thrift = parse(path, module_name, include_dirs=include_dirs,
---> 30 include_dir=include_dir)
31
32 if real_module:
C:\anaconda2\lib\site-packages\thriftpy\parser\parser.pyc in parse(path,
module_name, include_dirs, include_dir, lexer, parser, enable_cache)
494 raise ThriftParserError('ThriftPy does not support generating
module '
495 'with path in protocol \'{}\''.format(
--> 496 url_scheme))
497
498 if module_name is not None and not module_name.endswith('_thrift'):
ThriftParserError: ThriftPy does not support generating module with path in
protocol 'c'
Would someone tell me what I'm doing wrong ?
For reference, I'm using anaconda Python 2.7 on a Jupyter notebook. My OS is Windows 7 , and I'm using Spark on a single cluster.
Modify the parser code in the windows side. It is in site-packages of python.
For Example:- C:\Anaconda\python27\Lib\site-packages\thriftpy\parser\parser.py
Modify the 488 line:
#if url_scheme == '':
if len(url_scheme) <= 1:
Then try again.

Importing bitarray library into SparkContext

I am trying to import the bitarray library into a SparkContext. https://pypi.python.org/pypi/bitarray/0.8.1.
To do this I have zipped up the contexts in the bit array folder and then tried to add it to my python files. However even after I push the library to the nodes my RDD cannot find the library. Here is my code
zip bitarray.zip bitarray-0.8.1/bitarray/*
// Check the contents of the zip file
unzip -l bitarray.zip
Archive: bitarray.zip
Length Date Time Name
--------- ---------- ----- ----
143455 2015-11-06 02:07 bitarray/_bitarray.so
4440 2015-11-06 02:06 bitarray/__init__.py
6224 2015-11-06 02:07 bitarray/__init__.pyc
68516 2015-11-06 02:06 bitarray/test_bitarray.py
78976 2015-11-06 02:07 bitarray/test_bitarray.pyc
--------- -------
301611 5 files
then in spark
import os
# Environment
import findspark
findspark.init("/home/utils/spark-1.6.0/")
import pyspark
sparkConf = pyspark.SparkConf()
sparkConf.set("spark.executor.instances", "2")
sparkConf.set("spark.executor.memory", "10g")
sparkConf.set("spark.executor.cores", "2")
sc = pyspark.SparkContext(conf = sparkConf)
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import HiveContext
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import udf
hiveContext = HiveContext(sc)
PYBLOOM_LIB = '/home/ryandevera/pybloom.zip'
sys.path.append(PYBLOOM_LIB)
sc.addPyFile(PYBLOOM_LIB)
from pybloom import BloomFilter
f = BloomFilter(capacity=1000, error_rate=0.001)
x = sc.parallelize([(1,("hello",4)),(2,("goodbye",5)),(3,("hey",6)),(4,("test",7))],2)
def bloom_filter_spark(iterator):
for id,_ in iterator:
f.add(id)
yield (None, f)
x.mapPartitions(bloom_filter_spark).take(1)
This yields the error -
ImportError: pybloom requires bitarray >= 0.3.4
I am not sure where I am going wrong. Any help would be greatly appreciated!
Probably the simplest thing you is to create and distribute egg files. Assuming you've downloaded and unpacked source files from PyPI and set PYBLOOM_SOURCE_DIR and BITARRAY_SOURCE_DIR variables:
cd $PYBLOOM_SOURCE_DIR
python setup.py bdist_egg
cd $BITARRAY_SOURCE_DIR
python setup.py bdist_egg
In PySpark add:
from itertools import chain
import os
import glob
eggs = chain.from_iterable([
glob.glob(os.path.join(os.environ[x], "dist/*")) for x in
["PYBLOOM_SOURCE_DIR", "BITARRAY_SOURCE_DIR"]
])
for egg in eggs: sc.addPyFile(egg)
Problem is that BloomFilter object cannot be properly serialized so if you want to use it you'll have to either patch it or extract bitarrays and pass these around:
def buildFilter(iter):
bf = BloomFilter(capacity=1000, error_rate=0.001)
for x in iter:
bf.add(x)
return [bf.bitarray]
rdd = sc.parallelize(range(100))
rdd.mapPartitions(buildFilter).reduce(lambda x, y: x | y)