how to use pyhive in lambda function? - amazon-web-services

I've wrote a function that is using pyhive to read from Hive. Running it locally it works fine. However when trying to use lambda function I got the error:
"Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found'"
I've tried to use the guidelines in this link:
https://github.com/cloudera/impyla/issues/201
However, I wasn't able to use latest command:
yum install cyrus-sasl-lib cyrus-sasl-gssapi cyrus-sasl-md5
since the system I was using to build is ubuntu that doesn't support the yum function.
Tried to install those packages (using apt-get):
sasl2-bin libsasl2-2 libsasl2-dev libsasl2-modules libsasl2-modules-gssapi-mit
like described in:
python cannot connect hiveserver2
But still no luck. Any ideas?
Thanks,
Nir.

You can follow this github issue. I am able to connect Hive server2 with LDAP authentication using the pyhive library in AWS Lambda with Python 2.7. What I have done to make it work is:
Took one EC2 instance or launch container with AMI used in Lambda.
Run the following commands to install the required dependencies
yum upgrade
yum install gcc
yum install gcc-g++
sudo yum install cyrus-sasl cyrus-sasl-devel cyrus-sasl-ldap #include cyrus-sals dependency for authentication mechanism you are using to connect to hive
pip install six==1.12.0
Bundle up the /usr/lib64/sasl2/ to Lambda and set os.environ['SASL_PATH'] = os.path.join(os.getcwd(), /path/to/sasl2. Verify if .so files are presented on os.environ['SASL_PATH'] path.
My Lambda code looks like:
from pyhive import hive
import logging
import os
os.environ['SASL_PATH'] = os.path.join(os.getcwd(), 'lib/sasl2')
log = logging.getLogger()
log.setLevel(logging.INFO)
log.info('Path: %s',os.environ['SASL_PATH'])
def lambda_handler(event, context):
cursor = hive.connect(host='hiveServer2Ip', port=10000, username='userName', auth='LDAP',password='password').cursor()
SHOW_TABLE_QUERY = "show tables"
cursor.execute(SHOW_TABLE_QUERY)
tables = cursor.fetchall()
log.info('tables: %s', tables)
log.info('done')

Related

What is the correct configuration AWS SageMaker-Python-SDK to achieve local debugging/training with Apple M1 Pro

I want to run an RL training job on AWS SageMaker(script given below). But since the project is complex I was hoping to do a test run using SageMaker Local Mode (In my M1 MacBook Pro) before submitting to paid instances. However, I am struggling to make this local run successful even with a simple training task.
Now I did use Tensorflow-metal and Tensorflow-macos when running local training jobs(Without SageMaker). And I did not see anywhere I can specify this in the framework_version and nor that I am sure "local_gpu" which is the correct argument for a normal linux machine with GPU is exactly matching for Apple Silicon (M1 Pro).
I searched all over but I cannot find a case where this is addressed. (Very odd, am I doing something wrong? If so, please correct me.) If not and there's anyone who knows of a configuration, a docker image or an example properly done with M1 Pro please share.
I tried to run the following code. Which hangs after Logging in. (If you are trying to run the code, try with any simple training script as entry_point, and make sure to login with a similar code matching to your region using awscli with following command.
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
##main.py
import boto3
import sagemaker
import os
import keras
import numpy as np
from keras.datasets import fashion_mnist
from sagemaker.tensorflow import TensorFlow
sess = sagemaker.Session()
#role = <'arn:aws:iam::0000000000000:role/CFN-SM-IM-Lambda-Catalog-sk-SageMakerExecutionRole-BlaBlaBla'> #KINDLY ADD YOUR ROLE HERE
(x_train, y_train), (x_val, y_val) = fashion_mnist.load_data()
os.makedirs("./data", exist_ok = True)
np.savez('./data/training', image=x_train, label=y_train)
np.savez('./data/validation', image=x_val, label=y_val)
# Train on local data. S3 URIs would work too.
training_input_path = 'file://data/training.npz'
validation_input_path = 'file://data/validation.npz'
# Store model locally. A S3 URI would work too.
output_path = 'file:///tmp/model/'
tf_estimator = TensorFlow(entry_point='mnist_keras_tf.py',
role=role,
instance_count=1,
instance_type='local_gpu', # Train on the local CPU ('local_gpu' if it has a GPU)
framework_version='2.1.0',
py_version='py3',
hyperparameters={'epochs': 1},
output_path=output_path
)
tf_estimator.fit({'training': training_input_path, 'validation': validation_input_path})
The prebuilt SageMaker Docker Images for Deep Learning doesn't have Arm based support yet.
You can see Available Deep Learning Containers Images here.
The solution is to build your own Docker image and use it with SageMaker.
This is an example Dockerfile that uses miniconda to install TensorFlow dependencies:
FROM arm64v8/ubuntu
RUN apt-get -y update && apt-get install -y --no-install-recommends \
wget \
nginx \
ca-certificates \
gcc \
linux-headers-generic \
libc-dev
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py38_4.9.2-Linux-aarch64.sh
RUN chmod a+x Miniconda3-py38_4.9.2-Linux-aarch64.sh
RUN bash Miniconda3-py38_4.9.2-Linux-aarch64.sh -b
ENV PATH /root/miniconda3/bin/:$PATH
COPY ml-dependencies.yml ./
RUN conda env create -f ml-dependencies.yml
ENV PATH /root/miniconda3/envs/ml-dependencies/bin:$PATH
This is the the ml-dependencies.yml:
name: ml-dependencies
dependencies:
- python=3.8
- numpy
- pandas
- scikit-learn
- tensorflow==2.8.2
- pip
- pip:
- sagemaker-training
And this is how you'll run the trainin gusing SageMaker Script mode:
image = 'sagemaker-tensorflow2-graviton-training-toolkit-local'
california_housing_estimator = Estimator(
image_uri=image,
entry_point='california_housing_tf2.py',
source_dir='code',
role=DUMMY_IAM_ROLE,
instance_count=1,
instance_type='local',
hyperparameters={'epochs': 10,
'batch_size': 64,
'learning_rate': 0.1})
inputs = {'train': 'file://./data/train', 'test': 'file://./data/test'}
california_housing_estimator.fit(inputs, logs=True)
You can find the full working sample code on the Amazon SageMaker Local Mode Examples GitHub repository here.

How do I install Apache Superset CLI on Windows?

Superset offers a CLI for managing the Superset instance, but I am unable to find instructions for getting it installed and talking to my instance of Superset.
My local machine is Windows, but my instance of Superset is running in a hosted Kubernetes cluster.
-- Update 2 2022.08.06
After some continued exploration, have found some steps that seem to be getting me closer.
# clone the Superset repo
git clone https://github.com/apache/superset
cd superset
# create a virtual environment using Python 3.9,
# which is compatible with the current version of numpy
py -3.9 -m venv .venv
.venv\Scripts\activate
# install the Superset package
pip install apache-superset
# install requirements (not 100% sure which requirements are needed)
pip install -r .\requirements\base.txt
pip install -r .\requirements\development.txt
# install psycopg2
pip install psycopg2
# run superset-cli
superset-cli
# error: The term 'superset-cli' is not recognized
# run superset
superset
superset will run, but now I'm getting an error from psycopg2 about unknown host:
Loaded your LOCAL configuration at [c:\git\superset\superset_config.py]
logging was configured successfully
2022-08-06 06:29:08,311:INFO:superset.utils.logging_configurator:logging was configured successfully
2022-08-06 06:29:08,317:INFO:root:Configured event logger of type <class 'superset.utils.log.DBEventLogger'>
Falling back to the built-in cache, that stores data in the metadata database, for the following cache: `FILTER_STATE_CACHE_CONFIG`. It is recommended to use `RedisCache`, `MemcachedCache` or another dedicated caching backend for production deployments
2022-08-06 06:29:08,318:WARNING:superset.utils.cache_manager:Falling back to the built-in cache, that stores data in the metadata database, for the following cache: `FILTER_STATE_CACHE_CONFIG`. It is recommended to use `RedisCache`, `MemcachedCache` or another dedicated caching backend for production deployments
Falling back to the built-in cache, that stores data in the metadata database, for the following cache: `EXPLORE_FORM_DATA_CACHE_CONFIG`. It is recommended to use `RedisCache`, `MemcachedCache` or another dedicated caching backend for production deployments
2022-08-06 06:29:08,322:WARNING:superset.utils.cache_manager:Falling back to the built-in cache, that stores data in the metadata database, for the following cache: `EXPLORE_FORM_DATA_CACHE_CONFIG`. It is recommended to use `RedisCache`, `MemcachedCache` or another dedicated caching backend for production deployments
2022-08-06 06:29:10,602:ERROR:flask_appbuilder.security.sqla.manager:DB Creation and initialization failed: (psycopg2.OperationalError) could not translate host name "None" to address: Unknown host
My config file c:\git\superset\superset_config.py has the following database settings:
DATABASE_HOST = os.getenv("DATABASE_HOST")
DATABASE_DB = os.getenv("DATABASE_DB")
POSTGRES_USER = os.getenv("POSTGRES_USER")
POSTGRES_PASSWORD = os.getenv("DATABASE_PASSWORD")
I could set those values in the superset_config.py or I could set the environment variables and let superset_config.py read them. However, my instance of superset is running in a hosted kubernetes cluster and the superset-postgres service is not exposed by external ip. The only service with an external ip is superset.
Still stuck...
I was way off track - once I found the Preset-io backend-sdk repo on github it started coming together.
https://github.com/preset-io/backend-sdk
Install superset-cli
mkdir superset_cli
cd superset_cli
py -3.9 -m venv .venv
.venv\Scripts\activate
pip install -U setuptools setuptools_scm wheel #for good measure
pip install "git+https://github.com/preset-io/backend-sdk.git"
Example command
# Export resources (databases, datasets, charts, dashboards)
# into a directory as YAML files from a superset site: https://superset.example.org
mkdir export
superset-cli -u [username] -p [password] https://superset.example.org export export

GCP Composer Airflow - unable to install packages using PyPi

I have created a Composer environment with image version -> composer-2.0.13-airflow-2.2.5
when i try to install software using PyPi, it fails.
details below :
Command :
gcloud composer environments update $AIRFLOW --location us-east1 --update-pypi-packages-from-file requirements.txt
requirement.txt
---------------
google-api-core
google-auth
google-auth-oauthlib
google-cloud-bigquery
google-cloud-core
google-cloud-storage
google-crc32c
google-resumable-media
googleapis-common-protos
google-endpoints
joblib
json5
jsonschema
pandas
requests
requests-oauthlib
Error :
Karans-MacBook-Pro:composer_dags karanalang$ gcloud composer environments update $AIRFLOW --location us-east1 --update-pypi-packages-from-file requirements.txt
Waiting for [projects/versa-sml-googl/locations/us-east1/environments/versa-composer3] to be updated with [projects/versa-sml-googl/locations/us-east1/operations/c23b77a9-f46b-4222-bafd-62527bf27239]..
.failed.
ERROR: (gcloud.composer.environments.update) Error updating [projects/versa-sml-googl/locations/us-east1/environments/versa-composer3]: Operation [projects/versa-sml-googl/locations/us-east1/operations/c23b77a9-f46b-4222-bafd-62527bf27239] failed: Failed to install PyPI packages. looker-sdk 22.4.0 has requirement attrs>=20.1.0; python_version >= "3.7", but you have attrs 17.4.0.
Check the Cloud Build log at https://console.cloud.google.com/cloud-build/builds/60ac972a-8f5e-4b4f-a4a7-d81049fb19a3?project=939354532596 for details. For detailed instructions see https://cloud.google.com/composer/docs/troubleshooting-package-installation
Pls note:
I have an older Composer cluster (Composer version - 1.16.8, Airflow version - 1.10.15), where the above command works fine.
However, it is not working with the new cluster
What needs to be done to debug/fix this ?
tia!
I was able to get this working using the following code :
path = "gs://dataproc-spark-configs/pip_install.sh"
CLUSTER_GENERATOR_CONFIG = ClusterGenerator(
project_id=PROJECT_ID,
zone="us-east1-b",
master_machine_type="n1-standard-4",
worker_machine_type="n1-standard-4",
num_workers=4,
storage_bucket="dataproc-spark-logs",
init_actions_uris=[path],
metadata={'PIP_PACKAGES': 'pyyaml requests pandas openpyxl kafka-python'},
).make()
with models.DAG(
'Versa-Alarm-Insights-UsingComposer2',
# Continue to run DAG twice per day
default_args=default_dag_args,
schedule_interval='0 0/12 * * *',
catchup=False,
) as dag:
create_dataproc_cluster = DataprocCreateClusterOperator(
task_id="create_dataproc_cluster",
cluster_name="versa-composer2",
region=REGION,
cluster_config=CLUSTER_GENERATOR_CONFIG
)
The earlier command which involved installing packages by reading from file was working in Composer1 (Airflow 1.x), however failing with Composer 2.x (Airflow 2.x)
From the error, it is clear that you are running old version of attrs package.
run the below command and try
pip install attrs==20.3.0
or
pip install attrs==20.1.0

Errors when trying to call pycurl in a lambda on AWS

I want to use pycurl in order to have TTFB and TTLB, but am unable to call pycurl in an AWS lambda.
To focus on the issue, let say I call this simple lambda function:
import json
import pycurl
import certifi
def lambda_handler(event, context):
client_curl = pycurl.Curl()
client_curl.setopt(pycurl.CAINFO, certifi.where())
client_curl.setopt(pycurl.URL, "https://www.arolla.fr/blog/author/edouard-gomez-vaez/") #set url
client_curl.setopt(pycurl.FOLLOWLOCATION, 1)
client_curl.setopt(pycurl.WRITEFUNCTION, lambda x: None)
content = client_curl.perform()
dns_time = client_curl.getinfo(pycurl.NAMELOOKUP_TIME) #DNS time
conn_time = client_curl.getinfo(pycurl.CONNECT_TIME) #TCP/IP 3-way handshaking time
starttransfer_time = client_curl.getinfo(pycurl.STARTTRANSFER_TIME) #time-to-first-byte time
total_time = client_curl.getinfo(pycurl.TOTAL_TIME) #last requst time
client_curl.close()
data = json.dumps({'dns_time':dns_time,
'conn_time':conn_time,
'starttransfer_time':starttransfer_time,
'total_time':total_time,
})
return {
'statusCode': 200,
'body': data
}
I have the following error, which is understandable:
Unable to import module 'lambda_function': No module named 'pycurl'
I followed the tuto https://aws.amazon.com/fr/premiumsupport/knowledge-center/lambda-layer-simulated-docker/ in order to create a layer, but then have the following error while generated the layer with docker (I extracted the interesting part):
Could not run curl-config: [Errno 2] No such file or directory: 'curl-config': 'curl-config'
I even tried to generate the layer just launching on my own machine:
pip install -r requirements.txt -t python/lib/python3.6/site-packages/
zip -r mypythonlibs.zip python > /dev/null
And then uploading the zip as a layer in aws, but I then have another error when lanching the lambda:
Unable to import module 'lambda_function': libssl.so.1.0.0: cannot open shared object file: No such file or directory
It seems that the layer has to be built on a somehow extended target environment.
After a couple of hours scratching my head, I managed to resolve this issue.
TL;DR: build the layer by using a docker image inherited from the aws one, but with the needed libraries, for instance libcurl-devel, openssl-devel, python36-devel. Have a look at the trick Note 3 :).
The detailed way:
Prerequisite: having Docker installed
In a empty directory, copy your requirements.txt containing pycurl (in my case: pycurl~=7.43.0.5)
In this same directory, create the following Dockerfile (cf Note 3):
FROM public.ecr.aws/sam/build-python3.6
RUN yum install libcurl-devel python36-devel -y
RUN yum install openssl-devel -y
ENV PYCURL_SSL_LIBRARY=openssl
RUN ln -s /usr/include /var/lang/include
Build the docker image:
docker build -t build-python3.6-pycurl .
build the layer using this image (cf Note 2), by running:
docker run -v "$PWD":/var/task "build-python3.6-pycurl" /bin/sh -c "pip install -r requirements.txt -t python/lib/python3.6/site-packages/; exit"
Zip the layer by running:
zip mylayer.zip python > /dev/null
Send the file mylayer.zip to aws as a layer and make your lambda points to it (using the console, or following the tuto https://aws.amazon.com/fr/premiumsupport/knowledge-center/lambda-layer-simulated-docker/).
Test your lambda and celebrate!
Note 1. If you want to use python 3.8, just change 3.6 or 36 by 3.8 and 38.
Note 2. Do not forget to remove the python folder when regenerating the layer, using admin rights if necessary.
Note 3. Mind the symlink in the last line of the DockerFile. Without it, gcc won't be able to find some header files, such as Python.h.
Note 4. Compile pycurl with openssl backend, for it is the ssl backend used in the lambda executing environment. Or else you'll get a libcurl link-time ssl backend (openssl) is different from compile-time ssl backend error when executing the lambda.

elastic beanstalk: incremental push git

When I would like to push incremental changes to the AWS Elastic Beanstalk solution I get the following:
$ git aws.push
Updating the AWS Elastic Beanstalk environment None...
Error: Failed to get the Amazon S3 bucket name
I've already added FULLS3Access to my AWS users policies.
I had a similar issue today and here are the steps I followed to investigate :-
I modified line no 133 at .git/AWSDevTools/aws/dev_tools.py to print the exception like
except Exception, e:
print e
* Please make sure of spaces as Python does not work in case of spaces.
I ran command git aws.push again
and here is the exception printed :-
BotoServerError: 403 Forbidden
{"Error":{"Code":"SignatureDoesNotMatch","Message":"Signature not yet current: 20150512T181122Z is still later than 20150512T181112Z (20150512T180612Z + 5 min.)","Type":"Sender"},"
The issue is because there was a time difference in server and machine I corrected it and it stated working fine.
Basically the Exception will helps to let you know exact root cause, It may be related to Secret key as well.
It may have something to do with the boto-library (related thread). If you are on ubuntu/debian try this:
Remove old version of boto
sudo apt-get remove python-boto
Install newer version
sudo apt-get install python-pip
sudo pip install -U boto
Other systems (e.g. Mac)
Via easy_install
sudo easy_install pip
pip install boto
Or simply build from source
git clone git://github.com/boto/boto.git
cd boto
python setup.py install
Had the same problem a moment ago.
Note:
I just noticed your environment is called none. Did you follow all instructions and executed eb config/eb init?
One more try:
Add export PATH=$PATH:<path to unzipped eb CLI package>/AWSDevTools/Linux/ to your path and execute AWSDevTools-RepositorySetup.sh maybe something is wrong with your repository setup (notice the none weirdness). Other possible solutions:
Doublecheck AWSCredentials (maybe you are using different Key IDs / Wrong CredentialsFile-format)
Old/mismatching versions of eb client & python (check with eb -v and python -v) (current client is this)
Use amazons policy validator to doublecheck if your AWS User is allowed to perform all actions
If all that doesn't help im out of options. Good luck.