Google pytrends in Python jupyter notebook - google-trends

I am trying to run a script in jupyter notebook but getting a connection error
here is my script
from pytrends.request import TrendReq
import requests
#make a pytrends object to request Google Trends data
pytrends = TrendReq(hl='en-US')
#extract data about weekly searches of certain keywords
keywords = ["Python", "R", "C++", "Java", "HTML"]
pytrends.build_payload(keywords, timeframe='today 5-y')
and my error is
ProxyError: HTTPSConnectionPool(host='trends.google.com', port=443): Max retries exceeded with url: /?geo=US (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x000001605F2C53C8>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')))

Related

Creating Connection for RedshiftDataOperator

So i when to the airflow documentation for aws redshift there is 2 operator that can execute the sql query they are RedshiftSQLOperator and RedshiftDataOperator. I already implemented my job using RedshiftSQLOperator but i want to do it using RedshiftDataOperator instead, because i dont want to using postgres connection in RedshiftSQLOperator but AWS API.
RedshiftDataOperator Documentation
I had read this documentation there is aws_conn_id in the parameter. But when im trying to use the same connection id there is error.
[2023-01-11, 04:55:56 UTC] {base.py:68} INFO - Using connection ID 'redshift_default' for task execution.
[2023-01-11, 04:55:56 UTC] {base_aws.py:206} INFO - Credentials retrieved from login
[2023-01-11, 04:55:56 UTC] {taskinstance.py:1889} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/redshift_data.py", line 146, in execute
self.statement_id = self.execute_query()
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/redshift_data.py", line 124, in execute_query
resp = self.hook.conn.execute_statement(**filter_values)
File "/home/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 415, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/airflow/.local/lib/python3.7/site-packages/botocore/client.py", line 745, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (UnrecognizedClientException) when calling the ExecuteStatement operation: The security token included in the request is invalid.
From task id
redshift_data_task = RedshiftDataOperator(
task_id='redshift_data_task',
database='rds',
region='ap-southeast-1',
aws_conn_id='redshift_default',
sql="""
call some_procedure();
"""
)
What should i fill in the airflow connection ? Because in the documentation there is no example of value that i should fill to airflow. Thanks
Airflow RedshiftDataOperator Connection Required Value
Have you tried using the Amazon Redshift connection? There is both an option for authenticating using your Redshift credentials:
Connection ID: redshift_default
Connection Type: Amazon Redshift
Host: <your-redshift-endpoint> (for example, redshift-cluster-1.123456789.us-west-1.redshift.amazonaws.com)
Schema: <your-redshift-database> (for example, dev, test, prod, etc.)
Login: <your-redshift-username> (for example, awsuser)
Password: <your-redshift-password>
Port: <your-redshift-port> (for example, 5439)
(source)
and an option for using an IAM role (there is an example in the first link).
Disclaimer: I work at Astronomer :)
EDIT: Tested the following with Airflow 2.5.0 and Amazon provider 6.2.0:
Added the IP of my Airflow instance to the VPC security group with "All traffic" access.
Airflow Connection with the connection id aws_default, Connection type "Amazon Web Services", extra: { "aws_access_key_id": "<your-access-key-id>", "aws_secret_access_key": "<your-secret-access-key>", "region_name": "<your-region-name>" }. All other fields blank. I used a root key for my toy-aws. If you use other credentials you need to make sure that IAM role has access and the right permissions to the Redshift cluster (there is a list in the link above).
Operator code:
red = RedshiftDataOperator(
task_id="red",
database="dev",
sql="SELECT * FROM dev.public.users LIMIT 5;",
cluster_identifier="redshift-cluster-1",
db_user="awsuser",
aws_conn_id="aws_default"
)

Read/write to AWS S3 from Apache Spark Kubernetes container via vpc endpoint giving 400 Bad Request

I am trying to read and write data to AWS S3 from Apache Spark Kubernetes Containervia vpc endpoint
The Kubernetes container is on premise (data center) in US region . Following is the Pyspark code to connect to S3:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
conf = (
SparkConf()
.setAppName("PySpark S3 Example")
.set("spark.hadoop.fs.s3a.endpoint.region", "us-east-1")
.set("spark.hadoop.fs.s3a.endpoint","<vpc-endpoint>")
.set("spark.hadoop.fs.s3a.access.key", "<access_key>")
.set("spark.hadoop.fs.s3a.secret.key", "<secret_key>")
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("spark.driver.extraJavaOptions", "-Dcom.amazonaws.services.s3.enforceV4=true")
.set("spark.executor.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true")
.set("spark.executor.extraJavaOptions", "-Dcom.amazonaws.services.s3.enforceV4=true")
.set("spark.fs.s3a.path.style.access", "true")
.set("spark.hadoop.fs.s3a.server-side-encryption-algorithm","SSE-KMS")
.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
data = [{"key1": "value1", "key2": "value2"}, {"key1":"val1","key2":"val2"}]
df = spark.createDataFrame(data)
df.write.format("json").mode("append").save("s3a://<bucket-name>/test/")
Exception Raised:
py4j.protocol.Py4JJavaError: An error occurred while calling o91.save.
: org.apache.hadoop.fs.s3a.AWSBadRequestException: doesBucketExist on <bucket-name>
: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: <requestID>;
Any help would be appreciated
unless your hadoop s3a client is region aware (3.3.1+), setting that region option won't work. There's an aws sdk option "aws.region which you can set as as a system property instead.

Using notebook (Databricks) show error java.io.IOException: Error getting access token from metadata server

I'm using https://community.cloud.databricks.com/ (notebook) when I try to access Storage GCP through the Python command as below:
df = spark.read.format(csv).load(gs://test-gcs-doc-bucket-pr/test)
Error:
java.io.IOException: Error getting access token from metadata server at: 169.254.169.xxx/computeMetadata/v1/instance/service-accounts/default/token
Databricks Spark Configuration:
spark.hadoop.fs.gs.auth.client_id "10"
spark.hadoop.fs.gs.auth.auth_uri "https://accounts.google.com/o/oauth2/auth"
spark.databricks.delta.preview.enabled true
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.fs.gs.auth.service.account.email "test-gcs.iam.gserviceaccount.com"
spark.hadoop.fs.gs.auth.token_uri "https://oauth2.googleapis.com/token"
spark.hadoop.fs.gs.project_id "oval-replica-9999999"
spark.hadoop.fs.gs.auth.service.account.private_key "--BEGIN"
spark.hadoop.fs.gs.auth.service.account.private_key_id "3f869c98d389bb28c5b13a0e31785e73d8b ```

GPU utilization is zero when running batch transform in Amazon SageMaker

I want to run a batch transform job on AWS SageMaker. I have an image classification model which I have trained on a local GPU. Now I want to deploy it on AWS SageMaker and make predictions using Batch Transform. While the batch transform job runs successfully, the GPU utilization during the job is always zero (GPU Memory utilization, however, is at 97%). That's what CloudWatch is telling me. Also, the job takes approx. 7 minutes to process 500 images, I would expect it to run much faster than this, at least when comparing it to the time it takes to process the images on a local GPU.
My question: Why doesn't the GPU get used during batch transform, even though I am using a GPU instance (I am using an ml.p3.2xlarge instance)? I was able to deploy the very same model to an endpoint and send requests. When deploying to an endpoint instead of using batch transform, the GPU actually gets used.
Model preparation
I am using a Keras Model with TensorFlow backend. I converted this model to a sagemaker model using this guide https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/ :
import tensorflow as tf
from tensorflow.python.saved_model import builder
from tensorflow.python.saved_model.signature_def_utils import predict_signature_def
from tensorflow.python.saved_model import tag_constants
import tarfile
import sagemaker
# deactivate eager mode
if tf.executing_eagerly():
tf.compat.v1.disable_eager_execution()
builder = builder.SavedModelBuilder(export_dir)
# Create prediction signature to be used by TensorFlow Serving Predict API
signature = predict_signature_def(
inputs={"image_bytes": model.input}, outputs={"score_bytes": model.output})
with tf.compat.v1.keras.backend.get_session() as sess:
# Save the meta graph and variables
builder.add_meta_graph_and_variables(
sess=sess, tags=[tag_constants.SERVING], signature_def_map={"serving_default": signature})
builder.save()
with tarfile.open(tar_model_file, mode='w:gz') as archive:
archive.add('export', recursive=True)
sagemaker_session = sagemaker.Session()
s3_uri = sagemaker_session.upload_data(path=tar_model_file, bucket=bucket, key_prefix=sagemaker_model_dir)
Batch Transform
Container image used for batch transform: 763104351884.dkr.ecr.eu-central-1.amazonaws.com/tensorflow-inference:2.0.0-gpu
framework = 'tensorflow'
instance_type='ml.p3.2xlarge'
image_scope = 'inference'
tf_version = '2.0.0'
py_version = '3.6'
sagemaker_model = TensorFlowModel(model_data=MODEL_TAR_ON_S3, role=role, image_uri=tensorflow_image)
transformer = sagemaker_model.transformer(
instance_count = 1,
instance_type = instance_type,
strategy='MultiRecord',
max_concurrent_transforms=8,
max_payload=10, # in MB
output_path = output_data_path,
)
transformer.transform(data = input_data_path,
job_name = job_name,
content_type = 'application/json',
logs=False,
wait=True
)
Log file excerpts
Loading the model takes quite long (several minutes). During this time, the following error message is getting logged:
2020-11-08T15:14:12.433+01:00 2020/11/08 14:14:12 [error]
14#14: *3066 no live upstreams while connecting to upstream, client:
169.254.255.130, server: , request: "GET /ping HTTP/1.1", subrequest: "/v1/models/my_model:predict", upstream:
"http://tfs_upstream/v1/models/my_model:predict", host:
"169.254.255.131:8080" 2020-11-08T15:14:12.433+01:00
169.254.255.130 - - [08/Nov/2020:14:14:12 +0000] "GET /ping HTTP/1.1" 502 157 "-" "Go-http-client/1.1" 2020-11-08T15:14:12.433+01:00
2020/11/08 14:14:12 [error] 14#14: *3066 js: failed ping#015
2020-11-08T15:14:12.433+01:00 502 Bad
Gateway#015 2020-11-08T15:14:12.433+01:00
#015 2020-11-08T15:14:12.433+01:00 502
Bad Gateway#015 2020-11-08T15:14:12.433+01:00
nginx/1.16.1#015 2020-11-08T15:14:12.433+01:00
#015 2020-11-08T15:14:12.433+01:00 #015
There was a log entry about NUMA node read:
successful NUMA node read from SysFS had negative value (-1), but
there must be at least one NUMA node, so returning NUMA node zero
And about a serving warmup request:
No warmup data file found at
/opt/ml/model/export/my_model/1/assets.extra/tf_serving_warmup_requests
And this warning:
[warn] getaddrinfo: address family for nodename not supported

Python Orion Context Broker Token problems

I've been developing the following code:
datos = {
"id":"1",
"type":"Car",
"bra":"0",
}
jsonData = json.dumps(datos)
url = 'http://130.456.456.555:1026/v2/entities'
head = {
"Content-Type": "application/json",
"Accept": "application/json",
"X-Auth-Token": token
}
response = requests.post(url, data=jsonData, headers=head)
My problem is that I can't establish a connection between my computer and my fiware Lab instance.
The error is:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='130.206.113.177', port=1026): Max retries exceeded with url: /v1/entities (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f02c97c1f90>: Failed to establish a new connection: [Errno 110] Connection timed out',))
Seems to be a network connectivity problem.
Assuming that there actually an Orion process listening to port 1026 at IP 130.206.113.177 (should be checked, eg. curl localhost:1026/version command executed in the same VM where Orion runs), the most probable causes of Orion connection problems are:
Something in the Orion host (e.g a firewall or security group) is blocking the incoming connection
Something in the client host (e.g a firewall) is blocking the outcoming connection
There is some other network issue is causing the connection problem.