I am running pyspark to retrieve a table called books_df from S3 and convert it to a spark dataframe to use it further in my application.
Pyspark Code:
import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/drive/<<path to curr dir>>/spark-3.1.2-bin-hadoop2.7"
findspark.init()
conf = pyspark.SparkConf().setAppName('appName').setMaster('local')
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df = spark.createDataFrame(books_df)
df.show(3)
+--------------------+--------------------+--------------------+-----------+
| book_authors| book_desc| book_edition|book_format|
+--------------------+--------------------+--------------------+-----------+
| Suzanne Collins|Winning will make...| null| Hardcover|
|J.K. Rowling|Mary...|There is a door a...| US Edition| Paperback|
| Harper Lee|The unforgettable...| 50th Anniversary| Paperback|
+--------------------+--------------------+--------------------+-----------+
only showing top 3 rows
flask_ngrok section:
from flask import Flask, render_template,redirect, url_for, request, session, abort, flash
from flask_ngrok import run_with_ngrok
app = Flask(__name__, template_folder='/content/drive/<<path to curr dir>>/templates',static_folder='/content/drive/<<path to curr dir>>/static')
run_with_ngrok(app)
#app.route('/')
def upload():
return(render_template('upload.html'))
if __name__ == '__main__':
app.run()
When running this, I get the following error:
* Serving Flask app "__main__" (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
Exception in thread Thread-25:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 1177, in run
self.function(*self.args, **self.kwargs)
File "/usr/local/lib/python3.7/dist-packages/flask_ngrok.py", line 70, in start_ngrok
ngrok_address = _run_ngrok()
File "/usr/local/lib/python3.7/dist-packages/flask_ngrok.py", line 36, in _run_ngrok
j = json.loads(tunnel_url)
File "/usr/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Important note: When I add sc.stop() after the pyspark section, flask_ngrok works perfectly fine but I cannot access the pyspark dataframe df as the sparkContext is closed.
Is pyspark and flask_ngrok trying to access the same port and thus, giving the error?
Related
I am trying to migrate from Cloud Composer 1 into Cloud Composer 2 (from Airflow 1.10.15 into Airflow 2.2.5) and when attempting to load data from BigQuery into GCS using the BigQueryToGCSOperator
from airflow.providers.google.cloud.transfers.bigquery_to_gcs import BigQueryToGCSOperator
# ...
BigQueryToGCSOperator(
task_id='my-task',
source_project_dataset_table='my-project-name.dataset-name.table-name',
destination_cloud_storage_uris=f'gs://my-bucket/another-path/*.jsonl',
export_format='NEWLINE_DELIMITED_JSON',
compression=None,
location='europe-west2'
)
that results into the following error:
[2022-06-07, 11:17:01 UTC] {taskinstance.py:1776} ERROR - Task failed with exception
Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/transfers/bigquery_to_gcs.py", line 141, in execute
job = hook.get_job(job_id=job_id).to_api_repr()
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/common/hooks/base_google.py", line 439, in inner_wrapper
return func(self, *args, **kwargs)
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/google/cloud/hooks/bigquery.py", line 1492, in get_job
job = client.get_job(job_id=job_id, project=project_id, location=location)
File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 2066, in get_job
resource = self._call_api(
File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/bigquery/client.py", line 782, in _call_api
return call()
File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 283, in retry_wrapped_func
return retry_target(
File "/opt/python3.8/lib/python3.8/site-packages/google/api_core/retry.py", line 190, in retry_target
return target()
File "/opt/python3.8/lib/python3.8/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
raise exceptions.from_http_response(response)
google.api_core.exceptions.NotFound: 404 GET https://bigquery.googleapis.com/bigquery/v2/projects/my-project-name/jobs/airflow_1654592634552749_1896245556bd824c71f31c79d28cdfbe?projection=full&prettyPrint=false: Not found: Job my-project-name:airflow_1654592634552749_1896245556bd824c71f31c79d28cdfbe
Any clue what may be the issue here and why it does not work on Airflow 2.2.5 (even though the equivalent BigQueryToCloudStorageOperator works for Cloud Composer v1 in Airflow 1.10.15).
Apparently this seems to be a bug introduced in apache-airflow-providers-google version v7.0.0.
Also note that the file transfer from BQ into GCS will actually be successful (even though the task will fail).
As a workaround you can either revert back to a working version (if this is possible) e.g. to 6.8.0, or make use of the BQ API and get rid of BigQueryToGCSOperator.
For example,
from google.cloud import bigquery
from airflow.operators.python import PythonOperator
def load_bq_to_gcs():
client = bigquery.Client()
job_config = bigquery.job.ExtractJobConfig()
job_config.destination_format = bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON
destination_uri = f"{<gcs-bucket-destination>}*.jsonl"
dataset_ref = bigquery.DatasetReference(bq_project_name, bq_dataset_name)
table_ref = dataset_ref.table(bq_table_name)
extract_job = client.extract_table(
table_ref,
destination_uri,
job_config=job_config,
location='europe-west2',
)
extract_job.result()
and then create an instance of PythonOperator:
PythonOperator(
task_id='test_task',
python_callable=load_bq_to_gcs,
)
I am currently starting the process of preparing an App Engine app for Python 3 migration.
During the first step:
Migrate the App Engine bundled services in your Python 2 app to Google Cloud services ...
Following all the instructions to switch the datastore module from google.appengine.ext.ndb to google.cloud.ndb, I immediately get the following Import Error:
File "/usr/lib/google-cloud-sdk/platform/google_appengine/google/appengine/runtime/wsgi.py", line 240, in Handle
handler = _config_handle.add_wsgi_middleware(self._LoadHandler())
File "/usr/lib/google-cloud-sdk/platform/google_appengine/google/appengine/runtime/wsgi.py", line 311, in _LoadHandler
handler, path, err = LoadObject(self._handler)
File "/usr/lib/google-cloud-sdk/platform/google_appengine/google/appengine/runtime/wsgi.py", line 85, in LoadObject
obj = __import__(path[0])
File "/home/---.py", line 8, in <module>
from google.cloud import ndb
File "/home/test_env/local/lib/python2.7/site-packages/google/cloud/ndb/__init__.py", line 28, in <module>
from google.cloud.ndb.client import Client
File "/home/test_env/local/lib/python2.7/site-packages/google/cloud/ndb/client.py", line 23, in <module>
from google.cloud import _helpers
File "/home/test_env/local/lib/python2.7/site-packages/google/cloud/_helpers.py", line 29, in <module>
from six.moves import http_client
ImportError: No module named moves
This happens whether or not I am testing in a virtual environment. Importing six.moves works in a python console.
Apparently this is an issue with the bundled test server dev_appserver.py. Found it solved here by Andrewsg:
I think we've identified an issue with devappserver related to the six library specifically. Could you please try a workaround? Add the line: import six; reload(six) to the top of your app, before NDB is loaded
I am trying to connect python with hive data base .
Both are on different server.
hive resides on host xx.xxx.xxx.x and python is on my local system.
i am trying to use the below code , however its not working
import pyhive
from pyhive import hive
conn = hive.connect(host = 'xx.xxx.xx.xx', port = 8888, auth = 'KERBEROS', kerberos_service_name='adsuedscaihen01.aipcore.local', username ='user1', database = 'database1'
)
cur = conn.cursor()
cur.execute('SELECT * from table1')
result = cur.fetchall()
print(result)
while i run the above code , i am facing the below error:
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/site-packages/pyhive/hive.py", line 64, in connect
return Connection(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/pyhive/hive.py", line 162, in init
self._transport.open()
File "/usr/lib/python2.7/site-packages/thrift_sasl/init.py", line 79, in open
message=("Could not start SASL: %s" % self.sasl.getError()))
thrift.transport.TTransport.TTransportException: Could not start SASL: Error in sasl_client_start (-1) SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Server not found in Kerberos database)
I'm trying to put an Item into an Amazon DynamoDB table using a python script, but when I run the python script I get following error:
Traceback (most recent call last):
File "./table.py", line 32, in <module>
item.put(None, None)
File "/usr/local/lib/python2.7/dist-packages/boto/dynamodb/item.py", line 183, in put
return self.table.layer2.put_item(self, expected_value, return_values)
File "/usr/local/lib/python2.7/dist-packages/boto/dynamodb/layer2.py", line 551, in put_item
object_hook=self.dynamizer.decode)
File "/usr/local/lib/python2.7/dist-packages/boto/dynamodb/layer1.py", line 384, in put_item
object_hook=object_hook)
File "/usr/local/lib/python2.7/dist-packages/boto/dynamodb/layer1.py", line 119, in make_request
retry_handler=self._retry_handler)
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 954, in _mexe
status = retry_handler(response, i, next_sleep)
File "/usr/local/lib/python2.7/dist-packages/boto/dynamodb/layer1.py", line 159, in _retry_handler
data)
boto.exception.DynamoDBResponseError: DynamoDBResponseError: 400 Bad Request
{u'message': u'Requested resource not found', u'__type': u'com.amazonaws.dynamodb.v20111205#ResourceNotFoundException'}
My code is:
#!/usr/bin/python
import boto
import boto.s3
import sys
from boto import dynamodb2
from boto.dynamodb2.table import Table
from boto.s3.key import Key
import boto.dynamodb
conn = boto.dynamodb.connect_to_region('us-west-2', aws_access_key_id=<My_access_key>, aws_secret_access_key=<my_secret_key>)
entity = conn.create_schema(hash_key_name='RPI_ID', hash_key_proto_value=str, range_key_name='PIC_ID', range_key_proto_value=str)
table = conn.create_table(name='tblSensor', schema=entity, read_units=10, write_units=10)
item_data = {
'Pic_id': 'P100',
'RId': 'R100',
'Temperature': '28.50'
}
item = table.new_item(
# Our hash key is 'forum'
hash_key='RPI_ID',
# Our range key is 'subject'
range_key='PIC_ID',
# This has the
attrs=item_data
)
item.put() // I got error here.
My reference is: Setting/Getting/Deleting CORS Configuration on a Bucket
I ran your code in my account and it worked 100% perfect, returning:
{u'ConsumedCapacityUnits': 1.0}
You might want to check that you are using the latest version of boto:
pip install boto --upgrade
I searched on google and solved my that problem. I have set correct time and date on my raspberry pi board and run that program, it's working fine.
I m trying to understand this code which uses Py4j. However each time I run the code I'm getting the same error. I have py4j installed in my Ubuntu 14.04. The jar file is in usr/share/py4j.
The code is
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
from nltk.tokenize import wordpunct_tokenize , sent_tokenize
from py4j.java_gateway import JavaGateway
import nltk
from nltk.tree import Tree
import os.path
import parsers
LangPaths =os.path.realpath("/home/Downloads/Abstractive Summarizer/SumMe-master/Summarizer/langdetector/profiles/")
tltagger = nltk.data.load("taggers/english.pickle")
tlChunker = nltk.data.load("chunkers/maxent_ne_chunker/english_ace_binary.pickle")
enChunker = nltk.data.load("chunkers/maxent_ne_chunker/english_ace_multiclass.pickle")
punkt_param = PunktParameters() #creates an opening for tokenizer parameters.
punkt_param.abbrev_types = set(['gng','mr','mrs','dr','rep']) #abbreviations further accepted goes here
sentence_splitter = PunktSentenceTokenizer(punkt_param)
tokenized = ""
gateway = JavaGateway()
detector = gateway.entry_point
detector.init(LangPaths)
The error which I m getting is
File "/home/shiju/Downloads/Abstractive
Summarizer/SumMe-master/Summarizer/preprocessor.py", line 29, in
detector.init(LangPaths)
File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py",
line 811, in __call_ answer =
self.gateway_client.send_command(command)`
File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py",
line 624, in send_command
connection = self._get_connection()
File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py",
line 579, in _get_connection
connection = self._create_connection() File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line
585, in _create_connection
connection.start()
File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py",
line 697, in start
raise Py4JNetworkError(msg, e) py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server
I think Python is unable to connect with the Java application