get the client from pyspark - hdfs

I want to retrieve a list of file. I saw a post sayong that these commands would do the job:
from hdfs import Config
client = Config().get_client('dev')
client.list('/*')
But actually, execution fails:
---------------------------------------------------------------------------
HdfsError Traceback (most recent call last)
<ipython-input-308-ab40dc16879a> in <module>()
----> 1 client = Config().get_client('dev')
/opt/cloudera/extras/anaconda3/lib/python3.5/site-packages/hdfs/config.py in get_client(self, alias)
117 break
118 else:
--> 119 raise HdfsError('Alias %r not found in %r.', alias, self.path)
120 return self._clients[alias]
121
HdfsError: Alias 'dev' not found in '/home/sbenet/.hdfscli.cfg'.
As you can see, it is trying to access the file /home/sbenet/.hdfscli.cfg which does not exists.
If I want to use this method to retrieve the list of files, I need to fix this .hdfscli.cfg file issue, or to use another method with sc maybe.

You have to create a configuration file first. Check this out 1
[global]
default.alias = dev
[dev.alias]
url = http://dev.namenode:port
user = ann
[prod.alias]
url = http://prod.namenode:port
root = /jobs/

Related

Getting `dtype of input object does not match expected dtype <U0` when invoking MLflow-deployed NLP model in SageMaker

I deployed a Huggingface Transformer model in SageMaker using MLflow's sagemaker.deploy().
When logging the model I used infer_signature(np.array(test_example), loaded_model.predict(test_example)) to infer input and output signatures.
Model is deployed successfully. When trying to query the model I get ModelError (full traceback below).
To query the model, I am using precisely the same test_example that I used for infer_signature():
test_example = [['This is the subject', 'This is the body']]
The only difference is that when querying the deployed model, I am not wrapping the test example in np.array() as that is not json-serializeable.
To query the model I tried two different approaches:
import boto3
SAGEMAKER_REGION = 'us-west-2'
MODEL_NAME = '...'
client = boto3.client("sagemaker-runtime", region_name=SAGEMAKER_REGION)
# Approach 1
client.invoke_endpoint(
EndpointName=MODEL_NAME,
Body=json.dumps(test_example),
ContentType="application/json",
)
# Approach 2
client.invoke_endpoint(
EndpointName=MODEL_NAME,
Body=pd.DataFrame(test_example).to_json(orient="split"),
ContentType="application/json; format=pandas-split",
)
but they result in the same error.
Will be grateful for your suggestions.
Thank you!
Note: I am using Python 3 and all strings are unicode.
---------------------------------------------------------------------------
ModelError Traceback (most recent call last)
<ipython-input-89-d09862a5f494> in <module>
2 EndpointName=MODEL_NAME,
3 Body=test_example,
----> 4 ContentType="application/json; format=pandas-split",
5 )
~/anaconda3/envs/amazonei_tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
393 "%s() only accepts keyword arguments." % py_operation_name)
394 # The "self" in this scope is referring to the BaseClient.
--> 395 return self._make_api_call(operation_name, kwargs)
396
397 _api_call.__name__ = str(py_operation_name)
~/anaconda3/envs/amazonei_tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
723 error_code = parsed_response.get("Error", {}).get("Code")
724 error_class = self.exceptions.from_code(error_code)
--> 725 raise error_class(parsed_response, operation_name)
726 else:
727 return parsed_response
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{"error_code": "BAD_REQUEST", "message": "dtype of input object does not match expected dtype <U0"}". See https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/bec-sagemaker-model-test-app in account 543052680787 for more information.
Environment info:
{'channels': ['defaults', 'conda-forge', 'pytorch'],
'dependencies': ['python=3.6.10',
'pip==21.3.1',
'pytorch=1.10.2',
'cudatoolkit=10.2',
{'pip': ['mlflow==1.22.0',
'transformers==4.17.0',
'datasets==1.18.4',
'cloudpickle==1.3.0']}],
'name': 'bert_bec_test_env'}
I encoded the strings into numbers before sending them to the model.
Next, I added a code within the model wrapper that decodes numbers back to strings. This workaround worked without issues.
In my understanding this might indicate that there is a problem with MLflow's type checking for strings.
Added an issue here: https://github.com/mlflow/mlflow/issues/5474

BigQuery Storage API: the table has a storage format that is not supported

I've used the sample from the BQ documentation to read a BQ table into a pandas dataframe using this query:
query_string = """
SELECT
CONCAT(
'https://stackoverflow.com/questions/',
CAST(id as STRING)) as url,
view_count
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%google-bigquery%'
ORDER BY view_count DESC
"""
dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(bqstorage_client=bqstorageclient)
)
print(dataframe.head())
url view_count
0 https://stackoverflow.com/questions/22879669 48540
1 https://stackoverflow.com/questions/13530967 45778
2 https://stackoverflow.com/questions/35159967 40458
3 https://stackoverflow.com/questions/10604135 39739
4 https://stackoverflow.com/questions/16609219 34479
However, the minute I try and use any other non-public data-set, I get the following error:
google.api_core.exceptions.FailedPrecondition: 400 there was an error creating the session: the table has a storage format that is not supported
Is there some setting I need to set in my table so that it can work with the BQ Storage API?
This works:
query_string = """SELECT funding_round_type, count(*) FROM `datadocs-py.datadocs.investments` GROUP BY funding_round_type order by 2 desc LIMIT 2"""
>>> bqclient.query(query_string).result().to_dataframe()
funding_round_type f0_
0 venture 104157
1 seed 43747
However, when I set it to use the bqstorageclient I get that error:
>>> bqclient.query(query_string).result().to_dataframe(bqstorage_client=bqstorageclient)
Traceback (most recent call last):
File "/Users/david/Desktop/V/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 57, in error_remapped_callable
return callable_(*args, **kwargs)
File "/Users/david/Desktop/V/lib/python3.6/site-packages/grpc/_channel.py", line 533, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/Users/david/Desktop/V/lib/python3.6/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.FAILED_PRECONDITION
details = "there was an error creating the session: the table has a storage format that is not supported"
debug_error_string = "{"created":"#1565047973.444089000","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"there was an error creating the session: the table has a storage format that is not supported","grpc_status":9}"
>
I experienced the same issue as of 06 Nov 2019 and it turns out that the error that you are getting is a known issue with the Read API as it cannot currently handle result sets smaller than 10MB. I came across this that shed some light on this problem:
GitHub.com - GoogleCloudPlatform/spark-bigquery-connector - FAILED_PRECONDITION: there was an error creating the session: the table has a storage format that is not supported #46
I have tested it with a query that returns a larger than 10MB result set and it seems to be working fine for me with an EU multi-regional location of the dataset that I am querying against.
Also, you will need to install fastavro in your environment for this functionality to work.

Pyomo - Location of Log Files

Pretty basic question, but where can I find solver log files from Pyomo? I have a local installation of the COIN-OR solvers on an Ubuntu machine.
This is happening in a Jupyter notebook, but I'm getting the same error message when I run the .py file from terminal.
solverpath_exe='~/COIN-OR/bin/couenne'
opt = SolverFactory('couenne', executable = solverpath_exe)
opt.solve(model,tee=True)
---------------------------------------------------------------------------
ApplicationError Traceback (most recent call last)
<ipython-input-41-48380298846e> in <module>()
29 #instance = model.create_instance()
30 opt = SolverFactory('couenne', executable = solverpath_exe)
---> 31 opt.solve(model,tee=True)
32 #solver=SolverFactory(solvername,executable=solverpath_exe)
/home/ralphasher/.local/lib/python3.6/site-packages/pyomo/opt/base/solvers.py in solve(self, *args, **kwds)
598 logger.error("Solver log:\n" + str(_status.log))
599 raise pyutilib.common.ApplicationError(
--> 600 "Solver (%s) did not exit normally" % self.name)
601 solve_completion_time = time.time()
602 if self._report_timing:
ApplicationError: Solver (asl) did not exit normally
To keep the solver log file, you need to specify that you want to keep them when calling for the solving of your model.
opt.solve(model, tee=True, keepfiles=True)
The resulting file will be next to your main executable.
You can also log the file with a specific name, using
opt.solve(model, tee=True, logfile="some_file_name.log")

Python2: the meaning of '!../'

Hi I am studying caffe by this tutorial (http://nbviewer.jupyter.org/github/BVLC/caffe/blob/tutorial/examples/00-caffe-intro.ipynb)
I don't know the meaning of '!../' in the code like the following code:
import os
if os.path.isfile(caffe_root + 'models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel'):
print 'CaffeNet found.'
else:
print 'Downloading pre-trained CaffeNet model...'
!../scripts/download_model_binary.py ../models/bvlc_reference_caffenet
# load ImageNet labels (for understanding the output)
labels_file = 'synset_words.txt'
if not os.path.exists(labels_file):
print 'begin'
!../home2/challege98/caffe/data/ilsvrc12/get_ilsvrc_aux.sh
print 'finish'
labels = np.loadtxt(labels_file, str, delimiter='\t')
Could you explain it in detail, when I run the code, there is error that:
Downloading pre-trained CaffeNet model...
/bin/sh: 1: ../scripts/download_model_binary.py: not found
begin
/bin/sh: 1: ../home2/challege98/caffe/data/ilsvrc12/get_ilsvrc_aux.sh: not found
finish
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-19-8534d29d47f5> in <module>()
12 get_ipython().system(u'../home2/challege98/caffe/data/ilsvrc12/get_ilsvrc_aux.sh')
13 print 'finish'
---> 14 labels = np.loadtxt(labels_file, str, delimiter='\t')
15
16
/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.pyc in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin)
856 fh = iter(bz2.BZ2File(fname))
857 elif sys.version_info[0] == 2:
--> 858 fh = iter(open(fname, 'U'))
859 else:
860 fh = iter(open(fname))
IOError: [Errno 2] No such file or directory: 'synset_words.txt'
The exclamation point is to run a shell command. See here.
The error you are seeing is because the file synset_words.txt does not exist and is not being created because it cannot find the script to create it. Check this path is correct: ../home2/challege98/caffe/data/ilsvrc12/get_ilsvrc_aux.sh

Issue starting out with xlwings - AttributeError: Excel.Application.Workbooks

I was trying to use the package xlwings and ran into a simple error right from the start. I was able to run the example files they provided here without any major issues (except for multiple Excel books opening up upon running the code) but as soon as I tried to execute code via IPython I got the error AttributeError: Excel.Application.Workbooks. Specifically I ran:
from xlwings import Workbook, Sheet, Range, Chart
wb = Workbook()
Range('A1').value = 'Foo 1'
and got
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-7-7436ba97d05d> in <module>()
1 from xlwings import Workbook, Sheet, Range, Chart
----> 2 wb = Workbook()
3 Range('A1').value = 'Foo 1'
PATH\xlwings\main.pyc in __init__(self, fullname, xl_workbook, app_visible)
139 else:
140 # Open Excel if necessary and create a new workbook
--> 141 self.xl_app, self.xl_workbook = xlplatform.new_workbook()
142
143 self.name = xlplatform.get_workbook_name(self.xl_workbook)
PATH\xlwings\_xlwindows.pyc in new_workbook()
103 def new_workbook():
104 xl_app = _get_latest_app()
--> 105 xl_workbook = xl_app.Workbooks.Add()
106 return xl_app, xl_workbook
107
PATH\win32com\client\dynamic.pyc in __getattr__(self, attr)
520
521 # no where else to look.
--> 522 raise AttributeError("%s.%s" % (self._username_, attr))
523
524 def __setattr__(self, attr, value):
AttributeError: Excel.Application.Workbooks
I noticed the examples have a .xlxm file already present in the folder with the python code. Does the python code only ever work if it's in the same location as an existing Excel file? Does this mean it can't create Excel files automatically? Apologies if this is basic.