Here is my issue :
I download some netCDF4 files from an FTP server, in two different ways: via FileZilla and via a Python 2.7 script using ftplib.
Code of Python script (running on Windows) :
# download the file
try:
ftp = FTP(server_address)
ftp.login(server_login, server_pass)
filepath = 'the_remote_rep/myNetCDF4File.nc'
filename = 'myNetCDF4File.nc'
local_dir = 'toto'
new_file = open('%s/%s' % (local_dir, filename), "w")
ftp.retrbinary('RETR %s' % filepath, new_file.write)
ftp.close()
new_file.close()
except Exception as e:
print("Error FTP : '" + str(e) + "'")
# update title into the file
try:
fname = 'toto/myNetCDF4File.nc'
dataset = netCDF4.Dataset(fname, mode='a')
setattr(dataset, 'title', 'In Situ Observation Re-Analysis')
dataset.close()
except Exception as e:
print("Error netCDF4 : '" + str(e) + "'")
Then, I get this message :
Error netCDF4 : '[Errno 22] Invalid argument: 'toto/myNetCDF4File.nc''
When I try the second block of code with a netCDF4 file downloaded via FileZilla (the same file for example), there is no error.
Also, when I try to get the netCDF version of the file using "ncdump -k", here is the response (OK with the other file) :
ncdump: myNetCDF4File.nc: Invalid argument
In addition, files do not have the same size depending on the method :
FileZilla : 22 972 Ko
Python ftplib : 23 005 Ko
Is it a problem from ftplib when writing the retrieved file? Or did I miss some parameters to correctly encode the file?
Thanks in advance.
EDIT : verbose messages from FileZilla :
...
Response: 230 Login successful.
Trace: CFtpLogonOpData::ParseResponse() in state 5
Trace: CControlSocket::SendNextCommand()
Trace: CFtpLogonOpData::Send() in state 9
Command: OPTS UTF8 ON
Trace: CFtpControlSocket::OnReceive()
Response: 200 Always in UTF8 mode.
Trace: CFtpLogonOpData::ParseResponse() in state 9
Status: Logged in
Trace: Measured latency of 114 ms
Trace: CFtpControlSocket::ResetOperation(0)
Trace: CControlSocket::ResetOperation(0)
Trace: CFtpLogonOpData::Reset(0) in state 14
Trace: CFtpControlSocket::FileTransfer()
Trace: CControlSocket::SendNextCommand()
Trace: CFtpFileTransferOpData::Send() in state 0
Status: Starting download of /INSITU_OBSERVATIONS/myNetCDF4File.nc
Trace: CFtpChangeDirOpData::Send() in state 0
Trace: CFtpControlSocket::ResetOperation(0)
Trace: CControlSocket::ResetOperation(0)
Trace: CFtpChangeDirOpData::Reset(0) in state 0
Trace: CFtpFileTransferOpData::SubcommandResult(0) in state 1
Trace: CControlSocket::SendNextCommand()
Trace: CFtpFileTransferOpData::Send() in state 5
Trace: CFtpRawTransferOpData::Send() in state 2
Command: PASV
Trace: CFtpControlSocket::OnReceive()
Response: 227 Entering Passive Mode (193,68,190,45,179,16).
Trace: CFtpRawTransferOpData::ParseResponse() in state 2
Trace: CControlSocket::SendNextCommand()
Trace: CFtpRawTransferOpData::Send() in state 4
Trace: Binding data connection source IP to control connection source IP 134.xx.xx.xx
Command: RETR myNetCDF4File.nc
Trace: CTransferSocket::OnConnect
Trace: CFtpControlSocket::OnReceive()
Response: 150 Opening BINARY mode data connection for myNetCDF4File.nc (9411620 bytes).
Trace: CFtpRawTransferOpData::ParseResponse() in state 4
Trace: CControlSocket::SendNextCommand()
Trace: CFtpRawTransferOpData::Send() in state 5
Trace: CTransferSocket::TransferEnd(1)
Trace: CFtpControlSocket::TransferEnd()
Trace: CFtpControlSocket::OnReceive()
Response: 226 Transfer complete.
Trace: CFtpRawTransferOpData::ParseResponse() in state 7
Trace: CFtpControlSocket::ResetOperation(0)
Trace: CControlSocket::ResetOperation(0)
Trace: CFtpRawTransferOpData::Reset(0) in state 7
Trace: CFtpFileTransferOpData::SubcommandResult(0) in state 7
Trace: CFtpControlSocket::ResetOperation(0)
Trace: CControlSocket::ResetOperation(0)
Trace: CFtpFileTransferOpData::Reset(0) in state 7
Status: File transfer successful, transferred 9 411 620 bytes in 89 seconds
Status: Disconnected from server
Trace: CFtpControlSocket::ResetOperation(66)
Trace: CControlSocket::ResetOperation(66)
In fact, this is a problem of binary configuration (thanks to your questions).
I added ftp.voidcmd('TYPE I') before retrieving file with ftplib, then I modified writing parameter of local file as new_file = open('%s/%s' % (local_ftp_path, filename), "wb") to specify that's a binary file.
Now the file is readable after download via ftplib and has same size as downloaded from FileZilla.
Thanks to your contribution.
Related
I tried the camunda community python client, from the repo (https://github.com/camunda-community-hub/camunda-8-code-studio/tree/main/src/PythonCloudWorker). I have set up caumnda 8 saas account to run my tasks from the repo.
I 'm getting error when i try to run the python file, posting the error. Any suggestions appriciated.
communda_connect.py:59: DeprecationWarning: There is no current event loop
loop = asyncio.get_event_loop()
E0118 00:29:19.302897000 6259650560 hpack_parser.cc:1218] Error parsing metadata: error=invalid value key=content-type value=text/plain; charset=utf-8
E0118 00:29:19.307140000 6259650560 hpack_parser.cc:1218] Error parsing metadata: error=invalid value key=content-type value=text/plain; charset=utf-8
E0118 00:29:19.310754000 6259650560 hpack_parser.cc:1218] Error parsing metadata: error=invalid value key=content-type value=text/plain; charset=utf-8
Traceback (most recent call last):
env/lib/python3.10/site-packages/grpc/aio/_call.py", line 236, in _raise_for_status
raise _create_rpc_error(await self.initial_metadata(), await
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNIMPLEMENTED
details = "Received http2 header with status: 404"
debug_error_string = "UNKNOWN:Error received from peer ipv4:32.12.17.224:443 {created_time:"2023-01-18T00:29:19.304994+05:30", grpc_status:12, grpc_message:"Received http2 header with status: 404"}"
>
During handling of the above exception, another exception occurred:
env/lib/python3.10/site-packages/pyzeebe/grpc_internals/zeebe_adapter_base.py", line 33, in _handle_grpc_error
raise pyzeebe_error
pyzeebe.errors.zeebe_errors.UnkownGrpcStatusCodeError
problem was i had not passed the region parameter which was defaulting to bru-2.
camunda_region = os.environ.get('CAMUNDA_CLUSTER_REGION')
channel = create_camunda_cloud_channel(client_id=zeebe_client_id, client_secret=zeebe_client_secret, cluster_id=camundacloud_cluster_id,region=camunda_region)
I'm using RabbitMQ 3.8.2 with Erlang 22.2.7 and having a problem while consuming tasks. My configuration is django-celery-rabbitmq. While publishing messages in a queue everything goes ok until the length of the queue reaches 1200 messages. After this point RabbitMQ starts to close AMQP connection with following errors:
...
2022-11-01 09:35:25.327 [info] <0.20608.9> accepting AMQP connection <0.20608.9> (185.121.83.107:60447 -> 185.121.83.116:5672)
2022-11-01 09:35:25.483 [info] <0.20608.9> connection <0.20608.9> (185.121.83.107:60447 -> 185.121.83.116:5672): user 'rabbit_admin' authenticated and granted access to vhost '/'
...
2022-11-01 09:36:59.129 [warning] <0.19994.9> closing AMQP connection <0.19994.9> (185.121.83.108:36149 -> 185.121.83.116:5672, vhost: '/', user: 'rabbit_admin'):
client unexpectedly closed TCP connection
...
[error] <0.11162.9> closing AMQP connection <0.11162.9> (185.121.83.108:57631 -> 185.121.83.116:5672):
{writer,send_failed,{error,enotconn}}
...
2022-11-01 09:35:48.256 [error] <0.20201.9> closing AMQP connection <0.20201.9> (185.121.83.108:50058 -> 185.121.83.116:5672):
{inet_error,enotconn}
...
Then the django-celery consumer disappears from queue list, messages become "ready" and celery pods are unable to ack the message after the job is finished with the following error:
ERROR: [2022-11-01 09:20:23] /usr/src/app/project/celery.py:114 handle_message Error while handling Rabbit task: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/amqp/connection.py", line 514, in channel
return self.channels[channel_id]
KeyError: None
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/app/project/celery.py", line 76, in handle_message
message.ack()
File "/usr/local/lib/python3.10/site-packages/kombu/message.py", line 125, in ack
self.channel.basic_ack(self.delivery_tag, multiple=multiple)
File "/usr/local/lib/python3.10/site-packages/amqp/channel.py", line 1407, in basic_ack
return self.send_method(
File "/usr/local/lib/python3.10/site-packages/amqp/abstract_channel.py", line 70, in send_method
conn.frame_writer(1, self.channel_id, sig, args, content)
File "/usr/local/lib/python3.10/site-packages/amqp/method_framing.py", line 186, in write_frame
write(buffer_store.view[:offset])
File "/usr/local/lib/python3.10/site-packages/amqp/transport.py", line 347, in write
self._write(s)
ConnectionResetError: [Errno 104] Connection reset by peer
I have noticed that the message size also affects this behavior. In the above case there are like 1000-1500 symbols in each message. If I decrease it to 50 symbols, then the threshold at which RabbitMQ starts to close AMQP connection shifts to 4000-5000 messages.
I suspect that the problem is with lack of resources for RabbitMQ, but I don't know how find what exactly is going wrong. If I run htop on the server, I see that 2 available CPU are not at high load at any time (loaded less than 20% each) and RAM is 400mb / 3840mb used. So nothing seems to be wrong. Is there any resource checking command for RabbitMQ? The tasks do not take long time to complete, about 10 seconds each.
Also maybe there are some missing heartbeats from the client (I had the problem earlier, but not now, there are currently no error messages about that).
Also if I run sudo journalctl --system | grep rabbitmq, I get the following output:
......
Мау 24 05:15:49 oms-git.omsystem sshd[809111]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=43.154.63.169 user=rabbitmq
Мау 24 05:15:51 oms-git.omsystem sshd[809111]: Failed password for rabbitmq from 43.154.63.169 port 37010 ssh2
Мау 24 05:15:51 oms-git.omsystem sshd[809111]: Disconnected from authenticating user rabbitmq 43.154.63.169 port 37010 [preauth]
Мау 24 16:12:32 oms-git.omsystem sudo[842182]: ad : TTY=pts/3 ; PWD=/var/log/rabbitmq ; USER=root ; COMMAND=/usr/bin/tail -f -n 1000 rabbit#XXX-git.log
......
Maybe here is another issue with firewall, but I don't see any error messages about that in /var/log/rabbitmq/rabbit#XXX.log.
My Celery configuration on client is like:
CELERY_TASK_IGNORE_RESULT = True
CELERY_RESULT_BACKEND = 'django-db'
CELERY_CACHE_BACKEND = 'django-cache'
CELERY_SEND_EVENTS = False
CELERY_BROKER_POOL_LIMIT = 30
CELERY_BROKER_HEARTBEAT = 30
CELERY_BROKER_CONNECTION_TIMEOUT = 600
CELERY_PREFETCH_MULTIPLIER = 1
CELERY_SEND_EVENTS = False
CELERY_WORKER_CONCURRENCY = 1
CELERY_TASK_ACKS_LATE = True
Currently I'm running the pod using following command:
celery -A project.celery worker -l info -f /var/log/celery/celery.log -Ofair
Also I have tried to use various arguments to limit prefetch or turn off heartbit but it didn't work:
celery -A project.celery worker -l info -f /var/log/celery/celery.log --without-heartbeat --without-gossip --without-mingle
celery -A project.celery worker -l info -f /var/log/celery/celery.log --prefetch-multiplier=1 --pool=solo --
I expect that there are no limitations on queue length and every celery pod in my kubernetes cluster consumes and acks messages without errors.
I'm trying to use Ray from a Flask web application.
The whole thing runs in Docker container.
Ray Version is 0.8.6, Flask 1.1.2
When I start the web application, Ray tries to init twice, at it seems, and then the processes crashes. I added the memory limitations later on because there where some warning regarding not enough shared memory size (docker compose setting is "shm_size: '4gb'").
If I start Ray in the same container without using Flask it runs well.
import os
import flask
import ray
from flask import Flask
def create_app(test_config=None):
app = Flask(__name__, instance_relative_config=True)
app.config.from_mapping(
SECRET_KEY='dev',
DEBUG = True
)
# ensure the instance folder exists
try:
os.makedirs(app.instance_path)
except OSError:
pass
if ray.is_initialized() == False:
ray.init(ignore_reinit_error=True,
include_webui=False,
object_store_memory=1*1024*1014*1024,
redis_max_memory=2*1024*1014*1024)
ray.worker.global_worker.run_function_on_all_workers(setup_ray_logger)
#app.route('/api/GetAccountRatings', methods=['GET'])
def GetAccountRatings():
return ...
return app
When I start the flask web app with:
export FLASK_APP="mifad.api:create_app()"
export FLASK_ENV=development
flask run --host=0.0.0.0 --port=8084
I get the following error messages:
* Serving Flask app "mifad.api:create_app()" (lazy loading)
* Environment: development
* Debug mode: on
* Running on http://0.0.0.0:8084/ (Press CTRL+C to quit)
* Restarting with stat
Failed to set SIGTERM handler, processes mightnot be cleaned up properly on exit.
* Debugger is active!
* Debugger PIN: 331-620-174
Failed to set SIGTERM handler, processes mightnot be cleaned up properly on exit.
2020-07-06 07:38:10,382 INFO resource_spec.py:212 -- Starting Ray with 59.18 GiB memory available for workers and up to 0.99 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-07-06 07:38:10,610 WARNING services.py:923 -- Redis failed to start, retrying now.
2020-07-06 07:38:10,675 INFO resource_spec.py:212 -- Starting Ray with 59.13 GiB memory available for workers and up to 0.99 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-07-06 07:38:10,781 WARNING services.py:923 -- Redis failed to start, retrying now.
2020-07-06 07:38:11,043 WARNING services.py:923 -- Redis failed to start, retrying now.
2020-07-06 07:38:11,479 ERROR import_thread.py:93 -- ImportThread: Error 111 connecting to 172.29.0.2:44946. Connection refused.
2020-07-06 07:38:11,481 ERROR worker.py:949 -- print_logs: Connection closed by server.
2020-07-06 07:38:11,488 ERROR worker.py:1049 -- listen_error_messages_raylet: Connection closed by server.
2020-07-06 07:38:11,899 ERROR import_thread.py:93 -- ImportThread: Error while reading from socket: (104, 'Connection reset by peer')
2020-07-06 07:38:11,901 ERROR worker.py:1049 -- listen_error_messages_raylet: Connection closed by server.
2020-07-06 07:38:11,908 ERROR worker.py:949 -- print_logs: Connection closed by server.
F0706 07:38:17.390182 4555 4659 service_based_gcs_client.cc:104] Check failed: num_attempts < RayConfig::instance().gcs_service_connect_retries() No entry found for GcsServerAddress
*** Check failure stack trace: ***
# 0x7ff84ae8061d google::LogMessage::Fail()
# 0x7ff84ae81a8c google::LogMessage::SendToLog()
# 0x7ff84ae802f9 google::LogMessage::Flush()
# 0x7ff84ae80511 google::LogMessage::~LogMessage()
# 0x7ff84ae5dde9 ray::RayLog::~RayLog()
# 0x7ff84ac39cea ray::gcs::ServiceBasedGcsClient::GetGcsServerAddressFromRedis()
# 0x7ff84ac39f37 _ZNSt17_Function_handlerIFSt4pairISsiEvEZN3ray3gcs21ServiceBasedGcsClient7ConnectERN5boost4asio10io_contextEEUlvE_E9_M_invokeERKSt9_Any_data
# 0x7ff84ac6ffb7 ray::rpc::GcsRpcClient::Reconnect()
# 0x7ff84ac71da8 _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc19AddProfileDataReplyEEZNS4_12GcsRpcClient14AddProfileDataERKNS4_21AddProfileDataRequestERKSt8functionIS8_EEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
# 0x7ff84ac4251d ray::rpc::ClientCallImpl<>::OnReplyReceived()
# 0x7ff84ab96870 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
# 0x7ff84b0b80df boost::asio::detail::scheduler::do_run_one()
# 0x7ff84b0b8cf1 boost::asio::detail::scheduler::run()
# 0x7ff84b0b9c42 boost::asio::io_context::run()
# 0x7ff84ab7db10 ray::CoreWorker::RunIOService()
# 0x7ff84a7763e7 execute_native_thread_routine_compat
# 0x7ff84deed6db start_thread
# 0x7ff84dc1688f clone
F0706 07:38:17.804720 4553 4703 service_based_gcs_client.cc:104] Check failed: num_attempts < RayConfig::instance().gcs_service_connect_retries() No entry found for GcsServerAddress
*** Check failure stack trace: ***
# 0x7fedd65e261d google::LogMessage::Fail()
# 0x7fedd65e3a8c google::LogMessage::SendToLog()
# 0x7fedd65e22f9 google::LogMessage::Flush()
# 0x7fedd65e2511 google::LogMessage::~LogMessage()
# 0x7fedd65bfde9 ray::RayLog::~RayLog()
# 0x7fedd639bcea ray::gcs::ServiceBasedGcsClient::GetGcsServerAddressFromRedis()
# 0x7fedd639bf37 _ZNSt17_Function_handlerIFSt4pairISsiEvEZN3ray3gcs21ServiceBasedGcsClient7ConnectERN5boost4asio10io_contextEEUlvE_E9_M_invokeERKSt9_Any_data
# 0x7fedd63d1fb7 ray::rpc::GcsRpcClient::Reconnect()
# 0x7fedd63d3da8 _ZNSt17_Function_handlerIFvRKN3ray6StatusERKNS0_3rpc19AddProfileDataReplyEEZNS4_12GcsRpcClient14AddProfileDataERKNS4_21AddProfileDataRequestERKSt8functionIS8_EEUlS3_S7_E_E9_M_invokeERKSt9_Any_dataS3_S7_
# 0x7fedd63a451d ray::rpc::ClientCallImpl<>::OnReplyReceived()
# 0x7fedd62f8870 _ZN5boost4asio6detail18completion_handlerIZN3ray3rpc17ClientCallManager29PollEventsFromCompletionQueueEiEUlvE_E11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm
# 0x7fedd681a0df boost::asio::detail::scheduler::do_run_one()
# 0x7fedd681acf1 boost::asio::detail::scheduler::run()
# 0x7fedd681bc42 boost::asio::io_context::run()
# 0x7fedd62dfb10 ray::CoreWorker::RunIOService()
# 0x7fedd5ed83e7 execute_native_thread_routine_compat
# 0x7fedd968f6db start_thread
# 0x7fedd93b888f clone
Aborted (core dumped)
What am I doing wrong?
Best regards,
Bernd
I tried to install spark on my windows 10 machine. I have anacondo2 with python 2.7. I managed to open the ipython notebook instance. I am able to run the following lines:
airlines=sc.textFile("airlines.csv")
print (airlines)
But I get an error when I run: airlines.first()
Here's the error I get:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-6-85a5d6f5110f> in <module>()
----> 1 airlines.first()
C:\spark\python\pyspark\rdd.py in first(self)
1326 ValueError: RDD is empty
1327 """
-> 1328 rs = self.take(1)
1329 if rs:
1330 return rs[0]
C:\spark\python\pyspark\rdd.py in take(self, num)
1308
1309 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1310 res = self.context.runJob(self, takeUpToNumLeft, p)
1311
1312 items += res
C:\spark\python\pyspark\context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
932 mappedRDD = rdd.mapPartitions(partitionFunc)
933 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
--> 934 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
935
936 def show_profiles(self):
C:\spark\python\pyspark\rdd.py in _load_from_socket(port, serializer)
137 break
138 if not sock:
--> 139 raise Exception("could not open socket")
140 try:
141 rf = sock.makefile("rb", 65536)
Exception: could not open socket
I get a different error when I execute: airlines.collect()
Here's the error:
---------------------------------------------------------------------------
error Traceback (most recent call last)
<ipython-input-5-3745b2fa985a> in <module>()
1 # Using the collect operation, you can view the full dataset
----> 2 airlines.collect()
C:\spark\python\pyspark\rdd.py in collect(self)
775 with SCCallSiteSync(self.context) as css:
776 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
--> 777 return list(_load_from_socket(port, self._jrdd_deserializer))
778
779 def reduce(self, f):
C:\spark\python\pyspark\rdd.py in _load_from_socket(port, serializer)
140 try:
141 rf = sock.makefile("rb", 65536)
--> 142 for item in serializer.load_stream(rf):
143 yield item
144 finally:
C:\spark\python\pyspark\serializers.py in load_stream(self, stream)
515 try:
516 while True:
--> 517 yield self.loads(stream)
518 except struct.error:
519 return
C:\spark\python\pyspark\serializers.py in loads(self, stream)
504
505 def loads(self, stream):
--> 506 length = read_int(stream)
507 if length == SpecialLengths.END_OF_DATA_SECTION:
508 raise EOFError
C:\spark\python\pyspark\serializers.py in read_int(stream)
541
542 def read_int(stream):
--> 543 length = stream.read(4)
544 if not length:
545 raise EOFError
C:\Users\AS\Anaconda2\lib\socket.pyc in read(self, size)
382 # fragmentation issues on many platforms.
383 try:
--> 384 data = self._sock.recv(left)
385 except error, e:
386 if e.args[0] == EINTR:
error: [Errno 10054] An existing connection was forcibly closed by the remote host
Please help.
INSTALL PYSPARK on Windows 10
JUPYTER-NOTEBOOK With ANACONDA NAVIGATOR
STEP 1
Download Packages
1) spark-2.2.0-bin-hadoop2.7.tgz Download
2) java jdk 8 version Download
3) Anaconda v 5.2 Download
4) scala-2.12.6.msi Download
5) hadoop v2.7.1Download
STEP 2
MAKE SPARK FOLDER IN C:/ DRIVE AND PUT EVERYTHING INSIDE IT
It will look like this
NOTE : DURING INSTALLATION OF SCALA GIVE PATH OF SCALA INSIDE SPARK FOLDER
STEP 3
NOW SET NEW WINDOWS ENVIRONMENT VARIABLES
HADOOP_HOME=C:\spark\hadoop
JAVA_HOME=C:\Program Files\Java\jdk1.8.0_151
SCALA_HOME=C:\spark\scala\bin
SPARK_HOME=C:\spark\spark\bin
PYSPARK_PYTHON=C:\Users\user\Anaconda3\python.exe
PYSPARK_DRIVER_PYTHON=C:\Users\user\Anaconda3\Scripts\jupyter.exe
PYSPARK_DRIVER_PYTHON_OPTS=notebook
NOW SELECT PATH OF SPARK : EDIT AND ADD NEW
Add "C:\spark\spark\bin” to variable “Path” Windows
STEP 4
Make folder where you want to store Jupyter-Notebook outputs and files
After that open Anaconda command prompt and cd Folder name
then enter Pyspark
thats it your browser will pop up with Juypter localhost
STEP 5
Check pyspark is working or not !
Type simple code and run it
from pyspark.sql import Row
a = Row(name = 'Vinay' , age=22 , height=165)
print("a: ",a)
Let's assume that my file is named 'data' and looks like this:
2343234 {23.8375,-2.339921102} {(343.34333,-2.0000022)} 5-23-2013-11-am
I need to convert the 2nd field to a pair of coordinate numbers. So I wrote the follwoing code and called it basic.pig:
A = LOAD 'data' AS (f1:int, f2:chararray, f3:chararray. f4:chararray);
B = foreach A generate STRSPLIT(f2,',').$0 as f5, STRSPLIT(f2,',').$1 as f6;
C = foreach B generate REPLACE(f5,'{',' ') as f7, REPLACE(f6,'}',' ') as f8;
and then used (float) to convert the string to a float. But, the command 'REPLACE' fails to work and I get the following error:
-bash-3.2$ pig -x local basic.pig
2013-06-24 16:38:45,030 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1 (r1459641) compiled
Mar 22 2013, 02:13:53 2013-06-24 16:38:45,031 [main] INFO org.apache.pig.Main - Logging error messages to: /home/--/p/--test/pig_1372117125028.log
2013-06-24 16:38:45,321 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/isl/pmahboubi/.pigbootup not found
2013-06-24 16:38:45,425 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
2013-06-24 16:38:46,069 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 7, column 0. Encountered: <EOF> after : ""
Details at logfile: /home/--/p/--test/pig_1372117125028.log
And this is the details of the pig_137..log
Pig Stack Trace
---------------
ERROR 1000: Error during parsing. Lexical error at line 7, column 0. Encountered: <EOF> after : ""
org.apache.pig.tools.pigscript.parser.TokenMgrError: Lexical error at line 7, column 0. Encountered: <EOF> after : ""
at org.apache.pig.tools.pigscript.parser.PigScriptParserTokenManager.getNextToken(PigScriptParserTokenManager.java:3266)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.jj_ntk(PigScriptParser.java:1134)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:104)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:604)
at org.apache.pig.Main.main(Main.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
================================================================================
I've got data like this:
2724 1919 2012-11-18T23:57:56.000Z {(33.80981975),(-118.105289)}
2703 6401 2012-11-18T23:57:56.000Z {(55.83525609),(-4.07733138)}
1200 4015 2012-11-18T23:57:56.000Z {(41.49609152),(13.8411998)}
7104 9227 2012-11-18T23:57:56.000Z {(-24.95351118),(-53.46538723)}
and I can do this:
A = LOAD 'my_tsv_data' USING PigStorage('\t') AS (id1:int, id2:int, date:chararray, loc:chararray);
B = FOREACH A GENERATE REPLACE(loc,'\\{|\\}|\\(|\\)','');
C = LIMIT B 10;
DUMP C;
This error
ERROR 1000: Error during parsing. Lexical error at line 7, column 0. Encountered: <EOF> after : ""
came to me because I had used different types of quotation marks. I started with ' and ended with ´ or `, and it took quite a while to find what went wrong. So it had nothing to do with line 7 (my script was not so long, and I shortened data to four lines which naturally did not help), nothing to do with column 0, nothing to do with EOF of data, and hardly anything to do with " marks which I didn't use. So quite misleading error message.
I found the cause by using grunt - pig command shell.