pq.ParquetDataset throws errors as FileNotFoundError

pq.ParquetDataset throws errors as FileNotFoundError - amazon-web-services

For reading data from AWS, when I use spark.read.parquet("path"), I can read data, and there is no issue, but for the same path, when I use pq.ParquetDataset("path").read_pandas(), it throws errors as FileNotFoundError. I also add import pyarrow.parquet as pq to use pq.ParquetDataset.
Does anyone know what might be the issue?

Related

Dataflow breaks using TaggedOutputs, "can't pickle WeakDictionary"

we are trying to deploy an Streaming pipeline to Dataflow where we separate in few different "routes" that we manipulate differently the data.
We did the complete development with the DirectRunner, and works smoothly as we tested but now, that we did deployed it to Dataflow, it does not work.
The code fails when yielding on the following doFn
class SplitByRoute(beam.DoFn):
OUTPUT_TAG_ROUTE_ONE= "route_one"
OUTPUT_TAG_ROUTE_TWO = "route_two"
OUTPUT_NOT_SUPPORTED = "not_supported"
def __init__(self):
beam.DoFn.__init__(self)
def process(self, elem):
try:
route = self.define_route(elem["param"]) # Just tag it depending on param
except Exception:
route = None
logging.info(f"Routed to {route}")
if route == self.OUTPUT_TAG_ROUTE_ONE:
yield TaggedOutput(self.OUTPUT_TAG_ROUTE_ONE, elem)
elif route == self.OUTPUT_TAG_ROUTE_TWO:
logging.info(f"Element: {elem}")
yield TaggedOutput(self.OUTPUT_TAG_ROUTE_TWO, elem)
else:
yield TaggedOutput(self.OUTPUT_NOT_SUPPORTED, elem)
It does log the element, yield the output and fails with the following error
AttributeError: Can't pickle local object 'WeakValueDictionary.__init__.<locals>.remove' [while running 'generatedPtransform-3196']
Other considerations are that we use taggedOutputs on the pipeline before this DoFn, and it works on Dataflow but this one in particularly fails with the error mentioned. Could it be the memory cache? or something related to it? Where Weakrefs are used?
Far as I know, this error happens when you have a class inside another one. Maybe not(?)
Any suggestions so how we could manage this? It's been very frustrating error.
Thank you!!! :)

We found the error
As you might know, apache-beam uses dill package to serialize the data between the modules. This let us pickle an instance of a object and send it through the pipeline.
The problem was that in self.define_route(elem["param"]), we used that instance of the class and we modified one of it's attributes. As the answer from Samuel Romero says, you can pickle a class, but I didn't really know (and probably someone has to) that if you modify the class instance it can not be pickle again. that's an strage behaviour, I know, so I opened an issue on BEAM https://issues.apache.org/jira/browse/BEAM-10384 if you want to check it out.
I will probably get into it (to understand better the problem) soon or later, but if someone had the same error, the workaround, as I mentioned is to do not modify the instance of a class beeing serialized.
Thanks to anyone who tried to help!

As you can read here, Python uses the pickle library for data serialization and it is subject to its limitations. Data serialization is the way processes transfer data between them since they do not share memory space.
Here I found a suggestion about using a fork of multiprocessing module that uses the dill package instead of pickle. This fork is part of the pathos framework (as is the dill package too) and is now called pathos.multiprocess and not pathos.multiprocessing as seen in the reference I mentioned previously.

TypeError: can only concatenate str (not "DeferredAttribute") to str

I am wondering anyone can help me in one of the thing I cannot get my around it and really bothering as I spent last two on it but couldnt make it.
Basically, I am building an App (Django Python) to restore the information regarding all the network device e.g hostname, IP Address, S/N, Rack info etc. but I also to enable few options like Add, Edit, Delete and Connect next to device entery. I was able to create all the options except Connect option where I am completely stuck, I am trying to query database to get the IP address information and then using Popen module to open a putty window with the ssh to that IP Address device related, I tried everything I could but nothing worked, thereofrore, asking your help if you have any idea about this ? or any other alternative method for a user when he click on connect the putty or similar app will open and he just put the login credentials and get into the device.
I am sharing my code here, let me know if I am doing something wrong.
on the show all device page, i have this code, show.html
<td>Connect</td>
<!--<td>Connect</td>-->
I tried both ways, with id and ip address entry in the database
on view.py
def connect(request, ip_address):
hostlist_ip = HostList.ip_address
print(hostlist_ip)
Popen("putty.exe" + hostlist_ip)
return redirect('/show')
and in the url.py
path('connect/<str:ip_address>', views.connect),
or
path('connect/<str:ip_address>', views.connect),
Since I am also printing the the output on the terminal I notice that it is not returning the actually IP address but return this;
<django.db.models.query_utils.DeferredAttribute object at 0x04B77C50>
and on the web I receiving this error
TypeError at /connect/10.10.32.10
can only concatenate str (not "DeferredAttribute") to str
Request Method: GET
Request URL: http://localhost:8000/connect/10.10.32.10
Django Version: 2.2.3
Exception Type: TypeError
Exception Value:
can only concatenate str (not "DeferredAttribute") to str
let me know if you can help.
Just a F.Y.I I already tested the Popen via python but since we not getting the actual IP address from the database I am seeing this and I am a complete newbie with html/css and Djano, however I have some basic knowledge of python, so please ignore my any stupid comments in the post.
Many thanks

ahh I cannot believe I spend two day to troubleshoot this and just changed the name from ip_address to ip_add and it is working now :) i think as I mentioned above in the comment, it probably confusing with the built in module
here is simple solution:
views.py
def connect(request, ip_add):
import os
from subprocess import Popen
Popen("powershell putty.exe " + ip_add)
return redirect('/show')
url.py
path('connect/<str:ip_add>', views.connect),
I may have to find out a way if user is using the mac or linux, how I am going to change this powershell to something else. but anyhow it is working for windows
thanks all for the responses.

Redmine_RE plugin is not saving the data

i am using the Redmine.3.1.0.Then i have installed redmine_re plugin.But when i try to save the requirement using the Redmine_re plugin i am getting the following error
NameError (undefined local variable or method `connection' for #<ReArtifactRelationship:0x800ddb0>):
lib/plugins/acts_as_list/lib/active_record/acts/list.rb:220:in `bottom_item'
lib/plugins/acts_as_list/lib/active_record/acts/list.rb:214:in `bottom_position_in_list'
lib/plugins/acts_as_list/lib/active_record/acts/list.rb:205:in `add_to_list_bottom'
lib/redmine/sudo_mode.rb:63:in `sudo_mode'
pls suggest how to resolve this error
#ste26054

I did not develop this plugin, but I think the support for redmine 3.1.0 is only partial at the moment. (And you may get other errors even after fixing this).
I believe you are getting an error because of this: Deprecate #connection in favour of accessing it via the class
And your error is related to this file:
In this method:
def scope_condition()
"#{connection.quote_column_name("source_id")} = #{quote_value(self.source_id)}
AND
#{connection.quote_column_name("relation_type")} = #{quote_value(self.relation_type)}"
end
Try to add self.class. in front of connection
You may have to repeat this for other files in the code.
If your changes are working, I would suggest you to submit a pull request on their plugin github page :)

Getting an 500 internal error when uploading file using cgi and python

Hi I have a project that involves uploading documents. I am using html for the front end and python for the backend. I've managed to link my html and python file but I'm having a problem with the server. At first I though it was a random thing but I'm pretty sure it's because of what I added to the python code. I have:
import cgi
import sys
import os
htmlform = cgi.FieldStorage()
file_data = htmlform['myfile']
if not fileitem.file:
return
(name,ext) = os.path.splitext( fileitem.filename)
#if ext == “.jpg” or ext == “.png” or ext == “.gif”:
#ioFlag = “wb”
#else:
#ioFlag = “w”
I was able to log into my page go to the html form submit the form and got to a basic success html page I had below the above input. Now Im pretty new to python and didnt realise that the if statements should be indented. And I get a 500 internal error when I uncommented the if statement. I did it once and then went through commenting out my code being completely confused as to why I was getting error but after a while it just started working again. My guess is the incorrect if statement somehow got it stuck. I expect after about an hour it'll be working again but ideally I'd like to know if I could stop the process on the server if possible. I was following this guide http://www.alwaysgetbetter.com/blog/2009/01/02/python-file-upload/

Fixed it! The problem seems to be the indentation. If you're ever unsure about this stuff look at the error logs. I'm using an apache server and I dont have access to the error logs so I used
sudo cat /etc/log/apache2/error.log
It gave me the answer and this should hopefully help you even if your question is unrelated.
EDIT: An for completeness sake file_data should be fileitem

Django 404 dilemma

I have a bug in my 404 setup. I know that because, when I try to reach some page which doesn't exist, I get my server error template. But that templates is useless because it doesn't give me any debug info. In order to get django's debug page, I need to set DEBUG=True in settings file. But if I do that, bug doesn't appear because django doesn't try to access my buggy 404 setup. So what do you guys think? This is in my root urls file:
handler404 = 'portal.blog.views.handlenotfound' And this is in portal.blog.views.handlenotfound:
def handlenotfound(request):
global common_data
datas = {
'tags' : Tag.objects.all(),
'date_list' : Post.objects.filter(yayinlandi=True).dates("pub_date","year")
}
data.update(common_data)
return render_to_response("404.html",datas)
Edit:
I guess I also need to return a HttpResponseNotFound right?

If I had to debug this kind of errors, I would either
temporarily turn the handler into a simple view served by a custom url, so that django's internal mechanisms don't get into the way, or
(temporarily) wrap the handler code in a try..except block to log any error you may have missed
Anyway, are you sure your handler doesn't get called if DEBUG=true?

data.update(common_data) should be datas.update(common_data).
(Incidentally, data is already plural: the singular is datum.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

pq.ParquetDataset throws errors as FileNotFoundError - amazon-web-services

Related

Dataflow breaks using TaggedOutputs, "can't pickle WeakDictionary"

TypeError: can only concatenate str (not "DeferredAttribute") to str

Redmine_RE plugin is not saving the data

Getting an 500 internal error when uploading file using cgi and python

Django 404 dilemma

Categories

Resources