How to import a .py file to Qubole? - qubole

I'm connecting to Azure data lake, and I have the file there, but it's in a different path, and I don't know how to import it.
Thank you in advance for your help!

You can try below steps -->
-> sc.addPyFile("cloudstoragepath")
-> Then run the import filename_without_py_extension after that
I hope this is your requirement as per what I understand.

If you are looking for running a python script by importing it from azure blob or data lake, file path can be passed in 'query path' selection in Qubole Analyze. Please check below screenshot for sample.
If this is not what you are looking for do let us know the exact requirement and we would be happy to help !!

Related

Test data to requests for Postman monitor

I run my collection using Test data from a csv file, However there is no option to upload the test data file when adding monitor for the collection. On searching through internet could see that the test data file have to be provided in URL (saved in cloud ..google drive,.). But i couldn't get source for how to provide this URL to the collection . Can anyone please help
https://www.postman.com/praveendvd-public/workspace/postman-tricks-and-tips/request/8296678-d06b3fc0-6b8b-4370-9847-aee0f526e7db
you cannot use csv file in monitor , but could store the content of csv as variable and use that to drive the monitor . An example can be seen in the above public repository

Unable to Upload Huge Files/Datasets on Google Colab

I am uploading a TSV file for processing on GColab, the file is 4GB and the upload process is not getting completed from a very long time (hours). Any pointers here are of great help. Click here to check upload process details
It can be your internet connection. The import function for Google Colab is better useful when you upload small .py files. For huge files, I'd suggest you use Google Drive and upload it there in your account then simply move or copy it to your Google Colab instance:
1. Copy the file you want to use:
%cp "path/to/the file/file_name.extension" "path/to/your/google-colab-instance"
Google colab instance is usually like this - /contents/
Similarly,
2. Move the file you want to use:
%mv "path/to/the file/file_name.extension" "path/to/your/google-colab-instance"
The first "" would be the path to where you uploaded the .csv file in your drive.
Hope this helps. Let me know in the comments.

openshift django opening and writing to a text file

I have created a questionnaire with django and in my views.py have the following code as part of a function
if text is not None:
for answer in datas:
f=open('/Users/arsenios/Desktop/data.txt', 'a')
f.write(answer+",")
f.write("\n")
f.close()
This works fine locally. It creates a text folder on the desktop and fills it in with the data of each person that completes it. When I run the code with openshift I get:
"[Errno 2] No such file or directory: '/Users/arsenios/Desktop/data.txt'".
I have seen some people asking and mentioning "OPENSHIFT_DATA_DIR" but I feel like there are steps they haven't included. I don't know what changes I should make to settings.py and views.py.
Any help would be appreciated.
The OPENSHIFT_DATA_DIR is from OpenShift 2 and is not set in OpenShift 3.
The bigger question is whether that is a temporary file or needs to be persistent across restarts of the application container. If temporary file, use a name under /tmp directory. If it needs to be persistent, then you need to look at mounting a persistent volume to save the data in, or look at using a separate database with its own persistent storage.
For explanation of some of the fundamentals of using OpenShift 3, suggest you look at the free eBook at:
https://www.openshift.com/deploying-to-openshift/
I managed to solve it. It turns out the data was getting saved in data.txt in openshift and the command I had to use was oc rsync pod:/opt/app-root/src/data.txt /path/to/directory. This command downloaded the data.txt file from openshift to the directory I wanted to. So in my case I had to use oc rsync save-4-tb2dm:/opt/app-root/src/data.txt /Users/arsenios/Desktop

Connecting to Google Drive using Python(PyDrive)

I have this code:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
import time
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
drive = GoogleDrive(gauth)
# Auto-iterate through all files that matches this query
file_list = drive.ListFile({'q': "'root' in parents"}).GetList()
for file1 in file_list:
print 'title: %s, id: %s' % (file1['title'], file1['id'])
time.sleep(1)
However, each time I want to run it, it opens my browser and asks for permission to access google drive. How to bypass it? I mean at least ask once, then save the "permission" and don't ask again or (which would be best) silently accept the permission in the background without my decision. I have downloaded client_secrets.json which is used for passing the authorization details.
What if I wanted to release my application? I mean I had to generate + download the client_secrets.json in order to make it work. I guess my users wouldn't wanto to do so. Is there any better, more convenient way?
I would also appreciate a tutorial-for-dummies about using Google Drive API because it's hard for me to understand it reading the documentation alone.
pydrive2 has done a fabulous job of automating the authentication file via settings.yaml file. Use the below package if you are experiencing the above problem
https://docs.iterative.ai/PyDrive2/

Run Scrapy script, process output, and load into database all at once?

I've managed to write a Scrapy project that scrapes data from a web page, and when I call it with the scrapy crawl dmoz -o items.json -t json at the command line, it successfully outputs the scraped data to a JSON file.
I then wrote another script that takes that JSON file, loads it, changes the way the data is organized (I didn't like the default way it was being organized), and spits it out as a second JSON file. I then use Django's manage.py loaddata fixture.json command to load the contents of that second file into a Django database.
Now, I'm sensing that I'm going to get laughed out of the building for doing this in three separate steps, but I'm not quite sure how to put it all together into one script. For starters, it does seem really stupid that I can't just have my Scrapy project output my data in the exact way that I want. But where do I put the code to modify the 'default' way that Feed exports is outputting my data? Would it just go in my pipelines.py file?
And secondly, I want to call the scraper from inside a python script that will then also load the resulting JSON fixture into my database. Is that as simple as putting something like:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
at the top of my script, and then following it with something like:
from django.something.manage import loaddata
loaddata('/path/to/fixture.json')
? And finally, is there any specific place this script would have to live relative to both my Django project and the Scrapy project for it to work properly?
Exactly that. Define a custom item pipeline in pipelines.py to output the item data as desired, then add the pipeline class to settings.py. The scrapy documentation has a JSONWriterPipeline example that may be of use.
Well, on the basis that the script in your example was taken from the scrapy documentation, it should work. Have you tried it?
The location shouldn't matter, so long as all of the necessary imports work. You could test this by firing a Python interpreter in the desired location and then checking all of the imports one by one. If they all run correctly, then the script should be fine.
If anything goes wrong, then post another question in here with the relevant code and what was tried and I'm sure someone will be happy to help out. :)