I know how to use PyDrive to download a file from my drive, the problem is that I need to download (or at the very least OPEN) an xlsx file on a shared drive. Here is my code so far to download the file:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
gauth = GoogleAuth()
gauth.LoadClientConfigFile('client_secret.json')
drive = GoogleDrive(gauth)
team_drive_id = 'XXXXX'
parent_folder_id = 'XXXXX'
file_id = 'XXXXX'
f = drive.CreateFile({
'id': file_id,
'parents': [{
'kind': 'drive#fileLink',
'teamDriveId': team_drive_id,
'id': parent_folder_id
}]
})
f.GetContentFile('<file_name>')
The code returns error 404 (file not found), which makes sense: when I check the URL that GetContentFile is looking at, it is the URL leading to my drive, not the shared drive. I am probably missing a 'supportsTeamDrives': True somewhere (but where?).
There is actually a post associated to my question on https://github.com/gsuitedevs/PyDrive/issues/149 where someone raised the exact same issue. Apparently, that brought a developer to modify PyDrive about two weeks ago, but I still don't understand how to interpret his modifications and how to fix my problem. I have not noticed any other similar post on Stack Overflow (not about downloading from a shared drive anyway). Any help would be deeply appreciated.
Kind regards,
Berti
I found an answer in a newer Github post: https://github.com/gsuitedevs/PyDrive/issues/160
Answer from SMB784 works: "supportsTeamDrives=True" should be added to files.py (pydrive package) on line 235-6.
This definitely fixed my issue.
Move from pydrive to pydrive2
After encountering the same problem, I ran into this comment on an issue within the googleworkspace/PyDrive GitHub page from 2020:
...there is an actively maintained version PyDrive2 (Travis tests, regular releases including conda) from the DVC.org team...please give it a try and let us know - https://github.com/iterative/PyDrive2
Moving from the pydrive package to the pydrive2 package enabled me to download a local copy of a file stored on a shared drive by only requiring me to know the file ID.
After installing pydrive2, you can download a local copy of the file within the shared drive by using the following code template:
# load necessary modules ----
from pydrive2.auth import GoogleAuth
from pydrive2.drive import GoogleDrive
# authenticate yourself using your credentials
gauth = GoogleAuth()
gauth.LoadClientConfigFile('client_secret.json')
drive = GoogleDrive(gauth)
# store the file ID of the file located within the shared drive
file_id = 'XXXXX'
# store the output file name
output_file_name = 'YYYYY'
# create an instance of Google Drive file with auth of this instance
f = drive.CreateFile({'id': file_id})
# save content of this file as a local file
f.GetContentFile(output_file_name)
Related
I have a container that is built to run selenium-chromedriver with python to download an excel(.xlsx) file from a website.
I am Using SAM to build & deploy this image to be run in AWS Lambda.
When I build the container and invoke it locally, the program executes as expected: The download occurs and I can see the file placed in the root directory of the container.
The problem is: when I deploy this image to AWS and invoke my lambda function I get no errors, however, my download is never executed. The file never appears in my root directory.
My first thought was that maybe I didn't allocate enough memory to the lambda instance. I gave it 512 MB, and the logs said it was using 416MB. Maybe there wasn't enough room to fit another file inside? So I have increased the memory provided to 1024 MB, but still no luck.
My next thought was that maybe the download was just taking a long time, so I also allowed the program to wait for 5 minutes after clicking the download to ensure that the download is given time to complete. Still no luck.
I have also tried setting the following options for chromedriver (full list of chromedriver options posted at bottom):
options.add_argument(f"--user-data-dir={'/tmp'}"),
options.add_argument(f"--data-path={'/tmp'}"),
options.add_argument(f"--disk-cache-dir={'/tmp'}")
and also setting tempfolder = mkdtemp() and passing that into the chrome options as above in place of /tmp. Still no luck.
Since this applicaton is in a container, it should run the same locally as it does on AWS. So I am wondering if it is part of the config outside of the container that is blocking my ability to download a file? Maybe the request is going out but the response is not being allowed back in?
Please let me know if there is anything I need to clarify -- Any help on this issue is greatly appreciated!
Full list of Chromedriver options
options.binary_location = '/opt/chrome/chrome'
options.headless = True
options.add_argument('--disable-extensions')
options.add_argument('--no-first-run')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--disable-client-side-phishing-detection')
options.add_argument('--allow-running-insecure-content')
options.add_argument('--disable-web-security')
options.add_argument('--lang=' + random.choice(language_list))
options.add_argument('--user-agent=' + fake_user_agent.user_agent())
options.add_argument('--no-sandbox')
options.add_argument("--window-size=1920x1080")
options.add_argument("--single-process")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-dev-tools")
options.add_argument("--no-zygote")
options.add_argument(f"--user-data-dir={'/tmp'}")
options.add_argument(f"--data-path={'/tmp'}")
options.add_argument(f"--disk-cache-dir={'/tmp'}")
options.add_argument("--remote-debugging-port=9222")
options.add_argument("start-maximized")
options.add_argument("enable-automation")
options.add_argument("--headless")
options.add_argument("--disable-browser-side-navigation")
options.add_argument("--disable-gpu")
driver = webdriver.Chrome("/opt/chromedriver", options=options)```
Just in case anybody stumbles across this queston in future, adding the following to chrome options solved my issue:
prefs = {
"profile.default_content_settings.popups": 0,
"download.default_directory": r"/tmp",
"directory_upgrade": True
}
options.add_experimental_option("prefs", prefs)
I have a lambda (L1) replacing a file (100MB) into an s3 location ( s3://bucket/folder/abc.json). I have another lambdas (L2 , L3) reading the same file at the same time, one via golang api another via Athena query. The s3 bucket/ folder is not versioned.
The question is: Does the lambdas L2, L3 read the old copy of the file till the new file got uploaded? Or does it read the partial file that is being uploaded? If its the later then how do you make sure that the L2, L3 read the files only on full upload?
Amazon S3 is now strongly consistent. This means once you upload an object, all people that read that object are guaranteed to get the updated version of the object.
On the surface, that sounds like it guarantees that your question is "yes, all clients will get the old version or the new version of a file". The truth is still a bit fuzzier than that.
Under the covers, many of the S3 APIs upload with a multi-part upload. This is well known, and doesn't change what I've said above, since the upload must be done before the object is available. However, many of the APIs also use multiple byte-range requests during downloads to download larger objects. This is problematic. It means a download might download part of file v1, then when it goes to download another part, it might get v2 if v2 was just uploaded.
With a little bit of effort, we can demonstrate this:
#!/usr/bin/env python3
import boto3
import multiprocessing
import io
import threading
bucket = "a-bucket-to-use"
key = "temp/dummy_key"
size = 104857600
class ProgressWatcher:
def __init__(self, filesize, downloader):
self._size = float(filesize)
self._seen_so_far = 0
self._lock = threading.Lock()
self._launch = True
self.downloader = downloader
def __call__(self, bytes_amount):
with self._lock:
self._seen_so_far += bytes_amount
if self._launch and (self._seen_so_far / self._size) >= 0.95:
self._launch = False
self.downloader.start()
def upload_helper(pattern, name, callback):
# Upload a file of 100mb of "pattern" bytes
s3 = boto3.client('s3')
print(f"Uploading all {name}..")
temp = io.BytesIO(pattern * size)
s3.upload_fileobj(temp, bucket, key, Callback=callback)
print(f"Done uploading all {name}")
def download_helper():
# Download a file
s3 = boto3.client('s3')
print("Starting download...")
s3.download_file(bucket, key, "temp_local_copy")
print("Done with download")
def main():
# See how long an upload takes
upload_helper(b'0', "zeroes", None)
# Watch how the next upload progresses, this will start a download when it's nearly done
watcher = ProgressWatcher(size, multiprocessing.Process(target=download_helper))
# Start another upload, overwriting the all-zero file with all-ones
upload_helper(b'1', "ones", watcher)
# Wait for the downloader to finish
watcher.downloader.join()
# See what the resulting file looks like
print("Loading file..")
counts = [0, 0]
with open("temp_local_copy") as f:
for x in f.read():
counts[ord(x) - ord(b'0')] += 1
print("Results")
print(counts)
if __name__ == "__main__":
main()
This code uploads an object to S3 that's 100mb of "0". It then starts an upload, using the same key, of 100mb of "1", and when that second upload is 95% done, it starts a download of that S3 object. It then counts how many "0" and "1"s it sees in the downloaded file.
Running this, with the latest versions of Python and Boto3, your exact output will no doubt differ than mine due to network conditions, but this is what I saw with a test run:
Uploading all zeroes..
Done uploading all zeroes
Uploading all ones..
Starting download...
Done uploading all ones
Done with download
Loading file..
Results
[83886080, 20971520]
The last line is important. The downloaded file was mostly "0" bytes, but there were 20mb of "1" bytes. Meaning, I got some part of v1 of the file and some part of v2, despite only performing one download call.
Now, in practice, this is unlikely to happen, and more so if you have better network bandwidth then I do here on a run of the mill home Internet connection.
But it can always potentially happen. If you need to ensure that the downloaders never see a partial file like this, you either need to do something like verify a hash of the file is correct, or my preference is to upload with different keys each time, and have some mechnism for the client to discover the "latest" key, so they can download the whole unchanged file, even if an upload finishes while they're uploading.
The readers see only the old file until the new one is fully uploaded. There is no read of a partial file.
"Amazon S3 never adds partial objects."
SO discussion
Announcement
I'm trying to send some files (a zip and a Word doc) to a directory on a server using ftplib. I have the broad strokes sorted out:
session = ftplib.FTP(ftp.server, 'user','pass')
filewpt = open(file, mode)
readfile = open(file, mode)
session.cwd(new/work/directory)
session.storbinary('STOR filename.zip', filewpt)
session.storbinary('STOR readme.doc', readfile)
print "filename.zip and readme.doc were sent to the folder on ftp"
readfile.close()
filewpt.close()
session.quit()
This may provide someone else what they are after but not me. I have been using FileZilla as a check to make sure the files were transferred. When I see they have made it to the server, I see that they are both way smaller or even zero K for the readme.doc file. Now I'm guessing this has something to do with the fact that I stored the file in 'binary transfer mode' <--- whatever that means.
This is where my problems lie. I have no idea at all (yet) what is meant by binary transfer mode. Is it simply that I have to use retrbinary to return the files to their original state?
Could someone please explain to me like I'm a two year old what has happened to my files? If there's any more info required, please let me know.
This is a fantastic resource. Solved most of my problems. Still trying to work out the intricacies of FTPs, but I guess I will save that for another day. The link below builds a function to effortlessly upload files to an FTP without the partial upload problem that I've seen experienced by more than one Stack Exchanger.
http://effbot.org/librarybook/ftplib.htm
I have read through the manual and I cannot find the answer. Given a magnet link I would like to generate a torrent file so that it can be loaded on the next startup to avoid redownloading the metadata. I have tried the fast resume feature, but I still have to fetch meta data when I do it and that can take quite a bit of time. Examples that I have seen are for creating torrent files for a new torrent, where as I would like to create one matching a magnet uri.
Solution found here:
http://code.google.com/p/libtorrent/issues/detail?id=165#c5
See creating torrent:
http://www.rasterbar.com/products/libtorrent/make_torrent.html
Modify first lines:
file_storage fs;
// recursively adds files in directories
add_files(fs, "./my_torrent");
create_torrent t(fs);
To this:
torrent_info ti = handle.get_torrent_info()
create_torrent t(ti)
"handle" is from here:
torrent_handle add_magnet_uri(session& ses, std::string const& uri add_torrent_params p);
Also before creating torrent you have to make sure that metadata has been downloaded, do this by calling handle.has_metadata().
UPDATE
Seems like libtorrent python api is missing some of important c++ api that is required to create torrent from magnets, the example above won't work in python cause create_torrent python class does not accept torrent_info as parameter (c++ has it available).
So I tried it another way, but also encountered a brick wall that makes it impossible, here is the code:
if handle.has_metadata():
torinfo = handle.get_torrent_info()
fs = libtorrent.file_storage()
for file in torinfo.files():
fs.add_file(file)
torfile = libtorrent.create_torrent(fs)
torfile.set_comment(torinfo.comment())
torfile.set_creator(torinfo.creator())
for i in xrange(0, torinfo.num_pieces()):
hash = torinfo.hash_for_piece(i)
torfile.set_hash(i, hash)
for url_seed in torinfo.url_seeds():
torfile.add_url_seed(url_seed)
for http_seed in torinfo.http_seeds():
torfile.add_http_seed(http_seed)
for node in torinfo.nodes():
torfile.add_node(node)
for tracker in torinfo.trackers():
torfile.add_tracker(tracker)
torfile.set_priv(torinfo.priv())
f = open(magnet_torrent, "wb")
f.write(libtorrent.bencode(torfile.generate()))
f.close()
There is an error thrown on this line:
torfile.set_hash(i, hash)
It expects hash to be const char* but torrent_info.hash_for_piece(int) returns class big_number which has no api to convert it back to const char*.
When I find some time I will report this missing api bug to libtorrent developers, as currently it is impossible to create a .torrent file from a magnet uri when using python bindings.
torrent_info.orig_files() is also missing in python bindings, I'm not sure whether torrent_info.files() is sufficient.
UPDATE 2
I've created an issue on this, see it here:
http://code.google.com/p/libtorrent/issues/detail?id=294
Star it so they fix it fast.
UPDATE 3
It is fixed now, there is a 0.16.0 release. Binaries for windows are also available.
Just wanted to provide a quick update using the modern libtorrent Python package: libtorrent now has the parse_magnet_uri method which you can use to generate a torrent handle:
import libtorrent, os, time
def magnet_to_torrent(magnet_uri, dst):
"""
Args:
magnet_uri (str): magnet link to convert to torrent file
dst (str): path to the destination folder where the torrent will be saved
"""
# Parse magnet URI parameters
params = libtorrent.parse_magnet_uri(magnet_uri)
# Download torrent info
session = libtorrent.session()
handle = session.add_torrent(params)
print "Downloading metadata..."
while not handle.has_metadata():
time.sleep(0.1)
# Create torrent and save to file
torrent_info = handle.get_torrent_info()
torrent_file = libtorrent.create_torrent(torrent_info)
torrent_path = os.path.join(dst, torrent_info.name() + ".torrent")
with open(torrent_path, "wb") as f:
f.write(libtorrent.bencode(torrent_file.generate()))
print "Torrent saved to %s" % torrent_path
If saving the resume data didn't work for you, you are able to generate a new torrent file using the information from the existing connection.
fs = libtorrent.file_storage()
libtorrent.add_files(fs, "somefiles")
t = libtorrent.create_torrent(fs)
t.add_tracker("http://10.0.0.1:312/announce")
t.set_creator("My Torrent")
t.set_comment("Some comments")
t.set_priv(True)
libtorrent.set_piece_hashes(t, "C:\\", lambda x: 0), libtorrent.bencode(t.generate())
f=open("mytorrent.torrent", "wb")
f.write(libtorrent.bencode(t.generate()))
f.close()
I doubt that it'll make the resume faster than the function built specifically for this purpose.
Try to see this code http://code.google.com/p/libtorrent/issues/attachmentText?id=165&aid=-5595452662388837431&name=java_client.cpp&token=km_XkD5NBdXitTaBwtCir8bN-1U%3A1327784186190
it uses add_magnet_uri which I think is what you need
I read django.contrib.sessions.backend.file today, in the save method of SessionStore there is something as the following that's used to achieve multi-threaded saving integrity:
output_file_fd, output_file_name = tempfile.mkstemp(dir=dir,
prefix=prefix + '_out_')
renamed = False
try:
try:
os.write(output_file_fd, self.encode(session_data))
finally:
os.close(output_file_fd)
os.rename(output_file_name, session_file_name)
renamed = True
finally:
if not renamed:
os.unlink(output_file_name)
I don't quite understand how this solve the integrity problem.
Technically this doesn't solve the integrity problem completely. #9084 Addresses this issue.
Essentially this works by using tempfile.mkstemp which is guaranteed to be atomic, and writing the data to that file. It then calls os.rename() which will rename the temp file to the new file. In unix this will remove the old file before renaming, in windows this will raise an error. This should be fixed for django 1.1
If you looked in the revision history you'll see that they previously had locks, but changed them to this method for various reasons.