I want to know if it is possible to download a portion of a public AWS data set and how to do it.
Specifically, I want to download a part from Common Crawl Corpus to do local tests.
It looks like you can. If you point your browser to the public URL provided by Amazon, you can see links for the whole sets and also for subsets.
You can download it using your browser or any S3 client tools or libraries.
Related
This seems like something that happens often enough that there might already be some provision for doing it that I'm not aware of...
Users of our app download our installer package via a link on our site that points to the file hosted in an S3 bucket on AWS. Once installed, our app uses the same (hard-coded) URL to download and install updates when they become available.
One downside to this is that it requires that the download URL be static, and thus it can't contain any version information.
If we were hosting the download on our own (configurable) web server, I'd have an idea of how to set up a redirect from https://.../Foo_Latest to https://.../Foo_v1.0.13, and we could just manage the alias in that one place?
But since we upload new releases to an S3 bucket on AWS, I'm wondering whether there's not some existing capability on AWS to alias the URL. It seems like this might be a common enough use case that there's some solution already in place for doing this?
Of course, we could just have the static URL point to a server we control, and then do the redirect there to the AWS URL. But that feels like it would somewhat defeat the purpose of using high-availability and high-bandwidth benefits of AWS...
Am I missing anything?
Is it possible to copy a file from Sharepoint to s3? Preferably coding it from the AWS side.
I've searched but not seeing much out there. There's a similar title but this doesn't answer the question:
upload files from sharepoint online to aws s3 bucket
It is possible for sure. SharePoint online has a rest API I use a python package called office365, it implements a SharePoint client to handle most of the daily operations you will need.
The repo is: https://github.com/O365/python-o365
Some tips about I have struggled for the first time:
The ClientContext object requires the base site URL for the SharePoint library you want to authenticate, library doc:
https://mysharepoint.mydomain.com/sites/mysite/shareddocuments/
The URL you must pass to the Client context will be: https://mysharepoint.mydomain.com/sites/mysite
The method UserCredential requires your user in the following format: user#mydomain
I am using AWS S3 to store my private MP4 files to be rendered on my site. Besides, I also have AWS Cloudfront distribution to speed up the content delivery. S3 bucket has Policy to be accessed from my site and OAI so content can only be accessed using distribution.
Problem I'm facing is my videos are downloadable using browser extension though absolutely path of video is blocked outside of the site. Is there I can do to avoid it?
Any help/direction would be appreciated
If the browser needs to play the video then it will need to download it.
As you say, it is not that hard to download/capture the file so you have to consider what your goals are.
The usual approach is to accept that it can be downloaded and encrypt the file so that only users with access to the decryption key can play back the content.
The tricky part then becomes how to securely share the decryption key with authorised users in a way that neither they nor a third party can view or share the key. This is the essence of nearly all common DRM systems.
You can use a proprietary way to share the key securely, even something as simple as via some other communication channel, if this addresses your requirements. It will likely not be leveraging the full security capability of the devices, such as a secure media path, but it may be enough for your needs.
If not, then you will probably want to look at one or more of the common DRM systems in use today - you generally need multiple ones to cover all devices and clients, Widevine for Android, Chrome etc, FairPlay for Safari, iOS and PlayReady for xBox, Edge etc.
What would be the best approach to update a static website hosted on an S3 bucket such that there is no downtime. The updates will be done by marketing teams of a company with zero knowledge of cli commands or how to move around in the console. Are there ways to achieve this without having to learn to move around in the console?
Edit
The web site is a collection of static html pages and will be updated using an Html editor. Once edited the marketing team will upload each individual updated file to the S3 bucket. There are no more than 10 such files including html and images. This was currently being hosted on a shared server and we now want to move it to an S3 bucket capable of hosting simple web pages. The preference is to not provision console access for certain users as they are comfortable only using a WYSIWYG html editor and uploading using an FTP client. The editors don't know html and the site doesn't use javascript. I am thinking of writing a batch script to manage the uploads to keep all that cli complexity away so they only work on the HTML in the editor. Looking for the simplest approach to achieve this.
You can set up a CI/CD pipeline as mentioned here
Update static website in an automated way
The above pipeline generally has a code commit as trigger. It depends what your marketing teams are doing on the content and how they are updating it. If they are updating the content that is hosted on AWS, you can change the trigger to S3 updates. The solution depends on an individual use case, which may require some dev from your side to make it simpler for your marketing teams.
I'm unsure what you are asking here, because to "update" a static website, surely you must have some technical knowledge in the very basics of web development.
It's important here to define what exactly you mean by update, because again, updating a website and updating a bucket are two completely different things.
Also, S3 has eventual consistency with PUT's (updating) so there will be some minimal downtime.
The easiest way to update an S3 bucket is via the console, and not the CLI. The console is pretty user friendly, and shouldn't take long to get used to.
On my php server I have a list of urls that point to large files (not locally stored). These files can be hundreds of Mb so I'm looking for the best way to add them to GCS without first saving them to my server. Is this possible or will I have to save each one then upload it to GCS?
Edit
I forgot to mention the list of urls that I have is managed programatically and changes often so any solution needs to be able to be implemented without manual interaction.
If your urls are publicly reachable, you may be interested in the Transfer Service provided by Google Cloud Storage. You can provide a TSV file with a list of urls from where your files will be uploaded to the bucket of your choice. You can have a look at the service doc here for further details.