Using regEx to download the entire directory using wget - regex

I want to download multiple pdfs from urls such as this - https://dummy.site.com/aabbcc/xyz/2017/09/15/2194812/O7ca217a71ac444eda516d8f78c29091a.pdf
If I do wget on complete URL then it downloads the file wget https://dummy.site.com/aabbcc/xyz/2017/09/15/2194812/O7ca217a71ac444eda516d8f78c29091a.pdf
But if I try to recursively download the entire folder then it returns 403(forbidden access)
wget -r https://dummy.site.com/aabbcc/xyz/
I have tried by setting user agent, rejecting robots.txt and bunch of other solutions from the internet, but I'm coming back to same point.
So I want to form the list of all possible URLs considering the given URL as common pattern, and have no idea how to do that.
I just know that I can pass that file as input to wget which will download the files recursively. So seeking the help for forming the URL list using regEx here.
Thank You!

You can't download using wildcard the files you can't see. If the host do not support directory listing you have no idea what the filenames/paths are. Also as you do not know the algorithm to generate filenames you can't generate and get them.

Related

Assist With AWS CLI S3 Bucket/Folder File Search

I have several folders and only certain folders contain files with ".003" at the end of their name. These files do not have an extension.
I am interested in finding out:
The name of the containing folder any of those files ARE in (inside of the bucket) and possibly listed only once (no duplication)?
The name of the containing folder that those files are NOT in?
I know how to do a search for a file like so:
aws s3 ls s3://{bucket}/{folder1}/{folder2} --recursive |grep "\.003"
Are there CLI commands that can give me what I am looking for?
If this or something like this has been asked before please point me in the correct direction. My apologies if so! :)
Thank you for your time!

I have a gcp storage bucket with 1000+ images in it. What's the easiest way to get a text file that lists all the URLs of objects in the bucket?

I know that this api https://storage.googleapis.com/storage/v1/b/<BUCKET_NAME>/o? can be used to retrieve json data of 1000 objects at a time and and we can parse the output in code to pick out just the names and generate URLs of the required form. But is there a simpler way to generate a text file of list of URLs in a bucket?
edit: adding more details
I have configured a google load balancer(with CDN if that matters) with IP address <LB_IP> in front of this bucket. So ideally I would want to be able to generate a list of URLs like
http://<LB_IP>/image1.jpg
http://<LB_IP>/image2.jpg
...
In a general way you can just run in linux gsutil ls gs://my_bucket > your_list.txt to get all your objects in a text list.
If this is not what you are looking for please edit your question with more specific details.
gsutil doesn't have a command to print URLs for objects in a bucket, however it can list objects, as #Chris32 mentioned.
In addition, according to this Stackoverflow post you could combine listing to a sed program, to replace listings with object names and generate a form of URL.
For publicly visible objects, public links are predictable, as they match the following:
https//storage.googleapis.com/BUCKET_NAME/OBJECT_NAME

Uploading Local Directory to GCS using Airflow

I am trying to use Airflow to Upload a Directory (with parquet files) to GCS.
I tried the FileToGoogleCloudStorageOperator for this purpose.
I tried the following options:
Option 1
src=<Path>/*.parquet
it Errors out: No such file found
Option 2
src=<Path> -> Where path is the directory path
it Errors out by saying that: Is a directory
Questions
Is there anyway, FileToGoogleCloudStorageOperator can scale up to the directory level?
any alternate way of doing the same?
Short Answer: Currently it is not possible. But I will take it as a feature request and try to add this in the upcoming release.
Till then you can just use BashOperator and use gsutil to copy multiple files at the same time.
Another option is to use PythonOperator, list files using os package and loop over them and use the GoogleCloudStorageHook.upload to upload each file.

Specify output filename of Cloud Vision request

So I'm sitting with Google Cloud Vision (for Node.js) and I'm trying to dynamically upload a document to a Google Cloud Bucket, process it using Google Cloud Vision API, and then downloading the .json afterwards. However, when Cloud Vision processes my request and places it in my bucket for saved text extractions, it appends output-1-to-n.json at the end of the filename. So let's say I'm processing a file called foo.pdf that's 8 pages long, the output will not be foo.json (even though I specified that), but rather be foooutput1-to-8.json.
Of course, this could be remedied by checking the page count of the PDF before uploading it and appending that to the path I search for when downloading, but that seems like such an unneccesary hacky solution. I can't seem to find anything in the documentation about not appending output-1-to-n to outputs. Extremely happy for any pointers!
You can't specify a single output file for asyncBatchAnnotate because depending on your input, many files may get created. The output config is only a prefix and you have to do a wildcard search in gcs for your given prefix (so you should make sure your prefix is unique).
For more details see this answer.

Procmail to automatically make new folders to store emails from new senders

I am learning how to use procmail but at this point, I am not even sure it's the right tool for what I am trying to do.
So far, I have managed to get fetchmail to retrieve emails from a Google IMAP account and procmail to filter those emails into local folders I had previously created.
I am wondering though whether there is a way to get procmail to automatically create a new folder locally when an email from a new sender is being retrieved and to store that email into that folder.
So far, I have only found a website that describes the possibility of procmail creating automatically folders for mailing lists, but the recipe is something crazy using characters which I have no idea the meaning of, furthermore the official procmail website seems unreachable.
Please can you help? Thank you.
It's not clear what you expect the folder to be called, and what mailbox format you're using; but assuming maildir folders named by the sender's email terminus, try
Who=`formail -rtzxTo:`
:0
* ? mkdir -p "$Who"
$Who/
For an mbox folder, you don't need the directory check at all, because the folder is just a single text file, and you'd drop the final slash from the folder name. Mbox needs locking, so add a second colon after the zero.
Who=`formail -rtzxTo:`
:0:
$Who
Getting formail to create a reply and then extracting the To: header of the generated reply is a standard but slightly unobvious way to obtain just the email terminus for the sender of the input message.
The shell snippet mkdir -p dir creates dir if it doesn't already exist, and is a harmless no-op otherwise.