RegEx: need to automatically direct downloads from URL with changing section - regex

I have no knowledge of RegEx code but I've just downloaded a Google Chrome extension that lets me automatically direct downloads to specific folders on my computer.
I want jpgs from a stock photo website to be downloaded in a specific folder, but part of the URL changes for every single file. how do I write out the File URL so it ignores the random section of the URL?
https://website.com/photos/IGNORE THIS PART with azAZ01 RANDOM CODE/download?force=true

Your question was a little confusing to me, but this should be the regex you need:
https://website.com/photos/(\w|\d)*
To help break it down, the basic text (e.g. https:, website.com, photos) just matches with the raw text. The / thing is an escape character '\' followed by the slash we want. Then for the random part, I'm assuming it's made up of letters and numbers, so that last part translates roughly to, any letter '\w' or any number '\d', and the * means any number of those.
Also, you can use Regex101.com as a helpful tool when making regex

Related

How to handle a tilde / swung dash (~) in a regular expression in order to exclude temporary MS Office files?

I have a batch job in xml that gets scheduled by a job scheduling engine. This engine provides the possibility of observing directories for changes of their content. My task is to monitor directories on a file exchange server running Windows, where customers and clients upload files we need to process.
We need to know about the arrival of new files as soon as possible.
I have to put a regular expression into that xml-job in order to not match subdirectories and temporary files.
In most cases, customers and clients upload files formatted as text/csv/pdf, which don't cause any problems. Some upload MS Office files, which, on the other hand, become a problem if someone opens them in the directory. Then an invisible temporary file is created beginning with ~$.
According to the documentation of the scheduling engine, the regex follows the POSIX 1003.2 standard. However, I am not able to prevent notifications being sent when someone opens an MS Office file in a monitored directory.
My regular expressions, that I have tried so far are:
First try before even noticing temporary office files:
^[a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$
Second try, intention was excluding a leading ~:
^[^~][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$
Third try, intention was excluding a leading ~ by its character code:
^[^\x7e][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$
Fourth try, intention was excluding a leading ~ by its character code with a capital E:
^[^\x7E][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$
All of those don't stop sending notifications on file openings…
Does anyone have any idea what to do?
All suggestions and alternatives are welcome.
I even checked them at regex101, regexplanet.com, regexr.com and regextester.com where the second try was matching exactly as desired. I did not even forget to configure POSIX compilation if it was possible on those sites (not all).
How can I exclude the ~ character from matching the regular expression (at the beginning of a file name)?
Short version:
How can I create a regular expression that matches any file with any extension apart from .part and does neither match the file thumbs.db, nor any file whose name begins with a ~?
Requirements:
What should not be matched:
Subfolders (my approach was files without a .),
Thumbs.db (Windows thumbnails db),
*.part (filezilla partial uploads),
~$. (temporary files starting with ~ or ~$, MS Office tmp files)
The following list provides some files and folders that must be matched or not matched by the regex:
Ablage (subfolder, should not be matched)
Abrechnungen (subfolder, should not be matched)
eine_testdatei.csv
TEST-WORKBOOK.xlsx
TEST-WORKBOOK_äöüß.xlsx
Test-2018-08-08.txt
~$TEST-WORKBOOK.xlsx (temporary file, should not be matched)
TEST-WORKBOOK.xlsx.part (partial upload, should not be matched)
TEST-WORKBOOK.part (partial upload, should not be matched)
New Problems occurred while trying to find the regex
A few problems came up after the creation of this question when I tried to apply the actually correct regex stated in the answer given by #Bohemian. I wasn't aware of those problems, so I just add them here for completeness.
The first one occurred when certain characters in the regex were not allowed in xml. The xml file is parsed by a java class that throws an exception trying to parse < and >, they are forbidden in xml documents if not related to xml nodes directly (valid: <xml-node>...</xml-node>, invalid: attribute="<ome_on, why isn't this VALI|>").
This can be avoided by using the html names < instead of < and > instead of >.
The second (and currently unresolved) issue is an operand criticized for the actually correct regular expression ^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$. The engine says:
Error: 2018-08-17T06:05:46Z REGEX-13
[repetition-operator operand invalid, ^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$]
The corresponding line in the xml file looks like this:
<start_when_directory_changed directory="F:\someDirectory" regex="^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$" />
Now I am stuck again, because my knowledge of regular expressions is pretty low. It is so low, that I don't even have any idea what character could be that criticized operand in the regex.
Research has brought me to this question whose accepted answer states "POSIX regexes don't support using the question mark ? as a non-greedy (lazy) modifier to the star and plus quantifiers (…)", which gives me an idea about what is wrong with the great regex. Still, I am not able to provide a working regex, more research will have to follow…
POSIX ERE doesn't allow for a simple way to exclude a particular string from matching. You can disallow a particular character -- like in [^.part] you are matching a single character which is not (newline or) dot or p or a or r or t -- and you can specify alternations, but those are very cumbersome to combine into an expression which excludes some particular patterns.
Here's how to do it, but as you can see, it's not very readable.
^([^~t.]|t($|[^h])|th($|[^u])|thu($|[^m])|thum($|[^b])|thumb($|[^s])|thumbs($|[^.])|thumbs\.($|[^d])|thumbs\.d($|[^b])|\.($|[^p])|\.p($|[^a])|\.pa($|[^r])|\.par($|[^t]))+$
... and it still probably doesn't do exactly what you want.
Try this:
^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$
See live demo.
There is nothing special about the tilda character in regex.
I am very late on this but above comments were helpful for me. It may not work for you but my solution is:
file_list <- file_list[!grepl("~", file_list)]

KimonoLabs crawler Generated URL List with regex

So, I'm trying to crawl a website that has like 7,000 product pages and the link structure is like this:
https://example.com/category/sub-category/numericid-name-of-the-product/
What I'm trying to achieve is to Generate a URL list, the Kimono App has that option, and it actually sections the URL but I'm only offered default value, range, and custom list.
I tried to put in stuff like "/.+/" to match all the chars, but that does not work, I couldn't find any help on that on official kb.
.I know that import.io had that "{alpahnumeric}" for example for different parts of URL so it matches them, is there a way to accomplish that in kimonolabs app?
Try this regex: https://example.com/([^/]+)/([^/]+)/([0-9]+)-([^/]+)
Note: you may need to escape some characters (namely / would be escaped as \/).
Also, I'm not familiar with KimonoLabs, so I don't know if this is what you're looking for exactly. Feel free to clarify.
Explanation
https://example.com/ literally
([^/]+)/ a bunch of not /s, followed by a /
([0-9]+)-([^/]+) Numbers followed by another bunch of not /s

Regular Expression to match a specific URL broken up by arbitrary characters

I run a Django-based forum (the framework is probably not important to the question, but still) and it has been increasingly getting spammed with posts that link to a specific website constantly (www.solidwoodkitchen.co.uk - these people are apparently the worst).
I've implemented a string blocking system that stops them posting to the forum if the URL of the website is included in the post, but as spam bots usually do, it has figured out a way around that by breaking up the URL with other characters (eg. w_w_w.s*olid_wood*kit_ch*en._*co.*uk .). So a couple of questions:
Is it even possible to build a regex capable of finding the specific URL within a block of text even when it has been modified like that?
If it is, would this cause a performance hit?
Description
You could break the url into a string of characters, then join them together with [^a-z0-9]*?. So in this case with www.solidwoodkitchen.co.uk the resulting regex would look like:
w[^a-z0-9]*?w[^a-z0-9]*?w[^a-z0-9]*?[.][^a-z0-9]*?s[^a-z0-9]*?o[^a-z0-9]*?l[^a-z0-9]*?i[^a-z0-9]*?d[^a-z0-9]*?w[^a-z0-9]*?o[^a-z0-9]*?o[^a-z0-9]*?d[^a-z0-9]*?k[^a-z0-9]*?i[^a-z0-9]*?t[^a-z0-9]*?c[^a-z0-9]*?h[^a-z0-9]*?e[^a-z0-9]*?n[^a-z0-9]*?[.][^a-z0-9]*?c[^a-z0-9]*?o[^a-z0-9]*?[.][^a-z0-9]*?u[^a-z0-9]*?k
Edit live on Debuggex
This could would basically search for the entire string of characters seperated by zero or more non alphanumeric characters.
Or you could take the input text and strip out all punctuation then simply search for wwwsolidwoodkitchencouk.

Finding a URL within two strings regex

I have a long HTML file that contains the names of organizations and their URL's. Each organization's "section" in the code is demarcated by the word "organization" followed by a lot of code, with their URL located inside that code, and ends with the word "organization".
For example:
organization -- a lot of code (with the URL located somewhere inside) -- organization
I have tried to use regex to search and extract the URL, but to no avail.
organization(?<Protocol>\w+):\/\/(?<Domain>[\w#][\w.:#]+)\/?[\w\.?=%&=\ #/$,]*organization
I suspect my problem lies somewhere in my trying to demarcate the search for URL's by just using the word "organization", but I am not sure.
Try group 1 from this:
organization.*\b(\w+://[\w.?%&=#/$,-]+).*?organization
Your current regex is searching for something sandwiched immediately between two instances of "organization". If there's any chance of characters existing between "organization" and your URL, you'll need to introduce a non-greedy match for any instances of anything (.*?), and if there are newlines in the mix you'll need to use (?:.|\n)*?.
So your regex becomes:
organization(?:.|\n)*?(?<Protocol>\w+):\/\/(?<Domain>[\w#][\w.:#]+)\/?[\w\.?=%&=\ #/$,]*(?:.|\n)*?organization
(Because of the bold insertions, this mistakenly appears to have spaces, but it does not. If you select it and copy/paste, it will paste correctly without spaces)

replace urls

I have a huge txt file and Editpad Pro list of urls with images on the root folder.
http://www.othersite.com/image01.jpg
http://www.mysite.com/image01.jpg
http://www.mysite.com/category/image01.jpg
How can I change only that ones that has images on the root using regexp?
http://www.othersite.com/image01.jpg
http://www.NEW_WEBSITE.com/image01.jpg
http://www.mysite.com/category/image01.jpg
I'm using the RegExr online app.
Search and replace (case insensitive, regular expression):
http://www\.mysite\.com/([^/]*\.(?:jpg|gif|png))
with:
http://www\.NEW_WEBSITE\.com/\1
EDIT
And yes, this will also re-base files such as http://www.mysite.com/.jpg, if any such files or directories exist. If anyone doesn't like this then just replace * with + -- or with {X,} if your assumption happens to be that an image file needs at least a X character name s etc. etc. -- but really, this is probably quite outside the scope of what lab72 is trying to achieve (i.e. not image file name validation.)
url1.replace(/((https?:\/\/www.?)(\w*?)(.com\/image\d*?\.(png|gif|jpg))/,
"$1newName$3");
Something like the above should work. The code is in AS (not compiled though :P) Note that $2 matches the sites name which we are replacing with yoursite.
Replace
http://www\.mysite\.com/image(.*)
with
http://www.newsite.com/image$1
That being said, you might also be interested in a decent text editor. That flash applet is really yucky. You can still use the same regexp, although you'll have to replace the dollar sign $ with a backslash \.