Writing valid RegEx for use in file/folder exclusion

Writing valid RegEx for use in file/folder exclusion - regex

I'm trying to write two expressions to use in the files/folder Exclusion List for Code42 CrashPlan backup. Their support won't help with RegEx expressions, they just point me to their KB article.
In their "File Exclusions" section, I'd like to:
exclude this folder specifically: S:\Google Drive\Temp
any file or folder containing the string Backup_Excluded anywhere in its name.
This is what I've got so far - but I have no way of knowing if they're correct:
(?i).*Google Drive\\Temp ...but since I really want to exclude a specific folder, not a pattern - do I need to escape the slashes and colon in the path of S:\Google Drive\Temp
(?i).*Backup_Excluded
Research disclaimer: I know there are RegEx resources out there, but am unsure which flavor/syntax to use, as I'd imagine there are many. I was hoping those with more RegEx familiarity could advise.

The link you posted says:
The Code42 app treats all file separators as forward slashes /.
So it seems you'd want to use / instead of \\ in your regular expressions.
Colon doesn't need escaping.
\ needs escaping because it's the escaping character itself.
/ normally needs escaping because it is the default separators for regular expression sections. However, the examples in your link don't escape it, so only the matching section is implied, so no escaping.
Then you could probably use:
S:/Google Drive/Temp
or [A-Z]:/Google Drive/Temp (to allow any drive)
.*Backup_Excluded.*
I probably wouldn't use (?i), as the capitals in those strings are usually there, but that's your call.
Check out e.g. https://regex101.com/ to test your regular expressions (also in different flavours).

Related

How to handle a tilde / swung dash (~) in a regular expression in order to exclude temporary MS Office files?

I have a batch job in xml that gets scheduled by a job scheduling engine. This engine provides the possibility of observing directories for changes of their content. My task is to monitor directories on a file exchange server running Windows, where customers and clients upload files we need to process.
We need to know about the arrival of new files as soon as possible.
I have to put a regular expression into that xml-job in order to not match subdirectories and temporary files.
In most cases, customers and clients upload files formatted as text/csv/pdf, which don't cause any problems. Some upload MS Office files, which, on the other hand, become a problem if someone opens them in the directory. Then an invisible temporary file is created beginning with ~$.
According to the documentation of the scheduling engine, the regex follows the POSIX 1003.2 standard. However, I am not able to prevent notifications being sent when someone opens an MS Office file in a monitored directory.
My regular expressions, that I have tried so far are:
First try before even noticing temporary office files:
^[a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$
Second try, intention was excluding a leading ~:
^[^~][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$
Third try, intention was excluding a leading ~ by its character code:
^[^\x7e][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$
Fourth try, intention was excluding a leading ~ by its character code with a capital E:
^[^\x7E][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$
All of those don't stop sending notifications on file openings…
Does anyone have any idea what to do?
All suggestions and alternatives are welcome.
I even checked them at regex101, regexplanet.com, regexr.com and regextester.com where the second try was matching exactly as desired. I did not even forget to configure POSIX compilation if it was possible on those sites (not all).
How can I exclude the ~ character from matching the regular expression (at the beginning of a file name)?
Short version:
How can I create a regular expression that matches any file with any extension apart from .part and does neither match the file thumbs.db, nor any file whose name begins with a ~?
Requirements:
What should not be matched:
Subfolders (my approach was files without a .),
Thumbs.db (Windows thumbnails db),
*.part (filezilla partial uploads),
~$. (temporary files starting with ~ or ~$, MS Office tmp files)
The following list provides some files and folders that must be matched or not matched by the regex:
Ablage (subfolder, should not be matched)
Abrechnungen (subfolder, should not be matched)
eine_testdatei.csv
TEST-WORKBOOK.xlsx
TEST-WORKBOOK_äöüß.xlsx
Test-2018-08-08.txt
~$TEST-WORKBOOK.xlsx (temporary file, should not be matched)
TEST-WORKBOOK.xlsx.part (partial upload, should not be matched)
TEST-WORKBOOK.part (partial upload, should not be matched)
New Problems occurred while trying to find the regex
A few problems came up after the creation of this question when I tried to apply the actually correct regex stated in the answer given by #Bohemian. I wasn't aware of those problems, so I just add them here for completeness.
The first one occurred when certain characters in the regex were not allowed in xml. The xml file is parsed by a java class that throws an exception trying to parse < and >, they are forbidden in xml documents if not related to xml nodes directly (valid: <xml-node>...</xml-node>, invalid: attribute="<ome_on, why isn't this VALI|>").
This can be avoided by using the html names < instead of < and > instead of >.
The second (and currently unresolved) issue is an operand criticized for the actually correct regular expression ^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$. The engine says:
Error: 2018-08-17T06:05:46Z REGEX-13
[repetition-operator operand invalid, ^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$]
The corresponding line in the xml file looks like this:
<start_when_directory_changed directory="F:\someDirectory" regex="^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$" />
Now I am stuck again, because my knowledge of regular expressions is pretty low. It is so low, that I don't even have any idea what character could be that criticized operand in the regex.
Research has brought me to this question whose accepted answer states "POSIX regexes don't support using the question mark ? as a non-greedy (lazy) modifier to the star and plus quantifiers (…)", which gives me an idea about what is wrong with the great regex. Still, I am not able to provide a working regex, more research will have to follow…

POSIX ERE doesn't allow for a simple way to exclude a particular string from matching. You can disallow a particular character -- like in [^.part] you are matching a single character which is not (newline or) dot or p or a or r or t -- and you can specify alternations, but those are very cumbersome to combine into an expression which excludes some particular patterns.
Here's how to do it, but as you can see, it's not very readable.
^([^~t.]|t($|[^h])|th($|[^u])|thu($|[^m])|thum($|[^b])|thumb($|[^s])|thumbs($|[^.])|thumbs\.($|[^d])|thumbs\.d($|[^b])|\.($|[^p])|\.p($|[^a])|\.pa($|[^r])|\.par($|[^t]))+$
... and it still probably doesn't do exactly what you want.

Try this:
^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$
See live demo.
There is nothing special about the tilda character in regex.

I am very late on this but above comments were helpful for me. It may not work for you but my solution is:
file_list <- file_list[!grepl("~", file_list)]

Problems with finding and replacing

Hey stackoverflow community. Ive need help with huge information file. Is it possible with regular expression to find in this tag:
<category_name><![CDATA[Prekiniai ženklai>Adler|Kita buitinė technika>Buičiai naudingi prietaisai|Kita buitinė technika>Lygintuvai]]></category_name>
Somehow replace all the other data and leave only 'Adler' or 'Lygintuvai'. Im using Altova to edit xml files, so i cant find other way then find-replace. And im new in the regex stuff. So i thought maby you can help me.

#\<category_name\>.+?gt\;([\w]+?)\|.+?gt;([\w]+?)\]\]\>\<\/category_name\>#i
\1 - Adler
\2 - Lygintuvai
PHP
regex101.com
Fields may contain alphanumeric characters without spaces.
If you want to modify the scope of acceptable characters change [\w] to something other:
[a-z] - only letters
[0-9] - only digits
etc.

It's possible, but use of regular expressions to process XML will never be 100% correct (you can prove that using computer science theory), and it may also be very inefficient. For example, the solution given by Luk is incorrect because it doesn't allow whitespace in places where XML allows it. Much better to use XQuery or XSLT, both of which are designed for the job (and both work in Altova). You can then use XPath expressions to locate the element or attribute nodes you are interested in, and you can still use regular expressions (e.g. in the XPath replace() function) to process the content of text or attribute nodes.
Incidentally, your input is rather strange because it uses escape sequences like > within a CDATA section; but XML escape sequences are not recognized in a CDATA section.

Regex expression to match a string but exclude something at the same time

I want to try and ask this as concisely as possible please forgive me if I'm leaving something out. I want the expression to match all cases except where an exact filename string is present.
A backup software I'm using uses regular expressions and I want to setup an exclusion to skip all of a particular file extension type, except I have certain files I need to backup so I don't want them to match.
The files I want to exclude are we'll say for this example *.FLV
(?i).*\.flv
I want to include in my backups three files: abc123.flv, ghk432.flv, and fdw917.flv
This is where I'm having trouble, even just including one file from the three to be included to backup
(?i).*\.flv^(?!(abc123\.flv))&
The expression is being added to an Exclusion List for code42 CrashPlan backup, their support unfortunately cannot assist with complex RegEx expressions.
The closest thing I can supply as an example is their Example 3: Using An Exclude To Include:
.*/Documents/((?!(.*\.(doc|rtf)|.*/)$).)*$
http://support.code42.com/Administrator/3.6_And_4.0/Configuring/Using_Include_And_Exclude_Filters
However it excludes all files within directories named "Documents" and includes any files in those folders with doc or rtf file extensions. I'm trying to create an expression working with file extensions irregardless of folder location.
In my brain logically it seems like I need to write this as some kind of if then else statement but regex is not my forte.

Use an anchored negative look ahead with an alternation for the files you want to keep:
^(?i)(?!.*(abc123|ghk432|fdw917)\.flv).*\.flv
The negative lookahead asserts that the following input does not match its regex, and the pipe character means "or".

Try to put the negative lookahead at the position of the filename in the path:
^([^/]*/)*(?!(abc123|ghk432|fdw917)\.flv$)[^/]*\.flv$

Removing everything between a tag (including the tag itself) using Regex / Eclipse

I'm fairly new to figuring out how Regex works, but this one is just frustrating.
I have a massive XML document with a lot of <description>blahblahblah</description> tags. I want to basically remove any and all instances of <description></description>.
I'm using Eclipse and have tried a few examples of Regex I've found online, but nothing works.
<description>(.*?)</description>
Shouldn't that work?
EDIT:
Here is the actual code.
<description><![CDATA[<center><table><tr><th colspan='2' align='center'><em>Attributes</em></th></tr><tr bgcolor="#E3E3F3"><th>ID</th><td>308</td></tr></table></center>]]></description>

I'm not familiar with Eclipse, but I would expect its regex search facility to use Java's built-in regex flavor. You probably just need to check a box labeled "DOTALL" or "single-line" or something similar, or you can add the corresponding inline modifier to the regex:
(?s)<description>(.*?)</description>
That will allow the . to match newlines, which it doesn't by default.
EDIT: This is assuming there are newlines within the <description> element, which is the only reason I can think of why your regex wouldn't work. I'm also assuming you really are doing a regex search; is that automatic in Eclipse, or do you have to choose between regex and literal searching?

hgignore: help ignoring all files but certain ones

I need an .hgdontignore file :-) to include certain files and exclude everything else in a directory. Basically I want to include only the .jar files in a particular directory and nothing else. How can I do this? I'm not that skilled in regular expression syntax. Or can I do it with glob syntax? (I prefer that for readability)
Just as an example location, let's say I want to exclude all files under foo/bar/ except for foo/bar/*.jar.

The answer from Michael is a fine one, but another option is to just exclude:
foo/bar/**
and then manually add the .jar files. You can always add files that are excluded by an ignore rule and it overrides the ignore. You just have to remember to add any jars you create in the future.

To do this, you'll need to use this regular expression:
foo/bar/.+?\.(?!jar).+
Explanation
You are telling it what to ignore, so this expression is searching for things you don't want.
You look for any file whose name (including relative directory) includes (foo/bar/)
You then look for any characters that precede a period ( .+?\. == match one or more characters of any time until you reach the period character)
You then make sure it doesn't have the "jar" ending (?!jar) (This is called a negative look ahead
Finally you grab the ending it does have (.+)
Regular expressions are easy to mess up, so I strongly suggest that you get a tool like Regex Buddy to help you build them. It will break down a regex into plain English which really helps.
EDIT
Hey Jason S, you caught me, it does miss those files.
This corrected regex will work for every example you listed:
foo/bar/(?!.*\.jar$).+
It finds:
foo/bar/baz.txt
foo/bar/baz
foo/bar/jar
foo/bar/baz.jar.txt
foo/bar/baz.jar.
foo/bar/baz.
foo/bar/baz.txt.
But does not find
foo/bar/baz.jar
New Explanation
This says look for files in "foo/bar/" , then do not match if there are zero or more characters followed by ".jar" and then no more characters ($ means end of the line), then, if that isn't the case, match any following characters.

Anyone that wants to use negative lookaheads (or ?! in regex syntax) or any kind of back-referencing mechanism should be aware that Mercurial will fall back from google's RE2 to Python's re module for matching.
RE2 is a non-backtracking engine that guarantees a run-time linear with the size of the input. If performance is important to you, that is if you have a big repository, you should consider sticking to more simple patterns that Re2 supports, which is why I think that the solution offered by Ryan.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Writing valid RegEx for use in file/folder exclusion - regex

Related

How to handle a tilde / swung dash (~) in a regular expression in order to exclude temporary MS Office files?

Problems with finding and replacing

Regex expression to match a string but exclude something at the same time

Removing everything between a tag (including the tag itself) using Regex / Eclipse

hgignore: help ignoring all files but certain ones

Categories

Resources