How to handle a tilde / swung dash (~) in a regular expression in order to exclude temporary MS Office files? - regex

I have a batch job in xml that gets scheduled by a job scheduling engine. This engine provides the possibility of observing directories for changes of their content. My task is to monitor directories on a file exchange server running Windows, where customers and clients upload files we need to process.
We need to know about the arrival of new files as soon as possible.
I have to put a regular expression into that xml-job in order to not match subdirectories and temporary files.
In most cases, customers and clients upload files formatted as text/csv/pdf, which don't cause any problems. Some upload MS Office files, which, on the other hand, become a problem if someone opens them in the directory. Then an invisible temporary file is created beginning with ~$.
According to the documentation of the scheduling engine, the regex follows the POSIX 1003.2 standard. However, I am not able to prevent notifications being sent when someone opens an MS Office file in a monitored directory.
My regular expressions, that I have tried so far are:
First try before even noticing temporary office files:
^[a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$
Second try, intention was excluding a leading ~:
^[^~][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$
Third try, intention was excluding a leading ~ by its character code:
^[^\x7e][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$
Fourth try, intention was excluding a leading ~ by its character code with a capital E:
^[^\x7E][a-zA-Z0-9_\-]+\.+[a-zA-Z0-9_\-][^~][^.part]*$
All of those don't stop sending notifications on file openings…
Does anyone have any idea what to do?
All suggestions and alternatives are welcome.
I even checked them at regex101, regexplanet.com, regexr.com and regextester.com where the second try was matching exactly as desired. I did not even forget to configure POSIX compilation if it was possible on those sites (not all).
How can I exclude the ~ character from matching the regular expression (at the beginning of a file name)?
Short version:
How can I create a regular expression that matches any file with any extension apart from .part and does neither match the file thumbs.db, nor any file whose name begins with a ~?
Requirements:
What should not be matched:
Subfolders (my approach was files without a .),
Thumbs.db (Windows thumbnails db),
*.part (filezilla partial uploads),
~$. (temporary files starting with ~ or ~$, MS Office tmp files)
The following list provides some files and folders that must be matched or not matched by the regex:
Ablage (subfolder, should not be matched)
Abrechnungen (subfolder, should not be matched)
eine_testdatei.csv
TEST-WORKBOOK.xlsx
TEST-WORKBOOK_äöüß.xlsx
Test-2018-08-08.txt
~$TEST-WORKBOOK.xlsx (temporary file, should not be matched)
TEST-WORKBOOK.xlsx.part (partial upload, should not be matched)
TEST-WORKBOOK.part (partial upload, should not be matched)
New Problems occurred while trying to find the regex
A few problems came up after the creation of this question when I tried to apply the actually correct regex stated in the answer given by #Bohemian. I wasn't aware of those problems, so I just add them here for completeness.
The first one occurred when certain characters in the regex were not allowed in xml. The xml file is parsed by a java class that throws an exception trying to parse < and >, they are forbidden in xml documents if not related to xml nodes directly (valid: <xml-node>...</xml-node>, invalid: attribute="<ome_on, why isn't this VALI|>").
This can be avoided by using the html names < instead of < and > instead of >.
The second (and currently unresolved) issue is an operand criticized for the actually correct regular expression ^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$. The engine says:
Error: 2018-08-17T06:05:46Z REGEX-13
[repetition-operator operand invalid, ^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$]
The corresponding line in the xml file looks like this:
<start_when_directory_changed directory="F:\someDirectory" regex="^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$" />
Now I am stuck again, because my knowledge of regular expressions is pretty low. It is so low, that I don't even have any idea what character could be that criticized operand in the regex.
Research has brought me to this question whose accepted answer states "POSIX regexes don't support using the question mark ? as a non-greedy (lazy) modifier to the star and plus quantifiers (…)", which gives me an idea about what is wrong with the great regex. Still, I am not able to provide a working regex, more research will have to follow…

POSIX ERE doesn't allow for a simple way to exclude a particular string from matching. You can disallow a particular character -- like in [^.part] you are matching a single character which is not (newline or) dot or p or a or r or t -- and you can specify alternations, but those are very cumbersome to combine into an expression which excludes some particular patterns.
Here's how to do it, but as you can see, it's not very readable.
^([^~t.]|t($|[^h])|th($|[^u])|thu($|[^m])|thum($|[^b])|thumb($|[^s])|thumbs($|[^.])|thumbs\.($|[^d])|thumbs\.d($|[^b])|\.($|[^p])|\.p($|[^a])|\.pa($|[^r])|\.par($|[^t]))+$
... and it still probably doesn't do exactly what you want.

Try this:
^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$
See live demo.
There is nothing special about the tilda character in regex.

I am very late on this but above comments were helpful for me. It may not work for you but my solution is:
file_list <- file_list[!grepl("~", file_list)]

Related

Writing valid RegEx for use in file/folder exclusion

I'm trying to write two expressions to use in the files/folder Exclusion List for Code42 CrashPlan backup. Their support won't help with RegEx expressions, they just point me to their KB article.
In their "File Exclusions" section, I'd like to:
exclude this folder specifically: S:\Google Drive\Temp
any file or folder containing the string Backup_Excluded anywhere in its name.
This is what I've got so far - but I have no way of knowing if they're correct:
(?i).*Google Drive\\Temp ...but since I really want to exclude a specific folder, not a pattern - do I need to escape the slashes and colon in the path of S:\Google Drive\Temp
(?i).*Backup_Excluded
Research disclaimer: I know there are RegEx resources out there, but am unsure which flavor/syntax to use, as I'd imagine there are many. I was hoping those with more RegEx familiarity could advise.
The link you posted says:
The Code42 app treats all file separators as forward slashes /.
So it seems you'd want to use / instead of \\ in your regular expressions.
Colon doesn't need escaping.
\ needs escaping because it's the escaping character itself.
/ normally needs escaping because it is the default separators for regular expression sections. However, the examples in your link don't escape it, so only the matching section is implied, so no escaping.
Then you could probably use:
S:/Google Drive/Temp
or [A-Z]:/Google Drive/Temp (to allow any drive)
.*Backup_Excluded.*
I probably wouldn't use (?i), as the capitals in those strings are usually there, but that's your call.
Check out e.g. https://regex101.com/ to test your regular expressions (also in different flavours).

Capture everything after one word [duplicate]

This question already has an answer here:
Learning Regular Expressions [closed]
(1 answer)
Closed 6 years ago.
I am trying to make a regular expression capture any words in the specific line after the word Attachment:
This question is for work, so it is not a homework or test question. I took the paragraph below as an example from www.regular-expressions.info. I did not major in computers but Psychology so this is completely foreign to me. I've read the manuals for the last two days, and because this is going over my head, I don't know how to begin.
I have a task which involves me linking the attachments to a specific file with the same name saved in a folder (at least 500 attachments) on Adobe PDF. What I did before was to manually select the words and link it to a specific file in a folder, but it is tedious to do when they can go up to 500 attachments.
I was aware of an application plug-in called EVERMAP that you can download for Adobe to automatically link specific words to a specific file in a folder. However, it requires me to use regular expressions which again, I don't know how to use.
I will bold the words I want to capture in the paragraph below.
The repetition operator manual expand the match as far as they, and only come back if they must to satisfy the remainder.
Attachment: The repetition operator manual
The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more.
Attachment: Asterisk and stars engine
Attachment: (.+) should work in your case unless there are other exceptions to this rule. The regex simply tells the parser to capture 1 or more character after the word Attachment:. See here for the sample
Like #Kevin said, the Regex is simple. Use Attachment: (.+).
Maybe you are confused on how to use Regex. I don't know about the Evermap plugin, but you can copy all the text from the PDF to Sublime Text (text editor to open .txt but with a lot of features) and do Regex part there. And then, since you are not a programmer, you should remove other irrelevant data. So the Regex will be:
`^\s*Attachment:\s*(.+)$|^(?!Attachment:).+$`
And replace it with:
`\1`
\1 is a variable containing group value caught in ()
In Sublime Text find Find and Replace, then apply the Regex there. Don't forget to turn on the Regex mode.

Regex expression to match a string but exclude something at the same time

I want to try and ask this as concisely as possible please forgive me if I'm leaving something out. I want the expression to match all cases except where an exact filename string is present.
A backup software I'm using uses regular expressions and I want to setup an exclusion to skip all of a particular file extension type, except I have certain files I need to backup so I don't want them to match.
The files I want to exclude are we'll say for this example *.FLV
(?i).*\.flv
I want to include in my backups three files: abc123.flv, ghk432.flv, and fdw917.flv
This is where I'm having trouble, even just including one file from the three to be included to backup
(?i).*\.flv^(?!(abc123\.flv))&
The expression is being added to an Exclusion List for code42 CrashPlan backup, their support unfortunately cannot assist with complex RegEx expressions.
The closest thing I can supply as an example is their Example 3: Using An Exclude To Include:
.*/Documents/((?!(.*\.(doc|rtf)|.*/)$).)*$
http://support.code42.com/Administrator/3.6_And_4.0/Configuring/Using_Include_And_Exclude_Filters
However it excludes all files within directories named "Documents" and includes any files in those folders with doc or rtf file extensions. I'm trying to create an expression working with file extensions irregardless of folder location.
In my brain logically it seems like I need to write this as some kind of if then else statement but regex is not my forte.
Use an anchored negative look ahead with an alternation for the files you want to keep:
^(?i)(?!.*(abc123|ghk432|fdw917)\.flv).*\.flv
The negative lookahead asserts that the following input does not match its regex, and the pipe character means "or".
Try to put the negative lookahead at the position of the filename in the path:
^([^/]*/)*(?!(abc123|ghk432|fdw917)\.flv$)[^/]*\.flv$

Replacing all instances of a name in all strings in a solution

We have a large solution with many projects in it, and throughout the project in forms, messages, etc we have a reference to a company name. For years this company name has been the same, so it wasn't planned for it to change, but now it has.
The application is specific to one state in the US, so localizations/string resource files were never considered or used.
A quick Find All instances of the word pulled up 1309 lines, but we only need to change lines that actually end up being displayed to the user (button text, message text, etc).
Code can be refactored later to make it more readable when we have time to ensure nothing breaks, but for time being we're attempting to find all visible instances and replace them.
Is there any way to easily find these "instances"? Perhaps a type of Regex that can be used in the Find All functionality in Visual Studio to only pull out the word when it's wrapped inside quotes?
Before I go down the rabbit hole of trying to make my job easier and spending far more time than it would have taken to just go line by line, figured I would see if anyone has done something like this before and has a solution.
You can give this a try. (I hope your code is under source control!)
Foobar{[^"]*"([^"]*"[^"]*")*[^"]*}$
And replace with
NewFoobar\1
Explanation
Foobar the name you are searching for
[^"]*" a workaround for the missing non greedy modifier. [^"] means match anything but " that means this matches anything till the first ".
([^"]*"[^"]*")* To ensure that you are matching only inside quotes. This ensures that there are only complete sets of quotes following.
[^"]* ensures that there is no quote anymore till the end of the line $
{} the curly braces buts all this stuff following your companies name into a capturing group, you can refer to it using \1
The VS regex capability is quite stripped down. It perhaps represents 20% of what can be done with full-powered regular expressions. It won't be sufficient for your needs. For example, one way to solve this quote-delimited problem is to use non-greedy matching, which VS regex does not support.
If I were in your shoes, I would write a perl script or a C# assembly that runs outside of Visual Studio, and simply races through all files (having a particular file extension) and fixes everything. Then reload into Visual Studio, and you are done. Well, if all went well with the regex anway.
Ultimately what you really must watch out for is code like this:
Log.WriteLine("Hello " + m_CompanyName + " There");
In this case, regex will think that "m_CompanyName" appears between two quotes - but it is not what you meant. In this case you need even more sophistication, and I think you'll find the answer with a special .net regular expression extension.

hgignore: help ignoring all files but certain ones

I need an .hgdontignore file :-) to include certain files and exclude everything else in a directory. Basically I want to include only the .jar files in a particular directory and nothing else. How can I do this? I'm not that skilled in regular expression syntax. Or can I do it with glob syntax? (I prefer that for readability)
Just as an example location, let's say I want to exclude all files under foo/bar/ except for foo/bar/*.jar.
The answer from Michael is a fine one, but another option is to just exclude:
foo/bar/**
and then manually add the .jar files. You can always add files that are excluded by an ignore rule and it overrides the ignore. You just have to remember to add any jars you create in the future.
To do this, you'll need to use this regular expression:
foo/bar/.+?\.(?!jar).+
Explanation
You are telling it what to ignore, so this expression is searching for things you don't want.
You look for any file whose name (including relative directory) includes (foo/bar/)
You then look for any characters that precede a period ( .+?\. == match one or more characters of any time until you reach the period character)
You then make sure it doesn't have the "jar" ending (?!jar) (This is called a negative look ahead
Finally you grab the ending it does have (.+)
Regular expressions are easy to mess up, so I strongly suggest that you get a tool like Regex Buddy to help you build them. It will break down a regex into plain English which really helps.
EDIT
Hey Jason S, you caught me, it does miss those files.
This corrected regex will work for every example you listed:
foo/bar/(?!.*\.jar$).+
It finds:
foo/bar/baz.txt
foo/bar/baz
foo/bar/jar
foo/bar/baz.jar.txt
foo/bar/baz.jar.
foo/bar/baz.
foo/bar/baz.txt.
But does not find
foo/bar/baz.jar
New Explanation
This says look for files in "foo/bar/" , then do not match if there are zero or more characters followed by ".jar" and then no more characters ($ means end of the line), then, if that isn't the case, match any following characters.
Anyone that wants to use negative lookaheads (or ?! in regex syntax) or any kind of back-referencing mechanism should be aware that Mercurial will fall back from google's RE2 to Python's re module for matching.
RE2 is a non-backtracking engine that guarantees a run-time linear with the size of the input. If performance is important to you, that is if you have a big repository, you should consider sticking to more simple patterns that Re2 supports, which is why I think that the solution offered by Ryan.