Regex Assistance for replacing filepaths in markdown documents - regex

I migrated my notes from evernote to markdown files with yarle. unfortunately it created me a lot of folders seperatively for the attachments (although I set it up for one folder only).
I moved all attachements to one folder, so the filepath to the attachments in the mardown files needs to be updated.
I think regex would be right for this, but I don't have any knowledge about regex and would be really thankful for help.
Filepaths are as follows:![[./_attachmentsMove/Coordination_Patterns.resources/CoordinationPattern_Ipsi.MOV]]
All filepaths are identical ![[./_attachmentsMove/]] up to this
The second folder varies e.g. Coordination_Patterns.resources/.
I want to delete everything but the filename.extension itself e.g. ![[CoordinationPattern_Ipsi.MOV]].
An example of the other filepaths:
![[./_attachmentsMove/Jonglieren_(Hände).resources/07 Jonglieren.MOV]]
(second folder changes, filename changes, I also have .png and .mov).
I use MassReplaceIt (app for mac) which allows me to replace expressions in documents with regex. If someone has a solution using the terminal/commandline, I'll try this as well of course :)

Try if this regexp suffices:
(?<=!\[\[)[^\]]+/(?=[^\]/]+]])
Replace with empty string.
It should delete the part from the ![[ up to the last / before the next ]].

Related

Regex for Current NTUSER.DAT files

I am trying to come up with a regex (PCRE) that finds current windows NTUSER.DAT files when cycling through a file list (valid NTUSER.DAT are the ones that are in the correct path for use by Windows).
I am trying to exclude any NTUSER.DAT files that have been copied by a user and placed in a different location (e.g. on the Desktop). In the following sample data, the first 4 results are valid, the next 3 are invalid:
\Users\John Thomas Hamilton\ntuser.dat
\Users\Default\NTUSER.DAT
\Users\Mary Thomas\NTUSER.DAT
\Users\UpdatusUser\NTUSER.DAT
\Users\John Thomas Hamilton\Desktop\My Stuff\Windows\Users\Default\NTUSER.DAT
\Users\John Thomas Hamilton\Desktop\My Stuff\Windows\Users\Student\NTUSER.DAT
\Users\John Thomas Hamilton\Desktop\My Stuff\My stuff to sort\Tech Support Fix it\NTUSER.DAT
Currently the best/simplest regex I have is:
\\USERS\\[A-Z0-9]+\\NTUSER.DAT$
but of course there a plenty of valid Windows file name characters other than letters and numbers that could exist in the user name.
I think i need to search up to the first occurrence of the new folder "\" and then if it does not have NTUSER.DAT after it, reject it. I have not had any luck doing this so any help would be appreciated.
Well assuming you have a valid file list, this would work:
^\\Users\\[^\\]+?\\NTUSER.DAT$
Make sure you ignore case.
The secret is using [^\\]+? instead of .+? so that you match exactly one folder length in.

Applescript to extract the Digital Object Identifier (DOI) from a PDF file

I looked for an applescript to extract the DOI from a PDF file, but could not find it. There is enough information available on the actual format of the DOI (i.e. the regular expression), but how could I use this to get the identifier from the PDF file?
(It would be no problem if some external program were used, such as Hazel.)
If you're ok with using an app, I'd recommend Skim. Good AppleScript support. I'd probably structure it like this (especially if the document might be large):
set DOIFound to false
tell application "Skim"
set pp to pages of document 1
repeat with p in pp
set t to text of p
--look for DOI and set DOIFound to true
if DOIFound then exit repeat--if it's not found then use url?
end repeat
end tell
I'm assuming a DOI would always exist on one page (not spread out to between two). Looks like they are invariably (?) on the first page of an article, which would make this quick of course, even with a large doc.
[edit]
Another way would be to get the Xpdf OSX binaries from http://www.foolabs.com/xpdf/download.html and use pdftotext in the command line (just tested this; it works well) and parse the text using AppleScript. If you want to stay in AppleScript, you can do something like:
do shell script "path/to/pdftotext 'path/to/pdf/file.pdf'"
which would output a file in the same directory with a txt file extension -- you parse that for DOI.
Have you tried it with pdfgrep? It works really well in commmandline
pdfgrep -n --max-count 1 --include "*.pdf" "DOI"
i have no idea to build an apple script though, but i would be interested in one also. so that if i drop a pdf into that folder it just automatically extracts the DOI and renames the file with the DOI in the filename.

Very basic image renaming with regex

I spent most of yesterday putting together a collection of regular expressions to convert all my image names and paths to lower case. Today, I processed a folder full of files and was surprised to discover that many image names are still capitalized.
So I decided to try it one step at a time, first renaming .jpg's, then .gif's, .png's, etc.
I'm working on a Mac, using Dreamweaver and TextWrangler as my text editors. The following regex works perfectly for jpg's, with one major flaw - it deletes the extension...
([\w/-]+)\.jpe?g
\L\1
In other words, it changes South-America.jpg to south-america.
How can I change it so that it retains the file extension? I assume I can then just change it to...
([\w/-]+)\.png
\L\1
...to process png's, etc.
([\w\/-]+)(\.jpe?g)
and replace with \L\1\2
its deleting your extension because you are never saving it in a matchgroup.
You could perhaps capture the extension too?
([\w/-]+)(\.jpe?g)
\L\1\2
And I think you should be able to use something like this for all the files:
([\w/-]+)(\.[^.]+$)
\L\1\2
Or if you specifically want to convert those jpegs, pngs and gifs:
([\w/-]+)(\.(?:jpe?g|gif|png))
\L\1\2
If it's okay for the extension to become lowercase as well, you could just do
^(.*)$
\L\1
As long as you're certain that all lines contain file names.
If you want to process only certain file formats, use
^(.*\.(jpe?g|png|gif))$
\L\1

Crashplan and regex - excluding files based on filename

Crashplan allows for excluding files from a backup set by using regex for the exclusion criteria (there is no inclusion criteria functionality). For my particular use case I have a folder that contains these files:
C_VOL-b001.spf
C_VOL-b001-i001.md5
C_VOL-b001-i001.spi
E_VOL-b001.spf
E_VOL-b001-i001.md5
E_VOL-b001-i001.spi
F_VOL-b001.spf
F_VOL-b001-i001.md5
F_VOL-b001-i001.spi
G_VOL-b001.spf
G_VOL-b001-i001.md5
G_VOL-b001-i001.spi
and I want to exclude any file that doesn't begin with the C_VOL filename. These are backup files from another backup software, Shadowprotect, but I only want to include the C volume files and exclude the others. The incremental files will continue to be added to each of the volume sets using the naming schema of -i001, -i002, etc.
So far I've tried the following:
^E_VOL
^E_VOL.*
and a few other variations, with no success. I'm not sure if Crashplan only allows for selecting based on the filetype extension (their regex examples are here http://goo.gl/qDAEcR ). They do mention that "Note that CrashPlan treats all file separators as forward slashes (/)."
I'm not sure if Crashplan recognizes all regex expressions. If it helps, back in 2008 I emailed their tech support with a regex question and one of the founders of Crashplan, Matt Dornquast, helped me with a the following regex:
I am trying to exclude any file that either:
1. have an extension of .spf, or
2. has a file name of the type, XXXXXX-cd.spi
3. But also allow for backup of files with the name type of, xxxxx.spi
And his regex worked perfectly:
(?i).+(?:\-cd\.spi|\.spf)$
I've contacted their tech support again but they said they will no longer help with regex questions.
It seems that you could use the following regex:
.*/C_VOL.*
I created this based on this example (link) they featured on the website you linked in your question. Please let us know if it's working :)

RegEx to rewrite folder structure of varying length and file name to string

This is the code I'm using, developed with the help of #anubhava to rewrite a path generated by a CGI script to redirect the path from the location of my jpg image files to another folder that contains watermarked image files in the same folder structure organization as the originals, but exclude files that begin with tn_ or AM (plus _category_image.jpg):
RewriteRule ^ImageFolio4_files/1/([^/]+)/((?!AM|tn_)[^.]+\.jpg)$ /ImageFolio4_files/cache/images/~$1~$2 [L,R=302,NC]
The original path of:
/ImageFolio4_files/1/Casual_Portraits/abc123_789-xyz.jpg
And the above RegEx works to properly generate this output:
/ImageFolio4_files/cache/images/~Casual_Portraits~abc123_789-xyz.jpg
My CHALLENGE: I need to accommodate a multi-folder structure up to three folders deep underneath the ImageFolio4_files/1/ structure. The current code doesn't accomodate that. I also need to exclude any files named _category_image.jpg which occurs at each of the folder levels beneath ImageFolio4_files/1/ (these files are unique small display icons that appear next to the category names)
I really have no idea how to accomodate the multi-folder structure so your help would be appreciated.
First, change
([^/]+)/ to (([^/]+)/)+
in your expression.
Second, change
(?!AM|tn_) to (?!AM|tn_|_category_image.jpg)
You can use the the negative lookahead (?!) for the whole filename as well, it doesn't eat up characters, just checks if the regex "AM|tn_|_category_image.jpg" matches at the actual position.