PDF page count using regex - regex

I used regex to calculate page count for pdf. Below is the code that i used.
Regex regex = new Regex(#"/Type\s*/Page[^s]");
MatchCollection matches = regex.Matches(sr.ReadToEnd());
return matches.Count;
It works fine with the version below 1.6 but not working with pdf files with version 1.6 . It returns 0 page if pdf version is 1.6 .

In your case you most likely have to do with 1.6 documents which make use of the then introduced feature of compressed object streams. As in such documents the information you search for is compressed, your regular expression does not find it.
There are tools which allow you to decompress such streams in a file before searching it. Before you look for them, though, be aware that the result of your code cannot be trusted anyway as
there can be more matches than pages because there may be old, unused page objects or even other wrong positives in the file,
there can be less matches than pages because PDF allows alternative ways to write those type entries.

Related

Regex Assistance for replacing filepaths in markdown documents

I migrated my notes from evernote to markdown files with yarle. unfortunately it created me a lot of folders seperatively for the attachments (although I set it up for one folder only).
I moved all attachements to one folder, so the filepath to the attachments in the mardown files needs to be updated.
I think regex would be right for this, but I don't have any knowledge about regex and would be really thankful for help.
Filepaths are as follows:![[./_attachmentsMove/Coordination_Patterns.resources/CoordinationPattern_Ipsi.MOV]]
All filepaths are identical ![[./_attachmentsMove/]] up to this
The second folder varies e.g. Coordination_Patterns.resources/.
I want to delete everything but the filename.extension itself e.g. ![[CoordinationPattern_Ipsi.MOV]].
An example of the other filepaths:
![[./_attachmentsMove/Jonglieren_(Hände).resources/07 Jonglieren.MOV]]
(second folder changes, filename changes, I also have .png and .mov).
I use MassReplaceIt (app for mac) which allows me to replace expressions in documents with regex. If someone has a solution using the terminal/commandline, I'll try this as well of course :)
Try if this regexp suffices:
(?<=!\[\[)[^\]]+/(?=[^\]/]+]])
Replace with empty string.
It should delete the part from the ![[ up to the last / before the next ]].

Is it possible to make an index search by regex in PDF?

I want to search for all lines that match this regex
^([0-9IVX]\.)*.*\R
and report with the page number they are at. The output would be something like:
1. Heading/page number
1.1 Subheading/page number
1.1.1. Subsubheading/page number
Is this possible to do in PDF? I suppose that would require Ghostscript, but searching the How to Use Ghostscript page for regex I find nothing.
I can't think why you would expect Ghostscript to do search for you.
I'm not sure if you are hoping to get the data type 'heading, page number' etc from the PDF file, or if you are going to work that out yourself based on the data you find.
If it's the former then the first problem is that, in general, PDF files don't have the kind of structure information you are looking for. There is nothing in most PDF files which says 'this is a heading', 'this is a page number' etc.
There are such things as 'tagged PDF' which adds non-printing elements to a PDF file which do carry that kind of data around with them. This is an entirely optional feature, the vast majority of PDF files don't contain it, and Ghostscript completely ignores it.
Since most PDF files don't have that information, you can't rely on it, unless you are in the happy position of knowing where your PDF files are being generated and that they contain this kind of information. In which case there are numerous tools around which will extract it for you, or enable you to write code to do so.
The problem with just searching for the text is that firstly the text need not be written as a contiguous stream. So if you are looking for '1.1' that might be written as:
(1.1) Tj
(1) Tj
(.) Tj
(1) Tj
[(1) -0.1 (.) 0.1 (1)] TJ
or any combination of those. The individual character codes need not even appear in order or in the same content stream.
Secondly the character code in a PDF content stream need not be (and often is not) a Unicode code point. Or ASCII, or any other standard coding scheme, it can be totally arbitrary.
Some PDF files carry a ToUnicode CMap around which maps the character codes to Unicode code points, but not all do. Some fonts may use a standard (that's PDF standard) Encoding, in which case it's possible to infer the Unicode code points. Some Encodings may contain glyph names, from which it's again possible to infer Unicode code points.
In the end though, some PDF files are simply impossible to extract text from without using OCR.
Your best bet is probably to write code to extract text, and Ghostscript will do that. It even goes through the heirarchy of fallbacks listed above to try and find a Unicode code point. If all else fails it just uses the character code and hopes that's good enough.
If you use Ghostscript's txtwrite device it will produce either a faked up text page (the default) which attempts, as far as possible, to mimic the text layout in the original PDF file, including merging bits of text that aren't contiguous in the PDF file but are next to each other on the page. Or an 'XML-like' output which will tell you which Unicode code points, or character codes, were encountered and what their position is on the original page. If you don't like txtwrite's attempts to figure out which text goes with what, then you can use this to write your own.
I suspect the text page is probably good enough for your purposes. You can have the txtwrite device produce one file per page, so you can get the page number from the filename. Then you can write your own regex expression(s) to search the files and find your matches.

Applescript to extract the Digital Object Identifier (DOI) from a PDF file

I looked for an applescript to extract the DOI from a PDF file, but could not find it. There is enough information available on the actual format of the DOI (i.e. the regular expression), but how could I use this to get the identifier from the PDF file?
(It would be no problem if some external program were used, such as Hazel.)
If you're ok with using an app, I'd recommend Skim. Good AppleScript support. I'd probably structure it like this (especially if the document might be large):
set DOIFound to false
tell application "Skim"
set pp to pages of document 1
repeat with p in pp
set t to text of p
--look for DOI and set DOIFound to true
if DOIFound then exit repeat--if it's not found then use url?
end repeat
end tell
I'm assuming a DOI would always exist on one page (not spread out to between two). Looks like they are invariably (?) on the first page of an article, which would make this quick of course, even with a large doc.
[edit]
Another way would be to get the Xpdf OSX binaries from http://www.foolabs.com/xpdf/download.html and use pdftotext in the command line (just tested this; it works well) and parse the text using AppleScript. If you want to stay in AppleScript, you can do something like:
do shell script "path/to/pdftotext 'path/to/pdf/file.pdf'"
which would output a file in the same directory with a txt file extension -- you parse that for DOI.
Have you tried it with pdfgrep? It works really well in commmandline
pdfgrep -n --max-count 1 --include "*.pdf" "DOI"
i have no idea to build an apple script though, but i would be interested in one also. so that if i drop a pdf into that folder it just automatically extracts the DOI and renames the file with the DOI in the filename.

Crashplan and regex - excluding files based on filename

Crashplan allows for excluding files from a backup set by using regex for the exclusion criteria (there is no inclusion criteria functionality). For my particular use case I have a folder that contains these files:
C_VOL-b001.spf
C_VOL-b001-i001.md5
C_VOL-b001-i001.spi
E_VOL-b001.spf
E_VOL-b001-i001.md5
E_VOL-b001-i001.spi
F_VOL-b001.spf
F_VOL-b001-i001.md5
F_VOL-b001-i001.spi
G_VOL-b001.spf
G_VOL-b001-i001.md5
G_VOL-b001-i001.spi
and I want to exclude any file that doesn't begin with the C_VOL filename. These are backup files from another backup software, Shadowprotect, but I only want to include the C volume files and exclude the others. The incremental files will continue to be added to each of the volume sets using the naming schema of -i001, -i002, etc.
So far I've tried the following:
^E_VOL
^E_VOL.*
and a few other variations, with no success. I'm not sure if Crashplan only allows for selecting based on the filetype extension (their regex examples are here http://goo.gl/qDAEcR ). They do mention that "Note that CrashPlan treats all file separators as forward slashes (/)."
I'm not sure if Crashplan recognizes all regex expressions. If it helps, back in 2008 I emailed their tech support with a regex question and one of the founders of Crashplan, Matt Dornquast, helped me with a the following regex:
I am trying to exclude any file that either:
1. have an extension of .spf, or
2. has a file name of the type, XXXXXX-cd.spi
3. But also allow for backup of files with the name type of, xxxxx.spi
And his regex worked perfectly:
(?i).+(?:\-cd\.spi|\.spf)$
I've contacted their tech support again but they said they will no longer help with regex questions.
It seems that you could use the following regex:
.*/C_VOL.*
I created this based on this example (link) they featured on the website you linked in your question. Please let us know if it's working :)

Using regex to eliminate chunks in a file (categorized events in iCal file)

I have one .ics file from which I would like to create individual new .ics files depending on the event categories (I can't get egroupware to export only events of one category, I want to create new calendars depending on category). My intended approach is to repeatedly eliminate all events but those of one category and then save the file using EditPad Lite 7 (Windows).
I am struggling to get the regular expression right. .+? is still too greedy and negating the string (e.g. to eliminate all but events from one category) doesn't work either.
Sample
BEGIN:VEVENT
DESCRIPTION:Event 2
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:Event 3
CATEGORIES:Sports
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:Event 4
END:VEVENT
The regex BEGIN:VEVENT.+?CATEGORIES:Sports.+?END:VEVENT should only match sports events but it catches everything from the first BEGINto the first ENDfollowing the category.
Edit: negating doesn't work either: BEGIN:VEVENT.+?((?!CATEGORIES:Sports).).+?END:VEVENT.
What am I missing? Any pointers are highly appreciated.
I guess newlines are removed or ignored, because your regex does not care about them.
I only have a correction to the match after CATEGORIES
BEGIN:VEVENT.+?CATEGORIES:Sports.*?END:VEVENT
^
Zero or more
The first part of your regex looks good, maybe the regex engine in EditPad is not so good.
Try it with a different editor or scripting language (like Eclipse or perl or Notepad+ or Notepad2)
You could split the input and then grep the matching Sports events
#sportevents = grep /Sports/, split /END:VEVENT/, $input
map $_.="END:VEVENT", #sportevents
This was perl, maybe you can launch a script from EditPad to do it.
The second line just restores the END:VEVENT that was stripped during split.
OK. Solved it. I found something here which can be used to split ics files. I tweaked it to use the category rather than the summary in the file name and then merged the individually generated files according to category. I added the usual ics header and footer to all files and, voilà, I had individual calendar files.