Need to remove bunch of comments from every HTML file - regex

I'm looking to restore an old site of mine from the internet archives, which thankfully is pretty intact.
The only thing is that extra comments have been added to the existing HTML, which I want to remove. The comments have been added to the bottom of every page and is as follows,
<!--
FILE ARCHIVED ON 15:22:46 Jan 15, 2011 AND RETRIEVED FROM THE
INTERNET ARCHIVE ON 11:36:37 Jul 11, 2014.
JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE.
ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C.
SECTION 108(a)(3)).
-->
I've read over here that what I'm trying to do can be accomplished using regular expressions, but since I'm new to it I'd like some help.
This is all I've got after struggling for over 3 hours,
<!--(\s)*FILE ARCHIVED
I have got no clue on how to end it.
Any help would be gladly appreciated.

Match and replace the following regex with empty strings:
/<!--.+?-->/s
View a live regex demo.
Regular expression visualization:

The below regex would match only the comment section. Then you could remove the matched section easily.
/<!--\s*FILE ARCHIVED(?:[^\n]*[\n][^\n]*)*?-->/m
DEMO
OR
With the s(DOTALL) modifier,
/<!--\s*FILE ARCHIVED(?:(?!-->).)*-->/sg
DEMO

The Internet Archive allows us to retrieve the raw version of web pages. For example, if you have this URL (https://web.archive.org/web/20170204063743/http://john.smith#example.org/), replace the timestamp 20170204063743 with 20170204063743id_ (so the modified URL will look like https://web.archive.org/web/20170204063743id_/http://john.smith#example.org/) then you will get the original HTML without any additional comments added by the Internet Archive.

Please try this one:
preg_replace("/<!--(.|\s)*?-->/", "", $input_lines);
demo link
It will only keep the text "HTML content goes here" with the content below:
HTML content goes here
<!--
FILE ARCHIVED ON 15:22:46 Jan 15, 2011 AND RETRIEVED FROM THE
INTERNET ARCHIVE ON 11:36:37 Jul 11, 2014.
JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE.
ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C.
SECTION 108(a)(3)).
-->

I'd go with something like this :
<!--(\s)*FILE ARCHIVED(\s|.)*-->
View a live demo

Related

PDF page count using regex

I used regex to calculate page count for pdf. Below is the code that i used.
Regex regex = new Regex(#"/Type\s*/Page[^s]");
MatchCollection matches = regex.Matches(sr.ReadToEnd());
return matches.Count;
It works fine with the version below 1.6 but not working with pdf files with version 1.6 . It returns 0 page if pdf version is 1.6 .
In your case you most likely have to do with 1.6 documents which make use of the then introduced feature of compressed object streams. As in such documents the information you search for is compressed, your regular expression does not find it.
There are tools which allow you to decompress such streams in a file before searching it. Before you look for them, though, be aware that the result of your code cannot be trusted anyway as
there can be more matches than pages because there may be old, unused page objects or even other wrong positives in the file,
there can be less matches than pages because PDF allows alternative ways to write those type entries.

Crashplan and regex - excluding files based on filename

Crashplan allows for excluding files from a backup set by using regex for the exclusion criteria (there is no inclusion criteria functionality). For my particular use case I have a folder that contains these files:
C_VOL-b001.spf
C_VOL-b001-i001.md5
C_VOL-b001-i001.spi
E_VOL-b001.spf
E_VOL-b001-i001.md5
E_VOL-b001-i001.spi
F_VOL-b001.spf
F_VOL-b001-i001.md5
F_VOL-b001-i001.spi
G_VOL-b001.spf
G_VOL-b001-i001.md5
G_VOL-b001-i001.spi
and I want to exclude any file that doesn't begin with the C_VOL filename. These are backup files from another backup software, Shadowprotect, but I only want to include the C volume files and exclude the others. The incremental files will continue to be added to each of the volume sets using the naming schema of -i001, -i002, etc.
So far I've tried the following:
^E_VOL
^E_VOL.*
and a few other variations, with no success. I'm not sure if Crashplan only allows for selecting based on the filetype extension (their regex examples are here http://goo.gl/qDAEcR ). They do mention that "Note that CrashPlan treats all file separators as forward slashes (/)."
I'm not sure if Crashplan recognizes all regex expressions. If it helps, back in 2008 I emailed their tech support with a regex question and one of the founders of Crashplan, Matt Dornquast, helped me with a the following regex:
I am trying to exclude any file that either:
1. have an extension of .spf, or
2. has a file name of the type, XXXXXX-cd.spi
3. But also allow for backup of files with the name type of, xxxxx.spi
And his regex worked perfectly:
(?i).+(?:\-cd\.spi|\.spf)$
I've contacted their tech support again but they said they will no longer help with regex questions.
It seems that you could use the following regex:
.*/C_VOL.*
I created this based on this example (link) they featured on the website you linked in your question. Please let us know if it's working :)

Regex expreesion for last update file

Please help me out with a regex to find the last modified file/latest updated file in the folder.
The files are in this manner:
Test.2014_02_20 updated 13:00:23
Test.2014_02_21 updated 15:23:23
Test.2014_02_25 updated 21:24:23
Using regex we need to pick up the file Test.2014_02_25 updated 21:24:23
Thanks.
Through the comments from #Theox and #piet.t, OP has concluded that regular expressions were not the best tool to accomplish the task here.
"You can use regexs to validate the format of the line (...)" - Theox
"(...) the concept of ordering things by some criteria is way out of the scope of regex" - piet.t

Using regex to eliminate chunks in a file (categorized events in iCal file)

I have one .ics file from which I would like to create individual new .ics files depending on the event categories (I can't get egroupware to export only events of one category, I want to create new calendars depending on category). My intended approach is to repeatedly eliminate all events but those of one category and then save the file using EditPad Lite 7 (Windows).
I am struggling to get the regular expression right. .+? is still too greedy and negating the string (e.g. to eliminate all but events from one category) doesn't work either.
Sample
BEGIN:VEVENT
DESCRIPTION:Event 2
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:Event 3
CATEGORIES:Sports
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:Event 4
END:VEVENT
The regex BEGIN:VEVENT.+?CATEGORIES:Sports.+?END:VEVENT should only match sports events but it catches everything from the first BEGINto the first ENDfollowing the category.
Edit: negating doesn't work either: BEGIN:VEVENT.+?((?!CATEGORIES:Sports).).+?END:VEVENT.
What am I missing? Any pointers are highly appreciated.
I guess newlines are removed or ignored, because your regex does not care about them.
I only have a correction to the match after CATEGORIES
BEGIN:VEVENT.+?CATEGORIES:Sports.*?END:VEVENT
^
Zero or more
The first part of your regex looks good, maybe the regex engine in EditPad is not so good.
Try it with a different editor or scripting language (like Eclipse or perl or Notepad+ or Notepad2)
You could split the input and then grep the matching Sports events
#sportevents = grep /Sports/, split /END:VEVENT/, $input
map $_.="END:VEVENT", #sportevents
This was perl, maybe you can launch a script from EditPad to do it.
The second line just restores the END:VEVENT that was stripped during split.
OK. Solved it. I found something here which can be used to split ics files. I tweaked it to use the category rather than the summary in the file name and then merged the individually generated files according to category. I added the usual ics header and footer to all files and, voilĂ , I had individual calendar files.

Textile: using code and footnotes

Hi I'm using Redmine to write a wiki of my software. I need to put some notes next to a code section like this:
class.method()[1]
Where the "one" is a link to my note at the end of the page.
I've tried to use any method defined in the Textile syntax but it seems that it doesn't work. In fact when you use the code tag '# #' any other tag stops working.
It's good even if I can use the link tag [[ ]] but only if it is like this google.com
Thanks for any help,
Alessandro
Redmine uses Coderay to parse the code sections in the Wiki. Take a look at the documentation for the different languages. Otherwise I would suggest using comments instead of footnotes or in worst case line references to the code.
The footnote will only work if there is an alphanumeric character directly before the opening square bracktet:
this[1], whereas this()[2] or this [3]
fn1. will work
fn2. won't work.
fn3. won't work.
At least this is true with Redmine 3.1. See this issue for more information.
Note that you need blank lines between fn1., fn2. and fn3. to get a correct rendering.
This extension for Redmine supports footnotes and custom styles in the wiki.
Redmine supports Textile markup syntax. Textile has support for footnotes. As noted on ticket ticket #974, this is the syntax for using footnotes in Redmine:
Text with a footnote[1]
fn1. and here the actual footnote.