R Wildcard in the middle of an expression - regex

I want to use the pattern expression in R to find files in my directory that match "ReportName*.HTML". Meaning that I only want to find files with certain file names and extensions, but there are dynamic characters between.
Here's an example: I want to find all reports that begin with "2016 Operations" but end with the extension ".HTML". Currently I am trying:
files.control <- dir(path, pattern="^2016 Operations*.HTML$")
Why doesn't this work? I like the one line of code; it's so simple.

The "ReportName*.HTML" syntax is called a glob and is supported in R via the following which will return a character vector of the current directory filenames starting with ReportName and ending with .HTML.
Sys.glob("ReportName*.HTML")
The R function glob2rx will translate globs to regular expressions so this does the same thing:
dir(pattern = glob2rx("ReportName*.HTML"))
We can discover the regular expression associated with a glob like this:
glob2rx("ReportName*.HTML")
## [1] "^ReportName.*\\.HTML$"
and you can find more information on regular expressions from within R via help using ?regex and more info at the links near the bottom of this page: https://code.google.com/archive/p/gsubfn/

Related

How to do a regex search in a GtkSourceBuffer

I am using GtkSourceView with a GtkSourceBuffer.
I need to do a regular expression search on its contents, and I know that GtkSourceBuffer is a subclass of GtkTextBuffer.
I'd like to do something like the Python code below, where search_text is a regular expression.
search_text = 'some regular expression'
source_buffer = source_view.get_buffer()
match_start = source_buffer.get_start_iter()
result = match_start.forward_search(search_text, 0, None)
if result:
match_start, match_end = result
source_buffer.select_range(match_start, match_end)
The regex isn't too complex: search_text = '/file_name\S*'. (Basically I want to match all file names in a document that are preceded by a separator character /, start with a common file name, and end with a sequence of non-space characters, including the file extension).
The Gtk.GtkTextIter.forward_search() function only seems to accept these three flags, so I do not see a way of specifying that the search string is a regular expression...
Gtk.TextSearchFlags.VISIBLE_ONLY
Gtk.TextSearchFlags.TEXT_ONLY
Gtk.TextSearchFlags.CASE_INSENSITIVE
How can I achieve a regex search on GtkSourceBuffer or GtkTextBuffer ?
You should take a look at SearchSettings, which allows you to enable regex and set search text.
After that you create a SearchContext and use it to search (forward or backward methods)
Also GktTextBuffer can return it's text with get_text, but it's not what you are looking for.

Using multiple Perl regular expressions to find and replace

I'm a Perl and regex newcomer in need of your expertise.
I need to process text files that include placeholder lines like Foo Bar1.jpg and replace those with with corresponding URLs like https:/baz/qux/Foo_Bar1.jpg.
As you may have guessed, I'm working with HTML. The placeholder text refers to the filename, which is the only thing available when writing the document. That's why I have to use placeholder text. Ultimately, of course, I want to replace the filename with the URL (after I upload file to my CMS to get the URL). At that point, I have all of the information at hand — the filename and the URL. Of course, I could just paste the URLs over the placeholder names in the HTML document. In fact, I've done that. But I'm certain that there's a better way.
In short, I have placeholder lines like this:
Foo Bar1.jpg
Foo Bar2.jpg
Foo Bar3.jpg
And I also have URL lines like this:
https:/baz/qux/Foo_Bar1.jpg
https:/baz/qux/Foo_Bar2.jpg
https:/baz/qux/Foo_Bar3.jpg
I want to find the placeholder string and capture a differentiator like Bar1 with a regex. Then I want to use the captured part like Bar1 to perform another regex search that matches part of the corresponding URL string, i.e. https:/baz/qux/Foo_Bar1.jpg. After a successful match, I want to replace the Foo Bar1.jpg line with https:/baz/qux/Foo_Bar1.jpg.
Ultimately, I want to do that for every permutation, so that https:/baz/qux/Foo_Bar2.jpg also replaces Foo Bar2.jpg and so on.
I've written regular expressions that match both the placeholder and the URL. That's not my problem, as far as I can tell. I can find the strings I need to process. For example, /[a-z]+\s([a-z0-9]+)\.jpg/ successfully matches what I'm calling the placeholder text and captures what I'm calling the differentiator.
However, though I've spent an embarrassing number of hours over the past week reading through Stack Overflow, various other sites and O'Reilly books on Pearl and Pearl Regular Expressions, I can't wrap my mind around how to process what I can find.
I think the piece you are missing is the idea of using Perl's internal grep function, for searching a list of URL lines based on what you are calling your "differentiator".
Slurp your URL lines into a Perl array (assuming there are a finite manageable number of them, so that memory is not clobbered):
open URLS, theUrlFile.txt or die "Cannot open.\n";
my #urls = <URLS>;
Then within the loop over your file containing "placeholders":
while (my $key = /[a-z]+\s([a-z0-9]+)\.jpg/g) {
my #matches = grep $key, #urls;
if (#matches) {
s/[a-z]+\s$key\.jpg/$matches[0]/;
}
}
You may also want to insert error/warning messages if #matches != 1.

Google Analytics Regex Advanced filter to include and exclude keyword and filetype

Okay so I need to create an advance filter in Google Analytics that includes "breast", but DOES NOT include "before" "after" or "blog" in the url. I also want to filter out .jpg file extensions.
Here are example URLs that I want the filter to return:
http://www.doctortaylor.com/breast-lift-surgery/
http://www.doctortaylor.com/breast-augmentation-pasadena-and-los-angeles-area/
I want to filter out any urls that are before and after photo pages, and any actual .jpg file urls.
I'm a regex beginner, but this is pretty advanced. Any help would be greatly appreciated!!
This regular expression does fairly well:
^(?!before|after|blog)*((?!before|after|blog).)*breast(?!before|after|blog|\.jpg)*((?!before|after|blog|\.jpg).)*$
UPDATED: I have updated the expression to capture all scenarios, even characters that begin or end the string. This regular expression excludes all words that you list in your description and correctly identifies the word breast.
MATCHES
http://www.doctortaylor.com/breast-lift-surgery/
http://www.doctortaylor.com/breast-augmentation-pasadena-and-los-angeles-area/
DOES NOT MATCH
http://www.doctortaylor.com/breast-lift-surgeryblog/
http://www.doctortaylor.com/breast-lift-surgery.jpg/
http://blog.doctortaylor.com/breast-lift-surgery/
http://www.doctortaylor.com/after-breast-lift-surgery/
This regular expression uses an equivalent of inverse matching.

How to match and remove any line containing a specific string?

I have a huge directory list of URLs from my Web site. Example:
/folder/folder2/folder3/page.htm
/folder/folder2/folder3/page2.htm
/folder/folder2/folder3/page3.htm
/folder/folder2/folder3/page4.htm
I want to clean this list of all items that have /folder2 in the path. I need a regular expression to perform a find and replace for everything that uses /folder2/ and delete those lines from my list. So find/replace it with blank.
Does anyone know what the proper regular expression for this would be? I should specify I am using Dreamweaver as my editor, which may use different regular expressions.
This expression will match the entire line such that the string "/folder2" occurs in it:
^.+?\/folder2/.+$
HTH.
In Python that would be:
import re
regex = re.compile('.*/folder2/.*')
f = open("filtered_file.txt", "w")
map(lambda x: f.write(x), filter(lambda x: not regex.match(x), open("input.txt")))
f.close()

Regular expression for changing links in Dreamweaver

I'm in the process of moving my Dreamweaver-based website to a CMS, and I would like to replace site-wide the following kind of links:
a href="http://www.domain.com/category/item ### title.html" (where ### is a number)
to
a href="http://www.domain.com/category/item###"
What is the correct regular expression I should use in the find and replace built-in engine?
I propose
'(http://www.domain.com/category/item) *(\d+).+?\.html'
as RE chain
and to substitute the entire match with $1 + $2