How to do a regex search in a GtkSourceBuffer - regex

I am using GtkSourceView with a GtkSourceBuffer.
I need to do a regular expression search on its contents, and I know that GtkSourceBuffer is a subclass of GtkTextBuffer.
I'd like to do something like the Python code below, where search_text is a regular expression.
search_text = 'some regular expression'
source_buffer = source_view.get_buffer()
match_start = source_buffer.get_start_iter()
result = match_start.forward_search(search_text, 0, None)
if result:
match_start, match_end = result
source_buffer.select_range(match_start, match_end)
The regex isn't too complex: search_text = '/file_name\S*'. (Basically I want to match all file names in a document that are preceded by a separator character /, start with a common file name, and end with a sequence of non-space characters, including the file extension).
The Gtk.GtkTextIter.forward_search() function only seems to accept these three flags, so I do not see a way of specifying that the search string is a regular expression...
Gtk.TextSearchFlags.VISIBLE_ONLY
Gtk.TextSearchFlags.TEXT_ONLY
Gtk.TextSearchFlags.CASE_INSENSITIVE
How can I achieve a regex search on GtkSourceBuffer or GtkTextBuffer ?

You should take a look at SearchSettings, which allows you to enable regex and set search text.
After that you create a SearchContext and use it to search (forward or backward methods)
Also GktTextBuffer can return it's text with get_text, but it's not what you are looking for.

Related

RegEx. Get the value from the quotes and check for the attribute name [duplicate]

What would be a quick way to extract the value of the title attributes for an HTML table:
...
<li>Proclo</li>
<li>Proclus</li>
<li>Ptolemy</li>
<li>Pythagoras</li></ul><h3>S</h3>
...
so it would return Proclo, Proclus, Ptolemy, Pythagoras,.... in strings for each line. I'm reading the file using a StreamReader. I'm using C#.
Thank you.
This C# regex will find all title values:
(?<=\btitle=")[^"]*
The C# code is like this:
Regex regex = new Regex(#"(?<=\btitle="")[^""]*");
Match match = regex.Match(input);
string title = match.Value;
The regex uses positive lookbehind to find the position where the title value starts. It then matches everything up to the ending double quote.
Use the regexp below
title="([^"]+)"
and then use Groups to browse through matched elements.
EDIT: I have modified the regexp to cover the examples provided in comment by #Staffan Nöteberg

R Wildcard in the middle of an expression

I want to use the pattern expression in R to find files in my directory that match "ReportName*.HTML". Meaning that I only want to find files with certain file names and extensions, but there are dynamic characters between.
Here's an example: I want to find all reports that begin with "2016 Operations" but end with the extension ".HTML". Currently I am trying:
files.control <- dir(path, pattern="^2016 Operations*.HTML$")
Why doesn't this work? I like the one line of code; it's so simple.
The "ReportName*.HTML" syntax is called a glob and is supported in R via the following which will return a character vector of the current directory filenames starting with ReportName and ending with .HTML.
Sys.glob("ReportName*.HTML")
The R function glob2rx will translate globs to regular expressions so this does the same thing:
dir(pattern = glob2rx("ReportName*.HTML"))
We can discover the regular expression associated with a glob like this:
glob2rx("ReportName*.HTML")
## [1] "^ReportName.*\\.HTML$"
and you can find more information on regular expressions from within R via help using ?regex and more info at the links near the bottom of this page: https://code.google.com/archive/p/gsubfn/

String formatting in Xpath expression

I have a function as follow. I need to find all the links with particular search term
def parse(search_term):
response.xpath("//a[contains(.,search_term)]/#href").extract()
I believe above code gives me all the anchor links regardless of the search_term
If I replace search_term with "Energy" or any string, it gives perfect result for e.g
def parse(search_term):
response.xpath("//a[contains(.,'Energy')]/#href").extract()
The above code gives me the links which has 'Energy' as text in it.
Is this a string formatting issue?
XPath expressions are regular Python strings, so you have to "interpolate" them explicitly:
def parse(search_term):
response.xpath("//a[contains(.,'{}')]/#href".format(search_term)).extract()
Note that this only works for strings without any ' characters on it -- if it does, you'll need some tricks to escape it.

What is the regex required to find specific urls within content from a list of urls generated by a for loop?

As I write this I realise there are two parts to this question, however I think I am only really stuck on the first part and therefore the second is only provided for context:
Part A:
I need to search the contents of each value returned by a for loop (where each value is a url) for the following:
href="/dir/Sub_Dir/dir/163472311232-text-text-text-text/page-n"
where:
the numerals 163472311232 could be any length (ie it could be 5478)
-text-text-text-text could be any number of different words
where page-n could be from page-2 up until any number
where matches are not returned more than once, ie only unique matches are returned and therefore only one of the following would be returned:
href="/dir/Sub_Dir/dir/5422-la-la/page-4
href="/dir/Sub_Dir/dir/5422-la-la/page-4
Part B:
So the logic would be something like:
list_of_urls = original_list
for url in list_of_urls:
headers = {'User-Agent' : 'Mozilla 5.0'}
request = urllib2.Request(url, None, headers)
url_for_re = urllib2.urlopen(request).read()
another_url = re.findall(r'href="(/dir/Sub_dir\/dir/[^"/]*)"', url_for_re, re.I)
file.write(url)
file.write('\n')
file.write(another_url)
file.write('\n')
Which i am hoping will give me output similar to:
a.html
a/page-2.html
a/page-3.html
a/page-4.html
b.html
b/page-2.html
b/page-3.html
b/page-4.html
So my question is (assuming the logic in part B is ok):
What is the required regex pattern to use for part A?
I am a newbie to python and regex so this will limit my understanding somewhat in regards to relatively complicated regex suggestions etc.
update:
after suggestions i tried to test the following regex which did not produce any results:
import re
content = 'href="/dir/Sub_Dir/dir/5648342378-text-texttttt-texty-text-text/page-2"'
matches = re.findall(r'href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9]|[1-9][0-9]+)"', content, re.I)
prefix = 'http://www.test.com'
for match in matches:
i = prefix + match + '\n'
print i
solution:
i think this is the regex that will work:
matches = re.findall(r'href="(/dir/Sub_Dir/dir/[^"/]*/page-[2-9])"', content, re.I)
You can have... most of what you want. Regexes don't really do the distinct thing, so I suggest you just use them to get all the URLs, and then remove duplicates yourself.
Off the top of my head it would be something like this:
href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9])|([1-9][0-9]+)"
Plus or minus escaping rules, specifics on what words are allowed, etc. I'm a Windows guy, there's a great tool called Expresso which is helpful for learning regexes. I hope there's an equivalent for whatever platform you're using, it comes in handy.

How do I use regular expressions in Jinja2?

I'm new to Jinja2 and so far I've been able to do most of what I want. However, I need to use regular expressions and I can't seem to find anything anywhere in the documentation or on teh Googles.
I'd like to create a macro that mimics the behavior of this in Javascript:
function myFunc(str) {
return str.replace(/someregexhere/, '').replace(' ', '_');
}
which will remove characters in a string and then replace spaces with underscores. How can I do this with Jinja2?
There is an already existing filter called replace that you can use if you don't actually need a regular expression. Otherwise, you can register a custom filter:
{# Replace method #}
{{my_str|replace("some text", "")|replace(" ", "_")}}
# Custom filter method
def regex_replace(s, find, replace):
"""A non-optimal implementation of a regex filter"""
return re.sub(find, replace, s)
jinja_environment.filters['regex_replace'] = regex_replace