Regex Capturing Groups in Vala - regex

Is there such a thing? I've been looking around the Vala API and the Regex object seems to have no support for capturing groups so that I can reference them later. Is there currently any way to get around this apparent limitation? Say I'm parsing a string out of a group of strings (the contents of a file) for a given pattern like:
parameter = value
But I want the syntax to be lax so that it could also say parameter=value or parameter = value etc... The first idea that springs to mind is using regular expressions with capturing groups but there seems to be no support for this feature as a part of Vala right now, as far as I can see.
The only alternative I can come up with is splitting the string with a regular expression that matches whitespace so that I end up with an array I can analyze, but then again the file might not contain only "parameter = value"-like formatted lines.

It goes something like this. Disclaimer, this is off the top of my head:
Regex r = /^\s*(?P<parameter>.*)\s*=\s*(?P<value>.*)\s*$/;
MatchInfo info;
if(r.match(the_string, 0, out info)) {
var parameter = info.fetch_named("parameter");
var value = info.fetch_named("value");
}

Related

Regular expression that contains one expression yet doesn't contain the other

We are currently matching "service_hub*queue"
I want to ignore the case "service_hub_scout_dead_queue" and yet still match everything else.
What is the regular expression for that ?
This javascript sollution gives an array with the matches
var myText = 'service_hub_anything_queue Add service_hub_scout_dead_queue something service_hub_someting_queue else';
var myMatches = myText.match(/service_hub(?!_scout_dead_)\w+queue/g);
If you are rather interested in what follows a match
var mySplit = ('dummy'+myText).split(/service_hub(?!_scout_dead_)\w+queue/g).filter(function(txt,i) {return (i>0);})
I put 'dummy' and then filter away the first part to make it work both if the sting starts with a valid tag and when it does not.
Using negative lookbehind: "service_hub_.*?(?<!_scout_dead)_queue"
This appears to be widely supported by popular regex engines; I've tested with Java (or Scala, rather) just to make sure it works.

regex to match everything except character

I have a payload that contains the following:
����\�p�a�t�r�i�c�k�-�t�e�s�t�-�f�i�l�e�.�t�x�t������x�SMB2
I'm looking to extract the file name of patrick-test-file.txt
I can get close by using this, but it continues to include everything (including ascii characters)
[\\\\](.*?)x�SMB2
With a result of this: �p�a�t�r�i�c�k�-�t�e�s�t�-�f�i�l�e�.�t�x�t������ for the capture group.
How would I just match the characters of the file name, which could be anything of variable length, and could contain alphanumeric characters? Is this possible with pure regex?
Any help is much appreciated.
Sometimes you just can't do a single language-agnostic Regular Expression to accomplish something. And sometimes (usually) it is more performant to do a series of string functions.
I wouldn't personally accept any solution which has hard-coded values, such as x�SMB2.
If you want to use Regular Expressions only, you can first select the File-Name portion like so: (([-\w\d.\\]+)[^-\w\d.\\]?)+, then go ahead and replace [^-\w\d.\\] with nothing "".
Honestly, given the limited detail, the best function is like so:
var fileName = "\patrick-test-file.txt";
But half-joking aside, and with that limited detail, your best bet is to do a couple string functions:
var yuckyString = #"����\�p�a�t�r�i�c�k�-�t�e�s�t�-�f�i�l�e�.�t�x�t������x�SMB2";
var fileNameArea = yuckyString.Split(new[] { "��" }, StringSplitOptions.RemoveEmptyEntries)[0];
var fileName = fileNameArea.Replace("�", "");
Granted, there was no language listed, so I'm using C#. Also, the answer would change if there were irregularities with those special characters. With the limited info, the pattern seems clear.

How to make a regex to replace the value of a key in a json file

I want to make a regex so I can do a "Search/Replace"
over a json file with many object.
Every object has a key named "resource"
containing a URL.
Take a look at these examples:
"resource":"http://www.img/qwer/123/image.jpg"
"resource":"io.nl.info/221/elephant.gif"
"resource":"simgur.com/icon.png"
I want to make a regex to replace the whole url with
a string like this: img/filename.format.
This way, the result would be:
"resource":"img/image.jpg"
"resource":"img/elephant.gif"
"resource":"img/icon.png"
I'm just starting with regular expressions and I'm
completely lost. I was thinking that one valid idea would
be to write something starting with this pattern "resource":"
and ending with the last five characters. But I don't even know how to try
that.
How could I write the regular expression?
Thanks in advance!
Try this:
Find: "resource":\s*"[^"]+?([^\/"]+)"
Replace: "resource":"img/\1
Using [^"]+? ensures the match won't roll off the end of the current entry and gobble up too much input, and it's reluctant (with the added ?) so it gets the whole image file name (instead ofwhat the last character).
Edit:
I added optional whitespace after the key, which your pastebin has.
See a live demo of this regex with your pastebin.
Regex
.*\/
Debuggex Demo
This will find the text you want to replace. Replace it with img/ if you want to find the whole text you'll need to look for the following Regex:
("resource":").*\/
Debuggex Demo
Then replace with $1img/ this should give you group 1 and the img part.
Let me know if there are any questions
Note: I personally would just use objects since you have the JSON and parse it to a object then iterate over the objects and change each resource on each object independently rather than looking for a magic bullet
If your JSON is an array of objects containing resource field I would do it in 3 steps: convert to object, find resources and replace them, convert back to string (optional)
var tmp = JSON.parse('<your json>');
for (i = 0; i < tmp.length; ++i) {
for (e in tmp[i])
if (e == 'resource')
tmp[i][e] = tmp[i][e].replace(/.*(?=img\/.*\..*)/,'')
}
tmp = JSON.stringify(tmp);

What is the regex required to find specific urls within content from a list of urls generated by a for loop?

As I write this I realise there are two parts to this question, however I think I am only really stuck on the first part and therefore the second is only provided for context:
Part A:
I need to search the contents of each value returned by a for loop (where each value is a url) for the following:
href="/dir/Sub_Dir/dir/163472311232-text-text-text-text/page-n"
where:
the numerals 163472311232 could be any length (ie it could be 5478)
-text-text-text-text could be any number of different words
where page-n could be from page-2 up until any number
where matches are not returned more than once, ie only unique matches are returned and therefore only one of the following would be returned:
href="/dir/Sub_Dir/dir/5422-la-la/page-4
href="/dir/Sub_Dir/dir/5422-la-la/page-4
Part B:
So the logic would be something like:
list_of_urls = original_list
for url in list_of_urls:
headers = {'User-Agent' : 'Mozilla 5.0'}
request = urllib2.Request(url, None, headers)
url_for_re = urllib2.urlopen(request).read()
another_url = re.findall(r'href="(/dir/Sub_dir\/dir/[^"/]*)"', url_for_re, re.I)
file.write(url)
file.write('\n')
file.write(another_url)
file.write('\n')
Which i am hoping will give me output similar to:
a.html
a/page-2.html
a/page-3.html
a/page-4.html
b.html
b/page-2.html
b/page-3.html
b/page-4.html
So my question is (assuming the logic in part B is ok):
What is the required regex pattern to use for part A?
I am a newbie to python and regex so this will limit my understanding somewhat in regards to relatively complicated regex suggestions etc.
update:
after suggestions i tried to test the following regex which did not produce any results:
import re
content = 'href="/dir/Sub_Dir/dir/5648342378-text-texttttt-texty-text-text/page-2"'
matches = re.findall(r'href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9]|[1-9][0-9]+)"', content, re.I)
prefix = 'http://www.test.com'
for match in matches:
i = prefix + match + '\n'
print i
solution:
i think this is the regex that will work:
matches = re.findall(r'href="(/dir/Sub_Dir/dir/[^"/]*/page-[2-9])"', content, re.I)
You can have... most of what you want. Regexes don't really do the distinct thing, so I suggest you just use them to get all the URLs, and then remove duplicates yourself.
Off the top of my head it would be something like this:
href="/dir/Sub_Dir/dir/[0-9]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+-[a-zA-Z]+/page-([2-9])|([1-9][0-9]+)"
Plus or minus escaping rules, specifics on what words are allowed, etc. I'm a Windows guy, there's a great tool called Expresso which is helpful for learning regexes. I hope there's an equivalent for whatever platform you're using, it comes in handy.

Article spinner with 2 tiers

I made an article spinner that used regex to find words in this syntax:
{word1|word2}
And then split them up at the "|", but I need a way to make it support tier 2 brackets, such as:
{{word1|word2}|{word3|word4}}
What my code does when presented with such a line, is take "{{word1|word2}" and "{word3|word4}", and this is not as intended.
What I want is when presented with such a line, my code breaks it up as "{word1|word2}|{word3|word4}", so that I can use this with the original function and break it into the actual words.
I am using c#.
Here is the pseudo code of how it might look like:
Check string for regex match to "{{word1|word2}|{word3|word4}}" pattern
If found, store each one as "{word1|word2}|{word3|word4}" in MatchCollection (mc1)
Split the word at the "|" but not the one inside the brackets, and select a random one (aka, "{word1|word2}" or "{word3|word4}")
Store the new results aka "{word1|word2}" and "{word3|word4}" in a new MatchCollection (mc2)
Now search the string again, this time looking for "{word1|word2}" only and ignore the double "{{" "}}"
Store these in mc2.
I can not split these up normally
Here is the regex I use to search for "{word1|word2}":
Regex regexObj = new Regex(#"\{.*?\}", RegexOptions.Singleline);
MatchCollection m = regexObj.Matches(originalText); //How I store them
Hopefully someone can help, thanks!
Edit: I solved this using a recursive method. I was building an article spinner btw.
That is not parsable using a regular expression, instead you have to use a recursive descent parser. Map it to JSON by replacing:
{ with [
| with ,
wordX with "wordX" (regex \w+)
Then your input
{{word1|word2}|{word3|word4}}
becomes valid JSON
[["word1","word2"],["word3","word4"]]
and will map directly to PHP arrays when you call json_decode.
In C#, the same should be possible with JavaScriptSerializer.
I'm really not completely sure WHAT you're asking for, but I'll give it a go:
If you want to get {word1|word2}|{word3|word4} out of any occurrence of {{word1|word2}|{word3|word4}} but not {word1|word2} or {word3|word4}, then use this:
#"\{(\{[^}]*\}\|\{[^}]*\})\}"
...which will match {{word1|word2}|{word3|word4}}, but with {word1|word2}|{word3|word4} in the first matching group.
I'm not sure if this will be helpful or even if it's along the right track, but I'll try to check back every once in a while for more questions or clarifications.
s = "{Spinning|Re-writing|Rotating|Content spinning|Rewriting|SEO Content Machine} is {fun|enjoyable|entertaining|exciting|enjoyment}! try it {for yourself|on your own|yourself|by yourself|for you} and {see how|observe how|observe} it {works|functions|operates|performs|is effective}."
print spin(s)
If you want to use the [square|brackets|syntax] use this line in the process function:
'/[(((?>[^[]]+)|(?R))*)]/x',