Using multiple Perl regular expressions to find and replace - regex

I'm a Perl and regex newcomer in need of your expertise.
I need to process text files that include placeholder lines like Foo Bar1.jpg and replace those with with corresponding URLs like https:/baz/qux/Foo_Bar1.jpg.
As you may have guessed, I'm working with HTML. The placeholder text refers to the filename, which is the only thing available when writing the document. That's why I have to use placeholder text. Ultimately, of course, I want to replace the filename with the URL (after I upload file to my CMS to get the URL). At that point, I have all of the information at hand — the filename and the URL. Of course, I could just paste the URLs over the placeholder names in the HTML document. In fact, I've done that. But I'm certain that there's a better way.
In short, I have placeholder lines like this:
Foo Bar1.jpg
Foo Bar2.jpg
Foo Bar3.jpg
And I also have URL lines like this:
https:/baz/qux/Foo_Bar1.jpg
https:/baz/qux/Foo_Bar2.jpg
https:/baz/qux/Foo_Bar3.jpg
I want to find the placeholder string and capture a differentiator like Bar1 with a regex. Then I want to use the captured part like Bar1 to perform another regex search that matches part of the corresponding URL string, i.e. https:/baz/qux/Foo_Bar1.jpg. After a successful match, I want to replace the Foo Bar1.jpg line with https:/baz/qux/Foo_Bar1.jpg.
Ultimately, I want to do that for every permutation, so that https:/baz/qux/Foo_Bar2.jpg also replaces Foo Bar2.jpg and so on.
I've written regular expressions that match both the placeholder and the URL. That's not my problem, as far as I can tell. I can find the strings I need to process. For example, /[a-z]+\s([a-z0-9]+)\.jpg/ successfully matches what I'm calling the placeholder text and captures what I'm calling the differentiator.
However, though I've spent an embarrassing number of hours over the past week reading through Stack Overflow, various other sites and O'Reilly books on Pearl and Pearl Regular Expressions, I can't wrap my mind around how to process what I can find.

I think the piece you are missing is the idea of using Perl's internal grep function, for searching a list of URL lines based on what you are calling your "differentiator".
Slurp your URL lines into a Perl array (assuming there are a finite manageable number of them, so that memory is not clobbered):
open URLS, theUrlFile.txt or die "Cannot open.\n";
my #urls = <URLS>;
Then within the loop over your file containing "placeholders":
while (my $key = /[a-z]+\s([a-z0-9]+)\.jpg/g) {
my #matches = grep $key, #urls;
if (#matches) {
s/[a-z]+\s$key\.jpg/$matches[0]/;
}
}
You may also want to insert error/warning messages if #matches != 1.

Related

Powershell: Using regex to match patterns and its variations (which has special characters)

I have several thousand text files containing form information (one text file for each form), including the unique id of each form.
I have been trying to extract just the form id using regex (which I am not too familiar with) to match the string of characters found before and after the form id and extract only the form ID number in between them. Usually the text looks like this: "... 12 ID 12345678 INDEPENDENT BOARD..."
The bolded 8-digit number is the form ID that I need to extract.
The code I used can be seen below:
$id= ([regex]::Match($text_file, "12 ID (.+) INDEPENDENT").Groups[1].Value)
This works pretty well, but I soon noticed that there were some files for which this script did not work. After investigation, I found that there was another variation to the text containing the form ID used by some of the text files. This variation looks like this: "... 12 ID 12345678 (a.12(3)(b),45)..."
So my first challenge is to figure out how to change the script so that it will match the first or the second pattern. My second challenge is to escape all the special characters in "(a.12(3)(b),45)".
I know that the pipe | is used as an "or" in regex and two backslashes are used to escape special characters, however the code below gives me errors:
$id= ([regex]::Match($text_one_line, "34 PR (.+) INDEPENDENT"|"34 PR (.+) //(a//.12//(3//)//(b//)//,45//)").Groups[1].Value)
Where have I gone wrong here and how I can fix my code?
Thank you!
When you approach a regex pattern always look for fixed vs. variable parts.
In your case the ID seems to be fixed, and it is, therefore, useful as a reference point.
The following pattern applies this suggestion: (?:ID\s+)(\d{8})
(click on the pattern for an explanation).
$str = "... 12 ID 12345678 INDEPENDENT BOARD..."
$ret = [Regex]::Matches($str, "(?:ID\s+)(\d{8})")
for($i = 0; $i -lt $ret.Count; $i++) {
$ret[0].Groups[1].Value
}
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. It contains a treasure trove of useful information.

Powershell with regex: Unable to find and replace ALL occurences of specified string in a set of data

I am new to regular expressions and stackoverflow. Any help would be greatly appreciated.
I am trying to remove unwanted data from a data set. The data is contained in a .csv file column with multiple cells, each cell containing data similar to this:
OSVDB #109124,OSVDB #109125,OSVDB #109126,OSVDB #109127,OSVDB #109128,OSVDB #109129,OSVDB #109130,OSVDB #109131,OSVDB #109132,OSVDB #109133,OSVDB #109134,OSVDB #109135,OSVDB #109136,OSVDB #109137,OSVDB #109138,OSVDB #109139,OSVDB #109140,OSVDB #109141,OSVDB #109142,OSVDB #109143,VMSA #2014-0012,OSVDB #102715,OSVDB #104972,OSVDB #106710,OSVDB #115364,IAVA #2014-A-0191,IAVB #2014-B-0160,IAVB #2014-B-0162,IAVB #2015-B-0007
I want to replace the above data with each occurrence of the strings beginning "IAV...". So, the above cell would read:
IAVA #2014-A-0191,IAVB #2014-B-0160,IAVB #2014-B-0162,IAVB #2015-B-0007
Below is a snippet of the script that imports the .csv and gets the column containing the data.
My regex, within powershell is:
$reg1 = '$1'
$reg2 = '(IAV[A|B]\s#[0-9]{4}-[A|B]-[0-9]{4}){1,}'
ForEach-Object {$_.IAVM = [regex]::replace($_.IAVM,$reg2,$reg1); $_}
The result is:
The entire cell contents posted above.
From my understanding {1,} at the end of the regex should return each occurrence of the string pattern, but I'm returning all contents of every cell containing my regex string.
Maybe instead of trying to pick out your string you just delete the stuff you don't want? Try something like:
$reg1=''
$reg2='((OSVDB|VMSA)\s#[M-S0-9-]{6,9}[,]?)'
You have .* in that regex at the very beginning. This will capture everything up to the last match of the pat that follows it. In your case I don't think you need that part anyway.
Also note that PowerShell has a handy -replace operator, so there's often no reason to use the static methods on the Regex type.

Article spinner with 2 tiers

I made an article spinner that used regex to find words in this syntax:
{word1|word2}
And then split them up at the "|", but I need a way to make it support tier 2 brackets, such as:
{{word1|word2}|{word3|word4}}
What my code does when presented with such a line, is take "{{word1|word2}" and "{word3|word4}", and this is not as intended.
What I want is when presented with such a line, my code breaks it up as "{word1|word2}|{word3|word4}", so that I can use this with the original function and break it into the actual words.
I am using c#.
Here is the pseudo code of how it might look like:
Check string for regex match to "{{word1|word2}|{word3|word4}}" pattern
If found, store each one as "{word1|word2}|{word3|word4}" in MatchCollection (mc1)
Split the word at the "|" but not the one inside the brackets, and select a random one (aka, "{word1|word2}" or "{word3|word4}")
Store the new results aka "{word1|word2}" and "{word3|word4}" in a new MatchCollection (mc2)
Now search the string again, this time looking for "{word1|word2}" only and ignore the double "{{" "}}"
Store these in mc2.
I can not split these up normally
Here is the regex I use to search for "{word1|word2}":
Regex regexObj = new Regex(#"\{.*?\}", RegexOptions.Singleline);
MatchCollection m = regexObj.Matches(originalText); //How I store them
Hopefully someone can help, thanks!
Edit: I solved this using a recursive method. I was building an article spinner btw.
That is not parsable using a regular expression, instead you have to use a recursive descent parser. Map it to JSON by replacing:
{ with [
| with ,
wordX with "wordX" (regex \w+)
Then your input
{{word1|word2}|{word3|word4}}
becomes valid JSON
[["word1","word2"],["word3","word4"]]
and will map directly to PHP arrays when you call json_decode.
In C#, the same should be possible with JavaScriptSerializer.
I'm really not completely sure WHAT you're asking for, but I'll give it a go:
If you want to get {word1|word2}|{word3|word4} out of any occurrence of {{word1|word2}|{word3|word4}} but not {word1|word2} or {word3|word4}, then use this:
#"\{(\{[^}]*\}\|\{[^}]*\})\}"
...which will match {{word1|word2}|{word3|word4}}, but with {word1|word2}|{word3|word4} in the first matching group.
I'm not sure if this will be helpful or even if it's along the right track, but I'll try to check back every once in a while for more questions or clarifications.
s = "{Spinning|Re-writing|Rotating|Content spinning|Rewriting|SEO Content Machine} is {fun|enjoyable|entertaining|exciting|enjoyment}! try it {for yourself|on your own|yourself|by yourself|for you} and {see how|observe how|observe} it {works|functions|operates|performs|is effective}."
print spin(s)
If you want to use the [square|brackets|syntax] use this line in the process function:
'/[(((?>[^[]]+)|(?R))*)]/x',

How to Regex Multiple URLs From Same Variable In Perl

I'm trying to search a field in a database to extract URLs. Sometimes there will be more than 1 URL in a field and I would like to extract those in to separate variables (or an array).
I know my regex isn't going to cover all possibilities. As long as I flag on anything that starts with http and ends with a space I'm ok.
The problem I'm having is that my efforts either seem to get only 1 URL per record or they get only 1 the last letter from each URL. I've tried a couple different techniques based on solutions other have posted but I haven't found a solution that works for me.
Sample input line:
Testing http://marko.co http://tester.net Just about anything else you'd like.
Output goal
$var[0] = http://marko.co
$var[1] = http://tester.net
First try:
if ( $status =~ m/http:(\S)+/g ) {
print "$&\n";
}
Output:
http://marko.co
Second try:
#statusurls = ($status =~ m/http:(\S)+/g);
print "#statusurls\n";
Output:
o t
I'm new to regex, but since I'm using the same regex for each attempt, I don't understand why it's returning such different results.
Thanks for any help you can offer.
I've looked at these posts and either didn't find what I was looking for or didn't understand how to implement it:
This one seemed the most promising (and it's where I got the 2nd attempt from, but it didn't return the whole URL, just the letter: How can I store regex captures in an array in Perl?
This has some great stuff in it. I'm curious if I need to look at the URL as a word since it's bookended by spaces: Regex Group in Perl: how to capture elements into array from regex group that matches unknown number of/multiple/variable occurrences from a string?
This one offers similar suggestions as the first two. How can I store captures from a Perl regular expression into separate variables?
Solution:
#statusurls = ($status =~ m/(http:\S+)/g);
print "#statusurls\n";
Thanks!
I think that you need to capture more than just one character. Try this regex instead:
m/http:(\S+)/g

2 VB RegEx Issues

I need some help with a VB RegEx.
I've got two RegEx that I need to do two specific things.
RegEx one - I am not exactly sure how to do this, but I need to get everything within a Href tag. i.e.
String = "<a href=""test.html"">"
I need the RegEx to return .... test.html
RegEx Two - I have partly got this working.
I've got tags like
RegEx = "<div class=""top""(.*?)</div>"
String = "<div class=""top""><a><b><div class=""bottom""></div></b></a></div>"
The problem I have is this isnt returning anything, it should return everything withing "top", but it returns nothing.
Neither use-case can be solved well with regular expressions.
Use an HTML parser instead, e.g. the HTML Agility Pack.
Well, if your html doesn't contain nested tags you can do the first part with regex (as long as you can control your search source code, you can be much more certain of your results).
\<a href=""([^""]+)\>
the test.html will be found in the non-passive group referred to as $1.
The second part I'm concerned that you have nested tags in there and it's failing on that. The thing with regex and html is that regex can't delve well into the nested-allowable-but-not-best-practice code that can execute as expected but isn't well formed.
Can you post some search source for the second case so we can look?