Struggling with regex in yahoo pipe - regex

I'm using Yahoo Pipes to build a scraper that would scrape our company micro-site via xPath and generate an RSS feed that I can then embed on the main site.
So far I got as far as scraping the Job title and location from the page but I can't get the items to link out to the micro-site.
Here's my pipe so far: http://pipes.yahoo.com/pipes/pipe.info?_id=2bb5b8fedd0064b64d0e8861e3fc8fd5
I think I need to extract the href link from each node and then apply regex but I really can't get my head around it.
The link looks like this in the code: www2.jobs.badenochandclark.ch/JavaScript:OpenAssignment('a960c93a-11fe-4751-bc27-83a48429c3ba',%20'/Jobs/Details/a960c93a-11fe-4751-bc27-83a48429c3ba');
But I'm struggling to generate a regex that would basically do this:
www2.jobs.badenochandclark.ch/JavaScript:OpenAssignment('a960c93a-11fe-4751-bc27-83a48429c3ba',%20'/Jobs/Details/a960c93a-11fe-4751-bc27-83a48429c3ba');
So I'm stuck on how to extract a link and then how to build that on to the pipe. Any help or nudge in the right direction would be really appreciated.

Here you go..
http://pipes.yahoo.com/pipes/pipe.info?_id=d564b802185d5777d757ed4189470941
Used slightly less complicated code in the regex module. It often being easier to erase the code you do not want than trying to extract and assign to a variable
in plx.link.href find this-> JavaScript(.+)Jobs replace with->jobs
in plx.link.href find this-> \'\); replace with->leave blank
the trailing bit of code '); requires the backslashes as ' and ) are control charecters adding the backslash \ makes regex read them literlally as text characters.
This bit of regex a(.+?)b means match or grab everything between a & b and comes in handy for this sort of thing a lot.

Full-fledged URL-parsing isn't simple, but given enough constraints it becomes manageable.
For example, if you know
that JavaScript:OpenAssignment( always follows a /,
that the first argument is always a hexadecimal+dashes string in quotes,
that the second argument (at least the portion you need) is also in quotes,
and that you can discard the remainder of URL after the "function,"
then something like this might be a starting point:
\/JavaScript:OpenAssignment\([^'"]*['"][0-9a-fA-F\-]+['"][^,)]*,[^'")]*['"]([0-9a-fA-F\-]+)['"].*
Then, $1 would contain the match you desire to keep. The explanation follows.
\/ Slashes need to be escaped (usually).
JavaScript:OpenAssignment Our function of interest.
\( Parentheses need to be escaped too.
[^'"]* We're looking for a quote next, so ignore any
string of non-quotes, e.g. %20.
['"] A quote character.
[0-9a-fA-F\-]+ A hexadecimal-and-dashes string.
['"] A quote character.
[^,)]* We're looking for a comma next, so ignore any
string of non-quotes, e.g., again, %20.
, A comma character.
[^'"]* We're looking for a quote again, so ignore any
string of non-quotes, e.g. %20.
['"] A quote character.
([0-9a-fA-F\-]+) A hexadecimal-and-dashes string, this time captured.
['"] A quote character.
.* The rest of the string that we don't care about.

Related

What is the difference between `(\S.*\S)` and `^\s*(.*)\s*$` in regex?

I'm doing the RegexOne regex tutorial and it has a question about writing a regular expression to remove unnecessary whitespace.
The solution provided in the tutorial is
We can just skip all the starting and ending whitespace by not capturing it in a line. For example, the expression ^\s*(.*)\s*$ will catch only the content.
The setup for the question does indicate the use of the hat at the beginning and the dollar sign at the end, so it makes sense that this is the expression that they want:
We have previously seen how to match a full line of text using the hat ^ and the dollar sign $ respectively. When used in conjunction with the whitespace \s, you can easily skip all preceding and trailing spaces.
That said, using \S instead, I was able to come up with what seems like a simpler solution - (\S.*\S).
I've found this Stack Overflow solution that match the one in the tutorial - Regex Email - Ignore leading and trailing spaces? and I've seen other guides that recommend the same format but I'm struggling to find an explanation for why the \S is bad.
Additionally, this validates as correct in their tool... so, are there cases where this would not work as well as the provided solution? Or is the recommended version just a standard format?
The tutorial's solution of ^\s*(.*)\s*$ is wrong. The capture group .* is greedy, so it will expand as much as it can, all the way to the end of the line - it will capture trailing spaces too. The .* will never backtrack, so the \s* that follows will never consume any characters.
https://regex101.com/r/584uVG/1
Your solution is much better at actually matching only the non-whitespace content in the line, but there are a couple odd cases in which it won't match the non-space characters in the middle. (\S.*\S) will only capture at least two characters, whereas the tutorial's technique of (.*) may not capture any characters if the input is composed of all whitespace. (.*) may also capture only a single character.
But, given the problem description at your link:
Occasionally, you'll find yourself with a log file that has ill-formatted whitespace where lines are indented too much or not enough. One way to fix this is to use an editor's search a replace and a regular expression to extract the content of the lines without the extra whitespace.
From this, matching only the non-whitespace content (like you're doing) probably wouldn't remove the undesirable leading and trailing spaces. The tutorial is probably thinking to guide you towards a technique that can be used to match a whole line with a particular pattern, and then replace that line with only the captured group, like:
Match ^\s*(.*\S)\s*$, replace with $1: https://regex101.com/r/584uVG/2/
Your technique would work given the problem if you had a way to make a new text file containing only the captured groups (or all the full matches), eg:
const input = ` foo
bar
baz
qux `;
const newText = (input.match(/\S(?:$|.*\S)/gm) || [])
.join('\n');
console.log(newText);
Using \S instead of . is not bad - if one knows a particular location must be matched by a non-space character, rather than by a space, using \S is more precise, can make the intent of the pattern clearer, and can make a bad match fail faster, and can also avoid problems with catastrophic backtracking in some cases. These patterns don't have backtracking issues, but it's still a good habit to get into.

Adding "/index.html" to paths in Vim

I'm trying to append "/index.html" to some folder paths in a list like this:
path/one/
/another/index.html
other/file/index.html
path/number/two
this/is/the/third/path/
path/five
sixth/path/goes/here/
Obviously the text only needs to be added where it does not exist yet. I could achieve some good results with (vim command):
:%s/^\([^.]*\)$/\1\/index.html/
The only problem is that after running this command, some lines like the 1st, 5th and 7th in the previous example end up with duplicated slashes. That's easy to solve too, all I have to do is search for duplicates and replace with a single slashes.
But the question is:
Isn't there a better way to achieve the correct result at once?
I'm a Vim beginner, and not a regex master also. Any tips are really appreciated!
Thanks!
So very close :)
Just add an optional slash to the end of the regex:
\/\?
Then you need to change the rest of the pattern to a non-greedy match so that it ignores a trailing slash. The syntax for a non-greedy match in vim (replacing the *) is:
\{-}
So we end up with:
:%s/^\([^\.]\{-}\)\/\?$/\1\/index.html/
(Doesn't hurt to be safe and escape the period.)
Vim's regex supports the ability to match a bit of text foo if it does or doesn't precedes or follows some other text bar without matching bar, and this is exactly the sort of thing you're looking for. Here you want to match the end of line with an optional /, but only if the / isn't followed by index.html, and then replace it with /index.html. A quick look at Vim's help tells me \#<! is exactly what to use. It tells Vim that the preceding atom must be in the text but not in what's matched. With a little experimentation, I get
:%s;/\?\(index\.html\)\#<!$;/index.html;
I use ; to delimit the parts of the :s command so that I don't have to escape any / in the regex or replacement expression. In this particular situation, it's not a big deal though.
The / is optional, and we say so with \?.
We need to group index.html together because otherwise our special \#<! would only affect the l otherwise.

How to read this command to remove all blanks at the end of a line

I happened across this page full of super useful and rather cryptic vim tips at http://rayninfo.co.uk/vimtips.html. I've tried a few of these and I understand what is happening enough to be able to parse it correctly in my head so that I can possibly recreate it later. One I'm having a hard time getting my head wrapped around though are the following two commands to remove all spaces from the end of every line
:%s= *$== : delete end of line blanks
:%s= \+$== : Same thing
I'm interpreting %s as string replacement on every line in the file, but after that I am getting lost in what looks like some gnarly variation of :s and regex. I'm used to seeing and using :s/regex/replacement. But the above is super confusing.
What do those above commands mean in english, step by step?
The regex delimiters don't have to be slashes, they can be other characters as well. This is handy if your search or replacement strings contain slashes. In this case I don't know why they use equal signs instead of slashes, but you can pretend that the equals are slashes:
:%s/ *$//
:%s/ \+$//
Does that make sense? The first one searches for a space followed by zero or more spaces, and the second one searches for one or more spaces. Each one is anchored at the end of the line with $. And then the replacement string is empty, so the spaces are deleted.
I understand your confusion, actually. If you look at :help :s you have to scroll down a few pages before you find this note:
*E146*
Instead of the '/' which surrounds the pattern and replacement string, you
can use any other character, but not an alphanumeric character, '\', '"' or
'|'. This is useful if you want to include a '/' in the search pattern or
replacement string. Example:
:s+/+//+
I do not know vim syntax, but it looks to me like these are sed-style substitution operators. In sed, the / (in s/REGEX/REPLACEMENT/) can be uniformly replaced with any other single character. Here it appears to be =. So if you mentally replace = with /, you'll get
:%s/ *$//
:%s/ \+$//
which should make more sense to you.

Regular expression question

I have some text like this:
dagGeneralCodes$_ctl1$_ctl0
Some text
dagGeneralCodes$_ctl2$_ctl0
Some text
dagGeneralCodes$_ctl3$_ctl0
Some text
dagGeneralCodes$_ctl4$_ctl0
Some text
I want to create a regular expression that extracts the last occurrence of dagGeneralCodes$_ctl[number]$_ctl0 from the text above.
the result should be: dagGeneralCodes$_ctl4$_ctl0
Thanks in advance
Wael
This should do it:
.*(dagGeneralCodes\$_ctl\d\$_ctl0)
The .* at the front is greedy so initially it will grab the entire input string. It will then backtrack until it finds the last occurrence of the text you want.
Alternatively you can just find all the matches and keep the last one, which is what I'd suggest.
Also, specific advice will probably need to be given depending on what language you're doing this in. In Java, for example, you will need to use DOTALL mode to . matches newlines because ordinarily it doesn't. Other languages call this multiline mode. Javascript has a slightly different workaround for this and so on.
You can use:
[\d\D]*(dagGeneralCodes\$_ctl\d+\$_ctl0)
I'm using [\d\D] instead of . to make it match new-line as well. The * is used in a greedy way so that it will consume all but the last occurrence of dagGeneralCodes$_ctl[number]$_ctl0.
I really like using this Regular Expression Cheatsheet; it's free, a single page, and printed, fits on my cube wall.

Remove stuff, retrieve numbers, retrieve text with spaces in place of dots, remove the rest

This is my first question, so I hope I didn't mess too much with the title and the formatting.
I have a bunch of file a client of mine sent me in this form:
Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
What I need is a regex to output just:
212 The Actual Title Of the Chapter
I'm not gonna use it with any script language in particular; it's a batch renaming of files through an app supporting regex (which already "preserves" the extension).
So far, all I was able to do was this:
/.*x(\d+)\.(.*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Capture everything before a number preceded by an "x", group numbers after the "x", group everything following until a 3 digit Uppercase word is met, then capture everything that follows)
which gives me back:
212 The.Actual.Title.Of.the.Chapter
Having seen the result I thought that something like:
/.*x(\d+)\.([^.]*?)\.[A-Z]{3}.*/ -->REPLACE: $1 $2
(Changed second group to "Capture everything which is not a dot...") would have worked as expected.
Instead, the whole regex fails to match completely.
What am I missing?
TIA
ciĆ 
ale
.*x(\d+)\. matches Name.Of.Chapter.021x212.
\.[A-Z]{3}.* matches .DOC.NAME-Some.stuff.Here.ext
But ([^.]*?) does not match The.Actual.Title.Of.the.Chapter because this regex does not allow for any periods at all.
since you are on Mac, you could use the shell
$ s="Name.Of.Chapter.021x212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext"
$ echo ${s#*x}
212.The.Actual.Title.Of.the.Chapter.DOC.NAME-Some.stuff.Here.ext
$ t=${s#*x}
$ echo ${t%.[A-Z][A-Z][A-Z].*}
212.The.Actual.Title.Of.the.Chapter
Or if you prefer sed, eg
echo $filename | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//'
For processing multiple files
for file in *.ext
do
newfile=${file#*x}
newfile=${newfile%.[A-Z][A-Z][A-Z].*}
# or
# newfile=$(echo $file | sed 's|.[^x]*x||;s/\.[A-Z][A-Z][A-Z].*//')
mv "$file" "$newfile"
done
To your question "How can I remove the dots in the process of matching?" the answer is "You can't." The only way to do that is by processing the result of the match in a second step, as others have said. But I think there's a more basic question that needs to be addressed, which is "What does it mean for a regex to match a given input?"
A regex is usually said to match a string when it describes any substring of that string. If you want to be sure the regex describes the whole string, you need to add the start (^) and end ($) anchors:
/^.*x(\d+)\.(.*?)\.[A-Z]{3}.*$/
But in your case, you don't need to describe the whole string; if you get rid of the .* at either end, it will serve your just as well:
/x(\d+)\.(.*?)\.[A-Z]{3}/
I recommend you not get in the habit of "padding" regexes with .* at beginning and end. The leading .* in particular can change the behavior of the regex in unexpected ways. For example, it there were two places in the input string where x(\d+)\. could match, your "real" match would have started at the second one. Also, if it's not anchored with ^ or \A, a leading .* can make the whole regex much less efficient.
I said "usually" above because some tools do automatically "anchor" the match at the beginning (Python's match()) or at both ends (Java's matches()), but that's pretty rare. Most of the shells and command-line tools available on *nix systems define a regex match in the traditional way, but it's a good idea to say what tool(s) you're using, just in case.
Finally, a word or two about vocabulary. The parentheses in (\d+) cause the matched characters to be captured, not grouped. Many regex flavors also support non-capturing parentheses in the form (?:\d+), which are used for grouping only. Any text that is included in the overall match, whether it's captured or not, is said to have been consumed (not captured). The way you used the words "capture" and "group" in your question is guaranteed to cause maximum confusion in anyone who assumes you know what you're talking about. :D
If you haven't read it yet, check out this excellent tutorial.