Yahoo Pipes Using Regex to change link - regex

Hi I am pretty new to regex I can do some basic functions but having trouble with this. I need to change the link in the rss feed.
I have a url like this:
http://mysite.test/Search/PropDetail.aspx?id=38464&id=38464&listingid=129-2-6430678&searchID=250554873&ResultsType=SearchResult
and want to change it to updated site:
http://mysite.test/PropertyDetail/?id=38464&id=38464&listingid=129-2-6430678&searchID=250554873&ResultsType=SearchResult
Where only thing changed is from /Search/PropDetail.aspx
to /PropertyDetail/
I don't have access to the orginal rss feed or I would change it there so I have to use pipes. Please help, Thanks!

Use the regex control.
In it, specify the DOM address of the node containing your link (prefixed by "item.") within the "In" field. For the "replace" field type
(.*)//Search//PropDetail/.aspx
and in the "with" field type use:
$1//PropertyDetail//.*
I've 'escaped' the '/' character in the with field. However, I'm not sure you need to do this except before the '.*' Some trial and error may be needed.
Hopefully this will achieve the result you want.

Related

How to use regexextract on an imported data?

I can't get my regexextract to work properly on google sheets.
I have imported data from one tab to another, like this:
=IMPORTRANGE("https://docs.google.com/spreadsheets/d/1o52z55YdNha4T_tCsKcHkrbA5sR4C1GyxYuBmMGGqu0/edit#gid=0"; "SheetName 1!A2:G103")
This works fine and what Im importing are percentages. Now, what I'm trying to do is to use regex to extract the numbers and omit the '%' symbol.
Normally I would type: =VALUE(REGEXEXTRACT(A11,"\D+"))but it doesn't work. However, if I use this exact Regex on any 'normal' cell (cell that is not an import from another tab), it works.
Is there a reasong regex doesn't work on an imported value?
Edit:
I have to send surveys to clients to know if they are satisfied with the provided service. Customers will choose between a few options. I do that using google forms by creating one form, and I link it to a google sheet where the answers from the form are pasted:
On the same sheet, I've added a tab where I import the data from the previous tab, by using importrange:
As you can see, it works. But I want to take out the letters and the '%' symbol. I just want the numbers so I can run an AVG function.
Is there a way to do that?
Thanks
You can try the following:
=REGEXEXTRACT(A11,"(\d+)")
Here is the sample input and output data using the previous function:
I think the problem is just related to the capital "D" since you need to use "d" instead.
You can check this other post as a reference:
What does \d+ mean in regular expression terms?

Regular Expression to select file paths from list of URL

I have a list of URLs in different format that were extracted from a random website:
http://www.w3.org/2000/svg http://www.w3.org/1999/xlink
/bg-images/png/search-magnifying-glass.png
http://www.boston.com/weather?p1=BGMenu_SubnavBostonGlobe.com
http://www.w3.org/2000/svg
http://www.w3.org/1999/xlink
/bg-images/png/search-magnifying-glass.png http://www.w3.org/2000/svg
http://www.w3.org/1999/xlink
/bg-images/png/bg-logo--full.png
http://www.w3.org/2000/svg
http://www.w3.org/1999/xlink
/bg-images/png/bg-logo--bug.png
https://www.bostonglobe.com
https://www.bostonglobe.com
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking
http://www.w3.org/1999/xlink /bg-images/png/bg-logo-large--full.png
http://www.boston.com/section/cars?s_campaign=bg:hp:mainnav:cars
http://realestate.boston.com?s_campaign=bg:hp:mainnav:realestate
http://www.w3.org/2000/svg http://www.w3.org/1999/xlink
They all are in different format (optional http/https/www). I need to filter it to get any kind of "downloadable" content such as *jpg, *png, *html, etc.
Expected output:
/bg-images/png/search-magnifying-glass.png
/bg-images/png/search-magnifying-glass.png
/bg-images/png/bg-logo--full.png
/bg-images/png/bg-logo--bug.png
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking
/metro/2018/06/18/sjc-ruling-millionaires-tax-coming-monday/unxBjYa0JGHKfMKUBzsMjO/story.html?p1=BGHeader_SmartBar_Breaking (not sure about these yet just in case)
http://www.w3.org/1999/xlink /bg-images/png/bg-logo-large--full.png
this is my first time trying to write regex, and I came up with something like that:
(https?/\/)?(www\.)?[-a-zA-Z0-9#:;%._\+~\/#=]{2,256}\.[a-z]{2,4}a{0,1}\b([-a-zA-Z0-9#:;!%_\+.,~#?&//=]*)
which outputs a lots of trash lines. Any advice?
Since your sample Input_file is having space at last of the lines so I am using sub to remove those spaces, in case they are not there then you could remove it. Could you please try following and let me know if this helps you.
awk '{sub(/ *$/,"")}
(/^http/||/^https/||/^www/||/^\//) && \
(/.*png$/||/.*html$/||/.*jpg$/||/BGHeader_SmartBar_Breaking$/)
' Input_file
Instead of fetching some questionable URL from some questionable feed, you need to manually check them, because URL in general, DO NOT contain information about it's content. Many storage services uses ID to identify image, not names with extensions. But headers do contain this information:
How to get content type of a web address?
So as to what is downloadable? Everything. I mean literaly everything you see is downloadable. For example, for images content types will be something like these:
image/gif, image/png, image/jpeg, image/bmp, image/webp
For audio/video:
audio/midi, audio/mpeg, audio/webm, audio/ogg, audio/wav
Partially full list can be found here: http://htmlbook.ru/html/value/mime
As to solution - just sniff every link in multiple IO threads. This way you also will be able to filter those which need some authentication, were expired or invalid in first place. Usually its pretty cheap requests.

How to convert MS Word Smart Quotes and em-dashes to simple quotes and dashes in Ckeditor 4

Hi I really like the new Ckeditor 4 Advanced Content Filtering along with the pastefromword plugin - and have read the docs on what html tags to allow and not, and I understand why it kindly converts my client's MS Word crap into htmlentities. However, I'd like to do a little intervention and convert the smart quotes to straight quotes - and all em dashes to plain dashes and not allow - before the text gets sent to the CMS database. But I can't find any docs on this or examples.
I can see there were many questions about this on the old forum Ckeditor forum http://ckeditor.com/forums/CKEditor-3.x/Replacing-smart-quotes-regular-quotes, http://ckeditor.com/forums/CKEditor-3.x/Problem-copyingpasting-MS-Word but they didn't get answered.
I'm also hoping the ckeditor team reads these forums as this is where they suggest we post questions now.
CKEditor dev here.
If you want the Paste From Word plugin to do this, you could add a rule in the plugin that replaces the contents of text nodes.
To achieve this add a property named 'text' somewhere over here(on the same level as the 'comment' property):
https://github.com/ckeditor/ckeditor-dev/blob/master/plugins/pastefromword/filter/default.js#L1106
It should be a function that accepts one parameter - the text node content, e.g.:
text: function( content ) {
return content.replace(/[\u201E\u201C]/g,'"'); // Unicode for „ and “
}
This way whenever the PFW plugin filter encounters a text node it'll replace its contents with whatever is returned by the above mentioned function.
Caveats: there are quite a few Unicode symbols that represent quotation marks and dashes.
By the way: you may not want to get too attached to the current Paste From Word plugin - we're planning a major refactor of it for v4.6.
I hope this was helpful.

How to remove query string ? from image url vb.net

Whats the best way to remove a query string (the question mark variables) from a image url.
Say I got a good image such as
http://i.ebayimg.com/00/s/MTYwMFgxNjAw/z/zoMAAOSwMpZUniWv/$_12.JPG?set_id=880000500F
But I can't really save it properly without adding a bunch of useless checking code because of the query string crap after it.
I just need
http://i.ebayimg.com/00/s/MTYwMFgxNjAw/z/zoMAAOSwMpZUniWv/$_12.JPG
Looking for the proper regular expression that handles this so I could replace it with blank.
It might be simple enough not to worry about regex.
This would work:
Dim cleaned = url.Substring(0, url.IndexOf("?"c))

Are my regex just wrong or is there a buggy behaviour in td-agent's format behaviour?

I am using fluentd, elasticsearch and kibana to organize logs. Unfortunately, these logs are not written using any standard like apache, so I had to come up with the regex for the format myself. I used this site here to verify that they are working: http://fluentular.herokuapp.com/ .
The logs have roughly this format here:
DEBUG: 24.04.2014 16:00:00 [SingleActivityStrategy] Start Activitiy 'barbecue' zu verabeiten.
the format regex I am using is as follows:
format /(?<pri>([INFO]|[DEBUG]|[ERROR])+)...(?<date>(\d{2}\.\d{2}\.\d{4})).(?<time>(\d{2}:\d{2}:\d{2})).\[(?<subject>(.*))\].(?<msg>(.*))/
Now, judging by that website that is supposed to test specifically fluentd's behaviour with regexes, the output SHOULD be this one:
Record
Key Value
pri DEBUG
date 24.04.2014
subject SingleActivityStrategy
msg Start Activitiy 'barbecue' zu verabeiten.
Instead though, I have this ?bug? that pri is always shortened to DEBU. Same for ERROR which becomes ERRO, only INFO stays INFO. I am not very experienced with regular expressions and I find it hard to believe that this is a bug, still it confuses me and any help is greatly appreciated.
I'm not sure I can link the complete config file because I dont personally own these log files and I am trying to keep it on a level that my boss won't get mad at me for posting sensitive information, but should it definately be needed, I will post them later on after having asked him how much I can reveal.
In general, the logs always look roughly like this:
First the priority, which is either DEBUG, ERROR or INFO, next the date , next what we call the subject which is always written in [ ] and finally just a message.
Here is a link to fluentular with the format I am using and a teststring that produces the right result in fluentular, but not in my config file:
Fluentular
Sorry I couldn't make it work like a regular link to just click on.
Another link to test out regex with my format and test string is this one:
http://rubular.com/r/dfXOkQYNXP
tl;dr version:
my td-agent format regex cuts off the last letter, although fluentular says it shouldn't. My fault or a bug?
How the regex would look if you're trying to match the data specifically:
(INFO|DEBUG|ERROR)\:\s+(\d{2}\.\d{2}\.\d{4})\s(\d{2}:\d{2}:\d{2})\s\[(.*)\](.*)
In your format string, you were using . and ... for where your spaces and colon should be. I'm not to sure on why this works in Fluentular, but you should have matched the \: explicitly and each space between the values.
So you'd be looking at the following regular expression with the Fluentd fields (which are grouping names):
(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))
Meaning your td-agent.conf should look like:
<source>
type tail
path /var/log/foo/bar.log
pos_file /var/log/td-agent/foo-bar.log.pos
tag foo.bar
format /(?<pri>(INFO|ERROR|DEBUG))\:\s+(?<date>(\d{2}\.\d{2}\.\d{4}))\s(?<time>(\d{2}:\d{2}:\d{2}))\s\[(?<subject>(.*))\]\s(?<msg>(.*))/
</source>
I would also take a look into comparing Logstash vs. Fluentd. I like Logstash far more because you create Grok filters to match the type of data you want, and it makes formatting your fields much easier because you are providing an abstraction layer, but you essentially will get the same data.
And I would watch out when you're using sites like Rubular, as they are fairly particular about multi-line matching and the like. I'd suggest something like Regexr which gives immediate feedback and you can set global and multiline matching as well.