iMacros to get TITLE and URL from searches Google (SERP) - imacros

I cannot get the URL from search result Google, and only need to get main URL no need full URL. please help.
sample SERP:
googleforeducation.blogspot.com/.../teach-and-learn-from-everywh
WANT to become:
googleforeducation.blogspot.com
I did try and below is the full script. Thank you.
VERSION BUILD=8871104 RECORDER=FX
TAB T=1
SET !REPLAYSPEED FAST
SET !ERRORIGNORE YES
SET !EXTRACT_TEST_POPUP NO
URL GOTO=https://www.google.co.id/search?q=%2Bblogspot.com&bav=on.2,or.&biw=1064&bih=666&dpr=1#tbs=qdr:m&q=learn+blogspot+site:blogspot.com
TAG POS={{!LOOP}} TYPE=H3 ATTR=TXT:* EXTRACT=TXT
TAG POS={{!LOOP}} TYPE=CITE ATTR=CLASS:_Rm EXTRACT=HREF
SET !EXTRACT EVAL("want to get only <something>.blogspot.com OR only main URL");
SAVEAS TYPE=EXTRACT FOLDER=* FILE=Google.csv

Try this:
SET !EXTRACT EVAL("'{{!EXTRACT}}'.split('/')[0];")

Your Question: Run search in google for param in site and gather sundomains from results.
-Jump to last code sample for a working solution-
1)My suggestion is to look for the solution in another technology and not in macros. E.g Perl + LWP to get page source and then regex to pars it.
2)To the point, Your macro does not work because the element you selected does not contain an HREF tag, please use "Inspect Element" button in your browser to see you page layout.
I would use REGEX with iMacros to locate urls at pre-defined locations, for example the TRANSLATE button that always appears next to the finding in a foreign domain. (Or webchache, see last example).
The next code catches the subdomain for the first translate button only.
SEARCH SOURCE=REGEXP:"https://translate.google.{20,50}u=http://(.{1,50}).blogspot.com/&" EXTRACT="Subdomain is $1"
PROMPT {{!EXTRACT}}
Unfortunately when trying to loop the regex the grouping keeps overwriting $1
E.G (Not working properly but this is the more elegant way to go if someone can fix it):
SEARCH SOURCE=REGEXP:"(?:https://translate.google.{20,50}u=http://(.{1,50}).blogspot.com/&.+?){1,6}" EXTRACT="Subdomains are $1 $2 $3 $4 $5 $6"
PROMPT {{!EXTRACT}}
?: is to disable extracting for current grouping.
{1,6} is to run 1 to 6 times and extract the subdomain.
A Walkaround could be to copy past the code 6?8?20? times. This time i'm going to use a different anchor (webcache link) that should work for more people out of the box regardless of language.
E.G:
URL GOTO=https://www.google.co.il/?gfe_rd=cr&ei=tHCOV5S_INHb8Afd24GwCg#tbs=qdr:m&q=learn+blogspot+site:blogspot.com
SEARCH SOURCE=REGEXP:"(?:webcache.{19}com/search.q=cache:.{12}:(.{1,40}.blogspot.com)/).+?(?:webcache.{19}com/search.q=cache:.{12}:(.{1,40}.blogspot.com)/).+?(?:webcache.{19}com/search.q=cache:.{12}:(.{1,40}.blogspot.com)/).+?(?:webcache.{19}com/search.q=cache:.{12}:(.{1,40}.blogspot.com)/).+?(?:webcache.{19}com/search.q=cache:.{12}:(.{1,40}.blogspot.com)/).+?(?:webcache.{19}com/search.q=cache:.{12}:(.{1,40}.blogspot.com)/).+?(?:webcache.{19}com/search.q=cache:.{12}:(.{1,40}.blogspot.com)/).+?(?:webcache.{19}com/search.q=cache:.{12}:(.{1,40}.blogspot.com)/).+?" EXTRACT="Domains are $1,$2,$3,$4,$5,$6,$7,$8"
PROMPT {{!EXTRACT}}
Last one is a working solution for you, but code-wise it is ugly.
If someone reads this at a later time when google changed the page layout you will need to "Inspect element" on the page, search for "cache" and tweak the regex a little bit.
If you want more explanation for the regex i'd be glad to help step by step

Related

Getting regex redirection to cater for optional text

So, I have a series of regex redirections for some help topics.
Example:
(?i)\/msa/HelpOnline/source/helpoptionsemailgoogle.htm$
If the user had used the search tab in the old help system and clicked on the same link for the above topic the resulting link was:
/msa/helponline/index.html?page=source/helpoptionsemailgoogle.htm
So, I have an option bit of URL that might be present:
/msa/helponline/[index.html?page=]source/helpoptionsemailgoogle.htm
Is it possible with regex to optionally allow for that text so that with the same redirection we can find:
/msa/helponline/source/helpoptionsemailgoogle.htm
/msa/helponline/index.html?page=source/helpoptionsemailgoogle.htm
I am using the Wordpress Redirection plugin and the redirections are being stored with Wordpress.
This works:
(?i)\/msa/HelpOnline/(index.html\?page=)?source/helpoptionsemailgoogle.htm$

RegEx in Dreamweaver

I know this is possible because I used to do this before... but I can't remember how!
I have a lot of text and paragraphs that I would like to wrap in <p></p> tags but I know you can do it very quickly using the find/replace feature using RegEx.
Can anyone refresh my memory? I'm using Dreamweaver.
The best way to do it in Dreamweaver I think is to use the "Edit > Paste Special" menu, to past from a word processor with formatting.
If that doesn't work for you for some reason, you can use Dreamweaver search and replace with the "Use regular expressions" option enabled, and do a search and replace for:
\r|\n
and replace with
</p><p>
Then all that's needed is to add an outer paragraph tag if it's not already there.

Remove a character from the middle of a string with regex

I have no programing experience and thought this would be simple, but I have searched for days without luck. I am using a program to strip content from a web page. The program uses regex filters to display what you want from the stripped content. The stripped content can be any letter and is in the form of USD/SEK. I want to display USDSEK, without the "/"
Thanks
To elaborate further - I am using a program called Data toolbar for chrome, which makes it easy to strip content from web pages. After it strips the content, it provides a regex filter to display what part of the content is displayed. But I have to know the regex command to remove the / from USD/SEK, to display just USDSEK. I've tried [A-Z.,]+ but that only displays USD. I need the regex command to grab the first 3 and last 3 characters only, or to omit the / from the string.
Try adding parentheses around the groups which you wish to capture:
([a-zA-Z]{3})\/([a-zA-Z]{3})
or
([a-zA-Z]{3})\/((?1))
Depending on the functionality of the program you are using you can then reference these captured groups as $1and $2 (or \1and \2 depending on flavor)

regex to remove hyperlinks

Input:
source http://www.emaxhealth.com/1275/misdiagnosing from here http://www.cancerresearchuk.org/about-cancer/type recounting her experiences and thoughts blog http://fty720.blogspot.com even carried the new name. She was far from home.
From the about input I want to remove the hyperlinks. Below is the regex that I am trying
http://[\w|\W|\d|\s]*(?=[ ])
This regex will encompass all characters,digits and whitespaces after encountering the word 'http' and will continue till first blank space.
Unfortunately, it is not working as expected. Please do help me find out my error.Thanks
Try this sed command
sed 's/http[^ ]\+//g' FileName
Output :
source from here recounting her experiences and thoughts blog even carried the new name. She was far from home.
To find the hyperlink use:
\b(https?)://[A-Z0-9+&##/%?=~_|$!:,.;-]*[A-Z0-9+&##/%=~_|$]
or:
If you want to find the html a tag use:
<a\b[^>]*>(.*?)</a>

Wiki problem with anchor

I'm developing a project based on a wiki. One of its functionality is assigning anchors to heading (h1. h2.,etc) and I want to link a word to one of this anchors so when it's clicked the page automatically scrolls down to the correct heading. As it says on the help page the anchor should be used as following:
Redmine assigns an anchor to each of those headings thus you can link to them with "#Heading", "#Subheading" and so forth.
And then adds:
[[Guide#further-reading]] takes you to the anchor "further-reading". Headings get automatically assigned anchors so that you can refer to them
So I tried to use it by writing [[LBAW#Glossario]], or [[LBAW"#Glossario"]] or
[[LBAW#"Glossario"]]....none of them worked creating a new page each time instead of scrolling down as it should.
If anyone could give any advised I would be very much appreciated.
I had the same problem and figured it out. The anchor link ("click me") format looks like:
[[your_wiki_pagename#the-anchor|click me]]
the link will bring you to wiki page ("your_wiki_pagename"):
h2. the anchor
You need to replace the space in anchor text with "-" (the anchor -> the-anchor).
I have tried it in the current demo of redmine and it works. http://demo.redmine.org/projects/anewproject/wiki/Headers
Assume the following text on the page called Headers:
h1. Headers
some text
This link goes to [[Headers#header2]]
h2. header2
some
more
lines
of
text
to
see
scrolling
You can check the exact name of the anchor which Redmine has added by looking at the page source, or inspecting the element using browser tools.
I had the same problem but I hadn't put the exact name in, using the same case. My heading was:
DNS (moved to separate page)
and so Redmine created an anchor like this:
ΒΆ
so I had to use the following link:
[[Server_Configuration#DNS-moved-to-separate-page]]
and that worked.