For a long time the following code did work perfectly to extract hyperlinks from a text using regex-expression:
var text = "this is a http://google.de link!";
var link = text.match("(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+üäö&##/%=~_|$?!:,.]*\)|[-A-Z0-9+üäö&##/%=~_|$?!:,.])*(?:\([-A-Z0-9+üäö&##/%=~_|$?!:,.]*\)|[A-Z0-9+üäö&##/%=~_|$])", "gi");
The result should be "http://google.de" for the variable link. But it doesn't work anymore. I deem Google has changed something in GAS!?
Can you please tell me, which expression I can use to extract hyperlinks from a string?
(Edit: The answer is to use check 'Encode?'option in the HTTP Request. Please see Vinoth's Edit 2 and comment below, thanks!)
This is interesting!
I'm trying to parse a HTTP response which has (let's give concrete example,
bigH:"2a3a6CEH+iJakQpQtPm8efv"
Using Regular Expression Extractor when I try
bigH:"(.+?)"
it extracts the string but replaces all the "+" in the string with space. That is, instead of
"2a3a6CEH+iJakQpQtPm8efv"
it gives me:
"2a3a6CEH iJakQpQtPm8efv"
Note the space between H and i.
How can I stop it from replacing the "+" with a space? I'd really appreciate if someone can give an explanation also.
Btw, I tried (.+?) and (.\++?) and even ([.|\+]+?) - didn't work :(
Thanks,
--Ishtiaque
Updating with screenshots below:
Adding screenshots:
POST Response data:
After parsing with regular expression extractor in JMeter:
Side by side in Notepad++:
'Raw' tab shows the '+'s:
'HTTP' tab does not:
As you get the response in JSON format, I would go with JSON Path Extractor.
It seems to be a much easier approach than using Regular expression.
Below JSON Path should take care of getting the encoded string from your JSON & You should be able to access using ${bigH}.
Check this for more details (scroll down for JSON Path extractor details).
EDIT:
I was wrong that You get the response in JSON format. Are you trying to access - bigH:"XXX" - from script tag? For this, We have to use Regular expression extractor only or Beanshell.
<script type='text/javascript' charset='utf-8'>
registerSubmit(document.forms[0].elements['SubmitTopButton']);
registerSubmit(document.forms[0].elements['SubmitBottomButton']);
(function($) {
$(".wb_tsauthall").wb_tsauthall({
auth : "Authorize All",
unauth : "Unauthorize All",
locMsgKeys : []
});
$(".wb_newedit").wb_newedit({
labels:['Job','Code','Work Premium','Flat Rate','Premium','Shift','Sched Times','LTA','Sched Times w Breaks','Delete Details','Employee Holiday','Work Detail','Schedule Detail'],values:[105,103,200,206,204,450,401,500,461,199,900,100,460],bigH:"PVxUbYIODBT31j8IZnPGxF/9O1iuKAkFzTO9WhXu8An8hAUa22tLiWrEHz8v9SIu/NXZH1a5IxO0xYeNwRIYM+3n1kNsrESnhiAYhwhCiqUY9mI4hvEPgAOx7B+MEB8iSIUyNGNZbeGx9nSogFYpNrzmCXirW7Nm9Tn7owPKHmc8dOf5SZ+eDzAOHIB8+5YzQ3bIdFoe60hOMkyd7FiUXtwPcNMUFEjOSMs9JhgIHTE4agpCdbFb6SLuSuLoO9rqxj+9GovUbzTmrxj4faBKZVATNN7iIFyDZHYAZuZRcPJBdUJ1xNHMCWyPZ4p2/Yk0Q0ujdKJbJw9NFysikZgBFNEhNXEA4w8HL1ycYCmZDgSUW1GsumDAKh0Brq3K8Kh2akep8YEjDMWipKgSPaNx3CVY4lf87e0oK70nK/zKGkmpWFvyMnxbkJtWmeuxmPgRZgg2lYbZXFauD1AidnQQhPULJTTV+P+Xkk9PYm3ZkIEcDnYJUmPg/D3iuwg84m2IZatFTdjiNuDAcGNKptTd54yMgohN87c3sRMiZlSY/r88u+Le3BKWJqyl7Xai7Odqz366DFgOzdPi92LnSaggKX++hy+Z04kjyfSZOUYWmiWlc38SUdeTq2v15egig2mMkSLMaUnHagk="
});
$("#codeSummaryBar").wb_expandableframe({
iframe : contextPath + '/dailytimesheet/summaryInline.jsp'
});
$("#codeSummaryBar").click(function(){$("#codeSummaryBar_expand_collapse_icon").toggleClass("collapse expand");});
$("#codeSummaryBar").click();
$("#selectionBar").wb_expandableframe({
iframe : contextPath + '/dailytimesheet/dailySelectInline.jsp',
onExpand : function() {
$(".selectionBarControl").css("visibility", "hidden");
$("#expand_collapse_icon").removeClass("expand").addClass("collapse");
},
onCollapse : function() {
$(".selectionBarControl").css("visibility", "");
$("#expand_collapse_icon").removeClass("collapse").addClass("expand");
}
});
DTS.onload();
})(jQuery);
</script>
EDIT 2:
I doubt that you might have checked the Encode in the HTTP Request.
Uncheck
Try with the regular expression ([a-zA-Z0-9+]+)
I am making a web scraper for a website where I have to download images. I am currently using WWW::Mechanize and doing:
my #images=$mech->find_all_images(url_regex => qr/smallThumb/i);
which gives me all the images that have smallThumb in the URL.
How can I change smallThumb to zoom while retaining the previous links that have smallThumb?
You can do this:
my #smallthumbs = $mech->find_all_images(url_regex => qr/smallThumb/i);
my #zooms = $mech->find_all_images(url_regex => qr/zoom/i);
my #allimages = (#smallthumbs, #zooms);
The risk here is that you could have a URL that fits in both categories and get a dupe.
You can also go monkeying with the regex.
my #smallthumbs_or_zooms = $mech->find_all_images( url_regex => qr/smallThumb|zoom/i );
I want to retrieve all links from the page, where link text is in the below format.
(10) Now I tried using below method but it didn't work.
There are many similar links available on the same page where number is not in sequence and also there are many repeated numbers for the link text, so I want to first collect such web element and then using attribute I can get the URL.
Similar to this page.
http://www.dmoz.org/search?q=surat&start=0&type=more&all=no&cat=
I want the link after we click on those numbers present in the bracket.
List<WebElement> catLinks = driver.findElements(By.xpath("//html/body/div[#id='doc']/div[#id='bd-cross']/ol/li[1]/a[2]"));
for (WebElement catLink : catLinks) {
System.out.println(nLink + ". " + catLink.getAttribute("href"));
}
Link XPath is:
//html/body/div[#id='doc']/div[#id='bd-cross']/ol/li[***1***]/a[2]
Using Above XPath I can get the first link URL. Now What I can do to get all links URL.
I tried using regexp :
//html/body/div[#id='doc']/div[#id='bd-cross']/ol/li[\\d\\.\\*]/a[2]
But it is not working.
I also tried using below method.
List<WebElement> catLinks = driver.findElements(By.linkText("\\d\.\*"));
for (WebElement catLink : catLinks) {
System.out.println(nLink + ". " + catLink.getAttribute("href"));
}
but no luck.
Now What I can do to get all links
URL.
I triedn using regex :
//html/body/div[#id='doc']/div[#id='bd-cross']/ol/li[\\d\\.\\*]/a[2]
Nop. Use:
/html/body/div[#id='doc']/div[#id='bd-cross']/ol/li/a[2]
Less is more.
You don't need to include the /html/body/ in the xpath locator, this will just make it more fragile if the page structure changes. Try this much simpler xpath locator: id('bd-cross')//li/a[2]
Does someone have a regular expression that gets a link to a Youtube video (not embedded object) from (almost) all the possible ways of linking to Youtube?
I think this is a pretty common problem and I'm sure there are a lot of ways to link that.
A starting point would be:
http://www.youtube.com/watch?v=iwGFalTRHDA
http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related
http://youtu.be/iwGFalTRHDA
http://youtu.be/n17B_uFF4cA
http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4
http://www.youtube.com/watch?v=t-ZRX8984sc
http://youtu.be/t-ZRX8984sc
... please add more possible links and/or regular expressions to detect them.
So far I got this Regular expression working for the examples I posted, and it gets the ID on the first group:
http(?:s?):\/\/(?:www\.)?youtu(?:be\.com\/watch\?v=|\.be\/)([\w\-\_]*)(&(amp;)?[\w\?=]*)?
You can use this expression below.
(?:https?:\/\/)?(?:www\.)?youtu\.?be(?:\.com)?\/?.*(?:watch|embed)?(?:.*v=|v\/|\/)([\w\-_]+)\&?
I'm using it, and it cover the most used URLs.
I'll keep updating it on This Gist.
You can test it on this tool.
I like #brunodles's solution the most but you can still match non video links like https://www.youtube.com/feed/subscriptions
I went with this solution
(?:https?:\/\/)?(?:www\.)?youtu(?:\.be\/|be.com\/\S*(?:watch|embed)(?:(?:(?=\/[-a-zA-Z0-9_]{11,}(?!\S))\/)|(?:\S*v=|v\/)))([-a-zA-Z0-9_]{11,})
It can also be used to match multiple whitespace separated links.
The video id will be captured in the first group.
Tested with the following urls:
youtu.be/iwGFalTRHDA
youtube.com/watch?v=iwGFalTRHDA
www.youtube.com/watch?v=iwGFalTRHDA
http://www.youtube.com/watch?v=iwGFalTRHDA
https://www.youtube.com/watch?v=iwGFalTRHDA
https://www.youtube.com/watch?v=MoBL33GT9S8&feature=share
https://www.youtube.com/embed/watch?feature=player_embedded&v=iwGFalTRHDA
https://www.youtube.com/embed/watch?v=iwGFalTRHDA
https://www.youtube.com/embed/v=iwGFalTRHDA
https://www.youtube.com/watch/iwGFalTRHDA
http://www.youtube.com/attribution_link?u=/watch?v=aGmiw_rrNxk&feature=share
https://m.youtube.com/watch?v=iwGFalTRHDA
// will not match
https://www.youtube.com/feed/subscriptions
https://www.youtube.com/channel/UCgc00bfF_PvO_2AvqJZHXFg
https://www.youtube.com/c/NatGeoEdOrg/videos
https://regex101.com/r/rq2KLv/1
I improved the links posted above with a friend for a script I wrote for IRC to recognize even links without http at all. It worked on all stress tests I got so far, including garbled text with barely recognizable youtube urls, so here it is:
~(?:https?://)?(?:www\.)?youtu(?:be\.com/watch\?(?:.*?&(?:amp;)?)?v=|\.be/)([\w\-]+)(?:&(?:amp;)?[\w\?=]*)?~
I testet all the regular expressions that are shown here and none could cover all url types that my client was using.
I built this pretty much through trial and error, but it seems to work with all the patterns that Poppy Deejay posted.
"(?:.+?)?(?:\/v\/|watch\/|\?v=|\&v=|youtu\.be\/|\/v=|^youtu\.be\/)([a-zA-Z0-9_-]{11})+"
Maybe it helps someone who is in a similar situation that I had today ;)
Piggy backing on Fanmade, this covers the below links including the url encoded version of attribution_links:
(?:.+?)?(?:\/v\/|watch\/|\?v=|\&v=|youtu\.be\/|\/v=|^youtu\.be\/|watch\%3Fv\%3D)([a-zA-Z0-9_-]{11})+
https://www.youtube.com/attribution_link?a=tolCzpA7CrY&u=%2Fwatch%3Fv%3DMoBL33GT9S8%26feature%3Dshare
https://www.youtube.com/watch?v=MoBL33GT9S8&feature=share
http://www.youtube.com/watch?v=iwGFalTRHDA
https://www.youtube.com/watch?v=iwGFalTRHDA
http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related
http://youtu.be/iwGFalTRHDA
http://www.youtube.com/embed/watch?feature=player_embedded&v=iwGFalTRHDA
http://www.youtube.com/embed/watch?v=iwGFalTRHDA
http://www.youtube.com/embed/v=iwGFalTRHDA
http://www.youtube.com/watch?feature=player_embedded&v=iwGFalTRHDA
http://www.youtube.com/watch?v=iwGFalTRHDA
www.youtube.com/watch?v=iwGFalTRHDA
www.youtu.be/iwGFalTRHDA
youtu.be/iwGFalTRHDA
youtube.com/watch?v=iwGFalTRHDA
http://www.youtube.com/watch/iwGFalTRHDA
http://www.youtube.com/v/iwGFalTRHDA
http://www.youtube.com/v/i_GFalTRHDA
http://www.youtube.com/watch?v=i-GFalTRHDA&feature=related
http://www.youtube.com/attribution_link?u=/watch?v=aGmiw_rrNxk&feature=share&a=9QlmP1yvjcllp0h3l0NwuA
http://www.youtube.com/attribution_link?a=fF1CWYwxCQ4&u=/watch?v=qYr8opTPSaQ&feature=em-uploademail
http://www.youtube.com/attribution_link?a=fF1CWYwxCQ4&feature=em-uploademail&u=/watch?v=qYr8opTPSaQ
I've been having problems lately with the atttribution_link urls so i tried making my own regex that works for those too.
Here is my regex string:
(https?://)?(www\\.)?(yotu\\.be/|youtube\\.com/)?((.+/)?(watch(\\?v=|.+&v=))?(v=)?)([\\w_-]{11})(&.+)?
and here are some test cases i've tried:
http://www.youtube.com/watch?v=iwGFalTRHDA
https://www.youtube.com/watch?v=iwGFalTRHDA
http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related
http://youtu.be/iwGFalTRHDA
http://www.youtube.com/embed/watch?feature=player_embedded&v=iwGFalTRHDA
http://www.youtube.com/embed/watch?v=iwGFalTRHDA
http://www.youtube.com/embed/v=iwGFalTRHDA
http://www.youtube.com/watch?feature=player_embedded&v=iwGFalTRHDA
http://www.youtube.com/watch?v=iwGFalTRHDA
www.youtube.com/watch?v=iwGFalTRHDA
www.youtu.be/iwGFalTRHDA
youtu.be/iwGFalTRHDA
youtube.com/watch?v=iwGFalTRHDA
http://www.youtube.com/watch/iwGFalTRHDA
http://www.youtube.com/v/iwGFalTRHDA
http://www.youtube.com/v/i_GFalTRHDA
http://www.youtube.com/watch?v=i-GFalTRHDA&feature=related
http://www.youtube.com/attribution_link?u=/watch?v=aGmiw_rrNxk&feature=share&a=9QlmP1yvjcllp0h3l0NwuA
http://www.youtube.com/attribution_link?a=fF1CWYwxCQ4&u=/watch?v=qYr8opTPSaQ&feature=em-uploademail
http://www.youtube.com/attribution_link?a=fF1CWYwxCQ4&feature=em-uploademail&u=/watch?v=qYr8opTPSaQ
Also remember to check the string you get for your video url, sometimes it may get the percent characters. If so just do this
url = [url stringByReplacingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
and it should fix it.
Remember also that the index of the youtube key is now index 9.
NSRange youtubeKey = [result rangeAtIndex:9]; //the youtube key
NSString * strKey = [url substringWithRange:youtubeKey] ;
It'd be the longest RegEx in the world if you managed to cover all link formats, but here's one to get you started which will cover the first couple of link formats:
http://(www\.)?youtube\.com/watch\?.*v=([a-zA-Z0-9]+).*
The second group will match the video ID if you need to get that out.
(?:http?s?:\/\/)?(?:www.)?(?:m.)?(?:music.)?youtu(?:\.?be)(?:\.com)?(?:(?:\w*.?:\/\/)?\w*.?\w*-?.?\w*\/(?:embed|e|v|watch|.*\/)?\??(?:feature=\w*\.?\w*)?&?(?:v=)?\/?)([\w\d_-]{11})(?:\S+)?
https://regex101.com/r/nJzgG0/3
Detects YouTube and YouTube Music link in any string
I took all variants from here:
https://gist.github.com/rodrigoborgesdeoliveira/987683cfbfcc8d800192da1e73adc486#file-youtubeurlformats-txt
And built this regexp (YouTube ID is in group 2):
(\/|%3D|v=|vi=)([0-9A-z-_]{11})[%#?&\s]
Check it here: https://regexr.com/4u4ud
Edit: Works for any single string w/o breaks.
I'm working with that kind of links:
http://www.youtube.com/v/M-faNJWc9T0?fs=1&rel=0
And here's the regEx I'm using to get ID from it:
"(.+?)(\/v/)([a-zA-Z0-9_-]{11})+"
This is iterating on the existing answers and handles edge cases better. (for example http://thisisnotyoutu.be/thing)
/(?:https?:\/\/|www\.|m\.|^)youtu(?:be\.com\/watch\?(?:.*?&(?:amp;)?)?v=|\.be\/)([\w\-]+)(?:&(?:amp;)?[\w\?=]*)?/
here is the complete solution for getting youtube video id for java or android, i didn't found any link which doesn't work with this function
public static String getValidYoutubeVideoId(String youtubeUrl)
{
if(youtubeUrl == null || youtubeUrl.trim().contentEquals(""))
{
return "";
}
youtubeUrl = youtubeUrl.trim();
String validYoutubeVideoId = "";
String regexPattern = "^(?:https?:\\/\\/)?(?:[0-9A-Z-]+\\.)?(?:youtu\\.be\\/|youtube\\.com\\S*[^\\w\\-\\s])([\\w\\-]{11})(?=[^\\w\\-]|$)(?![?=&+%\\w]*(?:['\"][^<>]*>|<\\/a>))[?=&+%\\w]*";
Pattern regexCompiled = Pattern.compile(regexPattern, Pattern.CASE_INSENSITIVE);
Matcher regexMatcher = regexCompiled.matcher(youtubeUrl);
if(regexMatcher.find())
{
try
{
validYoutubeVideoId = regexMatcher.group(1);
}
catch(Exception ex)
{
}
}
return validYoutubeVideoId;
}
This is my answer to use in Scala. This is useful to extract 11 digits from Youtube's URL.
"https?://(?:[0-9a-zA-Z-]+.)?(?:www.youtube.com/|youtu.be\S*[^\w-\s])([\w -]{11})(?=[^\w-]|$)(?![?=&+%\w](?:[\'"][^<>]>|))[?=&+%\w-]*"
def getVideoLinkWR: UserDefinedFunction = udf(f = (videoLink: String) => {
val youtubeRgx = """https?://(?:[0-9a-zA-Z-]+\.)?(?:youtu\.be/|youtube\.com\S*[^\w\-\s])([\w \-]{11})(?=[^\w\-]|$)(?![?=&+%\w]*(?:[\'"][^<>]*>|</a>))[?=&+%\w-./]*""".r
videoLink match {
case youtubeRgx(a) => s"$a".toString
case _ => videoLink.toString
}
}
Youtube video URL Change to iframe supported link:
REGEX: https://regex101.com/r/LeZ9WH/2/
http://www.youtube.com/watch?v=iwGFalTRHDA
http://www.youtube.com/watch?v=iwGFalTRHDA&feature=related
http://youtu.be/iwGFalTRHDA
http://youtu.be/n17B_uFF4cA
http://www.youtube.com/embed/watch?feature=player_embedded&v=r5nB9u4jjy4
http://www.youtube.com/watch?v=t-ZRX8984sc
http://youtu.be/t-ZRX8984sc
https://youtu.be/2sFlFPmUfNo?t=1
Php function example:
if (!function_exists('clean_youtube_link')) {
/**
* #param $link
* #return string|string[]|null
*/
function clean_youtube_link($link)
{
return preg_replace(
'#(.+?)(\/)(watch\x3Fv=)?(embed\/watch\x3Ffeature\=player_embedded\x26v=)?([a-zA-Z0-9_-]{11})+#',
"https://www.youtube.com/embed/$5",
$link
);
}
}
This should work for almost all youtube links when extracting from a string:
((?:https?:)?\/\/)?((?:www|m)\.)?((?:youtube\.com|youtu.be))(\/(?:[\w\-]+\?v=|embed\/|v\/)?)([\w\-]{10}).\b
var isValidYoutubeLink: Bool{
// working for all the youtube url's
NSPredicate(format: "SELF MATCHES %#", "(?:http?s?:\\/\\/)?(?:www.)?(?:m.)?(?:music.)?youtu(?:\\.?be)(?:\\.com)?(?:(?:\\w*.?:\\/\\/)?\\w*.?\\w*-?.?\\w*\\/(?:embed|e|v|watch|.*\\/)?\\??(?:feature=\\w*\\.?\\w*)?&?(?:v=)?\\/?)([\\w\\d_-]{11})(?:\\S+)?").evaluate(with: self)
}
With this Javascript Regex, the first capture is a video ID :
^(?:https?:)?(?:\/\/)?(?:www\.)?(?:youtu\.be\/|youtube(?:\-nocookie)?\.(?:[A-Za-z]{2,4}|[A-Za-z]{2,3}\.[A-Za-z]{2})\/)(?:watch|embed\/|vi?\/)*(?:\?[\w=&]*vi?=)?([^#&\?\/]{11}).*$
(?-s)^https?\W+(?:www\.|m\.|music\.)*youtu\.?be(?:\.com|\/watch|\/o?embed|\/shorts|\/attribution_link\?[&\w\-=]*[au]=|\/ytsc\w+|[\?&\/]+[ve]i?\b|\?feature=\w+|-nocookie)*[\/=]([a-z\d\-_]{11})[\?&#% \t ] *.*$
or
(?-s)^(?:(?!https?[:\/]|www\.|m\.yo|music\.yo|youtu\.?be[\/\.]|watch[\/\?]|embed\/)\V)*(?:https?[:\/]+|www\.|m\.|music\.)+youtu\.?be(?:\.com\/|watch|o?embed(?:\/|\?url=\S+?)?|shorts|attribution_link\?[&\w\-=]*[au]=\/?|ytsc\w+|[\?&]*[ve]i?\b|\?feature=\w+|[\?&]time_continue=\d+|-nocookie|%[23][56FD])*(?:[\/=]|%2F|%3D)([a-z\d\-_]{11})[\?&#% \t ]? *.*$
(the part >>#% \t⠀ ]<< should contain continuous space, which is Alt+255, but stackoverflow-com can't print it)
(this string may be replaced to \1, sorted and abbreviated with: )
V█(?-i)^([A-Za-z\d\-_]{11})(?:\v+\1)*$
>█https:\/\/youtu\.be\/\1
(./dot can take up any symbol; \V or [^\r\n] can any except special, emoji and others; this >> [^!-⠀:/‽|\s] << can grab some emoji)
https://youtu.be/x26ANNC3C-8 • ♾ 𝕳𝕰𝕽𝕰𝕿𝕳𝕰𝖄𝕮𝕺𝕸𝕰 - 𝔩𝔢𝔞𝔳𝔢 𝔪𝔢 𝔞𝔩𝔬𝔫𝔢 • 7:15
This regex solve my problem, I can get youtube link having watch, embed or shared link
(?:http(?:s)?:\/\/)?(?:www\.)?(?:youtu\.be\/|youtube\.com\/(?:(?:watch)?\?(?:.*&)?v(?:i)?=|(?:embed|v|vi|user)\/))([^\?&\"'<> #]+)
You can check here https://regex101.com/r/Kvk0nB/1