How to use Regexp to retrieve URL where link text has number in the bracket - regex

I want to retrieve all links from the page, where link text is in the below format.
(10) Now I tried using below method but it didn't work.
There are many similar links available on the same page where number is not in sequence and also there are many repeated numbers for the link text, so I want to first collect such web element and then using attribute I can get the URL.
Similar to this page.
I want the link after we click on those numbers present in the bracket.
List<WebElement> catLinks = driver.findElements(By.xpath("//html/body/div[#id='doc']/div[#id='bd-cross']/ol/li[1]/a[2]"));
for (WebElement catLink : catLinks) {
System.out.println(nLink + ". " + catLink.getAttribute("href"));
Link XPath is:
Using Above XPath I can get the first link URL. Now What I can do to get all links URL.
I tried using regexp :
But it is not working.
I also tried using below method.
List<WebElement> catLinks = driver.findElements(By.linkText("\\d\.\*"));
for (WebElement catLink : catLinks) {
System.out.println(nLink + ". " + catLink.getAttribute("href"));
but no luck.

Now What I can do to get all links
I triedn using regex :
Nop. Use:
Less is more.

You don't need to include the /html/body/ in the xpath locator, this will just make it more fragile if the page structure changes. Try this much simpler xpath locator: id('bd-cross')//li/a[2]


End of string expected error when trying to post comment in Weblog

I'm trying to solve an issue with posting comments for a blog that uses the Weblog Sitecore module. From what I can tell, if the blog entry url contains dashes (i.e. http://[]/blog/2016/december/test-2-entry), then I get the "End of string expected at line [#]" error. If the blog entry url does NOT contain dashes, then the comment form works fine.
<replace mode="on" find="-" replaceWith="_"/>
Also tried to replace the dash with an empty space. Neither solution has worked as I still get the error.
Is there some other setting in the Web.config I can alter to escape the dashes in the urls? I have read that enclosing dashed url text with the # symbol works, but I'd like to be able to do that automatically instead of having the user go back and rename all their blog entries.
Here is a screenshot of the error for reference:
I have not experience the Weblog module but for the issue you are facing, you should escape the dash with #. Please see the following code snippet:
public string EscapePath(string path)
string[] joints = Regex.Split(path, "/");
string output = string.Empty;
for (int index = 0; index < joints.Length; index++)
string joint = joints[index];
if (!string.IsNullOrEmpty(joint))
output += string.Format("#{0}#", joint);
if (index != joints.Length - 1)
output += "/";
return output;
More information about escaping dash in queries can be found here
You should call this method before posting the comment for it to escape the dashes. You may also download the dll from here and use it in your solution

Using Servlet filter on all the pages except the index

I'm trying to use a Filter to force my users to login if they want to access some pages.
So my Filter has to redirect them to an error page in there's no session.
But I don't want this to happen when they visit index.html, because they can login in the index page.
So I need an URL Pattern that matches all the pages excluding / and index.xhtml.
How can I do that? Can I use regex in my web.xml ?
After reading this
I thought that I can make something like :
if (!req.getRequestURI().matches("((!?index)(.*)\\.xhtml)|((.*)\\.(png|gif|jpg|css|js(\\.xhtml)?))"))
in my doFilter() method, but it still processes everything.
I'm sure that the regex works because I've tested it online and it matches the files that doesn't need to be filtered, but the content of the if is executed even for the excluded files!
EDIT 2 :
I'm trying a new way.
I've mapped the Filter to *.xhtml in my web.xml, so I don't need to exclude css, images and javascript with the regex above.
Here's the new code (into the doFilter())
if (req.getRequestURI().contains("index")) {
chain.doFilter(request, response);
} else {
if (!userManager.isLogged()) {
request.getRequestDispatcher("error.xhtml").forward(request, response);
} else {
chain.doFilter(request, response);
but it still doesn't because it calls the chain.doFilter() (in the outer if) on every page.
How can I exclude my index page from being filtered?
The web.xml URL pattern doesn't support regex. It only supports wildcard prefix (folder) and suffix (extension) matching like /faces/* and *.xhtml.
As to your concrete problem, you've apparently the index file defined as a <welcome-file> and are opening it by /. This way the request.getRequestURI() will equal to /contextpath/, not /contextpath/index.xhtml. Debug the request.getRequestURI() to learn what the filter actually retrieved.
I suggest a rewrite:
String path = request.getRequestURI().substring(request.getContextPath().length());
if (userManager.isLogged() || path.equals("/") || path.equals("/index.xhtml") || path.startsWith(ResourceHandler.RESOURCE_IDENTIFIER)) {
chain.doFilter(request, response);
} else {
request.getRequestDispatcher("/WEB-INF/error.xhtml").forward(request, response);
Map this filter on /*. Note that I included the ResourceHandler.RESOURCE_IDENTIFIER check so that JSF resources like <h:outputStylesheet>, <h:outputScript> and <h:graphicImage> will also be skipped, otherwise you end up with an index page without CSS/JS/images when the user is not logged in.
Note that I assume that the FacesServlet is mapped on an URL pattern of *.xhtml. Otherwise you need to alter the /index.xhtml check on path accordingly.

How do I get a number in a link out of HTML code with preg_match?

I need to find out, whether in an array there is a specific HTML code. The array contains HTML codes and I need to get a number, that is included in a link.
This would be what I am searching for (the number 10 ist the number I want):
class = "active" href = "
So I tried the following using preg_match:
if(preg_match('/class = "active" href = "*)/',$array["crawler"],$arr)) { print_r($arr,true); }
Unfortunately this will give me nothing as result. So I guess, something is wrong with my preg_match. I allready checked all the manuals, but I still dont get what I am doing wrong.
Could someone help me with this? Thank you!
Aside from advising you to not parse HTML using regular expressions, your particular regular expression needs different delimiters:
preg_match('~class = "active" href = "http://www\.example\.com/something-(\d+)~', ...)
Alternatively, you could have escaped the slashes within the regex, but that leads to LSS (leaning slash syndrome):
preg_match('/class = "active" href = "http:\/\/www\.example\.com\/something-(.*)/', ...)
And that's just ugly.
You should have gotten an error, if your error_reporting is turned on.

Regular expression for youtube links

Does someone have a regular expression that gets a link to a Youtube video (not embedded object) from (almost) all the possible ways of linking to Youtube?
I think this is a pretty common problem and I'm sure there are a lot of ways to link that.
A starting point would be:
... please add more possible links and/or regular expressions to detect them.
So far I got this Regular expression working for the examples I posted, and it gets the ID on the first group:
You can use this expression below.
I'm using it, and it cover the most used URLs.
I'll keep updating it on This Gist.
You can test it on this tool.
I like #brunodles's solution the most but you can still match non video links like
I went with this solution
It can also be used to match multiple whitespace separated links.
The video id will be captured in the first group.
Tested with the following urls:
// will not match
I improved the links posted above with a friend for a script I wrote for IRC to recognize even links without http at all. It worked on all stress tests I got so far, including garbled text with barely recognizable youtube urls, so here it is:
I testet all the regular expressions that are shown here and none could cover all url types that my client was using.
I built this pretty much through trial and error, but it seems to work with all the patterns that Poppy Deejay posted.
Maybe it helps someone who is in a similar situation that I had today ;)
Piggy backing on Fanmade, this covers the below links including the url encoded version of attribution_links:
I've been having problems lately with the atttribution_link urls so i tried making my own regex that works for those too.
Here is my regex string:
and here are some test cases i've tried:
Also remember to check the string you get for your video url, sometimes it may get the percent characters. If so just do this
url = [url stringByReplacingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
and it should fix it.
Remember also that the index of the youtube key is now index 9.
NSRange youtubeKey = [result rangeAtIndex:9]; //the youtube key
NSString * strKey = [url substringWithRange:youtubeKey] ;
It'd be the longest RegEx in the world if you managed to cover all link formats, but here's one to get you started which will cover the first couple of link formats:
The second group will match the video ID if you need to get that out.
Detects YouTube and YouTube Music link in any string
I took all variants from here:
And built this regexp (YouTube ID is in group 2):
Check it here:
Edit: Works for any single string w/o breaks.
I'm working with that kind of links:
And here's the regEx I'm using to get ID from it:
This is iterating on the existing answers and handles edge cases better. (for example
here is the complete solution for getting youtube video id for java or android, i didn't found any link which doesn't work with this function
public static String getValidYoutubeVideoId(String youtubeUrl)
if(youtubeUrl == null || youtubeUrl.trim().contentEquals(""))
return "";
youtubeUrl = youtubeUrl.trim();
String validYoutubeVideoId = "";
String regexPattern = "^(?:https?:\\/\\/)?(?:[0-9A-Z-]+\\.)?(?:youtu\\.be\\/|youtube\\.com\\S*[^\\w\\-\\s])([\\w\\-]{11})(?=[^\\w\\-]|$)(?![?=&+%\\w]*(?:['\"][^<>]*>|<\\/a>))[?=&+%\\w]*";
Pattern regexCompiled = Pattern.compile(regexPattern, Pattern.CASE_INSENSITIVE);
Matcher regexMatcher = regexCompiled.matcher(youtubeUrl);
validYoutubeVideoId =;
catch(Exception ex)
return validYoutubeVideoId;
This is my answer to use in Scala. This is useful to extract 11 digits from Youtube's URL.
"https?://(?:[0-9a-zA-Z-]+.)?(?|\S*[^\w-\s])([\w -]{11})(?=[^\w-]|$)(?![?=&+%\w](?:[\'"][^<>]>|))[?=&+%\w-]*"
def getVideoLinkWR: UserDefinedFunction = udf(f = (videoLink: String) => {
val youtubeRgx = """https?://(?:[0-9a-zA-Z-]+\.)?(?:youtu\.be/|youtube\.com\S*[^\w\-\s])([\w \-]{11})(?=[^\w\-]|$)(?![?=&+%\w]*(?:[\'"][^<>]*>|</a>))[?=&+%\w-./]*""".r
videoLink match {
case youtubeRgx(a) => s"$a".toString
case _ => videoLink.toString
Youtube video URL Change to iframe supported link:
Php function example:
if (!function_exists('clean_youtube_link')) {
* #param $link
* #return string|string[]|null
function clean_youtube_link($link)
return preg_replace(
This should work for almost all youtube links when extracting from a string:
var isValidYoutubeLink: Bool{
// working for all the youtube url's
NSPredicate(format: "SELF MATCHES %#", "(?:http?s?:\\/\\/)?(?:www.)?(?:m.)?(?:music.)?youtu(?:\\.?be)(?:\\.com)?(?:(?:\\w*.?:\\/\\/)?\\w*.?\\w*-?.?\\w*\\/(?:embed|e|v|watch|.*\\/)?\\??(?:feature=\\w*\\.?\\w*)?&?(?:v=)?\\/?)([\\w\\d_-]{11})(?:\\S+)?").evaluate(with: self)
With this Javascript Regex, the first capture is a video ID :
(?-s)^https?\W+(?:www\.|m\.|music\.)*youtu\.?be(?:\.com|\/watch|\/o?embed|\/shorts|\/attribution_link\?[&\w\-=]*[au]=|\/ytsc\w+|[\?&\/]+[ve]i?\b|\?feature=\w+|-nocookie)*[\/=]([a-z\d\-_]{11})[\?&#% \t ] *.*$
(?-s)^(?:(?!https?[:\/]|www\.|m\.yo|music\.yo|youtu\.?be[\/\.]|watch[\/\?]|embed\/)\V)*(?:https?[:\/]+|www\.|m\.|music\.)+youtu\.?be(?:\.com\/|watch|o?embed(?:\/|\?url=\S+?)?|shorts|attribution_link\?[&\w\-=]*[au]=\/?|ytsc\w+|[\?&]*[ve]i?\b|\?feature=\w+|[\?&]time_continue=\d+|-nocookie|%[23][56FD])*(?:[\/=]|%2F|%3D)([a-z\d\-_]{11})[\?&#% \t ]? *.*$
(the part >>#% \t⠀ ]<< should contain continuous space, which is Alt+255, but stackoverflow-com can't print it)
(this string may be replaced to \1, sorted and abbreviated with: )
(./dot can take up any symbol; \V or [^\r\n] can any except special, emoji and others; this >> [^!-⠀:/‽|\s] << can grab some emoji) • ♾ 𝕳𝕰𝕽𝕰𝕿𝕳𝕰𝖄𝕮𝕺𝕸𝕰 - 𝔩𝔢𝔞𝔳𝔢 𝔪𝔢 𝔞𝔩𝔬𝔫𝔢 • 7:15
This regex solve my problem, I can get youtube link having watch, embed or shared link
(?:http(?:s)?:\/\/)?(?:www\.)?(?:youtu\.be\/|youtube\.com\/(?:(?:watch)?\?(?:.*&)?v(?:i)?=|(?:embed|v|vi|user)\/))([^\?&\"'<> #]+)
You can check here

How to use regex in selenium locators

I'm using selenium RC and I would like, for example, to get all the links elements with attribute href that match:
I would like to use:
sel.get_attribute( '//a[regx:match(#href, "http://[^/]*\")]/#name' )
which would return a list of the name attribute of all the links that match the regex.
(or something like it)
The answer above is probably the right way to find ALL of the links that match a regex, but I thought it'd also be helpful to answer the other part of the question, how to use regex in Xpath locators. You need to use the regex matches() function, like this:
(this, of course, would click the div with 'id=checkboxes', or 'id=cheANYTHINGHEREboxes')
Be aware, though, that the matches function is not supported by all native browser implementations of Xpath (most conspicuously, using this in FF3 will throw an error: invalid xpath[2]).
If you have trouble with your particular browser (as I did with FF3), try using Selenium's allowNativeXpath("false") to switch over to the JavaScript Xpath interpreter. It'll be slower, but it does seem to work with more Xpath functions, including 'matches' and 'ends-with'. :)
You can use the Selenium command getAllLinks to get an array of the ids of links on the page, which you could then loop through and check the href using the getAttribute, which takes the locator followed by an # and the attribute name. For example in Java this might be:
String[] allLinks = session().getAllLinks();
List<String> matchingLinks = new ArrayList<String>();
for (String linkId : allLinks) {
String linkHref = selenium.getAttribute("id=" + linkId + "#href");
if (linkHref.matches("http://[^/]*\\")) {
A possible solution is to use sel.get_eval() and write a JS script that returns a list of the links. something like the following answer:
selenium: Is it possible to use the regexp in selenium locators
Here's some alternate methods as well for Selenium RC. These aren't pure Selenium solutions, they allow interaction with your programming language data structures and Selenium.
You can also get get HTML page source, then regular expression the source to return a match set of links. Use regex grouping to separate out URLs, link text/ID, etc. and you can then pass them back to selenium to click on or navigate to.
Another method is get HTML page source or innerHTML (via DOM locators) of a parent/root element then convert the HTML to XML as DOM object in your programming language. You can then traverse the DOM with desired XPath (with regular expression or not), and obtain a nodeset of only the links of interest. From their parse out the link text/ID or URL and you can pass back to selenium to click on or navigate to.
Upon request, I'm providing examples below. It's mixed languages since the post didn't appear to be language specific anyways. I'm just using what I had available to hack together for examples. They aren't fully tested or tested at all, but I've worked with bits of the code before in other projects, so these are proof of concept code examples of how you'd implement the solutions I just mentioned.
//Example of element attribute processing by page source and regex (in PHP)
$pgSrc = $sel->getPageSource();
//simple hyperlink extraction via regex below, replace with better regex pattern as desired
//$matches is a 2D array, $matches[0] is array of whole string matched, $matches[1] is array of what's in parenthesis
//you either get an array of all matched link URL values in parenthesis capture group or an empty array
$links = count($matches) >= 2 ? $matches[1] : array();
//now do as you wish, iterating over all link URLs
//NOTE: these are URLs only, not actual hyperlink elements
//Example of XML DOM parsing with Selenium RC (in Java)
String locator = "id=someElement";
String htmlSrcSubset = sel.getEval("this.browserbot.findElement(\""+locator+"\").innerHTML");
//using JSoup XML parser library for Java, see
Document doc = Jsoup.parse(htmlSrcSubset);
/* once you have this document object, can then manipulate & traverse
it as an XML/HTML node tree. I'm not going to go into details on this
as you'd need to know XML DOM traversal and XPath (not just for finding locators).
But this tutorial URL will give you some ideas:
the example there seems to indicate first getting the element/node defined
by content tag within the "document" or source, then from there get all
hyperlink elements/nodes and then traverse that as a list/array, doing
whatever you want with an object oriented approach for each element in
the array. Each element is an XML node with properties. If you study it,
you'd find this approach gives you the power/access that WebDriver/Selenium 2
now gives you with WebElements but the example here is what you can do in
Selenium RC to get similar WebElement kind of capability
Selenium's By.Id and By.CssSelector methods do not support Regex and By.XPath only does where XPath 2.0 is enabled. If you want to use Regex, you can do something like this:
void MyCallingMethod(IWebDriver driver)
//Search by ID:
string attrName = "id";
//Regex = 'a number that is 1-10 digits long'
string attrRegex= "[0-9]{1,10}";
SearchByAttribute(driver, attrName, attrRegex);
IEnumerable<IWebElement> SearchByAttribute(IWebDriver driver, string attrName, string attrRegex)
List<IWebElement> elements = new List<IWebElement>();
//Allows spaces around equal sign. Ex: id = 55
string searchString = attrName +"\\s*=\\s*\"" + attrRegex +"\"";
//Search page source
MatchCollection matches = Regex.Matches(driver.PageSource, searchString, RegexOptions.IgnoreCase);
//iterate over matches
foreach (Match match in matches)
//Get exact attribute value
Match innerMatch = Regex.Match(match.Value, attrRegex);
cssSelector = "[" + attrName + "=" + attrRegex + "]";
//Find element by exact attribute value
return elements;
Note: this code is untested. Also, you can optimize this method by figuring out a way to eliminate the second search.