Extract Path from Href link using Jmeter

Extract Path from Href link using Jmeter - regex

I am trying to develop a simple test script using Jmeter that can do a bing search and randomly select a link from search results and navigate into the selected link. I am able to capture and randomize the link selection using Regex Extractor & Random variable functions but what I am not sure is, how to go about extracting the path from randomly captured href link.i.e., if the captured link is "http://en.wikipedia.org/wiki/.NET_Framework", I want to extract "/wiki/.NET_Framework" from the link and substitute it in the "Path" textspace of the subsequent HTTP requests. If I am correct, I think using Regex Extractor is not possible here as there are no unique boundaries inorder to extract the path directly from response of the page .

Let's break down http://en.wikipedia.org/wiki/.NET_Framework into components:
http - protocol
en.wikipedia.org - host
/wiki/.NET_Framework - path
Add Beanshell Pre Processor as a child of the request, where you need to substitute path and add the following code to Script area:
import java.net.URL;
URL url = new URL(vars.get("LINK"));
sampler.setProtocol(url.getProtocol());
sampler.setDomain(url.getHost());
sampler.setPath(url.getPath());
The code above assumes that Reference name for your URL is "LINK". Change it to the reference name which is specified in Regular Expression Extractor and it should work fine.
Beanshell Pre Processor is executed before request so all necessary fields will be populated.
See How to use BeanShell: JMeter's favorite built-in component guide for more details on Beanshell scripting.

Related

JMeter: Regex Extractor not Pulling Token

I need to extract a CSRF token from a webpage, then log it via BeanShell. The latter part is working thanks to the help I received in this thread, but now I need to figure out how to get ${token} to populate with the right data.
Note: I know the Regular Expression Extractor is not the preferred method, but I have to stay within the parameter of the exercise, in this case.
First, I have a HTTP Request set to perform a GET against www.blazedemo.com/register.
Second, I checked the response data shown in the response tree to find the CSRF token:
<!-- CSRF Token -->
<meta name="csrf-token" content="4ZCKKqQgwJH5lT5dQSeAwgeyOr7plAe7IOVRGmQm">
I have a Regex Extractor setup to grab it:
In case it fails to do so, I have default set as "NOT_FOUND".
Finally, I have a post processor logging whatever value is given to ${token}.
I find the following in my log:
2017-10-31 15:12:31,975 INFO o.a.j.u.BeanShellTestElement: The token
is: NOT_FOUND

Remember that it is not recommended to use regular expressions for parsing HTML, I would recommend going for CSS/JQuery Extractor instead.
Add CSS/JQuery Extractor as a child of the request which has this CSRF token
Configure it as follows:
Reference Name: anything meaningful, i.e. token
CSS/JQuery Expression: meta[name=csrf-token]
Attribute: content
Demo:
More information: How to Use the CSS/JQuery Extractor in JMeter
If you still want to go for Regular Expressions - change "Field to check" to Body, however I wouldn't recommend this as when it comes to parsing HTML responses regular expressions are headache to develop and/or support and very sensitive to any markup change, i.e. if order of attributes changes or an attribute goes to a new line it will ruin your test.

You choose in checkbox Response Headers which means it searches expression inside Request's headers.
In your case you search for HTML tag meta, you need to choose Body.

Replacing part of ${url}'s from a sitemap in Jmeter

I have a jmeter test plan that goes to a site's sitemap.xml page, retrieves each url on that page with an XPath Extractor, then passes ${url} to a HTTP Request sampler within a ForEach Controller to send the results for each page to a file. This works great, except I just realized that the links on this sitemap.xml page are hardcoded. This is a problem when i want to test https://staging-website.com, but all of the links on sitemap.xml are all www.website.com pages. It seems like there must be a way to replace 'www.website.com' in each ${url} with 'staging-website.com' with regex or something, but I haven't been able to figure out how. Any suggestions would be greatly appreciated.

Add a BeanShell pre-processor to manipulate the url.
String sUrl = vars.get("url");
String sNewUrl = sUrl.replace("www.website.com", "https://staging-website.com");
log.info("sNewUrl:" + sNewUrl);
vars.put("url", sNewUrl);
You can also try to correlate the sitemap.xml with the regular expression extractor positioned till www.website.com so that you extract only the URL portion of the data instead of the full host name. Shouldn't you be having it already since the HTTPSampler only allows you to enter the URI segment and not the host name?

You can use __strReplace() function available via JMeter Plugins project like:
${__strReplace(${url},${url},staging-website.com,)}
Demo:
The easiest way to install JMeter Custom Functions (as well as any other plugins) is using JMeter Plugins Manager

I was able to replace the host within the string by putting
${__javaScript('${url}'.replace('www.website'\,'staging.website'))}
in the path input of the second http request sampler. The answers provided by Selva and Dimitri were more elegant, so if I have time in the future to come back to this I will give them another try. I really appreciate the help!

Exclude urls without 'www' from Nutch 1.7 crawl

I'm currently using Nutch 1.7 to crawl my domain. My issue is specific to URLs being indexed as www vs. non-www.
Specifically, after firing the crawl and index to Solr 4.5 then validating the results on the front-end with AJAX Solr, the search results page lists results/pages that are both 'www' and '' urls such as:
www.mywebsite.com
mywebsite.com
www.mywebsite.com/page1.html
mywebsite.com/page1.html
My understanding is that the url filtering aka regex-urlfilter.txt needs modification. Are there any regex/nutch experts that could suggest a solution?
Here is the code on pastebin.

There are at least a couple solutions.
1.) urlfilter-regex plugin
If you don't want to crawl the non-www pages at all, or else filter them at a later stage such as at index time, that is what the urlfilter-regex plugin is for. It lets you mark any URLs matching the regex patterns starting with "+" to be crawled. Anything that does not match a regex prefixed with a "+" will not be crawled. Additionally in case you want to specify a general pattern but exclude certain URLs, you can use a "-" prefix to specify URLs to subsequently exclude.
In your case you would use a rule like:
+^(https?://)?www\.
This will match anything that starts with:
https://www.
http://www.
www.
and therefore will only allow such URLs to be crawled.
Based on the fact that the URLs listed were already not being excluded given your regex-urlfilter, it means either the plugin wasn't turned on in your nutch-site.xml, or else it is not pointed at that file.
In nutch-site.xml you have to specify regex-urlfilter in the list of plugins, e.g.:
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-basic|query-(basic|site|url)|response-(json|xml)|urlnormalizer-(pass|regex|basic)</value>
</property>
Additionally check that the property specifying which file to use is not over-written in nutch-site.xml and is correct in nutch-default.xml. It should be:
<property>
<name>urlfilter.regex.file</name>
<value>regex-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing regular expressions
used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>
and regex-urlfilter.txt should be in the conf directory for nutch.
There is also the option to only perform the filtering at different steps, e.g., index-time, if you only want to filter than.
2.) solrdedup command
If the URLs point to the exact same page, which I am guessing is the case here, they can be removed by running the nutch command to delete duplicates after crawling:
http://wiki.apache.org/nutch/bin/nutch%20solrdedup
This will use the digest values computed from the text of each indexed page to find any pages that were the same and delete all but one.
However you would have to modify the plugin to change which duplicate is kept if you want to specifically keep the "www" ones.
3.) Write a custom indexing filter plugin
You can write a plugin that reads the URL field of a nutch document and converts it in any way you want before indexing. This would give you more flexible than using an existing plugin like urlnormalize-regex.
It is actually very easy to make plugins and add them to Nutch, which is one of the great things about it. As a starting point you can copy and look at one of the other plugins including with nutch that implement IndexingFilter, such as the index-basic plugin.
You can also find a lot of examples:
http://wiki.apache.org/nutch/WritingPluginExample
http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html

Yahoo Pipes and Website Name

How do I fetch Page Name with Yahoo Pipes?
I'm making a news / blog aggregator, and need to know the name of the site where the info is coming from (bbc, cnn, fox, etc).
Do I need to do this with REGEX?
Anyone that can help?

You can fetch the page using the XPath Fetch Page or Fetch Feed modules in the Sources menu. Maybe with others too.
After that you can extract the page name itself using the various operators, possibly Regex, or others, depending on the source page you are using and the output you want to get.
In general your question is too broad and difficult to answer. To get you started, I created an example pipe that extracts the title of your question from this post, which is basically the "page name" of the current page.
http://pipes.yahoo.com/pipes/pipe.info?_id=668acf3f807c30d7b75f12459edd3252
I used the XPath Fetch Page with parameters:
URL = this page
Extract using XPath = //div[#id="question-header"]
I got that div path by inspecting the source code of this page, where I saw that div#question-header is the container of a question. I could have selected a deeper inner container or a higher level container. It all depends on the amount of other information you need. The more information you want to you from the page, the higher level container you select.
Next, I used the Create RSS operator to create a proper RSS feed, with parameters:
Title = h1.a
Link = h1.a.href
I chose these elements because in the container I extracted with xpath, the page name is inside h1 a. In Yahoo Pipes you use a dot as the path separator.

I found this sample pipe http://pipes.yahoo.com/pipes/pipe.info?_id=69b5dce1c59501a0c64a660c1cfdb856. The page title included the name of the site too. I am not sure if this what you are looking for.

How to extract an ID from a request's subsample or body in Jmeter & save it in CSV

I am a beginner in Jmeter & trying to save an ID (like ID=1234567) in a HTTP request's response data using regex extractor & a 3rd party plugin called Dummy sampler with Filewriter but I am failing every time. Here is what it looks like:
/Registration/PaymentInformation?accountRegistrationId=372036
My objective is to save these accountRegistrationId in a CSV file & then use them in the following request as a parameter. The only part where I am stuck is capturing them & saving a file. After that I can manage. I have googled everywhere but cant find a solution. Please help me.

Following regex should work - (.*accountRegistrationId=)([0-9]+)(.*)
Use $2$ to retrieve the id.
JMeter regex reference

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract Path from Href link using Jmeter - regex

Related

JMeter: Regex Extractor not Pulling Token

Replacing part of ${url}'s from a sitemap in Jmeter

Exclude urls without 'www' from Nutch 1.7 crawl

Yahoo Pipes and Website Name

How to extract an ID from a request's subsample or body in Jmeter & save it in CSV

Categories

Resources