I'm trying to scrape a piece of text from a website using Kimonolabs. The text is succesfully scraped using the advanced setting:
div > div > ul > li.location > span.value
The text being scraped using this CSS selector is:
Cityname, streetname 1
However, I wish to delete everything before the comma so that only remains:
Cityname
I wish to do this with regex, but I'm totally ignorant about it. What I do konw is that it has to containof 3 blocks when using Kimonolabs: https://help.kimonolabs.com/hc/en-us/articles/203043464-Manually-input-regular-expressions
Can anybody help me setting up the correct regex? All I got so far is the following, but it's not the correct markup for Kimonolabs (it doesn't allow for it in the dashboard):
^(.+?),
See the docs you referred to:
The regular expression pattern in kimono is defined in three parts. It's important that any custom regular expression you produce retains the three part notation, with the surrounding ( ) for each part. The first part refers to the pattern to the left of the desired content. The middle part refers to the pattern that the desired content must match and the third part refers to the pattern to the right of the desired content.
So, you seem to need:
/^()([^,]+)()/
Or, /(^)([^,]+)(,)/ (it should be equivalent), and the 2nd capture group (the middle part) should capture the Cityname.
Related
I have a block of text where I want to search for IMDb link, if found I want to extract the IMDdID.
Here is an example string:
http://www.imdb.com/Title/tt2618986
http://www.google.com/tt2618986
https://www.imdb.com/Title/tt2618986
http://www.imdb.com/title/tt1979376/?ref_=nv_sr_1?ref_=nv_sr_1
I want to only extract 2618986 from lines 1, 3 and 4.
Here is the regex line I am currently using but am not having luck:
(?:http|https)://(?:.*\.|.*)imdb.com/(?:t|T)itle(?:\?|/)(..\d+)(.+)?
https://regex101.com/r/ERtoRz/1
If you are interested in only extracting the ID, so 2618986, none of the comments quite nail it, since they match tt2618986. Building on top of #The fourth bird answer, you will need to separate tt2618986 into two parts - tt and 2618986. So instead of a single ([a-zA-Z0-9]+), have [a-zA-Z]+([0-9]+).
^https?://www\.imdb\.com/[Tt]itle[?/][a-zA-Z]+([0-9]+)
Regex Demo
You can then extract the 2618986 part by calling group 1.
This expression might simply extract those desired digits:
^(?:https?://)(?:www\.)?imdb\.com/title/[a-z]+([0-9]+).*$
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
i'm stuck with the following.
i have a page on ibm filenet containing a list with objects (these are documents or files) which have a specific classID and ID in their href. i need JMeter to get all HREFS containing a specific type of ID:
<a href="http://ipaddress/Workplace/Browse.jsp?eventTarget=WcmController&eventName=GetInfo&id={350B278C-DE7D-44DE-9B54-099672152476}&vsId=&classId={F14AC85A-4474-479A-9B4E-BCBA180B7975}&objectStoreName=Nice&majorVersion=&minorVersion=&versionStatus=&mimeType=&mode=&objectType=customobject&isPopup=true" target="_blank">
the 'classId' = {F14AC85A-4474-479A-9B4E-BCBA180B7975} is the right class id type i need to click on the page (there are several files with this classID but that is no problem). on the other hand the 'id' is thus different for each file.
how can i extract all 'id's containing this specific classId and make JMeter pass it to the next sampler, so it clicks on just one of them? what will my RegEx look like?
As already mentionned in the comment, I do not know jmeter and how to implement it in the code. A regular expression to match both id and classId within a link would be:
~(classId=|id=)([^&]*)~g
This is, search for a string classId= or id= first. If one of the strings is found, match any character afterwards, except an ampersand (&), as many times as possible (*) and capture it in a group (brackets). Possibly you need to fiddle with the parameters (e.g. /g for global) after the regex.
See this regex101 fiddle for more information.
How can I fetch the contents within value attribute of the below tag across the files
<h:graphicImage .... value="*1.png*" ...../>
<h:graphicImage .... value="*2.png*" ....>...</h:graphicImage>
My regular expression search result should result into
1.png
2.png
All I could find was content for multiple ending tags but what about the single ending tags.
Use an XML parser instead, regex cannot truly parse XML properly, unless you know the input will always follow a particular form.
However, here is a regex you can use to extract the value attribute of h:graphicImage tags, but read the caveats after:
<h:graphicImage[^>]+value="\*(.*?)\*"
and the 1.png or 2.png will be in the first captured group.
Caveats:
here I have assumed that your 1.png, 2.png etc are always surrounded by asterisks as that is what it seems from your question (that is what the \* is for)
this regex will fail if one of the attributes has a ">" character in it, for example
<h:graphicImage foo=">" value="*1.png*"
This is what I mentioned before about regex never being able to parse XML properly.
You could work around this by adjusting your regex:
<h:graphicImage.+?+value="\*(.*?)\*"
But this means that if you had <h:graphicImage /><foo value="*1.png*"> then the 1.png from the foo tag is extracted, when you only want to extract from the graphicImage tag.
Again, regex will always have issues with corner cases for XML, so you need to adjust according to your application (for example, if you know that only the graphicImage tag will ever have a "value" attribute, then the second case may be better than the first).
I am not new to regex but have come across a problem I can't seem to solve. I'm trying to locate a specific HTML tag that has a specific attribute/value pair (it may have other attributes, too, but those are optional), extract it's contents as a backreference and wrap a new custom tag around it. The original tag is:
<p backgroundColor="#0066cc" color="#0066cc" lineHeight="18" paragraphSpaceAfter="15" paragraphSpaceBefore="15" fontSize="24">This is my second paragraph, with some <span fontStyle="italic">inline stuff</span>too.</p>
I'd like to get it to be:
<custom_heading>This is my second paragraph, with some <span fontStyle="italic">inline stuff</span>too.</custom_heading>
The expression I'm currently using is this:
r = new RegExp("<p[^<]*backgroundColor=\"#0066cc\"[^>?]*\>","gi");
s=s.replace(r,"<bwt_heading>");
This works fine (replacing only the opening tag) until I then try and add the content, and closing tag:
<p[^<]*backgroundColor=\"#0066cc\"[^>?]*\>(.*?)</p>
The above results in no match, no replacement at all. Please help! I have managed to replace several other tags (and preserve their contents via backreference, like so:
<span>
If the short regex works, the long one should, too--unless there are newlines between the <p> and </p>. Have you tried it with ([\s\S]*?) instead of (.*?)?
Also, you've got [^<]* and [^>?]* in your original regex, neither of which seems right. I suspect they should both be [^>]*.
EDIT: It appears AS3 supports the s modifier for dotall mode (a.k.a. single-line or dot-matches-newlines mode), which is nice. ActionScript supposedly implements the ECMAScript standard, which (as of version 3) does not support a dotall mode, necessitating the [\s\S] hack we all know and love from JavaScript regexes.
That being the case, your regex doesn't need to be changed at all:
r = new RegExp("<p[^>]*backgroundColor=\"#0066cc\"[^>]*>(.*?)</p>","gis");
s=s.replace(r,"<bwt_heading>$1</bwt_heading>");
I'm trying to write a regexp which will help to find non-translated texts in html code.
Translated texts means that they are going through special tag: or through construction: ${...}
Ex. non-translated:
<h1>Hello</h1>
Translated texts are:
<h1><fmt:message key="hello" /></h1>
<button>${expression}</button>
I've written the following expression:
\<(\w+[^>])(?:.*)\>([^\s]+?)\</\1\>
It finds correct strings like:
<p>text<p>
Correctly skips
<a><fmt:message key="common.delete" /></a>
But also catches:
<li><p><fmt:message key="common.delete" /></p></li>
And I can't figure out how to add exception for ${...} strings in this expression
Can anybody help me?
If I understand you properly, you want to ensure the data inside the "tag" doesn't contain fmt:messsage or ${....}
You might be able to use a negative-lookahead in conjuction with a . to assert that the characters captured by the . are not one of those cases:
/<(\w+)[^>]*>(?:(?!<fmt:message|\$\{|<\/\1>).)*<\/\1>/i
If you want to avoid capturing any "tags" inside the tag, you can ignore the <fmt:message portion, and just use [^<] instead of a . - to match only non <
/<(\w+)[^>]*>(?:(?!\$\{)[^<])*<\/\1>/i
Added from comment If you also want to exclude "empty" tags, add another negative-lookahead - this time (?!\s*<) - ensure that the stuff inside the tag is not empty or only containing whitespace:
/<(\w+)[^>]*>(?!\s*<)(?:(?!\$\{)[^<])*<\/\1>/i
If the format is simple as in your examples you can try this:
<(\w+)>(?:(?!<fmt:message).)+</\1>
Rewritten into a more formal question:
Can you match
aba
but not
aca
without catching
abcba ?
Yes.
FSM:
Start->A->B->A->Terminate
Insert abcba and run it
Start is ready for input.
a -> MATCH, transition to A
b -> MATCH, transition to B
c -> FAIL, return fail.
I've used a simple one like this with success,
<([^>]+)[^>]*>([^<]*)</\1>
of course if there is any CDATA with '<' in those it's not going to work so well. But should do fine for simple XML.
also see
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/
for a discussion of using regex to parse html
executive summary: don't