How to get regex inner html in ubot studio - regex

Hi i have an text inside ubot studio that i am trying to scrape from
The function that i use is find regular expresion and the list item is this:
<td class="amt base">
$136
</td>
How can i get the $136 with regex?
I tried to use the following:
<td class="amt base">(.*)</td>
<td class="amt base">(*)</td>
<td class="amt base">*</td>
But none of them seem to work.
Thanks alot for sharing your regex knowladge.

How about the /s
/(?:<td class="amt base">)(.*)(?:<\/td>)/s
Online Demo
/s will treat the string as single line. With this change change .* will match any character whatsoever, even a newline, which normally it would not match.
?: the none matching group is optional here but it was added to make a single group match.
\/ it is also important to scape \ as they have an special meaning in Regex.
Important: I did not see a language specified in your post and /s may not be supported by some languages like Javascript or Ruby.
Update
Even though it is uncertain if it will work with all your input value you could try this:
Online Demo 2
/(?:<td class="amt base">\s+)(.*)(?:\s+<\/td>)/

\d* will match any list of digits.
\$\d* will match a dollar sign and then a string of digits.
The options that you tried do not work because .* stops at the end of each line. You are attempting to match a multi-line statement.

use this regular expression which will scrape a dollar sign and the digits ater:
\$\d+

Related

Vim Regex for html

Working with vim regexes for folding html, trying to ignore html tags that start and end on the same line.
So far, I have
if line =~# '<\(\w\+\).*<\/\1>'
return '='
endif
Which works fine for tags like <a></a>, but when dealing with custom elements, I run into issues since there is a hyphen in the tag name.
Like for example, this element
<paper-input label="Input label"></paper-input>
What needs to change in the regex to also catch the hyphen?
The correct regex (updated because of this link) is:
<\([^ >]\+\)[ >].*<\/\1>
or
<\([^ >]\+\)\>.*<\/\1>
This is important [^ >]. This will match any character until whitespace or > i.e. it will match both a and paper_input

Regex an innerHtml of a table to find special charcters

I'm having an hard time to get this..
I have this html code:
<table border='1'><tr><th></th><th>Fact Questions Report Type Count</th></tr><tr>
<td class=' sorting_1'>0 - 18</td><td>78</td></tr><tr><td class=' sorting_1'>19-64</td>
<td>78</td></tr><tr><td class=' sorting_1'>65+</td><td>78</td></tr><tr>
<td class=' sorting_1'>אין גיל</td><td>78</td></tr><tr><td class=' sorting_1'>נפטר</td>
<td>78</td></tr><tr><td class=' sorting_1'>Unknown</td><td>78</td></tr></table>
As you see there are special characters that I want to catch like those:
אין גיל , נפטר
I thought to do a regex that will exclude all words \W and numbers \D and those->=|'
But i can't get it work..
The perfect solution will be getting two items with the special charcters... אין גיל , נפטר
P.S: There could be other special charcters
I will love to see an example for this in here : RegexPal - Online Editor
tnx!
If you are trying to catch characters in the Hebrew language specifically, you can try
[\u0590-\u05FF\s]+
assuming spaces are okay, or, if using a more advanced regex engine,
[\p{Hebrew}\s]+
If you're actually trying to catch non-English but printable characters then it's hard to help you without seeing what you've tried. \D is a subset of \W, so you should only need \W+, or if I understand you correctly in that you want to exclude ->=|' as well, then [^\w>=|-]+ (the dash must come last here (or in the second position after ^)).
This one matches only ASCII printable characters:
[\x20-\x7e]
To catch those אין גיל , נפטר (among many other non ASCII characters) you need
[^\x20-\x7e]
As requested: regexpal.com
I thought to do a regex that will exclude all words \W and numbers \D and those =|'
Simply do it: [^\w\d=|']+
Visualization on Debuggex
Demo on RegExr
Note that you can't use [^\W]: since \W means anything but \w, [^\W] means anything but anything but \w, i.e. \w (- x - = +).

Matching all occurrences of a html element attribute in notepad++ regex

I have a file which has hundreds of links like this:
<h3>aspnet</h3>
Ex 1
Ex 2
Ex 3
So I want to remove all the elements
icon="..."
from all the lines. I went through the official Notepad++ regex wiki and have come up with this after several trials:
icon=\"[^\.]+\"
The problem with this is, it is selecting past the second double quote and stopping at the next occurring double quote. To illustrate, this will select the following content:
icon="data:image/png;base64,...jbvebich4sec9zgth1sfue1cdt...">EX 1</a> <a href="
If I modify the above regex to,
icon=\"[^\.]+\">
Then it is almost perfect, but it is also selecting the >:
icon="data:image/png;base64,...jbvebich4sec9zgth1sfue1cdt...">
The regex I am looking for would select like this:
icon="data:image/png;base64,...jbvebich4sec9zgth1sfue1cdt..."
I also tried the following, but it doesn't match anything at all
icon=\"[^\.]+\"$
Just match anything but a quote, followed by a quote:
icon="[^"]+"
Just tested with notepad++ 6.2.2 and confirmed that this matches correctly as written.
Broken down:
icon="
This is fairly obvious, match the literal text icon=".
[^"]+
This means to match any character that is not a ". Adding the + after it means "one or more times."
Finally we match another literal ".
I am not a notepad++ user. so don't know how notepad++ plays with regex, but can you try to replace
icon=\"[^>]* to (empty string) ?
Try this solution:
This is I just check was working as you wanted it.
The way achieving your goal:
Find what: (icon.*")|.*?
Replace with: $1

Regex filter XAML tags with specific attribute

I have the following Regex that matches XAML tags:
<[^<>]*>
This would return both of these lines:
<Button x:Name="Button1" />
<Button x:Name="Button2" Content="foo" />
What I want to do is filter out tags that have "Content="foo"" in them.
There are similar examples out there, but in this case the quotation marks are tripping me up.
Any ideas?
Regex
/<[^>]*Content="foo".*?>/g
Which means start regex (/) and search for < folled by zero or more not >s ([^>]*) followed by Content="foo" followed by zero or more anythings (.*) un-greedily, followed by '>' and end regex /.
Test Code:
Written in javascipt, but can be ported into other languages easily.
x = ['<Button x:Name="Button1" /><Button x:Name="Button2" Content="foo" /><Button x:Name="Button1" /><Button Content="foo" />']
console.log( x[0].match(/<[^>]*Content="foo".*?>/g) )
Updated to match exact opposite (as required by OP)
Regex
Using negative lookahead assertion, as detailed here
<((?!Content="foo")[^>])*>
Which means match < followed by zero or more not >s ([^>]*) that each do not have Content="foo" in front of them followed by >. Brackets added for necessary grouping.
But remember - using regex as a substitute for XML parsing will make people on stack overflow get really angry with you... so better to post questions like this anonymously. ;)
<([^<>](?!\sContent\s*=\s*("foo"|'foo')))*>
This... But PLEASE, don't use Regexes to parse XML.
Each character we capture with [^<>] we check if it is followed by one space and the Content="foo". If true we fail so the capture is rollbacked of one character and the Regex checks for the terminating >.
http://regexr.com?2umr9
You have to click around a little so that the Regex is "activated".
Some small notes: this Regex is WRONG because it's trying to parse XML, but ignoring this, it won't be fooled by SContent="foo" (ignored) or Content='foo' (catched) or :Content='foo' (ignored) or Content = "foo" (catched). This clearly if you are using a Regex parser that treats \s as the same list of space characters as XML :-) Otherwise some "strange, alien" spaces could break it a little. Remember to use it with case sensitive parsing!

Regular Expression does not remove html comment?

I have the following string:
<TD><!-- 1.91 -->6949<!-- 9.11 --></TD>
I want to end up with:
<TD>6949/TD>
but instead I end up with just the tags and no information:
<TD></TD>
This is the regular expression I am using:
RegEx.Replace("<TD><!-- 1.91 -->6949<!-- 9.11 --></TD>","<!--.*-->","")
Can someone explain how to keep the numbers and remove just what the comments. Also if possible, can someone explain why this is happening?
.* is a greedy qualifier which matches as much as possible.
It's matching everything until the last -->.
Change it to .*?, which is a lazy qualifier.
.* is greedy so it will match as many characters as possible. In this case the opening of the first comment until the end of the second. Changing it to .*? or [^>]* will fix it as the ? makes the match lazy. Which is to say it will match as few characters as possible.
Parsing HTML with Regex is always going to be tricky. Instead, use something like HTML Agility Pack which will allow you to query and parse html in a structured manner.