Regex an innerHtml of a table to find special charcters - regex

I'm having an hard time to get this..
I have this html code:
<table border='1'><tr><th></th><th>Fact Questions Report Type Count</th></tr><tr>
<td class=' sorting_1'>0 - 18</td><td>78</td></tr><tr><td class=' sorting_1'>19-64</td>
<td>78</td></tr><tr><td class=' sorting_1'>65+</td><td>78</td></tr><tr>
<td class=' sorting_1'>אין גיל</td><td>78</td></tr><tr><td class=' sorting_1'>נפטר</td>
<td>78</td></tr><tr><td class=' sorting_1'>Unknown</td><td>78</td></tr></table>
As you see there are special characters that I want to catch like those:
אין גיל , נפטר
I thought to do a regex that will exclude all words \W and numbers \D and those->=|'
But i can't get it work..
The perfect solution will be getting two items with the special charcters... אין גיל , נפטר
P.S: There could be other special charcters
I will love to see an example for this in here : RegexPal - Online Editor
tnx!

If you are trying to catch characters in the Hebrew language specifically, you can try
[\u0590-\u05FF\s]+
assuming spaces are okay, or, if using a more advanced regex engine,
[\p{Hebrew}\s]+
If you're actually trying to catch non-English but printable characters then it's hard to help you without seeing what you've tried. \D is a subset of \W, so you should only need \W+, or if I understand you correctly in that you want to exclude ->=|' as well, then [^\w>=|-]+ (the dash must come last here (or in the second position after ^)).

This one matches only ASCII printable characters:
[\x20-\x7e]
To catch those אין גיל , נפטר (among many other non ASCII characters) you need
[^\x20-\x7e]
As requested: regexpal.com

I thought to do a regex that will exclude all words \W and numbers \D and those =|'
Simply do it: [^\w\d=|']+
Visualization on Debuggex
Demo on RegExr
Note that you can't use [^\W]: since \W means anything but \w, [^\W] means anything but anything but \w, i.e. \w (- x - = +).

Related

Regex multiline to match string followed by text with arbitrary spaces in between

Was wondering if any regex gurus can help figure out how to create a regular expression to solve this. I am stumped.
I need to match "CUST-X" on variations of this multiline text..
"CUST-1 Some message\nLock-Id: Id74248cd199\n"
Requirement:
The "CUST-1" and "Some message" can be separated by a colon(:). The
colon is optional.
There can be none, one or multiple spaces between
the two strings.
Any number of spaces can be in front of "CUST-1".
There needs to be a message after "CUST-1". The message is
arbitrary, there's no pattern to the message.
Ignore any other CUST-XX after the first one. Only match on the 1st occurance.
java regex is preferable.
Examples:
Test strings that should match for "CUST-1"
"CUST-1 Some message\nLock-Id: Id14248cd199"
" CUST-1 another message\nLock-Id: Id14258cd199"
"CUST-1:I like apples\nLock-Id: Id84248cd199"
"CUST-1: peaches are sweet\nLock-Id: Id78248cd199"
"CUST-1: pies are great\nLock-Id: Id71248cd199"
Should match for "CUST-X" but not "CUST-X"
"CUST-1: Nice message about CUST-2\nLock-Id: Id74248cd199\n"
Test strings that should not match "CUST-1"
"CUST-1\nLock-Id: Id78248cd199"
"CUST-1 \nLock-Id: Id74248cd199"
"CUST-1:\nLock-Id: Id84248cd199"
"CUST-1: \nLock-Id: Id94248cd199"
The closes I've come up with is..
^\\s*([A-Z]+-[0-9]+):?\\s+\\S+
But this will also match the cases where I do not want the match to happen.
I think this is what you are looking for:
^\\s*([A-Z]+-[0-9]+):?\\p{Z}+\\S+
\s matches any white space char defined as [\t\n\f\r\p{Z}], which includes \n.
\p{Z} refers only to the whitespace char itself.

How to get regex inner html in ubot studio

Hi i have an text inside ubot studio that i am trying to scrape from
The function that i use is find regular expresion and the list item is this:
<td class="amt base">
$136
</td>
How can i get the $136 with regex?
I tried to use the following:
<td class="amt base">(.*)</td>
<td class="amt base">(*)</td>
<td class="amt base">*</td>
But none of them seem to work.
Thanks alot for sharing your regex knowladge.
How about the /s
/(?:<td class="amt base">)(.*)(?:<\/td>)/s
Online Demo
/s will treat the string as single line. With this change change .* will match any character whatsoever, even a newline, which normally it would not match.
?: the none matching group is optional here but it was added to make a single group match.
\/ it is also important to scape \ as they have an special meaning in Regex.
Important: I did not see a language specified in your post and /s may not be supported by some languages like Javascript or Ruby.
Update
Even though it is uncertain if it will work with all your input value you could try this:
Online Demo 2
/(?:<td class="amt base">\s+)(.*)(?:\s+<\/td>)/
\d* will match any list of digits.
\$\d* will match a dollar sign and then a string of digits.
The options that you tried do not work because .* stops at the end of each line. You are attempting to match a multi-line statement.
use this regular expression which will scrape a dollar sign and the digits ater:
\$\d+

Regular Expression replace special characters, only if not part of word

I have the following string:
'United Breaks Guitars': Did It Really Cost The Airline $180 Million? http://ow.ly/htPVk
Currently, my regex pattern looks like this: [^A-Za-z-0-9- - / -$]
I'm not an expert on regex and I've been playing around with this tool to figure things out, but I am stuck.
I'd like to remove characters such as ', ", :, etc. So far with the above pattern the highlighted characters are being removed from my example string:
'United Breaks Guitars' : Did It Really Cost The Airline $180 Million? http://ow.ly/htPVk
The issue above is that I don't want to remove the : and . from the URL. But if the string ends with a period I would like to remove it. Also, the apostrophe ' character should be kept in case it's used to omit characters or as a possession.
Thanks in advance.
Depends on how you define "part of a word", URL isn't much of a word.
If you define "part of a word" as surrounded by non-space characters, then you could use something like:
(?<!\S)[^\w $-]+|[^\w $-]+(?!\S)
(?!\S) is a shorter way of saying (?=\s|$), and the same applies for the lookbehind.

Regex filter XAML tags with specific attribute

I have the following Regex that matches XAML tags:
<[^<>]*>
This would return both of these lines:
<Button x:Name="Button1" />
<Button x:Name="Button2" Content="foo" />
What I want to do is filter out tags that have "Content="foo"" in them.
There are similar examples out there, but in this case the quotation marks are tripping me up.
Any ideas?
Regex
/<[^>]*Content="foo".*?>/g
Which means start regex (/) and search for < folled by zero or more not >s ([^>]*) followed by Content="foo" followed by zero or more anythings (.*) un-greedily, followed by '>' and end regex /.
Test Code:
Written in javascipt, but can be ported into other languages easily.
x = ['<Button x:Name="Button1" /><Button x:Name="Button2" Content="foo" /><Button x:Name="Button1" /><Button Content="foo" />']
console.log( x[0].match(/<[^>]*Content="foo".*?>/g) )
Updated to match exact opposite (as required by OP)
Regex
Using negative lookahead assertion, as detailed here
<((?!Content="foo")[^>])*>
Which means match < followed by zero or more not >s ([^>]*) that each do not have Content="foo" in front of them followed by >. Brackets added for necessary grouping.
But remember - using regex as a substitute for XML parsing will make people on stack overflow get really angry with you... so better to post questions like this anonymously. ;)
<([^<>](?!\sContent\s*=\s*("foo"|'foo')))*>
This... But PLEASE, don't use Regexes to parse XML.
Each character we capture with [^<>] we check if it is followed by one space and the Content="foo". If true we fail so the capture is rollbacked of one character and the Regex checks for the terminating >.
http://regexr.com?2umr9
You have to click around a little so that the Regex is "activated".
Some small notes: this Regex is WRONG because it's trying to parse XML, but ignoring this, it won't be fooled by SContent="foo" (ignored) or Content='foo' (catched) or :Content='foo' (ignored) or Content = "foo" (catched). This clearly if you are using a Regex parser that treats \s as the same list of space characters as XML :-) Otherwise some "strange, alien" spaces could break it a little. Remember to use it with case sensitive parsing!

Regex - Multiline Problem

I think I'm burnt out, and that's why I can't see an obvious mistake. Anyway, I want the following regex:
#BIZ[.\s]*#ENDBIZ
to grab me the #BIZ tag, #ENDBIZ tag and all the text in between the tags. For example, if given some text, I want the expression to match:
#BIZ
some text some test
more text
maybe some code
#ENDBIZ
At the moment, the regex matches nothing. What did I do wrong?
ADDITIONAL DETAILS
I'm doing the following in PHP
preg_replace('/#BIZ[.\s]*#ENDBIZ/', 'my new text', $strMultiplelines);
The dot loses its special meaning inside a character class — in other words, [.\s] means "match period or whitespace". I believe what you want is [\s\S], "match whitespace or non-whitespace".
preg_replace('/#BIZ[\s\S]*#ENDBIZ/', 'my new text', $strMultiplelines);
Edit: A bit about the dot and character classes:
By default, the dot does not match newlines. Most (all?) regex implementations have a way to specify that it match newlines as well, but it differs by implementation. The only way to match (really) any character in a compatible way is to pair a shorthand class with its negation — [\s\S], [\w\W], or [\d\D]. In my personal experience, the first seems to be most common, probably because this is used when you need to match newlines, and including \s makes it clear that you're doing so.
Also, the dot isn't the only special character which loses its meaning in character classes. In fact, the only characters which are special in character classes are ^, -, \, and ]. Check out the "Metacharacters Inside Character Classes" section of the character classes page on Regular-Expressions.info.
// Replaces all of your code with "my new text", but I do not think
// this is actually what you want based on your description.
preg_replace('/#BIZ(.+?)#ENDBIZ/s', 'my new text', $contents);
// Actually "gets" the text, which is what I think you might be looking for.
preg_match('/(#BIZ)(.+?)(#ENDBIZ)/s', $contents, $matches);
list($dummy, $startTag, $data, $endTag) = $matches;
This should work
#BIZ[\s\S]*#ENDBIZ
You can try this online Regular Expression Testing Tool
The mistake is the character group [.\s] that will match a dot (not any character) or white space. You probably tried to get .* with . matching newline characters, too. You achieve this by enabling the single line option ((?s:) does this in .NET regex).
(?s:#BIZ.*?#ENDBIZ)
Depending on the environment you're using your regex in, it may need special care to properly parse multiline text, eg re.DOTALL in Python. So what environment is that?
you can use
preg_replace('/#BIZ.*?#ENDBIZ/s', 'my new text', $strMultiplelines);
the 's' modifier says "match the dot with anything, even the newline character". the '?' says don't be greedy, such as for the case of:
foo
#BIZ
some text some test
more text
maybe some code
#ENDBIZ
bar
#BIZ
some text some test
more text
maybe some code
#ENDBIZ
hello world
the non-greediness won't get rid of the "bar" in the middle.
Unless I am missing something, you handle this the same way that you would in Perl, with either the /m or /s modifier at the end? Oddly enough the other answers that rather correctly pointed this out got down voted?!
It looks like you're doing a javascript regex, you'll need to enable multiline by specifying the m flag at the end of the expression:
var re = /^deal$/mg