How to highlight searched for words using a regular expression - regex

Hi
I am working on a groovy application that requires me to highlight(add spans) to the word that is searched for.For instance given the text below :
youtube
[href="youtube.com] i am here , in Youtube[/a]
I want to search for the word "youtube" and when it returned the above text should look like this :
[span]youtube[span]
[href="youtube.com] i am here , in [span]Youtube[/span] [/a]
The youtube word that is contained in the href or in the iframe must be ignored.
At the moment I have the following code :
def m = test =~ /([^<]*)?(youtube)/
println m[0]
def highLightText = { attrs, body ->
def postBody = attrs.text
def m = postBody =~ /(?i:${attrs.searchTerm})/
def array = []
m.each{
array << it as String
}
array.unique()
String result = postBody
array.each{
result = result.replaceAll("${it}", "<span class='highlight'>${it}</span>")
}
out << result
}
And it returns :
[span]youtube[span]
[href="[span]youtube[span].com] i am here , in [span]Youtube[/span] [/a]
Can anyone help me with a regular expression that can select only words that are not contained in links or other tags.
Thanks

A maintainable solution is unlikely to be achievable using regular expressions - the problem is too complex.
Parse your HTML into a DOM and consider only text nodes as being suitable for potential highlighting. Text nodes will, by definition, be only those pieces of content that are rendered and will not be element names, attributes/attribute values and so on.
The complexity of your problem is then reduced down to: how to do I find and highlight a string within another string?

Related

python - Wrong regex used?

Here is my func:
#register.filter
def load_human_key(key):
"""
load util based on typ for key
return: More readable key
"""
regex = re.findall('[A-Z][^A-Z]*', key)
if regex:
joined_regex = " ".join(regex)
return joined_regex
return key
When I use load_human_key("JsonKey"). It works fine and returns Json Key, but when I use load_human_key("JsonKEY") it returns "Json K E Y"), which is not the behaviour i'd like to implement. Can sb help my function, so that load_human_key("JsonKEY") = load_human_key("JsonKey")? I am completly new to regex.
Thanks!
A regex only cannot change characters from upper case to lower case, so you'll need to map each match to take care of that with Python code.
Not your question, but the naming used in your code is confusing: regex is not the regex, but the list of matches you get from executing one.
Here is how you could do it:
def load_human_key(key):
return re.sub('[A-Z]+[^A-Z]*', lambda m: ' ' + m[0].capitalize(), key).lstrip()

Refactoring starting place for regex

I have a function that stripes HTML markup to display inside of a text element.
stripChar: function stripChar(string) {
string = string.replace(/<\/?[^>]+(>|$)/g, "")
string = string.trim()
string = string.replace(/(\n{2,})/gm,"\n\n");
string = string.replace(/…/g,"...")
string = string.replace(/ /g,"")
let changeencode = entities.decode(string);
return changeencode;
}
This has worked great for me, but I have a new requirement and Im struggle to work out where I should start refactoring the code above. I still need to stripe out the above, but I have 2 exceptions;
List items, <ul><li>, I need to handle these so that they still appear as a bullet point
Hyperlinks, I want to use the react-native-hyperlink, so I need to leave intack the <a> for me to handle separately
Whilst the function is great for generalise tag replacement, its less flexible for my needs above.
You may use
stripChar: function stripChar(string) {
string = string.replace(/ |<(?!\/?(?:li|ul|a)\b)\/?[^>]+(?:>|$)/g, "");
string = string.trim();
string = string.replace(/\n{2,}/g,"\n\n");
string = string.replace(/…/g,"...")
let changeencode = entities.decode(string);
return changeencode;
}
The main changes:
.replace(/ /g,"") is moved to the first replace
The first replace is now used with a new regex pattern where the li, ul and a tags are excluded from the matches using a negative lookahead (?!\/?(?:li|ul|a)\b).
See the updated regex demo here.

regex to return all values not just first found one

I'm learning Pig Latin and am using regular expressions. Not sure if the regex is language agnostic or not but here is what I'm trying to do.
If I have a table with two fields: tweet id and tweet, I'd like to go through each tweet and pull out all mentions up to 3.
So if a tweet goes something like "#tim bla #sam #joe something bla bla" then the line item for that tweet will have tweet id, tim, sam, joe.
The raw data has twitter ids not the actual handles so this regex seems to return a mention (.*)#user_(\\S{8})([:| ])(.*)
Here is what I have tried:
a = load 'data.txt' AS (id:chararray, tweet:chararray);
b = foreach a generate id, LOWER(tweet) as tweet;
// filter data so only tweets with mentions
c = FILTER b BY tweet MATCHES '(.*)#user_(\\S{8})([:| ])(.*)';
// try to pull out the mentions.
d = foreach c generate id,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1}',3) as mention1,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){1,2}',3) as mention2,
REGEX_EXTRACT(tweet, '((.*)#user_(\\S{8})([:| ])(.*)){2,3}',3) as mention3;
e = limit d 20;
dump e;
So in that try I was playing with quantifiers, trying to return the first, second and 3rd instance of a match in a tweet {1}, {1,2}, {2,3}.
That did not work, mention 1-3 are just empty.
So I tried changing d:
d = foreach c generate id,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)',2) as mention1,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',5) as mention2,
REGEX_EXTRACT(tweet, '(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)#user_(\\S{8})([:| ])(.*)',8) as mention3,
But, instead of returning each user mentioned, this returned the same mention 3 times. I had expected that by cutting n pasting the expression again I'd get the second match, and pasting it a 3rd time would get the 3rd match.
I'm not sure how well I've managed to word this question but to put it another way, imagine that the function regex_extract() returned an array of matched terms. I would like to get mention[0], mention[1], mention[2] on a single line item.
Whenever you use PATTERN_EXTRACT or PATTERN_EXTRACT_ALL udf, keep in mind that it is just pure regex handled by Java.
It is easier to test the regex through a local Java test. Here is the regex I found to be acceptable :
Pattern p = Pattern.compile("#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?");
String input = "So if a tweet goes something like #tim bla #sam #joe #bill something bla bla";
Matcher m = p.matcher(input);
if(m.find()){
for(int i=0; i<=m.groupCount(); i++){
System.out.println(i + " -> " + m.group(i));
}
}
With this regex, if there is at least a mention, it will returns three fields, the seconds and/or third being null if a second/third mention is not found.
Therefore, you may use the following PIG code :
d = foreach c generate id, REGEX_EXTRACT_ALL(
tweet, '#(\\S+).*?(?:#(\\S+)(?:.*?#(\\S+))?)?');
You do not even need to filter the data first.

As3 Regex or alternative to split strings

i have a html page , i use regex to remove all html tags from the page and extract the text using the below code.
var foo = loader.data.replace(/<.*?>/g, "");
var bar:Array = foo.split("Total");
foo = foo.split(bar[0]);
trace(foo);
And using the same code lines below the replace method i remove every string before the word "TOTAL". It does the job perfectly but now i want to apply and other split to get contents after "TOTAL" and remove the Content after "BYTES".
So when i try to split it up again with
var bar2:Array = foo.split("BYTES");
foo = foo.split(bar2[0]);
Flash returns a error saying SPLIT is a not a valid method :S
I tried several other ways , ( REPLACE ) but still flash produces errors.
Can Anyone help me to get through this ?
Thank you
".split()" is a method of String. When you did the assignment below:
foo = foo.split(bar[0]);
foo became an array, and thus the call
var bar2:Array = foo.split("BYTES");
was being made to an array, which is invalid (no such method)
What you want instead is this:
var foo = loader.data.replace(/<.*?>/g, "");
trace(foo);
var result = foo.split("Total")[1].split("BYTES")[0];
trace(result);

re pulls data from one tag and not the other

I am trying to get a program to work that parses html like tags- it's for a TREC collection. I don't program often, except for databases and I am getting stuck on syntax. Here's my current code:
parseTREC ('LA010189.txt')
#Following Code-re P worked in Python
def parseTREC (atext):
atext=open(atext, "r")
filePath= "testLA.txt"
docID= []
docTXT=[]
p = re.compile ('<DOCNO>(.*?)</DOCNO>', re.IGNORECASE)
m= re.compile ('<P>(.*?)</P>', re.IGNORECASE)
for aline in atext:
values=str(aline)
if p.findall(values):
docID.append(p.findall(values))
if m.findall(values):
docID.append(p.findall(values))
print docID
atext.close()
the p re pulled the DOCNO as it was supposed. The m re though would not pull data and would print an empty list. I pretty sure that there are white spaces and also a new line. I tried the re.M and that did not help pull the data from the other lines. Ideally I would like to get to the point to where I store in a dictionary {DOCNO, Count}. Count would be determined by summing up every word that is in the P tags and also in a list []. I would appreciate any suggestions or advice.
You can try removing all the line breaks from the file if you think that is impacting your regex results. Also, make sure you don't have nested <P> tags because your regex may not match as expected. For example:
<p>
<p>
<p>here's some data</p>
And some more data.
</p>
And even more data.
</p>
will capture this section because of the "?":
<p>
<p>here's some data</p>
And some more data.
Also, is this a typo:
if p.findall(values):
docID.append(p.findall(values))
if m.findall(values):
docID.append(p.findall(values))
should that be:
docID.append(m.findall(values))
ont the last line?
Add the re.DOTALL flag like so:
m= re.compile ('<P>(.*?)</P>',
re.IGNORECASE | re.DOTALL)
You may want to add this to the other regex as well.
from xml.dom.minidom import *
import re
def parseTREC2 (atext):
fc = open(atext,'r').read()
fc = '<DOCS>\n' + fc + '\n</DOCS>'
dom = parseString(fc)
w_re = re.compile('[a-z]+',re.IGNORECASE)
doc_nodes = dom.getElementsByTagName('DOC')
for doc_node in doc_nodes:
docno = doc_node.getElementsByTagName('DOCNO')[0].firstChild.data
cnt = 1
for p_node in doc_node.getElementsByTagName('P'):
p = p_node.firstChild.data
words = w_re.findall(p)
print "\t".join([docno,str(cnt),p])
print words
cnt += 1
parseTREC2('LA010189.txt')
The code adds tags to the front of the document because there is no parent tag. The program then retrieves the information through the xml parser.