How to make a regex expression for this content? - regex

I am scraping a website which looks like this, and I am looking for 4 / 5 and 3 / 10. That is, I want (number) + space + slash + 3 spaces + another number.
I tried the regex expression ^[0-9]+(\/[0-9]+)" *"*$ but that did not work.
<td>Monday</td>
<td class="text-center text-danger font-weight-bold">4 / 5</td>
</td>
<td>Tuesday</td>
<td class="text-center text-danger font-weight-bold">3 / 10</td>
</td>

You were close. Use word boundary \b instead of ^ and $, because the text you are looking for is somewhere in the middle of your text. This regex should work:
/\b[0-9]+ +\/ +[0-9]+\b/
The + makes the regex more forgiving, by requiring at least one space.
If you want to capture the numbers separately you can introduce capture groups, to reference them with $1 and $2, respectively:
/\b([0-9]+) +\/ +([0-9]+)\b/

$ grep "[[:digit:]]\{1,\} \/ [[:digit:]]\{1,\}" filename
<td class="text-center text-danger font-weight-bold">4 / 5</td>
<td class="text-center text-danger font-weight-bold">3 / 10</td>
$
You have started the regex with ^ which is to anchor the regex to beginning of line.

Related

Regexp: How can I clean html tags of attributes but a few in Notepad++

Okay, in Notepad++ I want to clean html tags with lots of style attributes, e.g.
<td style="width: 457.4pt; border: solid windowtext 1.0pt; background: #BFBFBF; padding: 0cm 5.4pt 0cm 5.4pt;" colspan="2" valign="top" border="1" cellspacing="0" cellpadding="0">
and I want this as an outcome:
<td colspan="2" valign="top">
So far I am at this for search and replace:
<([a-z][A-Z]*)[^>]*?>
<$1>
which cleans all attributes of a html tag. But I want to keep colspan and valign. How do I have to modify the expression?
This can be error prone, but for the example string you might use a branch reset group to use 2 capture groups in different order and capture the name of the tag also in a group.
Find what
<(\w+)[^<>]*(?|(\bcolspan="[^"]*")[^<>]*(\bvalign="[^"]*")|(\bvalign="[^"]*")[^<>]*(\bcolspan="[^"]*"))[^<>]*>
Replace with
<$1 $2 $3>
Explanation
< Match literally
(\w+) Capture 1+ word chars in group 1
[^<>]* Optionally match any char except < and > (Assuming no nested tag etc..)
(?| Branch reset group
(\bcolspan="[^"]*")[^<>]*(\bvalign="[^"]*") Caputure first colspan, then valign
| Or
(\bvalign="[^"]*")[^<>]*(\bcolspan="[^"]*") Or capture both the other way around
) Close group
[^<>]* Optionally match any char except < and >
> Match literally
Regex demo
Assuming both colspan and valign are present, maybe:
<td([^<>]*?(\h(?:colspan|valign)="\w+"))((?1))[^<>]*>
Replace with <td$2$3>, see an online demo.

How to match only one string and not another

This is String 1:
<td class="AAA"><span class="BBB">Text1</span></td>
I want to remove the span so it looks like this:
<td class="BBB">Text1</td>
Which is easy enough with this regex:
Search: <td class="AAA"><span class="BBB">(.*)</span></td>
Replace: <td class="BBB">$1</td>
The problem: Sometimes the string looks like this (String 2):
<td class="AAA"><span class="BBB">Text1</span>-<span class="BBB">Text2</span></td>
which also matches because of the 2 closing tags. But I don't want it to be matched at all. How do I find only String 1?
Instead of matching any character in your matching group, match all characters aside from the open <:
Search: <td class="AAA"><span class="BBB">([^<]*)</span></td>
Replace: <td class="BBB">$1</td>
This is assuming your Text1 doesn't contain the < character.

gregexpr function in R returning different results whether Perl is TRUE or FALSE

I have the following piece of HTML I'm trying to run regex on with gregexpr function in R
<div class=g-unit>
<div class=nwp style=display:inline>
<input type=hidden name=cid value="22144">
<input autocomplete=off class=id-fromdate type=text size=10 name=startdate value="Sep 6, 2013"> -
<input autocomplete=off class=id-todate type=text size=10 name=enddate value="Sep 5, 2014">
<input id=hfs type=submit value=Update style="height:1.9em; margin:0 0 0 0.3em;">
</div>
</div>
</div>
<div id=prices class="gf-table-wrapper sfe-break-bottom-16">
<table class="gf-table historical_price">
<tr class=bb>
<th class="bb lm lft">Date
<th class="rgt bb">Open
<th class="rgt bb">High
<th class="rgt bb">Low
<th class="rgt bb">Close
<th class="rgt bb rm">Volume
<tr>
...
...
</table>
</div>
I am trying to extract the table part from this html by using the following regex expression
<table\\s+class="gf-table historical_price">.+<
When I run the gregexpr function with perl=FALSE it works fine and I get a result
However if I run it with perl=TRUE I get back nothing. It doesn't seem to match it
Does anyone know why the results are different from just switching Perl on and off?
Many thanks in advance!
It seems that in the extended mode for regex, the dot is able to match newline characters, that is not the case in perl mode. To make it work in perl mode, you need to use the (?s) modifier to make the dot able to match newline characters too:
> m <- gregexpr('(?s)<table\\s+class="gf-table historical_price">.+</table>', str, perl = TRUE)
In many regex flavors, the dot doesn't match newlines by default, probably to make more handy a line by line job.
The s in the inline modifier (?s) stands for "singleline". In other words, this means that the whole string is seen as a single line (for the dot) even if there are newline characters.
You need to use the inline (?s) modifier forcing the dot to match all characters, including line breaks.
The perl=T argument switches to the (PCRE) library that implements regex pattern matching.
gregexpr('(?s)<table\\s+class="gf-table historical_price">.+</table>', x, perl=T)
However as stated in the comments, a parser is recommended to do this. I would start out using the XML library.
cat(paste(xpathSApply(htmlParse(html), '//table[#class="gf-table historical_price"]', xmlValue), collapse = "\n"))

Using a REGEX to replace words within a sub-match

I hope this isn't a repetition...
I need a regex to do what should be a fairly simple task. I have code for an HTML table, and I want to replace all <td> tags with <th> tags in the first row of the table, i.e. within the first set of <tr> </tr> tags. The table might look something like this:
<table cellpadding="5" cellspacing="0" border="1">
<tr>
<td>Capacity %</td>
<td>Tension V</td>
<td>Acid kg/l</td>
</tr>
<tr>
<td>100</td>
<td>12.70</td>
<td>1.265</td>
</tr>...etc
and I want:
<table cellpadding="5" cellspacing="0" border="1">
<tr>
<th>Capacity %</th>
<th>Tension V</th>
<th>Acid kg/l</th>
</tr>
<tr>
<td>100</td>
<td>12.70</td>
<td>1.265</td>
</tr>...etc
I've tried regexes similar to this:
/(<table>\n<tr>\n)(.+?)(</tr>)
...and then tried to rebuild the table row using back references, but I can't seem to apply the regex to the multiple
</?td>
matches that there might be.
I'm doing this in javascript, which means I can't use look-behinds (although if anyone has a look behind solution I'd be interested in seeing it anyway...).
Thanks in advance for any help.
You could do it if your regex engine supports indefinite repetition inside lookbehind assertions, for example in .NET (C#):
resultString = Regex.Replace(subjectString,
#"(?<= # Assert that we can match this before the current position:
<table # <table
(?: # followed by...
(?! # (unless there's an intervening
</table # </table
| # or
</tr # </tr)
) # (End of lookahead assertion)
. # any character
)* # any number of times
) # (End of lookbehind assertion)
<td # Then match <td",
"<th", RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
works on your example. But even in .NET, I wouldn't use a regex for it, it's just too brittle. Better manipulate the DOM directly, that's what it's there for.
You can't do this with a single regex. Since regex basically works line-by-line, and you've got a special condition ("only on the first "), you'll need to write some conditional logic along with regex to make it work.

Regular Expression doesn't match

I've got a string with very unclean HTML. Before I parse it, I want to convert this:
<TABLE><TR><TD width="33%" nowrap=1><font size="1" face="Arial">
NE
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
DEK
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
143
</font> </TD>
</TR></TABLE>
in NE DEK 143 so it is a bit easier to parse. I've got this regular expression (RegexKitLite):
NSString *str = [dataString stringByReplacingOccurrencesOfRegex:#"<TABLE><TR><TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<\\/TR><\\/TABLE>"
withString:#"$1 $3 $5"];
I'm no an expert in Regex. Can someone help me out here?
Regards, dodo
Amarghosh, and bobince, the winning answerer of linked question, is generally right about this. However, since you are just sanitising, regexps are actually just fine.
First, strip the tags:
s/<.*?>//
Then collapse all extra spaces into one:
s/\s+/ /
Then remove leading/trailing space:
s/^\s+|\s+$//
Then get the values:
^([^ ]+) ([^ ]+) ([^ ]+)$
I have a few suspicions about why your regex might fail (without knowing the rules for string escaping in the iPhone SDK): The dot . used in places where it would have to match newlines, the slash looks like it's escaped unnecessarily etc.,
but: in your example, the text you're trying to extract is characterized by not being surrounded by tags.
So a search for all occurences of (?m)^[^<>\r\n]$ should find all matches.
If you sure of your html-code hierarchy, then you can just extract text enclosed by font-tags:
Regex r = Regex(#"<\s*font((\s+[^<>]*)|(\s*))>(?<desiredText>[^<>]*)<\s*/\s*font\s*>")
//C# example
foreach(Match m in r.Matches(txt))
result += m.Groups["desiredText"].Value.Trim()
;
It will be text enclosed by font-tags without white-space symbols by edges.