How to parse html string using R? - regex

How to grep data item from this html string
a <- "<div class=\"tst-10\">100%</div>"
so that the result is 100%? The main idea is to get data between > <.

I would use gsub() in this case:
gsub("(<.*>)(.*)(<.*>)", "\\2", a)
[1] "100%"
Basically, this breaks the string up into three parts, each separated by regular brackets ( and ). We can then use these as backreferences. The contents matched by the first set of backreferences can be referred to as \1 (use a double slash to escape the special character), those matched in the second, \2 and so on.
So, essentially, we're saying parse this string, figure out what matches my conditions, and return only the second backreference.
Piece by piece:
<.*> says to look for a "<" followed by any number of any characters ".*" up until you get to a ">"
.* means to match any number of characters (up until the next condition)
Keeping this in mind, you could actually probably use gsub("(.*>)(.*)(<.*)", "\\2", a) and get the same result.

I always use this regular expression to remove HTML tags:
gsub("<(.|\n)*?>","",a)
Gives:
[1] "100%"
Differs from mrdwab's in that I just remove every html tag and his extracts content from within html tags, which is probably more appropriate for this example. Look out that both will give different results if there are more tags:
> gsub("(<.*>)(.*)(<.*>)", "\\2", paste(a,"<lalala>foo</lalala>"))
[1] "foo"
> gsub("<(.|\n)*?>","", paste(a,"<lalala>foo</lalala>"))
[1] "100% foo"
I think that I found it here on SO once, not sure which answer.

Related

Regex to match text between single, double and triple quotes

I have a text file that I want to parse strings from. The thing is that there are strings enclosed in either single ('), double (") or 3x single (''') quotes within the exact same file. The best result I was able to get so far is to use this:
((?<=["])(.*?)(?=["]))|((?<=['])(.*?)(?=[']))
to match only single-line strings between single and double quotes. Please note that the strings in the file are enclosed in each type of quotes can be either single- or multi-line and that each type of string repeats several times within the file.
Here's a sample string:
<thisisthefirststring
'''- This is the first line of text
- This is the second line of text
- This is the third line of text
'''
>
<thisisanotheroption
"Just a string between quotes"
>
<thisisalsopossible
'Single quotes
Multiple lines.
With blank lines in between
'
>
<lineBreaksDoubleQoutes
"This is the first sentence here
After the first sentence, comes the blank line, and then the second one."
>
Use this:
((?:'|"){1,3})([^'"]+)\1
Test it online
Using the group reference \1, you can simplify your work
Also, to get only what is inside of the quotes, use the 2nd group of the match
This regex: ('{3}|["']{1})([^'"][\s\S]+?)\1
does what you want.
Some results:
Using Notepad++, you can use: ('''|'|")((?:(?!\1).)+)\1
Explanation:
('''|'|") : group 1, all types of quote
( : group 2
(?:(?!\1).)+ : any thing that is not the quote in group 1
) : end group 2
\1 : back reference to group 1 (i.e. same quote as the beginning)
Here is a screen capture of the result.
Here's something that may work for you.
^(\"([^\"\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"|'([^'\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*'|\"\"\"((?!\"\"\")[^\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"\"\")$
Replace the triple double quotes with triple single quotes. See it in action at regex101.com.
Named Group Version
Avoids problems when used in larger expressions by explicitly referring to the name of the group storing the last found quote.
Should work for most systems:
(?<Qt>'''|'|")(.*?)\k<Qt>
.NET version:
(?<Qt>'''|'|"")(.*?)\k<Qt>
Works as follows:
'''|'|": Check first for ''', then ', and finally ". Done in this order so ''' has priority over '.
(?<Qt>'''|'|""): When matched, place the match in <Qt> for later use.
(.*?): Capture the results of a lazy search for 0 or more of anything .*? - will return empty strings. To prevent empty strings from being returned, change to a lazy search for 1 or more of anything .+?.
\k<Qt>: Search for the value last stored in <Qt>.

Regex (ICU) for matching between parentheses

Looking for some regex which will create a capture group for words occurring within parentheses, ignoring the parentheses themselves. The regex must be either PCRE or ICU.
Input: ( lakshd asd___ asa1123 Name : _____)
Desired Output: Name
What I've tried:
\\((Name|name|NAME)\\)
(?<=\\()name|Name|NAME(?=\\))
\\(name|Name|NAME\\)
What I've tried:
\\((Name|name|NAME)\\)
(?<=\\()name|Name|NAME(?=\\))
\\(name|Name|NAME\\)
All these patterns look for name or Name or NAME that has a ( immediately before and ) right after, with difference being what is captured or returned as a match. To match some word inside parentheses, you need to use \([^()]* before the value you need to get, and [^()]*\) after it.
Also, there is no point in extracting something you already know.
So, if you plan to extract the last word from the parentheses, you may use
> library(stringr)
> s = "( lakshd asd___ asa1123 Name : _____)"
> res <- str_match(s, "(?i)\\([^()]*\\b([a-z]\\w*)\\b[^()]*\\)")
> res[,2]
[1] "Name"
Note that str_match allows accessing captured values.
The (?i)\\([^()]*\\b([a-z]\\w*)\\b[^()]*\\) pattern matches parentheses and the last whole word from it.
If nested levels of parentheses are not likely to happen then looking if current position is going to be followed by a closing parenthesis at the end while an opening parenthesis is supposed to be opened already will do the trick (works with both ICU and PCRE):
(Name|name|NAME)(?=[^()]*\))
PCRE live demo

Need RegEx Pattern to get text between delimiters at start of text

My source text could be any number of characters between "[" an "]" at the beginning of the line. I will have ONLY one line.
For example:
[1] and some other text here
[10] more text, but maybe some brackets [KEY]
[1000000] a lot more text
I want to match/return the text between the "[" and "]".
EDIT AFTER ANSWER PROVIDED
The first answer, provided by #nickb worked for me with this AppleScript:
Note that I had to convert the RegEx to a quoted string to use in AS. This uses the Satimage AppleScript Additions find text command, which provides the RegEx engine for AppleScript.
set strRegEx to "^\\[(.*?)\\]" -- Original: "^\[(.*?)\]"
set strTextToSearch to "[10] My Note title with [KEY] "
set strCaptureGroup to find text strRegEx in strTextToSearch using {"\\1"} with regexp and string result
log strCaptureGroup
-->10
The most simple regex you could use would be this:
^\[(.*?)\]
You can see it matching your input here.
Alternatively a pure AppleScript solution
set theText to "[1] and some other text here
[10] more text, but maybe some brackets [KEY]
[1000000] a lot more text"
set resultList to {}
set {TID, text item delimiters} to {text item delimiters, "]"}
repeat with aLine in (get paragraphs of theText)
if aLine starts with "[" then set end of resultList to text 2 thru -1 of text item 1 of aLine
end repeat
set text item delimiters to TID
resultList -- {"1", "10", "1000000"}
I think this will fit your criteria:
^\[([^]]*)\].*
With the stuff in brackets in the first matching group returned.
You can try runing the following reg. exp. on each line:
[^\[]\w+[^\]]
I tested it at regex101 and it matches the contents inside the [], excluding the brackets.
/^\[(.*?)\]/
is really the most simple regex for this case, but it matches surrounding brackets too.
The exact value (without brackets) is stored in 1st capture group.
If you don't want to match brackets, you will need this:
/(?<=^\[).*?(?=\])/
… unless you're using JavaScript – unfortunately, JS doesn't support lookbehinds.
In this case you'll need this regex:
/^[^\[\]]+/
(assuming that every input will start with […] component, and will not be empty)
The regex to use depends on how you are going to use it for the input it will parse. Some of the answers here have a trailing .* and some do not. Both are correct, it just depends on what exactly you are trying to match, and crucially how you ask it about a match. For example, in Java, with the regex ^\[(.*?)\], if you feed it the whole string "[1000000] a lot more text" and call matches(), it will return false because the regex pattern does not account for any of the trailing text outside the brackets. However, if you call find() after feeding in the same string, it will match because find() works on each substring as it parses and will return true on the first match it hits, while matches() will only return true if the entire input matches the regex. find() will also find subsequent substring matches to the regex in the string each time find() is called until the parser reaches the end of the input.
Personally, I like to use regex that account for the entire input and use capture groups to isolate the actual text I want to grab from the input. But your mileage may vary.

Extracting some data items in a string using regular expression

<![Apple]!>some garbage text may be here<![Banana]!>some garbage text may be here<![Orange]!><![Pear]!><![Pineapple]!>
In the above string, I would like to have a regex that matches all <![FruitName]!>, between these <![FruitName]!>, there may be some garbage text, my first attempt is like this:
<!\[[^\]!>]+\]!>
It works, but as you can see I've used this part:
[^\]!>]+
This kills some innocents. If the fruit name contains any one of these characters: ] ! > It'd be discarded and we love eating fruit so much that this should not happen.
How do we construct a regex that disallows exactly this string ]!> in the FruitName while all these can still be obtained?
The above example is just made up by me, I just want to know what the regex would look like if it has to be done in regex.
The simplest way would be <!\[.+?]!> - just don't care about what is matched between the two delimiters at all. Only make sure that it always matches the closing delimiter at the earliest possible opportunity - therefore the ? to make the quantifier lazy.
(Also, no need to escape the ])
About the specification that the sequence ]!> should be "disallowed" within the fruit name - well that's implicit since it is the closing delimiter.
To match a fruit name, you could use:
<!\[(.*?)]!>
After the opening <![, this matches the least amount of text that's followed by ]!>. By using .*? instead of .*, the least possible amount of text is matched.
Here's a full regex to match each fruit with the following text:
<!\[(.*?)]!>(.*?)(?=(<!\[)|$)
This uses positive lookahead (?=xxx) to match the beginning of the next tag or end-of-string. Positive lookahead matches but does not consume, so the next fruit can be matched by another application of the same regex.
depending on what language you are using, you can use the string methods your language provide by doing simple splitting (and simple regex that is more understandable). Split your string using "!>" as separator. Go through each field, check for <!. If found, replace all characters from front till <!. This will give you all the fruits. I use gawk to demonstrate, but the algorithm can be implemented in your language
eg gawk
# set field separator as !>
awk -F'!>' '
{
# for each field
for(i=1;i<=NF;i++){
# check if there is <!
if($i ~ /<!/){
# if <! is found, substitute from front till <!
gsub(/.*<!/,"",$i)
}
# print result
print $i
}
}
' file
output
# ./run.sh
[Apple]
[Banana]
[Orange]
[Pear]
[Pineapple]
No complicated regex needed.

How to cycle through delimited tokens with a Regular Expression?

How can I create a regular expression that will grab delimited text from a string? For example, given a string like
text ###token1### text text ###token2### text text
I want a regex that will pull out ###token1###. Yes, I do want the delimiter as well. By adding another group, I can get both:
(###(.+?)###)
/###(.+?)###/
if you want the ###'s then you need
/(###.+?###)/
the ? means non greedy, if you didn't have the ?, then it would grab too much.
e.g. '###token1### text text ###token2###' would all get grabbed.
My initial answer had a * instead of a +. * means 0 or more. + means 1 or more. * was wrong because that would allow ###### as a valid thing to find.
For playing around with regular expressions. I highly recommend http://www.weitz.de/regex-coach/ for windows. You can type in the string you want and your regular expression and see what it's actually doing.
Your selected text will be stored in \1 or $1 depending on where you are using your regular expression.
In Perl, you actually want something like this:
$text = 'text ###token1### text text ###token2### text text';
while($text =~ m/###(.+?)###/g) {
print $1, "\n";
}
Which will give you each token in turn within the while loop. The (.*?) ensures that you get the shortest bit between the delimiters, preventing it from thinking the token is 'token1### text text ###token2'.
Or, if you just want to save them, not loop immediately:
#tokens = $text =~ m/###(.+?)###/g;
Assuming you want to match ###token2### as well...
/###.+###/
Use () and \x. A naive example that assumes the text within the tokens is always delimited by #:
text (#+.+#+) text text (#+.+#+) text text
The stuff in the () can then be grabbed by using \1 and \2 (\1 for the first set, \2 for the second in the replacement expression (assuming you're doing a search/replace in an editor). For example, the replacement expression could be:
token1: \1, token2: \2
For the above example, that should produce:
token1: ###token1###, token2: ###token2###
If you're using a regexp library in a program, you'd presumably call a function to get at the contents first and second token, which you've indicated with the ()s around them.
Well when you are using delimiters such as this basically you just grab the first one then anything that does not match the ending delimiter followed by the ending delimiter. A special caution should be that in cases as the example above [^#] would not work as checking to ensure the end delimiter is not there since a singe # would cause the regex to fail (ie. "###foo#bar###). In the case above the regex to parse it would be the following assuming empty tokens are allowed (if not, change * to +):
###([^#]|#[^#]|##[^#])*###