regex and string extraction from HTML

regex and string extraction from HTML - regex

How do I modify the following String manipulation to look for "text to extract" in the HTML code below ? I don't understand the "(?<=')[^']+" I understand it is a regex pattern and I looked on a website but I don't get the logic of it... Maybe if someone show me the way with my question I could understand better..
if let match = dataString?.range(of: "(?<=')[^']+", options: .regularExpression) {
print(dataString?.substring(with: match) as Any)
HTML code:
<span class="phrase">Text to Extract</span></span></span></p>

First, https://regex101.com/ is a free online resource where you can test regex, and it will explain what each part of it is doing.
The regex (?<=')[^']+ can be broken down as follows
(?<=<token>) is a positive look-behind for a token. In this case, the char single-quote (')
[^<chars>] match anything not one of the following characters. In this case, the char single-quote (')
+ match the previous token 1 or more times. In this case, [^']
So the above regex matches anything between two '. Note that this has no concept of opening and closing, so a'b'c'd'e would match b, c, and d.
To match a literal phrase, you would just use that phrase in your regex (escaping any regex special characters with \).
If you need context aware (nest tracking) extraction, any regex will be inherently wrong, and you will need an HTML parser to extract it for you.

Related

Match everything but Regular Expression

Been searching for a long time, reading up about negative/positive outlook but can't get this to match everything but my regular expression.
\b[A-Z]{1}\d{3,6}[A-Z0-9]+
is the string I don't want to extract.
(?!\b[A-Z]{1}\d{3,6}[A-Z0-9]+).*
is my best attempt using Negative Outlook, but it will still match the data.
I am using this Regex on:
11/02/2019 1 475.50 453.345 Serial number : C580A0453WD7996 
AFJ_LowGuard_NewNew
End User Details:
The output I want is:
11/02/2019 1 475.50 453.345 Serial number :
AFJ_LowGuard_NewNew
End User Details:

You can either use your regex to match and replace the match with empty string, that's one approach.
Another approach that you seem to be trying is, you can use this following regex to match anything but your regex,
\b(?:(?![A-Z]\d{3,6}[A-Z0-9]+).)+\b
Demo
This will match anything except your pattern. But personally I suggest replacing by matching your pattern should be easy.
Edit:
Ok I read your comment that you want to replace anything except the string matched by your pattern. In that case you can use following regex to match everything except your pattern and replace it with empty string to get your result,
\b(?:(?![A-Z]\d{3,6}[A-Z0-9]+).)+
Demo with replacement with empty string

Regex: ignore characters that follow

I'd like to know how can I ignore characters that follows a particular pattern in a Regex.
I tried with positive lookaheads but they do not work as they preserves those character for other matches, while I want them to be just... discarded.
For example, a part of my regex is: (?<DoubleQ>\"\".*?\"\")|(?<SingleQ>\".*?\")
in order to match some "key-parts" of this string:
This is a ""sample text"" just for "testing purposes": not to be used anywhere else.
I want to capture the entire ""sample text"", but then I want to "extract" only sample text and the same with testing purposes. That is, I want the group to match to be ""sample text"", but then I want the full match to be sample text. I partially achieved that with the use of the \K option:
(?<DoubleQ>\"\"\K.*?\"\")|(?<SingleQ>\"\K.*?\")
Which ignores the first "" (or ") from the full match but takes it into account when matching the group. How can I ignore the following "" (")?
Note: positive lookahead does not work: it does not ignore characters from the following matches, it just does not include them in the current match.
Thanks a lot.

I hope I got your questions right. So you want to match the whole string including the quotes, but you want to replace/extract it only the expression without the quotes, right?
You typically can use the regex replace functionality to extract just a part of the match.
This is the regex expression:
""?(.*?)""?
And this the replace expression:
$1

Extract url based on specific keyword

I am crawling data from certain websites and I am looking to extract data from specific urls. One such case let say url with *devicehelp.optus.com.au/web/* as as example. PFB my regex -
/[^]*devicehelp\.optus\.com\.au\/web\/[^.]*/
This regex doesn't give me perfect match what I am looking for. Could someone please let me know what am I missing here?
Test urls -
*devicehelp.optus.com.au/web/*
http://www.top.abc.something.optus.devicehelp.optus.com.au/web/web/web/
This regex works when I test it on http://regexr.com/ but doesn't on https://regex101.com/

In most regex flavors, [^] is an invalid regex construct, while on the site you tested (regexr.com), this will be parsed as any character (since the regexr regex flavor is JavaScript).
To match any character but a newline zero or more times, you may use .*.
.*\bdevicehelp\.optus\.com\.au\/web\/.*
The \b is a word boundary, so as to match devicehelp as a whole word (if you do not intend to match it as a whole word, you may remove it). Dots should be escaped to match literal dots.

Regex for deleting characters before a certain character?

I'm very new at regex, and to be completely honest it confounds me. I need to grab the string after a certain character is reached in said string. I figured the easiest way to do this would be using regex, however like I said I'm very new to it. Can anyone help me with this or point me in the right direction?
For instance:
I need to check the string "23444:thisstring" and save "thisstring" to a new string.

If this is your string:
I'm very new at regex, and to be completely honest it confounds me
and you want to grab everything after the first "c", then this regular expression will work:
/c(.*)/s
It will return this match in the first matched group:
"ompletely honest it confounds me"
Try it at the regex tester here: regex tester
Explanation:
The c is the character you are looking for
.* (in combination with /s) matches everything left
(.*) captures what .* matched, making it available in $1 and returned in list context.

Regex for deleting characters before a certain character!
You can use lookahead like this
.*(?=x)
where x is a particular character or word or string.{using characters like .,$,^,*,+ have special meaning in regex so don't forget to escape when using it within x}
EDIT
for your sample string it would be
.*(?=thisstring)
.* matches 0 to many characters till thisisstring

Here is a one-line solution for matching everything after "before"
print $1."\n" if "beforeafter" =~ m/before(.*)/;
Edit:
While using lookbehind is possible, it's not required. Grouping provides an easier solution.

To get the string before : in your example, you have to use [^:][^:]*:\(.*\). Notice that you should have at least one [^:] followed by any number of [^:]s followed by an actual :, the character you are searching for.

Regular expression to split a string but consider multi-digit escape sequences

I could need some help on the following problem with regular expressions and would appreciate any help, thanks in advance.
I have to split a string by another string, let me call it separator. However, if an escape sequence preceeds separatorString, the string should not be split at this point. The escape sequence is also a string, let me call it escapeSequence.
Maybe it is better to start with some examples
separatorString = "§§";
escapeSequence = "###";
inputString = "Part1§§Part2" ==> Desired output: "Part1", "Part2"
inputString = "Part1§§Part2§§ThisIs###§§AllPart3" ==> Desired output: "Part1", "Part2", "ThisIs###§§AllPart3"
Searching stackoverflow, I found Splitting a string that has escape sequence using regular expression in Java and came up with the regular expression
"(?<!(###))§§".
This is basically saying, match if you find "§§", unless it is preceeded by "###".
This works fine with Regex.Split for the examples above, however, if inputString is "Part1###§§§§Part2" I receive "Part1###§", "§Part2" instead of "Part1###§§", "Part2".
I understand why, as the second "§" gives a match, because the proceeding chars are "##§" and not "###". I tried several hours to modify the regex, but the result got only worse. Does someone have an idea?

Let's call the things that appear between the separators, tokens. Your regex needs to stipulate what the beginning and end of a token looks like.
In the absence of any stipulation, in other words, using the regex you have now, the regex engine is happy to say that the first token is Part1###§ and the second is §Part2.
The syntax you used, (?<!foo) , is called a zero-width negative look-behind assertion. In other words, it looks behind the current match, and makes an assertion that it must match foo. Zero-width just indicates that the assertion does not advance the pointer or cursor in the subject string when the assertion is evaluated.
If you require that a new token start with something specific (say, an alphanumeric character), you can specify that with a zero-width positive lookahead assertion. It's similar to your lookbehind, but it says "the next bit has to match the following pattern", again without advancing the cursor or pointer.
To use it, put (?=[A-Z]) following the §§. The entire regex for the separator is then
(?<!###)§§(?=[A-z]).
This would assert that the character following a separator sequence needs to be an uppercase alpha, while the characters preceding the separator sequence must not be ###. In your example, it would force the match on the §§ separator to be the pair of chars before Part2. Then you would get Part1###§§ and Part2 as the tokens, or group captures.
If you want to stipulate what a token is in the negative - in other words to stipulate the a token begins with anything except a certain pattern, you can use a negative lookahead assertion. The syntax for this is (?!foo). It works just as you would expect - like your negative lookbehind, only looking forward.
The regular-expressions.info website has good explanations for all things regex, including for the lookahead and lookbehind constructs.
ps: it's "Hello All", not "Hello Together".

How about doing the opposite: Instead of splitting the string at the separators match non-separator parts and separator parts:
/(?:[^§#]|§[^§#]|#(?:[^#]|#(?:[^#]|#§§)))+|§§/
Then you just have to remove every matched separator part to get the non-separator parts.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js