Regexp Replace (as3) - Use text to find but not to replace - regex

Having some trouble with regexp. My XML file loaded to actionscript removes all spaces (automatically trims the text). So I want to replace all SPACE with a word so that I can fix that later on in my own parsing.
Here's examples of how the tags I want to adjust.
<w:t> </w:t>
<w:t> Test</w:t>
<w:t>Test </w:t>
This is the result I want.
<w:t>%SPACE%</w:t>
<w:t>%SPACE%Test</w:t>
<w:t>Test%SPACE%</w:t>
The closest result I got is <w:t>\s|\s</w:t>
Biggest problem is that it changes all spaces in the XML file that corrupts everything. Will only change inside w:t nodes but not destroy the text.

When parsing XML using the standard XML class in ActionScript you can specify to not ignore whitespace by setting the ignoreWhiteSpace property to false. It is set to true by default. This will ensure that white space in XML text nodes is preserved. You can then do whatever you want with it.
XML.ignoreWhiteSpace = false
/* parse your XML here */
That way you don't have to muck around with regular expressions and can use the standard XML ActionScript parsing.

var reg1 : RegExp = /((?:<w:t>|\G)[^<\s]*+)\s/g;
data = data.replace(reg1, "$1%SPACE%");
(?:<w:t>|\G) means every match starts at a <w:t> tag, or immediately after the previous match. Since [^<\s] can't match the closing </w:t> tag (or any other tag), every match is guaranteed to be inside a <w:t> element.
To do this properly, you would need to deal with some more questions, like:
\s matches several other kinds of whitespace, not just ' '. Do you want to replace any whitespace character with %SPACE%? Or do you know that ' ' will be the only kind of whitespace in those elements?
Will there be other elements inside the <w:t> elements (for example, <w:t> test <xyz> test </xyz> </w:t>)? If so, the regex becomes more complicated, but it's still doable.
I'm not set up to test ActionScript, but here's a demo in PHP, which uses the PCRE library under the hood, like AS3:
test it on ideone.com
EDIT: In addition to matching where the last match left off, \G matches the beginning of the input, just like \A. That's not a problem with the regex given here, but in the ideone demo it is. That regex should be
((?:<w:t>|\G(?!\A))(?:[^<\s]++|<(?!/w:t>))*+)\s

Made a workaround that isn't so nice. But well, problem is when you work against the clock.
I run the replace 3 times instead.
var reg1 : RegExp = /<w:t>\s/gm;
data = data.replace(reg1, "<w:t>%DEADSPACE%");
var reg2 :RegExp = /\s<\/w:t>/gm;
data = data.replace(reg2, "%DEADSPACE%</w:t>");
var reg3 :RegExp = /<w:t>\s<\/w:t>/gm;
data = data.replace(reg3, "<w:t>%DEADSPACE%</w:t>");
RegExp, what is it good for. Absolutly nothing (singing) ;)

there's also another way

Related

How to get the string that start after the last > by regular expression?

I am writing a C# code that read a webpage and grep the content from the webpage.
I spent a lot of time to figure the content and now I stuck on this:
<i class="icon"></i><a href="https://www.nytimes.com/2017/09/12/us/irma-storm-updates.html">Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged
I wanna get the "Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged" only
I used to use "(?<=\">)(.*)" to get some content out successfully but not fit for all of it.
Therefore, how could I use R.E. to point I want the element that start get after the last ' > '
Thank you.
If the substring that you want to match appears after the last > then the main thing you know about it is that it does not contain a >. This is matched with [^>]. If the string must contain at least one character then you'll want to use + as the quantifier; if it's allowed to be empty then you'll want to use * to allow for zero matches. Finally, you need to match the full remainder of the text, up to the end of the line, which you do with a $.
So the full expression is [^>]*$ (or [^>]+$ if it can't be zero length).
If you want to also require that the preceding text does have a >, you can make it a bit more complicated, using a non-matching look-behind, (?<=\>). This says to find a > (which needs to be escaped here with \>) but don't include it in the match. The final expression would then be (?<=\>)[^>]*$. Now, C# strings also make use of \ for escaping, so you have to escape it twice before passing it to the Regex constructor. So it becomes new Regex("(?<=\\>)[^>]*$").
The simpler version, [^>]*$, is probably sufficient for your needs.
Finally, I would add that parsing XML or HTML with regular expressions is usually the wrong thing to do because there are lots of edge cases, and you will have to make assumptions about the formatting. For example, based on your example text, I assumed you are searching up to the end of the input text. It's usually better to parse XML with an XML parser, which won't have these problems.
This is the Regex you need here is a working example in RegexStorm.net example:
>([^<>]+)
This says: Find a string that matches a closing angle bracket, followed by text that doesn't include angle brackets. The [^<>] says find letters, numbers, whitespace that are NOT open/close angle brackets. The parenthesis around the [^<>] captures the text as a separate group. The (+) says get at least one or more.
Here is a C# example that uses it. You need to get the second capture group for the text you want.
void Main()
{
string text = "<i class=\"icon\"></i><a href=\"https://www.nytimes.com/2017/09/12/us/irma-storm-updates.html\">Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged";
Regex regex = new Regex(">([^<>]+)");
MatchCollection matchCollection = regex.Matches(text);
if (matchCollection != null)
{
foreach (Match m in matchCollection)
{
Console.WriteLine(m.Groups[1].Value);
}
}
}
RegexStorm.net is a good .Net test site. Regex101.com is a good site to learn different Regex tools.

Regular Expressions - Select the Second Match

I have a txt file with <i> and </i> between words that I would like to remove using Editpad
For example, I'd like to keep when it's like this:
<i>Phrases and words.</i>
And I'd like to remove the </i> and <i> tags inside the phrase, when it's like this:
<i>Phrases</i>and<i> words.</i>
<i>Phrases</i>and <i>words.</i>
I was trying to do that using regex, but I couldn't do it.
As the tag is followed by space or a word character I could find when the line has the double tag with
/ <i>|<\/i> /
but this way I can't just press replace for nothing, I have to edit line by line I search.
There's anyway to accomplish that?
* Edited *
Another example of lines found on the subtitle text
<i>- find me on the chamber.</i>
- What? <i>Go. Go, go, go!</i>
Rule number one: you can't parse html with regex.
That being said, if you know each line follows a certain pattern, you can usually hack something together to work. ;)
If I've understood correctly, it looks like you can simply remove all <i> and </i> that aren't either at the beginning or end of the lines. In that case, one method you could try is the following regex:
(?<=.)\<\/?i\>(?=.)
This will match the tags, with a lookahead and behind to make sure that we aren't at the end/start of a line (by checking if another character exists in front/behind. (Note that typically matched characters in a lookahead/behind won't be replaced when you search/replace.)
Disclaimer: this works on regex101, but notepad++ may have some differences to the pcre regex style.
update to work with Editpad
EDIT: since this question is actually wanting to know how to do this in Editpad, below is a modified alternative:
Try searching for the regex: (.)\<\/?i\>(.). This will match (and capture) exactly one character before and after the <i> tags.
When replacing, use backreferences to replace the entire match with the two captured characters - a replacement string of \1\2 should work.

How do I properly format this Regex search in R? It works fine in the online tester

In R, I have a column of data in a data-frame, and each element looks something like this:
Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Marinilabiaceae
What I want is the section after the last semicolon, and I've been trying to use 'sub' and also duplicating the existing column and create a new one with just the endings kept. In essence, I want this (the genus):
Marinilabiaceae
A snippet of the code looks like this:
mydata$new_column<- sub("([\\s\\S]*;)", "", mydata$old_column)
In this situation, I am using \\ rather than \ because of R's escape sequences. The sub replaces the parts I don't want and updates it to the new column. I've tested the Regex several times in places such as this: http://regex101.com/r/kS7fD8/1
However, I'm still struggling because the results are very bizarre. Now my new column is populated with the organism's domain rather than the genus: Bacteria.
How do I resolve this? Are there any good easy-to-understand resources for learning more about R's Regex formats?
Starting with your simple string,
string <- "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Marinilabiaceae"
You can remove everything up to the last semicolon with "^(.*);" in your call to sub
> sub("^(.*);", "", string)
# [1] "Marinilabiaceae"
You can also use strsplit with tail
> tail(strsplit(string, ";")[[1]], 1)
# [1] "Marinilabiaceae"
Your regular expression, ([\\s\\S]*;) wouldn't work primarily because \\s matches any space characters, and your string does not contain any spaces. I think it worked in the regex101 site because that regex tester defaults to pcre (php) (see "Flavor" in top-left corner), and R regex syntax is slightly different. R requires extra backslash escape characters in many situations. For reference, this R text processing wiki has come in handy for me many times before.
Make it Greedy and get the matched group from desired index.
(.*);(.*)
^^^------- Marinilabiaceae
Here is regex101 demo
Or to get the first word use Non-Greedy way
(.*?);(.*)
Bacteria -----^^^
Here is demo
To extract everything after the last ; to the end of the line you can use:
[^;]*?$

Regular expression to only return every other match

I'm trying to write a regex that will match humanly readable quoted values. As one example, XML attributes. The problem I'm running into is that the data between quoted areas is actually quoted as well if you consider an attribute's ending quote and a subsequent attribute's beginning quote. Here's the expression I have so far:
(?<=\")(?(?!\s+\")[^\"]+)(?=\")
What I tried to express in plain English was: A quote (don't capture it), if not followed by just spaces terminating in another quote, match anything not a quote that is followed by another quote (not capturing the last quote).
and here's my sample data:
<computer name = "printserver" model = "1000ZS" />
The regex produces 3 matches:
printserver
model =
1000ZS
I think that if I could find a way to tell the regex engine to skip every other occurrence I'd have it.
Here's another sample data set, sort of like QML class attributes:
field1: "value1" field2: "value2" field3: "value3"
I can "see" the quoted data, but extracting it via regex is beating me :-)
I'm using the .NET 4.5 System.Text.RegularExpressions framework in my project. I'm not targeting a specific markup like XML, JSON, QML, etc. but am looking for a general purpose regex that would just grab the quoted values similar to how we interpret the data as humans...
Any suggestions? Thanks!
You can always consume the quote in your match:
\"([^\"]+)\"
And extract the part you need from the first capture group.
If it's explicitly a quote preceded by a space, then you can use the part you used, with a little tweak:
\"((?:(?!\s+\")[^\"])+)\"
Of if you just know that the string contains simple patterns like that, maybe something like this:
(?:(?!\s+\")[^\"])+(?=\")

Regular expression for XML

I'm attempting to build a regular expression that will match against the contents of an XML element containing some un-encoded data. Eg:
<myElement><![CDATA[<p>The <a href="http://blah"> draft </p>]]></myElement>
Usually in this circumstance I'd use
[^<]*
to match everything up to the less than sign but this isn't working in this case. I've also tried this unsuccessfully:
[^(</myElement>)]*
I'm using Groovy, i.e. Java.
Please don't do this, but you're probably looking for:
<myElement>(.*?)</myElement>
This won't work if <myElement> (or the closing tag) can appear in the CDATA. It won't work if the XML is malformed. It also won't work with nested <myElement>s. And the list goes on...
The proper solution is to use a real XML parser.
Your [^(</myElement>)]* regex was saying: match any number of characters that are not in the set (, <, /, m, etc., which is clearly not what you intended. You cannot place a group within a character class in order for it to be treated atomically -- the characters will always be treated as a set (with ( and ) being literal characters, too).
if you are doing it on a line by line basis, this will match the inside if your example:
>(.*)</
returns: <![CDATA[<p>The <a href="http://blah"> draft </p>]]>
Probably use it something like this:
subjectString = '<myElement><![CDATA[<p>The <a href="http://blah"> draft </p>]]></myElement>';
Matcher regexMatcher = subjectString =~ ">(.*)</"
if (regexMatcher.find()) {
String ResultString = regexMatcher.group();
}