HTML Agility Pack - ReplaceNode doesn't change the InnerHTML of the Body - replace

I have this
The body:
<body><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent leo leo, ultrices eu venenatis et, rutrum fringilla dolor.</p></body>
The code:
HtmlNode body = doc.DocumentNode.SelectSingleNode("//body");
Dictionary<HtmlNode, HtmlNode> toReplace = new Dictionary<HtmlNode, HtmlNode>();
// I do some logic here adding nodes to the toReplace dictionary.
foreach (HtmlNode replaceNode in toReplace.Keys)
{
replaceNode.ParentNod.ReplaceChild(toReplace[replaceNode], replaceNode);
}
After i do this, the InnerHtml of the body node remains the same as from beginning, although the OutterHtml or the InnerText are showing the good result. Is there something wrong with my code?
The result:
// body.InnerHtml
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent leo leo, ultrices eu venenatis et, rutrum fringilla dolor.</p>
// body.OutterHtml
<body><p>Lorem ipsum dolor sit amet...</p></body>

I think it may be something to do with the way you are adding nodes to replace old nodes. See if this solution works for you to truncate the text node. I did a quick test and all three gave me the same results.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
HtmlNode body = doc.DocumentNode.SelectSingleNode("//body");
foreach (var paragraph in body.Descendants("p"))
{
paragraph.InnerHtml = paragraph.InnerHtml.Substring(0, 25) + "...";
}
Console.WriteLine(body.InnerHtml);
Console.WriteLine(body.InnerText);
Console.WriteLine(body.OuterHtml);

Related

Regex that matches multiple new lines until finding patern

I am not very familiar to regex and I am having trouble to create a regex that solves my problem.
I want to create a regex that finds the following example: (What the regex should match is in bold)
Action type: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam sodales tincidunt ipsum ut ullamcorper
Phasellus rhoncus quam id eros volutpat, ac sodales magna tincidunt Phasellus rhoncus quam id eros volutpat, ac sodales magna
Phasellus rhoncus quam id eros volutpat, ac sodales magna tincidunt
Number Name Degree
11111111 LOREM IPSUM COMPUTER ENGINEERING
31837183 DOLOR IPSUM COMPUTER ENGINEERING
Total: 2
Action type: Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Number Name Degree
128172818211 SIT AMET IPSUM COMPUTER ENGINEERING
12183781 CONSECTETUR ELIT COMPUTER ENGINEERING
128172818212 ETIAM SODALES COMPUTER ENGINEERING
128172818213 IPSUM UT COMPUTER ENGINEERING
128172818215 SODALES MAGNA COMPUTER ENGINEERING
Total: 5
What I have accomplished so far, is generating a regex that matches the lines with success and the first line of the action type, but not the subsequent. I would like to match everything that comes after action type till the line that contains Number, Name and Degree.
The currently regex I am using is (Action type: .+?\n|[0-9]{8,12} .+?\n). A preview of the current executiong using regex101.com is attached.
As You can see, it works well for the second example, but it does not fulfil my needs with regard to the first one.
Is it possible to adapt my current regex to fit these multilines?
Try:
^Action type:.*?(?=^Number Name Degree)|^\d{8,12}[^\n]+
Regex demo.
^Action type:.*?(?=^Number Name Degree) - this matches all text beginning with Action type: until ^Number Name Degree is found.
^\d{8,12}[^\n]+ - this matches all lines beginning with 8-12 digits.
Note: the expression needs (?s) modifier

regex not capturing newline

I am trying to parse log files using regex. logs looks like that:
2022-04-01 00:00:00.0000|DEBUG|LOREM:LOREM|IPSUM:LOREM:LOREMIPSUM Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam vel placerat sapien. Suspendisse interdum est nulla, ac interdum sem pellentesque vel. Ut condimentum nisl ipsum (Failed:1/Total:5) [10.0000 ms].
2022-04-01 00:00:00.0000|DEBUG|LOREM:IPSUM|lorem ipsum \\SOME-PATH[Lorem Ipsum] (ID:000000-0000-0000-0000). Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam vel placerat sapien. Suspendisse interdum est nulla, ac interdum sem pellentesque vel. //line return here
Ut condimentum nisl ipsum.
2022-04-01 00:00:00.0000|DEBUG|LOREM:IPSUM|lorem ipsum \\SOME-PATH[Lorem Ipsum] (ID:000000-0000-0000-0000). Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam vel placerat sapien. Suspendisse interdum est nulla, ac interdum sem pellentesque vel. //line return here
Ut condimentum nisl ipsum.
Here is what I have tried (live version on regex 101 https://regex101.com/r/RoDU5L/1)
^(?<timestamp>^[\d-]+\s[\d:.]+)\|DEBUG\|(.*?)?\r?$|.*?(?<path>\\.*\]\s)(?<description>.*)+$ /gm
The problem is that it is not taking the last line "Ut condimentum nisl ipsum."
Thanks for your help
You can use
^(?<timestamp>^[\d-]+\s[\d:.]+)\|DEBUG\|(.*(?:\r?\n(?![\d-]+\s[\d:.]+\|).*)*)|.*?(?<path>\\.*\]\s)(?<description>.*)+$
See the regex demo.
The .*(?:\r?\n(?![\d-]+\s[\d:.]+\|).*)* part now matches
.* - any zero or more chars other than line break chars, as many as possible
(?:\r?\n(?![\d-]+\s[\d:.]+\|).*)* - zero or more occurrences of
\r?\n(?![\d-]+\s[\d:.]+\|) - CRLF or LF line ending now immediately followed with a datetime-like pattern and a | right after
.* - any zero or more chars other than line break chars, as many as possible.

Ignore a substring in RegEx pattern

I want to ignore the certain substring in the result match, not exclude if the substring exists.
For example
I have the text:
Lorem ipsum dolor sit amet, consectetur adipiscing eliti qwer-
ty egeet qwewerty lectus. Proinera risus massa, placerat in q-
werty sed, tincidunt in nunci auspendisse vel dolor qwerty qw-
erty, molestie nisl sit amet, qwerty ligula curabitur ipsum,
euismod at augue at, dapibus feugiat qweerty
I need to find all qwerty, even if it contains -\n.
My decision is adding (?:-\n)? after every char:
/q(?:-\n)?w(?:-\n)?e(?:-\n)?r(?:-\n)?t(?:-\n)?y/gm
But it looks bulky (even for the example that contains only 6 chars) and it is too hard to modify the regex later, is there a magic to make the regex shorter?
No, regex is not good at this kind of match. The easiest way would be to remove - and \n first.

Regular expression matching a sequence of words

Let's suppose we have a paragraph like this:
Lorem ipsum, sit amet consectetur adipiscing elit. Lorem - ipsum, sit
amet. Morbi a suscipit sem, quis finibus turpis. Lorem ipsum: sit
amet. Proin suscipit ac arcu pharetra tincidunt. Lorem ipsum. sit
amet. Pellentesque eu lacinia metus. sit amet: Lorem ipsum. Lorem
turpis ipsum, sit amet.
I need a regex pcre pattern case insensitive that only selects the words
1 lorem
2 ipsum
3 sit
4 amet
in that specific order ignoring punctutation and occurrences like
Sit amet lorem ipsum
Lorem turpis ipsum, sit amet
Simple straight forward with certain punctuation characters. You can append any punctuation character inside the []:
([Ll]orem)[\s,.!:\-()?]+(ipsum)[\s,.!:\-()?]+(sit)[\s,.!:\-()?]+(amet)
or everything that is a whitespace and not [A-Za-z0-9]
([Ll]orem)[\s\W]+(ipsum)[\s\W]+(sit)[\s\W]+(amet)
Case sensitivity can be an option to switch depending on the programming language. Or you have to manually add every relevant variation like ([L|l]orem)
Regex101 Example

Remove one iteration from every instance of a pattern with a RegEx?

Let's say I have the following text:
Lorem ipsum dolor sit amet, consectetur aaBaaBaaB adipiscing elit.
aaBaaB
aaB Ut in risus quis elit posuere faucibus sed vitae metus. aaBaaBaaBaaB
Fusce nec tortor in dolor aaBaaBaaB porttitor viverra. aaB
I'm trying to figure out how to perform a regular expression search and replace on this in such a way that the output is:
Lorem ipsum dolor sit amet, consectetur aaBaaB adipiscing elit.
aaB
Ut in risus quis elit posuere faucibus sed vitae metus. aaBaaBaaB
Fusce nec tortor in dolor aaBaaB porttitor viverra.
That is, to remove one "aaB" from each pattern of it. Is this actually possible, and if so, how would it be done? Specifically, I intend to do this in Sublime Text 2 as a RegEx search/replace in a file.
You can use a positive lookahead:
(?=(?<w>[a-z]{2}[A-Z]{1})\s)\k<w>
You just need to make sure you have case-sensitive matching on.
example: http://regex101.com/r/sK8bG1
Use either the leading or trailing whitespace to remove the first or last substring. Either of these work:
(\s+)(aaB) with $1 in the Replace field
or
(aaB)(\s+) with $2 in the Replace field