How to match only one string and not another - regex

This is String 1:
<td class="AAA"><span class="BBB">Text1</span></td>
I want to remove the span so it looks like this:
<td class="BBB">Text1</td>
Which is easy enough with this regex:
Search: <td class="AAA"><span class="BBB">(.*)</span></td>
Replace: <td class="BBB">$1</td>
The problem: Sometimes the string looks like this (String 2):
<td class="AAA"><span class="BBB">Text1</span>-<span class="BBB">Text2</span></td>
which also matches because of the 2 closing tags. But I don't want it to be matched at all. How do I find only String 1?

Instead of matching any character in your matching group, match all characters aside from the open <:
Search: <td class="AAA"><span class="BBB">([^<]*)</span></td>
Replace: <td class="BBB">$1</td>
This is assuming your Text1 doesn't contain the < character.

Related

How to make a regex expression for this content?

I am scraping a website which looks like this, and I am looking for 4 / 5 and 3 / 10. That is, I want (number) + space + slash + 3 spaces + another number.
I tried the regex expression ^[0-9]+(\/[0-9]+)" *"*$ but that did not work.
<td>Monday</td>
<td class="text-center text-danger font-weight-bold">4 / 5</td>
</td>
<td>Tuesday</td>
<td class="text-center text-danger font-weight-bold">3 / 10</td>
</td>
You were close. Use word boundary \b instead of ^ and $, because the text you are looking for is somewhere in the middle of your text. This regex should work:
/\b[0-9]+ +\/ +[0-9]+\b/
The + makes the regex more forgiving, by requiring at least one space.
If you want to capture the numbers separately you can introduce capture groups, to reference them with $1 and $2, respectively:
/\b([0-9]+) +\/ +([0-9]+)\b/
$ grep "[[:digit:]]\{1,\} \/ [[:digit:]]\{1,\}" filename
<td class="text-center text-danger font-weight-bold">4 / 5</td>
<td class="text-center text-danger font-weight-bold">3 / 10</td>
$
You have started the regex with ^ which is to anchor the regex to beginning of line.

Regex - Find and replace an url inside href attribute

I have a xlsx/csv file, which I am trying to modify it's contents with notepad++.
Exactly a url inside href. Ex:
href=""/xs_db/DOKUMENT_DB/www/Datenblaetter/de/7/7521_Datasheet--de.pdf""
href=""/xs_db/DOKUMENT_DB/www/Datenblaetter/de/7609_Datasheet--de.pdf""
href=""/xs_db/DOKUMENT_DB/www/Datenblaetter/de/6/7981_Datasheet--de.pdf""
etc...
After replace, I want them to look like this:
href=""/docs/7521_Datasheet--de.pdf""
href=""/docs/7609_Datasheet--de.pdf""
href=""/docs/7981_Datasheet--de.pdf""
Right now, I have this pattern on find:
(?<=href=(""|''))[^"']+(?=(.pdf""|.pdf''))
EDIT:
After trying the given examples no string matches. Here is full cell text:
"<table cellspacing=""0"" width=""100%"" border=""0"" cellpadding=""10""><tbody><tr>
<td align=""left"" valign=""top"">
<table cellspacing=""0"" width=""100%"" border=""0"" cellpadding=""0""><tbody><tr>
<td>
<table cellspacing=""0"" width=""100%"" border=""0"" cellpadding=""0""><tbody><tr>
<td align=""left"" valign=""top"" class=""DocRepCell1""><img src=""/catalog/pdf.gif"" alt="" "" border=""0""></td>
<td align=""left"" width=""97%"" valign=""middle"" class=""DocRepCell2""><span class=""NavigationButtonMoreInfos"">Produktinformation breite</span> </td>
<td align=""right"" width=""1%"" nowrap=""nowrap"" valign=""middle"" class=""DocRepCell3"">0,1 MB</td>
<td align=""right"" width=""1%"" nowrap=""nowrap"" valign=""middle"" class=""DocRepCell4"">
<a class=""NavigationButtonMoreInfos"" target=""_blank"" href=""/xs_db/DOKUMENT_DB/www/Datenblaetter/de/7/7521_Datasheet--de.pdf"">herunterladen</a></td></tr>
</tbody></table></td></tr></tbody>
</table></td></tr>
</tbody></table></td></tr>
</tbody></table>"
You can try the following find and replace in regex mode:
Find:
^href=""/.*?(\d+_Datasheet.*\.pdf"")$
Replace:
href=""/docs/$1
Note that the find pattern could be made more generic if it doesn't work on more of your data. But in general we would need some concrete way of identifying the start of the suffix which you wish to retain in the match. If my answer doesn't work for you, then state where it fails and provide logic which allows the suffix to be identified.
Here's a way to just match the part you want to replace with the path /docs
Find what :
^href=["']+\K(/.*?)(?=/\d+_[\w-]+\.pdf["']+$)
Replace with :
/docs
Search mode : Regular Expression (best with ". matches new lines" checked off)

Remove all tabs, blank/brake/new lines, empty lines, multiple successive spaces except single space character

I have a HTML code that looks like this:
<TABLE>
<TR>
<TD>Item</TD>
<TD><A>48</A>
</TD></TR>
<TR>
<TD>Item</TD>
<TD><A >48</A>
</TD></TR>
<TR>
<TD>Tags</TD>
<TD><A>
keyword</A>, <A>keyword
</A>, <A>keyword
</A>, <A>keyword</A>, <A
>keyword</A>, <A
>keyword
</A>, <A>keyword
</A>
</TABLE>
Using .NET regex, can anyone help me to remove ALL whitespace characters EXCEPT single space characters from it so that I will end up with one long string of code?
You can use this regex:
[\p{Z}\s]{2,}
This will check if there are at least 2 whitespace characters. Replace with empty string if found.
\p{Z} stands for All Separators Unicode shorthand class.
See demo
This is possible with the following regex,
\s{2,} // \s will match all whitespaces, and {2,} tells it, there needs to be more then 1
You can use it in c# like this:
string output = Regex.Replace(input, #"\s{2,}", "");
Effect:

Select URL in HTML table with regular expression

I have a table with names and URLs like this:
<tr>
<td>name1</td>
<td>www.url.com</td> </tr>
<tr>
<td>name2</td>
<td>www.url2.com</td> </tr>
I want to select all URL-tabledata in a table.
I tried:
<td>w{3,3}.*(</td>){1,1}
But this expression doesn't "stop" at the first </td>. I get:
<td>www.url.com</td> </tr>
<tr>
<td>name2</td>
<td>www.url2.com</td>
as result. Where is my mistake?
There are several ways to match a URL. I'll try the simplest to your needs: just correcting your regex. You can use this one instead:
<td>w{3}.*?</td>
Explanation:
<td> # this part is ok
w{3,3} # the notation {3} is simpler for this case and has the same effect
.* # the main problem: you have to use .*? to make .* non-greedy, that
is, to make it match as little as possible
(</td>){1,1} # same as second line. As the number is 1, {1} is not needed
Your regex can be
\b(https?|ftp|file)://[-A-Za-z0-9+&##/%?=~_|!:,.;]*[-A-Za-z0-9+&##/%=~_|]
or
"((((ht{2}ps?://)?)((w{3}\\.)?))?)[^.&&[a-zA-Z0-9]][a-zA-Z0-9.-]+[^.&&[a-zA-Z0-9]](\\.[a-zA-Z]{2,3})"
See this link - What is the best regular expression to check if a string is a valid URL?. Many answers are available.

Using a REGEX to replace words within a sub-match

I hope this isn't a repetition...
I need a regex to do what should be a fairly simple task. I have code for an HTML table, and I want to replace all <td> tags with <th> tags in the first row of the table, i.e. within the first set of <tr> </tr> tags. The table might look something like this:
<table cellpadding="5" cellspacing="0" border="1">
<tr>
<td>Capacity %</td>
<td>Tension V</td>
<td>Acid kg/l</td>
</tr>
<tr>
<td>100</td>
<td>12.70</td>
<td>1.265</td>
</tr>...etc
and I want:
<table cellpadding="5" cellspacing="0" border="1">
<tr>
<th>Capacity %</th>
<th>Tension V</th>
<th>Acid kg/l</th>
</tr>
<tr>
<td>100</td>
<td>12.70</td>
<td>1.265</td>
</tr>...etc
I've tried regexes similar to this:
/(<table>\n<tr>\n)(.+?)(</tr>)
...and then tried to rebuild the table row using back references, but I can't seem to apply the regex to the multiple
</?td>
matches that there might be.
I'm doing this in javascript, which means I can't use look-behinds (although if anyone has a look behind solution I'd be interested in seeing it anyway...).
Thanks in advance for any help.
You could do it if your regex engine supports indefinite repetition inside lookbehind assertions, for example in .NET (C#):
resultString = Regex.Replace(subjectString,
#"(?<= # Assert that we can match this before the current position:
<table # <table
(?: # followed by...
(?! # (unless there's an intervening
</table # </table
| # or
</tr # </tr)
) # (End of lookahead assertion)
. # any character
)* # any number of times
) # (End of lookbehind assertion)
<td # Then match <td",
"<th", RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
works on your example. But even in .NET, I wouldn't use a regex for it, it's just too brittle. Better manipulate the DOM directly, that's what it's there for.
You can't do this with a single regex. Since regex basically works line-by-line, and you've got a special condition ("only on the first "), you'll need to write some conditional logic along with regex to make it work.