Reduce multiple tags to just one occurrence if there are no words between them with preg_replace - regex

I searched the web up and down couldn't find anything that would be a working solution.
Let's say I have multiple occurrence of <u></u><font color="green"></font><i></i><div></div><big></big>
like this:
<u></u><font color="green"></font><i></i><div></div><big></big><u></u><font color="green"></font><i></i><div></div><big></big><u></u><font color="green"></font><i></i><div></div><big></big><u></u><font color="green"></font><i></i><div></div><big></big><u></u><font color="green"></font><i></i><div></div><big></big>
I want it to be reduced to just one occurrence: <u></u><font color="green"></font><i></i><div></div><big></big>
The tags are not always the same. Basically I want multiple occurrences of tags that repeat again and again with no words between them to be reduced to just one occurrence. Hope it makes sense.

I'm not entirely sure what you're after. This is a bit of a hack, but it might get you started.
Use (<.*>)\1+ as your search string, then replace with $1. You may need to run that a couple times.

Related

What is the correct regex pattern to use to clean up Google links in Vim?

As you know, Google links can be pretty unwieldy:
https://www.google.com/search?q=some+search+here&source=hp&newwindow=1&ei=A_23ssOllsUx&oq=some+se....
I have MANY Google links saved that I would like to clean up to make them look like so:
https://www.google.com/search?q=some+search+here
The only issue is that I cannot figure out the correct regex pattern for Vim to do this.
I figure it must be something like this:
:%s/&source=[^&].*//
:%s/&source=[^&].*[^&]//
:%s/&source=.*[^&]//
But none of these are working; they start at &source, and replace until the end of the line.
Also, the search?q=some+search+here can appear anywhere after the .com/, so I cannot rely on it being in the same place every time.
So, what is the correct Vim regex pattern to use in order to clean up these links?
Your example can easily be dealt with by using a very simple pattern:
:%s/&.*
because you want to keep everything that comes before the second parameter, which is marked by the first & in the string.
But, if the q parameter can be anywhere in the query string, as in:
https://www.google.com/search?source=hp&newwindow=1&q=some+search+here&ei=A_23ssOllsUx&oq=some+se....
then no amount of capturing or whatnot will be enough to cover every possible case with a single pattern, let alone a readable one. At this point, scripting is really the only reasonable approach, preferably with a language that understands URLs.
--- EDIT ---
Hmm, scratch that. The following seems to work across the board:
:%s#^\(https://www.google.com/search?\)\(.*\)\(q=.\{-}\)&.*#\1\3
We use # as separator because of the many / in a typical URL.
We capture a first group, up to and including the ? that marks the beginning of the query string.
We match whatever comes between the ? and the first occurrence of q= without capturing it.
We capture a second group, the q parameter, up to and excluding the next &.
We replace the whole thing with the first capture group followed by the second capture group.

How to match only one group - Regex

I am using Regex in the program Octoparse and I need to match only this #67 in MIDI Controller or this: #30 in DJ Mixer. I don't need the #, but I don't mind it. Since not every time it is going to end with Controller or Mixer I can't use the words as an end.
Can I somehow group them and then choose which group to match? I know only basic Regex so it's a little bit hard for me. I saw I can use <\1> but it doesn't work.
fiddle
Here is what the program looks like:
As you can see I can't remove the global flag.
As far as I understood from the description of your task, it doesn’t matter to you the first occurrence or the last, then you can do this:
(?!(.|\s)*?#\d+(.|\s)*?)(?<=#)(.+?)[\w\s].+
https://regex101.com/r/7OSJ7p/1

Regex np++ Change different instances of "this" to different word

I have several of text files that contain the something like the following on many different lines:
this_is_THIS.doc
What I need to accomplish is to replace THIS with different objects for the first 5 occurrences and disregard the rest.
I would like for it to appear like the following:
this_is_TREE.doc
this_is_CAR.doc
this_is_CAT.doc
this_is_DONKEY.doc
this_is_ROCK.doc
I will have to do this many times in the future with the words changing so I feel a regex that I can alter in the future would help me a lot. I have searched but found nothing useful. Thanks for any help, you folks are great here.
As long as you want to replace just 5 instances of THIS, I think the following solution is manageable. For this particular case, you can replace:
^(.+?)THIS(.+?)THIS(.+?)THIS(.+?)THIS(.+?)THIS
With
$1TREE$2CAR$3CAT$4DONKEY$5ROCK
Change the above texts like CAT, CAR as per your requirements.
Click for Demo
Before Replacing:
Don't forget to check . matches newline and Match case settings as shown below.
After Replacing:
Note: Even I wouldn't recommend this method if you need to replace say 100 instances of THIS. The regex is going to be too long in that case.

Regex capture words inside tags

Given an XML document, I'd like to be able to pick out individual key/value pairsfrom a particular tag:
<aaa>key0:val0 key1:val1 key2:va2</aaa>
I'd like to get back
key0:val0
key1:val1
key2:val2
So far I have
(?<=<aaa>).*(?=<\/aaa>)
Which will match everything inside, but as one result.
I also have
[^\s][\w]*:[\w]*[^\s] which will also match correctly in groups on this:
key0:val0 key1:val1 key2:va2
But not with the tags. I believe this is an issue with searching for subgroups and I'm not sure how to get around it.
Thanks!
You cannot combine the two expressions in the way you want, because you have to match each occurrence of "key:value".
So in what you came up with - (?<=<abc>)([\w]*:[\w]*[\s]*)+(?=<\/abc>) - there are two matching groups. The bigger one matches everything inside the tags, while the other matches a single "key:value" occurrence. The regex engine cannot give each individual occurence because it does not work that way. So it just gives you the last one.
If you think in python, on the matcher object obtained after applying you regex, you will have access to matcher.group(1) and matcher.group(2), because you have two matching ( ) groups in the regex.
But what you want is the n occurences of "key:value". So it's easier to just run the simpler \w+:\w+ regex on the string inside the tags.
I uploaded this one at parsemarket, and I'm not sure its what you are looking for, but maybe something like this:
(<aaa>)((\w+:\w+\s)*(\w+:\w+)*)(<\/aaa>)
AFAIK, unless you know how many k:v pairs are in the tags, you can't capture all of them in one regex. So, if there are only three, you could do something like this:
<(?:aaa)>(\w+:\w+\s*)+(\w+:\w+\s*)+(\w+:\w+\s*)+<(?:\/aaa)>
But I would think you would want to do some sort of loop with whatever language you are using. Or, as some of the comments suggest, use the parser classes in the language. I've used BeautifulSoup in Python for HTML.

If duplicate within brackets, delete one of the lines

Hi i have a long list of items (~6k), that comes in this format:
'Entry': ['Entry'],
What i want to do, is if within the first bracket, the words match, i.e.:
'ACT': ['KOSOV'],
'ACT': ['STIG'],
I want it to leave only one of the entries, it doesn't matter which entry the first the second or whatever, i just need it to leave one of them.
If possible I would like to accomplish that by sublime, or notepad++ using regexp and if there is no way then do whatever you think is best to solve this.
UPD: The AWK command did the job indeed, thank you
You can't solve this using just regular expressions. You either need to remember all entries you've seen so far while scanning the text (would require writing a small utility program, probably), or you could sort the entries and then remove any repeated entries.
If you have a sorted file, then you can solve it using a regular expression, such as this one:
^(([^:]+):.+\n)(?:\2.+\n)+
Replace with \1. See it in action here