How to extract out specific substrings? - regex

I have these long strings that have multiple substrings in them all separated by periods. The good news is I've found out how to extract most of the substrings on the left or right of the strings by using functions like left, mid, right, regexextract, find, len, and substitute, but I just can't figure out this last problem.
The problem with these substrings is sometimes some are there, sometimes none are there (most I've seen at once is, I think, 3). And other than being in all caps, which some of the other substrings that I don't want are also in, I don't think there's any regex pattern one could use except something like string1|string2|string3, etc all the way up to maybe string30.
I first thought it would be best to just have a formula look at the string, compare it to a range on another sheet, and if there was something in the range that was in the string, then show it. But I was lost on how to do that. Then I figured just put the whole range list in a regex and somehow extract any substrings that were in the string.
And that worked, but it would only extract the first substring it found whereas I wanted it to extract all the substrings it found. And while I think I'd prefer the substrings to be put into different columns (not rows) by using the Split function, I'd settle for them all being put in the same cell via the Textjoin function.
The farthest I've gotten is
=split(REGEXextract(A2,"\b(?:string1|string2|string3)\b")," ")
but like I said that only spits out the first substring it finds. And I've seen some people use REGEXreplace with Split and ArrayFormula and sometimes double REGEX functions, but I just can't seem to make those work for my purposes.
I'm doing this in GoogleSheets, but even an Excel or LibreOffice answer will probably be helpful as I can probably turn them into a GS solution. I realize I could just make a simple regexextract in 30 or so columns, but I'd really rather not do that. Thanks in advance, even if you just give me an idea of what direction to head in.

You could try something like this, that would filter all values that match your desired list of substrings. Replace F1:F2 with the range where you save the values you want to appear, and A3 with the cell of the substring. If you need you can set this as an array with Map or BYROW, for example
=filter(split(A3,"."),INDEX(REGEXMATCH(SPLIT(A3,"."),JOIN("|",F1:F2))))

Related

"REGEX" Match string not containing specific substring

I will give an example, I have two strings:
FL_0DS906555B_3661_27012221225012_V001_S
FL_0DS906555C_3661_27012221225012_V001_S
And I want to get any string, that has no "0DS906555B" in it, has "2701222122" in it and "5012" is in range of 5003-5012.
My regex looks like this:
^.*(?!.*0DS906555B).{6}2701222122(500[3-9]|501[0-2]).*$
unfortunately it keeps matching everything all the time. I have looked into many posts here but nothing helped for me since people usually asked for less complex, smaller strings.
Thank you
Try (regex101):
^(?!.*0DS906555B)(?=.*_2701222122(?:500[3-9]|501[012])_).*$

Find and replace with regular expression in Notepad++

At the moment, I have a PHP function that gets the contents of a CSV file and puts it into a multi-dimensional array, which contains text that I print out in various places, using the indexes.
an example of use would be:
$localText[index][pageText][conceptQualityText][$lang];
The first index, [index], would be the name of the page. The second index [pageText] would indicate what it is (text for the page). The third index, [conceptQualityText] indicates what the actual text is. The last index, [$lang] gets the text in the desired language.
so:
->page location
->what is it
->the content
->what language it should be displayed in.
This all worked fine in the previous PHP versions. However, upgrading to 7.2, PHP seems to be a bit more strict. I was a bit more green ~2 years ago when I first made this solution, and now know that since these indexes aren't defined as strings e.g. encapsulated in single quotes like so: ['index'], they fit the notation of a superglobal (DEFINE). I didn't give it much thought back then, but now PHP seems to interpret them as so (superglobals), and so I get thrown the error that x word is an undefined superglobal.
My initial thought is to make a search and replace on my example string:
$localText[index][pageText][conceptQualityText][$lang];
using the regular expression functionality in Notepad++.
However, the example is just one of many, the notation of the array indexing is basically:
$localText[index][index2][index3][$lang];
So my question is:
How can I make use of the Notepad++ search and replace, using a regular expression, so that my index pointers become strings, instead of acting as superglobal variables?
e.g. make:
$localText[index][index2][index3][$lang];
into:
$localText['index']['index2']['index3'][$lang];
I will need some sort of logic that checks for whatever is inside the brackets and encapsulates them with single quotes, except for the last index, [$lang].
I tried to give as much information as possible, let me know if anything needs to be elaborated.
I tried to refer to these docs without much luck.
I found a solution using
this:
find: \b(localText\[)([a-zA-z0-9_\-]+)(\]\[)([a-zA-z0-9_\-]+)(\]\[)([a-zA-z0-9_\-]+)
replace: $1'$2'$3'$4'$5'$6'
and it works like a charm. Thanks for everyone who took their time to help.
You can use the following regex to match:
\[[^'](\w+)[^']\]
The regex matches a Word between Square brackets unless it quoted.
Replace with:
['$1']
The regex will not match the last brackets because it contains a '$' sign.

Regex: Replace every char in the search string IF they're found in order

I am building a search functionality and I am trying to make it similar to the one in Sublime Text.
Assume "cmd" as the input string and "command" is one of the results.
To search the files, among other things, I split that input by chars and end up with the following regex: c.*?m.*?d. This part is succesfull in finding files like "command", however, when I use the same regex to replace the found string with some HTML elements to evidentiate the fact that the searched string is found in that particular item, this results in something like this:
<span>command</span>
I understand exactly why this is happening and I'm looking for and alternative to display to the user something like the following:
<span>c</span>o<span>m</span><span>m</span>an<span>d</span>
Or, maybe just:
<span>c</span>o<span>m</span>man<span>d</span>
I have an idea of how to do this, which is by encapsulating every single character in between parantheses and then replace every single one with the <span>$x</span> part, but I'm not sure how to do this exactly.
Any kind of help is immensely appreciated.
Thanks,

Regex capture words inside tags

Given an XML document, I'd like to be able to pick out individual key/value pairsfrom a particular tag:
<aaa>key0:val0 key1:val1 key2:va2</aaa>
I'd like to get back
key0:val0
key1:val1
key2:val2
So far I have
(?<=<aaa>).*(?=<\/aaa>)
Which will match everything inside, but as one result.
I also have
[^\s][\w]*:[\w]*[^\s] which will also match correctly in groups on this:
key0:val0 key1:val1 key2:va2
But not with the tags. I believe this is an issue with searching for subgroups and I'm not sure how to get around it.
Thanks!
You cannot combine the two expressions in the way you want, because you have to match each occurrence of "key:value".
So in what you came up with - (?<=<abc>)([\w]*:[\w]*[\s]*)+(?=<\/abc>) - there are two matching groups. The bigger one matches everything inside the tags, while the other matches a single "key:value" occurrence. The regex engine cannot give each individual occurence because it does not work that way. So it just gives you the last one.
If you think in python, on the matcher object obtained after applying you regex, you will have access to matcher.group(1) and matcher.group(2), because you have two matching ( ) groups in the regex.
But what you want is the n occurences of "key:value". So it's easier to just run the simpler \w+:\w+ regex on the string inside the tags.
I uploaded this one at parsemarket, and I'm not sure its what you are looking for, but maybe something like this:
(<aaa>)((\w+:\w+\s)*(\w+:\w+)*)(<\/aaa>)
AFAIK, unless you know how many k:v pairs are in the tags, you can't capture all of them in one regex. So, if there are only three, you could do something like this:
<(?:aaa)>(\w+:\w+\s*)+(\w+:\w+\s*)+(\w+:\w+\s*)+<(?:\/aaa)>
But I would think you would want to do some sort of loop with whatever language you are using. Or, as some of the comments suggest, use the parser classes in the language. I've used BeautifulSoup in Python for HTML.

If duplicate within brackets, delete one of the lines

Hi i have a long list of items (~6k), that comes in this format:
'Entry': ['Entry'],
What i want to do, is if within the first bracket, the words match, i.e.:
'ACT': ['KOSOV'],
'ACT': ['STIG'],
I want it to leave only one of the entries, it doesn't matter which entry the first the second or whatever, i just need it to leave one of them.
If possible I would like to accomplish that by sublime, or notepad++ using regexp and if there is no way then do whatever you think is best to solve this.
UPD: The AWK command did the job indeed, thank you
You can't solve this using just regular expressions. You either need to remember all entries you've seen so far while scanning the text (would require writing a small utility program, probably), or you could sort the entries and then remove any repeated entries.
If you have a sorted file, then you can solve it using a regular expression, such as this one:
^(([^:]+):.+\n)(?:\2.+\n)+
Replace with \1. See it in action here