Filtering 4 capital letters - regex

I have a couple of cells that all vary in their content, but one thing that is the same across all is that they all contain a string of 4 letters which are all capital.
Is there some way for me to only show those 4 capital letters? (through a formula)?
I'm fairly flexible in the solution here, it could either be done in the cell itself or in another cell that references the cell in question.

try:
=INDEX(IFNA(REGEXEXTRACT(A1:A, "[A-Z]{4}")))

Related

Regex to remove first two characters of a string if they are alphabets

My client uses SKUs from which they change the first two digit suffix to represent changes/updates in models. As an analyst, I need to make a unique list of SKUs to use in my data studio dashboard. A sample of the SKUs would look like:
NP9151BM01
NL9151BM01
NL6004SL01
NN6004SL01
NP1927YM05
NN1927YM05
NQ1296BM01
NG1296BM01
NQ1044YL04
NN1044YL04
NP9151YM05
9151YM05
1044YL04
I need to use regex to check if the first two characters are alphabets and remove them if they are. For example, if I have NP9151BM01 and NL9151BM01 as SKUs, I need to remove NP and NL from them to end up with the exact same SKU. However, if I have 9151YM05 or 1044YL04 as SKUs, I need to keep it as it is.
For my solution, I have researched on google and stack overflow and I've found this regex (?<=^..).*$ which will remove the first two characters in all SKUs but I'm not sure how to customise it to only remove the first two characters if they are alphabets.
I would appreciate any help that I can get with this!
To remove the first two alphabets:
=REGEXREPLACE(A2,"^[A-Z]{2}",)

Regex select words longer than 4 characters but only one instance if duplicates

I am trying to format text in InDesign using GREP Style.
The goal is to select words longer then 4 letters in a paragraph but if the word has been duplicated in a paragraph it should not select more then first instance of this word.
This is sample text:
"The Lord's right hand is lifted high; the Lord's right hand has done mighty things!"
The solution should give
Lord right hand lifted high done mighty things
i have done the first part
[[:word:]]{4,}
but don't have a clue how to deal with those duplicates.
Is order a requirement? If not, how about words longer than 4 characters not followed by that same word later in the text? See:
([[:word:]]{4,})(?!.*\1)
https://regex101.com/r/Ug4dLZ/1
Result: lifted high Lord right hand done many things
To be more comprehensive, include word breaks (i.e. count "Person" and "Personhood" as 2 separate words):
([[:word:]]{4,})(?!.*\b\1\b)

Permutation/Combination with license plates

Question, with a California license plate, it has #LLL### where L = Alphabet. I know with the combination is 10^4 * 10^3 for all possible solution. How about if I excluded a certain word, such as "FSS", where any combination of car license plate would not include the word "FSS".
How do I go upon this? I can still use the letters, but the three can't be together. Its throwing me for a loop. Do I use permutation to exclude the repetition word? Any help is appreciated.
EDIT- the # = digits. So from 0-9, there are ten possibilities, sorry didn't clarify
There are only so many ways you can have FSS in a string of seven characters.
FSS####
#FSS###
##FSS##
###FSS#
####FSS
So there are five different license plates with the string FSS in them. If there is no constraint on the four numbers, that means you have 9,999 different license plates for each position of "FSS".
You would want to subtract 9,999 * 5 from you total answer to get the plates allowed.
Edit:
So you want all permutations of 0-9 in the first, fifth, sixth, and seventh positions. And all permutations of A-Z of the second, third and fourth positions, except for F in the second, S in the third and S in the fourth, right? If so, it would be 10*25*25*25*10*10*10, or 10^4 * 25^3. Did I get your problem right?

Word lexical families

I am given a set of N words, and an integer K. 2 words are in the same group if they have exactly the first k letters and the last k letters identical. If they have more than k letters identical or less than k letters identical then the words are not in the same group. For example:
For k=3.
"abcdefg" and "abczefg" are in the same group
"abcddefg" and "abcdzefg" are not in the same group (the first k+1 letters are identical)
"abc" and "abc" are in the same group
A word can be in more than 1 groups. For example (k=3):
"abczefg" and "abcefg" form a group
"abczaefg" and "abcefg" form a group
"abczaefg" and "abczefg" are not in the same group (the first k+1 letters are identical)
The problem asks me to find the number of groups which contain the maximum number of words.
I thought about using a Trie (or Prefix Tree) and I assume this is the right data structure for this problem but I don't know how can I adapt them for this problem, because the part where if 2 words have more than k letters identical are not in the same group confuse me. My ideea has the complexity O(N*N*K) and considering that N<=10,000 and K<=100 I don't think that this ideea is fast enough. I would like to explain you my ideea, but it is not cleary yet even for me and I don't even know if it is correct, so I will skip this part.
My question is if there is a way I could solve this problem using a faster algorithm, and if there is such algorithm, I kindly ask you to explain it a little bit. Thank you in advance and I am sorry for the gramatical mistakes and if I didn't explain the problem clearly!
First group all the words that share the first k letters and last k letters. Your largest group must sit inside one of these groups, since there's no way two words that differ at their starts and ends can be in the same solution.
So, within each of these groups (of words that share the same k letters at their start and end), you need to find a maximal set of words such that no two share the k+1'th letter, nor the k+1'th letter from the end.
Construct a graph where vertices are the pairs of letters that are (k+1) from each end (de-duping) from words in one of these groups, and edges occur between (a, b) and (c, d) if a=c or b=d.
You need to find a subgraph of this which has no edges in it. This reduced problem is an instance of the "maximum independent subgraph" problem, which is NP-hard, so you'll need to solve it by using a search and hoping the set of words you're given isn't too nasty. Perhaps there's something about the graphs here to give a faster solution, but I don't see it.
The solution to the entire problem is the largest solution to one of the reduced problems described above.
Hope this helps!

Formatting UK postal codes for storage

I want to store UK postal codes in the database. Is it OK to store those postal codes without the spaces?
It is possible to store postcodes without spaces, but would definitely recommend formatting them correctly when they are displayed/output.
You can check out the allowed formats for postcodes here . There are always 3 characters after the space so it's easy to reinsert it.
Last 3 are always xyy
x Digit 0-9
yy Alpha A-Z
Anything before is the first part of the grid reference and has various formats.
we store postcodes and we accept inouts in any format, space or no space, but then strip or correct the entry for data storage
we find it works better this way when using the data for other things
Why would you want to store with no spaces?
Uk postcodes have a variety of formats:
list of formats
Why are you unable to store white spaces?
As others have said, there is no problem with removing all spaces and storing them, if that is what you want to do. As has been said, you can always format them with a space before the last three characters.
However, I would normally take them in any reasonable format, strip all spaces out, and them store them with this one extra space. The storage requirements are not an issue, and it makes it easier to simply display as it is. You would need to resolve the format before saving in some way, so you may as well save it as it is needed.
It's usually safe to remove the space. As others have said, you can re-insert the space later if required. The existence of a space between Outcode and Incode will not normally affect postal delivery. You should not have any non alpha numeric characters in a UK postcode, so if you see a dash you can safely remove it.
I work for Experian Data Quality and if your aim is clean data you may want to consider an address verification web service, like our Pro On Demand product. This will ensure you capture the correct postcode, as they can change over time, and that it is formatted correctly for your database.
It is okay to store without a space because you can always add an empty space back in to each postcode string - the heuristic is pretty simple.
As some other users have very helpfully explained, all UK postcodes have two groups of numbers and letters, separated by a space. The group following the space always contains a number and then two letters (thus, there are always three characters after the space). The group before the space will have either two, three, or four characters (see this Wikipedia page) and the screenshot below.
So, you can recreate the correct spacing by adding a space before the third-to-last character.
In R, it looks like this (but the same logic would work in other languages, such as Python):
#list of example postcodes
postcodes = c("LS176JA", "OX41EZ", "A99AA")
#add space to each postcode in the list of example postcodes
for (postcode in postcodes){
last_three = str_sub(postcode, start = -3)
first_x = str_replace(postcode, last_three, "")
final_postcode = paste0(first_x, " ", last_three)
print(final_postcode)
}
Which returns:
[1] "LS17 6JA"
[1] "OX4 1EZ"
[1] "A9 9AA"