Regex to remove first two characters of a string if they are alphabets - regex

My client uses SKUs from which they change the first two digit suffix to represent changes/updates in models. As an analyst, I need to make a unique list of SKUs to use in my data studio dashboard. A sample of the SKUs would look like:
NP9151BM01
NL9151BM01
NL6004SL01
NN6004SL01
NP1927YM05
NN1927YM05
NQ1296BM01
NG1296BM01
NQ1044YL04
NN1044YL04
NP9151YM05
9151YM05
1044YL04
I need to use regex to check if the first two characters are alphabets and remove them if they are. For example, if I have NP9151BM01 and NL9151BM01 as SKUs, I need to remove NP and NL from them to end up with the exact same SKU. However, if I have 9151YM05 or 1044YL04 as SKUs, I need to keep it as it is.
For my solution, I have researched on google and stack overflow and I've found this regex (?<=^..).*$ which will remove the first two characters in all SKUs but I'm not sure how to customise it to only remove the first two characters if they are alphabets.
I would appreciate any help that I can get with this!

To remove the first two alphabets:
=REGEXREPLACE(A2,"^[A-Z]{2}",)

Related

Regex to select two spaces between words (or two spaces before a letter)

I'm cleaning up some ancient HTML help files and there are a lot of double-spaces that I'd like to clean up (replace each double-space with single).
Sample: <li><b>% Successful</b>. The percentage of jobs that returned a confirmation.</li>
I want to find double-spaces only before the start of sentences or between words (so between the setting label or between 'percentage' and 'of', not spaces in isolation or before the XML tags).
I tried a simple search for two spaces, but that also brings up the tab/space mixture that the creator used for formatting the indents, so I'm getting five useless results for every relevant one.
Is there a single regex that would help with both use cases, or is it better to use two different ones for each format? I'm fine either way, just am still pretty new to regexes and not sure where to start on this one.
Spaces between periods and text (three here for a better visual):
<li><b>Submit Time</b>. The time the job was scheduled.</li>
Spaces between words (three here as well):
<li><b>End Time</b>. The date/time when the job was completed or canceled.</li>
I'm looking for multiple spaces either between words or at the start of sentences (generally after the setting name period).

Replace trailing ".1" to ".2"

I am assuming you would need a regex for this. The best I could come up with is
=REGEXREPLACE(C2, "\.(?=[^.]*$)", ".2")
but it only detects the period in the end and the google sheet returns #REF!
Other ways, such as directly changing the cell C2:C5, are also welcomed.
You can just check if the trailing 2 characters from the right are equal to .1
get two chars from the right
test equality
RIGHT(A1,2)=".1"
Then, to convert matching values, you can slice off the last two chars (length-2) and append the .2
LEFT(A1,LEN(A1)-2)&".2"
All together
=IF(RIGHT(A1,2)=".1",LEFT(A1,LEN(A1)-2)&".2",A1)
If you actually want to increment arbitrary values (and not just .1), you can skip the equality check and add 0.1 intermediately
=LEFT(C3,LEN(C3)-2)&((RIGHT(C3,2)+0.1)&"")
If you have values with more than a single digit, hunt them in an intermediate column so you can use their length to
add the right power of ten (.5+0.1, .993+0.001, etc.)
exclude the right number of chars when appending
If you want a full version parser, consider VBA or passing the column to a more practical language

Regular Expression help to find letter in XML number field

I am getting an error importing an XML file into a custom program. Other files import correctly. However, one file produces an error from a float field. I am using Notepad++ search function with Regular Expression to try and find the issue in the XML file.
When I use <milepost>([a-zA-Z0-9.]+)</milepost> I get around 30,000 results which is the correct number of records but the field is supposed to be DOUBLE. When I use <milepost>([0-9.]+)</milepost> I only get 29,994 records. This tells me that the import is most likely failing because there are letters in my number fields.
I have tried a number of variations like:
<milepost>([\S\D\d]+)</milepost>
<milepost>(.*?)</milepost>
<milepost>([\Sa-zA-Z]+)</milepost>
<milepost>([0-9.\w]+)</milepost>
etc.
Each of these returns the expected 30,000 records.
When I try to search for letters using :
<milepost>([a-zA-Z.]*)</milepost>
<milepost>([a-zA-Z]+)</milepost>
<milepost>(^[a-zA-Z]+$)</milepost>
<milepost>([a-zA-Z.a-zA-Z]+)</milepost>
I get 0 results (most likely because it excludes numbers)
I did manage to find one of the records I am looking for using this method:
<milepost>173.811818181818a</milepost>
But I do not feel like scrolling through 30,000+ lines to look for 5 more records with a letter in them.
Is there a regular expression that will return to me ONLY the values that have a letter/letters in them while allowing numbers? (Fields with only numbers and a period should be excluded)
The 6 problem records presumably contain a mixture of letters and numbers, but your searches for records containing letters will only match records consisting exclusively of letters.
Try
<milepost>.*[a-zA-Z].*</milepost>
which matches any record containing an ASCII letter in its value, as well as allowing other characters such as digits.
What you want is a negative look-ahead. Something like
<milepost>(?![0-9.]+</milepost>)
should be very close.
In plain English <milepost> not followed by exclusively digits and dots and a closing </milepost>

Excel- Extract Number from Cell

I have multiple cells that I am attempting to extract a number from, and need help finding a regex alternative.
The cells range in the following formats:
asdfs. Seat#29 asfddsa
asdfsa. Seat#5d
asdfasN/A . Seat#22 as789fsd
Seat#111 words33
The closest that I came to a solution is:
=IFERROR(TRIM(MID([#DisplayName],FIND("#",[#DisplayName])+1,3)),"")
As you can see this will extract most of the numbers but for some it leaves a character at the end.
The only commonality is the # preceding the seat number. I am trying to extract only the seat number, no other numbers.
I cannot use VBA, this must be done using formulas. I have figured this out once before but stupidly pasted over the formulas with a values only paste.
This can be done utilizing a flash fill, but I was hoping for a more stable formula.
If you want just the numbers then use:
=--MID(A1,FIND("#",A1)+1,AGGREGATE(15,6,ROW(1:5)/(ISERROR(--MID(REPLACE(A1,1,FIND("#",A1),""),ROW(1:5),1))),1)-1)
If you want the letter also then:
=MID(A1,FIND("#",A1)+1,FIND(" ",REPLACE(A1,1,FIND("#",A1),""))-1)
If you do not need the letter following the seat number, you can use
.*#(\d+)
Edit for clarity: Excel does not have regex functions built in. You will either have to use a UDF (I can help with that if you'd like) or use a non-regex solution.
Here is a solution without VBA to extract all numbers inside the strings.
https://drive.google.com/open?id=1Fk6VFznD3i8s6scADy_vXCEj-1zQpBPW
Sheet #3

Separating out a list with regex?

I have a CSV file which has been generated by a system. The problem is with one of the fields which used to be a list of items. An example of the original list is below....
The serial number of the desk is 45TYTU
This is the second item in the list
The colour of the apple is green
The ID code is 489RUI
This is the fourth item in the list.
And unfortunately the system spits out the code below.....
The serial number of the desk is 45TYTUThis is the second item in the listThe colour of the apple is greenThe ID code is 489RUIThis is the fourth item in the list.
As you can see, it ignores the line breaks and just bunches everything up. I am unable to modify the system that generates this output so what I am trying to do is come up with some sort of regex find and replace expression that will separate them out.
My original though would be to try and detect when an upper case letter is in the middle of a lower case word, but as in one of the items in the example, when a serial number is used it throws this out.
Anyone any suggestions? Is regex the way to go?
--- EDIT ---
I think i need to simplify things for myself, if I ignore the fact that lines that end in a serial number will break things for now. I need to just create an expression that will insert a line break if it detects that an upper case letter is being used after a lower case one
--- EDIT 2 ---
Using the example given by fardjad everything works for the sample data given, the strong was...
(.(?=[A-Z][a-z]))
Now as I test with more data I can see an issue appearing, certain lines begin with numbers so it is seeing these as serial numbers, you can see an example of this at http://regexr.com?2vfi5
There are only about 10 known numbers it uses at the start of the lines such as 240v, 120v etc...
Is there a way to exclude these?
That won't be a robust solution but this is what you asked. It matches the character before an uppercase letter followed by a lowercase one. You can simply use regex replace and append a new line character:
(.(?=[A-Z][a-z]))
see this demo.
You could search for this
(?<=\p{Ll})(?=\p{Lu})
and replace with a linebreak. The regex matches the empty space between a lowercase letter \p{Ll} and an uppercase letter \p{Lu}.
This assumes you're using a Unicode-aware regex engine (.NET, PCRE, Perl for example). If not, you might also get away with
(?<=[a-z])(?=[A-Z])
but this of course only detects lower-/uppercase changes in ASCII words.