Is there an error-proof way in google sheets to extract House numbers from address cell (street + house number) into another cell - regex

In Sheet1!AK2:AK I have addresses in the following formats:
rotenkamper weg, 323, Kirchstieg 2345, Im Schleedörn 20b
I need the street names to export into Sheet2!C3:C, i.e:
rotenkamper weg, Kirchenstieg, Im Schleedörn
The House numbers have to go into Sheet2!D3:D.
I have researched and tried for hours but couldn't find a solution that could fetch the house numbers including the letter i.e. 20b or if the number is a range 24-27.
Also, I have huge trouble to get it to work when the street consist of two or more words.
Does anyone know an elegant solution for this?
Any help would be much appreciated. This will safe me weeks of data entry work.

Try this in Sheet2!C3:
=ARRAYFORMULA(
{
REGEXREPLACE(REGEXREPLACE(Sheet1!AK2:AK, "\s+\S*\d\S*\b", ""), ",+", ","),
IFNA(REGEXEXTRACT(Sheet1!AK2:AK, "\S+$"))
}
)
Explanation:
REGEXREPLACE(Sheet1!AK2:AK, "\s+\S*\d\S*\b", "") this one removes any "word" which has a digit in it. Al of these 323, 2345, 20b will be gone.
REGEXREPLACE(..., ",+", ",") cleans up any multiple consequent commas which may appear after removing in the first step. This will be a value for the first column.
IFNA(REGEXEXTRACT(Sheet1!AK2:AK, "\S+$")) this one just gets whatever is at the end of the address string from the last space to the end. This will be a value for the second column.
{value_for_the_first_column, value_for_the_second_column} placed in the C3 cell will populate C3 with value_for_the_first_column and D3 with value_for_the_first_column.
ARRAYFORMULA will do all of the above for every row.
Regex pattern could be refined if you provide more than one example of the address.

Related

Extract multiple substrings of numbers of a specific length from string in Google Sheets

I'd need to split or extract only numbers made of 8 digits from a string in Google Sheets.
I've tried with SPLIT or REGEXREPLACE but I can't find a way to get only the numbers of that length, I only get all the numbers in the string!
For example I'm using
=SPLIT(lower(N2),"qwertyuiopasdfghjklzxcvbnm`-=[]\;' ,./!:##$%^&*()")
but I get all the numbers while I only need 8 digits numbers.
This may be a test value:
00150412632BBHBBLD 12458 32354 1312548896 ACT inv 62345471
I only need to extract "62345471" and nothing else!
Could you please help me out?
Many thanks!
Please use the following formula for a single cell.
Drag it down for more cells.
=INDEX(TRANSPOSE(QUERY(TRANSPOSE(IF(LEN(SPLIT(REGEXREPLACE(A2&" ","\D+"," ")," "))=8,
SPLIT(REGEXREPLACE(A2&" ","\D+"," ")," "),"")),"where Col1 is not null ",0)))
Functions used:
QUERY
INDEX
TRANSPOSE
IF
LEN
SPLIT
REGEXREPLACE
If you only need to do this for one cell (or you have your heart set on dragging the formula down into individual cells), use the following formula:
=REGEXEXTRACT(" "&N2&" ","\s(\d{8})\s")
However, I suspect you want to process the eight-digit number out of all cells running N2:N. If that is the case, clear whatever will be your results column (including any headers) and place the following in the top cell of that otherwise cleared results column:
=ArrayFormula({"Your Header"; IF(N2:N="",,IFERROR(REGEXEXTRACT(" "&N2:N&" ","\s(\d{8})\s")))})
Replace the header text Your Header with whatever you want your actual header text to be. The formula will show that header text and will return all results for all rows where N2:N is not null. Where no eight-digit number is found, null will be returned.
By prepending and appending a space to the N2:N raw strings before processing, spaces before and after string components can be used to determine where only eight digits exist together (as opposed to eight digits within a longer string of digits).
The only assumption here is that there are, in fact, spaces between string components. I did not assume that the eight-digit number will always be in a certain position (e.g., first, last) within the string.
Try this, take a look at Example sheet
=FILTER(TRANSPOSE(SPLIT(B2," ")),LEN(TRANSPOSE(SPLIT(B2," ")))=8)
Or this to get them all.
=JOIN(" ,",FILTER(TRANSPOSE(SPLIT(B2," ")),LEN(TRANSPOSE(SPLIT(B2," ")))=8))
Explanation
SPLIT with the dilimiter set to " " space TRANSPOSE and FILTER TRANSPOSE(SPLIT(B2," ") with the condition1 set to LEN(TRANSPOSE(SPLIT(B2," "))) is = 8
JOIN the outputed column whith " ," to gat all occurrences of number with a length of 8
Note: to get the numbers with the length of N just replace 8 in the FILTER function with a cell refrence.
Using this on a cell worked just fine for me:
(cell_with_data)=REGEXEXTRACT(A1,"[0-9]{8}$")

Regexmatch in Google Sheet to identify cells that include any string in another sheet

I have a ColumnA where each cell include multiple values separated by comma, eg:
Elvis Costello, Madonna
Bob, Elvis Presley, Morgan Stanley
Frank, Morgan Stanley, Madonna Ford,
Elvis Costello, Madonna Ford
And I want to identify which rows/cells that includes any of the exact terms in another sheet/column, eg
Elvis Presley
Madonna
And I found this simple solution using Regexmatch (the last solution on that page) Is there a way to REGEXMATCH from a range of cells from A1:A1000 for example?
Say you want to search for a match from a list of cities.
Put your list of cities in one tab.
Make them into lowercase for easier lookup since search terms are all in lowercase. You can do this by adding a new column and using the LOWER function.
Go back to your cell that has the list of search phrases.
In any blank cell out of the way (off to the side on the top row is a good place) put this formula: CITY LIST FORMULA: =TEXTJOIN("|",1,'vlookup city'!B$2:B$477) (if your tab is named 'vlookup city' and your cities are in column B of that tab)
Add a new column next to your search terms, or pick an existing one where you want to put your "match found" info.
In that new column, add this formula (if your data starts in row 4 and you put the City List formula in cell G3:) =REGEXMATCH(A4,G$4)
Fill the formula all the way down your list. You can double-click the little blue square in the bottom right corner of the cell, or grab-and-drag all the way to the bottom of the list.
Ba-ding! It will search for any one of those city names, anywhere in your search phrase.
If the search phrase contains at least one matching term, it will return "True."
You can then add extra features on your formula to make it return something else. For example: =IF(REGEXMATCH(A4,G$4), "match found", "no match found")
This is a super lightweight solution that won't slow your sheet down too much and is easy to use.
https://docs.google.com/spreadsheets/d/1XAIDB98r2CGu7hL3ISirErDPNlgT6lVt-TCG0qI1uTE/edit?usp=sharing
The problem is that the Regexmatch solution identifies "Elvis Costello" and "Madonna Ford" and I only want to identify cells/rows that includes the exact term to match, ie "Elvis Presley" and "Madonna", ie whatever is between the commas has to be an exact match with one of the search terms, not just partially right.
I hope it made sense:)
Thanks all!
I think I might have found the answer, still trying to double check if it's correct.
I added \b before and after. So in the example sheet re-posted in the quoted part of my question i changed the cell:
Cell B3:
=TEXTJOIN("|",1,'vlookup city'!B$2:B$476)
and added another cell like this:
Cell B2:
=concatenate("\b(",$B$3,")\b")
Still checking if all false flags are removed.
Thanks

SQLite: How to split a column

I have a column containing two names, which I'd like to extract into two separate columns surname1 and surname2 (I don't need the name nor the initial letter (e.g. N.)).
The exemplary content of that column is:
AwyeEaef2012 MS101 N.Lopez-O.Lorenzi.txt
-Lopez and Lorenzi are these two which we are looking for in this row.
What is good about my situation is that the first name comes always after the first dot (.) and ends just before the dash (-) and the second name comes just after second dot and ends just before the third dot and txt (.txt).
I know how to write a regex and using LIKE check if that column contains some specific surname but not the opposite way- how to read surnames and write them into two new columns.
Several rows from that column look like below:
WyeEaef MN2014 MS401 N.Lopez-O.Lorenzi.txt
AwyufEQ WCH2014 OS401 N.Lorenzi-O.Lopez.txt
THAFa5u WCH2014 LS107 N.Larry-O.Lolly.txt
So the pattern is as I mentioned *.Name1-[A-Z].Name2.txt
Where * is max 30 characters of capital and small letters and numbers
It could be approached in this manner: other words we need to divide this into substrings divided by dots first substring is a waste, the second without two last characters(a dash and acapital letter, e.g. -O) is the first name, the third substring is the second name and the fourth is another waste(a former file format).
I'd like to have an output of three columns:
initialColumn, firstName, secondName
The workaround that I wrote as a formula in Excel which I personally don't love, but might be useful for someone in the future.
=MID(A1;FIND(".";A1;1)+1;FIND(".";A1;FIND(".";A1;1)+1)-FIND(".";A1;1)-3)
I was surprised that Excel can manage processing ~0.5mln of records in the blink of an eye.

extract number from string in Oracle

I am trying to extract a specific text from an Outlook subject line. This is required to calculate turn around time for each order entered in SAP. I have a subject line as below
SO# 3032641559 FW: Attached new PO 4500958640- 13563 TYCO LJ
My final output should be like this: 3032641559
I have been able to do this in MS excel with the formulas like this
=IFERROR(INT(MID([#[Normalized_Subject]],SEARCH(30,[#[Normalized_Subject]]),10)),"Not Found")
in the above formula [#[Normalized_Subject]] is the name of column in which the SO number exists. I have asked to do this in oracle but I am very new to this. Your help on this would be greatly appreciated.
Note: in the above subject line the number 30 is common in every subject line.
The last parameter of REGEXP_SUBSTR() indicates the sub-expression you want to pick. In this case you can't just match 30 then some more numbers as the second set of digits might have a 30. So, it's safer to match the following, where x are more digits.
SO# 30xxxxxx
As a regular expression this becomes:
SO#\s30\d+
where \s indicates a space \d indicates a numeric character and the + that you want to match as many as there are. But, we can use the sub-expression substringing available; in order to do that you need to have sub-expressions; i.e. create groups where you want to split the string:
(SO#\s)(30\d+)
Put this in the function call and you have it:
regexp_substr(str, '(SO#\s)(30\d+)', 1, 1, 'i', 2)
SQL Fiddle

Regular Expression to break row with comma separated values into distinct rows

I have a file with many rows. Each row has a column which may contain comma separated values. I need each row to be distinct (ie no comma separated values).
Here is an example row:
AB AB10,AB11,AB12,AB15,AB16,AB21,AB22,AB23,AB24,AB25,AB99 ABERDEEN Aberdeenshire
The columns are comma separated (Postcode area, Postcode districts, Post town, Former postal county).
So the above row would get turned into:
AB AB10 ABERDEEN Aberdeenshire
AB AB11 ABERDEEN Aberdeenshire
AB AB12 ABERDEEN Aberdeenshire
...
...
I tried the following but it didn't work...
(.+)\t(([0-9A-Z]+),)+\t(.+)\t(.+)
I agree that RegEx are not be the best way but this should work hopefully if that's all you have available to you. (Done repeatedly until there are no more matches)
Edit
Updated with the OP's final solution from the comments.
Find: (.+)\t([^,\s]+),([^\t]+)\t(.+)
Replace: \1\t\2\t\4\r\1\t\3\t\4
I agree with stakx that this doesn't sound like a good place for regexes.
I would write a small program instead which read each line, split the line into columns, split each relevant column into a list of values, and then iterated over all combinations of those, outputting a line each time.
Assuming it's only that one column which can have multiple tokens, it would basically look like this:
while not InputFile.EndOfFile:
line = InputFile.readline();
columns = line.split('\t'); //Assuming 1-based array, so indexes 1-4
col2values = columns[2].split(',');
for each value in col2values:
OutputFile.WriteLine(columns[1]+'\t'+value+'\t'+columns[3]+'\t'+columns[4]);
If multiple columns can have multiple values, simply put another loop inside the for each.