Split a String, take only 5 items. But limit characters to <20

Split a String, take only 5 items. But limit characters to <20 - regex

a simple query perhaps.
I use a very useful formula:
=JOIN(replace(A1, find("|", SUBSTITUTE(A1, ", ", "|", 5)), len(A1), ""), "", )
...this takes a comma separated cell (may contain x50 strings) and returns only 5, next I'd like to limit the returned strings to those of under 20 characters. Is it possible to incorporate some magic into this formula. I currently use a regex "find and replace" with the value: .{20,} and then delete everything that is over 20 chars. There must be a more beautiful way of doing this?
ie. Cell A1 =
here's a string, here's a very long string over 20 chars, string 3, string 4, another string, string 5, another string 7, number 8
would become
here's a string, string 3, string 4, another string, string 5
Also, in formulas... how do you handle errors? If my queried cell only contains 3 strings, and I'm asking for 5 I get an Error, or bad return, what's the trick to handle such an event?
Thanks so much for reading this!

=JOIN(", ",ARRAY_CONSTRAIN(FILTER(SPLIT(A1,", ",0),LEN(SPLIT(A1,", ",0))<=20),1,5))
SPLIT by delimiter
FILTER array by LENgth of chars
ARRAY_CONSTRAIN to constrain the array
JOIN back the splitted filtered array

Related

Extract multiple substrings of numbers of a specific length from string in Google Sheets

I'd need to split or extract only numbers made of 8 digits from a string in Google Sheets.
I've tried with SPLIT or REGEXREPLACE but I can't find a way to get only the numbers of that length, I only get all the numbers in the string!
For example I'm using
=SPLIT(lower(N2),"qwertyuiopasdfghjklzxcvbnm`-=[]\;' ,./!:##$%^&*()")
but I get all the numbers while I only need 8 digits numbers.
This may be a test value:
00150412632BBHBBLD 12458 32354 1312548896 ACT inv 62345471
I only need to extract "62345471" and nothing else!
Could you please help me out?
Many thanks!

Please use the following formula for a single cell.
Drag it down for more cells.
=INDEX(TRANSPOSE(QUERY(TRANSPOSE(IF(LEN(SPLIT(REGEXREPLACE(A2&" ","\D+"," ")," "))=8,
SPLIT(REGEXREPLACE(A2&" ","\D+"," ")," "),"")),"where Col1 is not null ",0)))
Functions used:
QUERY
INDEX
TRANSPOSE
IF
LEN
SPLIT
REGEXREPLACE

If you only need to do this for one cell (or you have your heart set on dragging the formula down into individual cells), use the following formula:
=REGEXEXTRACT(" "&N2&" ","\s(\d{8})\s")
However, I suspect you want to process the eight-digit number out of all cells running N2:N. If that is the case, clear whatever will be your results column (including any headers) and place the following in the top cell of that otherwise cleared results column:
=ArrayFormula({"Your Header"; IF(N2:N="",,IFERROR(REGEXEXTRACT(" "&N2:N&" ","\s(\d{8})\s")))})
Replace the header text Your Header with whatever you want your actual header text to be. The formula will show that header text and will return all results for all rows where N2:N is not null. Where no eight-digit number is found, null will be returned.
By prepending and appending a space to the N2:N raw strings before processing, spaces before and after string components can be used to determine where only eight digits exist together (as opposed to eight digits within a longer string of digits).
The only assumption here is that there are, in fact, spaces between string components. I did not assume that the eight-digit number will always be in a certain position (e.g., first, last) within the string.

Try this, take a look at Example sheet
=FILTER(TRANSPOSE(SPLIT(B2," ")),LEN(TRANSPOSE(SPLIT(B2," ")))=8)
Or this to get them all.
=JOIN(" ,",FILTER(TRANSPOSE(SPLIT(B2," ")),LEN(TRANSPOSE(SPLIT(B2," ")))=8))
Explanation
SPLIT with the dilimiter set to " " space TRANSPOSE and FILTER TRANSPOSE(SPLIT(B2," ") with the condition1 set to LEN(TRANSPOSE(SPLIT(B2," "))) is = 8
JOIN the outputed column whith " ," to gat all occurrences of number with a length of 8
Note: to get the numbers with the length of N just replace 8 in the FILTER function with a cell refrence.

Using this on a cell worked just fine for me:
(cell_with_data)=REGEXEXTRACT(A1,"[0-9]{8}$")

Alteryx - Split a string with an uncertain length into 5 characters per column

I am trying to split a string (the string length is uncertain; it could be 500 characters or 1500 characters) into multiple columns, and each column should only contain 5 characters.
For example,
If column A contains the string:
AAGANAB5ARAB7AAAB9AAAC--CAC--1ACMRD
Then, I need Column B to Column H to be:
AAGAN,
AB5AR,
AB7AA,
AB9AA,
AC--C,
AC--1,
ACMRD
Also, the string contains “-“, but it is NOT delimiter. It should also be counted as a part of 5 char strings.
I know RegEx is probably the function I should use, and just by putting "(.....)" in the Regular Expression, Alteryx can extract the first 5 characters. But I don't know how to ask Alteryx to automatically split the entire string (length varies each row) to columns of 5 chars.

In Alteryx, use their RegEx tool (instead of the Formula tool with one of their REGEX expressions). In the config panel of the RegEx tool, and simply enter ..... as the RegEx, and the key is to select "Split to Rows"... this will give you rows with a new field that is the result of the applied RegEx.

Matlab Regexp for nested groups and captured tokens

Is there a way to capture tokens inside a non-captured group in Matlab regular expressions? Here is the specific problem:
InputString = 'Identifiers: 10 12 1 3 8 6 4 2'
Expression = 'Identifiers:\s(?:(\d*)\t?)+'
regexp(InputString, Expression, 'tokens')
I need to find the numbers after 'Identifier'. The string InputString is part of a big character array with lines before and after this line, separated by \r\n characters. The character after the colon is a whitespace, the numbers are seperated by tabs. The last number has no trailing tab. The number of numbers can vary, but it's always at least one and only integers with 1 or n digits.
I had the following idea in my Expression: Identify line by Identifiers:\s, find numbers with n>1 digits and captured token and possible trailing tab by (\d*)\t and repeat this 1 or more times by +. To repeat the digit part expression, I need to put it in a group. But I don't want to capture the token of the outer group (?:(\d*)\t?), but of course the token of the inner grouping (\d*). Thats why I used ?:. When I remove ?: only one token containing 1012138642 is returned.
Isn't it possible to capture tokens inside a non-capturing group? Do you have any solution to return the numbers in a single statement?
In my current solution I find the line by
Expression = 'Identifiers:.+?\r\n'
Line = regexp(InputString, Expression, 'match')
and get the digits with
regexp(Line, '(\d+)\t+', 'tokens')
I spend so much time finding a single statement solution, I now really need to know if it's possible or not! I am not sure if I am thinking wrong, my head is not working as intended or it's just impossible.

MATLAB doesn't support nested tokens, even if you you mark them as non capturing.
Starting in 16b there are some new text manipulations that make this easier:
str = "Identifiers: 10 12 1 3 8 6 4 2" + newline + "Blah";
str = str.extractBetween("Identifiers: ",newline).split
str =
8×1 string array
"10"
"12"
"1"
"3"
"8"
"6"
"4"
"2"
If your goal is one statement with regexp, using split might get you closer.
str = regexp(str,'(?<=Identifiers[^\n]*)\s*(?=[^\n]*)','split')
str =
1×10 string array
"Identifiers:" "10" "12" "1" "3" "8" "6" "4" "2" "Blah"

Regex leading zeros from string in Hive

I have a 19 - character string in Hive that I need to split up and remove any leading zeros.
Example:
7212092180052740029
I need it to be split like this
721 20 9218 00527 40029
So there are no leading zeros in 1st, 2nd, or 3rd section, and 00 would be removed from the 4th section; section 5 will be disregarded. My desired result would be
721209218527
My first-pass solution is
trim(concat_ws('', regexp_replace(substr(some_string, 1, 3), '^0*', '')
, regexp_replace(substr(some_string, 4, 2), '^0*', '')
, regexp_replace(substr(some_string, 6, 4), '^0*', '')
, regexp_replace(substr(some_string, 10, 5), '^0*', '')))
but this seems like extreme overkill. Any ideas how to do this with one line of regex?
Also, it should be noted that in any of the 5 sections, when split, will never be all zeros (i.e. section one will never be 000); if so then my 'solution' wouldn't work, as all zeros would be leading ones and '^0* would return nothing.

^0*|(?<=^.{3})0*|(?<=^.{5})0*|(?<=^.{9})0*|(?<=^.{14}).*$
You can use this regex and replace by empty string.See demo.
https://regex101.com/r/rO0yD8/15

Stata - searching if a particular part of a set of characters has a specific character

How do I this in Stata?
Say my data are:
Var1
Whoareyou
Whoisme
Idontknow
Isthatyou
Isyoume
How do I know if the 6th and 7th characters are "me"?

Previous answer is correct only by accident.
substr(var1, 6, 2) == "me"
The last argument of substr() is the maximum length of the substring extracted, not the position of the last character selected.
In the examples given me occurs at the end of the string, so using 6, 7 will work in those cases.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Split a String, take only 5 items. But limit characters to <20 - regex

=JOIN(", ",ARRAY_CONSTRAIN(FILTER(SPLIT(A1,", ",0),LEN(SPLIT(A1,", ",0))<=20),1,5)) SPLIT by delimiter FILTER array by LENgth of chars ARRAY_CONSTRAIN to constrain the array JOIN back the splitted filtered array

Related

Extract multiple substrings of numbers of a specific length from string in Google Sheets

Alteryx - Split a string with an uncertain length into 5 characters per column

Matlab Regexp for nested groups and captured tokens

Regex leading zeros from string in Hive

Stata - searching if a particular part of a set of characters has a specific character

Categories

Resources