Teradata REGEX or SUBSTR to remove the text between two *'s and the asterisk? - regex

I'm working in teradata with a dataset that has several occurrences of data in the following format:
*6A*H.ORTHO I
*4A*IMP
*16A*T.IMPLANTS
*2A*HIMPLANTS
*9A*IMP
*5A*F.IMPLANT
*6A*DIMP
*4A*TISSUE
*5A*KIMP
*7A*IMP
*10A*D.IMP
*3A*W.LSH
*10A*IMP
*16A*IMP
*22A*T.IMPLANTS
In the dataset above I'm attempting to extract everything after the second occurrence of an asterick. I.E. D.IMP, IMP, T.IMPLANTS, F.IMPLANT, etc..
I've attempted to use SUBSTR and came close using:
SUBSTR(TRIM(FSS.Surgical_Inventory_Code),1,
INDEX(TRIM(FSS.Surgical_Inventory_Code),'*')-1)
But, that only returns the data after the first *.
I believe the best solution to solve problem would be using a REGEX expression or SUBSTR. There is a function in teradata called REGEXP_SUBSTR. I'm not exactly sure how to create a REGEX statement to solve my problem.

If you only ever have 2 asterisks in your string, you can use STRTOK:
strtok(<source string>,'*',2)

Related

How can I extract specific patterns from a string?

I currently have a dataset filled with the following pattern:
My goal is to get each value into a different cell.
I have tried with the following formula, but it's not yielded the results I am looking for.
=SPLIT(D8,"[Stock]",FALSE,FALSE)
I would appreciate any guidance on how I can get to the ideal output, using Google Sheets.
Thank you in advance!
I will assume here from your post that your original data runs D8:D.
If you want to retain [Stock] in each entry, try the following in the Row-8 cell of a column that is otherwise empty from Row 8 downward:
=ArrayFormula(IF(D8:D="",,TRIM(SPLIT(REGEXREPLACE(D8:D&"~","(\[Stock\]).","$1~"),"~",1,1))))
If you don't want to retain [Stock] in each entry, use this version:
=ArrayFormula(IF(D8:D="",,TRIM(SPLIT(REGEXREPLACE(D8:D&"~","\[Stock\].","~"),"~",1,1))))
These formulas don't function based on using any punctuation at all as markers. They also assure that you don't wind up with blank (and therefore unusable) cells interspersed for ending SPLITs.
, only used in the separator
=ARRAYFORMULA(SPLIT(D8:D,", ",FALSE))
, used also in each string ([stock] will be replaced)
=ARRAYFORMULA(SPLIT(D8:D," [Stock], ",FALSE))
, used also in each string ([stock] will not be replaced)
=ArrayFormula(SPLIT(REGEXREPLACE(M9:M11,"(\[Stock\]), ","$1♦"),"♦"))
use:
=INDEX(TRIM(IFNA(SPLIT(D8:D; ","))))

Regex_Extract Bigquery

I'm trying to get the first, second, or third value of a string in bigquery using the regex_extract function.
The string looks like this
"testimage International,testimageinternational,002533336564114VoIdiAA"
I've been googling around and i'm struggling to get an appropriate regex function.
This is the closest I've gotten REGEXP_EXTRACT(string, r'[^,]+1') as x
However although it works great if the string is test,test1,test2 it doesn't work on the actual string. Any explanation of where i'm going wrong would be super appreciated.
Below is for BigQuery Standard SQL
Based on example in your question and suggested regex - I feel you might want to consider simply using SPLIT() as below
#standardSQL
WITH t AS (
SELECT 'testimage International,testimageinternational,002533336564114VoIdiAA' str
)
SELECT
SPLIT(str, ',')[SAFE_OFFSET(0)] AS first,
SPLIT(str, ',')[SAFE_OFFSET(1)] AS second,
SPLIT(str, ',')[SAFE_OFFSET(2)] AS third
FROM t
with result
Row first second third
1 testimage International testimageinternational 002533336564114VoIdiAA

regex for CREATE TABLE PARTITIONED BY DDL

I need a regex that matches to both of these strings:
CREATE TEMPORARY TABLE db.table (cols)USING parquet PARTITIONED BY (DATA2, DATA3)
CREATE TABLE db.table (cols)USING parquet
The closest I've got is this:
CREATE +?(TEMPORARY +)?TABLE *(?P<db>.*?\.)?(?P<table>.*?)\((?P<col>.*?)\).*?USING.*?(PARTITIONED BY \((?P<pcol>.*?)\))
But that doesn't match to the second string. I've tried using a ? on the end but that didn't help. basically I've been playing around with this for hours now and can't figure it out, so I'm resorting to SO.
I've set up a demo of this here: https://regex101.com/r/ffSVuD/1 If anyone feels game enough to try and solve it, be my guest!
I ended up using CREATE +?(TEMPORARY +)?TABLE *(?P<db>.*?\.)?(?P<table>.*?)\((?P<col>.*?)\).*?USING +([^\s]+) *(PARTITIONED BY \((?P<pcol>.*?)\))? to match both your examples.
Basically, I replaced USING.*? by USING +(\[^\s\]+) *, so that you don't end up with a .*? before your last group.
Finally, I added a ? after your last group to make it optional.

Possible combination (variations) of words in a string variable in stata

I have a string variable containing school names and I need to find all the possible combination of each word in this string variable in stata:
For example variation of a word "Academy" would be:
Academy,
Academy,
acdamey,
aacdemy,
dmcaamy,
aacedmy,
and so on.
I need this to standardize the raw data of school names, which has many typos of each word due to data entry issues, like the ones given above for "academy".
Depending whether your data is already in the Excel sheets or a file, you can either use regex trying to match all possible combinations (and probably fix them when found) or parse the strings first before bringing them into Excel. In either case you could make a file (or Excel list/table/area/etc.) that includes all the common typos and pick each typo as regex match to use when comparing to your actual input.
Making regexp that would actually find all possible cases is next to impossible, especially if there are cases where very similar (but correct) names for schools exist. In any case direct regexps would be very messy and complex, so I would advice you to parse the data by finding first the correct form, excluding it and then using (greedy) search/regex to find the typoed versions. You can then save the typos to use them as a filter/match/pattern.
To get some sort of starting ideas, check this links:
Regex: Search for verb roots
Read text file and extract string into Excel sheet using regex
P.s You should keep the count of all strings/school names and finally get a list of all names that did not match correct form or any of your regexp filters, so you can manually insert/correct them.

Separate text using regex

I have a string like
abcdefangners
and a set of numbers that specifies how to group the above string, such as
3,4
In this case, the output should be
abc,defa,gners
Is something like this possible using regex? I have one option of using a loop to get the comparisons of the set one by one, but is there a better way to do it?
You could do:-
/(.{3})(.{4})(.*)/
This would give you the substrings which you'd then have to join together.
You'd have to create the regexp for each set of numbers so it would not be as easy as other methods of string manipulation.