Rows not being caught by regular expression

Rows not being caught by regular expression - regex

I write my PL SQL code in TOAD V10.6 which is then run on Oracle servers, and I believe is 11g.
Because I am working with client adrs information, I cant actually post any results.
The goal of my program is to parse address data into its correct fields. Its not the whole address thankfully. The pieces of information it does contain are building number, street name, street type, direction, and sub-unit. The information is not always in the same presentation and I have worked my way around that by the sequence in which I parse the information out.
The way I go about parsing the address field
I load the address data into a new table
I delete duplicate rows
I mark key adrs patterns as errors (such as not enough fields since an address needs at least 3 to be valid)
I extract sub-unit which can appear anywhere in the adrs
I extract the direction which can appear anywhere in the adrs
I extract the building number and make sure its only numbers
I check to see if an apartment was hyphened onto the building number
I check to make sure there is still enough information for a valid address as I still need a street type and name
I extract the street type
Whatever remains is considered the street name
I have 27,000 which are being correctly parsed, about 3000 which contain errors and are excluded, and 2200 which are not handled correctly but do not trigger any errors, this is the second last step.
UPDATE TEMP_PARSE_EXIST
SET V_STREET_TYPE = REGEXP_SUBSTR(ADRS, '\w+.$')
WHERE ADT_ACT IS NULL;
UPDATE TEMP_PARSE_EXIST
SET ADT_ACT = 'EMPTY STREET TYPE'
WHERE V_STREET_TYPE IS NULL AND ADT_ACT IS NULL;
I had an almost identical issue before during the parsing of the sub-units. I never figured out what caused it or why moving the regular expression from the where clause to a different part corrected it.
UPDATE TEMP_PARSE_EXIST
SET ADT_ACT = 'PARSE ERROR: TOO MANY S_COM_RES_TYPE '
WHERE ADT_ACT IS NULL AND V_SECOND_LINE IS NULL
AND REGEXP_COUNT(ADRS, '\s' || S_COM_RES_TYPE || '.+\s.+' || S_COM_RES_TYPE , 1, 'i') > 1;
--this looks for a space before and after the sub-unit, then anything between another example
--the space before and after are to prevent STE and FL from being matched with valid street names
--the second one is less strict about that
--if there starts to be an issue then a space before can be added
--however, adding a space after would having it miss cases where there is no space after for the unit number
--the block of code below is suspected of being where the error is happening
--the error in question is where suite is not being noticed and extracted from the adrs line
--however there are many more similar examples being correctly handled
UPDATE TEMP_PARSE_EXIST
SET V_SECOND_LINE = REGEXP_SUBSTR(ADRS, S_COM_RES_TYPE || '(\s?\w+|$)', 1, 1, 'i')
--'(\s\w+|$)' was the original expression, but the ? was added in to account for there not being a space
--so the pattern grabs the sub-unit, and allows for a possible space between it and the number, or allows the end of string as there are some cases of that
WHERE ADT_ACT IS NULL AND V_SECOND_LINE IS NULL AND REGEXP_COUNT(ADRS, S_COM_RES_TYPE, 1, 'i') = 1;
--this removes v_second_line from the adrs
UPDATE TEMP_PARSE_EXIST
SET ADRS = TRIMMER(REPLACE(ADRS, V_SECOND_LINE))
WHERE V_SECOND_LINE IS NOT NULL;
The following code doesnt have the same error as above
UPDATE TEMP_PARSE_EXIST
SET ADT_ACT = 'PARSE ERROR: TOO MANY S_COM_RES_TYPE '
WHERE REGEXP_like(adrs, '\s' || S_COM_RES_TYPE || '\s(|.+)' || S_COM_RES_TYPE , 'i');
--this looks for a space before and after the sub-unit, then anything between another example
--the space before and after are to prevent STE and FL from being matched with valid street names
--which is a common issue if I am not so strict about it
UPDATE TEMP_PARSE_EXIST
SET V_SECOND_LINE = trimmer(REGEXP_substr(adrs, '\s' || S_COM_RES_TYPE || '\s\w+',1,1 ,'i'))
WHERE ADT_ACT IS NULL AND V_SECOND_LINE IS NULL;
--this removes v_second_line from the adrs, this is done for both parts
UPDATE TEMP_PARSE_EXIST
SET ADRS = TRIMMER(REPLACE(ADRS, V_SECOND_LINE))
WHERE V_SECOND_LINE IS NOT NULL;
I havent been able to figure out why this happening.
I am on an irregular project in my area, and the people I work with do not need to use regular expressions and have been unable to help me.
So the question is, why are there valid address's making it past the regular expression?
Update:
Here are examples of adrs which are correctly handled and all pieces are successfully parsed
Full example adrs Dirn Sub-unit number type name
100 Street1 Dr E E 100 Dr Street1
1000 1st Ave Suite 501 Suite 501 1000 Ave 1st
1000 100th St 1000 St 100th
1000 1st Ave N Unit 7 N Unit 7 1000 Ave 1st
Here are examples which are getting past the expression
Full example adrs Dirn Sub-unit number type name
1000 1st Avenue E E 1000 1st Avenue
1000 Street1 Road 1000 Street1 Road
1000 Street2 Street 1000 Street2 Street
1000 Street3 Drive 1000 Street3 Drive
100 1st Avenue S Unit 100 S Unit 100 100 1st Avenue
All the example address listed above were real (I changed the building numbers and names) and come from the same data set. There are no extra characters missing such as whitespace or special characters.

Jorge Campos is kind of correct that this was an XY problem.
The problem end up being a piece of code that I had not included, because it was so simple I didnt think it could be the cause. I have a case statement correcting the abbreviations of the street types to full names, with no else statement. So when a correct name was there, it got nulled out, because there were only correction statements.

Related

Pandas - If cell is blank, turn a pd.timedelta[ns64] cell to a string '--'

I have been working on a python project to automate some reports my team was designing by hand. I am running into a bit of an stubborn problem I can't figure out what I'm doing.
Essentially, the area I am stuck on has 4 separate data columns which I have made generic version of below.
Start Time | Finish Time | Not Usable Reason | Start to Finish
12:36 15:36 3:00
16:35 19:45 Production Defect 3:10
19:55 QA Failure
Not Usable Reason has at a high level two options. Blank OR text describing the issue. As well, depending on the issue, a finish time may not have been recorded due to some QA issue which was noticed before it was finished resulting in a Not Usable Reason which does not allow a start to finish to be calculated.
Essentially, what I am trying to do is IF there is a Not Usable Reason, in that row in the Start to Finish column, put a "--" into that field.
The code that I used to attempt this:
processor_df['Start to Finish'] = processor_df['Finish Time'] - processor_df['Start Time']
processor_df['Start to Finish'] = processor_df['Start to Finish'].astype(str)
processor_df['Start to Finish'] = processor_df.loc[pd.isnull(processor_df['Not Usable Reason']) == False, 'Start to Finish'] == '--'
processor_df['Start to Finish'] = pd.to_timedelta(processor_df['Start to Finish'])
This represents just a fraction of the code relating specifically to the small portion performing the calculation, then attempting to modify the start to finish column appropriately.
As well, I go from timedelta -> str -> timedelta due to an error I received when I didn't change it to a string:
ValueError: only leading negative signs are allowed
The issue is my desired output would be:
Start Time | Finish Time | Not Usable Reason | Start to Finish
12:36 15:36 3:00
16:35 19:45 Production Defect --
19:55 QA Failure --
but the above code produces it as:
Start Time | Finish Time | Not Usable Reason | Start to Finish
12:36 15:36
16:35 19:45 Production Defect False
19:55 QA Failure False
What is the best way to use a check if a condition exists and if it does replace it. The above has worked when I used it purely for strings but my condition was not checking if it as blank, but if it matched a specific value.
Thank you for your help with this, and if its a stupidly simple mistake, I thank you twice as hard for helping a novice out.
Best,
Andy

This line is where I think the problem lies:
processor_df['Start to Finish'] = processor_df.loc[pd.isnull(processor_df['Not Usable Reason']) == False, 'Start to Finish'] == '--'
You are basically overwriting the Start to Finish column with a boolean array. Your last expression in the line of code is == which returns a boolean array of all False values because of course the string '--' is not in any of the cells of that column.
The following line should do the trick instead:
processor_df.loc[processor_df['Not Usable Reason'].notnull() , 'Start to Finish'] = '--'
We use .loc, which I would in general recommend to start using out of habit for indexing rather than just square brackets, to get the rows where we have a 'Not Usable Reason' and the column 'Start to Finish' and we assign (using =, the assignment operator) the string '--' to these cells.

Regex validation on Account Address field

We sync our Salesforce accounts and opportunities to QuickBooks (QB), but QB has character limits on its fields. Street lines have a 41 character limit per line and I'm trying to have regex control and limit this, but it isn't working on the Address field type. I am using the very simple conditional formula:
REGEX(BillingStreet, '.{42,}')
which matches any non-linebreak character and if it's 42 characters or more, trigger the validation. The problem is that it ignores this rule. I know this formula works because if I apply it to another text field, it works how it's supposed to. Here's an example of how it should work: https://www.regexpal.com/99217. If there's a match anywhere, it should throw the validation error.
Any ideas?

I ended up not using Regex because it doesn't seem to work well. Instead I used formulas to make sure it follows the validation. Since we have a limit of two lines, it wasn't too bad to do this the long way.
AND(
IF(
//Look for a line break. If there is one, split and compare lengths separately.
FIND(MID( $Setup.Global__c.CLRF__c ,3,1), BillingStreet ) > 0
,
IF(
//If the first line is over the limit, return true to trigger validation.
LEN(LEFT(BillingStreet, FIND(MID( $Setup.Global__c.CLRF__c ,3,1), BillingStreet )-2))>41
,
TRUE
,
//If first line is fine, check second line and since this is a condition, it will return true/false automatically.
LEN(MID(BillingStreet, FIND(MID( $Setup.Global__c.CLRF__c ,3,1), BillingStreet )+1,LEN(BillingStreet))) > 41
)
,
//If there is no line break (one line) check the total length.
LEN(BillingStreet) > 41
)
,
//There is a validation for having more than 2 lines. Without this, it will combine lines 2 and above and check that length and will be confusing to users when it's > 41.
NOT(REGEX( BillingStreet , '(.*\r?\n.*){2,}'))
,
//Ignore this rule if the user has this flag active. Useful for bulk updating and don't have to worry about import errors.
$User.Bypass_Validation__c = False
)
This piece MID( $Setup.Global__c.CLRF__c ,3,1) represents a line break in Salesforce. Found out how it works from find line break in formula field. I would have liked to use Regex, but it just doesn't work if you ask me. Except for checking 2+ lines like in the above code.

regexp_similar '^.$' issues in teradata

For data scrubbing I have lot of hard coded values in my program. I am trying to put those values into a table. One of the conditions for this scrubbing is to find the length of the character and code (character_length(name) = 1).
But when I try to emulate the this by using ^.$, it is not catching values like ¿, ¥, Ã
please let me know if I am doing something wrong .
When I run below code and I see this 3 values ¿, ¥, Ã
select name from email_table
where character_length(name) = 1
and name not in
(select name from email_table
where regexp_similar(translate(name USING LATIN_TO_UNICODE WITH ERROR),'^.$', 'i') = 1)

It seems like the issue is due to version.
We have TD14 and TD 15 on different servers and I did following query
select case when regexp_similar('¥','^.$', 'i')=1
then 'Y'
else 'N'
end as output;
In case of TD 14, I get output as 'N' and in case of TD 15 answer is 'Y'.

Coldfusion - Checking for all lowercase or uppercase

I have been given the daunting task of sifting through a database of over 30,000 registrants and correcting the letter casing of names and addresses where needed. I am trying to write a program that will search for names and addresses in our database that are either all lowercase or all uppercase and output these mishaps in a webpage for me to review and correct more efficiently. I was informed that I could utilize Regular Expressions to find fields that adhere to my criteria, only I am new to programming and I am unfamiliar with the syntax of RegEx.
If anyone could provide me with some pointers as how to use RegEx to query for these inconsistencies, it would be greatly appreciated.
Thank you.

strComp should work
SELECT col
FROM table
WHERE strComp(col, lcase(col), 0) = 0 --all lower case
OR strComp(col, ucase(col), 0) = 0 --all upper case
The first two arguments are the columns to compare. The 3rd argument says to do a binary comparison. If the two strings are equal 0 is returned.

How will you accurately correct the data? If you see a last name of "MACGUYVER" should it change to Macguyver or MacGuyver? If you see a last name of "DE LA HOYA" will it become de la Hoya, De La Hoya, or something else? This task seems a bit dangerous.
If your plan is basically to just do initial capitalization then I suggest that you run an update first before doing any manual review.
You could run something like this to change your name fields to initial capital letters:
update yourTable
set lname = StrConv(lname,3)
where StrComp(lname, StrConv(lname,3), 0) <> 0
and StrComp(mid(lname,2,len(lname)), lcase(mid(lname,2,len(lname))), 0) = 0;
Where "lname" above is your last name column, for example.
The above would have to be run for each name field.
Note that this will not update names that legitimately have multiple capital letters, like MacGuyver or O'Connor, which need manual review.
Also note that it will update last names that start with van, von, de la, and others that may intentionally be lowercase.
You could then query for just the names that need manual review, which I assume will be a much smaller subset:
select *
from yourTable
where StrComp(lname, StrConv(lname,3), 0) <> 0;
Addresses are tougher. To find just those that are either all lowercase or all uppercase you can do this:
select *
from yourTable
where strComp(address1, lcase(address1), 0) = 0;
select *
from yourTable
where strComp(address1, ucase(address1), 0) = 0;
Obviously this won't catch address lines like "123 New YORK AveNUE".
Consider asking for permission to just set all address values to uppercase.
You'll save yourself a lot of trouble.

Extracting dollar amounts from existing sql data?

I have a field with that contains a mix of descriptions and dollar amounts. With TSQL, I would like to extract those dollar amounts, then insert them into a new field for the record.
-- UPDATE --
Some data samples could be:
Used knife set for sale $200.00 or best offer.
$4,500 Persian rug for sale.
Today only, $100 rebate.
Five items for sale: $20 Motorola phone car charger, $150 PS2, $50.00 3 foot high shelf.
In the set above I was thinking of just grabbing the first occurrence of the dollar figure... that is the simplest.
I'm not trying to remove the amounts from the original text, just get their value, and add them to a new field.
The amounts could/could not contain decimals, and commas.
I'm sure PATINDEX won't cut it and I don't need an extremely RegEx function to accomplish this.
However, looking at The OLE Regex Find (Execute) function here, appears to be the most robust, however when trying to use the function I get the following error message in SSMS:
SQL Server blocked access to procedure 'sys.sp_OACreate' of component
'Ole Automation Procedures' because this component is turned off as
part of the security configuration for this server. A system
administrator can enable the use of 'Ole Automation Procedures' by
using sp_configure. For more information about enabling 'Ole
Automation Procedures', see "Surface Area Configuration" in SQL Server
Books Online.
I don't want to go and changing my server settings just for this function. I have another regex function that works just fine without changes.
I can't imagine this being that complicated to just extract dollar amounts. Any simpler ways?
Thanks.

CREATE FUNCTION dbo.fnGetAmounts(#str nvarchar(max))
RETURNS TABLE
AS
RETURN
(
-- generate all possible starting positions ( 1 to len(#str))
WITH StartingPositions AS
(
SELECT 1 AS Position
UNION ALL
SELECT Position+1
FROM StartingPositions
WHERE Position <= LEN(#str)
)
-- generate possible lengths
, Lengths AS
(
SELECT 1 AS [Length]
UNION ALL
SELECT [Length]+1
FROM Lengths
WHERE [Length] <= 15
)
-- a Cartesian product between StartingPositions and Lengths
-- if the substring is numeric then get it
,PossibleCombinations AS
(
SELECT CASE
WHEN ISNUMERIC(substring(#str,sp.Position,l.Length)) = 1
THEN substring(#str,sp.Position,l.Length)
ELSE null END as Number
,sp.Position
,l.Length
FROM StartingPositions sp, Lengths l
WHERE sp.Position <= LEN(#str)
)
-- get only the numbers that start with Dollar Sign,
-- group by starting position and take the maximum value
-- (ie, from $, $2, $20, $200 etc)
SELECT MAX(convert(money, Number)) as Amount
FROM PossibleCombinations
WHERE Number like '$%'
GROUP BY Position
)
GO
declare #str nvarchar(max) = 'Used knife set for sale $200.00 or best offer.
$4,500 Persian rug for sale.
Today only, $100 rebate.
Five items for sale: $20 Motorola phone car charger, $150 PS2, $50.00 3 foot high shelf.'
SELECT *
FROM dbo.fnGetAmounts(#str)
OPTION(MAXRECURSION 32767) -- max recursion option is required in the select that uses this function

This link should help.
http://blogs.lessthandot.com/index.php/DataMgmt/DataDesign/extracting-numbers-with-sql-server
Assuming you are OK with extracting the numeric's, regardless of wether or not there is a $ sign. If that is a strict requirement, some mods will be needed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Rows not being caught by regular expression - regex

Related

Pandas - If cell is blank, turn a pd.timedelta[ns64] cell to a string '--'

Regex validation on Account Address field

regexp_similar '^.$' issues in teradata

Coldfusion - Checking for all lowercase or uppercase

Extracting dollar amounts from existing sql data?

Categories

Resources