Google Sheets Regex Result Not Equal To Text - regex

I have a spreadsheet which i'm using for importing prices using IMPORTHTML.
The import result contains the prices with text.
I'm using REGEXEXTRACT to get the price only.
The problem is that the extraction is not equal to same value in other cell.
For exmaple:
The import result is:
$58.00 & FREE Shipping. Details
in cell A1 - using REGEXEXTRACTwith regular_expression "[0-9][0-9].[0-9][0-9]" the result is 58.00
in cell A2 - i typed 58.00
trying to compare the two (using IF(A1=A2...) will fail.
Any idea why and how to fix it?
Thanks

You may use the following regex extraction:
REGEXEXTRACT(<CELL>, "^\W*([\d.]+)")
See the regex demo
The "^\W*([\d.]+)" means:
^ - start of string
\W* - zero or more non-word chars (non letters, digits, underscores)
([\d.]+) - Group 1: one or more digits or dots.
As per Rubén's details, you need to cast the string value extracted with the REGEXEXTRACT to the actual value of the extracted text with =VALUE.

Formula
Try
=VALUE(REGEXTRACT(A1,"[0-9][0-9].[0-9][0-9]")=A2
Explanation
REGEXEXTRACT always returns a text value. If you type 58.00 it's very likely that it's was identified as a number.

The answer for this is:
=VALUE(REGEXTRACT(<CELL1>,"^\W*([\d.]+)")
and after that using:
IF(A1=A2...)

Related

Find all groups of 9 digits (\d{9}) up to a certain word

I have the following string extracted from a PDF file and I would like to obtain the nine digits "control class" number from it:
string = ‘(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)’
I want all the matches that occur before the word “Sector”, otherwise I will have undesired matches.
I’m using the “re” module, in Python 3.8.
I tried to use the negative lookbehind as follows:
(?<!Sector:)\d{9})
However, it didn’t work. I still had the matches like ‘54177846’ and ‘201874249’, which are after the ‘Sector’ word.
I also tried to “isolate” the search area between the words “Process ID” and “Sector”:
(Process ID:.*?)(\d{9})(.*Sector)
I also tried to search for the expression \d9 only up to the “Sector” word, but it returned no results.
I had to work a solution around, in two steps: (1) I created a regex that would find all the results up to the word “Sector” (desperate_regex = ‘(.*)Sector)’ and assigned it to a new variable,partial_text`; (2) I then searched for the desired regex ('\d{9}') within the new variable.
My code is working, but it does not satisfies me. How would I find my matches with a single regex search?
Please note that the first "control class" number is truncated with the text that comes before it ("CONTROL CLASS706345519").
(PS: I'm a totally newbie, and this is my first post. I hope I could explain my self. Thank you!)
The easiest way is to get the string before Sector and just search that:
split_string, _ = string.split("Sector")
nums = re.findall(r'\d{9}', split_string)
# ['706345519', '708393673', '706855190']
Another would be to use the third-party regex module, which allows overlapping matches:
import regex as re
nums = re.findall(r'(\d{9}).*?Sector', string, overlapped=True)
# ['706345519', '708393673', '706855190']
The regex described below may be more overkill then required for the actual case being handled, but better safe than sorry.
If you want match a string of exactly 9 digits, no more no fewer, then you should you negative lookbehind and lookahead assertions to ensure that the 9 digits are not preceded nor followed by another digit (again, in this case perhaps the OP knows that only 9-digit numbers will ever appear and this is overkill). You can also use a negative lookbehind assertion to ensure that Sector does not appear before the 9 digits. This later assertion is a variable length assertion requiring the regex package from PyPI:
r'(?<!Sector.*?)(?<!\d)\d{9}(?!\d)'
(?<!Sector.*? Assert that we haven't scanned past Sector. This handles the situation where Sector might appear multiple times in the input by ensuring that we never scan past the first occurrence.
(?<!\d) Assert that the previous character is not a digit.
\d{9} Match 9 digits.
(?!\d) Assert that the next character is not a digit.
The simplified version:
r'(?<!Sector.*?)\d{9}'
The code:
import regex as re
string = '(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)'
#print(re.findall(r'(?<!Sector.*?)\d{9}', string))
print(re.findall(r'(?<!Sector.*?)(?<!\d)\d{9}(?!\d)', string))
Prints:
['706345519', '708393673', '706855190']
You could use an alternation and break if you find "Sector":
import re
text = """(some text before)Process ID: JD7717PO CONTROL CLASS706345519,708393673, 706855190 CODE AAZ-1585 ZZF-8017. Sector: Name:MULTIBANK S.A. SAAT: 54177846900115Date of Production2019/12/20\x02.02.037SBPEAA201874249B\x0c(some text after)"""
rx = re.compile(r'\d{9}|(Sector)')
results = []
for match in rx.finditer(text):
if match.group(1):
break
results.append(match.group(0))
print(results)
Which yields
['706345519', '708393673', '706855190']
If either of these work I'll add an explaination to it:
[\s\S]+(?:Process ID:\s+)(.*)(?:\s+Sector)[\s\S]+
\g<1>
Or this?
(?i)[\s\S]+(?:control\s+class\s*)(\d{9})[\s\S]+
\g<1>

regexextract isn't working the way I want in Google Sheets

...or rather...there's something wrong with my formula.
I have a series of item numbers, and I want to extract only the info between the first and 3rd dash, if any.
The info before the 1st dash must be letters.
The info between the 1st and second dash must be letters (i.e. A-z).
The info between the 2nd and 3rd dash must be numbers.
I want everything else to be ignored (I've wrapped my regexextract in an iferror to do this)
Here's my formula:
=arrayformula(iferror(regexextract(B1:B,"[A-z]+-([A-Z\{\\\]\^_`a-z]+-[0-9]+)-"),"")
It's working most of the time.
But for this: AAB-2971-PN-B-11-03
It extracts this: B-11
But I'm expecting this one to be an error/blank.
Other correct examples:
AAB-LL-1234-00 should extract LL-1234
AAN-1234 should error out
AAC-1234-LL should error out
AAC-1234-ll-123 should error out
Use this regex:
[A-Za-z]+-[A-Za-z]+-([0-9]+)-
And extract group 2.
There are a few problems with your regex, but the main one is [A-z] does not mean "all letters", it means "all characters between A and z", which includes the characters between Z and a, ie [, \, ], ^, _ and the back tick.
I suspect [A-Z{\]\^_a-z]+is your attempt at[A-Za-z]`.
try:
=ARRAYFORMULA(IFNA(REGEXEXTRACT(INDEX(SPLIT(B1:B, "-"),,2)&"", "\D+")&
REGEXEXTRACT(INDEX("-"&SPLIT(B1:B, "-"),,3), "-\d+")))
or:
=ARRAYFORMULA(IFERROR(IF((REGEXMATCH(INDEX(SPLIT(B1:B, "-"),,1), "[A-Za-z]+"))*
(NOT(REGEXMATCH(INDEX(SPLIT(B1:B, "-"),,1), "[0-9]+"))),
IFNA(REGEXEXTRACT(INDEX(SPLIT(B1:B, "-"),,2)&"", "\D+")&
REGEXEXTRACT(INDEX("-"&SPLIT(B1:B, "-"),,3), "-\d+")), )))

Hive REGEXP_EXTRACT returning null results

I am trying to extract R7080075 and X1234567 from the sample data below. The format is always a single upper case character followed by 7 digit number. This ID is also always preceded by an underscore. Since it's user generated data, sometimes it's the first underscore in the record and sometimes all preceding spaces have been replaced with underscores.
I'm querying HDP Hive with this in the select statement:
REGEXP_EXTRACT(column_name,'[(?:(^_A-Z))](\d{7})',0)
I've tried addressing positions 0-2 and none return an error or any data. I tested the code on regextester.com and it highlighted the data I want to extract. When I then run it in Zepplin, it returns NULLs.
My regex experience is limited so I have reviewed the articles here on regexp_extract (+hive) and talked with a colleague. Thanks in advance for your help.
Sample data:
Sept Wk 5 Sunny Sailing_R7080075_12345
Holiday_Wk2_Smiles_X1234567_ABC
The Hive manual says this:
Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc.
Also, your expression includes unnecessary characters in the character class.
Try this:
REGEXP_EXTRACT(column_name,'_[A-Z](\\d{7})',0)
Since you want only the part without underscore, use this:
REGEXP_EXTRACT(column_name,'_([A-Z]\\d{7})',1)
It matches the entire pattern, but extracts only the second group instead of the entire match.
Or alternatively:
REGEXP_EXTRACT(column_name,'(?<=_)[A-Z]\\d{7}', 0)
This uses a regexp technique called "positive lookbehind". It translates to : "find me an upper case alphabet followed by 7 digits, but only if they are preceded by an _". It uses the _ for matching but doesn't consider it part of the extracted match.

Notepad++: find all columns of a table

I am currently trying to figure out, how to find all columns of a table within an SQL statement using Regex in notepad++.
Lets take this query:
select
a.id,
a.id || a.name,
a.age,
b.id
From a,b
Now, I wat to retrieve all columns for a using regex - the problem the query itself is much larger and I do not want to have to go through the whole query.
The desired result is:
id
name
age
I already figured out that with
(?<=a\.)(\S+)
I match the desired strings, but Notepad++ still returns the whole lines and not only the words I need.
Can anyone help me here?
You may use this 2 step approach to extract values after a.:
Find: \ba\.(\w+)|(?s:.)
Replace With: (?1$1\n:)
Then, you need to remove duplicate lines to get the expected results.
Details
\ba\. - a a. substring as a whole word
(\w+) - Group 1: one or more word chars (the group value will be kept + an LF will be appended in the replacement pattern)
| - or
(?s:.) - any char (it will be removed).
The (?1$1\n:) replacement means that the Group 1 value will be output and a line ending LF symbol will be appended to the result if Group 1 matches, else, empty string will be used as a replacement.
Maybe "matching non greedy" using "?" and looking for word boundaries can help? The expression would look like this (add a ? in the last bracket):
(?<=a\.)(\S+?\b)
This just came into my mind as I read the question, didn't check it on functionality.
More information on non-greedy modifier can be found here.

Regular Expression - joining two lines, but first number of joined 2nd line is deleted

I have some sample data (simplified extract below - the real file contains 52,000 lines, with pairs of lines, the 2nd line of each pair is always a date field, and there are always 2 blank lines between each data pair):
The colour of money 20170233434
10-DEC-2015
SOME TEST DATA 32423412123
19-OCT-2015
I want to join each line up, using a Regular Expression (I am using TextPad, but I think the RegEx syntax is generic).
I am doing a replace search, and want to end up with this:
The colour of money 20170233434 10-DEC-2015
SOME TEST DATA 32423412123 19-OCT-2015
I am using this in the "Find what" field:
\n^[0|1|2|3|4|5|6|7|8|9]
And replacing with NULL.
The end result I am getting is almost there:
The colour of money 20170233434 0-DEC-2015
SOME TEST DATA 32423412123 9-OCT-2015
But not quite, because the first digit of the date values are being stripped out.
How would I modify the RegEx to not delete the first number of the 2nd line? I tried to replace with [0|1|2|3|4|5|6|7|8|9] but that just put that entire string in front of each date field, and still stripped out the first number of the date.
Just search for this
\r?\n(\d{1,2}\-)
And replace it with $1. See the live example here.
If you want to replace it with null, you can also use a lookahead:
\r?\n(?=\d{1,2}\-)
And replace it with null. See the live example here.
Those regular expressions only match for a newline character (in UNIX \n or Windows \r\n) followed by 1 or 2 characters of a number and finally followed by a dash. If you want to be more specific, you could also use this regular expression:
\r?\n(\d{1,2}\-[A-Z]{3}\-\d{4})
Or with a lookahead respectively:
\r?\n(?=\d{1,2}\-[A-Z]{3}\-\d{4})
You could even check for the double linebreaks after the statement (live example):
\r?\n(\d{1,2}\-[A-Z]{3}\-\d{4}(?:\r?\n){2})
Or with a lookahead respectively (live example):
\r?\n(?=\d{1,2}\-[A-Z]{3}\-\d{4}(?:\r?\n){2})