Bizarre Behavior with REGEXP_MATCH in Google Big Query

Bizarre Behavior with REGEXP_MATCH in Google Big Query - regex

I'm seeing very bizarre behavior with the REGEXP_MATCH function in google big query. The function appears to work perfectly fine for public data but is not working on my dataset. I have a dataset imported from the csv with the first two lines (first is header row which all becomes the schema where everything is a string), there's a lot more but the following is the only relevant data for this case.
"id","common_name","botanical_name","low_hardiness_zone","high_hardiness_zone","type","exposure_min","exposure_max","moisture_min","moisture_max"
"plant1","Abelia","Abelia zanderi 'Conti (Confetti)'","5b","9a","Shrub","Partial Sun","Full Sun","Dry","Dry"
When I run the query:
SELECT * FROM [PlantLink_Plant_Types.plant_data_set]
WHERE REGEXP_MATCH('common_name',r'.*')
I get every result.
However, when I run the query:
SELECT * FROM [PlantLink_Plant_Types.plant_data_set]
WHERE REGEXP_MATCH('common_name',r'A.*')
I get no results, which is really weird because the plant common name Abelia starts with an A.
Now my regex magic is not that strong, but I am pretty sure the pattern is not at fault. Additionally I've run the public dataset test queries with REGEXP_MATCH and they run correctly. Does anyone have any clue why REGEXP_MATCH would not always function as advertised?

Note:
REGEXP_MATCH('common_name',r'.*') matches the string 'common_name'
while
REGEXP_MATCH(common_name,r'.*') matches a field in your table that is called common_name
the 1st one is always true and therefore you get all results.
I guess you wanted to refer the content of the field, so you need to use the second one.
REGEXP_MATCH(common_name,r'A.*') should return all records that field common_name contains "A".
hope this helps.

Issue is the string 'common_name' does not start with 'A'.
Check this:
REGEXP_MATCH('common_name',r'.*'): All results.
REGEXP_MATCH('common_name',r'A.*'): No results.
REGEXP_MATCH('common_name',r'c.*'): All results.
REGEXP_MATCH(common_name,r'A.*'): All results that somewhere have an 'A'.
:)

Related

regexp_match not working on text string with special characters in google data studio

I've been trying to use REGEXP_MATCH to create a custom field in Google Data Studio but it's not working as expected.
Example of the data I'm using it on (this is how the data is formatted in the tags_name field:
{construction,po-johnson,po-james}
{construction,po-sandy,po-occonor}
The objective is to check if a certain name exists, then create a new label.
Here's the code I'm trying (tags_name is the field name where the original text string exists):
CASE
WHEN REGEXP_MATCH(tags_name, ".*(johnson?).*") THEN "Marc Johnson"
WHEN REGEXP_MATCH(tags_name, ".*(occonor?).*") THEN "Sam Occonor"
ELSE "undefined"
END
Is this happening due to the presence of the curly brackets/commas/hyphens?

I've tried to reproduce the error in Google Data Studio based on your problem statement. Everything worked exactly as expected though.
I've entered your input (and a few other expressions for confirmation) in the tags_name field and placed your REGEXP_MATCH function into another field:
Here is the result:
Is this the result you expected?
Is there still an issue? If so, you could edit your question and add corresponding screenshots.

checking if your elemMatch regex conditions work

I have a MongoDB query:
db.list.find({categories:{$elemMatch:{ "$regex":".*Bar.*", $not:/^Barbeque/}}}).pretty()
where it looks at the elements in the categories array and I think gets all documents where there is an element that contains "Bar" but none that contain "Barbecue". How to I check make sure that my query is correct?
Let me know if my query is wrong and how I could fix it.

You should specify the end of string like this:
$not:/^Barbeque$/
rather than
$not:/^Barbeque/
because if any of file contains (i.e. "Barbequee"), it must return to your query result.
about make sure that my query is correct or not, i have no idea. But if your query is correct, mongo must return the value that matches the written query.
if something goes wrong, it must be that the logic in the query that results in mongo returns an unexpected value for you.
so check your query before you run it. :D

Amazon Athena: no viable alternative at input

While creating a table in Athena; it gives me following exception:
no viable alternative at input

hyphens are not allowed in table name.. ( though wizard allows it ) .. Just remove hyphen and it works like a charm

Unfortunately, at the moment the syntax validation error messages are not very descriptive in Athena, this error may mean "almost" any possible syntax errors on the create table statement.
Although this is annoying at the moment you will need to check if the syntax follows the Create table documentation
Some examples are:
Backticks not in place (as already pointed out)
Missing/extra commas (remember that the last column doesn't need the comma after column definition
Missing spaces
More ..

This error generally occurs when the syntax of DDL has some silly errors.There are several answers that explain different errors based on there state.The simple solution to this problem is to patiently look into DDL and verify following points line by line:-
Check for missing commas
Unbalanced `(backtick operator)
Incompatible datatype not supported by HIVE(HIVE DATA TYPES REFERENCE)
Unbalanced comma
Hypen in table name

In my case, it was because of a trailing comma after the last column in the table. For example:
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
one STRING,
two STRING,
) LOCATION 's3://my-bucket/some/path';
After I removed the comma at the end of two STRING, it worked fine.

My case: it was an external table and the location had a typo (hence didn't exist)
Couple of tips:
Click the "Format query" button so you can spot errors easily
Use the example at the bottom of the documentation - it works - and modify it with your parameters: https://docs.aws.amazon.com/athena/latest/ug/create-table.html

Slashes. Mine was slashes. I had the DDL from Athena, saved as a python string.
WITH SERDEPROPERTIES (
'escapeChar'='\\',
'quoteChar'='\"',
'separatorChar'=',')
was changed to
WITH SERDEPROPERTIES (
'escapeChar'='\',
'quoteChar'='"',
'separatorChar'=',')
And everything fell apart.
Had to make it:
WITH SERDEPROPERTIES (
'escapeChar'='\\\\',
'quoteChar'='\\\"',
'separatorChar'=',')

In my case, it was an extra comma in PARTITIONED BY section,

In my case, I was missing the singlequotes for the S3 URL

In my case, it was that one of the table column names was enclosed in single quotes, as per the AWS documentation :( ('bucket')

As other users have noted, the standard syntax validation error message that Athena provides is not particularly helpful. Thoroughly checking the required DDL syntax (see HIVE data types reference) that other users have mentioned can be pretty tedious since it is fairly extensive.
So, an additional troubleshooting trick is to let AWS's own data parsing engine (AWS Glue) give you a hint about where your DDL may be off. The idea here is to let AWS Glue parse the data using its own internal rules and then show you where you may have made your mistake.
Specifically, here are the steps that worked for me to troubleshoot my DDL statement, which was giving me lots of trouble:
create a data crawler in AWS Glue; AWS and lots of other places go through the very detailed steps this requires so I won't repeat it here
point the crawler to the same data that you wanted (but failed) to upload into Athena
set the crawler output to a table (in an Athena database you've already created)
run the crawler and wait for the table with populated data to be created
find the newly-created table in the Athena Query Editor tab, click on the three vertical dots (...), and select "Generate Create Table DLL":
this will make Athena create the DLL for this table that is guaranteed to be valid (since the table was already created using that DLL)
take a look at this DLL and see if/where/how it differs from the DLL that you originally wrote. Naturally, this automatically-generated DLL will not have the exact choices for the data types that you may find useful, but at least you will know that it is 100% valid
finally, update your DLL based on this new Glue/Athena-generated-DLL, adjusting the column/field names and data types for your particular use case

After searching and following all the good answers here.
My issue was that working in Node.js i needed to remove the optional
ESCAPED BY '\' used in the Row settings to get my query to work. Hope this helps others.

Something that wasn't obvious for me the first time I used the UI is that if you get an error in the create table 'wizard', you can then cancel and there should be the query used that failed written in a new query window, for you to edit and fix.
My database had a hypen, so I added backticks in the query and rerun it.

This happened to me due to having comments in the query.
I realized this was a possibility when I tried the "Format Query" button and it turned the entire thing into almost 1 line, mostly commented out. My guess is that the query parser runs this formatter before sending the query to Athena.
Removed the comments, ran the query, and an angel got its wings!

Group by similar words

Is there any way to group a table by a text field, having in count that this text field is not always exactly the same?
Example:
select city_hotel, count(city_hotel)
from hotels, temp_grid
where st_intersects(hotels.geom, temp_grid.geom)
and potential=1
and part=4
group by city_hotel
order by (city_hotel) desc
The output I get is the expected, for example, City name and count:
"Vassiliki ";1
"Vassiliki";1
"Vassilias, Skiathos";1
"Vassilias";5
"Vasilikí";25
"Vasiliki";23
"Vasilias";1
But I'd want to group more this field, and get only one "Vasiliki" (or an array with all, this is not a problem) and a count of all the cells containing something similar between them.
I do not know if could this be possible. Maybe some function to text analysis or something similar?

SELECT COUNT(*), `etc` FROM table GROUP BY textfield LIKE '%sili%';
// The '%' is a SQL wildcard, which matches as many of any character as required.
You could do something like the above, choosing a word for the 'like' that best fits the spellings that your users have used.
Something that can help with that would be to do a
SELECT COUNT(*), textfield FROM table GROUP BY textfield ORDER BY textfield;
And selecting the most 'average' spelling for your words.
Otherwise you're starting to get into a bit of language processing, and for that you will want to write some code outside of SQL.
This would be something like https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
To find word's that are the same within an arbitrary margin of error.
There is a MySQL implementation here that you should be able to transpose as needed
https://stackoverflow.com/a/6392380/1287480
(credit https://stackoverflow.com/a/3515291/1287480)
.
(Personal thoughts on the topic)
You Really Really want to think about limiting the input from users that can give you this issue in the first place. It's far far better to give the users a list of places to select from, than it is to push potentially 'dirty' information into your database. That eventually always winds up with you trying to clean the information at a later time. A problem that has kept many people employed for many years.

How to get constraint errors from OCIErrorGet?

Our C++ program is using Oracle and OCI to do its database work. Occasionally, the user will trigger a constraint violation, which we detect and then show an error message from OCIErrorGet. OCIErrorGet returns strings like this:
ORA-02292: integrity constraint (MYSCHEMA.CC_MYCONSTRAINT) violated - child record found
ORA-06512: at line 5
I am looking for the cleanest way to extract "MYSCHEMA.CC_MYCONSTRAINT" from the Oracle error. Knowing the name of the constraint, I could show a better error message (our code could look up a very meaningful error message if it had access to the constraint name).
I could use a regex or something and assume that the Oracle message will never change, but this seems a little fragile to me. Or I could look for specific ORA codes and then grab whatever text falls between the parentheses. But I was hoping OCI had a cleaner/more robust way, if a constraint fails, to figure out the actual name of the failed constraint without resorting to hardcoded string manipulation.
Any ideas?

According to the Oracle Docs, a string search is exactly what you need to do:
Recognizing Variable Text in Messages
To help you find and fix errors, Oracle embeds object names, numbers,
and character strings in some messages. These embedded variables are
represented by string, number, or character, as appropriate. For
example:
ORA-00020: maximum number of processes (number) exceeded
The preceding message might actually appear as follows:
ORA-00020: maximum number of processes (50) exceeded
Oracle makes a big point in their docs of saying the strings will be kept up to date in their section on "Message Accuracy." It's a pretty strong suggestion that they intend you to do a string search.
Also, according to this website, the Oracle Error structure also pretty strongly implies that they intend you to do a string search, because the data structure lacks anything else for you to get:
array(4) {
["code"]=>int(942)
["message"]=>string(40) "ORA-00942: table or view does not exist"
["offset"]=>int(14)
["sqltext"]=>string(32) "select * from non_existing_table"
}
This output reveals the following information:
The variable $erris an array with four elements.
The first element is accessible by the key ‘code’ and its value is number 942.
The second value is accessible by the key ‘message’ and the value is string “ORA-00942: table or view does not exist”.
The third value is accessible by the key ‘offset’, and its value is the number 14. This is the character before the name of the
non-existing table.
The fourth member is the problematic SQL message causing the error in the first place.
I agree with you; it would be great if there were a better way to get the constraint name you're violating, but string-matching seems to be the intended way.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Bizarre Behavior with REGEXP_MATCH in Google Big Query - regex

Issue is the string 'common_name' does not start with 'A'. Check this: REGEXP_MATCH('common_name',r'.'): All results. REGEXP_MATCH('common_name',r'A.'): No results. REGEXP_MATCH('common_name',r'c.'): All results. REGEXP_MATCH(common_name,r'A.'): All results that somewhere have an 'A'. :)

Related

regexp_match not working on text string with special characters in google data studio

checking if your elemMatch regex conditions work

Amazon Athena: no viable alternative at input

Group by similar words

How to get constraint errors from OCIErrorGet?

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Bizarre Behavior with REGEXP_MATCH in Google Big Query - regex

Issue is the string 'common_name' does not start with 'A'. Check this: REGEXP_MATCH('common_name',r'.*'): All results. REGEXP_MATCH('common_name',r'A.*'): No results. REGEXP_MATCH('common_name',r'c.*'): All results. REGEXP_MATCH(common_name,r'A.*'): All results that somewhere have an 'A'. :)

Related

regexp_match not working on text string with special characters in google data studio

checking if your elemMatch regex conditions work

Amazon Athena: no viable alternative at input

Group by similar words

How to get constraint errors from OCIErrorGet?

Categories

Resources

Issue is the string 'common_name' does not start with 'A'. Check this: REGEXP_MATCH('common_name',r'.'): All results. REGEXP_MATCH('common_name',r'A.'): No results. REGEXP_MATCH('common_name',r'c.'): All results. REGEXP_MATCH(common_name,r'A.'): All results that somewhere have an 'A'. :)