Optimize join with regex

Optimize join with regex - regex

I have one table (A) with a phrase, and the other (B) is a phrase that I want to find WITHIN table A's phrase. So I'm joining them as follows:
Create table C as
SELECT A.*
FROM A
JOIN B
where (A.phrase LIKE concat("%",B.phrase,"%"));
It is taking a long time because it's only using one reducer, and I believe this has to do with the nature of the query? Is there a way of speeding this up? I don't think a mapjoin or bucketjoin would help, because I'm not equating two columns, but rather, searching within one table for words from another table...

I found the solution.
The problem was that Hive doesn't do non equi joins well. So I did equi joins to get a subset of table A before I did the non equi join regex. So, 3 steps.
Break A.phrase and B.phrase into individual words.
Equate these words to see which keywords from B.phrase are equal to any keywords from A.phrase - this gives a subset of table A where A.phrase contains at least one keyword from B.phrase.
Use this table A subset to find the whole "%B.phrase%".

I think that EXISTS may be faster simply because your query will return same row from A multiple times for every match:
SELECT
A.*
FROM A as a
WHERE EXISTS (
SELECT
1
FROM B
WHERE a.phrase LIKE concat("%",phrase,"%")
);

Related

How to use Calculated Join in Tableau using 1-M relationship and regex_extract

Question 1 - Is Tableau able to use multiple results from from a single line in a REGEXP using the global variable to compare against another table during a Join operation? If no, question 2 is null. If yes...
Question 2 - I'm attempting to join two data sources in Tableau using a regexp in a calculated join because the left table has 1 value in each cell (ie. 64826) and the right table has 4 possible matches in each cell (ie. 00000|00000|21678|64826).
The problem is that my regex stops looking after it finds 1 match (the first of 4 values), and the global variable /g has the opposite effect I expected and eliminates all matches.
I've tried calculated joins on the Data Source tab. I've also tried separating those 4 values into their own columns in worksheets using
regexp_extract_nth. In both cases, regex stops looking after the first result. A Left Join seems to work somewhat, while an Outer Join returns nothing.
REGEXP_EXTRACT([Event Number],'(\d{5})')
REGEXP_EXTRACT_NTH([Event Number],'(?!0{5})(\d{5})',1)
With these examples, regex would match a NULL with the left table even though 64826 is in the right table. I expect the calculated join to return all possible matches from the right set, so there'd be a match on 21678 and on 64826, duplicating rows in the right table like so...
21678 - 00000|00000|21678|64826
64826 - 00000|00000|21678|64826
45245 - 45106|45245|00000|00000
45106 - 45106|45245|00000|00000

Your original expression is just fine, we might want to make sure that we are sending a right command in Tableau, which I'm not so sure, maybe let's try an expression similar to:
\b([^0]....)\b
even just for testing, then maybe let's modify our commands to:
REGEXP_EXTRACT([Event Number], '\b([^0]....)\b')
or:
REGEXP_EXTRACT_NTH([Event Number], '\b([^0]....)\b', 1)
to see what happens. I'm assuming that the desired numbers won't be starting with 0.
Please see the demo here
Reference

Use multiple replace conditions for a single column in Amazon Redshift

I have a table where the amount column has , and $ sign for example: $8,122.14 as values. I want to write a replace function to replace $ and , over that column in one go. Is there any way we can write multiple conditions in one replace in Redshift? Also, this is apart of post processing the data where I am inserting data from stage table to a final table after replacing these values.
I tried the ways listed in the take 1 and 2 given in the code but both of them failed.
Take 1:
insert into db.stage_table
select
(coalesce(replace(logging_amount,'$',','),''))) as logging_amount
from db.table;
Take 2:
insert into db.stage_table
select
(coalesce(replace(logging_amount,'$',',')) as logging_amount
from db.table;
Both of them failed.
The expected result should be replace function in a single statement.

Yes you can nest replace statements like this
replace(replace(logging_amount,'$',''),',','')
Or you can use regex if you prefer (personally for something like this i think nested replaces are easier to read.)

Google Sheets Pattern Matching/RegEx for COUNTIF

The documentation for pattern matching for Google Sheets has not been helpful. I've been reading and searching for a while now and can't find this particular issue. Maybe I'm having a hard time finding the correct terms to search for but here is the problem:
I have several numbers (part numbers) that follow this format: ##-####
Categories can be defined by the part numbers, i.e. 50-03## would be one product category, and the remaining 2 digits are specific for a model.
I've been trying to run this:
=countif(E9:E13,"50-03[123][012]*")
(E9:E13 contains the part number formatted as text. If I format it any other way, the values show up screwed up because Google Sheets thinks I'm writing a date or trying to do arithmetic.)
This returns 0 every time, unless I were to change to:
=countif(E9:E13,"50-03*")
So it seems like wildcards work, but pattern matching does not?

As you identified and Wiktor mentioned COUNTIF only supports wildcards.
There are many ways to do what you want though, to name but 2
=ArrayFormula(SUM(--REGEXMATCH(E9:E13, "50-03[123][012]*")))
=COUNTA(FILTER(E9:E13, REGEXMATCH(E9:E13, "50-03[123][012]*")))

This is a really big hammer for a problem like yours, but you can use QUERY to do something like this:
=QUERY(E9:E13, "select count(E) where E matches '50-03[123][012]' label count(E) ''")
The label bit is to prevent QUERY from adding an automatic header to the count() column.
The nice thing about this approach is that you can pull in other columns, too. Say that over in column H, you have a number of orders for each part. Then, you can take two cells and show both the count of parts and the sum of orders:
=QUERY(E9:H13, "select count(E), sum(H) where E matches '50-03[123][012]' label count(E) '', sum(H) ''")
I routinely find this question on $searchEngine and fail to notice that I linked another question with a similar problem and other relevant answers.

Update vlookup table array

Suppose I have the following vlookup command:
=VLOOKUP('Sheet1'!S2,'Sheet2'!$B$138:$C$145,2,FALSE)
When I drag the vlookup to the right I want it to update to
=VLOOKUP('Sheet1'!S2,'Sheet2'!$B$146:$C$153,2,FALSE)
In other words, I want the letters B and C fixed but the numbers to increment by 8. How would I do this?

It looks like your answer is always in column C and your lookup value in column A. If this is the case use INDEX MATCH
=INDEX ( C:C , MATCH ( 'Sheet1'!S2 , 'Sheet2'!$B:$C , 0 ))
I've made a few assumptions. Drop a pic of your tables and I can amend it if you can't work out which bits to change

That may be possible using some long if-then-else logic or macro but it seems an odd thing to do. The numbers represent rows so if you are incrementing them across columns I wonder whether you need to transpose your data and/or use HLOOKUP instead. There is probably a better way to achieve what you want but it is difficult to answer from the question as provided.

regular expression or replace function in where clause of a mysql query

I write a mysql query
select * from table where name like '%salil%'
which works fine but it will no return records with name 'sal-il', 'sa#lil'.
So i want a query something like below
select * from table whereremove_special_character_from(name)like '%salil%'
remove_special_character_from(name) is a mysql method or a regular expression which remove all the special characters from name before like executed.

No, mysql doesn't support regexp based replace.
I'd suggest to use normalized versions of the search terms, stored in the separate fields.
So, at insert time you strip all non-alpha characters from the data and store it in the data_norm field for the future searches.

Since I know no way to do this, I'd use a "calculated column" for this, i.e. a column which depends on the value of name but without the special characters. This way, the cost for the transformation is paid only once and you can even create an index on the new column.
See this answer how to do this.

I agree with Aaron and Col. Shrapnel that you should use an extra column on the table e.g. search_name to store a normalised version of the name.
I noticed that this question was originally tagged ruby-on-rails. If this is part of a Rails application then you can use a before_save callback to set the value of this field.

In MYSQL 5.1 you can use REGEXP to do regular expression matching like this
SELECT * FROM foo WHERE bar REGEXP "baz"
see http://dev.mysql.com/doc/refman/5.1/en/regexp.html
However, take note that it will be slow and you should do what others posters suggested and store the clean value in a separate field.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Optimize join with regex - regex

I think that EXISTS may be faster simply because your query will return same row from A multiple times for every match: SELECT A.* FROM A as a WHERE EXISTS ( SELECT 1 FROM B WHERE a.phrase LIKE concat("%",phrase,"%") );

Related

How to use Calculated Join in Tableau using 1-M relationship and regex_extract

Use multiple replace conditions for a single column in Amazon Redshift

Google Sheets Pattern Matching/RegEx for COUNTIF

Update vlookup table array

regular expression or replace function in where clause of a mysql query

Categories

Resources