SUM with conditions in a query/importrange from another Google sheet - regex

I have a table like this in spreadsheet A:
| title| total |
|----- |--------|
| X1 | 2 |
| Y | 3 |
| Z | 4 |
| X2 | 5 |
Since this spreadsheet A is constantly updated and is using other formulas, I need to export it to another sheet to work on.
I also need to sum the Total column if the Title column match a condition such as Regexp.
Result should be as:
| title| total |
|----- |--------|
| X | 7 |
| Y | 3 |
| Z | 4 |
Please advise on this case, I've been studying query with sumif formula but it does not support sum when condition is not matched.
Thanks in advance.

You can try SUMIFS() with wildcard option. Use below formula-
=SUMIFS($B$2:$B$5,$A$2:$A$5,D2 & "*")

after you allow access try:
=INDEX(QUERY({REGEXREPLACE(
IMPORTRANGE("id", "sheetname!A2:A"), "\d+$", ),
IMPORTRANGE("id", "sheetname!B2:B")},
"select Col1,sum(Col2)
where Col1 is not null
group by Col1
label sum(Col2)''"))

Related

How to extract domain name using dynamic regex in Redshift?

I need to extract domain name from url using Redshift PostgreSQL. Example : extract 'google.com' from 'www.google.com'. Each url in my dataset has different top level domain (TLD). My approach was to first join the matching TLD to the dataset and use regex to extract 'first_string.TLD'. In Redshift, I'm getting error 'The pattern must be a valid UTF-8 literal character expression'. Is there a way around this?
A sample of my dataset:
+---+------------------------+--------------+
| id| trimmed_domain | tld |
+---+------------------------+--------------+
| 1 | sample.co.uk | co.uk |
| 2 | www.sample.co.uk | co.uk |
| 3 | www3.sample.co.uk | co.uk |
| 4 | biz.sample.co.uk | co.uk |
| 5 | digital.testing.sam.co | co |
| 6 | sam.co | co |
| 7 | www.google.com | com |
| 8 | 1.11.220 | |
+---+------------------------+--------------+
My code:
SELECT t1.extracted_domain, COUNT(DISTINCT(t1.id))
FROM(
SELECT
d.id,
d.trimmed_domain,
CASE
WHEN d.tld IS null THEN d.trimmed_domain ELSE
regexp_replace(d.trimmed_domain,'(.*\.)((.[a-z]*).*'||replace(tld,'.','\.')||')','\2')
END AS "extracted_domain"
FROM dataset d
)t1
GROUP BY 1
ORDER BY 2;
Expected output:
+------------------------+--------------+
| extracted_domain | count |
+------------------------+--------------+
| sample.co.uk | 4 |
| sam.co | 2 |
| google.com | 1 |
| 1.11.220 | 1 |
+------------------------+--------------+
I'm so sure about the query. However, you can use this tool and design any expression that you wish to modify your query.
My guess is that maybe this would help:
^(?!d|b|www3).*
You can list any domain that you wish to exclude in the list using OR (?!d|b|www3).
RegEx Circuit
You can visualize your expressions in this link:
You maybe want to add your desired URLs to an expression similar to:
^(sam|www.google|1.11|www.sample|www3.sample).*
So, I've found a solution. Redshift does not support column based regex so the alternative is to use Python UDF.
Change the tld column to regex pattern.
Go row by row and extract the domain name using the regex pattern column.
Group by the extracted_domain and count the users.
The SQL query is as below:
CREATE OR REPLACE function extractor(col_domain varchar)
RETURNS varchar
IMMUTABLE AS $$
import re
_regex = ''
for domain in col_domain:
if domain is None:
continue
else:
_regex += r'{}'.format(domain)
domain_regex = r'([^/.]+\.({}))'.format(_regex)
return domain_regex
$$ LANGUAGE plpythonu;
CREATE OR REPLACE FUNCTION regex_match(in_pattern varchar, input_str varchar)
RETURNS varchar
IMMUTABLE AS $$
import re
if in_pattern == '':
a = str(input_str)
else:
a= str(re.search(in_pattern, input_str).group())
return a
$$ LANGUAGE plpythonu;
SELECT
t2.extracted_domain,
COUNT(DISTINCT(t2.id)) AS "Unique Users"
FROM(
SELECT
t1.id,
t1.trimmed_domain,
regex_match(t1.regex_pattern, t1.trimmed_domain) AS "extracted_domain"
FROM(
SELECT
id,
trimmed_domain,
CASE WHEN tld is null THEN '' ELSE extractor(tld) END AS "regex_pattern"
FROM dataset
)t1
)t2
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;
Python UDF seems to be slow on a large dataset. So, I'm open to suggestions on improving the query.
If you know the prefixes you would like to remove from the domains, then why not just exclude these? The following query simply removes the know www/http/etc prefixes from domain names and counts normalized domain names.
SELECT COUNT(*) from
(select REGEXP_REPLACE(domain, '^(https|http|www|biz)') FROM domains)
GROUP BY regexp_replace;

Regular Expression to parse Query Output

I have a need to execute a metadata query which will dump a list of tables into a file. However, I need a way to eliminate all formatting besides the tableId itself. Can this be done through a regex? Appreciate all help in advance.
+-------------------------------------+-------+
| tableId | Type |
+-------------------------------------+-------+
| t_margins | TABLE |
| t_rev_test | TABLE |
| t_rev_share | TABLE |
You have some options, but I would suggest something like this:
^\| (\S+)
It will match on the line from the start, a pipe, a space and then all non-spaces. The non-spaces will be your tableId. Here is a little example in Python:
import re
my_string = '''| t_margins | TABLE |
| t_rev_test | TABLE |
| t_rev_share | TABLE |'''
my_list = my_string.split('\n')
for line in my_list:
match = re.search("^\| (\S+)", line)
print (match.group(1))
This will give you:
t_margins
t_rev_test
t_rev_share
The following regexp captures just the column values of the first column:
^\| (\w+)
https://regex101.com/r/gODhra/3

Power BI counting the occurances in a row

Here is my data base:
Name| 1st | 2nd | 3rd | 4th | 5th
Ann | five | five | four | five | one
Tom | four | one | four | five | four
and what I want to do is to create columns that would contain the number of occurrences in a row, so in this case what I want to achieve:
Name| 1st | 2nd | 3rd | 4th | 5th | Five| Four | One
Ann | five | five | four | five | one | 3 | 1 | 1
Tom | four | one | four | five | four | 1 | 3 | 1
Ideally, you want to unpivot your data so that it looks like this:
Name | Number | Value
-----|--------|------
Ann | 1st | five
Ann | 2nd | five
Ann | 3rd | four
Ann | 4th | five
Ann | 5th | one
Tom | 1st | four
Tom | 2nd | one
Tom | 3rd | four
Tom | 4th | five
Tom | 5th | four
Then you could easily create a matrix visual like this by putting Name on the rows, Value on the columns, and the count of Number in the values field.
I don't recommend it, but if you need to keep it in your current layout, then your calculated columns could be written like:
Five = (TableName[1st] = "five") + (TableName[2nd] = "five") + (TableName[3rd] = "five") +
(TableName[4th] = "five") + (TableName[5th] = "five")
The Four and One column formulas would be analogous.
I have the similar issue here, but my dataframe is not fixed in terms of columns, I mean, for next refresh of database the number of columns may change get bigger (more columns) or smallest (less columns), perhaps the name of column header changes too... In this case, I can't pass the columns name or index for the counter look for, the formula needs look into the entire row no matter how many columns or name of it..

Python spark extract characters from dataframe

I have a dataframe in spark, something like this:
ID | Column
------ | ----
1 | STRINGOFLETTERS
2 | SOMEOTHERCHARACTERS
3 | ANOTHERSTRING
4 | EXAMPLEEXAMPLE
What I would like to do is extract the first 5 characters from the column plus the 8th character and create a new column, something like this:
ID | New Column
------ | ------
1 | STRIN_F
2 | SOMEO_E
3 | ANOTH_S
4 | EXAMP_E
I can't use the following codem, because the values in the columns differ, and I don't want to split on a specific character, but on the 6th character:
import pyspark
split_col = pyspark.sql.functions.split(DF['column'], ' ')
newDF = DF.withColumn('new_column', split_col.getItem(0))
Thanks all!
Use something like this:
df.withColumn('new_column', concat(df.Column.substr(1, 5),
lit('_'),
df.Column.substr(8, 1)))
This use the function substr and concat
Those functions will solve your problem.

Why doesn't this command line (Batch file) work?

I have a command line which looks for certain IDs (2 IDs )in 2nd column. But I want this command to search all the columns, not just second column.
Can anyone help?
The command line for searching 2nd column is:
findstr /rb /c:"[^|]*| *ID1 *|" /c:"[^|]*| *ID2 *|" "src.txt" >" dest.txt"
Can someone modify it so that it searches all the columns instead of just the second and also give 2 command lines which will:
(1) Searches all the columns instead of just 2nd.
(2) Searches only for 1 ID.
(3) Searches only for 3 IDs.
src.txt -
The text is in this manner:
Ja | 11 | xxx
Jn | 19 | yyy
Jx | 21 | yyyas | sas
Also few lines may have more columns like that last one.
Thanks!
To find in src.txt containing the lines
Ja | 11 | xxx
Jn | 19 | yyy
nJ | 19 | yyy
Ax | 21 | Jyyas | sas
Ax | 23 | yyJas | sas
only the 3 lines where a value within a column starts with J and therefore writting to file dest.txt the lines
Ja | 11 | xxx
Jn | 19 | yyy
Ax | 21 | Jyyas | sas
the following command can be used
findstr /R /C:"^J" /C:"\| *J" "src.txt" >"dest.txt"
^J is for finding lines starting with J and \| *J is for finding lines having a value starting with J after 0 or more spaces in a different column than first column.
Please note that parameter /B is removed as otherwise this would not work.
/rb in your example is /R an /B combined in one parameter string.