extract a string before certain punctuation regex - regex

How to extract words before the first punctuation | in presto SQL?
Table
+----+------------------------------------+
| id | title |
+----+------------------------------------+
| 1 | LLA | Rec | po#069762 | saddasd |
| 2 | Hello amustromg dsfood |
| 3 | Hel | sdfke bones. |
+----+------------------------------------+
Output
+----+------------------------------------+
| id | result |
+----+------------------------------------+
| 1 | LLA |
| 2 | |
| 3 | Hel |
+----+------------------------------------+
Attempt
REGEXP_EXTRACT(title, '(.*)([^|]*)', 1)
Thank you

Using the base string functions we can try:
SELECT id,
CASE WHEN title LIKE '%|%'
THEN TRIM(SUBSTR(title, 1, STRPOS(title, '|') - 1))
ELSE '' END AS result
FROM yourTable
ORDER BY id;

Related

Query array column in BigQuery by condition

I have a table in Bigquery with this format:
+------------+-----------------+------------+-----------------+---------------------------------+
| event_date | event_timestamp | event_name | event_params.key| event_params.value.string_value |
+------------+-----------------+------------+-----------------+---------------------------------+
| 20201110 | 2929929292 | my_event | previous_page | /some-page |
+------------+-----------------+------------+-----------------+---------------------------------+
| | layer | /some-page/layer |
| +-----------------+---------------------------------+
| | session_id | 99292 |
| +-----------------+---------------------------------+
| | user._id | 2929292 |
+------------+-----------------+------------+-----------------+---------------------------------+
| 20201110 | 2882829292 | my_event | previous_page | /some-page |
+------------+-----------------+------------+-----------------+---------------------------------+
| | layer | /some-page/layer |
| +-----------------+---------------------------------+
| | session_id | 29292 |
| +-----------------+---------------------------------+
| | user_id | 229292 |
+-------------------------------------------+-----------------+---------------------------------+
I want to perform a query to get all rows where event_params.value.string_value contains the regex /layer.
I have tried this:
SELECT
"event_params.value.string_value",
FROM `my_project.my_dataset.my_events_20210110`,
UNNEST(event_params) AS event_param
WHERE event_param.key = 'layer' AND
REGEXP_CONTAINS(event_param.value.string_value, r'/layer')
LIMIT 100
But I'm getting this output:
+---------------------------------+
| event_params.value.string_value |
+---------------------------------+
| event_params.value.string_value |
+---------------------------------+
| event_params.value.string_value |
+---------------------------------+
| event_params.value.string_value |
+---------------------------------+
| event_params.value.string_value |
+---------------------------------+
Some ideas of what I'm doing wrong?
You are selecting a string - you should select a column.
The other problem is that you're cross joining the table with its arrays - effectively bloating up the table.
Your solution is to use a subquery in the WHERE clause:
SELECT
* -- Not sure what you actually need from the table ...
FROM `my_project.my_dataset.my_events_20210110`
WHERE
-- COUNT(*)>0 means "if you find more than zero" then return TRUE
(SELECT COUNT(*)>0 FROM UNNEST(event_params) AS event_param
WHERE event_param.key = 'layer' AND
REGEXP_CONTAINS(event_param.value.string_value, r'/layer')
)
LIMIT 100
If you actually want the values from the array your quick solution is removing the quotes:
SELECT
event_params.value.string_value
FROM `my_project.my_dataset.my_events_20210110`,
UNNEST(event_params) AS event_param
WHERE event_param.key = 'layer' AND
REGEXP_CONTAINS(event_param.value.string_value, r'/layer')
LIMIT 100

How do i add additional rows in M QUERY

I want to add more rows using the Query editor (Power query/ M Query) in only the Start Date and End Date column:
+----------+------------------+--------------+-----------+-------------+------------+
| Employee | Booking Type | Jobs | WorkLoad% | Start Date | End date |
+----------+------------------+--------------+-----------+-------------+------------+
| John | Chargeable | CNS | 20 | 04/02/2020 | 31/03/2020 |
| John | Chargeable | CNS | 20 | 04/03/2020 | 27/04/2020 |
| Bernard | Vacation/Holiday | SN | 100 | 30/04/2020 | 11/05/2020 |
| Bernard | Vacation/Holiday | Annual leave | 100 | 23/01/2020 | 24/02/2020 |
| Bernard | Chargeable | Tech PLC | 50 | 29/02/2020 | 30/03/2020 |
+----------+------------------+--------------+-----------+-------------+------------+
I want to find the MIN(Start Date) and MAX(End Date) and then append the range of start to end dates to this table only in the Start Date and End Date column in the Query Editor (Power Query/ M Query). Preferrable if I can create another table2 duplicating the original table and append these rows.
For example:
+----------+------------------+--------------+-----------+-------------+------------+
| Employee | Booking Type | Jobs | WorkLoad% | Start Date | End date |
+----------+------------------+--------------+-----------+-------------+------------+
| John | Chargeable | CNS | 20 | 04/02/2020 | 31/03/2020 |
| John | Chargeable | CNS | 20 | 04/03/2020 | 27/04/2020 |
| Bernard | Vacation/Holiday | SN | 100 | 30/04/2020 | 11/05/2020 |
| Bernard | Vacation/Holiday | Annual leave | 100 | 23/01/2020 | 24/02/2020 |
| Bernard | Chargeable | Tech PLC | 50 | 29/02/2020 | 30/03/2020 |
| | | | | 23/01/2020 | 23/01/2020 |
| | | | | 24/01/2020 | 24/01/2020 |
| | | | | 25/01/2020 | 25/01/2020 |
| | | | | 26/01/2020 | 26/01/2020 |
| | | | | 27/01/2020 | 27/01/2020 |
| | | | | 28/01/2020 | 28/01/2020 |
| | | | | 29/01/2020 | 29/01/2020 |
| | | | | 30/01/2020 | 30/01/2020 |
| | | | | 31/01/2020 | 31/01/2020 |
| | | | | ... | ... |
| | | | | 11/05/2020 | 11/05/2020 |
+----------+------------------+--------------+-----------+-------------+------------+
The List.Dates function is pretty useful here.
Generate the dates in your range, duplicate that to two columns and then append.
let
StartDate = List.Min(StartTable[Start Date]),
EndDate = List.Max(StartTable[End Date]),
DateList = List.Dates(StartDate, Duration.Days(EndDate - StartDate), #duration(1,0,0,0)),
DateCols = Table.FromColumns({DateList, DateList}, {"Start Date", "End Date"}),
AppendDates = Table.Combine({StartTable, DateCols})
in
AppendDates

django Queryset exclude() multiple data

i have database scheme like this.
# periode
+------+--------------+--------------+
| id | from | to |
+------+--------------+--------------+
| 1 | 2018-04-12 | 2018-05-11 |
| 2 | 2018-05-12 | 2018-06-11 |
+------+--------------+--------------+
# foo
+------+---------+
| id | name |
+------+---------+
| 1 | John |
| 2 | Doe |
| 3 | Trodi |
| 4 | son |
| 5 | Alex |
+------+---------+
#bar
+------+---------------+--------------+
| id | employee_id | periode_id |
+------+---------------+--------------+
| 1 | 1 |1 |
| 2 | 2 |1 |
| 3 | 1 |2 |
| 4 | 3 |1 |
+------+---------------+--------------+
I need to show employee that not in salary.
for now I do like this
queryset=Bar.objects.all().filter(periode_id=1)
result=Foo.objects.exclude(id=queryset)
but its fail, how do filter employee list not in salary?...
Well here you basically want the foos such that there is no period_id=1 in the Bar table.
We can let this work with:
ex = Bar.objects.all().filter(periode_id=1).values_list('employee_id', flat=True)
result=Foo.objects.exclude(id__in=ex)

Sed remove NULL but only when the NULL means empty or no value

Exporting a table from MySQL where fields that have no value will have the keyword NULL within.
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | NULL |
I have written a script to automatically remove all occurrences of NULL using a one-liner sed, which will remove the NULL in date column correctly:
sed -i 's/NULL//g'
However, how do we handle IF we have the following?
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | NULL |
| 2 | Dees | DJ Null Bee| US| 2017-04-01 |
| 3 | NULL AND VOID | NULLZIET | NULL| 2016-05-13 |
| 4 | Pablo | ALA PUHU MINULLE | GERMANY| 2017-02-14 |
Apparently, the global search and replace all occurrences of NULL will be removed, where even "ALA PUHU MINULLE" will become "ALA PUHU MIE", which is incorrect.
I suppose the use of regex perhaps can be useful to apply the rule? But if so, will "DJ Null Bee" be affected or will it become "DJ Bee"? The desired outcome should really be:
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | |
| 2 | Dees | DJ Null Bee| US| 2017-04-01 |
| 3 | NULL AND VOID | DJ Null Bee| | 2016-05-13 |
| 4 | Pablo | ALA PUHU MINULLE | GERMANY| 2017-02-14 |
Given that NULL is a special keyword for any databases, but there is no stopping anyone from calling themselves a DJ NULL, or have the word NULL because it means differently in another language.
Any ideas on how to resolve this? Any suggestions welcomed. Thank you!
All you need is:
$ sed 's/|[[:space:]]*NULL[[:space:]]*|/| |/g; s/|[[:space:]]*NULL[[:space:]]*|/| |/g' file
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | |
| 2 | Dees | DJ Null Bee| US| 2017-04-01 |
| 3 | NULL AND VOID | NULLZIET | | 2016-05-13 |
| 4 | Pablo | ALA PUHU MINULLE | GERMANY| 2017-02-14 |
That will work in any POSIX sed.
You have to do the substitution twice because each match consumes all of the characters in the match so when you have | NULL | NULL | the middle | is consumed by the match on | NULL | and so all that's left is NULL | which does not match | NULL |, so you need 2 passes to match every | NULL |.
Use awk:
awk -F\| '{ for (i=2;i<=NF;i++) { if ( $i == " NULL " ) { printf "| " } else if ( $i == " NULL" ) { printf "| DJ Null Bee " } else { printf "|"$i } } printf "\n" }' filename
Using pipe as the field separator, go through each field and then check if the field equates to " NULL " If it does, print nothing. Then check if the field equals " NULL" If it does print "DJ Null Bee" else print the field as is.
$ cat mysql.txt | sed -r 's/(\| )NULL( \|)/\1\2/g'
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | |
| 2 | Dees | DJ Null Bee| US| 2017-04-01 |
| 3 | NULL AND VOID | NULLZIET | NULL| 2016-05-13 |
| 4 | Pablo | ALA PUHU MINULLE | GERMANY| 2017-02-14 |
will only remove capital NULL fields delimited by the opening and closing pipe symbols alone.
It will keep your origin column "| NULL|" in the line "| 3 | NULL AND VOID | DJ Null Bee| NULL| 2016-05-13 |" as well.
awk '{sub(/BRAZIL \| NULL/,"BRAZIL \| ")sub(/NULLZIET \| NULL/,"DJ Null Bee\| ")}1' file
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | |
| 2 | Dees | DJ Null Bee| US| 2017-04-01 |
| 3 | NULL AND VOID | DJ Null Bee| | 2016-05-13 |
| 4 | Pablo | ALA PUHU MINULLE | GERMANY| 2017-02-14 |

how can I reposition patterns within a string using sed?

I have a fasta file of +20k intronic sequences with the following headers I can describe as:
>ENSG[0-9] | ENST[0-9] | start_position | end_position | name |
I would like to change positions of ENSG[0-9] and ENST[0-9] and add "NASCENT" to ENST[0-9] pattern.
I tried:
sed 's/\(ENSG\d*\) *| *\(ENST\d*\) */\2 | \1/'
to first just focus on repositioning, but to no avail. It's probably escapes that I've confused.
Any hint or a better solution?
Not 100% sure if I got your input format correct but if an example file would like this:
>ENSG1 | ENST1 | 1 | 3 | name1 |
ATG
>ENSG2 | ENST2 | 4 | 9 | name2 |
ATGATG
>ENSG12 | ENST12 | 10 | 17 | name12 |
ATGATGATG
calling sed with the following parameters:
sed 's/\(ENSG[0-9]\+\).*\(ENST[0-9]\+\)\(.*\)/NASCENT_\2 | \1\3/g'
would give you
>NASCENT_ENST1 | ENSG1 | 1 | 3 | name1 |
ATG
>NASCENT_ENST2 | ENSG2 | 4 | 9 | name2 |
ATGATG
>NASCENT_ENST12 | ENSG12 | 10 | 17 | name12 |
ATGATGATG