how can I reposition patterns within a string using sed? - regex

I have a fasta file of +20k intronic sequences with the following headers I can describe as:
>ENSG[0-9] | ENST[0-9] | start_position | end_position | name |
I would like to change positions of ENSG[0-9] and ENST[0-9] and add "NASCENT" to ENST[0-9] pattern.
I tried:
sed 's/\(ENSG\d*\) *| *\(ENST\d*\) */\2 | \1/'
to first just focus on repositioning, but to no avail. It's probably escapes that I've confused.
Any hint or a better solution?

Not 100% sure if I got your input format correct but if an example file would like this:
>ENSG1 | ENST1 | 1 | 3 | name1 |
ATG
>ENSG2 | ENST2 | 4 | 9 | name2 |
ATGATG
>ENSG12 | ENST12 | 10 | 17 | name12 |
ATGATGATG
calling sed with the following parameters:
sed 's/\(ENSG[0-9]\+\).*\(ENST[0-9]\+\)\(.*\)/NASCENT_\2 | \1\3/g'
would give you
>NASCENT_ENST1 | ENSG1 | 1 | 3 | name1 |
ATG
>NASCENT_ENST2 | ENSG2 | 4 | 9 | name2 |
ATGATG
>NASCENT_ENST12 | ENSG12 | 10 | 17 | name12 |
ATGATGATG

Related

extract a string before certain punctuation regex

How to extract words before the first punctuation | in presto SQL?
Table
+----+------------------------------------+
| id | title |
+----+------------------------------------+
| 1 | LLA | Rec | po#069762 | saddasd |
| 2 | Hello amustromg dsfood |
| 3 | Hel | sdfke bones. |
+----+------------------------------------+
Output
+----+------------------------------------+
| id | result |
+----+------------------------------------+
| 1 | LLA |
| 2 | |
| 3 | Hel |
+----+------------------------------------+
Attempt
REGEXP_EXTRACT(title, '(.*)([^|]*)', 1)
Thank you
Using the base string functions we can try:
SELECT id,
CASE WHEN title LIKE '%|%'
THEN TRIM(SUBSTR(title, 1, STRPOS(title, '|') - 1))
ELSE '' END AS result
FROM yourTable
ORDER BY id;

Regular Expression for Census Tracts

I am working on making sure a form input is parsed correctly through google forms and I was trying to use my limited regex knowledge to make sure that people do not input tracts incorrectly.
usually a tract is given in an example such as 5129.01
All tracts in the county start with 5, have a second character that is either a 0 or a 1,and if it is a 1, the third character is either [0 - 3] else its [0-9].
I have a working expression but I would like to ensure that if the second character is a 1 the user wouldn't be able to enter a tract like 5150.01
This is what I have:
^5[0-1]([0-9]{2})(\.([0-9]{2}))?$
and this is what is not working:
^5[0-1](?(?<=1)\d|[0-3])(\.([0-9]{2}))?$
Any help would be appreciated thanks
I hope this answers your question, please comment and I can edit accordingly.
The pattern:
^(?:51[0-3]|50[0-9])[0-9](?:\.[0-9]{1,2})?$
A visual representation:
HTML implementation:
<input name="example" pattern="^(?:51[0-3]|50[0-9])[0-9](?:\.[0-9]{1,2})?$">
Value validation examples:
+----------+--------+
| Value | Status |
+----------+--------+
| 51 | fail |
| 510 | fail |
| 5101 | pass |
| 5111 | pass |
| 5121 | pass |
| 5131 | pass |
| 5141 | fail |
| 5102. | fail |
| 5102.1 | pass |
| 5102.12 | pass |
| 5102.123 | fail |
| 50 | fail |
| 500 | fail |
| 5001 | pass |
| 5011 | pass |
| 5021 | pass |
| 5031 | pass |
| 5041 | pass |
| 5051 | pass |
| 5061 | pass |
| 5071 | pass |
| 5081 | pass |
| 5091 | pass |
| 50911 | fail |
| 5092. | fail |
| 5092.1 | pass |
| 5092.12 | pass |
| 5092.123 | fail |
+----------+--------+

django Queryset exclude() multiple data

i have database scheme like this.
# periode
+------+--------------+--------------+
| id | from | to |
+------+--------------+--------------+
| 1 | 2018-04-12 | 2018-05-11 |
| 2 | 2018-05-12 | 2018-06-11 |
+------+--------------+--------------+
# foo
+------+---------+
| id | name |
+------+---------+
| 1 | John |
| 2 | Doe |
| 3 | Trodi |
| 4 | son |
| 5 | Alex |
+------+---------+
#bar
+------+---------------+--------------+
| id | employee_id | periode_id |
+------+---------------+--------------+
| 1 | 1 |1 |
| 2 | 2 |1 |
| 3 | 1 |2 |
| 4 | 3 |1 |
+------+---------------+--------------+
I need to show employee that not in salary.
for now I do like this
queryset=Bar.objects.all().filter(periode_id=1)
result=Foo.objects.exclude(id=queryset)
but its fail, how do filter employee list not in salary?...
Well here you basically want the foos such that there is no period_id=1 in the Bar table.
We can let this work with:
ex = Bar.objects.all().filter(periode_id=1).values_list('employee_id', flat=True)
result=Foo.objects.exclude(id__in=ex)

Sed remove NULL but only when the NULL means empty or no value

Exporting a table from MySQL where fields that have no value will have the keyword NULL within.
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | NULL |
I have written a script to automatically remove all occurrences of NULL using a one-liner sed, which will remove the NULL in date column correctly:
sed -i 's/NULL//g'
However, how do we handle IF we have the following?
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | NULL |
| 2 | Dees | DJ Null Bee| US| 2017-04-01 |
| 3 | NULL AND VOID | NULLZIET | NULL| 2016-05-13 |
| 4 | Pablo | ALA PUHU MINULLE | GERMANY| 2017-02-14 |
Apparently, the global search and replace all occurrences of NULL will be removed, where even "ALA PUHU MINULLE" will become "ALA PUHU MIE", which is incorrect.
I suppose the use of regex perhaps can be useful to apply the rule? But if so, will "DJ Null Bee" be affected or will it become "DJ Bee"? The desired outcome should really be:
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | |
| 2 | Dees | DJ Null Bee| US| 2017-04-01 |
| 3 | NULL AND VOID | DJ Null Bee| | 2016-05-13 |
| 4 | Pablo | ALA PUHU MINULLE | GERMANY| 2017-02-14 |
Given that NULL is a special keyword for any databases, but there is no stopping anyone from calling themselves a DJ NULL, or have the word NULL because it means differently in another language.
Any ideas on how to resolve this? Any suggestions welcomed. Thank you!
All you need is:
$ sed 's/|[[:space:]]*NULL[[:space:]]*|/| |/g; s/|[[:space:]]*NULL[[:space:]]*|/| |/g' file
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | |
| 2 | Dees | DJ Null Bee| US| 2017-04-01 |
| 3 | NULL AND VOID | NULLZIET | | 2016-05-13 |
| 4 | Pablo | ALA PUHU MINULLE | GERMANY| 2017-02-14 |
That will work in any POSIX sed.
You have to do the substitution twice because each match consumes all of the characters in the match so when you have | NULL | NULL | the middle | is consumed by the match on | NULL | and so all that's left is NULL | which does not match | NULL |, so you need 2 passes to match every | NULL |.
Use awk:
awk -F\| '{ for (i=2;i<=NF;i++) { if ( $i == " NULL " ) { printf "| " } else if ( $i == " NULL" ) { printf "| DJ Null Bee " } else { printf "|"$i } } printf "\n" }' filename
Using pipe as the field separator, go through each field and then check if the field equates to " NULL " If it does, print nothing. Then check if the field equals " NULL" If it does print "DJ Null Bee" else print the field as is.
$ cat mysql.txt | sed -r 's/(\| )NULL( \|)/\1\2/g'
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | |
| 2 | Dees | DJ Null Bee| US| 2017-04-01 |
| 3 | NULL AND VOID | NULLZIET | NULL| 2016-05-13 |
| 4 | Pablo | ALA PUHU MINULLE | GERMANY| 2017-02-14 |
will only remove capital NULL fields delimited by the opening and closing pipe symbols alone.
It will keep your origin column "| NULL|" in the line "| 3 | NULL AND VOID | DJ Null Bee| NULL| 2016-05-13 |" as well.
awk '{sub(/BRAZIL \| NULL/,"BRAZIL \| ")sub(/NULLZIET \| NULL/,"DJ Null Bee\| ")}1' file
| id | name | nickname | origin | date |
| 1 | Joe | Mini-J | BRAZIL | |
| 2 | Dees | DJ Null Bee| US| 2017-04-01 |
| 3 | NULL AND VOID | DJ Null Bee| | 2016-05-13 |
| 4 | Pablo | ALA PUHU MINULLE | GERMANY| 2017-02-14 |

Return a Regex Match excluding parts of the match

Given I have a Twiki page with the following source:
| *Region* | *Owner* | *Partner / Project* | *Type* | *Description* | *Timeline* | *Flag* | *Update* |
| Region 1 | Olaf | RedHat | type1 | nothing to do at topic 1 | time1 | Progress | TBC |
| Region 2 | Olaf | Gentoo | type2 | nothing to do at topic 2 | time2 | | none |
I want to construct a regex pattern that should match everything after
Update* |
It's a simple problem, folks!
If by egrep you mean egrep binary in *nix then you can do:
egrep -o "\*Update\*.*" <<< $(<file)
OUTPUT:
*Update* | | Region 1 | Olaf | RedHat | type1 | nothing to do at topic 1 | time1 | Progress | TBC | | Region 2 | Olaf | Gentoo | type2 | nothing to do at topic 2 | time2 | | none |
is this what you want??
awk 'o;/\*Update\* \|$/{o=1}' yourFile
output
| Region 1 | Olaf | RedHat | type1 | nothing to do at topic 1 | time1 | Progress | TBC |
| Region 2 | Olaf | Gentoo | type2 | nothing to do at topic 2 | time2 | | none |