Extracting all rows that matches a "pattern" from a range - regex

I'm very new to this and I've googled about regex and RE2, but little help is found. Seems to be rather niche topic and the material isn't quite comprehensive. Would appreciate any help in this. Thank you!
I have a list with over 30k rows of text and I would like to extract those that match this pattern:
`=whatever`+whatever-whatever`+whatever/whatever.whatever
The data looks like this:
Detail 3
Detail 4
Detail 3
Detail 4
blah
blah
Detail 5
Detail 4
Detail 5
`=P2385`+P2385-M01 `+MC2385/5.0
`=BS2366`+P2366-A1 `+FIELD/14.5
`=P2385`+P2385-M02 `+MC2385/5.4
blah
blah
`=P2385`+P2385-M03 `+MC2385/5.5
`=P2385`+P2385-M04 `+MC2385/5.9
`=P2385`+P2385-M05 `+MC2385/6.0
`=BS2366`+P2366-A1 `+FIELD/14.5
`=P2385`+P2385-M06 `+MC2385/6.4
`=P2385`+P2385-M07 `+MC2385/6.5
`=P2385`+P2385-M08 `+MC2385/6.9
blah
blah
`=P2385`+P2385-M09 `+MC2385/7.0
`=P2385`+P2385-M10 `+MC2385/7.4
`=P2385`+P2385-M11 `+MC2385/7.5
`=P2385`+P2385-M12 `+MC2385/7.9
`=P2381`+P2381-B31 `+MC2381/12.5
blah
blah
`=P2381`+P2381-B32 `+MC2381/12.9
`=P2381`+P2381-B33 `+MC2381/13.0
`=P2370`+P2370-M01
blah
blah
`+MC2370/6.0
`=P2370`+P2370-M02 `+MC2370/6.4
`=P2370`+P2370-M03 `+MC2370/6.5
`=P2370`+P2370-M04 `+MC2370/6.9
`=BS2366`+P2366-A1 `+FIELD/14.5
`=P2368`+P2368-M01 `+MC2370/11.0
`=P2368`+P2368-M05 `+MC2370/12.0
`=BS2366`+P2366-A1 `+FIELD/14.5 `=P2366`+P2366-M01 `+MC2370/10.5
`=P2366`+P2366-M02 `+MC2370/10.9
Detail 3
Detail 4
Detail 3
blah
blah
Detail 4
Detail 5
Detail 5
blah
blah
Detail 4
Detail 5
blah
blah
blah
blah
The output from the regex extract should ideal look like this:
`=P2385`+P2385-M01 `+MC2385/5.0
`=P2385`+P2385-M02 `+MC2385/5.4
`=P2385`+P2385-M03 `+MC2385/5.5
`=P2385`+P2385-M04 `+MC2385/5.9
`=P2385`+P2385-M05 `+MC2385/6.0
`=BS2366`+P2366-A1 `+FIELD/14.5
`=P2385`+P2385-M06 `+MC2385/6.4
`=P2385`+P2385-M07 `+MC2385/6.5
`=P2385`+P2385-M08 `+MC2385/6.9
`=P2385`+P2385-M09 `+MC2385/7.0
`=P2385`+P2385-M10 `+MC2385/7.4
`=P2385`+P2385-M11 `+MC2385/7.5
`=BS2366`+P2366-A1 `+FIELD/14.5
`=P2385`+P2385-M12 `+MC2385/7.9
`=P2381`+P2381-B31 `+MC2381/12.5
`=P2381`+P2381-B32 `+MC2381/12.9
`=P2381`+P2381-B33 `+MC2381/13.0
`=P2370`+P2370-M02 `+MC2370/6.4
`=P2370`+P2370-M03 `+MC2370/6.5
`=P2370`+P2370-M04 `+MC2370/6.9
`=P2368`+P2368-M01 `+MC2370/11.0
`=P2368`+P2368-M05 `+MC2370/12.0
`=BS2366`+P2366-A1 `+FIELD/14.5
`=P2366`+P2366-M01 `+MC2370/10.5
`=P2366`+P2366-M02 `+MC2370/10.9

=ARRAYFORMULA(TRANSPOSE(SPLIT(SUBSTITUTE("♦"&TEXTJOIN("♦", 1,
IFERROR(REGEXEXTRACT(A1:A, "`.+\+.+-.+\+.+/.+"))), " `=", "♦`="), "♦")))

To match the specific data in your example, use this pattern:
^`=[A-Z0-9]+`\+[A-Z0-9]+-[A-Z0-9]+\s*`\+[A-Z0-9]+\/\d+\.\d+
Test here. That page also explains all the details of the regex.
Instead of the specific [A-Z0-9]+ you can use a generic .+?.
You do not specify where your list is stored.
If the list is stored in a text file, then you can use grep as the obvious tool for the job:
grep "^`=[A-Z0-9]+`\+[A-Z0-9]+-[A-Z0-9]+\s*`\+[A-Z0-9]+\/\d+\.\d+" input_file >output_file
where:
> is the output redirection operator
input_file is the name of the input file
output_file is the name of the file where you want to store the results.
If it is in a spreadsheet, then #player0 already provided a good answer.

Related

How to get the email address in between 2 different characters in Excel or Google Sheets using formula only?

The cell A2 has the following sample email address:
Jose Rizal <jose#email.com>
I want to get the email address only in cell B2:
=right(A2,len(A2) - search("<",A2,1))
but the result was: jose#email.com> (with the > on the last character).
The table looks like this and the expected result is on B2:
| A | B |
1| complete email address | email address only |
2| Jose Rizal <jose#email.com> | jose#email.com |
What to improve on my formula?
paste in B2:
=REGEXEXTRACT(A2, "<(.*)>")
and arrayformula would be:
=ARRAYFORMULA(IFERROR(REGEXEXTRACT(A2:A, "<(.*)>")))
In Excel, or Google sheets(But player0's REGEXEXTRACT is better to use in Google Sheets):
=MID(REPLACE(A2,FIND(">",A2),LEN(A2),""),FIND("<",A2)+1,LEN(A2))
And drag the formula down.
Add another Left trim in there:
=LEFT(RIGHT(A2,LEN(A2) - SEARCH("<",A2,1)),LEN(RIGHT(A2,LEN(A2) - SEARCH("<",A2,1)))-1)
Another attempt using FILTERXML if you are using one of the following versions of Excel:
=FILTERXML("<b><a>"&SUBSTITUTE(LEFT(A2,LEN(A2)-1),"<","</a><a>")&"</a></b>","//a[2]")
Suppose your data stars from Cell A2, drag the formula down to apply across.
For the logic behind this formula you may give a read to this article: Extract Words with FILTERXML.

match variable string at end of field with awk

Yet again my unfamiliarity with AWK lets me down, I can't figure out how to match a variable at the end of a line?
This would be fairly trivial with grep etc, but I'm interested in matching integers at the end of a string in a specific field of a tsv, and all the posts suggest (and I believe it to be the case!) that awk is the way to go.
If I want to just match a single one explicity, that's easy:
Here's my example file:
PVClopT_11 PAU_02102 PAU_02064 1pqx 1pqx_A 37.4 13 0.00035 31.4 >1pqx_A Conserved hypothetical protein; ZR18,structure, autostructure,spins,autoassign, northeast structural genomics consortium; NMR {Staphylococcus aureus subsp} SCOP: d.267.1.1 PDB: 2ffm_A 2m6q_A 2m8w_A No DOI found.
PVCpnf_18 PAK_3526 PAK_03186 3fxq 3fxq_A 99.7 2.7e-21 7e-26 122.2 >3fxq_A LYSR type regulator of TSAMBCD; transcriptional regulator, LTTR, TSAR, WHTH, DNA- transcription, transcription regulation; 1.85A {Comamonas testosteroni} PDB: 3fxr_A* 3fxu_A* 3fzj_A 3n6t_A 3n6u_A* 10.1111/j.1365-2958.2010.07043.x
PVCunit1_19 PAU_02807 PAU_02793 3kx6 3kx6_A 19.7 45 0.0012 31.3 >3kx6_A Fructose-bisphosphate aldolase; ssgcid, NIH, niaid, SBRI, UW, emerald biostructures, glycolysis, lyase, STRU genomics; HET: CIT; 2.10A {Babesia bovis} No DOI found.
PVClumt_17 PAU_02231 PAU_02190 3lfh 3lfh_A 39.7 12 0.0003 28.9 >3lfh_A Manxa, phosphotransferase system, mannose/fructose-speci component IIA; PTS; 1.80A {Thermoanaerobacter tengcongensis} No DOI found.
PVCcif_11 plu2521 PLT_02558 3h2t 3h2t_A 96.6 2.6e-05 6.7e-10 79.0 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCpnf_16 PAU_03338 PAU_03377 5jbr 5jbr_A 29.2 22 0.00058 23.9 >5jbr_A Uncharacterized protein BCAV_2135; structural genomics, PSI-biology, midwest center for structu genomics, MCSG, unknown function; 1.65A {Beutenbergia cavernae} No DOI found.
PVCunit1_17 PAK_2892 PAK_02622 1cii 1cii_A 63.2 2.7 6.9e-05 41.7 >1cii_A Colicin IA; bacteriocin, ION channel formation, transmembrane protein; 3.00A {Escherichia coli} SCOP: f.1.1.1 h.4.3.1 10.1038/385461a0
PVCunit1_11 PAK_2886 PAK_02616 3h2t 3h2t_A 96.6 1.9e-05 4.9e-10 79.9 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCpnf_11 PAU_03343 PAU_03382 3h2t 3h2t_A 97.4 4.4e-07 1.2e-11 89.7 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCunit1_5 afp5 PAU_02779 4tv4 4tv4_A 63.6 2.6 6.7e-05 30.5 >4tv4_A Uncharacterized protein; unknown function, ssgcid, virulence, structural genomics; 2.10A {Burkholderia pseudomallei} No DOI found.
And I can pull out all the lines which have a "_11" at the end of the first column by running the following on the commandline:
awk '{ if ($1 ~ /_11$/) { print } }' 02052017_HHresults_sorted.tsv
I want to enclose this in a loop to cover all integers from 1 - 5 (for instance), but I'm having trouble passing a variable in to the text match.
I expect it should be something like the following, but $i$ seems like its probably incorrect and by google-fu failed me:
awk 'BEGIN{ for (i=1;i<=5;i++){ if ($1 ~ /_$i$/) { print } } }' 02052017_HHresults_sorted.tsv
There may be other issues I haven't spotted with that awk command too, as I say, I'm not very awk-savvy.
EDIT FOR CLARIFICATION
I want to separate out all the matches, so can't use a character class. i.e. I want all the lines ending in "_1" in one file, then all the ones ending in "_2" in another, and so on (hence the loop).
You can't put variables inside //. Use string concatenation, which is done by simply putting the strings adjacent to each other in awk. You don't need to use a regexp literal when you use the ~ operator, it always treats the second argument as a regexp.
awk '{ for (i = 1; i <= 5; i++) {
if ( $1 ~ ("_" i "$") ) { print; break; }
}' 02052017_HHresults_sorted.tsv
It sounds like you're thinking about this all wrong and what you really need is just (with GNU awk for gensub()):
awk '{ print > ("out" gensub(/.*_/,"",1,$1)) }' 02052017_HHresults_sorted.tsv
or with any awk:
awk '{ n=$1; sub(/.*_/,"",n); print > ("out" n) }' 02052017_HHresults_sorted.tsv
No need to loop, use regex character class [..]:
awk 'match($1,/_([1-5])$/,a){ print >> a[1]".txt" }' 02052017_HHresults_sorted.tsv

doxygen equations not appearing correctly

I am having some issues with doxygen. I am trying to include an inline formula:
blah blah \f$ x \in [0,1] \f$ blah blah
but the html looks like
blah blah \( x \in [0,1] \) blah blah
Does anyone know why? If it helps:
EXTRA_PACKAGES = mathtools amsmath
USE_MATHJAX = YES
Make sure you have latex installed and verify whether you have these configurations on your Doxygen file:
GENERATE_LATEX = YES
LATEX_OUTPUT = latex
LATEX_CMD_NAME = latex #latex command name to be called from terminal

Regular Expression to deal with text issue

I have the following text sample:
Sample Supplier 123
AP Invoices 123456 -229.00
AP Invoices 235435 337.00
AP Invoices 444323 228.00
AP Invoices 576432 248.00
It's from a text file with 21,000 lines, which lists invoices against a supplier.
The pattern is always the same on each block of invoices against each supplier, where:
The supplier name starts at the beginning of a line
The invoices being to be listed 2 rows down from the supplier name, indented by one space.
I wondered if I can use a Regular Expression (I'm using TextPad as a Text Editor on a Windows PC) to:
Append each invoice line with a tab (\t)
Append the supplier name in front of the tab so each invoice line now starts with the supplier name, and a tab, where the supplier name is taken from 2 rows above the start of each block of invoices
Delete the supplier name line from above the invoice block.
Expected output:
Sample Supplier 123 AP Invoices 123456 -229.00
Sample Supplier 123 AP Invoices 235435 337.00
Sample Supplier 123 AP Invoices 444323 228.00
Sample Supplier 123 AP Invoices 576432 248.00
I realise I am probably asking for "the moon on a stick" here, but the alternative is to go through a 21,000 line text file and copy and paste the data into Excel, which might not be a very good use of my time.
Maybe I can't do it using a simple regular expression, or maybe it's simply not possible at all.
Any advice would be much appreciated.
Thanks
I would use a simple Python script to solve this issue:
currentheader = ""
with open("yourfile.txt") as f:
with open("newfile.txt","w") as fw:
for line in f:
if len(line.strip()) == 0:
continue
elif line[0] != " ": #new header
currentheader = line[:-1]
else:
fw.write(currentheader + "\t" + line[1:])
For this to work, on Windows you will have to install Python. Python 2 or 3 should both work with this script. After installing Python, you open a command line (Win+R, cmd, Enter), navigate to the folder your file is in using cd foldername, if necessary, and then type python dealWithTextIssue.py (after having saved the script as "dealWithTextIssue.py" there.
I think this isn't solvable just with regex, you'll have to do some programming. I made a little script in PHP:
$string = <<<EOL
Sample Supplier 123
AP Invoices 123456 -229.00
AP Invoices 235435 337.00
AP Invoices 444323 228.00
AP Invoices 576432 248.00
Second Supplier
A B C D
B F
EOL;
$array = preg_split("~[\n\r]+~", $string);
foreach ($array as $value) {
if (strpos($value, " ") == 0) {
if (strlen(trim($value)) > 0) {
echo "\t".$header.rtrim($value).PHP_EOL;
}
}
else {
$header = $value;
}
}
You can see it at work for example here after clicking on execute code.

AWStats multiple columns in extra section

I have an AWStats running and the reports are built from IIS logfiles.
I have an extra section to view all the actions of the executed perlscripts on the site.
The config looks like this:
ExtraSectionName1="Actions"
ExtraSectionCodeFilter1="200 304"
ExtraSectionCondition1="URL,\/cgi\-bin\/.+\.pl"
ExtraSectionFirstColumnTitle1="Action"
ExtraSectionFirstColumnValues1="QUERY_STRING,action=([a-zA-Z0-9]+)"
ExtraSectionFirstColumnFormat1="%s"
ExtraSectionStatTypes1=HPB
ExtraSectionAddAverageRow1=0
ExtraSectionAddSumRow1=1
MaxNbOfExtra1=20
MinHitExtra1=1
The output looks like this:
Action Pages Hits
foo 1234 1234
bar 5678 5678
But there are some actions with the same name in different perl scripts.
I would need this:
Script Action Pages Hits
foo.pl foo 1234 1234
bar.pl foo 1234 1234
foo.pl bar 5678 5678
bar.pl bar 5678 5678
Does anyone know how to create such a report?
EDIT:
I did some more research and all forum posts I've found say that it is not possible to have two columns in an extra section without hacking in awstats.pl
Now I am trying to put it into one column using URLWITHQUERY to output someting like this:
Action Pages Hits
foo.pl?action=foo 1234 1234
foo.pl?action=bar 1234 1234
bar.pl?action=foo 5678 5678
...
The new problem is that the query has more parameters than action, which are unordered.
I tried this
ExtraSectionFirstColumnValues1="URLWITHQUERY,([a-zA-Z0-9]+\.pl\?).*(action=[a-zA-Z0-9]+)"
but AWStats only gets the value from the first bracket pair and ignores the rest. I think it internally works with $1 provided by the perl regex 'magic'.
Any ideas?
maybe?
ExtraSectionFirstColumnTitle1="Script"
ExtraSectionFirstColumnValues1="URL,\/cgi\-bin\/(.+\.pl)`enter code here`"
ExtraSectionFirstColumnFormat1="%s"
ExtraSectionFirstColumnTitle2="Action"
ExtraSectionFirstColumnValues2="QUERY_STRING,action=([a-zA-Z0-9]+)"
ExtraSectionFirstColumnFormat2="%s"
I've found a solution.
awstats.pl fetches the data for the specified extra sections in line 19664 - 19750
This is my modification:
# Line 19693 - 19701 in awstats.pl (AWStats version 7 Revision 1.971)
elsif ( $rowkeytype eq 'URLWITHQUERY' ) {
if ( "$urlwithnoquery$tokenquery$standalonequery" =~
/$rowkeytypeval/ )
{
$rowkeyval = "$1$2"; # I simply added a $2 for the second capture group
$rowkeyok = 1;
last;
}
}
This will get the first and the second capture group specified in the ExtraSectionFirstColumnValuesX regex.
Example:
ExtraSectionFirstColumnValues1="URLWITHQUERY,([a-zA-Z0-9]+\.pl\?).*(action=[a-zA-Z0-9]+)"
Needless to say that you need to add a $3 $4 $5 ... if you need more groups.