How can I split the xml file using regular expression in Perl - regex

Here
<root>
<SchoolOrInstitution schoolType="highschool">
<SchoolName>BUTLER HIGH SCHOOL</SchoolName>
<Degree degreeType="highschool"/>
</SchoolOrInstitution>
<SchoolOrInstitution schoolType="highschool">
<SchoolName>BUTLER HIGH SCHOOL</SchoolName>
<Degree degreeType="highschool"/>
</SchoolOrInstitution>
<SchoolOrInstitution schoolType="highschool">
<SchoolName>BUTLER HIGH SCHOOL</SchoolName>
<Degree degreeType="highschool"/>
</SchoolOrInstitution>
..............
</root>
I want to split this xml in to mutiple file based on the number of SchoolOrInstitution tag.
Each XML should have 3 such tag.
So if the mail file has 9 SchoolOrInstitution tags, it should create 3 child file.
And root tag will also assign to to all the 3 files.

xml_split is a tool that comes with XML::Twig that seems quite close to what you are looking for. xml_split -g3 -l1 my.xml will get you most of the way, the only difference being that the top level element will have a tag name that's assigned by the tool instead of being the original name.

Using xsh, a wrapper around XML::LibXML:
my $old := open 19741254.xml ;
my $n = 1;
while $old/root/SchoolOrInstitution[1] {
my $new := create root ;
xmv $old/root/SchoolOrInstitution[position() <= 3] into $new/root ;
save :f concat($n, '.xml') $new ;
$n = $n + 1 ;
}

Related

How do I retrieve values from successive lines in perl?

I have this data below,called data.txt, I want to retrieve four columns from this data. First, I want to retrieve degradome category, then p-value, then the text before and after Query:. So the result should look like this(showing the first row only):
Degardome Category: 3 Degradome p-value: 0.0195958324320822 3' UGACGUUUCAGUUCCCAGUAU 5' Seq_3694_200
data.txt:
5' CCGGUAAGGUUAUGGGUCAUG 3' Transcript: Supercontig_2.8_1446328:1451-1471 Slice Site:1462
|o||o||o| |||||||o
3' UGACGUUUCAGUUCCCAGUAU 5' Query: Seq_3694_200
SiteID: Supercontig_2.8_1446328:1462
MFE of perfect match: -36.10
MFE of this site: -23.60
MFEratio: 0.653739612188366
Allen et al. score: 7.5
Paired Regions (query5'-query3',transcript3'-transcript5')
1-8,1471-1464
10-18,1462-1454
Unpaired Regions (query5'-query3',transcript3'-transcript5')
9-9,1463-1463 SIL: Symmetric internal loop
19-21,1453-1451 UP3: Unpaired region at 3' of query
Degradome data file: /media/owner/newdrive/phasing/degradome/_degradome.20171210/bbduk_trimmed/merged_HV2.fasta_dd.txt
Degardome Category: 3
Degradome p-value: 0.0195958324320822
T-Plot file: T-plots-IGR/Seq_3694_200_Supercontig_2.8_1446328_1462_TPlot.pdf
Position Reads Category
1462 4 3 <<<<<<<<<<
2949 7 3
4179 517 0
---------------------------------------------------
---------------------------------------------------
5' GGUGAGGAGGGGGGUUUG-GUC 3' Transcript: Supercontig_2.8_1511075:1311-1331 Slice Site:1323
| |||||oo||| |||o |||
3' AC-CUCCUUUCCCGAAAUACAG 5' Query: Seq_2299_664
SiteID: Supercontig_2.8_1511075:1323
MFE of perfect match: -37.90
MFE of this site: -25.30
MFEratio: 0.66754617414248
Allen et al. score: 8
Paired Regions (query5'-query3',transcript3'-transcript5')
1-3,1331-1329
5-8,1328-1325
10-19,1323-1314
20-20,1312-1312
Unpaired Regions (query5'-query3',transcript3'-transcript5')
4-4,x-x BULq: Bulge on query side
9-9,1324-1324 SIL: Symmetric internal loop
x-x,1313-1313 BULt: Bulge on transcript side
21-21,1311-1311 UP3: Unpaired region at 3' of query
Degradome data file: /media/owner/newdrive/phasing/degradome/_degradome.20171210/bbduk_trimmed/merged_HV2.fasta_dd.txt
Degardome Category: 4
Degradome p-value: 0.013385336399181
I tried to do this for before and after values, then I keep getting errors. Sorry I am new to perl and would really appreciate your help. Here are some of the codes I tried:
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use Modern::Perl;
my word = "Query:";
my $filename = $ARGV[0];
open(INPUT_FILE, $filename);
while (<<>>) {
chomp;
my ($before, $after) = m/(.+)(?:\t\Q$word\E:\t)(.+)/i;
say "word: $word\tbefore: $before\tafter: $after";
}
Since you need straight pieces of data from each section, and both sections and data come clearly demarcated, the only question is of what data structure to use. Given that you want mere lines with values collected from each section a simple array should be fine.
It is known that the phrases of interest, Query: then Degardome Category: N then p-value, are unique to the context and places shown in the sample.
use warnings;
use strict;
use feature 'say';
my $file = shift || die "Usage $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my (#res, #query, $category, $pvalue);
while (<$fh>) {
next if not /\S/;
if (/(.*?)\s+Query:\s+(.*)/) {
#query = ($1, $2);
next;
}
if (/^\s*(Degardome Category:\s+[0-9]+)/) {
$category = $1;
}
elsif (/^\s*(Degradome p-value:\s+[0-9.]+)/) {
$pvalue = $1;
push #res, [$category, $pvalue, #query];
}
}
say "#$_" for #res;
The end of a section is detected with the p-value: line, at which point we add to the #res an arrayref with all needed values captured up to that point.
The regex throughout depends on properties of data seen in the sample. Please review and adjust if some of my assumptions aren't right.
Details can also be pried from data more precisely, even by simply adding capture groups to the regexes above (and saving those captures into additional data structures).

Rename files in Powershell with a reference file

Sorry for previous confusion...
I've spent several hours today trying to write a powershell script that will pull a client ID off a PDF from system #1 (example, Smith,John_H123_20171012.pdf where the client ID is the H#### value), then look it up in an Excel spreadsheet that contains the client ID in system 1 and system 2, then rename the file to the format needed for system 2 (xxx_0000000123_yyy.pdf).
One gotcha is that client # is 2-4 digits in system 2 and always preceeded by 0's.
Using Powershell and regular expressions.
This is the first part I am trying to use for my initial rename:
Get-ChildItem -Filter *.pdf | Foreach-Object{
$pattern = "_H(.*?)_2"
$OrionID = [regex]::Match($file, $pattern).Groups[1].value
Rename-Item -NewName $OrionID
}
It is not accepting "NewName" because it states it is an empty string. I have run:
Get-Variable | select name,value,Description
And new name shows up as a name but with no value. How can I pass the output from the Regex into the rename?
Run this code line by line in debugger, you will understand how this works.
#Starts an Excel process, you can see Excel.exe as background process
$processExcel = New-Object -com Excel.Application
#If you set it to $False you wont see whats going on on Excel App
$processExcel.visible = $True
$filePath="C:\somePath\file.xls"
#Open $filePath file
$Workbook=$processExcel.Workbooks.Open($filePath)
#Select sheet 1
$sheet = $Workbook.Worksheets.Item(1)
#Select sheet with name "Name of some sheet"
$sheetTwo = $Workbook.Worksheets.Item("Name of some sheet")
#This will store C1 text on the variable
$cellString = $sheet.cells.item(3,1).text
#This will set A4 with variable value
$sheet.cells.item(1,4) = $cellString
#Iterate through all the sheet
$lastUsedRow = $sheet.UsedRange.Rows.count
$LastUsedColumn = $sheet.UsedRange.Columns.count
for ($i = 1;$i -le $lastUsedRow; $i++){
for ($j = 1;$j -le $LastUsedColumn; $j++){
$otherString = $sheet.cells.item($i,$j).text
}
}
#Create new Workbook and add sheet to it
$newWorkBook = $processExcel.Workbooks.Add()
$newWorkBook.worksheets.add()
$newSheet = $newWorkBook.worksheets.item(1)
$newSheet.name="SomeName"
#Close the workbook, if you set $False it wont save any changes, same as close without save
$Workbook.close($True)
#$Workbook.SaveAs("C:\newPath\newFile.xls",56) #You can save as the sheet, 56 is format code, check it o internet
$newWorkBook.close($False)
#Closes Excel app
$processExcel.Quit()
#This code is to remove the Excel process from the OS, this does not always work.
[System.Runtime.Interopservices.Marshal]::ReleaseComObject($processExcel)
Remove-Variable processExcel
I ended up using a utility called "Bulk Rename Utility" and Excel. I can run the various renaming regex's through BRU and add the reference .txt file after some Excel formatting.

How can I fetch only files which can contains 'test' word in practice.txt file and merging data into

cat practice.txt
test_0909_3434 test_8838 test_case_5656_5433 case_4333_3211 note_4433_2212
practice.txt file contains some more files.
required output:
test_0909_3434 test_8838
These test files contain some data so that I need to merge these two files data into one final file.
test_0909_3434 file contains:
id name
1 hh
2 ii
test_8838 file contains:
id name
2 ii
3 gg
4 kk
Final Output of output file: mergedfile.txt will be like follows:
id name
1 hh
2 ii
3 gg
4 kk
we need to remove redundant data also like above mergedfile.txt
1) simplistic (data is "in-order" and "well-formatted" in both input files):
cat f1 f2 | sort -u > f3
2) more complex (not "in-order" and not "well-formatted"). use regex.
Read records from both input files. Assume input record is called 'x'.
if [[ "${x}" =~ ^[[:space:]]*([[:digit:]]+)[[:space:]]+(.*)$ ]]; then
d="${BASH_REMATCH[1]}"
s="${BASH_REMATCH[2]}"
echo "d == $d, s == $s"
fi
aa["${d}"]="${k}"
Where aa is a Bash associative array (available in Bash >= 4x).
declare -A aa=()
This assumes the the first field is an integer (and a key). You can process accordingly, as to whether or not the key is unique.
If it's any more complex than this, consider using Perl.

match variable string at end of field with awk

Yet again my unfamiliarity with AWK lets me down, I can't figure out how to match a variable at the end of a line?
This would be fairly trivial with grep etc, but I'm interested in matching integers at the end of a string in a specific field of a tsv, and all the posts suggest (and I believe it to be the case!) that awk is the way to go.
If I want to just match a single one explicity, that's easy:
Here's my example file:
PVClopT_11 PAU_02102 PAU_02064 1pqx 1pqx_A 37.4 13 0.00035 31.4 >1pqx_A Conserved hypothetical protein; ZR18,structure, autostructure,spins,autoassign, northeast structural genomics consortium; NMR {Staphylococcus aureus subsp} SCOP: d.267.1.1 PDB: 2ffm_A 2m6q_A 2m8w_A No DOI found.
PVCpnf_18 PAK_3526 PAK_03186 3fxq 3fxq_A 99.7 2.7e-21 7e-26 122.2 >3fxq_A LYSR type regulator of TSAMBCD; transcriptional regulator, LTTR, TSAR, WHTH, DNA- transcription, transcription regulation; 1.85A {Comamonas testosteroni} PDB: 3fxr_A* 3fxu_A* 3fzj_A 3n6t_A 3n6u_A* 10.1111/j.1365-2958.2010.07043.x
PVCunit1_19 PAU_02807 PAU_02793 3kx6 3kx6_A 19.7 45 0.0012 31.3 >3kx6_A Fructose-bisphosphate aldolase; ssgcid, NIH, niaid, SBRI, UW, emerald biostructures, glycolysis, lyase, STRU genomics; HET: CIT; 2.10A {Babesia bovis} No DOI found.
PVClumt_17 PAU_02231 PAU_02190 3lfh 3lfh_A 39.7 12 0.0003 28.9 >3lfh_A Manxa, phosphotransferase system, mannose/fructose-speci component IIA; PTS; 1.80A {Thermoanaerobacter tengcongensis} No DOI found.
PVCcif_11 plu2521 PLT_02558 3h2t 3h2t_A 96.6 2.6e-05 6.7e-10 79.0 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCpnf_16 PAU_03338 PAU_03377 5jbr 5jbr_A 29.2 22 0.00058 23.9 >5jbr_A Uncharacterized protein BCAV_2135; structural genomics, PSI-biology, midwest center for structu genomics, MCSG, unknown function; 1.65A {Beutenbergia cavernae} No DOI found.
PVCunit1_17 PAK_2892 PAK_02622 1cii 1cii_A 63.2 2.7 6.9e-05 41.7 >1cii_A Colicin IA; bacteriocin, ION channel formation, transmembrane protein; 3.00A {Escherichia coli} SCOP: f.1.1.1 h.4.3.1 10.1038/385461a0
PVCunit1_11 PAK_2886 PAK_02616 3h2t 3h2t_A 96.6 1.9e-05 4.9e-10 79.9 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCpnf_11 PAU_03343 PAU_03382 3h2t 3h2t_A 97.4 4.4e-07 1.2e-11 89.7 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A 10.1016/j.str.2009.04.005
PVCunit1_5 afp5 PAU_02779 4tv4 4tv4_A 63.6 2.6 6.7e-05 30.5 >4tv4_A Uncharacterized protein; unknown function, ssgcid, virulence, structural genomics; 2.10A {Burkholderia pseudomallei} No DOI found.
And I can pull out all the lines which have a "_11" at the end of the first column by running the following on the commandline:
awk '{ if ($1 ~ /_11$/) { print } }' 02052017_HHresults_sorted.tsv
I want to enclose this in a loop to cover all integers from 1 - 5 (for instance), but I'm having trouble passing a variable in to the text match.
I expect it should be something like the following, but $i$ seems like its probably incorrect and by google-fu failed me:
awk 'BEGIN{ for (i=1;i<=5;i++){ if ($1 ~ /_$i$/) { print } } }' 02052017_HHresults_sorted.tsv
There may be other issues I haven't spotted with that awk command too, as I say, I'm not very awk-savvy.
EDIT FOR CLARIFICATION
I want to separate out all the matches, so can't use a character class. i.e. I want all the lines ending in "_1" in one file, then all the ones ending in "_2" in another, and so on (hence the loop).
You can't put variables inside //. Use string concatenation, which is done by simply putting the strings adjacent to each other in awk. You don't need to use a regexp literal when you use the ~ operator, it always treats the second argument as a regexp.
awk '{ for (i = 1; i <= 5; i++) {
if ( $1 ~ ("_" i "$") ) { print; break; }
}' 02052017_HHresults_sorted.tsv
It sounds like you're thinking about this all wrong and what you really need is just (with GNU awk for gensub()):
awk '{ print > ("out" gensub(/.*_/,"",1,$1)) }' 02052017_HHresults_sorted.tsv
or with any awk:
awk '{ n=$1; sub(/.*_/,"",n); print > ("out" n) }' 02052017_HHresults_sorted.tsv
No need to loop, use regex character class [..]:
awk 'match($1,/_([1-5])$/,a){ print >> a[1]".txt" }' 02052017_HHresults_sorted.tsv

Python - creating a dictionary from large text file where the key matches regex pattern

My question: how do I create a dictionary from a list by assigning dictionary keys based on a regex pattern match ('^--L-[0-9]{8}'), and assigning the values by using all lines between each key.
Example excerpt from the raw file:
SQL> --L-93752133
SQL> --SELECT table_name, tablespace_name from dba_tables where upper(table_name) like &tablename_from_developer;
SQL>
SQL> --L-52852243
SQL>
SQL> SELECT log_mode FROM v$database;
LOG_MODE
------------
NOARCHIVELOG
SQL>
SQL> archive log list
Database log mode No Archive Mode
Automatic archival Disabled
Archive destination USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence 3
Current log sequence 5
SQL>
SQL> --L-42127143
SQL>
SQL> SELECT t.name "TSName", e.encryptionalg "Algorithm", d.file_name "File Name"
2 FROM v$tablespace t
3 , v$encrypted_tablespaces e
4 , dba_data_files d
5 WHERE t.ts# = e.ts#
6 AND t.name = d.tablespace_name;
no rows selected
Some additional detail: The raw file can be large (at least 80K+ lines, but often much larger) and I need to preserve the original spacing so the output is still easy to read. Here's how I'm reading the file in and removing "SQL>" from the beginning of each line:
with open(rawFile, 'r') as inFile:
content = inFile.read()
rawList = content.splitlines()
for line in rawList:
cleanLine = re.sub('^SQL> ', '', line)
Finding the dictionary keys I'm looking for is easy:
pattern = re.compile(r'^--L-[0-9]{8}')
if pattern.search(cleanLine) is not None:
itemID = pattern.search(cleanLine)
print(itemID.group(0))
But how do I assign all lines between each key as the value belonging to the most recent key preceding them? I've been playing around with new lists, tuples, and dictionaries but everything I do is returning garbage. The goal is to have the data and keys linked to each other so that I can return them as needed later in my script.
I spent a while searching for a similar question, but in most other cases the source file was already in a dictionary-like format so creating the new dictionary was a less complicated problem. Maybe a dictionary or tuple isn't the right answer, but any help would be appreciated! Thanks!
In general, you should question why you would read the entire file, split the lines into a list, and then iterate over the list. This is a Python anti-pattern.
For line oriented text files, just do:
with open(fn) as f:
for line in f:
# process a line
It sounds, however, that you have multi-line block oriented patterns. If so, with smaller files, read the entire file into a single string and use a regex on that. Then you would use group 1 and group 2 as the key, value in your dict:
pat=re.compile(pattern, flags)
with open(file_name) as f:
di={m.group(1):m.group(2) for m in pat.finditer(f.read())}
With a larger file, use a mmap:
import re, mmap
pat=re.compile(pattern, flags)
with open(file_name, 'r+') as f:
mm = mmap.mmap(f.fileno(), 0)
for i, m in enumerate(pat.finditer(mm)):
# process each block accordingly...
As far as the regex, I am a little unclear on what you are trying to capture or not. I think this regex is what I am understanding you want:
^SQL> (--L-[0-9]{8})(.*?)(?=SQL> --L-[0-9]{8}|\Z)
Demo
In either case, running that regex with the example string yields:
>>> pat=re.compile(r'^SQL> (--L-[0-9]{8})\s*(.*?)\s*(?=SQL> --L-[0-9]{8}|\Z)', re.S | re.M)
>>> with open(file_name) as f:
... di={m.group(1):m.group(2) for m in pat.finditer(f.read())}
...
>>> di
{'--L-52852243': 'SQL> \nSQL> SELECT log_mode FROM v;\n\n LOG_MODE\n ------------\n NOARCHIVELOG\n\nSQL> \nSQL> archive log list\n Database log mode No Archive Mode\n Automatic archival Disabled\n Archive destination USE_DB_RECOVERY_FILE_DEST\n Oldest online log sequence 3\n Current log sequence 5\nSQL>',
'--L-93752133': 'SQL> --SELECT table_name, tablespace_name from dba_tables where upper(table_name) like &tablename_from_developer;\nSQL>',
'--L-42127143': 'SQL> \nSQL> SELECT t.name TSName, e.encryptionalg Algorithm, d.file_name File Name\n 2 FROM v t\n 3 , v e\n 4 , dba_data_files d\n 5 WHERE t.ts# = e.ts#\n 6 AND t.name = d.tablespace_name;\n\n no rows selected'}
Something like this?
with open(rawFile, 'r') as inFile:
content = inFile.read()
rawList = content.splitlines()
keyed_dict = {}
in_between_lines = ""
last_key = 0
for line in rawList:
cleanLine = re.sub('^SQL> ', '', line)
pattern = re.compile(r'^--L-[0-9]{8}')
if pattern.search(cleanLine) is not None:
itemID = pattern.search(cleanLine)
if last_key: keyed_dict[last_key] = in_between_lines
last_key = itemID.group(0)
in_between_lines = ""
else:
in_between_lines += cleanLine