Working with TCL and I am trying to setup a regex to get the data within my xml string. The code that I provided has an example string of what I am dealing with and the regexp is attempting to find the first close bracket and keep the data until the next open bracket then place that into variable number. Unfortunately the output I am getting is: "< RouteLabel>Hurdman<" instead of the expected "Hurdman". Any help would really be appreciated.
set direction(1) {<RouteLabel>Hurdman</RouteLabel>}
regexp {^.*>(.*)<} $direction(1) number
The issue here is not with the regex but with how you are using it.
The syntax you need is
regexp <PATTERN> <INPUT> <WHOLE_MATCH_VAR> <CAPTURE_1_VAR> ... <CAPTURE_n_VAR>
So, in your case, as you are not interested in the whole match, just put _ where the whole match is expected:
set direction(1) {<RouteLabel>Hurdman</RouteLabel>}
regexp {^.*>(.*)<} $direction(1) _ number
puts $number
printing Hurdman. See the online Tcl demo.
Crash course in tDOM for this exact task:
Get tDOM (note different spelling in package name):
% package require tdom
0.8.3
Create an empty document with a root element called foobar:
% set doc [dom createDocument foobar]
domDoc02569130
Get a fix on the root:
% set root [$doc documentElement]
domNode025692E0
Setup one of your XML strings:
% set direction(1) {<RouteLabel>Hurdman</RouteLabel>}
<RouteLabel>Hurdman</RouteLabel>
Add it to the DOM tree at the root:
% $root appendXML $direction(1)
domNode025692E0
Get the string you want by XPath expression:
% $root selectNodes {string(//RouteLabel/text())}
Hurdman
Or by querying the root (only works if there is only one single text node inserted at a time, otherwise you get them all concatenated):
% $root asText
Hurdman
If you want to clear the DOM tree from the root to make it ready for appending new strings without the old ones interfering:
% foreach node [$root childNodes] {$node delete}
But if you use XPath expressions you should be able to append any number of XML strings and still retrieve their content.
Once again:
package require tdom
set doc [dom createDocument foobar]
set root [$doc documentElement]
set direction(1) {<RouteLabel>Hurdman</RouteLabel>}
$root appendXML $direction(1)
$root selectNodes {string(//RouteLabel/text())}
# => Hurdman
Documentation:
tdom (package)
Related
I have a list of Items example (files in a folder), each item in the list is in its own string.
in the example the X--Y-- Have incrementing Digits.
my program has the filenames in a list eg : ["file1.txt", "file2.txt"]
item 1:
"X1Y2 alehandro alex.txt"
item 2:
"X1Y3 james file of files.txt"
so for each string i want to keep only the first Part the "X1Y2" parts for each file so I need to remove all the extra text on the filename.
I just want a regex expression on how to do this, I still do struggle with regex.
I need to pass this through a, replace with "" algorithm,
(using microsoft powertoys-rename to do this..
Alternatives in powershell also welcome.
any advice would be appreciated
I Want output to be the following
["X1Y2.txt","X2Y3.txt","X4Y3.txt"]
with the unwanted extra text removed.
A general solution using re.sub along with a list comprehension might be:
files = ["X1Y2 alehandro alex.txt", "X1Y3 james file of files.txt"]
output = [re.sub(r'(\S+).*\.(\w+)$', r'\1.\2', f) for f in files]
print(output) # ['X1Y2.txt', 'X1Y3.txt']
I've some URL's in my cas_fnd_dwd_det table,
casi_imp_urls cas_code
----------------------------------- -----------
www.casiac.net/fnds/CASI/qnxp.pdf
www.casiac.net/fnds/casi/as.pdf
www.casiac.net/fnds/casi/vindq.pdf
www.casiac.net/fnds/CASI/mnip.pdf
how do i copy the letters between last '/' and '.pdf' to another column
expected outcome
casi_imp_urls cas_code
----------------------------------- -----------
www.casiac.net/fnds/CASI/qnxp.pdf qnxp
www.casiac.net/fnds/casi/as.pdf as
www.casiac.net/fnds/casi/vindq.pdf vindq
www.casiac.net/fnds/CASI/mnip.pdf mnip
the below URL's are static
www.casiac.net/fnds/CASI/
www.casiac.net/fnds/casi/
Advise, how do i select the codes between last '/' and '.pdf' ?
I would recommend to take a look at REGEXP_SUBSTR. It allows to apply a regular expression. Db2 has string processing functions, but the regex function may be the easiest solution. See SO question on regex and URI parts for different ways of writing the expression. The following would return the last slash, filename and the extension:
SELECT REGEXP_SUBSTR('http://fobar.com/one/two/abc.pdf','\/(\w)*.pdf' ,1,1)
FROM sysibm.sysdummy1
/abc.pdf
The following uses REPLACE and the pattern is from this SO question with the pdf file extension added. It splits the string in three groups: everything up to the last slash, then the file name, then the ".pdf". The '$1' returns the group 1 (groups start with 0). Group 2 would be the ".pdf".
SELECT REGEXP_REPLACE('http://fobar.com/one/two/abc.pdf','(?:.+\/)(.+)(.pdf)','$1' ,1,1)
FROM sysibm.sysdummy1
abc
You could apply LENGTH and SUBSTR to extract the relevant part or try to build that into the regex.
For older Db2 versions than 11.1. Not sure if it works for 9.5, but definitely should work since 9.7.
Try this as is.
with cas_fnd_dwd_det (casi_imp_urls) as (values
'www.casiac.net/fnds/CASI/qnxp.pdf'
, 'www.casiac.net/fnds/casi/as.pdf'
, 'www.casiac.net/fnds/casi/vindq.pdf'
, 'www.casiac.net/fnds/CASI/mnip.PDF'
)
select
casi_imp_urls
, xmlcast(xmlquery('fn:replace($s, ".*/(.*)\.pdf", "$1", "i")' passing casi_imp_urls as "s") as varchar(50)) cas_code
from cas_fnd_dwd_det
I am looking for a regular expression for locating numerous expressions to find and replace. The expression looks like s360a__fieldname__c. I need to find all the instances where the s360a__ is then followed by the __c.
The issue is that it has to be within the one line so it is not finding a starting s360a__ and then the next __c which may be several lines below.
Here is an example of some of the xml I am changing.
<fields>
<fullName>s360a__AddressPreferredStreetAddressCity__c</fullName>
<deprecated>false</deprecated>
<externalId>false</externalId>
<label>Preferred Street City</label>
<length>255</length>
<required>false</required>
<trackFeedHistory>false</trackFeedHistory>
<trackHistory>false</trackHistory>
<type>Text</type>
<unique>false</unique>
</fields>
<fields>
<fullName>s360a__AddressPreferredStreetAddressCountry__c</fullName>
<deprecated>false</deprecated>
<externalId>false</externalId>
<label>Preferred Street Country</label>
<picklist>
You'd better of using a parser, combined with an xpath instead. Here's an example with PHP (can easily be adopted for e.g. Python as well). The idea is to load the DOM, then use a function to filter out elements (starts-with() and text() in this example):
<?php
$xml = '<fields>
<fullName>s360a__AddressPreferredStreetAddressCity__c</fullName>
<deprecated>false</deprecated>
<externalId>false</externalId>
<label>Preferred Street City</label>
<length>255</length>
<required>false</required>
<trackFeedHistory>false</trackFeedHistory>
<trackHistory>false</trackHistory>
<type>Text</type>
<unique>false</unique>
</fields>';
$dom = simplexml_load_string($xml);
// find everything where the text starts with 's360a_'
$fields = $dom->xpath("//*[starts-with(text(), 's360a_')]");
print_r($fields);
# s360a__AddressPreferredStreetAddressCity__c
The code checks if the text starts with s360a_. To actually check if it also ends with some specific string, you need to fiddle quite a bit (as the corresponding function ends-with() is not yet supported).
# check if the node text starts and ends with a specific string
$fields = $dom->xpath("//*[starts-with(., 's360a_') and substring(text(), string-length(text()) - string-length('_c') +1) = '_c']");
?>
I'm trying to write a script to simplify the process of searching through a particular applications log files for specific information. So I thought maybe there's a way to convert them into an XML tree, and I'm off to a decent start....but The problem is, the application log files are an absolute mess if you ask me
Some entries are simple
2014/04/09 11:27:03 INFO Some.code.function - Doing stuff
Ideally I'd like to turn the above into something like this
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>INFO</Type>
<Source>Some.code.function</Source>
<Sub>Doing stuff</Sub>
</Message>
Other entries are something like this where there's additional information and line breaks
2014/04/09 11:27:04 INFO Some.code.function - Something happens
changes:
this stuff happened
I'd like to turn this last chunk into something like the above, but add the additional info into a section
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>INFO</Type>
<Source>Some.code.function</Source>
<Sub>Doing stuff</Sub>
<details>changes:
this stuff happened</details>
</Message>
and then other messages, errors will be in the form of
2014/04/09 11:27:03 ERROR Some.code.function - Something didn't work right
Log Entry: LONGARSEDGUID
Error Code: E3145
Application: Name
Details:
message information etc etc and more line breaks, this part of the message may add up to an unknown number of lines before the next entry
This last chunk I'd like to convert as the last to above examples, but adding XML nodes for log entry, error code, application, and again, details like so
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>ERROR </Type>
<Source>Some.code.function</Source>
<Sub>Something didn't work right</Sub>
<Entry>LONGARSEDGUID</Entry>
<Code>E3145</Code>
<Application>Name</Application>
<details>message information etc etc and more line breaks, this part of the message may add up to an unknown number of lines before the next entry</details>
</Message>
Now I know that Select-String has a context option which would let me select a number of lines after the line I've filtered, the problem is, this isn't a constant number.
I'm thinking a regular expression would also me to select the paragraph chunk before the date string, but regular expressions are not a strong point of mine, and I thought there might be a better way because the one constant is that new entries start with a date string
the idea though is to either break these up into xml or tables of sorts and then from there I'm hoping it might take the last or filtering non relevant or recurring messages a little easier
I have a sample I just tossed on pastebin after removing/replacing a few bits of information for privacy reasons
http://pastebin.com/raw.php?i=M9iShyT2
Sorry this is kind of late, I got tied up with work for a bit there (darn work expecting me to be productive while on their dime). I ended up with something similar to Ansgar Wiechers solution, but formatted things into objects and collected those into an array. It doesn't manage your XML that you added later, but this gives you a nice array of objects to work with for the other records. I'll explain the main RegEx line here, I'll comment in-line where it's practical.
'(^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) [\d+?] (\w+?) {1,2}(.+?) - (.+)$' is the Regex that detects the start of a new record. I started to explain it, but there are probably better resources for you to learn RegEx than me explaining it to me. See this RegEx101.com link for a full breakdown and examples.
$Records=#() #Create empty array that we will populate with custom objects later
$Event = $Null #make sure nothing in $Event to give script a clean start
Get-Content 'C:\temp\test1.txt' | #Load file, and start looping through it line-by-line.
?{![string]::IsNullOrEmpty($_)}|% { #Filter out blank lines, and then perform the following on each line
if ($_ -match '(^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) \[\d+?] (\w+?) {1,2}(.+?) - (.+)$') { #New Record Detector line! If it finds this RegEx match, it means we're starting a new record.
if ($Event) { #If there's already a record in progress, add it to the Array
$Records+=$Event
}
$Event = New-Object PSObject -Property #{ #Create a custom PSObject object with these properties that we just got from that RegEx match
DateStamp = [datetime](get-date $Matches[1]) #We convert the date/time stamp into an actual DateTime object. That way sorting works better, and you can compare it to real dates if needed.
Type = $Matches[2]
Source = $Matches[3]
Message = $Matches[4]}
Ok, little pause for the cause here. $Matches isn't defined by me, why am I referencing it? . When PowerShell gets matches from a RegEx expression it automagically stores the resulting matches in $Matches. So all the groups that we just matched in parenthesis become $Matches[1], $Matches[2], and so on. Yes, it's an array, and there is a $Matches[0], but that is the entire string that was matched against, not just the groups that matched. We now return you to your regularly scheduled script...
} else { #End of the 'New Record' section. If it's not a new record if does the following
if($_ -match "^((?:[^ ^\[])(?:\w| |\.)+?):(.*)$"){
RegEx match again. It starts off by stating that this has to be the beginning of the string with the carat character (^). Then it says (in a non-capturing group noted by the (?:<stuff>) format, which really for my purposes just means it won't show up in $Matches) [^ \[]; that means that the next character can not be a space or opening bracket (escaped with a ), just to speed things up and skip those lines for this check. If you have things in brackets [] and the first character is a carat it means 'don't match anything in these brackets'.
I actually just changed this next part to include periods, and used \w instead of [a-zA-Z0-9] because it's essentially the same thing but shorter. \w is a "word character" in RegEx, and includes letters, numbers, and the underscore. I'm not sure why the underscore is considered part of a word, but I don't make the rules I just play the game. I was using [a-zA-Z0-9] which matches anything between 'a' and 'z' (lowercase), anything between 'A' and 'Z' (uppercase), and anything between '0' and '9'. At the risk of including the underscore character \w is a lot shorter and simpler.
Then the actual capturing part of this RegEx. This has 2 groups, the first is letters, numbers, underscores, spaces, and periods (escaped with a \ because '.' on it's own matches any character). Then a colon. Then a second group that is everything else until the end of the line.
$Field = $Matches[1] #Everything before the colon is the name of the field
$Value = $Matches[2].trim() #everything after the colon is the data in that field
$Event | Add-Member $Field $Value #Add the Field to $Event as a NoteProperty, with a value of $Value. Those two are actually positional parameters for Add-Member, so we don't have to go and specify what kind of member, specify what the name is, and what the value is. Just Add-Member <[string]name> <value can be a string, array, yeti, whatever... it's not picky>
} #End of New Field for current record
else{$Value = $_} #If it didn't find the regex to determine if it is a new field then this is just more data from the last field, so don't change the field, just set it all as data.
} else { #If it didn't find the regex then this is just more data from the last field, so don't change the field, just set it all as data.the field does not 'not exist') do this:
$Event.$Field += if(![string]::isNullOrEmpty($Event.$Field)){"`r`n$_"}else{$_}}
This is a long explanation for a fairly short bit of code. Really all it does is add data to the field! This has an inverted (prefixed with !) If check to see if the current field has any data, if it, or if it is currently Null or Empty. If it is empty it adds a new line, and then adds the $Value data. If it doesn't have any data it skips the new line bit, and just adds the data.
}
}
}
$Records+=$Event #Adds the last event to the array of records.
Sorry, I'm not very good with XML. But at least this gets you clean records.
Edit: Ok, code is notated now, hopefully everything is explained well enough. If something is still confusing perhaps I can refer you to a site that explains better than I can. I ran the above against your sample input in PasteBin.
One possible way to deal with such files is to process them line by line. Each log entry starts with a timestamp and ends when the next line starting with a timestamp appears, so you could do something like this:
Get-Content 'C:\path\to\your.log' | % {
if ($_ -match '^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}') {
if ($logRecord) {
# If a current log record exists, it is complete now, so it can be added
# to your XML or whatever, e.g.:
$logRecord -match '^(\d{4}/\d{2}/\d{2}) (\d{2}:\d{2}:\d{2}) (\S+) ...'
$message = $xml.CreateElement('Message')
$date = $xml.CreateElement('Date')
$date.InnerText = $matches[1]
$message.AppendChild($date)
$time = $xml.CreateElement('Time')
$time.InnerText = $matches[2]
$message.AppendChild($time)
$type = $xml.CreateElement('Type')
$type.InnerText = $matches[3]
$message.AppendChild($type)
...
$xml.SelectSingleNode('...').AppendChild($message)
}
$logRecord = $_ # start new record
} else {
$logRecord += "`r`n$_" # append to current record
}
}
This command line parses a contact list document that may or may not have either a phone, email or web listed. If it has all three then everything works great - appending the return from the FormatContact() at the end of the line for data uploading:
silent!/^\d/+1|ki|/\n^\d\|\%$/-1|kj|'i,'jd|let #a = substitute(#",'\s*Phone: \([^,]*\)\_.*','\1',"")|let #b = substitute(#",'^\_.*E-mail:\s\[\d*\]\([-_#.0-9a-zA-Z]*\)\_.*','\1',"")|let #c = substitute(#",'^\_.*Web site:\s*\[\d*\]\([-_.:/0-9a-zA-Z]*\)\_.*','\1',"")|?^\d\+?s/$/\=','.FormatContact(#a,#b,#c)
or, broken down:
silent!/^\d/+1|ki|/\n^\d\|\%$/-1|kj|'i,'jd
let #a = substitute(#",'\s*Phone: \([^,]*\)\_.*','\1',"")
let #b = substitute(#",'^\_.*E-mail:\s\[\d*\]\([-_#.0-9a-zA-Z]*\)\_.*','\1',"")
let #c = substitute(#",'^\_.*Web site:\s*\[\d*\]\([-_.:/0-9a-zA-Z]*\)\_.*','\1',"")
?^\d\+?s/$/\=','.FormatContact(#a,#b,#c)
I created three separate searches so as not to make any ONE search fail if one atom failed to match because - again - the contact info may or may not exist per contact.
The Problem that solution created was that when the pattern does not match I get the whole #" into #a. Instead, I need it to be empty when the match does not occur. I need each variable represented (phone,email,web) whether it be empty or not.
I see no flags that can be set in the substitution function that
will do this.
Is there a way to return "" if \1 is empty?
Is there a way to create an optional atom so the search query(ies) could still account for an empty match so as to properly record it as empty?
Instead of using substitutions that replace the whole captured text
with its part of interest, one can match only that target part. Unlike
substitution routines, matching ones either locate the text conforming
to the given pattern, or report that there is no such text. Thus,
using the matchstr() function in preference to substitute(), the
parsing code listed in the question can be changed as follows:
let #a = matchstr(#", '\<Phone:\s*\zs[^,]*')
let #b = matchstr(#", '\<E-mail:\s*\[\d*\]\zs[-_#.0-9a-zA-Z]*')
let #c = matchstr(#", '\<Web site:\s*\[\d*\]\zs[-_.:/0-9a-zA-Z]*')
Just in case you want linewise processing, consider using in combination with :global, e.g.
let #a=""
g/text to match/let #A=substitute(getline("."), '^.*\(' . #/ . '\).*$', '\1\r', '')
This will print the matched text for any line that contained it, separated with newlines:
echo #a
The beautiful thing here, is that you can make it work with the last-used search-pattern very easily:
g//let #A=substitute(getline("."), '^.*\(' . #/ . '\).*$', '\1\r', '')