Find and verify path strings in text file using PowerShell, RegEx search - regex

First time posting here, and I'll try to be clear and detailed, but be gentle if I missed an existing answer when I searched these boards.
First, the issues:
How to exclude a RegEx response that contains a specific keyword ("fastcopy")
How to include path results that do not end in a file name/wildcard
I am working with a set of text files that are very similar to batch files. They are plain text, and contain header lines, lines containing paths to files on a server, and comment lines. Commented lines begin with a semicolon (;), so that is simple enough to rule out. The paths should all start with a variable %INSTDIR%, but they may or may not have quotes surrounding the path, and they may or may not have execution options following the path. One last note... the company uses FastCopy.exe to dump files/folders down from the network, and in such a line I would like to return the folder/file being copied instead of the path containing fastcopy.exe.
Here is a sample (kind of large to show potential issues):
[Installing .NET 3.5 Hotfix KB943326 for App1]
; *** Added NET 3.5 SP1 hotfix KB943326: resolves App1 hidden menus force laptop re-booting
1 = %INSTDIR%\ToolShare$\Sample_Toolbox\applications\.NET_3.5_Hotfix_KB943326\WindowsXP-KB943326-x86-ENU.exe /quiet /norestart
[Installing Agent 5.3.1]
1 = %INSTDIR%\ToolShare$\Sample_Toolbox\applications\AGenT_531_2.0\w7wxp_ze_20\install.exe
[Installing APR Manager 2.1]
1 = %INSTDIR%\ToolShare$\Sample_Toolbox\applications\APRManager_21_Updated_2.0\wviwxp_ze_20\install.exe
[Installing Scope Simulator]
1 = MD "C:\Temp\scope_simulator_10"
2 = start /wait /high %INSTDIR%\ToolShare$\Site_Toolbox\Custom_Scripts\Source\fastcopy.exe /auto_close /no_confirm_del /no_confirm_stop /log=FALSE /open_window /force_start /force_close /stream=FALSE /cmd=diff "%INSTDIR%\ToolShare$\Sample_Toolbox\applications\scope_simulator_10" /to="C:\Temp\scope_simulator_10"
3 = "C:\Temp\scope_simulator_10\w7wxp_ze_10\Install.exe"
4 = RD "C:\temp\scope_simulator_10" /q /s
[Installing Log Analyzer Offline 2.6.1]
1 = %INSTDIR%\ToolShare$\Sample_Toolbox\applications\Log_Analyzer_Offline_261\wxp_ze_10\install.exe
[Installing Data Migration Script]
1 = MD "C:\Temp\Data Migration"
2 = xcopy "%INSTDIR%\ToolShare$\Sample_Toolbox\Support\Data Migration\*.*" "C:\Temp\Data Migration" /y /e
3 = xcopy "%INSTDIR%\ToolShare$\Sample_Toolbox\Support\Data Migration\Data Migration.lnk" C:\DOCUME~1\ALLUSE~1\Desktop\ /Y
I have it set to pull a 'dir \\UNCPath\*.ini' and then loop through that doing a ForEach ($INI in $Results) bit. The line that I have been using inside the loop to try and pull the paths from each line is:
gc $ini|?{!($_ -match "^;") -and ($_ -match "%INST[^`"]*?\\.*(\.\w{3}|\.\*)(?=`"|\s|\Z)")}|%{$TestPath = $Matches[0].replace("%INSTDIR%","\\ServerName1");if(test-path $testpath){write-host " [OK] " -foregroundcolor Green -NoNewline}else{write-host "[Missing] " -ForegroundColor red -NoNewline};write-host "$testpath"}
This gets me almost everything I could want. What it doesn't do is get anything that does not end in either a .* or standard 3 character extension (.exe, .cmd, .jar etc). Plus it kicks back the fastcopy path instead of the path that it being attempted to be copied.
What I would like for results:
%INSTDIR%\ToolShare$\Sample_Toolbox\applications\.NET_3.5_Hotfix_KB943326\WindowsXP-KB943326-x86-ENU.exe
%INSTDIR%\ToolShare$\Sample_Toolbox\applications\AGenT_531_2.0\w7wxp_ze_20\install.exe
%INSTDIR%\ToolShare$\Sample_Toolbox\applications\APRManager_21_Updated_2.0\wviwxp_ze_20\install.exe
%INSTDIR%\ToolShare$\Sample_Toolbox\applications\scope_simulator_10
%INSTDIR%\ToolShare$\Sample_Toolbox\applications\Log_Analyzer_Offline_261\wxp_ze_10\install.exe
%INSTDIR%\ToolShare$\Sample_Toolbox\Support\Data Migration\*.*
%INSTDIR%\ToolShare$\Sample_Toolbox\Support\Data Migration\Data Migration.lnk
I do not get the second result (instead I get the FastCopy path, but even if I strip Fastcopy from the line and only have the desired path it won't return it). Any suggestions are welcome.

The following script should work just fine.
$paths = Get-Content $ini | Foreach {
if ($_ -match "^(?=[^;]).*?(?<delimiter>[""' ])(?<path>%INSTDIR%(?!.*?fastcopy.exe).*?)(?:\1|$)")
{
Write-Output $Matches["path"]
}
}
The $paths variable will now contain all the paths requested. Observe that if any string contains the "fastcopy.exe" literal string anywhere in the path it will not be found by this regular expression.
An attempt to explaining the regular expression:
^ - match the start of the line
(?=[^;]) - positive lookahead verifying that the line does not start with a semicolon
.*? - any character, as few as possible (to remove all characters before the path we want to match)
(?<delimiter>["' ]) - named group verifying whether the path is surrounded by space, a quotation character or a apostrophe.
(?<path> - start a named capturing group for capturing the "path"
%INSTDIR% - matches the literal string '%INSTDIR%'
(?!.*?fastcopy.exe) - negative lookahead verifying that the part of the line we're trying to match (which has started with %INSTDIR%) doesn't contain the word fastcopy.exe anywhere later in the string (the second time the %INSTDIR% occurs on the fastcopy line, the rest of the line does not contain the fastcopy.exe literal string).
.*? - matches any character, as few as possible, to make sure that we stop as soon as we find a matching delimiter character below
) - ends the named capturing group "path"
(?:\1|$) - matches (in a non-capturing group) the character found by the delimiter group above (to match a quotation character, apostrophe or space, depending on what character was immediately before the %INSTDIR% literal string), or the end of the line.
If anything is unclear, please add a comment below asking for clarifications.

Related

Add text to the end if not already added

I have the following lines:
source = "git::ssh://git#github.abc.com/test//bar"
source = "git::ssh://git#github.abc.com/test//foo?ref=tf12"
resource = "bar"
I want to update any lines that contain source and git words by adding ?ref:tf12 to the end of the line but inside ". If the line already contains ?ref=tf12, it should skip
source = "git::ssh://git#github.abc.com/test//bar?ref=tf12"
source = "git::ssh://git#github.abc.com/test//foo?ref=tf12"
resource = "bar"
I have the following expression using sed, but it outputs wrongly
sed 's#source.*git.*//.*#&?ref=tf12#' file.tf
source = "git::ssh://git#github.abc.com/test//bar"?ref=tf12
source = "git::ssh://git#github.abc.com/test//foo"?ref=tf12?ref=tf12
resource = "bar"
Using simple regular expressions for this is rather brittle; if at all possible, using a more robust configuration file parser would probably be a better idea. If that's not possible, you might want to tighten up the regular expressions to make sure you don't modify unrelated lines. But here is a really simple solution, at least as a starting point.
sed -e '/^ *source *= *"git/!b' -e '/?ref=tf12" *$/b' -e 's/" *$/?ref=tf12"/' file.tf
This consists of three commands. Remember that sed examines one line at a time.
/^ * source *= *"git/!b - if this line does not begin with source="git (with optional spaces between the tokens) leave it alone. (! means "does not match" and b means "branch (to the end of this script)" i.e. skip this line and fetch the next one.)
/?ref=tf12" *$/b similarly says to leave alone lines which match this regex. In other words, if the line ends with ?ref=tf12" (with optional spaces after) don't modify it.
s/"* $/?ref=tf12"/ says to change the last double quote to include ?ref=tf12 before it. This will only happen on lines which were not skipped by the two previous commands.
sed '/?ref=tf12"/!s#\(source.*git.*//.*\)"#\1?ref=tf12"#' file.tf
/?ref=tf12"/! Only run substitude command if this pattern (?ref=tf12") doesn't match
\(...\)", \1 Instead of appending to the entire line using &, only match the line until the last ". Use parentheses to match everything before that " into a group which I can then refer with \1 in the replacement. (Where we re-add the ", so that it doesn't get lost)

Perl search and replace until positive lookahead over several lines - not working as expected?

The overall goal here is to remove a block of text starting with a particular string and ending with a positive lookahead. From the testing I've done, it seems that newlines are causing the problem, but I'm not sure what exactly is going on or the best way to fix it.
More context: I want to remove taxa from a .fasta file, including the taxon name and header information and the associated sequence. (fasta format begins with a header >locusname-locusnumber-species_name |locusname-locusnumber \n). Missing data in the sequence is coded as "-". Eventually I would like to do this for several species_names and do so for each of several thousand files in a directory.
I presumed this would be a simple task to do as a perl one-liner in bash (Ubuntu 18.04.2).
As an example, from the excerpt below I would like to remove the entire sequence of Pseudomymrex seminole D1367, i.e. the string that starts with >uce-483_Pseudomyrmex_seminole_D1367 |uce-483 and ends with the newline before >uce-483_Pseudomyrmex_seminole_D1435. . ..
For this, I have: perl -pe 's/>(.)+(Pseudomyrmex_seminole_D1367)[\s\S]+(?=>)//' infile.fasta > outfile.fasta
or equivalently perl -pe 's/>(.)+(Pseudomyrmex_seminole_D1367(.)+(?=>)//s' infile.fasta > outfile.fasta
Both of these seem to have no effect at all (i.e. diff infile.fasta outfile.fasta is empty.) If I remove the positive lookahead, it works correctly but only up to the first newline.
Here's an excerpt from the .fasta for context and testing:
>uce-483_Pseudomyrmex_seminole_D1366 |uce-483
------------------------------------------------------------
---------------------------------------------------tgtaaacgt
tataatacatgcgtatgaaaaaaaaaagtgaacacccggtacgtacccgtgctgaaacgt
tcagatttacatccatttgtagtagcattttcgctagttttttcaagagcaaaaaggaca
cattcaaaactgaatatacatgtcacagatgtttgtttgtgtgcaggtacctgtaatttt
gcaaacatatacctatatatgtgtgtcgcatatatatcatgtagtagatttccatgttat
gcaacatcttctcacaatgacaatcggtcgtttccttcactccgaaatgttcatgcgaac
agttaatctatatcccaagcagcgatgtaatgttatgcggcgcgcaagtctcattagact
tgtaaaccgtccgagtttcgacttaccata----tgtgtgtgtgtgcgcgcgtatgtgca
cgtac------acacgtttgtttatacatttgtctatacatttgcgtgtgaacgcgggat
gaacagagatttgcgcacacatagacatgagaaacgtcacttgtcgatgtagatactaat
tgtggaaaatacatattcctcttcagatacacgggaatgttgaattattttcactcgctc
cacgcgcgagtgttcgctccttttacgcacaacgagtccttctgctgcagc--gagatag
aaaatatttttgcgcggtaatcgtaaacgtatgagtgcctttcgacgtgaattctcttat
ggcagttctcacggtgtaaattataatcgaattaacattgcgagtgtgatctcaatataa
ttatagcgtctaagaacaaacacgtaacatgcacacacacacacacacac----------
---
>uce-483_Pseudomyrmex_seminole_D1367 |uce-483
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--ttcaaaactgaatatacatgtcacagatgtttgtttgtgtgcaggtacctgtaatttt
gcaaacatatg---atatatatgtgtcgcatatatatcatgtagtagatttccatgttat
gcaacatcttctcacaatgacaatcggtcgtttccttcactctgaaatgttcatgcgaac
agttaatctatatcccaagcagcgatgtaatgttatgcggcgcgcaagtctcattagact
tgtaaaccgtccgagtttcgacttaccata--tgtgtgtgtgtgtgtgcgcgtatgtgca
cgtacgcgcgcacacgtttgtttatacatttgtctatacatttgcgtgtgaacgcgggat
gaacagagatttgcgcacacatagacatgagaaacgtcacttgtcgatg-----------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---
>uce-483_Pseudomyrmex_seminole_D1435 |uce-483
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-------tacatccatttgtagtagcattttcgctagttttttcaagagcaaaaaggaca
cattcaaaactgaatatacatgtcacagatgtttgtttgtgtgcaggtacctgtaatttt
gcaaacatatacctatatatgtgtgtcgcatatatatcatgtagtagatttccatgttat
gcaacatcttctcacaatgacaatcggtcgtttccttcactccgaaatgttcatgcgaac
agttaatctatatcccaagcagcgatgtaatgttatgcggcgcgcaagtctcattagact
tgtaaaccgtccgagtttcgacttaccata--tgtgtgtgtgtgtgtgcgcgtatgtgca
cgtac------acacgtttgtttatacatttgtctatacatttgcgtgtgaacgcgggat
gaacagagatttgcgcacacatagacatgagaaacgtcacttgtcgatgtagatactaat
tgtggaaaatacatattcctcttcagatacacgggaa-----------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---
With -p (or -n) the one-liner is reading a line at a time; so it just can't match multiline patterns. One solution is to "slurp" the whole file in, if it isn't too large (see end for line-by-line solution)
perl -0777 -pe'...' in > out
See Command Switches in perlrun.
Then, the code shown in the question has an unbalanced parenthesis and it doesn't compile. Further, there is no reason to capture those .s so drop the parentheses around. Next, the pattern
s/>.+Pseudomyrmex_seminole_D1367...//;
matches everything from the very first > to the name of interest, so all preceding sequences are matched and removed as well. Instead, match >[^>]+...D1367 for example, so everything that isn't > after a >, to that phrase.
Finally, the last .+(?=>) will match everything to the very last > and thus the regex will remove all following sequences, not what you want according to the description. Instead, limit it to match to the first following >, either by making it "non-greedy" with .+?(?=>) or, more simply, with [^>]+.
All corrected
perl -0777 -pe's/>[^>]+?Pseudomyrmex_seminole_D1367[^>]+//' in > out
Note that there is no need for /s modifier now, since its purpose is to make . match a newline and here we don't need that since the [^>] does match newlines as well (anything other than >). The quantifier is +? to (hopefully) prevent backtracking each whole sequence that doesn't match.
Or, with your original use of lookahead
perl -0777 -pe's/>[^>]+?Pseudomyrmex_seminole_D1367.+?(?=>)//s' in > out
These work as expected with your sample, as well as with an extended example I made up with further sequences (>...) added.
For reference, and since a fasta file can be too big to slurp into a string, here it is line by line.
Once you see the >... line of interest set a flag; print a line if that flag isn't set (and if we aren't on that very line). Once you reach the next > clear the flag (print that line, too).
perl -ne'
if (/^>.+?Pseudomyrmex_seminole_D1367/) { $f = 1 }
elsif (not $f) { print }
elsif (/^>/) { $f = 0; print }
' in > out
I suspect that this may also perform considerably better on very large files.
The regex in the first solution has to scan each sequence whole in order to find that it is not the one of interest; it is only once it hits the next > that it can decide that the sequence doesn't match (and with no backtracking, hopefully, since +? would've stopped it had the right phrase been encountered).
Here the code mostly checks the first character and a flag.
So it's an incomparably lesser workload here -- but here the regex engine is started up on every line, and that is expensive. I can't tell with confidence how they stack against each other without trying.
You can also use > as input record separator. This way you avoid to slurp the whole file and since the main loop loads your file block by block, you only have to test which one is the target to not print it (without to describe the whole block in a pattern):
perl -ln076e's/\n$//;print ">$_" if $_ && !/Pseudomyrmex_seminole_D1367/' file
The l switch sets the output record separator to the input record separator (a newline by default).
The 0 switch sets the input record separator to > (76 in octal).

Perl Range command mismatching similar strings with one ending in a carriage return

The range command in Perl
RANGE
/^ identifier cust_pri/ .. /addr-type-none/
matches on strings with cust_pri and cust_pri_sip where a carriage return is immediately after the string cust_pri (and cust_pri_sip). I don't want a match on cust_pri_sip but only on cust_pri.
I tried putting in \r\n and both individually to no avail. Is there a string or metachar I can put into the end of perl range to help differentiate the two strings?
I need to look at data for both types of interfaces but on the first range command it is also collecting the data the second range command is also collecting (cust_pri_sip) causing my first script to error out. The second works find. I cannot change the input data and I need a way to differentiate the two.
This is a sub script of the main Perl program
WIDTH = 65
DIRECTORY = /home/myfiles/
MASTER Config Lines
identifier cust_pri
description *
addr prefix 0.0.0.0
network interfaces M00|1:\d*
tcp media profile
monitoring filters
node functionality
default location string
alt family realm
addr-type-none
RANGE
/^ identifier cust_pri/ .. /addr-type-none/
#
There is another sub script that is similar to above
RANGE
/^ identifier cust_pri_sip/ .. /addr-type-none/
The first script also collects the data of both scripts because it matches.
You can explicitly exclude _sip with /^ identifier cust_pri(?!_sip)/ or you can say cust_pri has to be at the end of the line with nothing after it with /^ identifier cust_pri$/

regex to remove paths from file names but only when path begins with a given pattern

I have a file containing file names (amongst other things). Only some of the file names are at the start of a line in the file:
~/remove/me/myexec.pl /some/other/path/exec.pl
/yet/another/path/pipeit.pl | ~/remove/me/subdir/tome.pl
~/remove/me/deeply/nested/exec.pl
I want to remove the file path of any file that starts with ~/remove/me. I also want any sub-directories of ~/remove/me to be removed.
Here is my desired output from the above:
myexec.pl /some/other/path/exec.pl
/yet/another/path/pipeit.pl | tome.pl
exec.pl
The paths of files not beginning with ~/remove/me must be left alone.
The nearest I can get is using a regex like this:
s{~/remove/me/[^/]*?}{}gxms
But this does not deal with subdirectories properly, giving me the following output:
myexec.pl /some/other/path/exec.pl
/yet/another/path/pipeit.pl | subdir/tome.pl
deeply/nested/exec.pl
Can anyone come up with a regex to solve this?
Another way - s{~/remove/me/(?:[^/\s]*?/)*}{}g
~/remove/me/
(?: # Optional - Many non-spaced subdir's
[^/\s]*?
/
)*
Try this:
~\/remove\/me[^\s]*\/(?=[^\s]+)
Regex live here.
Explaining:
~\/remove\/me # starts with "~/remove/me"
[^\s]*\/ # match any non-space till last slash "/"
(?=[^\s]+) # match without taking the name and extension
Hope it helps.
a quick one, not perfect but I think it's doing what's required - of course it could be optimised.
my $text = "~/remove/me/myexec.pl /some/other/path/exec.pl\n/yet/another/path/pipeit.pl | ~/remove/me/subdir/tome.pl\n~/remove/me/deeply/nested/exec.pl";
$text =~ s/~\/remove\/me[a-zA-Z0-9\/]*\/([a-zA-Z0-9.]+)/$1/g;
print $text;
results the following:
myexec.pl /some/other/path/exec.pl
/yet/another/path/pipeit.pl | tome.pl
exec.pl

Parsing log files

I'm trying to write a script to simplify the process of searching through a particular applications log files for specific information. So I thought maybe there's a way to convert them into an XML tree, and I'm off to a decent start....but The problem is, the application log files are an absolute mess if you ask me
Some entries are simple
2014/04/09 11:27:03 INFO Some.code.function - Doing stuff
Ideally I'd like to turn the above into something like this
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>INFO</Type>
<Source>Some.code.function</Source>
<Sub>Doing stuff</Sub>
</Message>
Other entries are something like this where there's additional information and line breaks
2014/04/09 11:27:04 INFO Some.code.function - Something happens
changes:
this stuff happened
I'd like to turn this last chunk into something like the above, but add the additional info into a section
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>INFO</Type>
<Source>Some.code.function</Source>
<Sub>Doing stuff</Sub>
<details>changes:
this stuff happened</details>
</Message>
and then other messages, errors will be in the form of
2014/04/09 11:27:03 ERROR Some.code.function - Something didn't work right
Log Entry: LONGARSEDGUID
Error Code: E3145
Application: Name
Details:
message information etc etc and more line breaks, this part of the message may add up to an unknown number of lines before the next entry
This last chunk I'd like to convert as the last to above examples, but adding XML nodes for log entry, error code, application, and again, details like so
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>ERROR </Type>
<Source>Some.code.function</Source>
<Sub>Something didn't work right</Sub>
<Entry>LONGARSEDGUID</Entry>
<Code>E3145</Code>
<Application>Name</Application>
<details>message information etc etc and more line breaks, this part of the message may add up to an unknown number of lines before the next entry</details>
</Message>
Now I know that Select-String has a context option which would let me select a number of lines after the line I've filtered, the problem is, this isn't a constant number.
I'm thinking a regular expression would also me to select the paragraph chunk before the date string, but regular expressions are not a strong point of mine, and I thought there might be a better way because the one constant is that new entries start with a date string
the idea though is to either break these up into xml or tables of sorts and then from there I'm hoping it might take the last or filtering non relevant or recurring messages a little easier
I have a sample I just tossed on pastebin after removing/replacing a few bits of information for privacy reasons
http://pastebin.com/raw.php?i=M9iShyT2
Sorry this is kind of late, I got tied up with work for a bit there (darn work expecting me to be productive while on their dime). I ended up with something similar to Ansgar Wiechers solution, but formatted things into objects and collected those into an array. It doesn't manage your XML that you added later, but this gives you a nice array of objects to work with for the other records. I'll explain the main RegEx line here, I'll comment in-line where it's practical.
'(^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) [\d+?] (\w+?) {1,2}(.+?) - (.+)$' is the Regex that detects the start of a new record. I started to explain it, but there are probably better resources for you to learn RegEx than me explaining it to me. See this RegEx101.com link for a full breakdown and examples.
$Records=#() #Create empty array that we will populate with custom objects later
$Event = $Null #make sure nothing in $Event to give script a clean start
Get-Content 'C:\temp\test1.txt' | #Load file, and start looping through it line-by-line.
?{![string]::IsNullOrEmpty($_)}|% { #Filter out blank lines, and then perform the following on each line
if ($_ -match '(^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) \[\d+?] (\w+?) {1,2}(.+?) - (.+)$') { #New Record Detector line! If it finds this RegEx match, it means we're starting a new record.
if ($Event) { #If there's already a record in progress, add it to the Array
$Records+=$Event
}
$Event = New-Object PSObject -Property #{ #Create a custom PSObject object with these properties that we just got from that RegEx match
DateStamp = [datetime](get-date $Matches[1]) #We convert the date/time stamp into an actual DateTime object. That way sorting works better, and you can compare it to real dates if needed.
Type = $Matches[2]
Source = $Matches[3]
Message = $Matches[4]}
Ok, little pause for the cause here. $Matches isn't defined by me, why am I referencing it? . When PowerShell gets matches from a RegEx expression it automagically stores the resulting matches in $Matches. So all the groups that we just matched in parenthesis become $Matches[1], $Matches[2], and so on. Yes, it's an array, and there is a $Matches[0], but that is the entire string that was matched against, not just the groups that matched. We now return you to your regularly scheduled script...
} else { #End of the 'New Record' section. If it's not a new record if does the following
if($_ -match "^((?:[^ ^\[])(?:\w| |\.)+?):(.*)$"){
RegEx match again. It starts off by stating that this has to be the beginning of the string with the carat character (^). Then it says (in a non-capturing group noted by the (?:<stuff>) format, which really for my purposes just means it won't show up in $Matches) [^ \[]; that means that the next character can not be a space or opening bracket (escaped with a ), just to speed things up and skip those lines for this check. If you have things in brackets [] and the first character is a carat it means 'don't match anything in these brackets'.
I actually just changed this next part to include periods, and used \w instead of [a-zA-Z0-9] because it's essentially the same thing but shorter. \w is a "word character" in RegEx, and includes letters, numbers, and the underscore. I'm not sure why the underscore is considered part of a word, but I don't make the rules I just play the game. I was using [a-zA-Z0-9] which matches anything between 'a' and 'z' (lowercase), anything between 'A' and 'Z' (uppercase), and anything between '0' and '9'. At the risk of including the underscore character \w is a lot shorter and simpler.
Then the actual capturing part of this RegEx. This has 2 groups, the first is letters, numbers, underscores, spaces, and periods (escaped with a \ because '.' on it's own matches any character). Then a colon. Then a second group that is everything else until the end of the line.
$Field = $Matches[1] #Everything before the colon is the name of the field
$Value = $Matches[2].trim() #everything after the colon is the data in that field
$Event | Add-Member $Field $Value #Add the Field to $Event as a NoteProperty, with a value of $Value. Those two are actually positional parameters for Add-Member, so we don't have to go and specify what kind of member, specify what the name is, and what the value is. Just Add-Member <[string]name> <value can be a string, array, yeti, whatever... it's not picky>
} #End of New Field for current record
else{$Value = $_} #If it didn't find the regex to determine if it is a new field then this is just more data from the last field, so don't change the field, just set it all as data.
} else { #If it didn't find the regex then this is just more data from the last field, so don't change the field, just set it all as data.the field does not 'not exist') do this:
$Event.$Field += if(![string]::isNullOrEmpty($Event.$Field)){"`r`n$_"}else{$_}}
This is a long explanation for a fairly short bit of code. Really all it does is add data to the field! This has an inverted (prefixed with !) If check to see if the current field has any data, if it, or if it is currently Null or Empty. If it is empty it adds a new line, and then adds the $Value data. If it doesn't have any data it skips the new line bit, and just adds the data.
}
}
}
$Records+=$Event #Adds the last event to the array of records.
Sorry, I'm not very good with XML. But at least this gets you clean records.
Edit: Ok, code is notated now, hopefully everything is explained well enough. If something is still confusing perhaps I can refer you to a site that explains better than I can. I ran the above against your sample input in PasteBin.
One possible way to deal with such files is to process them line by line. Each log entry starts with a timestamp and ends when the next line starting with a timestamp appears, so you could do something like this:
Get-Content 'C:\path\to\your.log' | % {
if ($_ -match '^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}') {
if ($logRecord) {
# If a current log record exists, it is complete now, so it can be added
# to your XML or whatever, e.g.:
$logRecord -match '^(\d{4}/\d{2}/\d{2}) (\d{2}:\d{2}:\d{2}) (\S+) ...'
$message = $xml.CreateElement('Message')
$date = $xml.CreateElement('Date')
$date.InnerText = $matches[1]
$message.AppendChild($date)
$time = $xml.CreateElement('Time')
$time.InnerText = $matches[2]
$message.AppendChild($time)
$type = $xml.CreateElement('Type')
$type.InnerText = $matches[3]
$message.AppendChild($type)
...
$xml.SelectSingleNode('...').AppendChild($message)
}
$logRecord = $_ # start new record
} else {
$logRecord += "`r`n$_" # append to current record
}
}