How can I extract strings from some text file with powershell script?

How can I extract strings from some text file with powershell script? - regex

I wanted to extract some strings from some text files. After some researching for that files, I found some pattern that strings appear in a text file.
I composed a short powershell script by help of google-search. This script receives two parameters (textfile path and extracting keyword) and operates extracting strings from text file.
As finding & extracting the target strings from the file $tpath\temp.txt, this script saves it to another file $tpath\tmpVI.txt.
Set-PSDebug -Trace 2 -step
$txtpath=$args[0]
$exkey=$args[1]
$tfile=gc "$tpath\temp.txt"
$savextracted="$tpath\tmpVI.txt"
$tfile -replace '&', '&' -replace '^.*$exkey', '' -replace '\s.*$', '' -replace '\\.*$','' | out-file "$savextracted" -encoding ascii
But until now, the extracted & saved result has been fault, never wanted strings.
By PS debugging, it seems the regular expressions in the last line make troubles and variable $exkey does so in replace quotation. But I don't know how to fix this. What shall I do?

If you're looking to capture lines that have your match, here's a snippet that solves that problem:
Function Get-Matches
{
Param(
[Parameter(Mandatory,Position=0)]
[String] $Path,
[Parameter(Mandatory,Position=1)]
[String] $Regex
)
#(Get-Content -Path $Path) -match $Regex
}

Related

Powershell replace using regex to match a row starting with a string of brackets

I'm trying to create a powershell script to modify a file and replace rows starting with "][" by a comma.
I have a file text.json with some broken JSON like this:
[
{
"Id": "5413146",
"Datasets": [
{
"DatasetId": "354843154864",
"DatasetName": "testset"
}
],
"SharingAction": "Direct"
}
][][][][][
{
"Id": "656156462",
"LastRefreshTime": "may"
}
][][
{
"Id": "32448542",
"LastRefreshTime": "jan"
}
]
To fix it I would need to replace the rows with multiple brackets with a comma and I need to do it with a power shell script.
I found out that I can read the contents of the file to a variable like this:
$text = Get-Content text.json -Raw
Then I can replace normal text and output the modified contents to a new file like this:
$text -replace 'may','june' | Out-File -FilePath text_modified.json
However, I'm having issues using regex to match the row with the brackets.
I found out that a regex to match a row starting with "][" would be like this:
^]\[.*
I tested with two different online regex validators and it seems to work fine. So then I believe the command I'm looking for should be:
$text -replace '^]\[.*',',' | Out-File -FilePath text_modified.json
It doesn't replace anything. Seems like it doesn't match the brackets properly.
$text -match ']' returns True but when I try $text -match '^]' it returns False. I also tried '^\]' which also returns False.
Any ideas? Thanks for any help.

The -Raw parameter of Get-Content returns a single string, not an array of strings representing each line in the file. Your current expression is effectively looking for the pattern at the start of input, or in this case, the start of the file.
If you remove -Raw, -replace will instead operate on each line of the file, and each line will be processed as its own input. This means that now your pattern looking for ^]\[.* at the beginning of the input will match on the correct lines now.
And then of course, Out-File will write your changed file contents to disk.
If you really want to use -Raw, as mentioned in the comments you can prefix your pattern with(?m). This is the .NET regex modifier for Multiline Mode. Basically, this modifier makes ^ and $ match the beginning and ending of a line, not just the beginning and ending of input.

Replace text between two string powershell

I have a question which im pretty much stuck on..
I have a file called xml_data.txt and another file called entry.txt
I want to replace everything between <core:topics> and </core:topics>
I have written the below script
$test = Get-Content -Path ./xml_data.txt
$newtest = Get-Content -Path ./entry.txt
$pattern = "<core:topics>(.*?)</core:topics>"
$result0 = [regex]::match($test, $pattern).Groups[1].Value
$result1 = [regex]::match($newtest, $pattern).Groups[1].Value
$test -replace $result0, $result1
When I run the script it outputs onto the console it doesnt look like it made any change.
Can someone please help me out
Note: Typo error fixed

There are three main issues here:
You read the file line by line, but the blocks of texts are multiline strings
Your regex does not match newlines as . does not match a newline by default
Also, the literal regex pattern must when replacing with a dynamic replacement pattern, you must always dollar-escape the $ symbol. Or use simple string .Replace.
So, you need to
Read the whole file in to a single variable, $test = Get-Content -Path ./xml_data.txt -Raw
Use the $pattern = "(?s)<core:topics>(.*?)</core:topics>" regex (it can be enhanced in case it works too slow by unrolling it to <core:topics>([^<]*(?:<(?!</?core:topics>).*)*)</core:topics>)
Use $test -replace [regex]::Escape($result0), $result1.Replace('$', '$$') to "protect" $ chars in the replacement, or $test.Replace($result0, $result1).

How can I make this PowerShell script more efficient?

I am trying to make a script that takes an XML file, looks for a matching condition, if it finds it adds a new line of asteriks, then when done going through the file to strip it of all its XML tags and leave the data in a plain text file.
The script has been tested on a small input xml file and works fine, but when I pass a large XML file to it takes forever (not actually sure how long as I ran it for over an hour and still no result so I just stopped it).
I'm guessing I must be performing the work in an extremely inefficient manner, hoping you guys can help me make it fast and efficient.
Here is the script below:
# Takes input XML File, cleans up XML elements, outputs plain text file
$FileName = "C:\Users\someguy\Desktop\input.xml"
$Pattern = "ProcessSpecifier = ""true"""
$FileOriginal = Get-Content $FileName
[String[]] $FileModified = #()
Foreach ($Line in $FileOriginal)
{
$FileModified += $Line
if ($Line -match $Pattern)
{
#Add Lines after the selected pattern
$FileModified += "*************isActive=true*****************"
}
}
$FileModified -replace "<[^>]+>", "" | Out-File C:\Users\someguy\Desktop\Output.txt

Let's go with a look behind and a bunch of regex to speed things up here. Also, I'm not going to store the whole thing in memory, I'm just going to pass it down the pipeline, which should help. I remove whitespace from the beginning and ends of lines, and filter out blank lines, but you can remove that bit if you want.
# Takes input XML File, cleans up XML elements, outputs plain text file
$FileName = "C:\Users\someguy\Desktop\input.xml"
$Pattern = '(?<=^.*ProcessSpecifier = "true".*$)'
(Get-Content $FileName) -replace $Pattern, "`n*************isActive=true*****************" -replace '<[^>]+?>' -replace '^\s*|\s$' | ?{$_} | Set-Content C:\Users\someguy\Desktop\Output.txt
So, the main thing here is that I use a look behind to find your pattern text, and then add a new line and the asterisk line to that line. So that the line
<SomeTag>ProcessSpecifier = "true"</SomeTag>
becomes:
<SomeTag>ProcessSpecifier = "true"</SomeTag>`n*************isActive=true*****************
When used inside double quote a backtick ` followed by n creates a new line, so the '*************isActive=true*****************' is on its own line immediately following your search pattern line. Past that I remove the XML tags, and then any leading or trailing whitespace from any line.
After the RegEx replacements I pass the result to a Where statement that removes blank lines, and then pass the remaining lines to Set-Content which I've seen better performance out of than Out-File.

Variation of TheMadTechnician's answer:
# Takes input XML File, cleans up XML elements, outputs plain text file
$FileName = "C:\Users\someguy\Desktop\input.xml"
$Pattern = '(?<=^.*ProcessSpecifier = "true".*$)'
Set-Content -Path C:\Users\someguy\Desktop\Output.txt -Value (((Get-Content $FileName) -replace $Pattern, "`n*************isActive=true*****************" -replace '<[^>]+?>' -replace '^\s*|\s$').Where{$_})
I actually try to avoid the pipeline, it is rather slow afaik. Of course you will run into problem with memory consumption if the files are very large.
The "().Where" construct doesn't work on all powershell versions (Version 4+ iirc).
This is a guess, I am not sure whether this is actually faster than TheMadTechnician's. I'd be curious about the result :)

Using powershell to search for a pattern

I am trying to write a powershell script to search for a pattern in a text file. Specifically I am looking at reading a file line by line and returning any line that has a space at the 32nd character position.
I have this so far but it just returns all lines that have white space. I need to narrow it down to the 32nd position
Get-Content -path C:\VM_names.txt | Where-Object {$_ -match "\s+"}

Use this pattern:
-match '^.{31} '
Explanation:
^ - beginning of the string
. - any character
{31} - repeated 31 times
- a space

This is actually really easy to do. By default, Get-Content reads a text file as an array of strings (individual lines), unless you use the -Raw parameter, which reads it as a single string. You can use the -match PowerShell operator to "match" the lines that meet your regular expression.
(Get-Content -Path c:\VM_names.txt) -match '^.{31}\s'
The result of the above command is an array of lines that match the desired regular expression.
NOTE: The call to Get-Content must be wrapped in parentheses, otherwise the PowerShell parser will think that -match is a parameter on that command.
NOTE2: As a good practice, use single quotes around strings, unless you specifically know that you need double quotes. You'll save yourself from accidental interpolation.

Replace an entire line of text using powershell and regexp?

I have a programming background, but I am fairly new to both powershell scripting and regexp. Regexp has always eluded me, and my prior projects have never 'forced' me to learn it.
With that in mind I have a file with a line of text that I need to replace. I can not depend on knowing where the line exists, if it has whitespace in front of it, or what the ACTUAL text being replaced IS. I DO KNOW what will preface and preceed the text being replaced.
AGAIN, I will not KNOW the value of "Replace This Text". I will only know what prefaces it "" and what preceeds it "". Edited OP to clarify. Thanks!
LINE OF TEXT I NEED TO REPLACE
<find-this-text>Replace This Text</find-this-text>
POTENTIAL CODE
(gc $file) | % { $_ -replace “”, “” } | sc $file
Get the content of the file, enclose this in parentheses to ensure file is first read and then closed so it doesnt throw an error when trying to save the file.
Iterate through each line, and issue replace statement. THIS IS WHERE I COULD USE HELP.
Save the file by using Set-Content. My understanding is that this method is preferable, because it takes encoding into consideration,like UTF8.

XML is not a line oriented format (nodes may span several lines, just as well as a line may contain several nodes), so it shouldn't be edited as if it were. Use a proper XML parser instead.
$xmlfile = 'C:\path\to\your.xml'
[xml]$xml = Get-Content $xmlfile
$node = $xml.SelectSingleNode('//find-this-text')
$node.'#text' = 'replacement text'
For saving the XML in "UTF-8 without BOM" format you can call the Save() method with a StreamWriter doing The Right Thing™:
$UTF8withoutBOM = New-Object Text.UTF8Encoding($false)
$writer = New-Object IO.StreamWriter ($xmlfile, $false, $UTF8withoutBOM)
$xml.Save($writer)
$writer.Close()

The .* in the regular expression would be considered "greedy" and dangerous by many. If the line that contains this tag and it's data contains nothing else, then there really isn't any significant risk according to my understanding.
$file = "c:\temp\sms.txt"
$OpenTag = "<find-this-text>"
$CloseTag = "</find-this-text>"
$NewText = $OpenTag + "New text" + $CloseTag
(Get-Content $file) | Foreach-Object {$_ -replace "$OpenTag.*$CloseTag", $NewText} | Set-Content $file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How can I extract strings from some text file with powershell script? - regex

If you're looking to capture lines that have your match, here's a snippet that solves that problem: Function Get-Matches { Param( [Parameter(Mandatory,Position=0)] [String] $Path, [Parameter(Mandatory,Position=1)] [String] $Regex ) #(Get-Content -Path $Path) -match $Regex }

Related

Powershell replace using regex to match a row starting with a string of brackets

Replace text between two string powershell

How can I make this PowerShell script more efficient?

Using powershell to search for a pattern

Replace an entire line of text using powershell and regexp?

Categories

Resources