Text manipulation problem - how to replace text after a known value - regex

I have a large text file containing filenames ending in .txt
Some of the rows of the file have unwanted text after the filename extension.
I am trying to find a way to search+replace or trim the whole file so that if a row is found with .txt, anything after this is simply removed. Example
C:\Test1.txt
C:\Test2.txtHelloWorld this is my
problem
C:\Test3.txt_____Annoying
stuff1234 .r
Desired result
C:\Test1.txt
C:\Test2.txt
C:\Test3.txt
I have tried with notepad++, or using batch/powershell, but got close, no cigar.
(Get-Content "D:\checkthese.txt") |
Foreach-Object {$_ -replace '.txt*', ".txt"} |
Set-Content "D:\CLEAN.txt"
My thinking here is if I replace anything (Wildcard*) after .txt then I would trim off what I need, but this doesnt work. I think I need to use regular expression, buy have the syntax wrong.

Simply change the * to a .*, like so:
(Get-Content "D:\checkthese.txt") |
Foreach-Object {$_ -replace '\.txt.*', ".txt"} |
Set-Content "D:\CLEAN.txt"
In regular expressions, * means "0 or more times", and in this case it'd act on the final t of .txt, so .txt* would only match .tx, .txt, .txtt, .txttt, etc...
., however, matches any character. This means, .* matches 0 or more of anything, which is what you want. Because of this, I also escaped the . in .txt, as it otherwise could break on filenames like: alovelytxtfile.txt, which would be trimmed to alovel.txt.
For more information, see:
Regex Tutorial - .
Regex Tutorial - *

Related

PowerShell v5.1: problem with replacing text in a file using regex patterns

I tried to follow answers provided here and here.
In a "test1.txt" file I have these contents:
20220421
20220422
20220423
20220424:
222
I want to replace the contents so that they would look like this in the output file "test2.txt":
20220421:
20220422:
20220423:
20220424:
222
I attempted to achieve this with the following code:
(Get-Content '.\test1.txt').replace('^\d{8}$', '^\d{8}:$') | Out-File '.\test2.txt'
However, instead of the expected results, I got the following content in "test2.txt":
20220421
20220422
20220423
20220424:
222
Can someone explain why I'm not achieving the expected results?
You are not using the regex supporting -replace operator and you are usinv a regex in the replacement instead of the correct replacement pattern.
You can use
(Get-Content '.\test1.txt') -replace '^(\d{8}):?\r?$', '$1:') | Out-File '.\test2.txt'
The ^(\d{8}):?\r?$ regex matches eight digits capturing them into Group 1, and then an optional colon, an optional CR and then asserts the end of string position.
The replacement is Group 1 value ($1) plus the colon char.
Powershell is treating $ and ^ as the beginning and end of the whole contents, not individual lines.
This is not quite what you want - I can't get the line break in the replacement string.
#"
20220421
20220422
20220423
20220424:
222
"# -replace "(\d{8})\n",'$1:'
line breaks not working

How to replace lines depending on the remaining text in file using PowerShell

I need to edit txt file using PowerShell. The problem is that I need to apply changes for the string only if the remaining part of the string matches some pattern. For example, I need to change 'specific_text' to 'other_text' only if the line ends with 'pattern':
'specific_text and pattern' -> changes to 'other_text and pattern'
But if the line doesn't end with pattern, I don't need to change it:
'specific_text and something else' -> no changes
I know about Replace function in PowerShell, but as far as I know it makes simple change for all matches of the regex. There is also Select-String function, but I couldn't combine them properly. My idea was to make it this way:
((get-content myfile.txt | select-string -pattern "pattern") -Replace "specific_text", "other_text") | Out-File myfile.txt
But this call rewrites the whole file and leaves only changed lines.
You may use
(get-content myfile.txt) -replace 'specific_text(?=.*pattern$)', "other_text" | Out-File myfile.txt
The specific_text(?=.*pattern$) pattern matches
specific_text - some specific_text...
(?=.*pattern$) - not immediately followed with any 0 or more chars other than a newline as many as possible and then pattern at the end of the string ($).

find repetitive string in a file

I have a text file Data.txt of Size in few MBs.
It has repetitive lines like
VolumeTradingDate=2017-09-05T00:00:00.000 VolumeTotal=73147 LastTradeConditions=0 in key=value format.
There are various key=valuedata, for simplicity I am showing very few.
Values are changing in lines.
I want to search all occurrences of VolumeTotal with its value and print/dump only that part in separate lines. Its value can be upto 25 characters.
I tried using cmd FindStr
findstr /C:VolumeTotal= "C:\Work\Data.txt"
But this doesn't give me desired result. It prints entire line.
Could anyone suggest what could be possible script in cmd or powershell to achieve this?
You can do this in PowerShell with a RegEx that uses look ahead and look behind:
Get-Content Data.txt | ForEach-Object {
$Check = $_ -Match '(?<= VolumeTotal\=)\d*(?= )'
If ($Check) { $Matches.Values }
}
The pattern: (?<= VolumeTotal\=)\d*(?= ) looks for any number of digits \d* between the strings ' VolumeTotal=' and a space character.
The result is sent to the automatic variable $Matches so we return the value of this variable if the pattern has been found.

Regex on Powershell script: Replace till end of line

I have some config files structured like:
PATH_KEY=C:\\dir\\project
foo=bar
I want to write a small script that replaces a certain key with current folder.
So basically I'm trying to replace "PATH_KEY=..." with "PATH_KEY=$PSScriptRoot"
My code so far:
$cfgs = Get-Childitem $PSScriptRoot -Filter *name*.cfg
foreach ($cfg in $cfgs)
{
( Get-Content $cfg) -replace 'PATH_KEY=.*?\n','PATH_KEY=$PSScriptRoot' | Set-Content $cfg
}
But the regular expression to take everything till end of line is not working.
Any help is appreciated!
You can use
'(?m)^PATH_KEY=.*'
or even
'PATH_KEY=.*'
Note that $ in the replacement should be doubled to denote a single $, but it is not a problem unless there is a digit after it.
See the demo:

Usage of | in PowerShell regex

I'm trying to split some text using PowerShell, and I'm doing a little experimenting with regex, and I would like to know exactly what the "|" character does in a PowerShell regex. For example, I have the following line of code:
"[02]: ./media/active-directory-dotnet-how-to-use-access-control/acs-01.png" | select-string '\[\d+\]:' | foreach-object {($_ -split '\[|\]')}
Running this line of code gives me the following output:
-blank line-
02
: ./media/active-directory-dotnet-how-to-use-access-control/acs-01.png
If I run the code without the "|" in the -split statement as such:
"[02]: ./media/active-directory-dotnet-how-to-use-access-control/acs-01.png" | select-string '\[\d+\]:' | foreach-object {($_ -split '\[\]')}
I get the following output without the [] being stripped (essentially it's just displaying the select-string output:
[02]: ./media/active-directory-dotnet-how-to-use-access-control/acs-01.png
If I modify the code and run it like this:
"[02]: ./media/active-directory-dotnet-how-to-use-access-control/acs-01.png" | select-string '\[\d+\]:' | foreach-object {($_ -split '\[|')}
In the output, the [ is stripped from the beginning but the output has a carriage return after each character (I did not include the full output for space purposes).
0
2
]
:
.
/
m
e
The Pipe character, "|", separates alternatives in regex.
You can see all the metacharacters defined here:
http://regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1
The answers already explain what the | is for but I would like to explain what is happening with each example that you have above.
-split '\[|\]': You are trying to match either [ or ] which is why you get 3 results. The first being a blank line which is the whitespace represented by the beginning of the line before the first [
-split '\[\]': Since you are omitting the | symbol in this example you are requesting to split on the character sequence [] which does not appear in your string. This is contrasted by the code $_.split('\[\]') which would split on every character. This is by design.
-split '\[|': Here you are running into a caveat of not specifying the right hand operand for the | operator. To quote the help from Regex101 when this regex is specified:
(null, matches any position)
Warning: An empty alternative effectively truncates the regex at this
point because it will always find a zero-width match
Which is why the last example split on every element. Also, I dont think any of this is PowerShell only. This behavior should be seen on other engines as well.
Walter Mitty is correct, | is for alternation.
You can also use [Regex]::Escape("string") in Powershell and it will return a string that has all the special characters escaped. So you can use that on any strings you want to match literally (or to determine if a specific character does or can have special meaning in a regex).