find repetitive string in a file

find repetitive string in a file - regex

I have a text file Data.txt of Size in few MBs.
It has repetitive lines like
VolumeTradingDate=2017-09-05T00:00:00.000 VolumeTotal=73147 LastTradeConditions=0 in key=value format.
There are various key=valuedata, for simplicity I am showing very few.
Values are changing in lines.
I want to search all occurrences of VolumeTotal with its value and print/dump only that part in separate lines. Its value can be upto 25 characters.
I tried using cmd FindStr
findstr /C:VolumeTotal= "C:\Work\Data.txt"
But this doesn't give me desired result. It prints entire line.
Could anyone suggest what could be possible script in cmd or powershell to achieve this?

You can do this in PowerShell with a RegEx that uses look ahead and look behind:
Get-Content Data.txt | ForEach-Object {
$Check = $_ -Match '(?<= VolumeTotal\=)\d*(?= )'
If ($Check) { $Matches.Values }
}
The pattern: (?<= VolumeTotal\=)\d*(?= ) looks for any number of digits \d* between the strings ' VolumeTotal=' and a space character.
The result is sent to the automatic variable $Matches so we return the value of this variable if the pattern has been found.

Related

How to output multiple regex matches through comma on the same line

I want to use grep/awk/sed to extract matched strings for each line of a log file. Then place it into csv file.
Highlighted strings (1432,53,http://www.espn.com/)
If the input is:
2018-10-31
18:48:01.717,INFO,15592.15627,PfbProxy::handlePfbFetchDone(0x1d69850,
pfbId=561, pid=15912, state=4, fd=78, timer=61), FETCH DONE: len=45,
PFBId=561, pid=0, loadTime=1434 ms, objects=53, fetchReqEpoch=0.0,
fetchDoneEpoch:0.0, fetchId=26, URL=http://www.espn.com/
2018-10-31
18:48:01.806,DEBUG,15592.15621,FETCH DONE: len=45, PFBId=82, pid=0,
loadTime=1301 ms, objects=54, fetchReqEpoch=0.0, fetchDoneEpoch:0.0,
fetchId=28, URL=http://www.diply.com/
Expected output for the above log lines:
URL,LoadTime,Objects
http://www.espn.com/,1434,53
http://www.diply.com/,1301,54
This is an example, and the actual Log File will have much more data.
--My-Solution-So-far-
For now I used grep to get all lines containing keyword 'FETCH DONE' (these lines contain strings I am looking for).
I did come up with regular expression that matches the data I need, but when I grep it and put it in the file it prints each string on the new line which is not quite what I am looking for.
The grep and regular expression I use (online regex tool: https://regexr.com/42cah):
echo -en 'url,loadtime,object\n'>test1.csv #add header
grep -Po '(?<=loadTime=).{1,5}(?= )|((?<=URL=).*|\/(?=.))|((?<=objects=).{1,5}(?=\,))'>>test1.csv #get matching strings
Actual output:
URL,LoadTime,Objects
http://www.espn.com
1434
53
http://www.diply.com
1301
54
Expected output:
URL,LoadTime,Objects
http://www.espn.com/,1434,53
http://www.diply.com/,1301,54
I was trying using awk to match multiple regex and print comma in between. I couldn't get it to work at all for some reason, even though my regex matches correct strings.
Another idea I have is to use sed to replace some '\n' for ',':
for(i=1;i<=n;i++)
if(i % 3 != 0){
sed REPLACE "\n" with "," on i-th line
}
Im pretty sure there is a more efficient way of doing it

Using sed:
sed -n 's/.*loadTime=\([0-9]*\)[^,]*, objects=\([0-9]*\).* URL=\(.*\)/\3,\1,\2/p' input | \
sed 1i'URL,LoadTime,Objects'

Findstr - Return only a regex match

I have this string in a text file (test.txt):
BLA BLA BLA
BLA BLA
Found 11 errors and 7 warnings
I perform this command:
findstr /r "[0-9]+ errors" test.txt
In order to get just 11 errors string.
Instead, the output is:
Found 11 errors and 7 warnings
Can someone assist?

findstr always returns every full line that contains a match, it is not capable of returning sub-strings only. Hence you need to do the sub-string extraction on your own. Anyway, there are some issues in your findstr command line, which I want to point out:
The string parameter of findstr actually defines multiple search strings separated by white-spaces, so one search string is [0-9]+ and the other one is error. The line Found 11 errors and 7 warnings in your text file is returned because of the word error only, the numeric part is not part of the match, because findstr does not support the + character (one or more occurrences of previous character or class), you need to change that part of the search string to [0-9][0-9]* to achieve that. To treat the whole string as one search string, you need to provide the /C option; since this defaults to literal search mode, you additionally need to add the /R option explicitly.
findstr /R /C:"[0-9][0-9]* errors" "test.txt"
Changing all this would however also match strings like x5 errorse; to avoid that you could use word boundaries like \< (beginning of word) and \> (end of word). (Alternatively you could also include a space on either side of the search string, so /C:" [0-9][0-9]* errors ", but this might cause trouble if the search string appears at the very beginning or end of the applicable line.)
So regarding all of the above, the corrected and improved command line looks like this:
findstr /R /C:"\<[0-9][0-9]* errors\>" "test.txt"
This will return the entire line containing a match:
Found 11 errors and 7 warnings
If you want to return such lines only and exclude lines like 2 errors are enough or 35 warnings but less than 3 errors, you could of course extend the search string accordingly:
findstr /R /C:"^Found [0-9][0-9]* errors and [0-9][0-9]* warnings$" "test.txt"
Anyway, to extract the portion 11 errors there are several options:
a for /F loop could parse the output of findstr and extract certain tokens:
for /F "tokens=2-3 delims= " %%E in ('
findstr/R /C:"\<[0-9][0-9]* errors\>" "test.txt"
') do echo(%%E %%F
the sub-string replacement syntax could also be used:
for /F "delims=" %%L in ('
findstr /R /C:"\<[0-9][0-9]* errors\>" "test.txt"
') do set "LINE=%%L"
set "LINE=%LINE:* =%"
set "LINE=%LINE: and =" & rem "%"
echo(%LINE%

The findstr tool cannot be used to extract matches only. It is much easier to use Powershell for this.
Here is an example:
$input_path = 'c:\ps\in.txt'
$output_file = 'c:\ps\out.txt'
$regex = '[0-9]+ errors'
select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file
See the Windows PowerShell: Extracting Strings Using Regular Expressions article on how to use the script above.

Using Type (or Cat) and Grep can do this.
This will allow for random number of errors (up to four digits).
type c:\temp\test.txt | grep -Eo '[0-9]{1,4} errors'
11 errors
If error number is larger than four digits, modify above to largest expected digits.
For an exact case-sensitive option
type c:\temp\test.txt | grep -o "11 errors"
11 errors
Or this case-insensitive option with Cat
cat c:\temp\test.txt | grep -o -i "11 ERRORS"
11 errors

Regex on Powershell script: Replace till end of line

I have some config files structured like:
PATH_KEY=C:\\dir\\project
foo=bar
I want to write a small script that replaces a certain key with current folder.
So basically I'm trying to replace "PATH_KEY=..." with "PATH_KEY=$PSScriptRoot"
My code so far:
$cfgs = Get-Childitem $PSScriptRoot -Filter *name*.cfg
foreach ($cfg in $cfgs)
{
( Get-Content $cfg) -replace 'PATH_KEY=.*?\n','PATH_KEY=$PSScriptRoot' | Set-Content $cfg
}
But the regular expression to take everything till end of line is not working.
Any help is appreciated!

You can use
'(?m)^PATH_KEY=.*'
or even
'PATH_KEY=.*'
Note that $ in the replacement should be doubled to denote a single $, but it is not a problem unless there is a digit after it.
See the demo:

search a pattern from a file with a combination of defined variable and extra strings using powershell

I have a text file as mentioned below:
04Jul#15:08 ERROR: The Ticket and Load data do not match: NUM[MXS035]
04Jul#15:14 No data for MXS035
04Jul#15:14 Ticket = [MXS035]
04Jul#15:39 Ticket = [ABC077]
04Jul#16:14 gNoRcomp = [72]
04Jul#16:14 Test lines 12345
04Jul#16:14 gNoRcomp = [72]
04Jul#16:14 test file content not displayed
MU: module rpt3.cpp, line 8652
Database 0
Communications 0
I have created a $date value which captures the common part of the text file 04Jul for a prticular date using the variable
$date_value=Get-date -Format ddMMM
when displayed the valye of $date_value we get 04Jul
I need to search a pattern which is in the text file which is having the date as common and the workdings ticket as common..the rest of the values in the line change.
example :
I need to capture the lines below:
04Jul#15:14 Ticket = [MXS035]
04Jul#15:39 Ticket = [ABC077]
This has 04Jul which is already captured in a variable $date_value# and the time filed changes and " Ticket = [" is again common and next 6 characters change and the last ] is common which do not change.
So the requirement is
$date_value#......Ticket=[......]
the above mentioned part is common in the text file line which needs to be captured.
I tried the below select string and is not working.
select-string -pattern "$date_value#\d+:\d+ Ticket = [[]ABCDEF[]]" test.txt
Any suggestions plesase?

You need to amend the part of the regex where you are looking for a reference in square brackets.
If you want to look for characters that have special meaning in regular expression syntax, you must escape them using the backslash character first e.g. to escape an opening square bracket, it's \[ (in fact to type this, I had to escape the backslash itself by typing it twice)
The following works:
select-string -pattern "$date_value#\d+:\d+ Ticket = \[[A-Za-z]{3}\d{3}\]" test.txt
So, it was all fine up until the reference in square brackets. What I've done here tells it to look for an opening square bracket, followed by 3 letters of either upper or lower case, followed by 3 digits, and finally a closing square bracket.
In my test, with a file using the contents you provided, I got back the following results:
test.txt:3:04Jul#15:14 Ticket = [MXS035]
test.txt:4:04Jul#15:39 Ticket = [ABC077]
...which tells you filename, line number that matched, and the line contents.
For further help, in the Powershell command window or ISE GUI, enter: help about_regular_expressions

this is example of regex for string like this: "04Jul#15:14 Ticket = [MXS035]"
"04Jul#15:14 Ticket = [MXS035]" -Match "(\d{2}\w{3})#(\d{2}:\d{2})\sTicket\s=\s\[(\w*)\]"
$date = ($Matches[1] | Get-date -Format ddMMM)
$time = $Matches[2]
$ticket = $Matches[3]
$date, $time, $ticket
this code will select all strings that matches to pattern
Select-String -Pattern "(\d{2}\w{3})#(\d{2}:\d{2})\sTicket\s=\s\[(\w*)\]" test.txt
with variable date
$date_value='04Jul'
Select-String -Pattern $date_value+"#\d{2}:\d{2}\sTicket\s=\s\[\w*\]" "C:\test.txt"

In your pattern, change [[]ABCDEF[]] to \[[A-Z]{3}[0-9]{3}\] to match a ticket number consisting of 3 uppercase letters followed by 3 digits between literal square brackets:
$date_value = Get-date -Format ddMMM
Select-String "$date_value#\d+:\d+ Ticket = \[[A-Z]{3}[0-9]{3}\]" test.txt
This gives me the following output when I use it on a file test.txt with the sample content from your question:
test.txt:3:04Jul#15:14 Ticket = [MXS035]
test.txt:4:04Jul#15:39 Ticket = [ABC077]

Text manipulation problem - how to replace text after a known value

I have a large text file containing filenames ending in .txt
Some of the rows of the file have unwanted text after the filename extension.
I am trying to find a way to search+replace or trim the whole file so that if a row is found with .txt, anything after this is simply removed. Example
C:\Test1.txt
C:\Test2.txtHelloWorld this is my
problem
C:\Test3.txt_____Annoying
stuff1234 .r
Desired result
C:\Test1.txt
C:\Test2.txt
C:\Test3.txt
I have tried with notepad++, or using batch/powershell, but got close, no cigar.
(Get-Content "D:\checkthese.txt") |
Foreach-Object {$_ -replace '.txt*', ".txt"} |
Set-Content "D:\CLEAN.txt"
My thinking here is if I replace anything (Wildcard*) after .txt then I would trim off what I need, but this doesnt work. I think I need to use regular expression, buy have the syntax wrong.

Simply change the * to a .*, like so:
(Get-Content "D:\checkthese.txt") |
Foreach-Object {$_ -replace '\.txt.*', ".txt"} |
Set-Content "D:\CLEAN.txt"
In regular expressions, * means "0 or more times", and in this case it'd act on the final t of .txt, so .txt* would only match .tx, .txt, .txtt, .txttt, etc...
., however, matches any character. This means, .* matches 0 or more of anything, which is what you want. Because of this, I also escaped the . in .txt, as it otherwise could break on filenames like: alovelytxtfile.txt, which would be trimmed to alovel.txt.
For more information, see:
Regex Tutorial - .
Regex Tutorial - *

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

find repetitive string in a file - regex

Related

How to output multiple regex matches through comma on the same line

Findstr - Return only a regex match

Regex on Powershell script: Replace till end of line

search a pattern from a file with a combination of defined variable and extra strings using powershell

Text manipulation problem - how to replace text after a known value

Categories

Resources