Findstr - Return only a regex match - regex

I have this string in a text file (test.txt):
BLA BLA BLA
BLA BLA
Found 11 errors and 7 warnings
I perform this command:
findstr /r "[0-9]+ errors" test.txt
In order to get just 11 errors string.
Instead, the output is:
Found 11 errors and 7 warnings
Can someone assist?

findstr always returns every full line that contains a match, it is not capable of returning sub-strings only. Hence you need to do the sub-string extraction on your own. Anyway, there are some issues in your findstr command line, which I want to point out:
The string parameter of findstr actually defines multiple search strings separated by white-spaces, so one search string is [0-9]+ and the other one is error. The line Found 11 errors and 7 warnings in your text file is returned because of the word error only, the numeric part is not part of the match, because findstr does not support the + character (one or more occurrences of previous character or class), you need to change that part of the search string to [0-9][0-9]* to achieve that. To treat the whole string as one search string, you need to provide the /C option; since this defaults to literal search mode, you additionally need to add the /R option explicitly.
findstr /R /C:"[0-9][0-9]* errors" "test.txt"
Changing all this would however also match strings like x5 errorse; to avoid that you could use word boundaries like \< (beginning of word) and \> (end of word). (Alternatively you could also include a space on either side of the search string, so /C:" [0-9][0-9]* errors ", but this might cause trouble if the search string appears at the very beginning or end of the applicable line.)
So regarding all of the above, the corrected and improved command line looks like this:
findstr /R /C:"\<[0-9][0-9]* errors\>" "test.txt"
This will return the entire line containing a match:
Found 11 errors and 7 warnings
If you want to return such lines only and exclude lines like 2 errors are enough or 35 warnings but less than 3 errors, you could of course extend the search string accordingly:
findstr /R /C:"^Found [0-9][0-9]* errors and [0-9][0-9]* warnings$" "test.txt"
Anyway, to extract the portion 11 errors there are several options:
a for /F loop could parse the output of findstr and extract certain tokens:
for /F "tokens=2-3 delims= " %%E in ('
findstr/R /C:"\<[0-9][0-9]* errors\>" "test.txt"
') do echo(%%E %%F
the sub-string replacement syntax could also be used:
for /F "delims=" %%L in ('
findstr /R /C:"\<[0-9][0-9]* errors\>" "test.txt"
') do set "LINE=%%L"
set "LINE=%LINE:* =%"
set "LINE=%LINE: and =" & rem "%"
echo(%LINE%

The findstr tool cannot be used to extract matches only. It is much easier to use Powershell for this.
Here is an example:
$input_path = 'c:\ps\in.txt'
$output_file = 'c:\ps\out.txt'
$regex = '[0-9]+ errors'
select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file
See the Windows PowerShell: Extracting Strings Using Regular Expressions article on how to use the script above.

Using Type (or Cat) and Grep can do this.
This will allow for random number of errors (up to four digits).
type c:\temp\test.txt | grep -Eo '[0-9]{1,4} errors'
11 errors
If error number is larger than four digits, modify above to largest expected digits.
For an exact case-sensitive option
type c:\temp\test.txt | grep -o "11 errors"
11 errors
Or this case-insensitive option with Cat
cat c:\temp\test.txt | grep -o -i "11 ERRORS"
11 errors

Related

Findstr with spaces does not work with character class

I have read How to write a search pattern to include a space in findstr? and I thought the character class should solve my problem.
I came up with the following minimal reproducible example:
echo 1.999 K| findstr /R "\....[ ]K"
I expect \. to match the dot, ... to match 999, [ ] to match the space and K to match K. However, there's no output.
When I do
echo 1.999 K| findstr /R "\.... K"
it will be interpreted as 2 individual search terms, which I can't use, because I need the K to be after the number (compare to echo K 1.999 M| findstr /R "\.... K")
I'm on Windows 10 1903 18362.1016 if that matters.
Using /R /C:"\....[ ]K" works, as does /R /C:"\.... K". I've personally long given up trying to understand findstr's idiosyncrasies with parsing its arguments or regular expressions¹. If you don't have multiple search terms, using /C: should never do harm and only cut down on unexpected failures.
The other option, if you have it available, might be to pipe your data through PowerShell:
echo 1.999 K| powershell "$input -match '\....[ ]K'"
echo 1.999 K| pwsh -command "$input -match '\....[ ]K'"
Slower (PowerShell 7 is quite a bit faster than 5.1), but a lot more predictable.
¹ dbenham has reverse-engineered findstr once, with a list of all known bugs. My guess here would be that despite the [], the space is still parsed as a separator between search terms, each of which then contain a literal bracket, leading to no match.

How to output multiple regex matches through comma on the same line

I want to use grep/awk/sed to extract matched strings for each line of a log file. Then place it into csv file.
Highlighted strings (1432,53,http://www.espn.com/)
If the input is:
2018-10-31
18:48:01.717,INFO,15592.15627,PfbProxy::handlePfbFetchDone(0x1d69850,
pfbId=561, pid=15912, state=4, fd=78, timer=61), FETCH DONE: len=45,
PFBId=561, pid=0, loadTime=1434 ms, objects=53, fetchReqEpoch=0.0,
fetchDoneEpoch:0.0, fetchId=26, URL=http://www.espn.com/
2018-10-31
18:48:01.806,DEBUG,15592.15621,FETCH DONE: len=45, PFBId=82, pid=0,
loadTime=1301 ms, objects=54, fetchReqEpoch=0.0, fetchDoneEpoch:0.0,
fetchId=28, URL=http://www.diply.com/
Expected output for the above log lines:
URL,LoadTime,Objects
http://www.espn.com/,1434,53
http://www.diply.com/,1301,54
This is an example, and the actual Log File will have much more data.
--My-Solution-So-far-
For now I used grep to get all lines containing keyword 'FETCH DONE' (these lines contain strings I am looking for).
I did come up with regular expression that matches the data I need, but when I grep it and put it in the file it prints each string on the new line which is not quite what I am looking for.
The grep and regular expression I use (online regex tool: https://regexr.com/42cah):
echo -en 'url,loadtime,object\n'>test1.csv #add header
grep -Po '(?<=loadTime=).{1,5}(?= )|((?<=URL=).*|\/(?=.))|((?<=objects=).{1,5}(?=\,))'>>test1.csv #get matching strings
Actual output:
URL,LoadTime,Objects
http://www.espn.com
1434
53
http://www.diply.com
1301
54
Expected output:
URL,LoadTime,Objects
http://www.espn.com/,1434,53
http://www.diply.com/,1301,54
I was trying using awk to match multiple regex and print comma in between. I couldn't get it to work at all for some reason, even though my regex matches correct strings.
Another idea I have is to use sed to replace some '\n' for ',':
for(i=1;i<=n;i++)
if(i % 3 != 0){
sed REPLACE "\n" with "," on i-th line
}
Im pretty sure there is a more efficient way of doing it
Using sed:
sed -n 's/.*loadTime=\([0-9]*\)[^,]*, objects=\([0-9]*\).* URL=\(.*\)/\3,\1,\2/p' input | \
sed 1i'URL,LoadTime,Objects'

Powershell: how to replace quoted text from a batch file

I have a text file that contains:
#define VERSION "0.1.2"
I need to replace that version number from a running batch file.
set NEW_VERSION="0.2.0"
powershell -Command "(gc BBB.iss) -replace '#define VERSION ', '#define VERSION %NEW_VERSION% ' | Out-File BBB.iss"
I know that my match pattern is not correct. I need to select the entire line including the "0.2.0", but I can't figure out how to escape all that because it's all enclosed in double quotes so it runs in a batch file.
I'm guessing that [0-9].[0-9].[0-9] will match the actual old version number, but what about the quotes?
but what about the quotes?
When calling PowerShell's CLI from cmd.exe (a batch file) with powershell -command "....", use \" to pass embedded ".
(This may be surprising, given that PowerShell-internally you typically use `" or "" inside "...", but it is the safe choice from the outside.[1].)
Note:
While \" works robustly on the PowerShell side, it can situationally break cmd.exe's parsing. In that case, use "^"" (sic) with powershell.exe (Windows PowerShell), and "" with pwsh.exe (PowerShell (Core) 7+), inside overall "..." quoting. See this answer for details.
Here's an approach that matches and replaces everything between "..." after #define VERSION :
:: Define the new version *without* double quotes
set NEW_VERSION=0.2.0
powershell -Command "(gc BBB.iss) -replace '(?<=#define VERSION\s+\").+?(?=\")', '%NEW_VERSION%' | Set-Content -Encoding ascii BBB.iss"
Note that using Out-File (as used in the question) to rewrite the file creates a UTF-16LE ("Unicode") encoded file, which may be undesired; use Set-Content -Encoding ... to control the output encoding. The above command uses Set-Content -Encoding ascii as an example.
Also note that rewriting an existing file this way (read existing content into memory, write modified content back) bears the slight risk of data loss, if writing the file gets interrupted.
(?<=#define VERSION\s+\") is a look-behind assertion ((?<=...)) that matches literal #define VERSION followed by at least one space or tab (\s+) and a literal "
Note how the " is escaped as \", which - surprisingly - is how you need to escape literal " chars. when you pass a command to PowerShell from cmd.exe (a batch file).[1]
.+? then non-greedily (?) matches one or more (+) characters (.)...
...until the closing " (escaped as \") is found via (?=\"), a look-ahead assertion
((?<=...))
The net effect is that only the characters between "..." are matched - i.e., the mere version number - which then allows replacing it with just '%NEW_VERSION%', the new version number.
A simpler alternative, if all that is needed is to replace the 1st line, without needing to preserve specific information from it:
powershell -nop -Command "#('#define VERSION \"%NEW_VERSION%\"') + (gc BBB.iss | Select -Skip 1) | Set-Content -Encoding ascii BBB.iss"
The command simply creates an array (#(...)) of output lines from the new 1st line and (+) all but the 1st line from the existing file (gc ... | Select-Object -Skip 1) and writes that back to the file.
[1] When calling from cmd.exe, escaping an embedded " as "" sometimes , but not always works (try
powershell -Command "'Nat ""King"" Cole'").
Instead, \"-escaping is the safe choice.
`", which is the typical PowerShell-internal way to escape " inside "...", never works when calling from cmd.exe.
You can try this,
powershell -Command "(gc BBB.iss) -replace '(?m)^\s*#define VERSION .*$', '#define VERSION %NEW_VERSION% ' | Out-File BBB.iss"
If you want double quotes left,
powershell -Command "(gc BBB.iss) -replace '(?m)^\s*#define VERSION .*$', '#define VERSION "%NEW_VERSION%"' | Out-File BBB.iss"

find repetitive string in a file

I have a text file Data.txt of Size in few MBs.
It has repetitive lines like
VolumeTradingDate=2017-09-05T00:00:00.000 VolumeTotal=73147 LastTradeConditions=0 in key=value format.
There are various key=valuedata, for simplicity I am showing very few.
Values are changing in lines.
I want to search all occurrences of VolumeTotal with its value and print/dump only that part in separate lines. Its value can be upto 25 characters.
I tried using cmd FindStr
findstr /C:VolumeTotal= "C:\Work\Data.txt"
But this doesn't give me desired result. It prints entire line.
Could anyone suggest what could be possible script in cmd or powershell to achieve this?
You can do this in PowerShell with a RegEx that uses look ahead and look behind:
Get-Content Data.txt | ForEach-Object {
$Check = $_ -Match '(?<= VolumeTotal\=)\d*(?= )'
If ($Check) { $Matches.Values }
}
The pattern: (?<= VolumeTotal\=)\d*(?= ) looks for any number of digits \d* between the strings ' VolumeTotal=' and a space character.
The result is sent to the automatic variable $Matches so we return the value of this variable if the pattern has been found.

Adding a new-line after a specific number of characters in a string

I've found this thread:
How to insert a new line character after a fixed number of characters in a file
but the answers there unfortunately don't help me.
I have this huge string $temp that I want to have cut off after a specific amount of characters (e.g. 20) to match it with the line above. After that number of characters there should be added a \n and in the end the formatted string should be inserted into a variable. There should be no cutting if the length of the string is < the number.
Right now I'm sticking to
sed -e "s/.\{20\}/&\n/g" <<< $temp
but it doesn't work. Instead of adding \n it's adding spaces.
The easiest is probably to use the fold utility, as suggested in the thread you referenced. For instance:
printf "%s\n" "$temp" | fold -cw 20
When you say:
to match it with the line above
...Perhaps it would be helpful to try the uniq program to weed out (or count) duplicates:
printf "%s\n" "$temp" | fold -cw 20 | uniq
And of course, if you want the output in a new variable, wrap it in $() like so:
new="$(printf "%s\n" "$temp" | fold -cw 20 | uniq)"
NewTemp="$( printf "%s" "${temp}" | sed -e 's/.\{20\}/&\
/g' )"
Your sed is fine but it is always a bit tricky when source and destination are the same if application is not specifically taking this into accound (like option -i of GNU sed).
use simple quote if possible (not substituion needed)
i prefer to use a real new line (when not onliner) instead of \n to availbel on each sed version (posix does not allow \n in replacement pattern)
just be sure of the meaning of existing new line in temp variable
sed work line per line by default (so each line is take one by one)
if using multiline (option or loading buffer), new line IS also a character take by .