I am trying to use grep -v -e '' to exclude comments (lines with # as the first non-whitespace character) from a file.
The # can appear either at the begining of the line or there could be a combination of several blank space and tabs in any combination before the first # is encountered.
Assume the file np4 contains this:
# hash at the begining of the line
## two hashes at the begining of the line
#### four hashes at the begining of the line
# two white spaces then a hash
a good line
another good line starting with a few spaces
a good line starting with a combination of spaces and tabs
# two white spaces, two tabes and then a hash
## two tabs, two white spaces and then two hashes
# tab, ws, tab, ws, tab then hash
I tried using the command below, but it does not work as I thought it would. I should only get three lines as the output.
grep -v -e '^\s*#.*$' np4
I believe that what you are missing is the plus sign to match one or more of the pound symbol. I didn't grep as i tested, but this looked good. i've supplied a permalink to my test.
^\s*#+.*$
Here is a permalink to the test on regexpal.com.
not sure if \s works well for grep.
Could you try "^[ ]#." ?
In the [] there is a space and a tab, two characters.
Related
I need to extract several lines of text (which vary in length along the 500 MB document) between a line that starts with "Query #" and two consecutive carriage returns. This is being done on a Mac. For example, the document format is:
Query #1: 020.1-Bni_its1_2019_envio1set1
lines I need to extract
Alignments (the following lines I don't need)
xyz
xyx
Query #2: This and the following lines I need. And so on.
There are always exactly two carriage returns before the word "Alignments". So basically I need all the lines from Query #.: until Alignments.
I tried the following regex, but I only recover the first line.
ggrep -P 'Query #.*?(?:[\r\n]{2}|\Z)'
I have tested the regex with multiple iterations at Regex101, but I have not yet found the answer.
The expected output is:
Query #1. Text.
Lines I need to extract
Query #2: This and following lines I need.
Lines I need.
Query #....
With pcregrep, you can use
pcregrep -oM 'Query #.*(?:\R(?!\R{2}).*)*' file.txt > results.txt
Here,
o - outputs matched texts
M - enables matching across lines (puts line endings into "pattern space")
Query #.*(?:\R(?!\R{2}).*)* matches
Query # - literal text
.* - the rest of the line
(?:\R(?!\R{2}).*)* - zero or more sequences of a line break sequence (\R) not immediately followed with two line break sequences ((?!\R{2})) and then the rest of the line.
Test screenshot:
From Regular Expressions: Now You Have Two Problems:
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
Using any AWK implementation in any shell on every Unix box:
awk '/^Query #/{f=1} /^Alignments/{f=0} f' file
Output:
Query #1: 020.1-Bni_its1_2019_envio1set1
lines I need to extract
Query #2: This and the following lines I need. And so on.
Every line of the input file will match one of the patterns:
"SCnnnn"
"SC-nnnn"
"SC_nnnn"
( n=[0-9], SC is literal but may be upper or lowercase and will be followed immediately by 1-4 digits delimited at the end by an alphanumeric, space or other non-numeric character)
Somewhere in the line there will also be a file extension (matching ".abc") where abc = upper|lower alphanumeric in any position.
I want to extract the first pattern and print this together with the extracted file extension for each line. This is what I have so far:
sed -E -n 's/([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p' infile
Here's a sample input line:
SCSCSCSCSCSCSCSCSC1867SCBrSCSCSCSC&SCBlSCkSCSCBSCrSCbSCckSC.xyz
with required output being:
SC1867.xyz
but what I am getting is:
SCSCSCSCSCSCSCSCSC1867.xyz
Can someone please tell me why this is returning the "SC"s before the part I want? I know it's something to do with greediness, but I can't get my head around it.
(Everything works fine where my "SCnnnn" match is at the beginning of the line.)
I am open to other tools - e.g. awk - if they offer a more straightforward solution.
EDIT: I think I found a solution - at least it appears to work:
sed -E -n 's/.*([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p'
It's actually not necessarily the greediness that is at play here. The reason this is happening is because sed is replacing a part of a line and then printing the whole line (the suffix of p on your s// command does this).
To more clearly see what's happening, make infile contain a more obvious string like 0o0o0o0o0o0o0o0oSC1867lalalalalalfalalala.xyz and run your first command. The following is the result
[user#localhost ~]$ sed -E -n 's/([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p' infile
0o0o0o0o0o0o0o0oSC1867.xyz
As a slow-mo: sed finds your [Ss][Cc] characters beginning after the 0o0o0s and dutifully replaces the string you have described with the desired substitution; namely, it maintains the SC_-like part and four digits, then deletes everything after the numbers until the suffix. The problem is seen when the p command prints out the partially-changed line, including all of the unwanted 0oze.
Alternately
As an alternate solution, not involving printing partially changed lines but instead matching an entire line and altering it to your purpose, the following command extracted the correct answer to stdout for a file containing your example string:
[user#localhost ~]$ sed -e 's/^.*\([Ss][Cc][-_]\?[0-9]\{4\}\).*\(\.[a-Z]\{3\}\)$/\1\2/' infile
SC1867.xyz
To break that regex down a bit: the regex begins with a beginning of line (^), consumes all characters (.*) until it sees an SC (upper or lower, [Ss][Cc]), then it checks for an optional hyphen or underscore ([-_]\?), followed by exactly four digits ([0-9]\{4\}). Then, all characters are consumed until a dot (\.) is seen, followed by exactly three alphanumerical characters ([a-Z]\{3\}) and an end of line ($). The two expressions not consumed by a wildcard are saved to registers and concatenated (\1\2).
... sed -E 's/^.*([Ss][Cc][-_]?[0-9]{4}).*(\.[a-Z]{3})$/\1\2/' infile works too, if you don't enjoy backslashes as much as I do.
I have a file in format:
has | have | had\tmeaning of have\n
apple\tmeaning of apple\n
write | wrote\tmeaning of write\n
I want to have it in the following format:
has\tmeaning of have\n
have\tmeaning of have\n
had\tmeaning of have\n
apple\tmeaning of apple\n
etc. Word(s) (has, have, had) can be single or multiple. Multiple words are seperated by space, pipe character, space. Meaning is followed by tab character and ended by new line. I am not sure but want to assume that meaning may contain pipe or tab character (or better any character except newline). Can it be done in notepad++? If not, is there other easy alternative?
My input file uses actual newline and tab characters. Since I can't paste them in stackoverflow, I have presented them as \n and \t (escape sequences) instead in the examples.
EDIT
It sounds like in your input, the tabs and new lines are not literally inserted. This should work:
Search: \s*([^ |]+) \|\s*(?=.*?\t(.*?)(?=(?:\R|$)))
Replace: \1\t\2\n
Original
In the Replace tab, make sure to check the "regex" box at the bottom left, then use this:
Search: \s*([^ |]+) \|\s*(?=.*?\\t(.*?)(?=(?:\\n|$)))
Replace: \1\t\2\n
I have thousands of text file with empty first row. Is it possible to delete this row in all files at once?
You need a bat script like this
#echo off
for %%i in (*.txt) do (
more +1 "%%~fi">>temp
del "%%~fi"
ren temp "%%~nxi"
)
Save the above code as something.bat and run it at your directory.
This will work using Notepad++ (tested with version 6.2.3):
\A[\r\n]+
Explanation:
\A and \Z always match the beginning and end of the entire file, irrespective of the multiline setting.
Note: This regex is slightly more general than the OP asked. It will remove any number of consecutive initial blank rows terminated with any line break sequence (\r\n, \r or \n).
Nothing is worse than changing thousands of files only to find later that a couple have a different line break sequence or have multiple initial blank lines.
Alternative:
Another regex that works is:
(?<!.)[\r\n]+
Explanation:
This uses negative look-behind, (?<!), to make sure no character exists before the sequence of CRs and LFs.
Note: You must tick the . matches newline check box for this to work.
I am trying to parse some reStructuredText and want to be able to identify when the indent level has changed. So, I need to be able to see when an indent of 8 spaces has changed to an indent of 4 spaces (for example), so that I can change the color of that text block. Is there a way of using regular expressions to count the number of spaces in the indent and pick out the next line that contains a shallower indent?
Something like this will work:
/
^(\s*)\S.*$ #Find a line with some number of spaces
(?:^\1\S.*$)* #Find more lines with the same starting spaces
^.*$ #This is the line you want here
/xm #x to ignore whitespace in the regex.
#m to have ^and $ match all lines