egrep search: every letter appears maximum 2 times - regex

Ok, I've got this issue where I have a list of md5's and a word next to it separated with a space, and I need to filter out some lines.
Example snippet:
...
F08A4C9263AD215D70B9C216F0B385CB wrapup
7B286E6F0615D64ACD4A7BC3578871DD wrath
8E35BA3D27A7730840EB1694386F69A0 wrathful
096762EA6790EDA22BF2369347FD53AC wreak
56AC6677205EB591A7173BADBB610F5C wreath
A85C0CB6C0367AF9D23442DF56EC9E1C wreathe
9E44AAE612306D44B91C4162DB5C26B7 wreck
6D9C795CBB3075DC1A482F6F78DC6D68 wreckage
BD907BC4DC65934D133BD5C472B78CC0 wrench
758C70E9B6F437D49D98D92E28E95939 wrest
81A4471F73DFDA0B534F58F4E1501FAB wrestle
94183CC7C7A66338DE89DB9C7460A8A2 wretch
AFEED5CE5BACCEC17EC54E68A97CCD0F wriggle
...
I need a regular expression for (e)grep that pulls out every line where every letter (so [A-F]) appears only 2 times maximum.
so an example for that would be:
4F2048B829C2834A23832F28928DE38E turtle
If anyone can help me with this i'd appreciate it very much!

You could use:
egrep -v "^\S*([A-F])\S*\1\S*\1" inputfile
That would list every line which does not include the letters A-F repeated three times in the same line.
EDIT: changed to avoid matching uppercase characters in the words...

you mentioned:
pulls out every line where every letter (so [A-F]) appears only 2
times maximum.
So my understanding is, the selected line should contain 0-2 [A-F]. Based on this, the following awk oneliner should do the job:
awk 'BEGIN{FS=""}{delete a;for(i=1;i<=NF;i++)if($i~/[A-F]/){a[$i]++;if(a[$i]>2)next}}1' file
Test
Note, the given input has NO line satisfies your requirement. So I added the 'turtle' line at the end:
kent$ echo "F08A4C9263AD215D70B9C216F0B385CB wrapup
7B286E6F0615D64ACD4A7BC3578871DD wrath
8E35BA3D27A7730840EB1694386F69A0 wrathful
096762EA6790EDA22BF2369347FD53AC wreak
56AC6677205EB591A7173BADBB610F5C wreath
A85C0CB6C0367AF9D23442DF56EC9E1C wreathe
9E44AAE612306D44B91C4162DB5C26B7 wreck
6D9C795CBB3075DC1A482F6F78DC6D68 wreckage
BD907BC4DC65934D133BD5C472B78CC0 wrench
758C70E9B6F437D49D98D92E28E95939 wrest
81A4471F73DFDA0B534F58F4E1501FAB wrestle
94183CC7C7A66338DE89DB9C7460A8A2 wretch
AFEED5CE5BACCEC17EC54E68A97CCD0F wriggle
4F2048B829C2834A23832F28928DE38E turtle"|awk 'BEGIN{FS=""}{delete a;for(i=1;i<=NF;i++)if($i~/[A-F]/){a[$i]++;if(a[$i]>2)next}}1'
4F2048B829C2834A23832F28928DE38E turtle

Related

Format a text file by regex match and replace

I have a text file that looks like the following:
Chanelle
Jettie
Winnie
Jen
Shella
Krysta
Tish
Monika
Lynwood
Danae
2649
2466
2890
2224
2829
2427
2816
2648
2833
2453
I need to make it look like this
Chanelle 2649
Jettie 2466
... ...
I tried a lot on sublime editor but couldn't figure out the regex to do that. Can somebody demonstrate if it can be done.
I tested the following in Notepad++ but it should work universally.
Use this as the search string:
(?:(\s+[A-Za-z]+)(\r?\n))((?:\s*[A-Za-z]*\r?\n)+)\s+(\d+)
and this as the replacement:
$1 $4$2$3
Running a replace with it once will do one line at a time, if you run it multiple times it'll continue to replace lines until there are no matching lines left.
Alternatively, you can use this as the replacement if you want to have the values aligned by tabs, but it's not going to match in all cases:
$1\t\t$4$2$3
While the regex answer by SeinopSys will work, you don't need a regex to do this - instead, you can take advantage of Sublime's multiple cursors.
Place your cursor at the beginning of line 1, then hold down Shift↓ to select all the names.
Hit CtrlShiftL (Selection -> Split into Lines) to split the selection into lines.
CtrlC to copy.
Place your cursor on line 11 (the first number line) and press CtrlShift↓ (Windows/OS X) or AltShift↓ (Linux) to place a cursor at the beginning of each number line.
Hit CtrlV to paste the names before the numbers.
You can now delete the names at the top and you're all set. Alternatively, you could use CtrlX to cut the names in step 3.

How do i delete first 2 lines which match with a text given by me ( using sed )?

How do i delete first 2 lines which match with a text given by me ( using sed ! )
E.g :
#file.txt contains following lines :
abc
def
def
abc
abc
def
And i want to delete first 2 "abc"
Using "sed"
While #EdMorton has pointed out that sed is not the best tool for this job (if you wonder why exactly, see my answer below and compare it to the awk code), my research showed that the solution to the generalized problem
Delete occurences "N" through "M" of a line matching a given pattern using sed
indeed is a very tricky one in my opinion. There seem to be many suggestions for how to replace the "N"th occurence of a matching pattern with sed, but I found that deleting a specific matching line (or a range of lines) is a much more complex undertaking.
While the generalized problem with arbitrary values for N, M, and the pattern would probably be solved best by writing a "sed script generator" on the basis of a Finite State Machine, the solution to the special case asked by the OP is still simple enough to be coded by hand. I must admit that I wasn't very familiar with the obfuscated intricacies of the sed command syntax before, but I found this challenge to be quite useful for gaining more experience with non-trivial sed usage.
Anyway, here's my solution for deleting the first two occurences of a line containing "abc" in a file. If there's a simpler approach, I'm eager to learn about it, as this has taken me some time now.
A final caveat: this assumes GNU sed, as I was unable to find a solution with POSIX sed:
sed -n ':1;/abc/{n;b2;};p;$b4;n;b1;:2;/abc/{n;b3;};p;$b4;n;b2;:3;p;$b4;n;b3;:4;q' file
or, in more verbose syntax:
sed -n '
# BEGIN - look for first match
:first;
/abc/ {
# First match found. Skip line and jump to second section
n; bsecond;
};
# Line does not match. Print it and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bfirst;
# END - look for first match
# BEGIN - look for second match
:second;
/abc/ {
# Second match found. Skip line and jump to final section
n; bfinal;
}
# Line does not match. Print it and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bsecond;
# END - look for second match
# BEGIN - both matches found; print remaining lines
:final;
# Print line and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bfinal;
# END - print remaining lines
# QUIT
:end;
q;
' file
sed is for simple substitutions on individual lines, that is all. For anything else you should be using awk:
$ awk '!(/abc/ && ++c<3)' file
def
def
abc
def

Vim: regular expression to delete all lines except those starting with a given list of numbers

I have a csv file where every line but the first starts with a number and looks like this:
subject,parameter1,parameter2,parameter3
1,blah,blah,blah
3,blah,blah,blah
2,blah,blah,blah
44,blah,blah,blah
12,blah,blah,blah
14,blah,blah,blah
11,blah,blah,blah
10,blah,blah,blah
11,blah,blah,blah
13,blah,blah,blah
3,blah,blah,blah
...
I would like to delete all lines except the first that start, say, with the numbers 1,6,12.
I was trying something like this:
:g!/^[1 6 12]\|^subject/d
But the 12 is interpreted as "1 or 2" so this also erases the lines that start with 2..
What am I missing, and what should be the most efficient way to do this?
Btw instead of 1, 6, 12, my list contains many multiple single and 2-digit numbers.
The character class [1 6 12] means "any single character that is in this class,
i.e. any one of ' ', 1, 2, 6 (the repeated 1 is ignored).
You could use
:g!/^1,\|^6,\|^12,\|^subject/d
which is close to your original syntax - but it works (tested with vim on Mac OS X).
Note - it is important to include the comma, so that the line starting with 1 doesn't "protect" 11, 12345, etc.
You might want to do this differently though - using grep.
Put all the "white listed" numbers in a file, one per line, like so:
^subject
^1,
^2,
^6,
^12,
then do
grep -f whitelist csvFile
and the output will be your "edited" file (which you can pipe to a new file).
If you are even more interested in "efficiency", you could make your text file (let's continue to call it whitelist) just
subject
1
2
6
12
and use the following command:
cat whitelist | xargs -I {} grep "^"{}"," cvsFile
This needs a bit of explaining.
xargs - take the input one line at a time
-I {} - and insert that line in the command that follows, at the {}
This means that the grep command will be run n times (once per line in the whitelist file), and each time the regular expression that is fed into grep will be the concatenation of
"^" - start of line
{} - contents of one line of the input file (whitelist)
"," - comma that follows the number
So this is a compact way of writing
grep "^subject," csvFile; grep "^1," csvFile; grep "^2," csvFile;
etc.
It has the advantage that you can now generate your whitelist any way you want - as long as it ends up in a file, one line at a time, you can use it; the disadvantage is that you are essentially running grep n times. If your files get very large, and you have a large number of items in your white list, that may start to be a problem; but since your OS is likely to put the file into cache after the first read-through, it is really quite fast. The use of the ^ anchor makes the regular expression very efficient - as soon as it doesn't find a match it goes on to the next line.
Use a global match:
:v/^\(subject\|1\|6\|12\),/ delete
For every line that does not match that regular expression, delete it.
It yields:
subject,parameter1,parameter2,parameter3
1,blah,blah,blah
12,blah,blah,blah
EDIT: Just now I realised that you were already using the global match. You error was in the character class. It matches any character inside it regardless of repeated ones, in your case numbers one, two, six and a space. You must separate them in different branches, like I did before.
a "functional" alternative:
:g/./if index([1,12,6],str2nr(split(getline("."),",")[0]))<0|exec 'normal! dd'|endif

Regular Expression over multiple lines

I'm stuck with this for several hours now and cycled through a wealth of different tools to get the job done. Without success. It would be fantastic, if someone could help me out with this.
Here is the problem:
I have a very large CSV file (400mb+) that is not formatted correctly. Right now it looks something like this:
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
As you can probably see the titles ",Title1" and ",Title2" should actually be on the same line as the foregoing sentence. Then it would look something like this:
This is a long abstract describing something. What follows is the tile for this sentence.",Title1
This is another sentence that is running on one line. On the next line you can find the title.,Title2
Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.
Here is what I came up with so far:
sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv
This should actually get the job done of matching the expression over multiple lines. Unfortunately it doesn't :)
The expression is looking for the dot at the end of the sentence and the optional quotes plus a newline character that I'm trying to match with .*.
Help much appreciated. And it doesn't really matter what tool gets the job done (awk, perl, sed, tr, etc.).
Multiline in sed isn't necessarily tricky per se, it's just that it uses commands most people aren't familiar with and have certain side effects, like delimiting the current line from the next line with a '\n' when you use 'N' to append the next line to the pattern space.
Anyway, it's much easier if you match on a line that starts with a comma to decide whether or not to remove the newline, so that's what I did here:
sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
Input
$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line
Output
$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line
Yours works with a couple of small changes:
sed -n '1h;1!H;${;g;s/\."\?\n,//g;p;}' inputfile
The ? needs to be escaped and . doesn't match newlines.
Here's another way to do it which doesn't require using the hold space:
sed -n '${p;q};N;/\n,/{s/"\?\n//p;b};P;D' inputfile
Here is a commented version:
sed -n '
$ # for the last input line
{
p; # print
q # and quit
};
N; # otherwise, append the next line
/\n,/ # if it starts with a comma
{
s/"\?\n//p; # delete an optional comma and the newline and print the result
b # branch to the end to read the next line
};
P; # it doesn't start with a comma so print it
D # delete the first line of the pair (it's just been printed) and loop to the top
' inputfile

what does this sed commands does? please explain its bits and pieces

Please explain this sed command?
sed -n "s/[^>]*>/ /gp"
What is gp?
It looks for non-greater-than characters preceding a greater-than symbol, and changes all of them to a single space. Thus, it will convert this input (where I've used _ to indicate a space):
foo>_bar> b
x>>_a
to
___b
___a
As Mark notes, "g" means global, and "p" means "print the line".
g means global: i.e. replace all occurences, not just the first.
p means to print the modified line. Otherwise due to the -n switch it would not be printed.
The command finds all lines containing at least one > and prints some spaces followed by the text after the final >. The number of spaces printed is the number of > in the line.
For example if this line is in the input file:
123>456>789
Then this is printed:
789
I was typing up a long explanation, but Brian beat me to it. To clarify a tiny bit, the "p" prints the modified / matching line. The "-n" in your command line tells sed to "not print the file". Combined with the "p", it works kinda like grep, but within the scope of the script (ie, anything it changes/matches).