what does this sed commands does? please explain its bits and pieces - regex

Please explain this sed command?
sed -n "s/[^>]*>/ /gp"
What is gp?

It looks for non-greater-than characters preceding a greater-than symbol, and changes all of them to a single space. Thus, it will convert this input (where I've used _ to indicate a space):
foo>_bar> b
x>>_a
to
___b
___a
As Mark notes, "g" means global, and "p" means "print the line".

g means global: i.e. replace all occurences, not just the first.
p means to print the modified line. Otherwise due to the -n switch it would not be printed.
The command finds all lines containing at least one > and prints some spaces followed by the text after the final >. The number of spaces printed is the number of > in the line.
For example if this line is in the input file:
123>456>789
Then this is printed:
789

I was typing up a long explanation, but Brian beat me to it. To clarify a tiny bit, the "p" prints the modified / matching line. The "-n" in your command line tells sed to "not print the file". Combined with the "p", it works kinda like grep, but within the scope of the script (ie, anything it changes/matches).

Related

Edit CSV rows in two different ways

I have a bash script that outputs two CSV columns. I need to prepend the three-digit number of those rows of the second column that contain them with "f. " and keep the rest of the rows intact. I have tried different ways so far but each has failed in one way or another.
What I've tried mainly has been to use regular expressions with either the first or second column to separate the desired rows from the rest, but I can't separate and prepend at the same time without cancelling out or messing up the process somehow. Some of the commands I've used so far have been: $ sed $ cut as well as (nested) for loops, read-while loops, if/else and if/else/elif statements, etc. What follows is one such (failed) solution:
for var1 in "^.*_[^f]_.*"
do
sed -i "" "s:$MSname::" $pathToCSV"_final.csv"
for var2 in "^.*_f_.*"
do
sed -i "" "s:$MSname:f.:" $pathToCSV"_final.csv"
done
done
And these are some sample rows:
abc_deg0014_0001_a_1.tif,British Library 1 Front Board Outside
abc_deg0014_0002_b_000.tif,British Library 1 Front Board Inside
abc_deg0014_0003_f_001r.tif,British Library 1 001r
abc_deg0014_0004_f_001v.tif,British Library 1 001v
…
abc_deg0014_0267_f_132r.tif,British Library 1 132r
abc_deg0014_0268_f_132v.tif,British Library 1 132v
abc_deg0014_0269_y_999.tif,British Library 1 Back Board Inside
abc_deg0014_0270_z_1.tif,British Library 1 Back Board Outside
Here $MSname = British Library 1 (since with different CSVs the "British Library 1" part can change to other words that I need to remove/replace and that's why I use parameter expansion).
The desired result:
abc_deg0014_0002_b_000.tif,Front Board Inside
abc_deg0014_0003_f_001r.tif,f. 001r
…
abc_deg0014_0268_f_132v.tif,f. 132v
abc_deg0014_0269_y_999.tif,Back Board Inside
If you look closely, you'll notice these rows are also differentiated from the rest by "f" in their first column (the rows that shouldn't get the "f. " in front of their second column are differentiated by "a", "b", "y", and "z", respectively, in the first column).
You are not using var1 or var2 for anything, and even if you did, looping over variables and repeatedly running sed -i on the same output file is extremely wasteful. Ideally, you would like to write all the modifications into a single sed script, and process the file only once.
Without being able to guess what other strings than "British Library 1" you have and whether those require different kinds of actions, I would suggest something along the lines of
sed -i '/^[^,]*_f_[^,_]*,/s/,British Library 1 /,f. /
s/,British Library 1 /,/' "${pathToCSV}_final.csv"
Notice how the sed script in single quotes can be wrapped over multiple physical lines. The first line finds any lines where the last characters between underscores in the first comma-separated column is f, and replaces ",British Library 1 " with ",f. ". (I made some adjustments to the spacing here -- I hope they make sense for you.) On the following line, we simply replace any (remaining) occurrences of ",British Library 1 " with just a comma; the idea is that only the lines which didn't match the regex on the previous line will still contain this string, and so we don't have to do another regex match.
This can easily be extended to cover more patterns in the same sed script, rather than repeatedly looping over the file and rewriting one pattern at a time. For example, if your next task is to replace Windsor Palace A with either a. or nothing depending on whether the penultimate underscore-separated subfield in the first field contains a, that should be obvious enough:
sed -i '/^[^,]*_f_[^,_]*,/s/,British Library 1 /,f. /
s/,British Library 1 /,/
/^[^,]*_a_[^,_]*,/s/,Windsor Palace A /,a. /
s/,Windsor Palace A /,/' "${pathToCSV}_final.csv"
In some more detail, the regex says
^ beginning of line
[^,]* any sequence of characters which are not a comma
_f_ literal characters underscore, f, underscore
[^,_]* any sequence of characters which are not a comma or an underscore
, literal comma
You should be able to see that this will target the last pair of underscores in the first column. It's important to never skip across the first comma, and near the end, not allow any underscores after the ones we specifically target before we finally allow the comma column delimiter.
Finally, also notice how we always use double quotes around variables which contain file names. There are scenarios where you can avoid this but you have to know what you are doing; the easy and straightforward rule of thumb is to always put double quotes around variables. For the full scoop, see When to wrap quotes around a shell variable?
With awk, you can look at the firth field to see whether it matches "3digits + 1 letter" then print with f. in this case and just remove fields 2,3 and 4 in the other case. For example:
awk -F'[, ]' '{
if($5 ~ /.?[[:digit:]]{3}[a-z]$/) {
printf("%s,f. %s\n",$1,$5)}
else {
printf("%s,%s %s %s\n",$1,$5,$6,$7)
}
}' test.txt
On the example you provide, it gives:
abc_deg0014_0001_a_1.tif,Front Board Outside
abc_deg0014_0002_b_000.tif,Front Board Inside
abc_deg0014_0003_f_001r.tif,f. 001r
abc_deg0014_0004_f_001v.tif,f. 001v
abc_deg0014_0267_f_132r.tif,f. 132r
abc_deg0014_0268_f_132v.tif,f. 132v
abc_deg0014_0269_y_999.tif,Back Board Inside
abc_deg0014_0270_z_1.tif,Back Board Outside

Vim: regular expression to delete all lines except those starting with a given list of numbers

I have a csv file where every line but the first starts with a number and looks like this:
subject,parameter1,parameter2,parameter3
1,blah,blah,blah
3,blah,blah,blah
2,blah,blah,blah
44,blah,blah,blah
12,blah,blah,blah
14,blah,blah,blah
11,blah,blah,blah
10,blah,blah,blah
11,blah,blah,blah
13,blah,blah,blah
3,blah,blah,blah
...
I would like to delete all lines except the first that start, say, with the numbers 1,6,12.
I was trying something like this:
:g!/^[1 6 12]\|^subject/d
But the 12 is interpreted as "1 or 2" so this also erases the lines that start with 2..
What am I missing, and what should be the most efficient way to do this?
Btw instead of 1, 6, 12, my list contains many multiple single and 2-digit numbers.
The character class [1 6 12] means "any single character that is in this class,
i.e. any one of ' ', 1, 2, 6 (the repeated 1 is ignored).
You could use
:g!/^1,\|^6,\|^12,\|^subject/d
which is close to your original syntax - but it works (tested with vim on Mac OS X).
Note - it is important to include the comma, so that the line starting with 1 doesn't "protect" 11, 12345, etc.
You might want to do this differently though - using grep.
Put all the "white listed" numbers in a file, one per line, like so:
^subject
^1,
^2,
^6,
^12,
then do
grep -f whitelist csvFile
and the output will be your "edited" file (which you can pipe to a new file).
If you are even more interested in "efficiency", you could make your text file (let's continue to call it whitelist) just
subject
1
2
6
12
and use the following command:
cat whitelist | xargs -I {} grep "^"{}"," cvsFile
This needs a bit of explaining.
xargs - take the input one line at a time
-I {} - and insert that line in the command that follows, at the {}
This means that the grep command will be run n times (once per line in the whitelist file), and each time the regular expression that is fed into grep will be the concatenation of
"^" - start of line
{} - contents of one line of the input file (whitelist)
"," - comma that follows the number
So this is a compact way of writing
grep "^subject," csvFile; grep "^1," csvFile; grep "^2," csvFile;
etc.
It has the advantage that you can now generate your whitelist any way you want - as long as it ends up in a file, one line at a time, you can use it; the disadvantage is that you are essentially running grep n times. If your files get very large, and you have a large number of items in your white list, that may start to be a problem; but since your OS is likely to put the file into cache after the first read-through, it is really quite fast. The use of the ^ anchor makes the regular expression very efficient - as soon as it doesn't find a match it goes on to the next line.
Use a global match:
:v/^\(subject\|1\|6\|12\),/ delete
For every line that does not match that regular expression, delete it.
It yields:
subject,parameter1,parameter2,parameter3
1,blah,blah,blah
12,blah,blah,blah
EDIT: Just now I realised that you were already using the global match. You error was in the character class. It matches any character inside it regardless of repeated ones, in your case numbers one, two, six and a space. You must separate them in different branches, like I did before.
a "functional" alternative:
:g/./if index([1,12,6],str2nr(split(getline("."),",")[0]))<0|exec 'normal! dd'|endif

egrep search: every letter appears maximum 2 times

Ok, I've got this issue where I have a list of md5's and a word next to it separated with a space, and I need to filter out some lines.
Example snippet:
...
F08A4C9263AD215D70B9C216F0B385CB wrapup
7B286E6F0615D64ACD4A7BC3578871DD wrath
8E35BA3D27A7730840EB1694386F69A0 wrathful
096762EA6790EDA22BF2369347FD53AC wreak
56AC6677205EB591A7173BADBB610F5C wreath
A85C0CB6C0367AF9D23442DF56EC9E1C wreathe
9E44AAE612306D44B91C4162DB5C26B7 wreck
6D9C795CBB3075DC1A482F6F78DC6D68 wreckage
BD907BC4DC65934D133BD5C472B78CC0 wrench
758C70E9B6F437D49D98D92E28E95939 wrest
81A4471F73DFDA0B534F58F4E1501FAB wrestle
94183CC7C7A66338DE89DB9C7460A8A2 wretch
AFEED5CE5BACCEC17EC54E68A97CCD0F wriggle
...
I need a regular expression for (e)grep that pulls out every line where every letter (so [A-F]) appears only 2 times maximum.
so an example for that would be:
4F2048B829C2834A23832F28928DE38E turtle
If anyone can help me with this i'd appreciate it very much!
You could use:
egrep -v "^\S*([A-F])\S*\1\S*\1" inputfile
That would list every line which does not include the letters A-F repeated three times in the same line.
EDIT: changed to avoid matching uppercase characters in the words...
you mentioned:
pulls out every line where every letter (so [A-F]) appears only 2
times maximum.
So my understanding is, the selected line should contain 0-2 [A-F]. Based on this, the following awk oneliner should do the job:
awk 'BEGIN{FS=""}{delete a;for(i=1;i<=NF;i++)if($i~/[A-F]/){a[$i]++;if(a[$i]>2)next}}1' file
Test
Note, the given input has NO line satisfies your requirement. So I added the 'turtle' line at the end:
kent$ echo "F08A4C9263AD215D70B9C216F0B385CB wrapup
7B286E6F0615D64ACD4A7BC3578871DD wrath
8E35BA3D27A7730840EB1694386F69A0 wrathful
096762EA6790EDA22BF2369347FD53AC wreak
56AC6677205EB591A7173BADBB610F5C wreath
A85C0CB6C0367AF9D23442DF56EC9E1C wreathe
9E44AAE612306D44B91C4162DB5C26B7 wreck
6D9C795CBB3075DC1A482F6F78DC6D68 wreckage
BD907BC4DC65934D133BD5C472B78CC0 wrench
758C70E9B6F437D49D98D92E28E95939 wrest
81A4471F73DFDA0B534F58F4E1501FAB wrestle
94183CC7C7A66338DE89DB9C7460A8A2 wretch
AFEED5CE5BACCEC17EC54E68A97CCD0F wriggle
4F2048B829C2834A23832F28928DE38E turtle"|awk 'BEGIN{FS=""}{delete a;for(i=1;i<=NF;i++)if($i~/[A-F]/){a[$i]++;if(a[$i]>2)next}}1'
4F2048B829C2834A23832F28928DE38E turtle

What would be the best approach to this substitution in Vim?

A several line document has a header/title section and then about 10 listings under each. I need to put the header/title info in with each of the listings so that they can be properly uploaded into a website (using comma and pipe delimiters). It looks like this:
SectionName1 and TitleName1
1111 - The SubSectionName A
222 - The SubSectionName B
3333 - The SubSectionName C
SectionName2 and TitleName2
444 - The SubSectionName D
55555 - The SubSectionName E
66 - The SubSectionName F
Repeating several hundred times. What I need is to produce something like:
SectionName1,TitleName1,1111,SubSectionNameA
SectionName1,TitleName1,222,SubSectionNameB
SectionName1,TitleName1,3333,SubSectionNameC
SectionName2,TitleName2,444,SubSectionNameD
SectionName2,TitleName2,55555,SubSectionNameE
SectionName2,TitleName2,66,SubSectionNameF
I realize there can multiple approaches to this solution, but I'm having a difficult time pulling the trigger on any one method. I understand submatches, joins and getline but I am not good at practical use of them in this scenario.
Any help to get me mentally started would be greatly appreciated.
Let me propose the following quite general Ex command solving the
issue.1
:g/^\s*\h/d|let#"=substitute(#"[:-2],'\s\+and\s\+',',','')|ki|/\n\s*\h\|\%$/kj|
\ 'i,'js/^\s*\(\d\+\)\s\+-\s\+The/\=#".','.submatch(1).','/|'i,'js/\s\+//g
At the top level, this is the :global command that enumerates the lines
starting with zero or more whitespace characters followed by a Latin letter or
an underscore (see :help /\h). The lines matching this pattern are supposed
to be the header lines containing section and title names. The rest of the
command, after the pattern describing the header lines, are instructions to be
executed for each of those lines.
The actions to be performed on the headers can be divided into three steps.
Delete the current header line, at the same time extracting section
and title names from it.
:d|let#"=substitute(#"[:-2],'\s\+and\s\+',',','')
First, remove the current line, saving it into the unnamed register,
using the :delete command. Then, update the contents of that
register (referred to as #"; see :help #r and :help "") to be
result of the substitution changing the word and surrounded by
whitespace characters, to a single comma. The actual replacement is
carried out by the substitute() function.
However, the input is not the exact string containing the whole header
line, but its prefix leaving out the last character, which is
a newline symbol. The [:-2] notation is a short form of the
[0:-2] subscript expression that designates the substring from the
very first byte to the second one counting from the end (see :help
expr-[:]). This way, the unnamed register holds the section and the
title names separated by comma.
Determine the range of dependent subsection lines.
:ki|/\n\s*\h\|\%$/kj
After the first step, the subsection records belonging to the just
parsed header line are located starting from the current line (the one
followed the header) until the next header line or, if there is no
such line below, the end of buffer. The numbers of these lines are
stored in the marks i and j, respectively. (See :helpg ^A mark
is for description of marks.)
The marks are placed using the :k command that sets a specified mark
at the last line of a given range which is the current line, by
default. So, unlike the first line of the considered block, the last
one requires a specific line range to point out its location.
A particular form of range, denoting the next line where a given
pattern matches, is used in this case (see :help :range). The
pattern defining the location of the line to be found, is composed in
such a way that it matches a line immediately preceding a header (a
line starting with possible whitespace followed by an alphabetical
character), or the very last line. (See :help pattern for details
about syntax of Vim regular expressions.)
Transform the delineated subsection lines according to desired format,
prepending section and title names found in the corresponding header
line.
:'i,'js/^\s*\(\d\+\)\s\+-\s\+The/\=#".','.submatch(1).','/|'i,'js/\s\+//g
This step comprised of the two :substitute commands that are run
over the range of lines delimited by the locations labelled by the
marks i and j (see :help [range]).
The first substitution command matches the beginning of a subsection
line—an identifier followed by a hyphen and the word The, all
floating in a whitespace—and replaces it with the contents of the
unnamed register, holding the section and title names concatenated
with a comma, the matched identifier, and another comma. The second
substitution finalizes the transformation by squeezing all whitespace
characters on the line to gum the subsection name and the following
letter together.
To construct the replacement string in the first :substitute
command, the substitute-with-an-expression feature is used (see :help
sub-replace-\=). The substitution part of the command should start
with \= for Vim to interpret the remaining text not in a regular
way, but as an expression (see :help expression). The result of
that expression's evaluation becomes the substitution string. Note
the use of the submatch() function in the substitute expression to
retrieve the text of a submatch by its number.
1 The command is wrapped for better readability, its one-line
version is listed below for ease of copy-pasting into Vim command line. Note
that the wrapped command can be used in a Vim script without any change.
:g/^\s*\h/d|let#"=substitute(#"[:-2],'\s\+and\s\+',',','')|ki|/\n\s*\h\|\%$/kj|'i,'js/^\s*\(\d\+\)\s\+-\s\+The/\=#".','.submatch(1).','/|'i,'js/\s\+//g
Simplest/fastest way I can think of is a simple macro. Do once, rinse, repeat.
Assuming your cursor is initially on the first character of the first line (S of SectionName), this macro should work as long as the document is exactly in the same format as posted above.
f ctT,<Esc>yyjpjjpjddkkkddkkkJr,f ctS,<Esc>f xjJr,f ctS,f xjJr,f ctS,<Esc>f xjdd
well I think the question is not that clear. why in your demo input, after "-", the text was like:
55555 - The SubSectionName E
but in your expected output, it turned into:
55555,SubSectionNameE
all spaces were removed, this is ok, but why "The" was removed as well? is there any pattern for "the" ?
I wrote an awk oneliner, it removes all spaces in output, but leave those "The" there, you can change it to get the right output you need.
awk -F' and ' -vOFS="," 'NF>1{s=$1;t=$2;next;}$1{gsub(/\s+/,"");gsub(/-/,",");print s,t,$0} ' input
test on your example input:
kent$ cat v
SectionName1 and TitleName1
1111 - The SubSectionName A
222 - The SubSectionName B
3333 - The SubSectionName C
SectionName2 and TitleName2
444 - The SubSectionName D
55555 - The SubSectionName E
66 - The SubSectionName F
kent$ awk -F' and ' -vOFS="," 'NF>1{s=$1;t=$2;next;}$1{gsub(/\s+/,"");gsub(/-/,",");print s,t,$0} ' v
SectionName1,TitleName1,1111,TheSubSectionNameA
SectionName1,TitleName1,222,TheSubSectionNameB
SectionName1,TitleName1,3333,TheSubSectionNameC
SectionName2,TitleName2,444,TheSubSectionNameD
SectionName2,TitleName2,55555,TheSubSectionNameE
SectionName2,TitleName2,66,TheSubSectionNameF

Regular Expression over multiple lines

I'm stuck with this for several hours now and cycled through a wealth of different tools to get the job done. Without success. It would be fantastic, if someone could help me out with this.
Here is the problem:
I have a very large CSV file (400mb+) that is not formatted correctly. Right now it looks something like this:
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
As you can probably see the titles ",Title1" and ",Title2" should actually be on the same line as the foregoing sentence. Then it would look something like this:
This is a long abstract describing something. What follows is the tile for this sentence.",Title1
This is another sentence that is running on one line. On the next line you can find the title.,Title2
Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.
Here is what I came up with so far:
sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv
This should actually get the job done of matching the expression over multiple lines. Unfortunately it doesn't :)
The expression is looking for the dot at the end of the sentence and the optional quotes plus a newline character that I'm trying to match with .*.
Help much appreciated. And it doesn't really matter what tool gets the job done (awk, perl, sed, tr, etc.).
Multiline in sed isn't necessarily tricky per se, it's just that it uses commands most people aren't familiar with and have certain side effects, like delimiting the current line from the next line with a '\n' when you use 'N' to append the next line to the pattern space.
Anyway, it's much easier if you match on a line that starts with a comma to decide whether or not to remove the newline, so that's what I did here:
sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
Input
$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line
Output
$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line
Yours works with a couple of small changes:
sed -n '1h;1!H;${;g;s/\."\?\n,//g;p;}' inputfile
The ? needs to be escaped and . doesn't match newlines.
Here's another way to do it which doesn't require using the hold space:
sed -n '${p;q};N;/\n,/{s/"\?\n//p;b};P;D' inputfile
Here is a commented version:
sed -n '
$ # for the last input line
{
p; # print
q # and quit
};
N; # otherwise, append the next line
/\n,/ # if it starts with a comma
{
s/"\?\n//p; # delete an optional comma and the newline and print the result
b # branch to the end to read the next line
};
P; # it doesn't start with a comma so print it
D # delete the first line of the pair (it's just been printed) and loop to the top
' inputfile