Edit CSV rows in two different ways - regex

I have a bash script that outputs two CSV columns. I need to prepend the three-digit number of those rows of the second column that contain them with "f. " and keep the rest of the rows intact. I have tried different ways so far but each has failed in one way or another.
What I've tried mainly has been to use regular expressions with either the first or second column to separate the desired rows from the rest, but I can't separate and prepend at the same time without cancelling out or messing up the process somehow. Some of the commands I've used so far have been: $ sed $ cut as well as (nested) for loops, read-while loops, if/else and if/else/elif statements, etc. What follows is one such (failed) solution:
for var1 in "^.*_[^f]_.*"
do
sed -i "" "s:$MSname::" $pathToCSV"_final.csv"
for var2 in "^.*_f_.*"
do
sed -i "" "s:$MSname:f.:" $pathToCSV"_final.csv"
done
done
And these are some sample rows:
abc_deg0014_0001_a_1.tif,British Library 1 Front Board Outside
abc_deg0014_0002_b_000.tif,British Library 1 Front Board Inside
abc_deg0014_0003_f_001r.tif,British Library 1 001r
abc_deg0014_0004_f_001v.tif,British Library 1 001v
…
abc_deg0014_0267_f_132r.tif,British Library 1 132r
abc_deg0014_0268_f_132v.tif,British Library 1 132v
abc_deg0014_0269_y_999.tif,British Library 1 Back Board Inside
abc_deg0014_0270_z_1.tif,British Library 1 Back Board Outside
Here $MSname = British Library 1 (since with different CSVs the "British Library 1" part can change to other words that I need to remove/replace and that's why I use parameter expansion).
The desired result:
abc_deg0014_0002_b_000.tif,Front Board Inside
abc_deg0014_0003_f_001r.tif,f. 001r
…
abc_deg0014_0268_f_132v.tif,f. 132v
abc_deg0014_0269_y_999.tif,Back Board Inside
If you look closely, you'll notice these rows are also differentiated from the rest by "f" in their first column (the rows that shouldn't get the "f. " in front of their second column are differentiated by "a", "b", "y", and "z", respectively, in the first column).

You are not using var1 or var2 for anything, and even if you did, looping over variables and repeatedly running sed -i on the same output file is extremely wasteful. Ideally, you would like to write all the modifications into a single sed script, and process the file only once.
Without being able to guess what other strings than "British Library 1" you have and whether those require different kinds of actions, I would suggest something along the lines of
sed -i '/^[^,]*_f_[^,_]*,/s/,British Library 1 /,f. /
s/,British Library 1 /,/' "${pathToCSV}_final.csv"
Notice how the sed script in single quotes can be wrapped over multiple physical lines. The first line finds any lines where the last characters between underscores in the first comma-separated column is f, and replaces ",British Library 1 " with ",f. ". (I made some adjustments to the spacing here -- I hope they make sense for you.) On the following line, we simply replace any (remaining) occurrences of ",British Library 1 " with just a comma; the idea is that only the lines which didn't match the regex on the previous line will still contain this string, and so we don't have to do another regex match.
This can easily be extended to cover more patterns in the same sed script, rather than repeatedly looping over the file and rewriting one pattern at a time. For example, if your next task is to replace Windsor Palace A with either a. or nothing depending on whether the penultimate underscore-separated subfield in the first field contains a, that should be obvious enough:
sed -i '/^[^,]*_f_[^,_]*,/s/,British Library 1 /,f. /
s/,British Library 1 /,/
/^[^,]*_a_[^,_]*,/s/,Windsor Palace A /,a. /
s/,Windsor Palace A /,/' "${pathToCSV}_final.csv"
In some more detail, the regex says
^ beginning of line
[^,]* any sequence of characters which are not a comma
_f_ literal characters underscore, f, underscore
[^,_]* any sequence of characters which are not a comma or an underscore
, literal comma
You should be able to see that this will target the last pair of underscores in the first column. It's important to never skip across the first comma, and near the end, not allow any underscores after the ones we specifically target before we finally allow the comma column delimiter.
Finally, also notice how we always use double quotes around variables which contain file names. There are scenarios where you can avoid this but you have to know what you are doing; the easy and straightforward rule of thumb is to always put double quotes around variables. For the full scoop, see When to wrap quotes around a shell variable?

With awk, you can look at the firth field to see whether it matches "3digits + 1 letter" then print with f. in this case and just remove fields 2,3 and 4 in the other case. For example:
awk -F'[, ]' '{
if($5 ~ /.?[[:digit:]]{3}[a-z]$/) {
printf("%s,f. %s\n",$1,$5)}
else {
printf("%s,%s %s %s\n",$1,$5,$6,$7)
}
}' test.txt
On the example you provide, it gives:
abc_deg0014_0001_a_1.tif,Front Board Outside
abc_deg0014_0002_b_000.tif,Front Board Inside
abc_deg0014_0003_f_001r.tif,f. 001r
abc_deg0014_0004_f_001v.tif,f. 001v
abc_deg0014_0267_f_132r.tif,f. 132r
abc_deg0014_0268_f_132v.tif,f. 132v
abc_deg0014_0269_y_999.tif,Back Board Inside
abc_deg0014_0270_z_1.tif,Back Board Outside

Related

Highlight line for specific commit in git log graph

I am trying to highlight the whole line for a specific commit in my git log graph. I have since before created a git log alias to format the output of my logs. I have attempted to highlight a specific line containing the commit-id, using my alias.
Alias in ~/.gitconfig
# Base command for log formatting
lg-base = "log --graph --decorate=short --decorate-refs-exclude='refs/tags/*' --color=always"
# Version 1 log format
lg1 = !"git lg-base --format=format:'%C(#f0890c)%h%C(reset) - %C(bold green)(%ar)%C(reset) %C(white)%s%C(reset) %C(dim white)- %an%C(reset)%C(#d10000)%d%C(reset)'"
Doing a test with searching for 6 months just because it should behave the same and might showcase my issue a bit better.
git lg1 | grep --color=always -E '(6 months).*|$'
Matches the correct lines. But it doesn't highlight the whole line to the right and when trying to highlight the left part of the line as well, it doesn't work as expected. Probably because of my lack in skills of using regex.
git lg1 | grep --color=always -E '.*(6 months).*|$'
Instead it marks the * in the beginning.
If you have a total other approach, that is fine with me as long as I can use my formatted git log alias.
Thomas' comment is the key to the issue here: although grep is adding its own color (or colour) changing escape sequences to highlight the line, Git has already put in color changing directives. Each such directive, for one of the named colors, looks like this:
ESC[numberm
where the number part is 30 through 37 for a foreground color and 40 through 47 for a background color (plus some extra codes for bold or dim, which I won't include here). (%C(reset) sends ESC [ m and your orange selector uses a 24-bit color directive, which is less widely supported than the eight base colors, which go back to the 1990s). Hence the original output reads:
* <sp> <orange> <hash-ID> <reset> <sp> - <sp> <blue> (n months ago) <reset> ...
The grep adds red, which is ESC [ 31 m, and a reset, around the matched expression—but the existing escapes within the expression remain.
The easiest way by far to avoid all this is to stop using color escape sequences at all, so that grep's added ones stick out like a sore red thumb. Of course that defeats your goal, which is to keep the color-changing escapes in lines that aren't highlighted. But you haven't explained what you'd like done with the color-changing escapes in lines, or parts of lines, that are highlighted. Answering that will determine what to do next.
There are any number of ways you could handle this. For instance, instead of %C(color)%<directive>%C(reset) you could use %x1b(name-of-color)%<directive>%x1b(reset) to insert the literal sequences ESC ( name of color or reset ), or assume that the terminal in question will use ANSI style escapes that end with the lowercase m character, and try to write something up in sed or awk (I'd use awk for something this complex, just because it's less like writing line noise) that does the match—awk supports regex matching—and if found, strips out the color sequences from the matched part and adds its own. Post-process this with something that inserts the appropriate terminal-dependent color-change sequences, or keep the original ESC [ ... m sequences on the assumption that you're in a window that uses that form, and you'll have the output you want (which you can now pipe through less -R if desired).
A skeleton awk program that does what you want is:
/<desired regex>/ { handle matched line; next; }
{ print }
The hard part is the "handle matched line". GNU awk has RSTART and RLENGTH to help out a lot; see, e.g., this answer. The substring of the line from the beginning to RSTART-1 wasn't matched (this may be empty), and the substring from RSTART+RLENGTH to the end of the line (which may also be empty) also was not matched; the substring of $0 at RSTART for length RLENGTH was matched and here's where you would strip out any color-changing sequences, if you want your basic red (or whatever) applied throughout.
Sample script (by Robin Hellmers)
Creating a script and placing it where you please, e.g.
~/.local/bin/highlight-commit.awk
with the contents
#!/usr/bin/nawk -f
BEGIN {
n = split(commits,arrayCommits," ");
background="145;0;0"
foreground="255;255;255"
}
{
# Compare with every given input e.g. commit id
for (i=1; i <= n; i++) {
if(match($0,arrayCommits[i])) {
# Remove any ANSI color escape sequence for matching row
gsub("\x1b\\[[0-9;]*m","",$0)
# Create ANSI color escape sequence for whole row
$0 = sprintf("\x1b[48;2;%sm\x1b[38;2;%sm%s\x1b[0m\x1b[0m",
background,
foreground,
$0);
break;
}
}
printf("%s\n", $0);
}
In ~/.gitconfig, add the following alias:
[alias]
highlight-commit = "!f() { git lg | awk -v commits=\"$*\" -f ~/.local/bin/highlight-commit.awk | less -XR; }; f"
By calling with e.g. two commits:
git highlight-commit 82451f8 310fca4

How to join lines adding a separator?

The command J joins lines.
The command gJ joins lines removing spaces
Is there also a command to Join lines adding a separator between the lines?
Example:
Input:
text
other text
more text
text
What I want to do:
- select these 4 lines
- if there are spaces at start and/or EOL remove them
- join lines adding a separator '//' between them
Output:
text//other text//more text//text
You can use :substitute for that, matching on \n:
:%s#\s*\n\s*#//#g
However, this appends the separator at the end, too (because the last line in the range also has a newline). You could remove that manually, or specify the c flag and quit the substitution before the last one, or reduce the range by one and :join the last one instead:
:1,$-1s#\s*\n\s*#//#g|join
I wrote a plugin "Join", could do what you wanted, and more.
https://github.com/sk1418/Join
Except for all features provided by the build-in :join command, Join can:
Join lines with separator (string)
Join lines with or without trimming the leading/trailing whitespaces
Join lines with negative count (backwards join)
Join lines in reverse
Join lines and keep joined lines (without removing joined lines)
Join lines with any combinations of above options
check the homepage for details and examples/screenshots.
There are few ways to do it, but I would recommend going by simplest route possible - recording a macro or doing multi-step command, for example by:
Appending to all lines excluding last by
Using substitution (:1,$-1s#$#//#)
Appending (:1,$-1norm A//)
And then join using visual selection (vGgJ) or by any other method.
Unless you're doing this operation very often you most likely forget any complex commands or existence of specialized plugin in your config, thus my recommendation of using generic, often used sub steps.
Another substitution, for the sake of diversity:
:%s:\n\ze.://
Will list 50 items per line seperated by spaces:
seq 0 70 | xargs -L 50 | sed 's/ /,/g'
Output:
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49
50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70

Vim: regular expression to delete all lines except those starting with a given list of numbers

I have a csv file where every line but the first starts with a number and looks like this:
subject,parameter1,parameter2,parameter3
1,blah,blah,blah
3,blah,blah,blah
2,blah,blah,blah
44,blah,blah,blah
12,blah,blah,blah
14,blah,blah,blah
11,blah,blah,blah
10,blah,blah,blah
11,blah,blah,blah
13,blah,blah,blah
3,blah,blah,blah
...
I would like to delete all lines except the first that start, say, with the numbers 1,6,12.
I was trying something like this:
:g!/^[1 6 12]\|^subject/d
But the 12 is interpreted as "1 or 2" so this also erases the lines that start with 2..
What am I missing, and what should be the most efficient way to do this?
Btw instead of 1, 6, 12, my list contains many multiple single and 2-digit numbers.
The character class [1 6 12] means "any single character that is in this class,
i.e. any one of ' ', 1, 2, 6 (the repeated 1 is ignored).
You could use
:g!/^1,\|^6,\|^12,\|^subject/d
which is close to your original syntax - but it works (tested with vim on Mac OS X).
Note - it is important to include the comma, so that the line starting with 1 doesn't "protect" 11, 12345, etc.
You might want to do this differently though - using grep.
Put all the "white listed" numbers in a file, one per line, like so:
^subject
^1,
^2,
^6,
^12,
then do
grep -f whitelist csvFile
and the output will be your "edited" file (which you can pipe to a new file).
If you are even more interested in "efficiency", you could make your text file (let's continue to call it whitelist) just
subject
1
2
6
12
and use the following command:
cat whitelist | xargs -I {} grep "^"{}"," cvsFile
This needs a bit of explaining.
xargs - take the input one line at a time
-I {} - and insert that line in the command that follows, at the {}
This means that the grep command will be run n times (once per line in the whitelist file), and each time the regular expression that is fed into grep will be the concatenation of
"^" - start of line
{} - contents of one line of the input file (whitelist)
"," - comma that follows the number
So this is a compact way of writing
grep "^subject," csvFile; grep "^1," csvFile; grep "^2," csvFile;
etc.
It has the advantage that you can now generate your whitelist any way you want - as long as it ends up in a file, one line at a time, you can use it; the disadvantage is that you are essentially running grep n times. If your files get very large, and you have a large number of items in your white list, that may start to be a problem; but since your OS is likely to put the file into cache after the first read-through, it is really quite fast. The use of the ^ anchor makes the regular expression very efficient - as soon as it doesn't find a match it goes on to the next line.
Use a global match:
:v/^\(subject\|1\|6\|12\),/ delete
For every line that does not match that regular expression, delete it.
It yields:
subject,parameter1,parameter2,parameter3
1,blah,blah,blah
12,blah,blah,blah
EDIT: Just now I realised that you were already using the global match. You error was in the character class. It matches any character inside it regardless of repeated ones, in your case numbers one, two, six and a space. You must separate them in different branches, like I did before.
a "functional" alternative:
:g/./if index([1,12,6],str2nr(split(getline("."),",")[0]))<0|exec 'normal! dd'|endif

Copying only the value at column n Vim

I have a file with long lines and need to see/ copy what the values are in a specic location(s) for the whole file but copy the rest of the line.
If the text width is small enough, ~184 columns, I can use :set colorcolumnnum to highlight the value. However over 184 characters it gets a bit unwieldy scrolling.
I tried :g/\%1237c/y Z, for one of the positions I needed, but that yanked the entire line.
eg for a smaller sample :g/\%49c/y Z will yank all of line 1 and 2 but I want to yank, or copy, the character at that column ie = on line 1 and x on line 2.
vim: filetype=help foldmethod=indent foldclose=all modifiable noreadonly
Table of Contents *sfcontents* *vim* *regex* *sfregex*
*sfsearch* - Search specific commands
|Ampersand-replaces-previous-pattern|
|append-a-global-search-to-a-register|
*sfHelp* Various Help related commands
There are two problems with your :g command:
For each matching line, the cursor is positioned on the first column. So even though you've matched at a particular column, that position is lost.
The \%c atom actually matches byte indices (what Vim somewhat confusingly names "columns"), so your measurement will be off for Tab and non-ASCII characters. Use the virtual column atom \%v instead.
Instead of :global, I would use :substitute with a replace-expression, in the idiom described at how to extract regex matches using vim:
:let t=[] | %s/\%49v./\=add(t, submatch(0))[-1]/g | let ## = join(t, "\n")
Alternatively, if you install my ExtractMatches plugin, I'd be that short command invocation:
:YankMatchesToReg /\%50v./

Remove the first character of each line and append using Vim

I have a data file as follows.
1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480
1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735
Using vim, I want to reomve the 1's from each of the lines and append them to the end. The resultant file would look like this:
14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065,1
13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050,1
13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185,1
14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480,1
13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735,1
I was looking for an elegant way to do this.
Actually I tried it like
:%s/$/,/g
And then
:%s/$/^./g
But I could not make it to work.
EDIT : Well, actually I made one mistake in my question. In the data-file, the first character is not always 1, they are mixture of 1, 2 and 3. So, from all the answers from this questions, I came up with the solution --
:%s/^\([1-3]\),\(.*\)/\2,\1/g
and it is working now.
A regular expression that doesn't care which number, its digits, or separator you've used. That is, this would work for lines that have both 1 as their first number, or 114:
:%s/\([0-9]*\)\(.\)\(.*\)/\3\2\1/
Explanation:
:%s// - Substitute every line (%)
\(<something>\) - Extract and store to \n
[0-9]* - A number 0 or more times
. - Every char, in this case,
.* - Every char 0 or more times
\3\2\1 - Replace what is captured with \(\)
So: Cut up 1 , <the rest> to \1, \2 and \3 respectively, and reorder them.
This
:%s/^1,//
:%s/$/,1/
could be somewhat simpler to understand.
:%s/^1,\(.*\)/\1,1/
This will do the replacement on each line in the file. The \1 replaces everything captured by the (.*)
:%s/1,\(.*$\)/\1,1/gc
.........................
You could also solve this one using a macro. First, think about how to delete the 1, from the start of a line and append it to the end:
0 go the the start of the line
df, delete everything to and including the first ,
A,<ESC> append a comma to the end of the line
p paste the thing you deleted with df,
x delete the trailing comma
So, to sum it up, the following will convert a single line:
0df,A,<ESC>px
Now if you'd like to apply this set of modifications to all the lines, you will first need to record them:
qj start recording into the 'j' register
0df,A,<ESC>px convert a single line
j go to the next line
q stop recording
Finally, you can execute the macro anytime you want using #j, or convert your entire file with 99#j (using a higher number than 99 if you have more than 99 lines).
Here's the complete version:
qj0df,A,<ESC>pxjq99#j
This one might be easier to understand than the other solutions if you're not used to regular expressions!