Ordering output files - regex

I have a file containing a large number of protein sequences. Each sequence is headed up by an initial "protein ID number" (GI number for those that know). I am using a awk command that allows me to print between two regular expressions. Using this, I can enter a list of GI numbers into one regex field where each GI number is separated by a "|". The second regex is a regex I added in after every protein, allowing me to perform the awk function (ABC123).
Therefore the code I am using is as follows
awk '/GI1|GI2|GI3|GI4|GIX.../,/ABC123/' database.txt > output.txt
As you can see from the above code, I am searching within database.txt and writing a new file. The problem is, when I open output.txt the list of GI's is in the wrong order. In output.txt I need them to occur in the same order as they occur in the first regex field i.e
GI1
GI2
GI3...
Instead, they occur in the order which they are found in database.txt, so in output.txt they look all jumbled i.e
Gi3
GI4
GI1
GI2
GI5
Does anyone know how I can get the list of GIs in the output file to match the same order as the list of GIs I input in the 1st regex field?

Try this command,
awk '/GI1|GI2|GI3|GI4|GIX.../,/ABC123/' database.txt | sort -k1.3,1.3 > output.txt
Now your output.txt contains the sorted list.
The specification 1.3,1.3 says that the sort key must starts at field 1 position 3 and ends at the same place.

Related

Edit CSV rows in two different ways

I have a bash script that outputs two CSV columns. I need to prepend the three-digit number of those rows of the second column that contain them with "f. " and keep the rest of the rows intact. I have tried different ways so far but each has failed in one way or another.
What I've tried mainly has been to use regular expressions with either the first or second column to separate the desired rows from the rest, but I can't separate and prepend at the same time without cancelling out or messing up the process somehow. Some of the commands I've used so far have been: $ sed $ cut as well as (nested) for loops, read-while loops, if/else and if/else/elif statements, etc. What follows is one such (failed) solution:
for var1 in "^.*_[^f]_.*"
do
sed -i "" "s:$MSname::" $pathToCSV"_final.csv"
for var2 in "^.*_f_.*"
do
sed -i "" "s:$MSname:f.:" $pathToCSV"_final.csv"
done
done
And these are some sample rows:
abc_deg0014_0001_a_1.tif,British Library 1 Front Board Outside
abc_deg0014_0002_b_000.tif,British Library 1 Front Board Inside
abc_deg0014_0003_f_001r.tif,British Library 1 001r
abc_deg0014_0004_f_001v.tif,British Library 1 001v
…
abc_deg0014_0267_f_132r.tif,British Library 1 132r
abc_deg0014_0268_f_132v.tif,British Library 1 132v
abc_deg0014_0269_y_999.tif,British Library 1 Back Board Inside
abc_deg0014_0270_z_1.tif,British Library 1 Back Board Outside
Here $MSname = British Library 1 (since with different CSVs the "British Library 1" part can change to other words that I need to remove/replace and that's why I use parameter expansion).
The desired result:
abc_deg0014_0002_b_000.tif,Front Board Inside
abc_deg0014_0003_f_001r.tif,f. 001r
…
abc_deg0014_0268_f_132v.tif,f. 132v
abc_deg0014_0269_y_999.tif,Back Board Inside
If you look closely, you'll notice these rows are also differentiated from the rest by "f" in their first column (the rows that shouldn't get the "f. " in front of their second column are differentiated by "a", "b", "y", and "z", respectively, in the first column).
You are not using var1 or var2 for anything, and even if you did, looping over variables and repeatedly running sed -i on the same output file is extremely wasteful. Ideally, you would like to write all the modifications into a single sed script, and process the file only once.
Without being able to guess what other strings than "British Library 1" you have and whether those require different kinds of actions, I would suggest something along the lines of
sed -i '/^[^,]*_f_[^,_]*,/s/,British Library 1 /,f. /
s/,British Library 1 /,/' "${pathToCSV}_final.csv"
Notice how the sed script in single quotes can be wrapped over multiple physical lines. The first line finds any lines where the last characters between underscores in the first comma-separated column is f, and replaces ",British Library 1 " with ",f. ". (I made some adjustments to the spacing here -- I hope they make sense for you.) On the following line, we simply replace any (remaining) occurrences of ",British Library 1 " with just a comma; the idea is that only the lines which didn't match the regex on the previous line will still contain this string, and so we don't have to do another regex match.
This can easily be extended to cover more patterns in the same sed script, rather than repeatedly looping over the file and rewriting one pattern at a time. For example, if your next task is to replace Windsor Palace A with either a. or nothing depending on whether the penultimate underscore-separated subfield in the first field contains a, that should be obvious enough:
sed -i '/^[^,]*_f_[^,_]*,/s/,British Library 1 /,f. /
s/,British Library 1 /,/
/^[^,]*_a_[^,_]*,/s/,Windsor Palace A /,a. /
s/,Windsor Palace A /,/' "${pathToCSV}_final.csv"
In some more detail, the regex says
^ beginning of line
[^,]* any sequence of characters which are not a comma
_f_ literal characters underscore, f, underscore
[^,_]* any sequence of characters which are not a comma or an underscore
, literal comma
You should be able to see that this will target the last pair of underscores in the first column. It's important to never skip across the first comma, and near the end, not allow any underscores after the ones we specifically target before we finally allow the comma column delimiter.
Finally, also notice how we always use double quotes around variables which contain file names. There are scenarios where you can avoid this but you have to know what you are doing; the easy and straightforward rule of thumb is to always put double quotes around variables. For the full scoop, see When to wrap quotes around a shell variable?
With awk, you can look at the firth field to see whether it matches "3digits + 1 letter" then print with f. in this case and just remove fields 2,3 and 4 in the other case. For example:
awk -F'[, ]' '{
if($5 ~ /.?[[:digit:]]{3}[a-z]$/) {
printf("%s,f. %s\n",$1,$5)}
else {
printf("%s,%s %s %s\n",$1,$5,$6,$7)
}
}' test.txt
On the example you provide, it gives:
abc_deg0014_0001_a_1.tif,Front Board Outside
abc_deg0014_0002_b_000.tif,Front Board Inside
abc_deg0014_0003_f_001r.tif,f. 001r
abc_deg0014_0004_f_001v.tif,f. 001v
abc_deg0014_0267_f_132r.tif,f. 132r
abc_deg0014_0268_f_132v.tif,f. 132v
abc_deg0014_0269_y_999.tif,Back Board Inside
abc_deg0014_0270_z_1.tif,Back Board Outside

Mass regex search-and-replace BETWEEN patterns

I have a directory with a bunch of text files, all of which follow this structure:
...
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
- Again, some list items of random text
- Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
....
And I need to run a replace operation (let's say, I need to prepend CCC at the beginning of the line, just after the dash) on only those "list items", which are between PATTERN_A and PATTERN_B. The problem is they aren't really much different from the text above PATTERN_A, or below PATTERN_B, so an ordinary regex can't really catch them without also affecting the remaining text.
So, my question would be, what tool and what regex should I use to perform that replacement?
(Just in case, I'm fine with Vim, and I can collect those files in a QuickFix for a further :cdo, for example. I'm not that good with awk, unfortunately, and absolutely bad with Perl :))
Thanks!
If I have understood your questions, you can do so quite easily with a pattern-range selection and the general substitution form with sed (stream editor). For example, in your case:
$ sed '/PATTERN_A/,/PATTERN_B/s/^\([ ]*-\)/\1CCC/' file
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
-CCC Again, some list items of random text
-CCC Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
(note: to substitute in place within the file add the -i option, and to create a backup of the original add -i.bak which will save the original file as file.bak)
Explanation
/PATTERN_A/,/PATTERN_B/ - select lines between PATTERN_A and PATTERN_B
s/^\([ ]*-\)/\1CCC/ - substitute (general form 's/find/replace/') where find is from beginning of line ^ capturing text between \(...\) that contains [ ]*- (any number of spaces and a hyphen) and then replace with \1 (called a backreference that contains all characters you captured with the capture group \(...\)) and appending CCC to its end.
Look things over and let me know if you have questions or if I misinterpreted your question.
With Perl also, you can get the results
> perl -pe ' { s/^(\s*-)/\1CCC/g if /PATTERN_A/../PATTERN_B/ } ' mass_replace.txt
...
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
-CCC Again, some list items of random text
-CCC Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
....
>

Get part of a string based on conditions using regex

For the life of me, I can't figure out the combination of the regular expression characters to use to parse the part of the string I want. The string is part of a for loop giving a line of 400 thousand lines (out of order). The string I have found by matching with the unique number passed by an array for loop.
For every string I'm trying to get a date number (such as 20151212 below).
Given the following examples of the strings (pulled from a CSV file with 400k++ lines of strings):
String1:
314513,,Jr.,John,Doe,652622,U51523144,,20151212,A,,,,,,,
String2:
365422,johnd#blankity.com,John,Doe.,Jr,987235,U23481,z725432,20160221,,,,,,,,
String3:
6231,,,,31248,U51523144,,,CB,,,,,,,
There are several complications here...
Some names have a "," in them, so it makes it more than 15 commas.
We don't know the value of the date, just that it is a date format such as (get-date).tostring("yyyyMMdd")
For those who can think of a better way...
We are given two CSV files to match. Algorithmic steps:
Look in the CSV file 1 for the ID Number (found on the 2nd column)
** No ID Numbers will be blank for CSV file 1
Look in the CSV file 2 and match the ID number from CSV file 1. On this same line, get the date. Once have date, append in 5th column on CSV file 1 with the same row as ID number
** Note: CSV file 2 will have $null for some of the values in the ID
number column
I'm open to suggestions (including using the Import-Csv cmdlet in which I am not to familiar with the flags and syntax of for loops with those values yet).
You could try something like this:
,(19|20)[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01]),
This will match all dates in the given format from 1900 - 2099. It is also specific enough to rule out most other random numbers, although without a larger sample of data, it's impossible to say.
Then in PowerShell:
gc data.csv | where { $_ -match ",((19|20)[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])),"} | % { $matches[1] }
In the PowerShell match we added capturing parenthesis around what we want, and reference the group via the group number in the $matches index.
If you are only interested in matching one line based on a preceding id you could use a lookbehind. For example,
$id=314513; # Or maybe U23481
gc c:\temp\reg.txt | where { $_ -match "(?<=$id.*),((19|20)[0-9]{2}(0[1-9]|1[012])(0[1-9]|[12][0-9]|3[01])),"} | % { $matches[1] }

Command line to merge lines with matching first field, 50 GB input

A while back, I asked a question about merging lines which have a common first field. Here's the original: Command line to match lines with matching first field (sed, awk, etc.)
Sample input:
a|lorem
b|ipsum
b|dolor
c|sit
d|amet
d|consectetur
e|adipisicing
e|elit
Desired output:
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
The idea is that if the first field matches, then the lines are merged. The input is sorted. The actual content is more complex, but uses the pipe as the sole delimiter.
The methods provided in the prior question worked well on my 0.5GB file, processing in ~16 seconds. However, my new file is approx 100x larger, and I prefer a method that streams. In theory, this will be able to run in ~30 minutes. The prior method failed to complete after running 24 hours.
Running on MacOS (i.e., BSD-type unix).
Ideas? [Note, the prior answer to the prior question was NOT a one-liner.]
You can append you results to a file on the fly so that you don't need to build a 50GB array (which I assume you don't have the memory for!). This command will concatenate the join fields for each of the different indices in a string which is written to a file named after the respective index with some suffix.
EDIT: on the basis of OP's comment that content may have spaces, I would suggest using -F"|" instead of sub and also the following answer is designed to write to standard out
(New) Code:
# split the file on the pipe using -F
# if index "i" is still $1 (and i exists) concatenate the string
# if index "i" is not $1 or doesn't exist yet, print current a
# (will be a single blank line for first line)
# afterwards, this will print the concatenated data for the last index
# reset a for the new index and take the first data set
# set i to $1 each time
# END statement to print the single last string "a"
awk -F"|" '$1==i{a=a"|"$2}$1!=i{print a; a=$2}{i=$1}END{print a}'
This builds a string of "data" while in a given index and then prints it out when index changes and starts building the next string on the new index until that one ends... repeat...
sed '# label anchor for a jump
:loop
# load a new line in working buffer (so always 2 lines loaded after)
N
# verify if the 2 lines have same starting pattern and join if the case
/^\(\([^|]\)*\(|.*\)\)\n\2/ s//\1/
# if end of file quit (and print result)
$ b
# if lines are joined, cycle and re make with next line (jump to :loop)
t loop
# (No joined lines here)
# if more than 2 element on first line, print first line
/.*|.*|.*\n/ P
# remove first line (using last search pattern)
s///
# (if anay modif) cycle (jump to :loop)
t loop
# exit and print working buffer
' YourFile
posix version (maybe --posix on Mac)
self commented
assume sorted entry, no empty line, no pipe in data (nor escaped one)
used unbufferd -u for a stream process if available

Vim: regular expression to delete all lines except those starting with a given list of numbers

I have a csv file where every line but the first starts with a number and looks like this:
subject,parameter1,parameter2,parameter3
1,blah,blah,blah
3,blah,blah,blah
2,blah,blah,blah
44,blah,blah,blah
12,blah,blah,blah
14,blah,blah,blah
11,blah,blah,blah
10,blah,blah,blah
11,blah,blah,blah
13,blah,blah,blah
3,blah,blah,blah
...
I would like to delete all lines except the first that start, say, with the numbers 1,6,12.
I was trying something like this:
:g!/^[1 6 12]\|^subject/d
But the 12 is interpreted as "1 or 2" so this also erases the lines that start with 2..
What am I missing, and what should be the most efficient way to do this?
Btw instead of 1, 6, 12, my list contains many multiple single and 2-digit numbers.
The character class [1 6 12] means "any single character that is in this class,
i.e. any one of ' ', 1, 2, 6 (the repeated 1 is ignored).
You could use
:g!/^1,\|^6,\|^12,\|^subject/d
which is close to your original syntax - but it works (tested with vim on Mac OS X).
Note - it is important to include the comma, so that the line starting with 1 doesn't "protect" 11, 12345, etc.
You might want to do this differently though - using grep.
Put all the "white listed" numbers in a file, one per line, like so:
^subject
^1,
^2,
^6,
^12,
then do
grep -f whitelist csvFile
and the output will be your "edited" file (which you can pipe to a new file).
If you are even more interested in "efficiency", you could make your text file (let's continue to call it whitelist) just
subject
1
2
6
12
and use the following command:
cat whitelist | xargs -I {} grep "^"{}"," cvsFile
This needs a bit of explaining.
xargs - take the input one line at a time
-I {} - and insert that line in the command that follows, at the {}
This means that the grep command will be run n times (once per line in the whitelist file), and each time the regular expression that is fed into grep will be the concatenation of
"^" - start of line
{} - contents of one line of the input file (whitelist)
"," - comma that follows the number
So this is a compact way of writing
grep "^subject," csvFile; grep "^1," csvFile; grep "^2," csvFile;
etc.
It has the advantage that you can now generate your whitelist any way you want - as long as it ends up in a file, one line at a time, you can use it; the disadvantage is that you are essentially running grep n times. If your files get very large, and you have a large number of items in your white list, that may start to be a problem; but since your OS is likely to put the file into cache after the first read-through, it is really quite fast. The use of the ^ anchor makes the regular expression very efficient - as soon as it doesn't find a match it goes on to the next line.
Use a global match:
:v/^\(subject\|1\|6\|12\),/ delete
For every line that does not match that regular expression, delete it.
It yields:
subject,parameter1,parameter2,parameter3
1,blah,blah,blah
12,blah,blah,blah
EDIT: Just now I realised that you were already using the global match. You error was in the character class. It matches any character inside it regardless of repeated ones, in your case numbers one, two, six and a space. You must separate them in different branches, like I did before.
a "functional" alternative:
:g/./if index([1,12,6],str2nr(split(getline("."),",")[0]))<0|exec 'normal! dd'|endif