fill line up to specific length regex - regex

This is part of a list. All lines need to be filled up with zeros up to 12 characters at the start of each line. Some lines already are length 12...
801095126710
2227121
19472168
21521070
21945110
25260089
92000077
93400015
132300300
132405100
211304212
934000107
934000108
934000110
934000120
934000144
93400138
160908013840
822100052908
822100053358
How can this be done with regex?

Warning this is ugly.
You can look for:
^(.{0,11})$
and replace with 0$1. Click on replace all 11 times and voilà.
You can't do math with regex. Regex are for string matching.

Description
I see this as a two step process. Step One insert 12 0 at the beginning of each line. Step Two capture the last 12 characters and all leading 0, and replace with just the 12 captured characters.
Step One - Replace commas with 10 spaces
^
Replace with: 000000000000
Live Demo: https://regex101.com/r/rM8bK2/1
Sample Text
801095126710
2227121
19472168
21521070
21945110
25260089
92000077
93400015
132300300
132405100
211304212
934000107
934000108
934000110
934000120
934000144
93400138
160908013840
822100052908
822100053358
After Replacement
000000000000801095126710
0000000000002227121
00000000000019472168
00000000000021521070
00000000000021945110
00000000000025260089
00000000000092000077
00000000000093400015
000000000000132300300
000000000000132405100
000000000000211304212
000000000000934000107
000000000000934000108
000000000000934000110
000000000000934000120
000000000000934000144
00000000000093400138
000000000000160908013840
000000000000822100052908
000000000000822100053358
123456789,123456789,123456789
Note: I inserted the number line here to help illustrate the number and position of characters
Step Two - Capture 10 characters and all trailing spaces
0*([0-9]{12})$
Replace with: $1
Live Demo: https://regex101.com/r/aS2xG0/1
Sample Text
Because this is step two, the sample text is the output from step one above
000000000000801095126710
0000000000002227121
00000000000019472168
00000000000021521070
00000000000021945110
00000000000025260089
00000000000092000077
00000000000093400015
000000000000132300300
000000000000132405100
000000000000211304212
000000000000934000107
000000000000934000108
000000000000934000110
000000000000934000120
000000000000934000144
00000000000093400138
000000000000160908013840
000000000000822100052908
000000000000822100053358
After Replacement
801095126710
000002227121
000019472168
000021521070
000021945110
000025260089
000092000077
000093400015
000132300300
000132405100
000211304212
000934000107
000934000108
000934000110
000934000120
000934000144
000093400138
160908013840
822100052908
822100053358
123456789,123456789,
Note: I inserted the number line here to help illustrate the number and position of characters

Related

RegEx in Notepad++ to find lines with less or more than n pipes

I have a large pipe-delimited text file that should have one 3-column record per line. Many of the records are split up by line breaks within a column.
I need to do a find/replace to get three, and only three, pipes per line/record.
Here's an example (I added the line breaks (\r\n) to demonstrate where they are and what needs to be replaced):
12-1234|The quick brown fox jumped over the lazy dog.|Every line should look similar to this one|\r\n
56-7890A|This record is split\r\n
\r\n
on to multiple lines|More text|\r\n
09-1234AS|\r\n
||\r\n
\r\n
56-1234|Some text|Some more text\r\n
|\r\n
76-5432ABC|A record will always start with two digits, a dash and four digits|There may or may not be up to three letters after the four digits|\r\n
The caveat is that I need to retain those mid-record line breaks for the target system. They need to be replaced with \.br\. So the final result of the above should look like this:
12-1234|The quick brown fox jumped over the lazy dog.|Every line should look similar to this one|\r\n
56-7890A|This record is split\.br\\.br\on multiple lines|More text|\r\n
09-1234AS|\.br\||\.br\\r\n
56-1234|Some text|Some more text\.br\|\r\n
76-5432ABC|A record will always start with two digits, a dash and four digits|There may or may not be up to three letters after the four digits|\r\n
As you can see the mid-record line breaks have all been replaced with \.br\ and the end-of-line line breaks have been retained to keep each three-column/pipe record on its own line. Note the last record's text, explaining how each line/record begins. I included that in case that would help in building a regex to properly identify the beginning of a record.
I'm not sure if this can be done in one find/replace step or if it needs to be (or just should be) split up into a couple of steps.
I had the thought to first search for |\r\n, since all records end with a pipe and a CRLF, and replace those with dummy text !##$. Then search for the remaining line breaks with \r\n, which will be mid-column line breaks and replace those with \.br\, then replace the dummy text with the original line breaks that I want to keep |\r\n.
That worked for all but records that looked like the third record in the first example, which has several line breaks after a pipe within the record. In such a large file as I am working with it wasn't until much later that I found that the above process I was using didn't properly catch those instances.
You can use
(?:\G(?!^(?<!.))|^\d{2}-\d+[A-Z]*\|[^|]*?(?:\|[^|]*?)?)\K\R+
Replace with \\.br\\. See the regex demo. Details:
(?:\G(?!^(?<!.))|^\d{2}-\d+[A-Z]*\|[^|]*?(?:\|[^|]*?)?) - either the end of the previous match (\G(?!^(?<!.))) or (|) start of a line, two digits, 0, one or more digits, zero or more letters, a |, then any zero or more chars other than |, as few as possible, and then an optional sequence of | and any zero or more chars other than |, as few as possible (see ^\d{2}-\d+[A-Z]*\|[^|]*?(?:\|[^|]*?)?)
\K - omit the text matched
\R+ - one or more line breaks.
See the Notepad++ demo:
If you need to remove empty lines after this, use Edit > Line Operations > Remove Empty Lines.

Modify position in a line if Regular Expression found

I need to modify the positions number 10 of every line that finds the word 'Example' (can´t use the actual data here) and add the string '(ID) '. It doesn´t necessarily have to begin with 9 numbers, it just needs to add the string to the position number 10.
For example, this line should be modified like this:
ORIGINAL: 123456789This line is being used as an Example
SOLUTION: 123456789(ID) This line is being used as an Example
So far I have this, to find the Example and copy the rest of the line as to not lose the text:
Find: (.*)Example
Bonus points if it works for two different words 'Example1' and 'Example2' in different sentences, the 'and also' part of this example would change in every line.
ORIGINAL: 123456789This line is being used as an Example1 and also Example2
SOLUTION: 123456789(ID) This line is being used as an Example1 and also Example2
This would have this search:
Find: (.*)Example1(.*)Example2
Thank you
You could try:
Find: (\d{9})(?=.*\bExample1\b.*\bExample2\b)
Replace: $(ID)
^^^ single space after (ID)
Demo
The regex pattern used matches and captures a 9 digit number (you may adjust to any width, or range of widths, which you want). It also uses a positive lookahead to assert that Example1 and Example2 in fact occur later in the same line:
(?=.*\bExample1\b.*\bExample2\b)
This is how you add characters in a certain position, even tho I accepted Tims answer because it´s very similar and made me figure it out:
^(\S{9})(?=.*\bExample1\b.*\bExample2\b)
As you can see, I only added '^' so it´s the position from the start of the line, and 'S' instead of 'd' so it counts characters that are not whitespace, instead of numbers. This should work for any type of line you have.

Renaming fasta headers to bracketed text

I have a file with 250 fasta sequences. Right now, the they look like this:
>NP_041982.1 DNA polymerase [Enterobacteria phage T7]
I want to change the headers so they look like this:
>Enterobacteria phage T7
For each header, I only want what is in-between the brackets. I'm trying to do this through linux commands.
Can anyone help with this?
file.fa contents
>Sequence One [Species 1]
actgtattagctaatcgatcagttacgattcga
tagctacgtacgtacgatcgatcagtcagctag
>Sequence Two [Species 2]
ttgtagctagctagctagctagctagctacgta
tgcatcgatcgattaatatcgcgccctaactcg
>Sequence Three
atgatagtctggtcatcgattcagtcagttcat
ttgcatgatctactagatcgatattagctagat
>Sequence Four [early bracket] text
tagctacgtacgatcgtacgatcgatcgtatat
gctagtcgactagctagctacgtacgtacgtaa
sed command:
sed 's#^>[^\[]*\[\([^\]*\)]$#>\1#g' file.fa
It looks a bit convoluted, but it means...
take any string of characters that matches the pattern of "a line that starts with >, followed by any number of characters besides [, followed by any number of characters besides ], followed by ]. Capture the string between the brackets, and replace the entire match with just the thing in the brackets.
prints the output
>Species 1
actgtattagctaatcgatcagttacgattcga
tagctacgtacgtacgatcgatcagtcagctag
>Species 2
ttgtagctagctagctagctagctagctacgta
tgcatcgatcgattaatatcgcgccctaactcg
>Sequence Three
atgatagtctggtcatcgattcagtcagttcat
ttgcatgatctactagatcgatattagctagat
>Sequence Four [early bracket] text
tagctacgtacgatcgtacgatcgatcgtatat
gctagtcgactagctagctacgtacgtacgtaa
the output can be saved to a new file with
sed 's#^>[^\[]*\[\([^\]*\)]$#>\1#g' file.fa > converted_filename.fa
Note that any headers without matches are printed as-is, and any lines that have characters after the final bracket will also be printed as-is. Might act odd if it encounters left brackets that are not closed on the same line. I'd recommend you double check that the new file has the same number of lines as the original.

Remove the first character of each line and append using Vim

I have a data file as follows.
1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480
1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735
Using vim, I want to reomve the 1's from each of the lines and append them to the end. The resultant file would look like this:
14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065,1
13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050,1
13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185,1
14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480,1
13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735,1
I was looking for an elegant way to do this.
Actually I tried it like
:%s/$/,/g
And then
:%s/$/^./g
But I could not make it to work.
EDIT : Well, actually I made one mistake in my question. In the data-file, the first character is not always 1, they are mixture of 1, 2 and 3. So, from all the answers from this questions, I came up with the solution --
:%s/^\([1-3]\),\(.*\)/\2,\1/g
and it is working now.
A regular expression that doesn't care which number, its digits, or separator you've used. That is, this would work for lines that have both 1 as their first number, or 114:
:%s/\([0-9]*\)\(.\)\(.*\)/\3\2\1/
Explanation:
:%s// - Substitute every line (%)
\(<something>\) - Extract and store to \n
[0-9]* - A number 0 or more times
. - Every char, in this case,
.* - Every char 0 or more times
\3\2\1 - Replace what is captured with \(\)
So: Cut up 1 , <the rest> to \1, \2 and \3 respectively, and reorder them.
This
:%s/^1,//
:%s/$/,1/
could be somewhat simpler to understand.
:%s/^1,\(.*\)/\1,1/
This will do the replacement on each line in the file. The \1 replaces everything captured by the (.*)
:%s/1,\(.*$\)/\1,1/gc
.........................
You could also solve this one using a macro. First, think about how to delete the 1, from the start of a line and append it to the end:
0 go the the start of the line
df, delete everything to and including the first ,
A,<ESC> append a comma to the end of the line
p paste the thing you deleted with df,
x delete the trailing comma
So, to sum it up, the following will convert a single line:
0df,A,<ESC>px
Now if you'd like to apply this set of modifications to all the lines, you will first need to record them:
qj start recording into the 'j' register
0df,A,<ESC>px convert a single line
j go to the next line
q stop recording
Finally, you can execute the macro anytime you want using #j, or convert your entire file with 99#j (using a higher number than 99 if you have more than 99 lines).
Here's the complete version:
qj0df,A,<ESC>pxjq99#j
This one might be easier to understand than the other solutions if you're not used to regular expressions!

Regex for splitting below data

I have a data one on each line in a file as below
BMT.PQ
DMZ.IV
VLD.Q
WPS.T
I am looking for a regex to split out into two categories of output
One where starting letter of the data is between A to M
and other
where starting letter of the data is from N to Z
I tried this
[A-M].* for getting first half of data with first beginning letter from A to M
and i was expecting a result/regex text match of
only :
BMT.PQ
DMZ.PQ
but it also gave a match for
LD.Q which was incorrect for me.
I even unsuccessfully tried [(A-M)(A-M)(A-M)].*
Basically i want to split based on starting letter in the data. One half for data beginning with letters from A to M and second half for data beginning with letter N to Z.
You are close, all you need is to add the ^ for start of string and $ for end of string.
^[A-M].*$
and
^[N-Z].*$
Make sure you enable multiline mode. Multiline mode (usually the m flag) allows ^ and $ to detect start of line and end of line respectively.
The carat symbol represents the start of the string being searched/matched. The two regexs that you probably need are:
^[A-M]
^[N-Z]