vim: substitute specific character, but only after nth occurance - regex

I need to make this exercise about regexes and text manipulation in vim.
So I have this file about the most scoring soccer players in history, with 50 entries looking like this:
1 Cristiano Ronaldo Portugal 88 121 0.73 03 Manchester United Real Madrid
The whitespaces between the fields are tabs (\t)
The fields each respond to a differen category: etc...
This last field contains one or more clubs the player has played in. (so not a fixed number of clubs)
The question: replace all tabs with a ';', except for the last field, where the clubs need to be seperated by a ','.
So I thought: I just replace all of them with a comma, and then I replace the first 7 commas with a semicolon. But how do you do that? Everything - from regex to vim commands - is allowed.
The first part is easy: :2,$s/\t/,/g
But the second part, I can't seem to figure out.
Any help would be greatly appreciated.
Thanks, Zeno

This answer is similar to #Amadan's, but it makes use of the ability to provide an expression as the replace string to actually do the difficult bit of changing the first set of tabs to semicolons:
%s/\v(.{-}\t){7}/\=substitute(submatch('0'), '\t', ';', 'g')/|%s/\t/,/g
Broken down this is a set of three substitute commands. The first two are cobbled together with a sub-replace-expression:
%s/\v(.{-}\t){7}/\=substitute(submatch('0'), '\t', ';', 'g')/
What this does is find exactly seven occurrances ({7}) of any character followed by a tab, in a non-greedy way. ((.{-}\t)). Then we replace this entire match (submatch(0)) with the result of the substitute expression (\=substitute(...)). The substitute expression is simple by comparison as it just converts all tabs to semicolons.
The last substitute just changes any other tabs on the line to commas.
See :help sub-replace-expression

Here's one way you could do it:
:let #q=":s/\t/;\<cr>"
:2,$norm 7#q
:2,$s/\t/,/g
Explanation:
First, we define a macro 'q' that will replace one tab with a semicolon. Now, on any line we can simply run this macro n times to replace the first n tabs. To automatically do this to every line, we use the norm command:
:2,$norm 7#q
This is essentially the same thing as literally typing 7#q (e.g. "run macro 'q' seven times") on every line in the specified range. From there, we can simply replace every tab with a comma.
:2,$s/\t/,/g

:2,$s/\t\(.*\t\)\#=/;/g
:2,$s/\t/,
Change any tabs where there is a tab later to ;
Change any remaining tabs to ,
EDIT: Misunderstood. Here is a fixed version:
:2,$s/\(\(\t.*\)\{7}\)\#<=\t/,/g
:2,$s/\t/;/g
Change any tabs where there's seven tabs before it to ,
Change any remaining tabs to ;

My PatternsOnText plugin has (among others) a :SubstituteSelected command that allows to specify the match positions. With this, you can easily replace the first 8 tabs with semicolons, and then use a regular substitute to change the remaining tabs into commas:
:2,$SubstituteSelected/\t/;/g 1-8
:2,$s/\t/,/g

We solved the issue by just capturing the first 8 groups manually ([^\t]*\t)(...)(...) and then separate them with a semicolon (\1;\2;...;) then replacing the remaining tabs with comma's | 2,$s/\t/,/g
Thanks to everyone trying to help!

Related

Need a regex to modify a second match and ignore the first and last

Library context, using MarcEdit which can also use regex.
I need this:
=773 \\$tEtudes inuit$x0701-1008$1Vol. 44 1-2, $2p. 53-84
to be changed to this:
=773 \\$tEtudes inuit$x0701-1008$1Vol. 44, no. 1-2, $2p. 53-84
Problem is, the 44 in this case and the 1-2 are numbers that will change from one book to the other and I am building commands to automate it.
I tried focusing on changing the space between the 44 and the 1-2 into a ', no. ' with \s but it obiviously changes all spaces characters.
The adding ', no. ' is easy because there is a different box for it but I can't focus on the 2nd space while ignoring the first and last and also keeping every characters before and after.
Thank you for helping, I've been looking/trying all day!
MarcEdit exemple
If the regular expression implementation supports look ahead, you can require that this space is followed by a range and a comma:
Find: \s(?=\d+-\d+,)
Replace: , no.

Using Regex to clean a csv file in R

This is my first post so I hope it is clear enough.
I am having a problem regarding cleaning my CSV files before I can read them into R and have spent the entire day trying to find a solution.
My data is supposed to be in the form of two columns. The first column is a timestamp consisting of 10 digits and the second an ID consisting of 11 or 12 Letters and numbers (the first 6 are always numbers).
For example:
logger10 |
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F0831102744
010700EDDA0F|
would become:
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F
0831102744 | 010700EDDA0F
(please excuse the lines in the middle, that was my attempt at separating the columns...).
The csv file seems to occasionally be missing a comma which means that sometimes one row will end up like this:
0923120531,010300033ADD0925075301,010700EDD00A
My hardware also adds the word logger10 (or whichever number logger this is) whenever it restarts which gives a similar problem e.g. logger10logger100831102744.
I think I have managed to solve the logger text problem (see code) but I am sure this could be improved. Also, I really don't want to delete any of the data.
My real trouble is making sure there is a line break in the right place after the ID and, if not, I would like to add one. I thought I could use regex for this but I'm having difficulty understanding it.
Any help would be greatly appreciated!
Here is my attempt:
temp <- list.files(pattern="*.CSV") #list of each csv/logger file
for(i in temp){
#clean each csv
tmp<-readLines(i) #check each line in file
tmp<-gsub("logger([0-9]{2})","",tmp) #remove logger text
pattern <- ("[0-9]{10}\\,[0-9]{6}[A-Z,0-9]{5,6}") #regex pattern ??
if (tmp!= pattern){
#I have no idea where to start here...
}
}
here is some raw data:
logger01
0729131218,020700EE1961
0729131226,020700EE1961
0831103159,0203000316DB
0831103207,0203000316DB0831103253,010700EDE28C
0831103301,010700EDE28C
0831103522,010300029815
0831103636,010300029815
0831103657,020300029815
If you want to do this in a single pass:
(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?
can be replaced with
\1\t\2\n
What this does is look for any of those rogue logger01 entries (including the space after it) optionally: That trailing ? after the group means that it can match 0 or 1 time: if it does match, it will. If it's not there, the match just keeps going anyway.
Following that, you look for (and capture) exactly 10 hex values (either digits or A-F). The ,? means that if a comma exists, it will match, but it can match 0 or 1 time as well (making it optional).
Following that, look for (and capture) exactly 12 hex values. Finally, to get rid of any strange trailing spaces, the ? (a space character followed by ?) will optionally match the trailing space.
Your replacement will replace the first captured group (the 10 hex digits), add in a tab, replace the second captured group (the 12 hex digits), and then a newline.
You can see this in use on regex101 to see the results. You can use code generator on the left side of that page to get some preformatted PHP/Javascript/Python that you can just drop into a script.
If you're doing this from the command line, perl could be used:
perl -pe 's/(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?/\1\t\2\n/g'
If another language, you may need to adapt it slightly to fit your needs.
EDIT
Re-reading the OP and comments, a slightly more rigid regex could be
(?:logger\d\d\ )?([\dA-F]{10}),?(\d{6}[\dA-F]{5,6})\ ?
I updated the regex101 link with the changes.
This still looks for the first 10 hex values, but now looks for exactly 6 digits, followed by 5-6 hex values, so the total number of characters matched is 11 or 12.
The replacement would be the same.
Paste your regex here https://regex101.com/ to see whether it catches all cases. The 5 or 6 letters or digits could pose an issue as it may catch the first digit of the timestamp when the logger misses out a comma. Append an '\n' to the end of the tmp string should work provided the regex catches all cases.

Regex selecting the last 6 numbers of

I am a noob at regex and i've been trying to select 6 numbers from within a file and then replace those 6 numbers with the same numbers plus , new line (making a CSV obviously).
Anyway sample data is simply nonsense like this:
fafksadjlkgtjafglkj210000adsfaklgjadklgjag3600001skfjaklaj093i393593390000002sadfljafkjgakjgasafksadjlkgtjafglkj£94.00 489438adsfaklgjadklgjag7700001skfjaklaj093i393593390000002ssafksa djlkgtjafglkj000000adsfaklgjadklgjag0000001skfj aklaj093i393593£39.00900002ssafksadjlk gtjafglkj000000adsfaklgjadklgjag0000001skfjaklaj093i3935£933.90000002s
Note some of the numbers are attached to currency values as well (and some are next to it but contain a space before hand) but the end will always be 6 numbers (consider them to be random as I can't see a pattern).
So I basically need to select strings matching numerics that are six digits long or longer, if longer then it just uses the last 6 digits.
Then I will replace it with itself and a comma and new line.
I hope that makes sense, i've tried a few things without success..
Thanks, edit the closest I have is:
(\d)\d{6}(?!\d)
In the Find what: text field, type in (\d{6})(\D). In the Replace with: text field, type in $1\r\n$2. Make sure that the regular expression radio button is selected. For your input, that should yield this:
fafksadjlkgtjafglkj210000
adsfaklgjadklgjag3600001
skfjaklaj093i393593390000002
sadfljafkjgakjgasafksadjlkgtjafglkj£94.00 489438
adsfaklgjadklgjag7700001
skfjaklaj093i393593390000002
ssafksa djlkgtjafglkj000000
adsfaklgjadklgjag0000001
skfj aklaj093i393593
£39.00900002
ssafksadjlk gtjafglkj000000
adsfaklgjadklgjag0000001
skfjaklaj093i3935£933.90000002
s
You want
\d{6}(?=\D*$)
Read more about anchors here.
i've been trying to select 6 numbers from within a file and then replace those 6 numbers with the same numbers plus , new line
So you're basically trying to do this, right?:
Find:
(\d{6})(\D)
Replace:
\1\n\2
[Online example]
How about:
Find what: (\d{6,})(?:\D*)$
Replace with: $1,\n

Regex for single space

I'm trying to match a file which is delimited by multiple spaces. The problem I have is that the first field can contain a single space. How can I match this with a regex?
Eg:
Name Other Data Other Data 2
Bob Smith XX1 0101010101
John Doe XX2 0101010101
Bob Doe XX3 0101010101
John Smith XX4 0101010101
Can I split these lines into three fields with a regex, splitting by a space but allowing for the single space in the first field?
Hi the following regex should work
(\w*\s\w*)\s+\w{2}\d\s+\d*
This would work:
Pattern:
(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)
Replacement:
+$1+ -$2- *$3*
$1 contains the first column, $2 the second and $3 the third one.
Example:
http://regexr.com?32tbt
You could split at two or more spaces:
[ ]{2,}
But you are probably better off, determining the lengths of the captures of this regular expression:
(Name[ ]+)(Other Data[ ]+)
And then to use a simple substring method that slices your lines into portions of the same length.
So in your case the first capture would be 15 characters long, the second 14 and the column would have 13 (but the last one doesn't really matter, which is why it isn't actually captured). Then you take the first 15, the next 14 and the remaining characters of every line and trim each one (remove trailing whitespace).
I think the simplest is to use a regex that matches two or more spaces.
/ +/
Which breaks down as... delimiter (/) followed by a space () followed by another space one or more times (+) followed by the end delimiter (/ in my example, but is language specific).
So simply put, use regex to match space, then one or more spaces as a means to split your string.
Usually, with this kind of files, the best approach is to get a substring based on where your required information is and then trim it. I see your file contains 16 chars before the second field, you can get a substring of length 16 from the beginning which will contain your desired text. You should trim it to get only the text you need without the spaces.
If the spacing pattern you posted is consistent (if it won't change among different files of this kind) you have also another problem: what happens to longer names?
Name Other Data
Johnny AppleseeXX1
TutankamonfirstXX2
if you really want to use a regex, be sure to avoid those corner cases.

Replacing multiple blank lines with one blank line using RegEx search and replace

I have a file that I need to reformat and remove "extra" blank lines.
I am using the Perl syntax regular expression search and replace functionality of UltraEdit and need the regular expression to put in the "Find What:" field.
Here is a sample of the file I need to re-format.
All current text
REPLACE with all the following:
Winter 2011 Class Schedule
Winter 2011 Class Registration Dates: Dec. 6, 2010 – Jan. 1, 2011
Winter 2011 Class Session Dates: Jan. 5 – Feb. 12, 2011
DANCE
Adventures in Ballet & Tap
3 – 6 years Instructor: Ann Newby
Tots ages 3 – 6 years old develop a greater sense of rhythm, flexibility and coordination as they explore the basic elements of movement.
Saturdays 9 - 10 a.m. Jan. 8 – Feb. 12 Six-week fees: $30
African Storytelling
3 – 6 years Instructor: Ann Newby
Tots ages 3 – 6 years old explore storytelling and fables through spoken word, music, movement and visual arts experiences.
Saturdays 10 – 11 a.m. Jan. 8 – Feb. 12 Six-week fee: $30
African Dance / Children
You'll notice that some of the double blank lines have spaces or tabs or both in them.
After the search and replace has been run I should have a file that looks like this.
All current text
REPLACE with all the following:
Winter 2011 Class Schedule
Winter 2011 Class Registration Dates: Dec. 6, 2010 – Jan. 1, 2011
Winter 2011 Class Session Dates: Jan. 5 – Feb. 12, 2011
DANCE
Adventures in Ballet & Tap
3 – 6 years Instructor: Ann Newby
Tots ages 3 – 6 years old develop a greater sense of rhythm, flexibility and coordination as they explore the basic elements of movement.
Saturdays 9 - 10 a.m. Jan. 8 – Feb. 12 Six-week fees: $30
African Storytelling
3 – 6 years Instructor: Ann Newby
Tots ages 3 – 6 years old explore storytelling and fables through spoken word, music, movement and visual arts experiences.
Saturdays 10 – 11 a.m. Jan. 8 – Feb. 12 Six-week fee: $30
African Dance / Children
Replacing
^(\s*\r\n){2,}
With
\r\n
Is what I ended up with.
This only selects blank lines in multiples of two or more and replaces them with one.
It depends what the line endings are. Assuming \n, replace this:
([ \t]*\n){3,}
with \n\n.
Try this perl oneliner perl -00pe0, if you want in place editing, just add -i option
Replacing
\n\s*\n\s*
with
\n\n
should do the trick
For completeness I want to reference here the large post Remove / delete blank and empty lines in the user forums of UltraEdit which contains at bottom after all the explanations for newbies the solution for reducing two or more lines with nothing (empty lines) or just whitespaces (blank lines) to one empty line independent on line terminator type.
And some words on what Alan Moore wrote in his answer:
UltraEdit's Perl regular expression support is not crippled by its line-based architecture. Perl regular expression engines have a flag which determine if a dot matches all characters except newline characters like carriage return (CR) and line feed (LF) or really all characters including CR and LF. This makes the difference if a text file is interpreted as large byte stream or as a sequence of lines for Perl regular expression finds/replaces. In UltraEdit the flag is set by default to not include \r (CR) and \n (LF) by a dot in the regular expression search string. But this behavior can be easily changed in UltraEdit by starting the regular expression string with (?s) which changes the value of the flag match_not_dot_newline as posted in UltraEdit user forums at topic "." in Perl regular expressions doesn't include CRLFs?
A Perl regular expression replace working for files with
carriage return + line feed (DOS/Windows) or
only line feed (Unix, Mac OS 10.0 and later versions) or
only carriage return (Mac OS 9 and previous versions)
as line ending with optionally trailing spaces and tabs at end of a paragraph (one or more lines) and with two or more lines without (empty line) or with whitespaces (blank line) below the paragraph could be done with search string \h*(\r?\n|\r)(?:\h*\1){2,} and \1\1 as replace string.
Explanation:
\h* matches any horizontal whitespace character according to Unicode 0 or more times. This first part of the search expression matches horizontal whitespace characters at end of a line like horizontal tabs, normal spaces, no-break-spaces and some other not often used spaces.
The usage of \s is not good as this character class matches any whitespace character including the vertical whitespace characters carriage return and line feed.
(\r?\n|\r) ... is an OR expression with two arguments in a marking group. The first argument matches a line feed optionally with a preceding carriage return while the second argument matches just a carriage return. So this expression matches all three common types of line terminations completely correct. It is important for the rest of the search and the replace to match always either CR+LF (both together) or just LF or just CR.
(?:\h*\1) ... is a non marking group which matches 0 or more horizontal whitespaces and the newline as found before back-referenced with \1, i.e. CR+LF or just LF or just CR. So this part of the expression finds an empty or blank line.
{2,} ... is a multiplier for the previous expression in the non marking group which means at least two times. So after end of a paragraph there must be two or more empty or blank lines. Only one empty or blank line below a paragraph is not enough for a positive match of search expression.
The replace string \1\1 references twice the first found line break.
The advantage of this regular expression in comparison to the others posted here is that the line ending type must not be known. The search expression finds that out and found line ending is referenced in the replace string. And probably existing trailing whitespaces at end of a paragraph and whitespaces on next line are removed also by this regular expression replace if there are two or more empty or blank lines below a paragraph.
{2,} can be replaced by + in search string if trimming whitespaces at end of a paragraph and on next empty or blank line should be also done on running this Perl regular expression replace. But please note that in this case the replace makes replaces which do not change anything at all if there are not trailing whitespaces at end of a paragraph and next line is an empty line.
In Vim, Using
:%!cat -s
I find this is the easiest way to delete extra empty line so far.
I'm not sure what UltraEdit lets you get away with in the "replace" area, but if you cannot use a newline (I've had this problem before) but can use capture references, this might work:
Find : \s*(\r\n)\s*(\r\n)\s*\r\n
Replace : $1$2
Not tested extensively, but seems to work on the sample you provided.
See this thread for what's causing the problem. As I understand it, UltraEdit regexes are greedy at the character level (i.e., within a line), but non-greedy at the line level (roughly speaking). I don't have access to UE, but I would try writing the regex so it has to match something concrete after the last blank line. For example:
search: (\r\n[ \t]*){2,}(\S)
replace: $1$2
This matches and captures two or more instances of a line separator and any horizontal whitespace that follows it, but it only retains the last one. The \S should force it to keep matching until it finds a line with at least one non-whitespace character.
I admit that I don't have a whole lot of confidence in this solution; UltraEdit's regex support is crippled by its line-based architecture. If you want an editor that does regexes right, and you don't want to learn a whole new regex syntax (like vim's), get EditPadPro.
Should also work with spaces on blank lines
Search - /\n^\s*\n/
Replace - \n\n
On my Intellij IDE what was search for \n\n and Replace it by \n