vim: search, capture & replace on different lines using regex - regex

Relatively new linux/vim/regex user here. I want to use regex to search for a numerical patterns, capture it, and then use the captured value to append a string to the previous line. In other words...I have a file of format:
title: description_id
text: {en: '2. text description'}
I want to capture the values from the text field and append them to the beginning of the title field...to yield something like this:
title: q2_description_id
text: {en: '2. text description'}
I feel like I've come across a way to reference other lines in a search & replace but am having trouble finding that now. Or maybe a macro would be suitable. Any help would be appreciated...thanks!

Perhaps something like:
:%s/\(title: \)\(.*\n\)\(text: \D*\)\(\d*\)/\1q\4_\2\3\4/
Where we are searching for 4 parts:
"title: "
rest of line and \n
"text: " and everything until next digit in line
first string of consecutive digits in line
and spitting them back out, with 4) inserted between 1) and 2).
EDIT: Shorter solution by Peter in the comments:
:%s/title: \zs\ze\_.\{-}text: \D*\(\d*\)/q\1_/

Use \n for the new lines (and ^v+enter for new lines on the substitute line): A quick and not very elegant example:
:%s/title: description_id\n\ntext: {en: '\(\i*\)\(.*\)/title: q\1_description_id^Mtext: {en: '\1\2/

Related

Tcl - How to Add Text after last character through regex?

I need a tip, tip or suggestion followed by some example of how I can add an extension in .txt format after the last character of a variable's output line.
For example:
set txt " ONLINE ENGLISH COURSE - LESSON 5 "
set result [concat "$txt" .txt]
Print:
Note that there is space in the start, means and fin of the variable phrase (txt). What must be maintained are the spaces of the start and means. But replace the last space after the end of the sentence, with the format of the extension [.txt].
With the built-in concat method of Tcl, it does not achieve the desired effect.
The expected result was something like this:
ONLINE ENGLISH COURSE - LESSON 5.txt
I know I could remove spaces with string map but I don't know how to remove just the last occurrence on the line.
And otherwise I don’t know how to remove the last space to add the text [.txt]
If anyone can point me to one or more solutions, thank you in advance.
set result "[string trimright $txt].txt"
or
set result [regsub {\s*$} $txt ".txt"]

Mass regex search-and-replace BETWEEN patterns

I have a directory with a bunch of text files, all of which follow this structure:
...
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
- Again, some list items of random text
- Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
....
And I need to run a replace operation (let's say, I need to prepend CCC at the beginning of the line, just after the dash) on only those "list items", which are between PATTERN_A and PATTERN_B. The problem is they aren't really much different from the text above PATTERN_A, or below PATTERN_B, so an ordinary regex can't really catch them without also affecting the remaining text.
So, my question would be, what tool and what regex should I use to perform that replacement?
(Just in case, I'm fine with Vim, and I can collect those files in a QuickFix for a further :cdo, for example. I'm not that good with awk, unfortunately, and absolutely bad with Perl :))
Thanks!
If I have understood your questions, you can do so quite easily with a pattern-range selection and the general substitution form with sed (stream editor). For example, in your case:
$ sed '/PATTERN_A/,/PATTERN_B/s/^\([ ]*-\)/\1CCC/' file
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
-CCC Again, some list items of random text
-CCC Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
(note: to substitute in place within the file add the -i option, and to create a backup of the original add -i.bak which will save the original file as file.bak)
Explanation
/PATTERN_A/,/PATTERN_B/ - select lines between PATTERN_A and PATTERN_B
s/^\([ ]*-\)/\1CCC/ - substitute (general form 's/find/replace/') where find is from beginning of line ^ capturing text between \(...\) that contains [ ]*- (any number of spaces and a hyphen) and then replace with \1 (called a backreference that contains all characters you captured with the capture group \(...\)) and appending CCC to its end.
Look things over and let me know if you have questions or if I misinterpreted your question.
With Perl also, you can get the results
> perl -pe ' { s/^(\s*-)/\1CCC/g if /PATTERN_A/../PATTERN_B/ } ' mass_replace.txt
...
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
-CCC Again, some list items of random text
-CCC Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
....
>

Remove lines that is shorter than or equal 5 characters after the : using Notepad++

The question is like: Remove lines that is shorter than 5 characters before the # using Notepad++
But it differs a bit...
I have like that:
abc:123
abc:1234
abc:12345
PLEASE NOTE: abc is not on all the lines, it is just an example.
I want to remove the first line in the previous example because 123 which is after : is shorter than or not equal to 5 characters.
Any help would be appreciated.
Thanks!
Open Notepad++ find and replace choose regex mode in the search and place ^((?!.+:\d{5,}).)*$ in search and keep replace with blank and press replaceAll
^((?!.+:\d{5,}).)*$
Without knowing the language there is only so much help I can offer. I'll give you an example of how I would solve this problem in C#.
Start by creating a string for your updated file (without the short lines)
string content = "";
Read a line in from your file.
Then get a substring of the line you read in - the abc: portion and check the length.
line = line.substring(indexof(":"), length - indexof(":"))
if(line.length > 5)
{
content += line;
}
At the end, truncate your file and write content to it.

Advanced text replacement (cloze deletion)

Well, I'd like to replace specific texts based on text, yeah sounds funny, so here it is.
The problem is how to replace the tab-separated values. Essentially, what I'd like to do is replace the matching vocabulary string found on the sentence with {...}.
The value before the tab \t is the vocab, the value after the tab is the sentence. The value on the left of the \t is the first column, to its right is the second column
TL;DR Version (English Version)
Essentially, I want to replace the text on the second column based on the first Column.
Examples:
ABCD \t 19475ABCD_97jdhgbl
would turn into
ABCD \t 19475{...}_97jdhgbl
ABCD is the first column here and 19475ABCD_97jdhgbl is the second one.
If you don't get the context of the Long Version below, solving this ABCD problem would be fine by me. I think it's quite a simple code but given that it's been about 4 years since I last coded in C and I've only recently started learning python, I can't do it.
Long Version: (Japanese-specific text)
1. Case 1: (For pure Kanji)
全部 \t それ、全部ください。
would become
全部 \t それ、{...}ください。
2. Case 2: (For pure Kana)**
ああ \t ああうるさい人は苦手です。
would become
ああ \t {...}うるさい人は苦手です。
あいづち \t 彼の話に私はあいづちを打ったの。
would become
あいづち \t 彼の話に私は{...}を打ったの。
For Case 1 and Case 2 it has to be exact matches, especially for kana because otherwise it might replace other kana in the sentence. The coding for Case 3 has to be different (see next).
3. Case 3: (for mixed Kana and Kanji)
This is the most complex one. For this one, I'd like the script/solution to change only the matching strings, i.e., it will ignore what doesn't match and only replace those with found matches. What it does is it takes the longest possible match and replace accordingly.
上げる \t 彼は荷物をあみだなに上げた。
would become
上げる \t 彼は荷物をあみだなに{...}た。
Note here that the first column has 上げる but the second column has 上げた because it has changed in tense (First column has る while the second one has た).
So, Ideally the solution should take the longest string found in both columns, in this case it is 上げ, so this is the only string replaced with {...}, while it leaves た.
Another example
が増える \t 値段がが増える
would become
が増える \t 値段が{...}
More TL;DR
I'm actually using this for Anki.
I could use excel or notepad++ but I don't think they could replace text based on placeholders.
My goal here is to create pseudo-cloze sentences that I can use as hints hidden in a hint field only to be used for ridiculously hard synonyms or homonyms (I have an Auditory card).
I know I'm missing a fourth case, i.e., pure kana with the possibility of a sentence having changed its tense, hence its spelling. Well, that'd be really hard to code so I'd rather do it manually so as not to mess up the other kana in the sentence.
Update
I forgot to say that the text is contained in a .txt file in this format:
全部 \t それ、全部ください。
ああ \t ああうるさい人は苦手です。
あいづち \t 彼の話に私はあいづちを打ったの。
上げる \t 彼は荷物をあみだなに上げた。
There are about 7000 lines of those things so it has to check the replacements for every line.
Code works, thanks, just a minor bug with sentences including non-full replacements, it creates broken characters.
上げたxxxx 彼は荷物をあみだなに上げあ。
ABCD ABCD123
86876 xx86876h897
全部 それ、全部ください
ああ ああうるさい人は苦手です。
上げたxxxx 彼は荷物をあみだなに上げあ。
務める ああうるさい人は苦手で務めす。
務める ああうるさい務めす人は苦手で。
turns into:
Just edited James' code a bit for testing purposes (I'm using this edited version to check what kind of strings would throw off the code.
So far I've discovered that spaces in the vocabulary could cause some trouble.
This code prints the original line below the parsed line.
Just change this line:
fout.write(output)
to this
fout.write(output+str(line)+'\n')
This regex should deal with the cases you are looking for (including matching the longest possible pattern in the first column):
^(\S+)(\S*?)\s+?(\S*?(\1)\S*?)$
Regex demo here.
You can then go on to use the match groups to make the specific replacement you are looking for. Here is an example solution in python:
import re
regex = re.compile(r'^(\S+)(\S*?)\s+?(\S*?(\1)\S*?)$')
with open('output.txt', 'w', encoding='utf-8') as fout:
with open('file.txt', 'r', encoding='utf-8') as fin:
for line in fin:
match = regex.match(line)
if match:
hint = match.group(3).replace(match.group(1), '{...}')
output = '{0}\t{1}\n'.format(match.group(1) + match.group(2), hint)
fout.write(output)
Python demo here.

How to split CSV line according to specific pattern

In a .csv file I have lines like the following :
10,"nikhil,khandare","sachin","rahul",viru
I want to split line using comma (,). However I don't want to split words between double quotes (" "). If I split using comma I will get array with the following items:
10
nikhil
khandare
sachin
rahul
viru
But I don't want the items between double-quotes to be split by comma. My desired result is:
10
nikhil,khandare
sachin
rahul
viru
Please help me to sort this out.
The character used for separating fields should not be present in the fields themselves. If possible, replace , with ; for separating fields in the csv file, it'll make your life easier. But if you're stuck with using , as separator, you can split each line using this regular expression:
/((?:[^,"]|"[^"]*")+)/
For example, in Python:
import re
s = '10,"nikhil,khandare","sachin","rahul",viru'
re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]
=> ['10', '"nikhil,khandare"', '"sachin"', '"rahul"', 'viru']
Now to get the exact result shown in the question, we only need to remove those extra " characters:
[e.strip('" ') for e in re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]]
=> ['10', 'nikhil,khandare', 'sachin', 'rahul', 'viru']
If you really have such a simple structure always, you can use splitting with "," (yes, with quotes) after discarding first number and comma
If no, you can use a very simple form of state machine parsing your input from left to right. You will have two states: insides quotes and outside. Regular expressions is a also a good (and simpler) way if you already know them (as they are basically an equivalent of state machine, just in another form)