Regex to capture hyphenated words separated by new line character - regex

I have a pattern such as word-\nword, i.e. words are hyphenated and separated by new line character.
I would like the output as word-word. I get word-\nword with the below code.
text_string = "word-\nword"
result=re.findall("[A-Za-z]+-\n[A-Za-z]+", text_string)
print(result)
I tried this, but did not work, I get no result.
text_string = "word-\nword"
result=re.findall("[A-Za-z]+-(?=\n)[A-Za-z]+", text_string)
print(result)
How can I achieve this.
Thank You !
Edit:
Would it be efficient to do a replace and run a simple regex
text_string = "aaa bbb ccc-\nddd eee fff"
replaced_text = text_string.replace('-\n', '-')
result = re.findall("\w+-\w+",replaced_text)
print(result)
or use the method suggested by CertainPerformance
text_string = "word-\nword"
result=re.sub("(?i)(\w+)-\n(\w+)", r'\1-\2', text_string)
print(result)

You should use re.sub instead of re.findall:
result = re.sub(r"(?<=-)\n+", "", test_str)
This matches any new lines after a - and replaces it with empty string.
Demo
You can alternatively use
(?<=-)\n(?=\w)
which matches new lines only if there is a - before it and it is followed by word characters.

If the string is composed of just that, then a pure regex solution is to use re.sub, capture the first word and the second word in a group, then echo those two groups back (without the dash and newline):
result=re.sub("(?i)([a-z]+)-\n([a-z]+)", r'\1\2', text_string)
Otherwise, if there is other stuff in the string, iterate over each match and join the groups:
text_string = "wordone-\nwordtwo wordthree-\nwordfour"
result=re.findall("(?i)([a-z]+)-\n([a-z]+)", text_string)
for match in result:
print(''.join(match))

You can simply replace any occurrences of '-\n' with '-' instead:
result = text_string.replace('-\n', '-')

Related

Python - how to add a new line every time there is a pattern is found in a string?

How can I add a new line every time there is a pattern of a regex-list found in a string ?
I am using python 3.6.
I got the following input:
12.13.14 Here is supposed to start a new line.
12.13.15 Here is supposed to start a new line.
Here is some text. It is written in one lines. 12.13. Here is some more text. 2.12.14. Here is even more text.
I wish to have the following output:
12.13.14
Here is supposed to start a new line.
12.13.15
Here is supposed to start a new line.
Here is some text. It is written in one lines.
12.13.
Here is some more text.
2.12.14.
Here is even more text.
My first try returns as the output the same as the input:
in_file2 = 'work1-T1.txt'
out_file2 = 'work2-T1.txt'
start_rx = re.compile('|'.join(
['\d\d\.\d\d\.', '\d\.\d\d\.\d\d','\d\d\.\d\d\.\d\d']))
with open(in_file2,'r', encoding='utf-8') as fin2, open(out_file2, 'w', encoding='utf-8') as fout2:
text_list = fin2.read().split()
fin2.seek(0)
for string in fin2:
if re.match(start_rx, string):
string = str.replace(start_rx, '\n\n' + start_rx + '\n')
fout2.write(string)
My second try returns an error 'TypeError: unsupported operand type(s) for +: '_sre.SRE_Pattern' and 'str''
in_file2 = 'work1-T1.txt'
out_file2 = 'work2-T1.txt'
start_rx = re.compile('|'.join(
['\d\d\.\d\d\.', '\d\.\d\d\.\d\d','\d\d\.\d\d\.\d\d']))
with open(in_file2,"r") as fin2, open(out_file2, 'w') as fout3:
for line in fin2:
start = False
if re.match(start_rx, line):
start = True
if start == False:
print ('do something')
if start == True:
line = '\n' + line ## leerzeichen vor Pos Nr
line = line.replace(start_rx, start_rx + '\n')
fout3.write(line)
First of all, to search and replace with a regex, you need to use re.sub, not str.replace.
Second, if you use a re.sub, you can't use the regex pattern inside a replacement pattern, you need to group the parts of the regex you want to keep and use backreferences in the replacement (or, if you just want to refer to the whole match, use \g<0> backreference, no capturing groups are required).
Third, when you build an unanchored alternation pattern, make sure longer alternatives come first, i.e. start_rx = re.compile('|'.join(['\d\d\.\d\d\.\d\d', '\d\.\d\d\.\d\d', '\d\d\.\d\d\.'])). However, you may use a more precise pattern here manually.
Here is how your code can be fixed:
with open(in_file2,'r', encoding='utf-8') as fin2, open(out_file2, 'w', encoding='utf-8') as fout2:
text = fin2.read()
fout2.write(re.sub(r'\s*(\d+(?:\.\d+)+\.?)\s*', r'\n\n\1\n', text))
See the Python demo
The pattern is
\s*(\d+(?:\.\d+)+\.?)\s*
See the regex demo
Details
\s* - 0+ whitespaces
(\d+(?:\.\d+)+\.?) - Group 1 (\1 in the replacement pattern):
\d+ - 1+ digits
(?:\.\d+)+ - 1 or more repetitions of . and 1+ digits
\.? - an optional .
\s* - 0+ whitespaces
Try this
out_file2=re.sub(r'(\d+) ', r'\1\n', in_file2)
out_file2=re.sub(r'(\w+)\.', r'\1\.\n', in_file2)

Replace only the first occurrence of a word with regex in text-editor

I want to replace only first occurrence of word(default) in each line with another word(rwd).
As below I want this:
../app/design/adminhtml/default/default/layout/pmodule.xml
../app/design/frontend/default/default/layout/pmodule.xml
../app/design/frontend/default/default/template/company/module/gmap.phtml
To be replaced to this:
../app/design/adminhtml/rwd/default/layout/pmodule.xml
../app/design/frontend/rwd/default/layout/pmodule.xml
../app/design/frontend/rwd/default/template/company/module/gmap.phtml
I have tried \bdefault\b but in vain.
You can use a regex with a lazy dot matching pattern:
^(.*?)\bdefault\b
To replace with \1rwd.
See the regex demo
Pattern details:
^ - start of line/string
(.*?) - Group 1 capturing any 0+ characters other than a newline as few as possible up to the first
\bdefault\b - whole word default.
GEdit screenshot:
Geany screenshot:
You can search using this lookahead regex:
^((?:(?!\bdefault\b).)*)default
And replace it using:
\1rwd
RegEx Demo
var path = #"C:\Users\pcNameHere\Downloads\dictionary.csv";
var destination = #"C:\Users\pcNameHere\Downloads\dictionaryEnglish.csv";
var database = File.ReadAllLines(path);
var pattern = ",";
Regex reg = new Regex(pattern);
string[] result = new string[database.Length];
for (int i = 0; i < database.Length; i++)
{
result[i] = reg.Replace(database[i], "|", 2);
}
File.WriteAllLines(destination, result);
Here is my sample for anyone looking in the future. I had the English dictionary text file with lines like these:
"Abacus","n.","A table or tray strewn with sand, anciently used for drawing, calculating, etc."
And I had to replace comma delimiters with something else so I could import them into a database, because there were commas inside the word definitions and the database would treat them as new columns. This code changed the first two occurrences of a comma in each line in the text file.

Regex to match strings that begin with specific word and after that words seperated by slashes

So i want to match all strings of the form with a regex
(word1|word2|word3)/some/more/text/..unlimited parts.../more
so it starts with specific word and it does not end with /
examples to match:
word1/ok/ready
word2/hello
word3/ok/ok/ok/ready
What i want in the end is when i have a text with above 3 examples in it (spread around in a random text), that i receive an array with those 3 matches after doing regex.exec(text);
Anybody an idea how to start?
Thanks!
Something like this should work:
^(word1|word2|word3)(/\w+)+$
If you're using this in an environment where you need to delimit the regex with slashes, then you'll need to escape the inner slash:
/^(word1|word2|word3)(\/\w+)+$/
Edit
If you don't want to capture the second part, make it a non-capturing group:
/^(word1|word2|word3)(?:\/\w+)+$/
^^ Add those two characters
I think this is what you want, but who knows:
var input = '';
input += 'here is a potential match word1/ok/ready that is followed by another ';
input += 'one word2/hello and finally the last one word3/ok/ok/ok/ready';
var regex = /(word1|word2|word3)(\/\w+)+/g;
var results = []
while ((result = regex.exec(input)) !== null) {
results.push(result[0].split('/'));
}
console.log(results);

negating with re.search (find strings that don't contain a specific character)

I'm trying to get re.search to find strings that don't have the letter p in them. My regex code returns everything in the list which is what I don't want. I wrote an alternate solution that gives me the exact results that I want, but I want to see if this can be solved with re.search, but I'll also accept another regex solution. I also tried re.findall and that didn't work, and re.match won't work because it looks for the pattern at the beginning of a string.
import re
someList = ['python', 'ppython', 'ython', 'cython', '.python', '.ythop', 'zython', 'cpython', 'www.python.org', 'xyzthon', 'perl', 'javap', 'c++']
# this returns everything from the source list which is what I DON'T want
pattern = re.compile('[^p]')
result = []
for word in someList:
if pattern.search(word):
result.append(word)
print '\n', result
''' ['python', 'ppython', 'ython', 'cython', '.python', '.ythop', 'zython', 'cpython', 'www.python.org', 'xyzthon', 'perl', 'javap', 'c++'] '''
# this non regex solution returns the results I want
cnt = 0; no_p = []
for word in someList:
for letter in word:
if letter == 'p':
cnt += 1
pass
if cnt == 0:
no_p.append(word)
cnt = 0
print '\n', no_p
''' ['ython', 'cython', 'zython', 'xyzthon', 'c++'] '''
You are almost there. The pattern you are using is looking for at least one letter that is not 'p'. You need a more strict one. Try:
pattern = re.compile('^[^p]*$')
Your understanding of character-set negation is flawed. The regex [^p] will match any string that has a character other than p in it, which is all of your strings. To "negate" a regex, simply negate the condition in the if statement. So:
import re
someList = ['python', 'ppython', 'ython', 'cython', '.python', '.ythop', 'zython', 'cpython', 'www.python.org', 'xyzthon', 'perl', 'javap', 'c++']
pattern = re.compile('p')
result = []
for word in someList:
if not pattern.search(word):
result.append(word)
print result
It is, of course, rather pointless to use a regex to see if a single specific character is in the string. Your second attempt is more apt for this, but it could be coded better:
result = []
for word in someList:
if 'p' not in word:
result.append(word)
print result

Extract text between single quotes in MATLAB

I have multiple lines in some text files such as
.model sdata1 s tstonefile='../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p' passive=2
I want to extract the text between the single quotes in MATLAB.
Much help would be appreciated.
To get all of the text inside multiple '' blocks, regexp can be used as follows:
regexp(txt,'''(.[^'']*)''','tokens')
This says to get text surrounded by ' characters, which does not include a ' in the captured text. For example, consider this file with two lines (I made up different file name),
txt = ['.model sdata1 s tstonefile=''../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p'' passive=2 ', char(10), ...
'.model sdata1 s tstonefile=''../data/s_element/isdimm_rcv_via_3port_via_minstub.s00p'' passive=2']
>> stringCell = regexp(txt,'''(.[^'']*)''','tokens');
>> stringCell{:}
ans =
'../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p'
ans =
'../data/s_element/isdimm_rcv_via_3port_via_minstub.s00p'
>>
Trivia:
char(10) gives a newline character because 10 is the ASCII code for newline.
The . character in regexp (regex in the rest of the coding word) pattern usually does not match a newline, which would make this a safer pattern. In MATLAB, a dot in regexp does match a newline, so to disable this, we could add 'dotexceptnewline' as the last input argument to `regexp``. This is convenient to ensure we don't get the text outside of the quotes instead, but not needed since the first match sets precedent.
Instead of excluding a ' from the match with [^''], the match can be made non-greedy with ? as follows, regexp(txt,'''(.*?)''','tokens').
If you plan to use textscan:
fid = fopen('data.txt','r');
rawdata = textscan(fid,'%s','delimiter','''');
fclose(fid);
output = rawdata{:}(2)
As also used in other answers the single apostrophe 'is represented by a double one: '', e.g. for delimiters.
considering the comment:
fid = fopen('data.txt','r');
rawdata = textscan(fid,'%s','delimiter','\n');
fclose(fid);
lines = rawdata{1,1};
L = size(lines,1);
output = cell(L,1);
for ii=1:L
temp = textscan(lines{ii},'%s','delimiter','''');
output{ii,1} = temp{:}(2);
end
One easy way is to split the string with single quote delimiter and take the even-numbered strings in the output:
str = fileread('test.txt');
out = regexp(str, '''', 'split');
out = out(2:2:end);
You can do this using regular expressions. Assuming that there is only one occurrence of text between quotation marks:
% select all chars between single quotation marks.
out = regexp(inputString,'''(.*)''','tokens','once');
After identifing which lines you want to extract info from, you could tokenize it or do something like this if they all have the same form:
test='.model sdata1 s tstonefile=''../data/s_element/isdimm_rcv_via_2port_via_minstub.s50p'' passive=2';
a=strfind(test,'''')
test=test(a(1):a(2))