python regex split on repeating character

python regex split on repeating character - regex

I have a string for example
--------------------------------
hello world !
--------------------------------
world hello !
--------------------------------
! hello world
and I want to be able to split the lines on the hyphens, the hyphens could be of variable length which is why I decided to use regex, the information I want to extract out of this is ['hello world !', 'world hello !', '! hello world'] I have tried splitting the string using static number of hyphens, this works but not sure how to go about it if it was of variable length. I have tried doing:
re.split(r'\-{3,}', str1)
however that did not seem to work

You may strip the unnecessary whitespace from the input and resulting split chunks with a .strip() method:
import re
p = re.compile(r'(?m)^-{3,}$')
t = "--------------------------------\nhello world !\n--------------------------------\nworld hello !\n--------------------------------\n! hello world"
result = [x.strip() for x in p.split(t.strip("-\n\r"))]
print(result)
As for the regex, I suggest limiting to the hyphen-only lines with (?m)^-{3,}$ that matches 3 or more hyphens between the start of line (^) and end of line ($) (due to (?m), these anchors match the line boundaries, not the string boundaries).
See the IDEONE demo

Related

Get segment of string in between characters

I have a giant data set that includes lots of file names with various parts of strings that I need to grab.
I have this code segment currently:
def fps(data):
for i in data:
pattern = r'.(\d{4}).' # finds data in between the periods
frames = re.findall(pattern, ' '.join(data)) #puts info into frames list
frames.sort()
for i in range(len(frames)): #Turns the str into integers
frames[i] = int(frames[i])
return frames
This is great and all but it only returns 4 characters after and before a period.
How would I grab part of the string after a period and before the next period.
Preferably without using regular edit because it's a little too complex for a simpleton like me.
For example:
One string may look like this
string = ['filename.0530.extension']
while the others may look like this
string2 = ['filename.042.extension']
string3 = [filename.045363.extension']
I would need to output the numbers in between the periods on the terminal so:
0530, 042, 045363

To match your example data your could match a dot, capture in a group one or more digits \d+ (instead of exactly 4 \d{4}) followed by matching a dot:
\.(\d+)\.
If you want to match all between the dots you might use a negating character class [^.] to match not a dot:
\.([^.]+)\.
Note that if you want to match a literal dot you should escape it \.
Demo

To match the numbers between your periods in your example, you can use this:
^.*\.[^.\s]*?\.?(\d+)\..*$
Here's an online example

Java replace all method appending the replacement string instead of replacing

I am trying to replace all the words starting with vowels to "XXXXX" in my text file. I am using RegEx to perform this, but when I try to replace it with replaceAll method, my replacement string is getting appended instead of replacing.
Here is my text file, code and output.
Hello 12 I am John
How are you
I am good
Thank you 89767 0
$%^
code:
String dest = data.replaceAll("\\b(?=[AEIOUaeiou])","XXXXX");
System.out.println(dest);
data is the string that contains all my file data.
output :
Hello 12 XXXXXI XXXXXam Manoj
How XXXXXare you
XXXXXI XXXXXam good
Thank you 89767 0
#$%^
Please help me out in solving this issue. I have gone through some answers regarding replaceAll() method but I am not able to find answer to my problem.

Your pattern only contains zero-width assertions: \\b matches a word boundary location and (?=[AEIOUaeiou]) positive lookahead asserts the position before a vowel.
Make the pattern consuming. Use
data = data.replaceAll("\\b[AEIOUaeiou]\\w*","XXXXX");
To only match letters, replace \w with \p{Alpha}.
See regex demo and a Java demo:
String data = "Hello 12 I am John\nHow are you\nI am good\nThank you 89767 0\n#$%^";
data = data.replaceAll("\\b[AEIOUaeiou]\\p{Alpha}*","XXXXX");
System.out.println(data);
Output:
Hello 12 XXXXX XXXXX John
How XXXXX you
XXXXX XXXXX good
Thank you 89767 0
#$%^

regex for tokenizing words and punctuation

I have been tokenizing English strings with a simple \b split. However, given the string Hello, "Joe!", a split on \b gives back these tokens:
print join "\n", split /\b/, 'Hello, "Joe!"';
Hello
, "
Joe
!"
I need separate punctuation to be separate tokens. What I need is this list below:
print join "\n", split /awesome regex here/, 'Hello, "Joe!"';
Hello
,
"
Joe
!
"
I can process the whitespace afterwards, but I can't think of a quick regex way to split the string properly. Any ideas?
EDIT
A better test case is "Hello there, Joe!", since it checks that words are split correctly.

(?=\W)|(?<=\W)|\s+
You can try this.See demo.
https://regex101.com/r/fX3oF6/4

Do matching instead of splitting.
[A-Za-z]+|[^\w\s]

You can use lookarounds regex to get this:
print join "\n", split /\s+|(?=\p{P})|(?<=\p{P})/, 'Hello, "Joe!"';
Output:
Hello
,
"
Joe
!
"
\p{P} matches any punctuation character.
Example 2:
print join "\n", split /\s+|(?=\p{P})|(?<=\p{P})/, 'hello there, Joe!';
hello
there
,
Joe
!

regular expression - match word only once in line

Case:
ehello goodbye hellot hello goodbye
ehello goodbye hello hello goodbye
I want to match line 1 (only has 'hello' once!)
DO NOT want to match line 2 (contains 'hello' more than once)
Tried using negative look ahead look behind and what not... without any real success..

A simple option is this (using the multiline flag and not dot-all):
^(?!.*\bhello\b.*\bhello\b).*\bhello\b.*$
First, check you don't have 'hello' twice, and then check you have it at least once.
There are other ways to check for the same thing, but I think this one is pretty simple.
Of course, you can simple match for \bhello\b and count the number of matches...

A generic regex would be:
^(?:\b(\w+)\b\W*(?!.*?\b\1\b))*\z
Altho it could be cleaner to invert the result of this match:
\b(\w+)\b(?=.*?\b\1\b)
This works by matching a word and capturing it, then making sure with a lookahead and a backreference that it does/not follow anywhere in the string.

Since you're only worried about words (ie tokens separated by whitespace), you can just split on spaces and see how often "hello" appears. Since you didn't mention a language, here's an implementation in Perl:
use strict;
use warnings;
my $a1="ehello goodbye hellot hello goodbye";
my $a2="ehello goodbye hello hello goodbye";
my #arr1=split(/\s+/,$a1);
my #arr2=split(/\s+/,$a2);
#grab the number of times that "hello" appears
my $num_hello1=scalar(grep{$_ eq "hello"}#arr1);
my $num_hello2=scalar(grep{$_ eq "hello"}#arr2);
print "$num_hello1, $num_hello2\n";
The output is
1, 2

Python extract words from a txt file

Is it possible to search for a series of words & extract the next word. For example in a txt file search for the word 'Test' & then return the word directly after it?
Test.txt
This is a test to test the function of the python code in the test environ_ment
I'm looking to get the results:-
to, the, environ_ment

You can use a regular expression for this:
import re
txt = "This is a test to test the function of the python code in the test environ_ment"
print re.findall("test\s+(\S+)", txt) # ['to', 'the', 'environ_ment']
The regular expression matches with "test" when it is followed by white space (\s+) and a series of non-white space characters \S+. The latter matches the words you are looking for and is put in a capture group (with parentheses) in order to return that part of the matches.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

python regex split on repeating character - regex

Related

Get segment of string in between characters

Java replace all method appending the replacement string instead of replacing

regex for tokenizing words and punctuation

regular expression - match word only once in line

Python extract words from a txt file

Categories

Resources