Removing lines with only digits - regex - regex

I am new to both python and regex. I am trying to process a text file where I want to remove lines with only digits and space. This is the regular expression I am using.
^\s*[0-9]*\s*$
I am able to match the lines I want to remove (in notepad++ find dialog).
but when I try to do the same with python, the lines are not matched. Is there a problem in the regex itself or there is some problem with my python code?
Python code that I am using :
contacts = re.sub(r'^\s*[0-9]*\s*$','\n',contents)
Sample text
Age:30
Gender:Male
20
Name:संगीता शर्मा
HusbandsName:नरेश कुमार शर्मा
HouseNo:10/183
30 30
Gender:Female
21
Name:मोनू शर्मा
FathersName:कैलाश शर्मा
HouseNo:10/183
30
Gender:Male

Use re.sub in multiline mode:
contacts = re.sub(r'^\s*([0-9]+\s*)+$','\n',x, flags=re.M)
Demo
If you want the beginning ^ and ending $ anchors to kick in, then you want to be in multiline mode.
In addition, use the following to represent a line only containing clusters of numbers, possibly separated by whitespace:
^\s*([0-9]+\s*)+$

You don't even need regex for that, a simple str.translate() to remove characters you're not interested and check if something is left should more than suffice:
import string
clear_chars = string.digits + string.whitespace # a map of characters we'd like to check for
# open input.txt for reading, out.txt for writing
with open("input.txt", "rb") as f_in, open("output.txt", "wb") as f_out:
for line in f_in: # iterate over the input file line by line
if line.translate(None, clear_chars): # remove the chars, check if anything is left
f_out.write(line) # write the line to the output file
# uncomment the following if you want added newlines when pattern matched
# else:
# f_out.write("\n") # write a new line on match
Which will produce for your sample input:
Age:30
Gender:Male
Name:संगीता शर्मा
HusbandsName:नरेश कुमार शर्मा
HouseNo:10/183
Gender:Female
Name:मोनू शर्मा
FathersName:कैलाश शर्मा
HouseNo:10/183
Gender:Male
If you want the matching lines replaced with a new line, just uncomment the else clause.

Related

How to extract text between two substrings from a Python file

I want to read the text between two characters (“#*” and “##”) from a file. My file contains thousands of records in the above-mentioned format. I have tried using the code below, but it is not returning the required output. My data contains thousands of records in the given format.
import re
start = '#*'
end = '##'
myfile = open('lorem.txt')
for line in fhand:
text = text.rstrip()
print (line[line.find(start)+len(start):line.rfind(end)])
myfile.close()
My Input:
\#*OQL[C++]: Extending C++ with an Object Query Capability
\##José A. Blakeley
\#t1995
\#cModern Database Systems
\#index0
\#*Transaction Management in Multidatabase Systems
\##Yuri Breitbart,Hector Garcia-Molina,Abraham Silberschatz
\#t1995
\#cModern Database Systems
\#index1
My Output:
51103
OQL[C++]: Extending C++ with an Object Query Capability
t199
cModern Database System
index
...
Expected output:
OQL[C++]: Extending C++ with an Object Query Capability
Transaction Management in Multidatabase Systems
You are reading the file line by line, but your matches span across lines. You need to read the file in and process it with a regex that can match any chars across lines:
import re
start = '#*'
end = '##'
rx = r'{}.*?{}'.format(re.escape(start), re.escape(end)) # Escape special chars, build pattern dynamically
with open('lorem.txt') as myfile:
contents = myfile.read() # Read file into a variable
for match in re.findall(rx, contents, re.S): # Note re.S will make . match line breaks, too
# Process each match individually
See the regex demo.
Use the following regex:
#\*([\s\S]*?)## /g
This regex captures all whitespace and non-whitespace characters between #* and ##.
Demo

python regex to remove extra characters from papers' doi

i am new to regex and i have a list of some papers' DOIs. some of the DOIs include some extra characters or strings. I want to remove all those extras. Here is the sample data:
10.1038/ncomms3230
10.1111/hojo.12033
blog/uninews #ivalid
article/info%3Adoi%2F10.1371%2Fjournal.pone.0076852utm_source=feedburner&utm;_medium=feed&utm;_campaign=Feed%3A+plosone%2FPLoSONE+%28PLOS+ONE+Alerts%3A+New+Articles%29
#want to extract 10.1371/journal.pone.0076852
utm_source=feedburner&utm;_medium=feed&utm;_campaign=Feed%3A+plosone%2 #invalid
10.1002/dta.1578
enhanced/doi #invalid
doi/pgen.1005204
doi:10.2135/cropsci2014.11.0791 # =want to remove "doi:"
10.1126/science.aab1052
gp/about-springer
10.1038/srep14556
10.1002/rcm.7274
10.1177/0959353515592899
now some of the entries don't have DOIs at all. I want to replace them with "".
Here is my regex expression that i came up with:
for doi in doi_lst:
doi = re.sub(r"^[^10\.][^a-z0-9//\.]+", "", doi)
but it does nothing. i searched in many other stack overflow questions but couldn't get the one for my case. Kindly help me out here.
P.s. i am working with Python 3
Assuming the pattern for DOIs is a substring starting with 10. and more digits, / and then 1+ word or . chars, you may convert the strings using urlib.parse.unquote first (to convert entities to literal strings) and then use re.search with \b10\.\d+/[\w.]+\b pattern to extract each DOI from the list items:
import re, urllib.parse
doi_list=["10.1038/ncomms3230", "10.1111/hojo.12033", "blog/uninews", "article/info%3Adoi%2F10.1371%2Fjournal.pone.0076852? ", "utm_source=feedburner&utm;_medium=feed&utm;_campaign=Feed%3A+plosone%2",
"10.1002/dta.1578", "enhanced/doi", "doi/pgen.1005204", "doi:10.2135/cropsci2014.11.0791", "10.1126/science.aab1052", "gp/about-springer", "10.1038/srep14556","10.1002/rcm.7274", "10.1177/0959353515592899"]
new_doi_list = []
for doi in doi_list:
doi = urllib.parse.unquote(doi)
m = re.search(r'\b10\.\d+/[\w.]+\b', doi)
if m:
new_doi_list.append(m.group())
print(m.group()) # DEMO
Output:
10.1038/ncomms3230
10.1111/hojo.12033
10.1371/journal.pone.0076852
10.1002/dta.1578
10.2135/cropsci2014.11.0791
10.1126/science.aab1052
10.1038/srep14556
10.1002/rcm.7274
10.1177/0959353515592899
To include empty items upon no match add else: new_doi_list.append("") condition to the above code.

Find and remove specific string from a line

I am hoping to receive some feedback on some code I have written in Python 3 - I am attempting to write a program that reads an input file which has page numbers in it. The page numbers are formatted as: "[13]" (this means you are on page 13). My code right now is:
pattern='\[\d\]'
for line in f:
if pattern in line:
re.sub('\[\d\]',' ')
re.compile(line)
output.write(line.replace('\[\d\]', ''))
I have also tried:
for line in f:
if pattern in line:
re.replace('\[\d\]','')
re.compile(line)
output_file.write(line)
When I run these programs, a blank file is created, rather than a file containing the original text minus the page numbers. Thank you in advance for any advice!
Your if statement won't work because not doing a regex match, it's looking for the literal string \[\d\] in line.
for line in f:
# determine if the pattern is found in the line
if re.match(r'\[\d\]', line):
subbed_line = re.sub(r'\[\d\]',' ')
output_file.writeline(subbed_line)
Additionally, you're using the re.compile() incorrectly. The purpose of it is to pre-compile your pattern into a function. This improves performance if you use the pattern a lot because you only evaluate the expression once, rather than re-evaluating each time you loop.
pattern = re.compile(r'\[\d\]')
if pattern.match(line):
# ...
Lastly, you're getting a blank file because you're using output_file.write() which writes a string as the entire file. Instead, you want to use output_file.writeline() to write lines to the file.
You don't write unmodified lines to your output.
Try something like this
if pattern in line:
#remove page number stuff
output_file.write(line) # note that it's not part of the if block above
That's why your output file is empty.

Format a text file by regex match and replace

I have a text file that looks like the following:
Chanelle
Jettie
Winnie
Jen
Shella
Krysta
Tish
Monika
Lynwood
Danae
2649
2466
2890
2224
2829
2427
2816
2648
2833
2453
I need to make it look like this
Chanelle 2649
Jettie 2466
... ...
I tried a lot on sublime editor but couldn't figure out the regex to do that. Can somebody demonstrate if it can be done.
I tested the following in Notepad++ but it should work universally.
Use this as the search string:
(?:(\s+[A-Za-z]+)(\r?\n))((?:\s*[A-Za-z]*\r?\n)+)\s+(\d+)
and this as the replacement:
$1 $4$2$3
Running a replace with it once will do one line at a time, if you run it multiple times it'll continue to replace lines until there are no matching lines left.
Alternatively, you can use this as the replacement if you want to have the values aligned by tabs, but it's not going to match in all cases:
$1\t\t$4$2$3
While the regex answer by SeinopSys will work, you don't need a regex to do this - instead, you can take advantage of Sublime's multiple cursors.
Place your cursor at the beginning of line 1, then hold down Shift↓ to select all the names.
Hit CtrlShiftL (Selection -> Split into Lines) to split the selection into lines.
CtrlC to copy.
Place your cursor on line 11 (the first number line) and press CtrlShift↓ (Windows/OS X) or AltShift↓ (Linux) to place a cursor at the beginning of each number line.
Hit CtrlV to paste the names before the numbers.
You can now delete the names at the top and you're all set. Alternatively, you could use CtrlX to cut the names in step 3.

Vim: How to delete repetition in a line

I am having a log file for analysis, in that few of the line will have repetition of it own, but not complete repetition, say
Alex is here and Alex is here and we went out
We bothWe both went out
I want to remove the first occurrence and get
Alex is here and we went out
We both went out
Please share a regex to do in Vim in Windows.
I don't recommend trying to use regex magic to solve this problem. Just write an external filter and use that.
Here's an external filter written in Python. You can use this to pre-process the log file, like so:
python prefix_chop.py logfile.txt > chopped.txt
But it also works by standard input:
cat logfile.txt | prefix_chop.py > chopped.txt
This means you can use it in vim with the ! command. Try these commands: goto line 1, then pipe from current line through the last line through the external program prefix_chop.py:
1G
!Gprefix_chop.py<Enter>
Or you can do it from ex mode:
:1,$!prefix_chop.py<Enter>
Here's the program:
#!/usr/bin/python
import sys
infile = sys.stdin if len(sys.argv) < 2 else open(sys.argv[1])
def repeated_prefix_chop(line):
"""
Check line for a repeated prefix string. If one is found,
return the line with that string removed, else return the
line unchanged.
"""
# Repeated string cannot be more than half of the line.
# So, start looking at mid-point of the line.
i = len(line) // 2 + 1
while True:
# Look for longest prefix that is found in the string after pos 0.
# The prefix starts at pos 0 and always matches itself, of course.
pos = line.rfind(line[:i])
if pos > 0:
return line[pos:]
i -= 1
# Stop testing before we hit a length-1 prefix, in case a line
# happens to start with a word like "oops" or a number like "77".
if i < 2:
return line
for line in infile:
sys.stdout.write(repeated_prefix_chop(line))
I put a #! comment on the first line, so this will work as a stand-alone program on Linux, Mac OS X, or on Windows if you are using Cygwin. If you are just using Windows without Cygwin, you might need to make a batch file to run this, or just type the whole command python prefix_chop.py. If you make a macro to run this you don't have to do the typing yourself.
EDIT: This program is pretty simple. Maybe it could be done in "vimscript" and run purely inside vim. But the external filter program can be used outside of vim... you can set things up so that the log file is run through the filter once per day every day, if you like.
Regex:\b(.*)\1\b
Replace with:\1 or $1
If you want to deal with more than two repeating sentences you can try this
\b(.+?\b)\1+\b
--
|->avoids matching individual characters in word like xxx
NOTE
Use \< and \> instead of \b
You could do it by matching as much as possible at the beginning of the line and then using a backreference to match the repeated bit.
For example, this command solves the problem you describe:
:%s/^\(.*\)\(\1.*\)/\2