regex excluding newline - regex

I have a simple word counter that works with one exception. It is splitting on the \n character.
The small sample text file is:
'''
A tree is a woody perennial plant,typically with branches.
I added this second line,just to add eleven more words.
'''
Line #1 has ten words, line #2 has eleven. Total word count = 21.
This code yields a count of 22 because it is including the \n character at the end of line #1:
import re
testfile = "d:\\python\\workbook\\words2.txt"
number_of_words = 0
with open(testfile, "r") as datafile:
for line in datafile:
number_of_words += len(re.split(",|\s", line))
print(number_of_words)
If I change my regex to: number_of_words += len(re.split(",|^\n|\s", line))
the word count (22) remains unchanged.
My question is: why is exclude newline [^\n] failing, or more broadly, what
should be the correct way to code my regex so that I exclude the trailing \n and have the above code arrive at the correct word total of 21.

You can simply use:
number_of_words = 0
with open(testfile, "r") as datafile:
for line in datafile:
number_of_words += len(re.findall('\w+', line)

Related

Crystal get from n line to n line from a file

How can I get specific lines in a file and add it to array?
For example: I want to get lines 200-300 and put them inside an array. And while at that count the total line in the file. The file can be quite big.
File.each_line is a good reference for this:
lines = [] of String
index = 0
range = 200..300
File.each_line(file, chomp: true) do |line|
index += 1
if range.includes?(index)
lines << line
end
end
Now lines holds the lines in range and index is the number of total lines in the file.
To prevent reading the entire file and allocating a new array for all of its content, you can use File.each_line iterator:
lines = [] of String
File.each_line(file, chomp: true).with_index(1) do |line, idx|
case idx
when 1...200 then next # ommit lines before line 200 (note exclusive range)
when 200..300 then lines << line # collect lines 200-300
else break # early break, to be efficient
end
end

how to skip multiple header lines using python

I am new to python. Trying to write a script that will use numeric colomns from a file whcih also contains a header. Here is an example of a file:
#File_Version: 4
PROJECTED_COORDINATE_SYSTEM
#File_Version____________-> 4
#Master_Project_______->
#Coordinate_type_________-> 1
#Horizon_name____________->
sb+
#Horizon_attribute_______-> STRUCTURE
474457.83994 6761013.11978
474482.83750 6761012.77069
474507.83506 6761012.42160
474532.83262 6761012.07251
474557.83018 6761011.72342
474582.82774 6761011.37433
474607.82530 6761011.02524
I'd like to skip the header. here is what i tried. It works of course if i know which characters will appear in the header like "#" and "#". But how can i skip all lines containing any letter character?
in_file1 = open(input_file1_short, 'r')
out_file1 = open(output_file1_short,"w")
lines = in_file1.readlines ()
x = []
y = []
for line in lines:
if "#" not in line and "#" not in line:
strip_line = line.strip()
replace_split = re.split(r'[ ,|;"\t]+', strip_line)
x = (replace_split[0])
y = (replace_split[1])
out_file1.write("%s\t%s\n" % (str(x),str(y)))
in_file1.close ()
Thank you very much!
I think you could use some built ins like this:
import string
for line in lines:
if any([letter in line for letter in string.ascii_letters]):
print "there is an ascii letter somewhere in this line"
This is only looking for ascii letters, however.
you could also:
import unicodedata
for line in lines:
if any([unicodedata.category(unicode(letter)).startswith('L') for letter in line]):
print "there is a unicode letter somewhere in this line"
but only if I understand my unicode categories correctly....
Even cleaner (using suggestions from other answers. This works for both unicode lines and strings):
for line in lines:
if any([letter.isalpha() for letter in line]):
print "there is a letter somewhere in this line"
But, interestingly, if you do:
In [57]: u'\u2161'.isdecimal()
Out[57]: False
In [58]: u'\u2161'.isdigit()
Out[58]: False
In [59]: u'\u2161'.isalpha()
Out[59]: False
The unicode for the roman numeral "Two" is none of those,
but unicodedata.category(u'\u2161') does return 'Nl' indicating a numeric (and u'\u2161'.isnumeric() is True).
This will check the first character in each line and skip all lines that doesn't start with a digit:
for line in lines:
if line[0].isdigit():
# we've got a line starting with a digit
Use a generator pipeline to filter your input stream.
This takes the lines from your original input lines, but stops to check that there are no letters in the entire line.
input_stream = (line in lines if
reduce((lambda x, y: (not y.isalpha()) and x), line, True))
for line in input_stream:
strip_line = ...

Remove lines beginning with the same semi-colon delimited part with regex

I would like to use Notepad++ to remove lines with duplicate beginning of line. For example, I have a semi-colon separated file like below:
string at the beginning of line 1;second string line 1; final string line1;
string at the beginning of line 2;second string line 2; final string line2;
string at the beginning of line 1;second string line 3; final string line3;
string at the beginning of line 1;second string line 4; final string line4;
I would like to remove the third and fourth lines as they have the same first substring as the first line and get the following result:
string at the beginning of line 1;second string line 1; final string line1;
string at the beginning of line 2;second string line 2; final string line2;
You can try using the following regex:
^(([^;]*;).*\R(?:.*\R)*?)\2.*
Or
^(([^;]*;).*\R(?:.*\R)*?)\2.*(?:$|\R)
And replace with $1.
The idea is to find and capture text in the beginning of a line that consists of non-semicolon characters up to ; ([^;]*;), then match the rest of the line (with .*\R), then 0 or more lines ((?:.*\R)*?) up to a line that starts with the captured text in group 2, matching it to the end and capturing into the second group that we can use later.
The drawback is that you will have to click Replace All several times until no match is found.
Thanks go to #nhahtdh who noticed a bug with my previous ^(([^;]*).*\R(?:.*\R)*?)\2.* regex that can overfire.

python script for limit text file words

I have an input file like:
input.txt:
to
the
cow
eliphant
pigen
then
enthosiastic
I want to remove those words which has character length is <= 4 , and if some word has more than 8 character then write those word in new file till 8 character length
output should be like:
output.txt:
eliphant
pigen
enthosia
This is my code:
f2 = open('output.txt', 'w+')
x2 = open('input.txt', 'r').readlines()
for y in x2:
if (len(y) <= 4):
y = y.replace(y, '')
f2.write(y)
elif (len(y) > 8):
y = y[0:8]
f2.write(y)
else:
f2.write(y)
f2.close()
print "Done!"
when i compile it then it gives the output like:
eliphantpigen
then
enthosia
it also writes 4 character length word... i don't understand what is the problem and how to write the code to limit character length of text file words....?
Use with when working with files, this guarantees that file would be closed.
You have then in your results because your are reading lines and not worlds.
Each line have symbol of ending '\n'. So when you are reading world then you have string
'then\n' and len of this string is 5.
with open('output.txt', 'w+') as ofp, open('input.txt', 'r') as ifp:
for line in ifp:
line = line.strip()
if len(line) > 8:
line = line[:8]
elif len(line) <= 4:
continue
ofp.write(line + '\n')

Python print a line that contains a number greater than "x" in a file

I'm new in Python, I have a script that prints all lines in a file that contains 9 using python:
#!/usr/bin/env phyton
import re
testFile = open("test.txt", "r")
for line in testFile:
if re.findall("\\b9\\b", line):
print line
Now, how can I print all lines that contains a number greater than 9?
test.txt:
number1 9
number2 10
number3 5
number4 6
number5 15
You can use regular expression grouping:
for line in testFile:
m = re.search(r"\b(\d+)\b", line)
if m is not None and int(m.group(1)) >= 9:
print line
The (\d+) extracts the text matched by that part of the regex into m.group(1). Then the int() converts that to an integer and compares with 9.
This will extract the first instance of a number within each line. If you want to search all numbers in a line, you will need to use something like re.finditer() in combination with the above.
This prints the line if there is any space-separated number greater than 9.
testFile = open("test.txt", "r")
for line in testFile:
for word in line.split():
try:
if int(word) > 9:
print line
break
except ValueError:
pass
Or, for your example
testFile = open("test.txt", "r")
for line in testFile:
if int(line.split()[1]) > 9:
print line