I would like to have the text printed out as how it shown on the exercise whereby the Lists of List has a * on each line and each are in a new line. I am still new to python and the Automate the Boring Stuff with Python book is kind of confusing sometimes.
I started by typing the text into the Python editor and having Pyperclip to copy it unto the clipboard. The problem is Pyperclip only accepts a single string, in which form the text is copied to the clipboard.
#! python3
#bulletPointerAdder.py - Adds Wikipedia bullet points to the start
#of each line of text on the clipboard.
#! python3
#bulletPointerAdder.py - Adds Wikipedia bullet points to the start
#of each line of text on the clipboard.
In the Python shell:
import pyperclip
>>> text = 'Lists of monkeys Lists of donkeys Lists of pankeys'
>>> pyperclip.copy(text)
>>>
RESTART: C:\Users\User\AppData\Local\Programs\Python\Python37-
32\bulletpointadder.py
>>> text
'* Lists of monkeys Lists of donkeys Lists of pankeys'
>>>
import os
import pyperclip
text = pyperclip.paste()
#Separate lines and add starts.
lines = text.split(os.linesep)
for i in range(len(lines)): # loop through all indexes in the "lines"
list
lines[i] = '* ' + lines[i] # add star to each sting in "lines" list
text = os.linesep.join(lines)
pyperclip.copy(text)
I actually want the text to be printed out like the sample below, but the problem is I am getting it print out as a single string.
Lists of animals
Lists of aquarium life
Lists of biologists by author abbreviation
Lists of cultivars
Understand this first and move to step 3:
We split the text along its newlines to get a list in which each item is one line of the text. We store the list in lines and then loop through the items in lines.
For each line, we add a star and a space to the start of the line. Now each string in lines begins with a star.
import pyperclip
text = pyperclip.paste()
# TODO manipulate the text in clipboard
lines = text.split('\n') # Each word is split into new line
for i in range(len(lines)):
lines[i] = '* ' + lines[i] # Each word gets a * prefix
text = '\n'.join(lines) # all the newlines created are joind back
pyperclip.copy(text) # whole content is than copied into clipboard
print(text)
With this code if you copy a list of things it will still be a list of things as it is intended.
Related
I am trying to extract few fields from OCR image. I am using pytesseract to read OCR image file and this is working as expected.
Code :
import pytesseract
from PIL import Image
import re
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-
OCR\tesseract.exe"
value = Image.open("ocr.JPG")
text = pytesseract.image_to_string(value)
print(text)
Output :
ALS 1 Emergency Base Rate
Y A0427 RE ABC
Anbulance Mileage Charge
Y A0425 RE ABC
Disposable Supplies
Y A0398 RH ABC
184800230, x
Next, I have to extract A0427 and A0425 from the text.. but the problem is I am not loop through the whole line.. it's taking one character at a time and that's why my regular expression isn't working..
Code:
for line in text :
print(line)
x= re.findall(r'^A[0-9][0-9][0-9][0-9]', text)
print(x)
Get rid of that for loop also, use only
x= re.findall(r'A[0-9][0-9][0-9][0-9]', text)
without any loop. ('remove ^ too')
text is a string, default behavior for Python when looping over a string using a for-loop is to loop through the characters (as a string is basically a list of characters).
To loop through the lines, first split the text into lines using text.splitlines():
for line in text.splitlines() :
print(line)
x= re.findall(r'^A[0-9][0-9][0-9][0-9]', text)
print(x)
EDIT: Or use Patels answer to skip the loop all together :)
The problem in your regex is start anchor ^ which expects your matching text A0425 should start from the very start of line and that is indeed not the case as you have Y and space before it. So just remove ^ from your regex and then you should be getting all expected strings. Also, you can change four of this [0-9] to write as [0-9]{4} and your shortened regex becomes,
A[0-9]{4}
Regex Demo
You need to modify your current code like this,
import pytesseract
from PIL import Image
import re
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-
OCR\tesseract.exe"
value = Image.open("ocr.JPG")
text = pytesseract.image_to_string(value)
print(re.findall(r'A[0-9]{4}', text))
This should prints all your matches without needing to loop individually into lines,
['A0427', 'A0425', 'A0398']
I wrote a script to print the lines containing a specific word from a bible txt file.The problem is i couldn't get the exact word with the line instead it prints all variations of the word.
For eg. if i search for "am" it prints sentences with words containing "lame","name" etc.
Instead i want it to print only the sentences with "am" only
i.e, "I am your saviour", "Here I am" etc
Here is the code i use:
import re
text = raw_input("enter text to be searched:")
shakes = open("bible.txt", "r")
for line in shakes:
if re.match('(.+)' +text+ '(.+)', line):
print line
This is another approach to take to complete your task, it may be helpful although it doesn't follow your current approach very much.
The test.txt file I fed as input had four sentences:
This is a special cat. And this is a special dog. That's an average cat. But better than that loud dog.
When you run the program, include the text file. In command line, that'd look something like:
python file.py test.txt
This is the accompanying file.py:
import fileinput
key = raw_input("Please enter the word you with to search for: ")
#print "You've selected: ", key, " as you're key-word."
with open('test.txt') as f:
content = str(f.readlines())
#print "This is the CONTENT", content
list_of_sentences = content.split(".")
for sentence in list_of_sentences:
words = sentence.split(" ")
for word in words:
if word == key:
print sentence
For the keyword "cat", this returns:
That is a special cat
That's an average cat
(note the periods are no longer there).
I think if you, in the strings outside text, put spaces like this:
'(.+) ' + text + ' (.+)'
That would do the trick, if I correctly understand what is going on in the code.
re.findall may be useful in this case:
print re.findall(r"([^.]*?" + text + "[^.]*\.)", shakes.read())
Or even without regex:
print [sentence + '.' for sentence in shakes.split('.') if text in sentence]
reading this text file:
I am your saviour. Here I am. Another sentence.
Second line.
Last line. One more sentence. I am done.
both give same results:
['I am your saviour.', ' Here I am.', ' I am done.']
Replace the whole string if it contains specific letters/character…
I have a text file (myFile.txt) that contains multiple lines, for example:
The hotdog
The goal
The goat
What I want to do is the following:
If any word/string in the file contains the characters 'go' then, replace it with a brand new word/string ("boat"), so the output would look like this:
The hotdog
The boat
The boat
How can I accomplish this in Python 2.7?
It sounds like you want something like this:
with open('myFile.txt', 'r+') as word_bank:
new_lines = []
for line in word_bank:
new_line = []
for word in line.strip().split():
if 'go' in word:
new_line.append('boat')
else:
new_line.append(word)
new_lines.append('%s\n' % ' '.join(new_line))
word_bank.truncate(0)
word_bank.seek(0)
word_bank.writelines(new_lines)
Open the file for reading and writing, iterate through it splitting each line into component words and looking for instances of 'go' to replace. Keep in list because you do not want to modify something you're iterating over. You will have a bad time. Once constructed, truncate the file (erase it) and write what you came up with. Notice I switched to sticking an explicit '\n' on the end because writelines will not do that for you.
from __future__ import division
import nltk
from nltk.corpus import wordnet as wn
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("inpsyn.txt")
data = fp.read()
#to tokenize input text into sentences
print '\n-----\n'.join(tokenizer.tokenize(data))# splits text into sentences
#to tokenize the tokenized sentences into words
tokens = nltk.wordpunct_tokenize(data)
text = nltk.Text(tokens)
words = [w.lower() for w in text]
print words #to print the tokens
for a in words:
print a
syns = wn.synsets(a)
print "synsets:", syns
for s in syns:
for l in s.lemmas:
print l.name
print s.definition
print s.examples
i do not find any code related to my question. If there is any please mail me the link.
this is a code which will not find synonyms from a given text file or related sentence
It's as simple as a coding error - look where you define a in the loop (for a in words). Now look further where you try syns=wn.synsets(a). In this case a is not defined outside the loop. What you want is include all your synsets code within the for a in words loop. Here is what you want altogether:
...
words = [w.lower() for w in nltk.wordpunct_tokenize(data)] # other lines in your code are just excessive
for a in words:
syns = wn.synsets(a)
print "synsets:", syns
for s in syns:
for l in s.lemmas:
print l.name
print s.definition
print s.examples
That's a bit of a silly mistake. Also, please learn some cleaner coding - the current is very untidy and painful to look at.
I have the following sample text file. I specify a input value that I want to search for, in this case the word "car", then I would like to add the matched line "this is a car" and all the lines below it that is indented with two spaces, to a list.
How do I search the text for ("this is a" + my input value)
How would I got about adding only the indented lines after making a match to the list?
This is a sample of what the text file would look like:
this is a car
it is red
it has big wheels
manual transmission
this is a dog
it is brown
and long fur
I am thinking it would look something like this in pseudo-code
def action(self, filename, input):
with open(filename, 'rb') as f:
text = f.readlines()
output = []
for lines in text:
if ("this is a" + input) in lines:
i = lines.strip()
output.append(i)
goto next line
while there is a single space
i = lines.strip()
output.append(i)
Then if I do a print of output I should see the following:
this is a car
it is red
it has big wheels
manual transmission
When you go through for lines in text, have a flag that says whether you're currently looking for the starting text (this is a car) or for the following text (lines beginning with spaces).
In pseudocode:
looking_for_indents = False
for line in text:
if ("this is a " + input) in line:
print line
looking_for_indents = True
continue
if looking_for_indents:
if line.startswith(' '):
print line
else
# that's all of them, stop looking
looking_for_indents = False
Side comments:
input is a builtin function in Python; naming your argument 'input' is bad practise
lines.strip() gets rid of start/end spaces, it can't help if you're looking for them
readlines() returns a list of strings, for iterates string by string. for lines in text implies lines is plural but it's not.