The following code is intended to open a text file and search for any matches from a list of strings then output how many results it finds. For some reason, it's always "finding" 0.
validcards=array("NVIDIA GRID K140Q","AMD FirePro S7150","VMware SVGA 3D")
textFile = fso.opentextfile("_cards.txt",1,0,1).readall
set fso=nothing
set query = new regexp
with query
.global=true
.multiline=true
.ignorecase=true
.pattern="^.*?" & join(validcards,".*?") & ".*?$"
end with
counter = 0
set results = query.execute(textFile)
for each result in results
stdout.WriteLine escape(result)
counter = counter + 1
next
When I output counter it is always zero. What am I missing? Here is what the text file looks like:
Name
VMware SVGA 3D
The text file is generated using wmic path win32_VideoController get name > _cards.txt
UPDATE
In desperation, I just printed out the file after it's loaded. It looks like this:
■N a m e
V M w a r e S V G A 3 D
I was able to fix this by changing the OpenTextFile line to textFile = fso.opentextfile("_cards.txt",1,0,-1).readall. However, the regex still is not working.
I changed the pattern to the following and now it seems to be working fine:
.pattern="^.*(" & join(validcards,"|") & ").*$"
Related
I'm working with a very large text file (58GB) that I'm attempting to split into smaller chunks. The problem that I'm running into is that the smaller chunks appear to be Hex. I'm having my terminal print each line to stdout as well, but when I'm seeing it printed in stdout it's looking like normal strings to me. Is this known behavior? I've never encountered an issue where Python keeps spitting stuff out in Hex before. Even odder when I tried using Ubuntu's split from the command line it was also generating everything in Hex.
Code snippet below:
working_dir = '/SECRET/'
output_dir = path.join(working_dir, 'output')
test_file = 'SECRET.txt'
report_file = 'SECRET_REPORT.txt'
output_chunks = 100000000
output_base = 'SECRET'
input = open(test_file, 'r')
report_output = open(report_file, 'w')
count = 0
at_line = 0
output_f = None
for line in input:
if count % output_chunks == 0:
if output_f:
report_output.write('[{}] wrote {} lines to {}. Total count is {}'.format(
datetime.now(), output_chunks, str(output_base + str(at_line) + '.txt'), count))
output_f.close()
output_f = open('{}{}.txt'.format(output_base, str(at_line)), 'wb')
at_line += 1
output_f.write(line.encode('ascii', 'ignore'))
print line.encode('ascii', 'ignore')
count += 1
Here's what was going on:
Each line was started with a NUL character. When I was opening up parts of the file using head or PyCharm's terminal it was showing up normal, but when I was looking at my output in Sublime Text it was picking up on that NUL character and rendering the results in Hex. I had to strip '\x00' from each line of the output and it started looking the way I would expect it to
I have been searching for an answer to this, but can not seem to get what I need. I would like a python script that reads my text file and starting from the top working its way through each line of the file and then prints out all the matches in another txt file. Content of the text file is just 4 digit numbers like 1234.
example
1234
3214
4567
8963
1532
1234
...and so on.
I would like the output to be something like:
1234 : matches found = 2
I know that there are matches in the file do to almost 10000 lines. I appreciate any help. If someone could just point me in the right direction here would be great. Thank you.
import re
file = open("filename", 'r')
fileContent=file.read()
pattern="1234"
print len(re.findall(pattern,fileContent))
If I were you I would open the file and use the split method to create a list with all the numbers in and use the Counter method from collections to count how many of each number in the list are dupilcates.
`
from collections import Counter
filepath = 'original_file'
new_filepath = 'new_file'
file = open(filepath,'r')
text = file.read()
file.close()
numbers_list = text.split('\n')
numbers_set = set(numbers_list)
dupes = [[item,':matches found =',str(count)] for item,count in Counter(numbers_list).items() if count > 1]
dupes = [' '.join(i) for i in dupes]
new_file = open(new_filepath,'w')
for i in dupes:
new_file.write(i)
new_file.close()
`
Thanks to everyone who helped me on this. Thank you to #csabinho for the code he provided and to #IanAuld for asking me "Why do you think you need recursion here?" – IanAuld. It got me to thinking that the solution was a simple one. I just wanted to know which 4 digit numbers had duplicates and how many, and also which 4 digit combos were unique. So this is what I came up with and it worked beautifully!
import re
a=999
while a <9999:
a = a+1
file = open("4digits.txt", 'r')
fileContent = file.read()
pattern = str(a)
result = len(re.findall(pattern, fileContent))
if result >= 1:
print(a,"matches",result)
else:
print (a,"This number is unique!")
I have this file, testpi.txt, which i'd like to convert into a list of sentences.
>>>cat testpi.txt
This is math π.
That is moth pie.
Here's what I've done:
r = open('testpi.txt', 'r')
sentence_List = r.readlines()
print sentence_List
And, when the output is sent to another text file - output.txt , this is how it looks like in output.txt:
['This is math \xcf\x80. That is moth pie.\n']
I tried codecs too, r = codecs.open('testpi.txt', 'r',encoding='utf-8'),
but the output then consists of a leading 'u' in all the entries.
How could I display this byte string - \xcf\x80 as π, in the output.txt
Please guide me, thanks.
The problem is you're printing the entire list which gives you an output format you don't want. Instead, print each string individually and it will work:
r = open('t.txt', 'r')
sentence_List = r.readlines()
for line in sentence_List:
print line,
Or:
print "['{}']".format("', '".join(map(str.rstrip, sentence_List)))
I am trying to split one file with two articles in it into two separate files with one article in each, for subsequent analysis of the articles. Each article in the initial file has an ID that I want to use to separate the files with, using RE.
Below is the initial input file, with ID number:
166068619 #### "Epilepsy: let's end our ignorance of this neglected condition
Helen Stephens is a young woman with epilepsy [...]."
106899978 #### "Great British Payoff shows that BBC governance is broken
If it was a television series, they'd probably call it [...]."
However, when I run my code, I do get two separate files as an output but they are empty.
This is my code:
def file_split(path_to_file):
"""Function splits bigger file into N smaller ones, based on a certain RE
match, that is used to break the bigger file into smaller ones"""
def pattern_extract(path_to_file):
"""Function identifies the number of RE occurences in a file,
No. can be used in further analysis as range No."""
import re
x = []
with open(path_to_file) as f:
for line in f:
match = re.search(r'^\d+?\t####\t', line)
if match:
a = match.group()
x.append(a)
return len(x)
y = pattern_extract(path_to_file)
m = y + 1
files = [open('filename%i.txt' %i, 'w') for i in range(1,m)]
with open(path_to_file) as f:
for line in f:
match = re.search(r'^\d+?\t####\t', line)
if match:
a = match.group()
#files = [open('filename%i.txt' %i, 'w') for i in range(1, m)]
files[i-1].write(a)
for f in files:
f.close()
return files
Output result is as follows:
file_split(path)
Out[19]:
[<open file 'filename1.txt', mode 'w' at 0x7fe121b130c0>,
<open file 'filename2.txt', mode 'w' at 0x7fe121b131e0>]
I am new to Python and I am not quite sure where the problem lies. I checked some other answers that addressed the multiple file outputs but cannot figure out the solution. Help would be very much appreciated.
There are two problems with your code:
you write only the line matching the ID (actually, just the match itself), not the rest
you are always writing to the last file, as you use i, the loop variable "left over" from the list comprehension
To fix it, you could change the lower portion of your code to this:
y = pattern_extract(path_to_file)
files = [open('filename%i.txt' %i, 'w') for i in range(y)]
n = -1
with open(path_to_file) as f:
for line in f:
if re.search(r'^\d+\s+####\s+', line):
n += 1
files[n].write(line)
But you do not have to read the file two times at all, just to count the matches: Just open another file when the line matches an ID line and directly write to that last file in the list, then close all the files.
open_files = []
with open(path_to_file) as f:
for line in f:
if re.search(r'^\d+\s+####\s+', line):
open_files.append(open('filename%d.txt' % len(open_files), 'w'))
open_files[-1].write(line)
for f in open_files:
f.close()
I am new to VBA (I mean, REALLY new) and I would like you to give me some tips.
I have an Excel file with 2 columns: SKU and media_gallery
I also have images stocked in a folder (lets name it /imageFolder)
I need to parse the imageFolder and look for ALL images sarting by SKU.jpg , and put them into the media_gallery column separated by a semicolon ( ; )
Example: My SKU is "1001", I need to parse the image folder for all images starting by 1001 (all image have this pattern: 1001-2.jpg , 1001-3.jpg etc...)
I can do that in Java or C# but I want to give a chance to VBA. :)
How can I do that?
EDIT: I only need file names yes! And I should of said that I have 20 000 images in my folder, and 8000 SKUs , so I don't know how we can handle looping on 20 000 images names.
EDIT2: If SKU contains a dash ( - ), I don't need to treat it, so I can pass to the next SKU. And each SKU has a maximum of 5 images (....;SKU-5.jpg)
Thanks all.
How to insert images given you have one image name per cell in a column: How to get images to appear in Excel given image url
Take the above and introduce an inner loop for the file name:
if instr(url_column.Cells(i).Value, "-") = 0 then
dim cur_file_name as string
cur_file_name = dir("imageFolder\" & url_column.Cells(i).Value & "*.jpg")
do until len(cur_file_name) = 0
image_column.Cells(i).Value = image_column.Cells(i).Value & cur_file_name & ";"
cur_file_name = Dir
loop
end if