Python: Iterate through an html file - python-2.7

I'm trying to iterate through an html file from the internet.
target = br.response().read()
for row in target:
if "[some text]" in row:
print next(target)
The problem is this loop iterates over each character in the html file, so it'll never find a match. How do I get it to iterate through each row instead?
I've tried target = target.splitlines() , but that really messes up the file.

What you basically want to achieve is the following (reading from a file, as your header suggests):
#!/usr/bin/env python
import sys
with open("test.txt") as file:
for line in file:
if "got" in line:
print "found: {0}".format(line)
You want to open your file ("test.txt").
You read each line (for .. in)
and look if the line contains a string, where in comes in nice:)
If you are interested in the line number:
for index, line in enumerate(file):
But beware the index starts with 0, so the current line number is index+1
Analog, if you want to read from a String as a file, take a look at StringIO.

Take a look at the page source for the file you're viewing, because that's what you're getting back as a response. I have a feeling the response you're getting doesn't actually have new lines where you want it to. For pages like http://docs.python.org/ where the source is readable your splitline() method works great, but for sites where the source essentially has no line breaks, like Google's homepage, it's a lot closer to the problems you're experiencing.
Depending on what you are trying to achieve, your best bet might be to use an html/xml parsing library like lxml. Otherwise using re is probably a pretty safe approach. Both are a lot better than trying to guess where the line breaks should be.

Related

index a text file (lines with different size) in c++

I have to extract information from a text file.
In the text file there is a list of strings.
This is an example of a string: AAA101;2015-01-01 00:00:00;0.784
The value after the last ; is a non integer value, which changes from line to line, so every line has different lenght of characters.
I want to map all of these lines into a structured vector as I can access to a specific line anytime I need without scan the whole file again.
I did some research and I found some threads about a command called, which permit me to reach a specific line of a text file but I read it only works if any line has the same characters lenght of the others.
I was thinking about converting all the lines in the file in a proper format in order to be able to map that file as I want but I hope there is a better and quick way
You can try TStringList*. It creates a list of AnsiStrings. Then each AnsiString can be accessed via ->operator [](numberOfTheLine).

How can I handle a file in python 2.7 to separate them line-by-line to insert into database?

I am very novice in python 2.7 who is unable to handle a large file( Data-sets for thesis). suppose my data-set(mmd.txt) has 3 attributes.
1,2,3
0.2,2.3,4
5.4,2,3
1.3,2.4,3
9.2,2.6,7.22
5.4,2,3
5.66,4.25,7.6
45.2,52.6,7.22
5.4,20.2,3.6
5.66,4.25,7.6
here is a the data-set (mmd.txt). you can see that every line has three attributes separated by comma and each line defines a row.So how can I handle them/separate them to insert into database?please put your suggestion so that I can understand. Thanks for reading my post.
Hello and welcome to the wonderful world of Python.
What are you trying to do is easily accomplishable. First you need to open the file using the open() function. Than you can split('\n') it into an array containing your lines. Lastly you can for loop through and split(',') the lines, separating the comma-separated values into an array. This is just the basic framework you'd have to follow and needs to be managed accordingly to the size of your input. In practice, it would look like this:
FILE_NAME = "mmd.txt"
with open(FILE_NAME, 'r') as f:
lines = f.read().split('\n');
for line in lines:
print line.split(',')
This code will print your parameters nicely separated inside an array. I think you can go on with that.
If you want to learn more about the language, I'd recommend Codecademy's Python Track.
Happy coding

I need to create list in python from OpenOffice Calc columns

The problem is I have large amounts of data in OpenOffice Calc, approximately 3600 entries for each of 4 different categories and 3 different sets of this data, and I need to run some calculations on it in python. I want to create lists corresponding each of the four categories. I am hoping someone can help guide me to an easy-ish, efficient way to do this whether it be script or importing data. I am using python 2.7 on a windows 8 machine. Any help is greatly appreciated.
My current method i am trying is to save odf file as cvs then use genfromtxt(from numpy).
from numpy import genfromtxt
my_data = genfromtxt('C:\Users\tomdi_000\Desktop\Load modeling(WSU)\PMU Data\Data18-1fault-Alvey-csv trial.csv', delimiter=',')
print(my_data)
File "C:\Program Files (x86)\Wing IDE 101 5.0\src\debug\tserver\_sandbox.py", line 5, in <module>
File "c:\Python27\Lib\site-packages\numpy\lib\npyio.py", line 1352, in genfromtxt
fhd = iter(np.lib._datasource.open(fname, 'rbU'))
File "c:\Python27\Lib\site-packages\numpy\lib\_datasource.py", line 147, in open
return ds.open(path, mode)
File "c:\Python27\Lib\site-packages\numpy\lib\_datasource.py", line 496, in open
raise IOError("%s not found." % path)
IOError: C:\Users omdi_000\Desktop\Load modeling(WSU)\PMU Data\Data18-1fault-Alvey-csv trial.csv not found.
the error stems from this code in _datasource.py
# NOTE: _findfile will fail on a new file opened for writing.
found = self._findfile(path)
if found:
_fname, ext = self._splitzipext(found)
if ext == 'bz2':
mode.replace("+", "")
return _file_openers[ext](found, mode=mode)
else:
raise IOError("%s not found." % path)
Your problem is that your path string 'C:\Users\tomdi_000\Desktop\Load modeling(WSU)\PMU Data\Data18-1fault-Alvey-csv trial.csv' contains an escape sequence - \t. Since you are not using raw string literal, the \t is being interpreted as a tab character, similar to the way a \n is interpreted as a newline. If you look at the line starting with IOError:, you'll see a tab has been inserted in its place. You don't get this problem with UNIX-style paths, as they use forward slashes /.
There are two ways around this. The first is to use a raw string literal:
r'C:\Users\tomdi_000\Desktop\Load modeling(WSU)\PMU Data\Data18-1fault-Alvey-csv trial.csv'
(note the r at the beginning). As explained in the link above, raw string literals don't interpret back slashes \ as beginning an escape sequence.
The second way is to use a UNIX-style path with forward slashes as path delimiters:
'C:/Users/tomdi_000/Desktop/Load modeling(WSU)/PMU Data/Data18-1fault-Alvey-csv trial.csv'
This is fine if you're hard-coding the paths into your code, or reading from a file that you generate, but if the paths are getting generated automatically, such as reading the results of an os.listdir() command for example, it's best to use raw strings instead.
If you're going to be using numpy to do the calculations on your data, then using np.genfromtxt() is fine. However, for working with CSV files, you'd be much better off using the csv module. It includes all sorts of functions for reading columns and rows, and doing data transformation. If you're just reading the data then storing it in a list, for example, csv is definitely the way to go.

Folder with 1300 png files into html images list

I've got folder with about 1300 png icons. What I need is html file with all of them inside like:
<img src="path-to-image.png" alt="file name without .png" id="file-name-without-.png" class="icon"/>
Its easy as hell but with that number of files its pure waste of time to do it manually. Have you any ideas how to automate it?
If you need it just once, then do a "dir" or "ls" and redirect it to a file, then use an editor with macro-ability like notepad++ to record modifying a single line like you desire, then hit play macro for the remainder of the file. If it's dynamic, use PHP.
I would not use C++ to do this. I would use vi, honestly, because running regular expressions repeatedly is all that is needed for this.
But young an do this in C++. I would start with a plan text file with all the file names generated by Dir or ls on the command prompt.
Then write code that takes a line of input and turns it into a line formatted the way you want. Test this and get it working on a single line first.
The RE engine of C++ is probably overkill (and is not all that well supported in compilers), but substr and basic find and replace is all you need. Is there a string library you are familiar with? std::string would do.
To generate the file name without PNG, check the last four characters and see if they exist and are .PNG (if not report an error). Then strip them. To remove dashes, copy characters to a new string but if you are reading a dash write a space. Everything else is just string concatenation.

Find Lines with N occurrences of a char

I have a txt file that I’m trying to import as flat file into SQL2008 that looks like this:
“123456”,”some text”
“543210”,”some more text”
“111223”,”other text”
etc…
The file has more than 300.000 rows and the text is large (usually 200-500 chars), so scanning the file by hand is very time consuming and prone to error. Other similar (and even more complex files) were successfully imported.
The problem with this one, is that “some lines” contain quotes in the text… (this came from an export from an old SuperBase DB that didn’t let you specify a text quantifier, there’s nothing I can do with the file other than clear it and try to import it).
So the “offending” lines look like this:
“123456”,”this text “contains” a quote”
“543210”,”And the “above” text is bad”
etc…
You can see the problem here.
Now, 300.000 is not too much if I could perform a search using a text editor that can use regex, I’d manually remove the quotes from each line. The problem is not the number of offending lines, but the impossibility to find them with a simple search. I’m sure there are less than 500, but spread those in a 300.000 lines txt file and you know what I mean.
Based upon that, what would be the best regex I could use to identify these lines?
My first thought is: Tell me which lines contain more than 4 quotes (“).
But I couldn’t come up with anything (I’m not good at Regex beyond the basics).
this pattern ^("[^"]+){4,} will match "lines containing more than 4 quotes"
you can experiment with replacing 4 with 5 or more, depending on your data.
I think that you can be more direct with a Regex than you're planning to be. Depending on your dialect of Regex, something like this should do it:
^"\d+",".*".*"
You could also use a regex to remove the outside quotes and use a better delimeter instead. For example, search for ^"([0-9]+)","(.*)"$ and replace it with \1+++++DELIM+++++\2.
Of course, this doesn't directly answer your question, but it might solve the problem.