How to find string matches between a text file and CSV?

How to find string matches between a text file and CSV? - python-2.7

I am currently working with Python 2.7 on a stand alone system. My preferred method for solving this problem would be to use pandas data-frames to compare. However, I do not have access to install the library on the system I'm working with. So my question is how else could I use a text file and look for matches of the strings in a csv.
If I have a main csv file with many fields (for relevance the first one is timestamps) and several other text files that contain a list of timestamps how can I compare each of the txt files with the main csv and if a match is found grab the entire row from the csv based on the specific field matching and outputting that result to another csv
Example:
example.csv
timestamp,otherfield,otherfield2
1.2345,,
2.3456,,
3.4567,,
5.7867,,
8.3654,,
12.3434,,
32.4355,,
example1.txt
2.3456
3.4565
example2.txt
12.3434
32.4355
If there are any questions I'm happy to answer them.

You can load all the files into lists, then search the lists using
with open('example.txt','r') as file_handle:
example_file_content = file_handle.read().split("\n")
with open("example1.txt", "r") as file_handle:
example1_file_content = file_handle.read().split("\n")
for index, line in enumerate(example_file_content):
if line.split(",")[0] in example1_file_content:
print('match found; line is',line)

Related

How to modify only timecodes but not other numbers or letters in a text file?

I'm making a program to speed up and slow down parts of videos, and I want to support modifying times on subtitles to match. How can I search for only the timecodes in a text file and modify them?
This is for srt subtitle files. Timecodes are in the format of HH.MM.SS,mmm. The files contain other numbers (eg in hex colors) so I only want to search for numbers in the specific timecode format.
I already have a function to take an input time in seconds and return an output time in seconds. It should also be fairly easy to convert between 'timecode' format and time in seconds.
This is an example of the text file:
1
00:00:00,000 --> 00:00:09,138
<font color="#CCCCCC">alexxa</font><font color="#E5E5E5"> who's your favorite president</font>
2
00:00:04,759 --> 00:00:12,889
<font color="#E5E5E5">George Washington</font><font color="#CCCCCC"> has my vote Alexa</font>
The only thing left is how to take in only timecodes and then replace them with new timecodes?
Not sure where to go from here. It would also be good to avoid looping through the text file more than necessary because there will be a lot of timecodes to change.

Given it's a text format, the most efficient way to match (and replace) the format of the time-stamps in your file would be to use regular expressions: https://en.cppreference.com/w/cpp/regex
the algo would you like this: you read line by line from your source file, for
every read line where RE matches, you replace it with the new time-stamps (i.e. craft a new line) and output to a new file (or to a buffer, which later could be committed into the source file - after processing is done). Other lines (where RE does not match) you output intact, as they were read.

Is there a way to grab the names of the files that glob parses through?

I am using the glob module to parse through a bunch of text files. Here is the line of that code:
for file in g.glob('*.TXT'):
for col in csv.DictReader(open(file,'rU')):
It works fine but is there a way to grab the names of the files that it iterates through? Im thinking this is not possible since it just looks for any files with the suffix '.TXT'. But I just thought I would ask.

Since glob.glob() returns only a list of matching file names, it's not possible to fetch all the file names considered using just this function. That would be a job for os.listdir().
If you only want to keep track of the matched files, you can store the return value of glob() before iterating over it:
filenames = g.glob('*.TXT')
for filename in filenames:
for col in csv.DictReader(open(filename,'rU')):
...
Also note that I changed the name file to filename, because that's more precise.

How can I handle a file in python 2.7 to separate them line-by-line to insert into database?

I am very novice in python 2.7 who is unable to handle a large file( Data-sets for thesis). suppose my data-set(mmd.txt) has 3 attributes.
1,2,3
0.2,2.3,4
5.4,2,3
1.3,2.4,3
9.2,2.6,7.22
5.4,2,3
5.66,4.25,7.6
45.2,52.6,7.22
5.4,20.2,3.6
5.66,4.25,7.6
here is a the data-set (mmd.txt). you can see that every line has three attributes separated by comma and each line defines a row.So how can I handle them/separate them to insert into database?please put your suggestion so that I can understand. Thanks for reading my post.

Hello and welcome to the wonderful world of Python.
What are you trying to do is easily accomplishable. First you need to open the file using the open() function. Than you can split('\n') it into an array containing your lines. Lastly you can for loop through and split(',') the lines, separating the comma-separated values into an array. This is just the basic framework you'd have to follow and needs to be managed accordingly to the size of your input. In practice, it would look like this:
FILE_NAME = "mmd.txt"
with open(FILE_NAME, 'r') as f:
lines = f.read().split('\n');
for line in lines:
print line.split(',')
This code will print your parameters nicely separated inside an array. I think you can go on with that.
If you want to learn more about the language, I'd recommend Codecademy's Python Track.
Happy coding

I need to create list in python from OpenOffice Calc columns

The problem is I have large amounts of data in OpenOffice Calc, approximately 3600 entries for each of 4 different categories and 3 different sets of this data, and I need to run some calculations on it in python. I want to create lists corresponding each of the four categories. I am hoping someone can help guide me to an easy-ish, efficient way to do this whether it be script or importing data. I am using python 2.7 on a windows 8 machine. Any help is greatly appreciated.
My current method i am trying is to save odf file as cvs then use genfromtxt(from numpy).
from numpy import genfromtxt
my_data = genfromtxt('C:\Users\tomdi_000\Desktop\Load modeling(WSU)\PMU Data\Data18-1fault-Alvey-csv trial.csv', delimiter=',')
print(my_data)
File "C:\Program Files (x86)\Wing IDE 101 5.0\src\debug\tserver\_sandbox.py", line 5, in <module>
File "c:\Python27\Lib\site-packages\numpy\lib\npyio.py", line 1352, in genfromtxt
fhd = iter(np.lib._datasource.open(fname, 'rbU'))
File "c:\Python27\Lib\site-packages\numpy\lib\_datasource.py", line 147, in open
return ds.open(path, mode)
File "c:\Python27\Lib\site-packages\numpy\lib\_datasource.py", line 496, in open
raise IOError("%s not found." % path)
IOError: C:\Users omdi_000\Desktop\Load modeling(WSU)\PMU Data\Data18-1fault-Alvey-csv trial.csv not found.
the error stems from this code in _datasource.py
# NOTE: _findfile will fail on a new file opened for writing.
found = self._findfile(path)
if found:
_fname, ext = self._splitzipext(found)
if ext == 'bz2':
mode.replace("+", "")
return _file_openers[ext](found, mode=mode)
else:
raise IOError("%s not found." % path)

Your problem is that your path string 'C:\Users\tomdi_000\Desktop\Load modeling(WSU)\PMU Data\Data18-1fault-Alvey-csv trial.csv' contains an escape sequence - \t. Since you are not using raw string literal, the \t is being interpreted as a tab character, similar to the way a \n is interpreted as a newline. If you look at the line starting with IOError:, you'll see a tab has been inserted in its place. You don't get this problem with UNIX-style paths, as they use forward slashes /.
There are two ways around this. The first is to use a raw string literal:
r'C:\Users\tomdi_000\Desktop\Load modeling(WSU)\PMU Data\Data18-1fault-Alvey-csv trial.csv'
(note the r at the beginning). As explained in the link above, raw string literals don't interpret back slashes \ as beginning an escape sequence.
The second way is to use a UNIX-style path with forward slashes as path delimiters:
'C:/Users/tomdi_000/Desktop/Load modeling(WSU)/PMU Data/Data18-1fault-Alvey-csv trial.csv'
This is fine if you're hard-coding the paths into your code, or reading from a file that you generate, but if the paths are getting generated automatically, such as reading the results of an os.listdir() command for example, it's best to use raw strings instead.
If you're going to be using numpy to do the calculations on your data, then using np.genfromtxt() is fine. However, for working with CSV files, you'd be much better off using the csv module. It includes all sorts of functions for reading columns and rows, and doing data transformation. If you're just reading the data then storing it in a list, for example, csv is definitely the way to go.

How to make a small program that can search for words in folder? [PYTHON]

So I'm trying to write a small program that can search for words within files of a specific folder.
It should be able to search for one word as well as a combination of two words or more. Then it should as a result give a list of the file names that include those words.
It's important that I use dictionaries and that the files are given a number within the dictionary.
I've tried several of things but I'm still stuck. Basically I've been thinking that each words within the file have to be split into strings, and then by searching for a string you get a result.
It's important to note that I'm using unicode utf-8.
Anyone willing to help me?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js