Renaming files with no fixed char length in Python - python-2.7

I am currently learning Python 2.7 and am really impressed by how much it can do.
Right now, I'm working my way through basics such as functions and loops. I'd reckon a more 'real-world' problem would spur me on even further.
I use a satellite recording device to capture TV shows etc to hard drive.
The naming convention is set by the device itself. It makes finding the shows you want to watch after the recording more difficult to find as the show name is preceded with lots of redundant info...
The recordings (in .mts format) are dumped into a folder called "HBPVR" at the root of the drive. I'd be running the script on my Mac when the drive is connected to it.
Example.
"Channel_4_+1-15062015-2100-Exams__Cheating_the_....mts"
or
"BBC_Two_HD-19052015-2320-Newsnight.mts"
I included the double-quotes.
I'd like a Python script that (ideally) would remove the broadcaster name, reformat the date info, strip the time info and then put the show's name to the front of the file name.
E.g "BBC_Two_HD-19052015-2320-Newsnight.mts" ->> "Newsnight 19 May 2015.mts"
What may complicate matters is that the broadcaster names are not all of equal length.
The main pattern is that broadcaster name runs up until the first hyphen.
I'd like to be able to re-run this script at later points for newer recordings and not have already renamed recordings renamed further.
Thanks.

Try this:
import calendar
input = "BBC_Two_HD-19052015-2320-Newsnight.mts"
# Remove broadcaster name
input = '-'.join(input.split("-")[1:])
# Get show name
show = ''.join(' '.join(input.split("-")[2:]).split(".mts")[:-1])
# Get time string
timestr = ''.join(input.split("-")[0])
day = int(''.join(timestr[0:2])) # The day is the first two digits
month = calendar.month_name[int(timestr[2:4])] # The month is the second two digits
year = timestr[4:8] # The year is the third through sixth digits
# And the new string:
new = show + " " + str(day) + " " + month + " " + year + ".mts"
print(new) # "Newsnight 19 May 2015.mts"
I wasn't quite sure what the '2320' was, so I chose to ignore it.

Thanks Coder256.
That has given me a bit more insight into how Python can actually help solve real world (first world!) problems like mine.
It tried it out with some different combos of broadcaster and show names and it worked.
I would like though to use the script to rename a batch of recordings/files inside the folder from time to time.
The script did throw and error when processing an already re-named recording, which is to be expected I guess. Should the renamed file have a special character at the start of its name to help avoid this happening?
e.g "_Newsnight 19 May 2015.mts"
Or is there a more aesthetically pleasing way of doing this, with special chars being added on etc.
Thanks.

One way to approach this, since you have a defined pattern is to use regular expressions:
>>> import datetime
>>> import re
>>> s = "BBC_Two_HD-19052015-2320-Newsnight.mts"
>>> ts, name = re.findall(r'.*?-(\d{8}-\d{4})-(.*?)\.mts', s)[0]
>>> '{} {}.mts'.format(name, datetime.datetime.strptime(ts, '%d%m%Y-%H%M').strftime('%d %b %Y'))
'Newsnight 19 May 2015.mts'

Related

Pandas: Grouping rows by list in CSV file?

In an effort to make our budgeting life a bit easier and help myself learn; I am creating a small program in python that takes data from our exported bank csv.
I will give you an example of what I want to do with this data. Say I want to group all of my fast food expenses together. There are many different names with different totals in the description column but I want to see it all tabulated as one "Fast Food " expense.
For instance the Csv is setup like this:
Date Description Debit Credit
1/20/20 POS PIN BLAH BLAH ### 1.75 NaN
I figured out how to group them with an or statement:
contains = df.loc[df['Description'].str.contains('food court|whataburger', flags = re.I, regex = True)]
I ultimately would like to have it read off of a list? I would like to group all my expenses into categories and check those category variable names so that it would only output from that list.
I tried something like:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
That obviously didn't work.
If there is a better way of doing this I am wide open to suggestions.
Also I have looked through quite a few posts here on stack and have yet to find the answer (although I am sure I overlooked it)
Any help would be greatly appreciated. I am still learning.
Thanks
You can assign a new column using str.extract and then groupby:
df = pd.DataFrame({"description":['Macdonald something', 'Whataburger something', 'pizza hut something',
'Whataburger something','Macdonald something','Macdonald otherthing',],
"debit":[1.75,2.0,3.5,4.5,1.5,2.0]})
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
df["found"] = df["description"].str.extract(f'({"|".join(fast_food)})',flags=re.I)
print (df.groupby("found").sum())
#
debit
found
Macdonald 5.25
Whataburger 6.50
pizza hut 3.50
Use dynamic pattern building:
fast_food = ['Macdonald', 'Whataburger', 'pizza hut']
pattern = r"\b(?:{})\b".format("|".join(map(re.escape, fast_food)))
contains = df.loc[df['Description'].str.contains(pattern, flags = re.I, regex = True)]
The \b word boundaries find whole words, not partial words.
The re.escape will protect special characters and they will be parsed as literal characters.
If \b does not work for you, check other approaches at Match a whole word in a string using dynamic regex

Regular Expressions to Update a Text File in Python

I'm trying to write a script to update a text file by replacing instances of certain characters, (i.e. 'a', 'w') with a word (i.e. 'airplane', 'worm').
If a single line of the text was something like this:
a.function(); a.CallMethod(w); E.aa(w);
I'd want it to become this:
airplane.function(); airplane.CallMethod(worm); E.aa(worm);
The difference is subtle but important, I'm only changing 'a' and 'w' where it's used as a variable, not just another character in some other word. And there's many lines like this in the file. Here's what I've done so far:
original = open('original.js', 'r')
modified = open('modified.js', 'w')
# iterate through each line of the file
for line in original:
# Search for the character 'a' when not part of a word of some sort
line = re.sub(r'\W(a)\W', 'airplane', line)
modified.write(line)
original.close()
modified.close()
I think my RE pattern is wrong, and I think i'm using the re.sub() method incorrectly as well. Any help would be greatly appreciated.
If you're concerned about the semantic meaning of the text you're changing with a regular expression, then you'd likely be better served by parsing it instead. Luckily python has two good modules to help you with parsing Python. Look at the Abstract Syntax Tree and the Parser modules. There's probably others for JavaScript if that's what you're doing; like slimit.
Future reference on Regular Expression questions, there's a lot of helpful information here:
https://stackoverflow.com/tags/regex/info
Reference - What does this regex mean?
And it took me 30 minutes from never having used this JavaScript parser in Python (replete with installation issues: please note the right ply version) to writing a basic solution given your example. You can too.
# Note: sudo pip3 install ply==3.4 && sudo pip3 install slimit
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
data = 'a.funktion(); a.CallMethod(w); E.aa(w);'
tree = Parser().parse(data)
for node in nodevisitor.visit(tree):
if isinstance(node, ast.Identifier):
if node.value == 'a':
node.value = 'airplaine'
elif node.value == 'w':
node.value = 'worm'
print(tree.to_ecma())
It runs to give this output:
$ python3 src/python_renames_js_test.py
airplaine.funktion();
airplaine.CallMethod(worm);
E.aa(worm);
Caveats:
function is a reserved word, I used funktion
the to_ecma method pretty prints; there is likely another way to output it closer to the original input.
line = re.sub(r'\ba\b', 'airplane', line)
should get you closer. However, note that you will also be replacing a.CallMethod("That is a house") into airplane("That is airplane house"), and open("file.txt", "a") into open("file.txt", "airplane"). Getting it right in a complex syntax environment using RegExp is hard-to-impossible.

Removing Duplicate Lines by Title Only

I am trying to modify a script so that it will remove duplicate lines from a text file using only the title portion of that line.
To clarify the text file lines look something like this:
Title|Image Url|Description|Page Url
At the moment the script does remove duplicates, but it does so by reading the entire line, not just the first part. All the lines in the file are not going to be 100% the same, but a few will be very similar.
I want to remove all of the lines that contain the same "title", regardless of what the rest of the line contains.
This is the script I am working with:
import sys
from collections import OrderedDict
infile = "testfile.txt"
outfile = "outfile.txt"
inf = open(infile,"r")
lines = inf.readlines()
inf.close()
newset = list(OrderedDict.fromkeys(lines))
outf = open(outfile,"w")
lstline = len(newset)
for i in range(0,lstline):
ln = newset[i]
outf.write(ln)
outf.close()
So far I have tried using .split() to split the lines in the list. I have also tried .readline(lines[0:25]) in hopes of using a character limit to achieve the desired results, but no luck so far. I also can't seem to find any documentation on my exact problem so I'm stuck.
I am using Windows 8 and Python 2.7.9 for this project if that helps.
I made a few changes to the program you had set up. First, I changed your file interactions to use "with" statements, since those are very convenient and automatically handle a lot of the functionality you had to write out. Second off, I used a set instead of an OrderedDict because you were basically just trying to emulate set functionality (exclusivity of elements) by using keys in an OrderedDict. If the title hasn't been used, it adds it to the set so it can't be used again and prints the line to the output file. If it has been used, it keeps going. I hope this helps you!
with open("testfile.txt") as infile:
with open("outfile.txt",'w') as outfile:
titleset = set()
for line in infile:
title = line.split('|')[0]
if title not in titleset:
titleset.add(title)
outfile.write(line)

matlab most efficient partial string subtract/regex

I have a project where I have a huge list of email addresses (i.e. johnhahifas#example.com) and I have a list of top 5000 most common First names (i.e. john, jim etc)
I am trying to for each email address, remove any First name if it appears in the email, so that for example:
johnhahifas#example.com becomes hahifas#example.com
benTTTben#something.com becomes TTT#something.com
so far, I came up with a regex solution and a strfind solution, the regexprep is much faster; I even optimized it a little to remove the #example.com so that the operation might be faster; but it still takes forever to run.
you can download the common first names from this address (CSV file)
http://www.quietaffiliate.com/Files/CSV_Database_of_First_Names.csv
The regex code I have:
fileID = fopen('bademails');
emails = textscan(fileID,'%s');
str = emails{1,1}; %//Loaded in the emails
fileID = fopen('CSV_Database_of_First_Names.csv');
names = textscan(fileID,'%s');
Names = lower(names{1,1}); %//Loaded in the Names
K = regexprep(str,Names,''); %Regex on Names
Any faster solutions would be much appreciated, thank you in advance!

How do i print each line here in a for loop

thanks for the follow :)
hii... if u want to make a new friend just add me on facebook! :) xx
Just wanna say if you ever feel lonely or sad or bored, just come and talk to me. I'm free anytime :)
I hope she not a spy for someone. I hope she real on neautral side. Because just her who i trust. :-)
not always but sometimes maybe :)
\u201c Funny how you get what you want and pray for when you want the same thing God wants. :)
Thank you :) can you follow me on Twitter so I can DM you?
RT dj got us a fallin in love and yeah earth number one m\u00fcsic listen thank you king :-)
found a cheeky weekend for \u00a380 return that's flights + hotel.. middle of april, im still looking pal :)
RT happy birthday mary ! Hope you have a good day :)
Thank god twitters not blocked on the school computers cause all my data is gone on my phone :(
enjoy tmrro. saw them earlier this wk here in tokyo :)
UPDATE:
Oki, maybe my question was wrong. I have to do this:
Open file and read from it
Remove some links, names and stuff from it (I have used regex, but don't know if it the right way to do
After i got clean code (only tweets with sad face or happy face) i have to print each line out, cause i have to loop each like this:
for line in tweets:
if '' in line:
cl.train(line,'happy')
else if '' in line:
cl.train(line,'sad')
My code so far you see here, but it doesn't work yet.
import re
from pprint import pprint
tweets = []
tweets = open('englishtweet.txt').read()
regex_username = '#[^\s]*' # regex to detect username in file
regex_url = 'http[^\s]*' # regex to detect url in file
regex_names = '#[^\s]*' # regex to detect # in file
for username in re.findall(regex_username, tweets):
tweets = tweets.replace(username, '')
for url in re.findall(regex_url, tweets):
tweets = tweets.replace(url, '')
for names in re.findall(regex_names, tweets):
tweets = tweets.replace(names, '')
If you want to read the first line, use next
with open("englishtweet.txt","r") as infile:
print next(infile).strip()
# this prints the first line only, and consumes the first value from the
# generator so this:
for line in infile:
print line.strip()
# will print every line BUT the first (since the first has been consumed)
I'm also using a context manager here, which will automatically close the file once you exit the with block instead of having to remember to call tweets.close(), and also will handle in case of error (depending on what else you're doing in your file, you may throw a handled exception that doesn't allow you to get to the .close statement).
If your file is very small, you could use .readlines:
with open("englishtweet.txt","r") as infile:
tweets = infile.readlines()
# tweets is now a list, each element is a separate line from the file
print tweets[0] # so element 0 is the first line
for line in tweets[1:]: # the rest of the lines:
print line.strip()
However that's not really suggested to read a whole file object into memory, as with some files it can simply be a huge memory waster, especially if you only need the first line -- no reason to read the whole thing to memory.
That said, since it looks like you may be using these for more than just one iteration, maybe readlines IS the best approach
You almost have it. Just remove the .read() when you originally open the file. Then you can loop through the lines.
tweets = open('englishtweet.txt','r')
for line in tweets:
print line
tweets.close()