Design advice for a personal project - "Files Renamer"? - c++

i've just started learning winapis and c++ programming ..
i was thinking about starting a personal project (to enhance my coding, and to help me understand the winapis better)..
and i've decided to program a "cmd" files renamer, that basically takes :
1)a path
2)a keyword
3)the desiered formate
4)versioned or not (or numbered, like if u had 20 episodes of the same show, u wouldnt wanna
truncate the episode number)..
5)special cases to delete (like when ur downloading a torrent, they have a [309u394] attached to the name.. and most of the time an initial [WE-RIP-TV-SHOWS-HDTV-FANSUBS-GROUPS-ETC]
i am building the logic as follows:
the program takes the path(input 1),
performs a full files indexing.. then it compares the files found against the keyword
example gives (input 2) (use regex?)
Reformat file name step. (input 3, 4, 5);
save file name.
questions:
A) is my logic flow proper? any suggestions to improve it?
B)should i use Regex to check against file name, keyword, and desired format? (not good with regex yet) , i mean is it the best way to perform the huge amount of comparisons ?

Regular expressions should do the trick. Also you could use the Boost library, it has some really neat functions including the regexp, which is probably faster than the functions you'll find around (:

Related

Custom extensions files readable & writable only with the program

as you can see the title explains a little.
I know several questions have alreay been asked about custom extensions, but i want also the file i'm using to be readable only from the program i'm writing, and if opened from the explorer,it would not be making any sense(maybe crypted).
The program i'm coding is a substitution cipher
in C++, and it works with two maps, so that every letter in the message would be found in one map, and substituted with the corresponding letter in the other.
I'm trying to store these 2 maps in a file, so how would you suggest i could acheive this ?
P.S: I couldn't find any similar question, If you do, please give me a link

Simple Phrases recognition

I am looking to recognize simple phrases like the ones what happens in google calendar
but rather than parsing Calendar Entries I have to parse Sentence related to finance, accounting and to do lists. So For example I have to parse sentences like
I spent 50 dollars on food yesterday
I need to mark an separate the info as Reason : 'food' , Cost : 50 and Time: <Yesterday's Date>
My question is do I go in for a full fledged Natural Language Processing like
given in these Questions and use Something like GATE
Machine Learning and Natural Language Processing
Natural Language Processing in Ruby
Ideas for Natural Language Processing project?
https://stackoverflow.com/a/3058063/492561
Or is it better to Write simple grammars using Something like AntLR and try to recognize it .
Or should I go really low and just Define a syntax and use Regular expressions .
Time is a Constraint , I have about 45 - 50 Days , And I don't know how to use AntLR or NLP libraries like GATE.
Preferred languages : Python , Java , Ruby (Not in any particular order)
PS : This is not home-work , So please Don't tag it as so.
PPS : Please try to give an answer with Facts on why using a particular method is better.
even if a particular method may not fit inside the time constraint please feel free to share it because It might benefit someone else .
You could look at named entity recognition indeed. From your question I understand your domain is pretty well defined, so you can identify the (few?) entities (dates, currencies, money amount, time expressions, etc.) that are relevant for you. If the phrases are very simple, you could go with a rule-based approach, otherwise it's likely to get too complex too soon.
Just to get yourself up and running in a few sec, http://timmcnamara.co.nz/post/2650550090/extracting-names-with-6-lines-of-python-code is an extremely nice example of what you could do. Of course I would not expect an high accuracy from just 6 lines of python, but it should give you an idea of how it works:
1>>> import nltk
2>>> def extract_entities(text):
3... for sent in nltk.sent_tokenize(text):
4... for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
5... if hasattr(chunk, 'node'):
6... print chunk.node, ' '.join(c[0] for c in chunk.leaves())
The core idea is on line 3 and 4: on line 3 it split text in sentences and iterates them.
On line 4, it splits the sentence in tokens, it runs "part of speech" tagging on the sentence, and then it feeds the pos-tagged sentence to the named entity recognition algorithm. That's the very basic pipeline.
In general, nltk is an extremely beautiful piece of software, and very well documented: I would look at it. Other answers contain very useful links.
Your task is a type of Information Extraction task, specifically relation/fact extraction, preceded by Named Entity Recognition.
Take a look at the following frameworks for Java/Python:
GExp
GATE
NLTK. Python. Book chapter on Information Extraction.
UIMA. (used for IBM's Watson.)

Moving, renaming huge amount of text files based on content and size

*Update July 4*
I ended up doing the following:
Sort on date
Check if last sentence is the same
If Yes: If bigger -> this is the new message to be chosen. If smaller: remove. If no more of the same can be found, choose this one and move to another folder.
If No: move on. Loop this again until all files with certain date have been checked.
Thanks all for the help!!
I'm busy with a big project where I have a huge number of emails that I have to filter, imported from gmail through thunderbird. There is a big problem though.
Because gmail uses conversations, but thunderbird doesn't format them as such, what I have is a text file for each email, though the complete previous conversation as well. And so a whole new text file for each reply.To clarify, an example of a conversation:
Me:Hi, how are you?
You, replying: Good!
Me: Great!
In gmail this looks exactly as above, but for me this are now 3 files:
file 1:
Me, sent at 11:41:
Hi, how are you?
file 2:
You, sent at 11:42:
Good!
Me, sent at 11:41:
Hi how are you?
file 3:
Me, sent at 11:43:
Great!
You, sent at 11:42:
Good!
Me, sent at 11:41:
Hi how are you?
As you can understand, this is no problem with 3 files: I just throw away file 1 and 2 and only use file 3. That's precisely what I want to do. But considering in total there are around 30k files, I would very much like to automate that.
It is unfortunately not possible to do this complete by file name, though partially it can. The files are named after their date. For instance: 20110102 for Jan 2, 2011. However as there are multiple email conversations on a day, I would lose a lot if I would just sort by date and only keep the largest.
I hope the problem is clear and you can help me with this.
I work on Mac OSX 10.7. I've tried using Applescript, but either my script is not good or Applescript can't handle the amount of files.
Maybe you have a recommendation for software or a script in some way? I'm open for all and not unfamiliar with programming.
Thanks in advance!
As your task is basically just text processing, any language you're familiar with, including AppleScript, PHP, bash, C, should be able to do the job. I think perhaps #inTide's breaking the problem down into discreet steps is what you need to do, building one portion at a time in the language of your choice.
Pick a language that you're familiar with and start writing one the code to the first step and make sure it's working as you expect, and then expand, adding a small bit of new functionality at each point and making sure that functionality works before moving on. Without an example of the code you've written or a better description of how AppleScript is failing for you, additional advice is difficult.

sketch flow -- how do we rename .xaml and .cs files without breaking sketchflow map?

how do we rename .xaml and .cs files?
would like to be able to keep development in synch with the original sketchflow. i.e. sketchflow has features such as the ability to collect client feedback on a per screen basis, etc.
... I kind of answered my own question here, so I'll post it as a follow up. Asked the original question 9 hours ago on the MS site without response... still trying to work out where the best place is to talk to the community, so sorry for the duplicate.
THE ANSWER (IS THERE A BETTER ONE?)
Context: Sketchflow is a prototyping tool. In large teams possibly you want to keep the prototype seperate from the finished version, or there's a large prototyping phase.
My view is that I really like Sketchflow. It's one of the coolest things I've seen for a while (well done Microsoft).
... so for me, I want the prototype to become a the finished product. I want the designers to step in and make transitions whenever they want. I want the designers to kick the process off, and the developers to put in the detail. I'd like our customers to be able to post feedback at any time during the build process. btw: get your developers to check out MVVM. It's very cool.
My bet is that the feedback could get lost if you make a breaking change (a file rename) -- so just beware of that. That wont be a problem for us. We'll get our file names to make sense and then mostly leave it alone. Of course MS could fix this this by creating a globally unique id (Guid) for each screen that is created. Perhaps they've done this already. If someone from MS reads this, please put this on your requested features list.
THE ANSWER:
So here is the answer that works for me:
don't try to hand-edit the xaml / cs, as all the cross referencing that you might be doing with behaviors will break if you aren't really careful. Typical files that need to be modified: .csproj, Sketch.Flow, xxxx.xaml, and xxxx.cs.
To auto do it, download a tool like Ultraedit. Alternatively, you might be able to just use VS 2010 (untested).
Steps with ultraedit:
(BACKUP YOUR PROJECT FIRST)
Search/Replace In Files...
Find in files... "Screen_1_19"
Replace with... "Welcome"
In Files/Types... "."
Directory...
Match Whole Word Only
Hit "Start"
follow the prompts
rename the files (.xaml & .cs) to be Welcome.???? (where ???? is .xaml or .cs) . Since I use SVN, this step gets done for me in one step (no big deal).
If using VS2010 for steps 1 through 8, be careful do longer string replacements first e.g. Screen_1_19 before Screen_1. I think VS treats _ as a word break. On ultraedit you'll be fine.
If there's interest, in the spare time that I don't currently have, I could release a quick tool to do this on codeplex.
** note: because we are working with XML and XML is very particular about being correct, I close expression blend down, and then reopen it again after the replace/rename to see if I was successful + my screen map still has all the flow lines still drawn in.
answer is above in the body of the question.

HD Regular Expression Search

I am working on a project for my computer security class and I have a couple questions. I had an idea to write a program that would search the whole hard drive looking for email addresses. I am just looking for addresses stored in plain text since it would be hard to find anything otherwise. I figured the best way to find addresses would be to use a regular expression.
I wrote an application in C# that works fairly well but it I would like to see if anyone has any better ideas. I am completely up for writing this in another language since I'm assuming C# isn't the best for this type of thing. So far the application I created just starts at the C:/ and recursively locates all files on the drive skipping those that aren't accessible. It also skips all common image, video, audio, compressed, and files over 512mb. This speeds it up quite a bit but there is a small chance that a large file could contain something useful. It takes about 12 seconds to generate the list of files and I'm guessing about an hour to check them all. One downside is that it uses about 50% cpu while scanning.
I'm looking for ideas on how to improve the search. Is there a faster way, a more efficient way, a more thorough way, things like that? I was trying to think if there was any way that you could tell if the file would contain plain text strings or not. Just let me know if you have any cool ideas. Thanks.
To be honest, the easiest existing way to do this is to use grep. As you improve your program, compare your speeds to it, and when you get close, stop worrying about optimizing. Alternatively, take a look at its source for an example of an existing product that does what you're looking for.
As noted elsewhere, tools already exist for this if you install Win32 ports of UNIX tools. Alternatively, the Windows equivalent is:
for /r c:\ %i in (*.*) do findstr /i /r "regular expression" "%i"
you should just use grep + find. grep is optimized for searching files fast, and find is optimized for providing lists of appropriate files for things like this. people have spent a long time optimizing these tools - no need to reinvent the wheel.