HD Regular Expression Search - regex

I am working on a project for my computer security class and I have a couple questions. I had an idea to write a program that would search the whole hard drive looking for email addresses. I am just looking for addresses stored in plain text since it would be hard to find anything otherwise. I figured the best way to find addresses would be to use a regular expression.
I wrote an application in C# that works fairly well but it I would like to see if anyone has any better ideas. I am completely up for writing this in another language since I'm assuming C# isn't the best for this type of thing. So far the application I created just starts at the C:/ and recursively locates all files on the drive skipping those that aren't accessible. It also skips all common image, video, audio, compressed, and files over 512mb. This speeds it up quite a bit but there is a small chance that a large file could contain something useful. It takes about 12 seconds to generate the list of files and I'm guessing about an hour to check them all. One downside is that it uses about 50% cpu while scanning.
I'm looking for ideas on how to improve the search. Is there a faster way, a more efficient way, a more thorough way, things like that? I was trying to think if there was any way that you could tell if the file would contain plain text strings or not. Just let me know if you have any cool ideas. Thanks.

To be honest, the easiest existing way to do this is to use grep. As you improve your program, compare your speeds to it, and when you get close, stop worrying about optimizing. Alternatively, take a look at its source for an example of an existing product that does what you're looking for.

As noted elsewhere, tools already exist for this if you install Win32 ports of UNIX tools. Alternatively, the Windows equivalent is:
for /r c:\ %i in (*.*) do findstr /i /r "regular expression" "%i"

you should just use grep + find. grep is optimized for searching files fast, and find is optimized for providing lists of appropriate files for things like this. people have spent a long time optimizing these tools - no need to reinvent the wheel.

Related

Reading data from an online text file in C++

Alright. What I'm after is, what seems to me, fairly simple.
I've got File I/O down to a fine art for basic text files.
But, what I need now, is a way to read a text file that's online.
Let's say, something like: http://www.iamawebsite.com.au/file.txt
I CAN download the file and store it locally, but that will produce a lot more pain for me in the future, and more-so for redistribution of the end program, so if I can get around in doing so, I will be forever grateful. (also, if possible, to refrain from any additional libraries or anything. If I have to use one, I will, but if there's a way around that, I'm happy)
I have looked around for a while on ways to do similar tasks, but they seem to be going for more than what I'm after, and skipping the small steps which are the ones I can't quite get.
(If it helps, using Windows 8, Visual Studio 2010 Ultimate, needs to work in Windows 7 and 8 if possible)
I tried many different things, and I couldn't get anything to work without over complicating things to a ridiculous level. I also tried libCurl but I couldn't manage to link it properly, not sure why.
I ended up using a combination of Batch and Powershell scripts, simple and powerful, and best of all it works.
If anybody is interested:
Batch script:
powershell -ExecutionPolicy Bypass "& "fileUrl\name.ps1"
Powershell script:
$webClient = New-Object System.Net.WebClient;
$url = "http://www.iamawebsite.com.au/iamafile.txt";
$file = "whereToSaveFile\desiredNameOfFile.txt";
$webClient.DownloadFile($url, $file);
I have both my Batch and Powershell files in the same directory, just to make it a little easier on myself
Thanks!
Try URLDownloadToCacheFile function maybe.
One way would be to use InternetOpen, thenInternetOpenUrl, and finally InternetReadFile.

Handling really large multi language projects

I am working on an really large multi language project (1000+ Classes + Configs + Scripts), with files distributed over network drives. I am having trouble fighting through the code, since the available Tools are not helping. The main problem is finding things. For the C++ Part: VS with VAX can only find files and symbols which are in the solution. A lot of them are not. Same problem with Reshaper. Right now i am stuck with doing unindexed string and file searches, which is highly inefficient on a network drive. I heared that SourceInsight would be an option since it allows you to just specify the folders that are part of the project and than indexes them, but my company wont spent money on it.
So my question ist: what Tools are there available to fight through an incredible large amount of code? And if possible they should be low cost or even free/open source.
Check out -
ctags
cscope
idutils
snavigator
In every one of these tools, you would have to invest(*) some time in reading the documentation, and then building your index. Consider switching to an editor that will work with these tools.
(*): I do mean invest, because it will reap dividends once you do.
hope this helps,
If you need to maintain a large amount of code, you really should have a source code managment system, a lot of them will help you find text by indexing all the files
And Most of them will work with various language.
Otherwise you can install some indexer like Apache Lucene and index all your files...
You should take a look at LXR. This is used by many Linux kernel source listings.
Try ndexer http://code.google.com/p/ndexer/
promises to Handle extremely large codebases!
The Perl program ack is also worth a look -- think of it as multi-file grep on steroids. The new version (in what I would call late beta) even lets you specify regexes for the files to process as well as regexes to search for -- a feature I've used extensively since it came out (I've got a subproject with 30k lines in 300+ classes, where this feature has been very helpful). You can even chain the new ack with itself so you can subselect the files to process.
VS with VAX can only find files and symbols which are in the solution. A lot of them are not.
You can add all the files that are not in your solution and set them to not build in the settings. Your VS build will not be affected by this, but now VS knows about those files and you can search them along with your VS native files.

Getting list of files and folders on the user's computer with the filename filtered by the text line

Currently I'm developing a project that should do the thing described above on Windows. I have the idea to recurcively go through all user's drives and collect all information on then, but it seems to be really time consuming. So is there a better way to do such thing (maybe to use OS's index file or NTFS MFT)?
I use C++/Qt.
You can search for any of the many code examples for this and use one.
The library finctions which you use FindFirstFile and FindNextFile are optimized and will go firectly to the FAT. They are coded by microsoft & I doubt that there is a faster way.
Btw, what do mean by "filtered by the text line"? Do you mean you want only filenames matching a certain pattern (use teh above) or files containing a string?

Create And Convert To PDF's.. NO Toolkit

Not sure where else to ask this, so I figured I'd give good old stackoverflow a shot.
Let's say, by chance, I would like to write a library or set of libraries that will create PDF's and convert files to PDF, AND I could care less about how long it will take me to complete (3 months - 10 years.. whatever). I have absolutely no interest in paying for a toolkit... the point of this would be to learn how to manipulate and create files like PDF's. There's nothing business critical about the project, I just want to learn how to do it. Where do I start? I would imagine something like this would be written in C++, but I'm not sure... maybe high level languages would work as well. I'm not looking for someone to tell me exactly how to do it, but send me in the write direction, or at least point out the concepts I would need to concretely grasp before proceeding with such a project.
Any advice and help in directing me here is greatly appreciated : )
Well, you will need a very good understanding of the PDF file format. Adobe publishes the standard and you can start at their site. You can start with the base 1.7 standard and then read the cumulative supplements from there. It is a daunting task, but it can be done and you can pretty much use any language you want, because in the end you are just generating bytes that can be saved to a file.
If you want to convert from, let's say, word documents, it will get a little trickier. Microsoft has published their file formats, which you would have to learn and then learn how to translate that into the corresponding PDF formatting. Also note that the .doc and .docx formats are completely separate file formats and would require separate engines to convert them.
With unlimited time, it is definitely doable, you would just need to ask yourself if it is worth it.

Design advice for a personal project - "Files Renamer"?

i've just started learning winapis and c++ programming ..
i was thinking about starting a personal project (to enhance my coding, and to help me understand the winapis better)..
and i've decided to program a "cmd" files renamer, that basically takes :
1)a path
2)a keyword
3)the desiered formate
4)versioned or not (or numbered, like if u had 20 episodes of the same show, u wouldnt wanna
truncate the episode number)..
5)special cases to delete (like when ur downloading a torrent, they have a [309u394] attached to the name.. and most of the time an initial [WE-RIP-TV-SHOWS-HDTV-FANSUBS-GROUPS-ETC]
i am building the logic as follows:
the program takes the path(input 1),
performs a full files indexing.. then it compares the files found against the keyword
example gives (input 2) (use regex?)
Reformat file name step. (input 3, 4, 5);
save file name.
questions:
A) is my logic flow proper? any suggestions to improve it?
B)should i use Regex to check against file name, keyword, and desired format? (not good with regex yet) , i mean is it the best way to perform the huge amount of comparisons ?
Regular expressions should do the trick. Also you could use the Boost library, it has some really neat functions including the regexp, which is probably faster than the functions you'll find around (: