How to run a dictionary search against a large text file? - c++

We're in the final stages of shipping our console game. On the Wii we're having the most problems with memory of course, so we're busy hunting down sloppy coding, packing bits, and so on.
I've done a dump of memory and used strings.exe (from sysinternals) to analyze it, but it's coming up with a lot of gunk like this:
''''$$$$ %%%%
''''$$$$%%%%####&&&&
''''$$$$((((!!!!$$$$''''((((####%%%%$$$$####((((
''))++.-$$%&''))
'')*>BZf8<S]^kgu[faniwkzgukzkzkz
'',,..EDCCEEONNL
I'm more interested in strings like this:
wood_wide_end.bmp
restroom_stonewall.bmp
...which mean we're still embedding some kinds of strings that need to be converted to ID's.
So my question is: what are some good ways of finding the stuff that's likely our debug data that we can eliminate?
I can do some rx's to hack off symbols or just search for certain kinds of strings. But what I'd really like to do is get a hold of a standard dictionary file and search my strings file against that. Seems slow if I were to build a big rx with aardvaark|alimony|archetype etc. Or will that work well enough if I do a .NET compiled rx assembly for it?
Looking for other ideas about how to find stuff we want to eliminate as well. Quick and dirty solutions, don't need elegant. Thanks!

First, I'd get a good word list. This NPL page has a good list of word lists of varying sizes and sources. What I would do is build a hash table of all the words in the word list, and then test each word that is output by strings against the word list. This is pretty easy to do in Python:
import sys
dictfile = open('your-word-list')
wordlist = frozenset(word.strip() for word in dictfile)
dictfile.close()
for line in sys.stdin:
# if any word in the line is in our list, print out the whole line
for word in line.split():
if word in wordlist:
print line
break
Then use it like this:
strings myexecutable.elf | python myscript.py
However, I think you're focusing your attention in the wrong place. Eliminating debug strings has very diminishing returns. Although eliminating debugging data is a Technical Certification Requirement that Nintendo requires you to do, I don't think they'll bounce you for having a couple of extra strings in your ELF.
Use a profiler and try to identify where you're using the most memory. Chances are, there will be a way to save huge amounts of memory with little effort if you focus your energy in the right place.

This sounds like an ideal task for a quick-and-dirty script in something supporting regex's. I'd probably do something in python real quick if it was me.
Here's how I would proceed:
Every time you encounter a string (from the strings.exe output), prompt the user as to whether they'd like to remember it in the dictionary or permanently ignore it. If the user chooses to permanently ignore the string, in the future when its encountered, don't prompt the user about it and throw it away. You can optionally keep an anti-dictionary file around to remember this for future runs of your script. Build up the dictionary file and for each string keep a count or any other info about it you'd like about it. Optionally sort by the number of times the string occurs, so you can focus on the most egregious offenders.
This sounds like an ideal task for learning a scripting language. I wouldn't bother messing with C#/C++ or anything real fancy to implement this.

Related

Scratch project: To check if any word in a list is contained in an answer

I have the following Scratch project which has a "kind list" of words like: "good", "kind", "love", "come" etc.
A user should be able to enter any sentence containing any of these words, and the happy face would show.
Currently if the user types "kind" the happy face shows and if it types anything else like "you are kind", the sad face shows.
How do I change this, in scratch, such that if the user types in:
"you are kind" or
"how kind you are" or
"come here"
(any sentence containing any word in the "kindlist") the face is happy,else not.
I can only find a block that allows me to select the LIST and then the ANSWER and no other alternatives. What I want is the Python equivalent of > in list
answer=input("Say something")
If any word in the input answer (sentence) in in the list.
Then do - - -
For teaching purposes, I am trying to simplify what is on https://machinelearningforkids.co.uk/#!/newproject (creating of the training set). Can this be done directly in scratch or not? Or is this why the site allows you generate blocks on their site first and import them.
Surely Scratch should have the capability to enter data into lists and then test them directly.
I've also tried using a loop (which doesn't quite work correctly either) but was hoping there was a far simpler way.
I guess Scratch deliberately offers a minimal set of functions,
on the one hand not to overwhelm beginners,
on the other hand to encourage students to piece together simple blocks into more complex systems.
Yes, a simple (sentence) contains (word) is all you get out-of-the-box;
you do need a loop to match a multi-word sentence against a multi-word whitelist.
Seems to me like you would be better off with some development environment
that will at least give you some mature text parsing capabilities.
I'm not saying it's impossible to teach student about machine learning using Scratch, but I doubt it's the best tool for the job.
It feels like somebody wants to give music lessons, but students first have to go through the process of building a piano.
As for your code, it looks like a good start.
Some suggestions:
Replace the 'forever' loop with a loop bounded by the length of list 'kindthings'.
Include a leading and a trailing space in the 'contains' check, to make sure only whole words match. Wouldn't want 'unhappy' in a sentence to match 'happy' in the whitelist.

Good coding patterns in RShiny applications?

What are some good resources/examples for good coding patterns to follow in RShiny applications?
I feel I am following two bad patterns in the shiny applications I am creating.
To make things react to user changes properly, I seem to end up wrapping most parts of server.r in observe().
At the beginning of each observe(), I want the expression to rerun if any one of a whole bunch of inputs change.
Ideally, I would like to put input[change_set] where change_set is a character vector of input names, however this gives an error of Error in [.reactivevalues: Single-bracket indexing of reactivevalues object is not allowed.
(or if I use input[[change_set]]: Error in checkName: Must use single string to index into reactivevalues)
What I end up doing to make things work is including multiple lines of input$var1, input$var2, ..., input$var15. This feels very wrong.
I am not making use of any functions like: reactive(), reactiveValues(), isolate(), withReactiveDomain(), makeReactiveBinding(), ... . I am guessing that I probably should be, but I don't know how to use them.
The solution to this problem is likely to be me rereading the small print in the documentation and reading code from example applications. Does anybody know any good quality resources for this?

How generate random word from real languages

How I can generate random word from real language?
Anybody know any API from internet with this functional?
For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=en' and I get responce 'Town'. Or 'Fast'. Or 'Received'... For example I send http-request to 'ht_tp://www.any...api.com/getword?lang=ru' and I get responce 'Ходить'. Or 'Шапка'. Or 'Отправлено'... Any form (noun, adjective, verb etc...) of the words of the any language.
I find resource 'http://www.randomlists.com/random-words'. But this is not JSON format, only English, and don't any warranty work in long time.
Please any ideas.
See this answer : https://stackoverflow.com/questions/824422/can-i-get-an-english-dictionary-word-list-somewhere Download a word dictionary, stick in the databse and fetch a random record or read a random line from the file each time. This way you don't depend on 3rd party API and you can extend it in all the languages you can find words for.
You can download the OpenOffice dictionaries. They come as extension (oxt), which is nothing different than a ZIP file. You may open them with 7zip or alike. Within you will find lots of files, interesting for you are the *.dic files. They will also contain resolutions or number words.
When you encounter something like abandon/LdS get rid of the /LdS this is used for hunspell.
Take these *.dic files use their name as key, put them into a database and pick a random word from there for a given language code.
Update
Older, but easier to access, the archived hunspell dictionaries from OpenOffice.
This question can be viewed in two ways and therefore I give two answers:
To collect words, I would run a spider on websites with known language (Wikipedia is a good starting point) and strip HTML tags.
To generate words from a real language is trickier. Using statistics from the collected words, it is possible to use Markow chains that produces statistically real words. I have tried letter by letter generation, and that works poorly. It is probably a better approach to use syllable construction instead.

What is the best way to encrypt hardcoded strings in C++?

Warning: C++ noob
I've read multiple posts on StackOverflow about string encryption. By the way, they don't answer my doubts.
I must insert one or two hardcoded strings in my code but I would like to make it difficult to read in plain text when debugging/reverse engineering. That's not all: my strings are URLs, so a simple packet analyzer (Wireshark) can read it.
I've said difficult because I know that, when the code runs, the string is somewhere (in RAM?) decrypted as plain text and somebody can read it. So, assuming that is not possible to completely secure my string, what is the best way of encrypting/decrypting it in C++?
I was thinking of something like this:
//I've omitted all the #include and main stuff of course...
string encryptedUrl = "Ajdu67gGHhbh34590Hb6vfu6gu" //Encrypted url with some known algorithm
URLDownloadToFile(NULL, encryptedUrl.decrypt(), C:\temp.txt, 0, NULL);
What about packet analyzing? I'm sure there's no way to hide the URL but maybe I'm missing something? Thank you and sorry for my worst english!
Edit 1: What my application does?
It's a simple login script. My application downloads a text file from an URL. This file contains an encrypted string that is read using fstream library. The string is then decrypted and used to login on another site. It is very weak, because there's no database, no salt, no hashing. My achievement is to ensure that neither the url nor the login string are "easy" to read from a static analisys of the binary, and possibly as hard as possible with a dynamic analysis (debugging, revers engineering, etc).
If you want to stymie packet inspectors, the bare minimum requirement is to use https with a hard-coded server certificate baked into your app.
There is no panacea for encrypting in-app content. A determined hacker with the right skills will get at the plain url, no matter what you do. The best you can hope for is to make it difficult enough that most people will just give up. The way to achieve this is to implement multiple diverse obfuscation and tripwire techniques. Including, but not limited to:
Store parts of the encrypted url and the password (preferably a one-time key) in different locations and bring them together in code.
Hide the encrypted parts in large strings of randomness that looks indistinguishable from the parts.
Bring the parts together piecemeal. E.g., Concatenate the first and second third of the encrypted url into a single buffer from one initialisation function, concatenate this buffer with the last third in a different unrelated init function, and use the final concatenation in yet another function, all called from different random places in your code.
Detect when the app is running under a debugger and have different functions trash the encrypted contents at different times.
Detection should be done at various call sites using different techniques, not by calling a single "DetectDebug" function or testing a global bool, both of which create a single point of attack.
Don't use obvious names, like, "DecryptUrl" for the relevant functions.
Harvest parts of the key from seemingly unrelated, but consistent sources. E.g., read the clock and only use a few of the high bits (high enough that that they won't change for the foreseeable future, but low enough that they're not all zero), or use a random sampling of non-volatile results from initialisation code.
This is just the tip of the iceberg and will only throw novices off the scent. None of it is going to stop, or even significantly slow down, a skillful attacker, who will simply intercept calls to the SSL library using a stealth debugger. You therefore have to ask yourself:
How much is it worth to me to protect this url, and from what kind of attacker?
Can I somehow change the system design so that I don't need to secure the url?
Try XorSTR [1, 2]. It's what I used to use when trying to hamper static analysis. Most results will come from game cheat forums, there is an html generator too.
However as others have mentioned, getting the strings is still easy for anyone who puts a breakpoint on URLDownloadToFile. However, you will have made their life a bit harder if they are trying to do static analysis.
I am not sure what your URL's do, and what your goal is in all this, but XorStr + anti-debug + packing the binary will stop most amateurs from reverse engineering your application.

HD Regular Expression Search

I am working on a project for my computer security class and I have a couple questions. I had an idea to write a program that would search the whole hard drive looking for email addresses. I am just looking for addresses stored in plain text since it would be hard to find anything otherwise. I figured the best way to find addresses would be to use a regular expression.
I wrote an application in C# that works fairly well but it I would like to see if anyone has any better ideas. I am completely up for writing this in another language since I'm assuming C# isn't the best for this type of thing. So far the application I created just starts at the C:/ and recursively locates all files on the drive skipping those that aren't accessible. It also skips all common image, video, audio, compressed, and files over 512mb. This speeds it up quite a bit but there is a small chance that a large file could contain something useful. It takes about 12 seconds to generate the list of files and I'm guessing about an hour to check them all. One downside is that it uses about 50% cpu while scanning.
I'm looking for ideas on how to improve the search. Is there a faster way, a more efficient way, a more thorough way, things like that? I was trying to think if there was any way that you could tell if the file would contain plain text strings or not. Just let me know if you have any cool ideas. Thanks.
To be honest, the easiest existing way to do this is to use grep. As you improve your program, compare your speeds to it, and when you get close, stop worrying about optimizing. Alternatively, take a look at its source for an example of an existing product that does what you're looking for.
As noted elsewhere, tools already exist for this if you install Win32 ports of UNIX tools. Alternatively, the Windows equivalent is:
for /r c:\ %i in (*.*) do findstr /i /r "regular expression" "%i"
you should just use grep + find. grep is optimized for searching files fast, and find is optimized for providing lists of appropriate files for things like this. people have spent a long time optimizing these tools - no need to reinvent the wheel.