Mediawiki mass user delete/merge/block - wiki

I have 500 or so spambots and about 5 actual registered users on my wiki. I have used nuke to delete their pages but they just keep reposting. I have spambot registration under control using reCaptcha. Now, I just need a way to delete/block/merge about 500 users at once.

You could just delete the accounts from the user table manually, or at least disable their authentication info with a query such as:
UPDATE /*_*/user SET
user_password = '',
user_newpassword = '',
user_email = '',
user_token = ''
WHERE
/* condition to select the users you want to nuke */
(Replace /*_*/ with your $wgDBprefix, if any. Oh, and do make a backup first.)
Wiping out the user_password and user_newpassword fields prevents the user from logging in. Also wiping out user_email prevents them from requesting a new password via email, and wiping out user_token drops any active sessions they may have.
Update: Since I first posted this, I've had further experience of cleaning up large numbers of spam users and content from a MediaWiki installation. I've documented the method I used (which basically involves first deleting the users from the database, then wiping out up all the now-orphaned revisions, and finally running rebuildall.php to fix the link tables) in this answer on Webmasters Stack Exchange.
Alternatively, you might also find Extension:RegexBlock useful:
"RegexBlock is an extension that adds special page with the interface for blocking, viewing and unblocking user names and IP addresses using regular expressions."

There are risks involved in applying the solution in the accepted answer. The approach may damage your database! It incompletely removes users, doing nothing to preserve referential integrity, and will almost certainly cause display errors.
Here a much better solution is presented (a prerequisite is that you have installed the User merge extension):
I have a little awkward way to accomplish the bulk merge through a
work-around. Hope someone would find it useful! (Must have a little
string concatenation skills in spreadsheets; or one may use a python
or similar script; or use a text editor with bulk replacement
features)
Prepare a list of all SPAMuserIDs, store them in a spreadsheet or textfile. The list may be
prepared from the user creation logs. If you do have the
dB access, the Wiki_user table can be imported into a local list.
The post method used for submitting the Merge & Delete User form (by clicking the button) should be converted to a get method. This
will get us a long URL. See the second comment (by Matthew Simoneau)
dated 13/Jan/2009) at
http://www.mathworks.com/matlabcentral/newsreader/view_thread/242300
for the method.
The resulting URL string should be something like below:
http: //(Your Wiki domain)/Special:UserMerge?olduser=(OldUserNameHere)&newuser=(NewUserNameHere)&deleteuser=1&token=0d30d8b4033a9a523b9574ccf73abad8%2B\
Now, divide this URL into four sections:
A: http: //(Your Wiki domain)/Special:UserMerge?olduser=
B: (OldUserNameHere)
C: &newuser=(NewUserNameHere)&deleteuser=1
D: &token=0d30d8b4033a9a523b9574ccf73abad8%2B\
Now using a text editor or spreadsheet, prefix each spam userIDs with part A and Suffix each with Part C and D. Part C will include the
NewUser(which is a specially created single dummy userID). The Part D,
the Token string is a session-dependent token that will be changed per
user per session. So you will need to get a new token every time a new
session/batch of work is required.
With the above step, you should get a long list of URLs, each good to do a Merge&Delete operation for one user. We can now create a
simple HTML file, view it and use a batch downloader like DownThemAll
in Firefox.
Add two more pieces " Linktext" to each line at
beginning and end. Also add at top and at
bottom and save the file as (for eg:) userlist.html
Open the file in Firefox, use DownThemAll add-on and download all the files! Effectively, you are visiting the Merge&Delete page for
each user and clicking the button!
Although this might look a lengthy and tricky job at first, once you
follow this method, you can remove tens of thousands of users without
much manual efforts.
You can verify if the operation is going well by opening some of the
downloaded html files (or by looking through the recent changes in
another window).
One advantage is that it does not directly edit the
MySQL pages. Nor does it require direct database access.
I did a bit of rewriting to the quoted text, since the original text contains some flaws.

Related

Nested tables in livecycle fall apart on email

I have a form here with a nested table - where each table can dynamically grow, i.e., the inner table (w/ Transit No and Account No) and the outer table (Accounts by ID No). Here is an example:
(Behind the buttons:
Add - $.parent.tbl.Row.instanceManager.addInstance();
Remove - $.parent.instanceManager.removeInstance(this.parent.index); (In
production I make sure there is at least one row to remove...)
In the definition for each table I do not have checked 'Repeat Table for Each Data Item'. This works great. However I did try with that checked and the outcome was the same.
Now, when I email the form and open the attachment, this is what I see:
You can see that the second table didn't make it, and apparently a row was added to the inner table in the first, without any data.
Any ideas on what's going wrong here? And what I can do about it?
Unfortunately I'm not sure what's wrong with your form but I have made a similar form that works - so I can show you how I did it and list a few things that I can think of that can cause problems.
This is what my form looks like and when I e-mail it, it comes out exactly the way it is:
(It has repeatable parent- and childsubforms like yours)
I did it entirely with JS though, no FormCalc and Dollar $igns :D
When a button is pressed I call a function from a Scriptobject.
These are the main parts of my script inside my functions:
Adding a Subform:
var oNewInstance = subform.instanceManager.addInstance(1);
Deleting a Subform:
if (subform.instanceManager.count > subform.instanceManager.occur.min)
{
subform.instanceManager.removeInstance(subform.index);
}
And these are my subforms' properties (in German, but you can figure it out :P):
Your problem might also have completely other reasons though, make sure you don't have any changes in an initialize,docReady, preSubmit and similar actions that occur between sending and opening the sent PDF.
Also before sending it as an e-mail you have to save it in Acrobat as a Reader Extended PDF:
Besides that I've noticed that sometimes problems can occur due to the target version (Selectable in LCD under File > Form Properties > Defaults).
It helped me sometimes to set it to the newest one.

Google App Engine - Add & Reflect pattern, working around eventual consistency

I'm building a web application on app engine.
In my case, that's built on django-nonrel, but the key point is that it's using Google's datastore.
I love the fact that I don't need to deal with replication, sharding, backups and such, but one thing that is always getting in my way is the eventual consistency, which seems to get in the way of implementing a common web app pattern which I'm calling "Add & Reflect".
Let's say I have a project management app. The Project is its central model.
Now there's a web page page where I see a list of all projects, can add a project, and then I'll reflect back the list of all projects, which should include the project I just added (assuming no errors).
So the pattern goes like this:
Get and display list of existing projects
User adds new project (using a form on that page)
New project is created
As a response, get and display list of existing projects (now includes the new project)
Now the thing is, that due to eventual consistency, there is no guarantee whatsoever that I will get that new project when I get a list of all projects right after adding a new project.
Now that would be fine if this momentary inconsistency happened when another request (e.g. another user: user B) requested the list of projects one second after the project was added by the first user (user A), but it's really a problem when user A performs an operation, and does not see the results of his action, therefore does not get feedback.
I have gotten used to doing something like this to work around this problem:
def create_project(request):
response_context = {}
new_project = Project(name=request.POST['name'])
project.save()
response_context['projects'] = Project.get_serialized_projects()
# on GAE, eventual consistency means we are not guaranteed to see the
# new projects while querying for all projects, therefore we might need
# to add it manually...
if project.serialize() not in response_context['projects']:
response_context['projects'].append(project.serialize())
return render('projects.html', response_context)
The problem is that this happens in many places in my code, so I'm thinking maybe I'm missing something there, since this pattern is such a basic web app pattern.
Any suggestions for other ways to handle this?
Yes its a common issue. No theres no magic fix.
From client-side once you know the commit succeeded you can save the item locally (globals or storage) and then when querying from datastore merge your saved data. Put an expiration on it so its temporary. Its not trivial to make it work in all cases (say added an item then removed/renamed it so also update cache etc).
From server-side its common to cache recent saves in memcache and also merge with your queries.

Removing query string from url in django while keeping GET information

I am working on a Django setup where I can receive a url containining a query string as part of a GET. I would like to be able to process the data provided in the query string and return a page that is adjusted for that data but does not contain the query string in the URL.
Ordinarily I would just use reverse(), but I am not sure how to apply it in this case. Here are the details of the situation:
Example URL: .../test/123/?list_options=1&list_options=2&list_options=3
urls.py
urlpatterns = patterns('',
url(r'test/(P<testrun_id>\d+)/'), views.testrun, name='testrun')
)
views.py
def testrun(request, testrun_id):
if 'list_options' in request.GET.keys():
lopt = request.GET.getlist('list_options')
:
:
[process lopt list]
:
:
:
:
[other processing]
:
:
context = { ...stuff... }
return render(request, 'test_tracker/testview.html', context)
When the example URL is processed, Django will return the page I want but with the URL still containing the query string on the end. The standard way of stripping off the unwanted query string would be to return the testrun function with return HttpResponseRedirect(reverse('testrun', args=(testrun_id,))). However, if I do that here then I'm going to get an infinite loop through the testrun function. Furthermore, I am unsure if the list_options data that was on the original request will still be available after the redirect given that it has been removed from the URL.
How should I work around this? I can see that it might make sense to move the parsing of the list_options variable out into a separate function to avoid the infinite recursion, but I'm afraid that it will lose me the list_options data from the request if I do it that way. Is there a neat way of simultaneously lopping the query string off the end of the URL and returning the page I want in one place so I can avoid having separate things out into multiple functions?
EDIT: A little bit of extra background, since there have been a couple of "Why would you want to do this?" queries.
The website I'm designing is to report on the results of various tests of the software I'm working on. This particular page is for reporting on the results of a single test, and often I will link to it from a bigger list of tests.
The list_options array is a way of specifying the other tests in the list I have just come from. This allows me to populate a drop-down menu with other relevant tests to allow me to easily switch between them.
As such, I could easily end up passing in 15-20 different values and creating huge URLs, which I'd like to avoid. The page is designed to have a default set of other tests to fill in the menu in question if I don't suggest any others in the URL, so it's not a big deal if I remove the list_options. If the user wishes to come back to the page directly he won't care about the other tests in the list, so it's not a problem if that information is not available.
First a word of caution. This is probably not a good idea to do for various reasons:
Bookmarking. Imagine that .../link?q=bar&order=foo will filter some search results and also sort the results in particular order. If you will automatically strip out the querystring, then you will effectively disallow users to bookmark specific search queries.
Tests. Any time you add any automation, things can and will probably go wrong in ways you never imagined. It is always better to stick with simple yet effective approaches since they are widely used thus are less error-prone. Ill give an example for this below.
Maintenance. This is not a standard behavior model therefore this will make maintenance harder for future developers since first they will have to understand first what is going on.
If you still want to achieve this, one of the simplest methods is to use sessions. The idea is that when there is a querystring, you save its contents into a session and then you retrieve it later on when there is no querystring. For example:
def testrun(request, testrun_id):
# save the get data
if request.META['QUERY_STRING']:
request.session['testrun_get'] = request.GET
# the following will not have querystring hence no infinite loop
return HttpResponseRedirect(reverse('testrun', args=(testrun_id,)))
# there is no querystring so retreive it from session
# however someone could visit the url without the querystring
# without visiting the querystring version first hence
# you have to test for it
get_data = request.session.get('testrun_get', None)
if get_data:
if 'list_options' in get_data.keys():
...
else:
# do some default option
...
context = { ...stuff... }
return render(request, 'test_tracker/testview.html', context)
That should work however it can break rather easily and there is no way to easily fix it. This should illustrate the second bullet from above. For example, imagine a user wants to compare two search queries side-by-side. So he will try to visit .../link?q=bar&order=foo and `.../link?q=cat&order=dog in different tabs of the same browser. So far so good because each page will open correct results however as soon as the user will try to refresh the first opened tab, he will get results from the second tab since that is what is currently stored in the session and because browser will have a single session token for both tabs.
Even if you will find some other method to achieve what you want without using sessions, I imagine that you will encounter similar issues because HTTP is stateless hence you will have to store state on the server.
There is actually a way to do this without breaking much of the functionality - store state on client instead of server-side. So you will have a url without a querystring and then let javascript query some API for whatever you will need to display on that page. That however will force you to make some sort of API and use some javascript which does not exactly fall into the scope of your question. So it is possible to do cleanly however that will involve more than just using Django.

Read AICC Server response in cross domain implementation

I am currently trying to develop a web activity that a client would like to track via their Learning Management System. Their LMS uses the AICC standard (HACP binding), and they keep the actual learning objects on a separate content repository.
Right now I'm struggling with the types of communication between the LMS and the "course" given that they sit on two different servers. I'm able to retreive the sessionId and the aicc_url from the URL string when the course launches, and I can successfully post values to the aicc_url on the LMS.
The difficulty is that I can not read and parse the return response from the LMS (which is formatted as plain text). AICC stipulates that the course start with posting a "getParam" command to the aicc_url with the session id in order to retrieve information like completion status, bookmarking information from previous sessions, user ID information, etc, all of which I need.
I have tried three different approaches so far:
1 - I started with using jQuery (1.7) and AJAX, which is how I would typically go about a same-server implementation. This returned a "no transport" error on the XMLHttpRequest. After some forum reading, I tried making sure that the ajax call's crossdomain property was set to true, as well as a recommendation to insert $.support.cors = true above the ajax call, neither of which helped.
2 & 3 - I tried using an oldschool frameset with a form in a bottom frame which would submit and refresh with the returned text from the LMS and then reading that via javascript; and then a variation upon that using an iFrame as a target of an actual form with an onload handler to read and parse the contents. Both of these approaches worked in a same-server environment, but fail in the cross-domain environment.
I'm told that all the other courses running off the content repository bookmark as well as track completion, so obviously it is possible to read the return values from the LMS somehow; AICC is pitched frequently as working in cross-server scenarios, so I'm thinking there must be a frequently-used method to doing this in the AICC structure that I am overlooking. My forum searches so far haven't turned up anything that's gotten me much further, so if anyone has any experience in cross-domain AICC implementations I could certainly use recommendations!
The only idea I have left is to try setting up a PHP "relay" form on the same server as the course, and having the front-end page send values to that, and using the PHP to submit those to the LMS, and relay the return text from the LMS to the front-end iframe or ajax call so that it would be perceived as being within the same domain.... I'm not sure if there's a way to solve the issue without going server-side. It seems likely there must be a common solution to this within AICC.
Thanks in advance!
Edits and updates:
For anyone encountering similar problems, I found a few resources that may help explain the problem as well as some alternate solutions.
The first is specific to Plateau, a big player in the LMS industry that was acquired by Successfactors. It's some documentation that provide on setting up a proxy to handle cross-domain content:
http://content.plateausystems.com/ContentIntegration/content/support_files/Cross-domain_Proxlet_Installation.pdf
The second I found was a slide presentation from Successfactors that highlights the challenge of cross-domain content, and illustrates so back-end ideas for resolving it; including the use of reverse proxies. The relevant parts start around slide 21-22 (page 11 in the PDF).
http://www.successfactors.com/static/docs/successconnect/sf/successfactors-content-integration-turley.pdf
Hope that helps anyone else out there trying to resolve the same issues!
The answer in this post may lead you in the right direction:
Best Practice: Legitimate Cross-Site Scripting
I think you are on the right track with setting up a PHP "relay." I think this is similar to choice #1 in the answer from the other post and seems to make most sense with what you described in your question.

Externally linked images - How to prevent cross site scripting

On my site, I want to allow users to add reference to images which are hosted anywhere on the internet. These images can then be seen by all users of my site. As far as I understand, this could open the risk of cross site scripting, as in the following scenario:
User A adds a link to a gif which he hosts on his own webserver. This webserver is configured in such a way, that it returns javascript instead of the image.
User B opens the page containg the image. Instead of seeing the image, javascript is executed.
My current security messures are currently such, that both on save and open, all content is encoded.
I am using asp.net(c#) on the server and a lot of jquery on the client to build ui elements, including the generation of image tags.
Is this fear of mine correct? Am I missing any other important security loopholes here? And most important of all, how do I prevent this attack? The only secure way I can think of right now, is to webrequest the image url on the server and check if it contains anything else than binary data...
Checking the file is indeed an image won't help. An attacker could return one thing when the server requests and another when a potential victim makes the same request.
Having said that, as long as you restrict the URL to only ever be printed inside the src attribute of an img tag, then you have a CSRF flaw, but not an XSS one.
Someone could for instance create an "image" URL along the lines of:
http://yoursite.com/admin/?action=create_user&un=bob&pw=alice
Or, more realistically but more annoyingly; http://yoursite.com/logout/
If all sensitive actions (logging out, editing profiles, creating posts, changing language/theme) have tokens, then an attack vector like this wouldn't give the user any benefit.
But going back to your question; unless there's some current browser bug I can't think of you won't have XSS. Oh, remember to ensure their image URL doesn't include odd characters. ie: an image URL of "><script>alert(1)</script><!-- may obviously have bad effects. I presumed you know to escape that.
Your approach to security is incorrect.
Don't approach the topic as "I have a user input, so how can I prevent XSS". Rather approach it like it this: "I have user input - it should be restrictive as possible - i.e. allowing nothing through". Then based on that allow only what's absolutely essential - plain-text strings thoroughly sanitized to prevent anything but a URL, and the specific, necessary characters for URLS only. Then Once it is sanitized I should only allow images. Testing for that is hard because it can be easily tricked. However, it should still be tested for. Then because you're using an input field you should make sure that everything from javascript scripts and escape characters, HTML, XML and SQL injections are all converted to plaintext and rendered harmless and useless. Consider your users as being both idiots and hackers - that they'll input everything incorrectly and try to hack something into your input space.
Aside from that you may run into som legal issues with regard to copyright. Copyrighted images generally may not be used on other people's sites without the copyright owner's consent and permission - usually obtained in writing (or email). So allowing users the opportunity to simply lift images from a site could run the risk of allowing them to take copyrighted material and reposting it on your site without permission which is illegal. Some sites are okay with citing the source, others require a fee to be paid, and others will sue you and bring your whole domain down for copyright infringement.