Submit form after user tweet - regex

I have an idea and I'm not sure if it'll work (and I hope this is the appropriate place to ask this question).
Basically, what I'm trying to do is grab a user's tweet (one user in particular), and if the tweet matches a RegEx pattern, grab a part of it and submit a form on another page. The form has two parts, the first part is for the information, and the second is just a confirmation.
The only two languages I know that would have this capability (to my knowledge) are PHP and Java. Unfortunately my knowledge of PHP is fairly mediocre and Java would be pretty basic. This being said I'd need to do some research.
First of all, is what I want to do even possible, and secondly, what would I have to be looking at to pull this off? (I'm open to learning a similar language if necessary)

Related

Can I do direct database updates in Django easily?

I am trying to write a web application that displays the content of a database as a website, e.g. in the form of a table, and lets the user update the table entries, which should automatically be reflected in the database, so that on page reload, the table looks exactly as the user left it.
Since my web development skills are fairly outdated, I wanted to take this as an excuse to try some new stuff. I know my way around SQL and Python quite well, so I thought Django would be a good choice. I don't have a lot of experience in Javascript however. I already worked through the tutorial, which covers classic HTML forms where you enter a bunch of data and then hit "submit" to push to the database.
What I would really prefer though is to have my whole table freely editable and either immediately save any change to the database (e.g. whenever I click a checkbox or "focus out" of a text box). As a second option I thought about having a single "save" button for the whole page (which may easily be several screens in size).
Now, for the first option, I assume I will likely have to use Javascript and Ajax techniques, which I am not comfortable with them yet, so writing greater pieces of Javascript code is something I am not very keen on at the moment.
For the second option, I would probably have my whole table be a huge, single form with a single submit button. I am a bit wary about this as it does not seem very robust to me.
So what my question boils down to: Are there ways to accomplish what I want in a robust and easy way without having to reinvent the wheel? From my understanding, Django does not cover the final rendering in HTML, it only provides the data, so I would assume I need some third party technology to handle that part?
Yes, for your second idea, submitting the whole table at once, Django has a thing called a ModelFormSet where you define a web form which is repeated for each row in the table (or, for the set of records you select). There are a good amount of basic things you'll need to understand to do it.. eg. how to create a Django view, how to set up a url, how to write templates... but you say you want to learn Django.. so.. it's a good exercise. The Django documentation has a good tutorial that leads you through development of a basic working app and from there it's not much further to do what you're seeking.
Here's the part of the Django documentation that discusses ModelFormSets:
https://docs.djangoproject.com/en/3.0/topics/forms/modelforms/#model-formsets
BTW, Django detects which rows have changed so it won't write every row every time, even though you've submitted them all.

How to refine text data?

I built many spiders to get news articles from different websites and i have an api to convert the text to audio clips, but i need a framework or python tools to refine the articles' text such as:
removing anything related to the source. removing any dates formats.
removing urls. change acronyms such as CEO to chief excution officer
for example. removing special characters and typos.
making sure that the sentence is written correctly after all the edits.
use the previously edited articles as a reference for the new articles.
I am using python, nltk and re, but it's exhausting and each time i think i covered all the cases, i find new cases to add and i think i am stuck in an infinite loop.
Any suggestions?
First of all, expanding acronyms to their full form is non-trivial and should probably not be considered part of scraping but rather part of a second step of processing (cf. IBM's The Art of Tokenization).
Cleaning scraped data is tedious, unfortunately: There is no magical solution because everyone is interested in scaping something different than what you are — some might be interested only in URLs, for example. Nevertheless, have you not tried using BeautifulSoup? — it's a Python library which offers a very nice API for handling many common scraping-related tasks.

A tool which checks that a local version of a site is fully translated (for continuous integration)

I'm working on a project, in which we design a localized version of an existing site (written in English) for another country (which is not English-speaking). And the business requirement is "no English text for all possible and impossible cases".
Does anyone know if there is a checker software/service which could check if a site is fully translated, that is which checks that there are no English text in it.
I new that there are sites for checking broken links, html validity etc, I need something like http://validator.w3.org/checklink but for checking that on all pages of the site there is no English text.
The reasons I think this way is needed are:
1. There is a lot of code which is common (both on backend and frontend) for all countries
2. If someone commits anything to the common code I need to be sure that this will not lead to english text issues in localized version.
3. From business point of view it is preferable that site does not support some functionality, than it shows english text ( legal matters)
4. The code both on frontend and backend changes a lot
5. There are a lot of files which affect text on the client's screen. Not just one with messages, unfortunately. And some of messages comes from backend, but most of them are in frontend
6. Due to all those fact currently someone manually fills all the forms and watch with his own eyes, and that is before each deploy...
I think you're approaching the problem from the wrong direction. You're looking for an algorithm or webcrawler that can detect wether any text is English or not? I don't know, but I doubt such a thing even exists.
If you have translated the website, you have full access to the codebase and/or translation texts, right? Can't you just open both the English and non-English strings files (.resx or whatever you are using) in a comparetool like Notepad++ to check the differences to see if there are any missing strings? And check the sourcecode and verify that all parts that can output user-displayable text use the meta:resourceKey property (or whatever you are using).
If you want to go the way of crawling, I'm not aware of an existing crawler that does this, but it sounds like a combination of two simple issues:
Finding existing open-source code for a web crawler should be dead simple
Identifying a language through n-gram analysis is trivial if there's a limited number of languages the text can be in.
The only difficult part would be to ensure that the analyzer always has a decent chunk of text to work with. You could extract stuff paragraph by paragraph. For forms you'd probably have to combine the text of several form labels.

What are my options for white-listing HTML in ColdFusion?

I want to allow my users to input HTML.
Requirements
Allow a specific set of HTML tags.
Preserve characters (do not encode ã into ã, for example)
Existing options
AntiSamy. Unfortunately AntiSamy encodes special characters and breaks requirement 2.
Native ColdFusion functions (HTMLCodeFormat() etc...) don't work as they encode HTML into entities, and thus fail requirement 1.
I found this set of functions somewhere, but I have no way of telling how secure this is: http://pastie.org/2072867
So what are my options? Are there existing libraries for this?
Portcullis works well for Cold Fusion for attack-specific issues. I've used a couple of other regex solutions I found on the web over time that have worked well, though they haven't been nearly as fleshed out. In 15 years (10 as a CMS developer) nothing I've built has been hacked....knock on wood.
When developing input fields of any type, it's good to look at the problem from different angles. You've got the UI side, which includes both usability and client-side validation. Yes, it can be bypassed, but javascript-based validation is quicker, more responsive, and rates higher on the magical UI scale than backend-interruption method or simply making things "disappear" without warning. It will speed up the back-end validation because it does the initial screening. So, it's not an "instead of" but an "in-addition to" type solution that can't be ignored.
Also on the UI front, giving your users a good quality editor also can make a huge difference in the process. My personal favorite is CKeditor simply because it's the only one that can handle Microsoft Word code on the front-side, keeping it far away from my DB. It seems silly, but Word HTML is valid, so it won't setoff any red flags....but on a moderately sized document it will quickly overload a DB field insert max, believe it or not. Not only will a good editor reduce the amount of silly HTML that comes in, but it will also just make things faster for the user....win/win.
I personally encode and decode my characters...it's always just worked well so I've never changed practice.

Web Application Cross Site Scripting

My website http://www.imayne.com seems to have this issue, verified by MacAfee. Can someone show me how to fix this? (Title)
It says this:
General Solution:
When accepting user input ensure that you are HTML encoding potentially malicious characters if you ever display the data back to the client.
Ensure that parameters and user input are sanitized by doing the following:
Remove < input and replace with "&lt";
Remove > input and replace with "&gt";
Remove ' input and replace with "&apos";
Remove " input and replace with "&#x22";
Remove ) input and replace with "&#x29";
Remove ( input and replace with "&#x28";
I cannot seem to show the actual code. This website is showing something else.
Im not a web dev but I can do a little. Im trying to be PCI compliant.
Let me both answer your question and give you some advice. Preventing XSS properly needs to be done by defining a white-list of acceptable values at the point of user input, not a black-black of disallowed values. This needs to happen first and foremost before you even begin thinking about encoding.
Once you get to encoding, use a library from your chosen framework, don't attempt character substitution yourself. There's more information about this here in OWASP Top 10 for .NET developers part 2: Cross-Site Scripting (XSS) (don't worry about it being .NET orientated, the concepts are consistent across all frameworks).
Now for some friendly advice: get some expert support ASAP. You've got a fundamentally obvious reflective XSS flaw in an e-commerce site and based on your comments on this page, this is not something you want to tackle on your own. The obvious nature of this flaw suggests you've quite likely got more obscure problems in the site as well. By your own admission, "you're a noob here" and you're not going to gain the competence required to sufficiently secure a website such as this overnight.
The type of changes you are describing are often accomplished in several languages via an HTML Encoding function. What is the site written in. If this is an ASP.NET site this article may help:
http://weblogs.asp.net/scottgu/archive/2010/04/06/new-lt-gt-syntax-for-html-encoding-output-in-asp-net-4-and-asp-net-mvc-2.aspx
In PHP use this function to wrap all text being output:
http://ch2.php.net/manual/en/function.htmlentities.php
Anyplace you see echo(...) or print(...) you can replace it with:
echo(htmlentities( $whateverWasHereOriginally, ENT_COMPAT));
Take a look at the examples section in the middle of the page for other guidance.
Follow those steps exactly, and you're good to go. The main thing is to ensure that you don't treat anything the user submits to you as code (HTML, SQL, Javascript, or otherwise). If you fail to properly clean up the inputs, you run the risk of script injection.
If you want to see a trivial example of this problem in action, search for
<span style="color:red">red</span>
on your site, and you'll see that the echoed search term is red.