Django testing view input with different character sets - django

I'm having trouble trying to generate a test for my views. I've a view, that consumes in a given input from form, some characters. That characters are commited to DB, without problems.
All I was trying was to generate a test to ensure that different characters, from different languages, were accepted.
I tested this one:
Český jazyk neboli čeština
This input is correctly got from HTML form, and stored in DB. When I try to set this one from test, something weird happens, and view throws an error, saying that
Warning: Incorrect string value: '\xC4\x8Cesk\xC3...' for column 'title' at row 1
My code is as simple as follows:
str1 = "Český jazyk neboli čeština"
self.client.post(url, {"title": str1})
And tryied all combinations:
str1 = u"..."
str1 = str1.encode('utf-8')
str1 = str1.decode('utf-8')
Without any success.
Can anyone tell me what I'm missing?
Thank you in advance

First of all: Make sure you have included this at the beginning of your script:
#-*- coding: utf-8 -*-
That is to tell the interpreter that this file's encoding is utf-8 (make sure it is from your text editor)
Second: instead of
str1 = "Český jazyk neboli čeština"
declare the str1 as unicode like this:
str1 = u"Český jazyk neboli čeština"
Now, I'll suggest you that if you want to include non ascii chars, declare them with their proper unicode code instead of the character to avoid weird encoding issues.
str1 = u'\u010cesk\xfd jazyk neboli \u010de\u0161tina'
This is a useful page to check characters unicode code
Hope this helps!

Related

Django 'ascii' codec can't encode characters despite encoding in UTF-8? What am I doing wrong?

I'm still in the process of learning Django. I have a bit of a problem with encoding a cyrillic strings. I have a text input. I append it's value using JS to the URL and then get that value in my view (I know I should probably use a form for that, but that's not the issue).
So here's my code (it's not complete, but it shows the main idea I think).
JS/HTML
var notes = document.getElementById("notes").value;
...
window.location.href = 'http://my-site/example?notes='+notes
<input type="text" class="notes" name="notes" id="notes">
Django/Python
notes= request.GET.get('notes', 0)
try:
notes = notes.encode('UTF-8')
except:
pass
...
sql = 'INSERT INTO table(notes) VALUES(%s)' % str(notes)
The issue is, whenever I type a string in cyrillic I get this error message: 'ascii' codec can't encode characters at position... Also I know that I probably shouldn't pass strings like that to the query, but it's a personal project so... that would do for now. I've been stuck there for a while now. Any suggestions as to what's causing this would be appreciated.
request.GET.get("key") will already get a string, why you need to encode it?
May set request.encoding="utf-8" work for you.

How can I use regex to construct an API call in my Jekyll plugin?

I'm trying to write my own Jekyll plugin to construct an api query from a custom tag. I've gotten as far as creating the basic plugin and tag, but I've run into the limits of my programming skills so looking to you for help.
Here's my custom tag for reference:
{% card "Arbor Elf | M13" %}
Here's the progress on my plugin:
module Jekyll
class Scryfall < Liquid::Tag
def initialize(tag_name, text, tokens)
super
#text = text
end
def render(context)
# Store the name of the card, ie "Arbor Elf"
#card_name =
# Store the name of the set, ie "M13"
#card_set =
# Build the query
#query = "https://api.scryfall.com/cards/named?exact=#{#card_name}&set=#{#card_set}"
# Store a specific JSON property
#card_art =
# Finally we render out the result
"<img src='#{#card_art}' title='#{#card_name}' />"
end
end
end
Liquid::Template.register_tag('cards', Jekyll::Scryfall)
For reference, here's an example query using the above details (paste it into your browser to see the response you get back)
https://api.scryfall.com/cards/named?exact=arbor+elf&set=m13
My initial attempts after Googling around was to use regex to split the #text at the |, like so:
#card_name = "#{#text}".split(/| */)
This didn't quite work, instead it output this:
[“A”, “r”, “b”, “o”, “r”, “ “, “E”, “l”, “f”, “ “, “|”, “ “, “M”, “1”, “3”, “ “]
I'm also then not sure how to access and store specific properties within the JSON response. Ideally, I can do something like this:
#card_art = JSONRESPONSE.image_uri.large
I'm well aware I'm asking a lot here, but I'd love to try and get this working and learn from it.
Thanks for reading.
Actually, your split should work – you just need to give it the correct regex (and you can call that on #text directly). You also need to escape the pipe character in the regex, because pipes can have special meaning. You can use rubular.com to experiment with regexes.
parts = #text.split(/\|/)
# => => ["Arbor Elf ", " M13"]
Note that they also contain some extra whitespace, which you can remove with strip.
#card_name = parts.first.strip
#card_set = parts.last.strip
This might also be a good time to answer questions like: what happens if the user inserts multiple pipes? What if they insert none? Will your code give them a helpful error message for this?
You'll also need to escape these values in your URL. What if one of your users adds a card containing a & character? Your URL will break:
https://api.scryfall.com/cards/named?exact=Sword of Dungeons & Dragons&set=und
That looks like a URL with three parameters, exact, set and Dragons. You need to encode the user input to be included in a URL:
require 'cgi'
query = "https://api.scryfall.com/cards/named?exact=#{CGI.escape(#card_name)}&set=#{CGI.escape(#card_set)}"
# => "https://api.scryfall.com/cards/named?exact=Sword+of+Dungeons+%26+Dragons&set=und"
What comes after that is a little less clear, because you haven't written the code yet. Try making the call with the Net::HTTP module and then parsing the response with the JSON module. If you have trouble, come back here and ask a new question.

how to fix bad decoding?

i am using shown below procedure to write (spidered from the web) a list of persons data.
I think the code itself is ok, but i am confused with results. Some chars are decoded properly, some not. For example:
STANIS?AWurodzony/a 01.01.1888, ?Ԅ?
HALINAurodzony/a 05.07.1927, ŁÓDŹ
the last word in both strings is the same!
Also ? sign used to replace non-translatable chars once is used once not:
STANISŁAWurodzony/a 24.03.1907, RAKSZANY
^
| here is written ok - not replaced
And here is the code:
def findPerson():
file = codecs.open('Names.txt','a','ISO-8859-1','replace')
try:
with codecs.open('./listNames_links.txt','r','ISO-8859-1','replace') as f:
line = f.readline()
while line != '':
#print line,
line = f.readline()
res = requests.get('http://real.address.gov.pl'+line)
res.raise_for_status()
soup = BeautifulSoup(res.text)
linkElems = soup.find('a','css_class_name').text
file.write(linkElems)
file.write('\r\n')#preserve end-of-line
question:
How to fix this. Is my procedure wrong? Or the source page has broken encoding? (I suppose it is ok, i can read it in browser without any errors.)
I'm not very familiar with python, but had similar problems with java programs. And in almost every case it was the problem that not the same encoding was used in all steps and thus while converting it results in ugly charcters.
I'd suggest to use UTF-8 during the whole process if possible.
An off-topic side remark: Because I stumpled upon this so very often, I bought myself this T-Shirt (correctly spelled Scheiß, German for crap) and wear it at work ;-)
This looks like a polish language, and doesn't look like multi-byte encoding such as UTF-8 is used. You are writing to your Names.txt file using ISO-8859-1 which doesn't support all polish characters.
You could use ISO-8859-2 for Polish characgter support, but even better would be to use UTF-8 which supports all languages and is the common standard on the web.
Try
file = codecs.open('Names.txt','a','UTF-8','replace')
When you make a request using Requests, try checking the encoding of each page. For example:
res = requests.get('http://real.address.gov.pl'+line)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text)

how to read only URL from txt file in MATLAB

I have a text file having multiple URLs with other information of the URL. How can I read the txt file and save the URLs only in an array to download it? I want to use
C = textscan(fileId, formatspec);
What should I mention in formatspec for URL as format?
This is not a job for textscan; you should use regular expressions for this. In MATLAB, regexes are described here.
For URLs, also refer here or here for examples in other languages.
Here's an example in MATLAB:
% This string is obtained through textscan or something
str = {...
'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};
% find URLs
C = regexpi(str, ...
['((http|https|ftp|file)://|www\.|ftp\.)',...
'[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]'], 'match');
C{:}
Result:
ans =
'http://www.example.com/index.php?query=test&otherStuf=info'
ans =
'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'
Note that this regex requires you to have the protocol included, or have a leading www. or ftp.. Something like example.com/universal_remote.cgi?redirect= is NOT matched.
You could go on and make the regex cover more and more cases. However, eventually you'll stumble upon the the most important conclusion (as made here for example; where I got my regex from): given the full definition of what precisely constitutes a valid URL, there is no single regex able to always match every valid URL. That is, there are valid URLs you can dream up that are not captured by any of the regexes shown.
But please keep in mind that this last statement is more theoretical rather than practical -- those non-matchable URLs are valid but not often encountered in practice :) In other words, if your URLs have a pretty standard form, you're pretty much covered with the regex I gave you.
Now, I fooled around a bit with the Java suggestion by pm89. As I suspected, it is an order of magnitude slower than just a regex, since you introduce another "layer of goo" to the code (in my timings, the difference was about 40x slower, excluding the imports). Here's my version:
import java.net.URL;
import java.net.MalformedURLException;
str = {...
'pre-URL garbage http://www.example.com/index.php?query=test&otherStuf=info more stuff here'
'pre--URL garbage example.com/index.php?query=test&otherStuf=info more stuff here'
'other foolish stuff ftp://localhost/home/ruler_of_the_world/awesomeContent.py 1 2 3 4 misleading://';
};
% Attempt to convert each item into an URL.
for ii = 1:numel(str)
cc = textscan(str{ii}, '%s');
for jj = 1:numel(cc{1})
try
url = java.net.URL(cc{1}{jj})
catch ME
% rethrow any non-url related errors
if isempty(regexpi(ME.message, 'MalformedURLException'))
throw(ME);
end
end
end
end
Results:
url =
'http://www.example.com/index.php?query=test&otherStuf=info'
url =
'ftp://localhost/home/ruler_of_the_world/awesomeContent.py'
I'm not too familiar with java.net.URL, but apparently, it is also unable to find URLs without leading protocol or standard domain (e.g., example.com/path/to/page).
This snippet can undoubtedly be improved upon, but I would urge you to consider why you'd want to do this for this longer, inherently slower and far uglier solution :)
As I suspected you could use java.net.URL according to this answer.
To implement the same code in Matlab:
First read the file into a string, using fileread for example:
str = fileread('Sample.txt');
Then split the text with respect to spaces, using strsplit:
spl_str = strsplit(str);
Finally use java.net.URL to detect the URLs:
for k = 1:length(spl_str)
try
url = java.net.URL(spl_str{k})
% Store or save the URL contents here
catch e
% it's not a URL.
end
end
You can write the URL contents into a file using urlwrite. But first convert the URLs obtained from java.net.URL to char:
url = java.net.URL(spl_str{k});
urlwrite(char(url), 'test.html');
Hope it helps.

weird django translations behaviour

I have weird problem with django translations that i need help figuring out. Problem is that instead of getting translated string every time i seem to get randomly either translated string or default string.
I have created a class for putting "action" buttons on several pages. Many of those buttons are reusable so why should i put
blabla
into several templates when i can create simple toolbar and use
toolbar.add(tool)
in view and then just use templatetag for rendering all the tools.... anyway.
This is example of one tool:
class NewUserGroupButton(MyTool):
tool_id = 'new-user-group'
button_label = ugettext(u'Create new user group')
tool_href = 'new/'
js = ()
def render(self):
s = '<a id="%(tool_id)s" href="%(tool_href)s" class="button blue">%(button_label)s</a>'
return s % {'tool_id': self.tool_id, 'tool_href': self.tool_href, 'button_label':self.button_label}
The tools are rendered like this:
class ToolbarTools:
def __init__(self):
self.tools = []
def add(self, tool):
self.tools.append(tool)
def render(self):
# Organize tools by weight
self.tools.sort(key=lambda x: x.weight)
# Render tools
output = '<ul id="toolbar" class="clearfix">'
for tool in self.tools:
output += '<li id="%s">' % ('tool-'+ tool.tool_id)
output += tool.render() if tool.renderable else ''
output += '</li>'
output += '</ul>'
# Reset tools container
self.tools = []
return mark_safe(output)
im using ugettext to translate the string. using (u)gettext=lambda s:s or ugettext_lazy gives me either no translations or proxy object corresponding to function choices.
And like i said - while rest of the page is in correct language the toolbar buttons seem to be randomly either translated or not. The faulty behaviour remains consistent if i move between different pages with different "tools" or even refresh the same page several times.
Could someone please help me to figure out what is causing this problem and how to fix it?
Problem solved.
The issue itself (random translating) was perhaps caused by using ugettext. But at the same time i should have used ugettext_lazy instead.
And thus the problem really came from ugettext_lazy returning proxy object not translated string... and that is caused by this:
[Quote]
Working with lazy translation objects
The result of a ugettext_lazy() call can be used wherever you would use a unicode string (an object with type unicode) in Python. If you try to use it where a bytestring (a str object) is expected, things will not work as expected, since a ugettext_lazy() object doesn't know how to convert itself to a bytestring. You can't use a unicode string inside a bytestring, either, so this is consistent with normal Python behavior. For example:
This is fine: putting a unicode proxy into a unicode string.
u"Hello %s" % ugettext_lazy("people")
This will not work, since you cannot insert a unicode object
into a bytestring (nor can you insert our unicode proxy there)
"Hello %s" % ugettext_lazy("people")
[/Quote]
taken from here:
https://docs.djangoproject.com/en/dev/topics/i18n/translation/#working-with-lazy-translation-objects
Alan
I had the same issue just now. The problem was caused by incorrect Middleware order - I set LocaleMiddleware after CommonMiddleware. After I placed it between SessionMiddleware and CommonMiddleware, it seems to work correctly.
See more here: https://lokalise.com/blog/advanced-django-internationalization/
I know, the problem was solved a long ago, but it can be helpful for someone.