| Title: | Handling POST forms in WSGI |
|---|---|
| Author: | Ian Bicking <ianb@colorstudy.com> |
| Discussions-To: | Python Web-SIG <web-sig@python.org> |
| Status: | Withdrawn |
| Created: | 21-Oct-2006 |
Contents
This suggests a way that WSGI middleware, applications, and frameworks can access POST form bodies so that there is less contention for the wsgi.input stream.
I decided that there were opportunities to decorate the wsgi.input stream itself, and have been pursing them in WSGIRemote. I may describe that strategy in a specification later.
Currently environ['wsgi.input'] points to a stream that represents the body of the HTTP request. Once this stream has been read, it cannot necessarily be read again. It may not have a seek method (none is required by the WSGI specification, and frequently none is provided by WSGI servers).
As a result any piece of a system that looks at the request body essentially takes ownership of that body, and no one else is able to access it. This is particularly problematic for POST form requests, as many framework pieces expect to have access to this. One notable case is when a request “enters” a traditional web framework which parses the POST form, then “exits” back to WSGI through some framework-specific WSGI gateway.
The specification covers library code that multiple frameworks can implement. This is not functionality that is intended to be added to a WSGI “stack”.
This applies when certain requirements of the WSGI environment are met:
def is_post_request(environ):
if environ['REQUEST_METHOD'].upper() != 'POST':
return False
content_type = environ.get('CONTENT_TYPE', 'application/x-www-form-urlencoded')
return (content_type.startswith('application/x-www-form-urlencoded'
or content_type.startswith('multipart/form-data'))
That is, it must be a POST request, and it must be a form request (generally application/x-www-form-urlencoded or when there are file uploads multipart/form-data).
When this happens, the form can be parsed by cgi.FieldStorage. The results of this parsing is put in wsgi.post_form as (new_wsgi_input, old_wsgi_input, FieldStorage_object).
The new_wsgi_input can be used to check if an intermediary has replaced the input since wsgi.post_form was calculated. If the input has been changed, the wsgi.post_form data should be discarded. The old_wsgi_input can be used if you want to get access to the original input stream (which may be seekable, and so still useful).
The replacement wsgi.input guards against routines that access the data but don’t conform to this specification. Ideally the replacement will act like the original wsgi.input (producing the same data), but if not it should raise an exception. The input should not block or produce inaccurate data.
def get_post_form(environ):
assert is_post_request(environ)
input = environ['wsgi.input']
post_form = environ.get('wsgi.post_form')
if (post_form is not None
and post_form[0] is input):
return post_form[2]
# This must be done to avoid a bug in cgi.FieldStorage
environ.setdefault('QUERY_STRING', '')
fs = cgi.FieldStorage(fp=input,
environ=environ,
keep_blank_values=1)
new_input = InputProcessed('')
post_form = (new_input, input, fs)
environ['wsgi.post_form'] = post_form
environ['wsgi.input'] = new_input
return fs
class InputProcessed(object):
def read(self, *args):
raise EOFError('The wsgi.input stream has already been consumed')
readline = readlines = __iter__ = read
By using this routing multiple consumers can parse a POST form, accessing the form data in any order (later consumers will get the already-parsed data).
Note that nothing in this specification touches or applies to the query string (in environ['QUERY_STRING']). This is not parsed as part of the process, and nothing in this specification applies to GET requests, or to the query string which may be present in a POST request.
While this proposal makes it more feasible for middleware to access POST form data, it should not be read as encouraging middleware to do so. In particular, no consumer should ever expect that wsgi.post_form is in the request environment. Also, no intermediary should parse the POST form data unless it actually is interested in that data – access should be deferred until there is a real need for the POST data.
One of the simplest possibilities is to add this information to environ['wsgi.input'] itself as a separate attribute. E.g.:
fs = getattr(environ['wsgi.input'], 'cgi_FieldStorage', None)
if fs is None: # parse and replace wsgi.input...
There’s a certain elegance to keeping wsgi.input self-describing and movable.
This doesn’t address non-form-submission POST requests. Most of the same issues apply to such requests, except that frameworks tend not to touch the request body in that case. The body may be large, so the actual contents of the request body shouldn’t go in the environment. Perhaps they could go in a temporary file, but this too might be an unnecessary indirection in many cases. Also other kinds of request (like PUT) that have a request body are not covered, for largely the same reason. In both these cases, it is much easier to construct a new wsgi.input that accesses whatever your internal representation of the request body is.
or could it just be the FieldStorage instance? Should all the information go in wsgi.input directly?
Should wsgi.input be replaced by InputProcessed, or just left as is? Or should we look for code that serializes FieldStorage objects back to parseable strings?
Does QUERY_STRING actually have to be set for cgi not to mess up, or is that just an issue with GET requests?