% mechanize -- FAQ

<div class="expanded">


  * <span class="q">Which version of Python do I need?</span>

    Python 2.4, 2.5, 2.6, or 2.7.  Python 3 is not yet supported.

  * <span class="q">Does mechanize depend on BeautifulSoup?</span>

    No.  mechanize offers a few classes that make use of BeautifulSoup, but
these classes are not required to use mechanize.  mechanize bundles
BeautifulSoup version 2, so that module is no longer required.  A future
version of mechanize will support BeautifulSoup version 3, at which point
mechanize will likely no longer bundle the module.

  * <span class="q">Does mechanize depend on ClientForm?</span>

    No, ClientForm is now part of mechanize.

  * <span class="q">Which license?</span>

    mechanize is dual-licensed: you may pick either the [BSD
license](http://www.opensource.org/licenses/bsd-license.php), or the [ZPL
2.1](http://www.zope.org/Resources/ZPL) (both are included in the
distribution).


Usage
-----

  * <span class="q">I'm not getting the HTML page I expected to see.</span>

    [Debugging tips](hints.html)

  * <span class="q">`Browser` doesn't have all of the forms/links I see in the
HTML.  Why not?</span>

    Perhaps the default parser can't cope with invalid HTML.  Try using the
included BeautifulSoup 2 parser instead:

~~~~{.python}
import mechanize

browser = mechanize.Browser(factory=mechanize.RobustFactory())
browser.open("http://example.com/")
print browser.forms
~~~~

    Alternatively, you can process the HTML (and headers) arbitrarily:

~~~~{.python}
browser = mechanize.Browser()
browser.open("http://example.com/")
html = browser.response().get_data().replace("<br/>", "<br />")
response = mechanize.make_response(
    html, [("Content-Type", "text/html")],
    "http://example.com/", 200, "OK")
browser.set_response(response)
~~~~

  * <span class="q">Is JavaScript supported?</span>

    No, sorry.  See [FAQs](#change-value) [below](#script).

  * <span class="q">My HTTP response data is truncated.</span>

    `mechanize.Browser's` response objects support the `.seek()` method, and
can still be used after `.close()` has been called.  Response data is not
fetched until it is needed, so navigation away from a URL before fetching all
of the response will truncate it.  Call `response.get_data()` before navigation
if you don't want that to happen.

  * <a name="xhtml" /><span class="q">I'm *sure* this page is HTML, why does `mechanize.Browser`
think otherwise?</span>

~~~~{.python}
b = mechanize.Browser(
    # mechanize's XHTML support needs work, so is currently switched off.  If
    # we want to get our work done, we have to turn it on by supplying a
    # mechanize.Factory (with XHTML support turned on):
    factory=mechanize.DefaultFactory(i_want_broken_xhtml_support=True)
    )
~~~~

  * <span class="q">Why don't timeouts work for me?</span>

    Timeouts are ignored with with versions of Python earlier than 2.6.
Timeouts do not apply to DNS lookups.

  * <span class="q">Is there any example code?</span>

    Look in the `examples/` directory.  Note that the examples on the [forms
    page](./forms.html) are executable as-is.  Contributions of example code
    would be very welcome!


Cookies
-------

  * <span class="q">Doesn't the standard Python library module, `Cookie`, do
    this?</span>

    No: module `Cookie` does the server end of the job.  It doesn't know when
    to accept cookies from a server or when to send them back.  Part of
    mechanize has been contributed back to the standard library as module
    `cookielib` (there are a few differences, notably that `cookielib` contains
    thread synchronization code; mechanize does not use `cookielib`).

  * <span class="q">Which HTTP cookie protocols does mechanize support?</span>

    Netscape and [RFC 2965](http://www.ietf.org/rfc/rfc2965.txt).  RFC 2965
    handling is switched off by default.

  * <span class="q">What about RFC 2109?</span>

    RFC 2109 cookies are currently parsed as Netscape cookies, and treated
    by default as RFC 2965 cookies thereafter if RFC 2965 handling is enabled,
    or as Netscape cookies otherwise.


  * <span class="q">Why don't I have any cookies?</span>

    See [here](hints.html#cookies).

  * <span class="q">My response claims to be empty, but I know it's not!</span>

    Did you call `response.read()` (e.g., in a debug statement), then forget
    that all the data has already been read?  In that case, you may want to use
    `mechanize.response_seek_wrapper`.  `mechanize.Browser` always returns
    [seekable responses](doc.html#seekable-responses), so it's not necessary to
    use this explicitly in that case.

  * <span class="q">What's the difference between the `.load()` and `.revert()`
    methods of `CookieJar`?</span>

    `.load()` *appends* cookies from a file.  `.revert()` discards all
    existing cookies held by the `CookieJar` first (but it won't lose any
    existing cookies if the loading fails).

  * <span class="q">Is it threadsafe?</span>

    No.  As far as I know, you can use mechanize in threaded code, but it
    provides no synchronisation: you have to provide that yourself.

  * <span class="q">How do I do <X\></span>

    Refer to the API documentation in docstrings.


Forms
-----

  * <span class="q">Doesn't the standard Python library module, `cgi`, do this?</span>

    No: the `cgi` module does the server end of the job.  It doesn't know
    how to parse or fill in a form or how to send it back to the server.

  * <span class="q">How do I figure out what control names and values to use?</span>

    `print form` is usually all you need.  In your code, things like the
    `HTMLForm.items` attribute of `HTMLForm` instances can be useful to inspect
    forms at runtime.  Note that it's possible to use item labels instead of
    item names, which can be useful — use the `by_label` arguments to the
    various methods, and the `.get_value_by_label()` / `.set_value_by_label()`
    methods on `ListControl`.

  * <span class="q">What do those `'*'` characters mean in the string
    representations of list controls?</span>

    A `*` next to an item means that item is selected.

  * <span class="q">What do those parentheses (round brackets) mean in the string
    representations of list controls?</span>

    Parentheses `(foo)` around an item mean that item is disabled.

  * <span class="q">Why doesn't <some control\> turn up in the data returned by
    `.click*()` when that control has non-`None` value?</span>

    Either the control is disabled, or it is not successful for some other
    reason. 'Successful' (see [HTML 4
    specification](http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13.2))
    means that the control will cause data to get sent to the server.

  * <span class="q">Why does mechanize not follow the HTML 4.0 / RFC 1866
    standards for `RADIO` and multiple-selection `SELECT` controls?</span>

    Because by default, it follows browser behaviour when setting the
    initially-selected items in list controls that have no items explicitly
    selected in the HTML.  Use the `select_default` argument to `ParseResponse`
    if you want to follow the RFC 1866 rules instead.  Note that browser
    behaviour violates the HTML 4.01 specification in the case of `RADIO`
    controls.

  * <span class="q">Why does `.click()`ing on a button not work for me?</span>

      * Clicking on a `RESET` button doesn't do anything, by design - this is a
        library for web automation, not an interactive browser.  Even in an
        interactive browser, clicking on `RESET` sends nothing to the server,
        so there is little point in having `.click()` do anything special here.

      * Clicking on a `BUTTON TYPE=BUTTON` doesn't do anything either, also by
        design.  This time, the reason is that that `BUTTON` is only in the
        HTML standard so that one can attach JavaScript callbacks to its
        events.  Their execution may result in information getting sent back to
        the server.  mechanize, however, knows nothing about these callbacks,
        so it can't do anything useful with a click on a `BUTTON` whose type is
        `BUTTON`.

      * Generally, JavaScript may be messing things up in all kinds of ways.
        See the answer to the next question.

  * <a name="change-value" /><span class="q">How do I change `INPUT
TYPE=HIDDEN` field values (for example, to emulate the effect of JavaScript
code)?</span>

    As with any control, set the control's `readonly` attribute false.

~~~~{.python}
form.find_control("foo").readonly = False # allow changing .value of control foo
form.set_all_readonly(False) # allow changing the .value of all controls
~~~~

  * <span class="q">I'm having trouble debugging my code.</span>

    See [here](hints.html) for few relevant tips.

  * <span class="q">I have a control containing a list of integers.  How do I
    select the one whose value is nearest to the one I want?</span>

~~~~{.python}
import bisect
def closest_int_value(form, ctrl_name, value):
    values = map(int, [item.name for item in form.find_control(ctrl_name).items])
    return str(values[bisect.bisect(values, value) - 1])

form["distance"] = [closest_int_value(form, "distance", 23)]
~~~~


General
-------

  * <a name="sniffing" /><span class="q">I want to see what my web browser is
    doing, but standard network sniffers like
    [wireshark](http://www.wireshark.org/) or netcat (nc) don't work for HTTPS.
    How do I sniff HTTPS traffic?</span>

    Three good options:

      * Mozilla plugin: [LiveHTTPHeaders](http://livehttpheaders.mozdev.org/).

      * [ieHTTPHeaders](http://www.blunck.info/iehttpheaders.html) does
        the same for MSIE.

      * Use [`lynx`](http://lynx.browser.org/) `-trace`, and filter out
        the junk with a script.

  * <a name="script" /><span class="q">JavaScript is messing up my
    web-scraping.  What do I do?</span>

    JavaScript is used in web pages for many purposes -- for example: creating
    content that was not present in the page at load time, submitting or
    filling in parts of forms in response to user actions, setting cookies,
    etc.  mechanize does not provide any support for JavaScript.

    If you come across this in a page you want to automate, you have four
    options.  Here they are, roughly in order of simplicity.

      * Figure out what the JavaScript is doing and emulate it in your Python
        code: for example, by manually adding cookies to your `CookieJar`
        instance, calling methods on `HTMLForm`s, calling `urlopen`, etc.  See
        [above](#change-value) re forms.

      * Use Java's [HtmlUnit](http://htmlunit.sourceforge.net/) or
        [HttpUnit](http://httpunit.sourceforge.net) from Jython, since they
        know some JavaScript.

      * Instead of using mechanize, automate a browser instead.  For example
        use MS Internet Explorer via its COM automation interfaces, using the
        [Python for Windows
        extensions](http://starship.python.net/crew/mhammond/), aka pywin32,
        aka win32all (e.g. [simple
        function](http://vsbabu.org/mt/archives/2003/06/13/ie_automation.html),
        [pamie](http://pamie.sourceforge.net/); [pywin32 chapter from the
        O'Reilly
        book](http://www.oreilly.com/catalog/pythonwin32/chapter/ch12.html)) or
        [ctypes](http://python.net/crew/theller/ctypes/)
        ([example](http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305273)).
        [This](http://www.brunningonline.net/simon/blog/archives/winGuiAuto.py.html)
        kind of thing may also come in useful on Windows for cases where the
        automation API is lacking.  For Firefox, there is
        [PyXPCOM](https://developer.mozilla.org/en/PyXPCOM).

      * Get ambitious and automatically delegate the work to an appropriate
        interpreter (Mozilla's JavaScript interpreter, for instance).  This is
        what HtmlUnit and httpunit do.  I did a spike along these lines some
        years ago, but I think it would (still) be quite a lot of work to do
        well.

  * <span class="q">Misc links</span>

      * <a name="parsing" />The following libraries can be useful for dealing
        with bad HTML: [lxml.html](http://codespeak.net/lxml/lxmlhtml.html),
        [html5lib](http://code.google.com/p/html5lib/), [BeautifulSoup
        3](http://www.crummy.com/software/BeautifulSoup/CHANGELOG.html),
        [mxTidy](http://www.egenix.com/files/python/mxTidy.html) and
        [mu-Tidylib](http://utidylib.berlios.de/).

      * [Selenium](http://www.openqa.org/selenium/): In-browser web functional
        testing.  If you need to test websites against real browsers, this is a
        standard way to do it.

      * O'Reilly book: [Spidering
        Hacks](http://oreilly.com/catalog/9780596005771).  Very Perl-oriented.

      * Standard extensions for web development with Firefox, which are also
        handy if you're scraping the web: [Web
        Developer](http://chrispederick.com/work/webdeveloper/) (amongst other
        things, this can display HTML form information),
        [Firebug](http://getfirebug.com/).

      * Similar functionality for IE6 and IE7: [Internet Explorer Developer
        Toolbar](http://www.google.co.uk/search?q=internet+explorer+developer+toolbar&btnI=I'm+Feeling+Lucky)
        (IE8 comes with something equivalent built-in, as does Google Chrome).

      * [Open source functional testing
        tools](http://www.opensourcetesting.org/functional.php).

      * [A HOWTO on web
        scraping](http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html) from
        Dave Kuhlman.

  * <span class="q">Will any of this code make its way into the Python standard
    library?</span>

    The request / response processing extensions to `urllib2` from mechanize
    have been merged into `urllib2` for Python 2.4.  The cookie processing has
    been added, as module `cookielib`.  There are other features that would be
    appropriate additions to `urllib2`, but since Python 2 is heading into
    bugfix-only mode, and I'm not using Python 3, they're unlikely to be added.

  * <span class="q">Where can I find out about the relevant standards?</span>

      * [HTML 4.01 Specification](http://www.w3.org/TR/html401/)

      * [Draft HTML 5 Specification](http://dev.w3.org/html5/spec/)

      * [RFC 1866](http://www.ietf.org/rfc/rfc1866.txt) - the HTML 2.0
        standard (you don't want to read this)

      * [RFC 1867](http://www.ietf.org/rfc/rfc1867.txt) - Form-based file
        upload

      * [RFC 2616](http://www.ietf.org/rfc/rfc2616.txt) - HTTP 1.1
        Specification

      * [RFC 3986](http://www.ietf.org/rfc/rfc3986.txt) - URIs

      * [RFC 3987](http://www.ietf.org/rfc/rfc3987.txt) - IRIs

</div>

<!-- Local Variables: -->
<!-- fill-column:79 -->
<!-- End: -->
