平时写各种小工具时会经常用到urllib2.build_opener([handler, …])这个函数,我一般是这样使用urllib2模块的:

  • 方式一

    http_handler_debug = urllib2.HTTPHandler(debuglevel=1)
    http_cookie_processor = urllib2.HTTPCookieProcessor()
    opener = urllib2.build_opener(http_handler_debug, http_cookie_processor)
    
    client_version = 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)'
    opener.addheaders = [('User-agent', client_version)]
    
    r = opener.open('http://192.168.9.2/')
    c = r.read()
    
  • 方式二

    http_handler  = self._urllib.HTTPHandler(debuglevel=_debug)
    https_handler = self._urllib.HTTPSHandler(debuglevel=_debug)
    
    # alex added proxy function
    proxy_handler = urllib2.ProxyHandler({'http': 'http://127.0.0.1:8087',
        'https': 'http://127.0.0.1:8087'})
    
    opener = self._urllib.OpenerDirector()
    opener.add_handler(http_handler)
    opener.add_handler(https_handler)
    
    # alex added proxy_handler
    opener.add_handler(proxy_handler)
    
    r = opener.open('http://192.168.9.2/')
    c = r.read()
    

于是,下面这两个问题就在我的脑海中慢慢产生了:

  1. urllib2.build_opener([handler, …])中各个handler的顺序是怎么确定的呢?
  2. 写代码时是否必须要按照一定的顺序来add_handler()呢?

“源码之下,了无秘密”,下面就开始我们的urllib2.py探索之旅。

  • 打开urllib2.py文件,找到class OpenerDirector,

    在OpenerDirector中的add_handler(self, handler)方法中,会有这么一段:

    if added:
        # the handlers must work in an specific order, the order
        # is specified in a Handler attribute
        bisect.insort(self.handlers, handler)
    

    注意其中的两行注释,大意是:handler必须按照一定的顺序来工作,这个顺序是在handler的一个attribute中指定的。

    注:新版的urllib2.py源码中去掉了上面的那两行注释。

  • 然后我们在urllib2.py中再看看各个handler的源码,可以看到,这个attribute就是handler_order了。

    class BaseHandler:
        handler_order = 500
    
    class HTTPErrorProcessor(BaseHandler):
        """Process HTTP error responses."""
        handler_order = 1000  # after all other processing
    
    class ProxyHandler(BaseHandler):
        # Proxies must be in front
        handler_order = 100
    
    class HTTPDigestAuthHandler(BaseHandler, AbstractDigestAuthHandler):
        """An authentication protocol defined by RFC 2069
    
        Digest authentication improves on basic authentication because it
        does not transmit passwords in the clear.
        """
    
        auth_header = 'Authorization'
        handler_order = 490  # before Basic auth
    
    class ProxyDigestAuthHandler(BaseHandler, AbstractDigestAuthHandler):
    
        auth_header = 'Proxy-Authorization'
        handler_order = 490  # before Basic auth
    
  • 下面是BaseHandler的完全定义:

    class BaseHandler:
        handler_order = 500
    
        def add_parent(self, parent):
            self.parent = parent
    
        def close(self):
            # Only exists for backwards compatibility
            pass
    
        def __lt__(self, other):
            if not hasattr(other, "handler_order"):
                # Try to preserve the old behavior of having custom classes
                # inserted after default ones (works only for custom user
                # classes which are not aware of handler_order).
                return True
            return self.handler_order < other.handler_order
    

至此,上述两个问题的答案也就自然而然的出来了:

  • 各Handler的顺序是由它自己进行维护的,每个Handler里有一个handler_order属性。
  • 写代码时一般不用考虑添加handler的顺序。