lxml.objectify

作者: 血刃飘香 | 来源:发表于2018-10-09 22:19 被阅读0次

《利用python进行数据分析》数据加载、存储与文件格式（部分笔
lxml.objectify

转译自：https://lxml.de/objectify.html
lxml.objectify主要用于处理以数据为中心的文档，可以根据叶子节点所含的内容自动推断数据类型。该模块依然使用lxml.etree的ElementTree，但是节点元素分成两类：结构节点元素(Tree Element)和数据节点元素(Data Element)。

基本

该模块继承了一些etree的API，如：

>>> from lxml import objectify, etree
>>> from io import BytesIO

>>> root = objectify.Element("root")
>>> a = objectify.SubElement(root, "a")

>>> root = objectify.fromstring("<root><a>test</a><b>11</b></root>")

>>> doc = objectify.parse(BytesIO(b"<root><a>test</a><b>11</b></root>"))
>>> root = doc.getroot()

注意，该模块的parse函数依然生成lxml.etree模块的ElementTree，但是节点元素分成两类：结构节点元素，默认类型为objectify.ObjectifiedElement；以及数据节点元素，类型为objectify.IntElement、objectify.StringElement等。

>>> from lxml import objectify
>>> from io import BytesIO
>>> doc = objectify.parse(BytesIO(b"<root><a>test</a><b>5</b></root>"))
>>> type(doc)
<class 'lxml.etree._ElementTree'>
>>> root = doc.getroot()
>>> type(root)
<class 'lxml.objectify.ObjectifiedElement'>
>>> type(root.a), type(root.b)
(<class 'lxml.objectify.StringElement'>, <class 'lxml.objectify.IntElement'>)

子元素可以直接通过'.'语法访问。

>>> root = objectify.Element("root")
>>> b1 = objectify.SubElement(root, "b")
>>> print(root.b.tag)
b

注意，有多个相同tag名的子元素时，'.'语法的返回值也可以看成是具有该tag名的子元素的序列。因为对objectify的元素使用下标[n]运算，会寻找它的兄弟元素，且迭代操作也是对同样tag名的兄弟元素进行迭代：

>>> b2 = objectify.SubElement(root, "b")
>>> root.b[0] is b1, root.b[1] is b2
(True, True)
>>> for b in root.b: print(b.tag)
b
b

特别注意下列用法：

# 迭代自己
>>> for b in b1: print(b.tag)
b
b

可以参考源代码实现(Cython):

cdef class ObjectifiedElement(ElementBase):
    def __iter__(self):
        u"""Iterate over self and all siblings with the same tag.
        """
        parent = self.getparent()
        if parent is None:
            return iter([self])
        return etree.ElementChildIterator(parent, tag=self.tag)

    def __len__(self):
        u"""Count self and siblings with the same tag.
        """
        return _countSiblings(self._c_node)

    def __getattr__(self, tag):
        u"""Return the (first) child with the given tag name.  If no namespace
        is provided, the child will be looked up in the same one as self.
        """
        if is_special_method(tag):
            return object.__getattr__(self, tag)
        return _lookupChildOrRaise(self, tag)

    def __getitem__(self, key):
        u"""Return a sibling, counting from the first child of the parent.  The
        method behaves like both a dict and a sequence.
        * If argument is an integer, returns the sibling at that position.
        * If argument is a string, does the same as getattr().  This can be
        used to provide namespaces for element lookup, or to look up
        children with special names (``text`` etc.).
        * If argument is a slice object, returns the matching slice.
        """
        cdef tree.xmlNode* c_self_node
        cdef tree.xmlNode* c_parent
        cdef tree.xmlNode* c_node
        cdef Py_ssize_t c_index
        if python._isString(key):
            return _lookupChildOrRaise(self, key)
        elif isinstance(key, slice):
            return list(self)[key]
        # normal item access
        c_index = key   # raises TypeError if necessary
        c_self_node = self._c_node
        c_parent = c_self_node.parent
        if c_parent is NULL:
            if c_index == 0:
                return self
            else:
                raise IndexError, unicode(key)
        if c_index < 0:
            c_node = c_parent.last
        else:
            c_node = c_parent.children
        c_node = _findFollowingSibling(
            c_node, tree._getNs(c_self_node), c_self_node.name, c_index)
        if c_node is NULL:
            raise IndexError, unicode(key)
        return elementFactory(self._doc, c_node)

迭代root.X得到的是root的tag名为X的子元素的序列，要访问root的所有子元素可以通过iterchildren()或getchildren()方法。

>>> root = objectify.fromstring("<root><b>10</b><b>11</b><a>test</a><b>12</b></root>")
>>> [el.tag for el in root.b]
['b', 'b', 'b']
>>> [el.tag for el in root.iterchildren()]
['b', 'b', 'a', 'b']
>>> root.index(root.b[0]), root.index(root.b[1]), root.index(root.b[2])
(0, 1, 3)

类似地，len(elt)返回elt的兄弟个数(包括自身)，而elt.countchildren()返回elt的子元素个数。

零散

元素的属性依然是通过get和set方法来操作。

>>> root.set('myattr', 'someval')
>>> root.get('myattr')
'someval'

直接用'.'语法赋值也可以添加子元素，此时子元素(子树)会被自动deep copied且子树的根的tag会被改成指定的名字：

>>> el = objectify.Element('other')
>>> root.c = el
>>> root.c.tag
'c'
>>> el.tag
'other'

也可以用列表赋值：

>>> root.c = [ objectify.Element("c"), objectify.Element("c") ]
>>> [el.tag for el in root.c]
['c', 'c']

注意，如果用数字，字符串等赋值，会生成Data Element并添加到树上：

>>> root.d = 1
>>> root.d
1
>>> type(root.d)
<class 'lxml.objectify.IntElement'>

对于Data Element，访问它的.pyval属性可以得到对应的数据值。另外，.text属性依然是字符串。

>>> root.d.pyval
1
>>> root.d.text
'1'

objectify提供DataElement() 工厂函数来生成数据节点，生成的元素的tag默认是'value'。

>>> el = objectify.DataElement(5, _pytype="int")
>>> el.pyval
5
>>> el.tag
'value'
>>> root.e = objectify.DataElement(5, _pytype="int")

类型标记

某些方法(如Element()工厂函数)生成元素的时候会自动加上namespace和type标记：

>>> a = objectify.Element('a')
>>> etree.tostring(a)
b'<a xmlns:py="http://codespeak.net/lxml/objectify/pytype" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" py:pytype="TREE"/>'

某些方法(如fromstring)则不会自动加上这些标记：

>>> a = objectify.fromstring("<a>test</a>")
>>> etree.tostring(a)
b'<a>test</a>'

但是源数据带type标记时，fromstring函数会利用这些标记来解析数据：

>>> a = objectify.fromstring('<a xmlns:py="http://codespeak.net/lxml/objectify/pytype" py:pytype="str">5</a>')
>>> a
'5'
>>> etree.tostring(a)
b'<a xmlns:py="http://codespeak.net/lxml/objectify/pytype" py:pytype="str">5</a>'

annotate和deannotate方法可以用于添加和移除这些标记：

>>> objectify.deannotate(a, cleanup_namespaces=True)
>>> etree.tostring(a)
b'<a>5</a>'
>>> type(a)
<class 'lxml.objectify.StringElement'>
>>> a.attrib
{}
>>> objectify.annotate(a)
>>> etree.tostring(a)
b'<a xmlns:py="http://codespeak.net/lxml/objectify/pytype" py:pytype="int">5</a>'
>>> type(a)         # a仍然是StringElement类
<class 'lxml.objectify.StringElement'>
>>> a.pyval
'5'
>>> a.attrib        # 但是a的attrib字典变了
{'{http://codespeak.net/lxml/objectify/pytype}pytype': 'int'}

注意deannotate默认并不去除命名空间，因此需要设cleanup_namespaces为True。另外注意这里可能出现了一个bug：annotate添加标记时并不管原来的元素所属的class，因此出现了标记跟元素本身的class不符合的现象。

网友评论

本文标题：lxml.objectify

本文链接：https://www.haomeiwen.com/subject/almyaftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

lxml.objectify

基本

零散

类型标记

相关文章

《利用python进行数据分析》数据加载、存储与文件格式（部分笔

lxml.objectify

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读