第3篇:CPython实现原理:字符串对象(中)

作者: 铁甲万能狗 | 来源:发表于2020-07-15 15:15 被阅读0次

Python list 实现原理
第3篇:CPython实现原理:字符串对象(中)
第3篇:CPython实现原理:字符串对象
第2篇:CPython实现原理:整数对象
第2篇:CPython实现原理:整数对象(后篇)
python学习笔记 -- list内部实现（转）
第1篇:CPython实现原理：代码对象和字节码
Python 高效处理字符串的秘密：字符串 Intern 机制
Python 高效处理字符串的秘密：字符串 Intern 机制
python篇---看面经总结(总结中)

对于前一篇，我们讨论到字符串对象初始化过程ascii_decode函数，我们说当ascii_decode函数如果对传入参数C级别的字符指针(char*)并没做任何操作，那么unicode_decode_utf8函数将继续调用_PyUnicodeWriter_InitWithBuffer函数，unicode_decode_utf8函数的部分代码片段如下所示

static PyObject *
unicode_decode_utf8(const char *s, Py_ssize_t size,
                    _Py_error_handler error_handler, const char *errors,
                    Py_ssize_t *consumed)
{
    ...
    s += ascii_decode(s, end, PyUnicode_1BYTE_DATA(u));
    if (s == end) {
        return u;
    }
    // Use _PyUnicodeWriter after fast path is failed.
    _PyUnicodeWriter writer;
    _PyUnicodeWriter_InitWithBuffer(&writer, u);
    writer.pos = s - starts;

    Py_ssize_t startinpos, endinpos;
    const char *errmsg = "";
    PyObject *error_handler_obj = NULL;
    PyObject *exc = NULL;
    ...
}

请回顾一下该函数调用的流程图，

当从ascii_decode函数返回后，unicode_decode_utf8函数得到的一个PyASCIIObject对象的内存实体如下图，提醒一下，不用理会图中内存地址，因为每次运行解释器堆内存地址都是不一样的。

该PyASCIIObject对象的指针(引用)会作为第二个参数传递给_PyUnicodeWriter_InitWithBuffer函数做进一步处理。

_PyUnicodeWriter接口

在unicode_decode_utf8调用_PyUnicodeWriter_InitWithBuffer函数前，它初始化一个_PyUnicodeWriter类型的变量并将该内存地址传递给_PyUnicodeWriter_InitWithBuffer，那么究竟_PyUnicodeWriter在整个PyUnicode对象初始化过程中起到什么作用呢？,如果你感兴趣的话，可以追溯一下_PyUnicodeWriter的老黄历

在2010年，Python 3.3的PEP 393有了一个全新的Unicode实现，即Python类型str,一直沿用至今。PEP 393的第一个实现使用了很多32位字符缓冲区（Py_UCS4），这需要占用大量内存，并且需要太多性能开销用于转换为8位（Py_UCS1，ASCII和Latin1）或16位（Py_UCS2，BMP）字符。目前流行的CPython3.x用于Unicode字符串的内部结构非常复杂，在构建新字符串以避免存储冗余的副本，因此，字符串的内存利用必须精打细算。而_PyUnicodeWriter类接口减少昂贵的内存副本，甚至在最佳情况下完全避免内存副本。

下面是_PyUnicodeWriter的结构体的源代码，这个类接口没什么好说的。

/* --- _PyUnicodeWriter API ----------------------------------------------- */

typedef struct {
    //由PyUnicode_New已分配的对象
    PyObject *buffer; 
    void *data;
    enum PyUnicode_Kind kind;
    Py_UCS4 maxchar;
    Py_ssize_t size;
    Py_ssize_t pos;

    /* minimum number of allocated characters (default: 0) */
    Py_ssize_t min_length;

    /* minimum character (default: 127, ASCII) */
    Py_UCS4 min_char;

    /* If non-zero, overallocate the buffer (default: 0). */
    unsigned char overallocate;

    /* If readonly is 1, buffer is a shared string (cannot be modified)
       and size is set to 0. */
    unsigned char readonly;
} _PyUnicodeWriter ;

明确地说，在Objects/unicodeobject.c源文件，大规模地使用了以 _PyUnicodeWriter_为前缀的函数族，而这里介绍的是_PyUnicodeWriter_InitWithBuffer是和字符串对象初始化有关的inline函数。而_PyUnicodeWriter_InitWithBuffer的实质性代码位于_PyUnicodeWriter_Update这个inline函数，如果你C语言基础扎实的话，实际上这两个函数并不存在C运行时函数栈pop/push的开销，因为它们的代码在编译后unicode_decode_utf8函数上下文的一部分。

static inline void
_PyUnicodeWriter_Update(_PyUnicodeWriter *writer)
{
    writer->maxchar = PyUnicode_MAX_CHAR_VALUE(writer->buffer);
    writer->data = PyUnicode_DATA(writer->buffer);

    if (!writer->readonly) {
        writer->kind = PyUnicode_KIND(writer->buffer);
        writer->size = PyUnicode_GET_LENGTH(writer->buffer);
    }
    else {
        /* use a value smaller than PyUnicode_1BYTE_KIND() so
           _PyUnicodeWriter_PrepareKind() will copy the buffer. */
        writer->kind = PyUnicode_WCHAR_KIND;
        assert(writer->kind <= PyUnicode_1BYTE_KIND);

        /* Copy-on-write mode: set buffer size to 0 so
         * _PyUnicodeWriter_Prepare() will copy (and enlarge) the buffer on
         * next write. */
        writer->size = 0;
    }
}

// Initialize _PyUnicodeWriter with initial buffer
static inline void
_PyUnicodeWriter_InitWithBuffer(_PyUnicodeWriter *writer, PyObject *buffer)
{
    //初始化writer的所有字段为0
    memset(writer, 0, sizeof(*writer));
    writer->buffer = buffer;
    _PyUnicodeWriter_Update(writer);
    writer->min_length = writer->size;
}

延续前篇的例子，当执行完_PyUnicodeWriter_InitWithBuffer函数，_PyUnicodeWriter对象和PyASCIIObject对象的内存状态如下图

咦？你觉得有些事有蹊跷吗！老铁，为什么_PyUnicodeWriter对象的字段data的void指针不是指向PyASCIIObject对象的头部与绿色内存区域30字节的地址边界，即(PyASCIIObject*)+1的地方？而是指向该地址边界- 4字节的地方！

兄dei，还记得之前的unicode_decode_utf8函数在调用ascii_decode函数时,你发现了吗？

static Py_ssize_t
ascii_decode(const char *start, const char *end, Py_UCS1 *dest)
{
   const char *p = start;
   const char *aligned_end = (const char *) _Py_ALIGN_DOWN(end, SIZEOF_LONG);
    ....
}

重点就在_Py_ALIGN_DOWN这个宏定义，将传入的指针p向下舍入到最接近以a对齐的地址边界，该地址边界不大于p，你姑且先不要深挖CPython为什么要这么做

/* Round pointer "p" down to the closest "a"-aligned address <= "p". */
#define _Py_ALIGN_DOWN(p, a) ((void *)((uintptr_t)(p) & ~(uintptr_t)((a) - 1)))

先回顾一下我们前一篇结尾给出的一个utf-8的字节内存图，这些我们第一次运行编译后的Python解释器绘制的内存图，首先end指针指向的是右双引号“"”，的内存地址208147855,经过_Py_ALIGN_DOWN宏函数处理后得到一个于end指针最近且满足SIZEOF_LONG(即按8字节对齐)的内存地址边界2081478552，该内存地址恰好是8的倍数。我们再来看从start到aligned_end刚好24个字节的长度。

那我说那么废话，意义何在呢？我们知道CPython会默认使用utf-8去解码unicode字节序列，而我们这里的例子刚好都是中文字，在utf-8的编码方式中，每个中文字是按3个字节依次存储的，当然读取时必须以3个字节去读取。也就是说CPython会依次读取这些中文字依次朝着高地址方向移动某个内存指针，这个内存指针刚好落在以3为倍数且位于unicode字节的起始地址边界，而对于x86_64平台的操作系统都是按8字节对齐的，也就是说CPython需要找到一种解码方式，必须同时兼顾某类文字的unicode编码位宽（本例是以3为倍数），同时兼容以8字节对齐。

扯远了，回归正题。整个_PyUnicodeWriter_函数族大规模地引用了以下宏函数，所以理解这些宏函数对于你大脑梳理字符串对象初始化过程以及字符串各种操作的细节至关重要。

/* Fast check to determine whether an object is ready. Equivalent to
   PyUnicode_IS_COMPACT(op) || ((PyUnicodeObject*)(op))->data.any) */

#define PyUnicode_IS_READY(op) (((PyASCIIObject*)op)->state.ready)

#define PyUnicode_Check(op) \
                 PyType_FastSubclass(Py_TYPE(op), Py_TPFLAGS_UNICODE_SUBCLASS)

/* Return true if the string is compact or 0 if not.
   No type checks or Ready calls are performed. */
#define PyUnicode_IS_COMPACT(op) \
    (((PyASCIIObject*)(op))->state.compact)


/* Return a void pointer to the raw unicode buffer. */
#define _PyUnicode_COMPACT_DATA(op)                     \
    (PyUnicode_IS_ASCII(op) ?                   \
     ((void*)((PyASCIIObject*)(op) + 1)) :              \
     ((void*)((PyCompactUnicodeObject*)(op) + 1)))

#define _PyUnicode_NONCOMPACT_DATA(op)                  \
    (assert(((PyUnicodeObject*)(op))->data.any),        \
     ((((PyUnicodeObject *)(op))->data.any)))


/* Return one of the PyUnicode_*_KIND values defined above. */
#define PyUnicode_KIND(op) \
    (assert(PyUnicode_Check(op)), \
     assert(PyUnicode_IS_READY(op)),            \
     ((PyASCIIObject *)(op))->state.kind)


/* Returns the length of the unicode string. The caller has to make sure that
   the string has it's canonical representation set before calling
   this macro.  Call PyUnicode_(FAST_)Ready to ensure that. */
#define PyUnicode_GET_LENGTH(op)                \
    (assert(PyUnicode_Check(op)),               \
     assert(PyUnicode_IS_READY(op)),            \
     ((PyASCIIObject *)(op))->length)




#define PyUnicode_DATA(op) \
    (assert(PyUnicode_Check(op)), \
     PyUnicode_IS_COMPACT(op) ? _PyUnicode_COMPACT_DATA(op) :   \
     _PyUnicode_NONCOMPACT_DATA(op))

//位于Includes/object.h
static inline int
PyType_HasFeature(PyTypeObject *type, unsigned long feature) {
    return ((PyType_GetFlags(type) & feature) != 0);
}

//位于Includes/object.h
#define PyType_FastSubclass(type, flag) PyType_HasFeature(type, flag)


/* Return a maximum character value which is suitable for creating another
   string based on op.  This is always an approximation but more efficient
   than iterating over the string. */
#define PyUnicode_MAX_CHAR_VALUE(op) \
    (assert(PyUnicode_IS_READY(op)),                                    \
     (PyUnicode_IS_ASCII(op) ?                                          \
      (0x7f) :                                                          \
      (PyUnicode_KIND(op) == PyUnicode_1BYTE_KIND ?                     \
       (0xffU) :                                                        \
       (PyUnicode_KIND(op) == PyUnicode_2BYTE_KIND ?                    \
        (0xffffU) :                                                     \
        (0x10ffffU)))))


/* Returns the length of the unicode string. The caller has to make sure that
   the string has it's canonical representation set before calling
   this macro.  Call PyUnicode_(FAST_)Ready to ensure that. */
#define PyUnicode_GET_LENGTH(op)                \
    (assert(PyUnicode_Check(op)),               \
     assert(PyUnicode_IS_READY(op)),            \
     ((PyASCIIObject *)(op))->length)

更新中.....

Python list 实现原理
Python list 实现原理我们通过本文描述CPython实现 list 列表对象，Cpython是pyt...
第3篇:CPython实现原理:字符串对象(中)
对于前一篇，我们讨论到字符串对象初始化过程ascii_decode函数，我们说当ascii_decode函数如果对...
第3篇:CPython实现原理:字符串对象
在CPython3.3之后，字符串对象发生了根本性的变法,本篇我们来讨论一下字符串对象，在Include/unic...
第2篇:CPython实现原理:整数对象
在CPython中的整数对象的堆内存分配并非在即时对某个需要使用的整数分配内存的，因为这样势必对CPython的内...
第2篇:CPython实现原理:整数对象(后篇)
前言 OK，对于CPython的整数对象来说，我们前一篇已经导出一个比较明确的立场，那就是小型整数这个设定其实没什...
python学习笔记 -- list内部实现（转）
看一下python的 cpython 实现（cpython就是python的c实现版本）列表对象的c语言结构体 ...
第1篇:CPython实现原理：代码对象和字节码
这是一篇为了更好地说明为什么我们在Python程序开发过程中，为什么要使用Cython作为Python的超集的原因...
Python 高效处理字符串的秘密：字符串 Intern 机制
字符串在 Python 中是最简单也是最常用的数据类型之一，在 CPython 中字符串的实现过程中使用了一种叫做...
Python 高效处理字符串的秘密：字符串 Intern 机制
字符串在 Python 中是最简单也是最常用的数据类型之一，在 CPython 中字符串的实现过程中使用了一种叫做...
python篇---看面经总结(总结中)
1.python中list底层怎么实现的先看下面的程序 cpython中，list对象的结构是： List ini...