pytables使用笔记

作者: 井底蛙蛙呱呱呱 | 来源:发表于2021-01-28 14:46 被阅读0次

pytables使用笔记
印象笔记
4个步骤3本笔记，麦肯锡高效工作的秘密，尽在麦肯锡笔记思考法
使用.简介|关于笔记的使用建议(三)短时整理2019.02.06
使用.简介|笔记.笔记本.笔记本组之间是什么关系2019.02.
flutter奇淫技巧
兔兔子和它的笔记_之 0 为什么使用电子笔记
康奈尔笔记法
学海无涯onenote作舟
工具类2:Onenote和Evernote使用心得

pytables是一种用来快速存取大量数据的工具，其功能与h5py类似，都是将数据储存为hdf5格式，但是更为强大。但是也正是由于其更加强大的功能，也导致了其官方文档的冗杂。这里简要记录一些pytables的使用笔记。

pytables可以存取的格式非常丰富：字符串，数值，数组，字符数组以及可变数组等均可以进行储存。

对于简单的字符串，数值或数组储存使用官方提供的代码样例即可：

import numpy as np
import os
import tables
import tables as tb
from tqdm import tqdm
import csv
import json

# jupyter notebook中设置交互式输出
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

with open(fpath, 'r') as f:
    fcsv = csv.DictReader(f, delimiter='\t')
    for row in tqdm(fcsv):
        label = int(float(row['label']))
        id_ = row['id']
        camp = row['mz_camp']
        arr_features = json.loads(row['features'])
        break

label
id_
camp
arr_features[:10]
# 数据样例输出如下：
1
'f6379cff0f6a4e3c4b6b9a514d1e2df7'
'28136560~28203900~28207488~28229731~28241431~28256316~28280756~28305638~28320029~28327595~28330390~28335941~28350527~28355142~28357456~28358600~28363423~28367817~28377060~28382884~28383105~28389436~28390580~28408871~28441839~28442775~28462158~28467384~28467592~28468411~28473351~28476406~28499819~28505188~28510882~28523531~28523544~28524454~28535608~28541445~28564026~28569928~28570786~28570955~28590884~28623358~28623579~28625464~28633199~28640154~28651373~28670262~78188107~78195101'
[1, 76, 2475, 2651, 2651, 2651, 2651, 2651, 2651, 2651]

pytables储存数据为hdf5格式：

# class Particle(tb.IsDescription):
#     label = tb.Int8Col(pos=1)               
#     uid = tb.StringCol(itemsize=32, pos=2)            
#     array_features = tb.Int32Col(shape=(len(arr_features),),pos=3)              

arr_length = len(arr_features)
Particle2 = {
    "uid": tb.Col.from_kind("string", itemsize=32, pos=1),            
    "label": tb.Col.from_type("int8", pos=2),            
    "array_features": tb.Col.from_type("int32", shape=(len(arr_features),), pos=3),            
}

在上面使用了两种方式来定义我们将要储存的数据的数据类型，其中：

itemsize 表示字符类型数据字符的最大长度，超出的部分将会截断；
pos 表示将要储存的几个数据的顺序，若批量append的话将会用到；
shape 则是指定了数据维度；

对于数组数据，pytables提供了多种类型：

tables.Array，最普通的array储存方式，对应create_array，不进行压缩，且不支持对array shape进行扩充，如更改列数等，不支持；另外一点有意思的地方是，array对于什么数据类型存进去就是什么数据类型读取出来，譬如已list存进去则读取出来就是list，numpy array存进去读取出来则是numpy array；最后，tables.Array是不支持压缩的；
tables.CArray，CArray与Array主要的不同就是CArray是支持压缩的，其通过create_carray()中的filters参数进行指定，如filters=tables.Filters(complevel=5, complib='zlib')；
tables.EArray，对应create_earray()，EArray与CArray的不同在于EArray是可扩展的，即虽然定义时其shape=(1,3)，但是后续可对其中一个维度进行更改（当前版本仅支持一个维度的更改）；
tables.VLArray，对应于create_vlarray()，这个功能非常强大，在vlarray中可以储存不同长度的数组，如[1,2], [1,2,3], [1,2,3,4]；

更多Array相关内容可参考官方文档：Homogenous storage classes .

下面我们将camp变量split为list of string，然后利用VLArray进行储存:

vlarray = fileh.create_vlarray(root, 'mz_camp', atom=tables.ObjectAtom(),
                               title="text data.")

这里将atom指定为tables.ObjectAtom()使得读取出来的数据仍然为string。也可以使atom=tables.StringAtom(itemsize=32)，但是此时读取出来的数据就变成了bytes string了，如 b"hello"，这时需要var.decode()来得到string。

利用上面的table和array基本可以处理所有数据格式的储存问题了。完整代码如下：

import numpy as np
import os
import tables
import tables as tb
from tqdm import tqdm
import csv
import json

# jupyter notebook中设置交互式输出
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

with open(fpath, 'r') as f:
    fcsv = csv.DictReader(f, delimiter='\t')
    for row in tqdm(fcsv):
        label = int(float(row['label']))
        id_ = row['id']
        camp = row['mz_camp']
        arr_features = json.loads(row['features'])
        break

label
id_
camp
arr_features[:10]
# 数据样例输出如下：
1
'f6379cff0f6a4e3c4b6b9a514d1e2df7'
'28136560~28203900~28207488~28229731~28241431~28256316~28280756~28305638~28320029~28327595~28330390~28335941~28350527~28355142~28357456~28358600~28363423~28367817~28377060~28382884~28383105~28389436~28390580~28408871~28441839~28442775~28462158~28467384~28467592~28468411~28473351~28476406~28499819~28505188~28510882~28523531~28523544~28524454~28535608~28541445~28564026~28569928~28570786~28570955~28590884~28623358~28623579~28625464~28633199~28640154~28651373~28670262~78188107~78195101'
[1, 76, 2475, 2651, 2651, 2651, 2651, 2651, 2651, 2651]


# 将数据储存为hdf5格式
#（1）open file，若是添加数据，则可以将这里的mode设置为'a'
fileh = tables.open_file("vlarray2.h5", mode="w")

#（2）定义table数据格式，并创建储存节点
# class Particle(tb.IsDescription):
#     label = tb.Int8Col(pos=1)               
#     uid = tb.StringCol(itemsize=32, pos=2)            
#     array_features = tb.Int32Col(shape=(0, len(arr_features, )),pos=3)              

arr_length = len(arr_features)
Particle2 = {
    "uid": tb.Col.from_kind("string", itemsize=32, pos=1),            
    "label": tb.Col.from_type("int8", pos=2),            
    "array_features": tb.Col.from_type("int32", shape=(arr_length, ), pos=3),            
}
# Get the root group
root = fileh.root
table = fileh.create_table(root, 'table', Particle2, "here id, label and array features.")

# （3）创建string list储存格式和节点
vlarray = fileh.create_vlarray(root, 'mz_camp',  atom=tables.ObjectAtom(),
                               title="text data.")

# （4）数据装载
# 批量装载table数据
table.append([(id_, 1, arr_features),
              (id_, 2, arr_features),
              (id_, 3, arr_features),
              (id_, 4, arr_features),
              (id_, 5, arr_features),
              (id_, 6, arr_features)])
table.flush()

# 另一种一个一个的装载table数据的格式
# # Create a shortcut to the table record object
# particle = table.row
# particle['uid'] = id_
# particle['label'] = label
# particle['array_features'] = arr_features

# 装载vlarray数据，目前发现valarray只能一个一个装载，否则会将一次装载的数据当成一个样本
vlarray.append(['1']+camp.split('~'))
vlarray.append(['2']+camp.split('~'))
vlarray.append(['3']+camp.split('~'))
vlarray.append(['4']+camp.split('~'))
vlarray.append(['5']+camp.split('~'))
vlarray.append(['6']+camp.split('~'))
vlarray.flush()

# 关闭文件
fileh.close()

文件读取可以使用下面的代码：

fileh = tables.open_file("vlarray2.h5", mode="r")
root = fileh.root

# 直接根据索引提取table中的数据
root.table.cols.uid[1]
root.table.cols.label[1]
root.table.cols.array_features[1]
root.mz_camp[1][:5]
# 输出
b'f6379cff0f6a4e3c4b6b9a514d1e2df7'
2
array([   1,   76, 2475, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2554,
       2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651,
       2651, 2651, 2651, 2577, 2651, 2651, 2651, 2651, 2651, 2651, 2651,
       2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651,
       2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651, 2651],
      dtype=int32)
['2', '28136560', '28203900', '28207488', '28229731']

# 也可以通过for循环来读取数据
for id_, lab, arr, camp_ in zip(root.table.cols.uid, root.table.cols.label, root.table.cols.array_features, root.mz_camp):
    print(id_)
    print(lab)
    print(arr[:5])
    print(camp_[:5])
    
fileh.close()
# 输出
b'f6379cff0f6a4e3c4b6b9a514d1e2df7'
1
[   1   76 2475 2651 2651]
['1', '28136560', '28203900', '28207488', '28229731']
b'f6379cff0f6a4e3c4b6b9a514d1e2df7'
2
[   1   76 2475 2651 2651]
['2', '28136560', '28203900', '28207488', '28229731']
b'f6379cff0f6a4e3c4b6b9a514d1e2df7'
3
[   1   76 2475 2651 2651]
['3', '28136560', '28203900', '28207488', '28229731']
b'f6379cff0f6a4e3c4b6b9a514d1e2df7'
4
[   1   76 2475 2651 2651]
['4', '28136560', '28203900', '28207488', '28229731']
b'f6379cff0f6a4e3c4b6b9a514d1e2df7'
5
[   1   76 2475 2651 2651]
['5', '28136560', '28203900', '28207488', '28229731']
b'f6379cff0f6a4e3c4b6b9a514d1e2df7'
6
[   1   76 2475 2651 2651]
['6', '28136560', '28203900', '28207488', '28229731']

更多使用方法可以参考官方文档：
Pytables Tutorials.
官方使用examples.

pytables使用笔记
pytables是一种用来快速存取大量数据的工具，其功能与h5py类似，都是将数据储存为hdf5格式，但是更为强大...
印象笔记
印象笔记相关学习链接：下载链接王泽熙：印象笔记使用教程精简版如何正确使用印象笔记的标签功能印象笔记印象笔记的初级使用方法
4个步骤3本笔记，麦肯锡高效工作的秘密，尽在麦肯锡笔记思考法
“很多人都有使用笔记的习惯，但很多人只是单纯地使用笔记而已，使用笔记本身成了目的，却没有将使用笔记和解决问题、整理...
使用.简介|关于笔记的使用建议(三)短时整理2019.02.06
使用.简介|关于笔记的使用建议(三)短时整理2019.02.06 1.收集笔记出现的笔记本：新建默认在“默认笔记...
使用.简介|笔记.笔记本.笔记本组之间是什么关系2019.02.
使用.简介|笔记.笔记本.笔记本组之间是什么关系2019.02.02 印象笔记，使用的是两层管理。 [你不必像wi...
flutter奇淫技巧
使用技巧笔记：
兔兔子和它的笔记_之 0 为什么使用电子笔记
兔兔子和它的笔记_之0 为什么使用电子笔记： 0.1 关于我使用电子笔记的动机：接触电子笔记已经很久，不过这...
康奈尔笔记法
什么是康奈尔笔记法使用康奈尔笔记法的案例（原则阅读笔记）
学海无涯onenote作舟
1、笔记原则 ·笔记内容以使用场景分类。使用场景即记录笔记并思考未来的写作以及学习可能性的用途。 ·系统笔记需要...
工具类2:Onenote和Evernote使用心得
作为一个一直使用笔记本的人，尝试过手帐、印象笔记、有道云笔记、onenote，最终找到自己笔记本使用习惯。 ...