处理二进制数据结构

有时候需要处理二进制数据。例如把Python对象转换为二进制发送到网络上，或者从网络上接收二进制数据再转换为Python对象。Python的struct模块包括一些函数和约定用于处理字节序列和内置数据类型的函数和方法。

二进制数据在传输时有字节序列问题：常见的字节序列有大端序列（big-endian)、小端序列（little-endian）等。这一点在转换是需要注意。

struct模块

struct模块提供二进制和对象间打包、解包的函数。另外还有一个Struct类，可以指定格式指示符以提前编辑处理方法，这类似于Python的正则模块提前把正则表达式编译以提高性能。

打包

Struct在格式指示符的规定下，将Python对象打包(packing)成字节序列，以方便存储和发送到网络上。

解包

在指定字节序列和格式指示符下，把字节序列还原为Python对象以方便应用，这个过程称为解包(unpacking)。

字节序列

默认情况下，打包、解包使用内置C库的字节序列。可以在格式指示符前指定字节序列。

指示符	含义
@	内置顺序
=	内置标准
<	little-endian
>	big-endian
!	网络顺序

例子

下面的例子打包I 5s 3f格式的数据。

import struct
import binascii
import ctypes

s = struct.Struct('I 5s 3f')
values = (100, b'hello', 1.2, 1.5, 1.7)
packed_data = s.pack(*values)
hex_packed_data = binascii.hexlify(packed_data)

print("original values:", values)
print("format string:", s.format)
print("size:", s.size, "bytes")
print("packed value", hex_packed_data)

程序输出：

original values: (100, b'hello', 1.2, 1.5, 1.7)
format string: b'I 5s 3f'
size: 24 bytes
packed value b'6400000068656c6c6f0000009a99993f0000c03f9a99d93f'

解包数据

1 2	data = binascii.unhexlify(hex_packed_data) print("unpacking data:", s.unpack(data))

输出

1	unpacking data: (100, b'hello', 1.2000000476837158, 1.5, 1.7000000476837158)

注意到浮点数的精度是损失了。

指定字节序

import struct
import binascii

endianness = [
    ('@', 'native, native'),
    ('=', 'native, standard'),
    ('<', 'little-endian'),
    ('>', 'big-endian'),
    ('!', 'network'),
]

values = (1, 2, 3, b'hello')

for code, name in endianness:
    s = struct.Struct(code + '3I 5s')
    packed_data = s.pack(*values)
    print('format string:', s.format, 'for', name)
    print('size:', s.size, 'bytes')
    print('packed value:', binascii.hexlify(packed_data))
    print('unpacked value:', s.unpack(packed_data))

查看输出

format string: b'@3I 5s' for native, native
size: 17 bytes
packed value: b'01000000020000000300000068656c6c6f'
unpacked value: (1, 2, 3, b'hello')
format string: b'=3I 5s' for native, standard
size: 17 bytes
packed value: b'01000000020000000300000068656c6c6f'
unpacked value: (1, 2, 3, b'hello')
format string: b'<3I 5s' for little-endian
size: 17 bytes
packed value: b'01000000020000000300000068656c6c6f'
unpacked value: (1, 2, 3, b'hello')
format string: b'>3I 5s' for big-endian
size: 17 bytes
packed value: b'00000001000000020000000368656c6c6f'
unpacked value: (1, 2, 3, b'hello')
format string: b'!3I 5s' for network
size: 17 bytes
packed value: b'00000001000000020000000368656c6c6f'
unpacked value: (1, 2, 3, b'hello')

指定缓冲区

如此程序重视性能，应该提前分配内存而不是动态分配，以减少碎片内存分配过程的开销。

下面的例子使用两种缓冲方法。

import struct
import binascii
import ctypes
import array

values = (1, 2, 3, b'hello')
s = struct.Struct('@3I 5s')
string_buffer = ctypes.create_string_buffer(s.size)
array_buffer = array.array('B', b'\0' * s.size)

print('ctypes string buffer')
print('before:', binascii.hexlify(string_buffer.raw))
s.pack_into(string_buffer, 0, *values)
print('after:', binascii.hexlify(string_buffer.raw))
print('unpacked:', s.unpack_from(string_buffer, 0))
      
print('-'*20)

print('array buffer')
print('before:', binascii.hexlify(array_buffer))
s.pack_into(array_buffer, 0, *values)
print('after:', binascii.hexlify(array_buffer))
print('unpacked:', s.unpack_from(array_buffer, 0))

输出

ctypes string buffer
before: b'0000000000000000000000000000000000'
after: b'01000000020000000300000068656c6c6f'
unpacked: (1, 2, 3, b'hello')
--------------------
array buffer
before: b'0000000000000000000000000000000000'
after: b'01000000020000000300000068656c6c6f'
unpacked: (1, 2, 3, b'hello')

转载请包括本文地址：https://allenwind.github.io/blog/5330
更多文章请参考：https://allenwind.github.io/blog/archives/