快速统计词表、交替迭代两个迭代器、计算所有子集、可读字节单位、滑动窗口

快速统计字词频率表

1
2
3
4
import itertools
import collections

chars = collections.Counter(itertools.chain(*X))

在数据集不大的情况下,通常我们使用类似如下代码来统计字词频率表,

1
2
3
4
5
import itertools
import collections
X = [text1, text2, ..., textn]
words = collections.Counter(itertools.chain(*X))
print(words.most_common(20))

当数据集非常大时,以上代码显得非常无力。这里提供在大数集中并行统计字词频率表的方法。这里提供一种使用多进程进行Counter的实现,见项目count-in-parallel

交替迭代两个迭代器

在模型训练时,交替输出两个正负类。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import itertools

def i1():
for i in range(0, 20, 2):
yield i

def i2():
for i in range(1, 20, 2):
yield i

def combine_iters():
ii1 = itertools.cycle(i1())
ii2 = itertools.cycle(i2())

while True:
yield next(ii1)
yield next(ii2)

iiters = combine_iters()
for i in range(100):
print(next(iiters))

计算所有子集

1
2
3
4
5
6
7
8
9
10
11
12
13
import numpy as np

def all_subgroup(es):
size = 2**len(es)
for i in range(size):
bins = list(np.binary_repr(i))
idx = (np.array(bins) == "1")
r = len(es) - len(idx)
yield es[r:][idx]

es = np.array(["a", "b", "c", "e", "f"])
for e in all_subgroup(es):
print(e)

可读字节单位

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def humanize_bytes(bytesize, precision=3):
abbrevs = (
(1 << 50, 'PB'),
(1 << 40, 'TB'),
(1 << 30, 'GB'),
(1 << 20, 'MB'),
(1 << 10, 'kB'),
(1, 'bytes')
)
if bytesize == 1:
return '1 byte'
for factor, suffix in abbrevs:
if bytesize >= factor:
break
return '%.*f %s' % (precision, bytesize / factor, suffix)

图像剪裁

1
2
3
4
5
6
7
8
9
10
def rsize(cls, old_paste, weight, height):
assert old_paste.is_image, TypeError('Unsupported Image Type.')
f = open(old_paste.path, 'rb')
im = Image.open(f)
img = cropresize2.crop_resize(im, (int(weight), int(height)))
rst = cls(old_paste.filename, old_paste.mimetype, 0)
img.save(rst.path)
filestat = os.stat(rst.path)
rst.size = filestat.st_size
return rst

时间序列与滑动窗口

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np

_row = lambda x: x

def series2X(series, size, func=_row):
# 把时间序列转换为滑动窗口形式
X = np.array([series[i:i+size] for i in range(len(series)-size+1)])
return np.apply_along_axis(func, 1, X)

def series2Xy(series, size, func=_row):
# 把时间序列转换为单步带标注形式数据
X = np.array([series[:-1][i:i+size] for i in range(len(series)-size)])
y = np.array(series[size:])
return np.apply_along_axis(func, 1, X), y

获取随机种子

1
2
3
4
5
6
import os
import numpy as np
bs = os.urandom(10)
seed = int.from_bytes(bs, "big")

np.random.seed(seed)

总结

转载请包括本文地址:https://allenwind.github.io/blog/10568
更多文章请参考:https://allenwind.github.io/blog/archives/