快速统计词表、交替迭代两个迭代器、计算所有子集、可读字节单位、滑动窗口
快速统计字词频率表 1 2 3 4 import itertoolsimport collectionschars = collections.Counter(itertools.chain(*X))
在数据集不大的情况下,通常我们使用类似如下代码来统计字词频率表,
1 2 3 4 5 import itertoolsimport collectionsX = [text1, text2, ..., textn] words = collections.Counter(itertools.chain(*X)) print (words.most_common(20 ))
当数据集非常大时,以上代码显得非常无力。这里提供在大数集中并行统计字词频率表的方法。这里提供一种使用多进程进行Counter的实现,见项目count-in-parallel 。
交替迭代两个迭代器 在模型训练时,交替输出两个正负类。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import itertoolsdef i1 (): for i in range (0 , 20 , 2 ): yield i def i2 (): for i in range (1 , 20 , 2 ): yield i def combine_iters (): ii1 = itertools.cycle(i1()) ii2 = itertools.cycle(i2()) while True : yield next (ii1) yield next (ii2) iiters = combine_iters() for i in range (100 ): print (next (iiters))
计算所有子集 1 2 3 4 5 6 7 8 9 10 11 12 13 import numpy as npdef all_subgroup (es ): size = 2 **len (es) for i in range (size): bins = list (np.binary_repr(i)) idx = (np.array(bins) == "1" ) r = len (es) - len (idx) yield es[r:][idx] es = np.array(["a" , "b" , "c" , "e" , "f" ]) for e in all_subgroup(es): print (e)
可读字节单位 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 def humanize_bytes (bytesize, precision=3 ): abbrevs = ( (1 << 50 , 'PB' ), (1 << 40 , 'TB' ), (1 << 30 , 'GB' ), (1 << 20 , 'MB' ), (1 << 10 , 'kB' ), (1 , 'bytes' ) ) if bytesize == 1 : return '1 byte' for factor, suffix in abbrevs: if bytesize >= factor: break return '%.*f %s' % (precision, bytesize / factor, suffix)
图像剪裁 1 2 3 4 5 6 7 8 9 10 def rsize (cls, old_paste, weight, height ): assert old_paste.is_image, TypeError('Unsupported Image Type.' ) f = open (old_paste.path, 'rb' ) im = Image.open (f) img = cropresize2.crop_resize(im, (int (weight), int (height))) rst = cls(old_paste.filename, old_paste.mimetype, 0 ) img.save(rst.path) filestat = os.stat(rst.path) rst.size = filestat.st_size return rst
时间序列与滑动窗口 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import numpy as np_row = lambda x: x def series2X (series, size, func=_row ): X = np.array([series[i:i+size] for i in range (len (series)-size+1 )]) return np.apply_along_axis(func, 1 , X) def series2Xy (series, size, func=_row ): X = np.array([series[:-1 ][i:i+size] for i in range (len (series)-size)]) y = np.array(series[size:]) return np.apply_along_axis(func, 1 , X), y
获取随机种子 1 2 3 4 5 6 import osimport numpy as npbs = os.urandom(10 ) seed = int .from_bytes(bs, "big" ) np.random.seed(seed)
总结 转载请包括本文地址:https://allenwind.github.io/blog/10568 更多文章请参考:https://allenwind.github.io/blog/archives/