Commit e8f1b272f8349748e71a1aff2ce81093c33943f6 - golang-github-minio-sha256-simd

Update README with section on AVX512 frankw 6 years ago

1 changed file(s) with 62 addition(s) and 107 deletion(s). Raw diff Collapse all Expand all

+62

-107

README.md less more

0	0	# sha256-simd
1	1
2		Accelerate SHA256 computations in pure Go for both Intel (AVX2, AVX, SSE) as well as ARM (arm64) platforms.
3
4		Update: As of Go 1.8, `crypto/sha256` offers similar performance for AVX2.
	2	Accelerate SHA256 computations in pure Go using AVX512 and AVX2 for Intel and ARM64 for ARM. On AVX512 it provides an up to 8x improvement (over 3 GB/s per core) in comparison to AVX2.
5	3
6	4	## Introduction
7	5
8		This package is designed as a drop-in replacement for `crypto/sha256`. For Intel CPUs it has three flavors for AVX2, AVX and SSE whereby the fastest method is automatically chosen depending on CPU capabilities. For ARM CPUs with the Cryptography Extensions advantage is taken of the SHA2 instructions resulting in a massive performance improvement.
	6	This package is designed as a replacement for `crypto/sha256`. For Intel CPUs it has two flavors for AVX512 and AVX2 (AVX/SSE are also supported). For ARM CPUs with the Cryptography Extensions, advantage is taken of the SHA2 instructions resulting in a massive performance improvement.
9	7
10		This package uses Golang assembly and as such does not depend on cgo. The Intel versions are based on the implementations as described in "Fast SHA-256 Implementations on Intel Architecture Processors" by J. Guilford et al.
	8	This package uses Golang assembly. The AVX512 version is based on the Intel's "multi-buffer crypto library for IPSec" whereas the other Intel implementations are described in "Fast SHA-256 Implementations on Intel Architecture Processors" by J. Guilford et al.
	9
	10	## New: Support for AVX512
	11
	12	We have added support for AVX512 which results in an up to 8x performance improvement over AVX2 (3.0 GHz Xeon Platinum 8124M CPU):
	13
	14	```
	15	$ benchcmp avx2.txt avx512.txt
	16	benchmark AVX2 MB/s AVX512 MB/s speedup
	17	BenchmarkHash5M 448.62 3498.20 7.80x
	18	```
	19
	20	The original code was developed by Intel as part of the [multi-buffer crypto library](https://github.com/intel/intel-ipsec-mb) for IPSec or more specifically this [AVX512](https://github.com/intel/intel-ipsec-mb/blob/master/avx512/sha256_x16_avx512.asm) implementation. The key idea behind it is to process a total of 16 checksums in parallel by “transposing” 16 (independent) messages of 64 bytes between a total of 16 ZMM registers (each 64 bytes wide).
	21
	22	Transposing the input messages means that in order to take full advantage of the speedup you need to have a (server) workload where multiple threads are doing SHA256 calculations in parallel. Unfortunately for this algorithm it is not possible for two message blocks processed in parallel to be dependent on one another — because then the (interim) result of the first part of the message has to be an input into the processing of the second part of the message.
	23
	24	Whereas the original Intel C implementation requires some sort of explicit scheduling of messages to be processed in parallel, for Golang it makes sense to take advantage of channels in order to group messages together and use channels as well for sending back the results (thereby effectively decoupling the calculations). We have implemented a fairly simple scheduling mechanism that seems to work well in practice.
	25
	26	Due to this differrent way of scheduling, we decided to use an explicit method to instantiate the AVX512 version. Essentially one or more AVX512 processing servers ([`Avx512Server`](https://github.com/minio/sha256-simd/blob/master/sha256blockAvx512_amd64.go#L294)) have to be created whereby each server can hash over 3 GB/s on a single core. An `hash.Hash` object ([`Avx512Digest`](https://github.com/minio/sha256-simd/blob/master/sha256blockAvx512_amd64.go#L45)) is then instantiated using one of these servers and used in the regular fashion:
	27
	28	```go
	29	import "github.com/minio/sha256-simd"
	30
	31	func main() {
	32	server := sha256.NewAvx512Server()
	33	h512 := sha256.NewAvx512(server)
	34	h512.Write(fileBlock)
	35	digest := h512.Sum([]byte{})
	36	}
	37	```
	38
	39	Note that, because of the scheduling overhead, for small messages (< 1 MB) you will be better off using the regular SHA256 hashing (but those are typically not performance critical anyway). Some other tips to get the best performance:
	40	* Have many go routines doing SHA256 calculations in parallel.
	41	* Try to Write() messages in multiples of 64 bytes.
	42	* Try to keep the overall length of messages to a roughly similar size ie. 5 MB (this way all 16 ‘lanes’ in the AVX512 computations are contributing as much as possible).
	43
	44	More detailed information can be found in this [blog]() post including scaling across cores.
11	45
12	46	## Drop-In Replacement
13	47
14		Following code snippet shows you how you can directly replace wherever `crypto/sha256` is used can be replaced with `github.com/minio/sha256-simd`.
	48	The following code snippet shows how you can use `github.com/minio/sha256-simd`. This will automatically select the fastest method for the architecture on which it will be executed.
15	49
16		Before:
17		```go
18		import "crypto/sha256"
19
20		func main() {
21		...
22		shaWriter := sha256.New()
23		io.Copy(shaWriter, file)
24		...
25		}
26		```
27
28		After:
29	50	```go
30	51	import "github.com/minio/sha256-simd"
31	52

39	60
40	61	## Performance
41	62
42		Below is the speed in MB/s for a single core (ranked fast to slow) as well as the factor of improvement over `crypto/sha256` (when applicable).
	63	Below is the speed in MB/s for a single core (ranked fast to slow) for blocks larger than 1 MB.
43	64
44		\| Processor \| Package \| Speed \| Improvement \|
45		\| --------------------------------- \| ------------------------- \| -----------:\| -----------:\|
46		\| 1.2 GHz ARM Cortex-A53 \| minio/sha256-simd (ARM64) \| 638.2 MB/s \| 105x \|
47		\| 2.4 GHz Intel Xeon CPU E5-2620 v3 \| minio/sha256-simd (AVX2) \| 355.0 MB/s \| 1.88x \|
48		\| 2.4 GHz Intel Xeon CPU E5-2620 v3 \| minio/sha256-simd (AVX) \| 306.0 MB/s \| 1.62x \|
49		\| 2.4 GHz Intel Xeon CPU E5-2620 v3 \| minio/sha256-simd (SSE) \| 298.7 MB/s \| 1.58x \|
50		\| 1.2 GHz ARM Cortex-A53 \| crypto/sha256 \| 6.1 MB/s \| \|
	65	\| Processor \| SIMD \| Speed (MB/s) \|
	66	\| --------------------------------- \| ------- \| ------------:\|
	67	\| 3.0 GHz Intel Xeon Platinum 8124M \| AVX512 \| 3498 \|
	68	\| 1.2 GHz ARM Cortex-A53 \| ARM64 \| 638 \|
	69	\| 3.0 GHz Intel Xeon Platinum 8124M \| AVX2 \| 449 \|
	70	\| 3.1 GHz Intel Core i7 \| AVX \| 362 \|
	71	\| 3.1 GHz Intel Core i7 \| SSE \| 299 \|
51	72
52		Note that the AVX2 version is measured with the "unrolled"/"demacro-ed" version. Due to some Golang assembly restrictions the AVX2 version that uses `defines` loses about 15% performance (you can see the macrofied version, which is a little bit easier to read, [here](https://github.com/minio/sha256-simd/blob/e1b0a493b71bb31e3f1bf82d3b8cbd0d6960dfa6/sha256blockAvx2_amd64.s)).
53
54		See further down for detailed performance.
55
56		## Comparison to other hashing techniques
	73	## asm2plan9s
57	74
58		As measured on Intel Xeon (same as above) with AVX2 version:
	75	In order to be able to work more easily with AVX512/AVX2 instructions, a separate tool was developed to convert SIMD instructions into the corresponding BYTE sequence as accepted by Go assembly. See [asm2plan9s](https://github.com/minio/asm2plan9s) for more information.
59	76
60		\| Method \| Package \| Speed \|
61		\| ------- \| -------------------\| --------:\|
62		\| BLAKE2B \| [minio/blake2b-simd](https://github.com/minio/blake2b-simd) \| 851 MB/s \|
63		\| MD5 \| crypto/md5 \| 607 MB/s \|
64		\| SHA1 \| crypto/sha1 \| 522 MB/s \|
65		\| SHA256 \| minio/sha256-simd \| 355 MB/s \|
66		\| SHA512 \| crypto/sha512 \| 306 MB/s \|
	77	## Why and benefits
67	78
68		asm2plan9s
69		----------
70
71		In order to be able to work more easily with AVX2/AVX instructions, a separate tool was developed to convert AVX2/AVX instructions into the corresponding BYTE sequence as accepted by Go assembly. See [asm2plan9s](https://github.com/minio/asm2plan9s) for more information.
72
73		Why and benefits
74		----------------
75
76		One of the most performance sensitive parts of [Minio](https://minio.io) server (object storage [server](https://github.com/minio/minio) compatible with Amazon S3) is related to SHA256 hash sums calculations. For instance during multi part uploads each part that is uploaded needs to be verified for data integrity by the server. Likewise in order to generated pre-signed URLs check sums must be calculated to ensure their validity.
	79	One of the most performance sensitive parts of the [Minio](https://github.com/minio/minio) object storage server is related to SHA256 hash sums calculations. For instance during multi part uploads each part that is uploaded needs to be verified for data integrity by the server.
77	80
78	81	Other applications that can benefit from enhanced SHA256 performance are deduplication in storage systems, intrusion detection, version control systems, integrity checking, etc.
79	82
80		ARM SHA Extensions
81		------------------
	83	## ARM SHA Extensions
82	84
83	85	The 64-bit ARMv8 core has introduced new instructions for SHA1 and SHA2 acceleration as part of the [Cryptography Extensions](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0501f/CHDFJBCJ.html). Below you can see a small excerpt highlighting one of the rounds as is done for the SHA256 calculation process (for full code see [sha256block_arm64.s](https://github.com/minio/sha256-simd/blob/master/sha256block_arm64.s)).
84	86

95	97	sha256su1 v5.4s, v7.4s, v8.4s
96	98	```
97	99
98		Detailed benchmarks
99		-------------------
100
101		### ARM64
	100	### Detailed benchmarks
102	101
103	102	Benchmarks generated on a 1.2 Ghz Quad-Core ARM Cortex A53 equipped [Pine64](https://www.pine64.com/).
104	103
105	104	```
106		minio@minio-arm:~/gopath/src/github.com/sha256-simd$ benchcmp golang.txt arm64.txt
107		benchmark old ns/op new ns/op delta
108		BenchmarkHash8Bytes-4 11836 1403 -88.15%
109		BenchmarkHash1K-4 181143 3138 -98.27%
110		BenchmarkHash8K-4 1365652 14356 -98.95%
111		BenchmarkHash1M-4 173192200 1642954 -99.05%
112
113		benchmark old MB/s new MB/s speedup
114		BenchmarkHash8Bytes-4 0.68 5.70 8.38x
115		BenchmarkHash1K-4 5.65 326.30 57.75x
116		BenchmarkHash8K-4 6.00 570.63 95.11x
117		BenchmarkHash1M-4 6.05 638.23 105.49x
	105	minio@minio-arm:$ benchcmp golang.txt arm64.txt
	106	benchmark golang arm64 speedup
	107	BenchmarkHash8Bytes-4 0.68 MB/s 5.70 MB/s 8.38x
	108	BenchmarkHash1K-4 5.65 MB/s 326.30 MB/s 57.75x
	109	BenchmarkHash8K-4 6.00 MB/s 570.63 MB/s 95.11x
	110	BenchmarkHash1M-4 6.05 MB/s 638.23 MB/s 105.49x
118	111	```
119	112
120		Example performance metrics were generated on Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz - 6 physical cores, 12 logical cores running Ubuntu GNU/Linux with kernel version 4.4.0-24-generic (vanilla with no optimizations).
121
122		### AVX
123
124		```
125		$ benchcmp go.txt avx.txt
126		benchmark old ns/op new ns/op delta
127		BenchmarkHash8Bytes-12 446 346 -22.42%
128		BenchmarkHash1K-12 5919 3701 -37.47%
129		BenchmarkHash8K-12 43791 27222 -37.84%
130		BenchmarkHash1M-12 5544989 3426938 -38.20%
131
132		benchmark old MB/s new MB/s speedup
133		BenchmarkHash8Bytes-12 17.93 23.06 1.29x
134		BenchmarkHash1K-12 172.98 276.64 1.60x
135		BenchmarkHash8K-12 187.07 300.93 1.61x
136		BenchmarkHash1M-12 189.10 305.98 1.62x
137		```
138
139		### SSE
140
141		```
142		$ benchcmp go.txt sse.txt
143		benchmark old ns/op new ns/op delta
144		BenchmarkHash8Bytes-12 446 362 -18.83%
145		BenchmarkHash1K-12 5919 3751 -36.63%
146		BenchmarkHash8K-12 43791 27396 -37.44%
147		BenchmarkHash1M-12 5544989 3444623 -37.88%
148
149		benchmark old MB/s new MB/s speedup
150		BenchmarkHash8Bytes-12 17.93 22.05 1.23x
151		BenchmarkHash1K-12 172.98 272.92 1.58x
152		BenchmarkHash8K-12 187.07 299.01 1.60x
153		BenchmarkHash1M-12 189.10 304.41 1.61x
154		```
155
156		License
157		-------
	113	## License
158	114
159	115	Released under the Apache License v2.0. You can find the complete text in the file LICENSE.
160	116
161		Contributing
162		------------
	117	## Contributing
163	118
164	119	Contributions are welcome, please send PRs for any enhancements.