Table of Contents

About

A lot of cryptographic packages such as cryptsetup and even ZFS rely on the AVX CPU extensions in order to perform computations necessary for encryption regardless whether the CPU supports AES NI or not. Typically, on low-end, consumer hardware and even middle-range enterprise hardware, a kernel message will be printed on boot indicating that AVX support has not been found for the CPU. The consequence thereof is that cryptographic primitives will run so slow the bad performance would be unjustified compared to the available hardware. It is not only Celeron CPUs that are affect, but also CPUs such as earlier Xenons that are even deliberately built for very large workloads.

The main problem is that developers chose to only support AVX instructions regarding cryptographic computations, being mostly the "latest and the brightest", but in reality the computations could be accelerated just as well by SSE. That being said, for ZFS alone there is exists solution presented by Attila Fülöp that ported the AVX computations to SSE4.1 such that older platforms can benefit from substantial speedups. Overall, it seems that with the port to SSE4.1 the speeds reach up to three times the original speed without SSE4.1 acceleration.

Generic Install

In order to use the SSE4.1 branch for ZFS acceleration, one option is to check out the source code from GitHub, compile it and build the zfs module with the modified code:

git clone https://github.com/AttilaFueloep/zfs.git
cd zfs
./autogen.sh
./configure --prefix=/
make -j4

and then installing and loading the module, whilst setting the GCM implementation to use SSE4.1:

make install
depmod
modprobe zfs
echo sse4_1 >/sys/module/zfs/parameters/icp_gcm_impl

Debian

On Debian, there is a better solution and that involves patching the existing "zfs-dkms" module with the patches from Attila Fülöp and then using the Debian-way to recompile a package, whilst incrementing the package release number with a local marker to distinguish it from the official "zfs-dkms" package.

Following the instructions on how to recompile Debian packages, the source of zfs-dkms is pulled along with the build-requirements by issuing the command:

apt-get source zfs-dkms
apt-get build-dep zfs-dkmsv

At the time of writing, Debian pull in the zfs-linux-2.2.4 source package with which the patch is compatible, such that the next step is to apply the following patch extracted from GitHub to be found in the patch section. The patch should be saved to sse4_1.patch and then applied by issuing the following command:

git apply sse4_1.patch --stat --apply

Note that the patch should be applied with git and that applying the patch with the patch tool will not work.

Usage

With the new ZFS kernel module built and installed, upon a fresh reboot, issue:

modinfo zfs

which should output the modified ZFS module (in this case, the module was built with Debian DKMS such that the new version string reads "2.2.4-1sse411").

The next part is to check the GCM implementation currently in use by issuing:

cat /sys/module/zfs/parameters/icp_gcm_impl
cycle fastest sse4_1 [generic] pclmulqdq 

where the angle brackets show which implementation is currently in use.

Then SSE4.1 support can be enabled by issuing:

echo sse4_1 >/sys/module/zfs/parameters/icp_gcm_impl

and then checked:

cat /sys/module/zfs/parameters/icp_gcm_impl
cycle fastest [sse4_1] generic pclmulqdq 

Lastly, it should not be necessary to change the implementation at runtime using the echo command yet the sse4_1 parameter can be passed as a kernel parameter when the module is loaded. In order to achieve that effect, simply create a file at /etc/modprobe.d/zfs.conf with the following contents:

options zfs icp_gcm_impl=sse4_1

After a reboot, SSE4.1 should be the default.

Performance Comparison

On a Celeron NAS with a ZFS pool using encryption frequently the CPU would be heavily occupied and "perf top" would show the "gcm_pclmulqdq_mul" occupying even more than half of the CPU. Similarly, using Samba and NFS would sometimes be slow, especially when accessing directories with large number of files. With the patch, there is no mention of GCM visibly hogging the CPU and files seem to load instantly from directories.

The patch has not been accepted into the mainline and it might be a while till it becomes part of the ZFS tree but for now the solution is to stick with the SSE4.1 optimized version till there is an official update and merge.

Patch

sse4_1.patch
From 535bc3d6e20975d23a7b0f82939a3e028481867d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Attila=20F=C3=BCl=C3=B6p?= <attila@fueloep.org>
Date: Thu, 9 Feb 2023 19:53:30 +0100
Subject: [PATCH 1/2] initial bring-over of "Intel Intelligent Storage
 Acceleration Library" port
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
 
Signed-off-by: Attila Fülöp <attila@fueloep.org>
---
 .../icp/gcm-simd/isa-l_crypto-ported/LICENSE  |   26 +
 .../icp/gcm-simd/isa-l_crypto-ported/README   |   18 +
 .../gcm-simd/isa-l_crypto-ported/gcm128_sse.S |   31 +
 .../gcm-simd/isa-l_crypto-ported/gcm256_sse.S |   31 +
 .../isa-l_crypto-ported/gcm_defines.S         |  295 +++
 .../gcm-simd/isa-l_crypto-ported/gcm_sse.S    | 2153 ++++++++++++++++
 .../gcm-simd/isa-l_crypto-ported/reg_sizes.S  |  224 ++
 contrib/icp/gcm-simd/isa-l_crypto/LICENSE     |   26 +
 contrib/icp/gcm-simd/isa-l_crypto/README      |   10 +
 .../icp/gcm-simd/isa-l_crypto/gcm128_sse.asm  |   31 +
 .../icp/gcm-simd/isa-l_crypto/gcm256_sse.asm  |   31 +
 .../icp/gcm-simd/isa-l_crypto/gcm_defines.asm |  291 +++
 contrib/icp/gcm-simd/isa-l_crypto/gcm_sse.asm | 2171 +++++++++++++++++
 .../icp/gcm-simd/isa-l_crypto/reg_sizes.asm   |  459 ++++
 .../asm-x86_64/modes/THIRDPARTYLICENSE.intel  |   26 +
 .../modes/THIRDPARTYLICENSE.intel.descrip     |    1 +
 .../icp/asm-x86_64/modes/isalc_gcm128_sse.S   |   31 +
 .../icp/asm-x86_64/modes/isalc_gcm256_sse.S   |   31 +
 .../icp/asm-x86_64/modes/isalc_gcm_defines.S  |  293 +++
 module/icp/asm-x86_64/modes/isalc_gcm_sse.S   | 2150 ++++++++++++++++
 module/icp/asm-x86_64/modes/isalc_reg_sizes.S |  221 ++
 21 files changed, 8550 insertions(+)
 create mode 100644 contrib/icp/gcm-simd/isa-l_crypto-ported/LICENSE
 create mode 100644 contrib/icp/gcm-simd/isa-l_crypto-ported/README
 create mode 100644 contrib/icp/gcm-simd/isa-l_crypto-ported/gcm128_sse.S
 create mode 100644 contrib/icp/gcm-simd/isa-l_crypto-ported/gcm256_sse.S
 create mode 100644 contrib/icp/gcm-simd/isa-l_crypto-ported/gcm_defines.S
 create mode 100644 contrib/icp/gcm-simd/isa-l_crypto-ported/gcm_sse.S
 create mode 100644 contrib/icp/gcm-simd/isa-l_crypto-ported/reg_sizes.S
 create mode 100644 contrib/icp/gcm-simd/isa-l_crypto/LICENSE
 create mode 100644 contrib/icp/gcm-simd/isa-l_crypto/README
 create mode 100644 contrib/icp/gcm-simd/isa-l_crypto/gcm128_sse.asm
 create mode 100644 contrib/icp/gcm-simd/isa-l_crypto/gcm256_sse.asm
 create mode 100644 contrib/icp/gcm-simd/isa-l_crypto/gcm_defines.asm
 create mode 100644 contrib/icp/gcm-simd/isa-l_crypto/gcm_sse.asm
 create mode 100644 contrib/icp/gcm-simd/isa-l_crypto/reg_sizes.asm
 create mode 100644 module/icp/asm-x86_64/modes/THIRDPARTYLICENSE.intel
 create mode 100644 module/icp/asm-x86_64/modes/THIRDPARTYLICENSE.intel.descrip
 create mode 100644 module/icp/asm-x86_64/modes/isalc_gcm128_sse.S
 create mode 100644 module/icp/asm-x86_64/modes/isalc_gcm256_sse.S
 create mode 100644 module/icp/asm-x86_64/modes/isalc_gcm_defines.S
 create mode 100644 module/icp/asm-x86_64/modes/isalc_gcm_sse.S
 create mode 100644 module/icp/asm-x86_64/modes/isalc_reg_sizes.S
 
diff --git a/contrib/icp/gcm-simd/isa-l_crypto-ported/LICENSE b/contrib/icp/gcm-simd/isa-l_crypto-ported/LICENSE
new file mode 100644
index 000000000000..ecebef110b46
--- /dev/null
+++ b/contrib/icp/gcm-simd/isa-l_crypto-ported/LICENSE
@@ -0,0 +1,26 @@
+  Copyright(c) 2011-2017 Intel Corporation All rights reserved.
+
+  Redistribution and use in source and binary forms, with or without
+  modification, are permitted provided that the following conditions
+  are met:
+    * Redistributions of source code must retain the above copyright
+      notice, this list of conditions and the following disclaimer.
+    * Redistributions in binary form must reproduce the above copyright
+      notice, this list of conditions and the following disclaimer in
+      the documentation and/or other materials provided with the
+      distribution.
+    * Neither the name of Intel Corporation nor the names of its
+      contributors may be used to endorse or promote products derived
+      from this software without specific prior written permission.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/contrib/icp/gcm-simd/isa-l_crypto-ported/README b/contrib/icp/gcm-simd/isa-l_crypto-ported/README
new file mode 100644
index 000000000000..219d427c845e
--- /dev/null
+++ b/contrib/icp/gcm-simd/isa-l_crypto-ported/README
@@ -0,0 +1,18 @@
+This directory contains the ported "Intel(R) Intelligent Storage Acceleration
+Library Crypto Version" [1] GCM x86-64 assembly files [2]. They were adapted
+for the GNU assembler and translated to AT&T syntax. The later was necessary to
+support LLVM clangs integrated assembler. It was verified that the ported
+versions still pass the GCM tests in the isa-l_crypto source tree. The original
+files can be found in the isa-l_crypto directory one level up.
+
+The ported assembler files where then further adapted to be used within the
+ICP.
+
+The main purpose to include these files (and the original ones) here, is to
+serve as a reference if upstream changes need to be applied to the files
+included and modified in the ICP. They could be used by other projects
+depending on the GNU or LLVM assemblers as a starting point as well.
+
+
+[1] https://github.com/intel/isa-l_crypto
+[2] https://github.com/intel/isa-l_crypto/tree/v2.24.0/aes
diff --git a/contrib/icp/gcm-simd/isa-l_crypto-ported/gcm128_sse.S b/contrib/icp/gcm-simd/isa-l_crypto-ported/gcm128_sse.S
new file mode 100644
index 000000000000..6b6422291dc2
--- /dev/null
+++ b/contrib/icp/gcm-simd/isa-l_crypto-ported/gcm128_sse.S
@@ -0,0 +1,31 @@
+//#######################################################################
+//  Copyright(c) 2011-2016 Intel Corporation All rights reserved.
+//
+//  Redistribution and use in source and binary forms, with or without
+//  modification, are permitted provided that the following conditions
+//  are met:
+//    * Redistributions of source code must retain the above copyright
+//      notice, this list of conditions and the following disclaimer.
+//    * Redistributions in binary form must reproduce the above copyright
+//      notice, this list of conditions and the following disclaimer in
+//      the documentation and/or other materials provided with the
+//      distribution.
+//    * Neither the name of Intel Corporation nor the names of its
+//      contributors may be used to endorse or promote products derived
+//      from this software without specific prior written permission.
+//
+//  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+//  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+//  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+//  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+//  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+//  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+//  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES# LOSS OF USE,
+//  DATA, OR PROFITS# OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+//  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+//  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+//  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//#######################################################################
+
+#define GCM128_MODE 1
+#include "gcm_sse_att.S"
diff --git a/contrib/icp/gcm-simd/isa-l_crypto-ported/gcm256_sse.S b/contrib/icp/gcm-simd/isa-l_crypto-ported/gcm256_sse.S
new file mode 100644
index 000000000000..31781f598ced
--- /dev/null
+++ b/contrib/icp/gcm-simd/isa-l_crypto-ported/gcm256_sse.S
@@ -0,0 +1,31 @@
+//////////////////////////////////////////////////////////////////////////
+//  Copyright(c) 2011-2016 Intel Corporation All rights reserved.
+//
+//  Redistribution and use in source and binary forms, with or without
+//  modification, are permitted provided that the following conditions
+//  are met:
+//    * Redistributions of source code must retain the above copyright
+//      notice, this list of conditions and the following disclaimer.
+//    * Redistributions in binary form must reproduce the above copyright
+//      notice, this list of conditions and the following disclaimer in
+//      the documentation and/or other materials provided with the
+//      distribution.
+//    * Neither the name of Intel Corporation nor the names of its
+//      contributors may be used to endorse or promote products derived
+//      from this software without specific prior written permission.
+//
+//  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+//  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+//  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+//  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+//  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+//  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+//  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES// LOSS OF USE,
+//  DATA, OR PROFITS// OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+//  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+//  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+//  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//////////////////////////////////////////////////////////////////////////
+
+#define GCM256_MODE 1
+#include "gcm_sse_att.S"
diff --git a/contrib/icp/gcm-simd/isa-l_crypto-ported/gcm_defines.S b/contrib/icp/gcm-simd/isa-l_crypto-ported/gcm_defines.S
new file mode 100644
index 000000000000..12a74bbe084a
--- /dev/null
+++ b/contrib/icp/gcm-simd/isa-l_crypto-ported/gcm_defines.S
@@ -0,0 +1,295 @@
+////////////////////////////////////////////////////////////////////////////////
+//  Copyright(c) 2011-2016 Intel Corporation All rights reserved.
+//
+//  Redistribution and use in source and binary forms, with or without
+//  modification, are permitted provided that the following conditions
+//  are met:
+//    * Redistributions of source code must retain the above copyright
+//      notice, this list of conditions and the following disclaimer.
+//    * Redistributions in binary form must reproduce the above copyright
+//      notice, this list of conditions and the following disclaimer in
+//      the documentation and/or other materials provided with the
+//      distribution.
+//    * Neither the name of Intel Corporation nor the names of its
+//      contributors may be used to endorse or promote products derived
+//      from this software without specific prior written permission.
+//
+//  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+//  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+//  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+//  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+//  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+//  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+//  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES// LOSS OF USE,
+//  DATA, OR PROFITS// OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+//  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+//  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+//  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+////////////////////////////////////////////////////////////////////////////////
+
+#ifndef GCM_DEFINES_ASM_INCLUDED
+#define GCM_DEFINES_ASM_INCLUDED
+
+//
+// Authors:
+//       Erdinc Ozturk
+//       Vinodh Gopal
+//       James Guilford
+
+// Port to GNU as and translation to GNU as att-syntax
+// Copyright(c) 2023 Attila Fülöp <attila@fueloep.org>
+
+////////////
+
+.section .rodata
+
+.balign 16
+POLY:		.quad     0x0000000000000001, 0xC200000000000000
+
+// unused for sse
+.balign 64
+POLY2:		.quad     0x00000001C2000000, 0xC200000000000000
+		.quad     0x00000001C2000000, 0xC200000000000000
+		.quad     0x00000001C2000000, 0xC200000000000000
+		.quad     0x00000001C2000000, 0xC200000000000000
+.balign 16
+TWOONE:		.quad     0x0000000000000001, 0x0000000100000000
+
+// order of these constants should not change.
+// more specifically, ALL_F should follow SHIFT_MASK, and ZERO should
+// follow ALL_F
+
+.balign 64
+SHUF_MASK:	.quad     0x08090A0B0C0D0E0F, 0x0001020304050607
+		.quad     0x08090A0B0C0D0E0F, 0x0001020304050607
+		.quad     0x08090A0B0C0D0E0F, 0x0001020304050607
+		.quad     0x08090A0B0C0D0E0F, 0x0001020304050607
+
+SHIFT_MASK:	.quad     0x0706050403020100, 0x0f0e0d0c0b0a0908
+ALL_F:		.quad     0xffffffffffffffff, 0xffffffffffffffff
+ZERO:		.quad     0x0000000000000000, 0x0000000000000000 // unused for sse
+ONE:		.quad     0x0000000000000001, 0x0000000000000000
+TWO:		.quad     0x0000000000000002, 0x0000000000000000 // unused for sse
+ONEf:		.quad     0x0000000000000000, 0x0100000000000000
+TWOf:		.quad     0x0000000000000000, 0x0200000000000000 // unused for sse
+
+// Below unused for sse
+.balign 64
+ddq_add_1234:
+	.quad	0x0000000000000001, 0x0000000000000000
+	.quad	0x0000000000000002, 0x0000000000000000
+	.quad	0x0000000000000003, 0x0000000000000000
+	.quad	0x0000000000000004, 0x0000000000000000
+
+.balign 64
+ddq_add_5678:
+	.quad	0x0000000000000005, 0x0000000000000000
+	.quad	0x0000000000000006, 0x0000000000000000
+	.quad	0x0000000000000007, 0x0000000000000000
+	.quad	0x0000000000000008, 0x0000000000000000
+
+.balign 64
+ddq_add_4444:
+	.quad	0x0000000000000004, 0x0000000000000000
+	.quad	0x0000000000000004, 0x0000000000000000
+	.quad	0x0000000000000004, 0x0000000000000000
+	.quad	0x0000000000000004, 0x0000000000000000
+
+.balign 64
+ddq_add_8888:
+	.quad	0x0000000000000008, 0x0000000000000000
+	.quad	0x0000000000000008, 0x0000000000000000
+	.quad	0x0000000000000008, 0x0000000000000000
+	.quad	0x0000000000000008, 0x0000000000000000
+
+.balign 64
+ddq_addbe_1234:
+	.quad	0x0000000000000000, 0x0100000000000000
+	.quad	0x0000000000000000, 0x0200000000000000
+	.quad	0x0000000000000000, 0x0300000000000000
+	.quad	0x0000000000000000, 0x0400000000000000
+
+.balign 64
+ddq_addbe_5678:
+	.quad	0x0000000000000000, 0x0500000000000000
+	.quad	0x0000000000000000, 0x0600000000000000
+	.quad	0x0000000000000000, 0x0700000000000000
+	.quad	0x0000000000000000, 0x0800000000000000
+
+.balign 64
+ddq_addbe_4444:
+	.quad	0x0000000000000000, 0x0400000000000000
+	.quad	0x0000000000000000, 0x0400000000000000
+	.quad	0x0000000000000000, 0x0400000000000000
+	.quad	0x0000000000000000, 0x0400000000000000
+
+.balign 64
+ddq_addbe_8888:
+	.quad	0x0000000000000000, 0x0800000000000000
+	.quad	0x0000000000000000, 0x0800000000000000
+	.quad	0x0000000000000000, 0x0800000000000000
+	.quad	0x0000000000000000, 0x0800000000000000
+
+.balign 64
+byte_len_to_mask_table:
+	.short      0x0000, 0x0001, 0x0003, 0x0007
+	.short      0x000f, 0x001f, 0x003f, 0x007f
+	.short      0x00ff, 0x01ff, 0x03ff, 0x07ff
+	.short      0x0fff, 0x1fff, 0x3fff, 0x7fff
+	.short      0xffff
+
+.balign 64
+byte64_len_to_mask_table:
+	.quad      0x0000000000000000, 0x0000000000000001
+	.quad      0x0000000000000003, 0x0000000000000007
+	.quad      0x000000000000000f, 0x000000000000001f
+	.quad      0x000000000000003f, 0x000000000000007f
+	.quad      0x00000000000000ff, 0x00000000000001ff
+	.quad      0x00000000000003ff, 0x00000000000007ff
+	.quad      0x0000000000000fff, 0x0000000000001fff
+	.quad      0x0000000000003fff, 0x0000000000007fff
+	.quad      0x000000000000ffff, 0x000000000001ffff
+	.quad      0x000000000003ffff, 0x000000000007ffff
+	.quad      0x00000000000fffff, 0x00000000001fffff
+	.quad      0x00000000003fffff, 0x00000000007fffff
+	.quad      0x0000000000ffffff, 0x0000000001ffffff
+	.quad      0x0000000003ffffff, 0x0000000007ffffff
+	.quad      0x000000000fffffff, 0x000000001fffffff
+	.quad      0x000000003fffffff, 0x000000007fffffff
+	.quad      0x00000000ffffffff, 0x00000001ffffffff
+	.quad      0x00000003ffffffff, 0x00000007ffffffff
+	.quad      0x0000000fffffffff, 0x0000001fffffffff
+	.quad      0x0000003fffffffff, 0x0000007fffffffff
+	.quad      0x000000ffffffffff, 0x000001ffffffffff
+	.quad      0x000003ffffffffff, 0x000007ffffffffff
+	.quad      0x00000fffffffffff, 0x00001fffffffffff
+	.quad      0x00003fffffffffff, 0x00007fffffffffff
+	.quad      0x0000ffffffffffff, 0x0001ffffffffffff
+	.quad      0x0003ffffffffffff, 0x0007ffffffffffff
+	.quad      0x000fffffffffffff, 0x001fffffffffffff
+	.quad      0x003fffffffffffff, 0x007fffffffffffff
+	.quad      0x00ffffffffffffff, 0x01ffffffffffffff
+	.quad      0x03ffffffffffffff, 0x07ffffffffffffff
+	.quad      0x0fffffffffffffff, 0x1fffffffffffffff
+	.quad      0x3fffffffffffffff, 0x7fffffffffffffff
+	.quad      0xffffffffffffffff
+
+.balign 64
+mask_out_top_block:
+	.quad      0xffffffffffffffff, 0xffffffffffffffff
+	.quad      0xffffffffffffffff, 0xffffffffffffffff
+	.quad      0xffffffffffffffff, 0xffffffffffffffff
+	.quad      0x0000000000000000, 0x0000000000000000
+
+.section .text
+
+
+////define the fields of gcm_data struct
+//typedef struct gcm_data
+//{
+//        u8 expanded_keys[16*15]//
+//        u8 shifted_hkey_1[16]//  // store HashKey <<1 mod poly here
+//        u8 shifted_hkey_2[16]//  // store HashKey^2 <<1 mod poly here
+//        u8 shifted_hkey_3[16]//  // store HashKey^3 <<1 mod poly here
+//        u8 shifted_hkey_4[16]//  // store HashKey^4 <<1 mod poly here
+//        u8 shifted_hkey_5[16]//  // store HashKey^5 <<1 mod poly here
+//        u8 shifted_hkey_6[16]//  // store HashKey^6 <<1 mod poly here
+//        u8 shifted_hkey_7[16]//  // store HashKey^7 <<1 mod poly here
+//        u8 shifted_hkey_8[16]//  // store HashKey^8 <<1 mod poly here
+//        u8 shifted_hkey_1_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey <<1 mod poly here (for Karatsuba purposes)
+//        u8 shifted_hkey_2_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^2 <<1 mod poly here (for Karatsuba purposes)
+//        u8 shifted_hkey_3_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^3 <<1 mod poly here (for Karatsuba purposes)
+//        u8 shifted_hkey_4_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^4 <<1 mod poly here (for Karatsuba purposes)
+//        u8 shifted_hkey_5_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^5 <<1 mod poly here (for Karatsuba purposes)
+//        u8 shifted_hkey_6_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^6 <<1 mod poly here (for Karatsuba purposes)
+//        u8 shifted_hkey_7_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^7 <<1 mod poly here (for Karatsuba purposes)
+//        u8 shifted_hkey_8_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^8 <<1 mod poly here (for Karatsuba purposes)
+//} gcm_data//
+
+#ifndef GCM_KEYS_VAES_AVX512_INCLUDED
+#define HashKey         16*15    // store HashKey <<1 mod poly here
+#define HashKey_1       16*15    // store HashKey <<1 mod poly here
+#define HashKey_2       16*16    // store HashKey^2 <<1 mod poly here
+#define HashKey_3       16*17    // store HashKey^3 <<1 mod poly here
+#define HashKey_4       16*18    // store HashKey^4 <<1 mod poly here
+#define HashKey_5       16*19    // store HashKey^5 <<1 mod poly here
+#define HashKey_6       16*20    // store HashKey^6 <<1 mod poly here
+#define HashKey_7       16*21    // store HashKey^7 <<1 mod poly here
+#define HashKey_8       16*22    // store HashKey^8 <<1 mod poly here
+#define HashKey_k       16*23    // store XOR of High 64 bits and Low 64 bits of  HashKey <<1 mod poly here (for Karatsuba purposes)
+#define HashKey_2_k     16*24    // store XOR of High 64 bits and Low 64 bits of  HashKey^2 <<1 mod poly here (for Karatsuba purposes)
+#define HashKey_3_k     16*25   // store XOR of High 64 bits and Low 64 bits of  HashKey^3 <<1 mod poly here (for Karatsuba purposes)
+#define HashKey_4_k     16*26   // store XOR of High 64 bits and Low 64 bits of  HashKey^4 <<1 mod poly here (for Karatsuba purposes)
+#define HashKey_5_k     16*27   // store XOR of High 64 bits and Low 64 bits of  HashKey^5 <<1 mod poly here (for Karatsuba purposes)
+#define HashKey_6_k     16*28   // store XOR of High 64 bits and Low 64 bits of  HashKey^6 <<1 mod poly here (for Karatsuba purposes)
+#define HashKey_7_k     16*29   // store XOR of High 64 bits and Low 64 bits of  HashKey^7 <<1 mod poly here (for Karatsuba purposes)
+#define HashKey_8_k     16*30   // store XOR of High 64 bits and Low 64 bits of  HashKey^8 <<1 mod poly here (for Karatsuba purposes)
+#endif
+
+#define AadHash		16*0	// store current Hash of data which has been input
+#define AadLen		16*1	// store length of input data which will not be encrypted or decrypted
+#define InLen		(16*1)+8 // store length of input data which will be encrypted or decrypted
+#define PBlockEncKey	16*2	// encryption key for the partial block at the end of the previous update
+#define OrigIV		16*3	// input IV
+#define CurCount	16*4	// Current counter for generation of encryption key
+#define PBlockLen	16*5	// length of partial block at the end of the previous update
+
+.macro xmmreg name, num
+	.set xmm\name, %xmm\num
+.endm
+
+#define arg(x) (STACK_OFFSET + 8*(x))(%r14)
+
+
+#if __OUTPUT_FORMAT__ != elf64
+#define arg1 %rcx
+#define arg2 %rdx
+#define arg3 %r8
+#define arg4 %r9
+#define arg5 %rsi
+#define arg6 (STACK_OFFSET + 8*6)(%r14)
+#define arg7 (STACK_OFFSET + 8*7)(%r14)
+#define arg8 (STACK_OFFSET + 8*8)(%r14)
+#define arg9 (STACK_OFFSET + 8*9)(%r14)
+#define arg10 (STACK_OFFSET + 8*10)(%r14)
+#else
+#define arg1 %rdi
+#define arg2 %rsi
+#define arg3 %rdx
+#define arg4 %rcx
+#define arg5 %r8
+#define arg6 %r9
+#define arg7 ((STACK_OFFSET) + 8*1)(%r14)
+#define arg8 ((STACK_OFFSET) + 8*2)(%r14)
+#define arg9 ((STACK_OFFSET) + 8*3)(%r14)
+#define arg10 ((STACK_OFFSET) + 8*4)(%r14)
+#endif
+
+#ifdef NT_LDST
+#define NT_LD
+#define NT_ST
+#endif
+
+////// Use Non-temporal load/stor
+#ifdef NT_LD
+#define	XLDR	 movntdqa
+#define	VXLDR	 vmovntdqa
+#define	VX512LDR vmovntdqa
+#else
+#define	XLDR	 movdqu
+#define	VXLDR	 vmovdqu
+#define	VX512LDR vmovdqu8
+#endif
+
+////// Use Non-temporal load/stor
+#ifdef NT_ST
+#define	XSTR	 movntdq
+#define	VXSTR	 vmovntdq
+#define	VX512STR vmovntdq
+#else
+#define	XSTR	 movdqu
+#define	VXSTR	 vmovdqu
+#define	VX512STR vmovdqu8
+#endif
+
+#endif // GCM_DEFINES_ASM_INCLUDED
diff --git a/contrib/icp/gcm-simd/isa-l_crypto-ported/gcm_sse.S b/contrib/icp/gcm-simd/isa-l_crypto-ported/gcm_sse.S
new file mode 100644
index 000000000000..eec65600ddc6
--- /dev/null
+++ b/contrib/icp/gcm-simd/isa-l_crypto-ported/gcm_sse.S
@@ -0,0 +1,2153 @@
+////////////////////////////////////////////////////////////////////////////////
+//  Copyright(c) 2011-2017 Intel Corporation All rights reserved.
+//
+//  Redistribution and use in source and binary forms, with or without
+//  modification, are permitted provided that the following conditions
+//  are met:
+//    * Redistributions of source code must retain the above copyright
+//      notice, this list of conditions and the following disclaimer.
+//    * Redistributions in binary form must reproduce the above copyright
+//      notice, this list of conditions and the following disclaimer in
+//      the documentation and/or other materials provided with the
+//      distribution.
+//    * Neither the name of Intel Corporation nor the names of its
+//      contributors may be used to endorse or promote products derived
+//      from this software without specific prior written permission.
+//
+//  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+//  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+//  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+//  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+//  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+//  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+//  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES// LOSS OF USE,
+//  DATA, OR PROFITS// OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+//  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+//  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+//  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+////////////////////////////////////////////////////////////////////////////////
+
+////////////////////////////////////////////////////////////////////////////////
+//
+// Authors:
+//       Erdinc Ozturk
+//       Vinodh Gopal
+//       James Guilford
+//
+//
+// References:
+//       This code was derived and highly optimized from the code described in
+//	 paper:
+//               Vinodh Gopal et. al. Optimized Galois-Counter-Mode
+//		  Implementation on Intel Architecture Processors. August, 2010
+//
+//       For the shift-based reductions used in this code, we used the method
+//	 described in paper:
+//               Shay Gueron, Michael E. Kounavis. Intel Carry-Less
+//		  Multiplication Instruction and its Usage for Computing the GCM
+//		  Mode. January, 2010.
+//
+//
+// Assumptions:
+//
+//
+//
+// iv:
+//       0                   1                   2                   3
+//       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                             Salt  (From the SA)               |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                     Initialization Vector                     |
+//       |         (This is the sequence number from IPSec header)       |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                              0x1                              |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//
+//
+//
+// AAD:
+//       AAD will be padded with 0 to the next 16byte multiple
+//       for example, assume AAD is a u32 vector
+//
+//       if AAD is 8 bytes:
+//       AAD[3] = {A0, A1};
+//       padded AAD in xmm register = {A1 A0 0 0}
+//
+//       0                   1                   2                   3
+//       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                               SPI (A1)                        |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                     32-bit Sequence Number (A0)               |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                              0x0                              |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//
+//                                       AAD Format with 32-bit Sequence Number
+//
+//       if AAD is 12 bytes:
+//       AAD[3] = {A0, A1, A2};
+//       padded AAD in xmm register = {A2 A1 A0 0}
+//
+//       0                   1                   2                   3
+//       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                               SPI (A2)                        |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                 64-bit Extended Sequence Number {A1,A0}       |
+//       |                                                               |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                              0x0                              |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//
+//        AAD Format with 64-bit Extended Sequence Number
+//
+//
+// aadLen:
+//       Must be a multiple of 4 bytes and from the definition of the spec.
+//       The code additionally supports any aadLen length.
+//
+// TLen:
+//       from the definition of the spec, TLen can only be 8, 12 or 16 bytes.
+//
+// poly = x^128 + x^127 + x^126 + x^121 + 1
+// throughout the code, one tab and two tab indentations are used. one tab is
+// for GHASH part, two tabs is for AES part.
+//
+
+// Port to GNU as and translation to GNU as att-syntax
+// Copyright(c) 2023 Attila Fülöp <attila@fueloep.org>
+
+// .altmacro
+.att_syntax prefix
+
+#include "../include/reg_sizes_att.S"
+#include "gcm_defines_att.S"
+
+#if !defined(GCM128_MODE) && !defined(GCM256_MODE)
+#error "No GCM mode selected for gcm_sse.S!"
+#endif
+
+#if defined(FUNCT_EXTENSION)
+#error "No support for non-temporal versions yet!"
+#endif
+#define	_nt	1
+
+#ifdef GCM128_MODE
+#define FN_NAME(x,y) aes_gcm_ ## x ## _128 ## y ## sse
+#define NROUNDS 9
+#endif
+
+#ifdef GCM256_MODE
+#define FN_NAME(x,y) aes_gcm_ ## x ## _256 ## y ## sse
+#define NROUNDS 13
+#endif
+
+
+// need to push 5 registers into stack to maintain
+#define STACK_OFFSET 8*5
+
+#define	TMP2	16*0    // Temporary storage for AES State 2 (State 1 is stored in an XMM register)
+#define	TMP3	16*1    // Temporary storage for AES State 3
+#define	TMP4	16*2    // Temporary storage for AES State 4
+#define	TMP5	16*3    // Temporary storage for AES State 5
+#define	TMP6	16*4    // Temporary storage for AES State 6
+#define	TMP7	16*5    // Temporary storage for AES State 7
+#define	TMP8	16*6    // Temporary storage for AES State 8
+
+#define	LOCAL_STORAGE	16*7
+
+#if __OUTPUT_FORMAT == win64
+#define	XMM_STORAGE	16*10
+#else
+#define	XMM_STORAGE	0
+#endif
+
+#define	VARIABLE_OFFSET	LOCAL_STORAGE + XMM_STORAGE
+
+////////////////////////////////////////////////////////////////
+// Utility Macros
+////////////////////////////////////////////////////////////////
+
+////////////////////////////////////////////////////////////////////////////////
+// GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)
+// Input: A and B (128-bits each, bit-reflected)
+// Output: C = A*B*x mod poly, (i.e. >>1 )
+// To compute GH = GH*HashKey mod poly, give HK = HashKey<<1 mod poly as input
+// GH = GH * HK * x mod poly which is equivalent to GH*HashKey mod poly.
+////////////////////////////////////////////////////////////////////////////////
+.macro  GHASH_MUL  GH, HK, T1, T2, T3, T4, T5
+	// \GH, \HK hold the values for the two operands which are carry-less
+	// multiplied.
+	////////////////////////////////////////////////////////////////////////
+	// Karatsuba Method
+	movdqa	\GH, \T1
+	pshufd	$0b01001110, \GH, \T2
+	pshufd	$0b01001110, \HK, \T3
+	pxor	\GH, \T2			// \T2 = (a1+a0)
+	pxor	\HK, \T3			// \T3 = (b1+b0)
+
+	pclmulqdq	$0x11, \HK, \T1		// \T1 = a1*b1
+	pclmulqdq       $0x00, \HK, \GH		// \GH = a0*b0
+	pclmulqdq       $0x00, \T3, \T2		// \T2 = (a1+a0)*(b1+b0)
+	pxor	\GH, \T2
+	pxor	\T1, \T2			// \T2 = a0*b1+a1*b0
+
+	movdqa	\T2, \T3
+	pslldq	$8, \T3		// shift-L \T3 2 DWs
+	psrldq	$8, \T2		// shift-R \T2 2 DWs
+	pxor	\T3, \GH
+	pxor	\T2, \T1	// <\T1:\GH> holds the result of the carry-less multiplication of \GH by \HK
+
+
+	//first phase of the reduction
+	movdqa	\GH, \T2
+	movdqa	\GH, \T3
+	movdqa	\GH, \T4	// move \GH into \T2, \T3, \T4 in order to perform the three shifts independently
+
+	pslld	$31, \T2	// packed right shifting << 31
+	pslld	$30, \T3	// packed right shifting shift << 30
+	pslld	$25, \T4	// packed right shifting shift << 25
+	pxor	\T3, \T2	// xor the shifted versions
+	pxor	\T4, \T2
+
+	movdqa	\T2, \T5
+	psrldq	$4, \T5		 // shift-R \T5 1 DW
+
+	pslldq	$12, \T2	// shift-L \T2 3 DWs
+	pxor	\T2, \GH	// first phase of the reduction complete
+	////////////////////////////////////////////////////////////////////////
+
+	//second phase of the reduction
+	movdqa	\GH, \T2	// make 3 copies of \GH (in in \T2, \T3, \T4) for doing three shift operations
+	movdqa	\GH, \T3
+	movdqa	\GH, \T4
+
+	psrld	$1, \T2		// packed left shifting >> 1
+	psrld	$2, \T3		// packed left shifting >> 2
+	psrld	$7, \T4		// packed left shifting >> 7
+	pxor	\T3, \T2	// xor the shifted versions
+	pxor	\T4, \T2
+
+	pxor	\T5, \T2
+	pxor	\T2, \GH
+	pxor	\T1, \GH	// the result is in \T1
+
+.endm // GHASH_MUL
+
+////////////////////////////////////////////////////////////////////////////////
+// PRECOMPUTE: Precompute HashKey_{2..8} and HashKey{,_{2..8}}_k.
+// HasKey_i_k holds XORed values of the low and high parts of the Haskey_i.
+////////////////////////////////////////////////////////////////////////////////
+.macro PRECOMPUTE GDATA, HK, T1, T2, T3, T4, T5, T6
+
+	movdqa	\HK, \T4
+	pshufd	$0b01001110, \HK, \T1
+	pxor	\HK, \T1
+	movdqu	\T1, HashKey_k(\GDATA)
+
+
+	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^2<<1 mod poly
+	movdqu	\T4, HashKey_2(\GDATA)		//  [HashKey_2] = HashKey^2<<1 mod poly
+	pshufd	$0b01001110, \T4, \T1
+	pxor	\T4, \T1
+	movdqu	\T1, HashKey_2_k(\GDATA)
+
+	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^3<<1 mod poly
+	movdqu	\T4, HashKey_3(\GDATA)
+	pshufd	$0b01001110, \T4, \T1
+	pxor	\T4, \T1
+	movdqu	\T1, HashKey_3_k(\GDATA)
+
+
+	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^4<<1 mod poly
+	movdqu	\T4, HashKey_4(\GDATA)
+	pshufd	$0b01001110, \T4, \T1
+	pxor	\T4, \T1
+	movdqu	\T1, HashKey_4_k(\GDATA)
+
+	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^5<<1 mod poly
+	movdqu	\T4, HashKey_5(\GDATA)
+	pshufd	$0b01001110, \T4, \T1
+	pxor	\T4, \T1
+	movdqu	\T1, HashKey_5_k(\GDATA)
+
+
+	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^6<<1 mod poly
+	movdqu	\T4, HashKey_6(\GDATA)
+	pshufd	$0b01001110, \T4, \T1
+	pxor	\T4, \T1
+	movdqu	\T1, HashKey_6_k(\GDATA)
+
+	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^7<<1 mod poly
+	movdqu	\T4, HashKey_7(\GDATA)
+	pshufd	$0b01001110, \T4, \T1
+	pxor	\T4, \T1
+	movdqu	\T1, HashKey_7_k(\GDATA)
+
+	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^8<<1 mod poly
+	movdqu  \T4, HashKey_8(\GDATA)
+	pshufd  $0b01001110, \T4, \T1
+	pxor    \T4, \T1
+	movdqu  \T1, HashKey_8_k(\GDATA)
+
+.endm // PRECOMPUTE
+
+
+////////////////////////////////////////////////////////////////////////////////
+// READ_SMALL_DATA_INPUT: Packs xmm register with data when data input is less
+// than 16 bytes.
+// Returns 0 if data has length 0.
+// Input: The input data (INPUT), that data's length (LENGTH).
+// Output: The packed xmm register (OUTPUT).
+////////////////////////////////////////////////////////////////////////////////
+.macro READ_SMALL_DATA_INPUT	OUTPUT, INPUT, LENGTH, \
+				END_READ_LOCATION, COUNTER, TMP1
+
+	// clang compat: no local support
+	// LOCAL _byte_loop_1, _byte_loop_2, _done
+
+	pxor	\OUTPUT, \OUTPUT
+	mov	\LENGTH, \COUNTER
+	mov	\INPUT, \END_READ_LOCATION
+	add	\LENGTH, \END_READ_LOCATION
+	xor	\TMP1, \TMP1
+
+
+	cmp	$8, \COUNTER
+	jl	_byte_loop_2_\@
+	pinsrq	$0, (\INPUT), \OUTPUT	//Read in 8 bytes if they exists
+	je	_done_\@
+
+	sub	$8, \COUNTER
+
+_byte_loop_1_\@:		//Read in data 1 byte at a time while data is left
+	shl	$8, \TMP1	//This loop handles when 8 bytes were already read in
+	dec	\END_READ_LOCATION
+
+	////  mov	BYTE(\TMP1), BYTE [\END_READ_LOCATION]
+	bytereg \TMP1
+	movb	(\END_READ_LOCATION), breg
+	dec	\COUNTER
+	jg	_byte_loop_1_\@
+	pinsrq	$1, \TMP1, \OUTPUT
+	jmp	_done_\@
+
+_byte_loop_2_\@:		//Read in data 1 byte at a time while data is left
+	cmp	$0, \COUNTER
+	je	_done_\@
+	shl	$8, \TMP1	//This loop handles when no bytes were already read in
+	dec	\END_READ_LOCATION
+	//// mov	BYTE(\TMP1), BYTE [\END_READ_LOCATION]
+	bytereg \TMP1
+	movb	(\END_READ_LOCATION), breg
+	dec	\COUNTER
+	jg	_byte_loop_2_\@
+	pinsrq	$0, \TMP1, \OUTPUT
+_done_\@:
+
+.endm // READ_SMALL_DATA_INPUT
+
+
+////////////////////////////////////////////////////////////////////////////////
+// CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted.
+// Input: The input data (A_IN), that data's length (A_LEN), and the hash key
+// (HASH_KEY).
+// Output: The hash of the data (AAD_HASH).
+////////////////////////////////////////////////////////////////////////////////
+.macro	CALC_AAD_HASH	A_IN, A_LEN, AAD_HASH, HASH_KEY, XTMP1, XTMP2, XTMP3, \
+			XTMP4, XTMP5, T1, T2, T3, T4, T5
+
+	// clang compat: no local support
+	// LOCAL _get_AAD_loop16, _get_small_AAD_block, _CALC_AAD_done
+
+	mov	\A_IN, \T1		// T1 = AAD
+	mov	\A_LEN, \T2		// T2 = aadLen
+	pxor	\AAD_HASH, \AAD_HASH
+
+	cmp	$16, \T2
+	jl	_get_small_AAD_block_\@
+
+_get_AAD_loop16_\@:
+
+	movdqu	(\T1), \XTMP1
+	//byte-reflect the AAD data
+	pshufb	 SHUF_MASK(%rip), \XTMP1
+	pxor	\XTMP1, \AAD_HASH
+	GHASH_MUL \AAD_HASH, \HASH_KEY, \XTMP1, \XTMP2, \XTMP3, \XTMP4, \XTMP5
+
+	sub	$16, \T2
+	je	_CALC_AAD_done_\@
+
+	add	$16, \T1
+	cmp	$16, \T2
+	jge	_get_AAD_loop16_\@
+
+_get_small_AAD_block_\@:
+	READ_SMALL_DATA_INPUT	\XTMP1, \T1, \T2, \T3, \T4, \T5
+	//byte-reflect the AAD data
+	pshufb	 SHUF_MASK(%rip), \XTMP1
+	pxor	\XTMP1, \AAD_HASH
+	GHASH_MUL \AAD_HASH, \HASH_KEY, \XTMP1, \XTMP2, \XTMP3, \XTMP4, \XTMP5
+
+_CALC_AAD_done_\@:
+
+.endm // CALC_AAD_HASH
+
+
+
+////////////////////////////////////////////////////////////////////////////////
+// PARTIAL_BLOCK: Handles encryption/decryption and the tag partial blocks
+// between update calls. Requires the input data be at least 1 byte long.
+// Input: gcm_key_data (GDATA_KEY), gcm_context_data (GDATA_CTX), input text
+// (PLAIN_CYPH_IN), input text length (PLAIN_CYPH_LEN), the current data offset
+// (DATA_OFFSET), and whether encoding or decoding (ENC_DEC).
+// Output: A cypher of the first partial block (CYPH_PLAIN_OUT), and updated
+// GDATA_CTX.
+// Clobbers rax, r10, r12, r13, r15, xmm0, xmm1, xmm2, xmm3, xmm5, xmm6, xmm9,
+// xmm10, xmm11, xmm13
+////////////////////////////////////////////////////////////////////////////////
+.macro PARTIAL_BLOCK	GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, \
+			PLAIN_CYPH_LEN, DATA_OFFSET, AAD_HASH, ENC_DEC
+
+	// clang compat: no local support
+	// LOCAL _fewer_than_16_bytes, _data_read, _no_extra_mask_1
+	// LOCAL _partial_incomplete_1, _dec_done, _no_extra_mask_2
+	// LOCAL _partial_incomplete_2, _encode_done, _partial_fill
+	// LOCAL _count_set, _less_than_8_bytes_left, _partial_block_done
+
+	mov	PBlockLen(\GDATA_CTX), %r13
+	cmp	$0, %r13
+	je	_partial_block_done_\@		//Leave Macro if no partial blocks
+
+	cmp	$16, \PLAIN_CYPH_LEN		//Read in input data without over reading
+	jl	_fewer_than_16_bytes_\@
+	XLDR	(\PLAIN_CYPH_IN), %xmm1		//If more than 16 bytes of data, just fill the xmm register
+	jmp	_data_read_\@
+
+_fewer_than_16_bytes_\@:
+	lea	(\PLAIN_CYPH_IN, \DATA_OFFSET), %r10
+	READ_SMALL_DATA_INPUT	%xmm1, %r10, \PLAIN_CYPH_LEN, %rax, %r12, %r15
+	mov	PBlockLen(\GDATA_CTX), %r13
+
+_data_read_\@:				//Finished reading in data
+
+
+	movdqu	PBlockEncKey(\GDATA_CTX), %xmm9	//xmm9 = ctx_data.partial_block_enc_key
+	movdqu	HashKey(\GDATA_KEY), %xmm13
+
+	lea	SHIFT_MASK(%rip), %r12
+
+	add	%r13, %r12		// adjust the shuffle mask pointer to be able to shift r13 bytes (16-r13 is the number of bytes in plaintext mod 16)
+	movdqu	(%r12), %xmm2		// get the appropriate shuffle mask
+	pshufb	%xmm2, %xmm9		// shift right r13 bytes
+
+	.ifc	\ENC_DEC, DEC
+
+	movdqa	%xmm1, %xmm3
+	pxor	%xmm1, %xmm9		// Cyphertext XOR E(K, Yn)
+
+	mov	\PLAIN_CYPH_LEN, %r15
+	add	%r13, %r15
+	sub	$16, %r15		//Set r15 to be the amount of data left in CYPH_PLAIN_IN after filling the block
+	jge	_no_extra_mask_1_\@	//Determine if if partial block is not being filled and shift mask accordingly
+	sub	%r15, %r12
+_no_extra_mask_1_\@:
+
+	movdqu	(ALL_F - SHIFT_MASK)(%r12), %xmm1	// get the appropriate mask to mask out bottom r13 bytes of xmm9
+	pand	%xmm1, %xmm9			// mask out bottom r13 bytes of xmm9
+
+	pand	%xmm1, %xmm3
+	pshufb	SHUF_MASK(%rip), %xmm3
+	pshufb	%xmm2, %xmm3
+	pxor	%xmm3, \AAD_HASH
+
+
+	cmp	$0, %r15
+	jl	_partial_incomplete_1_\@
+
+	GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6	//GHASH computation for the last <16 Byte block
+	xor	%rax, %rax
+	mov	%rax, PBlockLen(\GDATA_CTX)
+	jmp	_dec_done_\@
+_partial_incomplete_1_\@:
+	add	\PLAIN_CYPH_LEN, PBlockLen(\GDATA_CTX)
+_dec_done_\@:
+	movdqu	\AAD_HASH, AadHash(\GDATA_CTX)
+
+	.else	// .ifc \ENC_DEC, DEC
+
+	pxor	%xmm1, %xmm9		// Plaintext XOR E(K, Yn)
+
+	mov	\PLAIN_CYPH_LEN, %r15
+	add	%r13, %r15
+	sub	$16, %r15		//Set r15 to be the amount of data left in CYPH_PLAIN_IN after filling the block
+	jge	_no_extra_mask_2_\@	//Determine if if partial block is not being filled and shift mask accordingly
+	sub	%r15, %r12
+_no_extra_mask_2_\@:
+
+	movdqu	(ALL_F - SHIFT_MASK)(%r12), %xmm1  // get the appropriate mask to mask out bottom r13 bytes of xmm9
+	pand	%xmm1, %xmm9			 // mask out bottom r13  bytes of xmm9
+
+	pshufb	SHUF_MASK(%rip), %xmm9
+	pshufb	%xmm2, %xmm9
+	pxor	%xmm9, \AAD_HASH
+
+	cmp	$0, %r15
+	jl	_partial_incomplete_2_\@
+
+	GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6	//GHASH computation for the last <16 Byte block
+	xor	%rax, %rax
+	mov	%rax, PBlockLen(\GDATA_CTX)
+	jmp	_encode_done_\@
+_partial_incomplete_2_\@:
+	add     \PLAIN_CYPH_LEN, PBlockLen(\GDATA_CTX)
+_encode_done_\@:
+	movdqu	\AAD_HASH, AadHash(\GDATA_CTX)
+
+	pshufb	SHUF_MASK(%rip), %xmm9	// shuffle xmm9 back to output as ciphertext
+	pshufb	%xmm2, %xmm9
+
+	.endif	// .ifc \ENC_DEC, DEC
+
+
+	//////////////////////////////////////////////////////////
+	// output encrypted Bytes
+	cmp	$0, %r15
+	jl	_partial_fill_\@
+	mov	%r13, %r12
+	mov	$16, %r13
+	sub	%r12, %r13		// Set r13 to be the number of bytes to write out
+	jmp	_count_set_\@
+_partial_fill_\@:
+	mov	\PLAIN_CYPH_LEN, %r13
+_count_set_\@:
+	movq	%xmm9, %rax
+	cmp	$8, %r13
+	jle	_less_than_8_bytes_left_\@
+	mov	%rax, (\CYPH_PLAIN_OUT, \DATA_OFFSET)
+	add	$8, \DATA_OFFSET
+	psrldq	$8, %xmm9
+	movq	%xmm9, %rax
+	sub	$8, %r13
+_less_than_8_bytes_left_\@:
+	mov	%al, (\CYPH_PLAIN_OUT, \DATA_OFFSET)
+	add	$1, \DATA_OFFSET
+	shr	$8, %rax
+	sub	$1, %r13
+	jne	_less_than_8_bytes_left_\@
+	//////////////////////////////////////////////////////////
+_partial_block_done_\@:
+.endm // PARTIAL_BLOCK
+
+////////////////////////////////////////////////////////////////////////////////
+// INITIAL_BLOCKS: If a = number of total plaintext bytes; b = floor(a/16);
+// \num_initial_blocks = b mod 8;  encrypt the initial \num_initial_blocks
+// blocks and apply ghash on the ciphertext.
+// \GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, r14 are used as a
+// pointer only, not modified.
+// Updated AAD_HASH is returned in \T3.
+////////////////////////////////////////////////////////////////////////////////
+.macro INITIAL_BLOCKS	GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, \
+			LENGTH, DATA_OFFSET, num_initial_blocks, T1, HASH_KEY, \
+			T3, T4, T5, CTR, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, \
+			XMM7, XMM8, T6, T_key, ENC_DEC
+
+	// clang compat: no local support
+	// LOCAL _initial_blocks_done
+
+.altmacro
+.set i, (8-\num_initial_blocks)
+		xmmreg	i, %i
+		movdqu	\XMM8, xmmi	// move AAD_HASH to temp reg
+
+		// start AES for \num_initial_blocks blocks
+		movdqu	CurCount(\GDATA_CTX), \CTR	// \CTR = Y0
+
+
+.set i, (9-\num_initial_blocks)
+.rept \num_initial_blocks
+		xmmreg	i, %i
+		paddd	ONE(%rip), \CTR			// INCR Y0
+		movdqa	\CTR, xmmi
+		pshufb	SHUF_MASK(%rip), xmmi		// perform a 16Byte swap
+.set i, (i+1)
+.endr
+
+movdqu	16*0(\GDATA_KEY), \T_key
+.set i, (9-\num_initial_blocks)
+.rept \num_initial_blocks
+		xmmreg	i, %i
+		pxor	\T_key, xmmi
+.set i, (i+1)
+.endr
+
+.set j, 1
+.rept NROUNDS						// encrypt N blocks with 13 key rounds (11 for GCM192)
+movdqu	16*j(\GDATA_KEY), \T_key
+.set i, (9-\num_initial_blocks)
+.rept \num_initial_blocks
+		xmmreg	i, %i
+		aesenc	\T_key, xmmi
+.set i, (i+1)
+.endr
+
+.set j, (j+1)
+.endr
+
+movdqu	16*j(\GDATA_KEY), \T_key			// encrypt with last (14th) key round (12 for GCM192)
+.set i, (9-\num_initial_blocks)
+.rept \num_initial_blocks
+		xmmreg	i, %i
+		aesenclast	\T_key, xmmi
+.set i, (i+1)
+.endr
+
+.set i, (9-\num_initial_blocks)
+.rept \num_initial_blocks
+		xmmreg	i, %i
+		XLDR	(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, xmmi
+		XSTR	xmmi, (\CYPH_PLAIN_OUT, \DATA_OFFSET)	// write back ciphertext for \num_initial_blocks blocks
+		add	$16, \DATA_OFFSET
+		.ifc \ENC_DEC, DEC
+		movdqa	\T1, xmmi
+		.endif
+		pshufb	SHUF_MASK(%rip), xmmi      // prepare ciphertext for GHASH computations
+.set i, (i+1)
+.endr
+
+
+.set i, (8-\num_initial_blocks)
+.set j, (9-\num_initial_blocks)
+.rept \num_initial_blocks
+	xmmreg	i, %i
+	xmmreg	j, %j
+	pxor	xmmi, xmmj
+	GHASH_MUL xmmj, <\HASH_KEY>, <\T1>, <\T3>, <\T4>, <\T5>, <\T6>	// apply GHASH on \num_initial_blocks blocks
+.set i, (i+1)
+.set j, (j+1)
+.endr
+.noaltmacro
+
+	// \XMM8 has the current Hash Value
+	movdqa	\XMM8, \T3
+
+	cmp	$128, \LENGTH
+	jl	_initial_blocks_done_\@	// no need for precomputed constants
+
+////////////////////////////////////////////////////////////////////////////////
+// Haskey_i_k holds XORed values of the low and high parts of the Haskey_i
+		paddd   ONE(%rip), \CTR		// INCR Y0
+		movdqa  \CTR, \XMM1
+		pshufb  SHUF_MASK(%rip), \XMM1	// perform a 16Byte swap
+
+		paddd   ONE(%rip), \CTR		// INCR Y0
+		movdqa  \CTR, \XMM2
+		pshufb  SHUF_MASK(%rip), \XMM2	// perform a 16Byte swap
+
+		paddd   ONE(%rip), \CTR		// INCR Y0
+		movdqa  \CTR, \XMM3
+		pshufb  SHUF_MASK(%rip), \XMM3	// perform a 16Byte swap
+
+		paddd   ONE(%rip), \CTR		// INCR Y0
+		movdqa  \CTR, \XMM4
+		pshufb  SHUF_MASK(%rip), \XMM4	// perform a 16Byte swap
+
+		paddd   ONE(%rip), \CTR		// INCR Y0
+		movdqa  \CTR, \XMM5
+		pshufb  SHUF_MASK(%rip), \XMM5	// perform a 16Byte swap
+
+		paddd   ONE(%rip), \CTR		// INCR Y0
+		movdqa  \CTR, \XMM6
+		pshufb  SHUF_MASK(%rip), \XMM6	// perform a 16Byte swap
+
+		paddd   ONE(%rip), \CTR		// INCR Y0
+		movdqa  \CTR, \XMM7
+		pshufb  SHUF_MASK(%rip), \XMM7	// perform a 16Byte swap
+
+		paddd   ONE(%rip), \CTR		// INCR Y0
+		movdqa  \CTR, \XMM8
+		pshufb  SHUF_MASK(%rip), \XMM8	// perform a 16Byte swap
+
+		movdqu  16*0(\GDATA_KEY), \T_key
+		pxor	\T_key, \XMM1
+		pxor	\T_key, \XMM2
+		pxor	\T_key, \XMM3
+		pxor	\T_key, \XMM4
+		pxor	\T_key, \XMM5
+		pxor	\T_key, \XMM6
+		pxor	\T_key, \XMM7
+		pxor	\T_key, \XMM8
+
+.set i, 1
+.rept    NROUNDS			// do early (13) rounds (11 for GCM192)
+		movdqu  16*i(\GDATA_KEY), \T_key
+		aesenc	\T_key, \XMM1
+		aesenc	\T_key, \XMM2
+		aesenc	\T_key, \XMM3
+		aesenc	\T_key, \XMM4
+		aesenc	\T_key, \XMM5
+		aesenc	\T_key, \XMM6
+		aesenc	\T_key, \XMM7
+		aesenc	\T_key, \XMM8
+.set i, (i+1)
+.endr
+
+		movdqu	16*i(\GDATA_KEY), \T_key	// do final key round
+		aesenclast	\T_key, \XMM1
+		aesenclast	\T_key, \XMM2
+		aesenclast	\T_key, \XMM3
+		aesenclast	\T_key, \XMM4
+		aesenclast	\T_key, \XMM5
+		aesenclast	\T_key, \XMM6
+		aesenclast	\T_key, \XMM7
+		aesenclast	\T_key, \XMM8
+
+		XLDR	16*0(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, \XMM1
+		XSTR	\XMM1, 16*0(\CYPH_PLAIN_OUT, \DATA_OFFSET)
+		.ifc	\ENC_DEC, DEC
+		movdqa	\T1, \XMM1
+		.endif
+
+		XLDR	16*1(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, \XMM2
+		XSTR	\XMM2, 16*1(\CYPH_PLAIN_OUT, \DATA_OFFSET)
+		.ifc	\ENC_DEC, DEC
+		movdqa	\T1, \XMM2
+		.endif
+
+		XLDR	16*2(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, \XMM3
+		XSTR	\XMM3, 16*2(\CYPH_PLAIN_OUT, \DATA_OFFSET)
+		.ifc	\ENC_DEC, DEC
+		movdqa	\T1, \XMM3
+		.endif
+
+		XLDR	16*3(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, \XMM4
+		XSTR	\XMM4, 16*3(\CYPH_PLAIN_OUT, \DATA_OFFSET)
+		.ifc	\ENC_DEC, DEC
+		movdqa	\T1, \XMM4
+		.endif
+
+		XLDR	16*4(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, \XMM5
+		XSTR	\XMM5, 16*4(\CYPH_PLAIN_OUT, \DATA_OFFSET)
+		.ifc	\ENC_DEC, DEC
+		movdqa	\T1, \XMM5
+		.endif
+
+		XLDR	16*5(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, \XMM6
+		XSTR	\XMM6, 16*5(\CYPH_PLAIN_OUT, \DATA_OFFSET)
+		.ifc	\ENC_DEC, DEC
+		movdqa	\T1, \XMM6
+		.endif
+
+		XLDR	16*6(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, \XMM7
+		XSTR	\XMM7, 16*6(\CYPH_PLAIN_OUT, \DATA_OFFSET)
+		.ifc	\ENC_DEC, DEC
+		movdqa	\T1, \XMM7
+		.endif
+
+		XLDR	16*7(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, \XMM8
+		XSTR	\XMM8, 16*7(\CYPH_PLAIN_OUT, \DATA_OFFSET)
+		.ifc	\ENC_DEC, DEC
+		movdqa	\T1, \XMM8
+		.endif
+
+		add	$128, \DATA_OFFSET
+
+		pshufb  SHUF_MASK(%rip), \XMM1	// perform a 16Byte swap
+		pxor	\T3, \XMM1		// combine GHASHed value with the corresponding ciphertext
+		pshufb  SHUF_MASK(%rip), \XMM2	// perform a 16Byte swap
+		pshufb  SHUF_MASK(%rip), \XMM3	// perform a 16Byte swap
+		pshufb  SHUF_MASK(%rip), \XMM4	// perform a 16Byte swap
+		pshufb  SHUF_MASK(%rip), \XMM5	// perform a 16Byte swap
+		pshufb  SHUF_MASK(%rip), \XMM6	// perform a 16Byte swap
+		pshufb  SHUF_MASK(%rip), \XMM7	// perform a 16Byte swap
+		pshufb  SHUF_MASK(%rip), \XMM8	// perform a 16Byte swap
+
+////////////////////////////////////////////////////////////////////////////////
+
+_initial_blocks_done_\@:
+.noaltmacro
+.endm // INITIAL_BLOCKS
+
+
+////////////////////////////////////////////////////////////////////////////////
+// GHASH_8_ENCRYPT_8_PARALLEL: Encrypt 8 blocks at a time and ghash the 8
+// previously encrypted ciphertext blocks.
+// \GDATA (KEY), \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN are used as pointers only,
+// not modified.
+// \DATA_OFFSET is the data offset value
+////////////////////////////////////////////////////////////////////////////////
+.macro GHASH_8_ENCRYPT_8_PARALLEL GDATA, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, \
+				  DATA_OFFSET, T1, T2, T3, T4, T5, T6, CTR, \
+				  XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7, \
+				  XMM8, T7, loop_idx, ENC_DEC
+
+
+	movdqa	\XMM1, \T7
+	movdqu	\XMM2, TMP2(%rsp)
+	movdqu	\XMM3, TMP3(%rsp)
+	movdqu	\XMM4, TMP4(%rsp)
+	movdqu	\XMM5, TMP5(%rsp)
+	movdqu	\XMM6, TMP6(%rsp)
+	movdqu	\XMM7, TMP7(%rsp)
+	movdqu	\XMM8, TMP8(%rsp)
+
+	////////////////////////////////////////////////////////////////////////
+	//// Karatsuba Method
+
+	movdqa	\T7, \T4
+	pshufd	$0b01001110, \T7, \T6
+	pxor	\T7, \T6
+		.ifc \loop_idx, in_order
+		paddd	ONE(%rip), \CTR			// INCR CNT
+		.else
+		paddd	ONEf(%rip), \CTR			// INCR CNT
+		.endif
+	movdqu	HashKey_8(\GDATA), \T5
+	pclmulqdq	 $0x11, \T5, \T4		// \T1 = a1*b1
+	pclmulqdq	$0x00, \T5, \T7			// \T7 = a0*b0
+	movdqu	HashKey_8_k(\GDATA), \T5
+	pclmulqdq	$0x00, \T5, \T6			// \T2 = (a1+a0)*(b1+b0)
+		movdqa	\CTR, \XMM1
+
+		.ifc \loop_idx, in_order
+
+		paddd	ONE(%rip), \CTR			// INCR CNT
+		movdqa	\CTR, \XMM2
+
+		paddd	ONE(%rip), \CTR			// INCR CNT
+		movdqa	\CTR, \XMM3
+
+		paddd	ONE(%rip), \CTR			// INCR CNT
+		movdqa	\CTR, \XMM4
+
+		paddd	ONE(%rip), \CTR			// INCR CNT
+		movdqa	\CTR, \XMM5
+
+		paddd	ONE(%rip), \CTR			// INCR CNT
+		movdqa	\CTR, \XMM6
+
+		paddd	ONE(%rip), \CTR			// INCR CNT
+		movdqa	\CTR, \XMM7
+
+		paddd	ONE(%rip), \CTR			// INCR CNT
+		movdqa	\CTR, \XMM8
+
+		pshufb	SHUF_MASK(%rip), \XMM1		// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM2		// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM3		// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM4		// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM5		// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM6		// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM7		// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM8		// perform a 16Byte swap
+
+		.else	// .ifc \loop_idx, in_order
+
+		paddd	ONEf(%rip), \CTR		// INCR CNT
+		movdqa	\CTR, \XMM2
+
+		paddd	ONEf(%rip), \CTR		// INCR CNT
+		movdqa	\CTR, \XMM3
+
+		paddd	ONEf(%rip), \CTR		// INCR CNT
+		movdqa	\CTR, \XMM4
+
+		paddd	ONEf(%rip), \CTR		// INCR CNT
+		movdqa	\CTR, \XMM5
+
+		paddd	ONEf(%rip), \CTR		// INCR CNT
+		movdqa	\CTR, \XMM6
+
+		paddd	ONEf(%rip), \CTR		// INCR CNT
+		movdqa	\CTR, \XMM7
+
+		paddd	ONEf(%rip), \CTR		// INCR CNT
+		movdqa	\CTR, \XMM8
+
+		.endif	// .ifc \loop_idx, in_order
+	////////////////////////////////////////////////////////////////////////
+
+		movdqu	16*0(\GDATA), \T1
+		pxor	\T1, \XMM1
+		pxor	\T1, \XMM2
+		pxor	\T1, \XMM3
+		pxor	\T1, \XMM4
+		pxor	\T1, \XMM5
+		pxor	\T1, \XMM6
+		pxor	\T1, \XMM7
+		pxor	\T1, \XMM8
+
+	// \XMM6, \T5 hold the values for the two operands which are
+	// carry-less multiplied
+	////////////////////////////////////////////////////////////////////////
+	// Karatsuba Method
+	movdqu	TMP2(%rsp), \T1
+	movdqa	\T1, \T3
+
+	pshufd	$0b01001110, \T3, \T2
+	pxor	\T3, \T2
+	movdqu	HashKey_7(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1			// \T1 = a1*b1
+	pclmulqdq	$0x00, \T5, \T3			// \T3 = a0*b0
+	movdqu  HashKey_7_k(\GDATA), \T5
+	pclmulqdq	$0x00, \T5, \T2			// \T2 = (a1+a0)*(b1+b0)
+	pxor	\T1, \T4				// accumulate the results in \T4:\T7, \T6 holds the middle part
+	pxor	\T3, \T7
+	pxor	\T2, \T6
+
+		movdqu	16*1(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+		movdqu	16*2(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+	////////////////////////////////////////////////////////////////////////
+	// Karatsuba Method
+	movdqu	TMP3(%rsp), \T1
+	movdqa	\T1, \T3
+
+	pshufd	$0b01001110, \T3, \T2
+	pxor	\T3, \T2
+	movdqu	HashKey_6(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+	pclmulqdq	$0x00, \T5, \T3		// \T3 = a0*b0
+	movdqu  HashKey_6_k(\GDATA), \T5
+	pclmulqdq	$0x00, \T5, \T2		// \T2 = (a1+a0)*(b1+b0)
+	pxor	\T1, \T4			// accumulate the results in \T4:\T7, \T6 holds the middle part
+	pxor	\T3, \T7
+	pxor	\T2, \T6
+
+		movdqu	16*3(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+	movdqu	TMP4(%rsp), \T1
+	movdqa	\T1, \T3
+
+	pshufd	$0b01001110, \T3, \T2
+	pxor	\T3, \T2
+	movdqu	HashKey_5(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+	pclmulqdq	$0x00, \T5, \T3		// \T3 = a0*b0
+	movdqu  HashKey_5_k(\GDATA), \T5
+	pclmulqdq	$0x00, \T5, \T2		// \T2 = (a1+a0)*(b1+b0)
+	pxor	\T1, \T4			// accumulate the results in \T4:\T7, \T6 holds the middle part
+	pxor	\T3, \T7
+	pxor	\T2, \T6
+
+		movdqu	16*4(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+		movdqu	16*5(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+	movdqu	TMP5(%rsp), \T1
+	movdqa	\T1, \T3
+
+	pshufd	$0b01001110, \T3, \T2
+	pxor	\T3, \T2
+	movdqu	HashKey_4(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+	pclmulqdq	$0x00, \T5, \T3		// \T3 = a0*b0
+	movdqu  HashKey_4_k(\GDATA), \T5
+	pclmulqdq	$0x00, \T5, \T2		// \T2 = (a1+a0)*(b1+b0)
+	pxor	\T1, \T4			// accumulate the results in \T4:\T7, \T6 holds the middle part
+	pxor	\T3, \T7
+	pxor	\T2, \T6
+
+		movdqu	16*6(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+
+	movdqu	TMP6(%rsp), \T1
+	movdqa	\T1, \T3
+
+	pshufd	$0b01001110, \T3, \T2
+	pxor	\T3, \T2
+	movdqu	HashKey_3(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+	pclmulqdq	$0x00, \T5, \T3		// \T3 = a0*b0
+	movdqu  HashKey_3_k(\GDATA), \T5
+	pclmulqdq	$0x00, \T5, \T2		// \T2 = (a1+a0)*(b1+b0)
+	pxor	\T1, \T4			// accumulate the results in \T4:\T7, \T6 holds the middle part
+	pxor	\T3, \T7
+	pxor	\T2, \T6
+
+		movdqu	16*7(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+	movdqu	TMP7(%rsp), \T1
+	movdqa	\T1, \T3
+
+	pshufd	$0b01001110, \T3, \T2
+	pxor	\T3, \T2
+	movdqu	HashKey_2(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1			// \T1 = a1*b1
+	pclmulqdq	$0x00, \T5, \T3			// \T3 = a0*b0
+	movdqu  HashKey_2_k(\GDATA), \T5
+	pclmulqdq	$0x00, \T5, \T2			// \T2 = (a1+a0)*(b1+b0)
+	pxor	\T1, \T4				// accumulate the results in \T4:\T7, \T6 holds the middle part
+	pxor	\T3, \T7
+	pxor	\T2, \T6
+
+		movdqu	16*8(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+
+	// \XMM8, \T5 hold the values for the two operands which are
+	// carry-less multiplied.
+	////////////////////////////////////////////////////////////////////////
+	// Karatsuba Method
+	movdqu	TMP8(%rsp), \T1
+	movdqa	\T1, \T3
+
+	pshufd	$0b01001110, \T3, \T2
+	pxor	\T3, \T2
+	movdqu	HashKey(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+	pclmulqdq	$0x00, \T5, \T3		// \T3 = a0*b0
+	movdqu  HashKey_k(\GDATA), \T5
+	pclmulqdq	$0x00, \T5, \T2		// \T2 = (a1+a0)*(b1+b0)
+	pxor	\T3, \T7
+	pxor	\T1, \T4			// accumulate the results in \T4:\T7, \T6 holds the middle part
+
+		movdqu	16*9(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+
+#ifdef GCM128_MODE
+		movdqu	16*10(\GDATA), \T5
+#endif
+#ifdef GCM192_MODE
+		movdqu	16*10(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+		movdqu	16*11(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+		movdqu	16*12(\GDATA), \T5	// finish last key round
+#endif
+#ifdef GCM256_MODE
+		movdqu	16*10(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+		movdqu	16*11(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+		movdqu	16*12(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+		movdqu	16*13(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+	movdqu	16*14(\GDATA), \T5		// finish last key round
+#endif
+
+.altmacro
+.set i, 0
+.set j, 1
+.rept 8
+		xmmreg j, %j
+		XLDR	16*i(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+
+		.ifc \ENC_DEC, DEC
+		movdqa	\T1, \T3
+		.endif
+
+		pxor	\T5, \T1
+		aesenclast	\T1, xmmj				// XMM1:XMM8
+		XSTR	xmmj, 16*i(\CYPH_PLAIN_OUT, \DATA_OFFSET)	// Write to the Output buffer
+
+		.ifc \ENC_DEC, DEC
+		movdqa	\T3, xmmj
+		.endif
+.set i, (i+1)
+.set j, (j+1)
+.endr
+.noaltmacro
+
+	pxor	\T6, \T2
+	pxor	\T4, \T2
+	pxor	\T7, \T2
+
+
+	movdqa	\T2, \T3
+	pslldq	$8, \T3			// shift-L \T3 2 DWs
+	psrldq	$8, \T2			// shift-R \T2 2 DWs
+	pxor	\T3, \T7
+	pxor	\T2, \T4		// accumulate the results in \T4:\T7
+
+
+
+	//first phase of the reduction
+	movdqa	\T7, \T2
+	movdqa	\T7, \T3
+	movdqa	\T7, \T1		// move \T7 into \T2, \T3, \T1 in order to perform the three shifts independently
+
+	pslld	$31, \T2		// packed right shifting << 31
+	pslld	$30, \T3		// packed right shifting shift << 30
+	pslld	$25, \T1		// packed right shifting shift << 25
+	pxor	\T3, \T2		// xor the shifted versions
+	pxor	\T1, \T2
+
+	movdqa	\T2, \T5
+	psrldq	$4, \T5			// shift-R \T5 1 DW
+
+	pslldq	$12, \T2		// shift-L \T2 3 DWs
+	pxor	\T2, \T7		// first phase of the reduction complete
+
+	////////////////////////////////////////////////////////////////////////
+
+		pshufb	SHUF_MASK(%rip), \XMM1	// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM2	// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM3	// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM4	// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM5	// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM6	// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM7	// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM8	// perform a 16Byte swap
+
+	//second phase of the reduction
+	movdqa	\T7, \T2		// make 3 copies of \T7 (in in \T2, \T3, \T1) for doing three shift operations
+	movdqa	\T7, \T3
+	movdqa	\T7, \T1
+
+	psrld	$1, \T2			// packed left shifting >> 1
+	psrld	$2, \T3			// packed left shifting >> 2
+	psrld	$7, \T1			// packed left shifting >> 7
+	pxor	\T3, \T2		// xor the shifted versions
+	pxor	\T1, \T2
+
+	pxor	\T5, \T2
+	pxor	\T2, \T7
+	pxor	\T4, \T7		// the result is in \T4
+
+
+	pxor    \T7, \XMM1
+
+.endm // GHASH_8_ENCRYPT_8_PARALLEL
+
+////////////////////////////////////////////////////////////////////////////////
+// GHASH_LAST_8: GHASH the last 8 ciphertext blocks.
+////////////////////////////////////////////////////////////////////////////////
+.macro	GHASH_LAST_8	GDATA, T1, T2, T3, T4, T5, T6, T7, \
+			XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7, XMM8
+
+
+	// Karatsuba Method
+	movdqa	\XMM1, \T6
+	pshufd	$0b01001110,  \XMM1, \T2
+	pxor	\XMM1, \T2
+	movdqu	HashKey_8(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T6		// \T6 = a1*b1
+
+	pclmulqdq	$0x00, \T5, \XMM1	// \XMM1 = a0*b0
+	movdqu	HashKey_8_k(\GDATA), \T4
+	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
+
+	movdqa	\XMM1, \T7
+	movdqa	\T2, \XMM1			// result in \T6, \T7, \XMM1
+
+	// Karatsuba Method
+	movdqa	\XMM2, \T1
+	pshufd	$0b01001110,  \XMM2, \T2
+	pxor	\XMM2, \T2
+	movdqu	HashKey_7(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+
+	pclmulqdq	$0x00, \T5, \XMM2	// \XMM2 = a0*b0
+	movdqu	HashKey_7_k(\GDATA), \T4
+	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
+
+	pxor	\T1, \T6
+	pxor	\XMM2, \T7
+	pxor	\T2, \XMM1			// results accumulated in \T6, \T7, \XMM1
+
+	// Karatsuba Method
+	movdqa	\XMM3, \T1
+	pshufd	$0b01001110,  \XMM3, \T2
+	pxor	\XMM3, \T2
+	movdqu	HashKey_6(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+
+	pclmulqdq	$0x00, \T5, \XMM3	// \XMM3 = a0*b0
+	movdqu	HashKey_6_k(\GDATA), \T4
+	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
+
+	pxor	\T1, \T6
+	pxor	\XMM3, \T7
+	pxor	\T2, \XMM1			// results accumulated in \T6, \T7, \XMM1
+
+	// Karatsuba Method
+	movdqa	\XMM4, \T1
+	pshufd	$0b01001110,  \XMM4, \T2
+	pxor	\XMM4, \T2
+	movdqu	HashKey_5(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+
+	pclmulqdq	$0x00, \T5, \XMM4	// \XMM4 = a0*b0
+	movdqu	HashKey_5_k(\GDATA), \T4
+	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
+
+	pxor	\T1, \T6
+	pxor	\XMM4, \T7
+	pxor	\T2, \XMM1			// results accumulated in \T6, \T7, \XMM1
+
+	// Karatsuba Method
+	movdqa	\XMM5, \T1
+	pshufd	$0b01001110,  \XMM5, \T2
+	pxor	\XMM5, \T2
+	movdqu	HashKey_4(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+
+	pclmulqdq	$0x00, \T5, \XMM5	// \XMM5 = a0*b0
+	movdqu	HashKey_4_k(\GDATA), \T4
+	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
+
+	pxor	\T1, \T6
+	pxor	\XMM5, \T7
+	pxor	\T2, \XMM1			// results accumulated in \T6, \T7, \XMM1
+
+	// Karatsuba Method
+	movdqa	\XMM6, \T1
+	pshufd	$0b01001110,  \XMM6, \T2
+	pxor	\XMM6, \T2
+	movdqu	HashKey_3(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+
+	pclmulqdq	$0x00, \T5, \XMM6	// \XMM6 = a0*b0
+	movdqu	HashKey_3_k(\GDATA), \T4
+	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
+
+	pxor	\T1, \T6
+	pxor	\XMM6, \T7
+	pxor	\T2, \XMM1			// results accumulated in \T6, \T7, \XMM1
+
+	// Karatsuba Method
+	movdqa	\XMM7, \T1
+	pshufd	$0b01001110,  \XMM7, \T2
+	pxor	\XMM7, \T2
+	movdqu	HashKey_2(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+
+	pclmulqdq	$0x00, \T5, \XMM7	// \XMM7 = a0*b0
+	movdqu	HashKey_2_k(\GDATA), \T4
+	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
+
+	pxor	\T1, \T6
+	pxor	\XMM7, \T7
+	pxor	\T2, \XMM1			// results accumulated in \T6, \T7, \XMM1
+
+
+	// Karatsuba Method
+	movdqa	\XMM8, \T1
+	pshufd	$0b01001110,  \XMM8, \T2
+	pxor	\XMM8, \T2
+	movdqu	HashKey(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+
+	pclmulqdq	$0x00, \T5, \XMM8	// \XMM8 = a0*b0
+	movdqu	HashKey_k(\GDATA), \T4
+	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
+
+	pxor	\T1, \T6
+	pxor	\XMM8, \T7
+	pxor	\XMM1, \T2
+	pxor	\T6, \T2
+	pxor	\T7, \T2		// middle section of the temp results combined as in Karatsuba algorithm
+
+
+	movdqa	\T2, \T4
+	pslldq	$8, \T4			// shift-L \T4 2 DWs
+	psrldq	$8, \T2			// shift-R \T2 2 DWs
+	pxor	\T4, \T7
+	pxor	\T2, \T6		// <\T6:\T7> holds the result of the accumulated carry-less multiplications
+
+
+	//first phase of the reduction
+	movdqa	\T7, \T2
+	movdqa	\T7, \T3
+	movdqa	\T7, \T4		// move \T7 into \T2, \T3, \T4 in order to perform the three shifts independently
+
+	pslld	$31, \T2		// packed right shifting << 31
+	pslld	$30, \T3		// packed right shifting shift << 30
+	pslld	$25, \T4		// packed right shifting shift << 25
+	pxor	\T3, \T2		// xor the shifted versions
+	pxor	\T4, \T2
+
+	movdqa	\T2, \T1
+	psrldq	$4, \T1			// shift-R \T1 1 DW
+
+	pslldq	$12, \T2		// shift-L \T2 3 DWs
+	pxor	\T2, \T7		// first phase of the reduction complete
+	////////////////////////////////////////////////////////////////////////
+
+	//second phase of the reduction
+	movdqa	\T7, \T2		// make 3 copies of \T7 (in in \T2, \T3, \T4) for doing three shift operations
+	movdqa	\T7, \T3
+	movdqa	\T7, \T4
+
+	psrld	$1, \T2			// packed left shifting >> 1
+	psrld	$2, \T3			// packed left shifting >> 2
+	psrld	$7, \T4			// packed left shifting >> 7
+	pxor	\T3, \T2		// xor the shifted versions
+	pxor	\T4, \T2
+
+	pxor	\T1, \T2
+	pxor	\T2, \T7
+	pxor	\T7, \T6		// the result is in \T6
+
+.endm // GHASH_LAST_8
+
+////////////////////////////////////////////////////////////////////////////////
+// ENCRYPT_SINGLE_BLOCK: Encrypt a single block.
+////////////////////////////////////////////////////////////////////////////////
+.macro ENCRYPT_SINGLE_BLOCK	GDATA, ST, T1
+
+		movdqu	16*0(\GDATA), \T1
+		pxor	\T1, \ST
+
+.set i, 1
+.rept NROUNDS
+		movdqu	16*i(\GDATA), \T1
+		aesenc	\T1, \ST
+
+.set i, (i+1)
+.endr
+		movdqu	16*i(\GDATA), \T1
+		aesenclast	\T1, \ST
+.endm // ENCRYPT_SINGLE_BLOCK
+
+
+////////////////////////////////////////////////////////////////////////////////
+// FUNC_SAVE: Save clobbered regs on the stack.
+////////////////////////////////////////////////////////////////////////////////
+.macro FUNC_SAVE
+	//// Required for Update/GMC_ENC
+	//the number of pushes must equal STACK_OFFSET
+	push    %r12
+	push    %r13
+	push    %r14
+	push    %r15
+	push    %rsi
+	mov	%rsp, %r14
+
+	sub	$(VARIABLE_OFFSET), %rsp
+	and	$~63, %rsp
+
+#if __OUTPUT_FORMAT__ == win64
+	// xmm6:xmm15 need to be maintained for Windows
+	movdqu	%xmm6, (LOCAL_STORAGE + 0*16)(%rsp)
+	movdqu	%xmm7, (LOCAL_STORAGE + 1*16)(%rsp)
+	movdqu	%xmm8, (LOCAL_STORAGE + 2*16)(%rsp)
+	movdqu	%xmm9, (LOCAL_STORAGE + 3*16)(%rsp)
+	movdqu	%xmm10, (LOCAL_STORAGE + 4*16)(%rsp)
+	movdqu	%xmm11, (LOCAL_STORAGE + 5*16)(%rsp)
+	movdqu	%xmm12, (LOCAL_STORAGE + 6*16)(%rsp)
+	movdqu	%xmm13, (LOCAL_STORAGE + 7*16)(%rsp)
+	movdqu	%xmm14, (LOCAL_STORAGE + 8*16)(%rsp)
+	movdqu	%xmm15, (LOCAL_STORAGE + 9*16)(%rsp)
+
+	mov	arg(5), arg5  // XXXX [r14 + STACK_OFFSET + 8*5]
+#endif
+.endm // FUNC_SAVE
+
+////////////////////////////////////////////////////////////////////////////////
+// FUNC_RESTORE: Restore clobbered regs from the stack.
+////////////////////////////////////////////////////////////////////////////////
+.macro FUNC_RESTORE
+
+#if __OUTPUT_FORMAT__ == win64
+	movdqu	(LOCAL_STORAGE + 9*16)(%rsp), %xmm15
+	movdqu	(LOCAL_STORAGE + 8*16)(%rsp), %xmm14
+	movdqu	(LOCAL_STORAGE + 7*16)(%rsp), %xmm13
+	movdqu	(LOCAL_STORAGE + 6*16)(%rsp), %xmm12
+	movdqu	(LOCAL_STORAGE + 5*16)(%rsp), %xmm11
+	movdqu	(LOCAL_STORAGE + 4*16)(%rsp), %xmm10
+	movdqu	(LOCAL_STORAGE + 3*16)(%rsp), %xmm9
+	movdqu	(LOCAL_STORAGE + 2*16)(%rsp), %xmm8
+	movdqu	(LOCAL_STORAGE + 1*16)(%rsp), %xmm7
+	movdqu	(LOCAL_STORAGE + 0*16)(%rsp), %xmm6
+#endif
+
+	// Required for Update/GMC_ENC
+	mov	%r14, %rsp
+	pop	%rsi
+	pop	%r15
+	pop	%r14
+	pop	%r13
+	pop	%r12
+.endm // FUNC_RESTORE
+
+
+////////////////////////////////////////////////////////////////////////////////
+// GCM_INIT: Initializes a gcm_context_data struct to prepare for
+// encoding/decoding.
+// Input: gcm_key_data * (GDATA_KEY), gcm_context_data *(GDATA_CTX), IV,
+// Additional Authentication data (A_IN), Additional Data length (A_LEN).
+// Output: Updated GDATA_CTX with the hash of A_IN (AadHash) and initialized
+// other parts of GDATA.
+// Clobbers rax, r10-r13 and xmm0-xmm6
+////////////////////////////////////////////////////////////////////////////////
+.macro  GCM_INIT	GDATA_KEY, GDATA_CTX, IV, A_IN, A_LEN
+
+#define AAD_HASH	%xmm0
+#define SUBHASH		%xmm1
+
+	movdqu  HashKey(\GDATA_KEY), SUBHASH
+
+	CALC_AAD_HASH \A_IN, \A_LEN, AAD_HASH, SUBHASH, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %r10, %r11, %r12, %r13, %rax
+	pxor	%xmm3, %xmm2
+	mov	\A_LEN, %r10
+
+	movdqu	AAD_HASH, AadHash(\GDATA_CTX)	// ctx_data.aad hash = aad_hash
+	mov	%r10, AadLen(\GDATA_CTX)		// ctx_data.aad_length = aad_length
+	xor	%r10, %r10
+	mov	%r10, InLen(\GDATA_CTX)		// ctx_data.in_length = 0
+	mov	 %r10, PBlockLen(\GDATA_CTX)	// ctx_data.partial_block_length = 0
+	movdqu	%xmm2, PBlockEncKey(\GDATA_CTX)	// ctx_data.partial_block_enc_key = 0
+	mov	\IV, %r10
+	movdqa  ONEf(%rip), %xmm2			// read 12 IV bytes and pad with 0x00000001
+	pinsrq  $0, (%r10), %xmm2
+	pinsrd  $2, 8(%r10), %xmm2
+	movdqu	%xmm2, OrigIV(\GDATA_CTX)	// ctx_data.orig_IV = iv
+
+	pshufb	SHUF_MASK(%rip), %xmm2
+
+	movdqu	%xmm2, CurCount(\GDATA_CTX)	// ctx_data.current_counter = iv
+.endm // GCM_INIT
+
+
+////////////////////////////////////////////////////////////////////////////////
+// GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed
+// gcm_context_data struct has been initialized by GCM_INIT.
+// Requires the input data be at least 1 byte long because of
+// READ_SMALL_INPUT_DATA.
+// Input: gcm_key_data * (GDATA_KEY), gcm_context_data (GDATA_CTX),
+// input text (PLAIN_CYPH_IN), input text length (PLAIN_CYPH_LEN) and whether
+// encoding or decoding (ENC_DEC).
+// Output: A cypher of the given plain text (CYPH_PLAIN_OUT), and updated
+// GDATA_CTX
+// Clobbers rax, r10-r15, and xmm0-xmm15
+////////////////////////////////////////////////////////////////////////////////
+.macro	GCM_ENC_DEC	GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, \
+			PLAIN_CYPH_LEN, ENC_DEC
+
+#define	DATA_OFFSET		%r11
+
+	// clang compat: no local support
+	// LOCAL _initial_num_blocks_is_7, _initial_num_blocks_is_6
+	// LOCAL _initial_num_blocks_is_5, _initial_num_blocks_is_4
+	// LOCAL _initial_num_blocks_is_3, _initial_num_blocks_is_2
+	// LOCAL _initial_num_blocks_is_1, _initial_num_blocks_is_0
+	// LOCAL _initial_blocks_encrypted, _encrypt_by_8_new, _encrypt_by_8
+	// LOCAL _eight_cipher_left, _zero_cipher_left, _large_enough_update
+	// LOCAL _data_read, _less_than_8_bytes_left, _multiple_of_16_bytes
+
+// Macro flow:
+// calculate the number of 16byte blocks in the message
+// process (number of 16byte blocks) mod 8 '_initial_num_blocks_is_# .. _initial_blocks_encrypted'
+// process 8 16 byte blocks at a time until all are done '_encrypt_by_8_new .. _eight_cipher_left'
+// if there is a block of less tahn 16 bytes process it '_zero_cipher_left .. _multiple_of_16_bytes'
+
+	cmp	$0, \PLAIN_CYPH_LEN
+	je	_multiple_of_16_bytes_\@
+
+	xor	DATA_OFFSET, DATA_OFFSET
+	add	\PLAIN_CYPH_LEN, InLen(\GDATA_CTX)	//Update length of data processed
+	movdqu	HashKey(\GDATA_KEY), %xmm13		// xmm13 = HashKey
+	movdqu	AadHash(\GDATA_CTX), %xmm8
+
+
+	PARTIAL_BLOCK \GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, \PLAIN_CYPH_LEN, DATA_OFFSET, %xmm8, \ENC_DEC
+
+	mov	\PLAIN_CYPH_LEN, %r13                                // save the number of bytes of plaintext/ciphertext
+	sub	DATA_OFFSET, %r13
+	mov	%r13, %r10		//save the amount of data left to process in r10
+	and     $-16, %r13		// r13 = r13 - (r13 mod 16)
+
+	mov	%r13, %r12
+	shr	$4, %r12
+	and	$7, %r12
+	jz      _initial_num_blocks_is_0_\@
+
+
+	cmp	$7, %r12
+	je      _initial_num_blocks_is_7_\@
+	cmp	$6, %r12
+	je      _initial_num_blocks_is_6_\@
+	cmp	$5, %r12
+	je      _initial_num_blocks_is_5_\@
+	cmp	$4, %r12
+	je      _initial_num_blocks_is_4_\@
+	cmp	$3, %r12
+	je      _initial_num_blocks_is_3_\@
+	cmp	$2, %r12
+	je      _initial_num_blocks_is_2_\@
+
+	jmp     _initial_num_blocks_is_1_\@
+
+_initial_num_blocks_is_7_\@:
+	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 7, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	sub	$(16*7), %r13
+	jmp     _initial_blocks_encrypted_\@
+
+_initial_num_blocks_is_6_\@:
+	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 6, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	sub	$(16*6), %r13
+	jmp     _initial_blocks_encrypted_\@
+
+_initial_num_blocks_is_5_\@:
+	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 5, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	sub	$(16*5), %r13
+	jmp     _initial_blocks_encrypted_\@
+
+_initial_num_blocks_is_4_\@:
+	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 4, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	sub	$(16*4), %r13
+	jmp     _initial_blocks_encrypted_\@
+
+_initial_num_blocks_is_3_\@:
+	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 3, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	sub	$(16*3), %r13
+	jmp     _initial_blocks_encrypted_\@
+
+_initial_num_blocks_is_2_\@:
+	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 2, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	sub	$(16*2), %r13
+	jmp     _initial_blocks_encrypted_\@
+
+_initial_num_blocks_is_1_\@:
+	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 1, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	sub	$(16*1), %r13
+	jmp     _initial_blocks_encrypted_\@
+
+_initial_num_blocks_is_0_\@:
+	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 0, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+
+_initial_blocks_encrypted_\@:
+	cmp	$0, %r13
+	je      _zero_cipher_left_\@
+
+	sub	$128, %r13
+	je      _eight_cipher_left_\@
+
+	movd	%xmm9, %r15d
+	and	$255, %r15d
+	pshufb	SHUF_MASK(%rip), %xmm9
+
+
+_encrypt_by_8_new_\@:
+	cmp	$(255-8), %r15d
+	jg      _encrypt_by_8_\@
+
+	add	$8, %r15b
+	GHASH_8_ENCRYPT_8_PARALLEL	\GDATA_KEY, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, DATA_OFFSET, %xmm0, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm15, out_order, \ENC_DEC
+	add	$128, DATA_OFFSET
+	sub	$128, %r13
+	jne     _encrypt_by_8_new_\@
+
+	pshufb	SHUF_MASK(%rip), %xmm9
+	jmp     _eight_cipher_left_\@
+
+_encrypt_by_8_\@:
+	pshufb	SHUF_MASK(%rip), %xmm9
+	add	$8, %r15b
+
+	GHASH_8_ENCRYPT_8_PARALLEL	\GDATA_KEY, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, DATA_OFFSET, %xmm0, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm15, in_order, \ENC_DEC
+	pshufb	SHUF_MASK(%rip), %xmm9
+	add	$128, DATA_OFFSET
+	sub	$128, %r13
+	jne     _encrypt_by_8_new_\@
+
+	pshufb	SHUF_MASK(%rip), %xmm9
+
+
+
+_eight_cipher_left_\@:
+	GHASH_LAST_8	\GDATA_KEY, %xmm0, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8
+
+
+_zero_cipher_left_\@:
+	movdqu	%xmm14, AadHash(\GDATA_CTX) 
+	movdqu	%xmm9, CurCount(\GDATA_CTX) 
+
+	mov	%r10, %r13
+	and	$15, %r13			// r13 = (\PLAIN_CYPH_LEN mod 16)
+
+	je      _multiple_of_16_bytes_\@
+
+	mov	%r13, PBlockLen(\GDATA_CTX)		// my_ctx.data.partial_blck_length = r13
+	// handle the last <16 Byte block seperately
+
+	paddd	ONE(%rip), %xmm9			// INCR CNT to get Yn
+	movdqu	%xmm9, CurCount(\GDATA_CTX)	// my_ctx.data.current_counter = xmm9
+	pshufb  SHUF_MASK(%rip), %xmm9
+	ENCRYPT_SINGLE_BLOCK	\GDATA_KEY, %xmm9, %xmm2	// E(K, Yn)
+	movdqu	%xmm9, PBlockEncKey(\GDATA_CTX)	// my_ctx_data.partial_block_enc_key = xmm9
+
+	cmp	$16, \PLAIN_CYPH_LEN
+	jge	_large_enough_update_\@
+
+	lea	(\PLAIN_CYPH_IN, DATA_OFFSET), %r10
+	READ_SMALL_DATA_INPUT	%xmm1, %r10, %r13, %r12, %r15, %rax
+	lea	(SHIFT_MASK + 16)(%rip), %r12
+	sub	%r13, %r12
+	jmp	_data_read_\@
+
+_large_enough_update_\@:
+	sub	$16, DATA_OFFSET
+	add	%r13, DATA_OFFSET
+
+	movdqu	(\PLAIN_CYPH_IN, DATA_OFFSET), %xmm1	// receive the last <16 Byte block
+
+	sub	%r13, DATA_OFFSET
+	add	$16, DATA_OFFSET
+
+	lea	(SHIFT_MASK + 16)(%rip), %r12
+	sub	%r13, %r12			// adjust the shuffle mask pointer to be able to shift 16-r13 bytes (r13 is the number of bytes in plaintext mod 16)
+	movdqu	(%r12), %xmm2			// get the appropriate shuffle mask
+	pshufb	%xmm2, %xmm1			// shift right 16-r13 bytes
+_data_read_\@:
+	.ifc  \ENC_DEC, DEC
+
+	movdqa	%xmm1, %xmm2
+	pxor	%xmm1, %xmm9			// Plaintext XOR E(K, Yn)
+	movdqu	(ALL_F - SHIFT_MASK)(%r12), %xmm1	// get the appropriate mask to mask out top 16-r13 bytes of xmm9
+	pand	%xmm1, %xmm9			// mask out top 16-r13 bytes of xmm9
+	pand	%xmm1, %xmm2
+	pshufb	SHUF_MASK(%rip), %xmm2
+	pxor	%xmm2, %xmm14
+	movdqu	%xmm14, AadHash(\GDATA_CTX)
+
+	.else	// .ifc  \ENC_DEC, DEC
+
+	pxor	%xmm1, %xmm9			// Plaintext XOR E(K, Yn)
+	movdqu	(ALL_F - SHIFT_MASK)(%r12), %xmm1	// get the appropriate mask to mask out top 16-r13 bytes of xmm9
+	pand	%xmm1, %xmm9			// mask out top 16-r13 bytes of xmm9
+	pshufb	SHUF_MASK(%rip), %xmm9
+	pxor	%xmm9, %xmm14
+	movdqu	%xmm14, AadHash(\GDATA_CTX)
+
+	pshufb	SHUF_MASK(%rip), %xmm9		// shuffle xmm9 back to output as ciphertext
+
+	.endif	// .ifc  \ENC_DEC, DEC
+
+
+	//////////////////////////////////////////////////////////
+	// output r13 Bytes
+	movq	%xmm9, %rax
+	cmp	$8, %r13
+	jle     _less_than_8_bytes_left_\@
+
+	mov	%rax, (\CYPH_PLAIN_OUT, DATA_OFFSET)
+	add	$8, DATA_OFFSET
+	psrldq	$8, %xmm9
+	movq	%xmm9, %rax
+	sub	$8, %r13
+
+_less_than_8_bytes_left_\@:
+	movb	%al, (\CYPH_PLAIN_OUT, DATA_OFFSET)
+	add     $1, DATA_OFFSET
+	shr	$8, %rax
+	sub	$1, %r13
+	jne     _less_than_8_bytes_left_\@
+	//////////////////////////////////////////////////////////
+
+_multiple_of_16_bytes_\@:
+
+.endm // GCM_ENC_DEC
+
+
+////////////////////////////////////////////////////////////////////////////////
+// GCM_COMPLETE: Finishes Encyrption/Decryption of last partial block after
+// GCM_UPDATE finishes.
+// Input: A gcm_key_data * (GDATA_KEY), gcm_context_data * (GDATA_CTX) and
+// whether encoding or decoding (ENC_DEC).
+// Output: Authorization Tag (AUTH_TAG) and Authorization Tag length
+// (AUTH_TAG_LEN)
+// Clobbers %rax, r10-r12, and xmm0, xmm1, xmm5, xmm6, xmm9, xmm11, xmm14, xmm15
+////////////////////////////////////////////////////////////////////////////////
+.macro	GCM_COMPLETE	GDATA_KEY, GDATA_CTX, AUTH_TAG, AUTH_TAG_LEN, ENC_DEC
+
+#define	PLAIN_CYPH_LEN	%rax
+
+	// clang compat: no local support
+	// LOCAL _partial_done, _return_T, _T_8, _T_12, _T_16, _return_T_done
+
+	mov	PBlockLen(\GDATA_CTX), %r12	// r12 = aadLen (number of bytes)
+	movdqu	AadHash(\GDATA_CTX), %xmm14
+	movdqu	HashKey(\GDATA_KEY), %xmm13
+
+	cmp	$0, %r12
+
+	je _partial_done_\@
+
+	GHASH_MUL %xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6 //GHASH computation for the last <16 Byte block
+	movdqu	%xmm14, AadHash(\GDATA_CTX)
+
+_partial_done_\@:
+
+	mov	AadLen(\GDATA_CTX), %r12		// r12 = aadLen (number of bytes)
+	mov	InLen(\GDATA_CTX), PLAIN_CYPH_LEN
+
+	shl	$3, %r12			// convert into number of bits
+	movd	%r12d, %xmm15		// len(A) in xmm15
+
+	shl	$3, PLAIN_CYPH_LEN	// len(C) in bits  (*128)
+	movq	PLAIN_CYPH_LEN, %xmm1
+	pslldq	$8, %xmm15		// xmm15 = len(A)|| 0x0000000000000000
+	pxor	%xmm1, %xmm15		// xmm15 = len(A)||len(C)
+
+	pxor	%xmm15, %xmm14
+	GHASH_MUL %xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6    // final GHASH computation
+	pshufb	SHUF_MASK(%rip), %xmm14		// perform a 16Byte swap
+	movdqu	OrigIV(\GDATA_CTX), %xmm9	// xmm9 = Y0
+
+	ENCRYPT_SINGLE_BLOCK	\GDATA_KEY, %xmm9, %xmm2	// E(K, Y0)
+
+	pxor	%xmm14, %xmm9
+
+_return_T_\@:
+	mov	\AUTH_TAG, %r10			// r10 = authTag
+	mov	\AUTH_TAG_LEN, %r11		// r11 = auth_tag_len
+
+	cmp	$16, %r11
+	je      _T_16_\@
+
+	cmp	$12, %r11
+	je      _T_12_\@
+
+_T_8_\@:
+	movq	%xmm9, %rax
+	mov	%rax, (%r10)
+	jmp     _return_T_done_\@
+
+_T_12_\@:
+	movq	%xmm9, %rax
+	mov	%rax, (%r10)
+	psrldq	$8, %xmm9
+	movd	%xmm9, %eax
+	mov	%eax, 8(%r10)
+	jmp     _return_T_done_\@
+
+_T_16_\@:
+	movdqu	%xmm9, (%r10)
+
+_return_T_done_\@:
+.endm //GCM_COMPLETE
+
+
+#if 1
+
+	.balign 16
+////////////////////////////////////////////////////////////////////////////////
+//void	aes_gcm_precomp_{128,256}_sse
+//        (struct gcm_key_data *key_data);
+////////////////////////////////////////////////////////////////////////////////
+#if FUNCT_EXTENSION != _nt
+.global FN_NAME(precomp,_)
+FN_NAME(precomp,_):
+
+	endbranch
+
+	push	%r12
+	push	%r13
+	push	%r14
+	push	%r15
+
+	mov     %rsp, %r14
+
+	sub	$(VARIABLE_OFFSET), %rsp
+	and	$(~63), %rsp				// align rsp to 64 bytes
+
+#if __OUTPUT_FORMAT__ == win64
+	// only xmm6 needs to be maintained
+	movdqu	%xmm6, (LOCAL_STORAGE + 0*16)(%rsp)
+#endif
+
+	pxor	%xmm6, %xmm6
+	ENCRYPT_SINGLE_BLOCK	arg1, %xmm6, %xmm2	// xmm6 = HashKey
+
+	pshufb	SHUF_MASK(%rip), %xmm6
+	///////////////  PRECOMPUTATION of HashKey<<1 mod poly from the HashKey
+	movdqa	%xmm6, %xmm2
+	psllq	$1, %xmm6
+	psrlq	$63, %xmm2
+	movdqa	%xmm2, %xmm1
+	pslldq	$8, %xmm2
+	psrldq	$8, %xmm1
+	por	%xmm2, %xmm6
+
+	//reduction
+	pshufd	$0b00100100, %xmm1, %xmm2
+	pcmpeqd	TWOONE(%rip), %xmm2
+	pand	POLY(%rip), %xmm2
+	pxor	%xmm2, %xmm6		// xmm6 holds the HashKey<<1 mod poly
+	///////////////////////////////////////////////////////////////////////
+	movdqu	 %xmm6, HashKey(arg1)	// store HashKey<<1 mod poly
+
+	PRECOMPUTE  arg1, %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5
+
+#if __OUTPUT_FORMAT__ == win64
+	movdqu (LOCAL_STORAGE + 0*16)(%rsp), %xmm6
+#endif
+	mov	%r14, %rsp
+
+	pop	%r15
+	pop	%r14
+	pop	%r13
+	pop	%r12
+	ret
+#endif	// _nt
+
+
+////////////////////////////////////////////////////////////////////////////////
+//void   aes_gcm_init_128_sse / aes_gcm_init_256_sse (
+//        const struct gcm_key_data *key_data,
+//        struct gcm_context_data *context_data,
+//        u8      *iv,
+//        const   u8 *aad,
+//        u64     aad_len);
+////////////////////////////////////////////////////////////////////////////////
+#if FUNCT_EXTENSION != _nt
+.global FN_NAME(init,_)
+FN_NAME(init,_):
+	endbranch
+
+	push	%r12
+	push	%r13
+#if __OUTPUT_FORMAT__ == win64
+	push	arg5
+	sub	$(1*16), %rsp
+	movdqu	%xmm6, (0*16)(%rsp)
+	mov	(1*16 + 8*3 + 8*5)(%rsp), arg5
+#endif
+
+	GCM_INIT arg1, arg2, arg3, arg4, arg5
+
+#if __OUTPUT_FORMAT__ == win64
+	movdqu	(0*16)(%rsp), %xmm6
+	add	$(1*16), %rsp
+	pop	arg5
+#endif
+	pop	%r13
+	pop	%r12
+	ret
+#endif	// _nt
+
+
+////////////////////////////////////////////////////////////////////////////////
+//void   aes_gcm_enc_128_update_sse / aes_gcm_enc_256_update_sse
+//        const struct gcm_key_data *key_data,
+//        struct gcm_context_data *context_data,
+//        u8      *out,
+//        const   u8 *in,
+//        u64     plaintext_len);
+////////////////////////////////////////////////////////////////////////////////
+.global FN_NAME(enc,_update_)
+FN_NAME(enc,_update_):
+	endbranch
+
+	FUNC_SAVE
+
+	GCM_ENC_DEC arg1, arg2, arg3, arg4, arg5, ENC
+
+	FUNC_RESTORE
+
+	ret
+
+
+////////////////////////////////////////////////////////////////////////////////
+//void   aes_gcm_dec_256_update_sse / aes_gcm_dec_256_update_sse
+//        const struct gcm_key_data *key_data,
+//        struct gcm_context_data *context_data,
+//        u8      *out,
+//        const   u8 *in,
+//        u64     plaintext_len);
+////////////////////////////////////////////////////////////////////////////////
+.global FN_NAME(dec,_update_)
+FN_NAME(dec,_update_):
+	endbranch
+
+	FUNC_SAVE
+
+	GCM_ENC_DEC arg1, arg2, arg3, arg4, arg5, DEC
+
+	FUNC_RESTORE
+
+	ret
+
+
+////////////////////////////////////////////////////////////////////////////////
+//void   aes_gcm_enc_128_finalize_sse / aes_gcm_enc_256_finalize_sse
+//        const struct gcm_key_data *key_data,
+//        struct gcm_context_data *context_data,
+//        u8      *auth_tag,
+//        u64     auth_tag_len);
+////////////////////////////////////////////////////////////////////////////////
+#if FUNCT_EXTENSION != _nt
+.global FN_NAME(enc,_finalize_)
+FN_NAME(enc,_finalize_):
+
+	endbranch
+
+	push	%r12
+
+#if __OUTPUT_FORMAT__ == win64
+	// xmm6:xmm15 need to be maintained for Windows
+	sub	$(5*16), %rsp
+	movdqu	%xmm6, (0*16)(%rsp)
+	movdqu	%xmm9, (1*16)(%rsp)
+	movdqu	%xmm11, (2*16)(%rsp)
+	movdqu	%xmm14, (3*16)(%rsp)
+	movdqu	%xmm15, (4*16)(%rsp)
+#endif
+	GCM_COMPLETE	arg1, arg2, arg3, arg4, ENC
+
+#if __OUTPUT_FORMAT__ == win64
+	movdqu	(4*16)(%rsp), %xmm15
+	movdqu	(3*16)(%rsp), %xmm14
+	movdqu	(2*16)(%rsp), %xmm11
+	movdqu	(1*16)(%rsp), %xmm9
+	movdqu	(0*16)(%rsp), %xmm6
+	add	$(5*16), %rsp
+#endif
+
+	pop	%r12
+	ret
+#endif	// _nt
+
+
+////////////////////////////////////////////////////////////////////////////////
+//void   aes_gcm_dec_128_finalize_sse / aes_gcm_dec_256_finalize_sse
+//        const struct gcm_key_data *key_data,
+//        struct gcm_context_data *context_data,
+//        u8      *auth_tag,
+//        u64     auth_tag_len);
+////////////////////////////////////////////////////////////////////////////////
+#if FUNCT_EXTENSION != _nt
+.global FN_NAME(dec,_finalize_)
+FN_NAME(dec,_finalize_):
+
+	endbranch
+
+	push	%r12
+
+#if __OUTPUT_FORMAT == win64
+	// xmm6:xmm15 need to be maintained for Windows
+	sub	$(5*16), %rsp
+	movdqu	%xmm6, (0*16)(%rsp)
+	movdqu	%xmm9, (1*16)(%rsp)
+	movdqu	%xmm11, (2*16)(%rsp)
+	movdqu	%xmm14, (3*16)(%rsp)
+	movdqu	%xmm15, (4*16)(%rsp)
+#endif
+	GCM_COMPLETE	arg1, arg2, arg3, arg4, DEC
+
+#if __OUTPUT_FORMAT__ == win64
+	movdqu	(4*16)(%rsp), %xmm15
+	movdqu	(3*16)(%rsp), %xmm14
+	movdqu	(2*16)(%rsp), %xmm11
+	movdqu	(1*16)(%rsp), %xmm9
+	movdqu	(0*16)(%rsp), %xmm6
+	add	$(5*16), %rsp
+#endif
+
+	pop	%r12
+	ret
+#endif	// _nt
+
+
+////////////////////////////////////////////////////////////////////////////////
+//void   aes_gcm_enc_128_sse / aes_gcm_enc_256_sse
+//        const struct gcm_key_data *key_data,
+//        struct gcm_context_data *context_data,
+//        u8      *out,
+//        const   u8 *in,
+//        u64     plaintext_len,
+//        u8      *iv,
+//        const   u8 *aad,
+//        u64     aad_len,
+//        u8      *auth_tag,
+//        u64     auth_tag_len)//
+////////////////////////////////////////////////////////////////////////////////
+.global FN_NAME(enc,_)
+FN_NAME(enc,_):
+	endbranch
+
+	FUNC_SAVE
+
+	GCM_INIT arg1, arg2, arg6, arg7, arg8
+
+	GCM_ENC_DEC  arg1, arg2, arg3, arg4, arg5, ENC
+
+	GCM_COMPLETE arg1, arg2, arg9, arg10, ENC
+	FUNC_RESTORE
+
+	ret
+
+////////////////////////////////////////////////////////////////////////////////
+//void   aes_gcm_dec_128_sse / aes_gcm_dec_256_sse
+//        const struct gcm_key_data *key_data,
+//        struct gcm_context_data *context_data,
+//        u8      *out,
+//        const   u8 *in,
+//        u64     plaintext_len,
+//        u8      *iv,
+//        const   u8 *aad,
+//        u64     aad_len,
+//        u8      *auth_tag,
+//        u64     auth_tag_len)//
+////////////////////////////////////////////////////////////////////////////////
+.global FN_NAME(dec,_)
+FN_NAME(dec,_):
+	endbranch
+
+	FUNC_SAVE
+
+	GCM_INIT arg1, arg2, arg6, arg7, arg8
+
+	GCM_ENC_DEC  arg1, arg2, arg3, arg4, arg5, DEC
+
+	GCM_COMPLETE arg1, arg2, arg9, arg10, DEC
+	FUNC_RESTORE
+
+	ret
+
+.global FN_NAME(this_is_gas,_)
+FN_NAME(this_is_gas,_):
+	endbranch
+	FUNC_SAVE
+	FUNC_RESTORE
+	ret
+
+#else
+	// GAS doesnt't provide the linenuber in the macro
+	////////////////////////
+	// GHASH_MUL xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6
+	// PRECOMPUTE rax, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6
+	// READ_SMALL_DATA_INPUT xmm1, r10, 8, rax, r12, r15
+	// ENCRYPT_SINGLE_BLOCK rax, xmm0, xmm1
+	// INITIAL_BLOCKS rdi,rsi,rdx,rcx,r13,r11,7,xmm12,xmm13,xmm14,xmm15,xmm11,xmm9,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7,xmm8,xmm10,xmm0,ENC
+	// CALC_AAD_HASH [r14+8*5+8*1],[r14+8*5+8*2],xmm0,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,r10,r11,r12,r13,rax
+	// READ_SMALL_DATA_INPUT	xmm2, r10, r11, r12, r13, rax
+	// PARTIAL_BLOCK  rdi,rsi,rdx,rcx,r8,r11,xmm8,ENC
+	// GHASH_8_ENCRYPT_8_PARALLEL rdi,rdx,rcx,r11,xmm0,xmm10,xmm11,xmm12,xmm13,xmm14,xmm9,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7,xmm8,xmm15,out_order,ENC
+	//GHASH_LAST_8 rdi,xmm0,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7,xmm8
+#endif
diff --git a/contrib/icp/gcm-simd/isa-l_crypto-ported/reg_sizes.S b/contrib/icp/gcm-simd/isa-l_crypto-ported/reg_sizes.S
new file mode 100644
index 000000000000..0b63dbd2a0ef
--- /dev/null
+++ b/contrib/icp/gcm-simd/isa-l_crypto-ported/reg_sizes.S
@@ -0,0 +1,224 @@
+////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
+//  Copyright(c) 2011-2019 Intel Corporation All rights reserved.
+//
+//  Redistribution and use in source and binary forms, with or without
+//  modification, are permitted provided that the following conditions
+//  are met:
+//    * Redistributions of source code must retain the above copyright
+//      notice, this list of conditions and the following disclaimer.
+//    * Redistributions in binary form must reproduce the above copyright
+//      notice, this list of conditions and the following disclaimer in
+//      the documentation and/or other materials provided with the
+//      distribution.
+//    * Neither the name of Intel Corporation nor the names of its
+//      contributors may be used to endorse or promote products derived
+//      from this software without specific prior written permission.
+//
+//  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+//  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+//  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+//  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+//  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+//  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+//  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES// LOSS OF USE,
+//  DATA, OR PROFITS// OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+//  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+//  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+//  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Port to GNU as and translation to GNU as att-syntax
+// Copyright(c) 2023 Attila Fülöp <attila@fueloep.org>
+
+#ifndef _REG_SIZES_ASM_
+#define _REG_SIZES_ASM_
+
+
+// define d, w  and b variants for registers
+
+.macro dwordreg reg
+	.if \reg == %r8 || \reg == %r9 || \reg == %r10 || \reg == %r11 || \reg == %r12 || \reg == %r13 || \reg == %r14 || \reg == %r15
+		.set dreg, \reg\()d
+	.elseif \reg == %rax
+		.set dreg, %eax
+	.elseif \reg == %rcx
+		.set dreg, %ecx
+	.elseif \reg == %rdx
+		.set dreg, %edx
+	.elseif \reg == %rbx
+		.set dreg, %ebx
+	.elseif \reg == %rsp
+		.set dreg, %esp
+	.elseif \reg == %rbp
+		.set dreg, %ebp
+	.elseif \reg == %rsi
+		.set dreg, %esi
+	.elseif \reg == %rdi
+		.set dreg, %edi
+	.else
+		.error "Invalid register '\reg\()' while expanding macro 'dwordreg\()'"
+	.endif
+.endm
+
+.macro wordreg reg
+	.if \reg == %r8 || \reg == %r9 || \reg == %r10 || \reg == %r11 || \reg == %r12 || \reg == %r13 || \reg == %r14 || \reg == %r15
+		.set wreg, \reg\()w
+	.elseif \reg == %rax
+		.set wreg, %ax
+	.elseif \reg == %rcx
+		.set wreg, %cx
+	.elseif \reg == %rdx
+		.set wreg, %dx
+	.elseif \reg == %rbx
+		.set wreg, %bx
+	.elseif \reg == %rsp
+		.set wreg, %sp
+	.elseif \reg == %rbp
+		.set wreg, %bp
+	.elseif \reg == %rsi
+		.set wreg, %si
+	.elseif \reg == %rdi
+		.set wreg, %di
+	.else
+		.error "Invalid register '\reg\()' while expanding macro 'wordreg\()'"
+	.endif
+.endm
+
+
+.macro bytereg reg
+	.if \reg == %r8 || \reg == %r9 || \reg == %r10 || \reg == %r11 || \reg == %r12 || \reg == %r13 || \reg == %r14 || \reg == %r15
+		.set breg, \reg\()b
+	.elseif \reg == %rax
+		.set breg, %al
+	.elseif \reg == %rcx
+		.set breg, %cl
+	.elseif \reg == %rdx
+		.set breg, %dl
+	.elseif \reg == %rbx
+		.set breg, %bl
+	.elseif \reg == rsp
+		.set breg, %spl
+	.elseif \reg == %rbp
+		.set breg, %bpl
+	.elseif \reg == rsi
+		.set breg, %sil
+	.elseif \reg == rdi
+		.set breg, %dil
+	.else
+		.error "Invalid register '\reg\()' while expanding macro 'bytereg\()'"
+	.endif
+.endm
+
+// clang compat: Below won't owrk with clang; do it a bit different
+// #define ZERO_TO_THIRTYONE \
+// 	0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16, \
+// 	17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
+
+// .macro xword reg
+// 	.irep i, ZERO_TO_THIRTYONE
+// 		.if \reg == %xmm\i || \reg == %ymm\i || \reg == %zmm\i
+// 			.set xmmreg, %xmm\i
+// 		.endif
+// 	.endr
+// .endm
+
+// .macro yword reg
+// 	.irep i, ZERO_TO_THIRTYONE
+// 		.if \reg == %xmm\i || \reg == %ymm\i || \reg == %zmm\i
+// 			.set ymmreg, %ymm\i
+// 		.endif
+// 	.endr
+// .endm
+
+// .macro zword reg
+// 	.irep i, ZERO_TO_THIRTYONE
+// 		.if \reg == %xmm\i || \reg == %ymm\i || \reg == %zmm\i
+// 			.set zmmreg, %zmm\i
+// 		.endif
+// 	.endr
+// .endm
+
+// Example usage:
+// xword %zmm12
+// pxor xmmreg, xmmreg // =>  pxor %xmm12, %xmm12
+.macro xword reg
+	.set i, 0
+	.rep 32
+		.altmacro
+		do_xyzword <\reg>, xmm, %i
+		.noaltmacro
+	.set i, (i+1)
+	.endr
+.endm
+
+.macro yword reg
+	.set i, 0
+	.rep 32
+		.altmacro
+		do_xyzword <\reg>, ymm, %i
+		.noaltmacro
+	.set i, (i+1)
+	.endr
+.endm
+
+.macro zword reg
+	.set i, 0
+	.rep 32
+		.altmacro
+		do_xyzword <\reg>, zmm, %i
+		.noaltmacro
+	.set i, (i+1)
+	.endr
+.endm
+
+.macro  do_xyzword creg, prfx, idx
+	.if \creg == %xmm\idx || \creg == %ymm\idx || \creg == %zmm\idx
+		.set \prfx\()reg, %\prfx\idx
+	.endif
+.endm
+
+
+// FIXME: handle later
+#define	elf32		1
+#define	elf64		2
+#define	win64		3
+#define	machos64	4
+
+#ifndef __OUTPUT_FORMAT__
+#define __OUTPUT_FORMAT__ elf64
+#endif
+
+#if __OUTPUT_FORMAT__ == elf32
+.section .note.GNU-stack,"",%progbits
+.section .text
+#endif
+#if __OUTPUT_FORMAT__ == elf64
+#ifndef __x86_64__
+#define __x86_64__
+#endif
+.section .note.GNU-stack,"",%progbits
+.section .text
+#endif
+#if __OUTPUT_FORMAT__ == win64
+#define __x86_64__
+#endif
+#if __OUTPUT_FORMAT__ == macho64
+#define __x86_64__
+#endif
+
+
+#ifdef __x86_64__
+#define endbranch .byte 0xf3, 0x0f, 0x1e, 0xfa
+#else
+#define endbranch .byte 0xf3, 0x0f, 0x1e, 0xfb
+#endif
+
+#ifdef REL_TEXT
+#define WRT_OPT
+#elif __OUTPUT_FORMAT__ == elf64
+#define WRT_OPT        wrt ..plt
+#else
+#define WRT_OPT
+#endif
+
+#endif // ifndef _REG_SIZES_ASM_
diff --git a/contrib/icp/gcm-simd/isa-l_crypto/LICENSE b/contrib/icp/gcm-simd/isa-l_crypto/LICENSE
new file mode 100644
index 000000000000..ecebef110b46
--- /dev/null
+++ b/contrib/icp/gcm-simd/isa-l_crypto/LICENSE
@@ -0,0 +1,26 @@
+  Copyright(c) 2011-2017 Intel Corporation All rights reserved.
+
+  Redistribution and use in source and binary forms, with or without
+  modification, are permitted provided that the following conditions
+  are met:
+    * Redistributions of source code must retain the above copyright
+      notice, this list of conditions and the following disclaimer.
+    * Redistributions in binary form must reproduce the above copyright
+      notice, this list of conditions and the following disclaimer in
+      the documentation and/or other materials provided with the
+      distribution.
+    * Neither the name of Intel Corporation nor the names of its
+      contributors may be used to endorse or promote products derived
+      from this software without specific prior written permission.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/contrib/icp/gcm-simd/isa-l_crypto/README b/contrib/icp/gcm-simd/isa-l_crypto/README
new file mode 100644
index 000000000000..55317bb4459b
--- /dev/null
+++ b/contrib/icp/gcm-simd/isa-l_crypto/README
@@ -0,0 +1,10 @@
+This directory contains some of the original "Intel(R) Intelligent Storage
+Acceleration Library Crypto Version" [1] GCM x86-64 assembly files [2]. They
+are included here for reference purposes only.
+
+These files were ported to the GNU assembler to be used within the ICP. The
+ported version can be found in the isa-l_crypto-ported directory one level up.
+
+
+[1] https://github.com/intel/isa-l_crypto
+[2] https://github.com/intel/isa-l_crypto/tree/v2.24.0/aes
\ No newline at end of file
diff --git a/contrib/icp/gcm-simd/isa-l_crypto/gcm128_sse.asm b/contrib/icp/gcm-simd/isa-l_crypto/gcm128_sse.asm
new file mode 100644
index 000000000000..1717a86628fd
--- /dev/null
+++ b/contrib/icp/gcm-simd/isa-l_crypto/gcm128_sse.asm
@@ -0,0 +1,31 @@
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;  Copyright(c) 2011-2016 Intel Corporation All rights reserved.
+;
+;  Redistribution and use in source and binary forms, with or without
+;  modification, are permitted provided that the following conditions
+;  are met:
+;    * Redistributions of source code must retain the above copyright
+;      notice, this list of conditions and the following disclaimer.
+;    * Redistributions in binary form must reproduce the above copyright
+;      notice, this list of conditions and the following disclaimer in
+;      the documentation and/or other materials provided with the
+;      distribution.
+;    * Neither the name of Intel Corporation nor the names of its
+;      contributors may be used to endorse or promote products derived
+;      from this software without specific prior written permission.
+;
+;  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+;  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+;  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+;  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+;  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+;  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+;  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+;  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+;  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+;  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+;  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+%define GCM128_MODE 1
+%include "gcm_sse.asm"
diff --git a/contrib/icp/gcm-simd/isa-l_crypto/gcm256_sse.asm b/contrib/icp/gcm-simd/isa-l_crypto/gcm256_sse.asm
new file mode 100644
index 000000000000..c583d02b86ca
--- /dev/null
+++ b/contrib/icp/gcm-simd/isa-l_crypto/gcm256_sse.asm
@@ -0,0 +1,31 @@
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;  Copyright(c) 2011-2016 Intel Corporation All rights reserved.
+;
+;  Redistribution and use in source and binary forms, with or without
+;  modification, are permitted provided that the following conditions
+;  are met:
+;    * Redistributions of source code must retain the above copyright
+;      notice, this list of conditions and the following disclaimer.
+;    * Redistributions in binary form must reproduce the above copyright
+;      notice, this list of conditions and the following disclaimer in
+;      the documentation and/or other materials provided with the
+;      distribution.
+;    * Neither the name of Intel Corporation nor the names of its
+;      contributors may be used to endorse or promote products derived
+;      from this software without specific prior written permission.
+;
+;  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+;  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+;  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+;  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+;  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+;  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+;  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+;  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+;  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+;  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+;  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+%define GCM256_MODE 1
+%include "gcm_sse.asm"
diff --git a/contrib/icp/gcm-simd/isa-l_crypto/gcm_defines.asm b/contrib/icp/gcm-simd/isa-l_crypto/gcm_defines.asm
new file mode 100644
index 000000000000..e823b79596df
--- /dev/null
+++ b/contrib/icp/gcm-simd/isa-l_crypto/gcm_defines.asm
@@ -0,0 +1,291 @@
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;  Copyright(c) 2011-2016 Intel Corporation All rights reserved.
+;
+;  Redistribution and use in source and binary forms, with or without
+;  modification, are permitted provided that the following conditions
+;  are met:
+;    * Redistributions of source code must retain the above copyright
+;      notice, this list of conditions and the following disclaimer.
+;    * Redistributions in binary form must reproduce the above copyright
+;      notice, this list of conditions and the following disclaimer in
+;      the documentation and/or other materials provided with the
+;      distribution.
+;    * Neither the name of Intel Corporation nor the names of its
+;      contributors may be used to endorse or promote products derived
+;      from this software without specific prior written permission.
+;
+;  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+;  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+;  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+;  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+;  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+;  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+;  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+;  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+;  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+;  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+;  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+%ifndef GCM_DEFINES_ASM_INCLUDED
+%define GCM_DEFINES_ASM_INCLUDED
+
+;
+; Authors:
+;       Erdinc Ozturk
+;       Vinodh Gopal
+;       James Guilford
+
+
+;;;;;;
+
+section .data
+
+align 16
+
+POLY            dq     0x0000000000000001, 0xC200000000000000
+
+align 64
+POLY2           dq     0x00000001C2000000, 0xC200000000000000
+                dq     0x00000001C2000000, 0xC200000000000000
+                dq     0x00000001C2000000, 0xC200000000000000
+                dq     0x00000001C2000000, 0xC200000000000000
+align 16
+TWOONE          dq     0x0000000000000001, 0x0000000100000000
+
+; order of these constants should not change.
+; more specifically, ALL_F should follow SHIFT_MASK, and ZERO should follow ALL_F
+
+align 64
+SHUF_MASK       dq     0x08090A0B0C0D0E0F, 0x0001020304050607
+                dq     0x08090A0B0C0D0E0F, 0x0001020304050607
+                dq     0x08090A0B0C0D0E0F, 0x0001020304050607
+                dq     0x08090A0B0C0D0E0F, 0x0001020304050607
+
+SHIFT_MASK      dq     0x0706050403020100, 0x0f0e0d0c0b0a0908
+ALL_F           dq     0xffffffffffffffff, 0xffffffffffffffff
+ZERO            dq     0x0000000000000000, 0x0000000000000000
+ONE             dq     0x0000000000000001, 0x0000000000000000
+TWO             dq     0x0000000000000002, 0x0000000000000000
+ONEf            dq     0x0000000000000000, 0x0100000000000000
+TWOf            dq     0x0000000000000000, 0x0200000000000000
+
+align 64
+ddq_add_1234:
+        dq	0x0000000000000001, 0x0000000000000000
+        dq	0x0000000000000002, 0x0000000000000000
+        dq	0x0000000000000003, 0x0000000000000000
+        dq	0x0000000000000004, 0x0000000000000000
+
+align 64
+ddq_add_5678:
+        dq	0x0000000000000005, 0x0000000000000000
+        dq	0x0000000000000006, 0x0000000000000000
+        dq	0x0000000000000007, 0x0000000000000000
+        dq	0x0000000000000008, 0x0000000000000000
+
+align 64
+ddq_add_4444:
+        dq	0x0000000000000004, 0x0000000000000000
+        dq	0x0000000000000004, 0x0000000000000000
+        dq	0x0000000000000004, 0x0000000000000000
+        dq	0x0000000000000004, 0x0000000000000000
+
+align 64
+ddq_add_8888:
+        dq	0x0000000000000008, 0x0000000000000000
+        dq	0x0000000000000008, 0x0000000000000000
+        dq	0x0000000000000008, 0x0000000000000000
+        dq	0x0000000000000008, 0x0000000000000000
+
+align 64
+ddq_addbe_1234:
+        dq	0x0000000000000000, 0x0100000000000000
+        dq	0x0000000000000000, 0x0200000000000000
+        dq	0x0000000000000000, 0x0300000000000000
+        dq	0x0000000000000000, 0x0400000000000000
+
+align 64
+ddq_addbe_5678:
+        dq	0x0000000000000000, 0x0500000000000000
+        dq	0x0000000000000000, 0x0600000000000000
+        dq	0x0000000000000000, 0x0700000000000000
+        dq	0x0000000000000000, 0x0800000000000000
+
+align 64
+ddq_addbe_4444:
+        dq	0x0000000000000000, 0x0400000000000000
+        dq	0x0000000000000000, 0x0400000000000000
+        dq	0x0000000000000000, 0x0400000000000000
+        dq	0x0000000000000000, 0x0400000000000000
+
+align 64
+ddq_addbe_8888:
+        dq	0x0000000000000000, 0x0800000000000000
+        dq	0x0000000000000000, 0x0800000000000000
+        dq	0x0000000000000000, 0x0800000000000000
+        dq	0x0000000000000000, 0x0800000000000000
+
+align 64
+byte_len_to_mask_table:
+        dw      0x0000, 0x0001, 0x0003, 0x0007,
+        dw      0x000f, 0x001f, 0x003f, 0x007f,
+        dw      0x00ff, 0x01ff, 0x03ff, 0x07ff,
+        dw      0x0fff, 0x1fff, 0x3fff, 0x7fff,
+        dw      0xffff
+
+align 64
+byte64_len_to_mask_table:
+        dq      0x0000000000000000, 0x0000000000000001
+        dq      0x0000000000000003, 0x0000000000000007
+        dq      0x000000000000000f, 0x000000000000001f
+        dq      0x000000000000003f, 0x000000000000007f
+        dq      0x00000000000000ff, 0x00000000000001ff
+        dq      0x00000000000003ff, 0x00000000000007ff
+        dq      0x0000000000000fff, 0x0000000000001fff
+        dq      0x0000000000003fff, 0x0000000000007fff
+        dq      0x000000000000ffff, 0x000000000001ffff
+        dq      0x000000000003ffff, 0x000000000007ffff
+        dq      0x00000000000fffff, 0x00000000001fffff
+        dq      0x00000000003fffff, 0x00000000007fffff
+        dq      0x0000000000ffffff, 0x0000000001ffffff
+        dq      0x0000000003ffffff, 0x0000000007ffffff
+        dq      0x000000000fffffff, 0x000000001fffffff
+        dq      0x000000003fffffff, 0x000000007fffffff
+        dq      0x00000000ffffffff, 0x00000001ffffffff
+        dq      0x00000003ffffffff, 0x00000007ffffffff
+        dq      0x0000000fffffffff, 0x0000001fffffffff
+        dq      0x0000003fffffffff, 0x0000007fffffffff
+        dq      0x000000ffffffffff, 0x000001ffffffffff
+        dq      0x000003ffffffffff, 0x000007ffffffffff
+        dq      0x00000fffffffffff, 0x00001fffffffffff
+        dq      0x00003fffffffffff, 0x00007fffffffffff
+        dq      0x0000ffffffffffff, 0x0001ffffffffffff
+        dq      0x0003ffffffffffff, 0x0007ffffffffffff
+        dq      0x000fffffffffffff, 0x001fffffffffffff
+        dq      0x003fffffffffffff, 0x007fffffffffffff
+        dq      0x00ffffffffffffff, 0x01ffffffffffffff
+        dq      0x03ffffffffffffff, 0x07ffffffffffffff
+        dq      0x0fffffffffffffff, 0x1fffffffffffffff
+        dq      0x3fffffffffffffff, 0x7fffffffffffffff
+        dq      0xffffffffffffffff
+
+align 64
+mask_out_top_block:
+        dq      0xffffffffffffffff, 0xffffffffffffffff
+        dq      0xffffffffffffffff, 0xffffffffffffffff
+        dq      0xffffffffffffffff, 0xffffffffffffffff
+        dq      0x0000000000000000, 0x0000000000000000
+
+section .text
+
+
+;;define the fields of gcm_data struct
+;typedef struct gcm_data
+;{
+;        u8 expanded_keys[16*15];
+;        u8 shifted_hkey_1[16];  // store HashKey <<1 mod poly here
+;        u8 shifted_hkey_2[16];  // store HashKey^2 <<1 mod poly here
+;        u8 shifted_hkey_3[16];  // store HashKey^3 <<1 mod poly here
+;        u8 shifted_hkey_4[16];  // store HashKey^4 <<1 mod poly here
+;        u8 shifted_hkey_5[16];  // store HashKey^5 <<1 mod poly here
+;        u8 shifted_hkey_6[16];  // store HashKey^6 <<1 mod poly here
+;        u8 shifted_hkey_7[16];  // store HashKey^7 <<1 mod poly here
+;        u8 shifted_hkey_8[16];  // store HashKey^8 <<1 mod poly here
+;        u8 shifted_hkey_1_k[16];  // store XOR of High 64 bits and Low 64 bits of  HashKey <<1 mod poly here (for Karatsuba purposes)
+;        u8 shifted_hkey_2_k[16];  // store XOR of High 64 bits and Low 64 bits of  HashKey^2 <<1 mod poly here (for Karatsuba purposes)
+;        u8 shifted_hkey_3_k[16];  // store XOR of High 64 bits and Low 64 bits of  HashKey^3 <<1 mod poly here (for Karatsuba purposes)
+;        u8 shifted_hkey_4_k[16];  // store XOR of High 64 bits and Low 64 bits of  HashKey^4 <<1 mod poly here (for Karatsuba purposes)
+;        u8 shifted_hkey_5_k[16];  // store XOR of High 64 bits and Low 64 bits of  HashKey^5 <<1 mod poly here (for Karatsuba purposes)
+;        u8 shifted_hkey_6_k[16];  // store XOR of High 64 bits and Low 64 bits of  HashKey^6 <<1 mod poly here (for Karatsuba purposes)
+;        u8 shifted_hkey_7_k[16];  // store XOR of High 64 bits and Low 64 bits of  HashKey^7 <<1 mod poly here (for Karatsuba purposes)
+;        u8 shifted_hkey_8_k[16];  // store XOR of High 64 bits and Low 64 bits of  HashKey^8 <<1 mod poly here (for Karatsuba purposes)
+;} gcm_data;
+
+%ifndef GCM_KEYS_VAES_AVX512_INCLUDED
+%define HashKey         16*15    ; store HashKey <<1 mod poly here
+%define HashKey_1       16*15    ; store HashKey <<1 mod poly here
+%define HashKey_2       16*16    ; store HashKey^2 <<1 mod poly here
+%define HashKey_3       16*17    ; store HashKey^3 <<1 mod poly here
+%define HashKey_4       16*18    ; store HashKey^4 <<1 mod poly here
+%define HashKey_5       16*19    ; store HashKey^5 <<1 mod poly here
+%define HashKey_6       16*20    ; store HashKey^6 <<1 mod poly here
+%define HashKey_7       16*21    ; store HashKey^7 <<1 mod poly here
+%define HashKey_8       16*22    ; store HashKey^8 <<1 mod poly here
+%define HashKey_k       16*23    ; store XOR of High 64 bits and Low 64 bits of  HashKey <<1 mod poly here (for Karatsuba purposes)
+%define HashKey_2_k     16*24    ; store XOR of High 64 bits and Low 64 bits of  HashKey^2 <<1 mod poly here (for Karatsuba purposes)
+%define HashKey_3_k     16*25   ; store XOR of High 64 bits and Low 64 bits of  HashKey^3 <<1 mod poly here (for Karatsuba purposes)
+%define HashKey_4_k     16*26   ; store XOR of High 64 bits and Low 64 bits of  HashKey^4 <<1 mod poly here (for Karatsuba purposes)
+%define HashKey_5_k     16*27   ; store XOR of High 64 bits and Low 64 bits of  HashKey^5 <<1 mod poly here (for Karatsuba purposes)
+%define HashKey_6_k     16*28   ; store XOR of High 64 bits and Low 64 bits of  HashKey^6 <<1 mod poly here (for Karatsuba purposes)
+%define HashKey_7_k     16*29   ; store XOR of High 64 bits and Low 64 bits of  HashKey^7 <<1 mod poly here (for Karatsuba purposes)
+%define HashKey_8_k     16*30   ; store XOR of High 64 bits and Low 64 bits of  HashKey^8 <<1 mod poly here (for Karatsuba purposes)
+%endif
+
+%define AadHash		16*0	; store current Hash of data which has been input
+%define AadLen		16*1	; store length of input data which will not be encrypted or decrypted
+%define InLen		(16*1)+8 ; store length of input data which will be encrypted or decrypted
+%define PBlockEncKey	16*2	; encryption key for the partial block at the end of the previous update
+%define OrigIV		16*3	; input IV
+%define CurCount	16*4	; Current counter for generation of encryption key
+%define PBlockLen	16*5	; length of partial block at the end of the previous update
+
+%define reg(q) xmm %+ q
+%define arg(x) [r14 + STACK_OFFSET + 8*x]
+
+
+
+
+%ifnidn __OUTPUT_FORMAT__, elf64
+    %xdefine arg1 rcx
+    %xdefine arg2 rdx
+    %xdefine arg3 r8
+    %xdefine arg4 r9
+    %xdefine arg5 rsi ;[r14 + STACK_OFFSET + 8*5] - need push and load
+    %xdefine arg6 [r14 + STACK_OFFSET + 8*6]
+    %xdefine arg7 [r14 + STACK_OFFSET + 8*7]
+    %xdefine arg8 [r14 + STACK_OFFSET + 8*8]
+    %xdefine arg9 [r14 + STACK_OFFSET + 8*9]
+    %xdefine arg10 [r14 + STACK_OFFSET + 8*10]
+
+%else
+    %xdefine arg1 rdi
+    %xdefine arg2 rsi
+    %xdefine arg3 rdx
+    %xdefine arg4 rcx
+    %xdefine arg5 r8
+    %xdefine arg6 r9
+    %xdefine arg7 [r14 + STACK_OFFSET + 8*1]
+    %xdefine arg8 [r14 + STACK_OFFSET + 8*2]
+    %xdefine arg9 [r14 + STACK_OFFSET + 8*3]
+    %xdefine arg10 [r14 + STACK_OFFSET + 8*4]
+%endif
+
+%ifdef NT_LDST
+	%define NT_LD
+	%define NT_ST
+%endif
+
+;;; Use Non-temporal load/stor
+%ifdef NT_LD
+	%define	XLDR	 movntdqa
+	%define	VXLDR	 vmovntdqa
+	%define	VX512LDR vmovntdqa
+%else
+	%define	XLDR	 movdqu
+	%define	VXLDR	 vmovdqu
+	%define	VX512LDR vmovdqu8
+%endif
+
+;;; Use Non-temporal load/stor
+%ifdef NT_ST
+	%define	XSTR	 movntdq
+	%define	VXSTR	 vmovntdq
+	%define	VX512STR vmovntdq
+%else
+	%define	XSTR	 movdqu
+	%define	VXSTR	 vmovdqu
+	%define	VX512STR vmovdqu8
+%endif
+
+%endif ; GCM_DEFINES_ASM_INCLUDED
diff --git a/contrib/icp/gcm-simd/isa-l_crypto/gcm_sse.asm b/contrib/icp/gcm-simd/isa-l_crypto/gcm_sse.asm
new file mode 100644
index 000000000000..e35860496357
--- /dev/null
+++ b/contrib/icp/gcm-simd/isa-l_crypto/gcm_sse.asm
@@ -0,0 +1,2171 @@
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;  Copyright(c) 2011-2017 Intel Corporation All rights reserved.
+;
+;  Redistribution and use in source and binary forms, with or without
+;  modification, are permitted provided that the following conditions
+;  are met:
+;    * Redistributions of source code must retain the above copyright
+;      notice, this list of conditions and the following disclaimer.
+;    * Redistributions in binary form must reproduce the above copyright
+;      notice, this list of conditions and the following disclaimer in
+;      the documentation and/or other materials provided with the
+;      distribution.
+;    * Neither the name of Intel Corporation nor the names of its
+;      contributors may be used to endorse or promote products derived
+;      from this software without specific prior written permission.
+;
+;  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+;  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+;  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+;  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+;  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+;  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+;  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+;  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+;  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+;  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+;  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;
+; Authors:
+;       Erdinc Ozturk
+;       Vinodh Gopal
+;       James Guilford
+;
+;
+; References:
+;       This code was derived and highly optimized from the code described in paper:
+;               Vinodh Gopal et. al. Optimized Galois-Counter-Mode Implementation on Intel Architecture Processors. August, 2010
+;
+;       For the shift-based reductions used in this code, we used the method described in paper:
+;               Shay Gueron, Michael E. Kounavis. Intel Carry-Less Multiplication Instruction and its Usage for Computing the GCM Mode. January, 2010.
+;
+;
+;
+;
+; Assumptions:
+;
+;
+;
+; iv:
+;       0                   1                   2                   3
+;       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+;       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+;       |                             Salt  (From the SA)               |
+;       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+;       |                     Initialization Vector                     |
+;       |         (This is the sequence number from IPSec header)       |
+;       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+;       |                              0x1                              |
+;       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+;
+;
+;
+; AAD:
+;       AAD will be padded with 0 to the next 16byte multiple
+;       for example, assume AAD is a u32 vector
+;
+;       if AAD is 8 bytes:
+;       AAD[3] = {A0, A1};
+;       padded AAD in xmm register = {A1 A0 0 0}
+;
+;       0                   1                   2                   3
+;       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+;       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+;       |                               SPI (A1)                        |
+;       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+;       |                     32-bit Sequence Number (A0)               |
+;       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+;       |                              0x0                              |
+;       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+;
+;                                       AAD Format with 32-bit Sequence Number
+;
+;       if AAD is 12 bytes:
+;       AAD[3] = {A0, A1, A2};
+;       padded AAD in xmm register = {A2 A1 A0 0}
+;
+;       0                   1                   2                   3
+;       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+;       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+;       |                               SPI (A2)                        |
+;       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+;       |                 64-bit Extended Sequence Number {A1,A0}       |
+;       |                                                               |
+;       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+;       |                              0x0                              |
+;       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+;
+;        AAD Format with 64-bit Extended Sequence Number
+;
+;
+; aadLen:
+;       Must be a multiple of 4 bytes and from the definition of the spec.
+;       The code additionally supports any aadLen length.
+;
+; TLen:
+;       from the definition of the spec, TLen can only be 8, 12 or 16 bytes.
+;
+; poly = x^128 + x^127 + x^126 + x^121 + 1
+; throughout the code, one tab and two tab indentations are used. one tab is for GHASH part, two tabs is for AES part.
+;
+
+%include "reg_sizes.asm"
+%include "gcm_defines.asm"
+
+%ifndef GCM128_MODE
+%ifndef GCM192_MODE
+%ifndef GCM256_MODE
+%error "No GCM mode selected for gcm_sse.asm!"
+%endif
+%endif
+%endif
+
+%ifndef FUNCT_EXTENSION
+%define FUNCT_EXTENSION
+%endif
+
+%ifdef GCM128_MODE
+%define FN_NAME(x,y) aes_gcm_ %+ x %+ _128 %+ y %+ sse %+ FUNCT_EXTENSION
+%define NROUNDS 9
+%endif
+
+%ifdef GCM192_MODE
+%define FN_NAME(x,y) aes_gcm_ %+ x %+ _192 %+ y %+ sse %+ FUNCT_EXTENSION
+%define NROUNDS 11
+%endif
+
+%ifdef GCM256_MODE
+%define FN_NAME(x,y) aes_gcm_ %+ x %+ _256 %+ y %+ sse %+ FUNCT_EXTENSION
+%define NROUNDS 13
+%endif
+
+
+default rel
+; need to push 5 registers into stack to maintain
+%define STACK_OFFSET 8*5
+
+%define	TMP2	16*0    ; Temporary storage for AES State 2 (State 1 is stored in an XMM register)
+%define	TMP3	16*1    ; Temporary storage for AES State 3
+%define	TMP4	16*2    ; Temporary storage for AES State 4
+%define	TMP5	16*3    ; Temporary storage for AES State 5
+%define	TMP6	16*4    ; Temporary storage for AES State 6
+%define	TMP7	16*5    ; Temporary storage for AES State 7
+%define	TMP8	16*6    ; Temporary storage for AES State 8
+
+%define	LOCAL_STORAGE	16*7
+
+%ifidn __OUTPUT_FORMAT__, win64
+	%define	XMM_STORAGE	16*10
+%else
+	%define	XMM_STORAGE	0
+%endif
+
+%define	VARIABLE_OFFSET	LOCAL_STORAGE + XMM_STORAGE
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+; Utility Macros
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+; GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)
+; Input: A and B (128-bits each, bit-reflected)
+; Output: C = A*B*x mod poly, (i.e. >>1 )
+; To compute GH = GH*HashKey mod poly, give HK = HashKey<<1 mod poly as input
+; GH = GH * HK * x mod poly which is equivalent to GH*HashKey mod poly.
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+%macro  GHASH_MUL  7
+%define %%GH %1         ; 16 Bytes
+%define %%HK %2         ; 16 Bytes
+%define %%T1 %3
+%define %%T2 %4
+%define %%T3 %5
+%define %%T4 %6
+%define %%T5 %7
+        ; %%GH, %%HK hold the values for the two operands which are carry-less multiplied
+        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+        ; Karatsuba Method
+        movdqa  %%T1, %%GH
+        pshufd  %%T2, %%GH, 01001110b
+        pshufd  %%T3, %%HK, 01001110b
+        pxor    %%T2, %%GH                              ; %%T2 = (a1+a0)
+        pxor    %%T3, %%HK                              ; %%T3 = (b1+b0)
+
+        pclmulqdq       %%T1, %%HK, 0x11                ; %%T1 = a1*b1
+        pclmulqdq       %%GH, %%HK, 0x00                ; %%GH = a0*b0
+        pclmulqdq       %%T2, %%T3, 0x00                ; %%T2 = (a1+a0)*(b1+b0)
+        pxor    %%T2, %%GH
+        pxor    %%T2, %%T1                              ; %%T2 = a0*b1+a1*b0
+
+        movdqa  %%T3, %%T2
+        pslldq  %%T3, 8                                 ; shift-L %%T3 2 DWs
+        psrldq  %%T2, 8                                 ; shift-R %%T2 2 DWs
+        pxor    %%GH, %%T3
+        pxor    %%T1, %%T2                              ; <%%T1:%%GH> holds the result of the carry-less multiplication of %%GH by %%HK
+
+
+        ;first phase of the reduction
+        movdqa  %%T2, %%GH
+        movdqa  %%T3, %%GH
+        movdqa  %%T4, %%GH                              ; move %%GH into %%T2, %%T3, %%T4 in order to perform the three shifts independently
+
+        pslld   %%T2, 31                                ; packed right shifting << 31
+        pslld   %%T3, 30                                ; packed right shifting shift << 30
+        pslld   %%T4, 25                                ; packed right shifting shift << 25
+        pxor    %%T2, %%T3                              ; xor the shifted versions
+        pxor    %%T2, %%T4
+
+        movdqa  %%T5, %%T2
+        psrldq  %%T5, 4                                 ; shift-R %%T5 1 DW
+
+        pslldq  %%T2, 12                                ; shift-L %%T2 3 DWs
+        pxor    %%GH, %%T2                              ; first phase of the reduction complete
+        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+        ;second phase of the reduction
+        movdqa  %%T2,%%GH                               ; make 3 copies of %%GH (in in %%T2, %%T3, %%T4) for doing three shift operations
+        movdqa  %%T3,%%GH
+        movdqa  %%T4,%%GH
+
+        psrld   %%T2,1                                  ; packed left shifting >> 1
+        psrld   %%T3,2                                  ; packed left shifting >> 2
+        psrld   %%T4,7                                  ; packed left shifting >> 7
+        pxor    %%T2,%%T3                               ; xor the shifted versions
+        pxor    %%T2,%%T4
+
+        pxor    %%T2, %%T5
+        pxor    %%GH, %%T2
+        pxor    %%GH, %%T1                              ; the result is in %%T1
+
+
+%endmacro
+
+
+%macro PRECOMPUTE 8
+%define	%%GDATA	%1
+%define	%%HK	%2
+%define	%%T1	%3
+%define	%%T2	%4
+%define	%%T3	%5
+%define	%%T4	%6
+%define	%%T5	%7
+%define	%%T6	%8
+
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+; Haskey_i_k holds XORed values of the low and high parts of the Haskey_i
+        movdqa  %%T4, %%HK
+        pshufd  %%T1, %%HK, 01001110b
+        pxor    %%T1, %%HK
+        movdqu  [%%GDATA + HashKey_k], %%T1
+
+
+        GHASH_MUL %%T4, %%HK, %%T1, %%T2, %%T3, %%T5, %%T6      ;  %%T4 = HashKey^2<<1 mod poly
+        movdqu  [%%GDATA + HashKey_2], %%T4                         ;  [HashKey_2] = HashKey^2<<1 mod poly
+        pshufd  %%T1, %%T4, 01001110b
+        pxor    %%T1, %%T4
+        movdqu  [%%GDATA + HashKey_2_k], %%T1
+
+        GHASH_MUL %%T4, %%HK, %%T1, %%T2, %%T3, %%T5, %%T6              ;  %%T4 = HashKey^3<<1 mod poly
+        movdqu  [%%GDATA + HashKey_3], %%T4
+        pshufd  %%T1, %%T4, 01001110b
+        pxor    %%T1, %%T4
+        movdqu  [%%GDATA + HashKey_3_k], %%T1
+
+
+        GHASH_MUL %%T4, %%HK, %%T1, %%T2, %%T3, %%T5, %%T6              ;  %%T4 = HashKey^4<<1 mod poly
+        movdqu  [%%GDATA + HashKey_4], %%T4
+        pshufd  %%T1, %%T4, 01001110b
+        pxor    %%T1, %%T4
+        movdqu  [%%GDATA + HashKey_4_k], %%T1
+
+        GHASH_MUL %%T4, %%HK, %%T1, %%T2, %%T3, %%T5, %%T6              ;  %%T4 = HashKey^5<<1 mod poly
+        movdqu  [%%GDATA + HashKey_5], %%T4
+        pshufd  %%T1, %%T4, 01001110b
+        pxor    %%T1, %%T4
+        movdqu  [%%GDATA + HashKey_5_k], %%T1
+
+
+        GHASH_MUL %%T4, %%HK, %%T1, %%T2, %%T3, %%T5, %%T6              ;  %%T4 = HashKey^6<<1 mod poly
+        movdqu  [%%GDATA + HashKey_6], %%T4
+        pshufd  %%T1, %%T4, 01001110b
+        pxor    %%T1, %%T4
+        movdqu  [%%GDATA + HashKey_6_k], %%T1
+
+        GHASH_MUL %%T4, %%HK, %%T1, %%T2, %%T3, %%T5, %%T6              ;  %%T4 = HashKey^7<<1 mod poly
+        movdqu  [%%GDATA + HashKey_7], %%T4
+        pshufd  %%T1, %%T4, 01001110b
+        pxor    %%T1, %%T4
+        movdqu  [%%GDATA + HashKey_7_k], %%T1
+
+        GHASH_MUL %%T4, %%HK, %%T1, %%T2, %%T3, %%T5, %%T6              ;  %%T4 = HashKey^8<<1 mod poly
+        movdqu  [%%GDATA + HashKey_8], %%T4
+        pshufd  %%T1, %%T4, 01001110b
+        pxor    %%T1, %%T4
+        movdqu  [%%GDATA + HashKey_8_k], %%T1
+
+
+%endmacro
+
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+; READ_SMALL_DATA_INPUT: Packs xmm register with data when data input is less than 16 bytes.
+; Returns 0 if data has length 0.
+; Input: The input data (INPUT), that data's length (LENGTH).
+; Output: The packed xmm register (OUTPUT).
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+%macro READ_SMALL_DATA_INPUT	6
+%define	%%OUTPUT		%1 ; %%OUTPUT is an xmm register
+%define	%%INPUT			%2
+%define	%%LENGTH		%3
+%define	%%END_READ_LOCATION	%4 ; All this and the lower inputs are temp registers
+%define	%%COUNTER		%5
+%define	%%TMP1			%6
+
+	pxor	%%OUTPUT, %%OUTPUT
+	mov	%%COUNTER, %%LENGTH
+	mov	%%END_READ_LOCATION, %%INPUT
+	add	%%END_READ_LOCATION, %%LENGTH
+	xor	%%TMP1, %%TMP1
+
+
+	cmp	%%COUNTER, 8
+	jl	%%_byte_loop_2
+	pinsrq	%%OUTPUT, [%%INPUT],0		;Read in 8 bytes if they exists
+	je	%%_done
+
+	sub	%%COUNTER, 8
+
+%%_byte_loop_1:					;Read in data 1 byte at a time while data is left
+	shl	%%TMP1, 8			;This loop handles when 8 bytes were already read in
+	dec	%%END_READ_LOCATION
+	mov	BYTE(%%TMP1), BYTE [%%END_READ_LOCATION]
+	dec	%%COUNTER
+	jg	%%_byte_loop_1
+	pinsrq	%%OUTPUT, %%TMP1, 1
+	jmp	%%_done
+
+%%_byte_loop_2:					;Read in data 1 byte at a time while data is left
+	cmp	%%COUNTER, 0
+	je	%%_done
+	shl	%%TMP1, 8			;This loop handles when no bytes were already read in
+	dec	%%END_READ_LOCATION
+	mov	BYTE(%%TMP1), BYTE [%%END_READ_LOCATION]
+	dec	%%COUNTER
+	jg	%%_byte_loop_2
+	pinsrq	%%OUTPUT, %%TMP1, 0
+%%_done:
+
+%endmacro ; READ_SMALL_DATA_INPUT
+
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+; CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted.
+; Input: The input data (A_IN), that data's length (A_LEN), and the hash key (HASH_KEY).
+; Output: The hash of the data (AAD_HASH).
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+%macro	CALC_AAD_HASH	14
+%define	%%A_IN		%1
+%define	%%A_LEN		%2
+%define	%%AAD_HASH	%3
+%define	%%HASH_KEY	%4
+%define	%%XTMP1		%5	; xmm temp reg 5
+%define	%%XTMP2		%6
+%define	%%XTMP3		%7
+%define	%%XTMP4		%8
+%define	%%XTMP5		%9	; xmm temp reg 5
+%define	%%T1		%10	; temp reg 1
+%define	%%T2		%11
+%define	%%T3		%12
+%define	%%T4		%13
+%define	%%T5		%14	; temp reg 5
+
+
+	mov	%%T1, %%A_IN		; T1 = AAD
+	mov	%%T2, %%A_LEN		; T2 = aadLen
+	pxor	%%AAD_HASH, %%AAD_HASH
+
+	cmp	%%T2, 16
+	jl	%%_get_small_AAD_block
+
+%%_get_AAD_loop16:
+
+	movdqu	%%XTMP1, [%%T1]
+	;byte-reflect the AAD data
+	pshufb	%%XTMP1, [SHUF_MASK]
+	pxor	%%AAD_HASH, %%XTMP1
+	GHASH_MUL	%%AAD_HASH, %%HASH_KEY, %%XTMP1, %%XTMP2, %%XTMP3, %%XTMP4, %%XTMP5
+
+	sub	%%T2, 16
+	je	%%_CALC_AAD_done
+
+	add	%%T1, 16
+	cmp	%%T2, 16
+	jge	%%_get_AAD_loop16
+
+%%_get_small_AAD_block:
+	READ_SMALL_DATA_INPUT	%%XTMP1, %%T1, %%T2, %%T3, %%T4, %%T5
+	;byte-reflect the AAD data
+	pshufb	%%XTMP1, [SHUF_MASK]
+	pxor	%%AAD_HASH, %%XTMP1
+	GHASH_MUL	%%AAD_HASH, %%HASH_KEY, %%XTMP1, %%XTMP2, %%XTMP3, %%XTMP4, %%XTMP5
+
+%%_CALC_AAD_done:
+
+%endmacro ; CALC_AAD_HASH
+
+
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+; PARTIAL_BLOCK: Handles encryption/decryption and the tag partial blocks between update calls.
+; Requires the input data be at least 1 byte long.
+; Input: gcm_key_data (GDATA_KEY), gcm_context_data (GDATA_CTX), input text (PLAIN_CYPH_IN),
+; input text length (PLAIN_CYPH_LEN), the current data offset (DATA_OFFSET),
+; and whether encoding or decoding (ENC_DEC).
+; Output: A cypher of the first partial block (CYPH_PLAIN_OUT), and updated GDATA_CTX
+; Clobbers rax, r10, r12, r13, r15, xmm0, xmm1, xmm2, xmm3, xmm5, xmm6, xmm9, xmm10, xmm11, xmm13
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+%macro PARTIAL_BLOCK	8
+%define	%%GDATA_KEY		%1
+%define	%%GDATA_CTX		%2
+%define	%%CYPH_PLAIN_OUT	%3
+%define	%%PLAIN_CYPH_IN		%4
+%define	%%PLAIN_CYPH_LEN	%5
+%define	%%DATA_OFFSET		%6
+%define	%%AAD_HASH		%7
+%define	%%ENC_DEC		%8
+	mov	r13, [%%GDATA_CTX + PBlockLen]
+	cmp	r13, 0
+	je	%%_partial_block_done		;Leave Macro if no partial blocks
+
+	cmp	%%PLAIN_CYPH_LEN, 16		;Read in input data without over reading
+	jl	%%_fewer_than_16_bytes
+	XLDR	xmm1, [%%PLAIN_CYPH_IN]		;If more than 16 bytes of data, just fill the xmm register
+	jmp	%%_data_read
+
+%%_fewer_than_16_bytes:
+	lea	r10, [%%PLAIN_CYPH_IN + %%DATA_OFFSET]
+	READ_SMALL_DATA_INPUT	xmm1, r10, %%PLAIN_CYPH_LEN, rax, r12, r15
+	mov	r13, [%%GDATA_CTX + PBlockLen]
+
+%%_data_read:				;Finished reading in data
+
+
+	movdqu	xmm9, [%%GDATA_CTX + PBlockEncKey]	;xmm9 = ctx_data.partial_block_enc_key
+	movdqu	xmm13, [%%GDATA_KEY + HashKey]
+
+	lea	r12, [SHIFT_MASK]
+
+	add	r12, r13			; adjust the shuffle mask pointer to be able to shift r13 bytes (16-r13 is the number of bytes in plaintext mod 16)
+	movdqu	xmm2, [r12]			; get the appropriate shuffle mask
+	pshufb	xmm9, xmm2			;shift right r13 bytes
+
+%ifidn	%%ENC_DEC, DEC
+	movdqa	xmm3, xmm1
+	pxor	xmm9, xmm1			; Cyphertext XOR E(K, Yn)
+
+	mov	r15, %%PLAIN_CYPH_LEN
+	add	r15, r13
+	sub	r15, 16				;Set r15 to be the amount of data left in CYPH_PLAIN_IN after filling the block
+	jge	%%_no_extra_mask_1		;Determine if if partial block is not being filled and shift mask accordingly
+	sub	r12, r15
+%%_no_extra_mask_1:
+
+	movdqu	xmm1, [r12 + ALL_F-SHIFT_MASK]	; get the appropriate mask to mask out bottom r13 bytes of xmm9
+	pand	xmm9, xmm1			; mask out bottom r13 bytes of xmm9
+
+	pand	xmm3, xmm1
+	pshufb	xmm3, [SHUF_MASK]
+	pshufb	xmm3, xmm2
+	pxor	%%AAD_HASH, xmm3
+
+
+	cmp	r15,0
+	jl	%%_partial_incomplete_1
+
+	GHASH_MUL	%%AAD_HASH, xmm13, xmm0, xmm10, xmm11, xmm5, xmm6	;GHASH computation for the last <16 Byte block
+	xor	rax,rax
+	mov	[%%GDATA_CTX + PBlockLen], rax
+	jmp	%%_dec_done
+%%_partial_incomplete_1:
+	add	[%%GDATA_CTX + PBlockLen], %%PLAIN_CYPH_LEN
+%%_dec_done:
+	movdqu	[%%GDATA_CTX + AadHash], %%AAD_HASH
+
+%else
+	pxor	xmm9, xmm1	; Plaintext XOR E(K, Yn)
+
+	mov	r15, %%PLAIN_CYPH_LEN
+	add	r15, r13
+	sub	r15, 16				;Set r15 to be the amount of data left in CYPH_PLAIN_IN after filling the block
+	jge	%%_no_extra_mask_2		;Determine if if partial block is not being filled and shift mask accordingly
+	sub	r12, r15
+%%_no_extra_mask_2:
+
+	movdqu	xmm1, [r12 + ALL_F-SHIFT_MASK]	; get the appropriate mask to mask out bottom r13 bytes of xmm9
+	pand	xmm9, xmm1			; mask out bottom r13  bytes of xmm9
+
+	pshufb	xmm9, [SHUF_MASK]
+	pshufb	xmm9, xmm2
+	pxor	%%AAD_HASH, xmm9
+
+	cmp	r15,0
+	jl	%%_partial_incomplete_2
+
+	GHASH_MUL	%%AAD_HASH, xmm13, xmm0, xmm10, xmm11, xmm5, xmm6	;GHASH computation for the last <16 Byte block
+	xor	rax,rax
+	mov	[%%GDATA_CTX + PBlockLen], rax
+	jmp	%%_encode_done
+%%_partial_incomplete_2:
+	add     [%%GDATA_CTX + PBlockLen], %%PLAIN_CYPH_LEN
+%%_encode_done:
+	movdqu	[%%GDATA_CTX + AadHash], %%AAD_HASH
+
+	pshufb	xmm9, [SHUF_MASK]	; shuffle xmm9 back to output as ciphertext
+	pshufb	xmm9, xmm2
+%endif
+
+
+	;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+	; output encrypted Bytes
+	cmp	r15,0
+	jl	%%_partial_fill
+	mov	r12, r13
+	mov	r13, 16
+	sub	r13, r12			; Set r13 to be the number of bytes to write out
+	jmp	%%_count_set
+%%_partial_fill:
+	mov	r13, %%PLAIN_CYPH_LEN
+%%_count_set:
+	movq	rax, xmm9
+	cmp	r13, 8
+	jle	%%_less_than_8_bytes_left
+
+	mov	[%%CYPH_PLAIN_OUT+ %%DATA_OFFSET], rax
+	add	%%DATA_OFFSET, 8
+	psrldq	xmm9, 8
+	movq	rax, xmm9
+	sub	r13, 8
+%%_less_than_8_bytes_left:
+	mov	BYTE [%%CYPH_PLAIN_OUT + %%DATA_OFFSET], al
+	add	%%DATA_OFFSET, 1
+	shr	rax, 8
+	sub	r13, 1
+	jne	%%_less_than_8_bytes_left
+         ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+%%_partial_block_done:
+%endmacro ; PARTIAL_BLOCK
+
+
+; if a = number of total plaintext bytes
+; b = floor(a/16)
+; %%num_initial_blocks = b mod 8;
+; encrypt the initial %%num_initial_blocks blocks and apply ghash on the ciphertext
+; %%GDATA_KEY, %%GDATA_CTX, %%CYPH_PLAIN_OUT, %%PLAIN_CYPH_IN, r14 are used as a pointer only, not modified
+; Updated AAD_HASH is returned in %%T3
+
+%macro INITIAL_BLOCKS 24
+%define	%%GDATA_KEY		%1
+%define	%%GDATA_CTX		%2
+%define	%%CYPH_PLAIN_OUT	%3
+%define	%%PLAIN_CYPH_IN		%4
+%define	%%LENGTH		%5
+%define	%%DATA_OFFSET		%6
+%define	%%num_initial_blocks	%7	; can be 0, 1, 2, 3, 4, 5, 6 or 7
+%define	%%T1		%8
+%define	%%HASH_KEY	%9
+%define	%%T3		%10
+%define	%%T4		%11
+%define	%%T5		%12
+%define	%%CTR		%13
+%define	%%XMM1		%14
+%define	%%XMM2		%15
+%define	%%XMM3		%16
+%define	%%XMM4		%17
+%define	%%XMM5		%18
+%define	%%XMM6		%19
+%define	%%XMM7		%20
+%define	%%XMM8		%21
+%define	%%T6		%22
+%define	%%T_key		%23
+%define	%%ENC_DEC	%24
+
+%assign i       (8-%%num_initial_blocks)
+		movdqu	reg(i), %%XMM8	; move AAD_HASH to temp reg
+
+	        ; start AES for %%num_initial_blocks blocks
+	        movdqu  %%CTR, [%%GDATA_CTX + CurCount]	; %%CTR = Y0
+
+
+%assign i (9-%%num_initial_blocks)
+%rep %%num_initial_blocks
+                paddd   %%CTR, [ONE]           ; INCR Y0
+                movdqa  reg(i), %%CTR
+                pshufb  reg(i), [SHUF_MASK]     ; perform a 16Byte swap
+%assign i (i+1)
+%endrep
+
+movdqu  %%T_key, [%%GDATA_KEY+16*0]
+%assign i (9-%%num_initial_blocks)
+%rep %%num_initial_blocks
+                pxor    reg(i),%%T_key
+%assign i (i+1)
+%endrep
+
+%assign j 1
+%rep NROUNDS							; encrypt N blocks with 13 key rounds (11 for GCM192)
+movdqu  %%T_key, [%%GDATA_KEY+16*j]
+%assign i (9-%%num_initial_blocks)
+%rep %%num_initial_blocks
+                aesenc  reg(i),%%T_key
+%assign i (i+1)
+%endrep
+
+%assign j (j+1)
+%endrep
+
+
+movdqu  %%T_key, [%%GDATA_KEY+16*j]				; encrypt with last (14th) key round (12 for GCM192)
+%assign i (9-%%num_initial_blocks)
+%rep %%num_initial_blocks
+                aesenclast      reg(i),%%T_key
+%assign i (i+1)
+%endrep
+
+%assign i (9-%%num_initial_blocks)
+%rep %%num_initial_blocks
+                XLDR  %%T1, [%%PLAIN_CYPH_IN + %%DATA_OFFSET]
+                pxor    reg(i), %%T1
+                XSTR  [%%CYPH_PLAIN_OUT + %%DATA_OFFSET], reg(i)            ; write back ciphertext for %%num_initial_blocks blocks
+                add     %%DATA_OFFSET, 16
+                %ifidn  %%ENC_DEC, DEC
+                movdqa  reg(i), %%T1
+                %endif
+                pshufb  reg(i), [SHUF_MASK]     ; prepare ciphertext for GHASH computations
+%assign i (i+1)
+%endrep
+
+
+%assign i (8-%%num_initial_blocks)
+%assign j (9-%%num_initial_blocks)
+
+%rep %%num_initial_blocks
+        pxor    reg(j), reg(i)
+        GHASH_MUL       reg(j), %%HASH_KEY, %%T1, %%T3, %%T4, %%T5, %%T6      ; apply GHASH on %%num_initial_blocks blocks
+%assign i (i+1)
+%assign j (j+1)
+%endrep
+        ; %%XMM8 has the current Hash Value
+        movdqa  %%T3, %%XMM8
+
+        cmp     %%LENGTH, 128
+        jl      %%_initial_blocks_done                  ; no need for precomputed constants
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+; Haskey_i_k holds XORed values of the low and high parts of the Haskey_i
+                paddd   %%CTR, [ONE]                   ; INCR Y0
+                movdqa  %%XMM1, %%CTR
+                pshufb  %%XMM1, [SHUF_MASK]             ; perform a 16Byte swap
+
+                paddd   %%CTR, [ONE]                   ; INCR Y0
+                movdqa  %%XMM2, %%CTR
+                pshufb  %%XMM2, [SHUF_MASK]             ; perform a 16Byte swap
+
+                paddd   %%CTR, [ONE]                   ; INCR Y0
+                movdqa  %%XMM3, %%CTR
+                pshufb  %%XMM3, [SHUF_MASK]             ; perform a 16Byte swap
+
+                paddd   %%CTR, [ONE]                   ; INCR Y0
+                movdqa  %%XMM4, %%CTR
+                pshufb  %%XMM4, [SHUF_MASK]             ; perform a 16Byte swap
+
+                paddd   %%CTR, [ONE]                   ; INCR Y0
+                movdqa  %%XMM5, %%CTR
+                pshufb  %%XMM5, [SHUF_MASK]             ; perform a 16Byte swap
+
+                paddd   %%CTR, [ONE]                   ; INCR Y0
+                movdqa  %%XMM6, %%CTR
+                pshufb  %%XMM6, [SHUF_MASK]             ; perform a 16Byte swap
+
+                paddd   %%CTR, [ONE]                   ; INCR Y0
+                movdqa  %%XMM7, %%CTR
+                pshufb  %%XMM7, [SHUF_MASK]             ; perform a 16Byte swap
+
+                paddd   %%CTR, [ONE]                   ; INCR Y0
+                movdqa  %%XMM8, %%CTR
+                pshufb  %%XMM8, [SHUF_MASK]             ; perform a 16Byte swap
+
+                movdqu  %%T_key, [%%GDATA_KEY+16*0]
+                pxor    %%XMM1, %%T_key
+                pxor    %%XMM2, %%T_key
+                pxor    %%XMM3, %%T_key
+                pxor    %%XMM4, %%T_key
+                pxor    %%XMM5, %%T_key
+                pxor    %%XMM6, %%T_key
+                pxor    %%XMM7, %%T_key
+                pxor    %%XMM8, %%T_key
+
+
+%assign i 1
+%rep    NROUNDS       						; do early (13) rounds (11 for GCM192)
+                movdqu  %%T_key, [%%GDATA_KEY+16*i]
+                aesenc  %%XMM1, %%T_key
+                aesenc  %%XMM2, %%T_key
+                aesenc  %%XMM3, %%T_key
+                aesenc  %%XMM4, %%T_key
+                aesenc  %%XMM5, %%T_key
+                aesenc  %%XMM6, %%T_key
+                aesenc  %%XMM7, %%T_key
+                aesenc  %%XMM8, %%T_key
+%assign i (i+1)
+%endrep
+
+
+                movdqu          %%T_key, [%%GDATA_KEY+16*i]		; do final key round
+                aesenclast      %%XMM1, %%T_key
+                aesenclast      %%XMM2, %%T_key
+                aesenclast      %%XMM3, %%T_key
+                aesenclast      %%XMM4, %%T_key
+                aesenclast      %%XMM5, %%T_key
+                aesenclast      %%XMM6, %%T_key
+                aesenclast      %%XMM7, %%T_key
+                aesenclast      %%XMM8, %%T_key
+
+                XLDR  %%T1, [%%PLAIN_CYPH_IN + %%DATA_OFFSET + 16*0]
+                pxor    %%XMM1, %%T1
+                XSTR  [%%CYPH_PLAIN_OUT + %%DATA_OFFSET + 16*0], %%XMM1
+                %ifidn  %%ENC_DEC, DEC
+                movdqa  %%XMM1, %%T1
+                %endif
+
+                XLDR  %%T1, [%%PLAIN_CYPH_IN + %%DATA_OFFSET + 16*1]
+                pxor    %%XMM2, %%T1
+                XSTR  [%%CYPH_PLAIN_OUT + %%DATA_OFFSET + 16*1], %%XMM2
+                %ifidn  %%ENC_DEC, DEC
+                movdqa  %%XMM2, %%T1
+                %endif
+
+                XLDR  %%T1, [%%PLAIN_CYPH_IN + %%DATA_OFFSET + 16*2]
+                pxor    %%XMM3, %%T1
+                XSTR  [%%CYPH_PLAIN_OUT + %%DATA_OFFSET + 16*2], %%XMM3
+                %ifidn  %%ENC_DEC, DEC
+                movdqa  %%XMM3, %%T1
+                %endif
+
+                XLDR  %%T1, [%%PLAIN_CYPH_IN + %%DATA_OFFSET + 16*3]
+                pxor    %%XMM4, %%T1
+                XSTR  [%%CYPH_PLAIN_OUT + %%DATA_OFFSET + 16*3], %%XMM4
+                %ifidn  %%ENC_DEC, DEC
+                movdqa  %%XMM4, %%T1
+                %endif
+
+                XLDR  %%T1, [%%PLAIN_CYPH_IN + %%DATA_OFFSET + 16*4]
+                pxor    %%XMM5, %%T1
+                XSTR  [%%CYPH_PLAIN_OUT + %%DATA_OFFSET + 16*4], %%XMM5
+                %ifidn  %%ENC_DEC, DEC
+                movdqa  %%XMM5, %%T1
+                %endif
+
+                XLDR  %%T1, [%%PLAIN_CYPH_IN + %%DATA_OFFSET + 16*5]
+                pxor    %%XMM6, %%T1
+                XSTR  [%%CYPH_PLAIN_OUT + %%DATA_OFFSET + 16*5], %%XMM6
+                %ifidn  %%ENC_DEC, DEC
+                movdqa  %%XMM6, %%T1
+                %endif
+
+                XLDR  %%T1, [%%PLAIN_CYPH_IN + %%DATA_OFFSET + 16*6]
+                pxor    %%XMM7, %%T1
+                XSTR  [%%CYPH_PLAIN_OUT + %%DATA_OFFSET + 16*6], %%XMM7
+                %ifidn  %%ENC_DEC, DEC
+                movdqa  %%XMM7, %%T1
+                %endif
+
+                XLDR  %%T1, [%%PLAIN_CYPH_IN + %%DATA_OFFSET + 16*7]
+                pxor    %%XMM8, %%T1
+                XSTR  [%%CYPH_PLAIN_OUT + %%DATA_OFFSET + 16*7], %%XMM8
+                %ifidn  %%ENC_DEC, DEC
+                movdqa  %%XMM8, %%T1
+                %endif
+
+                add     %%DATA_OFFSET, 128
+
+                pshufb  %%XMM1, [SHUF_MASK]             ; perform a 16Byte swap
+                pxor    %%XMM1, %%T3                    ; combine GHASHed value with the corresponding ciphertext
+                pshufb  %%XMM2, [SHUF_MASK]             ; perform a 16Byte swap
+                pshufb  %%XMM3, [SHUF_MASK]             ; perform a 16Byte swap
+                pshufb  %%XMM4, [SHUF_MASK]             ; perform a 16Byte swap
+                pshufb  %%XMM5, [SHUF_MASK]             ; perform a 16Byte swap
+                pshufb  %%XMM6, [SHUF_MASK]             ; perform a 16Byte swap
+                pshufb  %%XMM7, [SHUF_MASK]             ; perform a 16Byte swap
+                pshufb  %%XMM8, [SHUF_MASK]             ; perform a 16Byte swap
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+%%_initial_blocks_done:
+
+
+%endmacro
+
+
+
+; encrypt 8 blocks at a time
+; ghash the 8 previously encrypted ciphertext blocks
+; %%GDATA (KEY), %%CYPH_PLAIN_OUT, %%PLAIN_CYPH_IN are used as pointers only, not modified
+; %%DATA_OFFSET is the data offset value
+%macro GHASH_8_ENCRYPT_8_PARALLEL 22
+%define	%%GDATA			%1
+%define	%%CYPH_PLAIN_OUT	%2
+%define	%%PLAIN_CYPH_IN		%3
+%define	%%DATA_OFFSET		%4
+%define	%%T1	%5
+%define	%%T2	%6
+%define	%%T3	%7
+%define	%%T4	%8
+%define	%%T5	%9
+%define	%%T6	%10
+%define	%%CTR	%11
+%define	%%XMM1	%12
+%define	%%XMM2	%13
+%define	%%XMM3	%14
+%define	%%XMM4	%15
+%define	%%XMM5	%16
+%define	%%XMM6	%17
+%define	%%XMM7	%18
+%define	%%XMM8	%19
+%define	%%T7	%20
+%define	%%loop_idx	%21
+%define	%%ENC_DEC	%22
+
+        movdqa  %%T7, %%XMM1
+        movdqu  [rsp + TMP2], %%XMM2
+        movdqu  [rsp + TMP3], %%XMM3
+        movdqu  [rsp + TMP4], %%XMM4
+        movdqu  [rsp + TMP5], %%XMM5
+        movdqu  [rsp + TMP6], %%XMM6
+        movdqu  [rsp + TMP7], %%XMM7
+        movdqu  [rsp + TMP8], %%XMM8
+
+        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+        ;; Karatsuba Method
+
+        movdqa  %%T4, %%T7
+        pshufd  %%T6, %%T7, 01001110b
+        pxor    %%T6, %%T7
+                %ifidn %%loop_idx, in_order
+                paddd  %%CTR, [ONE]                    ; INCR CNT
+                %else
+                paddd  %%CTR, [ONEf]                   ; INCR CNT
+                %endif
+        movdqu  %%T5, [%%GDATA + HashKey_8]
+        pclmulqdq       %%T4, %%T5, 0x11                        ; %%T1 = a1*b1
+        pclmulqdq       %%T7, %%T5, 0x00                        ; %%T7 = a0*b0
+        movdqu  %%T5, [%%GDATA + HashKey_8_k]
+        pclmulqdq       %%T6, %%T5, 0x00                        ; %%T2 = (a1+a0)*(b1+b0)
+                movdqa %%XMM1, %%CTR
+
+                %ifidn %%loop_idx, in_order
+                paddd  %%CTR, [ONE]                    ; INCR CNT
+                movdqa %%XMM2, %%CTR
+
+                paddd  %%CTR, [ONE]                    ; INCR CNT
+                movdqa %%XMM3, %%CTR
+
+                paddd  %%CTR, [ONE]                    ; INCR CNT
+                movdqa %%XMM4, %%CTR
+
+                paddd  %%CTR, [ONE]                    ; INCR CNT
+                movdqa %%XMM5, %%CTR
+
+                paddd  %%CTR, [ONE]                    ; INCR CNT
+                movdqa %%XMM6, %%CTR
+
+                paddd  %%CTR, [ONE]                    ; INCR CNT
+                movdqa %%XMM7, %%CTR
+
+                paddd  %%CTR, [ONE]                    ; INCR CNT
+                movdqa %%XMM8, %%CTR
+
+                pshufb  %%XMM1, [SHUF_MASK]             ; perform a 16Byte swap
+                pshufb  %%XMM2, [SHUF_MASK]             ; perform a 16Byte swap
+                pshufb  %%XMM3, [SHUF_MASK]             ; perform a 16Byte swap
+                pshufb  %%XMM4, [SHUF_MASK]             ; perform a 16Byte swap
+                pshufb  %%XMM5, [SHUF_MASK]             ; perform a 16Byte swap
+                pshufb  %%XMM6, [SHUF_MASK]             ; perform a 16Byte swap
+                pshufb  %%XMM7, [SHUF_MASK]             ; perform a 16Byte swap
+                pshufb  %%XMM8, [SHUF_MASK]             ; perform a 16Byte swap
+                %else
+                paddd  %%CTR, [ONEf]                   ; INCR CNT
+                movdqa %%XMM2, %%CTR
+
+                paddd  %%CTR, [ONEf]                   ; INCR CNT
+                movdqa %%XMM3, %%CTR
+
+                paddd  %%CTR, [ONEf]                   ; INCR CNT
+                movdqa %%XMM4, %%CTR
+
+                paddd  %%CTR, [ONEf]                   ; INCR CNT
+                movdqa %%XMM5, %%CTR
+
+                paddd  %%CTR, [ONEf]                   ; INCR CNT
+                movdqa %%XMM6, %%CTR
+
+                paddd  %%CTR, [ONEf]                   ; INCR CNT
+                movdqa %%XMM7, %%CTR
+
+                paddd  %%CTR, [ONEf]                   ; INCR CNT
+                movdqa %%XMM8, %%CTR
+                %endif
+        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+                movdqu  %%T1, [%%GDATA + 16*0]
+                pxor    %%XMM1, %%T1
+                pxor    %%XMM2, %%T1
+                pxor    %%XMM3, %%T1
+                pxor    %%XMM4, %%T1
+                pxor    %%XMM5, %%T1
+                pxor    %%XMM6, %%T1
+                pxor    %%XMM7, %%T1
+                pxor    %%XMM8, %%T1
+
+        ;; %%XMM6, %%T5 hold the values for the two operands which are carry-less multiplied
+        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+        ;; Karatsuba Method
+        movdqu  %%T1, [rsp + TMP2]
+        movdqa  %%T3, %%T1
+
+        pshufd  %%T2, %%T3, 01001110b
+        pxor    %%T2, %%T3
+        movdqu  %%T5, [%%GDATA + HashKey_7]
+        pclmulqdq       %%T1, %%T5, 0x11                ; %%T1 = a1*b1
+        pclmulqdq       %%T3, %%T5, 0x00                ; %%T3 = a0*b0
+        movdqu  %%T5, [%%GDATA + HashKey_7_k]
+        pclmulqdq       %%T2, %%T5, 0x00                ; %%T2 = (a1+a0)*(b1+b0)
+        pxor    %%T4, %%T1                              ; accumulate the results in %%T4:%%T7, %%T6 holds the middle part
+        pxor    %%T7, %%T3
+        pxor    %%T6, %%T2
+
+                movdqu  %%T1, [%%GDATA + 16*1]
+                aesenc  %%XMM1, %%T1
+                aesenc  %%XMM2, %%T1
+                aesenc  %%XMM3, %%T1
+                aesenc  %%XMM4, %%T1
+                aesenc  %%XMM5, %%T1
+                aesenc  %%XMM6, %%T1
+                aesenc  %%XMM7, %%T1
+                aesenc  %%XMM8, %%T1
+
+
+                movdqu  %%T1, [%%GDATA + 16*2]
+                aesenc  %%XMM1, %%T1
+                aesenc  %%XMM2, %%T1
+                aesenc  %%XMM3, %%T1
+                aesenc  %%XMM4, %%T1
+                aesenc  %%XMM5, %%T1
+                aesenc  %%XMM6, %%T1
+                aesenc  %%XMM7, %%T1
+                aesenc  %%XMM8, %%T1
+
+        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+        ; Karatsuba Method
+        movdqu  %%T1, [rsp + TMP3]
+        movdqa  %%T3, %%T1
+        pshufd  %%T2, %%T3, 01001110b
+        pxor    %%T2, %%T3
+        movdqu  %%T5, [%%GDATA + HashKey_6]
+        pclmulqdq       %%T1, %%T5, 0x11                ; %%T1 = a1*b1
+        pclmulqdq       %%T3, %%T5, 0x00                ; %%T3 = a0*b0
+        movdqu  %%T5, [%%GDATA + HashKey_6_k]
+        pclmulqdq       %%T2, %%T5, 0x00                ; %%T2 = (a1+a0)*(b1+b0)
+        pxor    %%T4, %%T1                              ; accumulate the results in %%T4:%%T7, %%T6 holds the middle part
+        pxor    %%T7, %%T3
+        pxor    %%T6, %%T2
+
+                movdqu  %%T1, [%%GDATA + 16*3]
+                aesenc  %%XMM1, %%T1
+                aesenc  %%XMM2, %%T1
+                aesenc  %%XMM3, %%T1
+                aesenc  %%XMM4, %%T1
+                aesenc  %%XMM5, %%T1
+                aesenc  %%XMM6, %%T1
+                aesenc  %%XMM7, %%T1
+                aesenc  %%XMM8, %%T1
+
+        movdqu  %%T1, [rsp + TMP4]
+        movdqa  %%T3, %%T1
+        pshufd  %%T2, %%T3, 01001110b
+        pxor    %%T2, %%T3
+        movdqu  %%T5, [%%GDATA + HashKey_5]
+        pclmulqdq       %%T1, %%T5, 0x11                ; %%T1 = a1*b1
+        pclmulqdq       %%T3, %%T5, 0x00                ; %%T3 = a0*b0
+        movdqu  %%T5, [%%GDATA + HashKey_5_k]
+        pclmulqdq       %%T2, %%T5, 0x00                ; %%T2 = (a1+a0)*(b1+b0)
+        pxor    %%T4, %%T1                              ; accumulate the results in %%T4:%%T7, %%T6 holds the middle part
+        pxor    %%T7, %%T3
+        pxor    %%T6, %%T2
+
+                movdqu  %%T1, [%%GDATA + 16*4]
+                aesenc  %%XMM1, %%T1
+                aesenc  %%XMM2, %%T1
+                aesenc  %%XMM3, %%T1
+                aesenc  %%XMM4, %%T1
+                aesenc  %%XMM5, %%T1
+                aesenc  %%XMM6, %%T1
+                aesenc  %%XMM7, %%T1
+                aesenc  %%XMM8, %%T1
+
+                movdqu  %%T1, [%%GDATA + 16*5]
+                aesenc  %%XMM1, %%T1
+                aesenc  %%XMM2, %%T1
+                aesenc  %%XMM3, %%T1
+                aesenc  %%XMM4, %%T1
+                aesenc  %%XMM5, %%T1
+                aesenc  %%XMM6, %%T1
+                aesenc  %%XMM7, %%T1
+                aesenc  %%XMM8, %%T1
+
+        movdqu  %%T1, [rsp + TMP5]
+        movdqa  %%T3, %%T1
+        pshufd  %%T2, %%T3, 01001110b
+        pxor    %%T2, %%T3
+        movdqu  %%T5, [%%GDATA + HashKey_4]
+        pclmulqdq       %%T1, %%T5, 0x11                ; %%T1 = a1*b1
+        pclmulqdq       %%T3, %%T5, 0x00                ; %%T3 = a0*b0
+        movdqu  %%T5, [%%GDATA + HashKey_4_k]
+        pclmulqdq       %%T2, %%T5, 0x00                ; %%T2 = (a1+a0)*(b1+b0)
+        pxor    %%T4, %%T1                              ; accumulate the results in %%T4:%%T7, %%T6 holds the middle part
+        pxor    %%T7, %%T3
+        pxor    %%T6, %%T2
+
+
+                movdqu  %%T1, [%%GDATA + 16*6]
+                aesenc  %%XMM1, %%T1
+                aesenc  %%XMM2, %%T1
+                aesenc  %%XMM3, %%T1
+                aesenc  %%XMM4, %%T1
+                aesenc  %%XMM5, %%T1
+                aesenc  %%XMM6, %%T1
+                aesenc  %%XMM7, %%T1
+                aesenc  %%XMM8, %%T1
+        movdqu  %%T1, [rsp + TMP6]
+        movdqa  %%T3, %%T1
+        pshufd  %%T2, %%T3, 01001110b
+        pxor    %%T2, %%T3
+        movdqu  %%T5, [%%GDATA + HashKey_3]
+        pclmulqdq       %%T1, %%T5, 0x11                ; %%T1 = a1*b1
+        pclmulqdq       %%T3, %%T5, 0x00                ; %%T3 = a0*b0
+        movdqu  %%T5, [%%GDATA + HashKey_3_k]
+        pclmulqdq       %%T2, %%T5, 0x00                ; %%T2 = (a1+a0)*(b1+b0)
+        pxor    %%T4, %%T1                              ; accumulate the results in %%T4:%%T7, %%T6 holds the middle part
+        pxor    %%T7, %%T3
+        pxor    %%T6, %%T2
+
+                movdqu  %%T1, [%%GDATA + 16*7]
+                aesenc  %%XMM1, %%T1
+                aesenc  %%XMM2, %%T1
+                aesenc  %%XMM3, %%T1
+                aesenc  %%XMM4, %%T1
+                aesenc  %%XMM5, %%T1
+                aesenc  %%XMM6, %%T1
+                aesenc  %%XMM7, %%T1
+                aesenc  %%XMM8, %%T1
+
+        movdqu  %%T1, [rsp + TMP7]
+        movdqa  %%T3, %%T1
+        pshufd  %%T2, %%T3, 01001110b
+        pxor    %%T2, %%T3
+        movdqu  %%T5, [%%GDATA + HashKey_2]
+        pclmulqdq       %%T1, %%T5, 0x11                ; %%T1 = a1*b1
+        pclmulqdq       %%T3, %%T5, 0x00                ; %%T3 = a0*b0
+        movdqu  %%T5, [%%GDATA + HashKey_2_k]
+        pclmulqdq       %%T2, %%T5, 0x00                ; %%T2 = (a1+a0)*(b1+b0)
+        pxor    %%T4, %%T1                              ; accumulate the results in %%T4:%%T7, %%T6 holds the middle part
+        pxor    %%T7, %%T3
+        pxor    %%T6, %%T2
+
+                movdqu  %%T1, [%%GDATA + 16*8]
+                aesenc  %%XMM1, %%T1
+                aesenc  %%XMM2, %%T1
+                aesenc  %%XMM3, %%T1
+                aesenc  %%XMM4, %%T1
+                aesenc  %%XMM5, %%T1
+                aesenc  %%XMM6, %%T1
+                aesenc  %%XMM7, %%T1
+                aesenc  %%XMM8, %%T1
+
+
+        ;; %%XMM8, %%T5 hold the values for the two operands which are carry-less multiplied
+        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+        ;; Karatsuba Method
+        movdqu  %%T1, [rsp + TMP8]
+        movdqa  %%T3, %%T1
+
+        pshufd  %%T2, %%T3, 01001110b
+        pxor    %%T2, %%T3
+        movdqu  %%T5, [%%GDATA + HashKey]
+        pclmulqdq       %%T1, %%T5, 0x11                ; %%T1 = a1*b1
+        pclmulqdq       %%T3, %%T5, 0x00                ; %%T3 = a0*b0
+        movdqu  %%T5, [%%GDATA + HashKey_k]
+        pclmulqdq       %%T2, %%T5, 0x00                ; %%T2 = (a1+a0)*(b1+b0)
+        pxor    %%T7, %%T3
+        pxor    %%T4, %%T1
+
+                movdqu  %%T1, [%%GDATA + 16*9]
+                aesenc  %%XMM1, %%T1
+                aesenc  %%XMM2, %%T1
+                aesenc  %%XMM3, %%T1
+                aesenc  %%XMM4, %%T1
+                aesenc  %%XMM5, %%T1
+                aesenc  %%XMM6, %%T1
+                aesenc  %%XMM7, %%T1
+                aesenc  %%XMM8, %%T1
+
+
+%ifdef GCM128_MODE
+		movdqu	%%T5, [%%GDATA + 16*10]
+%endif
+%ifdef GCM192_MODE
+		movdqu	%%T1, [%%GDATA + 16*10]
+		aesenc	%%XMM1, %%T1
+		aesenc	%%XMM2, %%T1
+		aesenc	%%XMM3, %%T1
+		aesenc	%%XMM4, %%T1
+		aesenc	%%XMM5, %%T1
+		aesenc	%%XMM6, %%T1
+		aesenc	%%XMM7, %%T1
+		aesenc	%%XMM8, %%T1
+
+		movdqu	%%T1, [%%GDATA + 16*11]
+		aesenc	%%XMM1, %%T1
+		aesenc	%%XMM2, %%T1
+		aesenc	%%XMM3, %%T1
+		aesenc	%%XMM4, %%T1
+		aesenc	%%XMM5, %%T1
+		aesenc	%%XMM6, %%T1
+		aesenc	%%XMM7, %%T1
+		aesenc	%%XMM8, %%T1
+
+		movdqu	%%T5, [%%GDATA + 16*12]        ; finish last key round
+%endif
+%ifdef GCM256_MODE
+		movdqu	%%T1, [%%GDATA + 16*10]
+		aesenc	%%XMM1, %%T1
+		aesenc	%%XMM2, %%T1
+		aesenc	%%XMM3, %%T1
+		aesenc	%%XMM4, %%T1
+		aesenc	%%XMM5, %%T1
+		aesenc	%%XMM6, %%T1
+		aesenc	%%XMM7, %%T1
+		aesenc	%%XMM8, %%T1
+
+		movdqu	%%T1, [%%GDATA + 16*11]
+		aesenc	%%XMM1, %%T1
+		aesenc	%%XMM2, %%T1
+		aesenc	%%XMM3, %%T1
+		aesenc	%%XMM4, %%T1
+		aesenc	%%XMM5, %%T1
+		aesenc	%%XMM6, %%T1
+		aesenc	%%XMM7, %%T1
+		aesenc	%%XMM8, %%T1
+
+		movdqu	%%T1, [%%GDATA + 16*12]
+		aesenc	%%XMM1, %%T1
+		aesenc	%%XMM2, %%T1
+		aesenc	%%XMM3, %%T1
+		aesenc	%%XMM4, %%T1
+		aesenc	%%XMM5, %%T1
+		aesenc	%%XMM6, %%T1
+		aesenc	%%XMM7, %%T1
+		aesenc	%%XMM8, %%T1
+
+		movdqu	%%T1, [%%GDATA + 16*13]
+		aesenc	%%XMM1, %%T1
+		aesenc	%%XMM2, %%T1
+		aesenc	%%XMM3, %%T1
+		aesenc	%%XMM4, %%T1
+		aesenc	%%XMM5, %%T1
+		aesenc	%%XMM6, %%T1
+		aesenc	%%XMM7, %%T1
+		aesenc	%%XMM8, %%T1
+
+	        movdqu	%%T5, [%%GDATA + 16*14]        ; finish last key round
+%endif
+
+%assign i 0
+%assign j 1
+%rep 8
+                XLDR  %%T1, [%%PLAIN_CYPH_IN+%%DATA_OFFSET+16*i]
+
+%ifidn %%ENC_DEC, DEC
+                movdqa  %%T3, %%T1
+%endif
+
+                pxor    %%T1, %%T5
+                aesenclast      reg(j), %%T1          ; XMM1:XMM8
+                XSTR  [%%CYPH_PLAIN_OUT+%%DATA_OFFSET+16*i], reg(j)       ; Write to the Output buffer
+
+%ifidn %%ENC_DEC, DEC
+                movdqa  reg(j), %%T3
+%endif
+%assign i (i+1)
+%assign j (j+1)
+%endrep
+
+
+
+
+        pxor    %%T2, %%T6
+        pxor    %%T2, %%T4
+        pxor    %%T2, %%T7
+
+
+        movdqa  %%T3, %%T2
+        pslldq  %%T3, 8                                 ; shift-L %%T3 2 DWs
+        psrldq  %%T2, 8                                 ; shift-R %%T2 2 DWs
+        pxor    %%T7, %%T3
+        pxor    %%T4, %%T2                              ; accumulate the results in %%T4:%%T7
+
+
+
+        ;first phase of the reduction
+        movdqa  %%T2, %%T7
+        movdqa  %%T3, %%T7
+        movdqa  %%T1, %%T7                              ; move %%T7 into %%T2, %%T3, %%T1 in order to perform the three shifts independently
+
+        pslld   %%T2, 31                                ; packed right shifting << 31
+        pslld   %%T3, 30                                ; packed right shifting shift << 30
+        pslld   %%T1, 25                                ; packed right shifting shift << 25
+        pxor    %%T2, %%T3                              ; xor the shifted versions
+        pxor    %%T2, %%T1
+
+        movdqa  %%T5, %%T2
+        psrldq  %%T5, 4                                 ; shift-R %%T5 1 DW
+
+        pslldq  %%T2, 12                                ; shift-L %%T2 3 DWs
+        pxor    %%T7, %%T2                              ; first phase of the reduction complete
+        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+                pshufb  %%XMM1, [SHUF_MASK]     ; perform a 16Byte swap
+                pshufb  %%XMM2, [SHUF_MASK]     ; perform a 16Byte swap
+                pshufb  %%XMM3, [SHUF_MASK]     ; perform a 16Byte swap
+                pshufb  %%XMM4, [SHUF_MASK]     ; perform a 16Byte swap
+                pshufb  %%XMM5, [SHUF_MASK]     ; perform a 16Byte swap
+                pshufb  %%XMM6, [SHUF_MASK]     ; perform a 16Byte swap
+                pshufb  %%XMM7, [SHUF_MASK]     ; perform a 16Byte swap
+                pshufb  %%XMM8, [SHUF_MASK]     ; perform a 16Byte swap
+
+        ;second phase of the reduction
+        movdqa  %%T2,%%T7                               ; make 3 copies of %%T7 (in in %%T2, %%T3, %%T1) for doing three shift operations
+        movdqa  %%T3,%%T7
+        movdqa  %%T1,%%T7
+
+        psrld   %%T2,1                                  ; packed left shifting >> 1
+        psrld   %%T3,2                                  ; packed left shifting >> 2
+        psrld   %%T1,7                                  ; packed left shifting >> 7
+        pxor    %%T2,%%T3                               ; xor the shifted versions
+        pxor    %%T2,%%T1
+
+        pxor    %%T2, %%T5
+        pxor    %%T7, %%T2
+        pxor    %%T7, %%T4                              ; the result is in %%T4
+
+
+        pxor    %%XMM1, %%T7
+
+%endmacro
+
+
+; GHASH the last 4 ciphertext blocks.
+%macro	GHASH_LAST_8 16
+%define	%%GDATA	%1
+%define	%%T1	%2
+%define	%%T2	%3
+%define	%%T3	%4
+%define	%%T4	%5
+%define	%%T5	%6
+%define	%%T6	%7
+%define	%%T7	%8
+%define	%%XMM1	%9
+%define	%%XMM2	%10
+%define	%%XMM3	%11
+%define	%%XMM4	%12
+%define	%%XMM5	%13
+%define	%%XMM6	%14
+%define	%%XMM7	%15
+%define	%%XMM8	%16
+
+        ; Karatsuba Method
+        movdqa  %%T6, %%XMM1
+        pshufd  %%T2, %%XMM1, 01001110b
+        pxor    %%T2, %%XMM1
+        movdqu  %%T5, [%%GDATA + HashKey_8]
+        pclmulqdq       %%T6, %%T5, 0x11                ; %%T6 = a1*b1
+
+        pclmulqdq       %%XMM1, %%T5, 0x00              ; %%XMM1 = a0*b0
+        movdqu  %%T4, [%%GDATA + HashKey_8_k]
+        pclmulqdq       %%T2, %%T4, 0x00                ; %%T2 = (a1+a0)*(b1+b0)
+
+        movdqa  %%T7, %%XMM1
+        movdqa  %%XMM1, %%T2                            ; result in %%T6, %%T7, %%XMM1
+
+
+        ; Karatsuba Method
+        movdqa  %%T1, %%XMM2
+        pshufd  %%T2, %%XMM2, 01001110b
+        pxor    %%T2, %%XMM2
+        movdqu  %%T5, [%%GDATA + HashKey_7]
+        pclmulqdq       %%T1, %%T5, 0x11                ; %%T1 = a1*b1
+
+        pclmulqdq       %%XMM2, %%T5, 0x00              ; %%XMM2 = a0*b0
+        movdqu  %%T4, [%%GDATA + HashKey_7_k]
+        pclmulqdq       %%T2, %%T4, 0x00                ; %%T2 = (a1+a0)*(b1+b0)
+
+        pxor    %%T6, %%T1
+        pxor    %%T7, %%XMM2
+        pxor    %%XMM1, %%T2                            ; results accumulated in %%T6, %%T7, %%XMM1
+
+
+        ; Karatsuba Method
+        movdqa  %%T1, %%XMM3
+        pshufd  %%T2, %%XMM3, 01001110b
+        pxor    %%T2, %%XMM3
+        movdqu  %%T5, [%%GDATA + HashKey_6]
+        pclmulqdq       %%T1, %%T5, 0x11                ; %%T1 = a1*b1
+
+        pclmulqdq       %%XMM3, %%T5, 0x00              ; %%XMM3 = a0*b0
+        movdqu  %%T4, [%%GDATA + HashKey_6_k]
+        pclmulqdq       %%T2, %%T4, 0x00                ; %%T2 = (a1+a0)*(b1+b0)
+
+        pxor    %%T6, %%T1
+        pxor    %%T7, %%XMM3
+        pxor    %%XMM1, %%T2                            ; results accumulated in %%T6, %%T7, %%XMM1
+
+        ; Karatsuba Method
+        movdqa  %%T1, %%XMM4
+        pshufd  %%T2, %%XMM4, 01001110b
+        pxor    %%T2, %%XMM4
+        movdqu  %%T5, [%%GDATA + HashKey_5]
+        pclmulqdq       %%T1, %%T5, 0x11                ; %%T1 = a1*b1
+
+        pclmulqdq       %%XMM4, %%T5, 0x00              ; %%XMM3 = a0*b0
+        movdqu  %%T4, [%%GDATA + HashKey_5_k]
+        pclmulqdq       %%T2, %%T4, 0x00                ; %%T2 = (a1+a0)*(b1+b0)
+
+        pxor    %%T6, %%T1
+        pxor    %%T7, %%XMM4
+        pxor    %%XMM1, %%T2                            ; results accumulated in %%T6, %%T7, %%XMM1
+
+        ; Karatsuba Method
+        movdqa  %%T1, %%XMM5
+        pshufd  %%T2, %%XMM5, 01001110b
+        pxor    %%T2, %%XMM5
+        movdqu  %%T5, [%%GDATA + HashKey_4]
+        pclmulqdq       %%T1, %%T5, 0x11                ; %%T1 = a1*b1
+
+        pclmulqdq       %%XMM5, %%T5, 0x00              ; %%XMM3 = a0*b0
+        movdqu  %%T4, [%%GDATA + HashKey_4_k]
+        pclmulqdq       %%T2, %%T4, 0x00                ; %%T2 = (a1+a0)*(b1+b0)
+
+        pxor    %%T6, %%T1
+        pxor    %%T7, %%XMM5
+        pxor    %%XMM1, %%T2                            ; results accumulated in %%T6, %%T7, %%XMM1
+
+        ; Karatsuba Method
+        movdqa  %%T1, %%XMM6
+        pshufd  %%T2, %%XMM6, 01001110b
+        pxor    %%T2, %%XMM6
+        movdqu  %%T5, [%%GDATA + HashKey_3]
+        pclmulqdq       %%T1, %%T5, 0x11                ; %%T1 = a1*b1
+
+        pclmulqdq       %%XMM6, %%T5, 0x00              ; %%XMM3 = a0*b0
+        movdqu  %%T4, [%%GDATA + HashKey_3_k]
+        pclmulqdq       %%T2, %%T4, 0x00                ; %%T2 = (a1+a0)*(b1+b0)
+
+        pxor    %%T6, %%T1
+        pxor    %%T7, %%XMM6
+        pxor    %%XMM1, %%T2                            ; results accumulated in %%T6, %%T7, %%XMM1
+
+        ; Karatsuba Method
+        movdqa  %%T1, %%XMM7
+        pshufd  %%T2, %%XMM7, 01001110b
+        pxor    %%T2, %%XMM7
+        movdqu  %%T5, [%%GDATA + HashKey_2]
+        pclmulqdq       %%T1, %%T5, 0x11                ; %%T1 = a1*b1
+
+        pclmulqdq       %%XMM7, %%T5, 0x00              ; %%XMM3 = a0*b0
+        movdqu  %%T4, [%%GDATA + HashKey_2_k]
+        pclmulqdq       %%T2, %%T4, 0x00                ; %%T2 = (a1+a0)*(b1+b0)
+
+        pxor    %%T6, %%T1
+        pxor    %%T7, %%XMM7
+        pxor    %%XMM1, %%T2                            ; results accumulated in %%T6, %%T7, %%XMM1
+
+
+        ; Karatsuba Method
+        movdqa  %%T1, %%XMM8
+        pshufd  %%T2, %%XMM8, 01001110b
+        pxor    %%T2, %%XMM8
+        movdqu  %%T5, [%%GDATA + HashKey]
+        pclmulqdq       %%T1, %%T5, 0x11                ; %%T1 = a1*b1
+
+        pclmulqdq       %%XMM8, %%T5, 0x00              ; %%XMM4 = a0*b0
+        movdqu  %%T4, [%%GDATA + HashKey_k]
+        pclmulqdq       %%T2, %%T4, 0x00                ; %%T2 = (a1+a0)*(b1+b0)
+
+        pxor    %%T6, %%T1
+        pxor    %%T7, %%XMM8
+        pxor    %%T2, %%XMM1
+        pxor    %%T2, %%T6
+        pxor    %%T2, %%T7                              ; middle section of the temp results combined as in Karatsuba algorithm
+
+
+        movdqa  %%T4, %%T2
+        pslldq  %%T4, 8                                 ; shift-L %%T4 2 DWs
+        psrldq  %%T2, 8                                 ; shift-R %%T2 2 DWs
+        pxor    %%T7, %%T4
+        pxor    %%T6, %%T2                              ; <%%T6:%%T7> holds the result of the accumulated carry-less multiplications
+
+
+        ;first phase of the reduction
+        movdqa %%T2, %%T7
+        movdqa %%T3, %%T7
+        movdqa %%T4, %%T7                               ; move %%T7 into %%T2, %%T3, %%T4 in order to perform the three shifts independently
+
+        pslld %%T2, 31                                  ; packed right shifting << 31
+        pslld %%T3, 30                                  ; packed right shifting shift << 30
+        pslld %%T4, 25                                  ; packed right shifting shift << 25
+        pxor %%T2, %%T3                                 ; xor the shifted versions
+        pxor %%T2, %%T4
+
+        movdqa %%T1, %%T2
+        psrldq %%T1, 4                                  ; shift-R %%T1 1 DW
+
+        pslldq %%T2, 12                                 ; shift-L %%T2 3 DWs
+        pxor %%T7, %%T2                                 ; first phase of the reduction complete
+        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+        ;second phase of the reduction
+        movdqa %%T2,%%T7                                ; make 3 copies of %%T7 (in in %%T2, %%T3, %%T4) for doing three shift operations
+        movdqa %%T3,%%T7
+        movdqa %%T4,%%T7
+
+        psrld %%T2,1                                    ; packed left shifting >> 1
+        psrld %%T3,2                                    ; packed left shifting >> 2
+        psrld %%T4,7                                    ; packed left shifting >> 7
+        pxor %%T2,%%T3                                  ; xor the shifted versions
+        pxor %%T2,%%T4
+
+        pxor %%T2, %%T1
+        pxor %%T7, %%T2
+        pxor %%T6, %%T7                                 ; the result is in %%T6
+
+%endmacro
+
+; Encryption of a single block
+%macro ENCRYPT_SINGLE_BLOCK 3
+%define	%%GDATA	%1
+%define	%%ST	%2
+%define	%%T1	%3
+		movdqu	%%T1, [%%GDATA+16*0]
+                pxor    %%ST, %%T1
+%assign i 1
+%rep NROUNDS
+		movdqu	%%T1, [%%GDATA+16*i]
+                aesenc  %%ST, %%T1
+%assign i (i+1)
+%endrep
+		movdqu	%%T1, [%%GDATA+16*i]
+                aesenclast      %%ST, %%T1
+%endmacro
+
+
+;; Start of Stack Setup
+
+%macro FUNC_SAVE 0
+	;; Required for Update/GMC_ENC
+	;the number of pushes must equal STACK_OFFSET
+        push    r12
+        push    r13
+        push    r14
+        push    r15
+        push    rsi
+        mov     r14, rsp
+
+	sub     rsp, VARIABLE_OFFSET
+	and     rsp, ~63
+
+%ifidn __OUTPUT_FORMAT__, win64
+        ; xmm6:xmm15 need to be maintained for Windows
+        movdqu [rsp + LOCAL_STORAGE + 0*16],xmm6
+        movdqu [rsp + LOCAL_STORAGE + 1*16],xmm7
+        movdqu [rsp + LOCAL_STORAGE + 2*16],xmm8
+        movdqu [rsp + LOCAL_STORAGE + 3*16],xmm9
+        movdqu [rsp + LOCAL_STORAGE + 4*16],xmm10
+        movdqu [rsp + LOCAL_STORAGE + 5*16],xmm11
+        movdqu [rsp + LOCAL_STORAGE + 6*16],xmm12
+        movdqu [rsp + LOCAL_STORAGE + 7*16],xmm13
+        movdqu [rsp + LOCAL_STORAGE + 8*16],xmm14
+        movdqu [rsp + LOCAL_STORAGE + 9*16],xmm15
+
+        mov	arg5, arg(5) ;[r14 + STACK_OFFSET + 8*5]
+%endif
+%endmacro
+
+
+%macro FUNC_RESTORE 0
+
+%ifidn __OUTPUT_FORMAT__, win64
+        movdqu xmm15  , [rsp + LOCAL_STORAGE + 9*16]
+        movdqu xmm14  , [rsp + LOCAL_STORAGE + 8*16]
+        movdqu xmm13  , [rsp + LOCAL_STORAGE + 7*16]
+        movdqu xmm12  , [rsp + LOCAL_STORAGE + 6*16]
+        movdqu xmm11  , [rsp + LOCAL_STORAGE + 5*16]
+        movdqu xmm10  , [rsp + LOCAL_STORAGE + 4*16]
+        movdqu xmm9 , [rsp + LOCAL_STORAGE + 3*16]
+        movdqu xmm8 , [rsp + LOCAL_STORAGE + 2*16]
+        movdqu xmm7 , [rsp + LOCAL_STORAGE + 1*16]
+        movdqu xmm6 , [rsp + LOCAL_STORAGE + 0*16]
+%endif
+
+;; Required for Update/GMC_ENC
+        mov     rsp, r14
+        pop     rsi
+        pop     r15
+        pop     r14
+        pop     r13
+        pop     r12
+%endmacro
+
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+; GCM_INIT initializes a gcm_context_data struct to prepare for encoding/decoding.
+; Input: gcm_key_data * (GDATA_KEY), gcm_context_data *(GDATA_CTX), IV,
+; Additional Authentication data (A_IN), Additional Data length (A_LEN).
+; Output: Updated GDATA_CTX with the hash of A_IN (AadHash) and initialized other parts of GDATA.
+; Clobbers rax, r10-r13 and xmm0-xmm6
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+%macro  GCM_INIT 	5
+%define %%GDATA_KEY	%1
+%define %%GDATA_CTX	%2
+%define %%IV		%3
+%define %%A_IN		%4
+%define %%A_LEN		%5
+%define %%AAD_HASH	xmm0
+%define %%SUBHASH	xmm1
+
+
+        movdqu  %%SUBHASH, [%%GDATA_KEY + HashKey]
+
+	CALC_AAD_HASH %%A_IN, %%A_LEN, %%AAD_HASH, %%SUBHASH, xmm2, xmm3, xmm4, xmm5, xmm6, r10, r11, r12, r13, rax
+	pxor	xmm2, xmm3
+	mov	r10, %%A_LEN
+
+	movdqu	[%%GDATA_CTX + AadHash], %%AAD_HASH	; ctx_data.aad hash = aad_hash
+	mov	[%%GDATA_CTX + AadLen], r10		; ctx_data.aad_length = aad_length
+	xor	r10, r10
+	mov	[%%GDATA_CTX + InLen], r10		; ctx_data.in_length = 0
+	mov	[%%GDATA_CTX + PBlockLen], r10		; ctx_data.partial_block_length = 0
+	movdqu	[%%GDATA_CTX + PBlockEncKey], xmm2	; ctx_data.partial_block_enc_key = 0
+	mov	r10, %%IV
+        movdqa  xmm2, [rel ONEf]                        ; read 12 IV bytes and pad with 0x00000001
+        pinsrq  xmm2, [r10], 0
+        pinsrd  xmm2, [r10+8], 2
+	movdqu	[%%GDATA_CTX + OrigIV], xmm2		; ctx_data.orig_IV = iv
+
+	pshufb xmm2, [SHUF_MASK]
+
+	movdqu	[%%GDATA_CTX + CurCount], xmm2		; ctx_data.current_counter = iv
+%endmacro
+
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+; GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed gcm_context_data
+; struct has been initialized by GCM_INIT.
+; Requires the input data be at least 1 byte long because of READ_SMALL_INPUT_DATA.
+; Input: gcm_key_data * (GDATA_KEY), gcm_context_data (GDATA_CTX), input text (PLAIN_CYPH_IN),
+; input text length (PLAIN_CYPH_LEN) and whether encoding or decoding (ENC_DEC)
+; Output: A cypher of the given plain text (CYPH_PLAIN_OUT), and updated GDATA_CTX
+; Clobbers rax, r10-r15, and xmm0-xmm15
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+%macro	GCM_ENC_DEC		6
+%define	%%GDATA_KEY		%1
+%define	%%GDATA_CTX		%2
+%define	%%CYPH_PLAIN_OUT	%3
+%define	%%PLAIN_CYPH_IN		%4
+%define	%%PLAIN_CYPH_LEN	%5
+%define	%%ENC_DEC		%6
+%define	%%DATA_OFFSET		r11
+
+; Macro flow:
+; calculate the number of 16byte blocks in the message
+; process (number of 16byte blocks) mod 8 '%%_initial_num_blocks_is_# .. %%_initial_blocks_encrypted'
+; process 8 16 byte blocks at a time until all are done '%%_encrypt_by_8_new .. %%_eight_cipher_left'
+; if there is a block of less tahn 16 bytes process it '%%_zero_cipher_left .. %%_multiple_of_16_bytes'
+
+	cmp	%%PLAIN_CYPH_LEN, 0
+	je	%%_multiple_of_16_bytes
+
+	xor	%%DATA_OFFSET, %%DATA_OFFSET
+	add	[%%GDATA_CTX + InLen], %%PLAIN_CYPH_LEN ;Update length of data processed
+	movdqu	xmm13, [%%GDATA_KEY + HashKey]                 ; xmm13 = HashKey
+	movdqu	xmm8, [%%GDATA_CTX + AadHash]
+
+
+	PARTIAL_BLOCK %%GDATA_KEY, %%GDATA_CTX, %%CYPH_PLAIN_OUT, %%PLAIN_CYPH_IN, %%PLAIN_CYPH_LEN, %%DATA_OFFSET, xmm8, %%ENC_DEC
+
+        mov     r13, %%PLAIN_CYPH_LEN                               ; save the number of bytes of plaintext/ciphertext
+	sub	r13, %%DATA_OFFSET
+	mov	r10, r13	;save the amount of data left to process in r10
+        and     r13, -16                                ; r13 = r13 - (r13 mod 16)
+
+        mov     r12, r13
+        shr     r12, 4
+        and     r12, 7
+        jz      %%_initial_num_blocks_is_0
+
+        cmp     r12, 7
+        je      %%_initial_num_blocks_is_7
+        cmp     r12, 6
+        je      %%_initial_num_blocks_is_6
+        cmp     r12, 5
+        je      %%_initial_num_blocks_is_5
+        cmp     r12, 4
+        je      %%_initial_num_blocks_is_4
+        cmp     r12, 3
+        je      %%_initial_num_blocks_is_3
+        cmp     r12, 2
+        je      %%_initial_num_blocks_is_2
+
+        jmp     %%_initial_num_blocks_is_1
+
+%%_initial_num_blocks_is_7:
+	INITIAL_BLOCKS	%%GDATA_KEY, %%GDATA_CTX, %%CYPH_PLAIN_OUT, %%PLAIN_CYPH_IN, r13, %%DATA_OFFSET, 7, xmm12, xmm13, xmm14, xmm15, xmm11, xmm9, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm10, xmm0, %%ENC_DEC
+        sub     r13, 16*7
+        jmp     %%_initial_blocks_encrypted
+
+%%_initial_num_blocks_is_6:
+	INITIAL_BLOCKS	%%GDATA_KEY, %%GDATA_CTX, %%CYPH_PLAIN_OUT, %%PLAIN_CYPH_IN, r13, %%DATA_OFFSET, 6, xmm12, xmm13, xmm14, xmm15, xmm11, xmm9, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm10, xmm0, %%ENC_DEC
+        sub     r13, 16*6
+        jmp     %%_initial_blocks_encrypted
+
+%%_initial_num_blocks_is_5:
+	INITIAL_BLOCKS	%%GDATA_KEY, %%GDATA_CTX, %%CYPH_PLAIN_OUT, %%PLAIN_CYPH_IN, r13, %%DATA_OFFSET, 5, xmm12, xmm13, xmm14, xmm15, xmm11, xmm9, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm10, xmm0, %%ENC_DEC
+        sub     r13, 16*5
+        jmp     %%_initial_blocks_encrypted
+
+%%_initial_num_blocks_is_4:
+	INITIAL_BLOCKS	%%GDATA_KEY, %%GDATA_CTX, %%CYPH_PLAIN_OUT, %%PLAIN_CYPH_IN, r13, %%DATA_OFFSET, 4, xmm12, xmm13, xmm14, xmm15, xmm11, xmm9, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm10, xmm0, %%ENC_DEC
+        sub     r13, 16*4
+        jmp     %%_initial_blocks_encrypted
+
+
+%%_initial_num_blocks_is_3:
+	INITIAL_BLOCKS	%%GDATA_KEY, %%GDATA_CTX, %%CYPH_PLAIN_OUT, %%PLAIN_CYPH_IN, r13, %%DATA_OFFSET, 3, xmm12, xmm13, xmm14, xmm15, xmm11, xmm9, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm10, xmm0, %%ENC_DEC
+        sub     r13, 16*3
+        jmp     %%_initial_blocks_encrypted
+%%_initial_num_blocks_is_2:
+	INITIAL_BLOCKS	%%GDATA_KEY, %%GDATA_CTX, %%CYPH_PLAIN_OUT, %%PLAIN_CYPH_IN, r13, %%DATA_OFFSET, 2, xmm12, xmm13, xmm14, xmm15, xmm11, xmm9, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm10, xmm0, %%ENC_DEC
+        sub     r13, 16*2
+        jmp     %%_initial_blocks_encrypted
+
+%%_initial_num_blocks_is_1:
+	INITIAL_BLOCKS	%%GDATA_KEY, %%GDATA_CTX, %%CYPH_PLAIN_OUT, %%PLAIN_CYPH_IN, r13, %%DATA_OFFSET, 1, xmm12, xmm13, xmm14, xmm15, xmm11, xmm9, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm10, xmm0, %%ENC_DEC
+        sub     r13, 16
+        jmp     %%_initial_blocks_encrypted
+
+%%_initial_num_blocks_is_0:
+	INITIAL_BLOCKS	%%GDATA_KEY, %%GDATA_CTX, %%CYPH_PLAIN_OUT, %%PLAIN_CYPH_IN, r13, %%DATA_OFFSET, 0, xmm12, xmm13, xmm14, xmm15, xmm11, xmm9, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm10, xmm0, %%ENC_DEC
+
+
+%%_initial_blocks_encrypted:
+        cmp     r13, 0
+        je      %%_zero_cipher_left
+
+        sub     r13, 128
+        je      %%_eight_cipher_left
+
+
+
+
+        movd    r15d, xmm9
+        and     r15d, 255
+        pshufb  xmm9, [SHUF_MASK]
+
+
+%%_encrypt_by_8_new:
+        cmp     r15d, 255-8
+        jg      %%_encrypt_by_8
+
+
+
+        add     r15b, 8
+	GHASH_8_ENCRYPT_8_PARALLEL	%%GDATA_KEY, %%CYPH_PLAIN_OUT, %%PLAIN_CYPH_IN, %%DATA_OFFSET, xmm0, xmm10, xmm11, xmm12, xmm13, xmm14, xmm9, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm15, out_order, %%ENC_DEC
+        add     %%DATA_OFFSET, 128
+        sub     r13, 128
+        jne     %%_encrypt_by_8_new
+
+        pshufb  xmm9, [SHUF_MASK]
+        jmp     %%_eight_cipher_left
+
+%%_encrypt_by_8:
+        pshufb  xmm9, [SHUF_MASK]
+        add     r15b, 8
+	GHASH_8_ENCRYPT_8_PARALLEL	%%GDATA_KEY, %%CYPH_PLAIN_OUT, %%PLAIN_CYPH_IN, %%DATA_OFFSET, xmm0, xmm10, xmm11, xmm12, xmm13, xmm14, xmm9, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8, xmm15, in_order, %%ENC_DEC
+        pshufb  xmm9, [SHUF_MASK]
+        add     %%DATA_OFFSET, 128
+        sub     r13, 128
+        jne     %%_encrypt_by_8_new
+
+        pshufb  xmm9, [SHUF_MASK]
+
+
+
+
+%%_eight_cipher_left:
+	GHASH_LAST_8	%%GDATA_KEY, xmm0, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8
+
+
+%%_zero_cipher_left:
+	movdqu	[%%GDATA_CTX + AadHash], xmm14
+	movdqu	[%%GDATA_CTX + CurCount], xmm9
+
+        mov     r13, r10
+        and     r13, 15                                ; r13 = (%%PLAIN_CYPH_LEN mod 16)
+
+        je      %%_multiple_of_16_bytes
+
+	mov	[%%GDATA_CTX + PBlockLen], r13		; my_ctx.data.partial_blck_length = r13
+        ; handle the last <16 Byte block seperately
+
+        paddd   xmm9, [ONE]                     ; INCR CNT to get Yn
+	movdqu	[%%GDATA_CTX + CurCount], xmm9		; my_ctx.data.current_counter = xmm9
+        pshufb  xmm9, [SHUF_MASK]
+	ENCRYPT_SINGLE_BLOCK	%%GDATA_KEY, xmm9, xmm2                    ; E(K, Yn)
+	movdqu	[%%GDATA_CTX + PBlockEncKey], xmm9		; my_ctx_data.partial_block_enc_key = xmm9
+
+	cmp	%%PLAIN_CYPH_LEN, 16
+	jge	%%_large_enough_update
+
+	lea	r10, [%%PLAIN_CYPH_IN + %%DATA_OFFSET]
+	READ_SMALL_DATA_INPUT	xmm1, r10, r13, r12, r15, rax
+	lea	r12, [SHIFT_MASK + 16]
+	sub	r12, r13
+	jmp	%%_data_read
+
+%%_large_enough_update:
+        sub     %%DATA_OFFSET, 16
+        add     %%DATA_OFFSET, r13
+
+        movdqu  xmm1, [%%PLAIN_CYPH_IN+%%DATA_OFFSET]                        ; receive the last <16 Byte block
+
+	sub     %%DATA_OFFSET, r13
+        add     %%DATA_OFFSET, 16
+
+        lea     r12, [SHIFT_MASK + 16]
+        sub     r12, r13                                ; adjust the shuffle mask pointer to be able to shift 16-r13 bytes (r13 is the number of bytes in plaintext mod 16)
+        movdqu  xmm2, [r12]                             ; get the appropriate shuffle mask
+        pshufb  xmm1, xmm2                              ; shift right 16-r13 bytes
+%%_data_read:
+        %ifidn  %%ENC_DEC, DEC
+        movdqa  xmm2, xmm1
+        pxor    xmm9, xmm1                              ; Plaintext XOR E(K, Yn)
+        movdqu  xmm1, [r12 + ALL_F - SHIFT_MASK]        ; get the appropriate mask to mask out top 16-r13 bytes of xmm9
+        pand    xmm9, xmm1                              ; mask out top 16-r13 bytes of xmm9
+        pand    xmm2, xmm1
+        pshufb  xmm2, [SHUF_MASK]
+        pxor    xmm14, xmm2
+	movdqu	[%%GDATA_CTX + AadHash], xmm14
+
+        %else
+        pxor    xmm9, xmm1                              ; Plaintext XOR E(K, Yn)
+        movdqu  xmm1, [r12 + ALL_F - SHIFT_MASK]        ; get the appropriate mask to mask out top 16-r13 bytes of xmm9
+        pand    xmm9, xmm1                              ; mask out top 16-r13 bytes of xmm9
+        pshufb  xmm9, [SHUF_MASK]
+        pxor    xmm14, xmm9
+	movdqu	[%%GDATA_CTX + AadHash], xmm14
+
+        pshufb  xmm9, [SHUF_MASK]               ; shuffle xmm9 back to output as ciphertext
+        %endif
+
+
+        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+        ; output r13 Bytes
+        movq    rax, xmm9
+        cmp     r13, 8
+        jle     %%_less_than_8_bytes_left
+
+        mov     [%%CYPH_PLAIN_OUT + %%DATA_OFFSET], rax
+        add     %%DATA_OFFSET, 8
+        psrldq  xmm9, 8
+        movq    rax, xmm9
+        sub     r13, 8
+
+%%_less_than_8_bytes_left:
+        mov     BYTE [%%CYPH_PLAIN_OUT + %%DATA_OFFSET], al
+        add     %%DATA_OFFSET, 1
+        shr     rax, 8
+        sub     r13, 1
+        jne     %%_less_than_8_bytes_left
+        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+%%_multiple_of_16_bytes:
+
+%endmacro
+
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+; GCM_COMPLETE Finishes Encyrption/Decryption of last partial block after GCM_UPDATE finishes.
+; Input: A gcm_key_data * (GDATA_KEY), gcm_context_data * (GDATA_CTX) and
+; whether encoding or decoding (ENC_DEC).
+; Output: Authorization Tag (AUTH_TAG) and Authorization Tag length (AUTH_TAG_LEN)
+; Clobbers rax, r10-r12, and xmm0, xmm1, xmm5, xmm6, xmm9, xmm11, xmm14, xmm15
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+%macro	GCM_COMPLETE		5
+%define	%%GDATA_KEY		%1
+%define	%%GDATA_CTX		%2
+%define	%%AUTH_TAG		%3
+%define	%%AUTH_TAG_LEN		%4
+%define	%%ENC_DEC		%5
+%define	%%PLAIN_CYPH_LEN	rax
+
+        mov     r12, [%%GDATA_CTX + PBlockLen]		; r12 = aadLen (number of bytes)
+	movdqu	xmm14, [%%GDATA_CTX + AadHash]
+	movdqu	xmm13, [%%GDATA_KEY + HashKey]
+
+	cmp	r12, 0
+
+	je %%_partial_done
+
+	GHASH_MUL xmm14, xmm13, xmm0, xmm10, xmm11, xmm5, xmm6 ;GHASH computation for the last <16 Byte block
+	movdqu	[%%GDATA_CTX + AadHash], xmm14
+
+%%_partial_done:
+
+	mov	r12, [%%GDATA_CTX + AadLen]			; r12 = aadLen (number of bytes)
+	mov	%%PLAIN_CYPH_LEN, [%%GDATA_CTX + InLen]
+
+        shl     r12, 3                                  ; convert into number of bits
+        movd    xmm15, r12d                             ; len(A) in xmm15
+
+        shl     %%PLAIN_CYPH_LEN, 3                     ; len(C) in bits  (*128)
+        movq    xmm1, %%PLAIN_CYPH_LEN
+        pslldq  xmm15, 8                                ; xmm15 = len(A)|| 0x0000000000000000
+        pxor    xmm15, xmm1                             ; xmm15 = len(A)||len(C)
+
+        pxor    xmm14, xmm15
+        GHASH_MUL       xmm14, xmm13, xmm0, xmm10, xmm11, xmm5, xmm6    ; final GHASH computation
+        pshufb  xmm14, [SHUF_MASK]                      ; perform a 16Byte swap
+
+        movdqu  xmm9, [%%GDATA_CTX + OrigIV]            ; xmm9 = Y0
+
+	ENCRYPT_SINGLE_BLOCK	%%GDATA_KEY, xmm9, xmm2	; E(K, Y0)
+
+        pxor    xmm9, xmm14
+
+
+
+%%_return_T:
+	mov	r10, %%AUTH_TAG				; r10 = authTag
+	mov	r11, %%AUTH_TAG_LEN			; r11 = auth_tag_len
+
+        cmp     r11, 16
+        je      %%_T_16
+
+        cmp     r11, 12
+        je      %%_T_12
+
+%%_T_8:
+        movq    rax, xmm9
+        mov     [r10], rax
+        jmp     %%_return_T_done
+%%_T_12:
+        movq    rax, xmm9
+        mov     [r10], rax
+        psrldq  xmm9, 8
+        movd    eax, xmm9
+        mov     [r10 + 8], eax
+        jmp     %%_return_T_done
+
+%%_T_16:
+        movdqu  [r10], xmm9
+
+%%_return_T_done:
+%endmacro ;GCM_COMPLETE
+
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;void	aes_gcm_precomp_128_sse / aes_gcm_precomp_192_sse / aes_gcm_precomp_256_sse
+;        (struct gcm_key_data *key_data);
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+%ifnidn FUNCT_EXTENSION, _nt
+global FN_NAME(precomp,_)
+FN_NAME(precomp,_):
+	endbranch
+
+        push    r12
+        push    r13
+        push    r14
+        push    r15
+
+        mov     r14, rsp
+
+
+
+        sub     rsp, VARIABLE_OFFSET
+        and     rsp, ~63                                ; align rsp to 64 bytes
+
+%ifidn __OUTPUT_FORMAT__, win64
+        ; only xmm6 needs to be maintained
+        movdqu [rsp + LOCAL_STORAGE + 0*16],xmm6
+%endif
+
+	pxor	xmm6, xmm6
+	ENCRYPT_SINGLE_BLOCK	arg1, xmm6, xmm2	; xmm6 = HashKey
+
+        pshufb  xmm6, [SHUF_MASK]
+        ;;;;;;;;;;;;;;;  PRECOMPUTATION of HashKey<<1 mod poly from the HashKey;;;;;;;;;;;;;;;
+        movdqa  xmm2, xmm6
+        psllq   xmm6, 1
+        psrlq   xmm2, 63
+        movdqa  xmm1, xmm2
+        pslldq  xmm2, 8
+        psrldq  xmm1, 8
+        por     xmm6, xmm2
+        ;reduction
+        pshufd  xmm2, xmm1, 00100100b
+        pcmpeqd xmm2, [TWOONE]
+        pand    xmm2, [POLY]
+        pxor    xmm6, xmm2                             ; xmm6 holds the HashKey<<1 mod poly
+        ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+        movdqu  [arg1 + HashKey], xmm6                  ; store HashKey<<1 mod poly
+
+
+        PRECOMPUTE  arg1, xmm6, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5
+
+%ifidn __OUTPUT_FORMAT__, win64
+       movdqu xmm6, [rsp + LOCAL_STORAGE + 0*16]
+%endif
+        mov     rsp, r14
+
+        pop     r15
+        pop     r14
+        pop     r13
+        pop     r12
+ret
+%endif	; _nt
+
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;void   aes_gcm_init_128_sse / aes_gcm_init_192_sse / aes_gcm_init_256_sse (
+;        const struct gcm_key_data *key_data,
+;        struct gcm_context_data *context_data,
+;        u8      *iv,
+;        const   u8 *aad,
+;        u64     aad_len);
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+%ifnidn FUNCT_EXTENSION, _nt
+global FN_NAME(init,_)
+FN_NAME(init,_):
+	endbranch
+
+	push	r12
+	push	r13
+%ifidn __OUTPUT_FORMAT__, win64
+	; xmm6:xmm15 need to be maintained for Windows
+        push    arg5
+	sub	rsp, 1*16
+	movdqu	[rsp + 0*16],xmm6
+        mov     arg5, [rsp + 1*16 + 8*3 + 8*5]
+%endif
+
+	GCM_INIT arg1, arg2, arg3, arg4, arg5
+
+%ifidn __OUTPUT_FORMAT__, win64
+	movdqu	xmm6 , [rsp + 0*16]
+	add	rsp, 1*16
+        pop     arg5
+%endif
+	pop	r13
+	pop	r12
+        ret
+%endif	; _nt
+
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;void   aes_gcm_enc_128_update_sse / aes_gcm_enc_192_update_sse / aes_gcm_enc_256_update_sse
+;        const struct gcm_key_data *key_data,
+;        struct gcm_context_data *context_data,
+;        u8      *out,
+;        const   u8 *in,
+;        u64     plaintext_len);
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+global FN_NAME(enc,_update_)
+FN_NAME(enc,_update_):
+	endbranch
+
+	FUNC_SAVE
+
+	GCM_ENC_DEC arg1, arg2, arg3, arg4, arg5, ENC
+
+	FUNC_RESTORE
+
+	ret
+
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;void   aes_gcm_dec_256_update_sse / aes_gcm_dec_192_update_sse / aes_gcm_dec_256_update_sse
+;        const struct gcm_key_data *key_data,
+;        struct gcm_context_data *context_data,
+;        u8      *out,
+;        const   u8 *in,
+;        u64     plaintext_len);
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+global FN_NAME(dec,_update_)
+FN_NAME(dec,_update_):
+	endbranch
+
+	FUNC_SAVE
+
+	GCM_ENC_DEC arg1, arg2, arg3, arg4, arg5, DEC
+
+	FUNC_RESTORE
+
+	ret
+
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;void   aes_gcm_enc_128_finalize_sse / aes_gcm_enc_192_finalize_sse / aes_gcm_enc_256_finalize_sse
+;        const struct gcm_key_data *key_data,
+;        struct gcm_context_data *context_data,
+;        u8      *auth_tag,
+;        u64     auth_tag_len);
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+%ifnidn FUNCT_EXTENSION, _nt
+global FN_NAME(enc,_finalize_)
+FN_NAME(enc,_finalize_):
+	endbranch
+
+	push r12
+
+%ifidn __OUTPUT_FORMAT__, win64
+	; xmm6:xmm15 need to be maintained for Windows
+	sub	rsp, 5*16
+	movdqu	[rsp + 0*16],xmm6
+	movdqu	[rsp + 1*16],xmm9
+	movdqu	[rsp + 2*16],xmm11
+	movdqu	[rsp + 3*16],xmm14
+	movdqu	[rsp + 4*16],xmm15
+%endif
+	GCM_COMPLETE	arg1, arg2, arg3, arg4, ENC
+
+%ifidn __OUTPUT_FORMAT__, win64
+	movdqu	xmm15  , [rsp + 4*16]
+	movdqu	xmm14  , [rsp+ 3*16]
+	movdqu	xmm11  , [rsp + 2*16]
+	movdqu	xmm9 , [rsp + 1*16]
+	movdqu	xmm6 , [rsp + 0*16]
+	add	rsp, 5*16
+%endif
+
+	pop r12
+        ret
+%endif	; _nt
+
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;void   aes_gcm_dec_128_finalize_sse / aes_gcm_dec_192_finalize_sse / aes_gcm_dec_256_finalize_sse
+;        const struct gcm_key_data *key_data,
+;        struct gcm_context_data *context_data,
+;        u8      *auth_tag,
+;        u64     auth_tag_len);
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+%ifnidn FUNCT_EXTENSION, _nt
+global FN_NAME(dec,_finalize_)
+FN_NAME(dec,_finalize_):
+	endbranch
+
+	push r12
+
+%ifidn __OUTPUT_FORMAT__, win64
+	; xmm6:xmm15 need to be maintained for Windows
+	sub	rsp, 5*16
+	movdqu	[rsp + 0*16],xmm6
+	movdqu	[rsp + 1*16],xmm9
+	movdqu	[rsp + 2*16],xmm11
+	movdqu	[rsp + 3*16],xmm14
+	movdqu	[rsp + 4*16],xmm15
+%endif
+	GCM_COMPLETE	arg1, arg2, arg3, arg4, DEC
+
+%ifidn __OUTPUT_FORMAT__, win64
+	movdqu	xmm15  , [rsp + 4*16]
+	movdqu	xmm14  , [rsp+ 3*16]
+	movdqu	xmm11  , [rsp + 2*16]
+	movdqu	xmm9 , [rsp + 1*16]
+	movdqu	xmm6 , [rsp + 0*16]
+	add	rsp, 5*16
+%endif
+
+	pop r12
+        ret
+%endif	; _nt
+
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;void   aes_gcm_enc_128_sse / aes_gcm_enc_192_sse / aes_gcm_enc_256_sse
+;        const struct gcm_key_data *key_data,
+;        struct gcm_context_data *context_data,
+;        u8      *out,
+;        const   u8 *in,
+;        u64     plaintext_len,
+;        u8      *iv,
+;        const   u8 *aad,
+;        u64     aad_len,
+;        u8      *auth_tag,
+;        u64     auth_tag_len);
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+global FN_NAME(enc,_)
+FN_NAME(enc,_):
+	endbranch
+
+	FUNC_SAVE
+
+	GCM_INIT arg1, arg2, arg6, arg7, arg8
+
+	GCM_ENC_DEC  arg1, arg2, arg3, arg4, arg5, ENC
+
+	GCM_COMPLETE arg1, arg2, arg9, arg10, ENC
+
+	FUNC_RESTORE
+
+	ret
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;void   aes_gcm_dec_128_sse / aes_gcm_dec_192_sse / aes_gcm_dec_256_sse
+;        const struct gcm_key_data *key_data,
+;        struct gcm_context_data *context_data,
+;        u8      *out,
+;        const   u8 *in,
+;        u64     plaintext_len,
+;        u8      *iv,
+;        const   u8 *aad,
+;        u64     aad_len,
+;        u8      *auth_tag,
+;        u64     auth_tag_len);
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+global FN_NAME(dec,_)
+FN_NAME(dec,_):
+	endbranch
+
+	FUNC_SAVE
+
+	GCM_INIT arg1, arg2, arg6, arg7, arg8
+
+	GCM_ENC_DEC  arg1, arg2, arg3, arg4, arg5, DEC
+
+	GCM_COMPLETE arg1, arg2, arg9, arg10, DEC
+
+	FUNC_RESTORE
+
+	ret
diff --git a/contrib/icp/gcm-simd/isa-l_crypto/reg_sizes.asm b/contrib/icp/gcm-simd/isa-l_crypto/reg_sizes.asm
new file mode 100644
index 000000000000..991fe48b80a0
--- /dev/null
+++ b/contrib/icp/gcm-simd/isa-l_crypto/reg_sizes.asm
@@ -0,0 +1,459 @@
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+;  Copyright(c) 2011-2019 Intel Corporation All rights reserved.
+;
+;  Redistribution and use in source and binary forms, with or without
+;  modification, are permitted provided that the following conditions
+;  are met:
+;    * Redistributions of source code must retain the above copyright
+;      notice, this list of conditions and the following disclaimer.
+;    * Redistributions in binary form must reproduce the above copyright
+;      notice, this list of conditions and the following disclaimer in
+;      the documentation and/or other materials provided with the
+;      distribution.
+;    * Neither the name of Intel Corporation nor the names of its
+;      contributors may be used to endorse or promote products derived
+;      from this software without specific prior written permission.
+;
+;  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+;  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+;  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+;  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+;  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+;  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+;  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+;  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+;  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+;  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+;  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+%ifndef _REG_SIZES_ASM_
+%define _REG_SIZES_ASM_
+
+%ifndef AS_FEATURE_LEVEL
+%define AS_FEATURE_LEVEL 4
+%endif
+
+%define EFLAGS_HAS_CPUID        (1<<21)
+%define FLAG_CPUID1_ECX_CLMUL   (1<<1)
+%define FLAG_CPUID1_EDX_SSE2    (1<<26)
+%define FLAG_CPUID1_ECX_SSE3	(1)
+%define FLAG_CPUID1_ECX_SSE4_1  (1<<19)
+%define FLAG_CPUID1_ECX_SSE4_2  (1<<20)
+%define FLAG_CPUID1_ECX_POPCNT  (1<<23)
+%define FLAG_CPUID1_ECX_AESNI   (1<<25)
+%define FLAG_CPUID1_ECX_OSXSAVE (1<<27)
+%define FLAG_CPUID1_ECX_AVX     (1<<28)
+%define FLAG_CPUID1_EBX_AVX2    (1<<5)
+
+%define FLAG_CPUID7_EBX_AVX2           (1<<5)
+%define FLAG_CPUID7_EBX_AVX512F        (1<<16)
+%define FLAG_CPUID7_EBX_AVX512DQ       (1<<17)
+%define FLAG_CPUID7_EBX_AVX512IFMA     (1<<21)
+%define FLAG_CPUID7_EBX_AVX512PF       (1<<26)
+%define FLAG_CPUID7_EBX_AVX512ER       (1<<27)
+%define FLAG_CPUID7_EBX_AVX512CD       (1<<28)
+%define FLAG_CPUID7_EBX_SHA            (1<<29)
+%define FLAG_CPUID7_EBX_AVX512BW       (1<<30)
+%define FLAG_CPUID7_EBX_AVX512VL       (1<<31)
+
+%define FLAG_CPUID7_ECX_AVX512VBMI     (1<<1)
+%define FLAG_CPUID7_ECX_AVX512VBMI2    (1 << 6)
+%define FLAG_CPUID7_ECX_GFNI           (1 << 8)
+%define FLAG_CPUID7_ECX_VAES           (1 << 9)
+%define FLAG_CPUID7_ECX_VPCLMULQDQ     (1 << 10)
+%define FLAG_CPUID7_ECX_VNNI           (1 << 11)
+%define FLAG_CPUID7_ECX_BITALG         (1 << 12)
+%define FLAG_CPUID7_ECX_VPOPCNTDQ      (1 << 14)
+
+%define FLAGS_CPUID7_EBX_AVX512_G1 (FLAG_CPUID7_EBX_AVX512F | FLAG_CPUID7_EBX_AVX512VL | FLAG_CPUID7_EBX_AVX512BW | FLAG_CPUID7_EBX_AVX512CD | FLAG_CPUID7_EBX_AVX512DQ)
+%define FLAGS_CPUID7_ECX_AVX512_G2 (FLAG_CPUID7_ECX_AVX512VBMI2 | FLAG_CPUID7_ECX_GFNI | FLAG_CPUID7_ECX_VAES | FLAG_CPUID7_ECX_VPCLMULQDQ | FLAG_CPUID7_ECX_VNNI | FLAG_CPUID7_ECX_BITALG | FLAG_CPUID7_ECX_VPOPCNTDQ)
+
+%define FLAG_XGETBV_EAX_XMM            (1<<1)
+%define FLAG_XGETBV_EAX_YMM            (1<<2)
+%define FLAG_XGETBV_EAX_XMM_YMM        0x6
+%define FLAG_XGETBV_EAX_ZMM_OPM        0xe0
+
+%define FLAG_CPUID1_EAX_AVOTON     0x000406d0
+%define FLAG_CPUID1_EAX_STEP_MASK  0xfffffff0
+
+; define d and w variants for registers
+
+%define	raxd	eax
+%define raxw	ax
+%define raxb	al
+
+%define	rbxd	ebx
+%define rbxw	bx
+%define rbxb	bl
+
+%define	rcxd	ecx
+%define rcxw	cx
+%define rcxb	cl
+
+%define	rdxd	edx
+%define rdxw	dx
+%define rdxb	dl
+
+%define	rsid	esi
+%define rsiw	si
+%define rsib	sil
+
+%define	rdid	edi
+%define rdiw	di
+%define rdib	dil
+
+%define	rbpd	ebp
+%define rbpw	bp
+%define rbpb	bpl
+
+%define zmm0x xmm0
+%define zmm1x xmm1
+%define zmm2x xmm2
+%define zmm3x xmm3
+%define zmm4x xmm4
+%define zmm5x xmm5
+%define zmm6x xmm6
+%define zmm7x xmm7
+%define zmm8x xmm8
+%define zmm9x xmm9
+%define zmm10x xmm10
+%define zmm11x xmm11
+%define zmm12x xmm12
+%define zmm13x xmm13
+%define zmm14x xmm14
+%define zmm15x xmm15
+%define zmm16x xmm16
+%define zmm17x xmm17
+%define zmm18x xmm18
+%define zmm19x xmm19
+%define zmm20x xmm20
+%define zmm21x xmm21
+%define zmm22x xmm22
+%define zmm23x xmm23
+%define zmm24x xmm24
+%define zmm25x xmm25
+%define zmm26x xmm26
+%define zmm27x xmm27
+%define zmm28x xmm28
+%define zmm29x xmm29
+%define zmm30x xmm30
+%define zmm31x xmm31
+
+%define ymm0x xmm0
+%define ymm1x xmm1
+%define ymm2x xmm2
+%define ymm3x xmm3
+%define ymm4x xmm4
+%define ymm5x xmm5
+%define ymm6x xmm6
+%define ymm7x xmm7
+%define ymm8x xmm8
+%define ymm9x xmm9
+%define ymm10x xmm10
+%define ymm11x xmm11
+%define ymm12x xmm12
+%define ymm13x xmm13
+%define ymm14x xmm14
+%define ymm15x xmm15
+%define ymm16x xmm16
+%define ymm17x xmm17
+%define ymm18x xmm18
+%define ymm19x xmm19
+%define ymm20x xmm20
+%define ymm21x xmm21
+%define ymm22x xmm22
+%define ymm23x xmm23
+%define ymm24x xmm24
+%define ymm25x xmm25
+%define ymm26x xmm26
+%define ymm27x xmm27
+%define ymm28x xmm28
+%define ymm29x xmm29
+%define ymm30x xmm30
+%define ymm31x xmm31
+
+%define xmm0x xmm0
+%define xmm1x xmm1
+%define xmm2x xmm2
+%define xmm3x xmm3
+%define xmm4x xmm4
+%define xmm5x xmm5
+%define xmm6x xmm6
+%define xmm7x xmm7
+%define xmm8x xmm8
+%define xmm9x xmm9
+%define xmm10x xmm10
+%define xmm11x xmm11
+%define xmm12x xmm12
+%define xmm13x xmm13
+%define xmm14x xmm14
+%define xmm15x xmm15
+%define xmm16x xmm16
+%define xmm17x xmm17
+%define xmm18x xmm18
+%define xmm19x xmm19
+%define xmm20x xmm20
+%define xmm21x xmm21
+%define xmm22x xmm22
+%define xmm23x xmm23
+%define xmm24x xmm24
+%define xmm25x xmm25
+%define xmm26x xmm26
+%define xmm27x xmm27
+%define xmm28x xmm28
+%define xmm29x xmm29
+%define xmm30x xmm30
+%define xmm31x xmm31
+
+%define zmm0y ymm0
+%define zmm1y ymm1
+%define zmm2y ymm2
+%define zmm3y ymm3
+%define zmm4y ymm4
+%define zmm5y ymm5
+%define zmm6y ymm6
+%define zmm7y ymm7
+%define zmm8y ymm8
+%define zmm9y ymm9
+%define zmm10y ymm10
+%define zmm11y ymm11
+%define zmm12y ymm12
+%define zmm13y ymm13
+%define zmm14y ymm14
+%define zmm15y ymm15
+%define zmm16y ymm16
+%define zmm17y ymm17
+%define zmm18y ymm18
+%define zmm19y ymm19
+%define zmm20y ymm20
+%define zmm21y ymm21
+%define zmm22y ymm22
+%define zmm23y ymm23
+%define zmm24y ymm24
+%define zmm25y ymm25
+%define zmm26y ymm26
+%define zmm27y ymm27
+%define zmm28y ymm28
+%define zmm29y ymm29
+%define zmm30y ymm30
+%define zmm31y ymm31
+
+%define xmm0y ymm0
+%define xmm1y ymm1
+%define xmm2y ymm2
+%define xmm3y ymm3
+%define xmm4y ymm4
+%define xmm5y ymm5
+%define xmm6y ymm6
+%define xmm7y ymm7
+%define xmm8y ymm8
+%define xmm9y ymm9
+%define xmm10y ymm10
+%define xmm11y ymm11
+%define xmm12y ymm12
+%define xmm13y ymm13
+%define xmm14y ymm14
+%define xmm15y ymm15
+%define xmm16y ymm16
+%define xmm17y ymm17
+%define xmm18y ymm18
+%define xmm19y ymm19
+%define xmm20y ymm20
+%define xmm21y ymm21
+%define xmm22y ymm22
+%define xmm23y ymm23
+%define xmm24y ymm24
+%define xmm25y ymm25
+%define xmm26y ymm26
+%define xmm27y ymm27
+%define xmm28y ymm28
+%define xmm29y ymm29
+%define xmm30y ymm30
+%define xmm31y ymm31
+
+%define xmm0z zmm0
+%define xmm1z zmm1
+%define xmm2z zmm2
+%define xmm3z zmm3
+%define xmm4z zmm4
+%define xmm5z zmm5
+%define xmm6z zmm6
+%define xmm7z zmm7
+%define xmm8z zmm8
+%define xmm9z zmm9
+%define xmm10z zmm10
+%define xmm11z zmm11
+%define xmm12z zmm12
+%define xmm13z zmm13
+%define xmm14z zmm14
+%define xmm15z zmm15
+%define xmm16z zmm16
+%define xmm17z zmm17
+%define xmm18z zmm18
+%define xmm19z zmm19
+%define xmm20z zmm20
+%define xmm21z zmm21
+%define xmm22z zmm22
+%define xmm23z zmm23
+%define xmm24z zmm24
+%define xmm25z zmm25
+%define xmm26z zmm26
+%define xmm27z zmm27
+%define xmm28z zmm28
+%define xmm29z zmm29
+%define xmm30z zmm30
+%define xmm31z zmm31
+
+%define ymm0z zmm0
+%define ymm1z zmm1
+%define ymm2z zmm2
+%define ymm3z zmm3
+%define ymm4z zmm4
+%define ymm5z zmm5
+%define ymm6z zmm6
+%define ymm7z zmm7
+%define ymm8z zmm8
+%define ymm9z zmm9
+%define ymm10z zmm10
+%define ymm11z zmm11
+%define ymm12z zmm12
+%define ymm13z zmm13
+%define ymm14z zmm14
+%define ymm15z zmm15
+%define ymm16z zmm16
+%define ymm17z zmm17
+%define ymm18z zmm18
+%define ymm19z zmm19
+%define ymm20z zmm20
+%define ymm21z zmm21
+%define ymm22z zmm22
+%define ymm23z zmm23
+%define ymm24z zmm24
+%define ymm25z zmm25
+%define ymm26z zmm26
+%define ymm27z zmm27
+%define ymm28z zmm28
+%define ymm29z zmm29
+%define ymm30z zmm30
+%define ymm31z zmm31
+
+%define DWORD(reg) reg %+ d
+%define WORD(reg)  reg %+ w
+%define BYTE(reg)  reg %+ b
+
+%define XWORD(reg) reg %+ x
+%define YWORD(reg) reg %+ y
+%define ZWORD(reg) reg %+ z
+
+%ifdef INTEL_CET_ENABLED
+ %ifdef __NASM_VER__
+  %if AS_FEATURE_LEVEL >= 10
+   %ifidn __OUTPUT_FORMAT__,elf32
+section .note.gnu.property  note  alloc noexec align=4
+DD 0x00000004,0x0000000c,0x00000005,0x00554e47
+DD 0xc0000002,0x00000004,0x00000003
+   %endif
+   %ifidn __OUTPUT_FORMAT__,elf64
+section .note.gnu.property  note  alloc noexec align=8
+DD 0x00000004,0x00000010,0x00000005,0x00554e47
+DD 0xc0000002,0x00000004,0x00000003,0x00000000
+   %endif
+  %endif
+ %endif
+%endif
+
+%ifidn __OUTPUT_FORMAT__,elf32
+section .note.GNU-stack noalloc noexec nowrite progbits
+section .text
+%endif
+%ifidn __OUTPUT_FORMAT__,elf64
+ %define __x86_64__
+section .note.GNU-stack noalloc noexec nowrite progbits
+section .text
+%endif
+%ifidn __OUTPUT_FORMAT__,win64
+ %define __x86_64__
+%endif
+%ifidn __OUTPUT_FORMAT__,macho64
+ %define __x86_64__
+%endif
+
+%ifdef __x86_64__
+ %define endbranch db 0xf3, 0x0f, 0x1e, 0xfa
+%else
+ %define endbranch db 0xf3, 0x0f, 0x1e, 0xfb
+%endif
+
+%ifdef REL_TEXT
+ %define WRT_OPT
+%elifidn __OUTPUT_FORMAT__, elf64
+ %define WRT_OPT        wrt ..plt
+%else
+ %define WRT_OPT
+%endif
+
+%macro mk_global 1-3
+  %ifdef __NASM_VER__
+    %ifidn __OUTPUT_FORMAT__, macho64
+	global %1
+    %elifidn __OUTPUT_FORMAT__, win64
+	global %1
+    %else
+	global %1:%2 %3
+    %endif
+  %else
+	global %1:%2 %3
+  %endif
+%endmacro
+
+
+; Fixes for nasm lack of MS proc helpers
+%ifdef __NASM_VER__
+  %ifidn __OUTPUT_FORMAT__, win64
+    %macro alloc_stack 1
+	sub	rsp, %1
+    %endmacro
+
+    %macro proc_frame 1
+	%1:
+    %endmacro
+
+    %macro save_xmm128 2
+	movdqa	[rsp + %2], %1
+    %endmacro
+
+    %macro save_reg 2
+	mov	[rsp + %2], %1
+    %endmacro
+
+    %macro rex_push_reg	1
+	push	%1
+    %endmacro
+
+    %macro push_reg 1
+	push	%1
+    %endmacro
+
+    %define end_prolog
+  %endif
+
+  %define endproc_frame
+%endif
+
+%ifidn __OUTPUT_FORMAT__, macho64
+ %define elf64 macho64
+ mac_equ equ 1
+%endif
+
+%macro slversion 4
+	section .text
+	global %1_slver_%2%3%4
+	global %1_slver
+	%1_slver:
+	%1_slver_%2%3%4:
+		dw 0x%4
+		db 0x%3, 0x%2
+%endmacro
+
+%endif ; ifndef _REG_SIZES_ASM_
diff --git a/module/icp/asm-x86_64/modes/THIRDPARTYLICENSE.intel b/module/icp/asm-x86_64/modes/THIRDPARTYLICENSE.intel
new file mode 100644
index 000000000000..ecebef110b46
--- /dev/null
+++ b/module/icp/asm-x86_64/modes/THIRDPARTYLICENSE.intel
@@ -0,0 +1,26 @@
+  Copyright(c) 2011-2017 Intel Corporation All rights reserved.
+
+  Redistribution and use in source and binary forms, with or without
+  modification, are permitted provided that the following conditions
+  are met:
+    * Redistributions of source code must retain the above copyright
+      notice, this list of conditions and the following disclaimer.
+    * Redistributions in binary form must reproduce the above copyright
+      notice, this list of conditions and the following disclaimer in
+      the documentation and/or other materials provided with the
+      distribution.
+    * Neither the name of Intel Corporation nor the names of its
+      contributors may be used to endorse or promote products derived
+      from this software without specific prior written permission.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/module/icp/asm-x86_64/modes/THIRDPARTYLICENSE.intel.descrip b/module/icp/asm-x86_64/modes/THIRDPARTYLICENSE.intel.descrip
new file mode 100644
index 000000000000..6184759c8b74
--- /dev/null
+++ b/module/icp/asm-x86_64/modes/THIRDPARTYLICENSE.intel.descrip
@@ -0,0 +1 @@
+PORTIONS OF GCM and GHASH FUNCTIONALITY
diff --git a/module/icp/asm-x86_64/modes/isalc_gcm128_sse.S b/module/icp/asm-x86_64/modes/isalc_gcm128_sse.S
new file mode 100644
index 000000000000..f552d8630073
--- /dev/null
+++ b/module/icp/asm-x86_64/modes/isalc_gcm128_sse.S
@@ -0,0 +1,31 @@
+//#######################################################################
+//  Copyright(c) 2011-2016 Intel Corporation All rights reserved.
+//
+//  Redistribution and use in source and binary forms, with or without
+//  modification, are permitted provided that the following conditions
+//  are met:
+//    * Redistributions of source code must retain the above copyright
+//      notice, this list of conditions and the following disclaimer.
+//    * Redistributions in binary form must reproduce the above copyright
+//      notice, this list of conditions and the following disclaimer in
+//      the documentation and/or other materials provided with the
+//      distribution.
+//    * Neither the name of Intel Corporation nor the names of its
+//      contributors may be used to endorse or promote products derived
+//      from this software without specific prior written permission.
+//
+//  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+//  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+//  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+//  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+//  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+//  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+//  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES# LOSS OF USE,
+//  DATA, OR PROFITS# OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+//  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+//  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+//  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//#######################################################################
+
+#define GCM128_MODE 1
+#include "isalc_gcm_sse_att.S"
diff --git a/module/icp/asm-x86_64/modes/isalc_gcm256_sse.S b/module/icp/asm-x86_64/modes/isalc_gcm256_sse.S
new file mode 100644
index 000000000000..c88cb0ed055f
--- /dev/null
+++ b/module/icp/asm-x86_64/modes/isalc_gcm256_sse.S
@@ -0,0 +1,31 @@
+//////////////////////////////////////////////////////////////////////////
+//  Copyright(c) 2011-2016 Intel Corporation All rights reserved.
+//
+//  Redistribution and use in source and binary forms, with or without
+//  modification, are permitted provided that the following conditions
+//  are met:
+//    * Redistributions of source code must retain the above copyright
+//      notice, this list of conditions and the following disclaimer.
+//    * Redistributions in binary form must reproduce the above copyright
+//      notice, this list of conditions and the following disclaimer in
+//      the documentation and/or other materials provided with the
+//      distribution.
+//    * Neither the name of Intel Corporation nor the names of its
+//      contributors may be used to endorse or promote products derived
+//      from this software without specific prior written permission.
+//
+//  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+//  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+//  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+//  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+//  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+//  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+//  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES// LOSS OF USE,
+//  DATA, OR PROFITS// OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+//  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+//  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+//  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//////////////////////////////////////////////////////////////////////////
+
+#define GCM256_MODE 1
+#include "isalc_gcm_sse_att.S"
diff --git a/module/icp/asm-x86_64/modes/isalc_gcm_defines.S b/module/icp/asm-x86_64/modes/isalc_gcm_defines.S
new file mode 100644
index 000000000000..00ec4c654d9f
--- /dev/null
+++ b/module/icp/asm-x86_64/modes/isalc_gcm_defines.S
@@ -0,0 +1,293 @@
+////////////////////////////////////////////////////////////////////////////////
+//  Copyright(c) 2011-2016 Intel Corporation All rights reserved.
+//
+//  Redistribution and use in source and binary forms, with or without
+//  modification, are permitted provided that the following conditions
+//  are met:
+//    * Redistributions of source code must retain the above copyright
+//      notice, this list of conditions and the following disclaimer.
+//    * Redistributions in binary form must reproduce the above copyright
+//      notice, this list of conditions and the following disclaimer in
+//      the documentation and/or other materials provided with the
+//      distribution.
+//    * Neither the name of Intel Corporation nor the names of its
+//      contributors may be used to endorse or promote products derived
+//      from this software without specific prior written permission.
+//
+//  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+//  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+//  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+//  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+//  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+//  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+//  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES// LOSS OF USE,
+//  DATA, OR PROFITS// OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+//  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+//  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+//  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+////////////////////////////////////////////////////////////////////////////////
+
+#ifndef GCM_DEFINES_ASM_INCLUDED
+#define GCM_DEFINES_ASM_INCLUDED
+
+//
+// Authors:
+//       Erdinc Ozturk
+//       Vinodh Gopal
+//       James Guilford
+
+
+////////////
+
+.section .rodata
+
+.balign 16
+POLY:		.quad     0x0000000000000001, 0xC200000000000000
+
+// unused for sse
+.balign 64
+POLY2:		.quad     0x00000001C2000000, 0xC200000000000000
+		.quad     0x00000001C2000000, 0xC200000000000000
+		.quad     0x00000001C2000000, 0xC200000000000000
+		.quad     0x00000001C2000000, 0xC200000000000000
+.balign 16
+TWOONE:		.quad     0x0000000000000001, 0x0000000100000000
+
+// order of these constants should not change.
+// more specifically, ALL_F should follow SHIFT_MASK, and ZERO should
+// follow ALL_F
+
+.balign 64
+SHUF_MASK:	.quad     0x08090A0B0C0D0E0F, 0x0001020304050607
+		.quad     0x08090A0B0C0D0E0F, 0x0001020304050607
+		.quad     0x08090A0B0C0D0E0F, 0x0001020304050607
+		.quad     0x08090A0B0C0D0E0F, 0x0001020304050607
+
+SHIFT_MASK:	.quad     0x0706050403020100, 0x0f0e0d0c0b0a0908
+ALL_F:		.quad     0xffffffffffffffff, 0xffffffffffffffff
+ZERO:		.quad     0x0000000000000000, 0x0000000000000000 // unused for sse
+ONE:		.quad     0x0000000000000001, 0x0000000000000000
+TWO:		.quad     0x0000000000000002, 0x0000000000000000 // unused for sse
+ONEf:		.quad     0x0000000000000000, 0x0100000000000000
+TWOf:		.quad     0x0000000000000000, 0x0200000000000000 // unused for sse
+
+// Below unused for sse
+.balign 64
+ddq_add_1234:
+	.quad	0x0000000000000001, 0x0000000000000000
+	.quad	0x0000000000000002, 0x0000000000000000
+	.quad	0x0000000000000003, 0x0000000000000000
+	.quad	0x0000000000000004, 0x0000000000000000
+
+.balign 64
+ddq_add_5678:
+	.quad	0x0000000000000005, 0x0000000000000000
+	.quad	0x0000000000000006, 0x0000000000000000
+	.quad	0x0000000000000007, 0x0000000000000000
+	.quad	0x0000000000000008, 0x0000000000000000
+
+.balign 64
+ddq_add_4444:
+	.quad	0x0000000000000004, 0x0000000000000000
+	.quad	0x0000000000000004, 0x0000000000000000
+	.quad	0x0000000000000004, 0x0000000000000000
+	.quad	0x0000000000000004, 0x0000000000000000
+
+.balign 64
+ddq_add_8888:
+	.quad	0x0000000000000008, 0x0000000000000000
+	.quad	0x0000000000000008, 0x0000000000000000
+	.quad	0x0000000000000008, 0x0000000000000000
+	.quad	0x0000000000000008, 0x0000000000000000
+
+.balign 64
+ddq_addbe_1234:
+	.quad	0x0000000000000000, 0x0100000000000000
+	.quad	0x0000000000000000, 0x0200000000000000
+	.quad	0x0000000000000000, 0x0300000000000000
+	.quad	0x0000000000000000, 0x0400000000000000
+
+.balign 64
+ddq_addbe_5678:
+	.quad	0x0000000000000000, 0x0500000000000000
+	.quad	0x0000000000000000, 0x0600000000000000
+	.quad	0x0000000000000000, 0x0700000000000000
+	.quad	0x0000000000000000, 0x0800000000000000
+
+.balign 64
+ddq_addbe_4444:
+	.quad	0x0000000000000000, 0x0400000000000000
+	.quad	0x0000000000000000, 0x0400000000000000
+	.quad	0x0000000000000000, 0x0400000000000000
+	.quad	0x0000000000000000, 0x0400000000000000
+
+.balign 64
+ddq_addbe_8888:
+	.quad	0x0000000000000000, 0x0800000000000000
+	.quad	0x0000000000000000, 0x0800000000000000
+	.quad	0x0000000000000000, 0x0800000000000000
+	.quad	0x0000000000000000, 0x0800000000000000
+
+.balign 64
+byte_len_to_mask_table:
+	.short      0x0000, 0x0001, 0x0003, 0x0007
+	.short      0x000f, 0x001f, 0x003f, 0x007f
+	.short      0x00ff, 0x01ff, 0x03ff, 0x07ff
+	.short      0x0fff, 0x1fff, 0x3fff, 0x7fff
+	.short      0xffff
+
+.balign 64
+byte64_len_to_mask_table:
+	.quad      0x0000000000000000, 0x0000000000000001
+	.quad      0x0000000000000003, 0x0000000000000007
+	.quad      0x000000000000000f, 0x000000000000001f
+	.quad      0x000000000000003f, 0x000000000000007f
+	.quad      0x00000000000000ff, 0x00000000000001ff
+	.quad      0x00000000000003ff, 0x00000000000007ff
+	.quad      0x0000000000000fff, 0x0000000000001fff
+	.quad      0x0000000000003fff, 0x0000000000007fff
+	.quad      0x000000000000ffff, 0x000000000001ffff
+	.quad      0x000000000003ffff, 0x000000000007ffff
+	.quad      0x00000000000fffff, 0x00000000001fffff
+	.quad      0x00000000003fffff, 0x00000000007fffff
+	.quad      0x0000000000ffffff, 0x0000000001ffffff
+	.quad      0x0000000003ffffff, 0x0000000007ffffff
+	.quad      0x000000000fffffff, 0x000000001fffffff
+	.quad      0x000000003fffffff, 0x000000007fffffff
+	.quad      0x00000000ffffffff, 0x00000001ffffffff
+	.quad      0x00000003ffffffff, 0x00000007ffffffff
+	.quad      0x0000000fffffffff, 0x0000001fffffffff
+	.quad      0x0000003fffffffff, 0x0000007fffffffff
+	.quad      0x000000ffffffffff, 0x000001ffffffffff
+	.quad      0x000003ffffffffff, 0x000007ffffffffff
+	.quad      0x00000fffffffffff, 0x00001fffffffffff
+	.quad      0x00003fffffffffff, 0x00007fffffffffff
+	.quad      0x0000ffffffffffff, 0x0001ffffffffffff
+	.quad      0x0003ffffffffffff, 0x0007ffffffffffff
+	.quad      0x000fffffffffffff, 0x001fffffffffffff
+	.quad      0x003fffffffffffff, 0x007fffffffffffff
+	.quad      0x00ffffffffffffff, 0x01ffffffffffffff
+	.quad      0x03ffffffffffffff, 0x07ffffffffffffff
+	.quad      0x0fffffffffffffff, 0x1fffffffffffffff
+	.quad      0x3fffffffffffffff, 0x7fffffffffffffff
+	.quad      0xffffffffffffffff
+
+.balign 64
+mask_out_top_block:
+	.quad      0xffffffffffffffff, 0xffffffffffffffff
+	.quad      0xffffffffffffffff, 0xffffffffffffffff
+	.quad      0xffffffffffffffff, 0xffffffffffffffff
+	.quad      0x0000000000000000, 0x0000000000000000
+
+.section .text
+
+
+////define the fields of gcm_data struct
+//typedef struct gcm_data
+//{
+//        u8 expanded_keys[16*15]//
+//        u8 shifted_hkey_1[16]//  // store HashKey <<1 mod poly here
+//        u8 shifted_hkey_2[16]//  // store HashKey^2 <<1 mod poly here
+//        u8 shifted_hkey_3[16]//  // store HashKey^3 <<1 mod poly here
+//        u8 shifted_hkey_4[16]//  // store HashKey^4 <<1 mod poly here
+//        u8 shifted_hkey_5[16]//  // store HashKey^5 <<1 mod poly here
+//        u8 shifted_hkey_6[16]//  // store HashKey^6 <<1 mod poly here
+//        u8 shifted_hkey_7[16]//  // store HashKey^7 <<1 mod poly here
+//        u8 shifted_hkey_8[16]//  // store HashKey^8 <<1 mod poly here
+//        u8 shifted_hkey_1_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey <<1 mod poly here (for Karatsuba purposes)
+//        u8 shifted_hkey_2_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^2 <<1 mod poly here (for Karatsuba purposes)
+//        u8 shifted_hkey_3_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^3 <<1 mod poly here (for Karatsuba purposes)
+//        u8 shifted_hkey_4_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^4 <<1 mod poly here (for Karatsuba purposes)
+//        u8 shifted_hkey_5_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^5 <<1 mod poly here (for Karatsuba purposes)
+//        u8 shifted_hkey_6_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^6 <<1 mod poly here (for Karatsuba purposes)
+//        u8 shifted_hkey_7_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^7 <<1 mod poly here (for Karatsuba purposes)
+//        u8 shifted_hkey_8_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^8 <<1 mod poly here (for Karatsuba purposes)
+//} gcm_data//
+
+#ifndef GCM_KEYS_VAES_AVX512_INCLUDED
+#define HashKey         16*15    // store HashKey <<1 mod poly here
+#define HashKey_1       16*15    // store HashKey <<1 mod poly here
+#define HashKey_2       16*16    // store HashKey^2 <<1 mod poly here
+#define HashKey_3       16*17    // store HashKey^3 <<1 mod poly here
+#define HashKey_4       16*18    // store HashKey^4 <<1 mod poly here
+#define HashKey_5       16*19    // store HashKey^5 <<1 mod poly here
+#define HashKey_6       16*20    // store HashKey^6 <<1 mod poly here
+#define HashKey_7       16*21    // store HashKey^7 <<1 mod poly here
+#define HashKey_8       16*22    // store HashKey^8 <<1 mod poly here
+#define HashKey_k       16*23    // store XOR of High 64 bits and Low 64 bits of  HashKey <<1 mod poly here (for Karatsuba purposes)
+#define HashKey_2_k     16*24    // store XOR of High 64 bits and Low 64 bits of  HashKey^2 <<1 mod poly here (for Karatsuba purposes)
+#define HashKey_3_k     16*25   // store XOR of High 64 bits and Low 64 bits of  HashKey^3 <<1 mod poly here (for Karatsuba purposes)
+#define HashKey_4_k     16*26   // store XOR of High 64 bits and Low 64 bits of  HashKey^4 <<1 mod poly here (for Karatsuba purposes)
+#define HashKey_5_k     16*27   // store XOR of High 64 bits and Low 64 bits of  HashKey^5 <<1 mod poly here (for Karatsuba purposes)
+#define HashKey_6_k     16*28   // store XOR of High 64 bits and Low 64 bits of  HashKey^6 <<1 mod poly here (for Karatsuba purposes)
+#define HashKey_7_k     16*29   // store XOR of High 64 bits and Low 64 bits of  HashKey^7 <<1 mod poly here (for Karatsuba purposes)
+#define HashKey_8_k     16*30   // store XOR of High 64 bits and Low 64 bits of  HashKey^8 <<1 mod poly here (for Karatsuba purposes)
+#endif
+
+#define AadHash		16*0	// store current Hash of data which has been input
+#define AadLen		16*1	// store length of input data which will not be encrypted or decrypted
+#define InLen		(16*1)+8 // store length of input data which will be encrypted or decrypted
+#define PBlockEncKey	16*2	// encryption key for the partial block at the end of the previous update
+#define OrigIV		16*3	// input IV
+#define CurCount	16*4	// Current counter for generation of encryption key
+#define PBlockLen	16*5	// length of partial block at the end of the previous update
+
+.macro xmmreg name, num
+	.set xmm\name, %xmm\num
+.endm
+
+#define arg(x) (STACK_OFFSET + 8*(x))(%r14)
+
+
+#if __OUTPUT_FORMAT__ != elf64
+#define arg1 %rcx
+#define arg2 %rdx
+#define arg3 %r8
+#define arg4 %r9
+#define arg5 %rsi
+#define arg6 (STACK_OFFSET + 8*6)(%r14)
+#define arg7 (STACK_OFFSET + 8*7)(%r14)
+#define arg8 (STACK_OFFSET + 8*8)(%r14)
+#define arg9 (STACK_OFFSET + 8*9)(%r14)
+#define arg10 (STACK_OFFSET + 8*10)(%r14)
+#else
+#define arg1 %rdi
+#define arg2 %rsi
+#define arg3 %rdx
+#define arg4 %rcx
+#define arg5 %r8
+#define arg6 %r9
+#define arg7 ((STACK_OFFSET) + 8*1)(%r14)
+#define arg8 ((STACK_OFFSET) + 8*2)(%r14)
+#define arg9 ((STACK_OFFSET) + 8*3)(%r14)
+#define arg10 ((STACK_OFFSET) + 8*4)(%r14)
+#endif
+
+#ifdef NT_LDST
+#define NT_LD
+#define NT_ST
+#endif
+
+////// Use Non-temporal load/stor
+#ifdef NT_LD
+#define	XLDR	 movntdqa
+#define	VXLDR	 vmovntdqa
+#define	VX512LDR vmovntdqa
+#else
+#define	XLDR	 movdqu
+#define	VXLDR	 vmovdqu
+#define	VX512LDR vmovdqu8
+#endif
+
+////// Use Non-temporal load/stor
+#ifdef NT_ST
+#define	XSTR	 movntdq
+#define	VXSTR	 vmovntdq
+#define	VX512STR vmovntdq
+#else
+#define	XSTR	 movdqu
+#define	VXSTR	 vmovdqu
+#define	VX512STR vmovdqu8
+#endif
+
+#endif // GCM_DEFINES_ASM_INCLUDED
diff --git a/module/icp/asm-x86_64/modes/isalc_gcm_sse.S b/module/icp/asm-x86_64/modes/isalc_gcm_sse.S
new file mode 100644
index 000000000000..5d5be5068904
--- /dev/null
+++ b/module/icp/asm-x86_64/modes/isalc_gcm_sse.S
@@ -0,0 +1,2150 @@
+////////////////////////////////////////////////////////////////////////////////
+//  Copyright(c) 2011-2017 Intel Corporation All rights reserved.
+//
+//  Redistribution and use in source and binary forms, with or without
+//  modification, are permitted provided that the following conditions
+//  are met:
+//    * Redistributions of source code must retain the above copyright
+//      notice, this list of conditions and the following disclaimer.
+//    * Redistributions in binary form must reproduce the above copyright
+//      notice, this list of conditions and the following disclaimer in
+//      the documentation and/or other materials provided with the
+//      distribution.
+//    * Neither the name of Intel Corporation nor the names of its
+//      contributors may be used to endorse or promote products derived
+//      from this software without specific prior written permission.
+//
+//  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+//  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+//  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+//  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+//  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+//  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+//  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES// LOSS OF USE,
+//  DATA, OR PROFITS// OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+//  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+//  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+//  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+////////////////////////////////////////////////////////////////////////////////
+
+////////////////////////////////////////////////////////////////////////////////
+//
+// Authors:
+//       Erdinc Ozturk
+//       Vinodh Gopal
+//       James Guilford
+//
+//
+// References:
+//       This code was derived and highly optimized from the code described in
+//	 paper:
+//               Vinodh Gopal et. al. Optimized Galois-Counter-Mode
+//		  Implementation on Intel Architecture Processors. August, 2010
+//
+//       For the shift-based reductions used in this code, we used the method
+//	 described in paper:
+//               Shay Gueron, Michael E. Kounavis. Intel Carry-Less
+//		  Multiplication Instruction and its Usage for Computing the GCM
+//		  Mode. January, 2010.
+//
+//
+// Assumptions:
+//
+//
+//
+// iv:
+//       0                   1                   2                   3
+//       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                             Salt  (From the SA)               |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                     Initialization Vector                     |
+//       |         (This is the sequence number from IPSec header)       |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                              0x1                              |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//
+//
+//
+// AAD:
+//       AAD will be padded with 0 to the next 16byte multiple
+//       for example, assume AAD is a u32 vector
+//
+//       if AAD is 8 bytes:
+//       AAD[3] = {A0, A1};
+//       padded AAD in xmm register = {A1 A0 0 0}
+//
+//       0                   1                   2                   3
+//       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                               SPI (A1)                        |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                     32-bit Sequence Number (A0)               |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                              0x0                              |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//
+//                                       AAD Format with 32-bit Sequence Number
+//
+//       if AAD is 12 bytes:
+//       AAD[3] = {A0, A1, A2};
+//       padded AAD in xmm register = {A2 A1 A0 0}
+//
+//       0                   1                   2                   3
+//       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                               SPI (A2)                        |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                 64-bit Extended Sequence Number {A1,A0}       |
+//       |                                                               |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//       |                              0x0                              |
+//       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+//
+//        AAD Format with 64-bit Extended Sequence Number
+//
+//
+// aadLen:
+//       Must be a multiple of 4 bytes and from the definition of the spec.
+//       The code additionally supports any aadLen length.
+//
+// TLen:
+//       from the definition of the spec, TLen can only be 8, 12 or 16 bytes.
+//
+// poly = x^128 + x^127 + x^126 + x^121 + 1
+// throughout the code, one tab and two tab indentations are used. one tab is
+// for GHASH part, two tabs is for AES part.
+//
+
+// .altmacro
+.att_syntax prefix
+
+#include "isalc_reg_sizes_att.S"
+#include "isalc_gcm_defines_att.S"
+
+#if !defined(GCM128_MODE) && !defined(GCM256_MODE)
+#error "No GCM mode selected for gcm_sse.S!"
+#endif
+
+#if defined(FUNCT_EXTENSION)
+#error "No support for non-temporal versions yet!"
+#endif
+#define	_nt	1
+
+#ifdef GCM128_MODE
+#define FN_NAME(x,y) aes_gcm_ ## x ## _128 ## y ## sse
+#define NROUNDS 9
+#endif
+
+#ifdef GCM256_MODE
+#define FN_NAME(x,y) aes_gcm_ ## x ## _256 ## y ## sse
+#define NROUNDS 13
+#endif
+
+
+// need to push 5 registers into stack to maintain
+#define STACK_OFFSET 8*5
+
+#define	TMP2	16*0    // Temporary storage for AES State 2 (State 1 is stored in an XMM register)
+#define	TMP3	16*1    // Temporary storage for AES State 3
+#define	TMP4	16*2    // Temporary storage for AES State 4
+#define	TMP5	16*3    // Temporary storage for AES State 5
+#define	TMP6	16*4    // Temporary storage for AES State 6
+#define	TMP7	16*5    // Temporary storage for AES State 7
+#define	TMP8	16*6    // Temporary storage for AES State 8
+
+#define	LOCAL_STORAGE	16*7
+
+#if __OUTPUT_FORMAT == win64
+#define	XMM_STORAGE	16*10
+#else
+#define	XMM_STORAGE	0
+#endif
+
+#define	VARIABLE_OFFSET	LOCAL_STORAGE + XMM_STORAGE
+
+////////////////////////////////////////////////////////////////
+// Utility Macros
+////////////////////////////////////////////////////////////////
+
+////////////////////////////////////////////////////////////////////////////////
+// GHASH_MUL MACRO to implement: Data*HashKey mod (128,127,126,121,0)
+// Input: A and B (128-bits each, bit-reflected)
+// Output: C = A*B*x mod poly, (i.e. >>1 )
+// To compute GH = GH*HashKey mod poly, give HK = HashKey<<1 mod poly as input
+// GH = GH * HK * x mod poly which is equivalent to GH*HashKey mod poly.
+////////////////////////////////////////////////////////////////////////////////
+.macro  GHASH_MUL  GH, HK, T1, T2, T3, T4, T5
+	// \GH, \HK hold the values for the two operands which are carry-less
+	// multiplied.
+	////////////////////////////////////////////////////////////////////////
+	// Karatsuba Method
+	movdqa	\GH, \T1
+	pshufd	$0b01001110, \GH, \T2
+	pshufd	$0b01001110, \HK, \T3
+	pxor	\GH, \T2			// \T2 = (a1+a0)
+	pxor	\HK, \T3			// \T3 = (b1+b0)
+
+	pclmulqdq	$0x11, \HK, \T1		// \T1 = a1*b1
+	pclmulqdq       $0x00, \HK, \GH		// \GH = a0*b0
+	pclmulqdq       $0x00, \T3, \T2		// \T2 = (a1+a0)*(b1+b0)
+	pxor	\GH, \T2
+	pxor	\T1, \T2			// \T2 = a0*b1+a1*b0
+
+	movdqa	\T2, \T3
+	pslldq	$8, \T3		// shift-L \T3 2 DWs
+	psrldq	$8, \T2		// shift-R \T2 2 DWs
+	pxor	\T3, \GH
+	pxor	\T2, \T1	// <\T1:\GH> holds the result of the carry-less multiplication of \GH by \HK
+
+
+	//first phase of the reduction
+	movdqa	\GH, \T2
+	movdqa	\GH, \T3
+	movdqa	\GH, \T4	// move \GH into \T2, \T3, \T4 in order to perform the three shifts independently
+
+	pslld	$31, \T2	// packed right shifting << 31
+	pslld	$30, \T3	// packed right shifting shift << 30
+	pslld	$25, \T4	// packed right shifting shift << 25
+	pxor	\T3, \T2	// xor the shifted versions
+	pxor	\T4, \T2
+
+	movdqa	\T2, \T5
+	psrldq	$4, \T5		 // shift-R \T5 1 DW
+
+	pslldq	$12, \T2	// shift-L \T2 3 DWs
+	pxor	\T2, \GH	// first phase of the reduction complete
+	////////////////////////////////////////////////////////////////////////
+
+	//second phase of the reduction
+	movdqa	\GH, \T2	// make 3 copies of \GH (in in \T2, \T3, \T4) for doing three shift operations
+	movdqa	\GH, \T3
+	movdqa	\GH, \T4
+
+	psrld	$1, \T2		// packed left shifting >> 1
+	psrld	$2, \T3		// packed left shifting >> 2
+	psrld	$7, \T4		// packed left shifting >> 7
+	pxor	\T3, \T2	// xor the shifted versions
+	pxor	\T4, \T2
+
+	pxor	\T5, \T2
+	pxor	\T2, \GH
+	pxor	\T1, \GH	// the result is in \T1
+
+.endm // GHASH_MUL
+
+////////////////////////////////////////////////////////////////////////////////
+// PRECOMPUTE: Precompute HashKey_{2..8} and HashKey{,_{2..8}}_k.
+// HasKey_i_k holds XORed values of the low and high parts of the Haskey_i.
+////////////////////////////////////////////////////////////////////////////////
+.macro PRECOMPUTE GDATA, HK, T1, T2, T3, T4, T5, T6
+
+	movdqa	\HK, \T4
+	pshufd	$0b01001110, \HK, \T1
+	pxor	\HK, \T1
+	movdqu	\T1, HashKey_k(\GDATA)
+
+
+	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^2<<1 mod poly
+	movdqu	\T4, HashKey_2(\GDATA)		//  [HashKey_2] = HashKey^2<<1 mod poly
+	pshufd	$0b01001110, \T4, \T1
+	pxor	\T4, \T1
+	movdqu	\T1, HashKey_2_k(\GDATA)
+
+	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^3<<1 mod poly
+	movdqu	\T4, HashKey_3(\GDATA)
+	pshufd	$0b01001110, \T4, \T1
+	pxor	\T4, \T1
+	movdqu	\T1, HashKey_3_k(\GDATA)
+
+
+	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^4<<1 mod poly
+	movdqu	\T4, HashKey_4(\GDATA)
+	pshufd	$0b01001110, \T4, \T1
+	pxor	\T4, \T1
+	movdqu	\T1, HashKey_4_k(\GDATA)
+
+	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^5<<1 mod poly
+	movdqu	\T4, HashKey_5(\GDATA)
+	pshufd	$0b01001110, \T4, \T1
+	pxor	\T4, \T1
+	movdqu	\T1, HashKey_5_k(\GDATA)
+
+
+	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^6<<1 mod poly
+	movdqu	\T4, HashKey_6(\GDATA)
+	pshufd	$0b01001110, \T4, \T1
+	pxor	\T4, \T1
+	movdqu	\T1, HashKey_6_k(\GDATA)
+
+	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^7<<1 mod poly
+	movdqu	\T4, HashKey_7(\GDATA)
+	pshufd	$0b01001110, \T4, \T1
+	pxor	\T4, \T1
+	movdqu	\T1, HashKey_7_k(\GDATA)
+
+	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^8<<1 mod poly
+	movdqu  \T4, HashKey_8(\GDATA)
+	pshufd  $0b01001110, \T4, \T1
+	pxor    \T4, \T1
+	movdqu  \T1, HashKey_8_k(\GDATA)
+
+.endm // PRECOMPUTE
+
+
+////////////////////////////////////////////////////////////////////////////////
+// READ_SMALL_DATA_INPUT: Packs xmm register with data when data input is less
+// than 16 bytes.
+// Returns 0 if data has length 0.
+// Input: The input data (INPUT), that data's length (LENGTH).
+// Output: The packed xmm register (OUTPUT).
+////////////////////////////////////////////////////////////////////////////////
+.macro READ_SMALL_DATA_INPUT	OUTPUT, INPUT, LENGTH, \
+				END_READ_LOCATION, COUNTER, TMP1
+
+	// clang compat: no local support
+	// LOCAL _byte_loop_1, _byte_loop_2, _done
+
+	pxor	\OUTPUT, \OUTPUT
+	mov	\LENGTH, \COUNTER
+	mov	\INPUT, \END_READ_LOCATION
+	add	\LENGTH, \END_READ_LOCATION
+	xor	\TMP1, \TMP1
+
+
+	cmp	$8, \COUNTER
+	jl	_byte_loop_2_\@
+	pinsrq	$0, (\INPUT), \OUTPUT	//Read in 8 bytes if they exists
+	je	_done_\@
+
+	sub	$8, \COUNTER
+
+_byte_loop_1_\@:		//Read in data 1 byte at a time while data is left
+	shl	$8, \TMP1	//This loop handles when 8 bytes were already read in
+	dec	\END_READ_LOCATION
+
+	////  mov	BYTE(\TMP1), BYTE [\END_READ_LOCATION]
+	bytereg \TMP1
+	movb	(\END_READ_LOCATION), breg
+	dec	\COUNTER
+	jg	_byte_loop_1_\@
+	pinsrq	$1, \TMP1, \OUTPUT
+	jmp	_done_\@
+
+_byte_loop_2_\@:		//Read in data 1 byte at a time while data is left
+	cmp	$0, \COUNTER
+	je	_done_\@
+	shl	$8, \TMP1	//This loop handles when no bytes were already read in
+	dec	\END_READ_LOCATION
+	//// mov	BYTE(\TMP1), BYTE [\END_READ_LOCATION]
+	bytereg \TMP1
+	movb	(\END_READ_LOCATION), breg
+	dec	\COUNTER
+	jg	_byte_loop_2_\@
+	pinsrq	$0, \TMP1, \OUTPUT
+_done_\@:
+
+.endm // READ_SMALL_DATA_INPUT
+
+
+////////////////////////////////////////////////////////////////////////////////
+// CALC_AAD_HASH: Calculates the hash of the data which will not be encrypted.
+// Input: The input data (A_IN), that data's length (A_LEN), and the hash key
+// (HASH_KEY).
+// Output: The hash of the data (AAD_HASH).
+////////////////////////////////////////////////////////////////////////////////
+.macro	CALC_AAD_HASH	A_IN, A_LEN, AAD_HASH, HASH_KEY, XTMP1, XTMP2, XTMP3, \
+			XTMP4, XTMP5, T1, T2, T3, T4, T5
+
+	// clang compat: no local support
+	// LOCAL _get_AAD_loop16, _get_small_AAD_block, _CALC_AAD_done
+
+	mov	\A_IN, \T1		// T1 = AAD
+	mov	\A_LEN, \T2		// T2 = aadLen
+	pxor	\AAD_HASH, \AAD_HASH
+
+	cmp	$16, \T2
+	jl	_get_small_AAD_block_\@
+
+_get_AAD_loop16_\@:
+
+	movdqu	(\T1), \XTMP1
+	//byte-reflect the AAD data
+	pshufb	 SHUF_MASK(%rip), \XTMP1
+	pxor	\XTMP1, \AAD_HASH
+	GHASH_MUL \AAD_HASH, \HASH_KEY, \XTMP1, \XTMP2, \XTMP3, \XTMP4, \XTMP5
+
+	sub	$16, \T2
+	je	_CALC_AAD_done_\@
+
+	add	$16, \T1
+	cmp	$16, \T2
+	jge	_get_AAD_loop16_\@
+
+_get_small_AAD_block_\@:
+	READ_SMALL_DATA_INPUT	\XTMP1, \T1, \T2, \T3, \T4, \T5
+	//byte-reflect the AAD data
+	pshufb	 SHUF_MASK(%rip), \XTMP1
+	pxor	\XTMP1, \AAD_HASH
+	GHASH_MUL \AAD_HASH, \HASH_KEY, \XTMP1, \XTMP2, \XTMP3, \XTMP4, \XTMP5
+
+_CALC_AAD_done_\@:
+
+.endm // CALC_AAD_HASH
+
+
+
+////////////////////////////////////////////////////////////////////////////////
+// PARTIAL_BLOCK: Handles encryption/decryption and the tag partial blocks
+// between update calls. Requires the input data be at least 1 byte long.
+// Input: gcm_key_data (GDATA_KEY), gcm_context_data (GDATA_CTX), input text
+// (PLAIN_CYPH_IN), input text length (PLAIN_CYPH_LEN), the current data offset
+// (DATA_OFFSET), and whether encoding or decoding (ENC_DEC).
+// Output: A cypher of the first partial block (CYPH_PLAIN_OUT), and updated
+// GDATA_CTX.
+// Clobbers rax, r10, r12, r13, r15, xmm0, xmm1, xmm2, xmm3, xmm5, xmm6, xmm9,
+// xmm10, xmm11, xmm13
+////////////////////////////////////////////////////////////////////////////////
+.macro PARTIAL_BLOCK	GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, \
+			PLAIN_CYPH_LEN, DATA_OFFSET, AAD_HASH, ENC_DEC
+
+	// clang compat: no local support
+	// LOCAL _fewer_than_16_bytes, _data_read, _no_extra_mask_1
+	// LOCAL _partial_incomplete_1, _dec_done, _no_extra_mask_2
+	// LOCAL _partial_incomplete_2, _encode_done, _partial_fill
+	// LOCAL _count_set, _less_than_8_bytes_left, _partial_block_done
+
+	mov	PBlockLen(\GDATA_CTX), %r13
+	cmp	$0, %r13
+	je	_partial_block_done_\@		//Leave Macro if no partial blocks
+
+	cmp	$16, \PLAIN_CYPH_LEN		//Read in input data without over reading
+	jl	_fewer_than_16_bytes_\@
+	XLDR	(\PLAIN_CYPH_IN), %xmm1		//If more than 16 bytes of data, just fill the xmm register
+	jmp	_data_read_\@
+
+_fewer_than_16_bytes_\@:
+	lea	(\PLAIN_CYPH_IN, \DATA_OFFSET), %r10
+	READ_SMALL_DATA_INPUT	%xmm1, %r10, \PLAIN_CYPH_LEN, %rax, %r12, %r15
+	mov	PBlockLen(\GDATA_CTX), %r13
+
+_data_read_\@:				//Finished reading in data
+
+
+	movdqu	PBlockEncKey(\GDATA_CTX), %xmm9	//xmm9 = ctx_data.partial_block_enc_key
+	movdqu	HashKey(\GDATA_KEY), %xmm13
+
+	lea	SHIFT_MASK(%rip), %r12
+
+	add	%r13, %r12		// adjust the shuffle mask pointer to be able to shift r13 bytes (16-r13 is the number of bytes in plaintext mod 16)
+	movdqu	(%r12), %xmm2		// get the appropriate shuffle mask
+	pshufb	%xmm2, %xmm9		// shift right r13 bytes
+
+	.ifc	\ENC_DEC, DEC
+
+	movdqa	%xmm1, %xmm3
+	pxor	%xmm1, %xmm9		// Cyphertext XOR E(K, Yn)
+
+	mov	\PLAIN_CYPH_LEN, %r15
+	add	%r13, %r15
+	sub	$16, %r15		//Set r15 to be the amount of data left in CYPH_PLAIN_IN after filling the block
+	jge	_no_extra_mask_1_\@	//Determine if if partial block is not being filled and shift mask accordingly
+	sub	%r15, %r12
+_no_extra_mask_1_\@:
+
+	movdqu	(ALL_F - SHIFT_MASK)(%r12), %xmm1	// get the appropriate mask to mask out bottom r13 bytes of xmm9
+	pand	%xmm1, %xmm9			// mask out bottom r13 bytes of xmm9
+
+	pand	%xmm1, %xmm3
+	pshufb	SHUF_MASK(%rip), %xmm3
+	pshufb	%xmm2, %xmm3
+	pxor	%xmm3, \AAD_HASH
+
+
+	cmp	$0, %r15
+	jl	_partial_incomplete_1_\@
+
+	GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6	//GHASH computation for the last <16 Byte block
+	xor	%rax, %rax
+	mov	%rax, PBlockLen(\GDATA_CTX)
+	jmp	_dec_done_\@
+_partial_incomplete_1_\@:
+	add	\PLAIN_CYPH_LEN, PBlockLen(\GDATA_CTX)
+_dec_done_\@:
+	movdqu	\AAD_HASH, AadHash(\GDATA_CTX)
+
+	.else	// .ifc \ENC_DEC, DEC
+
+	pxor	%xmm1, %xmm9		// Plaintext XOR E(K, Yn)
+
+	mov	\PLAIN_CYPH_LEN, %r15
+	add	%r13, %r15
+	sub	$16, %r15		//Set r15 to be the amount of data left in CYPH_PLAIN_IN after filling the block
+	jge	_no_extra_mask_2_\@	//Determine if if partial block is not being filled and shift mask accordingly
+	sub	%r15, %r12
+_no_extra_mask_2_\@:
+
+	movdqu	(ALL_F - SHIFT_MASK)(%r12), %xmm1  // get the appropriate mask to mask out bottom r13 bytes of xmm9
+	pand	%xmm1, %xmm9			 // mask out bottom r13  bytes of xmm9
+
+	pshufb	SHUF_MASK(%rip), %xmm9
+	pshufb	%xmm2, %xmm9
+	pxor	%xmm9, \AAD_HASH
+
+	cmp	$0, %r15
+	jl	_partial_incomplete_2_\@
+
+	GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6	//GHASH computation for the last <16 Byte block
+	xor	%rax, %rax
+	mov	%rax, PBlockLen(\GDATA_CTX)
+	jmp	_encode_done_\@
+_partial_incomplete_2_\@:
+	add     \PLAIN_CYPH_LEN, PBlockLen(\GDATA_CTX)
+_encode_done_\@:
+	movdqu	\AAD_HASH, AadHash(\GDATA_CTX)
+
+	pshufb	SHUF_MASK(%rip), %xmm9	// shuffle xmm9 back to output as ciphertext
+	pshufb	%xmm2, %xmm9
+
+	.endif	// .ifc \ENC_DEC, DEC
+
+
+	//////////////////////////////////////////////////////////
+	// output encrypted Bytes
+	cmp	$0, %r15
+	jl	_partial_fill_\@
+	mov	%r13, %r12
+	mov	$16, %r13
+	sub	%r12, %r13		// Set r13 to be the number of bytes to write out
+	jmp	_count_set_\@
+_partial_fill_\@:
+	mov	\PLAIN_CYPH_LEN, %r13
+_count_set_\@:
+	movq	%xmm9, %rax
+	cmp	$8, %r13
+	jle	_less_than_8_bytes_left_\@
+	mov	%rax, (\CYPH_PLAIN_OUT, \DATA_OFFSET)
+	add	$8, \DATA_OFFSET
+	psrldq	$8, %xmm9
+	movq	%xmm9, %rax
+	sub	$8, %r13
+_less_than_8_bytes_left_\@:
+	mov	%al, (\CYPH_PLAIN_OUT, \DATA_OFFSET)
+	add	$1, \DATA_OFFSET
+	shr	$8, %rax
+	sub	$1, %r13
+	jne	_less_than_8_bytes_left_\@
+	//////////////////////////////////////////////////////////
+_partial_block_done_\@:
+.endm // PARTIAL_BLOCK
+
+////////////////////////////////////////////////////////////////////////////////
+// INITIAL_BLOCKS: If a = number of total plaintext bytes; b = floor(a/16);
+// \num_initial_blocks = b mod 8;  encrypt the initial \num_initial_blocks
+// blocks and apply ghash on the ciphertext.
+// \GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, r14 are used as a
+// pointer only, not modified.
+// Updated AAD_HASH is returned in \T3.
+////////////////////////////////////////////////////////////////////////////////
+.macro INITIAL_BLOCKS	GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, \
+			LENGTH, DATA_OFFSET, num_initial_blocks, T1, HASH_KEY, \
+			T3, T4, T5, CTR, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, \
+			XMM7, XMM8, T6, T_key, ENC_DEC
+
+	// clang compat: no local support
+	// LOCAL _initial_blocks_done
+
+.altmacro
+.set i, (8-\num_initial_blocks)
+		xmmreg	i, %i
+		movdqu	\XMM8, xmmi	// move AAD_HASH to temp reg
+
+		// start AES for \num_initial_blocks blocks
+		movdqu	CurCount(\GDATA_CTX), \CTR	// \CTR = Y0
+
+
+.set i, (9-\num_initial_blocks)
+.rept \num_initial_blocks
+		xmmreg	i, %i
+		paddd	ONE(%rip), \CTR			// INCR Y0
+		movdqa	\CTR, xmmi
+		pshufb	SHUF_MASK(%rip), xmmi		// perform a 16Byte swap
+.set i, (i+1)
+.endr
+
+movdqu	16*0(\GDATA_KEY), \T_key
+.set i, (9-\num_initial_blocks)
+.rept \num_initial_blocks
+		xmmreg	i, %i
+		pxor	\T_key, xmmi
+.set i, (i+1)
+.endr
+
+.set j, 1
+.rept NROUNDS						// encrypt N blocks with 13 key rounds (11 for GCM192)
+movdqu	16*j(\GDATA_KEY), \T_key
+.set i, (9-\num_initial_blocks)
+.rept \num_initial_blocks
+		xmmreg	i, %i
+		aesenc	\T_key, xmmi
+.set i, (i+1)
+.endr
+
+.set j, (j+1)
+.endr
+
+movdqu	16*j(\GDATA_KEY), \T_key			// encrypt with last (14th) key round (12 for GCM192)
+.set i, (9-\num_initial_blocks)
+.rept \num_initial_blocks
+		xmmreg	i, %i
+		aesenclast	\T_key, xmmi
+.set i, (i+1)
+.endr
+
+.set i, (9-\num_initial_blocks)
+.rept \num_initial_blocks
+		xmmreg	i, %i
+		XLDR	(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, xmmi
+		XSTR	xmmi, (\CYPH_PLAIN_OUT, \DATA_OFFSET)	// write back ciphertext for \num_initial_blocks blocks
+		add	$16, \DATA_OFFSET
+		.ifc \ENC_DEC, DEC
+		movdqa	\T1, xmmi
+		.endif
+		pshufb	SHUF_MASK(%rip), xmmi      // prepare ciphertext for GHASH computations
+.set i, (i+1)
+.endr
+
+
+.set i, (8-\num_initial_blocks)
+.set j, (9-\num_initial_blocks)
+.rept \num_initial_blocks
+	xmmreg	i, %i
+	xmmreg	j, %j
+	pxor	xmmi, xmmj
+	GHASH_MUL xmmj, <\HASH_KEY>, <\T1>, <\T3>, <\T4>, <\T5>, <\T6>	// apply GHASH on \num_initial_blocks blocks
+.set i, (i+1)
+.set j, (j+1)
+.endr
+.noaltmacro
+
+	// \XMM8 has the current Hash Value
+	movdqa	\XMM8, \T3
+
+	cmp	$128, \LENGTH
+	jl	_initial_blocks_done_\@	// no need for precomputed constants
+
+////////////////////////////////////////////////////////////////////////////////
+// Haskey_i_k holds XORed values of the low and high parts of the Haskey_i
+		paddd   ONE(%rip), \CTR		// INCR Y0
+		movdqa  \CTR, \XMM1
+		pshufb  SHUF_MASK(%rip), \XMM1	// perform a 16Byte swap
+
+		paddd   ONE(%rip), \CTR		// INCR Y0
+		movdqa  \CTR, \XMM2
+		pshufb  SHUF_MASK(%rip), \XMM2	// perform a 16Byte swap
+
+		paddd   ONE(%rip), \CTR		// INCR Y0
+		movdqa  \CTR, \XMM3
+		pshufb  SHUF_MASK(%rip), \XMM3	// perform a 16Byte swap
+
+		paddd   ONE(%rip), \CTR		// INCR Y0
+		movdqa  \CTR, \XMM4
+		pshufb  SHUF_MASK(%rip), \XMM4	// perform a 16Byte swap
+
+		paddd   ONE(%rip), \CTR		// INCR Y0
+		movdqa  \CTR, \XMM5
+		pshufb  SHUF_MASK(%rip), \XMM5	// perform a 16Byte swap
+
+		paddd   ONE(%rip), \CTR		// INCR Y0
+		movdqa  \CTR, \XMM6
+		pshufb  SHUF_MASK(%rip), \XMM6	// perform a 16Byte swap
+
+		paddd   ONE(%rip), \CTR		// INCR Y0
+		movdqa  \CTR, \XMM7
+		pshufb  SHUF_MASK(%rip), \XMM7	// perform a 16Byte swap
+
+		paddd   ONE(%rip), \CTR		// INCR Y0
+		movdqa  \CTR, \XMM8
+		pshufb  SHUF_MASK(%rip), \XMM8	// perform a 16Byte swap
+
+		movdqu  16*0(\GDATA_KEY), \T_key
+		pxor	\T_key, \XMM1
+		pxor	\T_key, \XMM2
+		pxor	\T_key, \XMM3
+		pxor	\T_key, \XMM4
+		pxor	\T_key, \XMM5
+		pxor	\T_key, \XMM6
+		pxor	\T_key, \XMM7
+		pxor	\T_key, \XMM8
+
+.set i, 1
+.rept    NROUNDS			// do early (13) rounds (11 for GCM192)
+		movdqu  16*i(\GDATA_KEY), \T_key
+		aesenc	\T_key, \XMM1
+		aesenc	\T_key, \XMM2
+		aesenc	\T_key, \XMM3
+		aesenc	\T_key, \XMM4
+		aesenc	\T_key, \XMM5
+		aesenc	\T_key, \XMM6
+		aesenc	\T_key, \XMM7
+		aesenc	\T_key, \XMM8
+.set i, (i+1)
+.endr
+
+		movdqu	16*i(\GDATA_KEY), \T_key	// do final key round
+		aesenclast	\T_key, \XMM1
+		aesenclast	\T_key, \XMM2
+		aesenclast	\T_key, \XMM3
+		aesenclast	\T_key, \XMM4
+		aesenclast	\T_key, \XMM5
+		aesenclast	\T_key, \XMM6
+		aesenclast	\T_key, \XMM7
+		aesenclast	\T_key, \XMM8
+
+		XLDR	16*0(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, \XMM1
+		XSTR	\XMM1, 16*0(\CYPH_PLAIN_OUT, \DATA_OFFSET)
+		.ifc	\ENC_DEC, DEC
+		movdqa	\T1, \XMM1
+		.endif
+
+		XLDR	16*1(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, \XMM2
+		XSTR	\XMM2, 16*1(\CYPH_PLAIN_OUT, \DATA_OFFSET)
+		.ifc	\ENC_DEC, DEC
+		movdqa	\T1, \XMM2
+		.endif
+
+		XLDR	16*2(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, \XMM3
+		XSTR	\XMM3, 16*2(\CYPH_PLAIN_OUT, \DATA_OFFSET)
+		.ifc	\ENC_DEC, DEC
+		movdqa	\T1, \XMM3
+		.endif
+
+		XLDR	16*3(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, \XMM4
+		XSTR	\XMM4, 16*3(\CYPH_PLAIN_OUT, \DATA_OFFSET)
+		.ifc	\ENC_DEC, DEC
+		movdqa	\T1, \XMM4
+		.endif
+
+		XLDR	16*4(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, \XMM5
+		XSTR	\XMM5, 16*4(\CYPH_PLAIN_OUT, \DATA_OFFSET)
+		.ifc	\ENC_DEC, DEC
+		movdqa	\T1, \XMM5
+		.endif
+
+		XLDR	16*5(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, \XMM6
+		XSTR	\XMM6, 16*5(\CYPH_PLAIN_OUT, \DATA_OFFSET)
+		.ifc	\ENC_DEC, DEC
+		movdqa	\T1, \XMM6
+		.endif
+
+		XLDR	16*6(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, \XMM7
+		XSTR	\XMM7, 16*6(\CYPH_PLAIN_OUT, \DATA_OFFSET)
+		.ifc	\ENC_DEC, DEC
+		movdqa	\T1, \XMM7
+		.endif
+
+		XLDR	16*7(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+		pxor	\T1, \XMM8
+		XSTR	\XMM8, 16*7(\CYPH_PLAIN_OUT, \DATA_OFFSET)
+		.ifc	\ENC_DEC, DEC
+		movdqa	\T1, \XMM8
+		.endif
+
+		add	$128, \DATA_OFFSET
+
+		pshufb  SHUF_MASK(%rip), \XMM1	// perform a 16Byte swap
+		pxor	\T3, \XMM1		// combine GHASHed value with the corresponding ciphertext
+		pshufb  SHUF_MASK(%rip), \XMM2	// perform a 16Byte swap
+		pshufb  SHUF_MASK(%rip), \XMM3	// perform a 16Byte swap
+		pshufb  SHUF_MASK(%rip), \XMM4	// perform a 16Byte swap
+		pshufb  SHUF_MASK(%rip), \XMM5	// perform a 16Byte swap
+		pshufb  SHUF_MASK(%rip), \XMM6	// perform a 16Byte swap
+		pshufb  SHUF_MASK(%rip), \XMM7	// perform a 16Byte swap
+		pshufb  SHUF_MASK(%rip), \XMM8	// perform a 16Byte swap
+
+////////////////////////////////////////////////////////////////////////////////
+
+_initial_blocks_done_\@:
+.noaltmacro
+.endm // INITIAL_BLOCKS
+
+
+////////////////////////////////////////////////////////////////////////////////
+// GHASH_8_ENCRYPT_8_PARALLEL: Encrypt 8 blocks at a time and ghash the 8
+// previously encrypted ciphertext blocks.
+// \GDATA (KEY), \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN are used as pointers only,
+// not modified.
+// \DATA_OFFSET is the data offset value
+////////////////////////////////////////////////////////////////////////////////
+.macro GHASH_8_ENCRYPT_8_PARALLEL GDATA, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, \
+				  DATA_OFFSET, T1, T2, T3, T4, T5, T6, CTR, \
+				  XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7, \
+				  XMM8, T7, loop_idx, ENC_DEC
+
+
+	movdqa	\XMM1, \T7
+	movdqu	\XMM2, TMP2(%rsp)
+	movdqu	\XMM3, TMP3(%rsp)
+	movdqu	\XMM4, TMP4(%rsp)
+	movdqu	\XMM5, TMP5(%rsp)
+	movdqu	\XMM6, TMP6(%rsp)
+	movdqu	\XMM7, TMP7(%rsp)
+	movdqu	\XMM8, TMP8(%rsp)
+
+	////////////////////////////////////////////////////////////////////////
+	//// Karatsuba Method
+
+	movdqa	\T7, \T4
+	pshufd	$0b01001110, \T7, \T6
+	pxor	\T7, \T6
+		.ifc \loop_idx, in_order
+		paddd	ONE(%rip), \CTR			// INCR CNT
+		.else
+		paddd	ONEf(%rip), \CTR			// INCR CNT
+		.endif
+	movdqu	HashKey_8(\GDATA), \T5
+	pclmulqdq	 $0x11, \T5, \T4		// \T1 = a1*b1
+	pclmulqdq	$0x00, \T5, \T7			// \T7 = a0*b0
+	movdqu	HashKey_8_k(\GDATA), \T5
+	pclmulqdq	$0x00, \T5, \T6			// \T2 = (a1+a0)*(b1+b0)
+		movdqa	\CTR, \XMM1
+
+		.ifc \loop_idx, in_order
+
+		paddd	ONE(%rip), \CTR			// INCR CNT
+		movdqa	\CTR, \XMM2
+
+		paddd	ONE(%rip), \CTR			// INCR CNT
+		movdqa	\CTR, \XMM3
+
+		paddd	ONE(%rip), \CTR			// INCR CNT
+		movdqa	\CTR, \XMM4
+
+		paddd	ONE(%rip), \CTR			// INCR CNT
+		movdqa	\CTR, \XMM5
+
+		paddd	ONE(%rip), \CTR			// INCR CNT
+		movdqa	\CTR, \XMM6
+
+		paddd	ONE(%rip), \CTR			// INCR CNT
+		movdqa	\CTR, \XMM7
+
+		paddd	ONE(%rip), \CTR			// INCR CNT
+		movdqa	\CTR, \XMM8
+
+		pshufb	SHUF_MASK(%rip), \XMM1		// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM2		// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM3		// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM4		// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM5		// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM6		// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM7		// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM8		// perform a 16Byte swap
+
+		.else	// .ifc \loop_idx, in_order
+
+		paddd	ONEf(%rip), \CTR		// INCR CNT
+		movdqa	\CTR, \XMM2
+
+		paddd	ONEf(%rip), \CTR		// INCR CNT
+		movdqa	\CTR, \XMM3
+
+		paddd	ONEf(%rip), \CTR		// INCR CNT
+		movdqa	\CTR, \XMM4
+
+		paddd	ONEf(%rip), \CTR		// INCR CNT
+		movdqa	\CTR, \XMM5
+
+		paddd	ONEf(%rip), \CTR		// INCR CNT
+		movdqa	\CTR, \XMM6
+
+		paddd	ONEf(%rip), \CTR		// INCR CNT
+		movdqa	\CTR, \XMM7
+
+		paddd	ONEf(%rip), \CTR		// INCR CNT
+		movdqa	\CTR, \XMM8
+
+		.endif	// .ifc \loop_idx, in_order
+	////////////////////////////////////////////////////////////////////////
+
+		movdqu	16*0(\GDATA), \T1
+		pxor	\T1, \XMM1
+		pxor	\T1, \XMM2
+		pxor	\T1, \XMM3
+		pxor	\T1, \XMM4
+		pxor	\T1, \XMM5
+		pxor	\T1, \XMM6
+		pxor	\T1, \XMM7
+		pxor	\T1, \XMM8
+
+	// \XMM6, \T5 hold the values for the two operands which are
+	// carry-less multiplied
+	////////////////////////////////////////////////////////////////////////
+	// Karatsuba Method
+	movdqu	TMP2(%rsp), \T1
+	movdqa	\T1, \T3
+
+	pshufd	$0b01001110, \T3, \T2
+	pxor	\T3, \T2
+	movdqu	HashKey_7(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1			// \T1 = a1*b1
+	pclmulqdq	$0x00, \T5, \T3			// \T3 = a0*b0
+	movdqu  HashKey_7_k(\GDATA), \T5
+	pclmulqdq	$0x00, \T5, \T2			// \T2 = (a1+a0)*(b1+b0)
+	pxor	\T1, \T4				// accumulate the results in \T4:\T7, \T6 holds the middle part
+	pxor	\T3, \T7
+	pxor	\T2, \T6
+
+		movdqu	16*1(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+		movdqu	16*2(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+	////////////////////////////////////////////////////////////////////////
+	// Karatsuba Method
+	movdqu	TMP3(%rsp), \T1
+	movdqa	\T1, \T3
+
+	pshufd	$0b01001110, \T3, \T2
+	pxor	\T3, \T2
+	movdqu	HashKey_6(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+	pclmulqdq	$0x00, \T5, \T3		// \T3 = a0*b0
+	movdqu  HashKey_6_k(\GDATA), \T5
+	pclmulqdq	$0x00, \T5, \T2		// \T2 = (a1+a0)*(b1+b0)
+	pxor	\T1, \T4			// accumulate the results in \T4:\T7, \T6 holds the middle part
+	pxor	\T3, \T7
+	pxor	\T2, \T6
+
+		movdqu	16*3(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+	movdqu	TMP4(%rsp), \T1
+	movdqa	\T1, \T3
+
+	pshufd	$0b01001110, \T3, \T2
+	pxor	\T3, \T2
+	movdqu	HashKey_5(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+	pclmulqdq	$0x00, \T5, \T3		// \T3 = a0*b0
+	movdqu  HashKey_5_k(\GDATA), \T5
+	pclmulqdq	$0x00, \T5, \T2		// \T2 = (a1+a0)*(b1+b0)
+	pxor	\T1, \T4			// accumulate the results in \T4:\T7, \T6 holds the middle part
+	pxor	\T3, \T7
+	pxor	\T2, \T6
+
+		movdqu	16*4(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+		movdqu	16*5(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+	movdqu	TMP5(%rsp), \T1
+	movdqa	\T1, \T3
+
+	pshufd	$0b01001110, \T3, \T2
+	pxor	\T3, \T2
+	movdqu	HashKey_4(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+	pclmulqdq	$0x00, \T5, \T3		// \T3 = a0*b0
+	movdqu  HashKey_4_k(\GDATA), \T5
+	pclmulqdq	$0x00, \T5, \T2		// \T2 = (a1+a0)*(b1+b0)
+	pxor	\T1, \T4			// accumulate the results in \T4:\T7, \T6 holds the middle part
+	pxor	\T3, \T7
+	pxor	\T2, \T6
+
+		movdqu	16*6(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+
+	movdqu	TMP6(%rsp), \T1
+	movdqa	\T1, \T3
+
+	pshufd	$0b01001110, \T3, \T2
+	pxor	\T3, \T2
+	movdqu	HashKey_3(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+	pclmulqdq	$0x00, \T5, \T3		// \T3 = a0*b0
+	movdqu  HashKey_3_k(\GDATA), \T5
+	pclmulqdq	$0x00, \T5, \T2		// \T2 = (a1+a0)*(b1+b0)
+	pxor	\T1, \T4			// accumulate the results in \T4:\T7, \T6 holds the middle part
+	pxor	\T3, \T7
+	pxor	\T2, \T6
+
+		movdqu	16*7(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+	movdqu	TMP7(%rsp), \T1
+	movdqa	\T1, \T3
+
+	pshufd	$0b01001110, \T3, \T2
+	pxor	\T3, \T2
+	movdqu	HashKey_2(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1			// \T1 = a1*b1
+	pclmulqdq	$0x00, \T5, \T3			// \T3 = a0*b0
+	movdqu  HashKey_2_k(\GDATA), \T5
+	pclmulqdq	$0x00, \T5, \T2			// \T2 = (a1+a0)*(b1+b0)
+	pxor	\T1, \T4				// accumulate the results in \T4:\T7, \T6 holds the middle part
+	pxor	\T3, \T7
+	pxor	\T2, \T6
+
+		movdqu	16*8(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+
+	// \XMM8, \T5 hold the values for the two operands which are
+	// carry-less multiplied.
+	////////////////////////////////////////////////////////////////////////
+	// Karatsuba Method
+	movdqu	TMP8(%rsp), \T1
+	movdqa	\T1, \T3
+
+	pshufd	$0b01001110, \T3, \T2
+	pxor	\T3, \T2
+	movdqu	HashKey(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+	pclmulqdq	$0x00, \T5, \T3		// \T3 = a0*b0
+	movdqu  HashKey_k(\GDATA), \T5
+	pclmulqdq	$0x00, \T5, \T2		// \T2 = (a1+a0)*(b1+b0)
+	pxor	\T3, \T7
+	pxor	\T1, \T4			// accumulate the results in \T4:\T7, \T6 holds the middle part
+
+		movdqu	16*9(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+
+#ifdef GCM128_MODE
+		movdqu	16*10(\GDATA), \T5
+#endif
+#ifdef GCM192_MODE
+		movdqu	16*10(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+		movdqu	16*11(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+		movdqu	16*12(\GDATA), \T5	// finish last key round
+#endif
+#ifdef GCM256_MODE
+		movdqu	16*10(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+		movdqu	16*11(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+		movdqu	16*12(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+		movdqu	16*13(\GDATA), \T1
+		aesenc	\T1, \XMM1
+		aesenc	\T1, \XMM2
+		aesenc	\T1, \XMM3
+		aesenc	\T1, \XMM4
+		aesenc	\T1, \XMM5
+		aesenc	\T1, \XMM6
+		aesenc	\T1, \XMM7
+		aesenc	\T1, \XMM8
+
+	movdqu	16*14(\GDATA), \T5		// finish last key round
+#endif
+
+.altmacro
+.set i, 0
+.set j, 1
+.rept 8
+		xmmreg j, %j
+		XLDR	16*i(\PLAIN_CYPH_IN, \DATA_OFFSET), \T1
+
+		.ifc \ENC_DEC, DEC
+		movdqa	\T1, \T3
+		.endif
+
+		pxor	\T5, \T1
+		aesenclast	\T1, xmmj				// XMM1:XMM8
+		XSTR	xmmj, 16*i(\CYPH_PLAIN_OUT, \DATA_OFFSET)	// Write to the Output buffer
+
+		.ifc \ENC_DEC, DEC
+		movdqa	\T3, xmmj
+		.endif
+.set i, (i+1)
+.set j, (j+1)
+.endr
+.noaltmacro
+
+	pxor	\T6, \T2
+	pxor	\T4, \T2
+	pxor	\T7, \T2
+
+
+	movdqa	\T2, \T3
+	pslldq	$8, \T3			// shift-L \T3 2 DWs
+	psrldq	$8, \T2			// shift-R \T2 2 DWs
+	pxor	\T3, \T7
+	pxor	\T2, \T4		// accumulate the results in \T4:\T7
+
+
+
+	//first phase of the reduction
+	movdqa	\T7, \T2
+	movdqa	\T7, \T3
+	movdqa	\T7, \T1		// move \T7 into \T2, \T3, \T1 in order to perform the three shifts independently
+
+	pslld	$31, \T2		// packed right shifting << 31
+	pslld	$30, \T3		// packed right shifting shift << 30
+	pslld	$25, \T1		// packed right shifting shift << 25
+	pxor	\T3, \T2		// xor the shifted versions
+	pxor	\T1, \T2
+
+	movdqa	\T2, \T5
+	psrldq	$4, \T5			// shift-R \T5 1 DW
+
+	pslldq	$12, \T2		// shift-L \T2 3 DWs
+	pxor	\T2, \T7		// first phase of the reduction complete
+
+	////////////////////////////////////////////////////////////////////////
+
+		pshufb	SHUF_MASK(%rip), \XMM1	// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM2	// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM3	// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM4	// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM5	// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM6	// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM7	// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), \XMM8	// perform a 16Byte swap
+
+	//second phase of the reduction
+	movdqa	\T7, \T2		// make 3 copies of \T7 (in in \T2, \T3, \T1) for doing three shift operations
+	movdqa	\T7, \T3
+	movdqa	\T7, \T1
+
+	psrld	$1, \T2			// packed left shifting >> 1
+	psrld	$2, \T3			// packed left shifting >> 2
+	psrld	$7, \T1			// packed left shifting >> 7
+	pxor	\T3, \T2		// xor the shifted versions
+	pxor	\T1, \T2
+
+	pxor	\T5, \T2
+	pxor	\T2, \T7
+	pxor	\T4, \T7		// the result is in \T4
+
+
+	pxor    \T7, \XMM1
+
+.endm // GHASH_8_ENCRYPT_8_PARALLEL
+
+////////////////////////////////////////////////////////////////////////////////
+// GHASH_LAST_8: GHASH the last 8 ciphertext blocks.
+////////////////////////////////////////////////////////////////////////////////
+.macro	GHASH_LAST_8	GDATA, T1, T2, T3, T4, T5, T6, T7, \
+			XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7, XMM8
+
+
+	// Karatsuba Method
+	movdqa	\XMM1, \T6
+	pshufd	$0b01001110,  \XMM1, \T2
+	pxor	\XMM1, \T2
+	movdqu	HashKey_8(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T6		// \T6 = a1*b1
+
+	pclmulqdq	$0x00, \T5, \XMM1	// \XMM1 = a0*b0
+	movdqu	HashKey_8_k(\GDATA), \T4
+	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
+
+	movdqa	\XMM1, \T7
+	movdqa	\T2, \XMM1			// result in \T6, \T7, \XMM1
+
+	// Karatsuba Method
+	movdqa	\XMM2, \T1
+	pshufd	$0b01001110,  \XMM2, \T2
+	pxor	\XMM2, \T2
+	movdqu	HashKey_7(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+
+	pclmulqdq	$0x00, \T5, \XMM2	// \XMM2 = a0*b0
+	movdqu	HashKey_7_k(\GDATA), \T4
+	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
+
+	pxor	\T1, \T6
+	pxor	\XMM2, \T7
+	pxor	\T2, \XMM1			// results accumulated in \T6, \T7, \XMM1
+
+	// Karatsuba Method
+	movdqa	\XMM3, \T1
+	pshufd	$0b01001110,  \XMM3, \T2
+	pxor	\XMM3, \T2
+	movdqu	HashKey_6(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+
+	pclmulqdq	$0x00, \T5, \XMM3	// \XMM3 = a0*b0
+	movdqu	HashKey_6_k(\GDATA), \T4
+	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
+
+	pxor	\T1, \T6
+	pxor	\XMM3, \T7
+	pxor	\T2, \XMM1			// results accumulated in \T6, \T7, \XMM1
+
+	// Karatsuba Method
+	movdqa	\XMM4, \T1
+	pshufd	$0b01001110,  \XMM4, \T2
+	pxor	\XMM4, \T2
+	movdqu	HashKey_5(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+
+	pclmulqdq	$0x00, \T5, \XMM4	// \XMM4 = a0*b0
+	movdqu	HashKey_5_k(\GDATA), \T4
+	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
+
+	pxor	\T1, \T6
+	pxor	\XMM4, \T7
+	pxor	\T2, \XMM1			// results accumulated in \T6, \T7, \XMM1
+
+	// Karatsuba Method
+	movdqa	\XMM5, \T1
+	pshufd	$0b01001110,  \XMM5, \T2
+	pxor	\XMM5, \T2
+	movdqu	HashKey_4(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+
+	pclmulqdq	$0x00, \T5, \XMM5	// \XMM5 = a0*b0
+	movdqu	HashKey_4_k(\GDATA), \T4
+	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
+
+	pxor	\T1, \T6
+	pxor	\XMM5, \T7
+	pxor	\T2, \XMM1			// results accumulated in \T6, \T7, \XMM1
+
+	// Karatsuba Method
+	movdqa	\XMM6, \T1
+	pshufd	$0b01001110,  \XMM6, \T2
+	pxor	\XMM6, \T2
+	movdqu	HashKey_3(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+
+	pclmulqdq	$0x00, \T5, \XMM6	// \XMM6 = a0*b0
+	movdqu	HashKey_3_k(\GDATA), \T4
+	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
+
+	pxor	\T1, \T6
+	pxor	\XMM6, \T7
+	pxor	\T2, \XMM1			// results accumulated in \T6, \T7, \XMM1
+
+	// Karatsuba Method
+	movdqa	\XMM7, \T1
+	pshufd	$0b01001110,  \XMM7, \T2
+	pxor	\XMM7, \T2
+	movdqu	HashKey_2(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+
+	pclmulqdq	$0x00, \T5, \XMM7	// \XMM7 = a0*b0
+	movdqu	HashKey_2_k(\GDATA), \T4
+	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
+
+	pxor	\T1, \T6
+	pxor	\XMM7, \T7
+	pxor	\T2, \XMM1			// results accumulated in \T6, \T7, \XMM1
+
+
+	// Karatsuba Method
+	movdqa	\XMM8, \T1
+	pshufd	$0b01001110,  \XMM8, \T2
+	pxor	\XMM8, \T2
+	movdqu	HashKey(\GDATA), \T5
+	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
+
+	pclmulqdq	$0x00, \T5, \XMM8	// \XMM8 = a0*b0
+	movdqu	HashKey_k(\GDATA), \T4
+	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
+
+	pxor	\T1, \T6
+	pxor	\XMM8, \T7
+	pxor	\XMM1, \T2
+	pxor	\T6, \T2
+	pxor	\T7, \T2		// middle section of the temp results combined as in Karatsuba algorithm
+
+
+	movdqa	\T2, \T4
+	pslldq	$8, \T4			// shift-L \T4 2 DWs
+	psrldq	$8, \T2			// shift-R \T2 2 DWs
+	pxor	\T4, \T7
+	pxor	\T2, \T6		// <\T6:\T7> holds the result of the accumulated carry-less multiplications
+
+
+	//first phase of the reduction
+	movdqa	\T7, \T2
+	movdqa	\T7, \T3
+	movdqa	\T7, \T4		// move \T7 into \T2, \T3, \T4 in order to perform the three shifts independently
+
+	pslld	$31, \T2		// packed right shifting << 31
+	pslld	$30, \T3		// packed right shifting shift << 30
+	pslld	$25, \T4		// packed right shifting shift << 25
+	pxor	\T3, \T2		// xor the shifted versions
+	pxor	\T4, \T2
+
+	movdqa	\T2, \T1
+	psrldq	$4, \T1			// shift-R \T1 1 DW
+
+	pslldq	$12, \T2		// shift-L \T2 3 DWs
+	pxor	\T2, \T7		// first phase of the reduction complete
+	////////////////////////////////////////////////////////////////////////
+
+	//second phase of the reduction
+	movdqa	\T7, \T2		// make 3 copies of \T7 (in in \T2, \T3, \T4) for doing three shift operations
+	movdqa	\T7, \T3
+	movdqa	\T7, \T4
+
+	psrld	$1, \T2			// packed left shifting >> 1
+	psrld	$2, \T3			// packed left shifting >> 2
+	psrld	$7, \T4			// packed left shifting >> 7
+	pxor	\T3, \T2		// xor the shifted versions
+	pxor	\T4, \T2
+
+	pxor	\T1, \T2
+	pxor	\T2, \T7
+	pxor	\T7, \T6		// the result is in \T6
+
+.endm // GHASH_LAST_8
+
+////////////////////////////////////////////////////////////////////////////////
+// ENCRYPT_SINGLE_BLOCK: Encrypt a single block.
+////////////////////////////////////////////////////////////////////////////////
+.macro ENCRYPT_SINGLE_BLOCK	GDATA, ST, T1
+
+		movdqu	16*0(\GDATA), \T1
+		pxor	\T1, \ST
+
+.set i, 1
+.rept NROUNDS
+		movdqu	16*i(\GDATA), \T1
+		aesenc	\T1, \ST
+
+.set i, (i+1)
+.endr
+		movdqu	16*i(\GDATA), \T1
+		aesenclast	\T1, \ST
+.endm // ENCRYPT_SINGLE_BLOCK
+
+
+////////////////////////////////////////////////////////////////////////////////
+// FUNC_SAVE: Save clobbered regs on the stack.
+////////////////////////////////////////////////////////////////////////////////
+.macro FUNC_SAVE
+	//// Required for Update/GMC_ENC
+	//the number of pushes must equal STACK_OFFSET
+	push    %r12
+	push    %r13
+	push    %r14
+	push    %r15
+	push    %rsi
+	mov	%rsp, %r14
+
+	sub	$(VARIABLE_OFFSET), %rsp
+	and	$~63, %rsp
+
+#if __OUTPUT_FORMAT__ == win64
+	// xmm6:xmm15 need to be maintained for Windows
+	movdqu	%xmm6, (LOCAL_STORAGE + 0*16)(%rsp)
+	movdqu	%xmm7, (LOCAL_STORAGE + 1*16)(%rsp)
+	movdqu	%xmm8, (LOCAL_STORAGE + 2*16)(%rsp)
+	movdqu	%xmm9, (LOCAL_STORAGE + 3*16)(%rsp)
+	movdqu	%xmm10, (LOCAL_STORAGE + 4*16)(%rsp)
+	movdqu	%xmm11, (LOCAL_STORAGE + 5*16)(%rsp)
+	movdqu	%xmm12, (LOCAL_STORAGE + 6*16)(%rsp)
+	movdqu	%xmm13, (LOCAL_STORAGE + 7*16)(%rsp)
+	movdqu	%xmm14, (LOCAL_STORAGE + 8*16)(%rsp)
+	movdqu	%xmm15, (LOCAL_STORAGE + 9*16)(%rsp)
+
+	mov	arg(5), arg5  // XXXX [r14 + STACK_OFFSET + 8*5]
+#endif
+.endm // FUNC_SAVE
+
+////////////////////////////////////////////////////////////////////////////////
+// FUNC_RESTORE: Restore clobbered regs from the stack.
+////////////////////////////////////////////////////////////////////////////////
+.macro FUNC_RESTORE
+
+#if __OUTPUT_FORMAT__ == win64
+	movdqu	(LOCAL_STORAGE + 9*16)(%rsp), %xmm15
+	movdqu	(LOCAL_STORAGE + 8*16)(%rsp), %xmm14
+	movdqu	(LOCAL_STORAGE + 7*16)(%rsp), %xmm13
+	movdqu	(LOCAL_STORAGE + 6*16)(%rsp), %xmm12
+	movdqu	(LOCAL_STORAGE + 5*16)(%rsp), %xmm11
+	movdqu	(LOCAL_STORAGE + 4*16)(%rsp), %xmm10
+	movdqu	(LOCAL_STORAGE + 3*16)(%rsp), %xmm9
+	movdqu	(LOCAL_STORAGE + 2*16)(%rsp), %xmm8
+	movdqu	(LOCAL_STORAGE + 1*16)(%rsp), %xmm7
+	movdqu	(LOCAL_STORAGE + 0*16)(%rsp), %xmm6
+#endif
+
+	// Required for Update/GMC_ENC
+	mov	%r14, %rsp
+	pop	%rsi
+	pop	%r15
+	pop	%r14
+	pop	%r13
+	pop	%r12
+.endm // FUNC_RESTORE
+
+
+////////////////////////////////////////////////////////////////////////////////
+// GCM_INIT: Initializes a gcm_context_data struct to prepare for
+// encoding/decoding.
+// Input: gcm_key_data * (GDATA_KEY), gcm_context_data *(GDATA_CTX), IV,
+// Additional Authentication data (A_IN), Additional Data length (A_LEN).
+// Output: Updated GDATA_CTX with the hash of A_IN (AadHash) and initialized
+// other parts of GDATA.
+// Clobbers rax, r10-r13 and xmm0-xmm6
+////////////////////////////////////////////////////////////////////////////////
+.macro  GCM_INIT	GDATA_KEY, GDATA_CTX, IV, A_IN, A_LEN
+
+#define AAD_HASH	%xmm0
+#define SUBHASH		%xmm1
+
+	movdqu  HashKey(\GDATA_KEY), SUBHASH
+
+	CALC_AAD_HASH \A_IN, \A_LEN, AAD_HASH, SUBHASH, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %r10, %r11, %r12, %r13, %rax
+	pxor	%xmm3, %xmm2
+	mov	\A_LEN, %r10
+
+	movdqu	AAD_HASH, AadHash(\GDATA_CTX)	// ctx_data.aad hash = aad_hash
+	mov	%r10, AadLen(\GDATA_CTX)		// ctx_data.aad_length = aad_length
+	xor	%r10, %r10
+	mov	%r10, InLen(\GDATA_CTX)		// ctx_data.in_length = 0
+	mov	 %r10, PBlockLen(\GDATA_CTX)	// ctx_data.partial_block_length = 0
+	movdqu	%xmm2, PBlockEncKey(\GDATA_CTX)	// ctx_data.partial_block_enc_key = 0
+	mov	\IV, %r10
+	movdqa  ONEf(%rip), %xmm2			// read 12 IV bytes and pad with 0x00000001
+	pinsrq  $0, (%r10), %xmm2
+	pinsrd  $2, 8(%r10), %xmm2
+	movdqu	%xmm2, OrigIV(\GDATA_CTX)	// ctx_data.orig_IV = iv
+
+	pshufb	SHUF_MASK(%rip), %xmm2
+
+	movdqu	%xmm2, CurCount(\GDATA_CTX)	// ctx_data.current_counter = iv
+.endm // GCM_INIT
+
+
+////////////////////////////////////////////////////////////////////////////////
+// GCM_ENC_DEC Encodes/Decodes given data. Assumes that the passed
+// gcm_context_data struct has been initialized by GCM_INIT.
+// Requires the input data be at least 1 byte long because of
+// READ_SMALL_INPUT_DATA.
+// Input: gcm_key_data * (GDATA_KEY), gcm_context_data (GDATA_CTX),
+// input text (PLAIN_CYPH_IN), input text length (PLAIN_CYPH_LEN) and whether
+// encoding or decoding (ENC_DEC).
+// Output: A cypher of the given plain text (CYPH_PLAIN_OUT), and updated
+// GDATA_CTX
+// Clobbers rax, r10-r15, and xmm0-xmm15
+////////////////////////////////////////////////////////////////////////////////
+.macro	GCM_ENC_DEC	GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, \
+			PLAIN_CYPH_LEN, ENC_DEC
+
+#define	DATA_OFFSET		%r11
+
+	// clang compat: no local support
+	// LOCAL _initial_num_blocks_is_7, _initial_num_blocks_is_6
+	// LOCAL _initial_num_blocks_is_5, _initial_num_blocks_is_4
+	// LOCAL _initial_num_blocks_is_3, _initial_num_blocks_is_2
+	// LOCAL _initial_num_blocks_is_1, _initial_num_blocks_is_0
+	// LOCAL _initial_blocks_encrypted, _encrypt_by_8_new, _encrypt_by_8
+	// LOCAL _eight_cipher_left, _zero_cipher_left, _large_enough_update
+	// LOCAL _data_read, _less_than_8_bytes_left, _multiple_of_16_bytes
+
+// Macro flow:
+// calculate the number of 16byte blocks in the message
+// process (number of 16byte blocks) mod 8 '_initial_num_blocks_is_# .. _initial_blocks_encrypted'
+// process 8 16 byte blocks at a time until all are done '_encrypt_by_8_new .. _eight_cipher_left'
+// if there is a block of less tahn 16 bytes process it '_zero_cipher_left .. _multiple_of_16_bytes'
+
+	cmp	$0, \PLAIN_CYPH_LEN
+	je	_multiple_of_16_bytes_\@
+
+	xor	DATA_OFFSET, DATA_OFFSET
+	add	\PLAIN_CYPH_LEN, InLen(\GDATA_CTX)	//Update length of data processed
+	movdqu	HashKey(\GDATA_KEY), %xmm13		// xmm13 = HashKey
+	movdqu	AadHash(\GDATA_CTX), %xmm8
+
+
+	PARTIAL_BLOCK \GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, \PLAIN_CYPH_LEN, DATA_OFFSET, %xmm8, \ENC_DEC
+
+	mov	\PLAIN_CYPH_LEN, %r13                                // save the number of bytes of plaintext/ciphertext
+	sub	DATA_OFFSET, %r13
+	mov	%r13, %r10		//save the amount of data left to process in r10
+	and     $-16, %r13		// r13 = r13 - (r13 mod 16)
+
+	mov	%r13, %r12
+	shr	$4, %r12
+	and	$7, %r12
+	jz      _initial_num_blocks_is_0_\@
+
+
+	cmp	$7, %r12
+	je      _initial_num_blocks_is_7_\@
+	cmp	$6, %r12
+	je      _initial_num_blocks_is_6_\@
+	cmp	$5, %r12
+	je      _initial_num_blocks_is_5_\@
+	cmp	$4, %r12
+	je      _initial_num_blocks_is_4_\@
+	cmp	$3, %r12
+	je      _initial_num_blocks_is_3_\@
+	cmp	$2, %r12
+	je      _initial_num_blocks_is_2_\@
+
+	jmp     _initial_num_blocks_is_1_\@
+
+_initial_num_blocks_is_7_\@:
+	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 7, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	sub	$(16*7), %r13
+	jmp     _initial_blocks_encrypted_\@
+
+_initial_num_blocks_is_6_\@:
+	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 6, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	sub	$(16*6), %r13
+	jmp     _initial_blocks_encrypted_\@
+
+_initial_num_blocks_is_5_\@:
+	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 5, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	sub	$(16*5), %r13
+	jmp     _initial_blocks_encrypted_\@
+
+_initial_num_blocks_is_4_\@:
+	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 4, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	sub	$(16*4), %r13
+	jmp     _initial_blocks_encrypted_\@
+
+_initial_num_blocks_is_3_\@:
+	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 3, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	sub	$(16*3), %r13
+	jmp     _initial_blocks_encrypted_\@
+
+_initial_num_blocks_is_2_\@:
+	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 2, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	sub	$(16*2), %r13
+	jmp     _initial_blocks_encrypted_\@
+
+_initial_num_blocks_is_1_\@:
+	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 1, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	sub	$(16*1), %r13
+	jmp     _initial_blocks_encrypted_\@
+
+_initial_num_blocks_is_0_\@:
+	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 0, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+
+_initial_blocks_encrypted_\@:
+	cmp	$0, %r13
+	je      _zero_cipher_left_\@
+
+	sub	$128, %r13
+	je      _eight_cipher_left_\@
+
+	movd	%xmm9, %r15d
+	and	$255, %r15d
+	pshufb	SHUF_MASK(%rip), %xmm9
+
+
+_encrypt_by_8_new_\@:
+	cmp	$(255-8), %r15d
+	jg      _encrypt_by_8_\@
+
+	add	$8, %r15b
+	GHASH_8_ENCRYPT_8_PARALLEL	\GDATA_KEY, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, DATA_OFFSET, %xmm0, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm15, out_order, \ENC_DEC
+	add	$128, DATA_OFFSET
+	sub	$128, %r13
+	jne     _encrypt_by_8_new_\@
+
+	pshufb	SHUF_MASK(%rip), %xmm9
+	jmp     _eight_cipher_left_\@
+
+_encrypt_by_8_\@:
+	pshufb	SHUF_MASK(%rip), %xmm9
+	add	$8, %r15b
+
+	GHASH_8_ENCRYPT_8_PARALLEL	\GDATA_KEY, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, DATA_OFFSET, %xmm0, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm15, in_order, \ENC_DEC
+	pshufb	SHUF_MASK(%rip), %xmm9
+	add	$128, DATA_OFFSET
+	sub	$128, %r13
+	jne     _encrypt_by_8_new_\@
+
+	pshufb	SHUF_MASK(%rip), %xmm9
+
+
+
+_eight_cipher_left_\@:
+	GHASH_LAST_8	\GDATA_KEY, %xmm0, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8
+
+
+_zero_cipher_left_\@:
+	movdqu	%xmm14, AadHash(\GDATA_CTX) 
+	movdqu	%xmm9, CurCount(\GDATA_CTX) 
+
+	mov	%r10, %r13
+	and	$15, %r13			// r13 = (\PLAIN_CYPH_LEN mod 16)
+
+	je      _multiple_of_16_bytes_\@
+
+	mov	%r13, PBlockLen(\GDATA_CTX)		// my_ctx.data.partial_blck_length = r13
+	// handle the last <16 Byte block seperately
+
+	paddd	ONE(%rip), %xmm9			// INCR CNT to get Yn
+	movdqu	%xmm9, CurCount(\GDATA_CTX)	// my_ctx.data.current_counter = xmm9
+	pshufb  SHUF_MASK(%rip), %xmm9
+	ENCRYPT_SINGLE_BLOCK	\GDATA_KEY, %xmm9, %xmm2	// E(K, Yn)
+	movdqu	%xmm9, PBlockEncKey(\GDATA_CTX)	// my_ctx_data.partial_block_enc_key = xmm9
+
+	cmp	$16, \PLAIN_CYPH_LEN
+	jge	_large_enough_update_\@
+
+	lea	(\PLAIN_CYPH_IN, DATA_OFFSET), %r10
+	READ_SMALL_DATA_INPUT	%xmm1, %r10, %r13, %r12, %r15, %rax
+	lea	(SHIFT_MASK + 16)(%rip), %r12
+	sub	%r13, %r12
+	jmp	_data_read_\@
+
+_large_enough_update_\@:
+	sub	$16, DATA_OFFSET
+	add	%r13, DATA_OFFSET
+
+	movdqu	(\PLAIN_CYPH_IN, DATA_OFFSET), %xmm1	// receive the last <16 Byte block
+
+	sub	%r13, DATA_OFFSET
+	add	$16, DATA_OFFSET
+
+	lea	(SHIFT_MASK + 16)(%rip), %r12
+	sub	%r13, %r12			// adjust the shuffle mask pointer to be able to shift 16-r13 bytes (r13 is the number of bytes in plaintext mod 16)
+	movdqu	(%r12), %xmm2			// get the appropriate shuffle mask
+	pshufb	%xmm2, %xmm1			// shift right 16-r13 bytes
+_data_read_\@:
+	.ifc  \ENC_DEC, DEC
+
+	movdqa	%xmm1, %xmm2
+	pxor	%xmm1, %xmm9			// Plaintext XOR E(K, Yn)
+	movdqu	(ALL_F - SHIFT_MASK)(%r12), %xmm1	// get the appropriate mask to mask out top 16-r13 bytes of xmm9
+	pand	%xmm1, %xmm9			// mask out top 16-r13 bytes of xmm9
+	pand	%xmm1, %xmm2
+	pshufb	SHUF_MASK(%rip), %xmm2
+	pxor	%xmm2, %xmm14
+	movdqu	%xmm14, AadHash(\GDATA_CTX)
+
+	.else	// .ifc  \ENC_DEC, DEC
+
+	pxor	%xmm1, %xmm9			// Plaintext XOR E(K, Yn)
+	movdqu	(ALL_F - SHIFT_MASK)(%r12), %xmm1	// get the appropriate mask to mask out top 16-r13 bytes of xmm9
+	pand	%xmm1, %xmm9			// mask out top 16-r13 bytes of xmm9
+	pshufb	SHUF_MASK(%rip), %xmm9
+	pxor	%xmm9, %xmm14
+	movdqu	%xmm14, AadHash(\GDATA_CTX)
+
+	pshufb	SHUF_MASK(%rip), %xmm9		// shuffle xmm9 back to output as ciphertext
+
+	.endif	// .ifc  \ENC_DEC, DEC
+
+
+	//////////////////////////////////////////////////////////
+	// output r13 Bytes
+	movq	%xmm9, %rax
+	cmp	$8, %r13
+	jle     _less_than_8_bytes_left_\@
+
+	mov	%rax, (\CYPH_PLAIN_OUT, DATA_OFFSET)
+	add	$8, DATA_OFFSET
+	psrldq	$8, %xmm9
+	movq	%xmm9, %rax
+	sub	$8, %r13
+
+_less_than_8_bytes_left_\@:
+	movb	%al, (\CYPH_PLAIN_OUT, DATA_OFFSET)
+	add     $1, DATA_OFFSET
+	shr	$8, %rax
+	sub	$1, %r13
+	jne     _less_than_8_bytes_left_\@
+	//////////////////////////////////////////////////////////
+
+_multiple_of_16_bytes_\@:
+
+.endm // GCM_ENC_DEC
+
+
+////////////////////////////////////////////////////////////////////////////////
+// GCM_COMPLETE: Finishes Encyrption/Decryption of last partial block after
+// GCM_UPDATE finishes.
+// Input: A gcm_key_data * (GDATA_KEY), gcm_context_data * (GDATA_CTX) and
+// whether encoding or decoding (ENC_DEC).
+// Output: Authorization Tag (AUTH_TAG) and Authorization Tag length
+// (AUTH_TAG_LEN)
+// Clobbers %rax, r10-r12, and xmm0, xmm1, xmm5, xmm6, xmm9, xmm11, xmm14, xmm15
+////////////////////////////////////////////////////////////////////////////////
+.macro	GCM_COMPLETE	GDATA_KEY, GDATA_CTX, AUTH_TAG, AUTH_TAG_LEN, ENC_DEC
+
+#define	PLAIN_CYPH_LEN	%rax
+
+	// clang compat: no local support
+	// LOCAL _partial_done, _return_T, _T_8, _T_12, _T_16, _return_T_done
+
+	mov	PBlockLen(\GDATA_CTX), %r12	// r12 = aadLen (number of bytes)
+	movdqu	AadHash(\GDATA_CTX), %xmm14
+	movdqu	HashKey(\GDATA_KEY), %xmm13
+
+	cmp	$0, %r12
+
+	je _partial_done_\@
+
+	GHASH_MUL %xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6 //GHASH computation for the last <16 Byte block
+	movdqu	%xmm14, AadHash(\GDATA_CTX)
+
+_partial_done_\@:
+
+	mov	AadLen(\GDATA_CTX), %r12		// r12 = aadLen (number of bytes)
+	mov	InLen(\GDATA_CTX), PLAIN_CYPH_LEN
+
+	shl	$3, %r12			// convert into number of bits
+	movd	%r12d, %xmm15		// len(A) in xmm15
+
+	shl	$3, PLAIN_CYPH_LEN	// len(C) in bits  (*128)
+	movq	PLAIN_CYPH_LEN, %xmm1
+	pslldq	$8, %xmm15		// xmm15 = len(A)|| 0x0000000000000000
+	pxor	%xmm1, %xmm15		// xmm15 = len(A)||len(C)
+
+	pxor	%xmm15, %xmm14
+	GHASH_MUL %xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6    // final GHASH computation
+	pshufb	SHUF_MASK(%rip), %xmm14		// perform a 16Byte swap
+	movdqu	OrigIV(\GDATA_CTX), %xmm9	// xmm9 = Y0
+
+	ENCRYPT_SINGLE_BLOCK	\GDATA_KEY, %xmm9, %xmm2	// E(K, Y0)
+
+	pxor	%xmm14, %xmm9
+
+_return_T_\@:
+	mov	\AUTH_TAG, %r10			// r10 = authTag
+	mov	\AUTH_TAG_LEN, %r11		// r11 = auth_tag_len
+
+	cmp	$16, %r11
+	je      _T_16_\@
+
+	cmp	$12, %r11
+	je      _T_12_\@
+
+_T_8_\@:
+	movq	%xmm9, %rax
+	mov	%rax, (%r10)
+	jmp     _return_T_done_\@
+
+_T_12_\@:
+	movq	%xmm9, %rax
+	mov	%rax, (%r10)
+	psrldq	$8, %xmm9
+	movd	%xmm9, %eax
+	mov	%eax, 8(%r10)
+	jmp     _return_T_done_\@
+
+_T_16_\@:
+	movdqu	%xmm9, (%r10)
+
+_return_T_done_\@:
+.endm //GCM_COMPLETE
+
+
+#if 1
+
+	.balign 16
+////////////////////////////////////////////////////////////////////////////////
+//void	aes_gcm_precomp_{128,256}_sse
+//        (struct gcm_key_data *key_data);
+////////////////////////////////////////////////////////////////////////////////
+#if FUNCT_EXTENSION != _nt
+.global FN_NAME(precomp,_)
+FN_NAME(precomp,_):
+
+	endbranch
+
+	push	%r12
+	push	%r13
+	push	%r14
+	push	%r15
+
+	mov     %rsp, %r14
+
+	sub	$(VARIABLE_OFFSET), %rsp
+	and	$(~63), %rsp				// align rsp to 64 bytes
+
+#if __OUTPUT_FORMAT__ == win64
+	// only xmm6 needs to be maintained
+	movdqu	%xmm6, (LOCAL_STORAGE + 0*16)(%rsp)
+#endif
+
+	pxor	%xmm6, %xmm6
+	ENCRYPT_SINGLE_BLOCK	arg1, %xmm6, %xmm2	// xmm6 = HashKey
+
+	pshufb	SHUF_MASK(%rip), %xmm6
+	///////////////  PRECOMPUTATION of HashKey<<1 mod poly from the HashKey
+	movdqa	%xmm6, %xmm2
+	psllq	$1, %xmm6
+	psrlq	$63, %xmm2
+	movdqa	%xmm2, %xmm1
+	pslldq	$8, %xmm2
+	psrldq	$8, %xmm1
+	por	%xmm2, %xmm6
+
+	//reduction
+	pshufd	$0b00100100, %xmm1, %xmm2
+	pcmpeqd	TWOONE(%rip), %xmm2
+	pand	POLY(%rip), %xmm2
+	pxor	%xmm2, %xmm6		// xmm6 holds the HashKey<<1 mod poly
+	///////////////////////////////////////////////////////////////////////
+	movdqu	 %xmm6, HashKey(arg1)	// store HashKey<<1 mod poly
+
+	PRECOMPUTE  arg1, %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5
+
+#if __OUTPUT_FORMAT__ == win64
+	movdqu (LOCAL_STORAGE + 0*16)(%rsp), %xmm6
+#endif
+	mov	%r14, %rsp
+
+	pop	%r15
+	pop	%r14
+	pop	%r13
+	pop	%r12
+	ret
+#endif	// _nt
+
+
+////////////////////////////////////////////////////////////////////////////////
+//void   aes_gcm_init_128_sse / aes_gcm_init_256_sse (
+//        const struct gcm_key_data *key_data,
+//        struct gcm_context_data *context_data,
+//        u8      *iv,
+//        const   u8 *aad,
+//        u64     aad_len);
+////////////////////////////////////////////////////////////////////////////////
+#if FUNCT_EXTENSION != _nt
+.global FN_NAME(init,_)
+FN_NAME(init,_):
+	endbranch
+
+	push	%r12
+	push	%r13
+#if __OUTPUT_FORMAT__ == win64
+	push	arg5
+	sub	$(1*16), %rsp
+	movdqu	%xmm6, (0*16)(%rsp)
+	mov	(1*16 + 8*3 + 8*5)(%rsp), arg5
+#endif
+
+	GCM_INIT arg1, arg2, arg3, arg4, arg5
+
+#if __OUTPUT_FORMAT__ == win64
+	movdqu	(0*16)(%rsp), %xmm6
+	add	$(1*16), %rsp
+	pop	arg5
+#endif
+	pop	%r13
+	pop	%r12
+	ret
+#endif	// _nt
+
+
+////////////////////////////////////////////////////////////////////////////////
+//void   aes_gcm_enc_128_update_sse / aes_gcm_enc_256_update_sse
+//        const struct gcm_key_data *key_data,
+//        struct gcm_context_data *context_data,
+//        u8      *out,
+//        const   u8 *in,
+//        u64     plaintext_len);
+////////////////////////////////////////////////////////////////////////////////
+.global FN_NAME(enc,_update_)
+FN_NAME(enc,_update_):
+	endbranch
+
+	FUNC_SAVE
+
+	GCM_ENC_DEC arg1, arg2, arg3, arg4, arg5, ENC
+
+	FUNC_RESTORE
+
+	ret
+
+
+////////////////////////////////////////////////////////////////////////////////
+//void   aes_gcm_dec_256_update_sse / aes_gcm_dec_256_update_sse
+//        const struct gcm_key_data *key_data,
+//        struct gcm_context_data *context_data,
+//        u8      *out,
+//        const   u8 *in,
+//        u64     plaintext_len);
+////////////////////////////////////////////////////////////////////////////////
+.global FN_NAME(dec,_update_)
+FN_NAME(dec,_update_):
+	endbranch
+
+	FUNC_SAVE
+
+	GCM_ENC_DEC arg1, arg2, arg3, arg4, arg5, DEC
+
+	FUNC_RESTORE
+
+	ret
+
+
+////////////////////////////////////////////////////////////////////////////////
+//void   aes_gcm_enc_128_finalize_sse / aes_gcm_enc_256_finalize_sse
+//        const struct gcm_key_data *key_data,
+//        struct gcm_context_data *context_data,
+//        u8      *auth_tag,
+//        u64     auth_tag_len);
+////////////////////////////////////////////////////////////////////////////////
+#if FUNCT_EXTENSION != _nt
+.global FN_NAME(enc,_finalize_)
+FN_NAME(enc,_finalize_):
+
+	endbranch
+
+	push	%r12
+
+#if __OUTPUT_FORMAT__ == win64
+	// xmm6:xmm15 need to be maintained for Windows
+	sub	$(5*16), %rsp
+	movdqu	%xmm6, (0*16)(%rsp)
+	movdqu	%xmm9, (1*16)(%rsp)
+	movdqu	%xmm11, (2*16)(%rsp)
+	movdqu	%xmm14, (3*16)(%rsp)
+	movdqu	%xmm15, (4*16)(%rsp)
+#endif
+	GCM_COMPLETE	arg1, arg2, arg3, arg4, ENC
+
+#if __OUTPUT_FORMAT__ == win64
+	movdqu	(4*16)(%rsp), %xmm15
+	movdqu	(3*16)(%rsp), %xmm14
+	movdqu	(2*16)(%rsp), %xmm11
+	movdqu	(1*16)(%rsp), %xmm9
+	movdqu	(0*16)(%rsp), %xmm6
+	add	$(5*16), %rsp
+#endif
+
+	pop	%r12
+	ret
+#endif	// _nt
+
+
+////////////////////////////////////////////////////////////////////////////////
+//void   aes_gcm_dec_128_finalize_sse / aes_gcm_dec_256_finalize_sse
+//        const struct gcm_key_data *key_data,
+//        struct gcm_context_data *context_data,
+//        u8      *auth_tag,
+//        u64     auth_tag_len);
+////////////////////////////////////////////////////////////////////////////////
+#if FUNCT_EXTENSION != _nt
+.global FN_NAME(dec,_finalize_)
+FN_NAME(dec,_finalize_):
+
+	endbranch
+
+	push	%r12
+
+#if __OUTPUT_FORMAT == win64
+	// xmm6:xmm15 need to be maintained for Windows
+	sub	$(5*16), %rsp
+	movdqu	%xmm6, (0*16)(%rsp)
+	movdqu	%xmm9, (1*16)(%rsp)
+	movdqu	%xmm11, (2*16)(%rsp)
+	movdqu	%xmm14, (3*16)(%rsp)
+	movdqu	%xmm15, (4*16)(%rsp)
+#endif
+	GCM_COMPLETE	arg1, arg2, arg3, arg4, DEC
+
+#if __OUTPUT_FORMAT__ == win64
+	movdqu	(4*16)(%rsp), %xmm15
+	movdqu	(3*16)(%rsp), %xmm14
+	movdqu	(2*16)(%rsp), %xmm11
+	movdqu	(1*16)(%rsp), %xmm9
+	movdqu	(0*16)(%rsp), %xmm6
+	add	$(5*16), %rsp
+#endif
+
+	pop	%r12
+	ret
+#endif	// _nt
+
+
+////////////////////////////////////////////////////////////////////////////////
+//void   aes_gcm_enc_128_sse / aes_gcm_enc_256_sse
+//        const struct gcm_key_data *key_data,
+//        struct gcm_context_data *context_data,
+//        u8      *out,
+//        const   u8 *in,
+//        u64     plaintext_len,
+//        u8      *iv,
+//        const   u8 *aad,
+//        u64     aad_len,
+//        u8      *auth_tag,
+//        u64     auth_tag_len)//
+////////////////////////////////////////////////////////////////////////////////
+.global FN_NAME(enc,_)
+FN_NAME(enc,_):
+	endbranch
+
+	FUNC_SAVE
+
+	GCM_INIT arg1, arg2, arg6, arg7, arg8
+
+	GCM_ENC_DEC  arg1, arg2, arg3, arg4, arg5, ENC
+
+	GCM_COMPLETE arg1, arg2, arg9, arg10, ENC
+	FUNC_RESTORE
+
+	ret
+
+////////////////////////////////////////////////////////////////////////////////
+//void   aes_gcm_dec_128_sse / aes_gcm_dec_256_sse
+//        const struct gcm_key_data *key_data,
+//        struct gcm_context_data *context_data,
+//        u8      *out,
+//        const   u8 *in,
+//        u64     plaintext_len,
+//        u8      *iv,
+//        const   u8 *aad,
+//        u64     aad_len,
+//        u8      *auth_tag,
+//        u64     auth_tag_len)//
+////////////////////////////////////////////////////////////////////////////////
+.global FN_NAME(dec,_)
+FN_NAME(dec,_):
+	endbranch
+
+	FUNC_SAVE
+
+	GCM_INIT arg1, arg2, arg6, arg7, arg8
+
+	GCM_ENC_DEC  arg1, arg2, arg3, arg4, arg5, DEC
+
+	GCM_COMPLETE arg1, arg2, arg9, arg10, DEC
+	FUNC_RESTORE
+
+	ret
+
+.global FN_NAME(this_is_gas,_)
+FN_NAME(this_is_gas,_):
+	endbranch
+	FUNC_SAVE
+	FUNC_RESTORE
+	ret
+
+#else
+	// GAS doesnt't provide the linenuber in the macro
+	////////////////////////
+	// GHASH_MUL xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6
+	// PRECOMPUTE rax, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6
+	// READ_SMALL_DATA_INPUT xmm1, r10, 8, rax, r12, r15
+	// ENCRYPT_SINGLE_BLOCK rax, xmm0, xmm1
+	// INITIAL_BLOCKS rdi,rsi,rdx,rcx,r13,r11,7,xmm12,xmm13,xmm14,xmm15,xmm11,xmm9,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7,xmm8,xmm10,xmm0,ENC
+	// CALC_AAD_HASH [r14+8*5+8*1],[r14+8*5+8*2],xmm0,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,r10,r11,r12,r13,rax
+	// READ_SMALL_DATA_INPUT	xmm2, r10, r11, r12, r13, rax
+	// PARTIAL_BLOCK  rdi,rsi,rdx,rcx,r8,r11,xmm8,ENC
+	// GHASH_8_ENCRYPT_8_PARALLEL rdi,rdx,rcx,r11,xmm0,xmm10,xmm11,xmm12,xmm13,xmm14,xmm9,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7,xmm8,xmm15,out_order,ENC
+	//GHASH_LAST_8 rdi,xmm0,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7,xmm8
+#endif
diff --git a/module/icp/asm-x86_64/modes/isalc_reg_sizes.S b/module/icp/asm-x86_64/modes/isalc_reg_sizes.S
new file mode 100644
index 000000000000..d77291ce58a1
--- /dev/null
+++ b/module/icp/asm-x86_64/modes/isalc_reg_sizes.S
@@ -0,0 +1,221 @@
+////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
+//  Copyright(c) 2011-2019 Intel Corporation All rights reserved.
+//
+//  Redistribution and use in source and binary forms, with or without
+//  modification, are permitted provided that the following conditions
+//  are met:
+//    * Redistributions of source code must retain the above copyright
+//      notice, this list of conditions and the following disclaimer.
+//    * Redistributions in binary form must reproduce the above copyright
+//      notice, this list of conditions and the following disclaimer in
+//      the documentation and/or other materials provided with the
+//      distribution.
+//    * Neither the name of Intel Corporation nor the names of its
+//      contributors may be used to endorse or promote products derived
+//      from this software without specific prior written permission.
+//
+//  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+//  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+//  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+//  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+//  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+//  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+//  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES// LOSS OF USE,
+//  DATA, OR PROFITS// OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+//  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+//  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+//  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#ifndef _REG_SIZES_ASM_
+#define _REG_SIZES_ASM_
+
+
+// define d, w  and b variants for registers
+
+.macro dwordreg reg
+	.if \reg == %r8 || \reg == %r9 || \reg == %r10 || \reg == %r11 || \reg == %r12 || \reg == %r13 || \reg == %r14 || \reg == %r15
+		.set dreg, \reg\()d
+	.elseif \reg == %rax
+		.set dreg, %eax
+	.elseif \reg == %rcx
+		.set dreg, %ecx
+	.elseif \reg == %rdx
+		.set dreg, %edx
+	.elseif \reg == %rbx
+		.set dreg, %ebx
+	.elseif \reg == %rsp
+		.set dreg, %esp
+	.elseif \reg == %rbp
+		.set dreg, %ebp
+	.elseif \reg == %rsi
+		.set dreg, %esi
+	.elseif \reg == %rdi
+		.set dreg, %edi
+	.else
+		.error "Invalid register '\reg\()' while expanding macro 'dwordreg\()'"
+	.endif
+.endm
+
+.macro wordreg reg
+	.if \reg == %r8 || \reg == %r9 || \reg == %r10 || \reg == %r11 || \reg == %r12 || \reg == %r13 || \reg == %r14 || \reg == %r15
+		.set wreg, \reg\()w
+	.elseif \reg == %rax
+		.set wreg, %ax
+	.elseif \reg == %rcx
+		.set wreg, %cx
+	.elseif \reg == %rdx
+		.set wreg, %dx
+	.elseif \reg == %rbx
+		.set wreg, %bx
+	.elseif \reg == %rsp
+		.set wreg, %sp
+	.elseif \reg == %rbp
+		.set wreg, %bp
+	.elseif \reg == %rsi
+		.set wreg, %si
+	.elseif \reg == %rdi
+		.set wreg, %di
+	.else
+		.error "Invalid register '\reg\()' while expanding macro 'wordreg\()'"
+	.endif
+.endm
+
+
+.macro bytereg reg
+	.if \reg == %r8 || \reg == %r9 || \reg == %r10 || \reg == %r11 || \reg == %r12 || \reg == %r13 || \reg == %r14 || \reg == %r15
+		.set breg, \reg\()b
+	.elseif \reg == %rax
+		.set breg, %al
+	.elseif \reg == %rcx
+		.set breg, %cl
+	.elseif \reg == %rdx
+		.set breg, %dl
+	.elseif \reg == %rbx
+		.set breg, %bl
+	.elseif \reg == rsp
+		.set breg, %spl
+	.elseif \reg == %rbp
+		.set breg, %bpl
+	.elseif \reg == rsi
+		.set breg, %sil
+	.elseif \reg == rdi
+		.set breg, %dil
+	.else
+		.error "Invalid register '\reg\()' while expanding macro 'bytereg\()'"
+	.endif
+.endm
+
+// clang compat: Below won't owrk with clang; do it a bit different
+// #define ZERO_TO_THIRTYONE \
+// 	0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16, \
+// 	17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
+
+// .macro xword reg
+// 	.irep i, ZERO_TO_THIRTYONE
+// 		.if \reg == %xmm\i || \reg == %ymm\i || \reg == %zmm\i
+// 			.set xmmreg, %xmm\i
+// 		.endif
+// 	.endr
+// .endm
+
+// .macro yword reg
+// 	.irep i, ZERO_TO_THIRTYONE
+// 		.if \reg == %xmm\i || \reg == %ymm\i || \reg == %zmm\i
+// 			.set ymmreg, %ymm\i
+// 		.endif
+// 	.endr
+// .endm
+
+// .macro zword reg
+// 	.irep i, ZERO_TO_THIRTYONE
+// 		.if \reg == %xmm\i || \reg == %ymm\i || \reg == %zmm\i
+// 			.set zmmreg, %zmm\i
+// 		.endif
+// 	.endr
+// .endm
+
+// Example usage:
+// xword %zmm12
+// pxor xmmreg, xmmreg // =>  pxor %xmm12, %xmm12
+.macro xword reg
+	.set i, 0
+	.rep 32
+		.altmacro
+		do_xyzword <\reg>, xmm, %i
+		.noaltmacro
+	.set i, (i+1)
+	.endr
+.endm
+
+.macro yword reg
+	.set i, 0
+	.rep 32
+		.altmacro
+		do_xyzword <\reg>, ymm, %i
+		.noaltmacro
+	.set i, (i+1)
+	.endr
+.endm
+
+.macro zword reg
+	.set i, 0
+	.rep 32
+		.altmacro
+		do_xyzword <\reg>, zmm, %i
+		.noaltmacro
+	.set i, (i+1)
+	.endr
+.endm
+
+.macro  do_xyzword creg, prfx, idx
+	.if \creg == %xmm\idx || \creg == %ymm\idx || \creg == %zmm\idx
+		.set \prfx\()reg, %\prfx\idx
+	.endif
+.endm
+
+
+// FIXME: handle later
+#define	elf32		1
+#define	elf64		2
+#define	win64		3
+#define	machos64	4
+
+#ifndef __OUTPUT_FORMAT__
+#define __OUTPUT_FORMAT__ elf64
+#endif
+
+#if __OUTPUT_FORMAT__ == elf32
+.section .note.GNU-stack,"",%progbits
+.section .text
+#endif
+#if __OUTPUT_FORMAT__ == elf64
+#ifndef __x86_64__
+#define __x86_64__
+#endif
+.section .note.GNU-stack,"",%progbits
+.section .text
+#endif
+#if __OUTPUT_FORMAT__ == win64
+#define __x86_64__
+#endif
+#if __OUTPUT_FORMAT__ == macho64
+#define __x86_64__
+#endif
+
+
+#ifdef __x86_64__
+#define endbranch .byte 0xf3, 0x0f, 0x1e, 0xfa
+#else
+#define endbranch .byte 0xf3, 0x0f, 0x1e, 0xfb
+#endif
+
+#ifdef REL_TEXT
+#define WRT_OPT
+#elif __OUTPUT_FORMAT__ == elf64
+#define WRT_OPT        wrt ..plt
+#else
+#define WRT_OPT
+#endif
+
+#endif // ifndef _REG_SIZES_ASM_
 
From de13d7cd75869df6375da318b53fad36aa644bdf Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Attila=20F=C3=BCl=C3=B6p?= <attila@fueloep.org>
Date: Fri, 10 Feb 2023 00:09:09 +0100
Subject: [PATCH 2/2] ICP: AES_GCM: Add sse4 asm routines, first stab
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
 
add asm_linkage.h and .cfi
use macros for push and pop
add gcm192 support
add gcm192 support
adapt to gcm_ctx_t offsets
adapt to gcm_ctx_t keysched and htab
adapt to gcm_ctx_t keysched and htab
adapt to gcm_ctx_t keysched and htab
integrate to build sys
builds, next is the fun part: debugging
passes cursory SSE/AVX cross testing
various cleanup
cstyle
cleanup
avoid triggering meaningless asserts
cmn_err implies newline
oc there are bugs in the debugging code as well
fix merge error
update moved gcm_clear_ctx()
minor comment cleanup
 
Signed-off-by: Attila Fülöp <attila@fueloep.org>
---
 Makefile.am                                   |    2 +
 lib/libicp/Makefile.am                        |    3 +
 module/Kbuild.in                              |    5 +-
 module/icp/algs/modes/gcm.c                   | 1004 +++++++++++++++--
 module/icp/algs/modes/modes.c                 |   18 +-
 .../icp/asm-x86_64/modes/isalc_gcm128_sse.S   |    7 +-
 .../icp/asm-x86_64/modes/isalc_gcm192_sse.S   |   36 +
 .../icp/asm-x86_64/modes/isalc_gcm256_sse.S   |    7 +-
 .../icp/asm-x86_64/modes/isalc_gcm_defines.S  |  193 +++-
 module/icp/asm-x86_64/modes/isalc_gcm_sse.S   |  759 ++++++-------
 module/icp/asm-x86_64/modes/isalc_reg_sizes.S |   13 +-
 module/icp/include/modes/modes.h              |   57 +-
 module/icp/io/aes.c                           |    1 -
 14 files changed, 1529 insertions(+), 577 deletions(-)
 create mode 100644 module/icp/asm-x86_64/modes/isalc_gcm192_sse.S
 
diff --git a/Makefile.am b/Makefile.am
index 11e45dae8255..1fb636972566 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -51,6 +51,8 @@ dist_noinst_DATA += module/icp/asm-x86_64/aes/THIRDPARTYLICENSE.openssl
 dist_noinst_DATA += module/icp/asm-x86_64/aes/THIRDPARTYLICENSE.openssl.descrip
 dist_noinst_DATA += module/icp/asm-x86_64/modes/THIRDPARTYLICENSE.cryptogams
 dist_noinst_DATA += module/icp/asm-x86_64/modes/THIRDPARTYLICENSE.cryptogams.descrip
+dist_noinst_DATA += module/icp/asm-x86_64/modes/THIRDPARTYLICENSE.intel
+dist_noinst_DATA += module/icp/asm-x86_64/modes/THIRDPARTYLICENSE.intel.descrip
 dist_noinst_DATA += module/icp/asm-x86_64/modes/THIRDPARTYLICENSE.openssl
 dist_noinst_DATA += module/icp/asm-x86_64/modes/THIRDPARTYLICENSE.openssl.descrip
 dist_noinst_DATA += module/os/linux/spl/THIRDPARTYLICENSE.gplv2
diff --git a/lib/libicp/Makefile.am b/lib/libicp/Makefile.am
index 4ba55b2158bc..0c9994c3a7b2 100644
--- a/lib/libicp/Makefile.am
+++ b/lib/libicp/Makefile.am
@@ -74,6 +74,9 @@ nodist_libicp_la_SOURCES += \
 	module/icp/asm-x86_64/modes/gcm_pclmulqdq.S \
 	module/icp/asm-x86_64/modes/aesni-gcm-x86_64.S \
 	module/icp/asm-x86_64/modes/ghash-x86_64.S \
+	module/icp/asm-x86_64/modes/isalc_gcm128_sse.S \
+	module/icp/asm-x86_64/modes/isalc_gcm192_sse.S \
+	module/icp/asm-x86_64/modes/isalc_gcm256_sse.S \
 	module/icp/asm-x86_64/sha2/sha256-x86_64.S \
 	module/icp/asm-x86_64/sha2/sha512-x86_64.S \
 	module/icp/asm-x86_64/blake3/blake3_avx2.S \
diff --git a/module/Kbuild.in b/module/Kbuild.in
index 8d29f56c2fb8..cbfce110d322 100644
--- a/module/Kbuild.in
+++ b/module/Kbuild.in
@@ -125,7 +125,10 @@ ICP_OBJS_X86_64 := \
 	asm-x86_64/sha2/sha512-x86_64.o \
 	asm-x86_64/modes/aesni-gcm-x86_64.o \
 	asm-x86_64/modes/gcm_pclmulqdq.o \
-	asm-x86_64/modes/ghash-x86_64.o
+	asm-x86_64/modes/ghash-x86_64.o \
+	asm-x86_64/modes/isalc_gcm128_sse.o \
+	asm-x86_64/modes/isalc_gcm192_sse.o \
+	asm-x86_64/modes/isalc_gcm256_sse.o \
 
 ICP_OBJS_X86 := \
 	algs/aes/aes_impl_aesni.o \
diff --git a/module/icp/algs/modes/gcm.c b/module/icp/algs/modes/gcm.c
index dd8db6f97460..f6ceb49fb393 100644
--- a/module/icp/algs/modes/gcm.c
+++ b/module/icp/algs/modes/gcm.c
@@ -35,7 +35,13 @@
 #include <aes/aes_impl.h>
 #endif
 
-#define	GHASH(c, d, t, o) \
+#ifdef DEBUG_GCM_ASM
+/* Can't attach to inline funcs with bpftrace */
+#undef	inline
+#define	inline __attribute__((__noinline__))
+#endif
+
+#define	GHASH(c, d, t, o)				      \
 	xor_block((uint8_t *)(d), (uint8_t *)(c)->gcm_ghash); \
 	(o)->mul((uint64_t *)(void *)(c)->gcm_ghash, (c)->gcm_H, \
 	(uint64_t *)(void *)(t));
@@ -43,9 +49,14 @@
 /* Select GCM implementation */
 #define	IMPL_FASTEST	(UINT32_MAX)
 #define	IMPL_CYCLE	(UINT32_MAX-1)
-#ifdef CAN_USE_GCM_ASM
+#ifdef CAN_USE_GCM_ASM_AVX
 #define	IMPL_AVX	(UINT32_MAX-2)
 #endif
+#ifdef CAN_USE_GCM_ASM_SSE
+#define	IMPL_SSE4_1	(UINT32_MAX-3)
+#endif
+/* TODO: add AVX2, VAES */
+
 #define	GCM_IMPL_READ(i) (*(volatile uint32_t *) &(i))
 static uint32_t icp_gcm_impl = IMPL_FASTEST;
 static uint32_t user_sel_impl = IMPL_FASTEST;
@@ -55,30 +66,269 @@ static inline int gcm_init_ctx_impl(boolean_t, gcm_ctx_t *, char *, size_t,
     void (*)(uint8_t *, uint8_t *),
     void (*)(uint8_t *, uint8_t *));
 
+/* TODO: move below to seperate header (gcm_simd.h) ? */
 #ifdef CAN_USE_GCM_ASM
+#ifdef CAN_USE_GCM_ASM_AVX
 /* Does the architecture we run on support the MOVBE instruction? */
 boolean_t gcm_avx_can_use_movbe = B_FALSE;
+extern boolean_t ASMABI atomic_toggle_boolean_nv(volatile boolean_t *);
+#endif
 /*
- * Whether to use the optimized openssl gcm and ghash implementations.
- * Set to true if module parameter icp_gcm_impl == "avx".
+ * Which optimized gcm SIMD assembly implementations to use.
+ * Set to the SIMD implementation contained in icp_gcm_impl unless it's
+ * IMPL_CYCLE or IMPL_FASTEST. For IMPL_CYCLE we cycle through all available
+ * SIMD implementations on each call to gcm_init_ctx. For IMPL_FASTEST we set
+ * it to the fastest supported SIMD implementation. gcm_init__ctx() uses
+ * this to decide which SIMD implementation to use.
  */
-static boolean_t gcm_use_avx = B_FALSE;
-#define	GCM_IMPL_USE_AVX	(*(volatile boolean_t *)&gcm_use_avx)
+static gcm_simd_impl_t gcm_simd_impl = GSI_NONE;
+#define	GCM_SIMD_IMPL_READ	(*(volatile gcm_simd_impl_t *)&gcm_simd_impl)
+
+static inline void gcm_set_simd_impl(gcm_simd_impl_t);
+static inline gcm_simd_impl_t gcm_cycle_simd_impl(void);
+static inline size_t gcm_simd_get_htab_size(gcm_simd_impl_t);
+static inline int get_isalc_gcm_keylen_index(const gcm_ctx_t *ctx);
+static inline int get_isalc_gcm_impl_index(const gcm_ctx_t *ctx);
+
+/* TODO: move later */
+
+extern void ASMABI icp_isalc_gcm_precomp_128_sse(gcm_ctx_t *ctx);
+extern void ASMABI icp_isalc_gcm_precomp_192_sse(gcm_ctx_t *ctx);
+extern void ASMABI icp_isalc_gcm_precomp_256_sse(gcm_ctx_t *ctx);
+typedef void ASMABI (*isalc_gcm_precomp_fp)(gcm_ctx_t *);
+
+extern void ASMABI icp_isalc_gcm_init_128_sse(gcm_ctx_t *ctx, const uint8_t *iv,
+    const uint8_t *aad, uint64_t aad_len, uint64_t tag_len);
+extern void ASMABI icp_isalc_gcm_init_192_sse(gcm_ctx_t *ctx, const uint8_t *iv,
+    const uint8_t *aad, uint64_t aad_len, uint64_t tag_len);
+extern void ASMABI icp_isalc_gcm_init_256_sse(gcm_ctx_t *ctx, const uint8_t *iv,
+    const uint8_t *aad, uint64_t aad_len, uint64_t tag_len);
+typedef void ASMABI (*isalc_gcm_init_fp)(gcm_ctx_t *, const uint8_t *,
+    const uint8_t *, uint64_t, uint64_t);
+
+extern void ASMABI icp_isalc_gcm_enc_128_update_sse(gcm_ctx_t *ctx,
+    uint8_t *out, const uint8_t *in, uint64_t plaintext_len);
+extern void ASMABI icp_isalc_gcm_enc_192_update_sse(gcm_ctx_t *ctx,
+    uint8_t *out, const uint8_t *in, uint64_t plaintext_len);
+extern void ASMABI icp_isalc_gcm_enc_256_update_sse(gcm_ctx_t *ctx,
+    uint8_t *out, const uint8_t *in, uint64_t plaintext_len);
+typedef void ASMABI (*isalc_gcm_enc_update_fp)(gcm_ctx_t *, uint8_t *,
+    const uint8_t *, uint64_t);
+
+extern void ASMABI icp_isalc_gcm_dec_128_update_sse(gcm_ctx_t *ctx,
+    uint8_t *out, const uint8_t *in, uint64_t plaintext_len);
+extern void ASMABI icp_isalc_gcm_dec_192_update_sse(gcm_ctx_t *ctx,
+    uint8_t *out, const uint8_t *in, uint64_t plaintext_len);
+extern void ASMABI icp_isalc_gcm_dec_256_update_sse(gcm_ctx_t *ctx,
+    uint8_t *out, const uint8_t *in, uint64_t plaintext_len);
+typedef void ASMABI (*isalc_gcm_dec_update_fp)(gcm_ctx_t *, uint8_t *,
+    const uint8_t *, uint64_t);
+
+extern void ASMABI icp_isalc_gcm_enc_128_finalize_sse(gcm_ctx_t	*ctx);
+extern void ASMABI icp_isalc_gcm_enc_192_finalize_sse(gcm_ctx_t	*ctx);
+extern void ASMABI icp_isalc_gcm_enc_256_finalize_sse(gcm_ctx_t	*ctx);
+typedef void ASMABI (*isalc_gcm_enc_finalize_fp)(gcm_ctx_t *);
+
+extern void ASMABI icp_isalc_gcm_dec_128_finalize_sse(gcm_ctx_t	*ctx);
+extern void ASMABI icp_isalc_gcm_dec_192_finalize_sse(gcm_ctx_t	*ctx);
+extern void ASMABI icp_isalc_gcm_dec_256_finalize_sse(gcm_ctx_t	*ctx);
+typedef void ASMABI (*isalc_gcm_dec_finalize_fp)(gcm_ctx_t *);
+
+extern void ASMABI icp_isalc_gcm_enc_128_sse(gcm_ctx_t *ctx, uint8_t *out,
+    const uint8_t *in, uint64_t plaintext_len, const uint8_t *iv,
+    const uint8_t *aad, uint64_t aad_len, uint64_t tag_len);
+extern void ASMABI icp_isalc_gcm_enc_192_sse(gcm_ctx_t *ctx, uint8_t *out,
+    const uint8_t *in, uint64_t plaintext_len, const uint8_t *iv,
+    const uint8_t *aad, uint64_t aad_len, uint64_t tag_len);
+extern void ASMABI icp_isalc_gcm_enc_256_sse(gcm_ctx_t *ctx, uint8_t *out,
+    const uint8_t *in, uint64_t plaintext_len, const uint8_t *iv,
+    const uint8_t *aad, uint64_t aad_len, uint64_t tag_len);
+typedef void ASMABI (*isalc_gcm_enc_fp)(gcm_ctx_t *, uint8_t *, const uint8_t *,
+    uint64_t, const uint8_t *, const uint8_t *, uint64_t, uint64_t);
+
+extern void ASMABI icp_isalc_gcm_dec_128_sse(gcm_ctx_t *ctx, uint8_t *out,
+    const uint8_t *in, uint64_t plaintext_len, const uint8_t *iv,
+    const uint8_t *aad, uint64_t aad_len, uint64_t tag_len);
+extern void ASMABI icp_isalc_gcm_dec_192_sse(gcm_ctx_t *ctx, uint8_t *out,
+    const uint8_t *in, uint64_t plaintext_len, const uint8_t *iv,
+    const uint8_t *aad, uint64_t aad_len, uint64_t tag_len);
+extern void ASMABI icp_isalc_gcm_dec_256_sse(gcm_ctx_t *ctx, uint8_t *out,
+    const uint8_t *in, uint64_t plaintext_len, const uint8_t *iv,
+    const uint8_t *aad, uint64_t aad_len, uint64_t tag_len);
+typedef void ASMABI (*isalc_gcm_dec_fp)(gcm_ctx_t *, uint8_t *, const uint8_t *,
+    uint64_t, const uint8_t *, const uint8_t *, uint64_t, uint64_t);
+
+/* struct isalc_ops holds arrays for all isalc asm functions ... */
+typedef struct isalc_gcm_ops {
+	isalc_gcm_precomp_fp		igo_precomp[GSI_ISALC_NUM_IMPL][3];
+	isalc_gcm_init_fp		igo_init[GSI_ISALC_NUM_IMPL][3];
+	isalc_gcm_enc_update_fp		igo_enc_update[GSI_ISALC_NUM_IMPL][3];
+	isalc_gcm_dec_update_fp		igo_dec_update[GSI_ISALC_NUM_IMPL][3];
+	isalc_gcm_enc_finalize_fp	igo_enc_finalize[GSI_ISALC_NUM_IMPL][3];
+	isalc_gcm_dec_finalize_fp	igo_dec_finalize[GSI_ISALC_NUM_IMPL][3];
+	isalc_gcm_enc_fp		igo_enc[GSI_ISALC_NUM_IMPL][3];
+	isalc_gcm_dec_fp		igo_dec[GSI_ISALC_NUM_IMPL][3];
+} isalc_gcm_ops_t;
+
+static isalc_gcm_ops_t isalc_ops = {
+	.igo_precomp = {
+		[0][0] = icp_isalc_gcm_precomp_128_sse,
+		[0][1] = icp_isalc_gcm_precomp_192_sse,
+		[0][2] = icp_isalc_gcm_precomp_256_sse,
+		/* TODO: add [1][0..2] for AVX2 ... */
+	},
+	.igo_init = {
+		[0][0] = icp_isalc_gcm_init_128_sse,
+		[0][1] = icp_isalc_gcm_init_192_sse,
+		[0][2] = icp_isalc_gcm_init_256_sse,
+		/* TODO: add [1][0..2] for AVX2 ... */
+	},
+	.igo_enc_update = {
+		[0][0] = icp_isalc_gcm_enc_128_update_sse,
+		[0][1] = icp_isalc_gcm_enc_192_update_sse,
+		[0][2] = icp_isalc_gcm_enc_256_update_sse,
+		/* TODO: add [1][0..2] for AVX2 ... */
+	},
+	.igo_dec_update = {
+		[0][0] = icp_isalc_gcm_dec_128_update_sse,
+		[0][1] = icp_isalc_gcm_dec_192_update_sse,
+		[0][2] = icp_isalc_gcm_dec_256_update_sse,
+		/* TODO: add [1][0..2] for AVX2 ... */
+	},
+	.igo_enc_finalize = {
+		[0][0] = icp_isalc_gcm_enc_128_finalize_sse,
+		[0][1] = icp_isalc_gcm_enc_192_finalize_sse,
+		[0][2] = icp_isalc_gcm_enc_256_finalize_sse,
+		/* TODO: add [1][0..2] for AVX2 ... */
+	},
+	.igo_dec_finalize = {
+		[0][0] = icp_isalc_gcm_dec_128_finalize_sse,
+		[0][1] = icp_isalc_gcm_dec_192_finalize_sse,
+		[0][2] = icp_isalc_gcm_dec_256_finalize_sse,
+		/* TODO: add [1][0..2] for AVX2 ... */
+	},
+	.igo_enc = {
+		[0][0] = icp_isalc_gcm_enc_128_sse,
+		[0][1] = icp_isalc_gcm_enc_192_sse,
+		[0][2] = icp_isalc_gcm_enc_256_sse,
+		/* TODO: add [1][0..2] for AVX2 ... */
+	},
+	.igo_dec = {
+		[0][0] = icp_isalc_gcm_dec_128_sse,
+		[0][1] = icp_isalc_gcm_dec_192_sse,
+		[0][2] = icp_isalc_gcm_dec_256_sse,
+		/* TODO: add [1][0..2] for AVX2 ... */
+	}
+};
 
-extern boolean_t ASMABI atomic_toggle_boolean_nv(volatile boolean_t *);
+/*
+ * Return B_TRUE if impl is a isalc implementation.
+ */
+static inline boolean_t
+is_isalc_impl(gcm_simd_impl_t impl)
+{
+	int i = (int)impl;
 
-static inline boolean_t gcm_avx_will_work(void);
-static inline void gcm_set_avx(boolean_t);
-static inline boolean_t gcm_toggle_avx(void);
-static inline size_t gcm_simd_get_htab_size(boolean_t);
+	if (i >= GSI_ISALC_FIRST_IMPL && i <= GSI_ISALC_LAST_IMPL) {
+		return (B_TRUE);
+	} else {
+		return (B_FALSE);
+	}
+}
+
+/*
+ * Get the index into the isalc function pointer array for the different
+ * SIMD (SSE, AVX2, VAES) isalc implementations.
+ */
+static inline int
+get_isalc_gcm_impl_index(const gcm_ctx_t *ctx)
+{
+	gcm_simd_impl_t impl = ctx->gcm_simd_impl;
+	int index = (int)impl - GSI_ISALC_FIRST_IMPL;
+
+	ASSERT3S(index, >=, 0);
+	ASSERT3S(index, <, GSI_ISALC_NUM_IMPL);
+
+	return (index);
+}
+
+/*
+ * Get the index (0..2) into the isalc function pointer array for the GCM
+ * key length (128,192,256) the given ctx uses.
+ */
+static inline int
+get_isalc_gcm_keylen_index(const gcm_ctx_t *ctx)
+{
+	const void *keysched = ((aes_key_t *)ctx->gcm_keysched)->encr_ks.ks32;
+	int aes_rounds = ((aes_key_t *)keysched)->nr;
+	/* AES uses 10,12,14 rounds for AES-{128,192,256}. */
+	int index = (aes_rounds - 10) >> 1;
+
+	ASSERT3S(index, >=, 0);
+	ASSERT3S(index, <=, 2);
+
+	return (index);
+}
+
+static inline boolean_t gcm_sse_will_work(void);
+
+#ifdef DEBUG_GCM_ASM
+/*
+ * Call this in gcm_init_ctx before doing anything else. The shadowed ctx
+ * is stored in ctx->gcm_shadow_ctx.
+ */
+static  __attribute__((__noinline__)) gcm_ctx_t *
+gcm_duplicate_ctx(gcm_ctx_t *ctx)
+{
+	ASSERT3P(ctx->gcm_pt_buf, ==, NULL);
+	ASSERT3P(ctx->gcm_shadow_ctx, ==, NULL); /* No nested ctxs allowed. */
+
+	gcm_ctx_t *new_ctx;
+	size_t sz = sizeof (gcm_ctx_t);
+
+	if ((new_ctx = kmem_zalloc(sz, KM_SLEEP)) == NULL)
+		return (NULL);
+
+	(void) memcpy(new_ctx, ctx, sz);
+	new_ctx->gcm_simd_impl = DEBUG_GCM_ASM;
+	size_t htab_len = gcm_simd_get_htab_size(new_ctx->gcm_simd_impl);
+	if (htab_len == 0) {
+		kmem_free(new_ctx, sz);
+		return (NULL);
+	}
+	new_ctx->gcm_htab_len = htab_len;
+	new_ctx->gcm_Htable = kmem_alloc(htab_len, KM_SLEEP);
+	if (new_ctx->gcm_Htable == NULL) {
+		kmem_free(new_ctx, sz);
+		return (NULL);
+	}
+	new_ctx->gcm_is_shadow = B_TRUE;
 
-static int gcm_mode_encrypt_contiguous_blocks_avx(gcm_ctx_t *, char *, size_t,
-    crypto_data_t *, size_t);
+	ctx->gcm_shadow_ctx = new_ctx;
+	return (new_ctx);
+}
+#endif /* ifndef DEBUG_GCM_ASM */
+
+static inline void gcm_init_isalc(gcm_ctx_t *, const uint8_t *, size_t,
+    const uint8_t *, size_t);
+
+static inline int gcm_mode_encrypt_contiguous_blocks_isalc(gcm_ctx_t *,
+    const uint8_t *, size_t, crypto_data_t *);
+
+static inline int gcm_encrypt_final_isalc(gcm_ctx_t *, crypto_data_t *);
+static inline int gcm_decrypt_final_isalc(gcm_ctx_t *, crypto_data_t *);
+
+#ifdef CAN_USE_GCM_ASM_AVX
+static inline boolean_t gcm_avx_will_work(void);
+static int gcm_mode_encrypt_contiguous_blocks_avx(gcm_ctx_t *, const uint8_t *,
+    size_t, crypto_data_t *, size_t);
 
 static int gcm_encrypt_final_avx(gcm_ctx_t *, crypto_data_t *, size_t);
 static int gcm_decrypt_final_avx(gcm_ctx_t *, crypto_data_t *, size_t);
-static int gcm_init_avx(gcm_ctx_t *, const uint8_t *, size_t, const uint8_t *,
+static void gcm_init_avx(gcm_ctx_t *, const uint8_t *, size_t, const uint8_t *,
     size_t, size_t);
+
+#endif /* ifdef CAN_USE_GCM_ASM_AVX */
 #endif /* ifdef CAN_USE_GCM_ASM */
 
 /*
@@ -93,11 +343,19 @@ gcm_mode_encrypt_contiguous_blocks(gcm_ctx_t *ctx, char *data, size_t length,
     void (*xor_block)(uint8_t *, uint8_t *))
 {
 #ifdef CAN_USE_GCM_ASM
-	if (ctx->gcm_use_avx == B_TRUE)
+	if (is_isalc_impl(ctx->gcm_simd_impl) == B_TRUE)
+		return (gcm_mode_encrypt_contiguous_blocks_isalc(
+		    ctx, (const uint8_t *)data, length, out));
+
+#ifdef CAN_USE_GCM_ASM_AVX
+	if (ctx->gcm_simd_impl == GSI_OSSL_AVX)
 		return (gcm_mode_encrypt_contiguous_blocks_avx(
-		    ctx, data, length, out, block_size));
+		    ctx, (const uint8_t *)data, length, out, block_size));
 #endif
 
+	ASSERT3S(ctx->gcm_simd_impl, ==, GSI_NONE);
+#endif /* ifdef CAN_USE_GCM_ASM */
+
 	const gcm_impl_ops_t *gops;
 	size_t remainder = length;
 	size_t need = 0;
@@ -211,11 +469,19 @@ gcm_encrypt_final(gcm_ctx_t *ctx, crypto_data_t *out, size_t block_size,
     void (*xor_block)(uint8_t *, uint8_t *))
 {
 	(void) copy_block;
+
 #ifdef CAN_USE_GCM_ASM
-	if (ctx->gcm_use_avx == B_TRUE)
+	if (is_isalc_impl(ctx->gcm_simd_impl) == B_TRUE)
+		return (gcm_encrypt_final_isalc(ctx, out));
+
+#ifdef CAN_USE_GCM_ASM_AVX
+	if (ctx->gcm_simd_impl == GSI_OSSL_AVX)
 		return (gcm_encrypt_final_avx(ctx, out, block_size));
 #endif
 
+	ASSERT3S(ctx->gcm_simd_impl, ==, GSI_NONE);
+#endif /* ifdef CAN_USE_GCM_ASM */
+
 	const gcm_impl_ops_t *gops;
 	uint64_t counter_mask = ntohll(0x00000000ffffffffULL);
 	uint8_t *ghash, *macp = NULL;
@@ -367,8 +633,8 @@ gcm_mode_decrypt_contiguous_blocks(gcm_ctx_t *ctx, char *data, size_t length,
 		    length);
 		ctx->gcm_processed_data_len += length;
 	}
-
 	ctx->gcm_remainder_len = 0;
+
 	return (CRYPTO_SUCCESS);
 }
 
@@ -378,10 +644,17 @@ gcm_decrypt_final(gcm_ctx_t *ctx, crypto_data_t *out, size_t block_size,
     void (*xor_block)(uint8_t *, uint8_t *))
 {
 #ifdef CAN_USE_GCM_ASM
-	if (ctx->gcm_use_avx == B_TRUE)
+	if (is_isalc_impl(ctx->gcm_simd_impl) == B_TRUE)
+		return (gcm_decrypt_final_isalc(ctx, out));
+
+#ifdef CAN_USE_GCM_ASM_AVX
+	if (ctx->gcm_simd_impl == GSI_OSSL_AVX)
 		return (gcm_decrypt_final_avx(ctx, out, block_size));
 #endif
 
+	ASSERT3S(ctx->gcm_simd_impl, ==, GSI_NONE);
+#endif /* ifdef CAN_USE_GCM_ASM */
+
 	const gcm_impl_ops_t *gops;
 	size_t pt_len;
 	size_t remainder;
@@ -622,6 +895,7 @@ gmac_init_ctx(gcm_ctx_t *gcm_ctx, char *param, size_t block_size,
  * Init the GCM context struct. Handle the cycle and avx implementations here.
  * Initialization of a GMAC context differs slightly from a GCM context.
  */
+/* XXXX: inline __attribute__((__always_inline__) ??? */
 static inline int
 gcm_init_ctx_impl(boolean_t gmac_mode, gcm_ctx_t *gcm_ctx, char *param,
     size_t block_size, int (*encrypt_block)(const void *, const uint8_t *,
@@ -629,6 +903,7 @@ gcm_init_ctx_impl(boolean_t gmac_mode, gcm_ctx_t *gcm_ctx, char *param,
     void (*xor_block)(uint8_t *, uint8_t *))
 {
 	CK_AES_GCM_PARAMS *gcm_param;
+	boolean_t can_use_isalc = B_TRUE;
 	int rv = CRYPTO_SUCCESS;
 	size_t tag_len, iv_len;
 
@@ -640,23 +915,32 @@ gcm_init_ctx_impl(boolean_t gmac_mode, gcm_ctx_t *gcm_ctx, char *param,
 			if ((rv = gcm_validate_args(gcm_param)) != 0) {
 				return (rv);
 			}
+			/* XXXX: redundant? already done in gcm_alloc_ctx */
 			gcm_ctx->gcm_flags |= GCM_MODE;
-
+			/*
+			 * The isalc implementations do not support a IV lenght
+			 * other than 12 bytes and only 8, 12 and 16 bytes tag
+			 * length.
+			 */
 			size_t tbits = gcm_param->ulTagBits;
+			if (gcm_param->ulIvLen != 12 ||
+			    (tbits != 64 && tbits != 96 && tbits != 128)) {
+				can_use_isalc = B_FALSE;
+			}
 			tag_len = CRYPTO_BITS2BYTES(tbits);
 			iv_len = gcm_param->ulIvLen;
 		} else {
 			/* GMAC mode. */
+			ASSERT3U(AES_GMAC_TAG_BITS, ==, 128);
+			ASSERT3U(AES_GMAC_IV_LEN, ==, 12);
+
+			/* XXXX: redundant? already done in gmac_alloc_ctx */
 			gcm_ctx->gcm_flags |= GMAC_MODE;
 			tag_len = CRYPTO_BITS2BYTES(AES_GMAC_TAG_BITS);
 			iv_len = AES_GMAC_IV_LEN;
 		}
-		gcm_ctx->gcm_tag_len = tag_len;
 		gcm_ctx->gcm_processed_data_len = 0;
-
-		/* these values are in bits */
-		gcm_ctx->gcm_len_a_len_c[0]
-		    = htonll(CRYPTO_BYTES2BITS(gcm_param->ulAADLen));
+		gcm_ctx->gcm_tag_len = tag_len;
 	} else {
 		return (CRYPTO_MECHANISM_PARAM_INVALID);
 	}
@@ -670,40 +954,46 @@ gcm_init_ctx_impl(boolean_t gmac_mode, gcm_ctx_t *gcm_ctx, char *param,
 	    ((aes_key_t *)gcm_ctx->gcm_keysched)->ops->needs_byteswap;
 
 	if (GCM_IMPL_READ(icp_gcm_impl) != IMPL_CYCLE) {
-		gcm_ctx->gcm_use_avx = GCM_IMPL_USE_AVX;
+		gcm_ctx->gcm_simd_impl = GCM_SIMD_IMPL_READ;
 	} else {
 		/*
-		 * Handle the "cycle" implementation by creating avx and
-		 * non-avx contexts alternately.
+		 * Handle the "cycle" implementation by cycling through all
+		 * supported SIMD implementation. This can only be done once
+		 *  per context since they differ in requirements.
 		 */
-		gcm_ctx->gcm_use_avx = gcm_toggle_avx();
-
-		/* The avx impl. doesn't handle byte swapped key schedules. */
-		if (gcm_ctx->gcm_use_avx == B_TRUE && needs_bswap == B_TRUE) {
-			gcm_ctx->gcm_use_avx = B_FALSE;
+		gcm_ctx->gcm_simd_impl = gcm_cycle_simd_impl();
+		/*
+		 * We don't handle byte swapped key schedules in the SIMD
+		 * code paths.
+		 */
+		aes_key_t *ks = (aes_key_t *)gcm_ctx->gcm_keysched;
+		if (ks->ops->needs_byteswap == B_TRUE) {
+			gcm_ctx->gcm_simd_impl = GSI_NONE;
 		}
+#ifdef CAN_USE_GCM_ASM_AVX
 		/*
 		 * If this is a GCM context, use the MOVBE and the BSWAP
 		 * variants alternately. GMAC contexts code paths do not
 		 * use the MOVBE instruction.
 		 */
-		if (gcm_ctx->gcm_use_avx == B_TRUE && gmac_mode == B_FALSE &&
-		    zfs_movbe_available() == B_TRUE) {
+		if (gcm_ctx->gcm_simd_impl == GSI_OSSL_AVX &&
+		    gmac_mode == B_FALSE && zfs_movbe_available() == B_TRUE) {
 			(void) atomic_toggle_boolean_nv(
 			    (volatile boolean_t *)&gcm_avx_can_use_movbe);
 		}
+#endif
 	}
 	/*
-	 * We don't handle byte swapped key schedules in the avx code path,
+	 * We don't handle byte swapped key schedules in the SIMD code paths,
 	 * still they could be created by the aes generic implementation.
 	 * Make sure not to use them since we'll corrupt data if we do.
 	 */
-	if (gcm_ctx->gcm_use_avx == B_TRUE && needs_bswap == B_TRUE) {
-		gcm_ctx->gcm_use_avx = B_FALSE;
+	if (gcm_ctx->gcm_simd_impl != GSI_NONE && needs_bswap == B_TRUE) {
+		gcm_ctx->gcm_simd_impl = GSI_NONE;
 
 		cmn_err_once(CE_WARN,
 		    "ICP: Can't use the aes generic or cycle implementations "
-		    "in combination with the gcm avx implementation!");
+		    "in combination with the gcm SIMD implementations!");
 		cmn_err_once(CE_WARN,
 		    "ICP: Falling back to a compatible implementation, "
 		    "aes-gcm performance will likely be degraded.");
@@ -711,10 +1001,17 @@ gcm_init_ctx_impl(boolean_t gmac_mode, gcm_ctx_t *gcm_ctx, char *param,
 		    "ICP: Choose at least the x86_64 aes implementation to "
 		    "restore performance.");
 	}
-
+	/*
+	 * Only use isalc if the given IV and tag lengths match what we support.
+	 * This will almost always be the case.
+	 */
+	if (can_use_isalc == B_FALSE && is_isalc_impl(gcm_ctx->gcm_simd_impl)) {
+		gcm_ctx->gcm_simd_impl = GSI_NONE;
+	}
 	/* Allocate Htab memory as needed. */
-	if (gcm_ctx->gcm_use_avx == B_TRUE) {
-		size_t htab_len = gcm_simd_get_htab_size(gcm_ctx->gcm_use_avx);
+	if (gcm_ctx->gcm_simd_impl != GSI_NONE) {
+		size_t htab_len =
+		    gcm_simd_get_htab_size(gcm_ctx->gcm_simd_impl);
 
 		if (htab_len == 0) {
 			return (CRYPTO_MECHANISM_PARAM_INVALID);
@@ -727,20 +1024,31 @@ gcm_init_ctx_impl(boolean_t gmac_mode, gcm_ctx_t *gcm_ctx, char *param,
 			return (CRYPTO_HOST_MEMORY);
 		}
 	}
-	/* Avx and non avx context initialization differs from here on. */
-	if (gcm_ctx->gcm_use_avx == B_FALSE) {
+	/* Avx and non avx context initialization differ from here on. */
+	if (gcm_ctx->gcm_simd_impl == GSI_NONE) {
 #endif /* ifdef CAN_USE_GCM_ASM */
+		/* these values are in bits */
+		gcm_ctx->gcm_len_a_len_c[0] =
+		    htonll(CRYPTO_BYTES2BITS(aad_len));
+
 		if (gcm_init(gcm_ctx, iv, iv_len, aad, aad_len, block_size,
 		    encrypt_block, copy_block, xor_block) != CRYPTO_SUCCESS) {
 			rv = CRYPTO_MECHANISM_PARAM_INVALID;
 		}
 #ifdef CAN_USE_GCM_ASM
-	} else {
-		if (gcm_init_avx(gcm_ctx, iv, iv_len, aad, aad_len,
-		    block_size) != CRYPTO_SUCCESS) {
-			rv = CRYPTO_MECHANISM_PARAM_INVALID;
-		}
 	}
+	if (is_isalc_impl(gcm_ctx->gcm_simd_impl) == B_TRUE) {
+		gcm_init_isalc(gcm_ctx, iv, iv_len, aad, aad_len);
+	}
+#ifdef CAN_USE_GCM_ASM_AVX
+	if (gcm_ctx->gcm_simd_impl == GSI_OSSL_AVX) {
+		/* these values are in bits */
+		gcm_ctx->gcm_len_a_len_c[0] =
+		    htonll(CRYPTO_BYTES2BITS(aad_len));
+
+		gcm_init_avx(gcm_ctx, iv, iv_len, aad, aad_len, block_size);
+	}
+#endif /* ifdef CAN_USE_GCM_ASM_AVX */
 #endif /* ifdef CAN_USE_GCM_ASM */
 
 	return (rv);
@@ -876,21 +1184,34 @@ gcm_impl_init(void)
 	strlcpy(gcm_fastest_impl.name, "fastest", GCM_IMPL_NAME_MAX);
 
 #ifdef CAN_USE_GCM_ASM
+	/* Statically select the fastest SIMD implementation: (AVX > SSE). */
+	/* TODO: Use a benchmark like other SIMD implementations do. */
+	gcm_simd_impl_t fastest_simd = GSI_NONE;
+
+	if (gcm_sse_will_work()) {
+		fastest_simd = GSI_ISALC_SSE;
+	}
+
+#ifdef CAN_USE_GCM_ASM_AVX
 	/*
 	 * Use the avx implementation if it's available and the implementation
 	 * hasn't changed from its default value of fastest on module load.
 	 */
 	if (gcm_avx_will_work()) {
+		fastest_simd = GSI_OSSL_AVX;
 #ifdef HAVE_MOVBE
 		if (zfs_movbe_available() == B_TRUE) {
 			atomic_swap_32(&gcm_avx_can_use_movbe, B_TRUE);
 		}
-#endif
-		if (GCM_IMPL_READ(user_sel_impl) == IMPL_FASTEST) {
-			gcm_set_avx(B_TRUE);
-		}
+#endif /* ifdef HAVE_MOVBE */
 	}
-#endif
+#endif /* CAN_USE_GCM_ASM_AVX */
+
+	if (GCM_IMPL_READ(user_sel_impl) == IMPL_FASTEST) {
+		gcm_set_simd_impl(fastest_simd);
+	}
+#endif /* ifdef CAN_USE_GCM_ASM */
+
 	/* Finish initialization */
 	atomic_swap_32(&icp_gcm_impl, user_sel_impl);
 	gcm_impl_initialized = B_TRUE;
@@ -902,9 +1223,12 @@ static const struct {
 } gcm_impl_opts[] = {
 		{ "cycle",	IMPL_CYCLE },
 		{ "fastest",	IMPL_FASTEST },
-#ifdef CAN_USE_GCM_ASM
+#ifdef CAN_USE_GCM_ASM_AVX
 		{ "avx",	IMPL_AVX },
 #endif
+#ifdef CAN_USE_GCM_ASM
+		{ "sse4_1",	IMPL_SSE4_1 },
+#endif
 };
 
 /*
@@ -934,16 +1258,24 @@ gcm_impl_set(const char *val)
 	strlcpy(req_name, val, GCM_IMPL_NAME_MAX);
 	while (i > 0 && isspace(req_name[i-1]))
 		i--;
+
 	req_name[i] = '\0';
 
 	/* Check mandatory options */
 	for (i = 0; i < ARRAY_SIZE(gcm_impl_opts); i++) {
 #ifdef CAN_USE_GCM_ASM
+		/* Ignore sse implementation if it won't work. */
+		if (gcm_impl_opts[i].sel == IMPL_SSE4_1 &&
+		    !gcm_sse_will_work()) {
+			continue;
+		}
+#ifdef CAN_USE_GCM_ASM_AVX
 		/* Ignore avx implementation if it won't work. */
 		if (gcm_impl_opts[i].sel == IMPL_AVX && !gcm_avx_will_work()) {
 			continue;
 		}
-#endif
+#endif /* ifdef CAN_USE_GCM_ASM_AVX */
+#endif /* ifdef CAN_USE_GCM_ASM */
 		if (strcmp(req_name, gcm_impl_opts[i].name) == 0) {
 			impl = gcm_impl_opts[i].sel;
 			err = 0;
@@ -964,16 +1296,23 @@ gcm_impl_set(const char *val)
 	}
 #ifdef CAN_USE_GCM_ASM
 	/*
-	 * Use the avx implementation if available and the requested one is
-	 * avx or fastest.
+	 * Use the requested SIMD implementation if available.
+	 * If the requested one is fastest, use the fastest SIMD impl.
 	 */
+	gcm_simd_impl_t simd_impl = GSI_NONE;
+
+	if (gcm_sse_will_work() == B_TRUE &&
+	    (impl == IMPL_SSE4_1 || impl == IMPL_FASTEST)) {
+		simd_impl = GSI_ISALC_SSE;
+	}
+#ifdef CAN_USE_GCM_ASM_AVX
 	if (gcm_avx_will_work() == B_TRUE &&
 	    (impl == IMPL_AVX || impl == IMPL_FASTEST)) {
-		gcm_set_avx(B_TRUE);
-	} else {
-		gcm_set_avx(B_FALSE);
+		simd_impl = GSI_OSSL_AVX;
 	}
-#endif
+#endif /* ifdef CAN_USE_GCM_ASM_AVX */
+	gcm_set_simd_impl(simd_impl);
+#endif /* ifdef CAN_USE_GCM_ASM */
 
 	if (err == 0) {
 		if (gcm_impl_initialized)
@@ -1005,11 +1344,17 @@ icp_gcm_impl_get(char *buffer, zfs_kernel_param_t *kp)
 	/* list mandatory options */
 	for (i = 0; i < ARRAY_SIZE(gcm_impl_opts); i++) {
 #ifdef CAN_USE_GCM_ASM
+		if (gcm_impl_opts[i].sel == IMPL_SSE4_1 &&
+		    !gcm_sse_will_work()) {
+			continue;
+		}
+#ifdef CAN_USE_GCM_ASM_AVX
 		/* Ignore avx implementation if it won't work. */
 		if (gcm_impl_opts[i].sel == IMPL_AVX && !gcm_avx_will_work()) {
 			continue;
 		}
-#endif
+#endif /* ifdef CAN_USE_GCM_ASM_AVX */
+#endif /* ifdef CAN_USE_GCM_ASM */
 		fmt = (impl == gcm_impl_opts[i].sel) ? "[%s] " : "%s ";
 		cnt += kmem_scnprintf(buffer + cnt, PAGE_SIZE - cnt, fmt,
 		    gcm_impl_opts[i].name);
@@ -1028,10 +1373,122 @@ icp_gcm_impl_get(char *buffer, zfs_kernel_param_t *kp)
 module_param_call(icp_gcm_impl, icp_gcm_impl_set, icp_gcm_impl_get,
     NULL, 0644);
 MODULE_PARM_DESC(icp_gcm_impl, "Select gcm implementation.");
-#endif /* defined(__KERNEL) */
+#endif /* defined(__KERNEL) && defined(__linux__) */
+
 
 #ifdef CAN_USE_GCM_ASM
+
+static inline boolean_t
+gcm_sse_will_work(void)
+{
+	/* Avx should imply aes-ni and pclmulqdq, but make sure anyhow. */
+	return (kfpu_allowed() &&
+	    zfs_sse4_1_available() && zfs_aes_available() &&
+	    zfs_pclmulqdq_available());
+}
+
+static inline size_t
+gcm_simd_get_htab_size(gcm_simd_impl_t simd_mode)
+{
+	switch (simd_mode) {
+	case GSI_NONE:
+		return (0);
+		break;
+	case GSI_OSSL_AVX:
+		return (2 * 6 * 2 * sizeof (uint64_t));
+		break;
+	case GSI_ISALC_SSE:
+		return (2 * 8 * 2 * sizeof (uint64_t));
+		break;
+	default:
+#ifdef _KERNEL
+		cmn_err(CE_WARN, "Undefined simd_mode %d!", (int)simd_mode);
+#endif
+		return (0);
+	}
+}
+
+/* TODO: it's an enum now: adapt */
+static inline void
+gcm_set_simd_impl(gcm_simd_impl_t val)
+{
+	atomic_swap_32(&gcm_simd_impl, val);
+}
+
+/*
+ * Cycle through all supported SIMD implementations, used by IMPL_CYCLE.
+ * The cycle must be done atomically since multiple threads may try to do it
+ * concurrently. So we do a atomic compare and swap for each possible value,
+ * trying n_tries times to cycle the value.
+ *
+ * Please note that since higher level SIMD instruction sets include the lower
+ * level ones, the code for newer ones must be placed at the top of this
+ * function.
+ */
+static inline gcm_simd_impl_t
+gcm_cycle_simd_impl(void)
+{
+	int n_tries = 10;
+
+	/* TODO: Add here vaes and avx2 with vaes beeing top most */
+
+#ifdef CAN_USE_GCM_ASM_AVX
+	if (gcm_avx_will_work() == B_TRUE) {
+		for (int i = 0; i < n_tries; ++i) {
+			if (atomic_cas_32(&GCM_SIMD_IMPL_READ,
+			    GSI_NONE, GSI_ISALC_SSE) == GSI_NONE)
+				return (GSI_ISALC_SSE);
+
+			if (atomic_cas_32(&GCM_SIMD_IMPL_READ,
+			    GSI_ISALC_SSE, GSI_OSSL_AVX) == GSI_ISALC_SSE)
+				return (GSI_OSSL_AVX);
+
+			if (atomic_cas_32(&GCM_SIMD_IMPL_READ,
+			    GSI_OSSL_AVX, GSI_NONE) == GSI_OSSL_AVX)
+				return (GSI_NONE);
+		}
+		/* We failed to cycle, return current value. */
+		return (GCM_SIMD_IMPL_READ);
+	}
+#endif
+#ifdef CAN_USE_GCM_ASM_SSE
+	if (gcm_sse_will_work() == B_TRUE) {
+		for (int i = 0; i < n_tries; ++i) {
+			if (atomic_cas_32(&GCM_SIMD_IMPL_READ,
+			    GSI_NONE, GSI_ISALC_SSE) == GSI_NONE)
+				return (GSI_ISALC_SSE);
+
+			if (atomic_cas_32(&GCM_SIMD_IMPL_READ,
+			    GSI_ISALC_SSE, GSI_NONE) == GSI_ISALC_SSE)
+				return (GSI_NONE);
+
+		}
+		/* We failed to cycle, return current value. */
+		return (GCM_SIMD_IMPL_READ);
+	}
+#endif
+	/* No supported SIMD implementations. */
+	return (GSI_NONE);
+}
+
+#define	GCM_ISALC_MIN_CHUNK_SIZE 1024		/* 64 16 byte blocks */
+#define	GCM_ISALC_MAX_CHUNK_SIZE 1024*1024	/* XXXXXX */
+/* Get the chunk size module parameter. */
+#define	GCM_ISALC_CHUNK_SIZE_READ *(volatile uint32_t *) &gcm_isalc_chunk_size
+
+/*
+ * Module parameter: number of bytes to process at once while owning the FPU.
+ * Rounded down to the next multiple of 512 bytes and ensured to be greater
+ * or equal to GCM_ISALC_MIN_CHUNK_SIZE and less or equal to
+ * GCM_ISALC_MAX_CHUNK_SIZE. It defaults to 32 kiB.
+ */
+static uint32_t gcm_isalc_chunk_size = 32 * 1024;
+
+
+
+#ifdef CAN_USE_GCM_ASM_AVX
 #define	GCM_BLOCK_LEN 16
+
 /*
  * The openssl asm routines are 6x aggregated and need that many bytes
  * at minimum.
@@ -1054,7 +1511,7 @@ MODULE_PARM_DESC(icp_gcm_impl, "Select gcm implementation.");
 #define	gcm_incr_counter_block(ctx) gcm_incr_counter_block_by(ctx, 1)
 
 /* Get the chunk size module parameter. */
-#define	GCM_CHUNK_SIZE_READ *(volatile uint32_t *) &gcm_avx_chunk_size
+#define	GCM_AVX_CHUNK_SIZE_READ *(volatile uint32_t *) &gcm_avx_chunk_size
 
 /*
  * Module parameter: number of bytes to process at once while owning the FPU.
@@ -1079,6 +1536,15 @@ extern size_t ASMABI aesni_gcm_encrypt(const uint8_t *, uint8_t *, size_t,
 extern size_t ASMABI aesni_gcm_decrypt(const uint8_t *, uint8_t *, size_t,
     const void *, uint64_t *, uint64_t *);
 
+
+/* XXXX: DEBUG: don't disable preemption while debugging */
+#if 0
+#undef	kfpu_begin
+#undef	kfpu_end
+#define	kfpu_begin()
+#define	kfpu_end()
+#endif
+
 static inline boolean_t
 gcm_avx_will_work(void)
 {
@@ -1088,37 +1554,6 @@ gcm_avx_will_work(void)
 	    zfs_pclmulqdq_available());
 }
 
-static inline void
-gcm_set_avx(boolean_t val)
-{
-	if (gcm_avx_will_work() == B_TRUE) {
-		atomic_swap_32(&gcm_use_avx, val);
-	}
-}
-
-static inline boolean_t
-gcm_toggle_avx(void)
-{
-	if (gcm_avx_will_work() == B_TRUE) {
-		return (atomic_toggle_boolean_nv(&GCM_IMPL_USE_AVX));
-	} else {
-		return (B_FALSE);
-	}
-}
-
-static inline size_t
-gcm_simd_get_htab_size(boolean_t simd_mode)
-{
-	switch (simd_mode) {
-	case B_TRUE:
-		return (2 * 6 * 2 * sizeof (uint64_t));
-
-	default:
-		return (0);
-	}
-}
-
-
 /* Increment the GCM counter block by n. */
 static inline void
 gcm_incr_counter_block_by(gcm_ctx_t *ctx, int n)
@@ -1137,14 +1572,14 @@ gcm_incr_counter_block_by(gcm_ctx_t *ctx, int n)
  * if possible. While processing a chunk the FPU is "locked".
  */
 static int
-gcm_mode_encrypt_contiguous_blocks_avx(gcm_ctx_t *ctx, char *data,
+gcm_mode_encrypt_contiguous_blocks_avx(gcm_ctx_t *ctx, const uint8_t *data,
     size_t length, crypto_data_t *out, size_t block_size)
 {
 	size_t bleft = length;
 	size_t need = 0;
 	size_t done = 0;
 	uint8_t *datap = (uint8_t *)data;
-	size_t chunk_size = (size_t)GCM_CHUNK_SIZE_READ;
+	size_t chunk_size = (size_t)GCM_AVX_CHUNK_SIZE_READ;
 	const aes_key_t *key = ((aes_key_t *)ctx->gcm_keysched);
 	uint64_t *ghash = ctx->gcm_ghash;
 	uint64_t *cb = ctx->gcm_cb;
@@ -1276,6 +1711,36 @@ gcm_mode_encrypt_contiguous_blocks_avx(gcm_ctx_t *ctx, char *data,
 out:
 	clear_fpu_regs();
 	kfpu_end();
+
+#ifdef DEBUG_GCM_ASM
+	if (ctx->gcm_shadow_ctx != NULL) {
+		gcm_ctx_t *sc = ctx->gcm_shadow_ctx;
+
+		(void) gcm_mode_encrypt_contiguous_blocks_isalc(
+		    sc, data, length, NULL);
+
+		if (ctx->gcm_remainder_len != sc->gcm_remainder_len) {
+			cmn_err(CE_WARN,
+			    "AVX vs SSE: encrypt: remainder_len differs!");
+		}
+		/*
+		 * Handling of partial GCM blocks differ between AVX and SSE,
+		 * so the tags will not match in this case.
+		 */
+		if (ctx->gcm_remainder_len == 0) {
+			/* Byte swap the SSE tag, it is in host byte order. */
+			uint64_t shadow_ghash[2];
+			shadow_ghash[0] = htonll(sc->gcm_ghash[1]);
+			shadow_ghash[1] = htonll(sc->gcm_ghash[0]);
+
+			if (memcmp(ghash, shadow_ghash, ctx->gcm_tag_len)) {
+				cmn_err(CE_WARN,
+				    "AVX vs SSE: encrypt: tags differ!");
+			}
+		}
+	}
+#endif
+
 out_nofpu:
 	if (ct_buf != NULL) {
 		vmem_free(ct_buf, chunk_size);
@@ -1331,6 +1796,15 @@ gcm_encrypt_final_avx(gcm_ctx_t *ctx, crypto_data_t *out, size_t block_size)
 	clear_fpu_regs();
 	kfpu_end();
 
+#ifdef DEBUG_GCM_ASM
+	if (ctx->gcm_shadow_ctx != NULL) {
+		(void) gcm_encrypt_final_isalc(ctx->gcm_shadow_ctx, NULL);
+		if (memcmp(ghash, ctx->gcm_shadow_ctx->gcm_ghash,
+		    ctx->gcm_tag_len)) {
+			cmn_err(CE_WARN, "AVX vs SSE: enc_final: tags differ!");
+		}
+	}
+#endif
 	/* Output remainder. */
 	if (rem_len > 0) {
 		rv = crypto_put_output_data(remainder, out, rem_len);
@@ -1359,7 +1833,35 @@ gcm_decrypt_final_avx(gcm_ctx_t *ctx, crypto_data_t *out, size_t block_size)
 	ASSERT3S(((aes_key_t *)ctx->gcm_keysched)->ops->needs_byteswap, ==,
 	    B_FALSE);
 
-	size_t chunk_size = (size_t)GCM_CHUNK_SIZE_READ;
+#ifdef DEBUG_GCM_ASM
+	/* Copy over the plaintext buf to the shadow context. */
+	if (ctx->gcm_shadow_ctx != NULL) {
+		gcm_ctx_t *sc = ctx->gcm_shadow_ctx;
+		size_t sc_buf_len = ctx->gcm_pt_buf_len;
+		uint8_t *sc_pt_buf = vmem_alloc(sc_buf_len, KM_SLEEP);
+
+		if (sc_pt_buf != NULL) {
+			memcpy(sc_pt_buf, ctx->gcm_pt_buf, sc_buf_len);
+			sc->gcm_pt_buf = sc_pt_buf;
+			sc->gcm_pt_buf_len = sc_buf_len;
+			sc->gcm_processed_data_len = sc_buf_len;
+			/* Not strictly needed, for completeness. */
+			sc->gcm_remainder_len = 0;
+		} else {
+			/*
+			 * Memory allocation failed, just drop this shadow
+			 * context and leave a note in the log.
+			 */
+			gcm_clear_ctx(sc);
+			kmem_free(sc, sizeof (gcm_ctx_t));
+			ctx->gcm_shadow_ctx = NULL;
+			cmn_err(CE_WARN,
+			    "Failed to alloc pt_buf for shadow context!");
+		}
+	}
+#endif /* DEBUG_GCM_ASM */
+
+	size_t chunk_size = (size_t)GCM_AVX_CHUNK_SIZE_READ;
 	size_t pt_len = ctx->gcm_processed_data_len - ctx->gcm_tag_len;
 	uint8_t *datap = ctx->gcm_pt_buf;
 	const aes_key_t *key = ((aes_key_t *)ctx->gcm_keysched);
@@ -1428,6 +1930,7 @@ gcm_decrypt_final_avx(gcm_ctx_t *ctx, crypto_data_t *out, size_t block_size)
 		datap += block_size;
 		bleft -= block_size;
 	}
+	/* TODO: Remove later, we don't set rv up to here. */
 	if (rv != CRYPTO_SUCCESS) {
 		clear_fpu_regs();
 		kfpu_end();
@@ -1445,6 +1948,21 @@ gcm_decrypt_final_avx(gcm_ctx_t *ctx, crypto_data_t *out, size_t block_size)
 	clear_fpu_regs();
 	kfpu_end();
 
+#ifdef DEBUG_GCM_ASM
+	if (ctx->gcm_shadow_ctx != NULL) {
+		(void) gcm_decrypt_final_isalc(ctx->gcm_shadow_ctx, NULL);
+		/* Ensure decrypted plaintext and tag are identical. */
+		if (memcmp(ctx->gcm_pt_buf, ctx->gcm_shadow_ctx->gcm_pt_buf,
+		    pt_len)) {
+			cmn_err(CE_WARN,
+			    "AVX vs SSE: decrypt: plaintexts differ!");
+		}
+		if (memcmp(ghash, ctx->gcm_shadow_ctx->gcm_ghash,
+		    ctx->gcm_tag_len)) {
+			cmn_err(CE_WARN, "AVX vs SSE: decrypt: tags differ!");
+		}
+	}
+#endif
 	/* Compare the input authentication tag with what we calculated. */
 	if (memcmp(&ctx->gcm_pt_buf[pt_len], ghash, ctx->gcm_tag_len)) {
 		/* They don't match. */
@@ -1462,7 +1980,7 @@ gcm_decrypt_final_avx(gcm_ctx_t *ctx, crypto_data_t *out, size_t block_size)
  * Initialize the GCM params H, Htabtle and the counter block. Save the
  * initial counter block.
  */
-static int
+static void
 gcm_init_avx(gcm_ctx_t *ctx, const uint8_t *iv, size_t iv_len,
     const uint8_t *auth_data, size_t auth_data_len, size_t block_size)
 {
@@ -1471,7 +1989,7 @@ gcm_init_avx(gcm_ctx_t *ctx, const uint8_t *iv, size_t iv_len,
 	const void *keysched = ((aes_key_t *)ctx->gcm_keysched)->encr_ks.ks32;
 	int aes_rounds = ((aes_key_t *)ctx->gcm_keysched)->nr;
 	const uint8_t *datap = auth_data;
-	size_t chunk_size = (size_t)GCM_CHUNK_SIZE_READ;
+	size_t chunk_size = (size_t)GCM_AVX_CHUNK_SIZE_READ;
 	size_t bleft;
 
 	ASSERT(block_size == GCM_BLOCK_LEN);
@@ -1539,10 +2057,291 @@ gcm_init_avx(gcm_ctx_t *ctx, const uint8_t *iv, size_t iv_len,
 	}
 	clear_fpu_regs();
 	kfpu_end();
+#ifdef DEBUG_GCM_ASM
+	if (gcm_duplicate_ctx(ctx) != NULL) {
+		gcm_init_isalc(ctx->gcm_shadow_ctx, iv, iv_len, auth_data,
+		    auth_data_len);
+
+		if (memcmp(ctx->gcm_J0, ctx->gcm_shadow_ctx->gcm_J0, 16)) {
+			cmn_err(CE_WARN, "AVX vs SSE: init: ICBs differ!");
+		}
+		if (memcmp(ctx->gcm_H, ctx->gcm_shadow_ctx->gcm_H, 16)) {
+			cmn_err(CE_WARN,
+			    "AVX vs SSE: init: hash keys differ!");
+		}
+	}
+#endif
+
+}
+#endif /* ifdef CAN_USE_GCM_ASM_AVX */
+
+/*
+ * Initialize the GCM params H, Htabtle and the counter block. Save the
+ * initial counter block.
+ *
+ */
+
+static inline void
+gcm_init_isalc(gcm_ctx_t *ctx, const uint8_t *iv, size_t iv_len,
+    const uint8_t *auth_data, size_t auth_data_len)
+{
+	/*
+	 * We know that iv_len must be 12 since that's the only iv_len isalc
+	 * supports, and we made sure it's 12 before calling here.
+	 */
+	ASSERT3U(iv_len, ==, 12UL);
+
+	const uint8_t *aad = auth_data;
+	size_t aad_len = auth_data_len;
+	size_t tag_len = ctx->gcm_tag_len;
+
+	int impl = get_isalc_gcm_impl_index((const gcm_ctx_t *)ctx);
+	int keylen = get_isalc_gcm_keylen_index((const gcm_ctx_t *)ctx);
+
+	kfpu_begin();
+	(*(isalc_ops.igo_precomp[impl][keylen]))(ctx);	/* Init H and Htab */
+	(*(isalc_ops.igo_init[impl][keylen]))(ctx, iv, aad, aad_len, tag_len);
+	kfpu_end();
+}
+
+
+/*
+ * Encrypt multiple blocks of data in GCM mode.
+ * This is done in gcm_isalc_chunk_size chunks, utilizing ported Intel(R)
+ * Intelligent Storage Acceleration Library Crypto Version SIMD assembler
+ * routines. While processing a chunk the FPU is "locked".
+ */
+static inline int
+gcm_mode_encrypt_contiguous_blocks_isalc(gcm_ctx_t *ctx, const uint8_t *data,
+    size_t length, crypto_data_t *out)
+{
+	size_t bleft = length;
+	size_t chunk_size = (size_t)GCM_ISALC_CHUNK_SIZE_READ;
+	uint8_t *ct_buf = NULL;
+	int ct_buf_size;
+
+	/*
+	 * XXXX: It may make sense to allocate a multiple of 'chunk_size'
+	 * up to 'length' to reduce the overhead of crypto_put_output_data()
+	 * and to keep the caches warm.
+	 */
+	/* Allocate a buffer to encrypt to. */
+	if (bleft >= chunk_size) {
+		ct_buf_size = chunk_size;
+	} else {
+		ct_buf_size = bleft;
+	}
+	ct_buf = vmem_alloc(ct_buf_size, KM_SLEEP);
+	if (ct_buf == NULL) {
+		return (CRYPTO_HOST_MEMORY);
+	}
+
+	/* Do the bulk encryption in chunk_size blocks. */
+	int impl = get_isalc_gcm_impl_index((const gcm_ctx_t *)ctx);
+	int keylen = get_isalc_gcm_keylen_index((const gcm_ctx_t *)ctx);
+	const uint8_t *datap = data;
+	int rv = CRYPTO_SUCCESS;
+
+	for (; bleft >= chunk_size; bleft -= chunk_size) {
+		kfpu_begin();
+		(*(isalc_ops.igo_enc_update[impl][keylen]))(
+		    ctx, ct_buf, datap, chunk_size);
+
+		kfpu_end();
+		datap += chunk_size;
+#ifdef DEBUG_GCM_ASM
+		if (ctx->gcm_is_shadow == B_TRUE) {
+			continue;
+		}
+#endif
+		rv = crypto_put_output_data(ct_buf, out, chunk_size);
+		if (rv != CRYPTO_SUCCESS) {
+			/* Indicate that we're done. */
+			bleft = 0;
+			break;
+		}
+		out->cd_offset += chunk_size;
+
+	}
+	/* Check if we are already done. */
+	if (bleft > 0) {
+		/* Bulk encrypt the remaining data. */
+		kfpu_begin();
+		(*(isalc_ops.igo_enc_update[impl][keylen]))(
+		    ctx, ct_buf, datap, bleft);
+
+		kfpu_end();
+
+#ifdef DEBUG_GCM_ASM
+		if (ctx->gcm_is_shadow == B_TRUE) {
+			if (ct_buf != NULL) {
+				vmem_free(ct_buf, ct_buf_size);
+			}
+			return (CRYPTO_SUCCESS);
+
+		}
+#endif
+		rv = crypto_put_output_data(ct_buf, out, bleft);
+		if (rv == CRYPTO_SUCCESS) {
+			out->cd_offset += bleft;
+		}
+	}
+	if (ct_buf != NULL) {
+		vmem_free(ct_buf, ct_buf_size);
+	}
+	return (rv);
+}
+
+/*
+ * XXXX: IIRC inplace ops have a performance penalty in isalc but I can't
+ * find it anymore
+ */
+/*
+ * Finalize decryption: We just have accumulated crypto text, so now we
+ * decrypt it here inplace.
+ */
+static inline int
+gcm_decrypt_final_isalc(gcm_ctx_t *ctx, crypto_data_t *out)
+{
+	ASSERT3U(ctx->gcm_processed_data_len, ==, ctx->gcm_pt_buf_len);
+
+	size_t chunk_size = (size_t)GCM_ISALC_CHUNK_SIZE_READ;
+	size_t pt_len = ctx->gcm_processed_data_len - ctx->gcm_tag_len;
+	uint8_t *datap = ctx->gcm_pt_buf;
+
+	/*
+	 * The isalc routines will increment ctx->gcm_processed_data_len
+	 * on decryption, so reset it.
+	 */
+	ctx->gcm_processed_data_len = 0;
+
+	int impl = get_isalc_gcm_impl_index((const gcm_ctx_t *)ctx);
+	int keylen = get_isalc_gcm_keylen_index((const gcm_ctx_t *)ctx);
+
+	/* Decrypt in chunks of gcm_avx_chunk_size. */
+	size_t bleft;
+	for (bleft = pt_len; bleft >= chunk_size; bleft -= chunk_size) {
+		kfpu_begin();
+		(*(isalc_ops.igo_dec_update[impl][keylen]))(
+		    ctx, datap, datap, chunk_size);
+		kfpu_end();
+		datap += chunk_size;
+	}
+	/*
+	 * Decrypt remainder, which is less than chunk size, in one go and
+	 * finish the tag. Since this won't consume much time, do it in a
+	 * single kfpu block. dec_update() will handle a zero bleft properly.
+	 */
+	kfpu_begin();
+	(*(isalc_ops.igo_dec_update[impl][keylen]))(ctx, datap, datap, bleft);
+	datap += bleft;
+	(*(isalc_ops.igo_dec_finalize[impl][keylen]))(ctx);
+	kfpu_end();
+
+	ASSERT3U(ctx->gcm_processed_data_len, ==, pt_len);
+
+	/*
+	 * Compare the input authentication tag with what we calculated.
+	 * datap points to the expected tag at the end of ctx->gcm_pt_buf.
+	 */
+	if (memcmp(datap, ctx->gcm_ghash, ctx->gcm_tag_len)) {
+		/* They don't match. */
+		return (CRYPTO_INVALID_MAC);
+	}
+#ifdef DEBUG_GCM_ASM
+	if (ctx->gcm_is_shadow == B_TRUE) {
+		return (CRYPTO_SUCCESS);
+	}
+#endif
+	int rv = crypto_put_output_data(ctx->gcm_pt_buf, out, pt_len);
+	if (rv != CRYPTO_SUCCESS) {
+		return (rv);
+	}
+	out->cd_offset += pt_len;
+	/* io/aes.c asserts this, so be nice and meet expectations. */
+	ctx->gcm_remainder_len = 0;
+
+	/* Sensitive data in the context is cleared on ctx destruction. */
+	return (CRYPTO_SUCCESS);
+}
+
+/*
+ * Finalize the encryption: We have already written out all encrypted data.
+ * We update the hash with the last incomplete block, calculate
+ * len(A) || len (C), encrypt gcm->gcm_J0 (initial counter block), calculate
+ * the tag and store it in gcm->ghash and finally output the tag.
+ */
+static inline int
+gcm_encrypt_final_isalc(gcm_ctx_t *ctx, crypto_data_t *out)
+{
+	uint64_t tag_len = ctx->gcm_tag_len;
+
+/* For security measures we pass NULL as the out pointer for shadow contexts. */
+#ifndef DEBUG_GCM_ASM
+	if (out->cd_length < tag_len) {
+		return (CRYPTO_DATA_LEN_RANGE);
+	}
+#endif
+
+	int impl = get_isalc_gcm_impl_index((const gcm_ctx_t *)ctx);
+	int keylen = get_isalc_gcm_keylen_index((const gcm_ctx_t *)ctx);
+
+	kfpu_begin();
+	(*(isalc_ops.igo_enc_finalize[impl][keylen]))(ctx);
+	kfpu_end();
+
+#ifdef DEBUG_GCM_ASM
+	if (ctx->gcm_is_shadow == B_TRUE) {
+		return (CRYPTO_SUCCESS);
+	}
+#endif
+
+	/* Write the tag out. */
+	uint8_t *ghash = (uint8_t *)ctx->gcm_ghash;
+	int rv = crypto_put_output_data(ghash, out, tag_len);
+
+	if (rv != CRYPTO_SUCCESS)
+		return (rv);
+
+	out->cd_offset += tag_len;
+	/* io/aes.c asserts this, so be nice and meet expectations. */
+	ctx->gcm_remainder_len = 0;
+
+	/* Sensitive data in the context is cleared on ctx destruction. */
 	return (CRYPTO_SUCCESS);
 }
 
 #if defined(_KERNEL)
+
+static int
+icp_gcm_isalc_set_chunk_size(const char *buf, zfs_kernel_param_t *kp)
+{
+	unsigned long val;
+	char val_rounded[16];
+	int error = 0;
+
+	error = kstrtoul(buf, 0, &val);
+	if (error)
+		return (error);
+
+	/* XXXX; introduce #def */
+	val = val & ~(512UL - 1UL);
+
+	if (val < GCM_ISALC_MIN_CHUNK_SIZE || val > GCM_ISALC_MAX_CHUNK_SIZE)
+		return (-EINVAL);
+
+	snprintf(val_rounded, 16, "%u", (uint32_t)val);
+	error = param_set_uint(val_rounded, kp);
+	return (error);
+}
+
+module_param_call(icp_gcm_isalc_chunk_size, icp_gcm_isalc_set_chunk_size,
+    param_get_uint, &gcm_isalc_chunk_size, 0644);
+
+MODULE_PARM_DESC(icp_gcm_isalc_chunk_size,
+	"The number of bytes the isalc routines process while owning the FPU");
+
+#ifdef CAN_USE_GCM_ASM_AVX
 static int
 icp_gcm_avx_set_chunk_size(const char *buf, zfs_kernel_param_t *kp)
 {
@@ -1568,7 +2367,8 @@ module_param_call(icp_gcm_avx_chunk_size, icp_gcm_avx_set_chunk_size,
     param_get_uint, &gcm_avx_chunk_size, 0644);
 
 MODULE_PARM_DESC(icp_gcm_avx_chunk_size,
-	"How many bytes to process while owning the FPU");
+	"The number of bytes the avx routines process while owning the FPU");
 
+#endif /* ifdef CAN_USE_GCM_ASM_AVX */
 #endif /* defined(__KERNEL) */
 #endif /* ifdef CAN_USE_GCM_ASM */
diff --git a/module/icp/algs/modes/modes.c b/module/icp/algs/modes/modes.c
index 6f6649b3b58b..31a19d2aa594 100644
--- a/module/icp/algs/modes/modes.c
+++ b/module/icp/algs/modes/modes.c
@@ -180,7 +180,7 @@ gcm_clear_ctx(gcm_ctx_t *ctx)
 	explicit_memset(ctx->gcm_remainder, 0, sizeof (ctx->gcm_remainder));
 	explicit_memset(ctx->gcm_H, 0, sizeof (ctx->gcm_H));
 #if defined(CAN_USE_GCM_ASM)
-	if (ctx->gcm_use_avx == B_TRUE) {
+	if (ctx->gcm_simd_impl != GSI_NONE) {
 		ASSERT3P(ctx->gcm_Htable, !=, NULL);
 		memset(ctx->gcm_Htable, 0, ctx->gcm_htab_len);
 		kmem_free(ctx->gcm_Htable, ctx->gcm_htab_len);
@@ -193,4 +193,20 @@ gcm_clear_ctx(gcm_ctx_t *ctx)
 	/* Optional */
 	explicit_memset(ctx->gcm_J0, 0, sizeof (ctx->gcm_J0));
 	explicit_memset(ctx->gcm_tmp, 0, sizeof (ctx->gcm_tmp));
+
+#ifdef DEBUG_GCM_ASM
+	if (ctx->gcm_shadow_ctx != NULL) {
+		/* No need to clear data while debugging, just free memory. */
+		gcm_ctx_t *sc = ctx->gcm_shadow_ctx;
+
+		if (sc->gcm_Htable != NULL) {
+			kmem_free(sc->gcm_Htable, sc->gcm_htab_len);
+		}
+		if (sc->gcm_pt_buf != NULL) {
+			vmem_free(sc->gcm_pt_buf, sc->gcm_pt_buf_len);
+		}
+		kmem_free(sc, sizeof (gcm_ctx_t));
+		ctx->gcm_shadow_ctx = NULL;
+	}
+#endif
 }
diff --git a/module/icp/asm-x86_64/modes/isalc_gcm128_sse.S b/module/icp/asm-x86_64/modes/isalc_gcm128_sse.S
index f552d8630073..0d924cf6428f 100644
--- a/module/icp/asm-x86_64/modes/isalc_gcm128_sse.S
+++ b/module/icp/asm-x86_64/modes/isalc_gcm128_sse.S
@@ -27,5 +27,10 @@
 //  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 //#######################################################################
 
+#if defined(__x86_64__) && defined(HAVE_SSE4_1) && defined(HAVE_AES) && \
+    defined(HAVE_PCLMULQDQ)
+
 #define GCM128_MODE 1
-#include "isalc_gcm_sse_att.S"
+#include "isalc_gcm_sse.S"
+
+#endif
diff --git a/module/icp/asm-x86_64/modes/isalc_gcm192_sse.S b/module/icp/asm-x86_64/modes/isalc_gcm192_sse.S
new file mode 100644
index 000000000000..851837a34dd5
--- /dev/null
+++ b/module/icp/asm-x86_64/modes/isalc_gcm192_sse.S
@@ -0,0 +1,36 @@
+//#######################################################################
+//  Copyright(c) 2011-2016 Intel Corporation All rights reserved.
+//
+//  Redistribution and use in source and binary forms, with or without
+//  modification, are permitted provided that the following conditions
+//  are met:
+//    * Redistributions of source code must retain the above copyright
+//      notice, this list of conditions and the following disclaimer.
+//    * Redistributions in binary form must reproduce the above copyright
+//      notice, this list of conditions and the following disclaimer in
+//      the documentation and/or other materials provided with the
+//      distribution.
+//    * Neither the name of Intel Corporation nor the names of its
+//      contributors may be used to endorse or promote products derived
+//      from this software without specific prior written permission.
+//
+//  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+//  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+//  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+//  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+//  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+//  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+//  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES# LOSS OF USE,
+//  DATA, OR PROFITS# OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+//  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+//  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+//  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//#######################################################################
+
+#if defined(__x86_64__) && defined(HAVE_SSE4_1) && defined(HAVE_AES) && \
+    defined(HAVE_PCLMULQDQ)
+
+#define GCM192_MODE 1
+#include "isalc_gcm_sse.S"
+
+#endif
diff --git a/module/icp/asm-x86_64/modes/isalc_gcm256_sse.S b/module/icp/asm-x86_64/modes/isalc_gcm256_sse.S
index c88cb0ed055f..75b99f664348 100644
--- a/module/icp/asm-x86_64/modes/isalc_gcm256_sse.S
+++ b/module/icp/asm-x86_64/modes/isalc_gcm256_sse.S
@@ -27,5 +27,10 @@
 //  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 //////////////////////////////////////////////////////////////////////////
 
+#if defined(__x86_64__) && defined(HAVE_SSE4_1) && defined(HAVE_AES) && \
+    defined(HAVE_PCLMULQDQ)
+
 #define GCM256_MODE 1
-#include "isalc_gcm_sse_att.S"
+#include "isalc_gcm_sse.S"
+
+#endif
diff --git a/module/icp/asm-x86_64/modes/isalc_gcm_defines.S b/module/icp/asm-x86_64/modes/isalc_gcm_defines.S
index 00ec4c654d9f..825f46a52dc6 100644
--- a/module/icp/asm-x86_64/modes/isalc_gcm_defines.S
+++ b/module/icp/asm-x86_64/modes/isalc_gcm_defines.S
@@ -36,10 +36,10 @@
 //       Vinodh Gopal
 //       James Guilford
 
+// Port to GNU as, translation to GNU as att-syntax and adoptions for the ICP
+// Copyright(c) 2023 Attila Fülöp <attila@fueloep.org>
 
-////////////
-
-.section .rodata
+SECTION_STATIC
 
 .balign 16
 POLY:		.quad     0x0000000000000001, 0xC200000000000000
@@ -181,76 +181,146 @@ mask_out_top_block:
 
 .section .text
 
+// #define	KEYSCHED_LEN (15 * GCM_BLOCKSIZE)
+// #define	AES_KEY_LEN (2 * KEYSCHED_LEN + 16 + 8 + 4 + 4)	// 512
 
-////define the fields of gcm_data struct
-//typedef struct gcm_data
-//{
-//        u8 expanded_keys[16*15]//
-//        u8 shifted_hkey_1[16]//  // store HashKey <<1 mod poly here
-//        u8 shifted_hkey_2[16]//  // store HashKey^2 <<1 mod poly here
-//        u8 shifted_hkey_3[16]//  // store HashKey^3 <<1 mod poly here
-//        u8 shifted_hkey_4[16]//  // store HashKey^4 <<1 mod poly here
-//        u8 shifted_hkey_5[16]//  // store HashKey^5 <<1 mod poly here
-//        u8 shifted_hkey_6[16]//  // store HashKey^6 <<1 mod poly here
-//        u8 shifted_hkey_7[16]//  // store HashKey^7 <<1 mod poly here
-//        u8 shifted_hkey_8[16]//  // store HashKey^8 <<1 mod poly here
-//        u8 shifted_hkey_1_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey <<1 mod poly here (for Karatsuba purposes)
-//        u8 shifted_hkey_2_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^2 <<1 mod poly here (for Karatsuba purposes)
-//        u8 shifted_hkey_3_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^3 <<1 mod poly here (for Karatsuba purposes)
-//        u8 shifted_hkey_4_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^4 <<1 mod poly here (for Karatsuba purposes)
-//        u8 shifted_hkey_5_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^5 <<1 mod poly here (for Karatsuba purposes)
-//        u8 shifted_hkey_6_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^6 <<1 mod poly here (for Karatsuba purposes)
-//        u8 shifted_hkey_7_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^7 <<1 mod poly here (for Karatsuba purposes)
-//        u8 shifted_hkey_8_k[16]//  // store XOR of High 64 bits and Low 64 bits of  HashKey^8 <<1 mod poly here (for Karatsuba purposes)
-//} gcm_data//
+// Offsets into struct gcm_ctx:
+//
+// typedef struct gcm_ctx {
+//	void *gcm_keysched;		OFFSET:   0	=   0
+//	size_t gcm_keysched_len;	OFFSET:	  1*8	=   8
+//	uint64_t gcm_cb[2];		OFFSET:   2*8	=  16
+//	uint64_t gcm_remainder[2];	OFFSET:   4*8	=  32
+//	size_t gcm_remainder_len;	OFFSET:   6*8	=  48
+//	uint8_t *gcm_lastp;		OFFSET:   7*8	=  56
+//	uint8_t *gcm_copy_to;		OFFSET:   8*8	=  64
+//	uint32_t gcm_flags;		OFFSET:   9*8	=  72
+//	size_t gcm_tag_len;		OFFSET:  10*8	=  80
+//	size_t gcm_processed_data_len;	OFFSET:  11*8	=  88
+//	size_t gcm_pt_buf_len;		OFFSET:  12*8	=  96
+//	uint32_t gcm_tmp[4];		OFFSET:  13*8	= 104
+//	uint64_t gcm_ghash[2];		OFFSET:  15*8	= 120
+//	uint64_t gcm_H[2];		OFFSET:  17*8	= 136
+//	uint64_t *gcm_Htable;		OFFSET:  19*8	= 152
+//	size_t gcm_htab_len;		OFFSET:  20*8	= 160
+//	uint64_t gcm_J0[2];		OFFSET:  21*8	= 168
+//	uint64_t gcm_len_a_len_c[2];	OFFSET:  23*8	= 184
+//	uint8_t *gcm_pt_buf;		OFFSET:  25*8	= 200
+//	gcm_simd_impl_t gcm_simd_impl;	OFFSET:  26*8	= 208
+// } gcm_ctx_t;				SIZE:		= 216
+
+// AadHash:
+//	Store current Hash of data which has been input: gcm_ctx->ghash.
+//
+// AadLen:
+//	Store length of input data which will not be encrypted or decrypted:
+//	gcm_ctx->gcm_tag_len.
+//
+// InLen:
+//	Store length of input data which will be encrypted or decrypted:
+//	gcm_ctx->gcm_processed_data_len.
+//
+// PBlockEncKey:
+//	Encryption key for the partial block at the end of the previous update:
+//	no real match, use: gcm_ctx->gcm_remainder.
+//
+// OrigIV:
+//	The initial counter: 12 bytes IV with (int32_t) 1 appended:
+//	gcm_ctx->gcm_J0.
+//
+// CurCount:
+//	Current counter for generation of encryption key: gcm_ctx->gcm_cb.
+//
+// PBlockLen:
+//	Length of partial block at the end of the previous update:
+//	gcm_ctx->gcm_remainder_len.
+
+#define KeySched	    0		// gcm_ctx->gcm_keysched
+#define AadHash		(15*8)		// gcm_ctx->gcm_ghash
+#define AadLen		(23*8)		// gcm_ctx->gcm_len_a_len_c[0]
+#define	TagLen		(10*8)		// gcm_ctx->gcm_tag_len
+#define InLen		(11*8)		// gcm_ctx->gcm_processed_data_len
+#define PBlockEncKey	 (4*8)		// gcm_ctx->gcm_remainder
+#define OrigIV		(21*8)		// gcm_ctx->gcm_J0
+#define CurCount	 (2*8)		// gcm_ctx->gcm_cb
+#define PBlockLen	 (6*8)		// gcm_ctx->gcm_remainder_len
+#define GcmH		(17*8)		// gcm_ctx->gcm_H
+#define GcmHtab		(19*8)		// gcm_ctx->gcm_Htable
+#define LenALenC	(23*8)		// gcm_ctx->gcm_len_a_len_c
+
+// Define the offsets into gcm_ctx of the fields fields of gcm_htab.
+//  u8 shifted_hkey_1[16]	store HashKey <<1 mod poly here
+//  u8 shifted_hkey_2[16]	store HashKey^2 <<1 mod poly here
+//  u8 shifted_hkey_3[16]	store HashKey^3 <<1 mod poly here
+//  u8 shifted_hkey_4[16]	store HashKey^4 <<1 mod poly here
+//  u8 shifted_hkey_5[16]	store HashKey^5 <<1 mod poly here
+//  u8 shifted_hkey_6[16]	store HashKey^6 <<1 mod poly here
+//  u8 shifted_hkey_7[16]	store HashKey^7 <<1 mod poly here
+//  u8 shifted_hkey_8[16]	store HashKey^8 <<1 mod poly here
+//  u8 shifted_hkey_1_k[16]	store XOR of High 64 bits and Low 64 bits of  HashKey <<1 mod poly here (for Karatsuba purposes)
+//  u8 shifted_hkey_2_k[16]	store XOR of High 64 bits and Low 64 bits of  HashKey^2 <<1 mod poly here (for Karatsuba purposes)
+//  u8 shifted_hkey_3_k[16]	store XOR of High 64 bits and Low 64 bits of  HashKey^3 <<1 mod poly here (for Karatsuba purposes)
+//  u8 shifted_hkey_4_k[16]	store XOR of High 64 bits and Low 64 bits of  HashKey^4 <<1 mod poly here (for Karatsuba purposes)
+//  u8 shifted_hkey_5_k[16]	store XOR of High 64 bits and Low 64 bits of  HashKey^5 <<1 mod poly here (for Karatsuba purposes)
+//  u8 shifted_hkey_6_k[16]	store XOR of High 64 bits and Low 64 bits of  HashKey^6 <<1 mod poly here (for Karatsuba purposes)
+//  u8 shifted_hkey_7_k[16]	store XOR of High 64 bits and Low 64 bits of  HashKey^7 <<1 mod poly here (for Karatsuba purposes)
+//  u8 shifted_hkey_8_k[16]	store XOR of High 64 bits and Low 64 bits of  HashKey^8 <<1 mod poly here (for Karatsuba purposes)
+
+#define	GCM_BLOCKSIZE	16
 
 #ifndef GCM_KEYS_VAES_AVX512_INCLUDED
-#define HashKey         16*15    // store HashKey <<1 mod poly here
-#define HashKey_1       16*15    // store HashKey <<1 mod poly here
-#define HashKey_2       16*16    // store HashKey^2 <<1 mod poly here
-#define HashKey_3       16*17    // store HashKey^3 <<1 mod poly here
-#define HashKey_4       16*18    // store HashKey^4 <<1 mod poly here
-#define HashKey_5       16*19    // store HashKey^5 <<1 mod poly here
-#define HashKey_6       16*20    // store HashKey^6 <<1 mod poly here
-#define HashKey_7       16*21    // store HashKey^7 <<1 mod poly here
-#define HashKey_8       16*22    // store HashKey^8 <<1 mod poly here
-#define HashKey_k       16*23    // store XOR of High 64 bits and Low 64 bits of  HashKey <<1 mod poly here (for Karatsuba purposes)
-#define HashKey_2_k     16*24    // store XOR of High 64 bits and Low 64 bits of  HashKey^2 <<1 mod poly here (for Karatsuba purposes)
-#define HashKey_3_k     16*25   // store XOR of High 64 bits and Low 64 bits of  HashKey^3 <<1 mod poly here (for Karatsuba purposes)
-#define HashKey_4_k     16*26   // store XOR of High 64 bits and Low 64 bits of  HashKey^4 <<1 mod poly here (for Karatsuba purposes)
-#define HashKey_5_k     16*27   // store XOR of High 64 bits and Low 64 bits of  HashKey^5 <<1 mod poly here (for Karatsuba purposes)
-#define HashKey_6_k     16*28   // store XOR of High 64 bits and Low 64 bits of  HashKey^6 <<1 mod poly here (for Karatsuba purposes)
-#define HashKey_7_k     16*29   // store XOR of High 64 bits and Low 64 bits of  HashKey^7 <<1 mod poly here (for Karatsuba purposes)
-#define HashKey_8_k     16*30   // store XOR of High 64 bits and Low 64 bits of  HashKey^8 <<1 mod poly here (for Karatsuba purposes)
+#define	HashKey		(GCM_BLOCKSIZE *  0)
+#define	HashKey_1	(GCM_BLOCKSIZE *  0)
+#define	HashKey_2	(GCM_BLOCKSIZE *  1)
+#define	HashKey_3	(GCM_BLOCKSIZE *  2)
+#define	HashKey_4	(GCM_BLOCKSIZE *  3)
+#define	HashKey_5	(GCM_BLOCKSIZE *  4)
+#define	HashKey_6	(GCM_BLOCKSIZE *  5)
+#define	HashKey_7	(GCM_BLOCKSIZE *  6)
+#define	HashKey_8	(GCM_BLOCKSIZE *  7)
+#define	HashKey_k	(GCM_BLOCKSIZE *  8)
+#define	HashKey_2_k	(GCM_BLOCKSIZE *  9)
+#define	HashKey_3_k	(GCM_BLOCKSIZE * 10)
+#define	HashKey_4_k	(GCM_BLOCKSIZE * 11)
+#define	HashKey_5_k	(GCM_BLOCKSIZE * 12)
+#define	HashKey_6_k	(GCM_BLOCKSIZE * 13)
+#define	HashKey_7_k	(GCM_BLOCKSIZE * 14)
+#define	HashKey_8_k	(GCM_BLOCKSIZE * 15)
 #endif
 
-#define AadHash		16*0	// store current Hash of data which has been input
-#define AadLen		16*1	// store length of input data which will not be encrypted or decrypted
-#define InLen		(16*1)+8 // store length of input data which will be encrypted or decrypted
-#define PBlockEncKey	16*2	// encryption key for the partial block at the end of the previous update
-#define OrigIV		16*3	// input IV
-#define CurCount	16*4	// Current counter for generation of encryption key
-#define PBlockLen	16*5	// length of partial block at the end of the previous update
-
 .macro xmmreg name, num
 	.set xmm\name, %xmm\num
 .endm
 
+// Push a 64 bit register to the stack and generate the needed CFI directives.
+.macro CFI_PUSHQ	REG, OFFS
+	pushq	\REG
+	.cfi_adjust_cfa_offset	8
+	.cfi_offset	\REG, \OFFS
+.endm
+
+// Pop a 64 bit register from the stack and generate the needed CFI directives.
+.macro CFI_POPQ		REG
+	popq	\REG
+	.cfi_restore	\REG
+	.cfi_adjust_cfa_offset	-8
+.endm
+
 #define arg(x) (STACK_OFFSET + 8*(x))(%r14)
 
+/*
+.macro STACK_FRAME_NON_STANDARD func:req
+	.pushsection .discard.func_stack_frame_non_standard, "aw"
+-		.long \func - .
++#ifdef CONFIG_64BIT
++		.quad \func
++#else
++		.long \func
++#endif
+ 	.popsection
+.endm
+*/
 
-#if __OUTPUT_FORMAT__ != elf64
-#define arg1 %rcx
-#define arg2 %rdx
-#define arg3 %r8
-#define arg4 %r9
-#define arg5 %rsi
-#define arg6 (STACK_OFFSET + 8*6)(%r14)
-#define arg7 (STACK_OFFSET + 8*7)(%r14)
-#define arg8 (STACK_OFFSET + 8*8)(%r14)
-#define arg9 (STACK_OFFSET + 8*9)(%r14)
-#define arg10 (STACK_OFFSET + 8*10)(%r14)
-#else
 #define arg1 %rdi
 #define arg2 %rsi
 #define arg3 %rdx
@@ -261,7 +331,6 @@ mask_out_top_block:
 #define arg8 ((STACK_OFFSET) + 8*2)(%r14)
 #define arg9 ((STACK_OFFSET) + 8*3)(%r14)
 #define arg10 ((STACK_OFFSET) + 8*4)(%r14)
-#endif
 
 #ifdef NT_LDST
 #define NT_LD
diff --git a/module/icp/asm-x86_64/modes/isalc_gcm_sse.S b/module/icp/asm-x86_64/modes/isalc_gcm_sse.S
index 5d5be5068904..fab97e7d8408 100644
--- a/module/icp/asm-x86_64/modes/isalc_gcm_sse.S
+++ b/module/icp/asm-x86_64/modes/isalc_gcm_sse.S
@@ -116,30 +116,59 @@
 // for GHASH part, two tabs is for AES part.
 //
 
-// .altmacro
+// Port to GNU as, translation to GNU as att-syntax and adoptions for the ICP
+// Copyright(c) 2023 Attila Fülöp <attila@fueloep.org>
+
 .att_syntax prefix
 
-#include "isalc_reg_sizes_att.S"
-#include "isalc_gcm_defines_att.S"
+#define _ASM
+#include <sys/asm_linkage.h>
 
-#if !defined(GCM128_MODE) && !defined(GCM256_MODE)
+#if !defined(GCM128_MODE) && !defined(GCM192_MODE) && !defined(GCM256_MODE)
 #error "No GCM mode selected for gcm_sse.S!"
 #endif
 
-#if defined(FUNCT_EXTENSION)
-#error "No support for non-temporal versions yet!"
+#if 0
+#ifdef GCM128_MODE
+#define FN_NAME(x,y) ENTRY_NP(icp_isalc_gcm_ ## x ## _128 ## y ## sse)
+//#define FN_NAME(x,y) aes_gcm_ ## x ## _128 ## y ## sse:
+#define NROUNDS 9
+#endif
+
+#ifdef GCM192_MODE
+#define FN_NAME(x,y) ENTRY(icp_isalc_gcm_ ## x ## _192 ## y ## sse)
+#define NROUNDS 11
 #endif
-#define	_nt	1
 
+#ifdef GCM256_MODE
+#define FN_NAME(x,y) ENTRY(icp_isalc_gcm_ ## x ## _256 ## y ## sse)
+#define NROUNDS 13
+#endif
+#else
 #ifdef GCM128_MODE
-#define FN_NAME(x,y) aes_gcm_ ## x ## _128 ## y ## sse
+#define FN_NAME(x,y) icp_isalc_gcm_ ## x ## _128 ## y ## sse
 #define NROUNDS 9
 #endif
 
+#ifdef GCM192_MODE
+#define FN_NAME(x,y) icp_isalc_gcm_ ## x ## _192 ## y ## sse
+#define NROUNDS 11
+#endif
+
 #ifdef GCM256_MODE
-#define FN_NAME(x,y) aes_gcm_ ## x ## _256 ## y ## sse
+#define FN_NAME(x,y) icp_isalc_gcm_ ## x ## _256 ## y ## sse
 #define NROUNDS 13
 #endif
+#endif
+
+#include "isalc_reg_sizes.S"
+#include "isalc_gcm_defines.S"
+
+
+#if defined(FUNCT_EXTENSION)
+#error "No support for non-temporal versions yet!"
+#endif
+#define	_nt	1
 
 
 // need to push 5 registers into stack to maintain
@@ -235,59 +264,59 @@
 
 ////////////////////////////////////////////////////////////////////////////////
 // PRECOMPUTE: Precompute HashKey_{2..8} and HashKey{,_{2..8}}_k.
-// HasKey_i_k holds XORed values of the low and high parts of the Haskey_i.
+// HashKey_i_k holds XORed values of the low and high parts of the Haskey_i.
 ////////////////////////////////////////////////////////////////////////////////
-.macro PRECOMPUTE GDATA, HK, T1, T2, T3, T4, T5, T6
+.macro PRECOMPUTE HTAB, HK, T1, T2, T3, T4, T5, T6
 
 	movdqa	\HK, \T4
 	pshufd	$0b01001110, \HK, \T1
 	pxor	\HK, \T1
-	movdqu	\T1, HashKey_k(\GDATA)
+	movdqu	\T1, HashKey_k(\HTAB)
 
 
 	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^2<<1 mod poly
-	movdqu	\T4, HashKey_2(\GDATA)		//  [HashKey_2] = HashKey^2<<1 mod poly
+	movdqu	\T4, HashKey_2(\HTAB)		//  [HashKey_2] = HashKey^2<<1 mod poly
 	pshufd	$0b01001110, \T4, \T1
 	pxor	\T4, \T1
-	movdqu	\T1, HashKey_2_k(\GDATA)
+	movdqu	\T1, HashKey_2_k(\HTAB)
 
 	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^3<<1 mod poly
-	movdqu	\T4, HashKey_3(\GDATA)
+	movdqu	\T4, HashKey_3(\HTAB)
 	pshufd	$0b01001110, \T4, \T1
 	pxor	\T4, \T1
-	movdqu	\T1, HashKey_3_k(\GDATA)
+	movdqu	\T1, HashKey_3_k(\HTAB)
 
 
 	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^4<<1 mod poly
-	movdqu	\T4, HashKey_4(\GDATA)
+	movdqu	\T4, HashKey_4(\HTAB)
 	pshufd	$0b01001110, \T4, \T1
 	pxor	\T4, \T1
-	movdqu	\T1, HashKey_4_k(\GDATA)
+	movdqu	\T1, HashKey_4_k(\HTAB)
 
 	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^5<<1 mod poly
-	movdqu	\T4, HashKey_5(\GDATA)
+	movdqu	\T4, HashKey_5(\HTAB)
 	pshufd	$0b01001110, \T4, \T1
 	pxor	\T4, \T1
-	movdqu	\T1, HashKey_5_k(\GDATA)
+	movdqu	\T1, HashKey_5_k(\HTAB)
 
 
 	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^6<<1 mod poly
-	movdqu	\T4, HashKey_6(\GDATA)
+	movdqu	\T4, HashKey_6(\HTAB)
 	pshufd	$0b01001110, \T4, \T1
 	pxor	\T4, \T1
-	movdqu	\T1, HashKey_6_k(\GDATA)
+	movdqu	\T1, HashKey_6_k(\HTAB)
 
 	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^7<<1 mod poly
-	movdqu	\T4, HashKey_7(\GDATA)
+	movdqu	\T4, HashKey_7(\HTAB)
 	pshufd	$0b01001110, \T4, \T1
 	pxor	\T4, \T1
-	movdqu	\T1, HashKey_7_k(\GDATA)
+	movdqu	\T1, HashKey_7_k(\HTAB)
 
 	GHASH_MUL \T4, \HK, \T1, \T2, \T3, \T5, \T6	//  \T4 = HashKey^8<<1 mod poly
-	movdqu  \T4, HashKey_8(\GDATA)
+	movdqu  \T4, HashKey_8(\HTAB)
 	pshufd  $0b01001110, \T4, \T1
 	pxor    \T4, \T1
-	movdqu  \T1, HashKey_8_k(\GDATA)
+	movdqu  \T1, HashKey_8_k(\HTAB)
 
 .endm // PRECOMPUTE
 
@@ -397,7 +426,7 @@ _CALC_AAD_done_\@:
 ////////////////////////////////////////////////////////////////////////////////
 // PARTIAL_BLOCK: Handles encryption/decryption and the tag partial blocks
 // between update calls. Requires the input data be at least 1 byte long.
-// Input: gcm_key_data (GDATA_KEY), gcm_context_data (GDATA_CTX), input text
+// Input: gcm_key_data (GCM_HTAB), gcm_context_data (GDATA_CTX), input text
 // (PLAIN_CYPH_IN), input text length (PLAIN_CYPH_LEN), the current data offset
 // (DATA_OFFSET), and whether encoding or decoding (ENC_DEC).
 // Output: A cypher of the first partial block (CYPH_PLAIN_OUT), and updated
@@ -405,7 +434,7 @@ _CALC_AAD_done_\@:
 // Clobbers rax, r10, r12, r13, r15, xmm0, xmm1, xmm2, xmm3, xmm5, xmm6, xmm9,
 // xmm10, xmm11, xmm13
 ////////////////////////////////////////////////////////////////////////////////
-.macro PARTIAL_BLOCK	GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, \
+.macro PARTIAL_BLOCK	GCM_HTAB, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, \
 			PLAIN_CYPH_LEN, DATA_OFFSET, AAD_HASH, ENC_DEC
 
 	// clang compat: no local support
@@ -432,7 +461,7 @@ _data_read_\@:				//Finished reading in data
 
 
 	movdqu	PBlockEncKey(\GDATA_CTX), %xmm9	//xmm9 = ctx_data.partial_block_enc_key
-	movdqu	HashKey(\GDATA_KEY), %xmm13
+	movdqu	HashKey(\GCM_HTAB), %xmm13
 
 	lea	SHIFT_MASK(%rip), %r12
 
@@ -440,7 +469,7 @@ _data_read_\@:				//Finished reading in data
 	movdqu	(%r12), %xmm2		// get the appropriate shuffle mask
 	pshufb	%xmm2, %xmm9		// shift right r13 bytes
 
-	.ifc	\ENC_DEC, DEC
+	.ifc	\ENC_DEC, DEC		// We are decrypting.
 
 	movdqa	%xmm1, %xmm3
 	pxor	%xmm1, %xmm9		// Cyphertext XOR E(K, Yn)
@@ -473,7 +502,7 @@ _partial_incomplete_1_\@:
 _dec_done_\@:
 	movdqu	\AAD_HASH, AadHash(\GDATA_CTX)
 
-	.else	// .ifc \ENC_DEC, DEC
+	.else	// .ifc \ENC_DEC, DEC; We are encrypting.
 
 	pxor	%xmm1, %xmm9		// Plaintext XOR E(K, Yn)
 
@@ -542,11 +571,11 @@ _partial_block_done_\@:
 // INITIAL_BLOCKS: If a = number of total plaintext bytes; b = floor(a/16);
 // \num_initial_blocks = b mod 8;  encrypt the initial \num_initial_blocks
 // blocks and apply ghash on the ciphertext.
-// \GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, r14 are used as a
+// \KEYSCHED, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, r14 are used as a
 // pointer only, not modified.
 // Updated AAD_HASH is returned in \T3.
 ////////////////////////////////////////////////////////////////////////////////
-.macro INITIAL_BLOCKS	GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, \
+.macro INITIAL_BLOCKS	KEYSCHED, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, \
 			LENGTH, DATA_OFFSET, num_initial_blocks, T1, HASH_KEY, \
 			T3, T4, T5, CTR, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, \
 			XMM7, XMM8, T6, T_key, ENC_DEC
@@ -566,13 +595,13 @@ _partial_block_done_\@:
 .set i, (9-\num_initial_blocks)
 .rept \num_initial_blocks
 		xmmreg	i, %i
-		paddd	ONE(%rip), \CTR			// INCR Y0
+		paddd	ONE(%rip), \CTR		// INCR Y0
 		movdqa	\CTR, xmmi
-		pshufb	SHUF_MASK(%rip), xmmi		// perform a 16Byte swap
+		pshufb	SHUF_MASK(%rip), xmmi	// perform a 16Byte swap
 .set i, (i+1)
 .endr
 
-movdqu	16*0(\GDATA_KEY), \T_key
+movdqu	16*0(\KEYSCHED), \T_key
 .set i, (9-\num_initial_blocks)
 .rept \num_initial_blocks
 		xmmreg	i, %i
@@ -581,8 +610,8 @@ movdqu	16*0(\GDATA_KEY), \T_key
 .endr
 
 .set j, 1
-.rept NROUNDS						// encrypt N blocks with 13 key rounds (11 for GCM192)
-movdqu	16*j(\GDATA_KEY), \T_key
+.rept NROUNDS				// encrypt N blocks with 13 key rounds (11 for GCM192)
+movdqu	16*j(\KEYSCHED), \T_key
 .set i, (9-\num_initial_blocks)
 .rept \num_initial_blocks
 		xmmreg	i, %i
@@ -593,7 +622,7 @@ movdqu	16*j(\GDATA_KEY), \T_key
 .set j, (j+1)
 .endr
 
-movdqu	16*j(\GDATA_KEY), \T_key			// encrypt with last (14th) key round (12 for GCM192)
+movdqu	16*j(\KEYSCHED), \T_key		// encrypt with last (14th) key round (12 for GCM192)
 .set i, (9-\num_initial_blocks)
 .rept \num_initial_blocks
 		xmmreg	i, %i
@@ -668,7 +697,7 @@ movdqu	16*j(\GDATA_KEY), \T_key			// encrypt with last (14th) key round (12 for
 		movdqa  \CTR, \XMM8
 		pshufb  SHUF_MASK(%rip), \XMM8	// perform a 16Byte swap
 
-		movdqu  16*0(\GDATA_KEY), \T_key
+		movdqu  16*0(\KEYSCHED), \T_key
 		pxor	\T_key, \XMM1
 		pxor	\T_key, \XMM2
 		pxor	\T_key, \XMM3
@@ -680,7 +709,7 @@ movdqu	16*j(\GDATA_KEY), \T_key			// encrypt with last (14th) key round (12 for
 
 .set i, 1
 .rept    NROUNDS			// do early (13) rounds (11 for GCM192)
-		movdqu  16*i(\GDATA_KEY), \T_key
+		movdqu  16*i(\KEYSCHED), \T_key
 		aesenc	\T_key, \XMM1
 		aesenc	\T_key, \XMM2
 		aesenc	\T_key, \XMM3
@@ -692,7 +721,7 @@ movdqu	16*j(\GDATA_KEY), \T_key			// encrypt with last (14th) key round (12 for
 .set i, (i+1)
 .endr
 
-		movdqu	16*i(\GDATA_KEY), \T_key	// do final key round
+		movdqu	16*i(\KEYSCHED), \T_key	// do final key round
 		aesenclast	\T_key, \XMM1
 		aesenclast	\T_key, \XMM2
 		aesenclast	\T_key, \XMM3
@@ -780,14 +809,14 @@ _initial_blocks_done_\@:
 ////////////////////////////////////////////////////////////////////////////////
 // GHASH_8_ENCRYPT_8_PARALLEL: Encrypt 8 blocks at a time and ghash the 8
 // previously encrypted ciphertext blocks.
-// \GDATA (KEY), \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN are used as pointers only,
-// not modified.
+// \KEYSCHED, \GCM_HTAB, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN are used as pointers
+//  only, not modified.
 // \DATA_OFFSET is the data offset value
 ////////////////////////////////////////////////////////////////////////////////
-.macro GHASH_8_ENCRYPT_8_PARALLEL GDATA, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, \
-				  DATA_OFFSET, T1, T2, T3, T4, T5, T6, CTR, \
-				  XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7, \
-				  XMM8, T7, loop_idx, ENC_DEC
+.macro GHASH_8_ENCRYPT_8_PARALLEL KEYSCHED, GCM_HTAB, CYPH_PLAIN_OUT, \
+				  PLAIN_CYPH_IN, DATA_OFFSET, T1, T2, T3, T4, \
+				  T5, T6, CTR, XMM1, XMM2, XMM3, XMM4, XMM5, \
+				  XMM6, XMM7, XMM8, T7, loop_idx, ENC_DEC
 
 
 	movdqa	\XMM1, \T7
@@ -810,10 +839,10 @@ _initial_blocks_done_\@:
 		.else
 		paddd	ONEf(%rip), \CTR			// INCR CNT
 		.endif
-	movdqu	HashKey_8(\GDATA), \T5
+	movdqu	HashKey_8(\GCM_HTAB), \T5
 	pclmulqdq	 $0x11, \T5, \T4		// \T1 = a1*b1
 	pclmulqdq	$0x00, \T5, \T7			// \T7 = a0*b0
-	movdqu	HashKey_8_k(\GDATA), \T5
+	movdqu	HashKey_8_k(\GCM_HTAB), \T5
 	pclmulqdq	$0x00, \T5, \T6			// \T2 = (a1+a0)*(b1+b0)
 		movdqa	\CTR, \XMM1
 
@@ -875,7 +904,7 @@ _initial_blocks_done_\@:
 		.endif	// .ifc \loop_idx, in_order
 	////////////////////////////////////////////////////////////////////////
 
-		movdqu	16*0(\GDATA), \T1
+		movdqu	16*0(\KEYSCHED), \T1
 		pxor	\T1, \XMM1
 		pxor	\T1, \XMM2
 		pxor	\T1, \XMM3
@@ -894,16 +923,16 @@ _initial_blocks_done_\@:
 
 	pshufd	$0b01001110, \T3, \T2
 	pxor	\T3, \T2
-	movdqu	HashKey_7(\GDATA), \T5
+	movdqu	HashKey_7(\GCM_HTAB), \T5
 	pclmulqdq	$0x11, \T5, \T1			// \T1 = a1*b1
 	pclmulqdq	$0x00, \T5, \T3			// \T3 = a0*b0
-	movdqu  HashKey_7_k(\GDATA), \T5
+	movdqu  HashKey_7_k(\GCM_HTAB), \T5
 	pclmulqdq	$0x00, \T5, \T2			// \T2 = (a1+a0)*(b1+b0)
 	pxor	\T1, \T4				// accumulate the results in \T4:\T7, \T6 holds the middle part
 	pxor	\T3, \T7
 	pxor	\T2, \T6
 
-		movdqu	16*1(\GDATA), \T1
+		movdqu	16*1(\KEYSCHED), \T1
 		aesenc	\T1, \XMM1
 		aesenc	\T1, \XMM2
 		aesenc	\T1, \XMM3
@@ -913,7 +942,7 @@ _initial_blocks_done_\@:
 		aesenc	\T1, \XMM7
 		aesenc	\T1, \XMM8
 
-		movdqu	16*2(\GDATA), \T1
+		movdqu	16*2(\KEYSCHED), \T1
 		aesenc	\T1, \XMM1
 		aesenc	\T1, \XMM2
 		aesenc	\T1, \XMM3
@@ -930,16 +959,16 @@ _initial_blocks_done_\@:
 
 	pshufd	$0b01001110, \T3, \T2
 	pxor	\T3, \T2
-	movdqu	HashKey_6(\GDATA), \T5
+	movdqu	HashKey_6(\GCM_HTAB), \T5
 	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
 	pclmulqdq	$0x00, \T5, \T3		// \T3 = a0*b0
-	movdqu  HashKey_6_k(\GDATA), \T5
+	movdqu  HashKey_6_k(\GCM_HTAB), \T5
 	pclmulqdq	$0x00, \T5, \T2		// \T2 = (a1+a0)*(b1+b0)
 	pxor	\T1, \T4			// accumulate the results in \T4:\T7, \T6 holds the middle part
 	pxor	\T3, \T7
 	pxor	\T2, \T6
 
-		movdqu	16*3(\GDATA), \T1
+		movdqu	16*3(\KEYSCHED), \T1
 		aesenc	\T1, \XMM1
 		aesenc	\T1, \XMM2
 		aesenc	\T1, \XMM3
@@ -954,16 +983,16 @@ _initial_blocks_done_\@:
 
 	pshufd	$0b01001110, \T3, \T2
 	pxor	\T3, \T2
-	movdqu	HashKey_5(\GDATA), \T5
+	movdqu	HashKey_5(\GCM_HTAB), \T5
 	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
 	pclmulqdq	$0x00, \T5, \T3		// \T3 = a0*b0
-	movdqu  HashKey_5_k(\GDATA), \T5
+	movdqu  HashKey_5_k(\GCM_HTAB), \T5
 	pclmulqdq	$0x00, \T5, \T2		// \T2 = (a1+a0)*(b1+b0)
 	pxor	\T1, \T4			// accumulate the results in \T4:\T7, \T6 holds the middle part
 	pxor	\T3, \T7
 	pxor	\T2, \T6
 
-		movdqu	16*4(\GDATA), \T1
+		movdqu	16*4(\KEYSCHED), \T1
 		aesenc	\T1, \XMM1
 		aesenc	\T1, \XMM2
 		aesenc	\T1, \XMM3
@@ -973,7 +1002,7 @@ _initial_blocks_done_\@:
 		aesenc	\T1, \XMM7
 		aesenc	\T1, \XMM8
 
-		movdqu	16*5(\GDATA), \T1
+		movdqu	16*5(\KEYSCHED), \T1
 		aesenc	\T1, \XMM1
 		aesenc	\T1, \XMM2
 		aesenc	\T1, \XMM3
@@ -988,16 +1017,16 @@ _initial_blocks_done_\@:
 
 	pshufd	$0b01001110, \T3, \T2
 	pxor	\T3, \T2
-	movdqu	HashKey_4(\GDATA), \T5
+	movdqu	HashKey_4(\GCM_HTAB), \T5
 	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
 	pclmulqdq	$0x00, \T5, \T3		// \T3 = a0*b0
-	movdqu  HashKey_4_k(\GDATA), \T5
+	movdqu  HashKey_4_k(\GCM_HTAB), \T5
 	pclmulqdq	$0x00, \T5, \T2		// \T2 = (a1+a0)*(b1+b0)
 	pxor	\T1, \T4			// accumulate the results in \T4:\T7, \T6 holds the middle part
 	pxor	\T3, \T7
 	pxor	\T2, \T6
 
-		movdqu	16*6(\GDATA), \T1
+		movdqu	16*6(\KEYSCHED), \T1
 		aesenc	\T1, \XMM1
 		aesenc	\T1, \XMM2
 		aesenc	\T1, \XMM3
@@ -1013,16 +1042,16 @@ _initial_blocks_done_\@:
 
 	pshufd	$0b01001110, \T3, \T2
 	pxor	\T3, \T2
-	movdqu	HashKey_3(\GDATA), \T5
+	movdqu	HashKey_3(\GCM_HTAB), \T5
 	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
 	pclmulqdq	$0x00, \T5, \T3		// \T3 = a0*b0
-	movdqu  HashKey_3_k(\GDATA), \T5
+	movdqu  HashKey_3_k(\GCM_HTAB), \T5
 	pclmulqdq	$0x00, \T5, \T2		// \T2 = (a1+a0)*(b1+b0)
 	pxor	\T1, \T4			// accumulate the results in \T4:\T7, \T6 holds the middle part
 	pxor	\T3, \T7
 	pxor	\T2, \T6
 
-		movdqu	16*7(\GDATA), \T1
+		movdqu	16*7(\KEYSCHED), \T1
 		aesenc	\T1, \XMM1
 		aesenc	\T1, \XMM2
 		aesenc	\T1, \XMM3
@@ -1037,16 +1066,16 @@ _initial_blocks_done_\@:
 
 	pshufd	$0b01001110, \T3, \T2
 	pxor	\T3, \T2
-	movdqu	HashKey_2(\GDATA), \T5
+	movdqu	HashKey_2(\GCM_HTAB), \T5
 	pclmulqdq	$0x11, \T5, \T1			// \T1 = a1*b1
 	pclmulqdq	$0x00, \T5, \T3			// \T3 = a0*b0
-	movdqu  HashKey_2_k(\GDATA), \T5
+	movdqu  HashKey_2_k(\GCM_HTAB), \T5
 	pclmulqdq	$0x00, \T5, \T2			// \T2 = (a1+a0)*(b1+b0)
 	pxor	\T1, \T4				// accumulate the results in \T4:\T7, \T6 holds the middle part
 	pxor	\T3, \T7
 	pxor	\T2, \T6
 
-		movdqu	16*8(\GDATA), \T1
+		movdqu	16*8(\KEYSCHED), \T1
 		aesenc	\T1, \XMM1
 		aesenc	\T1, \XMM2
 		aesenc	\T1, \XMM3
@@ -1066,15 +1095,15 @@ _initial_blocks_done_\@:
 
 	pshufd	$0b01001110, \T3, \T2
 	pxor	\T3, \T2
-	movdqu	HashKey(\GDATA), \T5
+	movdqu	HashKey(\GCM_HTAB), \T5
 	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
 	pclmulqdq	$0x00, \T5, \T3		// \T3 = a0*b0
-	movdqu  HashKey_k(\GDATA), \T5
+	movdqu  HashKey_k(\GCM_HTAB), \T5
 	pclmulqdq	$0x00, \T5, \T2		// \T2 = (a1+a0)*(b1+b0)
 	pxor	\T3, \T7
 	pxor	\T1, \T4			// accumulate the results in \T4:\T7, \T6 holds the middle part
 
-		movdqu	16*9(\GDATA), \T1
+		movdqu	16*9(\KEYSCHED), \T1
 		aesenc	\T1, \XMM1
 		aesenc	\T1, \XMM2
 		aesenc	\T1, \XMM3
@@ -1086,10 +1115,10 @@ _initial_blocks_done_\@:
 
 
 #ifdef GCM128_MODE
-		movdqu	16*10(\GDATA), \T5
+		movdqu	16*10(\KEYSCHED), \T5
 #endif
 #ifdef GCM192_MODE
-		movdqu	16*10(\GDATA), \T1
+		movdqu	16*10(\KEYSCHED), \T1
 		aesenc	\T1, \XMM1
 		aesenc	\T1, \XMM2
 		aesenc	\T1, \XMM3
@@ -1099,7 +1128,7 @@ _initial_blocks_done_\@:
 		aesenc	\T1, \XMM7
 		aesenc	\T1, \XMM8
 
-		movdqu	16*11(\GDATA), \T1
+		movdqu	16*11(\KEYSCHED), \T1
 		aesenc	\T1, \XMM1
 		aesenc	\T1, \XMM2
 		aesenc	\T1, \XMM3
@@ -1109,10 +1138,10 @@ _initial_blocks_done_\@:
 		aesenc	\T1, \XMM7
 		aesenc	\T1, \XMM8
 
-		movdqu	16*12(\GDATA), \T5	// finish last key round
+		movdqu	16*12(\KEYSCHED), \T5	// finish last key round
 #endif
 #ifdef GCM256_MODE
-		movdqu	16*10(\GDATA), \T1
+		movdqu	16*10(\KEYSCHED), \T1
 		aesenc	\T1, \XMM1
 		aesenc	\T1, \XMM2
 		aesenc	\T1, \XMM3
@@ -1122,7 +1151,7 @@ _initial_blocks_done_\@:
 		aesenc	\T1, \XMM7
 		aesenc	\T1, \XMM8
 
-		movdqu	16*11(\GDATA), \T1
+		movdqu	16*11(\KEYSCHED), \T1
 		aesenc	\T1, \XMM1
 		aesenc	\T1, \XMM2
 		aesenc	\T1, \XMM3
@@ -1132,7 +1161,7 @@ _initial_blocks_done_\@:
 		aesenc	\T1, \XMM7
 		aesenc	\T1, \XMM8
 
-		movdqu	16*12(\GDATA), \T1
+		movdqu	16*12(\KEYSCHED), \T1
 		aesenc	\T1, \XMM1
 		aesenc	\T1, \XMM2
 		aesenc	\T1, \XMM3
@@ -1142,7 +1171,7 @@ _initial_blocks_done_\@:
 		aesenc	\T1, \XMM7
 		aesenc	\T1, \XMM8
 
-		movdqu	16*13(\GDATA), \T1
+		movdqu	16*13(\KEYSCHED), \T1
 		aesenc	\T1, \XMM1
 		aesenc	\T1, \XMM2
 		aesenc	\T1, \XMM3
@@ -1152,7 +1181,7 @@ _initial_blocks_done_\@:
 		aesenc	\T1, \XMM7
 		aesenc	\T1, \XMM8
 
-	movdqu	16*14(\GDATA), \T5		// finish last key round
+	movdqu	16*14(\KEYSCHED), \T5		// finish last key round
 #endif
 
 .altmacro
@@ -1242,7 +1271,7 @@ _initial_blocks_done_\@:
 ////////////////////////////////////////////////////////////////////////////////
 // GHASH_LAST_8: GHASH the last 8 ciphertext blocks.
 ////////////////////////////////////////////////////////////////////////////////
-.macro	GHASH_LAST_8	GDATA, T1, T2, T3, T4, T5, T6, T7, \
+.macro	GHASH_LAST_8	GCM_HTAB, T1, T2, T3, T4, T5, T6, T7, \
 			XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7, XMM8
 
 
@@ -1250,11 +1279,11 @@ _initial_blocks_done_\@:
 	movdqa	\XMM1, \T6
 	pshufd	$0b01001110,  \XMM1, \T2
 	pxor	\XMM1, \T2
-	movdqu	HashKey_8(\GDATA), \T5
+	movdqu	HashKey_8(\GCM_HTAB), \T5
 	pclmulqdq	$0x11, \T5, \T6		// \T6 = a1*b1
 
 	pclmulqdq	$0x00, \T5, \XMM1	// \XMM1 = a0*b0
-	movdqu	HashKey_8_k(\GDATA), \T4
+	movdqu	HashKey_8_k(\GCM_HTAB), \T4
 	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
 
 	movdqa	\XMM1, \T7
@@ -1264,11 +1293,11 @@ _initial_blocks_done_\@:
 	movdqa	\XMM2, \T1
 	pshufd	$0b01001110,  \XMM2, \T2
 	pxor	\XMM2, \T2
-	movdqu	HashKey_7(\GDATA), \T5
+	movdqu	HashKey_7(\GCM_HTAB), \T5
 	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
 
 	pclmulqdq	$0x00, \T5, \XMM2	// \XMM2 = a0*b0
-	movdqu	HashKey_7_k(\GDATA), \T4
+	movdqu	HashKey_7_k(\GCM_HTAB), \T4
 	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
 
 	pxor	\T1, \T6
@@ -1279,11 +1308,11 @@ _initial_blocks_done_\@:
 	movdqa	\XMM3, \T1
 	pshufd	$0b01001110,  \XMM3, \T2
 	pxor	\XMM3, \T2
-	movdqu	HashKey_6(\GDATA), \T5
+	movdqu	HashKey_6(\GCM_HTAB), \T5
 	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
 
 	pclmulqdq	$0x00, \T5, \XMM3	// \XMM3 = a0*b0
-	movdqu	HashKey_6_k(\GDATA), \T4
+	movdqu	HashKey_6_k(\GCM_HTAB), \T4
 	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
 
 	pxor	\T1, \T6
@@ -1294,11 +1323,11 @@ _initial_blocks_done_\@:
 	movdqa	\XMM4, \T1
 	pshufd	$0b01001110,  \XMM4, \T2
 	pxor	\XMM4, \T2
-	movdqu	HashKey_5(\GDATA), \T5
+	movdqu	HashKey_5(\GCM_HTAB), \T5
 	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
 
 	pclmulqdq	$0x00, \T5, \XMM4	// \XMM4 = a0*b0
-	movdqu	HashKey_5_k(\GDATA), \T4
+	movdqu	HashKey_5_k(\GCM_HTAB), \T4
 	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
 
 	pxor	\T1, \T6
@@ -1309,11 +1338,11 @@ _initial_blocks_done_\@:
 	movdqa	\XMM5, \T1
 	pshufd	$0b01001110,  \XMM5, \T2
 	pxor	\XMM5, \T2
-	movdqu	HashKey_4(\GDATA), \T5
+	movdqu	HashKey_4(\GCM_HTAB), \T5
 	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
 
 	pclmulqdq	$0x00, \T5, \XMM5	// \XMM5 = a0*b0
-	movdqu	HashKey_4_k(\GDATA), \T4
+	movdqu	HashKey_4_k(\GCM_HTAB), \T4
 	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
 
 	pxor	\T1, \T6
@@ -1324,11 +1353,11 @@ _initial_blocks_done_\@:
 	movdqa	\XMM6, \T1
 	pshufd	$0b01001110,  \XMM6, \T2
 	pxor	\XMM6, \T2
-	movdqu	HashKey_3(\GDATA), \T5
+	movdqu	HashKey_3(\GCM_HTAB), \T5
 	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
 
 	pclmulqdq	$0x00, \T5, \XMM6	// \XMM6 = a0*b0
-	movdqu	HashKey_3_k(\GDATA), \T4
+	movdqu	HashKey_3_k(\GCM_HTAB), \T4
 	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
 
 	pxor	\T1, \T6
@@ -1339,11 +1368,11 @@ _initial_blocks_done_\@:
 	movdqa	\XMM7, \T1
 	pshufd	$0b01001110,  \XMM7, \T2
 	pxor	\XMM7, \T2
-	movdqu	HashKey_2(\GDATA), \T5
+	movdqu	HashKey_2(\GCM_HTAB), \T5
 	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
 
 	pclmulqdq	$0x00, \T5, \XMM7	// \XMM7 = a0*b0
-	movdqu	HashKey_2_k(\GDATA), \T4
+	movdqu	HashKey_2_k(\GCM_HTAB), \T4
 	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
 
 	pxor	\T1, \T6
@@ -1355,11 +1384,11 @@ _initial_blocks_done_\@:
 	movdqa	\XMM8, \T1
 	pshufd	$0b01001110,  \XMM8, \T2
 	pxor	\XMM8, \T2
-	movdqu	HashKey(\GDATA), \T5
+	movdqu	HashKey(\GCM_HTAB), \T5
 	pclmulqdq	$0x11, \T5, \T1		// \T1 = a1*b1
 
 	pclmulqdq	$0x00, \T5, \XMM8	// \XMM8 = a0*b0
-	movdqu	HashKey_k(\GDATA), \T4
+	movdqu	HashKey_k(\GCM_HTAB), \T4
 	pclmulqdq	$0x00, \T4, \T2		// \T2 = (a1+a0)*(b1+b0)
 
 	pxor	\T1, \T6
@@ -1414,19 +1443,19 @@ _initial_blocks_done_\@:
 ////////////////////////////////////////////////////////////////////////////////
 // ENCRYPT_SINGLE_BLOCK: Encrypt a single block.
 ////////////////////////////////////////////////////////////////////////////////
-.macro ENCRYPT_SINGLE_BLOCK	GDATA, ST, T1
+.macro ENCRYPT_SINGLE_BLOCK	KEYSCHED, ST, T1
 
-		movdqu	16*0(\GDATA), \T1
+		movdqu	16*0(\KEYSCHED), \T1
 		pxor	\T1, \ST
 
 .set i, 1
 .rept NROUNDS
-		movdqu	16*i(\GDATA), \T1
+		movdqu	16*i(\KEYSCHED), \T1
 		aesenc	\T1, \ST
 
 .set i, (i+1)
 .endr
-		movdqu	16*i(\GDATA), \T1
+		movdqu	16*i(\KEYSCHED), \T1
 		aesenclast	\T1, \ST
 .endm // ENCRYPT_SINGLE_BLOCK
 
@@ -1437,92 +1466,67 @@ _initial_blocks_done_\@:
 .macro FUNC_SAVE
 	//// Required for Update/GMC_ENC
 	//the number of pushes must equal STACK_OFFSET
-	push    %r12
-	push    %r13
-	push    %r14
-	push    %r15
-	push    %rsi
-	mov	%rsp, %r14
+	CFI_PUSHQ	%r12, -16
+	CFI_PUSHQ	%r13, -24
+	CFI_PUSHQ	%r14, -32
+	CFI_PUSHQ	%r15, -40
+	CFI_PUSHQ	%rsi, -48	// XXXX Why push %rsi ????
+	mov		%rsp, %r14
+	.cfi_def_cfa_register	%r14
 
 	sub	$(VARIABLE_OFFSET), %rsp
 	and	$~63, %rsp
 
-#if __OUTPUT_FORMAT__ == win64
-	// xmm6:xmm15 need to be maintained for Windows
-	movdqu	%xmm6, (LOCAL_STORAGE + 0*16)(%rsp)
-	movdqu	%xmm7, (LOCAL_STORAGE + 1*16)(%rsp)
-	movdqu	%xmm8, (LOCAL_STORAGE + 2*16)(%rsp)
-	movdqu	%xmm9, (LOCAL_STORAGE + 3*16)(%rsp)
-	movdqu	%xmm10, (LOCAL_STORAGE + 4*16)(%rsp)
-	movdqu	%xmm11, (LOCAL_STORAGE + 5*16)(%rsp)
-	movdqu	%xmm12, (LOCAL_STORAGE + 6*16)(%rsp)
-	movdqu	%xmm13, (LOCAL_STORAGE + 7*16)(%rsp)
-	movdqu	%xmm14, (LOCAL_STORAGE + 8*16)(%rsp)
-	movdqu	%xmm15, (LOCAL_STORAGE + 9*16)(%rsp)
-
-	mov	arg(5), arg5  // XXXX [r14 + STACK_OFFSET + 8*5]
-#endif
 .endm // FUNC_SAVE
 
 ////////////////////////////////////////////////////////////////////////////////
 // FUNC_RESTORE: Restore clobbered regs from the stack.
 ////////////////////////////////////////////////////////////////////////////////
 .macro FUNC_RESTORE
-
-#if __OUTPUT_FORMAT__ == win64
-	movdqu	(LOCAL_STORAGE + 9*16)(%rsp), %xmm15
-	movdqu	(LOCAL_STORAGE + 8*16)(%rsp), %xmm14
-	movdqu	(LOCAL_STORAGE + 7*16)(%rsp), %xmm13
-	movdqu	(LOCAL_STORAGE + 6*16)(%rsp), %xmm12
-	movdqu	(LOCAL_STORAGE + 5*16)(%rsp), %xmm11
-	movdqu	(LOCAL_STORAGE + 4*16)(%rsp), %xmm10
-	movdqu	(LOCAL_STORAGE + 3*16)(%rsp), %xmm9
-	movdqu	(LOCAL_STORAGE + 2*16)(%rsp), %xmm8
-	movdqu	(LOCAL_STORAGE + 1*16)(%rsp), %xmm7
-	movdqu	(LOCAL_STORAGE + 0*16)(%rsp), %xmm6
-#endif
-
 	// Required for Update/GMC_ENC
-	mov	%r14, %rsp
-	pop	%rsi
-	pop	%r15
-	pop	%r14
-	pop	%r13
-	pop	%r12
+	mov		%r14, %rsp
+	.cfi_def_cfa_register	%rsp
+	CFI_POPQ	%rsi
+	CFI_POPQ	%r15
+	CFI_POPQ	%r14
+	CFI_POPQ	%r13
+	CFI_POPQ	%r12
 .endm // FUNC_RESTORE
 
 
 ////////////////////////////////////////////////////////////////////////////////
 // GCM_INIT: Initializes a gcm_context_data struct to prepare for
 // encoding/decoding.
-// Input: gcm_key_data * (GDATA_KEY), gcm_context_data *(GDATA_CTX), IV,
+// Input: gcm_ctx->gcm_Htable *(GCM_HTAB), gcm_ctx_t *(GDATA_CTX), IV,
 // Additional Authentication data (A_IN), Additional Data length (A_LEN).
 // Output: Updated GDATA_CTX with the hash of A_IN (AadHash) and initialized
-// other parts of GDATA.
+// other parts of GDATA_CTX.
 // Clobbers rax, r10-r13 and xmm0-xmm6
 ////////////////////////////////////////////////////////////////////////////////
-.macro  GCM_INIT	GDATA_KEY, GDATA_CTX, IV, A_IN, A_LEN
+.macro  GCM_INIT	GCM_HTAB, GDATA_CTX, IV, A_IN, A_LEN, TAG_LEN
 
 #define AAD_HASH	%xmm0
 #define SUBHASH		%xmm1
 
-	movdqu  HashKey(\GDATA_KEY), SUBHASH
+	movdqu  HashKey(\GCM_HTAB), SUBHASH
 
 	CALC_AAD_HASH \A_IN, \A_LEN, AAD_HASH, SUBHASH, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %r10, %r11, %r12, %r13, %rax
 	pxor	%xmm3, %xmm2
-	mov	\A_LEN, %r10
+	movq	\A_LEN, %r10			// %r10 = AAD length
 
-	movdqu	AAD_HASH, AadHash(\GDATA_CTX)	// ctx_data.aad hash = aad_hash
-	mov	%r10, AadLen(\GDATA_CTX)		// ctx_data.aad_length = aad_length
+	movdqu	AAD_HASH, AadHash(\GDATA_CTX)	// gcm_ctx.gcm_ghash = aad_hash
+	movq	%r10, AadLen(\GDATA_CTX)	// gcm_ctx->gcm_len_a_len_c[0] = aad_length
+	movq	\TAG_LEN, %r10			// %r10 = aad_tag_len
+	movq	%r10, TagLen(\GDATA_CTX)	// gcm_ctx->gcm_tag_len = aad_tag_len
 	xor	%r10, %r10
-	mov	%r10, InLen(\GDATA_CTX)		// ctx_data.in_length = 0
-	mov	 %r10, PBlockLen(\GDATA_CTX)	// ctx_data.partial_block_length = 0
-	movdqu	%xmm2, PBlockEncKey(\GDATA_CTX)	// ctx_data.partial_block_enc_key = 0
-	mov	\IV, %r10
-	movdqa  ONEf(%rip), %xmm2			// read 12 IV bytes and pad with 0x00000001
+	movq	%r10, InLen(\GDATA_CTX)		// gcm_ctx.gcm_processed_data_len = 0
+	movq	%r10, PBlockLen(\GDATA_CTX)	// gcm_ctx.gcm_remainder_len = 0
+	movdqu	%xmm2, PBlockEncKey(\GDATA_CTX)	// XXXX last counter block ???? gcm_ctx.gcm_remainder = 0
+	movq	\IV, %r10
+	movdqa  ONEf(%rip), %xmm2		// read 12 IV bytes and pad with 0x00000001
 	pinsrq  $0, (%r10), %xmm2
 	pinsrd  $2, 8(%r10), %xmm2
-	movdqu	%xmm2, OrigIV(\GDATA_CTX)	// ctx_data.orig_IV = iv
+	movdqu	%xmm2, OrigIV(\GDATA_CTX)	// gcm_ctx.gcm_J0 = CTR0
 
 	pshufb	SHUF_MASK(%rip), %xmm2
 
@@ -1535,15 +1539,15 @@ _initial_blocks_done_\@:
 // gcm_context_data struct has been initialized by GCM_INIT.
 // Requires the input data be at least 1 byte long because of
 // READ_SMALL_INPUT_DATA.
-// Input: gcm_key_data * (GDATA_KEY), gcm_context_data (GDATA_CTX),
+// Input: gcm_key_data * (KEYSCHED, GCM_HTAB), gcm_context_data (GDATA_CTX),
 // input text (PLAIN_CYPH_IN), input text length (PLAIN_CYPH_LEN) and whether
 // encoding or decoding (ENC_DEC).
 // Output: A cypher of the given plain text (CYPH_PLAIN_OUT), and updated
 // GDATA_CTX
 // Clobbers rax, r10-r15, and xmm0-xmm15
 ////////////////////////////////////////////////////////////////////////////////
-.macro	GCM_ENC_DEC	GDATA_KEY, GDATA_CTX, CYPH_PLAIN_OUT, PLAIN_CYPH_IN, \
-			PLAIN_CYPH_LEN, ENC_DEC
+.macro	GCM_ENC_DEC	KEYSCHED, GCM_HTAB, GDATA_CTX, CYPH_PLAIN_OUT, \
+			PLAIN_CYPH_IN, PLAIN_CYPH_LEN, ENC_DEC
 
 #define	DATA_OFFSET		%r11
 
@@ -1567,11 +1571,10 @@ _initial_blocks_done_\@:
 
 	xor	DATA_OFFSET, DATA_OFFSET
 	add	\PLAIN_CYPH_LEN, InLen(\GDATA_CTX)	//Update length of data processed
-	movdqu	HashKey(\GDATA_KEY), %xmm13		// xmm13 = HashKey
+	movdqu	HashKey(\GCM_HTAB), %xmm13		// xmm13 = HashKey
 	movdqu	AadHash(\GDATA_CTX), %xmm8
 
-
-	PARTIAL_BLOCK \GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, \PLAIN_CYPH_LEN, DATA_OFFSET, %xmm8, \ENC_DEC
+	PARTIAL_BLOCK \GCM_HTAB, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, \PLAIN_CYPH_LEN, DATA_OFFSET, %xmm8, \ENC_DEC
 
 	mov	\PLAIN_CYPH_LEN, %r13                                // save the number of bytes of plaintext/ciphertext
 	sub	DATA_OFFSET, %r13
@@ -1600,42 +1603,42 @@ _initial_blocks_done_\@:
 	jmp     _initial_num_blocks_is_1_\@
 
 _initial_num_blocks_is_7_\@:
-	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 7, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	INITIAL_BLOCKS	\KEYSCHED, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 7, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
 	sub	$(16*7), %r13
 	jmp     _initial_blocks_encrypted_\@
 
 _initial_num_blocks_is_6_\@:
-	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 6, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	INITIAL_BLOCKS	\KEYSCHED, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 6, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
 	sub	$(16*6), %r13
 	jmp     _initial_blocks_encrypted_\@
 
 _initial_num_blocks_is_5_\@:
-	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 5, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	INITIAL_BLOCKS	\KEYSCHED, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 5, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
 	sub	$(16*5), %r13
 	jmp     _initial_blocks_encrypted_\@
 
 _initial_num_blocks_is_4_\@:
-	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 4, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	INITIAL_BLOCKS	\KEYSCHED, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 4, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
 	sub	$(16*4), %r13
 	jmp     _initial_blocks_encrypted_\@
 
 _initial_num_blocks_is_3_\@:
-	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 3, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	INITIAL_BLOCKS	\KEYSCHED, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 3, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
 	sub	$(16*3), %r13
 	jmp     _initial_blocks_encrypted_\@
 
 _initial_num_blocks_is_2_\@:
-	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 2, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	INITIAL_BLOCKS	\KEYSCHED, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 2, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
 	sub	$(16*2), %r13
 	jmp     _initial_blocks_encrypted_\@
 
 _initial_num_blocks_is_1_\@:
-	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 1, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	INITIAL_BLOCKS	\KEYSCHED, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 1, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
 	sub	$(16*1), %r13
 	jmp     _initial_blocks_encrypted_\@
 
 _initial_num_blocks_is_0_\@:
-	INITIAL_BLOCKS	\GDATA_KEY, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 0, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
+	INITIAL_BLOCKS	\KEYSCHED, \GDATA_CTX, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, %r13, DATA_OFFSET, 0, %xmm12, %xmm13, %xmm14, %xmm15, %xmm11, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm10, %xmm0, \ENC_DEC
 
 _initial_blocks_encrypted_\@:
 	cmp	$0, %r13
@@ -1654,7 +1657,7 @@ _encrypt_by_8_new_\@:
 	jg      _encrypt_by_8_\@
 
 	add	$8, %r15b
-	GHASH_8_ENCRYPT_8_PARALLEL	\GDATA_KEY, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, DATA_OFFSET, %xmm0, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm15, out_order, \ENC_DEC
+	GHASH_8_ENCRYPT_8_PARALLEL	\KEYSCHED, \GCM_HTAB, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, DATA_OFFSET, %xmm0, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm15, out_order, \ENC_DEC
 	add	$128, DATA_OFFSET
 	sub	$128, %r13
 	jne     _encrypt_by_8_new_\@
@@ -1666,7 +1669,7 @@ _encrypt_by_8_\@:
 	pshufb	SHUF_MASK(%rip), %xmm9
 	add	$8, %r15b
 
-	GHASH_8_ENCRYPT_8_PARALLEL	\GDATA_KEY, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, DATA_OFFSET, %xmm0, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm15, in_order, \ENC_DEC
+	GHASH_8_ENCRYPT_8_PARALLEL	\KEYSCHED, \GCM_HTAB, \CYPH_PLAIN_OUT, \PLAIN_CYPH_IN, DATA_OFFSET, %xmm0, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm9, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8, %xmm15, in_order, \ENC_DEC
 	pshufb	SHUF_MASK(%rip), %xmm9
 	add	$128, DATA_OFFSET
 	sub	$128, %r13
@@ -1677,12 +1680,12 @@ _encrypt_by_8_\@:
 
 
 _eight_cipher_left_\@:
-	GHASH_LAST_8	\GDATA_KEY, %xmm0, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8
+	GHASH_LAST_8	\GCM_HTAB, %xmm0, %xmm10, %xmm11, %xmm12, %xmm13, %xmm14, %xmm15, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5, %xmm6, %xmm7, %xmm8
 
 
 _zero_cipher_left_\@:
-	movdqu	%xmm14, AadHash(\GDATA_CTX) 
-	movdqu	%xmm9, CurCount(\GDATA_CTX) 
+	movdqu	%xmm14, AadHash(\GDATA_CTX)
+	movdqu	%xmm9, CurCount(\GDATA_CTX)
 
 	mov	%r10, %r13
 	and	$15, %r13			// r13 = (\PLAIN_CYPH_LEN mod 16)
@@ -1695,7 +1698,7 @@ _zero_cipher_left_\@:
 	paddd	ONE(%rip), %xmm9			// INCR CNT to get Yn
 	movdqu	%xmm9, CurCount(\GDATA_CTX)	// my_ctx.data.current_counter = xmm9
 	pshufb  SHUF_MASK(%rip), %xmm9
-	ENCRYPT_SINGLE_BLOCK	\GDATA_KEY, %xmm9, %xmm2	// E(K, Yn)
+	ENCRYPT_SINGLE_BLOCK	\KEYSCHED, %xmm9, %xmm2	// E(K, Yn)
 	movdqu	%xmm9, PBlockEncKey(\GDATA_CTX)	// my_ctx_data.partial_block_enc_key = xmm9
 
 	cmp	$16, \PLAIN_CYPH_LEN
@@ -1774,13 +1777,12 @@ _multiple_of_16_bytes_\@:
 ////////////////////////////////////////////////////////////////////////////////
 // GCM_COMPLETE: Finishes Encyrption/Decryption of last partial block after
 // GCM_UPDATE finishes.
-// Input: A gcm_key_data * (GDATA_KEY), gcm_context_data * (GDATA_CTX) and
-// whether encoding or decoding (ENC_DEC).
-// Output: Authorization Tag (AUTH_TAG) and Authorization Tag length
-// (AUTH_TAG_LEN)
+// Input: A gcm_key_data * (KEYSCHED, GCM_HTAB), gcm_context_data * (GDATA_CTX)
+//  and whether encoding or decoding (ENC_DEC).
+// Output: Authorization Tag (AUTH_TAG) stored in gcm_ctx.gcm_ghash
 // Clobbers %rax, r10-r12, and xmm0, xmm1, xmm5, xmm6, xmm9, xmm11, xmm14, xmm15
 ////////////////////////////////////////////////////////////////////////////////
-.macro	GCM_COMPLETE	GDATA_KEY, GDATA_CTX, AUTH_TAG, AUTH_TAG_LEN, ENC_DEC
+.macro	GCM_COMPLETE	KEYSCHED, GCM_HTAB, GDATA_CTX, ENC_DEC
 
 #define	PLAIN_CYPH_LEN	%rax
 
@@ -1789,7 +1791,7 @@ _multiple_of_16_bytes_\@:
 
 	mov	PBlockLen(\GDATA_CTX), %r12	// r12 = aadLen (number of bytes)
 	movdqu	AadHash(\GDATA_CTX), %xmm14
-	movdqu	HashKey(\GDATA_KEY), %xmm13
+	movdqu	HashKey(\GCM_HTAB), %xmm13
 
 	cmp	$0, %r12
 
@@ -1803,26 +1805,32 @@ _partial_done_\@:
 	mov	AadLen(\GDATA_CTX), %r12		// r12 = aadLen (number of bytes)
 	mov	InLen(\GDATA_CTX), PLAIN_CYPH_LEN
 
-	shl	$3, %r12			// convert into number of bits
+	shl	$3, %r12		// convert into number of bits
 	movd	%r12d, %xmm15		// len(A) in xmm15
 
 	shl	$3, PLAIN_CYPH_LEN	// len(C) in bits  (*128)
 	movq	PLAIN_CYPH_LEN, %xmm1
 	pslldq	$8, %xmm15		// xmm15 = len(A)|| 0x0000000000000000
 	pxor	%xmm1, %xmm15		// xmm15 = len(A)||len(C)
-
+#ifdef DEBUG
+	pshufb	SHUF_MASK(%rip), %xmm15		// perform a 16Byte swap
+	movdqu	%xmm15, LenALenC(\GDATA_CTX)
+	pshufb	SHUF_MASK(%rip), %xmm15		// undo 16Byte swap
+#endif
 	pxor	%xmm15, %xmm14
 	GHASH_MUL %xmm14, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6    // final GHASH computation
 	pshufb	SHUF_MASK(%rip), %xmm14		// perform a 16Byte swap
 	movdqu	OrigIV(\GDATA_CTX), %xmm9	// xmm9 = Y0
 
-	ENCRYPT_SINGLE_BLOCK	\GDATA_KEY, %xmm9, %xmm2	// E(K, Y0)
+	ENCRYPT_SINGLE_BLOCK	\KEYSCHED, %xmm9, %xmm2	// E(K, Y0)
 
 	pxor	%xmm14, %xmm9
 
 _return_T_\@:
-	mov	\AUTH_TAG, %r10			// r10 = authTag
-	mov	\AUTH_TAG_LEN, %r11		// r11 = auth_tag_len
+	// mov	\AUTH_TAG, %r10			// r10 = authTag
+	// mov	\AUTH_TAG_LEN, %r11		// r11 = auth_tag_len
+	lea	AadHash(\GDATA_CTX), %r10	// r10 = authTag
+	movq	TagLen(\GDATA_CTX), %r11	// r11 = auth_tag_len
 
 	cmp	$16, %r11
 	je      _T_16_\@
@@ -1830,14 +1838,14 @@ _return_T_\@:
 	cmp	$12, %r11
 	je      _T_12_\@
 
-_T_8_\@:
-	movq	%xmm9, %rax
-	mov	%rax, (%r10)
+_T_8_\@:			// XXXX: Why use intermediate reg %rax/%eax?
+	movq	%xmm9, %rax	// %rax (ret val) contains 8 bytes tag
+	movq	%rax, (%r10)
 	jmp     _return_T_done_\@
 
 _T_12_\@:
-	movq	%xmm9, %rax
-	mov	%rax, (%r10)
+	movq	%xmm9, %rax	// %rax (ret val) contains upper and lower 4 bytes of tag
+	movq	%rax, (%r10)
 	psrldq	$8, %xmm9
 	movd	%xmm9, %eax
 	mov	%eax, 8(%r10)
@@ -1850,37 +1858,34 @@ _return_T_done_\@:
 .endm //GCM_COMPLETE
 
 
-#if 1
-
-	.balign 16
 ////////////////////////////////////////////////////////////////////////////////
-//void	aes_gcm_precomp_{128,256}_sse
-//        (struct gcm_key_data *key_data);
+// void	icp_isalc_gcm_precomp_{128,192,256}_sse(
+//          gcm_ctx_t *context_data		/* arg1 */
+// );
 ////////////////////////////////////////////////////////////////////////////////
 #if FUNCT_EXTENSION != _nt
-.global FN_NAME(precomp,_)
-FN_NAME(precomp,_):
 
-	endbranch
+ENTRY_NP(FN_NAME(precomp,_))
+.cfi_startproc
+	ENDBR
 
-	push	%r12
-	push	%r13
-	push	%r14
-	push	%r15
-
-	mov     %rsp, %r14
+	CFI_PUSHQ	%r12, -16
+	CFI_PUSHQ	%r13, -24
+	CFI_PUSHQ	%r14, -32
+	CFI_PUSHQ	%r15, -40
+	mov		%rsp, %r14
+	.cfi_def_cfa_register	%r14
 
 	sub	$(VARIABLE_OFFSET), %rsp
-	and	$(~63), %rsp				// align rsp to 64 bytes
-
-#if __OUTPUT_FORMAT__ == win64
-	// only xmm6 needs to be maintained
-	movdqu	%xmm6, (LOCAL_STORAGE + 0*16)(%rsp)
-#endif
+	and	$(~63), %rsp		// align rsp to 64 bytes
 
+	mov	KeySched(arg1), arg2	// arg2 = gcm_ctx->gcm_keysched
+	mov	GcmHtab(arg1), arg3	// arg3 = gcm_ctx->gcm_Htable
 	pxor	%xmm6, %xmm6
-	ENCRYPT_SINGLE_BLOCK	arg1, %xmm6, %xmm2	// xmm6 = HashKey
-
+	ENCRYPT_SINGLE_BLOCK	arg2, %xmm6, %xmm2	// xmm6 = HashKey
+#ifdef DEBUG
+	movdqu	%xmm6, GcmH(arg1)	// Save hash key to context.
+#endif
 	pshufb	SHUF_MASK(%rip), %xmm6
 	///////////////  PRECOMPUTATION of HashKey<<1 mod poly from the HashKey
 	movdqa	%xmm6, %xmm2
@@ -1897,254 +1902,218 @@ FN_NAME(precomp,_):
 	pand	POLY(%rip), %xmm2
 	pxor	%xmm2, %xmm6		// xmm6 holds the HashKey<<1 mod poly
 	///////////////////////////////////////////////////////////////////////
-	movdqu	 %xmm6, HashKey(arg1)	// store HashKey<<1 mod poly
+	movdqu	 %xmm6, HashKey(arg3)	// store HashKey<<1 mod poly
 
-	PRECOMPUTE  arg1, %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5
+	PRECOMPUTE  arg3, %xmm6, %xmm0, %xmm1, %xmm2, %xmm3, %xmm4, %xmm5
 
-#if __OUTPUT_FORMAT__ == win64
-	movdqu (LOCAL_STORAGE + 0*16)(%rsp), %xmm6
-#endif
-	mov	%r14, %rsp
+	mov		%r14, %rsp
+	.cfi_def_cfa_register	%rsp
+	CFI_POPQ	%r15
+	CFI_POPQ	%r14
+	CFI_POPQ	%r13
+	CFI_POPQ	%r12
+	RET
+.cfi_endproc
+SET_SIZE(FN_NAME(precomp,_))
 
-	pop	%r15
-	pop	%r14
-	pop	%r13
-	pop	%r12
-	ret
 #endif	// _nt
 
 
 ////////////////////////////////////////////////////////////////////////////////
-//void   aes_gcm_init_128_sse / aes_gcm_init_256_sse (
-//        const struct gcm_key_data *key_data,
-//        struct gcm_context_data *context_data,
-//        u8      *iv,
-//        const   u8 *aad,
-//        u64     aad_len);
+// void   icp_isalc_gcm_init_{128,192,256}_sse
+//            gcm_ctx_t		*context_data,	/* arg1 */
+//            const uint8_t	*iv,		/* arg2 */
+//            const uint8_t	*aad,		/* arg3 */
+//            uint64_t		aad_len		/* arg4 */
+//            uint64_t		tag_len		/* arg5 */
+// );
 ////////////////////////////////////////////////////////////////////////////////
 #if FUNCT_EXTENSION != _nt
-.global FN_NAME(init,_)
-FN_NAME(init,_):
-	endbranch
-
-	push	%r12
-	push	%r13
-#if __OUTPUT_FORMAT__ == win64
-	push	arg5
-	sub	$(1*16), %rsp
-	movdqu	%xmm6, (0*16)(%rsp)
-	mov	(1*16 + 8*3 + 8*5)(%rsp), arg5
-#endif
+ENTRY_NP(FN_NAME(init,_))
+.cfi_startproc
+	ENDBR
 
-	GCM_INIT arg1, arg2, arg3, arg4, arg5
+	CFI_PUSHQ	%r12, -16
+	CFI_PUSHQ	%r13, -24
 
-#if __OUTPUT_FORMAT__ == win64
-	movdqu	(0*16)(%rsp), %xmm6
-	add	$(1*16), %rsp
-	pop	arg5
-#endif
-	pop	%r13
-	pop	%r12
-	ret
-#endif	// _nt
+	mov	 GcmHtab(arg1), arg6		// arg5 = gcm_ctx->gcm_Htable
+	GCM_INIT arg6, arg1, arg2, arg3, arg4, arg5
 
+	CFI_POPQ	%r13
+	CFI_POPQ	%r12
+	RET
+.cfi_endproc
+SET_SIZE(FN_NAME(init,_))
+#endif	// _nt
 
 ////////////////////////////////////////////////////////////////////////////////
-//void   aes_gcm_enc_128_update_sse / aes_gcm_enc_256_update_sse
-//        const struct gcm_key_data *key_data,
-//        struct gcm_context_data *context_data,
-//        u8      *out,
-//        const   u8 *in,
-//        u64     plaintext_len);
+// void   icp_isalc_gcm_enc_{128,192,256}_update_sse(
+//            gcm_ctx_t		*context_data,	/* arg1 */
+//            uint8_t		*out,		/* arg2 */
+//            const uint8_t	*in,		/* arg3 */
+//            uint64_t		plaintext_len	/* arg4 */
+// );
 ////////////////////////////////////////////////////////////////////////////////
-.global FN_NAME(enc,_update_)
-FN_NAME(enc,_update_):
-	endbranch
+ENTRY_NP(FN_NAME(enc,_update_))
+.cfi_startproc
+	ENDBR
 
 	FUNC_SAVE
 
-	GCM_ENC_DEC arg1, arg2, arg3, arg4, arg5, ENC
+	movq	KeySched(arg1), arg5	// arg5 = gcm_ctx->gcm_keysched
+	movq	GcmHtab(arg1), arg6	// arg6 = gcm_ctx->gcm_Htable
 
-	FUNC_RESTORE
+	GCM_ENC_DEC arg5, arg6, arg1, arg2, arg3, arg4, ENC
 
-	ret
+	FUNC_RESTORE
 
+	RET
+.cfi_endproc
+SET_SIZE(FN_NAME(enc,_update_))
 
 ////////////////////////////////////////////////////////////////////////////////
-//void   aes_gcm_dec_256_update_sse / aes_gcm_dec_256_update_sse
-//        const struct gcm_key_data *key_data,
-//        struct gcm_context_data *context_data,
-//        u8      *out,
-//        const   u8 *in,
-//        u64     plaintext_len);
+// void   icp_isalc_gcm_dec_{128,192,256}_update_sse(
+//            gcm_ctx_t		*context_data,	/* arg1 */
+//            uint8_t		*out,		/* arg2 */
+//            const uint8_t	*in,		/* arg3 */
+//            uint64_t		plaintext_len	/* arg4 */
+// );
 ////////////////////////////////////////////////////////////////////////////////
-.global FN_NAME(dec,_update_)
-FN_NAME(dec,_update_):
-	endbranch
+ENTRY_NP(FN_NAME(dec,_update_))
+.cfi_startproc
+	ENDBR
 
 	FUNC_SAVE
 
-	GCM_ENC_DEC arg1, arg2, arg3, arg4, arg5, DEC
+	mov	KeySched(arg1), arg5	// arg5 = gcm_ctx->gcm_keysched
+	mov	GcmHtab(arg1), arg6	// arg6 = gcm_ctx->gcm_Htable
+
+	GCM_ENC_DEC arg5, arg6, arg1, arg2, arg3, arg4, DEC
 
 	FUNC_RESTORE
 
-	ret
+	RET
+.cfi_endproc
+SET_SIZE(FN_NAME(dec,_update_))
 
 
 ////////////////////////////////////////////////////////////////////////////////
-//void   aes_gcm_enc_128_finalize_sse / aes_gcm_enc_256_finalize_sse
-//        const struct gcm_key_data *key_data,
-//        struct gcm_context_data *context_data,
-//        u8      *auth_tag,
-//        u64     auth_tag_len);
+// void   icp_isalc_gcm_enc_{128,192,256}_finalize_sse(
+//           gcm_ctx_t	*context_data,	/* arg1 */
+// );
 ////////////////////////////////////////////////////////////////////////////////
 #if FUNCT_EXTENSION != _nt
-.global FN_NAME(enc,_finalize_)
-FN_NAME(enc,_finalize_):
+ENTRY_NP(FN_NAME(enc,_finalize_))
+.cfi_startproc
+	ENDBR
 
-	endbranch
+	CFI_PUSHQ	%r12, -16
 
-	push	%r12
+	movq		KeySched(arg1), arg2	// arg4 = gcm_ctx->gcm_keysched
+	movq		GcmHtab(arg1), arg3	// arg5 = gcm_ctx->gcm_Htable
 
-#if __OUTPUT_FORMAT__ == win64
-	// xmm6:xmm15 need to be maintained for Windows
-	sub	$(5*16), %rsp
-	movdqu	%xmm6, (0*16)(%rsp)
-	movdqu	%xmm9, (1*16)(%rsp)
-	movdqu	%xmm11, (2*16)(%rsp)
-	movdqu	%xmm14, (3*16)(%rsp)
-	movdqu	%xmm15, (4*16)(%rsp)
-#endif
-	GCM_COMPLETE	arg1, arg2, arg3, arg4, ENC
-
-#if __OUTPUT_FORMAT__ == win64
-	movdqu	(4*16)(%rsp), %xmm15
-	movdqu	(3*16)(%rsp), %xmm14
-	movdqu	(2*16)(%rsp), %xmm11
-	movdqu	(1*16)(%rsp), %xmm9
-	movdqu	(0*16)(%rsp), %xmm6
-	add	$(5*16), %rsp
-#endif
+	GCM_COMPLETE	arg2, arg3, arg1, ENC
 
-	pop	%r12
-	ret
+	CFI_POPQ	%r12
+	RET
+.cfi_endproc
+SET_SIZE(FN_NAME(enc,_finalize_))
 #endif	// _nt
 
 
 ////////////////////////////////////////////////////////////////////////////////
-//void   aes_gcm_dec_128_finalize_sse / aes_gcm_dec_256_finalize_sse
-//        const struct gcm_key_data *key_data,
-//        struct gcm_context_data *context_data,
-//        u8      *auth_tag,
-//        u64     auth_tag_len);
+// void   icp_isalc_gcm_dec_{128,129,256}_finalize_sse(
+//           gcm_ctx_t	*context_data,	/* arg1 */
+// );
 ////////////////////////////////////////////////////////////////////////////////
 #if FUNCT_EXTENSION != _nt
-.global FN_NAME(dec,_finalize_)
-FN_NAME(dec,_finalize_):
+ENTRY_NP(FN_NAME(dec,_finalize_))
+.cfi_startproc
+	ENDBR
 
-	endbranch
+	CFI_PUSHQ	%r12, -16
 
-	push	%r12
+	movq	KeySched(arg1), arg2	// arg4 = gcm_ctx->gcm_keysched
+	movq	GcmHtab(arg1), arg3	// arg5 = gcm_ctx->gcm_Htable
 
-#if __OUTPUT_FORMAT == win64
-	// xmm6:xmm15 need to be maintained for Windows
-	sub	$(5*16), %rsp
-	movdqu	%xmm6, (0*16)(%rsp)
-	movdqu	%xmm9, (1*16)(%rsp)
-	movdqu	%xmm11, (2*16)(%rsp)
-	movdqu	%xmm14, (3*16)(%rsp)
-	movdqu	%xmm15, (4*16)(%rsp)
-#endif
-	GCM_COMPLETE	arg1, arg2, arg3, arg4, DEC
-
-#if __OUTPUT_FORMAT__ == win64
-	movdqu	(4*16)(%rsp), %xmm15
-	movdqu	(3*16)(%rsp), %xmm14
-	movdqu	(2*16)(%rsp), %xmm11
-	movdqu	(1*16)(%rsp), %xmm9
-	movdqu	(0*16)(%rsp), %xmm6
-	add	$(5*16), %rsp
-#endif
+	GCM_COMPLETE	arg2, arg3, arg1, DEC
 
-	pop	%r12
-	ret
+	CFI_POPQ	%r12
+	RET
+.cfi_endproc
+SET_SIZE(FN_NAME(dec,_finalize_))
 #endif	// _nt
 
-
 ////////////////////////////////////////////////////////////////////////////////
-//void   aes_gcm_enc_128_sse / aes_gcm_enc_256_sse
-//        const struct gcm_key_data *key_data,
-//        struct gcm_context_data *context_data,
-//        u8      *out,
-//        const   u8 *in,
-//        u64     plaintext_len,
-//        u8      *iv,
-//        const   u8 *aad,
-//        u64     aad_len,
-//        u8      *auth_tag,
-//        u64     auth_tag_len)//
+// void   icp_isalc_gcm_enc_{128,192,256}_sse(
+//            gcm_ctx_t		*context_data,	/* arg1 */
+//            uint8_t		*out,		/* arg2 */
+//            const uint8_t	*in,		/* arg3 */
+//            uint64_t		plaintext_len,	/* arg4 */
+//            const uint8_t	*iv,		/* arg5 */
+//            const uint8_t	*aad,		/* arg6 */
+//            uint64_t		aad_len,	/* arg7 */
+//            uint64_t		tag_len,	/* arg8 */
+// );
 ////////////////////////////////////////////////////////////////////////////////
-.global FN_NAME(enc,_)
-FN_NAME(enc,_):
-	endbranch
+ENTRY_NP(FN_NAME(enc,_))
+.cfi_startproc
+	ENDBR
 
 	FUNC_SAVE
 
-	GCM_INIT arg1, arg2, arg6, arg7, arg8
+	pushq	arg2
+	movq	GcmHtab(arg1), arg2	// arg2 = gcm_ctx->gcm_Htable
 
-	GCM_ENC_DEC  arg1, arg2, arg3, arg4, arg5, ENC
+	GCM_INIT	arg2, arg1, arg5, arg6, arg7, arg8
+
+	popq	arg2
+	mov	KeySched(arg1), arg5	// arg5 = gcm_ctx->gcm_keysched
+	mov	GcmHtab(arg1), arg6	// arg6 = gcm_ctx->gcm_Htable
+
+	GCM_ENC_DEC  arg5, arg6, arg1, arg2, arg3, arg4, ENC
+	GCM_COMPLETE arg5, arg6, arg1, ENC
 
-	GCM_COMPLETE arg1, arg2, arg9, arg10, ENC
 	FUNC_RESTORE
 
-	ret
+	RET
+.cfi_endproc
+SET_SIZE(FN_NAME(enc,_))
 
 ////////////////////////////////////////////////////////////////////////////////
-//void   aes_gcm_dec_128_sse / aes_gcm_dec_256_sse
-//        const struct gcm_key_data *key_data,
-//        struct gcm_context_data *context_data,
-//        u8      *out,
-//        const   u8 *in,
-//        u64     plaintext_len,
-//        u8      *iv,
-//        const   u8 *aad,
-//        u64     aad_len,
-//        u8      *auth_tag,
-//        u64     auth_tag_len)//
+// void   icp_isalc_gcm_dec_{128,192,256}_sse(
+//            gcm_ctx_t *context_data,		/* arg1 */
+//            u8        *out,			/* arg2 */
+//            const u8  *in,			/* arg3 */
+//            u64       plaintext_len,		/* arg4 */
+//            u8        *iv,			/* arg5 */
+//            const u8  *aad,			/* arg6 */
+//            u64       aad_len,		/* arg7 */
+//            u64       tag_len,		/* arg8 */
+// );
 ////////////////////////////////////////////////////////////////////////////////
-.global FN_NAME(dec,_)
-FN_NAME(dec,_):
-	endbranch
+ENTRY_NP(FN_NAME(dec,_))
+.cfi_startproc
+	ENDBR
 
 	FUNC_SAVE
 
-	GCM_INIT arg1, arg2, arg6, arg7, arg8
+	pushq	arg2
+	movq	GcmHtab(arg1), arg2	// arg2 = gcm_ctx->gcm_Htable
 
-	GCM_ENC_DEC  arg1, arg2, arg3, arg4, arg5, DEC
+	GCM_INIT	arg2, arg1, arg5, arg6, arg7, arg8
 
-	GCM_COMPLETE arg1, arg2, arg9, arg10, DEC
-	FUNC_RESTORE
+	popq	arg2
+	mov	KeySched(arg1), arg5	// arg5 = gcm_ctx->gcm_keysched
+	mov	GcmHtab(arg1), arg6	// arg6 = gcm_ctx->gcm_Htable
 
-	ret
+	GCM_ENC_DEC  arg5, arg6, arg1, arg2, arg3, arg4, DEC
+	GCM_COMPLETE arg5, arg6, arg1, DEC
 
-.global FN_NAME(this_is_gas,_)
-FN_NAME(this_is_gas,_):
-	endbranch
-	FUNC_SAVE
 	FUNC_RESTORE
-	ret
 
-#else
-	// GAS doesnt't provide the linenuber in the macro
-	////////////////////////
-	// GHASH_MUL xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6
-	// PRECOMPUTE rax, xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6
-	// READ_SMALL_DATA_INPUT xmm1, r10, 8, rax, r12, r15
-	// ENCRYPT_SINGLE_BLOCK rax, xmm0, xmm1
-	// INITIAL_BLOCKS rdi,rsi,rdx,rcx,r13,r11,7,xmm12,xmm13,xmm14,xmm15,xmm11,xmm9,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7,xmm8,xmm10,xmm0,ENC
-	// CALC_AAD_HASH [r14+8*5+8*1],[r14+8*5+8*2],xmm0,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,r10,r11,r12,r13,rax
-	// READ_SMALL_DATA_INPUT	xmm2, r10, r11, r12, r13, rax
-	// PARTIAL_BLOCK  rdi,rsi,rdx,rcx,r8,r11,xmm8,ENC
-	// GHASH_8_ENCRYPT_8_PARALLEL rdi,rdx,rcx,r11,xmm0,xmm10,xmm11,xmm12,xmm13,xmm14,xmm9,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7,xmm8,xmm15,out_order,ENC
-	//GHASH_LAST_8 rdi,xmm0,xmm10,xmm11,xmm12,xmm13,xmm14,xmm15,xmm1,xmm2,xmm3,xmm4,xmm5,xmm6,xmm7,xmm8
-#endif
+	RET
+.cfi_endproc
+SET_SIZE(FN_NAME(dec,_))
+
+// -eof-
diff --git a/module/icp/asm-x86_64/modes/isalc_reg_sizes.S b/module/icp/asm-x86_64/modes/isalc_reg_sizes.S
index d77291ce58a1..3475264d2e78 100644
--- a/module/icp/asm-x86_64/modes/isalc_reg_sizes.S
+++ b/module/icp/asm-x86_64/modes/isalc_reg_sizes.S
@@ -1,4 +1,4 @@
-////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
+////////////////////////////////////////////////////////////////////////////////
 //  Copyright(c) 2011-2019 Intel Corporation All rights reserved.
 //
 //  Redistribution and use in source and binary forms, with or without
@@ -25,7 +25,10 @@
 //  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 //  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 //  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
+////////////////////////////////////////////////////////////////////////////////
+
+// Port to GNU as, translation to GNU as att-syntax and adoptions for the ICP
+// Copyright(c) 2023 Attila Fülöp <attila@fueloep.org>
 
 #ifndef _REG_SIZES_ASM_
 #define _REG_SIZES_ASM_
@@ -204,12 +207,6 @@
 #endif
 
 
-#ifdef __x86_64__
-#define endbranch .byte 0xf3, 0x0f, 0x1e, 0xfa
-#else
-#define endbranch .byte 0xf3, 0x0f, 0x1e, 0xfb
-#endif
-
 #ifdef REL_TEXT
 #define WRT_OPT
 #elif __OUTPUT_FORMAT__ == elf64
diff --git a/module/icp/include/modes/modes.h b/module/icp/include/modes/modes.h
index 23bf46ab51a0..81e66e178896 100644
--- a/module/icp/include/modes/modes.h
+++ b/module/icp/include/modes/modes.h
@@ -36,14 +36,28 @@ extern "C" {
 
 /*
  * Does the build chain support all instructions needed for the GCM assembler
- * routines. AVX support should imply AES-NI and PCLMULQDQ, but make sure
- * anyhow.
+ * routines.
  */
-#if defined(__x86_64__) && defined(HAVE_AVX) && \
-    defined(HAVE_AES) && defined(HAVE_PCLMULQDQ)
+#if defined(__x86_64__) && defined(HAVE_AES) && defined(HAVE_PCLMULQDQ)
+/* XXXX: does AES + PCLMULQDQ really imply at least SSE4_1? */
 #define	CAN_USE_GCM_ASM
+
+#ifdef DEBUG
+/* Defines this to the gcm_simd_impl_t to debug. */
+#define	DEBUG_GCM_ASM	GSI_ISALC_SSE
+#endif
+#if defined(HAVE_SSE4_1)
+#define	CAN_USE_GCM_ASM_SSE
+#endif
+#if defined(HAVE_AVX)
+#define	CAN_USE_GCM_ASM_AVX
 extern boolean_t gcm_avx_can_use_movbe;
 #endif
+#if defined(HAVE_AVX2)
+#define	CAN_USE_GCM_ASM_AVX2
+#endif
+/* TODO: Add VAES/AVX512 */
+#endif /* defined(__x86_64__) && defined(HAVE_AES) && defined(HAVE_PCLMULQDQ) */
 
 #define	ECB_MODE			0x00000002
 #define	CBC_MODE			0x00000004
@@ -183,6 +197,35 @@ typedef struct ccm_ctx {
 #define	ccm_copy_to		ccm_common.cc_copy_to
 #define	ccm_flags		ccm_common.cc_flags
 
+#if defined(CAN_USE_GCM_ASM)
+/*
+ * enum gcm_simd_impl holds the types of the implemented gcm asm routines for
+ * the various x86 SIMD extensions. Please note that other parts of the code
+ * depends on the order given below, so do not change the order and append new
+ * implementations at the end, but before GSI_NUM_IMPL.
+ */
+typedef enum gcm_simd_impl {
+	GSI_NONE,
+	GSI_OSSL_AVX,
+	GSI_ISALC_SSE,
+	GSI_NUM_IMPL
+} gcm_simd_impl_t;
+
+#define	GSI_ISALC_FIRST_IMPL	((int)GSI_ISALC_SSE)
+#define	GSI_ISALC_LAST_IMPL	((int)GSI_ISALC_SSE)
+
+/*
+ * XXXX: Serves as a template to remind us what to do if adding an isalc impl
+ * #ifdef CAN_USE_GCM_ASM_AVX2
+ * #undef GSI_ISALC_LAST_IMPL
+ * #define	GSI_ISALC_LAST_IMPL	((int)GSI_ISALC_AVX2)
+ * #endif
+ */
+
+#define	GSI_ISALC_NUM_IMPL	(GSI_ISALC_LAST_IMPL - GSI_ISALC_FIRST_IMPL +1)
+
+#endif /* if defined(CAN_USE_GCM_ASM) */
+
 /*
  * gcm_tag_len:		Length of authentication tag.
  *
@@ -228,7 +271,11 @@ typedef struct gcm_ctx {
 	uint64_t gcm_len_a_len_c[2];
 	uint8_t *gcm_pt_buf;
 #ifdef CAN_USE_GCM_ASM
-	boolean_t gcm_use_avx;
+	gcm_simd_impl_t gcm_simd_impl;
+#ifdef DEBUG_GCM_ASM
+	struct gcm_ctx *gcm_shadow_ctx;
+	boolean_t gcm_is_shadow;
+#endif
 #endif
 } gcm_ctx_t;
 
diff --git a/module/icp/io/aes.c b/module/icp/io/aes.c
index d6f01304f56b..0e146a53d522 100644
--- a/module/icp/io/aes.c
+++ b/module/icp/io/aes.c
@@ -1095,7 +1095,6 @@ aes_decrypt_atomic(crypto_mechanism_t *mechanism,
 	} else if (aes_ctx.ac_flags & (GCM_MODE|GMAC_MODE)) {
 		gcm_clear_ctx((gcm_ctx_t *)&aes_ctx);
 	}
-
 	return (ret);
 }