Note about imprementing major cryptographic Hash functions(SHA256, SHA512, SHA1, MD5, RIPEMD169) with standard C11. Entire codes in gist are:
These hash are similar algorithms. Once writing one of them, writing others would be easy.
Port from SHA256 ECMAScript implementation
At first, I wrote the SHA256 function with pure ECMAScript.
The implementation is very slower than other implementation: coreutils sha256sum
and crypto
in node.js.
So I want to comparing them with WebAssembly based implementations. Before the writing pure WebAssembly program, I use the emscripten to generate from C implementations. Then I make the C11 SHA256 implementation from experiments of ECMAScript version.
Implementing SHA256 with C11 is easier than with ECMAScript, because:
- 32bit unsigned integer operations exists: no overflow/minus sign processing required
- SHA256 not required memory management: no
malloc()
andfree()
Used standard library functions are just memcpy()
and memset()
in <string.h>
.
They are easy to implement yourself (when no platform dependent speed up codes).
The export API’s are similar with ECMAScript versions: sha256_init()
, sha256_update()
and sha256_digest()
.
These arguments are pointers for memories allocated on the stack (local variables) as:
#include <stdio.h>
#include "sha256.h"
int main() {
struct sha256 ctx;
unsigned char digest[SHA256_HASH_BYTES];
char* buffer = "Hello World!";
sha256_init(&ctx);
sha256_update(&ctx, buffer, 12);
sha256_digest(&ctx, digest);
for (size_t i = 0; i < SHA256_HASH_BYTES; i++) printf("%02x", digest[i]);
printf(" \"%s\"\n", buffer);
return 0;
}
About function to big endian
I prefer to use union
to convert on memory layout than bit operations as:
static inline uint32_t big32(uint32_t v) {
union {
struct {uint8_t b0, b1, b2, b3;};
uint32_t v;
} le = {.v = v};
union {
struct {uint8_t b3, b2, b1, b0;};
uint32_t v;
} be = {.b0 = le.b0, .b1 = le.b1, .b2 = le.b2, .b3 = le.b3};
return be.v;
}
When using bit operations to convert endian, I prefer filter cascading:
static inline uint32_t big32(uint32_t v) {
v = ((v & 0xff00ff00) >> 8) | ((v & 00ff00ff) << 8);
return ((v & 0xffff0000) >> 16) | ((v & 0000ffff) << 16));
}
In standard C, there are no method to judge host endians at compile time.
But gcc and clang has
predefined macros: __BYTE_ORDER__
, __ORDER_BIG_ENDIAN__
.
It can print as:
$ cc -dMN -E - < /dev/null | grep BYTE_ORDER
#define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
Manual loop extraction
In invertal update()
function, the loops for mxing chunks to the vector is not effective
because all variable assigned at last but only few variables(for example, d
and h
in SHA256)
updated in each step, other assignments are just rotating.
// loop version
uint32_t a = sh[0], b = sh[1], c = sh[2], d = sh[3],
e = sh[4], f = sh[5], g = sh[6], h = sh[7];
for (int i = 0; i < 64; i++) {
uint32_t s0 = rotr(a, 2) ^ rotr(a, 13) ^ rotr(a, 22);
uint32_t s1 = rotr(e, 6) ^ rotr(e, 11) ^ rotr(e, 25);
uint32_t ch = (e & f) ^ (~e & g);
uint32_t maj = (a & b) ^ (a & c) ^ (b & c);
uint32_t t1 = h + s1 + ch + K[i] + w[i];
uint32_t t2 = s0 + maj;
h = g; g = f; f = e; e = d + t1;
d = c; c = b; b = a; a = t1 + t2;
}
When extracted loop, variable assignments are turned to parameter rotation. The step updates the variables only, others are just referred. The execution time would be reduced for the loop extraction, called loop unrolling technique.
// step as inline function
static inline void
step(uint32_t a, uint32_t b, uint32_t c, uint32_t* d,
uint32_t e, uint32_t f, uint32_t g, uint32_t* h, uint32_t w, uint32_t k) {
*h += t1(e, f, g, w, k);
*d += *h;
*h += t2(a, b, c);
}
// extracted loop with inline function
uint32_t a = sh[0], b = sh[1], c = sh[2], d = sh[3],
e = sh[4], f = sh[5], g = sh[6], h = sh[7];
step(a, b, c, &d, e, f, g, &h, w[0], K[0]);
step(h, a, b, &c, d, e, f, &g, w[1], K[1]);
step(g, h, a, &b, c, d, e, &f, w[2], K[2]);
step(f, g, h, &a, b, c, d, &e, w[3], K[3]);
step(e, f, g, &h, a, b, c, &d, w[4], K[4]);
step(d, e, f, &g, h, a, b, &c, w[5], K[5]);
step(c, d, e, &f, g, h, a, &b, w[6], K[6]);
step(b, c, d, &e, f, g, h, &a, w[7], K[7]);
step(a, b, c, &d, e, f, g, &h, w[8], K[8]);
....
step(b, c, d, &e, f, g, h, &a, w[63], K[63]);
FYI: execution times for 512MB data as:
$ dd if=/dev/zero bs=1m count=512 of=512M.data
$ time ./sha256 512M.data
real time | |
---|---|
loop (clang-4) | 3.684s |
extracted (clang-4) | 3.205s |
loop (gcc-7) | 3.052s |
extracted (gcc-7) | 3.063s |
(coreutil sha256sum) | 3.207s |
clang-4 with pragma unroll
clang-4 has loop unrolling optimizations with
#pragma unroll
for each loop statements as:
uint32_t a = sh[0], b = sh[1], c = sh[2], d = sh[3],
e = sh[4], f = sh[5], g = sh[6], h = sh[7];
#pragma unroll
for (int i = 0; i < 64; i++) {
uint32_t s0 = rotr(a, 2) ^ rotr(a, 13) ^ rotr(a, 22);
uint32_t s1 = rotr(e, 6) ^ rotr(e, 11) ^ rotr(e, 25);
uint32_t ch = (e & f) ^ (~e & g);
uint32_t maj = (a & b) ^ (a & c) ^ (b & c);
uint32_t t1 = h + s1 + ch + K[i] + w[i];
uint32_t t2 = s0 + maj;
h = g; g = f; f = e; e = d + t1;
d = c; c = b; b = a; a = t1 + t2;
}
Apply it at loop steps and former chunk mixing in update()
.
It turns as:
real time | |
---|---|
loop (clang-4) | 3.192s |
extracted (clang-4) | 3.058s |
With “pragma unroll”, the extracted version reaches to gcc-7’s speed.
About main function: emulating coreutils hash commands
At first, I only implemented output to coreutils compatible format as:
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 /dev/null
It only uses standard C file functions: fopen()
, fread()
and fclose()
.
note on libc getline()
Then I implement checker option “-c” at next.
It required text file line reading.
I choose the getline()
function in <stdio.h>
.
But it is functions in POSIX extension; not available in some environments.
So I just try to implement getline()
compatible functions at:
I tried both with fgets()
version and with fgetc()
version.
The fgets()
version would be faster than the fgetc()
version but the code is tricky a little.
static ssize_t
getline(char** restrict pline, size_t* restrict plsize, FILE* restrict fp) {
if (*plsize < 2) *pline = realloc(*pline, *plsize = 128);
if (feof(fp)) return -1;
char* cur = *pline;
size_t cursize = *plsize;
for (;;) {
char* ret = fgets(cur, cursize, fp);
if (ret == NULL && !feof(fp)) return -1; //=> read error
if (ret == NULL && cur == *pline) return -1; //=> read empty
char* eod = memchr(cur, '\0', cursize);
if (feof(fp) || eod[-1] == '\n') return eod - *pline;
// line continued
cursize = *plsize + 1; // last of *pline is '\0'
*pline = realloc(*pline, *plsize *= 2);
cur = *pline + *plsize - cursize;
}
}
(I think fgets()
would be better if returned the position of '\0'
).
note on the line format for sum file
The code of parsing sum lines is complex for emulating coreutils error/warning messages. The splitter of hex digest and file name is at most two, the first character allow SPACE and TAB, but second one accepts only SPACE. The file names started with SPACE areallowed for the splitter.
# blank lines and # comment lines allowed
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 /dev/null
# single space silitter is allowed
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 /dev/null
# space allowed fron t of lines
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 /dev/null
Even if invalid formatted lines are exist, they do not effect the exit result value.
CMakeFiles.txt for C programs
I added to a cmake build script. To build binaries and libraries as:
$ git clone https://gist.github.com/bellbind/2fd66a09340942b9845ef432a5c8c419 sha256
$ cd sha256
$ mkdir build
$ cd build
$ cmake ..
$ make
$ ./sha256e /dev/null
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 /dev/null
To switch the compiler with cmake options as:
$ make clean
$ cmake -DCMAKE_C_COMPILER=gcc-7 ..
$ make
Difference of other hash functions from SHA256
SHA1 from SHA256
SHA1 is:
- left cycric shift
- hash size: 5 x uint32
- loop steps: 80
- only 4 keys used used for each 20 loop steps
- loop step uses 4 diffrent non-linear function
f(b, c, d)
each 20 steps
MD5 from SHA256
MD5 is:
- little endian byte order
- left cycric shift
- hash size: 4 x uint32
- loop steps: 64
- chunk words are not mixed for values, indexes are mixed
- loop step uses 4 diffrent non-linear function
f(b, c, d)
, word index functiong(i)
and cycric shift amountS[i]
for each 16 steps
RIPEMD160 from SHA256
- little endian byte order
- left cycric shift
- hash size: 5 x uint32
- chunk words are not mixed for values, indexes are mixed
- loop steps: 80
- two variable rotations in loop step: left side and right side
- only 5 keys used for each 16 loop steps
- on both side, loop step uses 5 diffrent non-linear function
f(b, c, d)
, word index functiong(i)
and cycric shift amountS[i]
for each 16 steps - hash vector also rotated for each chunk:
h[i] = h[(i + 1) % 5] + l[(i + 2) % 5] + r[(i + 3) % 5]
SHA512 from SHA256
SHA512 is:
- expand uint64 operators
- hash size: 8 x uint64
- chunk size: 128 bytes
- bit size: uint128
- bit shift amount is diffrent
- (value of initial vector and keys are different as uint64)
- loop steps: 80
Use sha256 C implementation for emscripten (with WebAssemby)
The SHA256 C codes are also acceptable for emscripten compile without any modifications. After set up emscripten with bynaryen, for compiling with WebAssembly target as:
emcc -std=c11 -pedantic -O3 -Wall -Wextra sha256extracted.c -o sha256.js -s WASM=1 -s EXPORTED_FUNCTIONS='["_sha256_init","_sha256_update","_sha256_digest"]'
It generates as emscripten module “sha256.js” and “sha256.wasm”. But it is not friendly for usual JavaScript programming because of:
- asynchronous module initialization existed (it would not use module synchromously as in node.js)
- memory allocation and releasing required: use
let ptr = m._malloc(s)
andm._free(ptr)
- function params only integer. arrays and structs put in memory and pass its address values
- total memory size is fixed at compile time: default is 256 * 64k = 16M
So usually, the emscripten module is wrapped with more user friendly module code. (But it cannot avoid async initialization and resource releasing entirely).
The wrapper module: sha256mod.js
is:
It is a basic example for using emscripten modules.
The repository includes:
- main code “sha256main.js”: outputs hash sums and checking sum files with “-c” option.
- package.json for building with
npm install
: emscripten embironments required - run with
npx
as:
$ npx https://gist.github.com/bellbind/de02a751e52ab3a3d35e843221d3d37d /dev/null
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 /dev/null
Execution time for 512MB data as:
time | |
---|---|
emscripten module | 4.854s |
nodejs crypto module | 1.777s |
pure ECMAScript module | 1m10.004s |
(coreutils sha256sum) | 3.2007s |