BetterOS.org : an attempt to make computer machines run better

BetterOS.org : an attempt to make computer machines run better


home | better linux | games | software | tutorials | reference | web log |
index |
MD5 Musings
I am not using MD5 for security. I'm making that the first sentence so I don't get a flood of hatemail. In fact, I'm not even using it or recommending its use at all. However, companies all over the world are still using it in their software.
Although I am not using MD5, let me just clear up one misconception before I get started. MD5 is not always "insecure". MD5 is only insecure when used for verifying file integrity against intentional corruption or for password hashing or for any generated tokens that should be kept secret. I used MD5 one time for random number whitening, and I got so much hate for it in such a short time because "it's not secure!!!"
Now while there were certainly better algorithms I could have and should have used, MD5 didn't magically make my application insecure, MD5 has been broken by finding a practical method for generating collisions. I didn't care if somebody else could generate the same random number as me, the numbers I was generating were only 16 bits long anyway, that can be collided in a negligable amount of time without MD5, and they were based on noise from the microphone so a collision wouldn't even help in determining the next number the RNG would generate. MD5 had no impact whatsoever on security. That being said, MD5 was the wrong algorithm for that application, and its really the wrong algorithm for any application, security or otherwise.

However, it is still in common use in the world, so it's still worth knowing about it.

So, I took a few days an wrote myself an implementation of it. I didn't really know the algorithm and I wanted to learn how it worked and see what I could do with it. I also wanted to spend some time practicing my assembly. So I wrote the algorithm in assembly.

For reference, I used the psuedo-code on Wikipedia, which is based on reference code from the original RFC document. To make things easier on myself, I decided that at first I was going to assume that a message would be whole bytes and less than 45 bytes long. If I feel like it I will extend that later, it shouldn't be difficult.
I also wanted to see how my algorithm would perform, so I made optimizations that I thought would be beneficial, although I know there is much more I could have done.

MD5 goes through 4 sets of 16 rounds, each with slightly different functions transforming the state until the final hash is reached. Each set of rounds performs similar operations, including and, or, not, xor, additions, multiplications, and modulus, plus two bit shift operations in each round. I noticed that each multiplication needed was close to a power of two, so I replaced all of the multiplication by left shifts either addition or subtraction. This was less of an optimization and more of me just not wanting to deal with multiplication in assembly, but it probably performs well also. I also noticed that the modulus operations are always modulo 16, and 16 is a power of 2, so I replaced all "modulo 16" operations with "and 15", eliminating the need for div instructions.

I also made some modifications to the loop, to avoid checking the value of the counter twice every round, and I managed to keep everything completely in registers (except for the rotate function).

I anticipated that one of my big problems would be endianness, but it turned out to not be an issue at all. Instead my main problems were just me forgetting simple things. I decided that I wanted to use rcx as my round counter, which I actively avoided using in the main code, but I forgot that I used it in the rotate function (because the lower portion, cl, is the only register allowed as the second operator for shr and shl), so I was overwriting it every loop. Then I wrote a function to print out the state each round so I could check where it was going wrong, and I forgot that when I made the write syscall, the kernel smashes rcx and r11, which was storing the third byte of the state between rounds, so my debugging actually made things worse. I also added the wrong constant at first in rounds 48-64, but that was an easy error to spot. Then finally, I made the dumb mistake of mixing up bits and bytes. The MD5 algorith pads the message to a multiple of 512 bits, but I stupidly was padding it to 512 bytes, which does not result in the correct hash...

In the end, I figured it all out and now I am able to calculate an MD5 that matches MD5s generated by other applications. I tested the speed of my implemetation against the FreeBSD md5 implementation (1000000 repetitions), and mine came out on top being roughly 3 times faster. Now I realize that it isn't a fair test since I wrote mine in assembly, and its statically linked, and doesn't spend any time starting up and loading libraries or reading from stdin or args, so basically, my application is cheating. So I checked again, this time ignoring time spent in system calls, only time spent in userland. Using that metric, my implementation came out ahead, this time clocking in at about 5 times faster. Of course its still not fair to compare them because its not as complete of an implementation or of an application as the one I am comparing it against. However, I the comparison makes me happy.

Here is the full assembly language code I wrote for MD5:
section .data
	;s specifies the per-round shift amounts
	s:		db 7, 12, 17, 22,  7, 12, 17, 22,  7, 12, 17, 22,  7, 12, 17, 22
			db 5,  9, 14, 20,  5,  9, 14, 20,  5,  9, 14, 20,  5,  9, 14, 20
			db 4, 11, 16, 23,  4, 11, 16, 23,  4, 11, 16, 23,  4, 11, 16, 23
			db 6, 10, 15, 21,  6, 10, 15, 21,  6, 10, 15, 21,  6, 10, 15, 21

	;integer parts of sines of integers
	K:		dd 0xd76aa478, 0xe8c7b756, 0x242070db, 0xc1bdceee
			dd 0xf57c0faf, 0x4787c62a, 0xa8304613, 0xfd469501
			dd 0x698098d8, 0x8b44f7af, 0xffff5bb1, 0x895cd7be
			dd 0x6b901122, 0xfd987193, 0xa679438e, 0x49b40821
			dd 0xf61e2562, 0xc040b340, 0x265e5a51, 0xe9b6c7aa
			dd 0xd62f105d, 0x02441453, 0xd8a1e681, 0xe7d3fbc8
			dd 0x21e1cde6, 0xc33707d6, 0xf4d50d87, 0x455a14ed
			dd 0xa9e3e905, 0xfcefa3f8, 0x676f02d9, 0x8d2a4c8a
			dd 0xfffa3942, 0x8771f681, 0x6d9d6122, 0xfde5380c
			dd 0xa4beea44, 0x4bdecfa9, 0xf6bb4b60, 0xbebfbc70
			dd 0x289b7ec6, 0xeaa127fa, 0xd4ef3085, 0x04881d05
			dd 0xd9d4d039, 0xe6db99e5, 0x1fa27cf8, 0xc4ac5665
			dd 0xf4292244, 0x432aff97, 0xab9423a7, 0xfc93a039
			dd 0x655b59c3, 0x8f0ccc92, 0xffeff47d, 0x85845dd1
			dd 0x6fa87e4f, 0xfe2ce6e0, 0xa3014314, 0x4e0811a1
			dd 0xf7537e82, 0xbd3af235, 0x2ad7d2bb, 0xeb86d391

	a0:		db 0x01,0x23,0x45,0x67
	b0:		db 0x89,0xab,0xcd,0xef
	c0:		db 0xfe,0xdc,0xba,0x98
	d0:		db 0x76,0x54,0x32,0x10

	A:		db 0x01,0x23,0x45,0x67
	B:		db 0x89,0xab,0xcd,0xef
	C:		db 0xfe,0xdc,0xba,0x98
	D:		db 0x76,0x54,0x32,0x10

	msg:	db "test"
			times 64 db 0

	hexbin: db "0123456789abcdef",10

section .text
	global _start

	_start:
		xor		rbp,rbp				;AMD64 ABI Requirement
		pop		rdi					;Get argument count
		mov		rsi,rsp				;Get argument array
		and		rsp,-16				;Align stack pointer
		push	rdi					;Save argument count
		push	rsi					;Save argument array
	main:
		lea		rdi,[msg]
		mov		rsi,32
		call	md5

		; fo testing
		mov rax,1
		mov	rdi,r8
		mov	rsi,r9
		syscall

		ret;

	; rdi = pointer to message
	; rsi = length of message (in bits)
	md5:
		call	md5_pad

	md5_loop:
		; A=a0
		; B=b0
		; C=c0
		; D=d0

		mov		r8,[A]
		mov		r9,[B]
		mov		r10,[C]
		mov		r11,[D]

		xor		rcx,rcx		; i = 0

	md5_round1:

		mov		r14,r11		;save value of D

		xor		r14,r10		;r14 = D xor C
		and		r14,r9		;r14 = B and (r14)
		xor		r14,r11		;r14 = F = D xor (r14)

		mov		r12,r11		; tmp = D
		mov		r11,r10		; D = C
		mov		r10,r9		; C = B

		mov		rdi,r8
		add		rdi,r14
		lea		r14,[rcx*4+K]
		add		edi,DWORD[r14]
		lea		r14,[rcx*4+msg]

		add		edi,DWORD[r14]
		lea		r14,[s+rcx]
		mov		esi,DWORD[r14]
		call	md5_rotate	; leftrotate((A+F+K[i]+msg[i]),s[i])
		add		r9,rax		; B += 
		mov		r8,r12		; A = tmp

		inc		rcx
		cmp		rcx,16
		jl		md5_round1	; while i < 16

	md5_round2:
		mov		r14,r10		;save value of C

		xor		r14,r9		;r14 = C xor B
		and		r14,r11		;r14 = D and (r14)
		xor		r14,r10		;r14 = F = C xor (r14)

		mov		r12,r11		; tmp = D
		mov		r11,r10		; D = C
		mov		r10,r9		; C = B

		mov		rdi,r8
		add		rdi,r14
		lea		r14,[rcx*4+K]
		add		edi,DWORD[r14]
		mov		r14,rcx
		shl		r14,2		; (i*5
		add		r14,rcx		; 
		add		r14,1		; +1)
		and		r14,15		; %16
		lea		r14,[r14*4+msg]
		add		edi,DWORD[r14]
		lea		r14,[s+rcx]
		mov		esi,DWORD[r14]
		call	md5_rotate	; leftrotate((A+F+K[i]+msg[g]),s[i])

		add		r9,rax		; B += 
		mov		r8,r12		; A = tmp

		inc		rcx
		cmp		rcx,32
		jl		md5_round2	; while i < 32

	md5_round3:
		mov		r14,r9		;save value of C

		xor		r14,r10		;r14 = B xor C
		xor		r14,r11		;r14 = F = (r14) xor D

		mov		r12,r11		; tmp = D
		mov		r11,r10		; D = C
		mov		r10,r9		; C = B

		mov		rdi,r8
		add		rdi,r14
		lea		r14,[rcx*4+K]
		add		edi,DWORD[r14]
		mov		r14,rcx
		shl		r14,2		; (i*3
		sub		r14,rcx		; 
		add		r14,5		; +5)
		and		r14,15		; %16
		lea		r14,[r14*4+msg]
		add		edi,DWORD[r14]
		lea		r14,[s+rcx]
		mov		esi,DWORD[r14]
		call	md5_rotate	; leftrotate((A+F+K[i]+msg[g]),s[i])
		add		r9,rax		; B += 
		mov		r8,r12		; A = tmp

		inc		rcx
		cmp		rcx,48
		jl		md5_round3	; while i < 48

	md5_round4:
		mov		r14,r11		;save value of D

		not		r14			; not D (this should be the 32 bit part of r14 instead)
		or		r14,r9		;r14 = not D or B
		xor		r14,r10		;r14 = F = (r14) xor C

		mov		r12,r11		; tmp = D
		mov		r11,r10		; D = C
		mov		r10,r9		; C = B

		mov		rdi,r8
		add		rdi,r14
		lea		r14,[rcx*4+K]
		add		edi,DWORD[r14]
		mov		r14,rcx
		shl		r14,3		; (i*7
		sub		r14,rcx		; 
		and		r14,15		; %16
		lea		r14,[r14*4+msg]
		add		edi,DWORD[r14]
		lea		r14,[s+rcx]
		mov		esi,DWORD[r14]
		call	md5_rotate	; leftrotate((A+F+K[i]+msg[g]),s[i])
		add		r9,rax		; B += 
		mov		r8,r12		; A = tmp

		inc		rcx
		cmp		rcx,64
		jl		md5_round4	; while i < 64

		add		r8,[a0]
		add		r9,[b0]
		add		r10,[c0]
		add		r11,[d0]

		;print the full hash
		mov		rdi,r8
		call	hex_print32be
		mov		rdi,r9
		call	hex_print32be
		mov		rdi,r10
		call	hex_print32be
		mov		rdi,r11
		call	hex_print32be
		call	nl

;		jmp md5_loop
	md5_end:
		ret;

	; rdi = pointer to message
	; rsi = length of message (in bits)
	md5_pad:
		push	rdi
		push	rsi
		push	rax

		mov		rax,rdi
		add		rax,56
		mov		[rax],rsi
		shr		rsi,3
		lea		rdi,[rdi+rsi]
		mov		[rdi],byte 0x80

		pop		rax
		pop		rsi
		pop		rdi
		ret;

	; leftrotate (x,c)
	; rdi = x
	; rsi = c
	; (x << c) | (x >> (32-c))
	md5_rotate:
		push rcx
		mov		rax,rdi
		mov		rcx,rsi
		shl		rdi,cl

		mov		ecx,32
		sub		ecx,esi
		shr		rax,cl

		or		rax,rdi
		pop rcx
		ret;

	;	rdi = byte to print
	hex_print:
		push	rcx
		push	rbx
		push	rdx
		push r14
		push r11
		push r10
		push r9
		push r8
		push rax
		push rcx
		push rdi
		push rsi

		mov		rbx,rdi
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,2		; stderr
		mov		rdx,1		; 1 byte
		syscall

		mov		rdi,rbx
		shr		rdi,4
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,2		; stderr
		mov		rdx,1		; 1 byte
		syscall
		
		pop rsi
		pop	rdi
		pop rcx
		pop rax
		pop r8
		pop r9
		pop r10
		pop r11
		pop r14
		pop		rdx
		pop		rbx
		pop		rcx
		ret

	;	rdi = dword to print
	hex_print32:
		push	rcx
		push	rbx
		push	rdx
		push r14
		push r11
		push r10
		push r9
		push r8
		push rax
		push rcx
		push rdi
		push rsi

		mov		rbx,rdi

		mov		rdi,rbx
		shr		rdi,28
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,2		; stderr
		mov		rdx,1		; 1 byte
		syscall

		mov		rdi,rbx
		shr		rdi,24
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,2		; stderr
		mov		rdx,1		; 1 byte
		syscall

		mov		rdi,rbx
		shr		rdi,20
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,2		; stderr
		mov		rdx,1		; 1 byte
		syscall

		mov		rdi,rbx
		shr		rdi,16
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,2		; stderr
		mov		rdx,1		; 1 byte
		syscall

		mov		rdi,rbx
		shr		rdi,12
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,2		; stderr
		mov		rdx,1		; 1 byte
		syscall

		mov		rdi,rbx
		shr		rdi,8
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,2		; stderr
		mov		rdx,1		; 1 byte
		syscall

		mov		rdi,rbx
		shr		rdi,4
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,2		; stderr
		mov		rdx,1		; 1 byte
		syscall

		mov		rdi,rbx
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,2		; stderr
		mov		rdx,1		; 1 byte
		syscall

		pop rsi
		pop	rdi
		pop rcx
		pop rax
		pop r8
		pop r9
		pop r10
		pop r11
		pop r14
		pop		rdx
		pop		rbx
		pop		rcx
		ret


	;	rdi = dword to print
	hex_print32be:
		push	rcx
		push	rbx
		push	rdx
		push r14
		push r11
		push r10
		push r9
		push r8
		push rax
		push rcx
		push rdi
		push rsi

		mov		rbx,rdi

		mov		rdi,rbx
		shr		rdi,4
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,1		; stderr
		mov		rdx,1		; 1 byte
		syscall

		mov		rdi,rbx
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,1		; stderr
		mov		rdx,1		; 1 byte
		syscall

		mov		rdi,rbx
		shr		rdi,12
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,1		; stderr
		mov		rdx,1		; 1 byte
		syscall

		mov		rdi,rbx
		shr		rdi,8
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,1		; stderr
		mov		rdx,1		; 1 byte
		syscall

		mov		rdi,rbx
		shr		rdi,20
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,1		; stderr
		mov		rdx,1		; 1 byte
		syscall

		mov		rdi,rbx
		shr		rdi,16
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,1		; stderr
		mov		rdx,1		; 1 byte
		syscall

		mov		rdi,rbx
		shr		rdi,28
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,1		; stderr
		mov		rdx,1		; 1 byte
		syscall

		mov		rdi,rbx
		shr		rdi,24
		and		rdi,15
		lea		rsi,[hexbin]
		add		rsi,rdi
		mov		rax,4		; write
		mov		rdi,1		; stderr
		mov		rdx,1		; 1 byte
		syscall

		pop rsi
		pop	rdi
		pop rcx
		pop rax
		pop r8
		pop r9
		pop r10
		pop r11
		pop r14
		pop		rdx
		pop		rbx
		pop		rcx
		ret

	nl:
		push	rsi
		push	rdi
		push	rdx
		push	rcx
		push r14
		push r11
		push r10
		push r9
		push r8
		push rax
		push rcx
		push rdi
		push rsi

		lea		rsi,[hexbin+16]
		mov		rax,4		; write
		mov		rdi,1		; stderr
		mov		rdx,1		; 1 byte
		syscall

		pop rsi
		pop	rdi
		pop rcx
		pop rax
		pop r8
		pop r9
		pop r10
		pop r11
		pop r14
		pop 	rcx
		pop		rdx
		pop		rdi
		pop		rsi
		ret