Published on

Using Zeek to Analyze POP3 Protocol (2)

Authors
  • avatar
    Name
    Morphy Chan
    Twitter

We know that the POP3 command for retrieving emails is RETR. Based on the Zeek API test results from Using Zeek to Analyze POP3 Protocol (1), we can outline a pattern for parsing email content:

  1. Email start markers:
    1. Client sends a RETR command to request an email
    2. Server responds with OK
  2. Email content: multi-line strings
  3. Email end marker: encountering a new command

Both start marker conditions must be met: RETR requests an email, and OK indicates a successful response. Before the next request command, we assume all multi-line strings in between are email content. Therefore, parsing emails essentially means tracking and analyzing this pattern line by line in the API's multi-line string output.

First, let's create a record Msg to store information for a single email:

type Msg: record {
    ts:                 time;
    uid:                string;
    id:                 conn_id;
    flag_retr_succ:     bool;       # Whether RETR request successfully retrieved email
    request:            string;     # Request command
    req_arg:            string;     # Request command argument
    reply:              string;     # Response
    retr_data_linenum:  int;        # Number of lines in email content
    retr_data:          vector of string;  # Stores email content
};

The flag_retr_succ field tracks whether both email start marker conditions are satisfied.

1. Detecting Email Start

Track the RETR command in pop3_request:

event  pop3_request(c: connection, is_orig: bool, command: string, arg: string)
{
    ...
    if(command == "RETR") {
        local retr_msg: Msg = [$ts = network_time(),
                                    $id = c$id,
                                    $uid = c$uid,
                                    $flag_retr_succ = F,
                                    $request = "RETR",
                                    $req_arg = arg,
                                    $reply = "",
                                    $retr_data_linenum = 0,
                                    $retr_data = vector()];
        g_retr_msg = retr_msg;
    }
  ...
}

When a RETR command is detected, a global Msg is initialized.

Correspondingly, check if the server responds with OK in pop3_reply:

event pop3_reply(c: connection, is_orig: bool, cmd: string, msg: string)
{
    ...
    if(cmd == "OK" && g_retr_msg?$flag_retr_succ) { # g_retr_msg exists
        if(g_retr_msg$request == "RETR" && g_retr_msg$reply == "") {
            g_retr_msg$flag_retr_succ = T;
            g_retr_msg$reply = "OK";
        }
    }
}

If we find:

  1. The reply command is OK
  2. The global msg records a RETR request command, flag_retr_succ is False, and reply is empty

This means the reply is responding to a RETR request, and the email was retrieved successfully. Both email start conditions are now satisfied:

zeek_pop3_mail_b

2. Saving Email Content

Multi-line strings following the email start are treated as email content. We use pop3_data to save this content:

event  pop3_data(c: connection, is_orig: bool, data: string)
{
    if(g_retr_msg?$flag_retr_succ && g_retr_msg$flag_retr_succ == T) {
        if(g_retr_msg$retr_data_linenum < g_retr_msg_max_line) {
            g_retr_msg$retr_data += data;
            g_retr_msg$retr_data_linenum += 1;
        }
    }
}

When flag_retr_succ is True, the data content is email information — we save it to a string vector for line-by-line parsing later.

Detecting Email End and Parsing Content

We also use pop3_request to track email termination:

event pop3_reply(c: connection, is_orig: bool, cmd: string, msg: string)
{
    ...
    pop3_proc_g_retr_msg(); # Check email end marker and parse
    if(cmd == "OK" && g_retr_msg?$flag_retr_succ) { # g_retr_msg exists
        ...
    }
    ...
}

The pop3_proc_g_retr_msg function:

function pop3_proc_g_retr_msg()
{
    if(g_retr_msg?$flag_retr_succ && g_retr_msg$flag_retr_succ == T) {
        # Update POP3 info
        local rec: POP3::Info = [$ts = g_retr_msg$ts,
                                 $uid = g_retr_msg$uid,
                                 $id = g_retr_msg$id];
        g_pop3_rec = rec;
        ...

        # Parse email content
        for(idx in g_retr_msg$retr_data) {
            # print g_retr_msg$retr_data[idx];
            local data:string = g_retr_msg$retr_data[idx];
            local key: string = "";
            local val: string = "";
            local len: int;
            if(data != "") {
                # Match "to" field
                if(/^[tT][oO]:/ in data) {
                    key = "to";
                    val = data[3:];
                }
                else if(/^[fF][rR][oO][mM]:/ in data) {
                    key = "from";
                    val = data[6:];
                }
                ...
            }
            if(key != "" && val != "")
                pop3_update_g_rec_data(key, val);
        }
        # Write to POP3 log
        Log::write(POP3::LOG, g_pop3_rec);

        # Finished parsing one email, reinitialize global msg
        ...
    }
}

This function is called from pop3_request and checks if flag_retr_succ in the global msg is True. If so, it means a new command has been encountered — the email retrieval is complete:

zeek_pop3_mail_e

After that, it parses the email content saved in the msg and updates the POP3 info record (following the pattern of Zeek's default SMTP parsing script, the POP3 script also creates a similar info record for writing parsed results to logs).

For parsing the saved email content strings, regex matching is used to extract key email fields like from, to, etc. — again following Zeek's SMTP parsing script approach.

3. Script Parsing Results

Here's the parsing result of the script on the test email:

$ cat pop3.log | jq
{
  "ts": 1615003258.432899,
  "uid": "CeJci2byiawb4zZlk",
  "id.orig_h": "192.168.153.18",
  "id.orig_p": 39118,
  "id.resp_h": "192.168.153.19",
  "id.resp_p": 110,
  "command": [
    "RETR",
    "OK"
  ],
  "arg": [
    "2"
  ],
  "date": " Fri, 5 Mar 2021 23:00:37 -0500",
  "from": "lisi <lisi@localdomain.com>",
  "to": [
    " zhangsan@localdomain.com"
  ],
  "msg_id": " <7ea7b5a3-3e76-ceee-2a49-a9ab81d5cc4c@localdomain.com>",
  "subject": " This is a test mail",
  "user_agent": " Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101"
}

Comparing this with the test email content from Using Zeek to Analyze POP3 Protocol (1), the script successfully extracts the relevant email fields. To parse additional fields, simply add more regex matching rules.

This script is based on a relatively rough pattern, and some edge cases may not be covered. There are also some open questions:

  1. Some email fields (like the User-Agent in the test email) span multiple lines — how should multi-line field values be handled?
  2. How should email attachments be parsed?