Spam Agent Architecture
© 2002, Matthew Helsley
Contents
I. Introduction: The Goal
II. Without Spam Agents
III. With Spam Agents
IV. When the System Will Fail
V. Current Anti-Spam Tools, and Where They Fit
VI. Further Recommendations
I. Introduction: The Goal
The goal of Spam Agent is construction of a system that automatically detects and blocks unsolicited email (hereafter referred to using the derogatory term 'spam') in a distributed fashion. Furthermore, the use of existing modules (MTA's such as sendmail and exim; Identifier modules such as Spam Assasin) should be possible with little or no modification.
II. Without Spam Agents
A trusted MTA receives an email from an untrusted MTA. The email is either spam, or not spam. The MTA forwards this email to its destination regardless of its spam value. When one or more spammers send spam to this server it destroys the ability of local users to effectively use the email system as means of communication. When one or more spammers not originating from the trusted MTA send mail through this system, it is known as an open relay (ie will relay email from any source to any destination).
III. With Spam Agents
![]() Click for enlargment. |
As above, the trusted MTA receives an email from an untrusted MTA. The trusted MTA might classify the untrusted MTA based on numerous databases. If the untrusted MTA is determined to be malicious, it might try to present a 'honey pot' to the malicious MTA (which will keep the remote MTA busy and thus prevent it from spamming others). If the remote MTA is not malicious and is not incompetent, the trusted MTA then classifies the email and ultimately proceeds to one of three states: accept, reject, or delayed processing. These result states are communicated to the trusted MTA so it may accept, reject, or reprocess an email for delivery.
Messages the MTA can immediately identify as spam can be trivially rejected. These typically involve relay-route-based checks (DNS lookups, spam block list lookups, open relay list lookups), from/to rules, and assorted other rules based on the envelope. These checks typically occur inside the MTA. Once the email has been accepted by the MTA, it is a candidate for further processing by an external filter agent. High-traffic MTAs may choose to find cases where the external filter agent can be short-circuited. These short-circuits might include email originating on the trusted host sent from a trusted user (root, postmaster, listserv, majordomo, mailman...). One important short circuit that prevents infinite delay in delivery of a message is the short circuit for messages marked with a delayed-processing flag.
The external filter (i.e. Spam Assassin [URL]), further classifies the email based upon content. Email classified as spam should then be processed by each trace plugin. Trace plugins would typically process the envelope and gather information on routes, hosts, and domains. Advanced trace plugins might gather information using email content in addition to the envelope. To speed up processing and reduce overall traffic in the system, the trace plugin should recognize envelopes that will produce the same trace results (i.e. cache previous traces). This will reduce the number of nominations sent to distributed databases (see next step).
For each trace plugin, a list of nomination plugins searches the trace output and attempts to find data that can be fed to distributed databases such as open relay databases, spam-friendly ISP/IP block/domain databases, etc (i.e. nominated for entry into a distributed database). The results of the nomination are fed into a report and logging stage, and finally the email is flagged for delayed processing before entering the delay queue.
The email then waits on the delay queue for a admin-specified period of time while delayed nominations take effect (i.e. open relays get added to an open relay database). After the delay, the email is re-processed by the MTA. NOTE that the delay may be indefinite and emptying of the queue could be triggered by an external source (i.e. a database that has responded to the nomination submissions for each queue entry). If the reprocessed message is not rejected by the MTA, then it should be automatically accepted (to prevent infinite delay in delivery).
IV. When the System Will Fail
The processing delays for adding entries to the distributed database means that either email must potentially wait long periods of time in the delay queue, or some spam will get through. The problem with long waits is legitimate mail that might look like spam (according to the Identification Module) could be placed into the queue. The legitimate mail might then take on the order of hours to days (assuming the delay is set long enough to allow changes in the distributed databases) to be recognized as originating from an open relay or known spam source.
The alternative is to keep a short delay on the delay queue and allow some spam through.
A better approach would include a local 'node' in the distributed database(s) and local tools such as open relay detectors. In addition to participating in the distributed database system, the local node would provide a quick means of adding database information. The drawbacks to this include a necessity for higher-frequency processing of the local database (to reduce delay queue waits), inability to immediately recognize spam via distributed voting mechanisms, and of course increased load.
V. Current Anti-Spam Tools, and Where They Fit
Open Relay Detectors - this tool probes a host to determine if it is an open mail relay. Nominationg modules can use these as a means of determining if a host/IP should be nominated for acceptance in an open relay database (see below).
Open Relay Databases - this tool stores and optionally ages a set of known open relays. The ideal open relay database system would periodically sweep its contents and detect relays that have been closed. Nomination modules would submit nominees to one or more databases.
Domain name databases - lists domains that have been determined to be operated by known spammers (people who send spam). Similar to open relay databases, these databases would be the target of submissions by nomination modules.
IP databases - similar to domain databases, but instead of listing names, the list contains IPs.Similar to open relay databases, these databases would be the target of submissions by nomination modules.
Dialup IP databases - blocks of IPs used by dialup service providers and often abused by spammers. NOTE that eventually these databases may choose to include dynamically served IP address blocks. This method of blocking spam is questionable due to the nature of IP block assignment protocols and the possiblity that it might exclude potentially large quantities of legitimate email traffic (i.e. people who run mail servers on dynamically served IPs are not necessarily spammers). Similar to open relay databases, these databases would be the target of submissions by nomination modules (however, this kind of module might not be a good idea).
Spam Assassin - a spam filtering tool. Uses advanced pattern matching to prevent spam from reachnig its destination. Spam assassin would be an Identify module in the model.
<check for numerous log monitoring/processing tools>
VI. Further Recommendations
An XML protocol for communication between the modules should be established. Every module should be capable of dealing with this protocol, even if it actually implements various modules. In this way, system administrators could build chains of trace/nominate/report/queue modules to meet their individual needs while at the same time allowing for complex handling of potential spam that may require exceptional processing power (and thus require a monolithic module architecture).