Archiving email is an amazingly old topic in the world of email. Products which store email in another location instead of a mailbox have existed in one form or another for decades.
PSTs, also referred to as the cockroaches of emails, came about in an Exchange 4.0 world where server storage was minimal and expensive and we stored email on our desktops or file shares for years, hoping they would not grow beyond the storage limits. Those were either 2GB or 20GB (more or less) depending on if you were using ANSI or Unicode PSTs. PSTs were a popular mechanism to archive email, and like public folders bred a legacy of bad practices which haunt email administrators to this day. Microsoft maintains guidance documentation on the limits of using PST files over WAN and LAN links.
Third party products stored email in a variety of places, like databases and file shares, to minimize the amount of email stored in Exchange servers. This effectively deferred the storage problem of storing email in the native Exchange databases, by moving the email into other storage and then trying to keep track of it.
Before Exchange 2010, we didn’t have a lot of confidence in storing gigabytes or even terabytes of email in Exchange. Exchange 2010 changed that by introducing technology in the database engine which could take advantage of cheaper and slower storage, which was also directly attached to Exchange servers instead of using expensive SAN based storage.
The perception that you don’t store a lot of email in Exchange, especially in a regulated industry remained, however. This is partly because it was difficult to find email in Exchange for both individual mailbox searches or departmental or companywide discovery searches. While exchange became marvelously competent at storing large amounts of email, which was evidenced by the advent of 100GB mailboxes in Exchange online and unlimited email archives, searching and finding email remained a challenge.
Let's take a step back and look at why you would store email anywhere else besides Exchange.
Size does matter
The first factor is storage, both for the end user as well as the server administrator.
Mailboxes needed to be size constrained as Outlook OSTs didn’t handle a synchronized mailbox of greater than 20GB well without desktop SSD technology. The documented limits for PST and OST files were considerably larger at 50GB, however physics prevails and large files make for a poor user experience, depending on the desktop drive technology.
Email database sizes needed to be relatively constrained as backup and restore times were one of the primary limiting factors. If backup was the only other copy of the mailbox database, and it took 8 hours to restore from tape and then there was a world of hurt waiting for the Exchange administrator. That assumes that the restored database mounted in the first place and didn’t need additional intervention.
Calculating the amount of storage required for Exchange 2013, which further optimized the database technology used to utilize cheap storage instead of SAN was a non-trivial exercise.
Understanding storage options for Exchange 2016 and Exchange 2019 is still a critical part of designing for on-premises Exchange today.
Email stubbing allowed for a third-party archive to strip attachments out of the mailbox and store them elsewhere, to limit mailbox sizes. While popular, this solution was acknowledged but not recommended by the Exchange team, who’s guidance it was to use supported and cheaper storage technologies. Stubbing required both Exchange and the third party archive to be highly available and responsive to ensure a good user experience.
When something is immutable, it means that it cannot be changed. Email as the primary form of communication needed to be stored in an immutable format for legal reasons, so that it could be found in the original format as part of legal discovery or litigation. Older version of Exchange didn’t have a method of preserving email for the long term since on-premises storage was expensive.
There’s no point of storing email in the long term if you can’t find it. Search accuracy for both desktop and server search has been less than accurate when searching large amounts of email. Running long running search queries which returned inaccurate results for a discovery officer or the overburdened email administrator tasked with the problem did not instill confidence. On premises versions of Exchange had exceptional capabilities in many areas, however accurate searched in a timeous manner was not one of them.
On-Premises Exchange features basic email discovery features which in earlier version of Exchange required the Exchange administrator to perform searches and export data. This scenario isn’t optimal when performing legal discovery for all kind of reasons, including if the email administrator is the subject!
Discovery requires accurate search, the ability to work with a large dataset which can be narrowed down over time, as well as abstract searches and content into manageable cases. Most of these features are not implemented in Exchange on-premises eDiscovery.
Case management itself is a large subject, which requires the functionality to separate cases from each other, give just the right number of rights to searched and discovered data etc.
So far we have examined database size constraints, the requirement for immutability, accurate search and eDiscovery as factors contributing for removing or duplicating email in systems outside of Exchange. Earlier versions of Exchange either offered essential or no functionality in these areas, requiring customers to duplicate their email into third party systems. Harvesting email from Exchange directly is not a guarantee that the email is the original item, nor that it has been deleted. To answer this need, Microsoft created the journaling standard for Exchange on-premises and Exchange online.
Journaling creates a copy of the email message that passes through Exchange and sends it to a journal recipient, which requires an on-premises mailbox. Journaling to Exchange online is explicitly not supported, which means you’re probably in hybrid mode and journaling to on-premises or forwarding the email directly to your third party archive service.
Journaling happens in the Exchange Transport service and is not affected by anything the user does, thus a copy of anything that is sent or received is journaled, subject to the configured journal rules. Since journaling is a catch-all mechanism for mail that is sent, received or transiting your Exchange on-premises installation or Exchange on-line tenant, it does have a glaring hole. You may want to follow along for this exercise.
Changing stored email
Send an email to someone, or just navigate to your sent items. Open a message, navigate to Message, Actions, Edit Message. Make the changes you want to make, and the click save or close the mail and acknowledge that you’d like to save your changes. Notice that the message which is now in your sent items is not the same message which you originally sent.
Now create an entry in a shared calendar, type in something incriminating and ask a colleague to type something back. Do the same thing with a task, note, contact, etc., anything that is not sent or received. If you’re following along, you’ll notice the flaw in journaling. None of the actions we have committed to email, tasks, notes, contacts or even files which we have dragged into an email folder are recorded in the email journal, since they are not transmitted.
While journaling is excellent for recording communications that has been sent or received, it does not account for a plethora of actions which we are able to perform on our mailboxes. Journaling will transmit items which have been sent or received but cannot do anything else to track the changes which occur in our mailboxes. Therefore, the picture which an auditor or a legal counsel receives from their third-party archive will be incomplete. To address this, third party archiving systems perform a scheduled scrape of the mailbox, which cannot occur in real-time to not affect the performance of the mail system. Should the schedule of the mailbox scrape be understood then a savvy user would easily circumvent this mechanism.
People who have left
People who have left the company are called leavers. An appropriate name, since they have in fact left. However, we need to account for leaver mailbox data for as many years as our requirements dictate. Often that requirement is measured in years and needs to account for the company reusing names or email addresses.
Creating an accurate email archive is not trivial. At the very least we need to be able to:- store an accurate copy of email in an immutable manner
- prove that the email has not been changed
- store the email for the long term and include leaver data
- discovery the email using a supported mechanism.
This precludes any other requirement we may have including case management, intelligent tagging, searching, or filing of emails.
In this article we have examined the challenges inherent in archiving email. In part two of this article, we will address how these challenges are answered in Office 365.
ENow’s Office 365 Monitoring solution is like your own personal outage detector that pertains solely to you environment. ENow’s solution monitors all crucial components including your hybrid servers, the network, and Office 365 from a single pane of glass. Knowing immediately when a problem happens, where the fault lies, and why the issue has occurred, ensures that any outages are detected and solved as quickly as possible.