If The Bard were around today, his work on eDiscovery would undoubtedly include an update to one of his more famous turns of phrase: “To de-dupe or not to de-dupe – what is the question?”
We all want to save money during eDiscovery, and the typical enterprise contains a substantial amount of duplicate information that increases the costs and risks of processing and review. For example, we know that a single email message to four colleagues can immediately result in five copies of that message – the sender’s copy, along with a copy for each recipient. With replies and forwards, the number seems to increase almost exponentially. Similarly, laptops and fileshares abound with duplicates of documents from collaborative efforts or resulting from saving an “extra” copy of a document sent with an email message. All of these extra copies drive up eDiscovery costs.
How De-Duplication Can Help In eDiscovery
The temptation in eDiscovery is to quickly get rid of all of these extra copies – to de-duplicate the information so that we can spend less in collecting, processing and reviewing that information. We commonly see duplication rates of 15 to 30% depending upon the source data and the environment. Thus, a company spending $1,000,000 for collection and review in a case could, in theory, save $150,000 to $300,000 just by eliminating duplicate data.
In reality, de-duplication for eDiscovery can be a tricky and worrisome process. It’s important to understand exactly what – and how – you will be de-duplicating data to insure that you get exactly what you expect and do not run afoul of preservation requirements or compliance concerns.
The Technology Behind De-Dupe
A typical “document” can have many different parts, so when we perform de-duplication, we need to understand which of those parts are being used to determine whether two documents are the same. Without being too philosophical, it’s clear that no two documents can ever be completely identical. At minimum, even if they are identical in all other respects, they will be stored in two different locations. In some cases, even that fact – the location or “path” of the document — could be important.
To take a closer look at this issue, let’s start with the file “badStuff.doc”, a file containing text information created in Microsoft Word, which is saved on a laptop. This file has several different types of information, including:
- File contents: The file itself consists of all of the words and information that are saved as the body or text of the document.
- Document Metadata: Many documents contain “metadata” stored that is hidden inside a document. (For a more complete discussion of metadata, see J. Shook, “Metadata Is Closer Than It Appears”, 4/28/11). The file “badStuff.doc” has these metadata fields with information labeled as Author, Title, Subject, Keywords, etc. This information does not appear in the basic editing screen in Word, but is easy to locate using the Properties feature. This file also contains other hidden metadata such as versioning or redline information, font names and sizes, formatting information such as bold, italics, etc. These latter categories may not seem very important, but keep in mind that a document containing the word “the” may not be considered a duplicate of a document with the same word in bold: “the”.
- File system metadata (file attributes): The “badStuff.doc” file is tracked and managed by the operating system, in this case Windows XP, as an object. The operating system maintains information about the file including its name, the location where it is stored (the filepath), date of creation, last access and modification, and other information such as access control lists for security. Typically this information is not stored within the file itself, but is held by the operating system to assist it in tracking and maintaining the file.
Determining Duplicates
Most frequently, de-duplication is performed by using the hash value or “fingerprint” of a file. (For a thorough discussion of hash values in eDiscovery, see Ralph Losey’s excellent e-Discovery Team blog entry and Ralph’s related Law Review article). The hash of a file is normally based upon its contents – the “file contents” that we listed above. The hash typically does not take into account the file system metadata (nor should it, because then almost nothing would ever be a duplicate).
For example, if I make a copy of “badStuff.doc” on a flash drive for my colleague Alice, and send it to Bob as an email attachment, they will both have an equivalent “copy” with the same hash value. If Alice opens the file to read it, then the file’s system metadata for the access date will change but the file itself will retain the same hash fingerprint. If Bob opens the file and then saves it back to his hard drive, without making any changes, the Modified and Accessed dates will change – although the hash value will stay the same.
Let’s say that Bob goes into the document and finds an error – he inserts a comma that was missing and then re-saves the document. The documents are no longer duplicates for purposes of hash values. Because the file contents have changed — the addition of the comma – the hash value will completely change. Even a minor change will produce a substantial change in the value of the hash, so we cannot use hash values as a tool to determine whether documents are almost duplicates.
Advanced De-Duplication: Email
Email messages are even trickier, as they include far more fields and information than just the typical fields (To, From, Date, Subject, Body, etc.) that we normally see. Email messages with the same exact origin — one that Alice sends to three different people at the same time – will include, in their metadata, different information about routes, time, mail servers, etc. Normally, eDiscovery tools and experts are aware of these issues and include in hash calculations only those fields that are useful in determining whether messages are duplicates.
Putting It All Together
Let’s apply this information to a sample eDiscovery case. Alice, Bob and Carol are key players in an insider trading scandal, and the company’s eDiscovery Team is identifying and collecting relevant evidence. It turns out that a file called “badStuff.doc” is very relevant, which was stored on everyone’s laptops, along with a copy located on Bob’s fileshare directory and another attached to an email message stored in the company’s email archive – five copies in total. The lawyers simply instructed the company’s IT Liaison to “de-duplicate everything”. In this case, the IT team collected laptop data first, so Alice’s fileshare copy, and Bob and Carol’s laptop copies are removed from processing as duplicates – they have the same hash value as Alice’s laptop copy (which came first) and so we removed them. Another IT person collected data from fileshares and the email message stored in the archive, so both of those copies are also being preserved.
The parties agree to a “global” production of all data and to perform further de-duplication on production. In that case, Alice’s version of “badStuff.doc” would be produced — but not Bob’s fileshare copy (because it’s a duplicate) and not Bob or Carol’s laptop copies (because we dropped those at collection). The email attachment may be produced, depending on the tools and processes that we use. If we produce the data by de-duplicating by custodian, then we’ll have a different result because Bob’s fileshare version will not be dropped.
Is this important? As with so many other things, and to the significant frustration of most lawyers, the answer is “it depends”. Certainly we’ve met at least part of our duty to provide the other side with all relevant information – they have a copy of “badStuff.doc”. But was it important in this case that Bob and Carol also had their own copies? Again, in our case the email might show that they received a copy at some point – but what if actually having a copy is more important in this particular matter? Or what if the emails with that attachment had been deleted (according to a retention schedule), prior to the investigation? In that case, there would be no link to Carol ever having received the document (despite the fact that we found it on her laptop) and it’s a hit-or-miss proposition on whether we produce Bob’s version. If we collect in a different order, we may produce a copy from Bob and drop the versions held by Alice and Carol. Is this worrisome?
Conclusion
Ideally, systems that de-duplicate data will de-duplicate objects while maintaining a “stub” of information with metadata and a “pointer” to the original file. In our case, then, there would be several stubs of information establishing where we found all of the documents, each of which “pointed” to a single instance of the file itself, since it was identical. This way, when we searched, viewed or produced the file, we would maintain the information that it actually resided in several locations, and not just one. In this way, we can still obtain many of the benefits of de-duplication without the problems.
So when you’re looking to save money with de-duplication, do your homework and be careful what you ask for. The some rule applies when the other side is providing you with de-duplicated information: understand what you are getting – and if it matters in your case.
Filed under: Uncategorized Tagged: | de-duplication, duplicate information, eDiscovery, EMC, enterprise, James D. Shook, metadata, Ralph Losey, technology


Dear Jim:
Great post. Clear, interesting and helpful. Every lawyer should read this. Best regards, Craig Ball
Craig, high praise indeed coming from you. Thank you. And for anyone who enjoyed this article but wants (a lot) more, you should check out Mr. Ball’s excellent site: http://www.craigball.com.
[...] Beware Of Dupes (emcsourceoneinsider.wordpress.com) Share this:Like this:LikeBe the first to like this post. This entry was posted in Compliance and tagged AIIM, eDiscovery, EDRM, Electronic discovery, Electronic Discovery Reference Model by markjowen. Bookmark the permalink. [...]