Deduplicating PDF Files (Emails) Using AutoPortfolio™ Plug-in For Adobe® Acrobat®
What is Email/Document De-Duplication?
Emails are one of the most important types of litigation documents. It is often necessary to compile hundreds or even thousands of emails for a single court case. Typically, there is a significant number of emails that are part of the email “threads” and are redundant. This is due to the fact that  email replies almost always include the content of the previous emails. It is sufficient to keep only the last email from each “thread” and discard the intermediate emails. The process of finding unique documents (emails) is often referred to as “de-duplication”. Detecting and discarding documents that are redundant can greatly reduce the number of documents/emails that need to be prepared during the electronic discovery process.
The AutoPortfolio plug-in provides functionality for de-duplication of PDF documents. These can be PDF files created from emails or any other kinds of text documents. The process is specifically fine-tuned for handling emails. The emails need to be converted into PDF format in order be used in the de-duplication. This allows using both emails and their attachments in the de-duplication process. The conversion into PDF format is provided by both Adobe Acrobat and AutoPortfolio plug-in.
What is a Duplicate File?
Any PDF file that has text that is either identical to or is fully contained in another PDF file is considered a duplicate. Note that only text content is compared. It is not possible to use this process for scanned PDF files that have not been run through the text recognition. The de-duplication process also does not compare images. There are other types of processing available for finding duplicate pages where comparison is performed "visually" without using the actual text. The de-duplication can also instantly detect files that are totally identically on the “binary level”.
Email Handling
The algorithm is specialized for processing email text to avoid comparing email headers that may be different while the email text is the same. This can happen when the same email is received from multiple recipients or was emailed to a group of people and was received by the same person more than once. There can be multiple unique emails in a single email thread, if an original email or any of the replies contain attachments.
Workflow Outline
1. Export email messages (or whole folders) from MS Outlook (or any other email app) into PDF Portfolio format. This is a standard functionality provided by Adobe Acrobat. The output is a single PDF portfolio file with emails converted into PDF, but all attachments remain in the native file format. PDF portfolio is an archive of other files, not a regular PDF document.
2. Extract individual emails with attachments as separate PDF documents by using AutoPortfolio plug-in for Adobe Acrobat. Each email is exported from PDF Portfolio as a separate PDF file with attachments converted into PDF and appended right after the email text.
3. Run de-duplication process to find redundant documents. Unique documents can be copied into another folder and duplicate files can be discarded.
4. Additional files can be quickly checked for duplicates using the existing set of files. The de-duplication process computes a special “fingerprint” file for every PDF document. It takes some time to create, but once it is computed, it is very fast to check new files for duplicates. Fingerprint files are computed only once.
Input Documents
In the tutorial we are going to use a sample email folder that contains 4 threads with 5 email replies each. The goal is to find emails that contain text from other emails and discard documents that are redundant. The deduplication process will be repeated with one more new reply for each thread to illustrate that new files can be quickly checked for duplications once the fingerprint files are computed.
You need a copy of Microsoft® Outlook® (or any other email application), Adobe® Acrobat® Professional along with AutoPortfolio™ plug-in installed on your computer in order to use this tutorial. You can download trial versions of both Adobe Acrobat and AutoPortfolio™.
Step 1 - Export Outlook® Email Folder To PDF Porfolio File
Start Microsoft® Outlook® application. Select an email folder (for example “Inbox”) you want to convert and press right mouse button, then select "Convert “Inbox” to Adobe PDF" from the popup menu.
Step 2 - Specify Output File Name And Location
Specify output file name and location in the "Save Adobe PDF File As" dialog that will appear on the screen. Press "Save" button to start conversion.
Step 3 - Inspect the Conversion Results
Once the conversion is finished, the output PDF portfolio is going to be automatically opened in Adobe® Acrobat®. Inspect the results and close the tab with PDF Portfolio file.
Step 4 - Open "Extract Files From PDF Portfolio" Dialog
Select "Plug-ins > AutoPortfolio Plug-in > Extract Files From Portfolio(s)" from the main menu.
Step 5 - Select Input PDF Portfolio
Press "Add Files" to specify an input PDF Portfolio file.
Select a PDF Portfolio file. Click "Open" once done.
Step 6 - Sort PDF Portfolio Records
The "Specify Sorting Order" dialog appears on the screen. Click on column headers to arrange email into desired order. Use the "Select Records" menu for more sorting options.
Step 7 - Select Records For Extraction
All or only few specific emails can be selected for extraction. In the following example the records have been sorted by date and 20 entries have been selected. Click "OK" once done.
Step 8 - Start the Extraction
Click "Browse" and specify an output folder. Check output options if you want to extract and merge file attachments. Click "OK" to start the extraction process.
Step 9 - Open the Processing Report
Once the processing is completed, a report message is going to appear on the screen with the number of files that have been extracted and asking if you want to display a processing report. Click "OK" to display the detailed report. The report is in HTML format and will be opened by the default web browser installed on your computer.
Step 10 - Inspect the Processing Report
The report lists the file name, description, file creation and modification dates, file size in bytes, number of attachments, and MD5 hash value for each email/document and attachment.
Step 11 - Open "PDF Document Deduplication" Menu
Open Adobe® Acrobat® and select "Plug-ins > AutoPortfolio Plug-in > Deduplicate PDF Files" from the main menu.
Step 12 - Select Deduplication Method
Click "Select All Files From Folder".
Step 13 - Specify an Input Folder
Select the input folder that contains extracted PDF files. Click "OK" once done.
Step 14 - Start Deduplication
The "Find Duplicate and Near-Duplicate Documents" dialog will be opened. It contains the list of input PDF files. Click "Deduplicate" to start the process.
Step 15 - Examine the Report Dialog
The dialog reports the number of duplicate documents. Click "OK" to proceed.
Step 16 - Inspect the Results
Once the deduplication process is completed, all duplicate files will be marked in red. The user can now use "File", "Select" and "Edit" menus to perform various operations on the results. Files can be either copied to another folder (use "File" menu selections) or saved as a load file (use "Save File List As..." button) or as an Excel-ready CSV spreahsheet. Note that if some PDF files cannot be opened or processed, they will be highlighted in yellow and show "Processing Error" status in the "Is Duplicate" column.
The plug-in creates a special "fingerprint" file for each input PDF file. If a PDF file already has a corresponding "fingerprint" file, then the existing file is used. The "fingerprint" file contains a text "map" of the document that allows a fast comparison of two files without the need to compare every byte of each file to every possible location in another file. Creating a "fingerprint" file takes some time, but since it is saved to disk it is a one-time processing. Once a file has a "fingerprint" computed, the comparison between two files is extremely fast. Do not delete "fingerprint" files if you want to run de-duplication multiple times.
Step 17 - Add More Files For the Deduplication Process
The plug-in allows to add new files to the deduplication process at any time. Let`s assume that new email replies for each 4 conversations have been received and all the procedures of converting emails into PDF Portfolio and separating into PDF files have been done. Click "Add Files" to add more PDF files to the deduplication process.
Step 18 - Select PDF Files
Select new PDF files for deduplication. Click "Open" once done.
The dialog reports the number of files that have been added. Click "OK" to proceed.
Step 19 - Start Deduplication
Click "Deduplicate" to run the process again. Note that this time the deduplication process will be much faster than the previous one, because the existing "fingerprints" are used.
Step 20 - Examine the Report Dialog
The dialog reports the number of duplicate documents that have been detected. Click "OK" to proceed.
Step 21 - Inspect the Results
Once the deduplication process is completed, all duplicate files are marked with red highlights. Note that this time the deduplication process has different results, because the newly added files contain a full text of all previous messages.
Step 22 - Copy Unique Files To Folder
The plug-in allows to separate unique documents from duplicates by placing them into another folder. Select "File > Copy Unique Files To Folder" from the menu.
Step 23 - Select Destination Folder
Select a destination folder. Click "OK" once done.
Step 24 - Examine the Report Dialog
The dialog shows the number of files that have been copied to the destination folder. Click "OK" to proceed.
Step 25 - Inspect the Results
All unique files have been copied into the destination folder.
Unique files contain only the last email that includes all the previous emails with replies. Note that if any of the emails contained attachments, then there will be more unique emails for each email thread.