Why use Enron data, anyway?

From Wikipedia's article on the same:

The Enron Corpus is a large database of over 600,000 emails generated by 158 employees[1] of the Enron Corporation and acquired by the Federal Energy Regulatory Commission during its investigation after the company's collapse.[2]

This makes it ideal, because:

  • It's public
  • It's relational
  • It's relatively numerous rows, but still sits inside a 60 GB VM

(Another really good candidate, if you have the VM / hard drive space is the Stack Overflow data set)

