Showing posts with label enron. Show all posts
Showing posts with label enron. Show all posts

Wednesday, December 30, 2020

mySQL: Sample Databases

 How do I load customer agnostic sample databases into mySQL?


The issue with demos and/or general purpose query learning is that it requires data. Sometimes, different types of data.


For example, for general purpose query learning, the enron dataset is amazing. Especially for windowing functions, too, where having a large number of rows in a single table is what you want. I found it here:

http://www.ahschulz.de/enron-email-data/


However, for other purposes, especially those with more of a business flavor such as aggregates and groups, something like the Northwind database is ideal.

The GitHub user jpwhite has published the Northwind sample database as a set of mySQL scripts.

https://github.com/jpwhite3/northwind-MySQL


Tuesday, December 29, 2020

mySQL: Basic aggregates from a table

In mySQL, how do I get the basic aggregates from a table? 



mySQLExplanationSQL Server
use enron;
select COUNT(*) as counted from recipientinfo;
select MIN(rid) as rid_min from recipientinfo;
select MAX(rid) as rid_max from recipientinfo;
select SUM(rid/10.0) as rid_summed from recipientinfo;
select AVG(rid/10.0) as rid_avgd from recipientinfo;
No real surprises here - they work the same in both places. use Enron;
go
select COUNT(*) as counted from dbo.recipientinfo;
select MIN(rid) as rid_min from dbo.recipientinfo;
select MAX(rid) as rid_max from dbo.recipientinfo;
select SUM(rid/10.0) as rid_summed from dbo.recipientinfo;
select AVG(rid/10.0) as rid_avgd from dbo.recipientinfo;

Tuesday, September 5, 2017

Enron data on github

I've posted the Enron data on GitHub, found here:

https://github.com/busynovadad/EnronData


Tuesday, August 15, 2017

Why use Enron data, anyway?

From Wikipedia's article on the same:

The Enron Corpus is a large database of over 600,000 emails generated by 158 employees[1] of the Enron Corporation and acquired by the Federal Energy Regulatory Commission during its investigation after the company's collapse.[2]

This makes it ideal, because:

  • It's public
  • It's relational
  • It's relatively numerous rows, but still sits inside a 60 GB VM


(Another really good candidate, if you have the VM / hard drive space is the Stack Overflow data set)

Tuesday, August 8, 2017

How does one get Enron data?

For upcoming samples, I'm going to be using Enron data.

Some places to find this information:

https://www.cs.cmu.edu/~./enron/

Others:

https://www.opensciencedatacloud.org/publicdata/enron-emails/

https://www.kaggle.com/wcukierski/enron-email-dataset