What Will You Learn?

  • What is Amazon Athena?
  • What are the basics of Amazon Athena?
  • What are the use cases for Amazon Athena?
  • How can I import data into Amazon Athena?
  • Where can I learn more about Amazon Athena?

The Basics

Learn Amazon Athena basics. Amazon Athena is meant for querying copious amounts of data. It’s a straightforward service based on Presto. The Presto Foundation calls Amazon Athena and Amazon EMR Presto Cloud on their website.

It’s not cheap to use Amazon Athena, but it is convenient. It costs $5.00 per TB of data scanned. It’s possible to pay a third of that with compression. Additional savings occur when querying a single column. The golden feature of Athena is its scaled cluster of servers. Setting up your own Presto analytics engine would take more work.

Primary Use Cases

  • Querying vast amounts of
    • Log Data
    • Behavioral Data
    • Noncritical data

Importing Data

I needed to query a 58G CSV file. I could load the ~1 billion rows into PostgreSQL, but any queries on the data caused my machine to crash. I wrote a Python script to query the data, which works but takes too long to process. So, I brought out the big guns.

I tried Athena as a solution. Here’s how it went down.

I created this table to store my CSV file. This table holds the output of an AWS command-line query.

DROP TABLE learnamazonathena.s3_objects;
CREATE EXTERNAL TABLE IF NOT EXISTS learnamazonathena.s3_objects (
         time string,
         bytes bigint,
         object string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION 's3://learn-amazon-athena/csv';

Amazon S3 is now the data store for the CSV file. I generated the CSV file with these commands.

# Query all S3 objects recursively
aws s3 ls --recursive learn-amazon-athena > s3_objects.csv

# Format the output as a CSV file
perl -p -i -e ';s/(^.{0,19})(\s+)([0-9]+)(\s)(.+)$/"\1","\3","\5"/g'; applications.csv

I ran this query against the newly created table.

SELECT split(object,
         ';/';), SUM(bytes)
FROM learnamazonathena.s3_objects
GROUP BY  split(object, ';/';)

This query took seventeen seconds to complete.

I love how fast and straightforward Amazon Athena is. The CSV file I created is small in the grand scheme of things. Facebook built Presto to query obscene amounts of data and later open-sourced it.

Go open source!

Learn Amazon Athena - Beyond the Basics