SQL for Data Analyst

These days lots of data available in digital form, ability to analyze and get meaning from that data become more important. Usually such job is called Data Analysis or Data Mining and the person who does that is called Data Analyst.

Actually, I wrote that article for my brother (who's Analyst and ask me about SQL) and decided to publish it because it may be also useful for others.

Main skills of Analyst are Mathematics, Statistics and Domain expertise. But in order to apply those skills he should be able to get the data itself. Widely supported way to get access to data is called SQL, and Analyst can benefit greatly if it knows basics of it.

SQL is a declarative language for querying and transforming data stored in relational database.

  • Declarative - it means that you declare what you want without explicitly telling how to do that. And database figures it out by herself how to fulfill your request in the best way. It's a good thing, because describe what you want usually much simpler than to explain how to do that.
  • Relational database - special type of database (it's the most widely used type of database) that stores data as rows (also called records) in tables.

Basics of SQL

Basics of SQL is very simple, it may looks complex at first, but usually a couple of hours enough to get started and a couple of days to get familiar.

There are two types of operations you (as an Analyst) usually want to perform on data in database - query it for subset of data you are interested in, and (more rarely) update data.

Thankfully those operations is what SQL is very good at, and it's easy to learn and understand it.

Everything in relational database is stored in tables, there may be multiple tables and they may be connected (related) to each other. Description of how data is split between tables and how tables are related is called the schema of database.

Things you need to know about relational databases:

  • What are table, row, column and cell.
  • How data is stored in database and how it's split between tables.
  • Relations between tables, what's the cartesian product of tables.
  • What are primary and foreign keys of table.
  • Data types, it's enough to know about string, number, currency and date only.
  • Entity–relationship model - not required, but it really helps to understand how tables are related to each other and to build complex queries.

When you query database you tell database what rows from what tables you need, the result of querying is also represented as a table.

Things you need to know about querying:

  • How to build query - the select statement.
  • Define what fields you need as a result of query (what columns should be in the result table).
  • Define what rows should be returned by query - where statement.
  • How to prevent rows show multiple times in result, distinct statement.
  • Sorting, sort by statement.
  • Grouping and aggregation group by statement and formulas sum, avg, count.
  • How to get results from multiple tables - join statement, types of joins - left, right and full (at first it looks complex, but actually it's easy, allow some time to get it). It would be helpful to understand entity-relationship model, it helps a lot.

Usually it's rare when Analyst need to update database, but sometimes it's handy to quickly fix something or alter data to see it in another way, it's helpful to know how to do that.

Things you need to know about updating data:

  • How to insert new row into table, the insert statement.
  • How to update one or multiple rows, the update statement.

How to learn

  1. Read quickly that course and do exercises http://www.w3schools.com/sql
  2. Get in touch with someone who knows SQL, if you don't know such person - ask question on http://stackoverflow.com or similar Q&A forum.
  3. Start using it in your job for the real things, if in trouble ask for help that person.

Don't be afraid and don't delay it for long. Also don't buy fat books about SQL and database design, too much details will be a burden.

SQL and relational databases are easy and practical things - best way to learn it - by doing.

Advanced SQL

There are also complex things in SQL and Relational databases - usually it's needed for Developers only and almost never used by Analysts, you can safely ignore it.

Things you don't need to know:

  • Transactions and ACID - things needed to reliably update data in database without surprises and side effects.
  • Indexing and query optimization - allow to make queries faster.
  • Triggers, Stored procedures, Views - allow to do advanced stuff for querying and update.
  • Normalization - how to design database schema and split data between tables in database to prevent data duplication.
  • Concurrency and locks - things needed to prevent surprises and side effects when database used by multiple users.

What's OLAP and Business Intelligence?

Formally OLAP is an approach to answering multi-dimensional analytical (MDA) queries swiftly. Sounds complex, but usually it's a pretty simple tool that execute a database query and display returned data as a graph, table or in other formats.

Sometimes OLAP tools can also automatically build SQL queries (only very simple ones), it's handy when you need to play and experiment with data, slice it quickly in multiple ways and see graphs.

Same with BI - usually it some sort of simple graphs built on result of queries and aggregations. In rare cases it may be really advanced stuff, but in most cases - it's pretty simple.