EMF SQL Webinar Tutorial

Reference: EMF SQL Reference Guide

For this tutorial, you’ll create a custom QA step for a dataset and then try out different SQL queries on that dataset. We’ve included links to the EMF User’s Guide if you need more information on finding datasets or creating your QA step.

Getting Started

Locate the dataset named “ptnonipm_2011NEIv2_POINT_20140913_16sep2014_v0.csv” (Finding Datasets) and open its QA tab in the Dataset Properties Editor (Viewing and Editing Dataset Properties). Add a custom QA step using your name or initials in the name of the QA step. Set the QA step program to SQL (Adding Custom QA Steps).

For our starting query, let’s select all the columns from the underlying dataset table. We’ll restrict the results to records with CO emissions. Paste the following SQL as the argument to your QA step. Tip: Paste into a text editor (e.g., Notepad) to avoid formatting issues. This is also a good way to edit and document queries.

SELECT *
FROM emissions.ds_ptnonipm_2011NEIv2_POINT_20140913_16sep2014_v0_csv_455405112
WHERE poll = 'CO'
ORDER BY ann_value

Notice that we’ve specified the exact table name for our dataset. The table is in the “emissions” table group and the table name “ds_ptnonipm_2011NEIv2_POINT_20140913_16sep2014_v0_csv_455405112” is based on the name of the dataset plus a unique identifier. The syntax for table names in the EMF is table_group.table_name. The inventory files are in the emissions group. Later we will work with tables of reference data (e.g., SCC descriptions, county names, etc.) in the reference group.

Run your QA step and view the results.

Question

How many columns are in the output of your SQL query? [78]

To make things simpler, let’s update the query to use the EMF-specific syntax to refer to the underlying table name. (See EMF-Specific Syntax in the EMF SQL Reference Guide for more information.)

SELECT *
FROM $TABLE[1] e
WHERE e.poll = 'CO'
ORDER BY e.ann_value

Remember that the $TABLE[1] e syntax requires the one character table alias. Technically, we don’t need to specify the table name or alias in the WHERE clause or ORDER BY list when the query only pulls data from one table since there’s no ambiguity about which table to use. We’ll include the alias when we refer to columns in our queries for later when we do join to other tables.

Question

How would you change this query to only return the columns containing the country code (country_cd), region code (region_cd), pollutant name (poll), and annual emissions value (ann_value)? [Replace SELECT * with SELECT e.country_cd, e.region_cd, e.poll, e.ann_value]

Dealing with Errors

What happens if you try to run a QA step with an error in the SQL query? Try running the following query where we’ve mistyped the annual emissions column name.

SELECT e.country_cd, e.region_cd, e.poll, e.ann_emis
FROM $TABLE[1] e
WHERE e.poll = 'CO'
ORDER BY e.ann_value

The QA step will fail to run and you should see a message like the following in the Status window:

Failed to run QA step ‘Step Name’ for Version ‘Initial Version’ of Dataset ‘ptnonipm_2011NEIv2_POINT_20140913_16sep2014_v0.csv’. Check the query - ERROR: column e.ann_emis does not exist

The status message lets us know we should check our SQL query and shows us the error that the database server returned.

WHERE Clause Options

Next, we’ll try out three different WHERE clause options. For each query, think about which rows should be returned. Then run your QA step and look at the results.

SELECT e.country_cd, e.region_cd, e.poll, e.ann_value
FROM $TABLE[1] e
WHERE e.poll ILIKE 'pm%'
ORDER BY e.ann_value

Question

What would happen if the WHERE clause were WHERE e.poll LIKE 'pm%'? [ILIKE does a case-insensitive match whereas LIKE is case-sensistive. There aren’t any pollutants in the dataset that start with lowercase ‘pm’ so the query wouldn’t return any records.]

SELECT e.country_cd, e.region_cd, e.poll, e.ann_value
FROM $TABLE[1] e
WHERE e.poll ILIKE 'pm%' 
AND e.ann_value > 300
ORDER BY e.ann_value

SELECT e.country_cd, e.region_cd, e.poll, e.ann_value AS total
FROM $TABLE[1] e
WHERE e.poll ILIKE 'pm%' 
AND e.ann_value > 300 
AND e.region_cd IN ('23001', '23003', '23005', '23007', 
'23009', '23011', '23013', '23015', '23017', '23019', 
'23021', '23023', '23025', '23027', '23029', '23031')
ORDER BY e.ann_value

Question

How could you use the SUBSTR(string, starting position, length) function to make the last WHERE clause simpler? [Instead of listing all the county codes, you could extract the first two characters of the region code: WHERE ... AND SUBSTR(e.region_cd, 1, 2) = '23' ORDER BY e.ann_value]

In the previous query, how do we find out the list of region codes in the dataset? We can use the DISTINCT keyword to grab unique values.

SELECT DISTINCT e.region_cd
FROM $TABLE[1] e
WHERE e.region_cd LIKE '23%'
ORDER BY e.region_cd

Helpful tip: starting with a simple DISTINCT pollutant QA step is a good way to see which pollutants are in an inventory since different inventory types use different pollutant codes. (See SELECT DISTINCT in the EMF SQL Reference Guide for more information.)

SELECT DISTINCT e.poll
FROM $TABLE[1] e

Back to WHERE clause options, let’s write a query to match records by facility name.

SELECT e.country_cd, e.facility_name, e.region_cd, e.poll, e.ann_value AS total
FROM $TABLE[1] e
WHERE e.poll = 'PM25-PRI' 
AND (e.facility_name ILIKE 'intermountain%' OR e.facility_name ILIKE 'bassett%')
ORDER BY e.facility_name

The next couple of queries use longer WHERE clauses with substring functions on the region_cd and poll columns to select specific states and exclude pollutants. Also, we’re going to change the output columns and change the ordering of the results.

SELECT e.region_cd, e.facility_name, e.poll, e.ann_value
FROM $TABLE[1] e
WHERE SUBSTR(e.region_cd, 1, 2) IN ('09', '11', '10', '25', '24', 
'23', '37', '33', '34', '36', '42', '44', '51', '50', '54') 
AND SUBSTR(e.poll, 1, 1) NOT IN ('1', '2', '3', '4', '5', '6', '7', '8', '9')
ORDER BY e.region_cd, e.facility_name

SELECT e.region_cd, e.scc, e.facility_name, e.poll, e.ann_value
FROM $TABLE[1] e
WHERE SUBSTR(e.region_cd, 1, 2) IN ('09', '11', '10', '25', '24', 
'23', '37', '33', '34', '36', '42', '44', '51', '50', '54') 
AND SUBSTR(e.poll, 1, 1) NOT IN ('1', '2', '3', '4', '5', '6', '7', '8', '9')
ORDER BY e.region_cd, e.scc, e.facility_name

Question

Do the two previous queries return a different number of results? Why or why not? [The two queries use the same WHERE clause so the same rows are matched. The second query adds the SCC column but the number of rows doesn’t change. We are not using any aggregation or grouping so adding the SCC column does not change the aggregation level.]

Remember that row filters in the EMF use the same syntax as the SQL WHERE clause (but do not include the keyword WHERE). Go to the Data tab in the Dataset Properties Editor and View the dataset. Then copy & paste this part of the WHERE clause from the previous query into the row filter box of the Data Viewer:

SUBSTR(e.region_cd, 1, 2) IN ('09', '11', '10', '25', '24', 
'23', '37', '33', '34', '36', '42', '44', '51', '50', '54') 
AND SUBSTR(e.poll, 1, 1) NOT IN ('1', '2', '3', '4', '5', '6', '7', '8', '9')

Combining Columns; Dealing with NULL Values

For our next query, let’s say we want to put a couple of fields together in our output. As an example, we’ll concatenate the facility name with the facility source type. In the output, we’ll label this column facility_info. (Concatenation is described in the Functions for String Values table.)

SELECT e.region_cd, e.facility_name, 
   e.facility_name || ' - ' || e.fac_source_type AS facility_info,
   e.poll, e.ann_value
FROM $TABLE[1] e
WHERE SUBSTR(e.region_cd, 1, 2) IN ('09', '11', '10', '25', '24', 
'23', '37', '33', '34', '36', '42', '44', '51', '50', '54') 
AND SUBSTR(e.poll, 1, 1) NOT IN ('1', '2', '3', '4', '5', '6', '7', '8', '9')
ORDER BY e.region_cd, e.facility_name

Unfortunately, this doesn’t quite give us what we want. When the fac_source_type column is NULL, our new column facility_info is also NULL. You can compare the results for the facility named “BRIDGEPORT HOSPITAL” vs. “BRIDGEPORT INSULATED WIRE CO”. To fix this, we’ll use the COALESCE function to deal with the facility source type column. (See Dealing with NULL Values Using COALESCE in the EMF SQL Reference Guide for more information.)

SELECT e.region_cd, e.facility_name, 
   e.facility_name || ' - ' || COALESCE(e.fac_source_type, 'No source type') AS facility_info,
   e.poll, e.ann_value
FROM $TABLE[1] e
WHERE SUBSTR(e.region_cd, 1, 2) IN ('09', '11', '10', '25', '24', 
'23', '37', '33', '34', '36', '42', '44', '51', '50', '54') 
AND SUBSTR(e.poll, 1, 1) NOT IN ('1', '2', '3', '4', '5', '6', '7', '8', '9')
ORDER BY e.region_cd, e.facility_name

Question

In our fixed query, what value is returned in the facility_info column for the “BRIDGEPORT INSULATED WIRE CO” facility? [BRIDGEPORT INSULATED WIRE CO - No source type]

Aggregate Functions

Up next, we’ll look at some queries that return aggregated values from multiple rows using grouping and aggregate functions.

SELECT e.region_cd, e.scc, e.facility_name, e.poll, SUM(e.ann_value)
FROM $TABLE[1] e
WHERE SUBSTR(e.region_cd, 1, 2) IN ('09', '11', '10', '25', '24', 
'23', '37', '33', '34', '36', '42', '44', '51', '50', '54') 
AND SUBSTR(e.poll, 1, 1) NOT IN ('1', '2', '3', '4', '5', '6', '7', '8', '9')
GROUP BY e.region_cd, e.scc, e.facility_name, e.poll
ORDER BY e.region_cd, e.scc, e.facility_name

The previous query gave us a facility-SCC level report. Next, we’ll generate a county-SCC summary.

SELECT e.region_cd, e.scc, e.poll, SUM(e.ann_value)
FROM $TABLE[1] e
WHERE SUBSTR(e.region_cd, 1, 2) IN ('09', '11', '10', '25', '24', 
'23', '37', '33', '34', '36', '42', '44', '51', '50', '54') 
AND SUBSTR(e.poll, 1, 1) NOT IN ('1', '2', '3', '4', '5', '6', '7', '8', '9')
GROUP BY e.region_cd, e.scc, e.poll
ORDER BY e.region_cd, e.scc

Compare the results of the two queries. Note that when you use an aggregating function in the SELECT statement, you will aggregate the data in the selected column (in this case annual emissions) to the level specified by the columns in your GROUP BY statement. In general, the more columns you specify, the more disaggregated the results will be.

Joining to Reference Data

Now, let’s improve our summary by adding some more descriptive information to the output. We’ll join to the county information reference table reference.fips and include the county and state names in the output.

SELECT e.region_cd, fips.state_name, fips.county, e.scc, e.poll, SUM(e.ann_value)
FROM $TABLE[1] e 
LEFT JOIN reference.fips 
ON e.country_cd = fips.country_code
AND e.region_cd = fips.state_county_fips
WHERE SUBSTR(e.region_cd, 1, 2) IN ('09', '11', '10', '25', '24', 
'23', '37', '33', '34', '36', '42', '44', '51', '50', '54') 
AND SUBSTR(e.poll, 1, 1) NOT IN ('1', '2', '3', '4', '5', '6', '7', '8', '9')
GROUP BY e.region_cd, fips.state_name, fips.county, e.scc, e.poll
ORDER BY e.region_cd, e.scc

Question

Our inventory only contains data for one country while the reference.fips table contains data for multiple countries. Why is it important to include both the region code and country code in our JOIN definition? [If we don’t include the country code match, our output will have multiple rows for each inventory source. For example, region code “09001” matches Fairfield County in Connecticut but is also used by regions in Cuba, Haiti, and the Dominican Republic.]

In the following query we’ve joined to both the reference.fips table and the reference.scc table to add SCC descriptions to our output.

SELECT e.region_cd, fips.state_name, fips.county, 
       e.scc, scc.scc_description, e.poll, SUM(e.ann_value)
FROM $TABLE[1] e 
LEFT JOIN reference.fips 
ON e.country_cd = fips.country_code
AND e.region_cd = fips.state_county_fips
LEFT JOIN reference.scc
ON e.scc = scc.scc
WHERE SUBSTR(e.region_cd, 1, 2) IN ('09', '11', '10', '25', '24', 
'23', '37', '33', '34', '36', '42', '44', '51', '50', '54') 
AND SUBSTR(e.poll, 1, 1) NOT IN ('1', '2', '3', '4', '5', '6', '7', '8', '9')
GROUP BY e.region_cd, fips.state_name, fips.county, e.scc, scc.scc_description, e.poll
ORDER BY e.region_cd, e.scc

Joining to Other Datasets

The EMF also has special syntax to refer to other datasets. In the query below, we are joining our inventory dataset with another dataset named “Gentpro_TREF_HOURLY_BASH_NH3_agNH3_bash_2011ea_11f_newgent_emf_txt_06sep2013_nf_v4_tref.csv”. This dataset’s type is “Temporal Cross Reference (CSV)” and it contains information used for temporally allocating emissions to finer resolutions (i.e. approximating monthly or daily emissions from annual data).

In the example query, we are using the SCC field to join our inventory with the temporal cross-reference. Since this is just an example, we’re limiting the query to a single county for speed.

SELECT e.region_cd, e.facility_name, e.scc, t.profile_type, t.profile_id
FROM $TABLE[1] e
LEFT JOIN $DATASET_TABLE["Gentpro_TREF_HOURLY_BASH_NH3_agNH3_bash_2011ea_11f_newgent_emf_txt_06sep2013_nf_v4_tref.csv", 1] t
ON t.scc = e.scc
WHERE e.region_cd = '23001'
ORDER BY e.region_cd, e.facility_name, e.scc, t.profile_type

Question

Which columns in the output come from our inventory dataset (ptnonipm_2011NEIv2_POINT_20140913_16sep2014_v0.csv) and which columns come from the temporal cross-reference dataset? What in the query tells us which dataset the column comes from? [Inventory dataset columns: region_cd, facility_name, scc; temporal cross-reference dataset columns: profile_type, profile_id. In the SELECT list of the query, the column names include the table aliases. Column names starting with ‘e.’ are from the inventory dataset while names starting with ‘t.’ come from the temporal cross-reference dataset.]

Last updated: November 14, 2014