I’ve been using Google’s cloud services more and more, specifically BigQuery. Besides the speed of queries and the simple API integration for different languages, like R, BigQuery makes available a large number of public data sets that come in quite handy. Here’s a quick guide to leveraging some of the Census data sets with your own data.
Let’s say you have a data set that includes locations with longitude and latitude coordinates like so:
place | longitude | latitude |
---|---|---|
Dodgers Stadium | -118.240288 | 34.072578 |
Citi Field | -73.503273 | 42.81469 |
Smithsonian National Museum of Natural History | -77.025963 | 38.892071 |
Disneyland | -117.926399 | 33.815395 |
Now, we would like to get Census tract-level information for each location. This is pretty simple with BigQuery.
Finding a Location’s Census Tract
First, we need to figure out the Census tract for each location. We can use standard geomgraphy functions to see which Census tract each address intersects with. Here’s a sample query:
SELECT *
EXCEPT (internal_point_geo, tract_geom)
FROM `mm-sandbox-decision-sciences.sample_places.sample_places_long_lat` places
JOIN `bigquery-public-data.geo_census_tracts.us_census_tracts_national` as c_us_n
ON ST_CONTAINS(c_us_n.tract_geom, st_geogpoint(places.longitude, places.latitude))
We need to make the long/lat for each of our locations a geography object (specifically, a point) and then we join the us_census_tracts_national
table to our locations based on which tract (based on tract_geom
) each point intersects with.
Note that we exclude two columns from our output (another convenient feature of BigQuery’s syntax)
Here’s what the results look like:
place | longitude | latitude | state_name | state_fips_code | county_fips_code | tract_ce | geo_id | tract_name | lsad_name | functional_status | area_land_meters | area_water_meters | internal_point_lat | internal_point_lon |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Citi Field | -73.503273 | 42.81469 | New York | 36 | 083 | 051800 | 36083051800 | 518 | Census Tract 518 | S | 159662573 | 8309272 | +42.8663030 | -073.5166161 |
Smithsonian National Museum of Natural History | -77.025963 | 38.892071 | District of Columbia | 11 | 001 | 006202 | 11001006202 | 62.02 | Census Tract 62.02 | S | 6539769 | 4970897 | +38.8809933 | -077.0363219 |
Dodgers Stadium | -118.240288 | 34.072578 | California | 06 | 037 | 980010 | 06037980010 | 9800.10 | Census Tract 9800.10 | S | 3757605 | 76091 | +34.0786807 | -118.2395140 |
Disneyland | -117.926399 | 33.815395 | California | 06 | 059 | 980000 | 06059980000 | 9800 | Census Tract 9800 | S | 2766486 | 0 | +33.8096249 | -117.9186718 |
The key here is the geo_id
. We can use that to join in information from the American Community Survey (ACS) for each of our locations based on their Census tract.
Joining ACS Data
Now that we know which Census tracts our locations belong to we can pull in lots of rich information from the ACS. All we need is to reference the right ACS table (found withing the census_bureau_acs
dataset), as there are various levels represented there (e.g. census tracts, census blocks, counties, etc.), and join based on the geo_id
.
Here’s the full query, combining both elements:
SELECT *
EXCEPT (internal_point_geo, tract_geom)
FROM `mm-sandbox-decision-sciences.sample_places.sample_places_long_lat` places
JOIN `bigquery-public-data.geo_census_tracts.us_census_tracts_national` as c_us_n
ON ST_CONTAINS(c_us_n.tract_geom, st_geogpoint(places.longitude, places.latitude))
JOIN `bigquery-public-data.census_bureau_acs.censustract_2018_5yr` as ctract
ON c_us_n.geo_id = ctract.geo_id
And here’s the output for one location (json just to make it easier to read here):
[
{
"place": "Dodgers Stadium",
"longitude": "-118.240288",
"latitude": "34.072578",
"state_name": "California",
"state_fips_code": "06",
"county_fips_code": "037",
"tract_ce": "980010",
"geo_id": "06037980010",
"tract_name": "9800.10",
"lsad_name": "Census Tract 9800.10",
"functional_status": "S",
"area_land_meters": "3757605",
"area_water_meters": "76091",
"internal_point_lat": "+34.0786807",
"internal_point_lon": "-118.2395140",
"geo_id_1": "06037980010",
"do_date": "2014-01-01",
"total_pop": "189.0",
"households": "69.0",
"male_pop": "96.0",
"female_pop": "93.0",
"median_age": "44.4",
"male_under_5": "5.0",
"male_5_to_9": "0.0",
"male_10_to_14": "0.0",
"male_15_to_17": "0.0",
"male_18_to_19": "0.0",
"male_20": "0.0",
"male_21": "1.0",
"male_22_to_24": "0.0",
"male_25_to_29": "0.0",
"male_30_to_34": "5.0",
"male_35_to_39": "13.0",
"male_40_to_44": "12.0",
"male_45_to_49": "17.0",
"male_50_to_54": "0.0",
"male_55_to_59": "5.0",
"male_60_to_61": "0.0",
"male_62_to_64": "7.0",
"male_65_to_66": "5.0",
"male_67_to_69": "6.0",
"male_70_to_74": "0.0",
"male_75_to_79": "5.0",
"male_80_to_84": "5.0",
"male_85_and_over": "10.0",
"female_under_5": "0.0",
"female_5_to_9": "0.0",
"female_10_to_14": "17.0",
"female_15_to_17": "32.0",
"female_18_to_19": "0.0",
"female_20": "0.0",
"female_21": "0.0",
"female_22_to_24": "0.0",
"female_25_to_29": "0.0",
"female_30_to_34": "0.0",
"female_35_to_39": "0.0",
"female_40_to_44": "19.0",
"female_45_to_49": "0.0",
"female_50_to_54": "0.0",
"female_55_to_59": "5.0",
"female_60_to_61": "5.0",
"female_62_to_64": "0.0",
"female_65_to_66": "0.0",
"female_67_to_69": "5.0",
"female_70_to_74": "0.0",
"female_75_to_79": "5.0",
"female_80_to_84": "5.0",
"female_85_and_over": "0.0",
"white_pop": "45.0",
"population_1_year_and_over": "189.0",
"population_3_years_over": "184.0",
"pop_5_years_over": null,
"pop_15_and_over": null,
"pop_16_over": "151.0",
"pop_25_years_over": "134.0",
"pop_25_64": "88.0",
"pop_never_married": null,
"pop_now_married": null,
"pop_separated": null,
"pop_widowed": null,
"pop_divorced": null,
"not_us_citizen_pop": "43.0",
"black_pop": "2.0",
"asian_pop": "25.0",
"hispanic_pop": "117.0",
"amerindian_pop": "0.0",
"other_race_pop": "0.0",
"two_or_more_races_pop": "0.0",
"hispanic_any_race": "117.0",
"not_hispanic_pop": "72.0",
"asian_male_45_54": "0.0",
"asian_male_55_64": "5.0",
"black_male_45_54": "0.0",
"black_male_55_64": "0.0",
"hispanic_male_45_54": "17.0",
"hispanic_male_55_64": "0.0",
"white_male_45_54": "0.0",
"white_male_55_64": "7.0",
"median_income": null,
"income_per_capita": "15788.0",
"income_less_10000": "27.0",
"income_10000_14999": "9.0",
"income_15000_19999": "5.0",
"income_20000_24999": "5.0",
"income_25000_29999": "0.0",
"income_30000_34999": "5.0",
"income_35000_39999": "0.0",
"income_40000_44999": "9.0",
"income_45000_49999": "0.0",
"income_50000_59999": "0.0",
"income_60000_74999": "0.0",
"income_75000_99999": "0.0",
"income_100000_124999": "5.0",
"income_125000_149999": "0.0",
"income_150000_199999": "0.0",
"income_200000_or_more": "4.0",
"pop_determined_poverty_status": "189.0",
"poverty": "136.0",
"gini_index": "0.7086",
"housing_units": "69.0",
"renter_occupied_housing_units_paying_cash_median_gross_rent": null,
"owner_occupied_housing_units_lower_value_quartile": "785300.0",
"owner_occupied_housing_units_median_value": "875000.0",
"owner_occupied_housing_units_upper_value_quartile": "964700.0",
"occupied_housing_units": "69.0",
"housing_units_renter_occupied": "36.0",
"vacant_housing_units": "0.0",
"vacant_housing_units_for_rent": "0.0",
"vacant_housing_units_for_sale": "0.0",
"dwellings_1_units_detached": "28.0",
"dwellings_1_units_attached": "22.0",
"dwellings_2_units": "0.0",
"dwellings_3_to_4_units": "0.0",
"dwellings_5_to_9_units": "0.0",
"dwellings_10_to_19_units": "10.0",
"dwellings_20_to_49_units": "9.0",
"dwellings_50_or_more_units": "0.0",
"mobile_homes": "0.0",
"housing_built_2005_or_later": "0.0",
"housing_built_2000_to_2004": "0.0",
"housing_built_1939_or_earlier": "10.0",
"median_year_structure_built": "1953.0",
"married_households": "31.0",
"nonfamily_households": "38.0",
"family_households": "31.0",
"households_public_asst_or_food_stamps": "5.0",
"male_male_households": "0.0",
"female_female_households": "0.0",
"children": "54.0",
"children_in_single_female_hh": "0.0",
"median_rent": "1544.0",
"percent_income_spent_on_rent": "51.0",
"rent_burden_not_computed": "5.0",
"rent_over_50_percent": "31.0",
"rent_40_to_50_percent": "0.0",
"rent_35_to_40_percent": "0.0",
"rent_30_to_35_percent": "0.0",
"rent_25_to_30_percent": "0.0",
"rent_20_to_25_percent": "0.0",
"rent_15_to_20_percent": "0.0",
"rent_10_to_15_percent": "0.0",
"rent_under_10_percent": "0.0",
"owner_occupied_housing_units": "33.0",
"million_dollar_housing_units": "0.0",
"mortgaged_housing_units": "23.0",
"different_house_year_ago_different_city": "1.0",
"different_house_year_ago_same_city": "11.0",
"families_with_young_children": "5.0",
"two_parent_families_with_young_children": "5.0",
"two_parents_in_labor_force_families_with_young_children": "5.0",
"two_parents_father_in_labor_force_families_with_young_children": "0.0",
"two_parents_mother_in_labor_force_families_with_young_children": "0.0",
"two_parents_not_in_labor_force_families_with_young_children": "0.0",
"one_parent_families_with_young_children": "0.0",
"father_one_parent_families_with_young_children": "0.0",
"father_in_labor_force_one_parent_families_with_young_children": "0.0",
"commute_less_10_mins": "1.0",
"commute_10_14_mins": "0.0",
"commute_15_19_mins": "4.0",
"commute_20_24_mins": "0.0",
"commute_25_29_mins": "5.0",
"commute_30_34_mins": "10.0",
"commute_35_44_mins": "24.0",
"commute_60_more_mins": "1.0",
"commute_45_59_mins": "5.0",
"commuters_16_over": "50.0",
"walked_to_work": "1.0",
"worked_at_home": "11.0",
"no_car": "0.0",
"no_cars": "9.0",
"one_car": "10.0",
"two_cars": "31.0",
"three_cars": "10.0",
"four_more_cars": "9.0",
"aggregate_travel_time_to_work": null,
"commuters_by_public_transportation": "1.0",
"commuters_by_bus": "1.0",
"commuters_by_car_truck_van": "48.0",
"commuters_by_carpool": "0.0",
"commuters_by_subway_or_elevated": "0.0",
"commuters_drove_alone": "48.0",
"group_quarters": "5.0",
"associates_degree": "6.0",
"bachelors_degree": "26.0",
"high_school_diploma": "19.0",
"less_one_year_college": "5.0",
"masters_degree": "5.0",
"one_year_more_college": "10.0",
"less_than_high_school_graduate": "47.0",
"high_school_including_ged": "30.0",
"bachelors_degree_2": "26.0",
"bachelors_degree_or_higher_25_64": "26.0",
"graduate_professional_degree": "10.0",
"some_college_and_associates_degree": "21.0",
"male_45_64_associates_degree": "6.0",
"male_45_64_bachelors_degree": "0.0",
"male_45_64_graduate_degree": "0.0",
"male_45_64_less_than_9_grade": "17.0",
"male_45_64_grade_9_12": "5.0",
"male_45_64_high_school": "1.0",
"male_45_64_some_college": "0.0",
"male_45_to_64": "29.0",
"employed_pop": "61.0",
"unemployed_pop": "1.0",
"pop_in_labor_force": "62.0",
"not_in_labor_force": "89.0",
"workers_16_and_over": "61.0",
"armed_forces": "0.0",
"civilian_labor_force": "62.0",
"employed_agriculture_forestry_fishing_hunting_mining": "0.0",
"employed_arts_entertainment_recreation_accommodation_food": "0.0",
"employed_construction": "7.0",
"employed_education_health_social": "12.0",
"employed_finance_insurance_real_estate": "13.0",
"employed_information": "0.0",
"employed_manufacturing": "10.0",
"employed_other_services_not_public_admin": "1.0",
"employed_public_administration": "0.0",
"employed_retail_trade": "0.0",
"employed_science_management_admin_waste": "1.0",
"employed_transportation_warehousing_utilities": "17.0",
"employed_wholesale_trade": "0.0",
"occupation_management_arts": "17.0",
"occupation_natural_resources_construction_maintenance": "16.0",
"occupation_production_transportation_material": "18.0",
"occupation_sales_office": "10.0",
"occupation_services": "0.0",
"management_business_sci_arts_employed": "17.0",
"sales_office_employed": "10.0",
"in_grades_1_to_4": "0.0",
"in_grades_5_to_8": "17.0",
"in_grades_9_to_12": "32.0",
"in_school": "51.0",
"in_undergrad_college": "1.0",
"speak_only_english_at_home": null,
"speak_spanish_at_home": null,
"speak_spanish_at_home_low_english": null
}
]