Chapter 4 Data

All of our data are based on the 2016 Medicare & Medicaid EHR Incentive Program since the adoption of EHR is an ongoing basis: a provider who adopted and used EHR during the 2014 program, for example, is likely to still be in the program the following years. So theoretically, the 2016 program would be the most cumulative source of providers adopting EHR. As mentioned in Related Work, the adoption of EHR has been consistently rising over the years across the U.S., so there is also no interest in looking at the use over time when we can just look at the latest collection of data. And since 2017 isn’t quite over yet, it wouldn’t be fair to start looking at the usage of EHR within the 2017 program, so the most recent completed data would be on the 2016 program.

4.1 Eligible Professionals (EPs)

To answer the first of our Initial Questions as to whether or not a physician’s age, education, practice location, etc. may be related to their use of EHR in the incentive program, we must be able to find demographics data at the practitioner level.

4.1.1 Source/Metadata

And here we are: readily available on Data.Medicare.gov are the demographics of physicians eligible for the Medicare & Medicaid EHR incentive program at the practitioner level. We used the API link to read in the data set and applied SQL filtering options to speed up the download since it is a large file. We specified the limit to be exactly 2254703 because that’s the exact number of rows the data contain, if we don’t specify this, the default limit of rows that get downloaded is 100. To compensate, we will specify the exact columns of interest to include, which can also be specified in the download link (we looked through the metadata on the website to select the meaningful and interesting variables).

dat <- read.csv("https://data.medicare.gov/resource/c8qv-268j.csv?$limit=2254703&$select=npi,frst_nm,gndr,cred,med_sch,grd_yr,pri_spec,cty,st,zip,hosp_afl_1,hosp_afl_lbn_1,ehr&$order=npi")
saveRDS(dat, "./data/dat_raw.rds") # saving data for reproducibility

4.1.2 Wrangling

Let’s first take a look at the data structure and a summary of all the columns.

str(dat)

## 'data.frame':    1527291 obs. of  13 variables:
##  $ cred          : Factor w/ 22 levels "","AA","AU","CNA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ cty           : Factor w/ 10192 levels "20850","ABBEVILLE",..: 2345 274 7938 2790 9076 1041 5545 5545 364 364 ...
##  $ ehr           : Factor w/ 2 levels "","Y": 1 1 1 1 2 1 1 1 1 1 ...
##  $ frst_nm       : Factor w/ 54156 levels "A","A'LANNE",..: 3360 3360 3360 48311 38890 10170 20706 20706 24158 24158 ...
##  $ gndr          : Factor w/ 3 levels "F","M","U": 2 2 2 2 2 2 1 1 2 2 ...
##  $ grd_yr        : int  1994 1994 1994 2003 1999 2003 2007 2007 1997 1997 ...
##  $ hosp_afl_1    : Factor w/ 4746 levels "","010001","010005",..: 4371 4371 4371 1112 3220 3617 3168 3168 622 622 ...
##  $ hosp_afl_lbn_1: Factor w/ 4515 levels "","ABBEVILLE AREA MEDICAL CENTER",..: 2020 2020 2020 1127 2405 2967 1943 1943 4203 4203 ...
##  $ med_sch       : Factor w/ 398 levels "A T STILL UN, ARIZONA SCHL OF DENT.Y & ORAL HLTH",..: 215 215 215 321 215 223 261 261 209 209 ...
##  $ npi           : int  1003000126 1003000126 1003000126 1003000134 1003000142 1003000407 1003000423 1003000423 1003000480 1003000480 ...
##  $ pri_spec      : Factor w/ 83 levels "","ADDICTION MEDICINE",..: 38 38 38 61 5 24 51 51 27 27 ...
##  $ st            : Factor w/ 56 levels "","AK","AL","AR",..: 50 50 50 17 39 42 39 39 7 7 ...
##  $ zip           : Factor w/ 188675 levels "","000075949",..: 43170 47812 50388 117622 89610 35299 90294 90294 150777 150777 ...

summary(dat)

##       cred                   cty          ehr            frst_nm       
##         :1012568   HOUSTON     :  25754    :1194451   MICHAEL:  30701  
##  MD     : 344822   PITTSBURGH  :  25135   Y: 332840   DAVID  :  27486  
##  PA     :  29223   NEW YORK    :  18521               JOHN   :  27117  
##  NP     :  25666   PHILADELPHIA:  18025               ROBERT :  21989  
##  DO     :  21639   CHICAGO     :  16532               JAMES  :  20402  
##  CNA    :  16635   ANN ARBOR   :  16390               (Other):1399580  
##  (Other):  76738   (Other)     :1406934               NA's   :     16  
##  gndr           grd_yr       hosp_afl_1     
##  F:681413   Min.   :1942          : 449488  
##  M:845877   1st Qu.:1988   230046 :  13889  
##  U:     1   Median :1999   450076 :   7003  
##             Mean   :1997   490009 :   5784  
##             3rd Qu.:2008   390111 :   5496  
##             Max.   :2017   050454 :   5261  
##             NA's   :4024   (Other):1040370  
##                                             hosp_afl_lbn_1   
##                                                    : 450721  
##  UNIVERSITY OF MICHIGAN HEALTH SYSTEM              :  13889  
##  UNIVERSITY OF TEXAS M D ANDERSON CANCER CENTER,THE:   7003  
##  UNIVERSITY OF VIRGINIA MEDICAL CENTER             :   5784  
##  HOSPITAL OF UNIV OF PENNSYLVANIA                  :   5496  
##  UCSF MEDICAL CENTER                               :   5261  
##  (Other)                                           :1039137  
##                                          med_sch            npi           
##  OTHER                                       :771781   Min.   :1.003e+09  
##  WAYNE STATE UNIVERSITY SCHOOL OF MEDICINE   : 11559   1st Qu.:1.165e+09  
##  INDIANA UNIVERSITY SCHOOL OF MEDICINE       : 10794   Median :1.337e+09  
##  PHILADELPHIA COLLEGE OF OSTEOPATHIC MEDICINE: 10036   Mean   :1.340e+09  
##  TEMPLE UNIVERSITY SCHOOL OF MEDICINE        :  9473   3rd Qu.:1.509e+09  
##  UNIVERSITY OF MINNESOTA MEDICAL SCHOOL      :  9266   Max.   :1.680e+09  
##  (Other)                                     :704382                      
##                  pri_spec            st                zip         
##  NURSE PRACTITIONER  :148237   CA     :131284   481095000:  14678  
##  INTERNAL MEDICINE   :141051   TX     :114489   152124756:   6601  
##  FAMILY PRACTICE     :119255   PA     :100942   770304000:   6033  
##  PHYSICIAN ASSISTANT :111992   NY     : 95038   229080001:   4827  
##  PHYSICAL THERAPY    : 69682   FL     : 84422   326103003:   4336  
##  DIAGNOSTIC RADIOLOGY: 64327   MI     : 66982   110303816:   3727  
##  (Other)             :872747   (Other):934134   (Other)  :1487089

Observations from the above output:

There are 56 levels of states, which means this data include states outside of the 50 mainland states and DC. Since we don’t have sufficient data on the Virgin Islands and Guam, we will have to filter these states out.
There is one person whose gender is unknown, we will filter this person out since we want to group by gender.
There are 4024 rows with missing graduation years, will need to filer this out too if graduation years is one of the main demographics of interest (will mess up plotting). Also, one or more physician who graduated in 2017 slipped into our data…we will ignore this by filtering it out too.
It looks like most zip codes in this data are the full 9-digit zip codes, which can be split into the standard 5-digit format and the 4-digit.
There are way too many levels of medical schools and primary specialties, we will probably focus on looking at the most popular ones during exploratory analysis rather than the final primary analysis.

data("state")
clean <- dat %>% filter(st %in% c(state.abb, "DC")) %>% 
  filter(gndr %in% c("F", "M") & !is.na(grd_yr) & grd_yr != 2017)

Now we only have 51 levels of states, 2 levels of gender, and no records with missing graduation year. To make sure we only keep valid zip codes, we can try to use R’s zip code package to validate our zip codes. This is important because we want to be able to accurately perform our analyses on the location variables and regional analysis (especially zip code plotting).

# separate out the first five zip codes from the last four extension
clean <- clean %>% mutate(zip.ext = substr(as.character(zip), 6, 9), 
                          zip = substr(as.character(zip), 1, 5))

# use R zipcode package and data
library(zipcode)
data("zipcode")
zip_dat <- clean %>% left_join(zipcode,by='zip') 

# check state mismatches
zip_mismatch <- zip_dat %>% 
  mutate(st = as.character(st)) %>% 
  filter(st != state) %>% 
  select(npi, frst_nm, zip, cty, city, st, state, latitude, longitude)

# check number of records mismatched
nrow(zip_mismatch)

## [1] 1034

# look at the top five mismatches
(top5_mismatch <- zip_mismatch %>% 
  group_by(zip, cty, st, city, state) %>% 
  summarize(n = n()) %>% 
  arrange(desc(n)) %>% 
  head(5))

## # A tibble: 5 x 6
## # Groups:   zip, cty, st, city [5]
##     zip            cty    st           city state     n
##   <chr>         <fctr> <chr>          <chr> <chr> <int>
## 1 99362    WALLA WALLA    WA    Walla Walla    OR   222
## 2 52761      MUSCATINE    IA      Muscatine    IL   114
## 3 43530        GRANGER    IN         Jewell    OH    86
## 4 22401 FREDERICKSBURG    MD Fredericksburg    VA    82
## 5 20176        HANOVER    MD       Leesburg    VA    74

In order to not just blatantly throw away 1034 mismatches, we will look at the top 5 most common mismatches on a case-by-case basis to try to salvage as many records as possible without searching too deeply in our data set. Also, starting from the 6th most common mismatches, the number of records n were down to 58 or less, so they would have little effect on the data set.

The results of searching up the top 10 mismatches on Google Maps:

The zip code 99362 spans across Walla Walla in both WA and OR, so it is likely that the practice is located on the WA side of the zip code coverage. We will ignore the state mismatch from the zip code package for this case.
The zip code 52761 does belong to Muscatine, IA, which is right on the border of IL. Again, the practice is most likely located on the IA side. Will ignore the state mismatch for this case as well.
The zip code for Granger, IN was most likely mistyped. The correct zip code for Granger is 46530, which is very close to 43530. We will fix the zip code for this particular case only.
The zip code 22401 does belong to Frederickburg, VA, and is not near the border of VA and MD at all. There is a “Frederick” city in MD, but none of its zip codes look similar enough to 22401. No clear culprit, will have to let this one go. :(
The zip code for Hanover, MD is 21076, which was most likely mistyped as 20176. We will fix this particular case as well.

# leave out the 22401 Fredericksburg MD mismatch, fix the zip codes for Granger and Hanover.
zip_fix <- top5_mismatch %>% ungroup() %>% 
  filter(cty != 'FREDERICKSBURG') %>% 
  mutate(zip = replace(zip, zip %in% c(43530, 20176), c(46530, 21076))) %>% 
  select(-n)

mismatch_tokeep <- zip_dat %>% 
  select(-zip) %>% # we want to replace the original wrong zip codes with the fixed zip codes
  inner_join(zip_fix, by = c('cty','st', 'city', 'state'))

## Warning: Column `st` joining factor and character vector, coercing into
## character vector

clean <- zip_dat %>%
  mutate(st = as.character(st), state = as.character(state)) %>% 
  filter(st == state | nchar(zip) == 5) %>% # keeping only the matches
  bind_rows(mismatch_tokeep) %>% # then add in the mismatches we wanted to keep or have fixed
  select(-city, -state) # will rely on the original city and state columns

Continuing on with our wrangling…

About 29.4304098% of the (original) data are of physicians not affiliated with any hospital, these are the practitioners we will subset on to do our practitioner-level analysis on since these are the only cases when a physician’s demographics can have an effect on the choice of using EHR. For physicians who are affiliated with a hospital, their demographics would no longer be informative as to whether or not they use EHR, it would then have to be the demographics of the hospitals that could inform the use of EHR at the hospital-level. Thus, we were able to find demographics on hospitals across the U.S. in general, which will be discussed in the next section within this chapter, but there’s no indication of participation in the incentive program nor EHR use, so we will have to merge that hospital demographic data with the practitioners in this data set who are affiliated with any one hospital.

So for the practitioner-level analysis we want to:

Calculate the number of years since graduation
Use 1’s and 0’s instead of ‘Y’ and blanks as indicator of EHR use for our binary outcome
Keep only one record per physician, which means we would have to lose the distinct practice locations for physicians who have multiple practices. But since we can’t fit our logistic model with either city, state, or zip code, there’s no point in keeping this detail. Instead, we can calculate the number of distinct practices these physicians have in order to collapse the data to the practitioner level.

EPs <- clean %>% filter(hosp_afl_1 == '') %>% 
  mutate(yrs_grd = 2016 - as.integer(grd_yr), # calc the n of years since graduation
         cred = case_when(cred == '' ~ 'NA', # assign blanks to NA's
                          TRUE ~ as.character(cred)),
         cred = as.factor(cred),
         ehr = case_when(ehr == 'Y' ~ 1, # use numeric indicators
                         ehr == '' ~ 0),
         ehr = as.factor(ehr)) %>% 
  select(-hosp_afl_1, -hosp_afl_lbn_1, -frst_nm) %>% 
  distinct(npi, .keep_all = TRUE) # keep only one row per physician

# calculate the number of practices for each physician
SL <- dat %>% group_by(npi) %>% summarise(locations = n_distinct(zip))

EPs <- left_join(EPs, SL, by="npi")

rm(SL)
summary(EPs)

##       cred                 cty         ehr        gndr      
##  NA     :167692   NEW YORK   :  3555   0:214471   F:128382  
##  MD     : 19334   HOUSTON    :  1875   1: 18284   M:104373  
##  DC     :  9698   BROOKLYN   :  1701              U:     0  
##  PT     :  6881   LOS ANGELES:  1700                        
##  OD     :  6773   CHICAGO    :  1586                        
##  CSW    :  5750   PHOENIX    :  1490                        
##  (Other): 16627   (Other)    :220848                        
##      grd_yr                                         med_sch      
##  Min.   :1942   OTHER                                   :140789  
##  1st Qu.:1988   PALMER COLLEGE CHIROPRACTIC - DAVENPORT :  4547  
##  Median :1999   LOGAN COLLEGE OF CHIROPRACTIC           :  2312  
##  Mean   :1997   PENNSYLVANIA COLLEGE OF OPTOMETRY       :  2310  
##  3rd Qu.:2009   SOUTHERN COLLEGE OF OPTOMETRY           :  2211  
##  Max.   :2016   ILLINOIS COLLEGE OF OPTOMETRY AT CHICAGO:  2184  
##                 (Other)                                 : 78402  
##       npi                              pri_spec          st           
##  Min.   :1.003e+09   PHYSICAL THERAPY      :34788   Length:232755     
##  1st Qu.:1.165e+09   CHIROPRACTIC          :27033   Class :character  
##  Median :1.336e+09   CLINICAL SOCIAL WORKER:25225   Mode  :character  
##  Mean   :1.340e+09   NURSE PRACTITIONER    :22163                     
##  3rd Qu.:1.509e+09   OPTOMETRY             :20352                     
##  Max.   :1.680e+09   CLINICAL PSYCHOLOGIST :16522                     
##                      (Other)               :86672                     
##      zip              zip.ext             latitude       longitude      
##  Length:232755      Length:232755      Min.   :19.10   Min.   :-170.27  
##  Class :character   Class :character   1st Qu.:34.14   1st Qu.: -98.57  
##  Mode  :character   Mode  :character   Median :39.40   Median : -86.18  
##                                        Mean   :38.19   Mean   : -91.39  
##                                        3rd Qu.:41.75   3rd Qu.: -77.34  
##                                        Max.   :66.90   Max.   : -67.24  
##                                        NA's   :140     NA's   :140      
##     yrs_grd        locations      
##  Min.   : 0.00   Min.   :  1.000  
##  1st Qu.: 7.00   1st Qu.:  1.000  
##  Median :17.00   Median :  1.000  
##  Mean   :18.55   Mean   :  1.293  
##  3rd Qu.:28.00   3rd Qu.:  1.000  
##  Max.   :74.00   Max.   :359.000  
##

4.1.3 Output

# permanently save physician-level analysis dataset to repo
saveRDS(EPs, file = "./data/EPs.rds")

# for the rest of practitioners who are affiliated with a hospital
hosp_afl <- clean %>% # filter on the clean version
  filter(hosp_afl_1 != '') %>% 
  mutate(hosp_afl_lbn_1 = as.character(hosp_afl_lbn_1), 
         cty = str_trim(as.character(cty)),
         st = as.character(st),
         ehr = ifelse(ehr == "Y", 1, 0)) %>% 
  # only keep relevant or potentially interesting variables (keeping npi to count number of physicians)
  distinct(npi, hosp_afl_lbn_1, st, cty, ehr, hosp_afl_1, pri_spec, gndr, grd_yr)

# clean up intermediate data sets
rm(dat, clean, zipcode, zip_dat, zip_fix, zip_mismatch, mismatch_tokeep, top5_mismatch)

We are storing the EPs data set as a .rds file in our repo under the /data directory for reproducibility of our analysis since this is the clean analysis-ready data set that will be used to analyze the potential effects of practitioner demographics on EHR use in our Primary Analysis.
The hosp_afl data set, still containing one row per physician per hospital location, is to be merged with the hospital demographics data discussed in the next section.
Both EPs and the merged-to-be data of hosp_afl containing the hospital demographics are to be merged with the EHR vendor and product information in the last section of this chapter, in order to explore both physician and hospital demographics by specific types of EHR used.

4.2 Eligible Hospitals (Hosp)

General hospital demographics such as the total number of staffed beds, total number of discharges, total number of patient days, and total gross patient revenue (inpatient and outpatient) are all publicly accessible on the American Hospital Directory organized by U.S. states. Since there is not a .csv file ready for us to download and the hospital demographics are already neatly summarized in a table for each state, we decided that we could do some web scraping for this.

4.2.1 Source/Metadata

The link to each U.S. state’s summary table of hospital demographics by hospital name in that state begins with the same URL: https://www.ahd.com/states/hospital_ and ends with the state’s abbreviation plus a .html. For example, a table of hospital demographics for each hospital in Massachusetts can be directly scraped from the link, https://www.ahd.com/states/hospital_MA.html. The details of these hospital statistics are described here.

hosp_web <- data.frame()
for (st in  c(state.abb, "DC")){
  url = paste("https://www.ahd.com/states/hospital_", st, ".html", sep='')
  st_df <- read_html(url) %>%
    html_nodes("table") %>%
    .[[2]] %>%
    html_table()
  st_df <- as.data.frame.matrix(eval(parse(text=st)))
  # removing the TOTAL row at the end of each table
  st_df <- st_df[!(st_df$"Hospital Name"=="T O T A L"),]
  st_df$State <- st
  hosp_web <- rbind(hosp_web, st_df)
}
saveRDS(hosp_web, "./data/hosp_raw.rds") # saving data for reproducibility

4.2.2 Wrangling

Let’s take a look at the data structure of the raw scrapped hospital demographics.

str(hosp_web)

## 'data.frame':    3907 obs. of  7 variables:
##  $ Hospital_name        : chr  "Alaska Native Medical Center" "Alaska Regional Hospital" "Bartlett Regional Hospital" "Central Peninsula General Hospital" ...
##  $ City                 : chr  "Anchorage" "Anchorage" "Juneau" "Soldotna" ...
##  $ Staffed_beds         : chr  "152" "174" "57" "106" ...
##  $ Total_discharges     : chr  "7,320" "6,405" "1,710" "2,575" ...
##  $ Patient_days         : chr  "38,687" "33,507" "6,325" "10,186" ...
##  $ Gross_patient_revenue: chr  "$0" "$947,754" "$147,231" "$305,469" ...
##  $ State                : chr  "AK" "AK" "AK" "AK" ...

To do:

We need to turn all the demographic information into numeric columns so we can use them as continuous variables.
These values are reported with commas, so we need to strip these out.
All the gross patient revenues are reported with a dollar sign in addition to the commas, so we need to strip those out too.
Proper cases are harder to match on, will convert all letters in city and hospital name to upper case.
It looks like some hospital has a gross patient revenue of $0…is this possible? Will explore this after dealing with all of the above first.

hosp_clean <- transform(hosp_web, 
                       Staffed_beds = as.numeric(gsub(",", "", Staffed_beds)), 
                       Total_discharges = as.numeric(gsub(",", "", Total_discharges)),
                       Patient_days = as.numeric(gsub(",","", Patient_days)),
                       Gross_patient_revenue = as.numeric(gsub("\\$|,", "", Gross_patient_revenue)),
                       Hospital_name = toupper(Hospital_name),
                       City = toupper(str_trim(as.character(City)))
                       )
summary(hosp_clean)

##  Hospital_name          City            Staffed_beds     Total_discharges
##  Length:3907        Length:3907        Min.   :    0.0   Min.   :     0  
##  Class :character   Class :character   1st Qu.:   48.0   1st Qu.:  1244  
##  Mode  :character   Mode  :character   Median :  133.0   Median :  4562  
##                                        Mean   :  204.8   Mean   :  8017  
##                                        3rd Qu.:  264.0   3rd Qu.: 11336  
##                                        Max.   :36985.0   Max.   :136646  
##   Patient_days    Gross_patient_revenue    State          
##  Min.   :     0   Min.   :       0      Length:3907       
##  1st Qu.:  4104   1st Qu.:   98340      Class :character  
##  Median : 18892   Median :  377425      Mode  :character  
##  Mean   : 37794   Mean   :  797958                        
##  3rd Qu.: 51621   3rd Qu.: 1001421                        
##  Max.   :695643   Max.   :15618749

It looks like the zero value wasn’t just for the gross patient revenue. From the summary output, we can see that the minimum values on all hospital demographics (staffed beds, total discharges, patient days, gross patient revenue) are zero, which doesn’t seem right. Theoretically, it is impossible for any functioning hospital to have a total of zero on any of these characteristics. Would having a total of zero number of staffed beds really make it a hospital? Can a hospital survive on a gross patient revenue of zero dollars? We thought that it might be a web scrapping error, but when we went back and looked on the website, some of their tables definitely have random zero hospital statistics and some even have a whole row of zeros for certain hospitals. The metadata wasn’t very clear on this, but it seems like it is their way of representing missing data. Instead of keeping them as zeros, which would affect any computation we want to perform on these continuous variables, we will convert every zero into a true missing value, NA. But if a hospital has all missing demographics (the entire row of zeros), then we will just remove that hospital entirely (delete the whole row).

# the demographics are columns 3 to 6 of the data frame
is.na(hosp_clean[,3:6]) <- !hosp_clean[,3:6] # turn 0's to NAs
hosp_clean <- hosp_clean[rowSums(is.na(hosp_clean)) < 4,] # only delete hosp with all four NAs
summary(hosp_clean)

##  Hospital_name          City            Staffed_beds     Total_discharges
##  Length:3456        Length:3456        Min.   :    1.0   Min.   :     1  
##  Class :character   Class :character   1st Qu.:   73.0   1st Qu.:  2141  
##  Mode  :character   Mode  :character   Median :  157.0   Median :  5692  
##                                        Mean   :  231.6   Mean   :  9087  
##                                        3rd Qu.:  290.0   3rd Qu.: 12564  
##                                        Max.   :36985.0   Max.   :136646  
##                                        NA's   :1         NA's   :9       
##   Patient_days    Gross_patient_revenue    State          
##  Min.   :     1   Min.   :    2298      Length:3456       
##  1st Qu.:  7740   1st Qu.:  188177      Class :character  
##  Median : 23922   Median :  504646      Mode  :character  
##  Mean   : 42838   Mean   :  923192                        
##  3rd Qu.: 57391   3rd Qu.: 1149258                        
##  Max.   :695643   Max.   :15618749                        
##  NA's   :9        NA's   :79

Now the minimum value is 1…still questionable and potentially causing the skew in their distributions. But we did take a look at these specific hospitals with any one demographic characteristic of value 1 and they seem to also have low values of the other characteristics as well (not necessarily exactly of 1, that would’ve been even more suspicious), so these may very well be valid who knows. Anyway, we need to move on with our lives, I mean, to merging with hosp_afl (which is a subset of practitioners who are affiliated with a hospital from the previous section) on the hospital name and state since different hospitals in different locations can have the same name. Since hosp_afl can have multiple rows per hospital if more than one physician in the data is affiliated with it, so we will aggregate the data back to hospital-level after the merge by summarizing physician-level information.

hosps <- inner_join(hosp_clean, hosp_afl, by=c("Hospital_name"="hosp_afl_lbn_1", "State"="st"))
hosps <- hosps %>% 
  group_by(hosp_afl_1, Hospital_name, State, Staffed_beds, Total_discharges, Patient_days, Gross_patient_revenue) %>% 
  summarize(num_phys = n_distinct(npi),
            female_prop = round(mean(ifelse(gndr == "F", 1, 0)),2), 
            avg_grad_year = round(mean(grd_yr),2),
            n_specialty =n_distinct(pri_spec),
            EHR_use = as.factor(max(ehr)))
summary(hosps)

##    hosp_afl_1   Hospital_name         State            Staffed_beds    
##  150042 :   5   Length:1857        Length:1857        Min.   :    7.0  
##  050471 :   3   Class :character   Class :character   1st Qu.:   72.0  
##  360052 :   3   Mode  :character   Mode  :character   Median :  154.0  
##  360134 :   3                                         Mean   :  232.1  
##  050257 :   2                                         3rd Qu.:  277.0  
##  050380 :   2                                         Max.   :36985.0  
##  (Other):1839                                                          
##  Total_discharges  Patient_days    Gross_patient_revenue    num_phys     
##  Min.   :    1    Min.   :     2   Min.   :    3624      Min.   :   1.0  
##  1st Qu.: 2056    1st Qu.:  7248   1st Qu.:  185397      1st Qu.:  28.0  
##  Median : 5350    Median : 21967   Median :  488683      Median :  69.0  
##  Mean   : 8376    Mean   : 38899   Mean   :  854423      Mean   : 114.4  
##  3rd Qu.:11983    3rd Qu.: 53935   3rd Qu.: 1100914      3rd Qu.: 152.0  
##  Max.   :77757    Max.   :425114   Max.   :14140692      Max.   :1460.0  
##  NA's   :1        NA's   :1        NA's   :9                             
##   female_prop     avg_grad_year   n_specialty    EHR_use 
##  Min.   :0.0000   Min.   :1971   Min.   : 1.00   1:1701  
##  1st Qu.:0.2800   1st Qu.:1994   1st Qu.:12.00   0: 156  
##  Median :0.3600   Median :1996   Median :22.00           
##  Mean   :0.3592   Mean   :1996   Mean   :22.79           
##  3rd Qu.:0.4300   3rd Qu.:1998   3rd Qu.:33.00           
##  Max.   :1.0000   Max.   :2014   Max.   :62.00           
##

table(hosps[which(hosps$num_phys <= 10),]$EHR_use)

## 
##   1   0 
##  52 114

Well, we managed to get about {r} (nrow(hosps)/nrow(hosp_clean))*100% match on the hospitals between the two data sets and the summary statistics on those hospital demographics look a lot nicer and less skewed! Even though we aggregated the physical-level demographics because we didn’t want to just throw away data, we do acknowledge that these physician-level demographics are not really meaningful because they are not representative of the demographics of the hospital. It is unlikely that we had all the physicians information from our hosp_afl for each hospital that matched. (i.e. the 10 physicians we have data on for a hospital may not accurately represent the all practitioners in that hospital) We thought of excluding hospitals with less than 10 practitioners in the data to obtain accurate physician demographics for the hospital, but it would mean cutting down way to much of our limited data. We decided to just focus on the main hospital demographics scraped from the website, and keep the aggregated physician demographics at the exploratory level. Merging with hosp_afl still provided us the main outcome of interest, “EHR_use (the hospital uses the electronic health system)”, calculated as 1 if at least one practitioner in the hospital uses EHR and 0 if none of the practitioners in the hospital uses EHR - that’s why we took the maximum in our code above. Reminder that for practitioners affiliated with hospitals, we assumed that EHR use is the hospital-level adoption and not individual’s. Thus it makes sense that if at least one of the practitioners is recorded in the data as using EHR, we will assume the hospital uses EHR.

4.2.3 Output

# clean up intermediate data sets
rm(hosp_web, hosp_afl)
# hosp_clean will be used later in secondary regional analysis

# permanently save hospital-level analysis dataset to repo
saveRDS(hosps, file = "./data/hosps.rds")

Again, we are storing the hosps data set as a .rds file in our repo under the /data directory for reproducibility of our analysis since it will be used to analyze the potential effects of hospital demographics on the use of EHR, which is another aspect of our Primary Analysis.
This data set will also be subsequently merge with the EHR vendor and product information discussed in the next section to explore hospital demographics by specific types of EHR used.

4.3 Vendors & Products (EHR)

Since so many professionals and hospitals across the U.S. have already adopted EHR, we realized that simply looking at the binary outcome of EHR use would be too boring of a project. Thus, we were curious to further explore the demographics of those who do use EHR by the specific types of EHR vendor or product. Fortunately, we were able to find data to support this additional secondary/exploratory analysis.

4.3.1 Source/Metadata

The Health IT Dashboard provides certified health IT product data from the ONC Certified Health IT Product List (CHPL) such as the unique vendors, products, and product types of each certified health IT product used as part of the Medicare EHR Incentive Program. We downloaded only the 2016 data set, which also includes unique provider identifiers (NPI), in order to match the EPs data set discussed in the first section of this chapter. As the metadata explains, a provider in this data set can be either an eligible professional (EP) and eligible hospital (Hospital), as distinguished by the Provider_Type column. Thus, only the Provider_Type == 'EP' records is merged with the subset of EPs who are not affiliated with any hospital, and the Provider_Type == 'Hospital' records is merged with the subset of EPs who are affiliated with a hospital, and therefore directly merged with the combined and cleaned version of the hosp data set discussed in the previous section of this chapter.

EHR <- read.csv("https://dashboard.healthit.gov/datadashboard/data/MU_REPORT_2016.csv")
saveRDS(EHR, "./data/EHR_raw.rds") # saving data for reproducibility

4.3.2 Wrangling

Data structure, you know the drill.

str(EHR)

## 'data.frame':    505914 obs. of  23 variables:
##  $ NPI                             : int  1003000142 1003000522 1003000597 1003000639 1003000902 1003000936 1003000936 1003001256 1003001256 1003001462 ...
##  $ CCN                             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Provider_Type                   : Factor w/ 2 levels "EP","Hospital": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Business_State_Territory        : Factor w/ 56 levels "Alabama","Alaska",..: 38 11 39 5 17 45 45 6 6 12 ...
##  $ ZIP                             : Factor w/ 11921 levels "","00603","00612",..: 5619 4304 9168 10670 5353 3731 3731 10037 10037 4055 ...
##  $ Specialty                       : Factor w/ 288 levels "","Acupuncturist",..: 177 9 279 263 79 27 27 79 79 235 ...
##  $ Hospital_Type                   : Factor w/ 3 levels "","Critical Access",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Program_Type                    : Factor w/ 2 levels "Medicare","Medicare/Medicaid": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Program_Year                    : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ Provider_Stage_Number           : Factor w/ 2 levels "Stage 1","Stage 2": 2 2 1 1 2 2 2 2 2 1 ...
##  $ Payment_Year                    : int  3 5 NA NA 5 3 3 5 5 NA ...
##  $ Attestation_Month               : int  2 1 1 2 1 5 5 2 2 3 ...
##  $ Attestation_Year                : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ MU_Definition_2014              : logi  NA NA NA NA NA NA ...
##  $ Stage_2_Scheduled_2014          : int  0 1 0 0 1 0 0 0 0 0 ...
##  $ EHR_Certification_Number        : Factor w/ 3550 levels "0014E0087U0WV35",..: 2699 1660 2177 2303 2965 1119 1119 2791 2791 2023 ...
##  $ EHR_Product_CHP_Id              : Factor w/ 2317 levels "14.02.02.1026.A033.01.01.0.170217",..: 1876 346 1168 1403 2226 136 1747 1403 1876 1004 ...
##  $ Vendor_Name                     : Factor w/ 484 levels "4medica, Inc.",..: 134 35 338 134 159 170 170 134 134 120 ...
##  $ EHR_Product_Name                : Factor w/ 798 levels "1 Connect BuildYourEMR",..: 192 42 554 192 81 251 252 192 192 533 ...
##  $ EHR_Product_Version             : Factor w/ 699 levels "1","1.0","1.0.0",..: 489 113 385 494 50 549 206 494 489 155 ...
##  $ Product_Classification          : Factor w/ 3 levels "","Complete EHR",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Product_Setting                 : Factor w/ 3 levels "","Ambulatory",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Product_Certification_Edition_Yr: int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...

There are actually a lot of variables here that are not informative so we can drop them and just focus on a smaller set, i.e. program year would just be 2016 on every row in the data, EHR certification number or CHP ID is at an unnecessary level of detail. Let’s select the important columns and a few potentially interesting variables to summarize by (thus, using distinct instead of select).

EHR_clean <- EHR %>% distinct(NPI, CCN, Provider_Type, Business_State_Territory, ZIP, Specialty, Hospital_Type, Vendor_Name, EHR_Product_Name, Product_Classification, Product_Setting)
summary(EHR_clean)

##       NPI                 CCN          Provider_Type   
##  Min.   :1.003e+09   Min.   : 10001   EP      :393375  
##  1st Qu.:1.255e+09   1st Qu.:121301   Hospital: 16330  
##  Median :1.508e+09   Median :241361                    
##  Mean   :1.501e+09   Mean   :261394                    
##  3rd Qu.:1.750e+09   3rd Qu.:390179                    
##  Max.   :1.993e+09   Max.   :670114                    
##                      NA's   :393375                    
##  Business_State_Territory      ZIP        
##  California  : 31692             : 16330  
##  Minnesota   : 24252      55905  : 11278  
##  Florida     : 22664      48109  :  3336  
##  Texas       : 22307      02114  :  3021  
##  Pennsylvania: 20564      32224  :  2818  
##  Michigan    : 19739      60611  :  2521  
##  (Other)     :268487      (Other):370401  
##                                                           Specialty     
##  Family Medicine                                               : 61485  
##  Internal Medicine                                             : 45820  
##  Student in an Organized Health Care Education/Training Program: 17552  
##  Cardiovascular Disease                                        : 17524  
##                                                                : 16330  
##  (Other)                                                       :250764  
##  NA's                                                          :   230  
##          Hospital_Type                      Vendor_Name    
##                 :393375   Epic Systems Corporation:133637  
##  Critical Access:  3348   Cerner Corporation      : 70367  
##  General        : 12982   Allscripts              : 28067  
##                           athenahealth, Inc.      : 20970  
##                           eClinicalWorks, LLC     : 16354  
##                           GE Healthcare           : 14882  
##                           (Other)                 :125428  
##                                      EHR_Product_Name 
##  EpicCare Ambulatory 2014 Certified EHR Suite: 71207  
##  athenaClinicals                             : 20945  
##  Beacon Oncology 2014 Certified Module       : 16281  
##  eClinicalWorks                              : 16097  
##  NextGen Ambulatory EHR                      : 13616  
##  Powerchart (Clinical)                       : 13293  
##  (Other)                                     :258266  
##   Product_Classification   Product_Setting  
##              : 14690               : 14690  
##  Complete EHR:202827     Ambulatory:325708  
##  Modular EHR :192188     Inpatient : 69307  
##                                             
##                                             
##                                             
##

Observations:

There are records of missing CCN (the identifying number for eligible hospitals) and missing hospital type, which matches up with the number of records that are Provider_Type == 'EP'. It makes sense since these variables would not be relevant to an eligible professional (EP), who is not affiliated with any hospital.
There are exactly 16330 records of missing zip codes and blank specialty (specifically not NA’s), and this number matches up with the number of Provider_Type == 'Hospital'.
It sounds like the we just need to split the data into EPs and hospitals since there are enough distinct variables relevant for each type. Then clean each one before merging with their corresponding level of demographics.

Split:

# split for EPs
EHR_EPs <- EHR_clean %>% filter(Provider_Type == 'EP') %>% 
  select(-CCN, -Hospital_Type, -Provider_Type)
summary(EHR_EPs)

##       NPI            Business_State_Territory      ZIP        
##  Min.   :1.003e+09   California  : 30425      55905  : 11278  
##  1st Qu.:1.255e+09   Minnesota   : 23895      48109  :  3336  
##  Median :1.508e+09   Florida     : 21678      02114  :  3021  
##  Mean   :1.501e+09   Texas       : 20941      32224  :  2818  
##  3rd Qu.:1.750e+09   Pennsylvania: 19919      60611  :  2521  
##  Max.   :1.993e+09   Michigan    : 19221      23298  :  2235  
##                      (Other)     :257296      (Other):368166  
##                                                           Specialty     
##  Family Medicine                                               : 61485  
##  Internal Medicine                                             : 45820  
##  Student in an Organized Health Care Education/Training Program: 17552  
##  Cardiovascular Disease                                        : 17524  
##  Obstetrics & Gynecology                                       : 15059  
##  (Other)                                                       :235705  
##  NA's                                                          :   230  
##                    Vendor_Name    
##  Epic Systems Corporation:131883  
##  Cerner Corporation      : 64893  
##  Allscripts              : 27743  
##  athenahealth, Inc.      : 20929  
##  eClinicalWorks, LLC     : 16345  
##  GE Healthcare           : 14861  
##  (Other)                 :116721  
##                                      EHR_Product_Name 
##  EpicCare Ambulatory 2014 Certified EHR Suite: 71047  
##  athenaClinicals                             : 20928  
##  Beacon Oncology 2014 Certified Module       : 16189  
##  eClinicalWorks                              : 16097  
##  NextGen Ambulatory EHR                      : 13613  
##  Powerchart (Clinical)                       : 12548  
##  (Other)                                     :242953  
##   Product_Classification   Product_Setting  
##              : 14462               : 14462  
##  Complete EHR:200889     Ambulatory:324643  
##  Modular EHR :178024     Inpatient : 54270  
##                                             
##                                             
##                                             
##

# check for valid zip codes
EHR_EPs %>% filter(nchar(as.character(ZIP)) < 5) %>% 
  group_by(ZIP) %>% summarize(n())

## # A tibble: 3 x 2
##      ZIP `n()`
##   <fctr> <int>
## 1   1010     1
## 2   EH16     2
## 3   IP28     9

# split for hospitals
EHR_hosps <- EHR_clean %>% filter(Provider_Type == 'Hospital') %>% 
  select(-ZIP, -Specialty, -Provider_Type)
summary(EHR_hosps)

##       NPI                 CCN         Business_State_Territory
##  Min.   :1.003e+09   Min.   : 10001   Texas       : 1366      
##  1st Qu.:1.245e+09   1st Qu.:121301   California  : 1267      
##  Median :1.488e+09   Median :241361   Florida     :  986      
##  Mean   :1.495e+09   Mean   :261394   Illinois    :  667      
##  3rd Qu.:1.740e+09   3rd Qu.:390179   Pennsylvania:  645      
##  Max.   :1.993e+09   Max.   :670114   Ohio        :  564      
##                                       (Other)     :10835      
##          Hospital_Type  
##                 :    0  
##  Critical Access: 3348  
##  General        :12982  
##                         
##                         
##                         
##                         
##                                           Vendor_Name  
##  Cerner Corporation                             :5474  
##  Medical Information Technology, Inc. (MEDITECH):2879  
##  Epic Systems Corporation                       :1754  
##  MEDHOST                                        : 653  
##  McKesson                                       : 625  
##  Cerner Health Services, Inc.                   : 525  
##  (Other)                                        :4420  
##                                     EHR_Product_Name
##  HealthSentry                               :  851  
##  Powerchart (Clinical)                      :  745  
##  Powerchart (All CQM's)                     :  703  
##  P2 Sentinel (Powered by Sensage)           :  580  
##  EpicCare Inpatient 2014 Certified EHR Suite:  565  
##  HealtheLife and Powerchart                 :  552  
##  (Other)                                    :12334  
##   Product_Classification   Product_Setting 
##              :  228                :  228  
##  Complete EHR: 1938      Ambulatory: 1065  
##  Modular EHR :14164      Inpatient :15037  
##                                            
##                                            
##                                            
##

The summaries much look better separately for the two sets. Now we just have to clean up the zip codes for the physician-level EHR’s. Looks like there are only a few records with invalid zip codes. Similar to how we cleaned up the zip codes for EPs, we will match the zip codes with R’s own zip code package.

# first remove non-US states, appending DC to list
state_name <- c(state.name,"District Of Columbia")
states <- data.frame(state_name = state_name, state_abb = c(state.abb, "DC"))
EHR_EPs <- EHR_EPs %>% filter(Business_State_Territory %in% state_name)
length(unique(EHR_EPs$Business_State_Territory))

## [1] 51

# remove invalid zip codes
EHR_EPs <- EHR_EPs %>% filter(nchar(as.character(ZIP)) >= 5)

# merge R zip codes
library(zipcode)
data("zipcode")
EHR_EPs <- EHR_EPs %>% left_join(zipcode, by = c("ZIP" = "zip"))

## Warning: Column `ZIP`/`zip` joining factor and character vector, coercing
## into character vector

# check state mismatches
EHR_mismatch <- EHR_EPs %>% 
  left_join(states, by = c("state" = "state_abb")) %>% # need to add in long state names
  mutate(Business_State_Territory = as.character(Business_State_Territory)) %>% 
  filter(state_name != Business_State_Territory)

## Warning: Column `state`/`state_abb` joining character vector and factor,
## coercing into character vector

# percent state mismatch
nrow(EHR_mismatch)/nrow(EHR_EPs)

## [1] 0.04492693

# check which zips are mismatching
EHR_mismatch %>% 
  group_by(ZIP, Business_State_Territory, state_name) %>% 
  summarize(num = n()) %>% 
  arrange(desc(num)) %>% 
  head(8)

## # A tibble: 8 x 4
## # Groups:   ZIP, Business_State_Territory [8]
##     ZIP Business_State_Territory   state_name   num
##   <chr>                    <chr>       <fctr> <int>
## 1 51503                 Nebraska         Iowa   120
## 2 99362               Washington       Oregon    83
## 3 54601                Minnesota    Wisconsin    80
## 4 19104               New Jersey Pennsylvania    65
## 5 55905                  Florida    Minnesota    63
## 6 33331                     Ohio      Florida    58
## 7 85259                Minnesota      Arizona    48
## 8 55905                Wisconsin    Minnesota    46

The top 10 mismatches:

Zip code 51503 is in Iowa (agrees with zip code package)/bordering Nebraska, but reported as Nebraska which is neighboring Iowa. This may be correct, so we’ll keep Nebraska as reported
Zip code 55905 is in Minnesota, but reported state Florida is not near Minnesota. Since there is likely a typo in either zip or state, we’ll remove this mismatched zip all together
Zip code 54601 is in Wisconsin/bordering Minnesota, but reported as Minnesota which is neighboring Wisconsin. We’ll keep Minnesota as reported
Zip code 33331 is in Florida, which is nowhere near Ohio. Since it is unclear which was the typo, we’ll delete all these mismatches
Zip code 55905 is in Minnesota - found it’s actually Mayo Clinic (along with #2). Wisconsin and Minnesota are bordering each other, so we’ll keep Wisconsin –> in general, if we look at test_mismatch %>% filter(ZIP == "55905") you will notice that this zip code is incorrectly used for multiple states, possibly meaning there were data entry/quality issues, which validates our decision to clean up the zip codes.

EHR_zip <- EHR_EPs %>%
  left_join(states, by = c("state" = "state_abb")) %>% 
  mutate(Business_State_Territory = as.character(Business_State_Territory)) %>%
  filter(
    (state_name == Business_State_Territory) | 
    (state_name != Business_State_Territory && ZIP == "51503" && Business_State_Territory == "Nebraska") |
    (state_name != Business_State_Territory && ZIP == "54601" && Business_State_Territory == "Minnesota") |
    (state_name != Business_State_Territory && ZIP == "55905" && Business_State_Territory == "Wisconsin")
    ) %>% 
  select(NPI, Business_State_Territory, ZIP, Specialty, Vendor_Name, EHR_Product_Name, Product_Classification, Product_Setting)

## Warning: Column `state`/`state_abb` joining character vector and factor,
## coercing into character vector

We want to inner join with the EPs data set on NPI, which has one row per physician by their unique NPI (one-to-many merge).

# merge with EPs data
EHR_EPs <- inner_join(EHR_zip, EPs, by=c('NPI'='npi'))
summary(EHR_EPs)

##       NPI            Business_State_Territory     ZIP           
##  Min.   :1.003e+09   Length:24155             Length:24155      
##  1st Qu.:1.165e+09   Class :character         Class :character  
##  Median :1.336e+09   Mode  :character         Mode  :character  
##  Mean   :1.340e+09                                              
##  3rd Qu.:1.509e+09                                              
##  Max.   :1.680e+09                                              
##                                                                 
##            Specialty                                  Vendor_Name   
##  Optometrist    : 4746   Epic Systems Corporation           : 4225  
##  Chiropractor   : 2893   Integrated Practice Solutions, Inc.: 1600  
##  Family Medicine: 2397   Modernizing Medicine, Inc.         : 1544  
##  Ophthalmology  : 1906   NextGen Healthcare                 : 1285  
##  Dermatology    : 1571   GE Healthcare                      : 1048  
##  (Other)        :10638   Mayo Clinic                        : 1035  
##  NA's           :    4   (Other)                            :13418  
##                                      EHR_Product_Name
##  EpicCare Ambulatory 2014 Certified EHR Suite: 2043  
##  ChiroTouch                                  : 1281  
##  NextGen Ambulatory EHR                      : 1209  
##  athenaClinicals                             : 1019  
##  EMA                                         :  888  
##  eClinicalWorks                              :  732  
##  (Other)                                     :16983  
##   Product_Classification   Product_Setting       cred      
##              :  352                :  352   NA     :15578  
##  Complete EHR:15438      Ambulatory:21513   MD     : 4987  
##  Modular EHR : 8365      Inpatient : 2290   OD     : 1675  
##                                             DC     : 1271  
##                                             DO     :  325  
##                                             DPM    :  290  
##                                             (Other):   29  
##         cty        ehr       gndr          grd_yr    
##  ROCHESTER: 1663   0: 5355   F: 9673   Min.   :1951  
##  LOS ALTOS:  687   1:18800   M:14482   1st Qu.:1985  
##  PALO ALTO:  527             U:    0   Median :1996  
##  NEW YORK :  335                       Mean   :1995  
##  FREMONT  :  276                       3rd Qu.:2005  
##  BOSTON   :  207                       Max.   :2016  
##  (Other)  :20460                                     
##                                      med_sch                   pri_spec   
##  OTHER                                   : 5140   OPTOMETRY        :5155  
##  PALMER COLLEGE CHIROPRACTIC - DAVENPORT :  699   CHIROPRACTIC     :3161  
##  ILLINOIS COLLEGE OF OPTOMETRY AT CHICAGO:  640   FAMILY PRACTICE  :2637  
##  SOUTHERN COLLEGE OF OPTOMETRY           :  572   DERMATOLOGY      :2534  
##  PENNSYLVANIA COLLEGE OF OPTOMETRY       :  538   OPHTHALMOLOGY    :2044  
##  INDIANA UNIVERSITY - SCHOOL OF OPTOMETRY:  420   INTERNAL MEDICINE:1499  
##  (Other)                                 :16146   (Other)          :7125  
##       st                zip              zip.ext             latitude    
##  Length:24155       Length:24155       Length:24155       Min.   :19.53  
##  Class :character   Class :character   Class :character   1st Qu.:34.61  
##  Mode  :character   Mode  :character   Mode  :character   Median :39.04  
##                                                           Mean   :38.31  
##                                                           3rd Qu.:42.13  
##                                                           Max.   :64.85  
##                                                           NA's   :11     
##    longitude          yrs_grd        locations     
##  Min.   :-159.38   Min.   : 0.00   Min.   : 1.000  
##  1st Qu.:-104.96   1st Qu.:11.00   1st Qu.: 1.000  
##  Median : -90.03   Median :20.00   Median : 1.000  
##  Mean   : -93.52   Mean   :21.31   Mean   : 1.281  
##  3rd Qu.: -80.31   3rd Qu.:31.00   3rd Qu.: 1.000  
##  Max.   : -67.88   Max.   :65.00   Max.   :53.000  
##  NA's   :11

# check EHR use (should expect all 1's or "yes")
table(EHR_EPs$ehr)

## 
##     0     1 
##  5355 18800

Notice that the ehr column is not all 1’s when we expect them to be. This explains why we had so many physicians who fell in the ehr='' category. The physicians whom we have EHR product usage information on did not answer “yes” to EHR use in the original EPs data.

4.3.3 Output

saveRDS(EHR_EPs, file = "./data/EHR_EPs.rds")

Although the specific product types are too granular and not meaningful enough to clean, we believe that exploring the professional and hospital demographics across vendor types may still be interesting, especially if we focus on the top 10 most popular vendors or classify the smaller local vendors as one group.