Chapter 4 Data
All of our data are based on the 2016 Medicare & Medicaid EHR Incentive Program since the adoption of EHR is an ongoing basis: a provider who adopted and used EHR during the 2014 program, for example, is likely to still be in the program the following years. So theoretically, the 2016 program would be the most cumulative source of providers adopting EHR. As mentioned in Related Work, the adoption of EHR has been consistently rising over the years across the U.S., so there is also no interest in looking at the use over time when we can just look at the latest collection of data. And since 2017 isn’t quite over yet, it wouldn’t be fair to start looking at the usage of EHR within the 2017 program, so the most recent completed data would be on the 2016 program.
4.1 Eligible Professionals (EPs)
To answer the first of our Initial Questions as to whether or not a physician’s age, education, practice location, etc. may be related to their use of EHR in the incentive program, we must be able to find demographics data at the practitioner level.
4.1.1 Source/Metadata
And here we are: readily available on Data.Medicare.gov are the demographics of physicians eligible for the Medicare & Medicaid EHR incentive program at the practitioner level. We used the API link to read in the data set and applied SQL filtering options to speed up the download since it is a large file. We specified the limit to be exactly 2254703 because that’s the exact number of rows the data contain, if we don’t specify this, the default limit of rows that get downloaded is 100. To compensate, we will specify the exact columns of interest to include, which can also be specified in the download link (we looked through the metadata on the website to select the meaningful and interesting variables).
dat <- read.csv("https://data.medicare.gov/resource/c8qv-268j.csv?$limit=2254703&$select=npi,frst_nm,gndr,cred,med_sch,grd_yr,pri_spec,cty,st,zip,hosp_afl_1,hosp_afl_lbn_1,ehr&$order=npi")
saveRDS(dat, "./data/dat_raw.rds") # saving data for reproducibility
4.1.2 Wrangling
Let’s first take a look at the data structure and a summary of all the columns.
str(dat)
## 'data.frame': 1527291 obs. of 13 variables:
## $ cred : Factor w/ 22 levels "","AA","AU","CNA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ cty : Factor w/ 10192 levels "20850","ABBEVILLE",..: 2345 274 7938 2790 9076 1041 5545 5545 364 364 ...
## $ ehr : Factor w/ 2 levels "","Y": 1 1 1 1 2 1 1 1 1 1 ...
## $ frst_nm : Factor w/ 54156 levels "A","A'LANNE",..: 3360 3360 3360 48311 38890 10170 20706 20706 24158 24158 ...
## $ gndr : Factor w/ 3 levels "F","M","U": 2 2 2 2 2 2 1 1 2 2 ...
## $ grd_yr : int 1994 1994 1994 2003 1999 2003 2007 2007 1997 1997 ...
## $ hosp_afl_1 : Factor w/ 4746 levels "","010001","010005",..: 4371 4371 4371 1112 3220 3617 3168 3168 622 622 ...
## $ hosp_afl_lbn_1: Factor w/ 4515 levels "","ABBEVILLE AREA MEDICAL CENTER",..: 2020 2020 2020 1127 2405 2967 1943 1943 4203 4203 ...
## $ med_sch : Factor w/ 398 levels "A T STILL UN, ARIZONA SCHL OF DENT.Y & ORAL HLTH",..: 215 215 215 321 215 223 261 261 209 209 ...
## $ npi : int 1003000126 1003000126 1003000126 1003000134 1003000142 1003000407 1003000423 1003000423 1003000480 1003000480 ...
## $ pri_spec : Factor w/ 83 levels "","ADDICTION MEDICINE",..: 38 38 38 61 5 24 51 51 27 27 ...
## $ st : Factor w/ 56 levels "","AK","AL","AR",..: 50 50 50 17 39 42 39 39 7 7 ...
## $ zip : Factor w/ 188675 levels "","000075949",..: 43170 47812 50388 117622 89610 35299 90294 90294 150777 150777 ...
summary(dat)
## cred cty ehr frst_nm
## :1012568 HOUSTON : 25754 :1194451 MICHAEL: 30701
## MD : 344822 PITTSBURGH : 25135 Y: 332840 DAVID : 27486
## PA : 29223 NEW YORK : 18521 JOHN : 27117
## NP : 25666 PHILADELPHIA: 18025 ROBERT : 21989
## DO : 21639 CHICAGO : 16532 JAMES : 20402
## CNA : 16635 ANN ARBOR : 16390 (Other):1399580
## (Other): 76738 (Other) :1406934 NA's : 16
## gndr grd_yr hosp_afl_1
## F:681413 Min. :1942 : 449488
## M:845877 1st Qu.:1988 230046 : 13889
## U: 1 Median :1999 450076 : 7003
## Mean :1997 490009 : 5784
## 3rd Qu.:2008 390111 : 5496
## Max. :2017 050454 : 5261
## NA's :4024 (Other):1040370
## hosp_afl_lbn_1
## : 450721
## UNIVERSITY OF MICHIGAN HEALTH SYSTEM : 13889
## UNIVERSITY OF TEXAS M D ANDERSON CANCER CENTER,THE: 7003
## UNIVERSITY OF VIRGINIA MEDICAL CENTER : 5784
## HOSPITAL OF UNIV OF PENNSYLVANIA : 5496
## UCSF MEDICAL CENTER : 5261
## (Other) :1039137
## med_sch npi
## OTHER :771781 Min. :1.003e+09
## WAYNE STATE UNIVERSITY SCHOOL OF MEDICINE : 11559 1st Qu.:1.165e+09
## INDIANA UNIVERSITY SCHOOL OF MEDICINE : 10794 Median :1.337e+09
## PHILADELPHIA COLLEGE OF OSTEOPATHIC MEDICINE: 10036 Mean :1.340e+09
## TEMPLE UNIVERSITY SCHOOL OF MEDICINE : 9473 3rd Qu.:1.509e+09
## UNIVERSITY OF MINNESOTA MEDICAL SCHOOL : 9266 Max. :1.680e+09
## (Other) :704382
## pri_spec st zip
## NURSE PRACTITIONER :148237 CA :131284 481095000: 14678
## INTERNAL MEDICINE :141051 TX :114489 152124756: 6601
## FAMILY PRACTICE :119255 PA :100942 770304000: 6033
## PHYSICIAN ASSISTANT :111992 NY : 95038 229080001: 4827
## PHYSICAL THERAPY : 69682 FL : 84422 326103003: 4336
## DIAGNOSTIC RADIOLOGY: 64327 MI : 66982 110303816: 3727
## (Other) :872747 (Other):934134 (Other) :1487089
Observations from the above output:
- There are 56 levels of states, which means this data include states outside of the 50 mainland states and DC. Since we don’t have sufficient data on the Virgin Islands and Guam, we will have to filter these states out.
- There is one person whose gender is unknown, we will filter this person out since we want to group by gender.
- There are 4024 rows with missing graduation years, will need to filer this out too if graduation years is one of the main demographics of interest (will mess up plotting). Also, one or more physician who graduated in 2017 slipped into our data…we will ignore this by filtering it out too.
- It looks like most zip codes in this data are the full 9-digit zip codes, which can be split into the standard 5-digit format and the 4-digit.
- There are way too many levels of medical schools and primary specialties, we will probably focus on looking at the most popular ones during exploratory analysis rather than the final primary analysis.
data("state")
clean <- dat %>% filter(st %in% c(state.abb, "DC")) %>%
filter(gndr %in% c("F", "M") & !is.na(grd_yr) & grd_yr != 2017)
Now we only have 51 levels of states, 2 levels of gender, and no records with missing graduation year. To make sure we only keep valid zip codes, we can try to use R’s zip code package to validate our zip codes. This is important because we want to be able to accurately perform our analyses on the location variables and regional analysis (especially zip code plotting).
# separate out the first five zip codes from the last four extension
clean <- clean %>% mutate(zip.ext = substr(as.character(zip), 6, 9),
zip = substr(as.character(zip), 1, 5))
# use R zipcode package and data
library(zipcode)
data("zipcode")
zip_dat <- clean %>% left_join(zipcode,by='zip')
# check state mismatches
zip_mismatch <- zip_dat %>%
mutate(st = as.character(st)) %>%
filter(st != state) %>%
select(npi, frst_nm, zip, cty, city, st, state, latitude, longitude)
# check number of records mismatched
nrow(zip_mismatch)
## [1] 1034
# look at the top five mismatches
(top5_mismatch <- zip_mismatch %>%
group_by(zip, cty, st, city, state) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
head(5))
## # A tibble: 5 x 6
## # Groups: zip, cty, st, city [5]
## zip cty st city state n
## <chr> <fctr> <chr> <chr> <chr> <int>
## 1 99362 WALLA WALLA WA Walla Walla OR 222
## 2 52761 MUSCATINE IA Muscatine IL 114
## 3 43530 GRANGER IN Jewell OH 86
## 4 22401 FREDERICKSBURG MD Fredericksburg VA 82
## 5 20176 HANOVER MD Leesburg VA 74
In order to not just blatantly throw away 1034 mismatches, we will look at the top 5 most common mismatches on a case-by-case basis to try to salvage as many records as possible without searching too deeply in our data set. Also, starting from the 6th most common mismatches, the number of records n
were down to 58 or less, so they would have little effect on the data set.
The results of searching up the top 10 mismatches on Google Maps:
- The zip code 99362 spans across Walla Walla in both WA and OR, so it is likely that the practice is located on the WA side of the zip code coverage. We will ignore the state mismatch from the zip code package for this case.
- The zip code 52761 does belong to Muscatine, IA, which is right on the border of IL. Again, the practice is most likely located on the IA side. Will ignore the state mismatch for this case as well.
- The zip code for Granger, IN was most likely mistyped. The correct zip code for Granger is 46530, which is very close to 43530. We will fix the zip code for this particular case only.
- The zip code 22401 does belong to Frederickburg, VA, and is not near the border of VA and MD at all. There is a “Frederick” city in MD, but none of its zip codes look similar enough to 22401. No clear culprit, will have to let this one go. :(
- The zip code for Hanover, MD is 21076, which was most likely mistyped as 20176. We will fix this particular case as well.
# leave out the 22401 Fredericksburg MD mismatch, fix the zip codes for Granger and Hanover.
zip_fix <- top5_mismatch %>% ungroup() %>%
filter(cty != 'FREDERICKSBURG') %>%
mutate(zip = replace(zip, zip %in% c(43530, 20176), c(46530, 21076))) %>%
select(-n)
mismatch_tokeep <- zip_dat %>%
select(-zip) %>% # we want to replace the original wrong zip codes with the fixed zip codes
inner_join(zip_fix, by = c('cty','st', 'city', 'state'))
## Warning: Column `st` joining factor and character vector, coercing into
## character vector
clean <- zip_dat %>%
mutate(st = as.character(st), state = as.character(state)) %>%
filter(st == state | nchar(zip) == 5) %>% # keeping only the matches
bind_rows(mismatch_tokeep) %>% # then add in the mismatches we wanted to keep or have fixed
select(-city, -state) # will rely on the original city and state columns
Continuing on with our wrangling…
- About 29.4304098% of the (original) data are of physicians not affiliated with any hospital, these are the practitioners we will subset on to do our practitioner-level analysis on since these are the only cases when a physician’s demographics can have an effect on the choice of using EHR. For physicians who are affiliated with a hospital, their demographics would no longer be informative as to whether or not they use EHR, it would then have to be the demographics of the hospitals that could inform the use of EHR at the hospital-level. Thus, we were able to find demographics on hospitals across the U.S. in general, which will be discussed in the next section within this chapter, but there’s no indication of participation in the incentive program nor EHR use, so we will have to merge that hospital demographic data with the practitioners in this data set who are affiliated with any one hospital.
So for the practitioner-level analysis we want to:
- Calculate the number of years since graduation
- Use 1’s and 0’s instead of ‘Y’ and blanks as indicator of EHR use for our binary outcome
- Keep only one record per physician, which means we would have to lose the distinct practice locations for physicians who have multiple practices. But since we can’t fit our logistic model with either city, state, or zip code, there’s no point in keeping this detail. Instead, we can calculate the number of distinct practices these physicians have in order to collapse the data to the practitioner level.
EPs <- clean %>% filter(hosp_afl_1 == '') %>%
mutate(yrs_grd = 2016 - as.integer(grd_yr), # calc the n of years since graduation
cred = case_when(cred == '' ~ 'NA', # assign blanks to NA's
TRUE ~ as.character(cred)),
cred = as.factor(cred),
ehr = case_when(ehr == 'Y' ~ 1, # use numeric indicators
ehr == '' ~ 0),
ehr = as.factor(ehr)) %>%
select(-hosp_afl_1, -hosp_afl_lbn_1, -frst_nm) %>%
distinct(npi, .keep_all = TRUE) # keep only one row per physician
# calculate the number of practices for each physician
SL <- dat %>% group_by(npi) %>% summarise(locations = n_distinct(zip))
EPs <- left_join(EPs, SL, by="npi")
rm(SL)
summary(EPs)
## cred cty ehr gndr
## NA :167692 NEW YORK : 3555 0:214471 F:128382
## MD : 19334 HOUSTON : 1875 1: 18284 M:104373
## DC : 9698 BROOKLYN : 1701 U: 0
## PT : 6881 LOS ANGELES: 1700
## OD : 6773 CHICAGO : 1586
## CSW : 5750 PHOENIX : 1490
## (Other): 16627 (Other) :220848
## grd_yr med_sch
## Min. :1942 OTHER :140789
## 1st Qu.:1988 PALMER COLLEGE CHIROPRACTIC - DAVENPORT : 4547
## Median :1999 LOGAN COLLEGE OF CHIROPRACTIC : 2312
## Mean :1997 PENNSYLVANIA COLLEGE OF OPTOMETRY : 2310
## 3rd Qu.:2009 SOUTHERN COLLEGE OF OPTOMETRY : 2211
## Max. :2016 ILLINOIS COLLEGE OF OPTOMETRY AT CHICAGO: 2184
## (Other) : 78402
## npi pri_spec st
## Min. :1.003e+09 PHYSICAL THERAPY :34788 Length:232755
## 1st Qu.:1.165e+09 CHIROPRACTIC :27033 Class :character
## Median :1.336e+09 CLINICAL SOCIAL WORKER:25225 Mode :character
## Mean :1.340e+09 NURSE PRACTITIONER :22163
## 3rd Qu.:1.509e+09 OPTOMETRY :20352
## Max. :1.680e+09 CLINICAL PSYCHOLOGIST :16522
## (Other) :86672
## zip zip.ext latitude longitude
## Length:232755 Length:232755 Min. :19.10 Min. :-170.27
## Class :character Class :character 1st Qu.:34.14 1st Qu.: -98.57
## Mode :character Mode :character Median :39.40 Median : -86.18
## Mean :38.19 Mean : -91.39
## 3rd Qu.:41.75 3rd Qu.: -77.34
## Max. :66.90 Max. : -67.24
## NA's :140 NA's :140
## yrs_grd locations
## Min. : 0.00 Min. : 1.000
## 1st Qu.: 7.00 1st Qu.: 1.000
## Median :17.00 Median : 1.000
## Mean :18.55 Mean : 1.293
## 3rd Qu.:28.00 3rd Qu.: 1.000
## Max. :74.00 Max. :359.000
##
4.1.3 Output
# permanently save physician-level analysis dataset to repo
saveRDS(EPs, file = "./data/EPs.rds")
# for the rest of practitioners who are affiliated with a hospital
hosp_afl <- clean %>% # filter on the clean version
filter(hosp_afl_1 != '') %>%
mutate(hosp_afl_lbn_1 = as.character(hosp_afl_lbn_1),
cty = str_trim(as.character(cty)),
st = as.character(st),
ehr = ifelse(ehr == "Y", 1, 0)) %>%
# only keep relevant or potentially interesting variables (keeping npi to count number of physicians)
distinct(npi, hosp_afl_lbn_1, st, cty, ehr, hosp_afl_1, pri_spec, gndr, grd_yr)
# clean up intermediate data sets
rm(dat, clean, zipcode, zip_dat, zip_fix, zip_mismatch, mismatch_tokeep, top5_mismatch)
- We are storing the
EPs
data set as a.rds
file in our repo under the/data
directory for reproducibility of our analysis since this is the clean analysis-ready data set that will be used to analyze the potential effects of practitioner demographics on EHR use in our Primary Analysis. - The
hosp_afl
data set, still containing one row per physician per hospital location, is to be merged with the hospital demographics data discussed in the next section. - Both
EPs
and the merged-to-be data ofhosp_afl
containing the hospital demographics are to be merged with theEHR
vendor and product information in the last section of this chapter, in order to explore both physician and hospital demographics by specific types of EHR used.
4.2 Eligible Hospitals (Hosp)
General hospital demographics such as the total number of staffed beds, total number of discharges, total number of patient days, and total gross patient revenue (inpatient and outpatient) are all publicly accessible on the American Hospital Directory organized by U.S. states. Since there is not a .csv
file ready for us to download and the hospital demographics are already neatly summarized in a table for each state, we decided that we could do some web scraping for this.
4.2.1 Source/Metadata
The link to each U.S. state’s summary table of hospital demographics by hospital name in that state begins with the same URL: https://www.ahd.com/states/hospital_
and ends with the state’s abbreviation plus a .html
. For example, a table of hospital demographics for each hospital in Massachusetts can be directly scraped from the link, https://www.ahd.com/states/hospital_MA.html. The details of these hospital statistics are described here.
hosp_web <- data.frame()
for (st in c(state.abb, "DC")){
url = paste("https://www.ahd.com/states/hospital_", st, ".html", sep='')
st_df <- read_html(url) %>%
html_nodes("table") %>%
.[[2]] %>%
html_table()
st_df <- as.data.frame.matrix(eval(parse(text=st)))
# removing the TOTAL row at the end of each table
st_df <- st_df[!(st_df$"Hospital Name"=="T O T A L"),]
st_df$State <- st
hosp_web <- rbind(hosp_web, st_df)
}
saveRDS(hosp_web, "./data/hosp_raw.rds") # saving data for reproducibility
4.2.2 Wrangling
Let’s take a look at the data structure of the raw scrapped hospital demographics.
str(hosp_web)
## 'data.frame': 3907 obs. of 7 variables:
## $ Hospital_name : chr "Alaska Native Medical Center" "Alaska Regional Hospital" "Bartlett Regional Hospital" "Central Peninsula General Hospital" ...
## $ City : chr "Anchorage" "Anchorage" "Juneau" "Soldotna" ...
## $ Staffed_beds : chr "152" "174" "57" "106" ...
## $ Total_discharges : chr "7,320" "6,405" "1,710" "2,575" ...
## $ Patient_days : chr "38,687" "33,507" "6,325" "10,186" ...
## $ Gross_patient_revenue: chr "$0" "$947,754" "$147,231" "$305,469" ...
## $ State : chr "AK" "AK" "AK" "AK" ...
To do:
- We need to turn all the demographic information into numeric columns so we can use them as continuous variables.
- These values are reported with commas, so we need to strip these out.
- All the gross patient revenues are reported with a dollar sign in addition to the commas, so we need to strip those out too.
- Proper cases are harder to match on, will convert all letters in city and hospital name to upper case.
- It looks like some hospital has a gross patient revenue of $0…is this possible? Will explore this after dealing with all of the above first.
hosp_clean <- transform(hosp_web,
Staffed_beds = as.numeric(gsub(",", "", Staffed_beds)),
Total_discharges = as.numeric(gsub(",", "", Total_discharges)),
Patient_days = as.numeric(gsub(",","", Patient_days)),
Gross_patient_revenue = as.numeric(gsub("\\$|,", "", Gross_patient_revenue)),
Hospital_name = toupper(Hospital_name),
City = toupper(str_trim(as.character(City)))
)
summary(hosp_clean)
## Hospital_name City Staffed_beds Total_discharges
## Length:3907 Length:3907 Min. : 0.0 Min. : 0
## Class :character Class :character 1st Qu.: 48.0 1st Qu.: 1244
## Mode :character Mode :character Median : 133.0 Median : 4562
## Mean : 204.8 Mean : 8017
## 3rd Qu.: 264.0 3rd Qu.: 11336
## Max. :36985.0 Max. :136646
## Patient_days Gross_patient_revenue State
## Min. : 0 Min. : 0 Length:3907
## 1st Qu.: 4104 1st Qu.: 98340 Class :character
## Median : 18892 Median : 377425 Mode :character
## Mean : 37794 Mean : 797958
## 3rd Qu.: 51621 3rd Qu.: 1001421
## Max. :695643 Max. :15618749
It looks like the zero value wasn’t just for the gross patient revenue. From the summary output, we can see that the minimum values on all hospital demographics (staffed beds, total discharges, patient days, gross patient revenue) are zero, which doesn’t seem right. Theoretically, it is impossible for any functioning hospital to have a total of zero on any of these characteristics. Would having a total of zero number of staffed beds really make it a hospital? Can a hospital survive on a gross patient revenue of zero dollars? We thought that it might be a web scrapping error, but when we went back and looked on the website, some of their tables definitely have random zero hospital statistics and some even have a whole row of zeros for certain hospitals. The metadata wasn’t very clear on this, but it seems like it is their way of representing missing data. Instead of keeping them as zeros, which would affect any computation we want to perform on these continuous variables, we will convert every zero into a true missing value, NA
. But if a hospital has all missing demographics (the entire row of zeros), then we will just remove that hospital entirely (delete the whole row).
# the demographics are columns 3 to 6 of the data frame
is.na(hosp_clean[,3:6]) <- !hosp_clean[,3:6] # turn 0's to NAs
hosp_clean <- hosp_clean[rowSums(is.na(hosp_clean)) < 4,] # only delete hosp with all four NAs
summary(hosp_clean)
## Hospital_name City Staffed_beds Total_discharges
## Length:3456 Length:3456 Min. : 1.0 Min. : 1
## Class :character Class :character 1st Qu.: 73.0 1st Qu.: 2141
## Mode :character Mode :character Median : 157.0 Median : 5692
## Mean : 231.6 Mean : 9087
## 3rd Qu.: 290.0 3rd Qu.: 12564
## Max. :36985.0 Max. :136646
## NA's :1 NA's :9
## Patient_days Gross_patient_revenue State
## Min. : 1 Min. : 2298 Length:3456
## 1st Qu.: 7740 1st Qu.: 188177 Class :character
## Median : 23922 Median : 504646 Mode :character
## Mean : 42838 Mean : 923192
## 3rd Qu.: 57391 3rd Qu.: 1149258
## Max. :695643 Max. :15618749
## NA's :9 NA's :79
Now the minimum value is 1…still questionable and potentially causing the skew in their distributions. But we did take a look at these specific hospitals with any one demographic characteristic of value 1 and they seem to also have low values of the other characteristics as well (not necessarily exactly of 1, that would’ve been even more suspicious), so these may very well be valid who knows. Anyway, we need to move on with our lives, I mean, to merging with hosp_afl
(which is a subset of practitioners who are affiliated with a hospital from the previous section) on the hospital name and state since different hospitals in different locations can have the same name. Since hosp_afl
can have multiple rows per hospital if more than one physician in the data is affiliated with it, so we will aggregate the data back to hospital-level after the merge by summarizing physician-level information.
hosps <- inner_join(hosp_clean, hosp_afl, by=c("Hospital_name"="hosp_afl_lbn_1", "State"="st"))
hosps <- hosps %>%
group_by(hosp_afl_1, Hospital_name, State, Staffed_beds, Total_discharges, Patient_days, Gross_patient_revenue) %>%
summarize(num_phys = n_distinct(npi),
female_prop = round(mean(ifelse(gndr == "F", 1, 0)),2),
avg_grad_year = round(mean(grd_yr),2),
n_specialty =n_distinct(pri_spec),
EHR_use = as.factor(max(ehr)))
summary(hosps)
## hosp_afl_1 Hospital_name State Staffed_beds
## 150042 : 5 Length:1857 Length:1857 Min. : 7.0
## 050471 : 3 Class :character Class :character 1st Qu.: 72.0
## 360052 : 3 Mode :character Mode :character Median : 154.0
## 360134 : 3 Mean : 232.1
## 050257 : 2 3rd Qu.: 277.0
## 050380 : 2 Max. :36985.0
## (Other):1839
## Total_discharges Patient_days Gross_patient_revenue num_phys
## Min. : 1 Min. : 2 Min. : 3624 Min. : 1.0
## 1st Qu.: 2056 1st Qu.: 7248 1st Qu.: 185397 1st Qu.: 28.0
## Median : 5350 Median : 21967 Median : 488683 Median : 69.0
## Mean : 8376 Mean : 38899 Mean : 854423 Mean : 114.4
## 3rd Qu.:11983 3rd Qu.: 53935 3rd Qu.: 1100914 3rd Qu.: 152.0
## Max. :77757 Max. :425114 Max. :14140692 Max. :1460.0
## NA's :1 NA's :1 NA's :9
## female_prop avg_grad_year n_specialty EHR_use
## Min. :0.0000 Min. :1971 Min. : 1.00 1:1701
## 1st Qu.:0.2800 1st Qu.:1994 1st Qu.:12.00 0: 156
## Median :0.3600 Median :1996 Median :22.00
## Mean :0.3592 Mean :1996 Mean :22.79
## 3rd Qu.:0.4300 3rd Qu.:1998 3rd Qu.:33.00
## Max. :1.0000 Max. :2014 Max. :62.00
##
table(hosps[which(hosps$num_phys <= 10),]$EHR_use)
##
## 1 0
## 52 114
Well, we managed to get about {r} (nrow(hosps)/nrow(hosp_clean))*100
% match on the hospitals between the two data sets and the summary statistics on those hospital demographics look a lot nicer and less skewed! Even though we aggregated the physical-level demographics because we didn’t want to just throw away data, we do acknowledge that these physician-level demographics are not really meaningful because they are not representative of the demographics of the hospital. It is unlikely that we had all the physicians information from our hosp_afl
for each hospital that matched. (i.e. the 10 physicians we have data on for a hospital may not accurately represent the all practitioners in that hospital) We thought of excluding hospitals with less than 10 practitioners in the data to obtain accurate physician demographics for the hospital, but it would mean cutting down way to much of our limited data. We decided to just focus on the main hospital demographics scraped from the website, and keep the aggregated physician demographics at the exploratory level. Merging with hosp_afl
still provided us the main outcome of interest, “EHR_use (the hospital uses the electronic health system)”, calculated as 1 if at least one practitioner in the hospital uses EHR and 0 if none of the practitioners in the hospital uses EHR - that’s why we took the maximum in our code above. Reminder that for practitioners affiliated with hospitals, we assumed that EHR use is the hospital-level adoption and not individual’s. Thus it makes sense that if at least one of the practitioners is recorded in the data as using EHR, we will assume the hospital uses EHR.
4.2.3 Output
# clean up intermediate data sets
rm(hosp_web, hosp_afl)
# hosp_clean will be used later in secondary regional analysis
# permanently save hospital-level analysis dataset to repo
saveRDS(hosps, file = "./data/hosps.rds")
- Again, we are storing the
hosps
data set as a.rds
file in our repo under the/data
directory for reproducibility of our analysis since it will be used to analyze the potential effects of hospital demographics on the use of EHR, which is another aspect of our Primary Analysis. - This data set will also be subsequently merge with the EHR vendor and product information discussed in the next section to explore hospital demographics by specific types of EHR used.
4.3 Vendors & Products (EHR)
Since so many professionals and hospitals across the U.S. have already adopted EHR, we realized that simply looking at the binary outcome of EHR use would be too boring of a project. Thus, we were curious to further explore the demographics of those who do use EHR by the specific types of EHR vendor or product. Fortunately, we were able to find data to support this additional secondary/exploratory analysis.
4.3.1 Source/Metadata
The Health IT Dashboard provides certified health IT product data from the ONC Certified Health IT Product List (CHPL) such as the unique vendors, products, and product types of each certified health IT product used as part of the Medicare EHR Incentive Program. We downloaded only the 2016 data set, which also includes unique provider identifiers (NPI), in order to match the EPs
data set discussed in the first section of this chapter. As the metadata explains, a provider in this data set can be either an eligible professional (EP) and eligible hospital (Hospital), as distinguished by the Provider_Type
column. Thus, only the Provider_Type == 'EP'
records is merged with the subset of EPs
who are not affiliated with any hospital, and the Provider_Type == 'Hospital'
records is merged with the subset of EPs
who are affiliated with a hospital, and therefore directly merged with the combined and cleaned version of the hosp
data set discussed in the previous section of this chapter.
EHR <- read.csv("https://dashboard.healthit.gov/datadashboard/data/MU_REPORT_2016.csv")
saveRDS(EHR, "./data/EHR_raw.rds") # saving data for reproducibility
4.3.2 Wrangling
Data structure, you know the drill.
str(EHR)
## 'data.frame': 505914 obs. of 23 variables:
## $ NPI : int 1003000142 1003000522 1003000597 1003000639 1003000902 1003000936 1003000936 1003001256 1003001256 1003001462 ...
## $ CCN : int NA NA NA NA NA NA NA NA NA NA ...
## $ Provider_Type : Factor w/ 2 levels "EP","Hospital": 1 1 1 1 1 1 1 1 1 1 ...
## $ Business_State_Territory : Factor w/ 56 levels "Alabama","Alaska",..: 38 11 39 5 17 45 45 6 6 12 ...
## $ ZIP : Factor w/ 11921 levels "","00603","00612",..: 5619 4304 9168 10670 5353 3731 3731 10037 10037 4055 ...
## $ Specialty : Factor w/ 288 levels "","Acupuncturist",..: 177 9 279 263 79 27 27 79 79 235 ...
## $ Hospital_Type : Factor w/ 3 levels "","Critical Access",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Program_Type : Factor w/ 2 levels "Medicare","Medicare/Medicaid": 1 1 1 1 1 1 1 1 1 1 ...
## $ Program_Year : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ Provider_Stage_Number : Factor w/ 2 levels "Stage 1","Stage 2": 2 2 1 1 2 2 2 2 2 1 ...
## $ Payment_Year : int 3 5 NA NA 5 3 3 5 5 NA ...
## $ Attestation_Month : int 2 1 1 2 1 5 5 2 2 3 ...
## $ Attestation_Year : int 2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
## $ MU_Definition_2014 : logi NA NA NA NA NA NA ...
## $ Stage_2_Scheduled_2014 : int 0 1 0 0 1 0 0 0 0 0 ...
## $ EHR_Certification_Number : Factor w/ 3550 levels "0014E0087U0WV35",..: 2699 1660 2177 2303 2965 1119 1119 2791 2791 2023 ...
## $ EHR_Product_CHP_Id : Factor w/ 2317 levels "14.02.02.1026.A033.01.01.0.170217",..: 1876 346 1168 1403 2226 136 1747 1403 1876 1004 ...
## $ Vendor_Name : Factor w/ 484 levels "4medica, Inc.",..: 134 35 338 134 159 170 170 134 134 120 ...
## $ EHR_Product_Name : Factor w/ 798 levels "1 Connect BuildYourEMR",..: 192 42 554 192 81 251 252 192 192 533 ...
## $ EHR_Product_Version : Factor w/ 699 levels "1","1.0","1.0.0",..: 489 113 385 494 50 549 206 494 489 155 ...
## $ Product_Classification : Factor w/ 3 levels "","Complete EHR",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Product_Setting : Factor w/ 3 levels "","Ambulatory",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Product_Certification_Edition_Yr: int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
There are actually a lot of variables here that are not informative so we can drop them and just focus on a smaller set, i.e. program year would just be 2016 on every row in the data, EHR certification number or CHP ID is at an unnecessary level of detail. Let’s select the important columns and a few potentially interesting variables to summarize by (thus, using distinct
instead of select
).
EHR_clean <- EHR %>% distinct(NPI, CCN, Provider_Type, Business_State_Territory, ZIP, Specialty, Hospital_Type, Vendor_Name, EHR_Product_Name, Product_Classification, Product_Setting)
summary(EHR_clean)
## NPI CCN Provider_Type
## Min. :1.003e+09 Min. : 10001 EP :393375
## 1st Qu.:1.255e+09 1st Qu.:121301 Hospital: 16330
## Median :1.508e+09 Median :241361
## Mean :1.501e+09 Mean :261394
## 3rd Qu.:1.750e+09 3rd Qu.:390179
## Max. :1.993e+09 Max. :670114
## NA's :393375
## Business_State_Territory ZIP
## California : 31692 : 16330
## Minnesota : 24252 55905 : 11278
## Florida : 22664 48109 : 3336
## Texas : 22307 02114 : 3021
## Pennsylvania: 20564 32224 : 2818
## Michigan : 19739 60611 : 2521
## (Other) :268487 (Other):370401
## Specialty
## Family Medicine : 61485
## Internal Medicine : 45820
## Student in an Organized Health Care Education/Training Program: 17552
## Cardiovascular Disease : 17524
## : 16330
## (Other) :250764
## NA's : 230
## Hospital_Type Vendor_Name
## :393375 Epic Systems Corporation:133637
## Critical Access: 3348 Cerner Corporation : 70367
## General : 12982 Allscripts : 28067
## athenahealth, Inc. : 20970
## eClinicalWorks, LLC : 16354
## GE Healthcare : 14882
## (Other) :125428
## EHR_Product_Name
## EpicCare Ambulatory 2014 Certified EHR Suite: 71207
## athenaClinicals : 20945
## Beacon Oncology 2014 Certified Module : 16281
## eClinicalWorks : 16097
## NextGen Ambulatory EHR : 13616
## Powerchart (Clinical) : 13293
## (Other) :258266
## Product_Classification Product_Setting
## : 14690 : 14690
## Complete EHR:202827 Ambulatory:325708
## Modular EHR :192188 Inpatient : 69307
##
##
##
##
Observations:
- There are records of missing CCN (the identifying number for eligible hospitals) and missing hospital type, which matches up with the number of records that are
Provider_Type == 'EP'
. It makes sense since these variables would not be relevant to an eligible professional (EP), who is not affiliated with any hospital. - There are exactly 16330 records of missing zip codes and blank specialty (specifically not NA’s), and this number matches up with the number of
Provider_Type == 'Hospital'
. - It sounds like the we just need to split the data into EPs and hospitals since there are enough distinct variables relevant for each type. Then clean each one before merging with their corresponding level of demographics.
Split:
# split for EPs
EHR_EPs <- EHR_clean %>% filter(Provider_Type == 'EP') %>%
select(-CCN, -Hospital_Type, -Provider_Type)
summary(EHR_EPs)
## NPI Business_State_Territory ZIP
## Min. :1.003e+09 California : 30425 55905 : 11278
## 1st Qu.:1.255e+09 Minnesota : 23895 48109 : 3336
## Median :1.508e+09 Florida : 21678 02114 : 3021
## Mean :1.501e+09 Texas : 20941 32224 : 2818
## 3rd Qu.:1.750e+09 Pennsylvania: 19919 60611 : 2521
## Max. :1.993e+09 Michigan : 19221 23298 : 2235
## (Other) :257296 (Other):368166
## Specialty
## Family Medicine : 61485
## Internal Medicine : 45820
## Student in an Organized Health Care Education/Training Program: 17552
## Cardiovascular Disease : 17524
## Obstetrics & Gynecology : 15059
## (Other) :235705
## NA's : 230
## Vendor_Name
## Epic Systems Corporation:131883
## Cerner Corporation : 64893
## Allscripts : 27743
## athenahealth, Inc. : 20929
## eClinicalWorks, LLC : 16345
## GE Healthcare : 14861
## (Other) :116721
## EHR_Product_Name
## EpicCare Ambulatory 2014 Certified EHR Suite: 71047
## athenaClinicals : 20928
## Beacon Oncology 2014 Certified Module : 16189
## eClinicalWorks : 16097
## NextGen Ambulatory EHR : 13613
## Powerchart (Clinical) : 12548
## (Other) :242953
## Product_Classification Product_Setting
## : 14462 : 14462
## Complete EHR:200889 Ambulatory:324643
## Modular EHR :178024 Inpatient : 54270
##
##
##
##
# check for valid zip codes
EHR_EPs %>% filter(nchar(as.character(ZIP)) < 5) %>%
group_by(ZIP) %>% summarize(n())
## # A tibble: 3 x 2
## ZIP `n()`
## <fctr> <int>
## 1 1010 1
## 2 EH16 2
## 3 IP28 9
# split for hospitals
EHR_hosps <- EHR_clean %>% filter(Provider_Type == 'Hospital') %>%
select(-ZIP, -Specialty, -Provider_Type)
summary(EHR_hosps)
## NPI CCN Business_State_Territory
## Min. :1.003e+09 Min. : 10001 Texas : 1366
## 1st Qu.:1.245e+09 1st Qu.:121301 California : 1267
## Median :1.488e+09 Median :241361 Florida : 986
## Mean :1.495e+09 Mean :261394 Illinois : 667
## 3rd Qu.:1.740e+09 3rd Qu.:390179 Pennsylvania: 645
## Max. :1.993e+09 Max. :670114 Ohio : 564
## (Other) :10835
## Hospital_Type
## : 0
## Critical Access: 3348
## General :12982
##
##
##
##
## Vendor_Name
## Cerner Corporation :5474
## Medical Information Technology, Inc. (MEDITECH):2879
## Epic Systems Corporation :1754
## MEDHOST : 653
## McKesson : 625
## Cerner Health Services, Inc. : 525
## (Other) :4420
## EHR_Product_Name
## HealthSentry : 851
## Powerchart (Clinical) : 745
## Powerchart (All CQM's) : 703
## P2 Sentinel (Powered by Sensage) : 580
## EpicCare Inpatient 2014 Certified EHR Suite: 565
## HealtheLife and Powerchart : 552
## (Other) :12334
## Product_Classification Product_Setting
## : 228 : 228
## Complete EHR: 1938 Ambulatory: 1065
## Modular EHR :14164 Inpatient :15037
##
##
##
##
The summaries much look better separately for the two sets. Now we just have to clean up the zip codes for the physician-level EHR’s. Looks like there are only a few records with invalid zip codes. Similar to how we cleaned up the zip codes for EPs
, we will match the zip codes with R’s own zip code package.
# first remove non-US states, appending DC to list
state_name <- c(state.name,"District Of Columbia")
states <- data.frame(state_name = state_name, state_abb = c(state.abb, "DC"))
EHR_EPs <- EHR_EPs %>% filter(Business_State_Territory %in% state_name)
length(unique(EHR_EPs$Business_State_Territory))
## [1] 51
# remove invalid zip codes
EHR_EPs <- EHR_EPs %>% filter(nchar(as.character(ZIP)) >= 5)
# merge R zip codes
library(zipcode)
data("zipcode")
EHR_EPs <- EHR_EPs %>% left_join(zipcode, by = c("ZIP" = "zip"))
## Warning: Column `ZIP`/`zip` joining factor and character vector, coercing
## into character vector
# check state mismatches
EHR_mismatch <- EHR_EPs %>%
left_join(states, by = c("state" = "state_abb")) %>% # need to add in long state names
mutate(Business_State_Territory = as.character(Business_State_Territory)) %>%
filter(state_name != Business_State_Territory)
## Warning: Column `state`/`state_abb` joining character vector and factor,
## coercing into character vector
# percent state mismatch
nrow(EHR_mismatch)/nrow(EHR_EPs)
## [1] 0.04492693
# check which zips are mismatching
EHR_mismatch %>%
group_by(ZIP, Business_State_Territory, state_name) %>%
summarize(num = n()) %>%
arrange(desc(num)) %>%
head(8)
## # A tibble: 8 x 4
## # Groups: ZIP, Business_State_Territory [8]
## ZIP Business_State_Territory state_name num
## <chr> <chr> <fctr> <int>
## 1 51503 Nebraska Iowa 120
## 2 99362 Washington Oregon 83
## 3 54601 Minnesota Wisconsin 80
## 4 19104 New Jersey Pennsylvania 65
## 5 55905 Florida Minnesota 63
## 6 33331 Ohio Florida 58
## 7 85259 Minnesota Arizona 48
## 8 55905 Wisconsin Minnesota 46
The top 10 mismatches:
- Zip code 51503 is in Iowa (agrees with zip code package)/bordering Nebraska, but reported as Nebraska which is neighboring Iowa. This may be correct, so we’ll keep Nebraska as reported
- Zip code 55905 is in Minnesota, but reported state Florida is not near Minnesota. Since there is likely a typo in either zip or state, we’ll remove this mismatched zip all together
- Zip code 54601 is in Wisconsin/bordering Minnesota, but reported as Minnesota which is neighboring Wisconsin. We’ll keep Minnesota as reported
- Zip code 33331 is in Florida, which is nowhere near Ohio. Since it is unclear which was the typo, we’ll delete all these mismatches
- Zip code 55905 is in Minnesota - found it’s actually Mayo Clinic (along with #2). Wisconsin and Minnesota are bordering each other, so we’ll keep Wisconsin –> in general, if we look at
test_mismatch %>% filter(ZIP == "55905")
you will notice that this zip code is incorrectly used for multiple states, possibly meaning there were data entry/quality issues, which validates our decision to clean up the zip codes.
EHR_zip <- EHR_EPs %>%
left_join(states, by = c("state" = "state_abb")) %>%
mutate(Business_State_Territory = as.character(Business_State_Territory)) %>%
filter(
(state_name == Business_State_Territory) |
(state_name != Business_State_Territory && ZIP == "51503" && Business_State_Territory == "Nebraska") |
(state_name != Business_State_Territory && ZIP == "54601" && Business_State_Territory == "Minnesota") |
(state_name != Business_State_Territory && ZIP == "55905" && Business_State_Territory == "Wisconsin")
) %>%
select(NPI, Business_State_Territory, ZIP, Specialty, Vendor_Name, EHR_Product_Name, Product_Classification, Product_Setting)
## Warning: Column `state`/`state_abb` joining character vector and factor,
## coercing into character vector
We want to inner join with the EPs
data set on NPI
, which has one row per physician by their unique NPI (one-to-many merge).
# merge with EPs data
EHR_EPs <- inner_join(EHR_zip, EPs, by=c('NPI'='npi'))
summary(EHR_EPs)
## NPI Business_State_Territory ZIP
## Min. :1.003e+09 Length:24155 Length:24155
## 1st Qu.:1.165e+09 Class :character Class :character
## Median :1.336e+09 Mode :character Mode :character
## Mean :1.340e+09
## 3rd Qu.:1.509e+09
## Max. :1.680e+09
##
## Specialty Vendor_Name
## Optometrist : 4746 Epic Systems Corporation : 4225
## Chiropractor : 2893 Integrated Practice Solutions, Inc.: 1600
## Family Medicine: 2397 Modernizing Medicine, Inc. : 1544
## Ophthalmology : 1906 NextGen Healthcare : 1285
## Dermatology : 1571 GE Healthcare : 1048
## (Other) :10638 Mayo Clinic : 1035
## NA's : 4 (Other) :13418
## EHR_Product_Name
## EpicCare Ambulatory 2014 Certified EHR Suite: 2043
## ChiroTouch : 1281
## NextGen Ambulatory EHR : 1209
## athenaClinicals : 1019
## EMA : 888
## eClinicalWorks : 732
## (Other) :16983
## Product_Classification Product_Setting cred
## : 352 : 352 NA :15578
## Complete EHR:15438 Ambulatory:21513 MD : 4987
## Modular EHR : 8365 Inpatient : 2290 OD : 1675
## DC : 1271
## DO : 325
## DPM : 290
## (Other): 29
## cty ehr gndr grd_yr
## ROCHESTER: 1663 0: 5355 F: 9673 Min. :1951
## LOS ALTOS: 687 1:18800 M:14482 1st Qu.:1985
## PALO ALTO: 527 U: 0 Median :1996
## NEW YORK : 335 Mean :1995
## FREMONT : 276 3rd Qu.:2005
## BOSTON : 207 Max. :2016
## (Other) :20460
## med_sch pri_spec
## OTHER : 5140 OPTOMETRY :5155
## PALMER COLLEGE CHIROPRACTIC - DAVENPORT : 699 CHIROPRACTIC :3161
## ILLINOIS COLLEGE OF OPTOMETRY AT CHICAGO: 640 FAMILY PRACTICE :2637
## SOUTHERN COLLEGE OF OPTOMETRY : 572 DERMATOLOGY :2534
## PENNSYLVANIA COLLEGE OF OPTOMETRY : 538 OPHTHALMOLOGY :2044
## INDIANA UNIVERSITY - SCHOOL OF OPTOMETRY: 420 INTERNAL MEDICINE:1499
## (Other) :16146 (Other) :7125
## st zip zip.ext latitude
## Length:24155 Length:24155 Length:24155 Min. :19.53
## Class :character Class :character Class :character 1st Qu.:34.61
## Mode :character Mode :character Mode :character Median :39.04
## Mean :38.31
## 3rd Qu.:42.13
## Max. :64.85
## NA's :11
## longitude yrs_grd locations
## Min. :-159.38 Min. : 0.00 Min. : 1.000
## 1st Qu.:-104.96 1st Qu.:11.00 1st Qu.: 1.000
## Median : -90.03 Median :20.00 Median : 1.000
## Mean : -93.52 Mean :21.31 Mean : 1.281
## 3rd Qu.: -80.31 3rd Qu.:31.00 3rd Qu.: 1.000
## Max. : -67.88 Max. :65.00 Max. :53.000
## NA's :11
# check EHR use (should expect all 1's or "yes")
table(EHR_EPs$ehr)
##
## 0 1
## 5355 18800
Notice that the ehr
column is not all 1
’s when we expect them to be. This explains why we had so many physicians who fell in the ehr=''
category. The physicians whom we have EHR product usage information on did not answer “yes” to EHR use in the original EPs
data.
4.3.3 Output
saveRDS(EHR_EPs, file = "./data/EHR_EPs.rds")
Although the specific product types are too granular and not meaningful enough to clean, we believe that exploring the professional and hospital demographics across vendor types may still be interesting, especially if we focus on the top 10 most popular vendors or classify the smaller local vendors as one group.