Introduction to epidict

The goal of {epidict} is to provide standardized data dictionaries for use in epidemiological data analysis templates. Currently it supports standardised dictionaries from MSF OCA. This is a product of the R4EPIs project; learn more at https://r4epis.netlify.com

Installation

You can install {epidict} from CRAN:

install.packages("epidict")
Click here for alternative installation options

If there is a bugfix or feature that is not yet on CRAN, you can install it via the {drat} package:

You can also install the in-development version from GitHub using the {remotes} package (but there’s no guarantee that it will be stable):

# install.packages("remotes")
remotes::install_github("R4EPI/epidict") 

Accessing dictionaries

There are four MSF outbreak dictionaries available in {epidict} based on DHIS2 exports:

  • Acute Jaundice Syndrome (often suspected to be Hepatitis E) (“AJS”)
  • Cholera/Acute watery diarrhea (“Cholera”)
  • Measles/Rubella (“Measles”)
  • Meningitis (“Meningitis”)

You can read more about the outbreak dictionaries at https://r4epis.netlify.com/outbreaks

The dictionary can be obtained via the msf_dict() function, which specifies a dictionary that describes recorded variables (data_element_shortname) in rows and their possible options (if categorical):

Click here for code examples
library("epidict")
msf_dict("AJS")
#> # A tibble: 68 × 8
#>    data_element_uid data_element_name                     data_element_shortname
#>    <lgl>            <chr>                                 <chr>                 
#>  1 NA               egen_044_event_file_type              event_file_type       
#>  2 NA               egen_001_patient_case_number          case_number           
#>  3 NA               egen_004_date_of_consultation_admiss… date_of_consultation_…
#>  4 NA               egen_022_detected_by                  detected_by           
#>  5 NA               egen_005_patient_facility_type        patient_facility_type 
#>  6 NA               egen_029_msf_involvement              msf_involvement       
#>  7 NA               egen_008_age_years                    age_years             
#>  8 NA               egen_009_age_months                   age_months            
#>  9 NA               egen_010_age_days                     age_days              
#> 10 NA               egen_011_sex                          sex                   
#> # ℹ 58 more rows
#> # ℹ 5 more variables: data_element_description <chr>,
#> #   data_element_valuetype <chr>, data_element_formname <chr>,
#> #   used_optionset_uid <chr>, options <list>
msf_dict("Cholera")
#> # A tibble: 45 × 8
#>    data_element_uid data_element_name                     data_element_shortname
#>    <chr>            <chr>                                 <chr>                 
#>  1 AafTlSwliVQ      egen_001_patient_case_number          case_number           
#>  2 OTGOtWBz39J      egen_004_date_of_consultation_admiss… date_of_consultation_…
#>  3 wnmMr2V3T3u      egen_006_patient_origin               patient_origin        
#>  4 sbgqjeVwtb8      egen_008_age_years                    age_years             
#>  5 eXYhovYyl61      egen_009_age_months                   age_months            
#>  6 UrYJSk2Wp46      egen_010_age_days                     age_days              
#>  7 D1Ky5K7pFN6      egen_011_sex                          sex                   
#>  8 dTm5R53YYXC      egen_012_pregnancy_status             pregnant              
#>  9 FF7d81Zy0yQ      egen_013_pregnancy_trimester          trimester             
#> 10 vLAmA6Pmjip      egen_014_pregnant_foetus_alive_at_ad… foetus_alive_at_admis…
#> # ℹ 35 more rows
#> # ℹ 5 more variables: data_element_description <chr>,
#> #   data_element_valuetype <chr>, data_element_formname <chr>,
#> #   used_optionset_uid <chr>, options <list>
msf_dict("Measles")
#> # A tibble: 52 × 8
#>    data_element_uid data_element_name                     data_element_shortname
#>    <chr>            <chr>                                 <chr>                 
#>  1 DE_EGEN_001      egen_001_patient_case_number          case_number           
#>  2 DE_EGEN_004      egen_004_date_of_consultation_admiss… date_of_consultation_…
#>  3 DE_EGEN_005      egen_005_patient_facility_type        patient_facility_type 
#>  4 DE_EGEN_006      egen_006_patient_origin               patient_origin        
#>  5 DE_EGEN_008      egen_008_age_years                    age_years             
#>  6 DE_EGEN_009      egen_009_age_months                   age_months            
#>  7 DE_EGEN_010      egen_010_age_days                     age_days              
#>  8 DE_EGEN_011      egen_011_sex                          sex                   
#>  9 DE_EGEN_012      egen_012_pregnancy_status             pregnant              
#> 10 DE_EGEN_013      egen_013_pregnancy_trimester          trimester             
#> # ℹ 42 more rows
#> # ℹ 5 more variables: data_element_description <chr>,
#> #   data_element_valuetype <chr>, data_element_formname <chr>,
#> #   used_optionset_uid <chr>, options <list>
msf_dict("Meningitis")
#> # A tibble: 53 × 8
#>    data_element_uid data_element_name                     data_element_shortname
#>    <chr>            <chr>                                 <chr>                 
#>  1 AafTlSwliVQ      egen_001_patient_case_number          case_number           
#>  2 OTGOtWBz39J      egen_004_date_of_consultation_admiss… date_of_consultation_…
#>  3 udXAcFEE1dl      egen_005_patient_facility_type        patient_facility_type 
#>  4 wnmMr2V3T3u      egen_006_patient_origin               patient_origin        
#>  5 sbgqjeVwtb8      egen_008_age_years                    age_years             
#>  6 eXYhovYyl61      egen_009_age_months                   age_months            
#>  7 UrYJSk2Wp46      egen_010_age_days                     age_days              
#>  8 D1Ky5K7pFN6      egen_011_sex                          sex                   
#>  9 ADfNqpCL5kf      egen_015_exit_status                  exit_status           
#> 10 JZ8yqTow79G      egen_016_date_of_exit                 date_of_exit          
#> # ℹ 43 more rows
#> # ℹ 5 more variables: data_element_description <chr>,
#> #   data_element_valuetype <chr>, data_element_formname <chr>,
#> #   used_optionset_uid <chr>, options <list>

In addition, there are four MSF survey dictionaries available:

  • Retrospective mortality and access to care (“Mortality”)
  • Malnutrition (“Nutrition”)
  • Vaccination coverage long form (“vaccination_long”)
  • Vaccination coverage short form (“vaccination_short”)

You can read more about the survey dictionaries at https://r4epis.netlify.com/surveys

These are accessible via msf_dict_survey() where the variables are in name. You can also read in your own Kobo (ODK) dictionaries by specifying tempalte = FALSE and then setting name = <path to your .xlsx>.

Click here for code examples
msf_dict_survey("Mortality")
#> # A tibble: 174 × 15
#>    type     name  short_name label_english label_french hint_english hint_french
#>    <chr>    <chr> <chr>      <chr>         <chr>        <chr>        <chr>      
#>  1 start    start start      Start Time    <NA>         <NA>         <NA>       
#>  2 end      end   end        End Time      <NA>         <NA>         <NA>       
#>  3 today    today today      Date of Surv… <NA>         <NA>         <NA>       
#>  4 deviceid devi… deviceid   Phone Serial… <NA>         <NA>         <NA>       
#>  5 date     date  Date of c… Date          Date         <NA>         <NA>       
#>  6 integer  team… Team numb… Team number   Numéro d'éq… <NA>         <NA>       
#>  7 village  vill… Village  … Village name  Nom du vill… <NA>         <NA>       
#>  8 text     vill… Other vil… Specify other Autre, spéc… <NA>         <NA>       
#>  9 integer  clus… Cluster n… Cluster numb… Numéro de l… <NA>         <NA>       
#> 10 integer  hous… Household… Household nu… Numéro du m… <NA>         <NA>       
#> # ℹ 164 more rows
#> # ℹ 8 more variables: default <chr>, relevant <chr>, appearance <chr>,
#> #   constraint <chr>, repeat_count <chr>, calculation <chr>, value_type <chr>,
#> #   options <list>
msf_dict_survey("Nutrition")
#> # A tibble: 27 × 15
#>    type     name  short_name label_english label_french hint_english hint_french
#>    <chr>    <chr> <chr>      <chr>         <chr>        <chr>        <chr>      
#>  1 start    start <NA>       Start Time    <NA>         <NA>         <NA>       
#>  2 end      end   <NA>       End Time      <NA>         <NA>         <NA>       
#>  3 today    today <NA>       Date of Surv… <NA>         <NA>         <NA>       
#>  4 deviceid devi… <NA>       Phone Serial… <NA>         <NA>         <NA>       
#>  5 date     date  Date       Date          <NA>         <NA>         <NA>       
#>  6 integer  team… Team numb… Team number   <NA>         <NA>         <NA>       
#>  7 village  vill… Village n… Village name  Nom du vill… <NA>         <NA>       
#>  8 text     vill… Other vil… Specify other Précisez au… <NA>         <NA>       
#>  9 geopoint vill… Village l… Village loca… Localisatio… <NA>         <NA>       
#> 10 integer  clus… Cluster n… Cluster numb… Numéro de g… <NA>         <NA>       
#> # ℹ 17 more rows
#> # ℹ 8 more variables: repeat_count <chr>, relevant <chr>, calculation <lgl>,
#> #   constraint <chr>, appearance <chr>, default <chr>, value_type <chr>,
#> #   options <list>
msf_dict_survey("Vaccination_long")
#> # A tibble: 106 × 15
#>    type     name  short_name label_english label_french hint_english hint_french
#>    <chr>    <chr> <chr>      <chr>         <chr>        <chr>        <chr>      
#>  1 start    start <NA>       Start Time    Start Time   <NA>         <NA>       
#>  2 end      end   <NA>       End Time      End Time     <NA>         <NA>       
#>  3 today    today <NA>       Date of Surv… Date of Sur… <NA>         <NA>       
#>  4 deviceid devi… <NA>       Phone Serial… Phone Seria… <NA>         <NA>       
#>  5 date     date  Date       Date          Date         <NA>         <NA>       
#>  6 integer  team… Team numb… Team number   Numéro de l… <NA>         <NA>       
#>  7 village  vill… Village n… Village name  Nom du vill… <NA>         <NA>       
#>  8 text     vill… Other vil… Specify other Veuillez sp… <NA>         <NA>       
#>  9 integer  clus… Cluster n… Cluster numb… Numéro de l… <NA>         <NA>       
#> 10 integer  hous… Household… Household nu… Numéro du m… <NA>         <NA>       
#> # ℹ 96 more rows
#> # ℹ 8 more variables: default <chr>, relevant <chr>, appearance <chr>,
#> #   repeat_count <chr>, constraint <chr>, calculation <chr>, value_type <chr>,
#> #   options <list>
msf_dict_survey("Vaccination_short")
#> # A tibble: 38 × 16
#>    type     name  short_name label_english label_french hint_english hint_french
#>    <chr>    <chr> <chr>      <chr>         <chr>        <chr>        <chr>      
#>  1 start    start <NA>       Start Time    Start Time   <NA>         <NA>       
#>  2 end      end   <NA>       End Time      End Time     <NA>         <NA>       
#>  3 today    today <NA>       Date of Surv… Date of Sur… <NA>         <NA>       
#>  4 deviceid devi… <NA>       Phone Serial… Phone Seria… <NA>         <NA>       
#>  5 date     date  Date       Date          Date         <NA>         <NA>       
#>  6 integer  team… Team numb… Team number   Numéro de l… <NA>         <NA>       
#>  7 village  vill… Village n… Village name  Nom du vill… <NA>         <NA>       
#>  8 text     vill… Other vil… Specify other Veuillez sp… <NA>         <NA>       
#>  9 integer  clus… Cluster n… Cluster numb… Numéro de l… <NA>         <NA>       
#> 10 integer  hous… Household… Household nu… Numéro du m… <NA>         <NA>       
#> # ℹ 28 more rows
#> # ℹ 9 more variables: default <chr>, relevant <chr>, appearance <chr>,
#> #   repeat_count <chr>, constraint <chr>, calculation <chr>, hxl <chr>,
#> #   value_type <chr>, options <list>

Generating data

The {epidict} package has a function for generating data that’s called gen_data(), which takes three arguments: The dictionary, which column describes the variable names, and how many rows are needed in the output.

Click here for code examples
gen_data("Measles", varnames = "data_element_shortname", numcases = 100, org = "MSF")
#> # A tibble: 100 × 52
#>    case_number date_of_consultation_admis…¹ patient_facility_type patient_origin
#>    <chr>       <date>                       <fct>                 <chr>         
#>  1 A1          2018-04-17                   OP                    Village A     
#>  2 A2          2018-01-27                   OP                    Village B     
#>  3 A3          2018-04-03                   IP                    Village D     
#>  4 A4          2018-01-08                   OP                    Village B     
#>  5 A5          2018-04-28                   OP                    Village C     
#>  6 A6          2018-01-14                   OP                    Village B     
#>  7 A7          2018-03-08                   IP                    Village D     
#>  8 A8          2018-04-25                   OP                    Village B     
#>  9 A9          2018-03-13                   IP                    Village D     
#> 10 A10         2018-02-18                   OP                    Village D     
#> # ℹ 90 more rows
#> # ℹ abbreviated name: ¹​date_of_consultation_admission
#> # ℹ 48 more variables: age_years <int>, age_months <int>, age_days <int>,
#> #   sex <fct>, pregnant <fct>, trimester <fct>,
#> #   foetus_alive_at_admission <fct>, exit_status <fct>, date_of_exit <date>,
#> #   time_to_death <fct>, pregnancy_outcome_at_exit <fct>,
#> #   baby_born_with_complications <fct>, previously_vaccinated <fct>, …
gen_data("Vaccination_long", varnames = "name", numcases = 100, org = "MSF")
#> # A tibble: 100 × 120
#>    start end   today deviceid date       team_number village_name village_other
#>    <lgl> <lgl> <lgl> <lgl>    <date>     <lgl>       <fct>        <lgl>        
#>  1 NA    NA    NA    NA       2018-01-04 NA          village_10   NA           
#>  2 NA    NA    NA    NA       2018-01-23 NA          village_8    NA           
#>  3 NA    NA    NA    NA       2018-03-16 NA          village_9    NA           
#>  4 NA    NA    NA    NA       2018-03-29 NA          village_1    NA           
#>  5 NA    NA    NA    NA       2018-01-10 NA          village_9    NA           
#>  6 NA    NA    NA    NA       2018-01-29 NA          other        NA           
#>  7 NA    NA    NA    NA       2018-01-24 NA          village_4    NA           
#>  8 NA    NA    NA    NA       2018-04-04 NA          village_10   NA           
#>  9 NA    NA    NA    NA       2018-04-05 NA          village_9    NA           
#> 10 NA    NA    NA    NA       2018-02-06 NA          village_5    NA           
#> # ℹ 90 more rows
#> # ℹ 112 more variables: cluster_number <dbl>, household_number <int>,
#> #   households_building <int>, random_hh <int>, consent <chr>,
#> #   no_consent_reason <fct>, no_consent_other <lgl>, caretaker_relation <fct>,
#> #   caretaker_other <lgl>, number_children <dbl>, child_number <chr>,
#> #   sex <fct>, date_birth <date>, age_years <int>, age_months <int>,
#> #   any_vaccine <fct>, vaccine_card <fct>, hf_records <fct>, …

Cleaning data with the dictionaries

You can use the dictionaries to clean the data via the {matchmaker} package:

Click here for code examples
library("matchmaker")
library("dplyr")

dat <- gen_data(dictionary = "Cholera", 
  varnames = "data_element_shortname",
  numcases = 20,
  org = "MSF"
)
print(dat)
#> # A tibble: 20 × 45
#>    case_number date_of_consultation_admiss…¹ patient_origin age_years age_months
#>    <chr>       <date>                        <chr>              <int>      <int>
#>  1 A1          2018-04-14                    Village A             42         NA
#>  2 A2          2018-02-23                    Village D             26         NA
#>  3 A3          2018-04-29                    Village A             27         NA
#>  4 A4          2018-02-10                    Village C             50         NA
#>  5 A5          2018-03-13                    Village D             16         NA
#>  6 A6          2018-02-10                    Village D             83         NA
#>  7 A7          2018-03-16                    Village A              7         NA
#>  8 A8          2018-04-14                    Village B             25         NA
#>  9 A9          2018-02-15                    Village A             36         NA
#> 10 A10         2018-02-06                    Village B              7         NA
#> 11 A11         2018-01-28                    Village A             12         NA
#> 12 A12         2018-03-21                    Village C             54         NA
#> 13 A13         2018-01-15                    Village B             43         NA
#> 14 A14         2018-04-20                    Village C              9         NA
#> 15 A15         2018-04-21                    Village B             55         NA
#> 16 A16         2018-01-19                    Village B             NA         NA
#> 17 A17         2018-04-09                    Village C             22         NA
#> 18 A18         2018-02-02                    Village D             63         NA
#> 19 A19         2018-04-19                    Village D              7         NA
#> 20 A20         2018-02-02                    Village B             21         NA
#> # ℹ abbreviated name: ¹​date_of_consultation_admission
#> # ℹ 40 more variables: age_days <int>, sex <fct>, pregnant <fct>,
#> #   trimester <fct>, foetus_alive_at_admission <fct>, exit_status <fct>,
#> #   date_of_exit <date>, time_to_death <fct>, pregnancy_outcome_at_exit <fct>,
#> #   previously_vaccinated <fct>, previous_vaccine_doses_received <fct>,
#> #   readmission <fct>, msf_involvement <fct>,
#> #   cholera_treatment_facility_type <fct>, residential_status_brief <fct>, …

# We want the expanded dictionary, so we will select `compact = FALSE`
dict <- msf_dict(disease = "Cholera", 
  long    = TRUE,
  compact = FALSE,
  tibble  = TRUE
)
print(dict)
#> # A tibble: 182 × 11
#>    data_element_uid data_element_name                     data_element_shortname
#>    <chr>            <chr>                                 <chr>                 
#>  1 AafTlSwliVQ      egen_001_patient_case_number          case_number           
#>  2 OTGOtWBz39J      egen_004_date_of_consultation_admiss… date_of_consultation_…
#>  3 wnmMr2V3T3u      egen_006_patient_origin               patient_origin        
#>  4 sbgqjeVwtb8      egen_008_age_years                    age_years             
#>  5 eXYhovYyl61      egen_009_age_months                   age_months            
#>  6 UrYJSk2Wp46      egen_010_age_days                     age_days              
#>  7 D1Ky5K7pFN6      egen_011_sex                          sex                   
#>  8 D1Ky5K7pFN6      egen_011_sex                          sex                   
#>  9 D1Ky5K7pFN6      egen_011_sex                          sex                   
#> 10 dTm5R53YYXC      egen_012_pregnancy_status             pregnant              
#> # ℹ 172 more rows
#> # ℹ 8 more variables: data_element_description <chr>,
#> #   data_element_valuetype <chr>, data_element_formname <chr>,
#> #   used_optionset_uid <chr>, option_code <chr>, option_name <chr>,
#> #   option_uid <chr>, option_order_in_set <dbl>

# Now we can use matchmaker to filter the data
dat_clean <- matchmaker::match_df(dat, dict, 
  from  = "option_code",
  to    = "option_name",
  by    = "data_element_shortname",
  order = "option_order_in_set"
)
print(dat_clean)
#> # A tibble: 20 × 45
#>    case_number date_of_consultation_admiss…¹ patient_origin age_years age_months
#>    <chr>       <date>                        <chr>              <int>      <int>
#>  1 A1          2018-04-14                    Village A             42         NA
#>  2 A2          2018-02-23                    Village D             26         NA
#>  3 A3          2018-04-29                    Village A             27         NA
#>  4 A4          2018-02-10                    Village C             50         NA
#>  5 A5          2018-03-13                    Village D             16         NA
#>  6 A6          2018-02-10                    Village D             83         NA
#>  7 A7          2018-03-16                    Village A              7         NA
#>  8 A8          2018-04-14                    Village B             25         NA
#>  9 A9          2018-02-15                    Village A             36         NA
#> 10 A10         2018-02-06                    Village B              7         NA
#> 11 A11         2018-01-28                    Village A             12         NA
#> 12 A12         2018-03-21                    Village C             54         NA
#> 13 A13         2018-01-15                    Village B             43         NA
#> 14 A14         2018-04-20                    Village C              9         NA
#> 15 A15         2018-04-21                    Village B             55         NA
#> 16 A16         2018-01-19                    Village B             NA         NA
#> 17 A17         2018-04-09                    Village C             22         NA
#> 18 A18         2018-02-02                    Village D             63         NA
#> 19 A19         2018-04-19                    Village D              7         NA
#> 20 A20         2018-02-02                    Village B             21         NA
#> # ℹ abbreviated name: ¹​date_of_consultation_admission
#> # ℹ 40 more variables: age_days <int>, sex <fct>, pregnant <fct>,
#> #   trimester <fct>, foetus_alive_at_admission <fct>, exit_status <fct>,
#> #   date_of_exit <date>, time_to_death <fct>, pregnancy_outcome_at_exit <fct>,
#> #   previously_vaccinated <fct>, previous_vaccine_doses_received <fct>,
#> #   readmission <fct>, msf_involvement <fct>,
#> #   cholera_treatment_facility_type <fct>, residential_status_brief <fct>, …