The goal of {epidict} is to provide standardized data dictionaries for use in epidemiological data analysis templates. Currently it supports standardised dictionaries from MSF OCA. This is a product of the R4EPIs project; learn more at https://r4epis.netlify.com
You can install {epidict} from CRAN:
If there is a bugfix or feature that is not yet on CRAN, you can install it via the {drat} package:
You can also install the in-development version from GitHub using the {remotes} package (but there’s no guarantee that it will be stable):
There are four MSF outbreak dictionaries available in {epidict} based on DHIS2 exports:
You can read more about the outbreak dictionaries at https://r4epis.netlify.com/outbreaks
The dictionary can be obtained via the msf_dict()
function, which specifies a dictionary that describes recorded variables
(data_element_shortname
) in rows and their possible options
(if categorical):
library("epidict")
msf_dict("AJS")
#> # A tibble: 68 × 8
#> data_element_uid data_element_name data_element_shortname
#> <lgl> <chr> <chr>
#> 1 NA egen_044_event_file_type event_file_type
#> 2 NA egen_001_patient_case_number case_number
#> 3 NA egen_004_date_of_consultation_admiss… date_of_consultation_…
#> 4 NA egen_022_detected_by detected_by
#> 5 NA egen_005_patient_facility_type patient_facility_type
#> 6 NA egen_029_msf_involvement msf_involvement
#> 7 NA egen_008_age_years age_years
#> 8 NA egen_009_age_months age_months
#> 9 NA egen_010_age_days age_days
#> 10 NA egen_011_sex sex
#> # ℹ 58 more rows
#> # ℹ 5 more variables: data_element_description <chr>,
#> # data_element_valuetype <chr>, data_element_formname <chr>,
#> # used_optionset_uid <chr>, options <list>
msf_dict("Cholera")
#> # A tibble: 45 × 8
#> data_element_uid data_element_name data_element_shortname
#> <chr> <chr> <chr>
#> 1 AafTlSwliVQ egen_001_patient_case_number case_number
#> 2 OTGOtWBz39J egen_004_date_of_consultation_admiss… date_of_consultation_…
#> 3 wnmMr2V3T3u egen_006_patient_origin patient_origin
#> 4 sbgqjeVwtb8 egen_008_age_years age_years
#> 5 eXYhovYyl61 egen_009_age_months age_months
#> 6 UrYJSk2Wp46 egen_010_age_days age_days
#> 7 D1Ky5K7pFN6 egen_011_sex sex
#> 8 dTm5R53YYXC egen_012_pregnancy_status pregnant
#> 9 FF7d81Zy0yQ egen_013_pregnancy_trimester trimester
#> 10 vLAmA6Pmjip egen_014_pregnant_foetus_alive_at_ad… foetus_alive_at_admis…
#> # ℹ 35 more rows
#> # ℹ 5 more variables: data_element_description <chr>,
#> # data_element_valuetype <chr>, data_element_formname <chr>,
#> # used_optionset_uid <chr>, options <list>
msf_dict("Measles")
#> # A tibble: 52 × 8
#> data_element_uid data_element_name data_element_shortname
#> <chr> <chr> <chr>
#> 1 DE_EGEN_001 egen_001_patient_case_number case_number
#> 2 DE_EGEN_004 egen_004_date_of_consultation_admiss… date_of_consultation_…
#> 3 DE_EGEN_005 egen_005_patient_facility_type patient_facility_type
#> 4 DE_EGEN_006 egen_006_patient_origin patient_origin
#> 5 DE_EGEN_008 egen_008_age_years age_years
#> 6 DE_EGEN_009 egen_009_age_months age_months
#> 7 DE_EGEN_010 egen_010_age_days age_days
#> 8 DE_EGEN_011 egen_011_sex sex
#> 9 DE_EGEN_012 egen_012_pregnancy_status pregnant
#> 10 DE_EGEN_013 egen_013_pregnancy_trimester trimester
#> # ℹ 42 more rows
#> # ℹ 5 more variables: data_element_description <chr>,
#> # data_element_valuetype <chr>, data_element_formname <chr>,
#> # used_optionset_uid <chr>, options <list>
msf_dict("Meningitis")
#> # A tibble: 53 × 8
#> data_element_uid data_element_name data_element_shortname
#> <chr> <chr> <chr>
#> 1 AafTlSwliVQ egen_001_patient_case_number case_number
#> 2 OTGOtWBz39J egen_004_date_of_consultation_admiss… date_of_consultation_…
#> 3 udXAcFEE1dl egen_005_patient_facility_type patient_facility_type
#> 4 wnmMr2V3T3u egen_006_patient_origin patient_origin
#> 5 sbgqjeVwtb8 egen_008_age_years age_years
#> 6 eXYhovYyl61 egen_009_age_months age_months
#> 7 UrYJSk2Wp46 egen_010_age_days age_days
#> 8 D1Ky5K7pFN6 egen_011_sex sex
#> 9 ADfNqpCL5kf egen_015_exit_status exit_status
#> 10 JZ8yqTow79G egen_016_date_of_exit date_of_exit
#> # ℹ 43 more rows
#> # ℹ 5 more variables: data_element_description <chr>,
#> # data_element_valuetype <chr>, data_element_formname <chr>,
#> # used_optionset_uid <chr>, options <list>
In addition, there are four MSF survey dictionaries available:
You can read more about the survey dictionaries at https://r4epis.netlify.com/surveys
These are accessible via msf_dict_survey()
where the
variables are in name
. You can also read in your own Kobo
(ODK) dictionaries by specifying tempalte = FALSE
and then
setting name = <path to your .xlsx>
.
msf_dict_survey("Mortality")
#> # A tibble: 174 × 15
#> type name short_name label_english label_french hint_english hint_french
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 start start start Start Time <NA> <NA> <NA>
#> 2 end end end End Time <NA> <NA> <NA>
#> 3 today today today Date of Surv… <NA> <NA> <NA>
#> 4 deviceid devi… deviceid Phone Serial… <NA> <NA> <NA>
#> 5 date date Date of c… Date Date <NA> <NA>
#> 6 integer team… Team numb… Team number Numéro d'éq… <NA> <NA>
#> 7 village vill… Village … Village name Nom du vill… <NA> <NA>
#> 8 text vill… Other vil… Specify other Autre, spéc… <NA> <NA>
#> 9 integer clus… Cluster n… Cluster numb… Numéro de l… <NA> <NA>
#> 10 integer hous… Household… Household nu… Numéro du m… <NA> <NA>
#> # ℹ 164 more rows
#> # ℹ 8 more variables: default <chr>, relevant <chr>, appearance <chr>,
#> # constraint <chr>, repeat_count <chr>, calculation <chr>, value_type <chr>,
#> # options <list>
msf_dict_survey("Nutrition")
#> # A tibble: 27 × 15
#> type name short_name label_english label_french hint_english hint_french
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 start start <NA> Start Time <NA> <NA> <NA>
#> 2 end end <NA> End Time <NA> <NA> <NA>
#> 3 today today <NA> Date of Surv… <NA> <NA> <NA>
#> 4 deviceid devi… <NA> Phone Serial… <NA> <NA> <NA>
#> 5 date date Date Date <NA> <NA> <NA>
#> 6 integer team… Team numb… Team number <NA> <NA> <NA>
#> 7 village vill… Village n… Village name Nom du vill… <NA> <NA>
#> 8 text vill… Other vil… Specify other Précisez au… <NA> <NA>
#> 9 geopoint vill… Village l… Village loca… Localisatio… <NA> <NA>
#> 10 integer clus… Cluster n… Cluster numb… Numéro de g… <NA> <NA>
#> # ℹ 17 more rows
#> # ℹ 8 more variables: repeat_count <chr>, relevant <chr>, calculation <lgl>,
#> # constraint <chr>, appearance <chr>, default <chr>, value_type <chr>,
#> # options <list>
msf_dict_survey("Vaccination_long")
#> # A tibble: 106 × 15
#> type name short_name label_english label_french hint_english hint_french
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 start start <NA> Start Time Start Time <NA> <NA>
#> 2 end end <NA> End Time End Time <NA> <NA>
#> 3 today today <NA> Date of Surv… Date of Sur… <NA> <NA>
#> 4 deviceid devi… <NA> Phone Serial… Phone Seria… <NA> <NA>
#> 5 date date Date Date Date <NA> <NA>
#> 6 integer team… Team numb… Team number Numéro de l… <NA> <NA>
#> 7 village vill… Village n… Village name Nom du vill… <NA> <NA>
#> 8 text vill… Other vil… Specify other Veuillez sp… <NA> <NA>
#> 9 integer clus… Cluster n… Cluster numb… Numéro de l… <NA> <NA>
#> 10 integer hous… Household… Household nu… Numéro du m… <NA> <NA>
#> # ℹ 96 more rows
#> # ℹ 8 more variables: default <chr>, relevant <chr>, appearance <chr>,
#> # repeat_count <chr>, constraint <chr>, calculation <chr>, value_type <chr>,
#> # options <list>
msf_dict_survey("Vaccination_short")
#> # A tibble: 38 × 16
#> type name short_name label_english label_french hint_english hint_french
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 start start <NA> Start Time Start Time <NA> <NA>
#> 2 end end <NA> End Time End Time <NA> <NA>
#> 3 today today <NA> Date of Surv… Date of Sur… <NA> <NA>
#> 4 deviceid devi… <NA> Phone Serial… Phone Seria… <NA> <NA>
#> 5 date date Date Date Date <NA> <NA>
#> 6 integer team… Team numb… Team number Numéro de l… <NA> <NA>
#> 7 village vill… Village n… Village name Nom du vill… <NA> <NA>
#> 8 text vill… Other vil… Specify other Veuillez sp… <NA> <NA>
#> 9 integer clus… Cluster n… Cluster numb… Numéro de l… <NA> <NA>
#> 10 integer hous… Household… Household nu… Numéro du m… <NA> <NA>
#> # ℹ 28 more rows
#> # ℹ 9 more variables: default <chr>, relevant <chr>, appearance <chr>,
#> # repeat_count <chr>, constraint <chr>, calculation <chr>, hxl <chr>,
#> # value_type <chr>, options <list>
The {epidict} package has a function for generating data that’s
called gen_data()
, which takes three arguments: The
dictionary, which column describes the variable names, and how many rows
are needed in the output.
gen_data("Measles", varnames = "data_element_shortname", numcases = 100, org = "MSF")
#> # A tibble: 100 × 52
#> case_number date_of_consultation_admis…¹ patient_facility_type patient_origin
#> <chr> <date> <fct> <chr>
#> 1 A1 2018-01-20 OP Village C
#> 2 A2 2018-01-02 IP Village A
#> 3 A3 2018-04-13 OP Village A
#> 4 A4 2018-01-30 IP Village B
#> 5 A5 2018-01-09 IP Village A
#> 6 A6 2018-01-03 IP Village C
#> 7 A7 2018-03-26 OP Village C
#> 8 A8 2018-02-25 OP Village A
#> 9 A9 2018-01-20 IP Village C
#> 10 A10 2018-04-29 IP Village C
#> # ℹ 90 more rows
#> # ℹ abbreviated name: ¹date_of_consultation_admission
#> # ℹ 48 more variables: age_years <int>, age_months <int>, age_days <int>,
#> # sex <fct>, pregnant <fct>, trimester <fct>,
#> # foetus_alive_at_admission <fct>, exit_status <fct>, date_of_exit <date>,
#> # time_to_death <fct>, pregnancy_outcome_at_exit <fct>,
#> # baby_born_with_complications <fct>, previously_vaccinated <fct>, …
gen_data("Vaccination_long", varnames = "name", numcases = 100, org = "MSF")
#> # A tibble: 100 × 120
#> start end today deviceid date team_number village_name village_other
#> <lgl> <lgl> <lgl> <lgl> <date> <lgl> <fct> <lgl>
#> 1 NA NA NA NA 2018-02-04 NA village_6 NA
#> 2 NA NA NA NA 2018-02-16 NA village_8 NA
#> 3 NA NA NA NA 2018-01-30 NA village_9 NA
#> 4 NA NA NA NA 2018-02-16 NA village_7 NA
#> 5 NA NA NA NA 2018-04-19 NA village_8 NA
#> 6 NA NA NA NA 2018-01-07 NA village_8 NA
#> 7 NA NA NA NA 2018-03-01 NA village_2 NA
#> 8 NA NA NA NA 2018-03-19 NA village_5 NA
#> 9 NA NA NA NA 2018-02-21 NA village_2 NA
#> 10 NA NA NA NA 2018-04-30 NA village_2 NA
#> # ℹ 90 more rows
#> # ℹ 112 more variables: cluster_number <dbl>, household_number <int>,
#> # households_building <int>, random_hh <int>, consent <chr>,
#> # no_consent_reason <fct>, no_consent_other <lgl>, caretaker_relation <fct>,
#> # caretaker_other <lgl>, number_children <dbl>, child_number <chr>,
#> # sex <fct>, date_birth <date>, age_years <int>, age_months <int>,
#> # any_vaccine <fct>, vaccine_card <fct>, hf_records <fct>, …
You can use the dictionaries to clean the data via the {matchmaker} package:
library("matchmaker")
library("dplyr")
dat <- gen_data(dictionary = "Cholera",
varnames = "data_element_shortname",
numcases = 20,
org = "MSF"
)
print(dat)
#> # A tibble: 20 × 45
#> case_number date_of_consultation_admiss…¹ patient_origin age_years age_months
#> <chr> <date> <chr> <int> <int>
#> 1 A1 2018-03-21 Village B 11 NA
#> 2 A2 2018-04-18 Village A 36 NA
#> 3 A3 2018-01-25 Village A 13 NA
#> 4 A4 2018-01-09 Village B 13 NA
#> 5 A5 2018-03-09 Village B NA 21
#> 6 A6 2018-02-11 Village A 66 NA
#> 7 A7 2018-01-27 Village A 62 NA
#> 8 A8 2018-01-14 Village C 59 NA
#> 9 A9 2018-03-29 Village C 67 NA
#> 10 A10 2018-02-25 Village B 75 NA
#> 11 A11 2018-02-12 Village C 18 NA
#> 12 A12 2018-04-29 Village B 16 NA
#> 13 A13 2018-02-07 Village B 56 NA
#> 14 A14 2018-03-08 Village D 38 NA
#> 15 A15 2018-02-23 Village B 41 NA
#> 16 A16 2018-01-19 Village A 4 NA
#> 17 A17 2018-02-19 Village C 45 NA
#> 18 A18 2018-01-30 Village B 11 NA
#> 19 A19 2018-03-21 Village C 24 NA
#> 20 A20 2018-01-07 Village D 36 NA
#> # ℹ abbreviated name: ¹date_of_consultation_admission
#> # ℹ 40 more variables: age_days <int>, sex <fct>, pregnant <fct>,
#> # trimester <fct>, foetus_alive_at_admission <fct>, exit_status <fct>,
#> # date_of_exit <date>, time_to_death <fct>, pregnancy_outcome_at_exit <fct>,
#> # previously_vaccinated <fct>, previous_vaccine_doses_received <fct>,
#> # readmission <fct>, msf_involvement <fct>,
#> # cholera_treatment_facility_type <fct>, residential_status_brief <fct>, …
# We want the expanded dictionary, so we will select `compact = FALSE`
dict <- msf_dict(disease = "Cholera",
long = TRUE,
compact = FALSE,
tibble = TRUE
)
print(dict)
#> # A tibble: 182 × 11
#> data_element_uid data_element_name data_element_shortname
#> <chr> <chr> <chr>
#> 1 AafTlSwliVQ egen_001_patient_case_number case_number
#> 2 OTGOtWBz39J egen_004_date_of_consultation_admiss… date_of_consultation_…
#> 3 wnmMr2V3T3u egen_006_patient_origin patient_origin
#> 4 sbgqjeVwtb8 egen_008_age_years age_years
#> 5 eXYhovYyl61 egen_009_age_months age_months
#> 6 UrYJSk2Wp46 egen_010_age_days age_days
#> 7 D1Ky5K7pFN6 egen_011_sex sex
#> 8 D1Ky5K7pFN6 egen_011_sex sex
#> 9 D1Ky5K7pFN6 egen_011_sex sex
#> 10 dTm5R53YYXC egen_012_pregnancy_status pregnant
#> # ℹ 172 more rows
#> # ℹ 8 more variables: data_element_description <chr>,
#> # data_element_valuetype <chr>, data_element_formname <chr>,
#> # used_optionset_uid <chr>, option_code <chr>, option_name <chr>,
#> # option_uid <chr>, option_order_in_set <dbl>
# Now we can use matchmaker to filter the data
dat_clean <- matchmaker::match_df(dat, dict,
from = "option_code",
to = "option_name",
by = "data_element_shortname",
order = "option_order_in_set"
)
print(dat_clean)
#> # A tibble: 20 × 45
#> case_number date_of_consultation_admiss…¹ patient_origin age_years age_months
#> <chr> <date> <chr> <int> <int>
#> 1 A1 2018-03-21 Village B 11 NA
#> 2 A2 2018-04-18 Village A 36 NA
#> 3 A3 2018-01-25 Village A 13 NA
#> 4 A4 2018-01-09 Village B 13 NA
#> 5 A5 2018-03-09 Village B NA 21
#> 6 A6 2018-02-11 Village A 66 NA
#> 7 A7 2018-01-27 Village A 62 NA
#> 8 A8 2018-01-14 Village C 59 NA
#> 9 A9 2018-03-29 Village C 67 NA
#> 10 A10 2018-02-25 Village B 75 NA
#> 11 A11 2018-02-12 Village C 18 NA
#> 12 A12 2018-04-29 Village B 16 NA
#> 13 A13 2018-02-07 Village B 56 NA
#> 14 A14 2018-03-08 Village D 38 NA
#> 15 A15 2018-02-23 Village B 41 NA
#> 16 A16 2018-01-19 Village A 4 NA
#> 17 A17 2018-02-19 Village C 45 NA
#> 18 A18 2018-01-30 Village B 11 NA
#> 19 A19 2018-03-21 Village C 24 NA
#> 20 A20 2018-01-07 Village D 36 NA
#> # ℹ abbreviated name: ¹date_of_consultation_admission
#> # ℹ 40 more variables: age_days <int>, sex <fct>, pregnant <fct>,
#> # trimester <fct>, foetus_alive_at_admission <fct>, exit_status <fct>,
#> # date_of_exit <date>, time_to_death <fct>, pregnancy_outcome_at_exit <fct>,
#> # previously_vaccinated <fct>, previous_vaccine_doses_received <fct>,
#> # readmission <fct>, msf_involvement <fct>,
#> # cholera_treatment_facility_type <fct>, residential_status_brief <fct>, …