introduction.Rmd
This vignette gives a very brief overview of the current package. To start, we load the package into R.
library(ctrialsgov)
In the next few sections, we see how to setup the data set, query it, and then visualize the output.
Before querying the ClinicalTrials.gov data, we need to load a pre-processed version of the data into R. There are three ways to do this. If you have installed a copy of the data set locally into PostGRES, the data can be created from scratch with the following block of code (it will take a couple of minutes to finish):
library(DBI)
library(RPostgreSQL)
drv <- dbDriver('PostgreSQL')
con <- DBI::dbConnect(drv, dbname="aact")
ctgov_create_data(con)
Alternatively, we can download a static version of the data from GitHub and load this into R without needing the setup a local version of the database. This will be cached locally so that it can be re-loaded without downloading each time. To download and load this data, use the following:
Finally, we can load a small sample dataset (2% of the total) that is included with the package itself using the following:
This is the version of the data that is used in most of the tests, examples, and in this vignette.
The primary function for querying the dataset is called ctgov_query
. It can be called after using any of the functions in the previous section. Here are a few examples of how the function works. We will see a few examples here; see the help pages for a complete list of options.
There are a number of fields in the data that use exact matches of categories. Here, for example, we find the interventional studies:
ctgov_query(study_type = "Interventional")
## # A tibble: 2,899 × 28
## nct_id start_date phase enrollment brief_title official_title
## <chr> <date> <chr> <int> <chr> <chr>
## 1 NCT04670484 2021-12-01 N/A 30 Mask Hypoxi… To Determine …
## 2 NCT04784052 2021-10-31 Phase 1/Phase 2 12 Depleted Do… TCR+ T-cell/C…
## 3 NCT04851119 2021-10-29 Phase 1/Phase 2 38 Tegavivint … A Phase 1/2 S…
## 4 NCT04538482 2021-09-30 N/A 112 DASH INterv… Determining t…
## 5 NCT04430452 2021-09-30 Phase 2 30 Hypofractio… Phase II Tria…
## 6 NCT04826393 2021-09-30 Phase 1 65 ASP8374 + C… Phase Ib Tria…
## 7 NCT04271683 2021-09-01 N/A 70 Safe Use of… Safe Use of C…
## 8 NCT03029637 2021-08-31 N/A 90 No-preparat… No-preparatio…
## 9 NCT04307433 2021-08-31 N/A 120 Storytellin… mHealth Deliv…
## 10 NCT04453709 2021-08-31 N/A 232 Family-cent… Reducing Stre…
## # … with 2,889 more rows, and 22 more variables:
## # primary_completion_date <date>, study_type <chr>, description <chr>,
## # eudract_num <chr>, other_id <chr>, allocation <chr>,
## # intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## # time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## # minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>, …
Or, all of the interventional studies that have a primary industry sponsor:
ctgov_query(study_type = "Interventional", sponsor_type = "Industry")
## # A tibble: 783 × 28
## nct_id start_date phase enrollment brief_title official_title
## <chr> <date> <chr> <int> <chr> <chr>
## 1 NCT04683120 2021-07-15 N/A 100 Real-time D… Real-time Dia…
## 2 NCT04875754 2021-06-30 Phase 1/Phase 2 24 A Study Eva… A Phase 1/2a,…
## 3 NCT04821310 2021-06-04 Phase 2 110 Protonix Tr… An Explorator…
## 4 NCT04839042 2021-06-01 Phase 1 40 A Phase 1, … A Phase 1, Fi…
## 5 NCT04707768 2021-05-31 Phase 3 450 Study Evalu… A Randomized,…
## 6 NCT04855136 2021-05-04 Phase 1/Phase 2 415 Safety and … An Explorator…
## 7 NCT04732286 2021-04-30 Phase 3 100 A Study of … A Phase IIIb,…
## 8 NCT04793620 2021-04-30 Phase 1 60 Pertussis A… A Phase 1, Ra…
## 9 NCT04803253 2021-04-30 N/A 14 Study of th… Evaluating th…
## 10 NCT04786873 2021-04-30 Phase 3 100 A Research … Multicenter, …
## # … with 773 more rows, and 22 more variables: primary_completion_date <date>,
## # study_type <chr>, description <chr>, eudract_num <chr>, other_id <chr>,
## # allocation <chr>, intervention_model <chr>, observational_model <chr>,
## # primary_purpose <chr>, time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## # minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>,
## # sponsor <chr>, sponsor_type <chr>, conditions <chr>, interventions <list>
A few fields have continuous values that can be searched by giving a vector with two values. The results return any values that fall between the lower bound (first value) and the upper bound (second value). Here, we find the studies that have between 40 and 42 patients enrolled in them:
ctgov_query(enrollment_range = c(40, 42))
## # A tibble: 115 × 28
## nct_id start_date phase enrollment brief_title official_title
## <chr> <date> <chr> <int> <chr> <chr>
## 1 NCT04839042 2021-06-01 Phase 1 40 A Phase 1, Firs… A Phase 1, First-…
## 2 NCT04843852 2021-05-31 Phase 2 40 TLR-9 Adjuvante… Augmentation of H…
## 3 NCT04392180 2021-05-31 <NA> 42 COA-APTIC Careg… COA-APTIC Caregiv…
## 4 NCT04800991 2021-03-17 N/A 40 NuGa (Nutrition… A Pilot Study to …
## 5 NCT04646577 2020-12-01 N/A 40 The Effect of t… The Effect of Tra…
## 6 NCT03301805 2020-12-01 Phase 2 40 A Phase II, Eva… A Phase II, Open …
## 7 NCT04592224 2020-11-12 <NA> 40 Cancer Informat… Cancer Informatio…
## 8 NCT04473430 2020-11-05 N/A 40 Use of Continuo… Use of Continuous…
## 9 NCT04568369 2020-10-31 N/A 40 Treatment of Po… Functional Near I…
## 10 NCT04349644 2020-10-21 N/A 40 Enhancing Socia… Enhancing Social …
## # … with 105 more rows, and 22 more variables: primary_completion_date <date>,
## # study_type <chr>, description <chr>, eudract_num <chr>, other_id <chr>,
## # allocation <chr>, intervention_model <chr>, observational_model <chr>,
## # primary_purpose <chr>, time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## # minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>,
## # sponsor <chr>, sponsor_type <chr>, conditions <chr>, interventions <list>
Setting one end of the range to missing avoids searching for that end of the range. For example, the following finds any studies with 1000 or more patients.
ctgov_query(enrollment_range = c(1000, NA))
## # A tibble: 237 × 28
## nct_id start_date phase enrollment brief_title official_title
## <chr> <date> <chr> <int> <chr> <chr>
## 1 NCT04381195 2021-06-30 <NA> 1300 Adult Functioni… Developing a Gold…
## 2 NCT03928639 2021-05-17 <NA> 9999 Structural Hear… NHLBI Structural …
## 3 NCT04884048 2021-05-01 <NA> 2000 Multicentric Bo… Multicentric Stud…
## 4 NCT04104555 2021-05-01 N/A 1055 Orthotics for T… Orthotics for Tre…
## 5 NCT04727437 2021-04-30 Phase 3 1466 STOPping Antico… STOPping Anticoag…
## 6 NCT04658888 2021-04-26 N/A 1211 St. Joe's Invit… Increasing Cancer…
## 7 NCT04703790 2021-04-07 <NA> 1200 Acceptability o… Acceptability Acr…
## 8 NCT04775992 2021-03-01 <NA> 1000 Preemptive Anal… Preemptive Analge…
## 9 NCT04762862 2021-02-01 <NA> 1200 QUANTACT : Impa… QUANTACT : Impact…
## 10 NCT04682730 2021-01-12 N/A 1200 An Implementati… An Implementation…
## # … with 227 more rows, and 22 more variables: primary_completion_date <date>,
## # study_type <chr>, description <chr>, eudract_num <chr>, other_id <chr>,
## # allocation <chr>, intervention_model <chr>, observational_model <chr>,
## # primary_purpose <chr>, time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## # minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>,
## # sponsor <chr>, sponsor_type <chr>, conditions <chr>, interventions <list>
Similarly, we can give a range of dates. These are given in the form of strings as “YYYY-MM-DD”:
ctgov_query(date_range = c("2020-01-01", "2020-02-01"))
## # A tibble: 25 × 28
## nct_id start_date phase enrollment brief_title official_title
## <chr> <date> <chr> <int> <chr> <chr>
## 1 NCT04162236 2020-02-01 <NA> 440 Cardiac Dys… Cardiac Dysfu…
## 2 NCT04036539 2020-02-01 N/A 34 Photobiomod… Evaluation of…
## 3 NCT04617158 2020-02-01 N/A 40 Comparing A… Comparison Be…
## 4 NCT03999164 2020-02-01 Phase 1 30 Imaging of … Imaging of Gl…
## 5 NCT04203927 2020-02-01 Early Phase 1 50 Effects of … Effects of Em…
## 6 NCT03880149 2020-02-01 N/A 140 The Effects… The Effects o…
## 7 NCT04263623 2020-01-31 Phase 2 75 Clinical St… A Double-Blin…
## 8 NCT04199520 2020-01-31 Phase 2 155 Compare the… Compare the E…
## 9 NCT04219735 2020-01-30 Phase 2 200 Effect of M… Effect of Min…
## 10 NCT04774809 2020-01-29 Phase 2/Phase 3 300 Assess the … A Randomized,…
## # … with 15 more rows, and 22 more variables: primary_completion_date <date>,
## # study_type <chr>, description <chr>, eudract_num <chr>, other_id <chr>,
## # allocation <chr>, intervention_model <chr>, observational_model <chr>,
## # primary_purpose <chr>, time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## # minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>,
## # sponsor <chr>, sponsor_type <chr>, conditions <chr>, interventions <list>
Finally, we can also search free text fields using keywords. The following for example finds and study that includes the phrase “lung cancer” (ignoring case) in the description field:
ctgov_query(description_kw = "lung cancer")
## # A tibble: 54 × 28
## nct_id start_date phase enrollment brief_title official_title
## <chr> <date> <chr> <int> <chr> <chr>
## 1 NCT04696939 2021-01-31 Phase 2 100 Combined Ate… Clinical Study…
## 2 NCT03421678 2020-10-31 <NA> 0 Ethnic/Racia… Ethnic/Racial …
## 3 NCT04452214 2020-09-24 Phase 1 15 A Study of t… An Open-label,…
## 4 NCT04186988 2019-11-05 Early Phase 1 20 [18F]-AraG f… Imaging of T-C…
## 5 NCT03979170 2019-04-01 <NA> 50 Patient-deri… Patient-derive…
## 6 NCT03720873 2018-10-31 Phase 2 90 EGFR-TKIs Co… An Multicenter…
## 7 NCT03611738 2018-10-31 Phase 1 48 Ceritinib Pl… Phase I Study …
## 8 NCT03366675 2017-12-01 Phase 2 15 AZD 2811 Mon… Phase II, Sing…
## 9 NCT04454853 2017-12-01 <NA> 40 Methylated D… A Research of …
## 10 NCT03261947 2017-10-25 Phase 2 101 A Study to E… An Open-Label,…
## # … with 44 more rows, and 22 more variables: primary_completion_date <date>,
## # study_type <chr>, description <chr>, eudract_num <chr>, other_id <chr>,
## # allocation <chr>, intervention_model <chr>, observational_model <chr>,
## # primary_purpose <chr>, time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## # minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>,
## # sponsor <chr>, sponsor_type <chr>, conditions <chr>, interventions <list>
We can search two terms at once as well, by default it finds things that match at least one of the terms:
ctgov_query(description_kw = c("lung cancer", "colon cancer"))
## # A tibble: 61 × 28
## nct_id start_date phase enrollment brief_title official_title
## <chr> <date> <chr> <int> <chr> <chr>
## 1 NCT04696939 2021-01-31 Phase 2 100 Combined Ate… Clinical Study…
## 2 NCT03421678 2020-10-31 <NA> 0 Ethnic/Racia… Ethnic/Racial …
## 3 NCT04452214 2020-09-24 Phase 1 15 A Study of t… An Open-label,…
## 4 NCT04186988 2019-11-05 Early Phase 1 20 [18F]-AraG f… Imaging of T-C…
## 5 NCT03979170 2019-04-01 <NA> 50 Patient-deri… Patient-derive…
## 6 NCT03720873 2018-10-31 Phase 2 90 EGFR-TKIs Co… An Multicenter…
## 7 NCT03611738 2018-10-31 Phase 1 48 Ceritinib Pl… Phase I Study …
## 8 NCT03366675 2017-12-01 Phase 2 15 AZD 2811 Mon… Phase II, Sing…
## 9 NCT04454853 2017-12-01 <NA> 40 Methylated D… A Research of …
## 10 NCT03261947 2017-10-25 Phase 2 101 A Study to E… An Open-Label,…
## # … with 51 more rows, and 22 more variables: primary_completion_date <date>,
## # study_type <chr>, description <chr>, eudract_num <chr>, other_id <chr>,
## # allocation <chr>, intervention_model <chr>, observational_model <chr>,
## # primary_purpose <chr>, time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## # minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>,
## # sponsor <chr>, sponsor_type <chr>, conditions <chr>, interventions <list>
But the match_all
flag can be set to search for both terms at the same time (here, that returns no matches):
ctgov_query(description_kw = c("lung cancer", "colon cancer"), match_all = TRUE)
## # A tibble: 0 × 28
## # … with 28 variables: nct_id <chr>, start_date <date>, phase <chr>,
## # enrollment <int>, brief_title <chr>, official_title <chr>,
## # primary_completion_date <date>, study_type <chr>, description <chr>,
## # eudract_num <chr>, other_id <chr>, allocation <chr>,
## # intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## # time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, …
Other keyword fields include official_title_kw
, source_kw
and criteria_kw
.
Any of the options can be combined as needed.
ctgov_query(
description_kw = "cancer",
enrollment_range = c(100, 200),
date_range = c("2019-01-01", "2020-02-01")
)
## # A tibble: 4 × 28
## nct_id start_date phase enrollment brief_title official_title primary_complet…
## <chr> <date> <chr> <int> <chr> <chr> <date>
## 1 NCT04199520 2020-01-31 Phase 2 155 Compare th… Compare the E… 2021-01-31
## 2 NCT03728829 2019-12-30 <NA> 100 Targeted N… An Observatio… 2022-09-30
## 3 NCT04498689 2019-08-01 Phase 2 117 Efficacy a… A Phase II, S… 2022-12-31
## 4 NCT02749552 2019-01-22 N/A 200 The Role o… The Role of V… 2019-08-31
## # … with 21 more variables: study_type <chr>, description <chr>,
## # eudract_num <chr>, other_id <chr>, allocation <chr>,
## # intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## # time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## # minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>,
## # sponsor <chr>, sponsor_type <chr>, conditions <chr>, interventions <list>
Finally, we can also pass a current version of the data set to the query function, rather than starting with the full data set. This is useful when you want to combine queries in a more complex way. For example, this is equivalent to the above:
library(dplyr)
ctgov_query() %>%
ctgov_query(description_kw = "cancer") %>%
ctgov_query(enrollment_range = c(100, 200)) %>%
ctgov_query(date_range = c("2019-01-01", "2020-02-01"))
## # A tibble: 4 × 28
## nct_id start_date phase enrollment brief_title official_title primary_complet…
## <chr> <date> <chr> <int> <chr> <chr> <date>
## 1 NCT04199520 2020-01-31 Phase 2 155 Compare th… Compare the E… 2021-01-31
## 2 NCT03728829 2019-12-30 <NA> 100 Targeted N… An Observatio… 2022-09-30
## 3 NCT04498689 2019-08-01 Phase 2 117 Efficacy a… A Phase II, S… 2022-12-31
## 4 NCT02749552 2019-01-22 N/A 200 The Role o… The Role of V… 2019-08-31
## # … with 21 more variables: study_type <chr>, description <chr>,
## # eudract_num <chr>, other_id <chr>, allocation <chr>,
## # intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## # time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## # minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>,
## # sponsor <chr>, sponsor_type <chr>, conditions <chr>, interventions <list>
The package also contains a number of tools for visualizing the output. Here is one example:
ctgov_query(
description_kw = "cancer",
enrollment_range = c(100, 200),
date_range = c("2019-01-01", "2020-02-01")
) %>%
ctgov_plot_timeline() +
ggplot2::theme_minimal()