Information infrastructure project
The information infrastructure project provides support to all projects in collecting, preprocessing and sharing their research data, and in developing and publishing efficient implementations of their statistical methods in popular open-source software environments. INF ensures that TRR 391 can implement the highest standards with respect to the FAIR data principles, e.g., the reproducibility of research results and re-use of research data, and it provides training on these.
Project Leaders
Dr. Philipp Breidenbach
Research Data Center Ruhr
RWI Leibniz Institute for Economic Research
Prof. Dr. Paul-Christian Bürkner
Department of Statistics - Chair of Computational Statistics
TU Dortmund University
Prof. Dr. Andreas Groll
Department of Statistics - Chair of Statistical Methods for Big Data
TU Dortmund University
Summary
The goal of INF is to ensure a smooth cooperation of all data processes between the participating teams from different universities and disciplines. In addition to the overall research data management, this includes the interoperability of data and software between the participating disciplines and the support and training of researchers in the needs of interdisciplinary workflows and FAIR data. Research data often comes from a variety of sources and formats, making it difficult to standardize processes to ensure consistency, as different disciplines and researchers use different data structures. In addition to incoming data, project products such as coding schemes and newly developed software must also be integrated into the collaboration, which is further complicated by different "understandings" of data and information between the disciplines involved in TRR 391.
The data come from different types of data sources: simulations, experiments, surveys, open access data, and industry collaborations that require special efforts to transfer data and results. The interdisciplinary composition and the broad set of data sources also offer the great possibility that the INF project can develop a language to convey a common understanding of data, data processing and data provision between different disciplines. Such a successful translation would have benefits for future interdisciplinary projects beyond TRR 391, especially for the National Research Data Infrastructure (NFDI), which has the goal of a common research data infrastructure.
Faced with the set of challenges described above, INF aims to manage all data processes efficiently, to provide data according to the FAIR principles and to train scientists in these specific tasks. INF will promote exchanging and merging research data and efficient software implementations of the statistical methods investigated within our TRR, as well as the availability of the research data and output, both inside and outside of TRR 391. The support and infrastructure provided by this project will ensure that TRR 391 can implement the highest standards with respect to the reproducibility of research results and re-use of research data. To address this, the following principles will be followed in all projects:
- Open data: Data created and re-used in our TRR shall be handled according to the FAIR principles, and the guidelines of research data management at the participating institutions. All data sets will be brought to comparable standards by the researchers. The data will be enhanced with meta-data according to a uniform standard and a data description. The data sets will be shared as openly as possible and as closely as necessary.
- Open source: The various research projects have in common that the results will be based on extensive source code. For quality assurance and replication, it is particularly important that these source codes can also be replicated by other researchers and follow the guidelines invented within the INF project. Source code of all methods developed in TRR 391 will be published as open source under a suitable license.
- Reproducible research: Research articles from TRR 391 shall be published jointly with the computational tools necessary to reproduce the results. In particular, this includes the software and a precise description how the results have been obtained. In combination with open data and open source code, we strive for the best possible reproducibility of all results.
Comprehensive training measures guarantee that the researchers can follow these principles. Special emphasis is placed on ensuring that all solutions provide for low-threshold participation of all projects in TRR 391. The INF project will have extensive and direct communication with NFDI sections and consortia. This approach aims to facilitate the direct flow of NFDI developments into TRR 391, and vice versa, to enable the exchange of findings with the NFDI.
Anzt, H., F. Bach, S. Druskat, F. Löffler, et al. (2020). An environment for sustainable research software in Germany and beyond: Current state, open challenges, and call for action. F1000Research 9. doi: 10.12688/f1000research.23224.2.
Bauer, T., R. Budde, and S. Schaffner (2013). The Research Data Center Ruhr at the RWI (FDZ Ruhr im RWI). Journal of Contextual Economics–Schmollers Jahrbuch, 439–448. doi: 10.3790/schm.133.3.439.
Binder, N., H. Dette, J. Franz, D. Zöller, et al. (2022a). Data Mining in Urology: Understanding Real-world Treatment Pathways for Lower Urinary Tract Systems via Exploration of Big Data. European urology focus. doi: 10.1016/j.euf.2022.03.019.
Binder, N., T. A. Gerds, and P. K. Andersen (2014). Pseudo-observations for competing risks with covariate dependent censoring. Lifetime data analysis 20, 303–315. doi: 10.1007/s10985-013-9247-7.
Binder, N., A.-S. Herrnböck, and M. Schumacher (2017). Estimating hazard ratios in cohort data with missing disease information due to death. Biometrical Journal 59, 251–269. doi: 10.1002/bimj.201500167.
Binder, N., K. Möllenhoff, A. Sigle, and H. Dette (2022b). Similarity of competing risks models with constant intensities in an application to clinical healthcare pathways involving prostate cancer surgery. Statistics in Medicine 41, 3804–3819. doi: 10.1002/sim.9481.
Bretz, F., K. Möllenhoff, H. Dette, W. Liu, et al. (2018). Assessing the similarity of dose response and target doses in two non-overlapping subgroups. Statistics in Medicine 37, 722–738. doi: 10.1002/sim.7546.
Bürkner, P.-C. (2017). brms: An R Package for Bayesian multilevel models using Stan. Journal of Statistical Software 80, 1–28. doi: 10.18637/jss.v080.i01.
Bürkner, P.-C. (2018). Advanced Bayesian Multilevel Modeling with the R Package brms. The R Journal 10, 395–411. doi: 10.32614/RJ-2018-017.
Bürkner, P.-C. (2020). Bayesian Item Response Modelling in R with brms and Stan. Journal of Statistical Software, 1–54. doi: 10.18637/jss.v100.i05.
Bürkner, P.-C., J. Gabry, M. Kay, and A. Vehtari (2023). posterior: Tools for working with posterior distributions. url: https://mc-stan.org/posterior/.
Cremer, F., M. Fräßdorf, J. Neumann, M. Petersen, et al. (2021). Institutionelle Workflows zum Forschungsdatenmanagement. Bestandsaufnahme und Lösungsvorschläge aus der Leibniz-Gemeinschaft. Bausteine Forschungsdatenmanagement, 142–150. doi: 10.17192/bfdm.2021.3.8346.
Dette, H., K. Möllenhoff, S. Volgushev, and F. Bretz (2018). Equivalence of regression curves. Journal of the American Statistical Association 113, 711–729. doi: 10.1080/01621459.2017.1281813.
Eddelbuettel, D. and R. François (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software 40, 1–18. doi: 10.18637/jss.v040.i08.
Groll, A. (2016). PenCoxFrail: Regularization in Cox frailty models. url: http://CRAN.R-project.org/package=PenCoxFrail.
Groll, A. (2020). GMMBoost: Componentwise likelihood-based boosting approaches to generalized mixed models. url: https://CRAN.R-project.org/package=GMMBoost.
Groll, A. (2022). glmmLasso: Variable selection for generalized linear mixed models by L1-penalized estimation. url: http://CRAN.R-project.org/package=glmmLasso.
Grossmann, Y. V. and M. Franke (2023). Software ist kein Beiprodukt! — Nachhaltige Forschungssoftware durch Software-Management-Pläne. b.i.t online 26.5.
Hong, N. P. C., D. S. Katz, M. Barker, A.-L. Lamprecht, et al. (2022). FAIR Principles for Research Software (FAIR4RS Principles). doi: 10.15497/RDA00065.
Kohl, T., A. Sigle, T. Kuru, J. Salem, et al. (2022). Comprehensive analysis of complications after transperineal prostate biopsy without antibiotic prophylaxis: Results of a multicenter trial with 30 days’ follow-up. Prostate Cancer and Prostatic Diseases 25, 264–268. doi: 10.1038/s41391-021-00423-3.
Lenth, R. V. (2022). emmeans: Estimated marginal means, aka least-squares means. url: https://CRAN.R-project.org/package=emmeans.
Möllenhoff, K., F. Bretz, and H. Dette (2020). Equivalence of regression curves sharing common parameters. Biometrics 76, 518–529. doi: 10.1111/biom.13149.
Möllenhoff, K., H. Dette, E. Kotzagiorgis, S. Volgushev, et al. (2018). Regulatory assessment of drug dissolution profiles comparability via maximum deviation. Statistics in medicine 37, 2968–2981. doi: 10.1002/sim.7689.
Möllenhoff, K., F. Loingeville, J. Bertrand, T. T. Nguyen, et al. (2022). Efficient model-based bioequivalence testing. Biostatistics 23, 314–327. doi: 10.1093/biostatistics/kxaa026.
Piironen, J., M. Paasiniemi, A. Catalina, F. Weber, et al. (2023). projpred: Projection predictive feature selection. url: https://mc-stan.org/projpred/.
R Core Team (2023). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. url: https://www.R-project.org/.
Radev, S. T., M. Schmitt, L. Schumacher, L. Elsemüller, et al. (2023). BayesFlow: Amortized Bayesian workflows with neural networks.
Stan Development Team (2023). Stan Modeling Language Users Guide and Reference Manual, 2.31.0. url: https://mc-stan.org.
Vehtari, A., J. Gabry, M. Magnusson, Y. Yao, et al. (2022). loo: Efficient leave-one-out cross-validation and WAIC for Bayesian models. url: https://mc-stan.org/loo/.
Wilkinson, M. D., M. Dumontier, I. J. Aalbersberg, G. Appleton, et al. (2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data 3, 1–9. doi: 10.1038/sdata.2016.18.