Data Sciences Facility Core (TRSC)
Providing novel, state-of-the-art services that support and integrate research and outreach
Overview
TiCER’s mission is to identify, understand, and reduce adverse environmental health risks for individuals and populations. To accomplish this goal, TiCER investigators must navigate the challenges of data collection, storage, analysis, and integration.
Goals & Structure:
The need for data science services is a requirement across many environmental health research projects, so the DSFC is closely entwined with the other cores. The DSFC exists to support the data science needs of investigators by providing access to experts and resources available at Texas A&M in the areas of
- computational toxicological data sciences,
- biostatistics/bioinformatics, and
- geospatial sciences.
Through the DSFC, TiCER members have access to cutting-edge research methodologies that improve existing experimental approaches. DSFC members have a track record of innovative data science research and applications, including:
- Multi-omics, including gene expression, sequence alignment, functional genomics, regulatory pathways, gene-environment interactions, microbiome and metagenomics
- Bioinformatics tool development and visualization
- Statistical analysis for complex data frameworks
- Use of statistical data sparsity methods to improve analysis, e.g., predictions, canonical correlations
- Chemical exposure assessment, dose- and concentration-response analysis
- Pharmacokinetics, including population physiologically-based pharmacokinetic modeling
- Bayesian approaches to characterizing uncertainty and population variability
- In vitro-to-in vivo extrapolation (IVIVE)
- Geospatial data analytics, including geocoding/georeferencing and spatial interpolation
- Geographic visualization, including interactive dashboards, story maps, and cartography
The components of the DSFC are organized to facilitate broad core utilization across TiCER. The DSFC uses a business model that combines fully supported core administration, data infrastructure, and initial consultations, along with partial support for project-specific services.
Administration
Administration of the DSFC is the responsibility of Dr. Weihsueh Chiu (director) and Dr. Raymond Carroll (deputy director). They provide overall management and are responsible for decision-making regarding project support. They also provide guidance to staff on data science methods and approaches.
Initial Consultations & Requests
The purpose of initial consultations is to refine the request and to plan project-specific support. Based on previous experience with bioinformatics and statistical support, an initial consultation limit of 4 hours per request is imposed. In addition, if it is determined that the request can be fulfilled with an effort of 1 person-day or less, (e.g., routine statistical analysis of small datasets), then such efforts will be supported wholly out of the DSFC.
Data Infrastructure
The DSFC maintains a robust data repository infrastructure for storage, analysis, sharing, and dissemination of data from TiCER-associated projects and cores. The data management specialist, Dr. John Blazier, will be responsible for maintaining this infrastructure, with guidance from Dr. Ivan Ivanov. Data security and integrity will be addressed in consultation with TiCER’s quality assurance officer.
Project-Specific Support
Support is available for the DSFC to provide dedicated services to individual projects (salaries are 2/3 institutional cost-shared). Junior investigators and pilot project recipients are expected to cover only 50% of the cost of DSFC services, whereas senior investigators are expected to cover 75% of the cost of DSFC services. Services are billed at a flat rate of $50/hr to simplify budget accounting.
Operations Process:
DSFC operations consist of a decision process for investigator-initiated service requests on a monthly cycle and a set of organizational/coordination/reporting tasks on monthly, quarterly, or annual cycles. The decision process for service requests includes the following steps:
- Principal Investigator (PI) Request
PIs submit requests for DSFC services. Request forms require contact and billing data and include a request for responses to data science-centric questions that help the DSFC understand the goals and expectations of a new project or user. These include the scientific question being posed, the technical details of the proposed project, and the relevance of the research to environmental health sciences research and to TiCER themes. The request also identifies a preferred DSFC point of contact (POC). - Triage
The DSFC director and/or deputy director performs an initial triage of the request. If the request is clearly outside TiCER’s scope, then it will be immediately rejected. Otherwise, a suitable DSFC POC will be identified, and the request will be forwarded to them to set up an initial consultation with the PI. - Initial Consultation
An initial consultation meeting will be scheduled between the DSFC POC and the requesting PI to determine the appropriate avenue of DSFC support. This meeting will help establish the level of DSFC utilization requested, including any training needs and new methods development. There are three possible outcomes: request for subsidized DSFC support, request for small-effort DSFC support, or cancellation of the request. - Monthly Review of Requests
Prior to allocating support from DSFC, the core director and deputy director review all requests on a monthly basis. Requests will be prioritized for funding based on several key factors, including (a) environmental health and TiCER thematic relevance, (b) appropriate use of DSFC resources, (c) availability of expertise, and (d) availability of funds. - Funding Decision
The DSFC director and deputy director make the final decision regarding funding of service requests. Requests may be approved, denied, or deferred to a later cycle. The highest priority requests will be approved subject to funds availability. Deferred requests will be eligible for consideration in a later cycle, and the investigator will have the option to revise the initial service request. Requests that are low priority will generally be denied.
Services
Computational Toxicology Modeling
Population physiologically-based pharmacokinetic (PBPK) modeling using a hierarchical Bayesian approach
PBPK models are powerful tools in environmental health and can integrate various types and sizes of PK data sets into a common analysis framework. This approach can be enhanced significantly by incorporating the anatomy and physiology of the underlying biological system into the framework through the use of PBPK models, which have a much wider range of applicability than traditional empirical compartmental PK models. This greater applicability domain is particularly important when trying to understand population PK since it is generally infeasible to obtain empirical data over the full range of variability in environmentally exposed populations and communities.
Unlike traditional compartmental models, PBPK model parameters are not uniquely identifiable based on experimental PK data alone, as it is infeasible for human populations to obtain all the information (tissue volumes and flows, tissue-specific metabolism and transport, and tissue-specific compound concentrations) needed to identify all their parameter values on the basis of maximum likelihood alone. It has, therefore, been proposed that population PBPK models, like other mechanistic models, are best parameterized in a Bayesian framework, which seamlessly integrates the prior information available from the literature on PBPK parameter values and correlations. A population model further helps Bayesian integration of prior information since priors rest on population parameters (e.g., average blood volume) rather than on individual values.
The PBPK model–population model–Bayesian framework results in a clear, elegant, versatile, flexible approach that does not depend on asymptotic or Gaussian approximations. This approach has been demonstrated in multiple case studies of environmental or occupational toxicants, including several by Dr. Chiu. These include the use of population PBPK models for bi-directional translation of mouse population-based models to human health and risk assessment.
Toxicokinetic modeling for in vitro-to-in vivo extrapolation (IVIVE)
While in vitro data offer a great potential to fill knowledge gaps related to toxicity information, several factors that can significantly influence in vivo toxicity, such as bioavailability, metabolic and renal clearance, and protein binding, are not accounted for in in vitro test data. Therefore, the incorporation of toxicokinetic dosimetry modeling is a necessary element of the vision to move toxicity testing away from in vivo rodent studies to human-based in vitro assays and enable direct comparisons with human exposures for evaluating risk.
IVIVE forms a critical link from Stressors to Responses, one of TiCER’s themes. IVIVE models also account for population variability in toxicokinetics and, therefore, also touch on the theme of Individuals to Populations. Moreover, IVIVE approaches have been demonstrated to have a substantial impact on hazard prioritization and ranking, thereby providing critical information to inform the theme of Environmental Justice & Policy. Because of the large number of chemicals TiCER investigates, both individually and in combination, an essential element of the DSFC is to maintain and augment, as necessary, a repository of models and their parameters in a readily computable form. Pearce et al. recently developed an open-source library “httk” (high-throughput toxicokinetics) in the “R” statistical software program that includes toxicokinetic models and pre-calculated RTK factors, as well as computer code for dynamic simulation and Monte Carlo sampling using several different toxicokinetic model formulations. TiCER investigators have also extended these approaches to mixtures.
Population-based dose-response modeling
A number of commercial and freely available software packages can conduct dose-response modeling, but their use in evaluating toxicological datasets is still quite limited compared to the traditional application of pair-wise statistical tests to determine “no observed adverse effect levels” (NOAELs).
The limitations of NOAEL as a starting point for toxicological risk assessment have been recognized for decades, and it is generally accepted that BMD, introduced by Crump, is more scientifically appropriate. BMD is the dose associated with a specific size of effect, the benchmark response (BMR; we use M, for magnitude of effect, to denote this value). BMDM is estimated, with associated confidence interval/statistical distribution, by statistical model fitting to dose-response data. In order to facilitate the use of toxicological data generated by TiCER investigators in regulatory decision-making, BMD modeling services are provided.
We previously developed a standardized workflow for BMD modeling that was applied to hundreds of datasets, and which can be adapted for supported TiCER investigator needs. Additionally, the recent advent of high-throughput and population-based in vitro and in vivo models presents a new challenge. We have recently demonstrated the development of a hierarchical Bayesian workflow for dose-response modeling of population data, which has a particular application to population-based in vitro data. This approach enabled a statistically rigorous analysis of population variability toxic responses, providing key information on toxicodynamic susceptibility.
Population-based risk prediction
Methods for population-based risk prediction for the purposes of informing environmental health decisions continue to evolve, with increasing emphasis on the use of in vitro data, dose-response modeling, and probabilistic analyses of uncertainty and variability. Recent National Academy of Sciences reports have recommended the development of predictive risk estimates that provide stakeholders and decision-makers with a fuller characterization of the potential population human health effects associated with different levels of exposure.
Recently, the World Health Organization (WHO) developed a harmonized probabilistic approach based on the concept of the “target human dose” (denoted HDMI) that responds directly to these National Academy of Sciences recommendations. While similar to the traditional “reference dose” (RfD) in its derivation, it differs in several important ways. First, being based on dose-response modeling, it is always tied to a specific health effect, unlike a NOAEL or RfD. Second, it is a harmonized approach applicable to both cancer and non-cancer effects. Furthermore, it extends existing decision benchmarks by quantitatively defining the protection goals in terms of population incidence (I) and magnitude (M) of the critical effect on the human population. This allows for incorporation into regulatory impact analyses that require economic benefit-cost estimates. Finally, it uses uncertainty distributions derived from historical and/or chemical-specific data and analyses in order to derive confidence intervals that quantify the level of uncertainty in the overall result.
Bioinformatics
The DSFC provides guidance to facilitate robust bioinformatics analyses and workflows, as well as provide resources for development of novel bioinformatics methods. DSFC personnel are experienced in the design and implementation of software solutions and computational pipelines to biological problems, and will provide computational biology and bioinformatics programming support to assist research faculty, staff, and students with management and analysis of genomic and other biological data.
Custom Software Pipelines, Apps, and Workflows
The diversity of data stemming from different biological systems pose specific challenges to life sciences researchers during the data analysis steps, particularly to those who are not experienced in computer programming. Many of the tools developed within life sciences research are from other researchers in various academic institutions and deployed in a UNIX/LINUX computing environment, which requires at least a basic knowledge of those operating systems. The DSFC can support investigators and their staff by providing algorithm design and computer programming services to develop easy-to-use command-line tools and scripts, graphical user interfaces, and/or online web apps/interfaces that are customized to their data analysis workflow. Recent examples include:
- lncRNApipe (https://github.com/biocoder/Perl-for-Bioinformatics/tree/master/NGS-Utils) is a command-line‒based software pipeline for identifying putative long non-coding RNA from RNA-Seq reads prepared with Total RNA-Seq protocol from Illumina. It is written in the PERL programming language.
- SBEToolbox (https://github.com/biocoder/SBEToolbox) is a menu-driven graphical user interface written in MatLab to compute various topological metrics and deduce various clusters within biological networks.
- gQTL (https://genomics.tamu.edu/gqtl) is a web-based application for QTL analysis of Collaborative Cross Mice.
Bioinformatics Methods Development
In some cases, novel bioinformatics methods may need to be developed to address Center project needs. Past examples include the following:
- Regulatory Pathway Inference and Classification: Many associations between gene expression-activation and clinical response are correlative as opposed to mechanistic. As a result, these associations are often poor predictors of how cellular signaling pathways will behave if perturbed by environmental exposure modifiers such as diet and exercise. Using systems biology approaches, we have developed mathematical models to integrate the dynamics of reactions in cells and interactions between cells in response to environmental exposure, leading to a dynamic, potentially personalized, “molecular signature.” We have recently demonstrated that the integration of existing pathway knowledge with expression data in a novel systems-based Bayesian approach can be used to achieve optimal classification accuracy, relative to standard data-driven approaches.
- Prediction of Toxicology Outcomes Using Data Integration and Bridging from Cell and Animal Models to Human Disease: One approach to understanding mechanisms from Stressors to Response, one of the key themes of this center, is through predictive models that integrate the different modalities of data, including prior information, for the different animal models and interventional and observational studies (environmental exposure and intervention treatments). Some approaches include the concept of the Coefficient of Determination (CoD), Bayesian network models, and Canonical Correlation Analysis (CCA). The CoD paradigm has been used to detect the so-called “master-slave” regulatory relationships between genes in a Gene Regulatory Network (GRN) and more recently to estimate the probabilities of detecting true feature sets. The Bayesian network model allows for a proper interpretation of plausible causality relationships between regulatory and regulated elements in a given pathway of interest. CCA can integrate existing pathway knowledge with expression data to detect diet-dependent interactions between host transcriptome and microbiome. More recently, we have developed a novel methodology for detecting correlative structures in NextGen sequencing data. This methodology can be used, for example, to examine exposure to phthalates during two critical developmental windows—prenatal and early childhood—on the gut metagenome and host gut gene expression.
Geospatial Sciences
There is a growing need for geospatial risk analyses, especially in support of community engagement and other community-based participatory research across TiCER.
Spatially explicit study design and modeling
This group provides tailored tools to support four categories of TiCER activities: data collection, integration/curation, data analytics, and visualization. For data collection, remote sensing and LiDAR data can be employed to develop spatial sampling schemes. When combined with demographic and socioeconomic data, the center can develop cluster or stratified sampling strategies for cross-sectional or longitudinal survey methods. Regarding data integration, geospatial and temporal features can be the primary data linkage, while other formats of data can be converted into geospatial database by geocoding/georeferencing/georectification. As the projects use sensor networks (e.g., biosensing, environmental monitoring, crowd-sensing), contextual data that capture the dynamics of the environment and human exposure can be obtained and scaled for analysis. Next, geoprocessing, spatial interpolation, spatial autocorrelation, and spatial machine-learning methods can be applied to model global and local geospatial processes or address spatial dependence and heterogeneity. Lastly, static maps (e.g., choropleth maps, heatmaps) and interactive web maps and dashboards will be developed to visualize the dataset and project results. In addition, DSFC personnel have experience leveraging virtual and augmented reality technologies for realistic representations of the environment. Three-dimensional digital models will be developed based on LiDAR point clouds in ESRI CityEngine and game engines (e.g., Unreal, Unity) to simulate various environmental hazards (e.g., flooding, oil spill).
Geospatial data integration for climate and environmental justice assessment
Texas environmental justice communities face compounded hazards that manifest as “acute-on-chronic” stressors. With the increasing recognition that health disparities are geospatial — closely tied to the spatial-temporal heterogeneity of environmental and social stressor distribution. For example, populations in closer proximity to Toxic Release Inventory sites, industrial land uses, or transportation corridors run a higher risk of being affected by toxins through various exposure pathways; these areas are often occupied by communities of color. By integrating domains of multisource and multi-type data (e.g., climate extremes, environment, infrastructure, socioeconomic, and health), this core can identify vulnerable populations with specific exposure levels, pathways, and scales relevant to their unique needs, further increasing the validity and accuracy of methodologies and the robustness of their findings.
GeoDesign as a community-based participatory research framework
Geospatial technology can not only reveal distributive injustice in existing environmental stressors and health disparities but also promote procedural justice through enabling information access and meaningful participation of community members, thus facilitating equitable allocation of resources and decision-making. DSFC utilizes a GeoDesign framework with six key modeling approaches, including representation models, process models, evaluation models, change models, impact models, and decision models, to support community-based participatory research. The representation and process models aim to enable community-driven problem identification through efficient spatial data sharing and modeling individual, household, and community-level health risk factors. The evaluation and change models utilize scenario-based approaches to evaluate health outcomes based on community co-design solutions. The impact and decision models assess the primary health outcomes and auxiliary benefits of the solutions, as well as any unintended social consequences.
Contact Information
DSFC
588 Raymond Stotzer Parkway
College Station, TX 77843
Tel: 979.845.4106
DSFC Team
Peer-Reviewed Research
- Characterizing Uncertainty and Variability in Physiologically Based Pharmacokinetic Models: State of the Science and Needs for Research and Implementation
- Evaluation of physiologically based pharmacokinetic models for use in risk assessment
- Key Scientific Issues in the Health Risk Assessment of Trichloroethylene
- Carcinogenicity of perfluorooctanoic acid, tetrafluoroethylene, dichloromethane, 1,2-dichloropropane, and 1,3-propane sultone
- Trichloroethylene: Mechanistic, epidemiologic and other supporting evidence of carcinogenic hazard
- Prognostic factors for local recurrence, metastasis, and survival rates in squamous cell carcinoma of the skin, ear, and lip: Implications for treatment modality selection
- Long-Term Recurrence Rates in Previously Untreated (Primary) Basal Cell Carcinoma: Implications for Patient Follow-Up
- Estimation and Comparison of Changes in the Presence of Informative Right Censoring by Modeling the Censoring Process
- The PSA−/lo Prostate Cancer Cell Population Harbors Self-Renewing Long-Term Tumor-Propagating Cells that Resist Castration
- Prognostic Validity of a Novel American Joint Committee on Cancer Staging Classification for Pancreatic Neuroendocrine Tumors
- Assessing the potential exposure risk and control for airborne titanium dioxide and carbon black nanoparticles in the workplace
- Assessing the potential risks to zebrafish posed by environmentally relevant copper and silver nanoparticles