Data Sciences Facility Core [DSFC]

The DSFC provides novel, state-of-the art data science services to support and integrate scientific research and outreach activities.

About this Core

TiCER’s ultimate goal is to identify, understand, and reduce adverse environmental health risks for individuals and populations. To accomplish this goal, TiCER scientists must navigate challenges of data collection, storage, analysis, and integration. The DSFC exists to support these needs by providing access to data science experts and resources available at Texas A&M in three areas: toxicological data sciences, biostatistics/bioinformatics, and geospatial sciences. The need for data science services is a requirement across many environmental health research projects, therefore the DSFC is closely entwined with all other Facility Cores.

The Members of the DSFC Have a Track Record of Innovative Data Science Research and Applications

These include:

Multi-omics, including gene expression, sequence alignment, functional genomics, regulatory pathways, gene-environment interactions, microbiome and metagenomics
Bioinformatics tool development and visualization
Statistical analysis for complex data frameworks
Use of statistical data sparsity methods to improve analysis, e.g., predictions, canonical correlations
Chemical exposure assessment, dose- and concentration-response analysis
Pharmacokinetics, including population physiologically-based pharmacokinetic modeling
Bayesian approaches to characterizing uncertainty and population variability
In vitro-to-in vivo extrapolation (IVIVE)
Geospatial data analytics, including geocoding/georeferencing, spatial interpolation
Geographic visualization, including interactive dashboards and story maps, cartography/map mapping

Through the DSFC, Center Members Have Access to Cutting-Edge Research Methodologies That Improve Existing Experimental Approaches

DSFC Components

The components of the DSFC are organized to facilitate broad Core utilization across the Center. The DSFC will uses a business model that combines fully supported Core administration, data infrastructure, and initial consultations, along with partial support for project-specific services.

Administration

Administration of the DSFC is the responsibility of Drs. Chiu (Core Director) and Carroll, (Deputy Director). They provide overall management of the DSFC and are responsible for decision-making regarding project support. They also provide guidance to staff on data science methods and approaches.

Data Infrastructure

The DSFC maintains a robust data repository infrastructure for storage, analysis, sharing, and dissemination of data from Center-associated projects and Cores. The Data Management Specialist Dr. Blazier will be responsible for maintaining this infrastructure, with guidance from Dr. Ivanov. Data security and integrity will be addressed in consultation with the Center Quality Assurance Officer.

Initial Consultations and Requests

The purpose of initial consultations is to refine the initial request and to plan project-specific support. Based on previous experience with bioinformatics and statistical support, an initial consultation limit of 4 hours per request is imposed. In addition, if it is determined that the request can be fulfilled with an effort of 1 person-day or less, (e.g., routine statistical analysis of small datasets), then such efforts will be supported wholly out of the DSFC.

Project-Specific Support

Support is available for the DSFC to provide dedicated support services to individual projects (salaries are 2/3 institutional cost-shared). Junior investigators and pilot project recipients are expected to cover only 50% of the cost of DSFC services, whereas senior investigators are expected to cover 75% of the cost of DSFC services. Services are billed at a flat rate of $50/hr to simplify budget accounting.

DSFC Operations Process

DSFC operations consists of a decision process for investigator-initiated service requests on a monthly cycle, and a set of organizational/coordination/reporting tasks on monthly, quarterly, or annual cycles. The decision process for service requests includes the following steps:

PI Online Request
A PI will submit an online request for DSFC services through the Center website linked to the Project Management and Tracking System. Request forms will require contact and billing data, and will include a request for responses to Data Science-centric questions that will help the Core understand the goals and expectations of a new project or user. These will include the scientific question being posed, the technical details of the proposed project, and the relevance of the research to environmental health sciences research and to Center themes. The request form will also identify a preferred DSFC point of contact (POC).
Triage
The Director and/or Deputy Director will perform an initial triage of the request form. If the request is clearly outside the scope of the Center, then it will be immediately rejected. Otherwise, a suitable DSFC POC will be identified and the request forwarded to him/her to set up an initial consultation with the PI.
Initial Consultation
An initial consultation meeting will be scheduled between the DSFC POC and the requesting PI to determine the appropriate avenue of DSFC support to the PI. This meeting will help establish the level of DSFC utilization requested, including any training needs and new methods development. There are three possible outcomes: request for subsidized DSFC support, request for small-effort DSFC support, or cancellation of the request.
Monthly Review of Requests
Prior to allocating support from DSFC, the Core Director and Deputy Director will review all requests on a monthly basis. Requests will be prioritized for funding based on several key factors, including: (a) environmental health and Center thematic relevance; (b) appropriate use of DSFC resources; (c) availability of expertise; and (d) availability of funds.
Funding Decision
The DSFC Director and Deputy Director will make the final decision regarding funding of service requests. Requests may be approved, denied, or deferred to a later cycle. The highest priority requests will be approved subject to funds availability. Deferred requests will be eligible for consideration in a later cycle, and the investigator will have the option to revise the initial service request. Requests that are low priority will generally be denied.

The DSFC Provides State-of-the-Art Computational Toxicology Modeling Services

Population physiologically based pharmacokinetic (PBPK) modeling using a hierarchical Bayesian approach.

PBPK models are powerful tools in environmental health, and can integrate various types and sizes of PK data sets into a common analysis framework. This approach can be enhanced significantly by incorporating the anatomy and physiology of the underlying biological system into the framework through the use of PBPK models, which have a much wider range of applicability than traditional empirical compartmental PK models. This greater applicability domain is particularly important when trying to understand population PK, since it is generally infeasible to obtain empirical data over the full range of variability in environmentally exposed populations and communities.

Unlike traditional compartmental models, PBPK model parameters are not uniquely identifiable based on experimental PK data alone, as it is infeasible for human populations to obtain all the information (tissue volumes and flows, tissue-specific metabolism and transport, and tissue-specific compound concentrations) needed to identify all their parameter values on the basis of maximum likelihood alone. It has therefore been proposed that population PBPK models, like other mechanistic models, are best parameterized in a Bayesian framework, which seamlessly integrates the prior information available from the literature on PBPK parameter values and correlations. A population model further helps Bayesian integration of prior information, since priors rest on population parameters (e.g., average blood volume) rather than on individual values.

The PBPK model–population model–Bayesian framework results in a clear, elegant, versatile, flexible approach that does not depend on asymptotic or Gaussian approximations. This approach has been demonstrated in multiple case studies of environmental or occupational toxicants, including several by Dr. Chiu. These include use of population PBPK models for bi-directional translation of mouse population-based models to human health and risk assessment.

Toxicokinetic modeling for in vitro-to-in vivo extrapolation (IVIVE).

While in vitro data offer a great potential to fill knowledge gaps related to toxicity information, several factors that can significantly influence in vivo toxicity such as bioavailability, metabolic and renal clearance, and protein binding, are not accounted for in in vitro test data. Therefore, the incorporation of toxicokinetic dosimetry modeling is a necessary element of the vision to move toxicity testing away from in vivo rodent studies to human-based in vitro assays and enable direct comparisons with human exposures for evaluating risk.

IVIVE forms a critical link from Stressors to Responses, one of the Center’s themes. IVIVE models also account for population variability in toxicokinetics, and therefore also touch on the theme of Individuals to Populations. Moreover, IVIVE approaches have been demonstrated to have a substantial impact on hazard prioritization and ranking, thereby providing critical information to inform the theme of Community, Regulation, and Policy. Because of the large number of chemicals the Center will investigate, both individually and in combination, an essential element of the DSFC is to maintain, and augment as necessary, a repository of models and their parameters in a readily computable form. Pearce et al. recently developed an open-source library “httk” (high-throughput toxicokinetics) in the “R” statistical software program that includes toxicokinetic models and pre-calculated RTK factors, as well as computer code for dynamic simulation and Monte Carlo sampling using several different toxicokinetic model formulations. Center investigators have also extended these approaches to mixtures.

Population-based dose-response modeling.

A number of commercial and freely available software packages can conduct dose-response modeling, but their use in evaluating toxicological datasets is still quite limited compared to the traditional application of pair-wise statistical tests to determine “no observed adverse effect levels” (NOAELs).

The limitations of the NOAEL as a starting point for toxicological risk assessment have been recognized for decades, and it is generally accepted that the BMD, introduced by Crump, is more scientifically appropriate. The BMD is the dose associated with a specific size of effect, the benchmark response (BMR; we use M, for magnitude of effect, to denote this value). The BMD_M is estimated, with associated confidence interval/statistical distribution, by statistical model fitting to dose-response data. In order to facilitate use of toxicological data generated by Center investigators in regulatory decision-making, BMD modeling services will be provided.

We previously developed a standardized workflow for BMD modeling that was applied to hundreds of datasets, and which can be adapted for supported Center investigator needs. Additionally, the recent advent of high-throughput and population-based in vitro and in vivo models presents a new challenge. We have recently demonstrated development of a hierarchical Bayesian workflow for dose-response modeling of population data, which has a particular application to population-based in vitro data. This approach enabled a statistically rigorous analysis of population variability toxic responses, providing key information on toxicodynamic susceptibility.

Population-based risk prediction.

Methods for population-based risk prediction for the purposes of informing environmental health decisions continue to evolve, with increasing emphasis on use of in vitro data, dose-response modeling, and probabilistic analyses of uncertainty and variability. Recent National Academy of Sciences reports have recommended development of predictive risk estimates that provide stakeholders and decision-makers with a fuller characterization of the potential population human health effects associated with different levels of exposure.

Recently, the World Health Organization (WHO) developed a harmonized probabilistic approach based on the concept of the “target human dose” (denoted HD_M^I) that responds directly to these National Academy of Sciences recommendations. While similar to the traditional “reference dose” (RfD) in its derivation, it differs in several important ways. First, being based on dose-response modeling, it is always tied to a specific health effect, unlike a NOAEL or RfD. Second, it is a harmonized approach applicable to both cancer and non-cancer effects. Furthermore, it extends existing decision benchmarks by quantitatively defining the protection goals in terms of population incidence (I) and magnitude (M) of the critical effect in the human population. This allows for incorporation into regulatory impact analyses that require economic benefit-cost estimates. Finally, it uses uncertainty distributions derived from historical and/or chemical-specific data and analyses in order to derive confidence intervals that quantify the level of uncertainty in the overall result.

Bioinformatics and the DSFC

The DSFC will provide this guidance to facilitate robust bioinformatics analyses and workflows, as well as provide resources for development of novel bioinformatics methods. DSFC personnel are experienced in the design and implementation of software solutions and computational pipelines to biological problems, and will provide computational biology and bioinformatics programming support to assist research faculty, staff, and students with management and analysis of genomic and other biological data.

Custom Software Pipelines, Apps, and Workflows

The diversity of data stemming from different biological systems pose specific challenges to life sciences researchers during the data analysis steps, particularly to those who are not experienced in computer programming. Many of the tools developed within life sciences research are from other researchers in various academic institutions and deployed in a UNIX/LINUX computing environment, which requires at least a basic knowledge of those operating systems. The DSFC can support investigators and their staff by providing algorithm design and computer programming services to develop easy-to-use command-line tools and scripts, graphical user interfaces, and/or online web apps/interfaces that are customized to their data analysis workflow. Recent examples include:

lncRNApipe (https://github.com/biocoder/Perl-for-Bioinformatics/tree/master/NGS-Utils) is a command-line‒based software pipeline for identifying putative long non-coding RNA from RNA-Seq reads prepared with Total RNA-Seq protocol from Illumina. It is written in PERL programming language.
SBEToolbox (https://github.com/biocoder/SBEToolbox) is a menu-driven graphical user interface written in MatLab to compute various topological metrics and deduce various clusters within biological networks.
gQTL (https://genomics.tamu.edu/gqtl) is a web-based application for QTL analysis of Collaborative Cross Mice.

Bioinformatics Methods Development

In some cases, novel bioinformatics methods may need to be developed to address Center project needs. Past examples include the following:

Regulatory Pathway Inference and Classification. Many associations between gene expression-activation and clinical response are correlative as opposed to mechanistic. As a result, these associations are often poor predictors of how cellular signaling pathways will behave if perturbed by environmental exposure modifiers such as diet and exercise. Using systems biology approaches, we have developed mathematical models to integrate the dynamics of reactions in cells and interactions between cells in response to environmental exposure, leading to a dynamic, potentially personalized, “molecular signature.” We have recently demonstrated that the integration of existing pathway knowledge with expression data in a novel systems-based Bayesian approach can be used to achieve optimal classification accuracy, relative to standard data-driven approaches.
Prediction of Toxicology Outcomes using Data Integration and Bridging from Cell and Animal Models to Human Disease. One approach to understanding mechanisms from Stressors to Response, one of the key themes of this center, is through predictive models that integrate the different modalities of data, including prior information, for the different animal models and interventional and observational studies (environmental exposure and intervention treatments). Some approaches include the concept of the Coefficient of Determination (CoD), Bayesian network models, and Canonical Correlation Analysis (CCA). The CoD paradigm has been used to detect the so-called “master-slave” regulatory relationships between genes in a Gene Regulatory Network (GRN) and more recently to estimate the probabilities of detecting true feature sets. The Bayesian network model allows for a proper interpretation of plausible causality relationships between regulatory and regulated elements in a given pathway of interest. CCA can integrate existing pathway knowledge with expression data to detect diet-dependent interactions between host transcriptome and microbiome. More recently, we have developed a novel methodology for detecting correlative structures in NextGen sequencing data. This methodology can be used, for example, to examine exposure to phthalates during two critical developmental windows—prenatal and early childhood—on the gut metagenome and host gut gene expression.

Geospatial Sciences at the DSFC

There are growing needs for geospatial risk analyses, especially in support of community engagement and other community-based participatory research across the Center.

Spatially explicit study design and modeling

This core provides tailored tools to support four categories of center activities: data collection, integration/curation, data analytics, and visualization. For data collection, remote sensing and LiDAR data can be employed to develop spatial sampling schemes. When combined with demographic and socioeconomic data, the center can develop cluster or stratified sampling strategies for cross-sectional or longitudinal survey methods. Regarding data integration, geospatial and temporal features can be the primary data linkage, while other formats of data can be converted into geospatial database by geocoding/georeferencing/georectification. As the projects use sensor networks (e.g., biosensing, environmental monitoring, crowd-sensing), contextual data that capture the dynamics of the environment and human exposure can be obtained and scaled for analysis. Next, geoprocessing, spatial interpolation, spatial autocorrelation, spatial machine learning methods can be applied to model global and local geospatial processes or address spatial dependence and heterogeneity. Lastly, static maps (e.g., choropleth maps, heatmaps) and interactive webmaps and dashboards will be developed to visualize dataset and project results. In addition, DSFC personnel have experience leveraging virtual and augmented reality technologies for realistic representations of the environment. Three-dimensional digital models will be developed based on LiDAR point clouds in ESRI CityEngine and game engines (e.g., Unreal, Unity) to simulate various environmental hazards (e.g., flooding, oil spill).

Geospatial data integration for climate and environmental justice assessment

Texas EJ communities face compounded hazards that manifest as “acute-on-chronic” stressors. With the increasing recognition that health disparities are geospatial – closely tied to the spatial-temporal heterogeneity of environmental and social stressor distribution. For example, populations in closer proximity to Toxic Release Inventory sites, industrial land uses, or transportation corridors run a higher risk of being affected by toxics through various exposure pathways; these areas are often occupied by communities of color. By integrating domains of multisource and multi-type data (e.g., climate extremes, environment, infrastructure, socioeconomic, and health), this core can identify vulnerable populations with specific exposure levels, pathways, and scales relevant to their unique needs, further increasing the validity and accuracy of methodologies and the robustness of their findings.

GeoDesign as a community-based participatory research framework

Geospatial technology can not only reveal distributive injustice in existing environmental stressors and health disparities, but also promote procedural justice through enabling information access and meaningful participation of community members, thus facilitating equitable allocation of resources and decision-making. DSFC will utilize a GeoDesign framework with six key modeling approaches, representation models, process models, evaluation models, change models, impact models, and decision models, to support community-based participatory research. The representation and process models aim to enable community-driven problem identification through efficient spatial data sharing and modeling the individual, household, and community level health risk factors. The evaluation and change models utilize scenario-based approaches to evaluate health outcomes based on community co-design solutions. The impact and decision models assess the primary health outcomes and auxiliary benefits of the solutions, as well as any unintended social consequences.