We gratefully acknowledge support from
the Simons Foundation and member institutions.

Quantitative Biology

New submissions

[ total of 13 entries: 1-13 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Mon, 20 May 24

[1]  arXiv:2405.10343 [pdf, other]
Title: UniCorn: A Unified Contrastive Learning Approach for Multi-view Molecular Representation Learning
Subjects: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)

Recently, a noticeable trend has emerged in developing pre-trained foundation models in the domains of CV and NLP. However, for molecular pre-training, there lacks a universal model capable of effectively applying to various categories of molecular tasks, since existing prevalent pre-training methods exhibit effectiveness for specific types of downstream tasks. Furthermore, the lack of profound understanding of existing pre-training methods, including 2D graph masking, 2D-3D contrastive learning, and 3D denoising, hampers the advancement of molecular foundation models. In this work, we provide a unified comprehension of existing pre-training methods through the lens of contrastive learning. Thus their distinctions lie in clustering different views of molecules, which is shown beneficial to specific downstream tasks. To achieve a complete and general-purpose molecular representation, we propose a novel pre-training framework, named UniCorn, that inherits the merits of the three methods, depicting molecular views in three different levels. SOTA performance across quantum, physicochemical, and biological tasks, along with comprehensive ablation study, validate the universality and effectiveness of UniCorn.

[2]  arXiv:2405.10345 [pdf, other]
Title: Machine Learning Driven Biomarker Selection for Medical Diagnosis
Subjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Recent advances in experimental methods have enabled researchers to collect data on thousands of analytes simultaneously. This has led to correlational studies that associated molecular measurements with diseases such as Alzheimer's, Liver, and Gastric Cancer. However, the use of thousands of biomarkers selected from the analytes is not practical for real-world medical diagnosis and is likely undesirable due to potentially formed spurious correlations. In this study, we evaluate 4 different methods for biomarker selection and 4 different machine learning (ML) classifiers for identifying correlations, evaluating 16 approaches in all. We found that contemporary methods outperform previously reported logistic regression in cases where 3 and 10 biomarkers are permitted. When specificity is fixed at 0.9, ML approaches produced a sensitivity of 0.240 (3 biomarkers) and 0.520 (10 biomarkers), while standard logistic regression provided a sensitivity of 0.000 (3 biomarkers) and 0.040 (10 biomarkers). We also noted that causal-based methods for biomarker selection proved to be the most performant when fewer biomarkers were permitted, while univariate feature selection was the most performant when a greater number of biomarkers were permitted.

[3]  arXiv:2405.10348 [pdf, other]
Title: Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning
Subjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Protein-protein bindings play a key role in a variety of fundamental biological processes, and thus predicting the effects of amino acid mutations on protein-protein binding is crucial. To tackle the scarcity of annotated mutation data, pre-training with massive unlabeled data has emerged as a promising solution. However, this process faces a series of challenges: (1) complex higher-order dependencies among multiple (more than paired) structural scales have not yet been fully captured; (2) it is rarely explored how mutations alter the local conformation of the surrounding microenvironment; (3) pre-training is costly, both in data size and computational burden. In this paper, we first construct a hierarchical prompt codebook to record common microenvironmental patterns at different structural scales independently. Then, we develop a novel codebook pre-training task, namely masked microenvironment modeling, to model the joint distribution of each mutation with their residue types, angular statistics, and local conformational changes in the microenvironment. With the constructed prompt codebook, we encode the microenvironment around each mutation into multiple hierarchical prompts and combine them to flexibly provide information to wild-type and mutated protein complexes about their microenvironmental differences. Such a hierarchical prompt learning framework has demonstrated superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction and a case study of optimizing human antibodies against SARS-CoV-2.

[4]  arXiv:2405.10432 [pdf, ps, other]
Title: Lysine-Cysteine-Serine-Tryptophan Inserted into the DNA-Binding Domain of Human Mineralocorticoid Receptor Increases Transcriptional Activation by Aldosterone
Comments: 21 pages, 5 figures
Subjects: Biomolecules (q-bio.BM)

Due to alternative splicing in an ancestral DNA-binding domain (DBD) of the mineralocorticoid receptor (MR), humans contain two almost identical MR transcripts with either 984 amino acids (MR-984) or 988 amino acids (MR-988), in which their DBDs differ by only four amino acids, Lys,Cys,Ser,Trp (KCSW). Human MRs also contain mutations at two sites, codons 180 and 241, in the amino terminal domain (NTD). Together, there are five distinct full-length human MR genes in GenBank. Human MR-984, which was cloned in 1987, has been extensively studied. Human MR-988, cloned in 1995, contains KCSW in its DBD. Neither this human MR-988 nor the other human MR-988 genes have been studied for their response to aldosterone and other corticosteroids. Here, we report that transcriptional activation of human MR-988 by aldosterone is increased by about 50% compared to activation of human MR-984 in HEK293 cells transfected with the TAT3 promoter, while the half-maximal response (EC50) is similar for aldosterone activation of MR-984 and MR-988. Transcriptional activation of human MR also depends on the amino acids at codons 180 and 241. Interestingly, in HEK293 cells transfected with the MMTV promoter, transcriptional activation by aldosterone of human MR-988 is similar to activation of human MR-984, indicating that the promoter has a role in the regulation of the response of human MR-988 to aldosterone. The physiological responses to aldosterone and other corticosteroids in humans with MR genes containing KCSW and with differences at codons 180 and 241 in the NTD warrant investigation.

[5]  arXiv:2405.10486 [pdf, ps, other]
Title: Comparison of reaction networks of insulin signaling
Comments: 18 pages, 0 figure
Subjects: Molecular Networks (q-bio.MN)

Understanding the insulin signaling cascade provides insights on the underlying mechanisms of biological phenomena such as insulin resistance, diabetes, Alzheimer's disease, and cancer. For this reason, previous studies utilized chemical reaction network theory to perform comparative analyses of reaction networks of insulin signaling in healthy (INSMS: INSulin Metabolic Signaling) and diabetic cells (INRES: INsulin RESistance). This study extends these analyses using various methods which give further insights regarding insulin signaling. Using embedded networks, we discuss evidence of the presence of a structural "bifurcation" in the signaling process between INSMS and INRES. Concordance profiles of INSMS and INRES show that both have a high propensity to remain monostationary. Moreover, the concordance properties allow us to present heuristic evidence that INRES has a higher level of stability beyond its monostationarity. Finally, we discuss a new way of analyzing reaction networks through network translation. This method gives rise to three new insights: (i) each stoichiometric class of INSMS and INRES contains a unique positive equilibrium; (ii) any positive equilibrium of INSMS is exponentially stable and is a global attractor in its stoichiometric class; and (iii) any positive equilibrium of INRES is locally asymptotically stable. These results open up opportunities for collaboration with experimental biologists to understand insulin signaling better.

[6]  arXiv:2405.10488 [pdf, ps, other]
Title: Comparative prospects of imaging methods for whole-brain mammalian connectomics
Comments: See page 10 after references for Supplemental Information
Subjects: Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM)

Mammalian whole-brain connectomes are a crucial ingredient for holistic understanding of brain function. Imaging these connectomes at sufficient resolution to densely reconstruct cellular morphology and synapses represents a longstanding goal in neuroscience. Although the technologies needed to reconstruct whole-brain connectomes have not yet reached full maturity, they are advancing rapidly enough that the mouse brain might be within reach in the near future. Human connectomes remain a more distant goal. Here, we quantitatively compare existing and emerging imaging technologies that have potential to enable whole-brain mammalian connectomics. We perform calculations on electron microscopy (EM) techniques and expansion microscopy coupled with light-sheet fluorescence microscopy (ExLSFM) methods. We consider techniques from the literature that have sufficiently high resolution to identify all synapses and sufficiently high speed to be relevant for whole mammalian brains. Each imaging modality comes with benefits and drawbacks, so we suggest that attacking the problem through multiple approaches could yield the best outcomes. We offer this analysis as a resource for those considering how to organize efforts towards imaging whole-brain mammalian connectomes.

[7]  arXiv:2405.10812 [pdf, other]
Title: VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling
Comments: ICML 2024. Preprint V1 with 16 pages and 5 figures
Subjects: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)

Similar to natural language models, pre-trained genome language models are proposed to capture the underlying intricacies within genomes with unsupervised sequence modeling. They have become essential tools for researchers and practitioners in biology. However, the \textit{hand-crafted} tokenization policies used in these models may not encode the most discriminative patterns from the limited vocabulary of genomic data. In this paper, we introduce VQDNA, a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebook as \textit{learnable} vocabulary, VQDNA can adaptively tokenize genomes into \textit{pattern-aware} embeddings in an end-to-end manner. To further push its limits, we propose Hierarchical Residual Quantization (HRQ), where varying scales of codebooks are designed in a hierarchy to enrich the genome vocabulary in a coarse-to-fine manner. Extensive experiments on 32 genome datasets demonstrate VQDNA's superiority and favorable parameter efficiency compared to existing genome language models. Notably, empirical analysis of SARS-CoV-2 mutations reveals the fine-grained pattern awareness and biological significance of learned HRQ vocabulary, highlighting its untapped potential for broader applications in genomics.

[8]  arXiv:2405.10911 [pdf, other]
Title: A minimal scenario for the origin of non-equilibrium order
Subjects: Populations and Evolution (q-bio.PE); Statistical Mechanics (cond-mat.stat-mech); Molecular Networks (q-bio.MN)

Extant life contains numerous non-equilibrium mechanisms to create order not achievable at equilibrium; it is generally assumed that these mechanisms evolved because the resulting order was sufficiently beneficial to overcome associated costs of time and energy. Here, we identify a broad range of conditions under which non-equilibrium order-creating mechanisms will evolve as an inevitable consequence of self-replication, even if the order is not directly functional. We show that models of polymerases, when expanded to include known stalling effects, can evolve kinetic proofreading through selection for fast replication alone, consistent with data from recent mutational screens. Similarly, replication contingent on fast self-assembly can select for non-equilibrium instabilities and result in more ordered structures without any direct selection for order. We abstract these results into a framework that predicts that self-replication intrinsically amplifies dissipative order-enhancing mechanisms if the distribution of replication times is wide enough. Our work suggests the intriguing possibility that non-equilibrium order can arise more easily than assumed, even before that order is directly functional, with consequences impacting mutation rate evolution and kinetic traps in self-assembly to the origin of life.

Cross-lists for Mon, 20 May 24

[9]  arXiv:2405.10283 (cross-list from cond-mat.stat-mech) [pdf, other]
Title: Power-law relaxation of a confined diffusing particle subject to resetting with memory
Comments: 19 pages, 3 figures
Subjects: Statistical Mechanics (cond-mat.stat-mech); Populations and Evolution (q-bio.PE)

We study the relaxation of a Brownian particle with long range memory under confinement in one dimension. The particle diffuses in an arbitrary confining potential and resets at random times to previously visited positions, chosen with a probability proportional to the local time spend there by the particle since the initial time. This model mimics an animal which moves erratically in its home range and returns preferentially to familiar places from time to time. The steady state density of the position is given by the equilibrium Boltzmann-Gibbs distribution, as in standard diffusion, while the transient part of the density can be obtained through a mapping of the Fokker-Planck equation of the process to a Schr\"odinger eigenvalue problem. Due to memory, the approach at large time toward the steady state is critically self-organised, in the sense that it always follows a sluggish power-law form, in contrast to the exponential decay that characterises Markov processes. The exponent of this power-law depends in a simple way on the resetting rate and on the relaxation rate of the Brownian particle in the absence of resetting. We apply these findings to several exactly solvable examples, such as the harmonic, V-shaped and box potentials.

[10]  arXiv:2405.10625 (cross-list from cs.CL) [pdf, other]
Title: Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction
Comments: Preprint
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)

Transformer-based encoder-decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: Can FlanT5 and ByT5, the encode-decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become `chemistry domain compatible' in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; The most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.

[11]  arXiv:2405.10780 (cross-list from eess.SP) [pdf, ps, other]
Title: Intelligent Neural Interfaces: An Emerging Era in Neurotechnology
Subjects: Signal Processing (eess.SP); Hardware Architecture (cs.AR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)

Integrating smart algorithms on neural devices presents significant opportunities for various brain disorders. In this paper, we review the latest advancements in the development of three categories of intelligent neural prostheses featuring embedded signal processing on the implantable or wearable device. These include: 1) Neural interfaces for closed-loop symptom tracking and responsive stimulation; 2) Neural interfaces for emerging network-related conditions, such as psychiatric disorders; and 3) Intelligent BMI SoCs for movement recovery following paralysis.

Replacements for Mon, 20 May 24

[12]  arXiv:2403.15523 (replaced) [pdf, other]
Title: Towards auditory attention decoding with noise-tagging: A pilot study
Comments: 6 pages, 2 figures, 9th Graz Brain-Computer Interface Conference 2024
Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[13]  arXiv:2404.10260 (replaced) [pdf, other]
Title: HelixFold-Multimer: Elevating Protein Complex Structure Prediction to New Heights
Subjects: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
[ total of 13 entries: 1-13 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, q-bio, recent, 2405, contact, help  (Access key information)