====
Paper
====

======
Python
=====
=

=========
Tensorflow
=========

=======
PyTorch
=======

=====
Keras
=====

=======
Topics
=======

====
Link
====

====
Video

====

==========
Drug Design

==========

==============
Material Science
==============

=========
Economics
=========
=


===========
自然语言处理
==========
=

BERT Biological Language Model nltk
HuggingFace Transformers DNABERT transformer-xl
bert GitHub word vector GPT
BigBird longformer (GitHub)  
google/bigbird-pegasus-large-bigpatent allenai/longformer-base-4096  

Protein Language Models (Paper) || BLM

‘ ‘ ‘ ‘ ‘ ‘
pLM4ACE: A protein language model based predictor for antihypertensive peptide screening .2024. Food Chemistry
Leveraging transformers-based language models in proteome bioinformatics
UniDL4BioPep: a universal deep learning architecture for binary classification in peptide bioactivity
A High Efficient Biological Language Model for Predicting Protein-Protein Interactions
Linguistically inspired roadmap for building biologically reliable protein language models (NATURE MACHINE INTELLIGENCE )
NeuroPred-PLM: an interpretable and robust model for neuropeptide prediction by protein language model
SHINE: protein language model-based pathogenicity prediction for short inframe insertion and deletion variants
A robust protein language model for SARS-CoV-2 protein-protein interaction network prediction
Genome-wide prediction of disease variant effects with a deep protein language model (NATURE GENETICS)
QuoteTarget: A sequence-based transformer protein language model to identify potentially druggable protein targets
Generative power of a protein language model trained on multiple sequence alignments
Evolutionary-scale prediction of atomic-level protein structure with a language model (science)
Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models
Survey of Protein Sequence Embedding Models
BepiPred-3.0: Improved B-cell epitope prediction using protein language models
TCRconv: predicting recognition between T cell receptors and epitopes using contextualized motifs
De novo design of protein structure and function with RFdiffusion (Nature)
ProteinBERT: a universal deep-learning model of protein sequence and function
DrugFinder: Druggable Protein Identification Model Based on Pre-Trained Models and Evolutionary Information
DeepLoc 2.0: multi-label subcellular localization prediction using protein language models
Finding functional motifs in protein sequences with deep learning and natural language models
Identifying B-cell epitopes using AlphaFold2 predicted structures and pretrained language model
SignalP 6.0 predicts all five types of signal peptides using protein language models (NATURE BIOTECHNOLOGY )
IUP-BERT: Identification of Umami Peptides Based on BERT Features . 2022
Learning meaningful representations of protein sequences . 2022
Identifying promising sequences for protein engineering using a deep transformer protein language model .2023
Learning meaningful representations of protein sequences .2022
The language of proteins: NLP, machine learning & protein sequences . 2021
Leveraging transformers-based language models in proteome bioinformatics. 2023 Review
Learning the Protein Language Model of SARS-CoV-2 Spike Proteins 2023
DistilProtBert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts . 2022
Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction . 2022
Superior protein thermophilicity prediction with protein language model embeddings . 2023
A method for multiple-sequence-alignment-free protein structure prediction using a protein language model. 2023 NATURE MACHINE INTELLIGENCE
TooT-BERT-M: Discriminating membrane proteins from non-membrane proteins using a BERT representation of protein primary sequences . 2022
Predicting the specific substrate for transmembrane transport proteins using BERT language model. 2022
Embeddings from protein language models predict conservation and variant effects . 2022
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. 2022 high cited
Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings . 2023
AggBERT: Best in Class Prediction of Hexapeptide Amyloidogenesis with a Semi-Supervised ProtBERT Model. 2023
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. 2021 High cited
Learning the language of viral evolution and escape . 2021. Science
Efficient evolution of human antibodies from general protein language models. 2023. NATURE BIOTECHNOLOGY .
ProGen2: Exploring the boundaries of protein language models . 2023 . CELL SYSTEMS
IgLM: Infilling language modeling for antibody sequence design . 2023. CELL SYSTEMS
ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. 2022 . IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE . High cited
Learning functional properties of proteins with language models. NATURE MACHINE INTELLIGENCE . 2022
 
 
 
 
 
 
 

 

 


 

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

DNA Language Models (Paper)

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome || GitHub
iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations
BERT-Promoter: An improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection
Improving language model of human genome for DNA–protein binding prediction based on task-specific pre-training
miProBERT:identificationofmicroRNApromotersbasedonthepre-trainedmodelBERT. 2023
M6A-BERT-Stacking: A Tissue-Specific Predictor for Identifying RNA N6-Methyladenosine Sites Based on BERT and Stacking Strategy. 2023
Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review. 2023
Predicting the Sequence Specificities of DNA-Binding Proteins by DNA Fine-Tuned Language Model With Decaying Learning Rates . 2023
Classification of Highly Divergent Viruses from DNA/RNA Sequence Using Transformer-Based Models . 2023
Prediction of MicroRNA Subcellular Localization by Using a Sequence-to-Sequence Model . 2018
TSSNote-CyaPromBERT: Development of an integrated platform for highly accurate promoter prediction and visualization of Synechococcus sp. and Synechocystis sp. through a state-of-the-art natural language processing model BERT . 2022
Improving language model of human genome for DNA-protein binding prediction based on task-specific pre-training . 2023
ViBE: a hierarchical BERT model to identify eukaryotic viruses using metagenome sequencing data . 2022
TSSNote-CyaPromBERT: Development of an integrated platform for highly accurate promoter prediction and visualization of Synechococcus sp. and Synechocystis sp. through a state-of-the-art natural language processing model BERT . 2022
A self-supervised deep learning method for data-efficient training in genomics .2023
PLPMpro: Enhancing promoter sequence prediction with prompt-learning based pre-trained language model .2023
SENet: A deep learning framework for discriminating super- and typical enhancers by sequence information .2023
GSRNet, an adversarial training-based deep framework with multi-scale CNN and BiGRU for predicting genomic signals and regions.2023
A generalizable framework to comprehensively predict epigenome, chromatin organization, and transcriptome .2023
BERT2OME: Prediction of 2'-O-Methylation Modifications From RNA Sequence by Transformer Architecture Based on BERT .2023
Predicting the Sequence Specificities of DNA-Binding Proteins by DNA Fine-Tuned Language Model With Decaying Learning Rates .2023
PD-BertEDL: An Ensemble Deep Learning Method Using BERT and Multivariate Representation to Predict Peptide Detectability .2022
Improving language model of human genome for DNA-protein binding prediction based on task-specific pre-training .2022
Identification of bacteriophage genome sequences with representation learning .2022
Immune2vec: Embedding B/T Cell Receptor Sequences in Double-struck capital RN Using Natural Language Processing .2021
Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution .2022
Effective gene expression prediction from sequence by integrating long-range interactions. 2021 . NATURE METHODS
 
 
 
 
 
 
 
 
 
 

 

 


RNA Language Models (Paper)

miProBERT: identification of microRNA promoters based on the pre-trained model BERT
preMLI: a pre-trained method to uncover microRNA-lncRNA potential interactions
首个单细胞生物学基础大型语言模型,在超1000万个细胞进行预训练 scGPT: GitHub
scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI pdf
Prediction of RNA-protein interactions using a nucleotide language model
Rm-LR: A long-range-based deep learning model for predicting multiple types of RNA modifications .2023
Multiple sequence alignment-based RNA language model and its application to structural inference. Nucleic Acids Research. 2023 NoV
 
 

 


Foundation Models and other topics

Large language models encode clinical knowledge .2023
Large language models in medicine . 2023
On the opportunities and risks of foundation models. 2021
The future landscape of large language models in medicine .2023
A foundation model for generalizable disease detection from retinal images .2023
Representation learning applications in biological sequence analysis . 2021
 
Accelerating drug target inhibitor discovery with a deep generative foundation model . 2023
Genomic benchmarks: a collection of datasets for genomic sequence classification .2023
DeepBIO: an automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis .2023
 
 
 
 
 
 
 
 
 
 

 

 

人工智能大语言模型微调技术:SFT、LoRA、Freeze 监督微调方法

北京智源人工智能研究院 || 上海人工智能实验室 || 百川智能

FlagEval 10月榜:新增Aquila2-34B、InternLM-20B、Qwen-14B等模型

生物语言模型
Biological Language Model

像 GPT-4 这样的大型语言模型因其对自然语言的惊人掌握而席卷了世界。 然而,大语言模型(LLM)最重要的长期机会将需要一种完全不同类型的语言:生物学语言。 过去一个世纪,生物化学、分子生物学和遗传学研究进展的长征中出现了一个引人注目的主题:事实证明,生物学是一个可破译、可编程、在某些方面甚至是数字系统。 DNA 仅使用四个变量——A(腺嘌呤)、C(胞嘧啶)、G(鸟嘌呤)和 T(胸腺嘧啶)来编码地球上每个生物体的完整遗传指令。 将此与现代计算系统进行比较,现代计算系统使用两个变量(0 和 1)对世界上所有的数字电子信息进行编码。 一个系统是二元系统,另一个是四元系统,但这两个系统在概念上有惊人的重叠; 这两个系统都可以正确地被认为是数字化的。
Large language models like GPT-4 have taken the world by storm for their amazing mastery of natural language. However, the most important long-term opportunity for large language model (LLM) will require an entirely different kind of language: the language of biology. A striking theme has emerged from the long march of research advances in biochemistry, molecular biology, and genetics over the past century: Biology, it turns out, is a decipherable, programmable, and in some respects even digital system. DNA uses only four variables—A (adenine), C (cytosine), G (guanine), and T (thymine)—to encode the complete genetic instructions for every living organism on Earth. Compare this to modern computing systems, which encode all digital electronic information in the world using two variables (0 and 1). One system is binary and the other is quaternary, but the two systems have surprising conceptual overlap; both systems can rightly be considered digital.