彭司实验室

====
论文
====

======
Python
======

=========
Tensorflow
=========

=======
PyTorch
=======

=====
Keras
=====

====
专题
====

====
链接
====

====
视频
====

=======
药物设计
=======

=======
材料科学
=======

============
经济学与金融学
============

===========
自然语言处理
===========

BERT	Biological Language Model	nltk
HuggingFace Transformers	DNABERT	transformer-xl
bert GitHub	word vector	GPT

Model in Hugging Face

============
Pretein Models
============

1. ESM-2 (HUggFace) (15B,3B,...,8M)

ESM-2 is a state-of-the-art protein model trained on a masked language modelling objective. It is suitable for fine-tuning on a wide range of tasks that take protein sequences as input. For detailed information on the model architecture and training data, please refer to the accompanying paper. You may also be interested in some demo notebooks (PyTorch, TensorFlow) which demonstrate how to fine-tune ESM-2 models on your tasks of interest.

ESM-2 是一种最先进的蛋白质模型，在掩蔽语言建模目标上进行训练。它适合对以蛋白质序列作为输入的各种任务进行微调。有关模型架构和训练数据的详细信息，请参阅随附的论文。您可能还对一些演示笔记本（PyTorch、TensorFlow）感兴趣，它们演示了如何针对您感兴趣的任务微调 ESM-2 模型。

2. proteinBERT (Hugging Face)

Model description

Pretrained Protein language model, using a mixed masked language modeling (MLM) & ELECTRA objective, as well as an additional pretraining task of predicting GO (Gene ontology) function for all UniRef90 proteins.

It was introduced in our ProteinBERT paper and is also fully available in the Github repository - https://github.com/nadavbra/protein_bert.

Intended uses & limitations

A pretrained language model for predicting Protein (AA) sequences and their properties. Can predict on new tasks, including whole sequence or local (per position) tasks, includding classification, multilabel and regression. Expected input is an amino acid (protein) sequence. Model provided here outputs concatted embedding of all hidden states. Can be adapted for any application.

Caveat:

Conversion of model may have changed compatability, as tensorflow "sanitized" input-seq to input_seq and input-annotations to input_annotations. In cases of compatibility issues or errors, we refer to the original pretraining & finetuning code, model dump and ProteinBERT package: https://github.com/nadavbra/protein_bert

Training and evaluation data

Trained on ~106M proteins from UniRef90. Sequences were filtered in advance to remove any with over 30% similarity (by BLAST score) to any sequence in any of the TAPE benchmark datasets. 8943 most frequent GO annotations were kept for the pretraining task.

模型说明
预训练蛋白质语言模型，使用混合掩蔽语言模型 (MLM) 和 ELECTRA 目标，以及预测所有 UniRef90 蛋白质的 GO（基因本体）功能的额外预训练任务。

它在我们的 ProteinBERT 论文中进行了介绍，并且在 Github 存储库中也完全可用 - https://github.com/nadavbra/ Protein_bert。

预期用途和限制
用于预测蛋白质 (AA) 序列及其特性的预训练语言模型。可以预测新任务，包括整个序列或局部（每个位置）任务，包括分类、多标签和回归。预期输入是氨基酸（蛋白质）序列。此处提供的模型输出所有隐藏状态的串联嵌入。可适用于任何应用。

警告：
模型的转换可能会改变兼容性，因为张量流“净化”了 input-seq 到 input_seq 以及 input-annotations 到 input_annotations。如果出现兼容性问题或错误，我们参考原始预训练和微调代码、模型转储和 ProteinBERT 包：https://github.com/nadavbra/ Protein_bert

训练和评估数据
使用来自 UniRef90 的约 106M 蛋白质进行训练。预先过滤序列，以删除与任何 TAPE 基准数据集中的任何序列具有超过 30% 相似性（通过 BLAST 评分）的序列。为预训练任务保留了 8943 个最常见的 GO 注释。

3. RITA_s (Hugging Face) (1.2B, 680M, ...., 85M)

RITA is a family of autoregressive protein models, developed by a collaboration of Lighton, the OATML group at Oxford, and the Debbie Marks Lab at Harvard.

RITA 是一系列自回归蛋白质模型，由 Lighton、牛津 OATML 小组和哈佛大学 Debbie Marks 实验室合作开发。

==========
DNA Models
==========

1. dnagpt/human_gpt2-v1 (Hugging Face)

dna language model trained using gpt2. using human genome data.

Key features of our dangpt models:

BPE tokenization instead of k-mers (DNABERT, DNABERT2 also use BPE)
GPT model, but not bert(DNABERT, GENA_LM)
pre-training on the latest T2T human genome assembly
details:https://github.com/maris205/dnagpt. includes train/bpe code

2. dnagpt/dummy-model-test (Hugging Face)

3. dnagpt (Hugging Face)

human_genome_GCF_009914755.1 Dataset

4. GENA-LM (Hugging Face)

GENA-LM (gena-lm-bert-base)

GENA-LM is a Family of Open-Source Foundational Models for Long DNA Sequences.

GENA-LM models are transformer masked language models trained on human DNA sequence.

Differences between GENA-LM (gena-lm-bert-base) and DNABERT:

BPE tokenization instead of k-mers;
input sequence size is about 4500 nucleotides (512 BPE tokens) compared to 512 nucleotides of DNABERT
pre-training on T2T vs. GRCh38.p13 human genome assembly.

Source code and data: https://github.com/AIRI-Institute/GENA_LM

Paper: https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1

GENA-LM（gena-lm-bert-base）
GENA-LM 是一系列用于长 DNA 序列的开源基础模型。

GENA-LM 模型是在人类 DNA 序列上训练的变压器掩码语言模型。

GENA-LM（gena-lm-bert-base）和DNABERT之间的区别：

BPE 标记化而不是 k-mers；
输入序列大小约为 4500 个核苷酸（512 个 BPE 标记），而 DNABERT 为 512 个核苷酸
T2T 与 GRCh38.p13 人类基因组组装的预训练。
源代码和数据：https://github.com/AIRI-Institute/GENA_LM

论文：https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1

5. HyenaDNA (Hugging Face)

HyenaDNA is a long-range genomic foundation model pretrained on context lengths of up to 1 million tokens at single nucleotide resolution.

HyenaDNA 是一种远程基因组基础模型，在单核苷酸分辨率下针对高达 100 万个标记的上下文长度进行了预训练。

6. dnagpt/dnallama_lora (Hugging Face)

7. dnagpt/dnagpt_unigram (Hugging Face)

8. The Nucleotide Transformers (Hugging Face)

The Nucleotide Transformers are a collection of foundational language models that were pre-trained on DNA sequences from whole-genomes. Compared to other approaches, our models do not only integrate information from single reference genomes, but leverage DNA sequences from over 3,200 diverse human genomes, as well as 850 genomes from a wide range of species, including model and non-model organisms. Through robust and extensive evaluation, we show that these large models provide extremely accurate molecular phenotype prediction compared to existing methods

Part of this collection is the nucleotide-transformer-2.5b-multi-species, a 2.5B parameters transformer pre-trained on a collection of 850 genomes from a wide range of species, including model and non-model organisms. The model is made available both in Tensorflow and Pytorch.

Nucleotide Transformers 是一系列基础语言模型的集合，这些模型是根据全基因组的 DNA 序列进行预训练的。与其他方法相比，我们的模型不仅整合了来自单个参考基因组的信息，还利用了来自 3,200 多个不同人类基因组的 DNA 序列，以及来自广泛物种（包括模型和非模型生物）的 850 个基因组。通过稳健和广泛的评估，我们表明与现有方法相比，这些大型模型提供了极其准确的分子表型预测该集合的一部分是核苷酸-transformer-2.5b-multi-species，这是一个 2.5B 参数转换器，在来自广泛物种（包括模型和非模型生物）的 850 个基因组集合上进行了预训练。该模型在 Tensorflow 和 Pytorch 中均可用。

9. DNABERT-2-117M (Hugging Face)

This is the official pre-trained model introduced in DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome .DNABERT-2 is a transformer-based genome foundation model trained on multi-species genome.

DNABERT-2 是一种基于 Transformer 的基因组基础模型，在多物种基因组上进行训练。

传统的BERT模型通常有一个最大序列长度限制，如512个token。这意味着模型能够处理的输入序列的长度不能超过这个限制。

DNABERT的特定实现：在DNABERT的上下文中，"token"指的是k-mer，即长度为k的DNA序列片段。因此，模型能处理的最大DNA序列长度取决于k-mer的大小（k值）和模型的最大token数。例如，如果使用3-mer作为token，并假设模型的最大token数是512，那么最大可处理的DNA序列长度将是512乘以3，即1536个碱基。

实际应用中的调整：在实际使用DNABERT时，可以根据具体需求调整模型配置，如选择不同的k值或修改模型以支持更长的序列。这种灵活性允许研究人员根据他们的特定数据集和研究目标定制模型。

计算资源的影响：处理更长序列的能力也受计算资源的限制。增加序列长度或k-mer的大小会导致计算成本显著增加，因为模型需要处理更多的信息。

因此，DNABERT可以处理的具体序列长度取决于模型的配置和所使用的k-mer的大小，以及可用的计算资源。在选择这些参数时，通常需要在模型性能、计算效率和特定应用场景的需求之间进行权衡。

==========
RNA Models
==========

===============
Biomedical Models
===============

BioGPT-Large

Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain. Among the two main branches of pre-trained language models in the general language domain, i.e. BERT (and its variants) and GPT (and its variants), the first one has been extensively studied in the biomedical domain, such as BioBERT and PubMedBERT. While they have achieved great success on a variety of discriminative downstream biomedical tasks, the lack of generation ability constrains their application scope. In this paper, we propose BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature. We evaluate BioGPT on six biomedical natural language processing tasks and demonstrate that our model outperforms previous models on most tasks. Especially, we get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks, respectively, and 78.2% accuracy on PubMedQA, creating a new record. Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fluent descriptions for biomedical terms.

受其在一般自然语言领域取得巨大成功的启发，预训练语言模型在生物医学领域引起了越来越多的关注。在通用语言领域预训练语言模型的两个主要分支，即BERT（及其变体）和GPT（及其变体）中，第一个在生物医学领域得到了广泛的研究，例如BioBERT和PubMedBERT。虽然它们在各种有区别的下游生物医学任务上取得了巨大成功，但生成能力的缺乏限制了它们的应用范围。在本文中，我们提出了 BioGPT，这是一种在大规模生物医学文献上预先训练的特定领域的生成 Transformer 语言模型。我们在六项生物医学自然语言处理任务上评估 BioGPT，并证明我们的模型在大多数任务上都优于以前的模型。特别是，我们在 BC5CDR、KD-DTI 和 DDI 端到端关系提取任务上分别获得了 44.98%、38.42% 和 40.76% 的 F1 分数，在 PubMedQA 上获得了 78.2% 的准确率，创造了新的记录。我们关于文本生成的案例研究进一步证明了 BioGPT 在生物医学文献上为生物医学术语生成流畅描述的优势。

biogpt

from transformers import pipeline, set_seed
from transformers import BioGptTokenizer, BioGptForCausalLM
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")
tokenizer = BioGptTokenizer.from_pretrained("microsoft/biogpt")
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator("COVID-19 is", max_length=20, num_return_sequences=5, do_sample=True)

BioGPT-Large-PubMedQA

pubmedbert-base-embeddings

This is a PubMedBERT-base model fined-tuned using sentence-transformers. It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. The training dataset was generated using a random sample of PubMed title-abstract pairs along with similar title pairs.

PubMedBERT Embeddings produces higher quality embeddings than generalized models for medical literature. Further fine-tuning for a medical subdomain will result in even better performance.

这是一个基于 PubMedBERT 的模型，使用句子转换器进行了微调。它将句子和段落映射到 768 维密集向量空间，可用于聚类或语义搜索等任务。训练数据集是使用 PubMed 标题-摘要对以及类似标题对的随机样本生成的。

PubMedBERT Embeddings 生成的嵌入质量比医学文献的通用模型更高。对医疗子领域的进一步微调将带来更好的性能。

biobert-v1.1

自然语言处理 Natural Language Processing

自然语言处理( Natural Language Processing, NLP)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此，这一领域的研究将涉及自然语言，即人们日常使用的语言，所以它与语言学的研究有着密切的联系，但又有重要的区别。自然语言处理并不是一般地研究自然语言，而在于研制能有效地实现自然语言通信的计算机系统，特别是其中的软件系统。因而它是计算机科学的一部分。
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language that people use every day, so it is closely related to the study of linguistics, but there are important differences. Natural language processing is not the general study of natural language, but the development of computer systems that can effectively realize natural language communication, especially the software systems. Therefore it is part of computer science.

==== 论文 ====

====== Python ======

========= Tensorflow =========

======= PyTorch =======

===== Keras =====

==== 专题 ====

==== 链接 ====

==== 视频 ====

======= 药物设计 =======

======= 材料科学 =======

============ 经济学与金融学 ============

=========== 自然语言处理 ===========